Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding
大语言模型是否已为医疗保健做好准备?临床语言理解的对比研究
Yuqing Wang Computer Science Department University of California, Santa Barbara
于清(Yuqing Wang) 加州大学圣塔芭芭拉分校计算机科学系
Yun Zhao Meta Platforms, Inc.
Yun Zhao Meta Platforms, Inc.
Linda Petzold Computer Science Department University of California, Santa Barbara
Linda Petzold 加州大学圣塔芭芭拉分校计算机科学系
Abstract
摘要
Large language models (LLMs) have made significant progress in various domains, including healthcare. However, the specialized nature of clinical language understanding tasks presents unique challenges and limitations that warrant further investigation. In this study, we conduct a comprehensive evaluation of state-of-the-art LLMs, namely GPT-3.5, GPT4, and Bard, within the realm of clinical language understanding tasks. These tasks span a diverse range, including named entity recognition, relation extraction, natural language inference, semantic textual similarity, document classification, and question-answering. We also introduce a novel prompting strategy, self-questioning prompting (SQP), tailored to enhance the performance of LLMs by eliciting informative questions and answers pertinent to the clinical scenarios at hand. Our evaluation highlights the importance of employing task-specific learning strategies and prompting techniques, such as SQP, to maximize the effectiveness of LLMs in healthcare-related tasks. Our study emphasizes the need for cautious implementation of LLMs in healthcare settings, ensuring a collaborative approach with domain experts and continuous verification by human experts to achieve responsible and effective use, ultimately contributing to improved patient care. Our code is available at https://github.com/EternityYW/LL M healthcare.
大语言模型 (LLMs) 在医疗保健等多个领域取得了显著进展。然而,临床语言理解任务的专业性带来了独特的挑战与限制,需要进一步研究。本研究对GPT-3.5、GPT4和Bard等前沿大语言模型在临床语言理解任务领域进行了全面评估,涵盖命名实体识别、关系抽取、自然语言推理、语义文本相似度、文档分类和问答等多种任务。我们还提出了一种新型提示策略——自我提问提示 (SQP),通过生成与当前临床场景相关的信息性问题与答案来提升大语言模型性能。评估结果表明,采用任务专用学习策略和提示技术(如SQP)对最大化大语言模型在医疗相关任务中的效能至关重要。本研究强调在医疗场景中需谨慎部署大语言模型,必须与领域专家协同合作并持续接受人工专家验证,从而实现负责任且有效的应用,最终提升患者护理质量。代码已开源:https://github.com/EternityYW/LLM_ healthcare。
1. Introduction
1. 引言
Recent advancements in clinical language understanding hold the potential to revolutionize healthcare by facilitating the development of intelligent systems that support decisionmaking (Lederman et al., 2022; Zuheros et al., 2021), expedite diagnostics (Wang and Lin, 2022; Wang et al., 2022b), and improve patient care (Christensen et al., 2002). Such systems could assist healthcare professionals in managing the ever-growing body of medical literature, interpreting complex patient records, and developing personalized treatment plans (Pivovarov and Elhadad, 2015; Zeng et al., 2021). State-of-the-art large language models (LLMs) like OpenAI’s GPT-3.5 and GPT-4 (OpenAI, 2023), and Google
临床语言理解领域的最新进展具有彻底改变医疗保健行业的潜力,其通过促进智能系统开发来支持临床决策 (Lederman et al., 2022; Zuheros et al., 2021)、加速诊断流程 (Wang and Lin, 2022; Wang et al., 2022b) 以及改善患者护理 (Christensen et al., 2002)。这类系统可协助医疗专业人员管理日益增长的医学文献、解读复杂的患者病历并制定个性化治疗方案 (Pivovarov and Elhadad, 2015; Zeng et al., 2021)。当前最先进的大语言模型如OpenAI的GPT-3.5与GPT-4 (OpenAI, 2023) 以及谷歌
AI’s Bard (Elias, 2023), have gained significant attention for their remarkable performance across diverse natural language understanding tasks, such as sentiment analysis, machine translation, text sum mari z ation, and question-answering (Zhong et al., 2023; Jiao et al., 2023; Wang et al., 2023). However, a comprehensive evaluation of their effectiveness in the specialized healthcare domain, with its unique challenges and complexities, remains necessary.
AI的Bard (Elias, 2023)因其在情感分析、机器翻译、文本摘要和问答等多样化自然语言理解任务中的卓越表现而备受关注 (Zhong et al., 2023; Jiao et al., 2023; Wang et al., 2023) 。然而,在具有独特挑战性和复杂性的专业医疗领域,仍需对其有效性进行全面评估。
The healthcare domain presents distinct challenges, including handling specialized medical terminology, managing the ambiguity and variability of clinical language, and meeting the high demands for reliability and accuracy in critical tasks. Although existing research has explored the application of LLMs in healthcare, the focus has typically been on a limited set of tasks or learning strategies. For example, studies have investigated tasks like medical concept extraction, patient cohort identification, and drug-drug interaction prediction, primarily relying on supervised learning approaches (Vilar et al., 2018; Gehrmann et al., 2018; Afshar et al., 2019). In this study, we broaden this scope by evaluating LLMs on various clinical language understanding tasks, including natural language inference (NLI), document classification, semantic textual similarity (STS), question-answering (QA), named entity recognition (NER), and relation extraction.
医疗领域存在独特的挑战,包括处理专业医学术语、应对临床语言的模糊性和多变性,以及在关键任务中对可靠性和准确性的高要求。尽管现有研究已探索过大语言模型在医疗领域的应用,但通常仅聚焦于有限的任务或学习策略。例如,已有研究探讨了医学概念提取、患者队列识别和药物相互作用预测等任务,主要依赖监督学习方法 (Vilar et al., 2018; Gehrmann et al., 2018; Afshar et al., 2019)。本研究通过评估大语言模型在多种临床语言理解任务(包括自然语言推理(NLI)、文档分类、语义文本相似度(STS)、问答(QA)、命名实体识别(NER)和关系抽取)上的表现,拓宽了这一研究范围。
Furthermore, the exploration of learning strategies such as few-shot learning, transfer learning, and unsupervised learning in the healthcare domain has been relatively limited. Similarly, the impact of diverse prompting techniques on improving model performance in clinical tasks has not been extensively examined, leaving room for a comprehensive comparative study.
此外,在医疗领域对少样本学习 (few-shot learning)、迁移学习和无监督学习等学习策略的探索相对有限。同样,不同提示技术对提升临床任务模型性能的影响尚未得到广泛研究,这为全面比较研究留下了空间。
In this study, we aim to bridge this gap by evaluating the performance of state-of-the-art LLMs on a range of clinical language understanding tasks. LLMs offer the exciting prospect of in-context few-shot learning via prompting, enabling task completion without fine-tuning separate language model checkpoints for each new challenge. In this context, we propose a novel prompting strategy called self-questioning prompting (SQP) to enhance these models’ effectiveness across various tasks. Our empirical evaluations demonstrate the potential of SQP as a promising technique for improving LLMs in the healthcare domain. Furthermore, by pinpointing tasks where the models excel and those where they struggle, we highlight the need for addressing specific challenges such as wording ambiguity, lack of context, and negation handling, while emphasizing the importance of responsible LLM implementation and collaboration with domain experts in healthcare settings.
本研究旨在通过评估先进大语言模型(LLM)在一系列临床语言理解任务中的表现来填补这一空白。大语言模型通过提示(prompting)实现上下文少样本学习(few-shot learning)的潜力令人振奋,无需针对每个新任务微调单独的语言模型检查点(checkpoint)。在此背景下,我们提出了一种称为自问提示(self-questioning prompting, SQP)的新颖提示策略,以提升模型在各类任务中的效能。实证评估表明,SQP是提升医疗领域大语言模型性能的有效技术。通过明确模型擅长与薄弱的任务环节,我们揭示了需要解决的特定挑战,如措辞歧义、语境缺失和否定处理等问题,同时强调了在医疗场景中负责任地部署大语言模型并与领域专家协作的重要性。
In summary, our contributions are threefold:
总之,我们的贡献有三个方面:
negation, emphasizing the need for a cautious approach when employing LLMs in healthcare as a supplement to human expertise.
否定,强调在医疗保健领域将大语言模型作为人类专业知识的补充时需采取谨慎态度。
General iz able Insights about Machine Learning in the Context of Healthcare
医疗保健领域中机器学习的通用性洞见
Our study presents a comprehensive evaluation of state-of-the-art LLMs in the healthcare domain, examining their capabilities and limitations across a variety of clinical language under standing tasks. We develop and demonstrate the efficacy of our self-questioning prompting (SQP) strategy, which involves generating context-specific questions and answers to guide the model towards a better understanding of clinical scenarios. This tailored learning approach significantly enhances LLM performance in healthcare-focused tasks. Our in-depth error analysis on the most challenging task shared by all models uncovers unique difficulties encountered by each model, such as wording ambiguity, lack of context, and negation issues. These findings emphasize the need for a cautious approach when implementing LLMs in healthcare as a complement to human expertise. We underscore the importance of integrating domain-specific knowledge, fostering collaborations among researchers, practitioners, and domain experts, and employing task-oriented prompting techniques like SQP. By addressing these challenges and harnessing the potential benefits of LLMs, we can contribute to improved patient care and clinical decision-making in healthcare settings.
我们的研究对医疗领域最先进的大语言模型进行了全面评估,考察了它们在各种临床语言理解任务中的能力与局限。我们开发并验证了自提问提示 (SQP) 策略的有效性,该策略通过生成特定情境的问题和答案来引导模型更好地理解临床场景。这种定制化学习方法显著提升了大语言模型在医疗任务中的表现。通过对所有模型共有的最具挑战性任务进行深入错误分析,我们揭示了各模型遇到的独特困难,例如措辞歧义、语境缺失和否定表述问题。这些发现强调在医疗领域应用大语言模型作为人类专业知识的补充时需保持谨慎态度。我们特别指出需要整合领域专业知识、促进研究人员/从业者与领域专家协作,以及采用SQP等任务导向的提示技术。通过应对这些挑战并发挥大语言模型的潜在优势,我们能为改善医疗环境中的患者护理和临床决策做出贡献。
2. Related Work
2. 相关工作
In this section, we review the relevant literature on large language models applied to clinical language understanding tasks in healthcare, as well as existing prompting strategies.
本节回顾了应用于医疗领域临床语言理解任务的大语言模型相关文献,以及现有的提示策略。
2.1. Large Language Models in Healthcare
2.1. 医疗领域的大语言模型 (Large Language Models)
The advent of the Transformer architecture (Vaswani et al., 2017) revolutionized the field of natural language processing, paving the way for the development of large-scale pre-trained language models such as base BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019). In the healthcare domain, domain-specific adaptations of BERT, such as BioBERT (Lee et al., 2020) and Clinic alBERT (Alsentzer et al., 2019), have been introduced to tackle various clinical language understanding tasks. More recently, GPT-3.5 and its successor GPT-4, launched by OpenAI (OpenAI, 2023), as well as Bard, developed by Google AI (Elias, 2023), have emerged as state-of-the-art LLMs, showcasing impressive capabilities in a wide range of applications, including healthcare (Biswas, 2023; Kung et al., 2023; Patel and Lam, 2023; Singhal et al., 2023).
Transformer架构 (Vaswani et al., 2017) 的出现彻底改变了自然语言处理领域,为大规模预训练语言模型(如基础BERT (Devlin et al., 2018) 和RoBERTa (Liu et al., 2019))的发展铺平了道路。在医疗领域,针对BERT的领域适配版本(如BioBERT (Lee et al., 2020) 和ClinicalBERT (Alsentzer et al., 2019))被提出以应对各类临床语言理解任务。近期,由OpenAI (OpenAI, 2023) 发布的GPT-3.5及其继任者GPT-4,以及Google AI (Elias, 2023) 开发的Bard,已成为最先进的大语言模型,在医疗等广泛领域展现出卓越能力 (Biswas, 2023; Kung et al., 2023; Patel and Lam, 2023; Singhal et al., 2023)。
Clinical language understanding is a critical aspect of healthcare informatics, focused on extracting meaningful information from diverse sources, such as electronic health records (Juhn and Liu, 2020), scientific articles (Grabar et al., 2021), and patient-authored text data (Mukhiya et al., 2020). This domain encompasses various tasks, including NER (Nayel and Shashi re kha, 2017), relation extraction (Lv et al., 2016), NLI (Romanov and Shivade, 2018), STS (Wang et al., 2020), document classification (Has sanz a deh et al., 2018), and QA (Soni and Roberts, 2020). Prior work has demonstrated the effectiveness of domain-specific models in achieving improved performance on these tasks compared to general-purpose counterparts (Peng et al., 2019; Mascio et al., 2020; Digan et al., 2021). However, challenges posed by complex medical terminologies, the need for precise inference, and the reliance on domain-specific knowledge can limit their effectiveness (Shen et al., 2023). In this work, we address some of these limitations by conducting a comprehensive evaluation of state-of-the-art LLMs on a diverse set of clinical language understanding tasks, focusing on their performance and applicability within healthcare settings.
临床语言理解是医疗信息学中的一个关键领域,专注于从电子健康记录 (Juhn and Liu, 2020)、科学文献 (Grabar et al., 2021) 和患者自述文本数据 (Mukhiya et al., 2020) 等多种来源中提取有意义的信息。该领域涵盖多项任务,包括命名实体识别 (NER) (Nayel and Shashi re kha, 2017)、关系抽取 (Lv et al., 2016)、自然语言推理 (NLI) (Romanov and Shivade, 2018)、语义文本相似度 (STS) (Wang et al., 2020)、文档分类 (Has sanz a deh et al., 2018) 以及问答 (QA) (Soni and Roberts, 2020)。先前研究表明,与通用模型相比,领域专用模型在这些任务上能显著提升性能 (Peng et al., 2019; Mascio et al., 2020; Digan et al., 2021)。然而,复杂的医学术语、精确推理的需求以及对领域特定知识的依赖,仍可能限制其有效性 (Shen et al., 2023)。本研究通过全面评估当前最先进的大语言模型在多样化临床语言理解任务中的表现,重点关注其在医疗场景下的适用性,以应对部分现有局限性。
2.2. Prompting Strategies
2.2. 提示策略
Prompting strategies, often used in conjunction with few-shot or zero-shot learning (Brown et al., 2020; Kojima et al., 2022), guide and refine the behavior of LLMs to improve performance on various tasks. In these learning paradigms, LLMs are conditioned on a limited number of examples in the form of prompts, enabling them to generalize and perform well on the target task. Standard prompting techniques (Brown et al., 2020) involve providing an LLM with a clear and concise prompt, often in the form of a question or statement, which directs the model towards the desired output. Another approach, known as chainof-thought prompting (Wei et al., 2022; Kojima et al., 2022), leverages a series of interconnected prompts to generate complex reasoning or multi-step outputs. While these existing prompting strategies have shown considerable success, their effectiveness can be limited by the quality and informative ness of the prompts (Wang et al., 2022a), which may not always capture the intricate nuances of specialized domains like healthcare. Motivated by these limitations, we propose a novel prompting strategy called self-questioning prompting (SQP). SQP aims to enhance the performance of LLMs by generating informative questions and answers related to the given clinical scenarios, thus addressing the unique challenges of the healthcare domain and contributing to improved task-specific performance.
提示策略通常与少样本或零样本学习结合使用(Brown et al., 2020; Kojima et al., 2022),用于指导和优化大语言模型的行为,以提升其在各类任务中的表现。在这些学习范式中,大语言模型通过少量示例形式的提示进行条件化,使其能够泛化并在目标任务中表现良好。标准提示技术(Brown et al., 2020)涉及为大语言模型提供清晰简洁的提示,通常以问题或陈述的形式,引导模型产生期望的输出。另一种方法称为思维链提示(Wei et al., 2022; Kojima et al., 2022),利用一系列相互关联的提示来生成复杂推理或多步骤输出。尽管现有提示策略已取得显著成功,但其效果可能受限于提示的质量和信息量(Wang et al., 2022a),这些提示未必总能捕捉医疗等专业领域的复杂细微差别。基于这些局限性,我们提出了一种名为自问自答提示(SQP)的新策略。SQP旨在通过生成与给定临床场景相关的信息性问题及答案,提升大语言模型的性能,从而应对医疗领域的独特挑战,并促进任务特定表现的提升。
3. Self-Questioning Prompting
3. 自我提问提示
Complex problems can be daunting, but they can often be solved by breaking them down into smaller parts and asking questions to clarify understanding and explore different aspects. Inspired by this human-like reasoning process, we introduce a novel method called self-questioning prompting (SQP) for LLMs. SQP aims to enhance model performance by encouraging models to be more aware of their own thinking processes, enabling them to better understand relevant concepts and develop deeper comprehension. This is achieved through the generation of targeted questions and answers that provide additional context and clarification, ultimately leading to improved performance on various tasks. The general construction process of SQP for a task, as shown in Figure 1, involves identifying key information in the input text, generating targeted questions to clarify understanding, using the questions and answers to enrich the context of the task prompt, and tailoring the strategy to meet the unique output requirements of each task. For a better understanding of the general construction procedure, consider an example prompting for NLI task:
复杂问题往往令人望而生畏,但通过将其分解为更小的部分并提出问题来澄清理解和探索不同方面,通常可以找到解决方案。受这种类人推理过程的启发,我们为大语言模型引入了一种称为自提问提示 (self-questioning prompting,SQP) 的新方法。SQP旨在通过鼓励模型更清晰地感知自身思维过程,从而增强模型性能,使其能更好地理解相关概念并形成更深层次的认知。该方法通过生成针对性问题与答案来提供额外上下文和解释,最终提升模型在各种任务中的表现。
如[图1]所示,SQP针对某项任务的一般构建流程包括:识别输入文本中的关键信息、生成针对性问题以澄清理解、利用问答对丰富任务提示的上下文,以及根据每项任务的独特输出需求定制策略。为了更好地理解这一构建流程,请看一个用于自然语言推理 (NLI) 任务的提示示例:
- Key Information: With two clinical sentences, ${{\mathrm{sentence}}_ {-}1}$ and ${{\mathrm{sentence}}_ {-}2}$ , the model is asked to “Generate questions about the medical situations described”. This prompt guides the model to identify important elements.
- 关键信息:给定两个临床句子 ${{\mathrm{sentence}}_ {-}1}$ 和 ${{\mathrm{sentence}}_ {-}2}$,模型需根据提示"生成与所述医疗情境相关的问题"。该指令引导模型识别关键要素。
- Question Generation: Following the first prompt, the model creates questions about the identified details, solidifying its grasp on the context.
- 问题生成:在第一个提示之后,模型会针对已识别的细节创建问题,从而巩固其对上下文的理解。
- Enriching Context: The model then “Answer these questions using basic medical knowledge and use the insights to evaluate their relationship”. This prompt instructs the model to deepen its understanding.
- 丰富上下文:模型随后"利用基础医学知识回答这些问题,并运用这些见解评估它们之间的关系"。这一提示指令引导模型深化其理解。
- Task-Specific Strategy: Lastly, the model follows the prompt to “Categorize the relationship between ${{\mathrm{sentence}}_ {-}1}$ and ${{\mathrm{sentence}}_ {-}2}$ as entailment if {sentence 2} logically follows ${{\mathrm{sentence}}_ {-}1}$ , contradiction if they oppose each other, or neutrality if unrelated”. This directly links the task requirements with the model’s understanding.
- 任务特定策略:最后,模型遵循提示“将 ${{\mathrm{sentence}}_ {-}1}$ 和 ${{\mathrm{sentence}}_ {-}2}$ 之间的关系分类为蕴含(如果 {sentence 2} 在逻辑上遵循 ${{\mathrm{sentence}}_ {-}1}$)、矛盾(如果两者对立)或中立(如果无关)”。这直接将任务需求与模型的理解联系起来。

Figure 1: Construction process of self-questioning prompting (SQP).
图 1: 自我提问提示 (SQP) 的构建过程。
In Table 1, we compare the proposed SQP with existing prompting methods, including standard prompting and chain-of-thought prompting, highlighting the differences in guidelines and purposes for each strategy. Subsequently, we present the SQP templates for six clinical language understanding tasks. The core self-questioning process is highlighted in each template, as shown in Figure 2. The SQP templates were developed through a combination of consultations with healthcare professionals and iterative testing. We evaluated multiple prompt candidates, with the best-performing templates chosen for use in the study. In the case of few-shot examples, the SQP QA pairs were annotated by healthcare professionals for model input. These underscored and bold parts illustrate how SQP generates targeted questions and answers related to the tasks, which guide the model’s reasoning, leading to improved task performance. By incorporating this self-questioning process into the prompts, SQP enables the model to utilize its knowledge more effectively and adapt to a wide range of clinical tasks.
在表1中,我们将提出的SQP (Self-Questioning Prompting) 与现有提示方法(包括标准提示和思维链提示)进行对比,重点说明每种策略在指导原则和目的上的差异。随后,我们展示了六个临床语言理解任务的SQP模板。每个模板中都突出显示了核心自问过程,如图2所示。这些SQP模板是通过结合医疗专业人员咨询和迭代测试开发的。我们评估了多个候选提示方案,最终选择表现最佳的模板用于研究。在少样本示例中,SQP问答对由医疗专业人员标注后输入模型。这些加粗下划线部分展示了SQP如何生成与任务相关的针对性问题及答案,从而引导模型推理并提升任务表现。通过将这种自问过程融入提示,SQP使模型能更有效地利用自身知识,并适应广泛的临床任务。
4. Datasets
4. 数据集
We utilize a wide range of biomedical and clinical language understanding datasets for our experiments. These datasets encompass various tasks, including NER (NCBI-Disease (Dogan et al., 2014) and BC5CDR-Chem (Li et al., 2016)), relation extraction (i2b2 2010-Relation (Uzuner et al., 2011) and SemEval 2013-DDI (Segura-Bedmar et al., 2013)), STS (BIOSSES (Soga nci og lu et al., 2017)), NLI (MedNLI (Romanov and Shivade, 2018)), document classification (i2b2 2006-Smoking (Uzuner et al., 2006)), and QA (bioASQ 10b-Factoid (Tsa tsar on is et al., 2015)). Among these tasks, STS (BIOSSES) is a regression task, while the rest are classification tasks. Table 2 offers a comprehensive overview of the tasks and datasets. For NER tasks, we adopt the BIO tagging scheme, where ‘B’ represents the beginning of an entity, ‘I’ signifies the continuation of an entity, and ‘O’ denotes the absence of an entity. The output
我们使用了多种生物医学和临床语言理解数据集进行实验。这些数据集涵盖多项任务,包括命名实体识别(NCBI-Disease (Dogan et al., 2014) 和 BC5CDR-Chem (Li et al., 2016))、关系抽取(i2b2 2010-Relation (Uzuner et al., 2011) 和 SemEval 2013-DDI (Segura-Bedmar et al., 2013))、语义文本相似度(BIOSSES (Sogancioglu et al., 2017))、自然语言推理(MedNLI (Romanov and Shivade, 2018))、文档分类(i2b2 2006-Smoking (Uzuner et al., 2006))以及问答(bioASQ 10b-Factoid (Tsatsaronis et al., 2015))。其中语义文本相似度(BIOSSES)是回归任务,其余均为分类任务。表 2 提供了任务与数据集的完整概览。对于命名实体识别任务,我们采用 BIO 标注方案,其中"B"表示实体起始,"I"表示实体延续,"O"表示非实体。输出

Figure 2: Self-questioning prompting (SQP) templates for six clinical language understanding tasks, with the core self-questioning process underscored and bolded. These components represent the generation of targeted questions and answers, guiding the model’s reasoning and enhancing task performance.
图 2: 六大临床语言理解任务的自提问提示 (SQP) 模板,核心自提问流程已加下划线并加粗。这些组件代表针对性问题与答案的生成过程,可引导模型推理并提升任务表现。
Table 1: Comparison among standard prompting, chain-of-thought prompting, and selfquestioning prompting.
| Prompting Strategy | Guideline | Purpose |
| Standard | Use a direct, concise prompt for the desired task. | To obtain a direct response from the model. |
| Chain-of-Thought | Create interconnected prompts guiding the model through logical reasoning. | To engage the model's reasoning by breaking down complex tasks. |
| Self-Questioning | Generate targeted questions and use answers to guide the task response. | To deepen the model's understanding and enhance performance. |
表 1: 标准提示 (Standard Prompting) 、思维链提示 (Chain-of-Thought Prompting) 和自我提问提示 (Self-Questioning Prompting) 的对比。
| 提示策略 | 指导原则 | 目的 |
|---|---|---|
| 标准提示 | 使用直接、简洁的提示来完成所需任务。 | 从模型中获取直接响应。 |
| 思维链提示 | 创建相互关联的提示,引导模型进行逻辑推理。 | 通过分解复杂任务来激发模型的推理能力。 |
| 自我提问提示 | 生成针对性问题,并利用答案来引导任务响应。 | 加深模型理解并提升性能。 |
column in Table 2 presents specific classes, scores, or tagging schemes associated with each task.
表 2 中的列展示了与每项任务相关的具体类别、分数或标记方案。
For relation extraction, SemEval 2013-DDI requires identifying one of the following labels: Advice, Effect, Mechanism, or Int. In the case of i2b2 2010-Relation, it necessitates predicting relationships such as Treatment Improves Medical Problem (TrIP), Treatment Worsens Medical Problem (TrWP), Treatment Causes Medical Problem (TrCP), Treatment is Administered for Medical Problem (TrAP), Treatment is Not Administered because of Medical Problem (TrNAP), Test Reveals Medical Problem (TeRP), Test Conducted to Investigate Medical Problem (TeCP), or Medical Problem Indicates Medical Problem (PIP).
对于关系抽取任务,SemEval 2013-DDI要求识别以下标签之一:建议(Advice)、影响(Effect)、机制(Mechanism)或相互作用(Int)。在i2b2 2010-Relation任务中,需要预测诸如治疗改善医学问题(TrIP)、治疗恶化医学问题(TrWP)、治疗引发医学问题(TrCP)、为医学问题实施治疗(TrAP)、因医学问题未实施治疗(TrNAP)、检查揭示医学问题(TeRP)、为调查医学问题进行检查(TeCP)或医学问题指示医学问题(PIP)等关系。
Table 2: Overview of biomedical/clinical language understanding tasks and datasets.
| Task | Dataset | Output | Metric |
| Named Entity Recognition | NCBI-Disease, BC5CDR-Chemical | BIO tagging for diseases and chemicals | Micro F1 |
| Relation Extraction | i2b2 2010-Relation, SemEval 2013-DDI | relations between entities | Micro F1, Macro F1 |
| Semantic Textual Similarity | BIOSSES | similarity scores from 0 (different)to 4 (identical) | Pearson Correlation |
| Natural Language Inference | MedNLI | entailment, neutral, contradiction | Accuracy |
| Document Classification | i2b2 2006-Smoking | current smoker, past smoker, smoker, non-smoker, unknown | Micro F1 |
| Question-Answering | bioASQ 10b-Factoid | factoid answers | Mean Reciprocal Rank, Lenient Accuracy |
表 2: 生物医学/临床语言理解任务与数据集概览
| 任务 | 数据集 | 输出 | 评估指标 |
|---|---|---|---|
| 命名实体识别 (Named Entity Recognition) | NCBI-Disease, BC5CDR-Chemical | 疾病和化学物质的BIO标注 | 微平均F1值 (Micro F1) |
| 关系抽取 (Relation Extraction) | i2b2 2010-Relation, SemEval 2013-DDI | 实体间关系 | 微平均F1值 (Micro F1), 宏平均F1值 (Macro F1) |
| 语义文本相似度 (Semantic Textual Similarity) | BIOSSES | 0(不同)到4(相同)的相似度评分 | 皮尔逊相关系数 (Pearson Correlation) |
| 自然语言推理 (Natural Language Inference) | MedNLI | 蕴含/中立/矛盾 | 准确率 (Accuracy) |
| 文档分类 (Document Classification) | i2b2 2006-Smoking | 当前吸烟者/既往吸烟者/吸烟者/非吸烟者/未知 | 微平均F1值 (Micro F1) |
| 问答系统 (Question-Answering) | bioASQ 10b-Factoid | 事实型答案 | 平均倒数排名 (Mean Reciprocal Rank), 宽松准确率 (Lenient Accuracy) |
5. Experiments
5. 实验
In this section, we outline the experimental setup and evaluation procedure used to evaluate the performance of various LLMs on tasks related to biomedical and clinical text comprehension and analysis.
在本节中,我们概述了用于评估各种大语言模型(LLM)在生物医学和临床文本理解与分析相关任务上性能的实验设置和评估流程。
5.1. Experimental Setup
5.1. 实验设置
We investigate various prompting strategies for state-of-the-art LLMs, employing N-shot learning techniques on diverse clinical language understanding tasks.
我们研究了针对先进大语言模型的各种提示策略,在不同临床语言理解任务上采用N样本学习技术。
Large Language Models. We assess the performance of three state-of-the-art LLMs, each offering unique capabilities and strengths. First, we examine GPT-3.5, an advanced model developed by OpenAI, known for its remarkable language understanding and generation capabilities. Next, we investigate GPT-4, an even more powerful successor to GPT-3.5, designed to push the boundaries of natural language processing further. Finally, we explore Bard, an innovative language model launched by Google AI. We experiment with these models through their web versions. By comparing these models, we aim to gain insights into their performance on clinical language understanding tasks.
大语言模型 (Large Language Models)。我们评估了三种最先进的大语言模型的性能,每种模型都具有独特的能力和优势。首先,我们研究了由 OpenAI 开发的 GPT-3.5,该模型以其卓越的语言理解和生成能力而闻名。接着,我们考察了 GPT-4,它是 GPT-3.5 更强大的后继版本,旨在进一步突破自然语言处理的边界。最后,我们探索了由 Google AI 推出的创新语言模型 Bard。我们通过这些模型的网页版本进行实验。通过比较这些模型,我们希望深入了解它们在临床语言理解任务上的表现。
Prompting Strategies. We employ three prompting strategies to optimize the performance of LLMs on each task: standard prompting, chain-of-thought prompting, and our proposed self-questioning prompting. Standard prompting serves as the baseline, while chain-of-thought and self-questioning prompting techniques are investigated to assess their potential impact on model performance. The full set of prompting templates used for each task are given in Appendix A.
提示策略。我们采用三种提示策略来优化大语言模型在每项任务上的表现:标准提示 (standard prompting)、思维链提示 (chain-of-thought prompting) 和我们提出的自提问提示 (self-questioning prompting)。标准提示作为基线,同时研究思维链和自提问提示技术以评估它们对模型性能的潜在影响。每项任务使用的完整提示模板集见附录A。
N-Shot Learning. We explore N-shot learning for LLMs, focusing on zero-shot and 5-shot learning scenarios. Zero-shot learning refers to the situation where the model has not been exposed to any labeled examples during training and is expected to generalize to the task without prior knowledge. In contrast, 5-shot learning involves the model receiving a small amount of labeled data, consisting of five few-shot exemplars from the training set, to facilitate its adaptation to the task. We evaluate the model’s performance in both zero-shot and 5-shot learning settings to understand its ability to generalize and adapt to different tasks in biomedical and clinical domains.
N-Shot学习。我们探索了大语言模型的N-Shot学习,重点关注零样本和5样本学习场景。零样本学习指模型在训练期间未接触任何标注样本,需在没有先验知识的情况下泛化至目标任务。相比之下,5样本学习会为模型提供少量标注数据(包含训练集中的五个少样本示例)以辅助任务适应。我们通过评估模型在零样本和5样本场景下的表现,来理解其在生物医学和临床领域中泛化及适应不同任务的能力。
5.2. Evaluation Procedure
5.2. 评估流程
To assess the performance for each task, given the constraints of model release timings and web version utilization, we form an evaluation set by randomly selecting $50%$ of instances from the original test set. In the case of zero-shot learning, we directly evaluate the model’s performance on this evaluation set. For 5-shot learning, we enhance the model with five fewshot exemplars, which are randomly chosen from the training set. The model’s performance is then assessed using the same evaluation set as in the zero-shot learning scenario.
为了评估每项任务的性能,考虑到模型发布时间和网页版本使用的限制,我们从原始测试集中随机选取50%的实例组成评估集。在零样本学习场景中,我们直接在该评估集上测试模型表现。对于5样本学习,我们从训练集中随机选取5个少样本示例增强模型,并使用与零样本学习相同的评估集进行性能评估。
6. Results
6. 结果
In this section, we present a comprehensive analysis of the performance of the LLMs (i.e., Bard, GPT-3.5, and GPT-4) on clinical language understanding tasks. We begin by comparing the overall performance of these models, followed by an examination of the effectiveness of various prompting strategies. Next, we delve into a detailed task-by-task analysis, providing insights into the models’ strengths and weaknesses across different tasks. Finally, we conduct a case study on error analysis, investigating common error types and the potential improvements brought about by advanced prompting techniques.
在本节中,我们对大语言模型 (即 Bard、GPT-3.5 和 GPT-4) 在临床语言理解任务上的表现进行全面分析。首先比较这些模型的整体性能,随后评估不同提示策略的有效性。接着逐项深入分析任务表现,揭示模型在不同任务中的优势与不足。最后通过错误案例研究,探讨常见错误类型及高级提示技术带来的改进潜力。
6.1. Overall Performance Comparison
6.1. 整体性能对比
In our study, we evaluate the performance of Bard, GPT-3.5, and GPT-4 on various clinical benchmark datasets spanning multiple tasks. We employ different prompting strategies, including standard, chain-of-thought, and self-questioning, as well as N-shot learning with N equal to 0 and 5. Table 3 summarizes the experimental results.
在我们的研究中,我们评估了Bard、GPT-3.5和GPT-4在涵盖多个任务的各种临床基准数据集上的表现。我们采用了不同的提示策略,包括标准提示、思维链提示和自我提问提示,以及N等于0和5的N样本学习。表3总结了实验结果。
We observe that GPT-4 generally outperforms Bard and GPT-3.5 in tasks involving the identification and classification of specific information within text, such as NLI (MedNLI), NER (NCBI-Disease, BC5CDR-Chemical), and STS (BIOSSES). In the realm of document classification, a task that involves assigning predefined categories to entire documents, GPT4 also surpasses GPT-3.5 and Bard on the i2b2 2006-Smoking dataset. In relation extraction, GPT-4 outperforms both Bard and GPT-3.5 on the SemEval 2013-DDI dataset, while Bard
我们观察到,在涉及识别和分类文本中特定信息的任务中(如NLI(MedNLI)、NER(NCBI-Disease、BC5CDR-Chemical)和STS(BIOSSES)),GPT-4通常表现优于Bard和GPT-3.5。在文档分类领域(即对整个文档进行预定义类别划分的任务),GPT-4在i2b2 2006-Smoking数据集上也超越了GPT-3.5和Bard。在关系抽取任务中,GPT-4在SemEval 2013-DDI数据集上的表现优于Bard和GPT-3.5,而Bard...
Table 3: Performance comparison of Bard, GPT-3.5, and GPT-4 with different prompting strategies (standard, chain-of-thought, and self-questioning) and N-shot learning (N = 0, 5) on clinical benchmark datasets. randomly sampled evaluation data from the test set. Our results show that GPT-4 outperforms Bard and GPT-3.5 in tasks that involve identification and classification of specific information within text, while Bard achieves higher accuracy than GPT-3.5 and GPT-4 on tasks that require a more factual understanding of the text. Additionally, self-questioning prompting consistently achieves the best performance on the majority of tasks. The best results for each dataset are highlighted in bold.
表 3: Bard、GPT-3.5 和 GPT-4 在临床基准数据集上采用不同提示策略 (标准、思维链和自提问) 和 N 样本学习 (N = 0, 5) 的性能对比。测试集随机采样评估数据。结果表明,在涉及文本中特定信息识别与分类的任务中,GPT-4 优于 Bard 和 GPT-3.5;而在需要更事实性理解文本的任务上,Bard 的准确率高于 GPT-3.5 和 GPT-4。此外,自提问提示策略在多数任务中持续取得最佳表现。各数据集最优结果以粗体标出。
| Model | NCBI- Disease | BC5CDR- Chemical | i2b2 2010- Relation | SemEval 2013- DDI | BIOSSES | MedNLI | i2b2 2006- Smoking | BioASQ 10b- Factoid | |
| Micro F1 | Micro F1 | Micro F1 | Macro F1 | Pear. | Acc. | Micro F1 | MRR | Len.Acc. | |
| Bard | |||||||||
| W zero-shotStP | 0.911 | 0.947 | 0.720 | 0.490 | 0.401 | 0.580 | 0.780 | 0.800 | 0.820 |
| w/5-shot StP | 0.933 | 0.972 | 0.900 | 0.528 | 0.449 | 0.640 | 0.820 | 0.845 | 0.880 |
| /zero-shot CoTP | 0.946 | 0.972 | 0.660 | 0.525 | 0.565 | 0.580 | 0.760 | 0.887 | 0.920 |
| 5-shot CoTP | 0.955 | 0.977 | 0.900 | 0.709 | 0.602 | 0.720 | 0.800 | 0.880 | 0.900 |
| w/ zero-shot SQP | 0.956 | 0.977 | 0.760 | 0.566 | 0.576 | 0.760 | 0.760 | 0.850 | 0.860 |
| w/5-shot SQP | 0.960 | 0.983 | 0.940 | 0.772 | 0.601 | 0.760 | 0.820 | 0.860 | 0.860 |
| GPT-3.5 | |||||||||
| zero-shot StP W/ | 0.918 | 0.939 | 0.780 | 0.360 | 0.805 | 0.700 | 0.680 | 0.707 | 0.720 |
| w/5-shot StP | 0.947 | 0.967 | 0.840 | 0.531 | 0.828 | 0.780 | 0.780 | 0.710 | 0.740 |
| zero-shot CoTP | 0.955 | 0.977 | 0.680 | 0.404 | 0.875 | 0.740 | 0.680 | 0.743 | 0.800 |
| w/5-shot CoTP | 0.967 | 0.977 | 0.840 | 0.548 | 0.873 | 0.740 | 0.740 | 0.761 | 0.820 |
| w/zero-shot SQP w/ 5-shot SQP | 0.963 | 0.974 | 0.860 | 0.529 | 0.873 | 0.760 | 0.720 | 0.720 | 0.740 |
| 0.970 | 0.983 | 0.860 | 0.620 | 0.892 | 0.820 | 0.820 | 0.747 | 0.780 | |
| GPT-4 | |||||||||
| zero-shot StP W | 0.968 | 0.976 | 0.860 | 0.428 | 0.820 | 0.800 | 0.900 | 0.795 | 0.820 |
| w/5-shot StP | 0.975 | 0.989 | 0.860 | 0.502 | 0.848 | 0.840 | 0.880 | 0.815 | 0.840 |
| /zero-shot CoTP | 0.981 | 0.994 0.994 | 0.860 0.880 | 0.509 0.544 | 0.875 0.897 | 0.840 0.800 | 0.860 0.860 | 0.805 0.852 | 0.840 0.880 |
| w/5-shot CoTP | 0.984 | 0.992 | 0.920 | 0.595 | 0.889 | 0.860 | 0.900 | 0.844 | 0.900 |
| w/ zero-shot SQP | 0.985 | 0.860 | |||||||
| w/5-shot SQP | 0.984 | 0.995 | 0.920 | 0.798 | 0.916 | 0.860 | 0.873 | 0.900 | |
Note: Acc. $=$ Accuracy; CoTP $=$ Chain-of-Thought Prompting; Len. Acc. $=$ Lenient Accuracy; MRR = Mean Reciprocal Rank; Pear. $=$ Pearson Correlation; StP $=$ Standard Prompting.
| 模型 | NCBI-疾病 | BC5CDR-化学 | i2b2 2010-关系 | SemEval 2013-DDI | BIOSSES | MedNLI | i2b2 2006-吸烟 | BioASQ 10b-事实型 |
|---|---|---|---|---|---|---|---|---|
| Micro F1 | Micro F1 | Micro F1 | Macro F1 | Pear. | Acc. | Micro F1 | MRR | |
| Bard | ||||||||
| 零样本StP | 0.911 | 0.947 | 0.720 | 0.490 | 0.401 | 0.580 | 0.780 | 0.800 |
| 5样本StP | 0.933 | 0.972 | 0.900 | 0.528 | 0.449 | 0.640 | 0.820 | 0.845 |
| 零样本CoTP | 0.946 | 0.972 | 0.660 | 0.525 | 0.565 | 0.580 | 0.760 | 0.887 |
| 5样本CoTP | 0.955 | 0.977 | 0.900 | 0.709 | 0.602 | 0.720 | 0.800 | 0.880 |
| 零样本SQP | 0.956 | 0.977 | 0.760 | 0.566 | 0.576 | 0.760 | 0.760 | 0.850 |
| 5样本SQP | 0.960 | 0.983 | 0.940 | 0.772 | 0.601 | 0.760 | 0.820 | 0.860 |
| GPT-3.5 | ||||||||
| 零样本StP | 0.918 | 0.939 | 0.780 | 0.360 | 0.805 | 0.700 | 0.680 | 0.707 |
| 5样本StP | 0.947 | 0.967 | 0.840 | 0.531 | 0.828 | 0.780 | 0.780 | 0.710 |
| 零样本CoTP | 0.955 | 0.977 | 0.680 | 0.404 | 0.875 | 0.740 | 0.680 | 0.743 |
| 5样本CoTP | 0.967 | 0.977 | 0.840 | 0.548 | 0.873 | 0.740 | 0.740 | 0.761 |
| 零样本SQP | 0.963 | 0.974 | 0.860 | 0.529 | 0.873 | 0.760 | 0.720 | 0.720 |
| 5样本SQP | 0.970 | 0.983 | 0.860 | 0.620 | 0.892 | 0.820 | 0.820 | 0.747 |
| GPT-4 | ||||||||
| 零样本StP | 0.968 | 0.976 | 0.860 | 0.428 | 0.820 | 0.800 | 0.900 | 0.795 |
| 5样本StP | 0.975 | 0.989 | 0.860 | 0.502 | 0.848 | 0.840 | 0.880 | 0.815 |
| 零样本CoTP | 0.981 | 0.994 | 0.860 | 0.509 | 0.875 | 0.840 | 0.860 | 0.805 |
| 5样本CoTP | 0.984 | 0.992 | 0.920 | 0.595 | 0.889 | 0.860 | 0.900 | 0.844 |
| 零样本SQP | 0.985 | 0.860 | ||||||
| 5样本SQP | 0.984 | 0.995 | 0.920 | 0.798 | 0.916 | 0.860 | 0.873 |
注: Acc. = 准确率; CoTP = 思维链提示; Len. Acc. = 宽松准确率; MRR = 平均倒数排名; Pear. = 皮尔逊相关系数; StP = 标准提示。
demonstrates superior performance in the i2b2 2010-Relation dataset. Additionally, Bard excels in tasks that require a more factual understanding of the text, such as QA (BioASQ 10b-Factoid).
在i2b2 2010-Relation数据集上展现出卓越性能。此外,Bard在需要更事实性理解文本的任务(如QA (BioASQ 10b-Factoid))中表现优异。
Regarding prompting strategies, self-questioning consistently outperforms standard prompting and exhibits competitive performance when compared to chain-of-thought prompting across all settings. Our findings suggest that self-questioning is a promising approach for enhancing the performance of LLMs, achieving the best performance for the majority of tasks, except for QA (BioASQ 10b-Factoid).
关于提示策略,自我提问法在所有设置中均优于标准提示法,并与思维链提示法表现出相当的竞争力。我们的研究结果表明,自我提问法是提升大语言模型性能的一种有效方法,在除QA(BioASQ 10b-Factoid)之外的大多数任务中均取得了最佳表现。
Furthermore, our study demonstrates that 5-shot learning generally leads to improved performance across all tasks when compared to zero-shot learning, although not universally. This finding indicates that incorporating even a modest amount of task-specific training data can substantially enhance the effectiveness of pre-trained LLMs.
此外,我们的研究表明,与零样本学习相比,5样本学习通常能提升所有任务的性能(尽管并非普遍适用)。这一发现表明,即使加入少量任务特定的训练数据,也能显著提升预训练大语言模型的效果。
6.2. Prompting Strategies Comparison
6.2. 提示策略对比
We evaluate the performance of different prompting strategies, specifically standard prompting, self-questioning prompting (SQP), and chain-of-thought prompting (CoTP) on both zero-shot and 5-shot learning settings across various models and datasets. Figure 3 presents the averaged performance comparison over all datasets, under the assumption that datasets and evaluation metrics are equally important and directly comparable. We observe that selfquestioning prompting consistently yields the best performance compared to standard and chain-of-thought prompting. In addition, GPT-4 excels among the models, demonstrating the highest overall performance.
我们评估了不同提示策略的性能,具体包括标准提示 (standard prompting) 、自我提问提示 (self-questioning prompting, SQP) 和思维链提示 (chain-of-thought prompting, CoTP) ,在多种模型和数据集上的零样本和5样本学习设置中的表现。图 3: 展示了所有数据集上的平均性能对比,假设数据集和评估指标具有同等重要性且可直接比较。我们观察到,与标准提示和思维链提示相比,自我提问提示始终能带来最佳性能。此外,GPT-4在模型中表现最为出色,展现出最高的整体性能。

Figure 3: Average performance comparison of three prompting methods in zero-shot and 5-shot learning settings across Bard, GPT-3.5, and GPT-4 models. Performance values are averaged across all datasets, assuming equal importance for datasets and evaluation metrics, as well as direct comparability. The selfquestioning prompting method consistently outperforms standard and chain-ofthought prompting, and GPT-4 excels among the models.
图 3: 零样本和5样本学习设置下,三种提示方法在Bard、GPT-3.5和GPT-4模型中的平均性能对比。性能值为所有数据集的平均值,假设数据集和评估指标具有同等重要性且可直接比较。自我提问提示方法始终优于标准提示和思维链提示,而GPT-4在模型中表现最优。
Table 4 and Table 5 demonstrate performance improvements of prompting strategies over multiple datasets and models under zero-shot and 5-shot settings, respectively, using standard prompting as a baseline. In the zero-shot learning setting (Table 4), self-questioning prompting achieves the highest improvement in the majority of tasks, with improvements ranging from $4.9%$ to $46.9%$ across different datasets.
表 4 和表 5 分别展示了在零样本和 5 样本设置下,以标准提示为基线,提示策略在多个数据集和模型上的性能提升。在零样本学习设置中 (表 4),自我提问提示在大多数任务中实现了最高的提升,不同数据集上的提升幅度从 $4.9%$ 到 $46.9%$ 不等。
In the 5-shot learning setting (Table 5), self-questioning prompting leads to the highest improvement in most tasks, with improvements ranging from $2.9%$ to $59.0%$ . In both settings, we also observe some instances where chain-of-thought or self-questioning prompting yields negative values, such as relation extraction (i2b2 2010-Relation) and document classification (i2b2 2006-Smoking), indicating inferior performance compared to standard prompting. This could be due to the specific nature of certain tasks, where the additional context or complexity introduced by the alternative prompting strategies might not contribute to better understanding or performance. It might also be possible that the model’s capacity is insufficient to take advantage of the additional information provided by the alternative prompting strategies in some cases.
在少样本学习设置中(表5),自我提问提示法在多数任务中实现了最高提升,改进幅度从$2.9%$到$59.0%$不等。在这两种设置下,我们也观察到思维链或自我提问提示法在某些情况下会出现负值,例如关系抽取(i2b2 2010-Relation)和文档分类(i2b2 2006-Smoking),表明其性能低于标准提示法。这可能是因为某些任务的特殊性质,替代提示策略引入的额外上下文或复杂性可能无助于更好的理解或性能。也有可能是模型在某些情况下无法充分利用替代提示策略提供的额外信息。
Overall, self-questioning prompting generally outperforms other prompting strategies across different models and datasets in both zero-shot and 5-shot learning settings, despite occasional inferior performance in specific tasks. This suggests that self-questioning prompting can be a promising technique for improving performance in the domain of clinical language understanding. Furthermore, GPT-4 emerges as the top-performing model, emphasizing the potential for various applications in the clinical domain.
总体而言,在零样本和5样本学习设置中,自我提问提示 (self-questioning prompting) 策略在不同模型和数据集上通常优于其他提示策略,尽管在特定任务中偶尔表现不佳。这表明自我提问提示有望成为提升临床语言理解领域性能的有效技术。此外,GPT-4 表现最优,凸显了其在临床领域多种应用场景的潜力。
Table 4: Comparison of zero-shot learning performance improvements (in $%$ ) for different models and prompting techniques on multiple datasets, with standard prompting as the baseline. Bold values indicate the highest improvement for each dataset across models and prompting strategies, while negative values signify inferior performance. Self-questioning prompting leads to the largest improvement in the majority of tasks.
表 4: 不同模型和提示技术在多个数据集上的零样本学习性能提升对比 (以标准提示为基线,单位为 $%$ )。加粗数值表示各数据集在模型和提示策略中的最高提升,负值表示性能下降。自提问提示在多数任务中带来了最大提升。
| Dataset | Metric | Bard | GPT-3.5 | GPT-4 | |||
| CoTP | SQP | CoTP | SQP | CoTP | SQP | ||
| NCBI-Disease | Micro F1 | 3.8 | 4.9 | 4.0 | 4.9 | 1.3 | 1.8 |
| BC5CDR-Chemical | MicroF1 | 2.6 | 3.2 | 4.0 | 3.7 | 1.8 | 1.6 |
| i2b2 2010-Relation | MicroF1 | -8.3 | 5.6 | -12.8 | 10.3 | 0.0 | 7.0 |
| SemEval 2013-DDI | MacroF1 | 7.1 | 15.5 | 12.2 | 46.9 | 18.9 | 39.0 |
| BIOSSES | Pear. | 40.9 | 43.6 | 8.7 | 8.4 | 6.7 | 8.4 |
| MedNLI | Acc. | 0.0 | 31.0 | 5.7 | 8.6 | 5.0 | 7.5 |
| i2b2 2006-Smoking | Micro F1 | -2.6 | -2.6 | 0.0 | 5.9 | -4.4 | 0.0 |
| BioASQ 10b-Factoid | MRR | 10.9 | 6.3 | 5.1 | 1.8 | 1.3 | 6.2 |
| BioASQ 10b-Factoid | Len. Acc. | 12.2 | 4.9 | 11.1 | 2.8 | 2.4 | 9.8 |
| 数据集 | 指标 | Bard CoTP | Bard SQP | GPT-3.5 CoTP | GPT-3.5 SQP | GPT-4 CoTP | GPT-4 SQP |
|---|---|---|---|---|---|---|---|
| NCBI-Disease | Micro F1 | 3.8 | 4.9 | 4.0 | 4.9 | 1.3 | 1.8 |
| BC5CDR-Chemical | MicroF1 | 2.6 | 3.2 | 4.0 | 3.7 | 1.8 | 1.6 |
| i2b2 2010-Relation | MicroF1 | -8.3 | 5.6 | -12.8 | 10.3 | 0.0 | 7.0 |
| SemEval 2013-DDI | MacroF1 | 7.1 | 15.5 | 12.2 | 46.9 | 18.9 | 39.0 |
| BIOSSES | Pear. | 40.9 | 43.6 | 8.7 | 8.4 | 6.7 | 8.4 |
| MedNLI | Acc. | 0.0 | 31.0 | 5.7 | 8.6 | 5.0 | 7.5 |
| i2b2 2006-Smoking | Micro F1 | -2.6 | -2.6 | 0.0 | 5.9 | -4.4 | 0.0 |
| BioASQ 10b-Factoid | MRR | 10.9 | 6.3 | 5.1 | 1.8 | 1.3 | 6.2 |
| BioASQ 10b-Factoid | Len. Acc. | 12.2 | 4.9 | 11.1 | 2.8 | 2.4 | 9.8 |
6.3. Task-by-Task Analysis
6.3. 任务逐项分析
To delve deeper into the specific characteristics and challenges associated with each task (i.e., NER, relation extraction, STS, NLI, document classification, and QA), we individually analyze the results, aiming to better understand the underlying factors that contribute to model performance and identify areas for potential improvement or further investigation.
为了深入探究每项任务(即命名实体识别(NER)、关系抽取、语义文本相似度(STS)、自然语言推理(NLI)、文档分类和问答(QA))的具体特征与挑战,我们分别对结果进行了分析,旨在更好地理解影响模型性能的内在因素,并找出潜在改进或进一步研究的领域。
Named Entity Recognition Task. In the NER task, we focus on two datasets: NCBI-Disease and BC5CDR-Chemical. Employing the BIO tagging scheme, we evaluate model performance using the micro F1 metric. NER tasks in the biomedical domain pose unique challenges due to specialized terminology, complex entity names, and frequent use of abbreviations. Our results indicate that, compared to standard prompting, self-questioning prompting leads to average improvements of 3.9% and 2.8% in zero-shot learning for NCBIDisease and BC5CDR-Chemical, respectively. In the 5-shot setting, the average improvements are $2.1%$ and $1.1%$ , respectively. Moreover, GPT-4 demonstrates the most significant performance boost compared to Bard and GPT-3.5.
命名实体识别任务。在NER任务中,我们重点关注NCBI-Disease和BC5CDR-Chemical两个数据集。采用BIO标注方案,使用微平均F1值评估模型性能。生物医学领域的NER任务因专业术语、复杂实体名称和频繁使用缩写而面临独特挑战。实验结果表明,与标准提示相比,自提问提示在NCBI-Disease和BC5CDR-Chemical数据集上分别实现零样本学习平均提升3.9%和2.8%。在5样本设置中,平均提升分别为$2.1%$和$1.1%$。此外,与Bard和GPT-3.5相比,GPT-4展现出最显著的性能提升。
We also conduct a qualitative analysis by examining specific examples from the datasets, such as the term “aromatic ring” in the BC5CDR-Chemical dataset, which is often incor
我们还通过分析数据集中的具体案例进行定性研究,例如BC5CDR-Chemical数据集中频繁出现错误标注的术语"aromatic ring"。
Table 5: Comparison of 5-shot learning performance improvements (in $%$ ) for different models and prompting techniques on multiple datasets, with standard prompting as the baseline. Bold values indicate the highest improvement for each dataset across models and prompting strategies, while negative values signify inferior performance. Self-questioning prompting leads to the highest improvement in 6 out of 8 tasks, followed by chain-of-thought prompting with 2 largest improvements.
表 5: 不同模型和提示技术在多个数据集上5样本学习性能提升对比 (以标准提示为基线,单位 $%$ )。加粗数值表示各数据集在模型和提示策略中的最高提升,负值表示性能下降。自我提问提示在8项任务中有6项取得最高提升,思维链提示则在2项任务中表现最佳。
| Dataset | Metric | Bard | GPT-3.5 | GPT-4 | |||
| CoTP | SQP | CoTP | SQP | CoTP | SQP | ||
| NCBI-Disease | Micro F1 | 2.4 | 2.9 | 2.1 | 2.4 | 0.9 | 0.9 |
| BC5CDR-Chemical | MicroF1 | 0.5 | 1.1 | 1.0 | 1.7 | 0.5 | 0.6 |
| i2b2 2010-Relation | MicroF1 | 0.0 | 4.4 | 0.0 | 2.4 | 2.3 | 7.0 |
| SemEval 2013-DDI | MacroF1 | 34.3 | 46.2 | 3.2 | 16.8 | 8.4 | 59.0 |
| BIOSSES | Pear. | 34.1 | 33.9 | 5.4 | 7.7 | 5.8 | 8.0 |
| MedNLI | Acc. | 12.5 | 18.8 | -5.1 | 5.1 | -4.8 | 2.4 |
| i2b2 2006-Smoking | Micro F1 | -2.4 | 0.0 | -5.1 | 5.1 | -2.3 | -2.3 |
| BioASQ 10b-Factoid | MRR | 4.1 | 1.8 | 7.2 | 5.2 | 4.5 | 7.1 |
| BioASQ 10b-Factoid | Len. Acc. | 2.3 | -2.3 | 10.8 | 5.4 | 4.8 | 7.1 |
| 数据集 | 指标 | Bard-CoTP | Bard-SQP | GPT-3.5-CoTP | GPT-3.5-SQP | GPT-4-CoTP | GPT-4-SQP |
|---|---|---|---|---|---|---|---|
| NCBI-Disease | Micro F1 | 2.4 | 2.9 | 2.1 | 2.4 | 0.9 | 0.9 |
| BC5CDR-Chemical | MicroF1 | 0.5 | 1.1 | 1.0 | 1.7 | 0.5 | 0.6 |
| i2b2 2010-Relation | MicroF1 | 0.0 | 4.4 | 0.0 | 2.4 | 2.3 | 7.0 |
| SemEval 2013-DDI | MacroF1 | 34.3 | 46.2 | 3.2 | 16.8 | 8.4 | 59.0 |
| BIOSSES | Pear. | 34.1 | 33.9 | 5.4 | 7.7 | 5.8 | 8.0 |
| MedNLI | Acc. | 12.5 | 18.8 | -5.1 | 5.1 | -4.8 | 2.4 |
| i2b2 2006-Smoking | Micro F1 | -2.4 | 0.0 | -5.1 | 5.1 | -2.3 | -2.3 |
| BioASQ 10b-Factoid | MRR | 4.1 | 1.8 | 7.2 | 5.2 | 4.5 | 7.1 |
| BioASQ 10b-Factoid | Len. Acc. | 2.3 | -2.3 | 10.8 | 5.4 | 4.8 | 7.1 |
rectly predicted as “B-Chemical” (beginning of a chemical entity) instead of “O” (outside of any entity) by the models. This error might occur because the term “aromatic ring” refers to a structural feature commonly found in chemical compounds, leading models to associate it with chemical entities and mis classify it. This example highlights the challenges faced by the models in accurately recognizing entities, particularly when dealing with terms that have strong associations with specific entity types. It also demonstrates the potential limitations of prompting strategies in addressing these challenges, as models may still struggle to disambiguate such terms, despite employing different prompting techniques.
模型错误地将"aromatic ring"直接预测为"B-Chemical"(化学实体的起始)而非"O"(非实体部分)。这种错误可能源于"芳香环"这一术语通常指代化合物中的结构特征,导致模型将其与化学实体相关联而产生误判。该案例凸显了模型在实体识别任务中面临的挑战,尤其是处理与特定实体类型存在强关联的术语时。这也表明提示策略在解决此类问题时可能存在局限性——即便采用不同的提示技术,模型仍难以准确消除这类术语的歧义。
Relation Extraction Task. In the relation extraction task involving the i2b2 2010- Relation and SemEval 2013-DDI datasets, we evaluate our model’s performance using micro F1 and macro F1 scores, respectively. Our study reveals that self-questioning prompting leads to average improvements of $7.6%$ and $33.8%$ in zero-shot learning for the i2b2 2010- Relation and SemEval 2013-DDI datasets, respectively. In the 5-shot setting, the average improvements are $4.6%$ and $40.7%$ , respectively. GPT-4 demonstrates more significant performance improvement compared to Bard and GPT-3.5.
关系抽取任务。在涉及i2b2 2010-Relation和SemEval 2013-DDI数据集的关系抽取任务中,我们分别使用微平均F1值和宏平均F1值评估模型性能。研究表明,自提问提示(self-questioning prompting)使i2b2 2010-Relation和SemEval 2013-DDI数据集在零样本学习中的平均性能分别提升$7.6%$和$33.8%$。在5样本设置下,平均提升分别为$4.6%$和$40.7%$。与Bard和GPT-3.5相比,GPT-4展现出更显著的性能提升。
For our qualitative analysis, we examine a challenging example from the i2b2 2010- Relation dataset, where the models struggle to identify the correct relationship between “Elavil” and “stabbing left-sided chest pain”. The gold label indicates “TrWP” (Treatment Worsens Medical Problem), but all models incorrectly predict it as “TrAP” (Treatment is Administered for Medical Problem). This mis classification may arise from the models’ inability to recognize that the patient still experiences severe pain despite taking Elavil. This example highlights the difficulties encountered by the models in accurately identifying nuanced relationships in complex biomedical texts. Incorporating domain-specific knowledge could help to better capture the subtleties of such relationships.
在我们的定性分析中,我们研究了i2b2 2010关系数据集中的一个具有挑战性的例子,模型难以识别"Elavil"与"左侧胸部刺痛"之间的正确关系。黄金标准标注为"TrWP"(治疗加重医疗问题),但所有模型都错误地预测为"TrAP"(针对医疗问题实施治疗)。这一误分类可能源于模型无法识别患者在服用Elavil后仍经历剧烈疼痛的情况。该示例突显了模型在准确识别复杂生物医学文本中微妙关系时遇到的困难。融入领域特定知识可能有助于更好地捕捉此类关系的细微差别。
Semantic Textual Similarity Task. In the STS task, we focus on the BIOSSES dataset and evaluate our model’s performance using Pearson correlation. Our study reveals that self-questioning prompting leads to average improvements of $20.1%$ and $16.5%$ in zeroshot and 5-shot settings, respectively. GPT-4 outperforms Bard and GPT-3.5 across all settings.
语义文本相似性任务。在STS任务中,我们重点关注BIOSSES数据集,并使用皮尔逊相关系数评估模型性能。研究表明,自提问提示(prompting)使零样本(zero-shot)和5样本(few-shot)设置下的性能平均提升了$20.1%$和$16.5%$。在所有实验设置中,GPT-4的表现均优于Bard和GPT-3.5。
Taking a closer look, we examine a pair of sentences with a gold label similarity score of 0.2, indicating high dissimilarity. The first sentence discusses the specific effect of mutant K-Ras on tumor progression, while the second sentence refers to an important advance in lung cancer research without mentioning any specific details. However, the average score predicted by models, regardless of the setting, is 2.0. This discrepancy may arise from the models’ difficulty in grasping the distinct contexts in which the sentences are written. The models might be misled by the presence of related keywords such as “tumor” and “cancer”, leading to an over estimation of the similarity score. This example demonstrates the challenge faced by the models in accurately gauging the semantic similarity of sentences when the underlying context or focus differs, despite the presence of shared terminology.
仔细来看,我们研究了一对黄金标签相似度分数为0.2的句子,表明它们高度不相似。第一个句子讨论了突变K-Ras对肿瘤进展的具体影响,而第二个句子提到了肺癌研究的重要进展,但没有提及任何具体细节。然而,无论何种设置,模型预测的平均分数都是2.0。这种差异可能源于模型难以把握句子所处的不同上下文。模型可能被"肿瘤"和"癌症"等相关关键词误导,从而高估了相似度分数。这个例子展示了当底层上下文或焦点不同时,尽管存在共享术语,模型在准确衡量句子语义相似性方面面临的挑战。
Natural Language Inference Task. In the NLI task, we focus on the MedNLI dataset and evaluate our model’s performance using accuracy. On average, self-questioning prompting improves the model performance by 15.7% and 8.8% for zero-shot and 5-shot settings, respectively, with GPT-4 consistently outperforming Bard and GPT-3.5 across all settings.
自然语言推理任务。在NLI任务中,我们聚焦MedNLI数据集,使用准确率评估模型性能。平均而言,自提问提示 (self-questioning prompting) 在零样本和5样本设置下分别将模型性能提升15.7%和8.8%,且GPT-4在所有设置中始终优于Bard和GPT-3.5。
We further investigate a pair of sentences where the gold label is “contradiction”. The first sentence states that the patient was transferred to the Neonatal Intensive Care Unit for observation, while the second sentence claims that the patient had an uneventful course. Despite the gold label, none of the models ever predict the true label, opting for “neutral” or “entailment” instead. The models may focus on the absence of explicit negations or conflicting keywords, leading them to overlook the more subtle contradiction. These findings highlight the need to enhance model capabilities to better understand implicit and nuanced relationships between sentences, thereby enabling more accurate predictions in complex real-world clinical scenarios.
我们进一步研究了一对黄金标签为"矛盾"的句子。首句陈述患者被转入新生儿重症监护室(NICU)观察,而第二句声称患者病程平稳。尽管存在黄金标签,所有模型均未预测真实标签,而是选择了"中性"或"蕴含"。模型可能因缺乏显式否定或冲突关键词而忽略了更微妙的矛盾。这些发现凸显了增强模型能力的必要性,以更好地理解句子间隐含的微妙关系,从而在复杂现实临床场景中实现更精准的预测。
Document Classification Task. In the document classification task, we focus on the i2b2 2006-Smoking dataset and evaluate our model’s performance using micro F1. Our analysis reveals that self-questioning prompting leads to average improvements of $1.1%$ and 0.9% for zero-shot and 5-shot settings, respectively. GPT-4 consistently delivers superior performance to Bard and GPT-3.5 in all experimental settings.
文档分类任务。在文档分类任务中,我们聚焦于i2b2 2006-Smoking数据集,并使用微平均F1值评估模型性能。分析表明,自提问提示 (self-questioning prompting) 使零样本和5样本设置的平均性能分别提升$1.1%$和0.9%。在所有实验设置中,GPT-4的表现始终优于Bard和GPT-3.5。
During our qualitative assessment, we investigate a patient record containing the sentence “He is a heavy smoker and drinks 2-3 shots per day at times”. All models classify the patient as a “CURRENT SMOKER”, while the patient is, in fact, a past smoker, as indicated by the subsequent descriptions of medications and the patient’s improved condition. This mis classification may occur because the models focus on the explicit mention of smoking habits in the sentence, neglecting the broader context provided by the entire document. This instance highlights the need for models to take a more comprehensive approach in interpreting clinical documents by considering the overall context, rather than relying solely on individual textual cues.
在我们的定性评估中,我们研究了一份包含"他烟瘾很大,有时每天喝2-3杯酒"句子的患者记录。所有模型都将该患者分类为"当前吸烟者",而实际上根据后续药物描述和患者病情改善情况显示,这是一位既往吸烟者。这种错误分类可能是因为模型过于关注句中明确提到的吸烟习惯,而忽略了整个文档提供的更广泛背景。这个案例凸显了模型在解读临床文档时需要采取更全面的方法,考虑整体上下文,而非仅仅依赖个别文本线索。
Question-Answering Task. In the QA task using the bioASQ 10b-Factoid dataset, we evaluate our model with MRR and lenient accuracy. For MRR, self-questioning prompting leads to average improvements of 4.8% and 4.7% for zero-shot and 5-shot settings, respectively. For lenient accuracy, the improvements are 5.8% and 3.4%, respectively. GPT-4 consistently outperforms Bard and GPT-3.5 across all settings.
问答任务。在使用bioASQ 10b-Factoid数据集的QA任务中,我们通过MRR和宽松准确率评估模型。在MRR指标上,自我提问提示 (self-questioning prompting) 使零样本和5样本设置分别平均提升4.8%和4.7%。宽松准确率方面,提升幅度分别为5.8%和3.4%。在所有设置中,GPT-4的表现始终优于Bard和GPT-3.5。
During our qualitative exploration, we analyzed an example question: “What is the major sequence determinant for nucleosome positioning?” The correct answer is “G+C content”; however, the top answer from models is “DNA sequence”. This mis classification might occur because the models capture the broader context related to nucleosome positioning but fail to recognize the specific determinant, namely G+C content. The models may rely on more general associations between DNA sequences and nucleosome positioning, resulting in a less precise answer. This example underscores the necessity for models to identify fine-grained details in biomedical questions and deliver more accurate and specific responses.
在我们的定性探索中,我们分析了一个示例问题:"核小体定位的主要序列决定因素是什么?"正确答案是"G+C含量",但模型给出的首要答案是"DNA序列"。这种错误分类可能源于模型捕捉到了核小体定位相关的广泛背景信息,却未能识别出具体决定因素(即G+C含量)。模型可能依赖于DNA序列与核小体定位之间更普遍的关联性,从而导致答案精确度不足。这个案例凸显了大语言模型需要识别生物医学问题中的细粒度细节,并提供更准确、更具针对性的回答。
6.4. Case Study: Error Analysis
6.4. 案例分析:错误分析
We conduct a comprehensive error analysis on relation extraction (SemEval 2013-DDI), the most challenging task shared by all LLMs. This task is identified by calculating the median performance across all settings for a robust representation. Our process for identifying errors in relation extraction has two main stages. First, we find errors by comparing the model’s outputs with the correct labels. Any differences we find are marked as errors. Next, we ask the model to explain its predictions. We manually review these explanations to spot errors, understand why they happened, and group them into specific error types. We investigate common error types and provide illustrative examples, examining the influence of prompting strategies and N-shot learning on the models’ performance. This analysis highlights each model’s strengths, limitations, and the role of experimental settings in improving clinical language understanding tasks.
我们对关系提取(SemEval 2013-DDI)进行了全面的错误分析,这是所有大语言模型共同面临的最具挑战性的任务。该任务通过计算所有设置的中位数性能来确定,以获得稳健的表征。
我们识别关系提取错误的过程分为两个主要阶段。首先,通过比较模型的输出与正确标签来发现错误,任何差异均被标记为错误。接着,我们要求模型解释其预测,并手动审查这些解释以定位错误、分析原因,并将其归类为特定的错误类型。
我们研究了常见的错误类型并提供了示例,探讨了提示策略和N样本学习对模型性能的影响。此分析揭示了每个模型的优势、局限性,以及实验设置在提升临床语言理解任务中的作用。
Table 6: Average error type distribution for SemEval 2013-DDI across Bard, GPT-3.5, and GPT-4. Wording Ambiguity is the most common error for Bard, Lack of Context for GPT-3.5, and Negation and Qualification for GPT-4.
| Error Type | Description | Error Proportion (%) | ||
| Bard | GPT-3.5 | GPT-4 | ||
| Wording Ambiguity | unclear wording | 32 | 23 | 24 |
| Lack of Context | incomplete context usage | 25 | 31 | 19 |
| Complex Interactions | multiple drug interactions | 19 | 12 | 14 |
| Negation and Qualification | Misinterpreting | 8 | 27 | 25 |
| Co-reference Resolution | negation/qualification Misidentifying co-references | 16 | 7 | 18 |
表 6: SemEval 2013-DDI 在 Bard、GPT-3.5 和 GPT-4 上的平均错误类型分布。Bard 最常见错误是措辞歧义 (Wording Ambiguity),GPT-3.5 是缺乏上下文 (Lack of Context),GPT-4 是否定与限定 (Negation and Qualification)。
| 错误类型 | 描述 | Bard | GPT-3.5 | GPT-4 |
|---|---|---|---|---|
| 措辞歧义 (Wording Ambiguity) | 表述不清晰 | 32 | 23 | 24 |
| 缺乏上下文 (Lack of Context) | 上下文使用不完整 | 25 | 31 | 19 |
| 复杂交互 (Complex Interactions) | 多重药物相互作用 | 19 | 12 | 14 |
| 否定与限定 (Negation and Qualification) | 误解否定/限定关系 | 8 | 27 | 25 |
| 共指消解 (Co-reference Resolution) | 错误识别共指关系 | 16 | 7 | 18 |
Table 6 presents the average error type distribution for the SemEval 2013-DDI task across Bard, GPT-3.5, and GPT-4. The average proportions are calculated by aggregating error frequencies for each error type across all settings and then dividing by the total number of errors for each model. The most common error type for Bard is Wording Ambiguity, accounting for $32%$ of its errors, which may stem from the inherent complexity of clinical language or insufficient training data for specific drug relations. In contrast, GPT-3.5 struggles the most with Lack of Context, comprising 31% of its errors, suggesting the model’s difficulty in grasping the broader context of the input text. GPT-4’s top error is Negation and Qualification, making up $25%$ of its errors, possibly due to the model’s limitations in understanding and processing negations and qualifications within the clinical domain. This analysis highlights the unique challenges each model faces in the relation extraction task, emphasizing the need for targeted interventions and tailored strategies to address these specific areas for improvement.
表 6 展示了 SemEval 2013-DDI 任务中 Bard、GPT-3.5 和 GPT-4 的平均错误类型分布。平均比例是通过汇总所有设置中每种错误类型的错误频率,然后除以每个模型的总错误数计算得出的。Bard 最常见的错误类型是措辞歧义 (Wording Ambiguity),占其错误的 32%,这可能源于临床语言固有的复杂性或特定药物关系训练数据的不足。相比之下,GPT-3.5 最常犯的错误是缺乏上下文 (Lack of Context),占其错误的 31%,表明该模型难以把握输入文本的更广泛背景。GPT-4 的主要错误是否定与限定 (Negation and Qualification),占其错误的 25%,可能是由于该模型在理解和处理临床领域中的否定和限定方面存在局限性。这一分析凸显了每个模型在关系抽取任务中面临的独特挑战,强调了需要有针对性的干预和定制策略来改进这些特定领域。

Figure 4: Error correction examples using self-questioning prompting (SQP) for Bard, GPT-3.5, and GPT-4 in the SemEval 2013-DDI dataset, compared to standard prompting (StP). Each example showcases the top error for each model and how SQP addresses these challenges. As this paper primarily focuses on the effectiveness of SQP, chain-of-thought prompting is not presented in these examples.
图 4: 在 SemEval 2013-DDI 数据集中,使用自问提示 (SQP) 对 Bard、GPT-3.5 和 GPT-4 进行错误纠正的示例,与标准提示 (StP) 进行对比。每个示例展示了各模型的主要错误类型及 SQP 如何应对这些挑战。由于本文主要关注 SQP 的有效性,这些示例中未展示思维链提示方法。
Specific examples presented in Figure 4 illustrate the challenges faced by each model and how self-questioning prompting (SQP) can effectively improve their performance. SQP demonstrates its flexibility and adaptability across various model architectures by mitigating distinct error types and refining predictions. Bard sees improvements in addressing Wording Ambiguity, GPT-3.5 benefits from enhanced context utilization, and GPT-4’s under standing of negation is strengthened. These examples emphasize the significance of harnessing advanced prompting techniques like SQP to bolster model performance and reveal the multifaceted challenges faced by LLMs in relation extraction tasks, particularly within the clinical domain.
图4中的具体示例展示了每个模型面临的挑战,以及自我提问提示(SQP)如何有效提升其性能。通过减少不同类型的错误并优化预测结果,SQP展现了其在不同模型架构间的灵活性和适应性。Bard在解决措辞歧义方面有所改进,GPT-3.5受益于上下文利用能力的增强,而GPT-4对否定句的理解得到加强。这些案例凸显了采用SQP等先进提示技术对提升模型性能的重要性,同时揭示了大语言模型在关系抽取任务(尤其是临床领域)中面临的多维度挑战。
Our findings highlight the potential of advanced prompting techniques, such as selfquestioning prompting, in addressing model-specific errors and enhancing overall performance. These insights can be extended to various clinical language understanding tasks, guiding future research to develop more robust, accurate, and reliable models capable of processing complex clinical information and improving patient care.
我们的研究结果表明,先进的提示技术(如自我提问提示)在解决模型特定错误和提升整体性能方面具有潜力。这些发现可推广至各类临床语言理解任务,为未来研究指明方向,助力开发更稳健、准确、可靠的模型,从而处理复杂临床信息并改善患者护理。
7. Discussion
7. 讨论
In this study, we have conducted a comprehensive evaluation of state-of-the-art large language models in the healthcare domain, including GPT-3.5, GPT-4, and Bard. We have examined the capabilities and limitations of these leading large language models across various clinical language understanding tasks such as NER, relation extraction, and QA. Our findings suggest that while LLMs have made substantial progress in understanding clinical language and achieving competitive performance across these tasks, they still exhibit notable limitations and challenges. Some of these challenges include the varying confidence levels of their responses and the difficulty in determining the trustworthiness of their generated information without human validation. Consequently, our study emphasizes the importance of using LLMs with caution as a supplement to existing workflows rather than as a replacement for human expertise. To effectively implement LLMs, clinical practitioners should employ task-specific learning strategies and prompting techniques, such as SQP, carefully designing and selecting prompts that guide the model towards better understanding and generation of relevant responses. Collaboration with experts during the development and fine-tuning of LLMs is essential to ensure accurate capture of domain-specific knowledge and sensitivity to clinical language nuances. Additionally, clinicians should be aware of the limitations and potential biases in LLMs and ensure that a human expert verifies the information they produce. By adopting a cautious approach, healthcare professionals can harness the potential of LLMs responsibly and effectively, ultimately contributing to improved patient care.
在本研究中,我们对医疗保健领域最先进的大语言模型进行了全面评估,包括GPT-3.5、GPT-4和Bard。我们考察了这些领先的大语言模型在各类临床语言理解任务(如命名实体识别(NER)、关系抽取和问答系统(QA))中的能力与局限。研究发现,尽管大语言模型在理解临床语言和完成这些任务方面取得了显著进展,但仍存在明显的局限性及挑战,包括回答置信度波动较大、生成信息可信度难以在没有人工验证的情况下确定等问题。因此,本研究强调应谨慎使用大语言模型作为现有工作流程的补充,而非替代人类专业知识。为有效应用大语言模型,临床从业者需采用任务导向的学习策略和提示技术(如结构化查询提示(SQP)),精心设计能引导模型更好理解并生成相关回答的提示词。在模型开发与微调过程中,与领域专家协作对准确捕捉专业知识和临床语言细微差异至关重要。此外,临床医生需意识到大语言模型的局限性及潜在偏见,并确保其产出信息经过专家验证。通过这种审慎的应用方式,医疗专业人员才能负责任且高效地发挥大语言模型的潜力,最终提升患者护理质量。
Limitations While this study presents meaningful observations and sheds light on the role of large language models in the healthcare domain, there are some limitations to our work. Our study focuses on a select group of state-of-the-art LLMs, which may limit the general iz ability of our findings to other models or future iterations. The performance of the proposed SQP strategy may vary depending on the tasks, prompting setup, and input-output exemplars used, suggesting that further research into alternative prompting strategies or other techniques is warranted. Our evaluation is based on a set of clinical language understanding tasks and may not cover all possible use cases in the healthcare domain, necessitating further investigation into other tasks or subdomains. Lastly, ethical and legal considerations, such as patient privacy, data security, and potential biases, are not explicitly addressed in this study. Future work should explore these aspects to ensure the responsible and effective application of LLMs in healthcare settings.
局限性
虽然本研究提出了有意义的观察结果,并揭示了大语言模型在医疗领域的作用,但我们的工作仍存在一些局限性。我们的研究聚焦于一组精选的先进大语言模型,这可能限制研究结论对其他模型或未来迭代版本的普适性。提出的SQP策略性能可能因任务、提示设置和使用的输入输出示例而异,这表明需要进一步研究替代提示策略或其他技术。我们的评估基于一组临床语言理解任务,可能无法涵盖医疗领域所有潜在用例,因此需对其他任务或子领域展开深入研究。最后,本研究未明确探讨伦理和法律问题(如患者隐私、数据安全和潜在偏见)。未来工作应探索这些方面,以确保大语言模型在医疗场景中的负责任且有效的应用。
References
参考文献
Majid Afshar, Andrew Phillips, Niranjan Karnik, Jeanne Mueller, Daniel To, Richard Gonzalez, Ron Price, Richard Cooper, Cara Joyce, and Dmitriy Dligach. Natural language processing and machine learning to identify alcohol misuse from the electronic health record in trauma patients: development and internal validation. Journal of the American Medical Informatics Association, 26(3):254–261, 2019.
Majid Afshar、Andrew Phillips、Niranjan Karnik、Jeanne Mueller、Daniel To、Richard Gonzalez、Ron Price、Richard Cooper、Cara Joyce 和 Dmitriy Dligach。利用自然语言处理 (Natural Language Processing) 和机器学习从创伤患者电子健康记录中识别酒精滥用:开发与内部验证。《美国医学信息学会杂志》,26(3):254–261,2019。
Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323, 2019.
Emily Alsentzer、John R Murphy、Willie Boag、Wei-Hung Weng、Di Jin、Tristan Naumann 和 Matthew McDermott。公开可用的临床 BERT 嵌入。arXiv 预印本 arXiv:1904.03323,2019。
Som Biswas. Chatgpt and the future of medical writing, 2023.
Som Biswas. ChatGPT与医学写作的未来,2023.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neel a kant an, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 等. 大语言模型是少样本学习者. Advances in neural information processing systems, 33:1877–1901, 2020.
Lee Christensen, Peter Haug, and Marcelo Fiszman. Mplus: a probabilistic medical language understanding system. In Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain, pages 29–36, 2002.
Lee Christensen、Peter Haug 和 Marcelo Fiszman。Mplus:一种概率医学语言理解系统。载于《ACL-02生物医学领域自然语言处理研讨会论文集》,第29-36页,2002年。
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。BERT:面向语言理解的深度双向Transformer预训练。arXiv预印本 arXiv:1810.04805,2018。
William Digan, Aurélie Névéol, Antoine Neuraz, Maxime Wack, David Baudoin, Anita Burgun, and Bastien Rance. Can reproducibility be improved in clinical natural language processing? a study of 7 clinical nlp suites. Journal of the American Medical Informatics Association, 28(3):504–515, 2021.
William Digan、Aurélie Névéol、Antoine Neuraz、Maxime Wack、David Baudoin、Anita Burgun 和 Bastien Rance。临床自然语言处理的复现性能提升吗?对7套临床NLP系统的研究。《美国医学信息学会杂志》,28(3):504–515,2021年。
Rezarta Islamaj Dogan, Robert Leaman, and Zhiyong Lu. Ncbi disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics, 47:1–10, 2014.
Rezarta Islamaj Dogan、Robert Leaman 和 Zhiyong Lu。NCBI疾病语料库:疾病名称识别与概念标准化的资源。《生物医学信息学杂志》,47:1-10,2014。
Jennifer Elias. Google is asking employees to test potential chatgpt competitors, including a chatbot called ‘apprentice bard’. CNBC. Archived from the original on February, 2: 2023, 2023.
Jennifer Elias. Google要求员工测试潜在的ChatGPT竞争对手,包括名为"学徒诗人"的聊天机器人. CNBC. 原始内容存档于2023年2月2日, 2023.
Sebastian Gehrmann, Franck Dern on court, Yeran Li, Eric T Carlson, Joy T Wu, Jonathan Welt, John Foote Jr, Edward T Moseley, David W Grant, Patrick D Tyler, et al. Comparing deep learning and concept extraction based methods for patient phe no typing from clinical narratives. PloS one, 13(2):e0192360, 2018.
Sebastian Gehrmann、Franck Dernon court、Yeran Li、Eric T Carlson、Joy T Wu、Jonathan Welt、John Foote Jr、Edward T Moseley、David W Grant、Patrick D Tyler等。基于深度学习与概念提取的临床叙事患者表型分析方法比较。《公共科学图书馆·综合》,13(2):e0192360,2018年。
Natalia Grabar, Cyril Grouin, et al. Year 2020 (with covid): Observation of scientific literature on clinical natural language processing. Yearbook of Medical Informatics, 30 (01):257–263, 2021.
Natalia Grabar, Cyril Grouin等. 2020年(含新冠疫情): 临床自然语言处理科学文献观察. 医学信息学年鉴, 30(01):257–263, 2021.
Hamed Has sanz a deh, Anthony Nguyen, Sarvnaz Karimi, and Kevin Chu. Transfer ability of artificial neural networks for clinical document classification across hospitals: a case study on abnormality detection from radiology reports. Journal of biomedical informatics, 85: 68–79, 2018.
Hamed Hassanzadeh, Anthony Nguyen, Sarvnaz Karimi, and Kevin Chu. 跨医院临床文档分类的人工神经网络迁移能力研究:基于放射学报告异常检测的案例分析. Journal of biomedical informatics, 85: 68–79, 2018.
Large Language Models in Healthcare: A Comparative Study
医疗领域的大语言模型:一项比较研究
Appendix A. Prompt Templates
附录 A. 提示模板
The templates provided herein are specific to each dataset that we utilize in our experiments. All models adhere uniformly to these templates.
本文提供的模板针对我们实验中使用的每个数据集进行了专门设计。所有模型均统一遵循这些模板。
A.1. Natural Language Inference - MedNLI
A.1. 自然语言推理 - MedNLI
A.2. Semantic Textual Similarity - BIOSSES
A.2. 语义文本相似性 - BIOSSES
A.3. Factoid Question Answering - bioASQ 10b
A.3. 事实型问答 - bioASQ 10b
• Standard prompting: Answer the following factoid question and provide up to 5 candidates, ordered by decreasing confidence. Question: ${{\mathrm{query}}}$ . Candidate answers:
• 标准提示 (Standard prompting): 回答以下事实型问题并提供最多5个候选答案,按置信度降序排列。问题: ${{\mathrm{query}}}$。候选答案:
• Chain-of-thought prompting: Given the question $\mathrm{{query}}$ , identify keywords, search for relevant information, list up to 5 candidate answers, rank them by confidence, and present an ordered list.
• 思维链提示 (Chain-of-thought prompting):给定问题 $\mathrm{{query}}$,识别关键词、搜索相关信息、列出最多5个候选答案、按置信度排序并呈现有序列表。
• Self-questioning prompting: Given the question ${{\mathrm{query}}}$ , generate a related question that will help answer the question more accurately. Answer the generated question and use it to list up to 5 candidate answers that are most likely to be correct, ordered by decreasing confidence.
• 自我提问提示 (Self-questioning prompting):给定问题 ${{\mathrm{query}}}$,生成一个有助于更准确回答该问题的相关问题。回答生成的提问,并利用它列出最多5个最可能正确的候选答案,按置信度降序排列。
A.4. Named Entity Recognition – NCBI (Disease)
A.4. 命名实体识别 – NCBI(疾病)
• Standard Prompting: Perform clinical Named Entity Recognition on the provided PubMed text ${\mathrm{text}}$ for disease name recognition and concept normalization. Your output should be in two columns, with the token in the first column and the category in the second column, separated by empty space. Tokenize each word, phrase, symbol, and punctuation as a separate token. Categorize each token as “B-Disease”, “I-Disease”, or “O”. Note that hyphenated words or phrases should be treated as separate tokens. The order of output should follow the original text order.
• 标准提示 (Standard Prompting): 对提供的PubMed文本 ${\mathrm{text}}$ 执行临床命名实体识别 (Named Entity Recognition),用于疾病名称识别和概念归一化。输出应为两列格式,第一列为token,第二列为类别,中间用空格分隔。每个单词、短语、符号和标点都应作为独立token进行切分。将每个token分类为"B-Disease"、"I-Disease"或"O"。注意连字符连接的单词或短语应视为独立token。输出顺序需保持原文顺序。
• Chain of Thought Prompting: Read and understand the following PubMed text: text. Identify all disease names mentioned in the text. Normalize the identified disease names to their corresponding standardized concepts. Your output should be in two columns, with the token in the first column and the category in the second column, separated by empty space. Tokenize each word, phrase, symbol, and punctuation as a separate token. Categorize each token as “B-Disease”, “I-Disease”, or “O”. Note that hyphenated words or phrases should be treated as separate tokens. The order of output should follow the original text order.
• 思维链提示 (Chain of Thought Prompting): 阅读并理解以下PubMed文本: text。识别文本中提到的所有疾病名称。将识别到的疾病名称归一化为对应的标准化概念。输出应为两列格式,第一列为token,第二列为类别,中间用空格分隔。每个单词、短语、符号和标点都应作为独立token进行切分。将每个token分类为"B-Disease"、"I-Disease"或"O"。注意带连字符的单词或短语应视为独立token。输出顺序需保持原文顺序。
• Self-questioning Prompting: Given the provided PubMed text text, identify disease names and concepts by asking questions such as “What diseases are mentioned in the text?’ and ’Which medical terms refer to diseases?” Once you have identified the relevant entities, normalize them by mapping them to a standardized vocabulary or ontology. Your output should be in two columns, with the token in the first column and the category in the second column, separated by empty space. Tokenize each word, phrase, symbol, and punctuation as a separate token. Categorize each token as “B-Disease”, “I-Disease”, or “O”. Note that hyphenated words or phrases should be treated as separate tokens. The order of output should follow the original text order.
• 自我提问提示:根据提供的PubMed文本,通过提问"文中提到了哪些疾病?"和"哪些医学术语指代疾病?"来识别疾病名称和概念。识别出相关实体后,通过映射到标准化词汇表或本体进行归一化处理。输出应为两列格式,第一列为token,第二列为类别,中间用空格分隔。每个单词、短语、符号和标点都应作为独立token进行切分。将每个token分类为"B-Disease"、"I-Disease"或"O"。注意连字符连接的单词或短语应视为独立token。输出顺序需保持原文顺序。
A.5. Named Entity Recognition – BC5CDR (Chemical)
A.5. 命名实体识别 – BC5CDR (化学)
• Standard prompting: Perform Named Entity Recognition on the provided PubMed text ${\mathrm{text}}$ for chemical entity recognition. Your output should be in two columns, with the token in the first column and the category in the second column, separated by empty space. Tokenize each word, phrase, symbol, and punctuation mark as a separate token. Categorize each token as “B-Chemical”, “I-Chemical”, or “O”. Note that hyphenated words or phrases should be treated as separate tokens. The order of output should follow the original text order.
• 标准提示 (Standard prompting): 对提供的PubMed文本 ${\mathrm{text}}$ 执行命名实体识别 (Named Entity Recognition) 以进行化学实体识别。输出应为两列格式,第一列为token,第二列为类别,中间用空格分隔。将每个单词、短语、符号和标点符号都作为独立token进行切分。将每个token分类为"B-Chemical"、"I-Chemical"或"O"。注意带连字符的单词或短语应视为独立token。输出顺序需保持原文顺序。
• Chain-of-thought prompting: Read and understand the following PubMed text: ${\mathrm{text}}$ . Identify all chemical entities. Your output should be in two columns, with the token in the first column and the category in the second column, separated by empty space.
• 思维链提示 (Chain-of-thought prompting): 阅读并理解以下PubMed文本: ${\mathrm{text}}$。识别所有化学实体。输出应为两列格式,第一列为token,第二列为类别,两者用空格分隔。
Tokenize each word, phrase, symbol, and punctuation mark as a separate token. Categorize each token as “B-Chemical”, “I-Chemical”, or “O”. Note that hyphenated words or phrases should be treated as separate tokens. The order of output should follow the original text order.
将每个单词、短语、符号和标点符号作为单独的token进行标记。将每个token分类为"B-Chemical"、"I-Chemical"或"O"。注意,带连字符的单词或短语应视为单独的token。输出顺序应遵循原始文本顺序。
• Self-questioning prompting: Given the provided PubMed text ${\mathrm{text}}$ , identify chemical entities by asking questions such as, “What chemicals are mentioned in the text?” Your output should be in two columns, with the token in the first column and the category in the second column, separated by empty space. Tokenize each word, phrase, symbol, and punctuation mark as a separate token. Categorize each token as “BChemical”, “I-Chemical”, or “O”. Note that hyphenated words or phrases should be treated as separate tokens. The order of the output should follow the original text order.
• 自我提问提示:给定提供的PubMed文本 ${\mathrm{text}}$,通过提问如"文中提到了哪些化学物质?"来识别化学实体。输出应为两列格式,第一列为token,第二列为类别,中间用空格分隔。将每个单词、短语、符号和标点符号作为独立token进行分割。类别标记为"BChemical"、"I-Chemical"或"O"。注意带连字符的单词或短语应视为独立token。输出顺序需保持原文顺序。
A.6. Relation Extraction – i2b2 2010
A.6. 关系抽取 – i2b2 2010
• Standard prompting: Given the context sentence context, identify the relationship between {concept 1} and {concept 2} within the sentence, and specify which category it falls under: Treatment Improves Medical Problem (TrIP), Treatment Worsens Medical Problem (TrWP), Treatment Causes Medical Problem (TrCP), Treatment is Administered for Medical Problem (TrAP), Treatment is Not Administered Because of Medical Problem (TrNAP), Test Reveals Medical Problem (TeRP), Test Conducted to Investigate Medical Problem (TeCP), or Medical Problem Indicates Medical Problem (PIP). The relationship between ${\mathrm{concept.1}}$ and ${\mathrm{concept}_ {-}2}$ is ${}$ .
• 标准提示:给定上下文句子context,识别句子中{concept 1}和{concept 2}之间的关系,并指明其所属类别:治疗改善医学问题(TrIP)、治疗恶化医学问题(TrWP)、治疗导致医学问题(TrCP)、针对医学问题实施治疗(TrAP)、因医学问题未实施治疗(TrNAP)、检验揭示医学问题(TeRP)、为调查医学问题进行检验(TeCP)或医学问题预示医学问题(PIP)。${\mathrm{concept.1}}$与${\mathrm{concept}_ {-}2}$之间的关系是${}$。
• Chain-of-thought prompting: Given the context sentence context, identify the relationship between ${\mathrm{concept}.1}$ and ${\mathrm{concept}.2}$ within the sentence. Determine if the sentence discusses a treatment, test, or medical problem. For treatments, categorize the relationship as TrIP (improving), TrWP (worsening), TrCP (causing), TrAP (administering), or TrNAP (not administering) based on its impact on the medical problem. For tests, categorize as TeRP (revealing) or TeCP (investigating) based on the test’s purpose. For medical problems, if one problem indicates another, categorize it as PIP. The relationship between ${\mathrm{concept}.1}$ and ${\mathrm{concept}_ {-}2}$ is ${}$ .
• 思维链提示 (Chain-of-thought prompting):给定上下文句子context,识别句中${\mathrm{concept}.1}$与${\mathrm{concept}.2}$之间的关系。判断该句子讨论的是治疗、检查还是医学问题。对于治疗,根据其对医学问题的影响将关系归类为TrIP(改善)、TrWP(恶化)、TrCP(导致)、TrAP(实施治疗)或TrNAP(未实施治疗)。对于检查,根据检查目的归类为TeRP(揭示)或TeCP(探查)。对于医学问题,若一个问题暗示另一个问题,则归类为PIP。${\mathrm{concept}.1}$与${\mathrm{concept}_ {-}2}$之间的关系是${}$。
• Self-questioning prompting: Given the context sentence context, identify the relationship between ${\mathrm{concept}.1}$ and ${\mathrm{concept}.2}$ within the sentence. Generate questions to explore the nature of their relationship, such as whether it involves treatments improving (TrIP), worsening (TrWP), causing (TrCP), being administered for (TrAP), or not being administered due to (TrNAP) a medical problem; tests revealing (TeRP) or investigating (TeCP) a medical problem; or one medical problem indicating another (PIP). Answer the questions and use the insights to categorize the relationship between ${\mathrm{concept}.1}$ and ${\mathrm{concept}.2}$ as ${}$ .
• 自我提问提示:给定上下文句子context,识别句中${\mathrm{concept}.1}$与${\mathrm{concept}.2}$之间的关系。生成问题以探究两者关系的本质,例如是否涉及治疗改善(TrIP)、恶化(TrWP)、导致(TrCP)、为治疗(TrAP)或因未治疗(TrNAP)而产生的医疗问题;检测揭示(TeRP)或调查(TeCP)医疗问题;或一个医疗问题预示另一个问题(PIP)。回答问题并利用这些见解将${\mathrm{concept}.1}$与${\mathrm{concept}.2}$之间的关系分类为${}$。
A.7. Relation Extraction – DDI
A.7. 关系抽取 – DDI
• Standard prompting: Given the context sentence context, identify the relationship between the p harm a co logical substances ${{\mathrm{e}}_ {-}1}$ and ${{\mathrm{e}}_ {-}2}$ within the sentence. Specify which category the relationship falls under: Advice, Effect, Mechanism, or Int. The relationship between ${{\mathrm{e}}_ {-}1}$ and ${{\mathrm{e}}_ {-}2}$ is categorized as ${}$ .
• 标准提示 (Standard prompting): 给定上下文句子 context,识别药理学物质 ${{\mathrm{e}}_ {-}1}$ 和 ${{\mathrm{e}}_ {-}2}$ 在句子中的关系。明确该关系属于以下哪一类别:建议 (Advice)、效果 (Effect)、机制 (Mechanism) 或相互作用 (Int)。${{\mathrm{e}}_ {-}1}$ 和 ${{\mathrm{e}}_ {-}2}$ 之间的关系归类为 ${}$。
• Chain of Thought Prompting: Given the context sentence context, identify the relationship between the p harm a co logical substances ${{\mathrm{e}}_ {-}1}$ and ${{\mathrm{e}}_ {-}2}$ within the sentence. Analyze their interaction and classify the relationship under one of these categories: Advice, Effect, Mechanism, or Int. Consider whether the sentence advises on their use, describes their combined or counteracting effects, explains the interaction mechanism, or indicates an interaction with insufficient details. The relationship between ${{\mathrm{e}}_ {-}1}$ and ${{\mathrm{e}}_ {-}2}$ is categorized as ${}$ .
• 思维链提示 (Chain of Thought Prompting):给定上下文句子 context,识别药理学物质 ${{\mathrm{e}}_ {-}1}$ 和 ${{\mathrm{e}}_ {-}2}$ 在句子中的关系。分析它们的相互作用,并将关系归类为以下类别之一:建议 (Advice)、效应 (Effect)、机制 (Mechanism) 或不确定 (Int)。考虑句子是否建议它们的使用、描述它们的协同或拮抗效应、解释相互作用机制,或表明相互作用但细节不足。${{\mathrm{e}}_ {-}1}$ 和 ${{\mathrm{e}}_ {-}2}$ 之间的关系归类为 ${}$。
• Self-questioning Prompting: Given the context sentence context, generate questions about the relationship between the p harm a co logical substances ${{\mathrm{e}}_ {-}1}$ and ${{\mathrm{e}}_ {-}2}$ within the sentence, covering their usage advice, combined or counteracting effects, and interaction mechanisms. Answer each question and then categorize the relationship between ${{\mathrm{e}}_ {-}1}$ and ${{\mathrm{e}}_ {-}2}$ as Advice, Effect, Mechanism, or Int, based on the insights gained from the questions. The relationship between ${{\mathrm{e}}_ {-}1}$ and ${{\mathrm{e}}_ {-}2}$ is categorized as ${}$ .
• 自我提问提示:给定上下文句子context,生成关于句中药理物质${{\mathrm{e}}_ {-}1}$与${{\mathrm{e}}_ {-}2}$之间关系的问题,涵盖其使用建议、协同或拮抗效应以及相互作用机制。回答每个问题后,根据问题获得的见解将${{\mathrm{e}}_ {-}1}$与${{\mathrm{e}}_ {-}2}$的关系归类为建议(Advice)、效应(Effect)、机制(Mechanism)或相互作用(Int)。${{\mathrm{e}}_ {-}1}$与${{\mathrm{e}}_ {-}2}$的关系被归类为${}$。
