Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding
大语言模型是否已为医疗保健做好准备?临床语言理解的对比研究
Yuqing Wang Computer Science Department University of California, Santa Barbara
于清(Yuqing Wang) 加州大学圣塔芭芭拉分校计算机科学系
Yun Zhao Meta Platforms, Inc.
Yun Zhao Meta Platforms, Inc.
Linda Petzold Computer Science Department University of California, Santa Barbara
Linda Petzold 加州大学圣塔芭芭拉分校计算机科学系
Abstract
摘要
Large language models (LLMs) have made significant progress in various domains, including healthcare. However, the specialized nature of clinical language understanding tasks presents unique challenges and limitations that warrant further investigation. In this study, we conduct a comprehensive evaluation of state-of-the-art LLMs, namely GPT-3.5, GPT4, and Bard, within the realm of clinical language understanding tasks. These tasks span a diverse range, including named entity recognition, relation extraction, natural language inference, semantic textual similarity, document classification, and question-answering. We also introduce a novel prompting strategy, self-questioning prompting (SQP), tailored to enhance the performance of LLMs by eliciting informative questions and answers pertinent to the clinical scenarios at hand. Our evaluation highlights the importance of employing task-specific learning strategies and prompting techniques, such as SQP, to maximize the effectiveness of LLMs in healthcare-related tasks. Our study emphasizes the need for cautious implementation of LLMs in healthcare settings, ensuring a collaborative approach with domain experts and continuous verification by human experts to achieve responsible and effective use, ultimately contributing to improved patient care. Our code is available at https://github.com/EternityYW/LL M healthcare.
大语言模型 (LLMs) 在医疗保健等多个领域取得了显著进展。然而,临床语言理解任务的专业性带来了独特的挑战与限制,需要进一步研究。本研究对GPT-3.5、GPT4和Bard等前沿大语言模型在临床语言理解任务领域进行了全面评估,涵盖命名实体识别、关系抽取、自然语言推理、语义文本相似度、文档分类和问答等多种任务。我们还提出了一种新型提示策略——自我提问提示 (SQP),通过生成与当前临床场景相关的信息性问题与答案来提升大语言模型性能。评估结果表明,采用任务专用学习策略和提示技术(如SQP)对最大化大语言模型在医疗相关任务中的效能至关重要。本研究强调在医疗场景中需谨慎部署大语言模型,必须与领域专家协同合作并持续接受人工专家验证,从而实现负责任且有效的应用,最终提升患者护理质量。代码已开源:https://github.com/EternityYW/LLM_ healthcare。
1. Introduction
1. 引言
Recent advancements in clinical language understanding hold the potential to revolutionize healthcare by facilitating the development of intelligent systems that support decisionmaking (Lederman et al., 2022; Zuheros et al., 2021), expedite diagnostics (Wang and Lin, 2022; Wang et al., 2022b), and improve patient care (Christensen et al., 2002). Such systems could assist healthcare professionals in managing the ever-growing body of medical literature, interpreting complex patient records, and developing personalized treatment plans (Pivovarov and Elhadad, 2015; Zeng et al., 2021). State-of-the-art large language models (LLMs) like OpenAI’s GPT-3.5 and GPT-4 (OpenAI, 2023), and Google
临床语言理解领域的最新进展具有彻底改变医疗保健行业的潜力,其通过促进智能系统开发来支持临床决策 (Lederman et al., 2022; Zuheros et al., 2021)、加速诊断流程 (Wang and Lin, 2022; Wang et al., 2022b) 以及改善患者护理 (Christensen et al., 2002)。这类系统可协助医疗专业人员管理日益增长的医学文献、解读复杂的患者病历并制定个性化治疗方案 (Pivovarov and Elhadad, 2015; Zeng et al., 2021)。当前最先进的大语言模型如OpenAI的GPT-3.5与GPT-4 (OpenAI, 2023) 以及谷歌
AI’s Bard (Elias, 2023), have gained significant attention for their remarkable performance across diverse natural language understanding tasks, such as sentiment analysis, machine translation, text sum mari z ation, and question-answering (Zhong et al., 2023; Jiao et al., 2023; Wang et al., 2023). However, a comprehensive evaluation of their effectiveness in the specialized healthcare domain, with its unique challenges and complexities, remains necessary.
AI的Bard (Elias, 2023)因其在情感分析、机器翻译、文本摘要和问答等多样化自然语言理解任务中的卓越表现而备受关注 (Zhong et al., 2023; Jiao et al., 2023; Wang et al., 2023) 。然而,在具有独特挑战性和复杂性的专业医疗领域,仍需对其有效性进行全面评估。
The healthcare domain presents distinct challenges, including handling specialized medical terminology, managing the ambiguity and variability of clinical language, and meeting the high demands for reliability and accuracy in critical tasks. Although existing research has explored the application of LLMs in healthcare, the focus has typically been on a limited set of tasks or learning strategies. For example, studies have investigated tasks like medical concept extraction, patient cohort identification, and drug-drug interaction prediction, primarily relying on supervised learning approaches (Vilar et al., 2018; Gehrmann et al., 2018; Afshar et al., 2019). In this study, we broaden this scope by evaluating LLMs on various clinical language understanding tasks, including natural language inference (NLI), document classification, semantic textual similarity (STS), question-answering (QA), named entity recognition (NER), and relation extraction.
医疗领域存在独特的挑战,包括处理专业医学术语、应对临床语言的模糊性和多变性,以及在关键任务中对可靠性和准确性的高要求。尽管现有研究已探索过大语言模型在医疗领域的应用,但通常仅聚焦于有限的任务或学习策略。例如,已有研究探讨了医学概念提取、患者队列识别和药物相互作用预测等任务,主要依赖监督学习方法 (Vilar et al., 2018; Gehrmann et al., 2018; Afshar et al., 2019)。本研究通过评估大语言模型在多种临床语言理解任务(包括自然语言推理(NLI)、文档分类、语义文本相似度(STS)、问答(QA)、命名实体识别(NER)和关系抽取)上的表现,拓宽了这一研究范围。
Furthermore, the exploration of learning strategies such as few-shot learning, transfer learning, and unsupervised learning in the healthcare domain has been relatively limited. Similarly, the impact of diverse prompting techniques on improving model performance in clinical tasks has not been extensively examined, leaving room for a comprehensive comparative study.
此外,在医疗领域对少样本学习 (few-shot learning)、迁移学习和无监督学习等学习策略的探索相对有限。同样,不同提示技术对提升临床任务模型性能的影响尚未得到广泛研究,这为全面比较研究留下了空间。
In this study, we aim to bridge this gap by evaluating the performance of state-of-the-art LLMs on a range of clinical language understanding tasks. LLMs offer the exciting prospect of in-context few-shot learning via prompting, enabling task completion without fine-tuning separate language model checkpoints for each new challenge. In this context, we propose a novel prompting strategy called self-questioning prompting (SQP) to enhance these models’ effectiveness across various tasks. Our empirical evaluations demonstrate the potential of SQP as a promising technique for improving LLMs in the healthcare domain. Furthermore, by pinpointing tasks where the models excel and those where they struggle, we highlight the need for addressing specific challenges such as wording ambiguity, lack of context, and negation handling, while emphasizing the importance of responsible LLM implementation and collaboration with domain experts in healthcare settings.
本研究旨在通过评估先进大语言模型(LLM)在一系列临床语言理解任务中的表现来填补这一空白。大语言模型通过提示(prompting)实现上下文少样本学习(few-shot learning)的潜力令人振奋,无需针对每个新任务微调单独的语言模型检查点(checkpoint)。在此背景下,我们提出了一种称为自问提示(self-questioning prompting, SQP)的新颖提示策略,以提升模型在各类任务中的效能。实证评估表明,SQP是提升医疗领域大语言模型性能的有效技术。通过明确模型擅长与薄弱的任务环节,我们揭示了需要解决的特定挑战,如措辞歧义、语境缺失和否定处理等问题,同时强调了在医疗场景中负责任地部署大语言模型并与领域专家协作的重要性。
In summary, our contributions are threefold:
总之,我们的贡献有三个方面:
negation, emphasizing the need for a cautious approach when employing LLMs in healthcare as a supplement to human expertise.
否定,强调在医疗保健领域将大语言模型作为人类专业知识的补充时需采取谨慎态度。
General iz able Insights about Machine Learning in the Context of Healthcare
医疗保健领域中机器学习的通用性洞见
Our study presents a comprehensive evaluation of state-of-the-art LLMs in the healthcare domain, examining their capabilities and limitations across a variety of clinical language under standing tasks. We develop and demonstrate the efficacy of our self-questioning prompting (SQP) strategy, which involves generating context-specific questions and answers to guide the model towards a better understanding of clinical scenarios. This tailored learning approach significantly enhances LLM performance in healthcare-focused tasks. Our in-depth error analysis on the most challenging task shared by all models uncovers unique difficulties encountered by each model, such as wording ambiguity, lack of context, and negation issues. These findings emphasize the need for a cautious approach when implementing LLMs in healthcare as a complement to human expertise. We underscore the importance of integrating domain-specific knowledge, fostering collaborations among researchers, practitioners, and domain experts, and employing task-oriented prompting techniques like SQP. By addressing these challenges and harnessing the potential benefits of LLMs, we can contribute to improved patient care and clinical decision-making in healthcare settings.
我们的研究对医疗领域最先进的大语言模型进行了全面评估,考察了它们在各种临床语言理解任务中的能力与局限。我们开发并验证了自提问提示 (SQP) 策略的有效性,该策略通过生成特定情境的问题和答案来引导模型更好地理解临床场景。这种定制化学习方法显著提升了大语言模型在医疗任务中的表现。通过对所有模型共有的最具挑战性任务进行深入错误分析,我们揭示了各模型遇到的独特困难,例如措辞歧义、语境缺失和否定表述问题。这些发现强调在医疗领域应用大语言模型作为人类专业知识的补充时需保持谨慎态度。我们特别指出需要整合领域专业知识、促进研究人员/从业者与领域专家协作,以及采用SQP等任务导向的提示技术。通过应对这些挑战并发挥大语言模型的潜在优势,我们能为改善医疗环境中的患者护理和临床决策做出贡献。
2. Related Work
2. 相关工作
In this section, we review the relevant literature on large language models applied to clinical language understanding tasks in healthcare, as well as existing prompting strategies.
本节回顾了应用于医疗领域临床语言理解任务的大语言模型相关文献,以及现有的提示策略。
2.1. Large Language Models in Healthcare
2.1. 医疗领域的大语言模型 (Large Language Models)
The advent of the Transformer architecture (Vaswani et al., 2017) revolutionized the field of natural language processing, paving the way for the development of large-scale pre-trained language models such as base BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019). In the healthcare domain, domain-specific adaptations of BERT, such as BioBERT (Lee et al., 2020) and Clinic alBERT (Alsentzer et al., 2019), have been introduced to tackle various clinical language understanding tasks. More recently, GPT-3.5 and its successor GPT-4, launched by OpenAI (OpenAI, 2023), as well as Bard, developed by Google AI (Elias, 2023), have emerged as state-of-the-art LLMs, showcasing impressive capabilities in a wide range of applications, including healthcare (Biswas, 2023; Kung et al., 2023; Patel and Lam, 2023; Singhal et al., 2023).
Transformer架构 (Vaswani et al., 2017) 的出现彻底改变了自然语言处理领域,为大规模预训练语言模型(如基础BERT (Devlin et al., 2018) 和RoBERTa (Liu et al., 2019))的发展铺平了道路。在医疗领域,针对BERT的领域适配版本(如BioBERT (Lee et al., 2020) 和ClinicalBERT (Alsentzer et al., 2019))被提出以应对各类临床语言理解任务。近期,由OpenAI (OpenAI, 2023) 发布的GPT-3.5及其继任者GPT-4,以及Google AI (Elias, 2023) 开发的Bard,已成为最先进的大语言模型,在医疗等广泛领域展现出卓越能力 (Biswas, 2023; Kung et al., 2023; Patel and Lam, 2023; Singhal et al., 2023)。
Clinical language understanding is a critical aspect of healthcare informatics, focused on extracting meaningful information from diverse sources, such as electronic health records (Juhn and Liu, 2020), scientific articles (Grabar et al., 2021), and patient-authored text data (Mukhiya et al., 2020). This domain encompasses various tasks, including NER (Nayel and Shashi re kha, 2017), relation extraction (Lv et al., 2016), NLI (Romanov and Shivade, 2018), STS (Wang et al., 2020), document classification (Has sanz a deh et al., 2018), and QA (Soni and Roberts, 2020). Prior work has demonstrated the effectiveness of domain-specific models in achieving improved performance on these tasks compared to general-purpose counterparts (Peng et al., 2019; Mascio et al., 2020; Digan et al., 2021). However, challenges posed by complex medical terminologies, the need for precise inference, and the reliance on domain-specific knowledge can limit their effectiveness (Shen et al., 2023). In this work, we address some of these limitations by conducting a comprehensive evaluation of state-of-the-art LLMs on a diverse set of clinical language understanding tasks, focusing on their performance and applicability within healthcare settings.
临床语言理解是医疗信息学中的一个关键领域,专注于从电子健康记录 (Juhn and Liu, 2020)、科学文献 (Grabar et al., 2021) 和患者自述文本数据 (Mukhiya et al., 2020) 等多种来源中提取有意义的信息。该领域涵盖多项任务,包括命名实体识别 (NER) (Nayel and Shashi re kha, 2017)、关系抽取 (Lv et al., 2016)、自然语言推理 (NLI) (Romanov and Shivade, 2018)、语义文本相似度 (STS) (Wang et al., 2020)、文档分类 (Has sanz a deh et al., 2018) 以及问答 (QA) (Soni and Roberts, 2020)。先前研究表明,与通用模型相比,领域专用模型在这些任务上能显著提升性能 (Peng et al., 2019; Mascio et al., 2020; Digan et al., 2021)。然而,复杂的医学术语、精确推理的需求以及对领域特定知识的依赖,仍可能限制其有效性 (Shen et al., 2023)。本研究通过全面评估当前最先进的大语言模型在多样化临床语言理解任务中的表现,重点关注其在医疗场景下的适用性,以应对部分现有局限性。
2.2. Prompting Strategies
2.2. 提示策略
Prompting strategies, often used in conjunction with few-shot or zero-shot learning (Brown et al., 2020; Kojima et al., 2022), guide and refine the behavior of LLMs to improve performance on various tasks. In these learning paradigms, LLMs are conditioned on a limited number of examples in the form of prompts, enabling them to generalize and perform well on the target task. Standard prompting techniques (Brown et al., 2020) involve providing an LLM with a clear and concise prompt, often in the form of a question or statement, which directs the model towards the desired output. Another approach, known as chainof-thought prompting (Wei et al., 2022; Kojima et al., 2022), leverages a series of interconnected prompts to generate complex reasoning or multi-step outputs. While these existing prompting strategies have shown considerable success, their effectiveness can be limited by the quality and informative ness of the prompts (Wang et al., 2022a), which may not always capture the intricate nuances of specialized domains like healthcare. Motivated by these limitations, we propose a novel prompting strategy called self-questioning prompting (SQP). SQP aims to enhance the performance of LLMs by generating informative questions and answers related to the given clinical scenarios, thus addressing the unique challenges of the healthcare domain and contributing to improved task-specific performance.
提示策略通常与少样本或零样本学习结合使用(Brown et al., 2020; Kojima et al., 2022),用于指导和优化大语言模型的行为,以提升其在各类任务中的表现。在这些学习范式中,大语言模型通过少量示例形式的提示进行条件化,使其能够泛化并在目标任务中表现良好。标准提示技术(Brown et al., 2020)涉及为大语言模型提供清晰简洁的提示,通常以问题或陈述的形式,引导模型产生期望的输出。另一种方法称为思维链提示(Wei et al., 2022; Kojima et al., 2022),利用一系列相互关联的提示来生成复杂推理或多步骤输出。尽管现有提示策略已取得显著成功,但其效果可能受限于提示的质量和信息量(Wang et al., 2022a),这些提示未必总能捕捉医疗等专业领域的复杂细微差别。基于这些局限性,我们提出了一种名为自问自答提示(SQP)的新策略。SQP旨在通过生成与给定临床场景相关的信息性问题及答案,提升大语言模型的性能,从而应对医疗领域的独特挑战,并促进任务特定表现的提升。
3. Self-Questioning Prompting
3. 自我提问提示
Complex problems can be daunting, but they can often be solved by breaking them down into smaller parts and asking questions to clarify understanding and explore different aspects. Inspired by this human-like reasoning process, we introduce a novel method called self-questioning prompting (SQP) for LLMs. SQP aims to enhance model performance by encouraging models to be more aware of their own thinking processes, enabling them to better understand relevant concepts and develop deeper comprehension. This is achieved through the generation of targeted questions and answers that provide additional context and clarification, ultimately leading to improved performance on various tasks. The general construction process of SQP for a task, as shown in Figure 1, involves identifying key information in the input text, generating targeted questions to clarify understanding, using the questions and answers to enrich the context of the task prompt, and tailoring the strategy to meet the unique output requirements of each task. For a better understanding of the general construction procedure, consider an example prompting for NLI task:
复杂问题往往令人望而生畏,但通过将其分解为更小的部分并提出问题来澄清理解和探索不同方面,通常可以找到解决方案。受这种类人推理过程的启发,我们为大语言模型引入了一种称为自提问提示 (self-questioning prompting,SQP) 的新方法。SQP旨在通过鼓励模型更清晰地感知自身思维过程,从而增强模型性能,使其能更好地理解相关概念并形成更深层次的认知。该方法通过生成针对性问题与答案来提供额外上下文和解释,最终提升模型在各种任务中的表现。
如[图1]所示,SQP针对某项任务的一般构建流程包括:识别输入文本中的关键信息、生成针对性问题以澄清理解、利用问答对丰富任务提示的上下文,以及根据每项任务的独特输出需求定制策略。为了更好地理解这一构建流程,请看一个用于自然语言推理 (NLI) 任务的提示示例:
- Key Information: With two clinical sentences, ${{\mathrm{sentence}}_ {-}1}$ and ${{\mathrm{sentence}}_ {-}2}$ , the model is asked to “Generate questions about the medical situations described”. This prompt guides the model to identify important elements.
- 关键信息:给定两个临床句子 ${{\mathrm{sentence}}_ {-}1}$ 和 ${{\mathrm{sentence}}_ {-}2}$,模型需根据提示"生成与所述医疗情境相关的问题"。该指令引导模型识别关键要素。
- Question Generation: Following the first prompt, the model creates questions about the identified details, solidifying its grasp on the context.
- 问题生成:在第一个提示之后,模型会针对已识别的细节创建问题,从而巩固其对上下文的理解。
- Enriching Context: The model then “Answer these questions using basic medical knowledge and use the insights to evaluate their relationship”. This prompt instructs the model to deepen its understanding.
- 丰富上下文:模型随后"利用基础医学知识回答这些问题,并运用这些见解评估它们之间的关系"。这一提示指令引导模型深化其理解。
- Task-Specific Strategy: Lastly, the model follows the prompt to “Categorize the relationship between ${{\mathrm{sentence}}_ {-}1}$ and ${{\mathrm{sentence}}_ {-}2}$ as entailment if {sentence 2} logically follows ${{\mathrm{sentence}}_ {-}1}$ , contradiction if they oppose each other, or neutrality if unrelated”. This directly links the task requirements with the model’s understanding.
- 任务特定策略:最后,模型遵循提示“将 ${{\mathrm{sentence}}_ {-}1}$ 和 ${{\mathrm{sentence}}_ {-}2}$ 之间的关系分类为蕴含(如果 {sentence 2} 在逻辑上遵循 ${{\mathrm{sentence}}_ {-}1}$)、矛盾(如果两者对立)或中立(如果无关)”。这直接将任务需求与模型的理解联系起来。
Figure 1: Construction process of self-questioning prompting (SQP).
图 1: 自我提问提示 (SQP) 的构建过程。
In Table 1, we compare the proposed SQP with existing prompting methods, including standard prompting and chain-of-thought prompting, highlighting the differences in guidelines and purposes for each strategy. Subsequently, we present the SQP templates for six clinical language understanding tasks. The core self-questioning process is highlighted in each template, as shown in Figure 2. The SQP templates were developed through a combination of consultations with healthcare professionals and iterative testing. We evaluated multiple prompt candidates, with the best-performing templates chosen for use in the study. In the case of few-shot examples, the SQP QA pairs were annotated by healthcare professionals for model input. These underscored and bold parts illustrate how SQP generates targeted questions and answers related to the tasks, which guide the model’s reasoning, leading to improved task performance. By incorporating this self-questioning process into the prompts, SQP enables the model to utilize its knowledge more effectively and adapt to a wide range of clinical tasks.
在表1中,我们将提出的SQP (Self-Questioning Prompting) 与现有提示方法(包括标准提示和思维链提示)进行对比,重点说明每种策略在指导原则和目的上的差异。随后,我们展示了六个临床语言理解任务的SQP模板。每个模板中都突出显示了核心自问过程,如图2所示。这些SQP模板是通过结合医疗专业人员咨询和迭代测试开发的。我们评估了多个候选提示方案,最终选择表现最佳的模板用于研究。在少样本示例中,SQP问答对由医疗专业人员标注后输入模型。这些加粗下划线部分展示了SQP如何生成与任务相关的针对性问题及答案,从而引导模型推理并提升任务表现。通过将这种自问过程融入提示,SQP使模型能更有效地利用自身知识,并适应广泛的临床任务。
4. Datasets
4. 数据集
We utilize a wide range of biomedical and clinical language understanding datasets for our experiments. These datasets encompass various tasks, including NER (NCBI-Disease (Dogan et al., 2014) and BC5CDR-Chem (Li et al., 2016)), relation extraction (i2b2 2010-Relation (Uzuner et al., 2011) and SemEval 2013-DDI (Segura-Bedmar et al., 2013)), STS (BIOSSES (Soga nci og lu et al., 2017)), NLI (MedNLI (Romanov and Shivade, 2018)), document classification (i2b2 2006-Smoking (Uzuner et al., 2006)), and QA (bioASQ 10b-Factoid (Tsa tsar on is et al., 2015)). Among these tasks, STS (BIOSSES) is a regression task, while the rest are classification tasks. Table 2 offers a comprehensive overview of the tasks and datasets. For NER tasks, we adopt the BIO tagging scheme, where ‘B’ represents the beginning of an entity, ‘I’ signifies the continuation of an entity, and ‘O’ denotes the absence of an entity. The output
我们使用了多种生物医学和临床语言理解数据集进行实验。这些数据集涵盖多项任务,包括命名实体识别(NCBI-Disease (Dogan et al., 2014) 和 BC5CDR-Chem (Li et al., 2016))、关系抽取(i2b2 2010-Relation (Uzuner et al., 2011) 和 SemEval 2013-DDI (Segura-Bedmar et al., 2013))、语义文本相似度(BIOSSES (Sogancioglu et al., 2017))、自然语言推理(MedNLI (Romanov and Shivade, 2018))、文档分类(i2b2 2006-Smoking (Uzuner et al., 2006))以及问答(bioASQ 10b-Factoid (Tsatsaronis et al., 2015))。其中语义文本相似度(BIOSSES)是回归任务,其余均为分类任务。表 2 提供了任务与数据集的完整概览。对于命名实体识别任务,我们采用 BIO 标注方案,其中"B"表示实体起始,"I"表示实体延续,"O"表示非实体。输出
Figure 2: Self-questioning prompting (SQP) templates for six clinical language understanding tasks, with the core self-questioning process underscored and bolded. These components represent the generation of targeted questions and answers, guiding the model’s reasoning and enhancing task performance.
图 2: 六大临床语言理解任务的自提问提示 (SQP) 模板,核心自提问流程已加下划线并加粗。这些组件代表针对性问题与答案的生成过程,可引导模型推理并提升任务表现。
Table 1: Comparison among standard prompting, chain-of-thought prompting, and selfquestioning prompting.
Prompting Strategy | Guideline | Purpose |
Standard | Use a direct, concise prompt for the desired task. | To obtain a direct response from the model. |
Chain-of-Thought | Create interconnected prompts guiding the model through logical reasoning. | To engage the model's reasoning by breaking down complex tasks. |
Self-Questioning | Generate targeted questions and use answers to guide the task response. | To deepen the model's understanding and enhance performance. |
表 1: 标准提示 (Standard Prompting) 、思维链提示 (Chain-of-Thought Prompting) 和自我提问提示 (Self-Questioning Prompting) 的对比。
提示策略 | 指导原则 | 目的 |
---|---|---|
标准提示 | 使用直接、简洁的提示来完成所需任务。 | 从模型中获取直接响应。 |
思维链提示 | 创建相互关联的提示,引导模型进行逻辑推理。 | 通过分解复杂任务来激发模型的推理能力。 |
自我提问提示 | 生成针对性问题,并利用答案来引导任务响应。 | 加深模型理解并提升性能。 |
column in Table 2 presents specific classes, scores, or tagging schemes associated with each task.
表 2 中的列展示了与每项任务相关的具体类别、分数或标记方案。
For relation extraction, SemEval 2013-DDI requires identifying one of the following labels: Advice, Effect, Mechanism, or Int. In the case of i2b2 2010-Relation, it necessitates predicting relationships such as Treatment Improves Medical Problem (TrIP), Treatment Worsens Medical Problem (TrWP), Treatment Causes Medical Problem (TrCP), Treatment is Administered for Medical Problem (TrAP), Treatment is Not Administered because of Medical Problem (TrNAP), Test Reveals Medical Problem (TeRP), Test Conducted to Investigate Medical Problem (TeCP), or Medical Problem Indicates Medical Problem (PIP).
对于关系抽取任务,SemEval 2013-DDI要求识别以下标签之一:建议(Advice)、影响(Effect)、机制(Mechanism)或相互作用(Int)。在i2b2 2010-Relation任务中,需要预测诸如治疗改善医学问题(TrIP)、治疗恶化医学问题(TrWP)、治疗引发医学问题(TrCP)、为医学问题实施治疗(TrAP)、因医学问题未实施治疗(TrNAP)、检查揭示医学问题(TeRP)、为调查医学问题进行检查(TeCP)或医学问题指示医学问题(PIP)等关系。
Table 2: Overview of biomedical/clinical language understanding tasks and datasets.
Task | Dataset | Output | Metric |
Named Entity Recognition | NCBI-Disease, BC5CDR-Chemical | BIO tagging for diseases and chemicals | Micro F1 |
Relation Extraction | i2b2 2010-Relation, SemEval 2013-DDI | relations between entities | Micro F1, Macro F1 |
Semantic Textual Similarity | BIOSSES | similarity scores from 0 (different)to 4 (identical) | Pearson Correlation |
Natural Language Inference | MedNLI | entailment, neutral, contradiction | Accuracy |
Document Classification | i2b2 2006-Smoking | current smoker, past smoker, smoker, non-smoker, unknown | Micro F1 |
Question-Answering | bioASQ 10b-Factoid | factoid answers | Mean Reciprocal Rank, Lenient Accuracy |
表 2: 生物医学/临床语言理解任务与数据集概览
任务 | 数据集 | 输出 | 评估指标 |
---|---|---|---|
命名实体识别 (Named Entity Recognition) | NCBI-Disease, BC5CDR-Chemical | 疾病和化学物质的BIO标注 | 微平均F1值 (Micro F1) |
关系抽取 (Relation Extraction) | i2b2 2010-Relation, SemEval 2013-DDI | 实体间关系 | 微平均F1值 (Micro F1), 宏平均F1值 (Macro F1) |
语义文本相似度 (Semantic Textual Similarity) | BIOSSES | 0(不同)到4(相同)的相似度评分 | 皮尔逊相关系数 (Pearson Correlation) |
自然语言推理 (Natural Language Inference) | MedNLI | 蕴含/中立/矛盾 | 准确率 (Accuracy) |
文档分类 (Document Classification) | i2b2 2006-Smoking | 当前吸烟者/既往吸烟者/吸烟者/非吸烟者/未知 | 微平均F1值 (Micro F1) |
问答系统 (Question-Answering) | bioASQ 10b-Factoid | 事实型答案 | 平均倒数排名 (Mean Reciprocal Rank), 宽松准确率 (Lenient Accuracy) |
5. Experiments
5. 实验
In this section, we outline the experimental setup and evaluation procedure used to evaluate the performance of various LLMs on tasks related to biomedical and clinical text comprehension and analysis.
在本节中,我们概述了用于评估各种大语言模型(LLM)在生物医学和临床文本理解与分析相关任务上性能的实验设置和评估流程。
5.1. Experimental Setup
5.1. 实验设置
We investigate various prompting strategies for state-of-the-art LLMs, employing N-shot learning techniques on diverse clinical language understanding tasks.
我们研究了针对先进大语言模型的各种提示策略,在不同临床语言理解任务上采用N样本学习技术。
Large Language Models. We assess the performance of three state-of-the-art LLMs, each offering unique capabilities and strengths. First, we examine GPT-3.5, an advanced model developed by OpenAI, known for its remarkable language understanding and generation capabilities. Next, we investigate GPT-4, an even more powerful successor to GPT-3.5, designed to push the boundaries of natural language processing further. Finally, we explore Bard, an innovative language model launched by Google AI. We experiment with these models through their web versions. By comparing these models, we aim to gain insights into their performance on clinical language understanding tasks.
大语言模型 (Large Language Models)。我们评估了三种最先进的大语言模型的性能,每种模型都具有独特的能力和优势。首先,我们研究了由 OpenAI 开发的 GPT-3.5,该模型以其卓越的语言理解和生成能力而闻名。接着,我们考察了 GPT-4,它是 GPT-3.5 更强大的后继版本,旨在进一步突破自然语言处理的边界。最后,我们探索了由 Google AI 推出的创新语言模型 Bard。我们通过这些模型的网页版本进行实验。通过比较这些模型,我们希望深入了解它们在临床语言理解任务上的表现。
Prompting Strategies. We employ three prompting strategies to optimize the performance of LLMs on each task: standard prompting, chain-of-thought prompting, and our proposed self-questioning prompting. Standard prompting serves as the baseline, while chain-of-thought and self-questioning prompting techniques are investigated to assess their potential impact on model performance. The full set of prompting templates used for each task are given in Appendix A.
提示策略。我们采用三种提示策略来优化大语言模型在每项任务上的表现:标准提示 (standard prompting)、思维链提示 (chain-of-thought prompting) 和我们提出的自提问提示 (self-questioning prompting)。标准提示作为基线,同时研究思维链和自提问提示技术以评估它们对模型性能的潜在影响。每项任务使用的完整提示模板集见附录A。
N-Shot Learning. We explore N-shot learning for LLMs, focusing on zero-shot and 5-shot learning scenarios. Zero-shot learning refers to the situation where the model has not been exposed to any labeled examples during training and is expected to generalize to the task without prior knowledge. In contrast, 5-shot learning involves the model receiving a small amount of labeled data, consisting of five few-shot exemplars from the training set, to facilitate its adaptation to the task. We evaluate the model’s performance in both zero-shot and 5-shot learning settings to understand its ability to generalize and adapt to different tasks in biomedical and clinical domains.
N-Shot学习。我们探索了大语言模型的N-Shot学习,重点关注零样本和5样本学习场景。零样本学习指模型在训练期间未接触任何标注样本,需在没有先验知识的情况下泛化至目标任务。相比之下,5样本学习会为模型提供少量标注数据(包含训练集中的五个少样本示例)以辅助任务适应。我们通过评估模型在零样本和5样本场景下的表现,来理解其在生物医学和临床领域中泛化及适应不同任务的能力。
5.2. Evaluation Procedure
5.2. 评估流程
To assess the performance for each task, given the constraints of model release timings and web version utilization, we form an evaluation set by randomly selecting $50%$ of instances from the original test set. In the case of zero-shot learning, we directly evaluate the model’s performance on this evaluation set. For 5-shot learning, we enhance the model with five fewshot exemplars, which are randomly chosen from the training set. The model’s performance is then assessed using the same evaluation set as in the zero-shot learning scenario.
为了评估每项任务的性能,考虑到模型发布时间和网页版本使用的限制,我们从原始测试集中随机选取50%的实例组成评估集。在零样本学习场景中,我们直接在该评估集上测试模型表现。对于5样本学习,我们从训练集中随机选取5个少样本示例增强模型,并使用与零样本学习相同的评估集进行性能评估。
6. Results
6. 结果
In this section, we present a comprehensive analysis of the performance of the LLMs (i.e., Bard, GPT-3.5, and GPT-4) on clinical language understanding tasks. We begin by comparing the overall performance of these models, followed by an examination of the effectiveness of various prompting strategies. Next, we delve into a detailed task-by-task analysis, providing insights into the models’ strengths and weaknesses across different tasks. Finally, we conduct a case study on error analysis, investigating common error types and the potential improvements brought about by advanced prompting techniques.
在本节中,我们对大语言模型 (即 Bard、GPT-3.5 和 GPT-4) 在临床语言理解任务上的表现进行全面分析。首先比较这些模型的整体性能,随后评估不同提示策略的有效性。接着逐项深入分析任务表现,揭示模型在不同任务中的优势与不足。最后通过错误案例研究,探讨常见错误类型及高级提示技术带来的改进潜力。
6.1. Overall Performance Comparison
6.1. 整体性能对比
In our study, we evaluate the performance of Bard, GPT-3.5, and GPT-4 on various clinical benchmark datasets spanning multiple tasks. We employ different prompting strategies, including standard, chain-of-thought, and self-questioning, as well as N-shot learning with N equal to 0 and 5. Table 3 summarizes the experimental results.
在我们的研究中,我们评估了Bard、GPT-3.5和GPT-4在涵盖多个任务的各种临床基准数据集上的表现。我们采用了不同的提示策略,包括标准提示、思维链提示和自我提问提示,以及N等于0和5的N样本学习。表3总结了实验结果。
We observe that GPT-4 generally outperforms Bard and GPT-3.5 in tasks involving the identification and classification of specific information within text, such as NLI (MedNLI), NER (NCBI-Disease, BC5CDR-Chemical), and STS (BIOSSES). In the realm of document classification, a task that involves assigning predefined categories to entire documents, GPT4 also surpasses GPT-3.5 and Bard on the i2b2 2006-Smoking dataset. In relation extraction, GPT-4 outperforms both Bard and GPT-3.5 on the SemEval 2013-DDI dataset, while Bard
我们观察到,在涉及识别和分类文本中特定信息的任务中(如NLI(MedNLI)、NER(NCBI-Disease、BC5CDR-Chemical)和STS(BIOSSES)),GPT-4通常表现优于Bard和GPT-3.5。在文档分类领域(即对整个文档进行预定义类别划分的任务),GPT-4在i2b2 2006-Smoking数据集上也超越了GPT-3.5和Bard。在关系抽取任务中,GPT-4在SemEval 2013-DDI数据集上的表现优于Bard和GPT-3.5,而Bard...
Table 3: Performance comparison of Bard, GPT-3.5, and GPT-4 with different prompting strategies (standard, chain-of-thought, and self-questioning) and N-shot learning (N = 0, 5) on clinical benchmark datasets. randomly sampled evaluation data from the test set. Our results show that GPT-4 outperforms Bard and GPT-3.5 in tasks that involve identification and classification of specific information within text, while Bard achieves higher accuracy than GPT-3.5 and GPT-4 on tasks that require a more factual understanding of the text. Additionally, self-questioning prompting consistently achieves the best performance on the majority of tasks. The best results for each dataset are highlighted in bold.
表 3: Bard、GPT-3.5 和 GPT-4 在临床基准数据集上采用不同提示策略 (标准、思维链和自提问) 和 N 样本学习 (N = 0, 5) 的性能对比。测试集随机采样评估数据。结果表明,在涉及文本中特定信息识别与分类的任务中,GPT-4 优于 Bard 和 GPT-3.5;而在需要更事实性理解文本的任务上,Bard 的准确率高于 GPT-3.5 和 GPT-4。此外,自提问提示策略在多数任务中持续取得最佳表现。各数据集最优结果以粗体标出。
Model | NCBI- Disease | BC5CDR- Chemical | i2b2 2010- Relation | SemEval 2013- DDI | BIOSSES | MedNLI | i2b2 2006- Smoking | BioASQ 10b- Factoid | |
Micro F1 | Micro F1 | Micro F1 | Macro F1 | Pear. | Acc. | Micro F1 | MRR | Len.Acc. | |
Bard | |||||||||
W zero-shotStP | 0.911 | 0.947 | 0.720 | 0.490 | 0.401 | 0.580 | 0.780 | 0.800 | 0.820 |
w/5-shot StP | 0.933 | 0.972 | 0.900 | 0.528 | 0.449 | 0.640 | 0.820 | 0.845 | 0.880 |
/zero-shot CoTP | 0.946 | 0.972 | 0.660 | 0.525 | 0.565 | 0.580 | 0.760 | 0.887 | 0.920 |
5-shot CoTP | 0.955 | 0.977 | 0.900 | 0.709 | 0.602 | 0.720 | 0.800 | 0.880 | 0.900 |
w/ zero-shot SQP | 0.956 | 0.977 | 0.760 | 0.566 | 0.576 | 0.760 | 0.760 | 0.850 | 0.860 |
w/5-shot SQP | 0.960 | 0.983 | 0.940 | 0.772 | 0.601 | 0.760 | 0.820 | 0.860 | 0.860 |
GPT-3.5 | |||||||||
zero-shot StP W/ | 0.918 | 0.939 | 0.780 | 0.360 | 0.805 | 0.700 | 0.680 | 0.707 | 0.720 |
w/5-shot StP | 0.947 | 0.967 | 0.840 | 0.531 | 0.828 | 0.780 | 0.780 | 0.710 | 0.740 |
zero-shot CoTP | 0.955 | 0.977 | 0.680 | 0.404 | 0.875 | 0.740 | 0.680 | 0.743 | 0.800 |
w/5-shot CoTP | 0.967 | 0.977 | 0.840 | 0.548 | 0.873 | 0.740 | 0.740 | 0.761 | 0.820 |
w/zero-shot SQP w/ 5-shot SQP | 0.963 | 0.974 | 0.860 | 0.529 | 0.873 | 0.760 | 0.720 | 0.720 | 0.740 |
0.970 | 0.983 | 0.860 | 0.620 | 0.892 | 0.820 | 0.820 | 0.747 | 0.780 | |
GPT-4 | |||||||||
zero-shot StP W | 0.968 | 0.976 | 0.860 | 0.428 | 0.820 | 0.800 | 0.900 | 0.795 | 0.820 |
w/5-shot StP | 0.975 | 0.989 | 0.860 | 0.502 | 0.848 | 0.840 | 0.880 | 0.815 | 0.840 |
/zero-shot CoTP | 0.981 | 0.994 0.994 | 0.860 0.880 | 0.509 0.544 | 0.875 0.897 | 0.840 0.800 | 0.860 0.860 | 0.805 0.852 | 0.840 0.880 |
w/5-shot CoTP | 0.984 | 0.992 | 0.920 | 0.595 | 0.889 | 0.860 | 0.900 | 0.844 | 0.900 |
w/ zero-shot SQP | 0.985 | 0.860 | |||||||
w/5-shot SQP | 0.984 | 0.995 | 0.920 | 0.798 | 0.916 | 0.860 | 0.873 | 0.900 |
Note: Acc. $=$ Accuracy; CoTP $=$ Chain-of-Thought Prompting; Len. Acc. $=$ Lenient Accuracy; MRR = Mean Reciprocal Rank; Pear. $=$ Pearson Correlation; StP $=$ Standard Prompting.
模型 | NCBI-疾病 | BC5CDR-化学 | i2b2 2010-关系 | SemEval 2013-DDI | BIOSSES | MedNLI | i2b2 2006-吸烟 | BioASQ 10b-事实型 |
---|---|---|---|---|---|---|---|---|
Micro F1 | Micro F1 | Micro F1 | Macro F1 | Pear. | Acc. | Micro F1 | MRR | |
Bard | ||||||||
零样本StP | 0.911 | 0.947 | 0.720 | 0.490 | 0.401 | 0.580 | 0.780 | 0.800 |
5样本StP | 0.933 | 0.972 | 0.900 | 0.528 | 0.449 | 0.640 | 0.820 | 0.845 |
零样本CoTP | 0.946 | 0.972 | 0.660 | 0.525 | 0.565 | 0.580 | 0.760 | 0.887 |
5样本CoTP | 0.955 | 0.977 | 0.900 | 0.709 | 0.602 | 0.720 | 0.800 | 0.880 |
零样本SQP | 0.956 | 0.977 | 0.760 | 0.566 | 0.576 | 0.760 | 0.760 | 0.850 |
5样本SQP | 0.960 | 0.983 | 0.940 | 0.772 | 0.601 | 0.760 | 0.820 | 0.860 |
GPT-3.5 | ||||||||
零样本StP | 0.918 | 0.939 | 0.780 | 0.360 | 0.805 | 0.700 | 0.680 | 0.707 |
5样本StP | 0.947 | 0.967 | 0.840 | 0.531 | 0.828 | 0.780 | 0.780 | 0.710 |
零样本CoTP | 0.955 | 0.977 | 0.680 | 0.404 | 0.875 | 0.740 | 0.680 | 0.743 |
5样本CoTP | 0.967 | 0.977 | 0.840 | 0.548 | 0.873 | 0.740 | 0.740 | 0.761 |
零样本SQP | 0.963 | 0.974 | 0.860 | 0.529 | 0.873 | 0.760 | 0.720 | 0.720 |
5样本SQP | 0.970 | 0.983 | 0.860 | 0.620 | 0.892 | 0.820 | 0.820 | 0.747 |
GPT-4 | ||||||||
零样本StP | 0.968 | 0.976 | 0.860 | 0.428 | 0.820 | 0.800 | 0.900 | 0.795 |
5样本StP | 0.975 | 0.989 | 0.860 | 0.502 | 0.848 | 0.840 | 0.880 | 0.815 |
零样本CoTP | 0.981 | 0.994 | 0.860 | 0.509 | 0.875 | 0.840 | 0.860 | 0.805 |
5样本CoTP | 0.984 | 0.992 | 0.920 | 0.595 | 0.889 | 0.860 | 0.900 | 0.844 |
零样本SQP | 0.985 | 0.860 | ||||||
5样本SQP | 0.984 | 0.995 | 0.920 | 0.798 | 0.916 | 0.860 | 0.873 |
注: Acc. = 准确率; CoTP = 思维链提示; Len. Acc. = 宽松准确率; MRR = 平均倒数排名; Pear. = 皮尔逊相关系数; StP = 标准提示。
demonstrates superior performance in the i2b2 2010-Relation dataset. Additionally, Bard excels in tasks that require a more factual understanding of the text, such as QA (BioASQ 10b-Factoid).
在i2b2 2010-Relation数据集上展现出卓越性能。此外,Bard在需要更事实性理解文本的任务(如QA (BioASQ 10b-Factoid))中表现优异。
Regarding prompting strategies, self-questioning consistently outperforms standard prompting and exhibits competitive performance when compared to chain-of-thought prompting across all settings. Our findings suggest that self-questioning is a promising approach for enhancing the performance of LLMs, achieving the best performance for the majority of tasks, except for QA (BioASQ 10b-Factoid).
关于提示策略,自我提问法在所有设置中均优于标准提示法,并与思维链提示法表现出相当的竞争力。我们的研究结果表明,自我提问法是提升大语言模型性能的一种有效方法,在除QA(BioASQ 10b-Factoid)之外的大多数任务中均取得了最佳表现。
Furthermore, our study demonstrates that 5-shot learning generally leads to improved performance across all tasks when compared to zero-shot learning, although not universally. This finding indicates that incorporating even a modest amount of task-specific training data can substantially enhance the effectiveness of pre-trained LLMs.
此外,我们的研究表明,与零样本学习相比,5样本学习通常能提升所有任务的性能(尽管并非普遍适用)。这一发现表明,即使加入少量任务特定的训练数据,也能显著提升预训练大语言模型的效果。
6.2. Prompting Strategies Comparison
6.2. 提示策略对比
We evaluate the performance of different prompting strategies, specifically standard prompting, self-questioning prompting (SQP), and chain-of-thought prompting (CoTP) on both zero-shot and 5-shot learning settings across various models and datasets. Figure 3 presents the averaged performance comparison over all datasets, under the assumption that datasets and evaluation metrics are equally important and directly comparable. We observe that selfquestioning prompting consistently yields the best performance compared to standard and chain-of-thought prompting. In addition, GPT-4 excels among the models, demonstrating the highest overall performance.
我们评估了不同提示策略的性能,具体包括标准提示 (standard prompting) 、自我提问提示 (self-questioning prompting, SQP) 和思维链提示 (chain-of-thought prompting, CoTP) ,在多种模型和数据集上的零样本和5样本学习设置中的表现。图 3: 展示了所有数据集上的平均性能对比,假设数据集和评估指标具有同等重要性且可直接比较。我们观察到,与标准提示和思维链提示相比,自我提问提示始终能带来最佳性能。此外,GPT-4在模型中表现最为出色,展现出最高的整体性能。
Figure 3: Average performance comparison of three prompting methods in zero-shot and 5-shot learning settings across Bard, GPT-3.5, and GPT-4 models. Performance values are averaged across all datasets, assuming equal importance for datasets and evaluation metrics, as well as direct comparability. The selfquestioning prompting method consistently outperforms standard and chain-ofthought prompting, and GPT-4 excels among the models.
图 3: 零样本和5样本学习设置下,三种提示方法在Bard、GPT-3.5和GPT-4模型中的平均性能对比。性能值为所有数据集的平均值,假设数据集和评估指标具有同等重要性且可直接比较。自我提问提示方法始终优于标准提示和思维链提示,而GPT-4在模型中表现最优。
Table 4 and Table 5 demonstrate performance improvements of prompting strategies over multiple datasets and models under zero-shot and 5-shot settings, respectively, using standard prompting as a baseline. In the zero-shot learning setting (Table 4), self-questioning prompting achieves the highest improvement in the majority of tasks, with improvements ranging from $4.9%$ to $46.9%$ across different datasets.
表 4 和表 5 分别展示了在零样本和 5 样本设置下,以标准提示为基线,提示策略在多个数据集和模型上的性能提升。在零样本学习设置中 (表 4),自我提问提示在大多数任务中实现了最高的提升,不同数据集上的提升幅度从 $4.9%$ 到 $46.9%$ 不等。
In the 5-shot learning setting (Table 5), self-questioning prompting leads to the highest improvement in most tasks, with improvements ranging from $2.9%$ to $59.0%$ . In both settings, we also observe some instances where chain-of-thought or self-questioning prompting yields negative values, such as relation extraction (i2b2 2010-Relation) and document classification (i2b2 2006-Smoking), indicating inferior performance compared to standard prompting. This could be due to the specific nature of certain tasks, where the additional context or complexity introduced by the alternative prompting strategies might not contribute to better understanding or performance. It might also be possible that the model’s capacity is insufficient to take advantage of the additional information provided by the alternative prompting strategies in some cases.
在少样本学习设置中(表5),自我提问提示法在多数任务中实现了最高提升,改进幅度从$2.9%$到$59.0%$不等。在这两种设置下,我们也观察到思维链或自我提问提示法在某些情况下会出现负值,例如关系抽取(i2b2 2010-Relation)和文档分类(i2b2 2006-Smoking),表明其性能低于标准提示法。这可能是因为某些任务的特殊性质,替代提示策略引入的额外上下文或复杂性可能无助于更好的理解或性能。也有可能是模型在某些情况下无法充分利用替代提示策略提供的额外信息。
Overall, self-questioning prompting generally outperforms other prompting strategies across different models and datasets in both zero-shot and 5-shot learning settings, despite occasional inferior performance in specific tasks. This suggests that self-questioning prompting can be a promising technique for improving performance in the domain of clinical language understanding. Furthermore, GPT-4 emerges as the top-performing model, emphasizing the potential for various applications in the clinical domain.
总体而言,在零样本和5样本学习设置中,自我提问提示 (self-questioning prompting) 策略在不同模型和数据集上通常优于其他提示策略,尽管在特定任务中偶尔表现不佳。这表明自我提问提示有望成为提升临床语言理解领域性能的有效技术。此外,GPT-4 表现最优,凸显了其在临床领域多种应用场景的潜力。
Table 4: Comparison of zero-shot learning performance improvements (in $%$ ) for different models and prompting techniques on multiple datasets, with standard prompting as the baseline. Bold values indicate the highest improvement for each dataset across models and prompting strategies, while negative values signify inferior performance. Self-questioning prompting leads to the largest improvement in the majority of tasks.
表 4: 不同模型和提示技术在多个数据集上的零样本学习性能提升对比 (以标准提示为基线,单位为 $%$ )。加粗数值表示各数据集在模型和提示策略中的最高提升,负值表示性能下降。自提问提示在多数任务中带来了最大提升。
Dataset | Metric | Bard | GPT-3.5 | GPT-4 | |||
CoTP | SQP | CoTP | SQP | CoTP | SQP | ||
NCBI-Disease | Micro F1 | 3.8 | 4.9 | 4.0 | 4.9 | 1.3 | 1.8 |
BC5CDR-Chemical | MicroF1 | 2.6 | 3.2 | 4.0 | 3.7 | 1.8 | 1.6 |
i2b2 2010-Relation | MicroF1 | -8.3 | 5.6 | -12.8 | 10.3 | 0.0 | 7.0 |
SemEval 2013-DDI | MacroF1 | 7.1 | 15.5 | 12.2 | 46.9 | 18.9 | 39.0 |
BIOSSES | Pear. | 40.9 | 43.6 | 8.7 | 8.4 | 6.7 | 8.4 |
MedNLI | Acc. | 0.0 | 31.0 | 5.7 | 8.6 | 5.0 | 7.5 |
i2b2 2006-Smoking | Micro F1 | -2.6 | -2.6 | 0.0 | 5.9 | -4.4 | 0.0 |
BioASQ 10b-Factoid | MRR | 10.9 | 6.3 | 5.1 | 1.8 | 1.3 | 6.2 |
BioASQ 10b-Factoid | Len. Acc. | 12.2 | 4.9 | 11.1 | 2.8 | 2.4 | 9.8 |
数据集 | 指标 | Bard CoTP | Bard SQP | GPT-3.5 CoTP | GPT-3.5 SQP | GPT-4 CoTP | GPT-4 SQP |
---|---|---|---|---|---|---|---|
NCBI-Disease | Micro F1 | 3.8 | 4.9 | 4.0 | 4.9 | 1.3 | 1.8 |
BC5CDR-Chemical | MicroF1 | 2.6 | 3.2 | 4.0 | 3.7 | 1.8 | 1.6 |
i2b2 2010-Relation | MicroF1 | -8.3 | 5.6 | -12.8 | 10.3 | 0.0 | 7.0 |
SemEval 2013-DDI | MacroF1 | 7.1 | 15.5 | 12.2 | 46.9 | 18.9 | 39.0 |
BIOSSES | Pear. | 40.9 | 43.6 | 8.7 | 8.4 | 6.7 | 8.4 |
MedNLI | Acc. | 0.0 | 31.0 | 5.7 | 8.6 | 5.0 | 7.5 |
i2b2 2006-Smoking | Micro F1 | -2.6 | -2.6 | 0.0 | 5.9 | -4.4 | 0.0 |
BioASQ 10b-Factoid | MRR | 10.9 | 6.3 | 5.1 | 1.8 | 1.3 | 6.2 |
BioASQ 10b-Factoid | Len. Acc. | 12.2 | 4.9 | 11.1 | 2.8 | 2.4 | 9.8 |
6.3. Task-by-Task Analysis
6.3. 任务逐项分析
To delve deeper into the specific characteristics and challenges associated with each task (i.e., NER, relation extraction, STS, NLI, document classification, and QA), we individually analyze the results, aiming to better understand the underlying factors that contribute to model performance and identify areas for potential improvement or further investigation.
为了深入探究每项任务(即命名实体识别(NER)、关系抽取、语义文本相似度(STS)、自然语言推理(NLI)、文档分类和问答(QA))的具体特征与挑战,我们分别对结果进行了分析,旨在更好地理解影响模型性能的内在因素,并找出潜在改进或进一步研究的领域。
Named Entity Recognition Task. In the NER task, we focus on two datasets: NCBI-Disease and BC5CDR-Chemical. Employing the BIO tagging scheme, we evaluate model performance using the micro F1 metric. NER tasks in the biomedical domain pose unique challenges due to specialized terminology, complex entity names, and frequent use of abbreviations. Our results indicate that, compared to standard prompting, self-questioning prompting leads to average improvements of 3.9% and 2.8% in zero-shot learning for NCBIDisease and BC5CDR-Chemical, respectively. In the 5-shot setting, the average improvements are $2.1%$ and $1.1%$ , respectively. Moreover, GPT-4 demonstrates the most significant performance boost compared to Bard and GPT-3.5.
命名实体识别任务。在NER任务中,我们重点关注NCBI-Disease和BC5CDR-Chemical两个数据集。采用BIO标注方案,使用微平均F1值评估模型性能。生物医学领域的NER任务因专业术语、复杂实体名称和频繁使用缩写而面临独特挑战。实验结果表明,与标准提示相比,自提问提示在NCBI-Disease和BC5CDR-Chemical数据集上分别实现零样本学习平均提升3.9%和2.8%。在5样本设置中,平均提升分别为$2.1%$和$1.1%$。此外,与Bard和GPT-3.5相比,GPT-4展现出最显著的性能提升。
We also conduct a qualitative analysis by examining specific examples from the datasets, such as the term “aromatic ring” in the BC5CDR-Chemical dataset, which is often incor
我们还通过分析数据集中的具体案例进行定性研究,例如BC5CDR-Chemical数据集中频繁出现错误标注的术语"aromatic ring"。
Table 5: Comparison of 5-shot learning performance improvements (in $%$ ) for different models and prompting techniques on multiple datasets, with standard prompting as the baseline. Bold values indicate the highest improvement for each dataset across models and prompting strategies, while negative values signify inferior performance. Self-questioning prompting leads to the highest improvement in 6 out of 8 tasks, followed by chain-of-thought prompting with 2 largest improvements.
表 5: 不同模型和提示技术在多个数据集上5样本学习性能提升对比 (以标准提示为基线,单位 $%$ )。加粗数值表示各数据集在模型和提示策略中的最高提升,负值表示性能下降。自我提问提示在8项任务中有6项取得最高提升,思维链提示则在2项任务中表现最佳。
Dataset | Metric | Bard | GPT-3.5 | GPT-4 | |||
CoTP | SQP | CoTP | SQP | CoTP | SQP | ||
NCBI-Disease | Micro F1 | 2.4 | 2.9 | 2.1 | 2.4 | 0.9 | 0.9 |
BC5CDR-Chemical | MicroF1 | 0.5 | 1.1 | 1.0 | 1.7 | 0.5 | 0.6 |
i2b2 2010-Relation | MicroF1 | 0.0 | 4.4 | 0.0 | 2.4 | 2.3 | 7.0 |
SemEval 2013-DDI | MacroF1 | 34.3 | 46.2 | 3.2 | 16.8 | 8.4 | 59.0 |
BIOSSES | Pear. | 34.1 | 33.9 | 5.4 | 7.7 | 5.8 | 8.0 |
MedNLI | Acc. | 12.5 | 18.8 | -5.1 | 5.1 | -4.8 | 2.4 |
i2b2 2006-Smoking | Micro F1 | -2.4 | 0.0 | -5.1 | 5.1 | -2.3 | -2.3 |
BioASQ 10b-Factoid | MRR | 4.1 | 1.8 | 7.2 | 5.2 | 4.5 | 7.1 |
BioASQ 10b-Factoid | Len. Acc. | 2.3 | -2.3 | 10.8 | 5.4 | 4.8 | 7.1 |
数据集 | 指标 | Bard-CoTP | Bard-SQP | GPT-3.5-CoTP | GPT-3.5-SQP | GPT-4-CoTP | GPT-4-SQP |
---|---|---|---|---|---|---|---|
NCBI-Disease | Micro F1 | 2.4 | 2.9 | 2.1 | 2.4 | 0.9 | 0.9 |
BC5CDR-Chemical | MicroF1 | 0.5 | 1.1 | 1.0 | 1.7 | 0.5 | 0.6 |
i2b2 2010-Relation | MicroF1 | 0.0 | 4.4 | 0.0 | 2.4 | 2.3 | 7.0 |
SemEval 2013-DDI | MacroF1 | 34.3 | 46.2 | 3.2 | 16.8 | 8.4 | 59.0 |
BIOSSES | Pear. | 34.1 | 33.9 | 5.4 | 7.7 | 5.8 | 8.0 |
MedNLI | Acc. | 12.5 | 18.8 | -5.1 | 5.1 | -4.8 | 2.4 |
i2b2 2006-Smoking | Micro F1 | -2.4 | 0.0 | -5.1 | 5.1 | -2.3 | -2.3 |
BioASQ 10b-Factoid | MRR | 4.1 | 1.8 | 7.2 | 5.2 | 4.5 | 7.1 |
BioASQ 10b-Factoid | Len. Acc. | 2.3 | -2.3 | 10.8 | 5.4 | 4.8 | 7.1 |
rectly predicted as “B-Chemical” (beginning of a chemical entity) instead of “O” (outside of any entity) by the models. This error might occur because the term “aromatic ring” refers to a structural feature commonly found in chemical compounds, leading models to associate it with chemical entities and mis classify it. This example highlights the challenges faced by the models in accurately recognizing entities, particularly when dealing with terms that have strong associations with specific entity types. It also demonstrates the potential limitations of prompting strategies in addressing these challenges, as models may still struggle to disambiguate such terms, despite employing different prompting techniques.
模型错误地将"aromatic ring"直接预测为"B-Chemical"(化学实体的起始)而非"O"(非实体部分)。这种错误可能源于"芳香环"这一术语通常指代化合物中的结构特征,导致模型将其与化学实体相关联而产生误判。该案例凸显了模型在实体识别任务中面临的挑战,尤其是处理与特定实体类型存在强关联的术语时。这也表明提示策略在解决此类问题时可能存在局限性——即便采用不同的提示技术,模型仍难以准确消除这类术语的歧义。
Relation Extraction Task. In the relation extraction task involving the i2b2 2010- Relation and SemEval 2013-DDI datasets, we evaluate our model’s performance using micro F1 and macro F1 scores, respectively. Our study reveals that self-questioning prompting leads to average improvements of $7.6%$ and