[论文翻译]PIKE-RAG:专业知识与推理增强生成


原文地址:https://arxiv.org/pdf/2501.11551v2


PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation

PIKE-RAG:专业知识与推理增强生成

Abstract

摘要

Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduce sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to increment ally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems’ problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iterative ly construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks. The code is publicly available at https://github.com/microsoft/PIKE-RAG.

尽管检索增强生成 (RAG) 系统通过外部检索扩展了大语言模型 (LLM) 的能力,并取得了显著的进展,但这些系统往往难以满足现实工业应用中复杂多样的需求。仅依赖检索在从专业语料库中提取深度领域知识并进行逻辑推理方面显得不足。为了解决这一问题,我们引入了专业化知识与推理增强生成 (PIKE-RAG),专注于提取、理解和应用专业化知识,同时构建连贯的推理,逐步引导大语言模型生成准确的响应。认识到工业任务的多样化挑战,我们引入了一种新范式,根据任务在知识提取和应用中的复杂性进行分类,从而系统评估 RAG 系统的解决问题能力。这一策略为 RAG 系统的阶段性开发和增强提供了路线图,以满足工业应用不断变化的需求。此外,我们提出了知识原子化和知识感知任务分解,以有效从数据块中提取多方面的知识,并分别基于原始查询和累积知识迭代构建推理,在各种基准测试中展示了卓越的性能。代码公开在 https://github.com/microsoft/PIKE-RAG

1 Introduction

1 引言

Large Language Models (LLMs) have revolutionized the field of natural language processing by demonstrating the capability to generate coherent and con textually relevant text. These advanced models are trained on expansive corpora, equipping them with the versatility to execute a diverse spectrum of linguistic tasks, ranging from text completion to translation and sum mari z ation [5, 9, 50, 6]. Despite their broad capabilities, LLMs exhibit pronounced limitations when tasked with specialized queries in professional domains [38, 54], a demand that is particularly acute in industrial applications. This primarily stems from the scarcity of domain-specific training material and a limited grasp of specialized knowledge and rationale within these domains. As a result, LLMs may produce responses that are not only potentially erroneous but also lack the detail and precision required for expert-level engagement [11]. Besides the limitations in the domain-specific tasks, another striking issue with LLMs is the phenomena known as "hallucination", where the model generates information that is not grounded in reality or factual data [10, 57]. Moreover, the knowledge base of LLMs, being static and crystallized at the point of their last update, introduces temporal stasis [13]. Further compounding these challenges is the issue of long-context comprehension [37]. Existing LLMs struggle to maintain an understanding of task definitions across long context, and their performance tends to deteriorate significantly when confronted with more complex and demanding tasks.

大语言模型 (LLMs) 通过展示生成连贯且上下文相关文本的能力,彻底改变了自然语言处理领域。这些先进的模型在广泛的语料库上进行训练,使其具备执行多种语言任务的多功能性,从文本补全到翻译和摘要 [5, 9, 50, 6]。尽管它们具备广泛的能力,但在处理专业领域的特定查询时,LLMs 表现出明显的局限性 [38, 54],这种需求在工业应用中尤为迫切。这主要源于领域特定训练材料的稀缺以及对这些领域专业知识和原理的有限理解。因此,LLMs 可能会生成不仅可能错误,而且缺乏专家级参与所需的细节和精度的响应 [11]。除了在领域特定任务中的局限性外,LLMs 另一个显著问题是所谓的“幻觉”现象,即模型生成的信息不基于现实或事实数据 [10, 57]。此外,LLMs 的知识库在其最后一次更新时是静态和固化的,这引入了时间停滞 [13]。进一步加剧这些挑战的是长上下文理解问题 [37]。现有的 LLMs 在长上下文中难以保持对任务定义的理解,当面对更复杂和要求更高的任务时,其性能往往会显著下降。

To address the inherent limitations of LLMs, Retrieval-Augmented Generation (RAG) [35] has been proposed, which merges the generative capabilities of LLMs with a retrieval mechanism, allowing the incorporation of relevant external information to anchor the generated text in factual data. This integrated strategy improves both the accuracy and reliability of the generated content, providing a promising pathway for the practical deployment of LLMs in industrial applications. However, current RAG methods remain heavily reliant on text retrieval and the comprehension capabilities of LLMs, with a lack of attention to extracting, understanding, and utilizing knowledge from the diverse source data. In industrial applications requiring expertise, such as specialized knowledge and problem-solving rationale, existing RAG approaches primarily designed for research benchmarks demonstrate significant limitations. There is a lack of clarity regarding the challenges that RAG encounters in industrial applications. Gaining a comprehensive insight into these challenges is crucial for the development of RAG algorithms. Therefore, we summarize the main challenges as follows.

为了解决大语言模型(LLM)的固有局限性,检索增强生成(Retrieval-Augmented Generation,RAG)[35] 被提出,它将大语言模型的生成能力与检索机制相结合,允许引入相关的外部信息,使生成的文本基于事实数据。这种综合策略提高了生成内容的准确性和可靠性,为大语言模型在工业应用中的实际部署提供了一条有前景的路径。然而,当前的 RAG 方法仍然严重依赖文本检索和大语言模型的理解能力,缺乏对从多样化的源数据中提取、理解和利用知识的关注。在需要专业知识的工业应用中,如专业知识和问题解决逻辑,现有的主要为研究基准设计的 RAG 方法表现出明显的局限性。对于 RAG 在工业应用中遇到的挑战,目前缺乏清晰的认识。全面了解这些挑战对于 RAG 算法的发展至关重要。因此,我们将主要挑战总结如下。

• Knowledge source diversity: RAG systems are constructed upon a diverse corpus of source documents collected over many years from various domains, encompassing a wide range of file formats like scanned images, digital text files, and web data, sometimes accompanied by specialized databases. In contrast, widely-used datasets [28, 60, 51] typically feature pre-segmented, simplified corpora that do not capture the complexity of real-world data. Existing methods designed for these benchmarks struggle to efficiently extract specialized knowledge and uncover underlying rationales from diverse sources, particularly in industrial applications. For example, an LED product datasheet typically comprises specifications such as performance characteristics presented in complex tables, electrical properties depicted in charts, and installation instructions illustrated with figures. Addressing queries related to the non-textual knowledge presents significant challenges for existing RAG approaches.

• 知识来源多样性:RAG系统构建于多年收集的多样化源文档语料库之上,这些文档来自各个领域,涵盖扫描图像、数字文本文件和网络数据等多种文件格式,有时还伴随专门的数据库。相比之下,广泛使用的数据集 [28, 60, 51] 通常以预先分割、简化的语料库为特征,无法捕捉现实世界数据的复杂性。为这些基准设计的方法难以从多样化来源中高效提取专门知识并揭示潜在原理,尤其是在工业应用中。例如,LED产品数据表通常包含性能特征等规格,以复杂表格呈现,电气特性以图表展示,安装说明则通过图示说明。处理与非文本知识相关的查询对现有RAG方法提出了重大挑战。

• Domain specialization deficit: In industrial applications, RAG are expected to leverage the specialized knowledge and rationale in professional fields. However, these specialized knowledge are characterized by domain-specific terminologies, expertise, and distinctive logical frameworks that are integral to their functioning. RAG approaches built on common knowledge-centric datasets demonstrate unsatisfactory performance when applied to professional fields, as LLMs exhibit deficiencies in extracting, understanding, and organizing domain specific knowledge and rationale [38]. For example, in the field of semiconductor design, research relies heavily on a deep understanding of underlying physical properties. When LLMs are utilized to extract and organize the specialized knowledge and rationale from the research documents, they often fail to properly capture essential physical principles and achieve a comprehensive understanding due to their inherent limitations. Consequently, RAG systems frequently produce incomplete or inaccurate interpretations of critical problem elements and generate responses that lack proper rationale grounded in physical principles. Moreover, assessing the quality of professional content generation poses a significant challenge. This issue not only impedes the development and optimization of RAG algorithms but also complicates their practical deployment across various industrial applications.

• 领域专业化不足:在工业应用中,RAG 被期望利用专业领域的专业知识和原理。然而,这些专业知识具有领域特定的术语、专业知识和独特的逻辑框架,这些是其功能不可或缺的组成部分。基于以常识为中心的数据集构建的 RAG 方法在应用于专业领域时表现出不令人满意的性能,因为大语言模型在提取、理解和组织领域特定知识和原理方面存在不足 [38]。例如,在半导体设计领域,研究严重依赖于对基础物理特性的深入理解。当利用大语言模型从研究文档中提取和组织专业知识和原理时,由于其固有的局限性,它们往往无法正确捕捉基本的物理原理并实现全面的理解。因此,RAG 系统经常对关键问题要素产生不完整或不准确的解释,并生成缺乏基于物理原理的适当推理的响应。此外,评估专业内容生成的质量也构成了重大挑战。这个问题不仅阻碍了 RAG 算法的开发和优化,还使其在各种工业应用中的实际部署变得复杂。

• One-size-fits-all: Various RAG application scenarios, although based on a similar framework, present different challenges that require diverse capabilities, particularly for extracting, understanding, and organizing domain-specific knowledge and rationale. The complexity and focus of questions vary across these scenarios, and within a single scenario, the difficulty can also differ. For example, in rule-based query scenarios, such as determining the legal conditions for mailing items, RAG systems primarily focus on retrieving relevant factual rules by bridging the semantic gap between the query and the rules. In multihop query scenarios, such as comparing products across multiple aspects, RAG systems emphasize extracting information from diverse sources and performing multihop reasoning to arrive at accurate answers. Most existing RAG approaches [62] adopt a one-size-fits-all strategy, failing to account for the varying complexities and specific demands both within and across scenarios. This results in solutions that do not meet the comprehensive accuracy standards required for practical applications, thereby limiting the development and integration of RAG systems in real-world environments.

• 通用解决方案:尽管基于相似的框架,各种 RAG 应用场景面临不同的挑战,需要多样化的能力,特别是在提取、理解和组织领域特定知识和逻辑方面。这些场景中的问题复杂性和重点各不相同,甚至在单个场景中,难度也可能有所差异。例如,在基于规则的查询场景中,如确定邮寄物品的法律条件,RAG 系统主要通过弥合查询与规则之间的语义差距来检索相关的事实规则。在多跳查询场景中,如从多个方面比较产品,RAG 系统则强调从不同来源提取信息并进行多跳推理,以得出准确的答案。大多数现有的 RAG 方法 [62] 采用了一种通用解决方案,未能考虑到场景内外的不同复杂性和特定需求。这导致解决方案无法满足实际应用所需的全面准确性标准,从而限制了 RAG 系统在现实环境中的发展和集成。

We believe that the key to addressing these challenges lies in advancing beyond traditional retrieval augmentation, by effectively extracting, understanding, and applying specialized knowledge, and developing appropriate reasoning logic tailored to the specific tasks and the knowledge involved. We refer to this approach as sPecIalized Knowledge and Rationale Augmentation. Given that various tasks require diverse capabilities, particularly for extracting, understanding, and organizing domain-specific knowledge and rationale, we summarize and categorize the questions commonly encountered into four types with respect to their difficulty: factual questions, linkable-reasoning questions, predictive questions, and creative questions. Accordingly, we propose a classification of RAG system capability levels, aligned with the system’s ability to solve these different types of problems. This classification serves as a guideline for systematically advancing the system’s capabilities in a controllable and measurable manner.

我们相信,解决这些挑战的关键在于超越传统的检索增强,通过有效提取、理解和应用专业知识,并开发适合特定任务和相关知识的推理逻辑。我们将这种方法称为专业知识与推理增强 (sPecIalized Knowledge and Rationale Augmentation) 。鉴于各种任务需要不同的能力,特别是在提取、理解和组织领域特定知识和推理方面,我们总结并分类了常见问题,根据其难度分为四种类型:事实性问题、可链接推理问题、预测性问题和创造性问题。相应地,我们提出了RAG系统能力水平的分类,与系统解决这些不同类型问题的能力相匹配。该分类为系统在可控和可衡量的方式下系统地提升能力提供了指导。

Furthermore, we propose sPecIalized KnowledgE and Rationale Augmented Generation (PIKERAG) framework, which not only support phased system development and deployment, demonstrating excellent versatility, but also enhances capabilities by effectively leveraging specialized knowledge and rationale. Within this framework, knowledge extraction components are employed to extract specialized knowledge from diverse source data, laying a robust foundation for knowledgebased retrieval and reasoning. Additionally, a task decomposer is utilized to dynamically manage the routing of retrieval and reasoning operations, creating specialized rationale based on available knowledge. PIKE-RAG enables a phased exploration of RAG capabilities, which facilitates the progressive refinement of RAG algorithms and the staged implementation of RAG applications. For each developing phase, the RAG framework and its modules are tailored to address specific challenges. For example, in the knowledge base construction phase, a multi-layer heterogeneous graph is employed to effectively represent relationship between various components of the data, enhancing knowledge organization and integration. The RAG system, designed for factual questions, introduces multi-granularity retrieval, allowing for multi-layer, multi-granularity retrieval across a heterogeneous knowledge graph to improve factual retrieval accuracy. In the advanced RAG system, aiming at addressing complex queries, knowledge atomizing is introduced to fully explore the intrinsic knowledge from data chunks, while knowledge-aware task decomposition manages the retrieval and organization of multiple pieces of atomic knowledge to construct a coherent rationale.

此外,我们提出了一种专门化知识与推理增强生成(PIKE-RAG)框架,该框架不仅支持分阶段的系统开发与部署,展现出极佳的通用性,还能通过有效利用专门化知识和推理来增强能力。在该框架中,知识提取组件用于从多样化的源数据中提取专门化知识,为基于知识的检索和推理奠定了坚实的基础。此外,任务分解器被用来动态管理检索与推理操作的路由,基于可用知识生成专门的推理。PIKE-RAG 支持对 RAG 能力的分阶段探索,从而促进 RAG 算法的逐步优化和 RAG 应用的分阶段实施。在每个发展阶段,RAG 框架及其模块都会根据特定挑战进行调整。例如,在知识库构建阶段,采用多层异构图来有效表示数据各组件之间的关系,增强知识的组织与整合。针对事实性问题设计的 RAG 系统引入了多粒度检索,允许在异质知识图谱上进行多层、多粒度的检索,以提高事实检索的准确性。在高级 RAG 系统中,为了解决复杂查询,引入了知识原子化,以充分挖掘数据块中的内在知识,而知识感知的任务分解则管理多个原子知识的检索与组织,以构建连贯的推理。

Extensive experiments are conducted to evaluate the performance of the proposed PIKE-RAG framework on both open-domain and legal benchmarks, and experimental results demonstrate the effectiveness of PIKE-RAG. Our framework and staged development strategy could further advance the current research and application of RAG in industrial contexts. In summary, the contributions of this work are as follows:

我们进行了大量实验,以评估所提出的 PIKE-RAG 框架在开放领域和法律基准上的性能,实验结果证明了 PIKE-RAG 的有效性。我们的框架和分阶段开发策略可以进一步推动 RAG 在工业环境中的研究和应用。综上所述,本工作的贡献如下:

2 Related work

2 相关工作

2.1 RAG

2.1 RAG

Retrieval-Augmented Generation (RAG) has emerged as a promising solution that effectively incorporates external knowledge to enhance response generation. Initially, retrieval-augmented techniques were introduced to improve the performance of pre-trained language models on knowledge-intensive tasks [35, 29, 12]. With the booming of Large Language Models [5, 9, 50, 6], most research in the RAG paradigm has shifted towards a framework that initially retrieves pertinent information from external data sources and subsequently integrates it into the context of the query prompt as supplementing knowledge for con textually relevant generation [46]. Following this framework, naive RAG research paradigm [25] converts raw data into uniform plain text and segment it into smaller chunks, which are encoded into vector space for query-based retrieval. The top k relevant chunks are used to expand the context of the prompt for generation. To enhance the retrieval quality of the naive RAG, advanced RAG approaches implement specific enhancements across the pre-retrieval, retrieval, and post-retrieval processes, including query optimization [39, 63], multi-granularity chunking [16, 65], mixed retrieval and chunk re-ranking.

检索增强生成 (Retrieval-Augmented Generation, RAG) 作为一种有前景的解决方案,能够有效整合外部知识以增强响应生成。最初,检索增强技术被引入以提高预训练语言模型在知识密集型任务上的表现 [35, 29, 12]。随着大语言模型 (Large Language Models) 的蓬勃发展 [5, 9, 50, 6],RAG 范式中的大多数研究转向了一种框架,该框架首先从外部数据源检索相关信息,随后将其整合到查询提示的上下文中,作为上下文相关生成的补充知识 [46]。遵循这一框架,朴素的 RAG 研究范式 [25] 将原始数据转换为统一的纯文本并将其分割成较小的块,这些块被编码到向量空间中以进行基于查询的检索。前 k 个相关块用于扩展提示的上下文以进行生成。为了提高朴素 RAG 的检索质量,先进的 RAG 方法在预检索、检索和后检索过程中实施了特定的增强,包括查询优化 [39, 63]、多粒度分块 [16, 65]、混合检索和块重排序。

Beyond the aforementioned RAG paradigms, numerous sophisticated enhancements in RAG pipelines and system modules are introduced within modular RAG systems [26], aiming to improve system capability and versatility. These advancements have enabled the processing of a wider variety of source data, facilitating the transformation of raw information into structured data and, ultimately, into valuable knowledge [56, 20]. Furthermore, the indexing and retrieval modules have been refined with multi-granularity and multi-architecture approaches [58, 65]. Various pre-retrieval [24, 64] and postretrieval [18, 30] functions are proposed to enhance both the retrieval effectiveness and the quality of sequential generation. It has been recognized that naïve RAG systems are insufficient to tackle complex tasks such as sum mari z ation [27] and multi-hop reasoning [51, 28]. Consequently, most recent research focuses on developing advanced coordination schemes that leverage existing modules to collaborative ly address these challenges. ITERRETGEN [48] and DSP [33] employ retrieve-read iteration to leverage generation response as the context for next round retrieval. FLARE [31] proposes a confidence-based active retrieval mechanism that dynamically adjusts query with respect to the low-confidence tokens in the regenerated sentences. These loop-based RAG pipelines progressively converge towards the correct answer and provide enhanced flexibility to RAG systems in addressing diverse requirements.

除了上述的 RAG 范式外,模块化 RAG 系统中引入了许多复杂的 RAG 管道和系统模块增强 [26],旨在提高系统能力和多功能性。这些进步使得处理更多种类的源数据成为可能,促进了原始信息向结构化数据以及最终向有价值知识的转化 [56, 20]。此外,索引和检索模块通过多粒度和多架构方法进行了优化 [58, 65]。提出了各种检索前 [24, 64] 和检索后 [18, 30] 功能,以增强检索效果和序列生成的质量。人们认识到,简单的 RAG 系统不足以应对诸如摘要 [27] 和多跳推理 [51, 28] 等复杂任务。因此,最近的研究主要集中在开发高级协调方案,利用现有模块协同应对这些挑战。ITERRETGEN [48] 和 DSP [33] 采用检索-读取迭代,利用生成响应作为下一轮检索的上下文。FLARE [31] 提出了一种基于置信度的主动检索机制,根据再生句子中低置信度的 Token 动态调整查询。这些基于循环的 RAG 管道逐步收敛到正确答案,并为 RAG 系统提供了应对多样化需求的增强灵活性。

2.2 Knowledge bases for RAG

2.2 用于RAG的知识库

In naïve RAG approaches, source data is converted to plain text and chunked for retrieval. However, as RAG applications expand and demand for diversity grows, plain text-based retrieval becomes insufficient for several reasons: (1) textual information is generally redundant and noisy, leading to decreased retrieval quality; (2) complex problems require the integration of multiple data sources, and plain text alone cannot adequately represent the intricate relationships between objects. As a result, researchers are exploring diverse data sources to enrich the corpus, incorporating search engines [59, 53], databases [55, 41, 47], knowledge graphs [49, 56], and multimodal corpora [17, 15]. Concurrently, there is an emphasis on developing efficient knowledge representations for corpus to enhance knowledge retrieval. A graph is regarded as a powerful knowledge representation because of its capacity to intuitively model complex relationships. GraphRAG [20] combines knowledge graph generation and query-focused sum mari z ation with RAG to address both local and global questions. HOLMES [42] construct hyper-relational KGs and prune them to distilled graphs, which serve as an input to LLMs for multihop question answering. However, the construction of knowledge graphs is extremely resource-intensive, and the associated costs scale up with the size of the corpus.

在简单的 RAG 方法中,源数据被转换为纯文本并分块以便检索。然而,随着 RAG 应用的扩展和对多样性的需求增加,基于纯文本的检索在多个方面变得不足:(1) 文本信息通常冗余且嘈杂,导致检索质量下降;(2) 复杂问题需要整合多个数据源,而纯文本无法充分表示对象之间的复杂关系。因此,研究人员正在探索多样化的数据源以丰富语料库,包括搜索引擎 [59, 53]、数据库 [55, 41, 47]、知识图谱 [49, 56] 和多模态语料库 [17, 15]。同时,重点在于开发高效的语料库知识表示以增强知识检索。图因其能够直观地建模复杂关系而被视为一种强大的知识表示。GraphRAG [20] 将知识图谱生成和以查询为中心的摘要与 RAG 结合,以解决局部和全局问题。HOLMES [42] 构建超关系知识图谱并将其修剪为精简图,作为大语言模型的多跳问答输入。然而,知识图谱的构建极其耗费资源,且相关成本随着语料库的规模而增加。

2.3 Multi-hop QA

2.3 多跳问答

Multi-hop Question Answering (MHQA) [60] involves answering questions that require reasoning over multiple pieces of information, often scattered across different documents or paragraphs. This task presents unique challenges as it necessitates not only retrieving relevant information but also effectively combining and reasoning over the retrieved pieces to arrive at a correct answer. The traditional graph-based methods in MHQA solve the problem by building graphs and inferring on graph neural networks(GNN) to predict answers [44, 21]. With the advent of LLMs, recent graph-based methods [36, 42] have evolved to construct knowledge graphs for retrieval and generate response through LLMs. Another branch of methods dynamically convert multi-hop questions into a series of sub-queries by generating subsequent questions based on the answers to previous ones [52, 33, 23]. The subqueries guides the sequential retrieval and the retrieved results in turn are used to improve reasoning. Treating MHQA as a supervised problem, Self-RAG [61] trains an LM to learn to retrieve, generate, and critique text passages, and beam-retrieval [7] models the multi-hop retrieval process in an end-to-end manner by jointly optimizing an encoder and classification heads across all hops. Self-Ask [43] improves CoT by explicitly asking itself follow-up questions before answering the initial question. This method enables the automatic decomposition of questions and can be seamlessly integrated with retrieval mechanisms to tackle Multi-hop Question Answering.

多跳问答 (MHQA) [60] 涉及回答需要跨多个信息片段进行推理的问题,这些信息通常分散在不同的文档或段落中。该任务提出了独特的挑战,因为它不仅需要检索相关信息,还需要有效地组合并推理检索到的信息片段,以得出正确答案。传统的基于图的方法在 MHQA 中通过构建图并在图神经网络 (GNN) 上进行推理来预测答案 [44, 21]。随着大语言模型的出现,最近的基于图的方法 [36, 42] 已经演变为构建知识图谱进行检索,并通过大语言模型生成响应。另一类方法通过根据先前问题的答案生成后续问题,将多跳问题动态转换为一系列子查询 [52, 33, 23]。子查询指导顺序检索,检索到的结果反过来用于改进推理。将 MHQA 视为监督问题,Self-RAG [61] 训练一个语言模型来学习检索、生成和评论文本段落,而 beam-retrieval [7] 通过在所有跳数上联合优化编码器和分类头,以端到端的方式对多跳检索过程进行建模。Self-Ask [43] 通过在回答初始问题之前明确询问自己后续问题来改进思维链 (CoT)。这种方法能够自动分解问题,并可以无缝集成检索机制以解决多跳问答问题。

3 Problem formulation

3 问题表述

Existing research mainly concentrates on algorithmic enhancements to improve the performance of RAG systems. However, there is limited effort in providing a comprehensive and systematic discussion of the RAG framework. In this work, we conceptualize the RAG framework from three key perspectives: knowledge base, task classification, and system development. We assert that the knowledge base serves as the fundamental cornerstone of RAG, underpinning all retrieval and generation processes. Furthermore, we recognize that RAG tasks can vary significantly in complexity and difficulty, depending on the required generation capabilities and the availability of supporting corpora. By categorizing tasks according to their difficulty levels, we classify RAG systems into distinct levels based on their problem-solving capabilities across the different types of questions.

现有研究主要集中在算法增强上,以提高 RAG 系统的性能。然而,关于 RAG 框架的全面系统性讨论却较为有限。在本研究中,我们从三个关键视角对 RAG 框架进行了概念化:知识库、任务分类和系统开发。我们主张知识库是 RAG 的基石,支撑着所有的检索与生成过程。此外,我们认识到 RAG 任务的复杂度和难度可以根据所需的生成能力及辅助语料的可用性而有显著不同。通过按难度等级划分任务,我们依据 RAG 系统在解决不同类型问题上的能力,将其划分为不同的级别。

3.1 Knowledge base

3.1 知识库

In industrial applications, specialized knowledge primarily originates from years of accumulated data within specific fields such as manufacturing, energy, and logistics. For example, in the pharmaceutical industry, data sources include extensive research and development documentation, as well as drug application files amassed over many years. These sources are not only diverse in file formats, but also encompass a significant amount of multi-modal contents such as tables, charts, and figures, which are also crucial for problem-solving. Furthermore, there are often functional connections between files within a specialized domain, such as hyperlinks, references, and relational database links, which explicitly or implicitly reflect the logical organization of knowledge within the professional field. Currently, existing datasets provide pre-segmented corpora and do not account for the complexities encountered in real-world applications, such as the integration of multi-format data and the maintenance of referential relationships between documents. Therefore, the construction of a comprehensive knowledge base is foundational for Retrieval-Augmented Generation (RAG) in the industrial field. As the architecture and quality of the knowledge base directly influence the retrieval methods and their performance, we propose structuring the knowledge base as a multi-layer heterogeneous graph, denoted as $G$ , with corresponding nodes and edges represented by $(V,\dot{E})$ . The graph nodes can include documents, sections, chunks, figures, tables, and customized nodes from distilled knowledge. The edges signify the relationships among these nodes, encapsulating the interconnections and dependencies within the graph. This multi-layer heterogeneous graph encompasses three distinct layers: the information resource layer $G_{i}$ , the corpus layer $G_{c}$ and the distilled knowledge layer $G_{d k}$ . Each layer corresponds to different stages of information processing, representing varying levels of granularity and abstraction in knowledge.

在工业应用中,专业知识主要来源于制造业、能源和物流等特定领域多年积累的数据。例如,在制药行业,数据来源包括广泛的研发文档以及多年积累的药物申请文件。这些来源不仅在文件格式上多样化,还包含大量多模态内容,如表格、图表和图形,这些内容对于解决问题也至关重要。此外,专业领域内的文件之间通常存在功能性联系,如超链接、引用和关系数据库链接,这些联系明确或隐含地反映了专业领域内知识的逻辑组织。目前,现有数据集提供了预分段的语料库,但并未考虑实际应用中遇到的复杂性,例如多格式数据的整合和文档之间引用关系的维护。因此,构建一个全面的知识库是工业领域检索增强生成(RAG)的基础。由于知识库的结构和质量直接影响检索方法及其性能,我们建议将知识库构建为多层异构图,记为 $G$,相应的节点和边表示为 $(V,\dot{E})$。图节点可以包括文档、章节、块、图形、表格以及从提炼知识中定制的节点。边表示这些节点之间的关系,封装了图中的相互联系和依赖关系。这种多层异构图包含三个不同的层次:信息资源层 $G_{i}$、语料层 $G_{c}$ 和提炼知识层 $G_{d k}$。每一层对应信息处理的不同阶段,代表知识的不同粒度和抽象层次。

3.2 Task classification

3.2 任务分类

Contemporary RAG frameworks frequently overlook the intricate difficulty and logistical demands inherent to diverse tasks, typically employing a one-size-fits-all methodology. However, even with comprehensive knowledge retrieval, current RAG systems are insufficient to handle tasks of varying difficulty with equal effectiveness. Therefore, it is essential to categorize tasks and analyze the typical strategies for overcoming the challenges inherent to each category. The difficulty of a task is closely associated with several critical factors.

当代 RAG 框架常常忽视不同任务固有的复杂性和后勤需求,通常采用一刀切的方法。然而,即使有全面的知识检索,当前的 RAG 系统也无法同样有效地处理不同难度的任务。因此,必须对任务进行分类,并分析克服每类任务固有挑战的典型策略。任务的难度与几个关键因素密切相关。

Factual Questions

事实性问题


Linkable & Reasoning Questions

图 1:
可链接与推理问题


Figure 1: Illustrative examples of distinct question types

图 1: 不同问题类型的示例

• Effectiveness of Knowledge Utilization: The sophistication involved in applying the extracted knowledge to formulate responses, including synthesizing, organizing, and generating insights or predictions.

• 知识利用的有效性:应用提取的知识来制定回应的复杂性,包括综合、组织以及生成见解或预测。

In categorizing real-world RAG tasks within industries, we focus on the processes of knowledge extraction, understanding, organization, and utilization to provide structured and insightful responses. Taking the aforementioned factors into account, we identify four distinct classes of questions that address a broad spectrum of demands. The first type, Factual Questions, involves extracting specific, explicit information directly from the corpus, relying on retrieval mechanisms to identify the relevant facts. Linkable-Reasoning Questions demand a deeper level of knowledge integration, often requiring multi-step reasoning and linking across multiple sources. Predictive Questions extend beyond the available data, requiring inductive reasoning and structuring of retrieved facts into analyzable forms, such as time series, for future-oriented predictions. Finally, Creative Questions engage domainspecific logic and creative problem-solving, encouraging the generation of innovative solutions by synthesizing knowledge and identifying patterns or influencing factors. This categorization, driven by varying levels of reasoning and knowledge management, ensures a comprehensive approach to addressing industry-specific queries.

在对行业中的现实世界 RAG 任务进行分类时,我们重点关注知识提取、理解、组织和利用的过程,以提供结构化和有洞察力的响应。考虑到上述因素,我们确定了四类不同的问题,以满足广泛的需求。第一类,事实性问题 (Factual Questions) ,涉及直接从语料库中提取特定的、明确的信息,依靠检索机制来识别相关事实。可链接推理问题 (Linkable-Reasoning Questions) 需要更深层次的知识整合,通常需要多步推理和跨多个来源的链接。预测性问题 (Predictive Questions) 超出了现有数据的范围,需要归纳推理并将检索到的事实构建为可分析的形式,例如时间序列,以进行面向未来的预测。最后,创造性问题 (Creative Questions) 涉及特定领域的逻辑和创造性问题解决,通过综合知识并识别模式或影响因素,鼓励生成创新解决方案。这种由不同层次的推理和知识管理驱动的分类,确保了解决行业特定问题的全面方法。

The criteria defining each category are elaborated in the following sections, with representative examples for each provided in Figure 1. For each question type, we also present the associated support data and the expected reasoning processes to illustrate the differences between these categories. These inquiries are formulated by experts in pharmaceutical applications, based on the data released by the FDA.2

定义每个类别的标准将在以下章节中详细阐述,并在图 1 中提供了每个类别的代表性示例。对于每种问题类型,我们还提供了相关的支持数据和预期的推理过程,以说明这些类别之间的差异。这些查询是由药物应用专家根据 FDA 发布的数据制定的。

Factual Questions These questions seek specific, concrete pieces of information explicitly presented in the original corpus. The referenced text can be processed within the context of a conversation in LLMs. As shown in Figure 1, this class of questions can be effectively answered if the relevant fact is successfully retrieved.

事实性问题

Linkable-Reasoning Questions Answering these questions necessitates gathering pertinent information from diverse sources and/or executing multi-step reasoning. The answers may be implicitly distributed across multiple texts. Due to variations in the linking and reasoning processes, we further divide this category into four subcategories: bridging questions, comparative questions, quantitative questions, and summarizing questions. Examples of each subcategory are illustrated in Figure 1. Specifically, bridging questions involve sequentially bridging multiple entities to derive the answer. Quantitative questions require statistical analysis based on the retrieved data. Comparative questions focus on comparing specified attributes of two entities. Summarizing questions require condensing or synthesizing information from multiple sources or large volumes of text into a concise, coherent summary, and they often involve integrating key points, identifying main themes, or drawing conclusions based on the aggregated content. Summarizing questions may combine elements of other question types, such as bridging, comparative, or quantitative questions, as they frequently require the extraction and integration of diverse pieces of information to generate a comprehensive and meaningful summary. Given these questions require multi-step retrieval and reasoning, it is crucial to establish a reasonable operation route for answer-seeking in interaction with the knowledge base.

可链接推理问题:回答这些问题需要从不同来源收集相关信息或执行多步推理。答案可能隐含分布在多个文本中。由于链接和推理过程的差异,我们进一步将此类问题分为四个子类别:桥接问题、比较问题、定量问题和总结问题。每个子类别的示例如图 1 所示。具体来说,桥接问题涉及依次桥接多个实体以得出答案。定量问题需要基于检索到的数据进行统计分析。比较问题侧重于比较两个实体的指定属性。总结问题需要将多个来源或大量文本中的信息浓缩或综合为一个简洁、连贯的摘要,并且通常涉及整合关键点、识别主题或基于聚合内容得出结论。总结问题可能结合了其他问题类型的元素,例如桥接、比较或定量问题,因为它们经常需要提取和整合不同的信息片段以生成全面且有意义的摘要。鉴于这些问题需要多步检索和推理,在与知识库的交互中建立合理的答案寻找操作路径至关重要。

Predictive Questions For this type of questions, the answers are not directly available in the original text and may not be purely factual, necessitating inductive reasoning and prediction based on existing facts. To harness the predictive capabilities of LLMs or other external prediction tools, it is essential to gather and organize relevant knowledge to generate structured data for further analysis. For instance, as illustrated in Figure 1, all biosimilar products with the approval dates are retrieved, and the total number of approvals for each year is calculated and organized to year-indexed time series data for prediction purposes. Furthermore, it is important to note that the correct answer to predictive questions may not be unique, reflecting the inherent uncertainty and variability in predictive tasks.

预测性问题

Creative Questions One significant demand of RAG is to mine valuable domain-specific logic from professional knowledge bases and introduce novel perspectives that can innovate and advance existing solutions. Addressing creative questions necessitates creative thinking based on the availability of factual information and an understanding of the underlying principles and rules. As illustrated in the example, it is essential to organize the extracted information to highlight key stages and their duration, and then identify common patterns and influential factors. Subsequently, solutions are developed with the objective of evaluating potential outcomes and stimulating fresh ideas. The goal of these responses is to inspire experts to generate innovative ideas, rather than to provide ready-to-implement solutions.

创造性问题
RAG(Retrieval-Augmented Generation)的一个重要需求是从专业知识库中挖掘有价值的领域特定逻辑,并引入能够创新和推进现有解决方案的新颖视角。解决创造性问题需要基于事实信息的可用性以及对基本原理和规则的理解进行创造性思考。如示例所示,必须组织提取的信息以突出关键阶段及其持续时间,然后识别常见模式和有影响力的因素。随后,开发的解决方案旨在评估潜在结果并激发新想法。这些回答的目标是激发专家产生创新想法,而不是提供现成的解决方案。

It is crucial to recognize that the classification of a question may shift with changes in the knowledge base. Questions Q1, Q2, and Q3 in Figure1, although seemingly similar, are categorized differently depending on the availability of information and the logical steps required to derive an answer. For instance, Q1 is classified as a factual question because it can be directly answered using a table that concisely lists all biosimilar products along with their respective approval dates, providing sufficient explicit information. In contrast, Q2, which inquires about the total count of interchangeable biosimilar products, cannot be resolved by directly referencing a single explicit source. To answer Q2, one must identify all the products meeting the specified criteria and subsequently calculate the total, necessitating an additional step of statistical aggregation. Therefore, Q2 is categorized as a linkable-reasoning question due to the need for an intermediate processing. Finally, Q3 poses a challenge because the answer does not explicitly exist within the knowledge base. Addressing

关键是要认识到,问题的分类可能会随着知识库的变化而改变。图1中的问题Q1、Q2和Q3,虽然看似相似,但根据信息的可用性和推导答案所需的逻辑步骤,它们被归类为不同类型。例如,Q1被归类为事实性问题,因为它可以通过直接参考一张表格来回答,该表格简明地列出了所有生物类似物产品及其各自的批准日期,提供了足够的显性信息。相比之下,Q2询问的是可互换生物类似物产品的总数,无法通过直接引用单一的显性来源来解决。要回答Q2,必须识别所有符合指定标准的产品,随后计算总数,这需要额外的统计聚合步骤。因此,Q2被归类为可链接推理问题,因为它需要中间处理。最后,Q3提出了一个挑战,因为答案并未在知识库中明确存在。解决

Table 1: Level definition based on RAG system’s capability

表 1: 基于 RAG 系统能力的等级定义

等级 系统能力描述
L1 L1 系统旨在为事实性问题提供准确可靠的答案,确保基础信息检索的坚实基础。
L2 L2 系统扩展其功能,包括对事实性问题和可链接推理问题的准确可靠响应,支持更复杂的多步检索和推理任务。
L3 L3 系统进一步增强其能力,通过加入对预测性问题提供合理预测的能力,同时在回答事实性问题和可链接推理问题时保持准确性和可靠性。
L4 L4 系统能够为创造性问题提出合理规划或解决方案。此外,它保留了对预测性问题提供合理预测的能力,同时还能为事实性问题和可链接推理问题提供准确可靠的答案。

Q3 requires gathering relevant data, organizing it to infer hidden patterns, and making predictions based on these inferred rules. As a result, Q3 is categorized as a predictive question, indicating the requirement to extrapolate beyond the existing data to forecast potential outcomes or trends.

Q3需要收集相关数据,组织数据以推断隐藏的模式,并根据这些推断的规则进行预测。因此,Q3被归类为预测性问题,表明需要从现有数据中推断出潜在的结果或趋势。

3.3 RAG system level

3.3 RAG系统层级

In industrial RAG systems, inquiries encompass a broad spectrum of difficulties and are approached from diverse perspectives. Although RAG systems can leverage the general question-answering(QA) abilities of LLMs, their limited comprehension of expert-level knowledge often leads to inconsistent response quality across questions of varying complexities. In response to this status quo, we propose categorizing RAG systems into four distinct levels based on their problem-solving capabilities across the four classes of questions outlined in the previous subsection. This stratified approach facilitates the phased development of RAG systems, allowing capabilities to be increment ally enhanced through iterative module refinement and algorithmic optimization. Our framework is strategically designed to provide a standardized, objective methodology for developing RAG systems that effectively meet the specialized needs of various industry scenarios. The definition of RAG systems in different level is presented in Table 1. It highlights the systems’ capabilities to handle increasingly complex queries, demonstrating the evolution from simple information retrieval to advanced predictive and creative problem-solving. Each level represents a step towards more sophisticated interactions with knowledge bases, requiring the RAG systems to demonstrate higher levels of understanding, reasoning, and innovation.

在工业 RAG 系统中,查询涵盖了广泛的难度范围,并从不同的角度进行处理。尽管 RAG 系统可以利用大语言模型的通用问答(QA)能力,但它们对专家级知识的有限理解往往导致对不同复杂度问题的回答质量不一致。针对这一现状,我们提议根据 RAG 系统在前一小节中列出的四类问题上的解决能力,将其分为四个不同的层次。这种分层方法有助于 RAG 系统的分阶段开发,通过迭代模块优化和算法改进逐步增强能力。我们的框架旨在为开发 RAG 系统提供一种标准化、客观的方法论,以有效满足各种行业场景的专业需求。不同层次 RAG 系统的定义如表 1 所示。它展示了系统处理日益复杂查询的能力,从简单的信息检索演变为高级的预测性和创造性问题解决。每个层次都代表了与知识库更复杂的交互,要求 RAG 系统展现出更高层次的理解、推理和创新能力。

More specially, at the foundational level, RAG systems respond to factual questions with answers that are directly extract able from provided texts. Advancing to the second level, RAG systems are equipped to handle complex questions involving linkage and reasoning. These queries necessitate the synthesis of information from disparate sources or multi-step reasoning processes. The RAG could address a variety of composite questions, includes bridging questions that necessitate a sequence of logical reasoning, comparative questions demanding parallel analysis, and summarizing questions that involve condensing information into comprehensive responses. At the third level, the systems are intricately designed to tackle predictive questions where answers are not immediately discernible from the original text. Finally, RAG systems at the forth level demonstrate the capacity for creative problem-solving, utilizing a solid factual base to foster novel concepts or strategies. While these systems may not offer ready-to-implement solutions, they play a crucial role in stimulating expert creativity to advance fields such as analytics or treatment design.

更具体地说,在基础层面上,RAG 系统能够从提供的文本中直接提取答案来回应事实性问题。进阶到第二层面,RAG 系统具备处理涉及链接和推理的复杂问题的能力。这些查询需要从不同来源综合信息或多步骤的推理过程。RAG 能够应对多种复合问题,包括需要一系列逻辑推理的桥梁问题、要求并行分析的比较问题,以及涉及将信息浓缩为全面回答的总结性问题。在第三层面,这些系统被精心设计以应对预测性问题,这些问题的答案无法直接从原始文本中识别。最后,处于第四层面的 RAG 系统展现了创造性解决问题的能力,利用坚实的事实基础来孕育新的概念或策略。虽然这些系统可能不提供即插即用的解决方案,但它们在激发专家创造力以推动分析或治疗设计等领域的发展中扮演着至关重要的角色。

4 Methodology

4 方法论

4.1 Framework

4.1 框架

Based on the formulation of RAG systems in terms of knowledge base, task classification, and systemlevel division, we propose a versatile and expandable RAG framework. Within this framework, the progression in levels of RAG systems can be achieved by adjusting submodules within the main modules. The overview of our framework is depicted in Figure 2. The framework primarily consists of several fundamental modules, including file parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, knowledge-centric reasoning, and task decomposition and coordination. In this framework, domain-specific documents of diverse formats are processed by file parsing module to convert the file to machine-readable formats, and file units are generated to build up graph in information source layer. The knowledge extraction module chunks the text and generates corpus and knowledge units to construct graph in corpus layer and distilled knowledge layer. The heterogeneous graph established is utilized as the knowledge base for retrieval. Extracted knowledge is stored in multiple structured formats, and the knowledge retrieval module employs hybrid retrieval strategy to access relevant information. Note that the knowledge base not only serves as the source of knowledge gathering but also benefits from a feedback loop, where the organized and verified knowledge is regarded as feedback to refine and improve the knowledge base.

基于知识库、任务分类和系统层级划分的 RAG 系统表述,我们提出了一个多功能且可扩展的 RAG 框架。在该框架中,RAG 系统的层级演进可以通过调整主模块中的子模块来实现。图 2 展示了我们框架的概览。该框架主要由几个基础模块组成,包括文件解析、知识提取、知识存储、知识检索、知识组织、以知识为中心的推理以及任务分解与协调。在该框架中,不同格式的领域特定文档通过文件解析模块处理,将文件转换为机器可读格式,并生成文件单元以构建信息源层的图。知识提取模块对文本进行分块并生成语料库和知识单元,以构建语料库层和精炼知识层的图。建立的异构图被用作检索的知识库。提取的知识以多种结构化格式存储,知识检索模块采用混合检索策略来访问相关信息。需要注意的是,知识库不仅是知识收集的来源,还受益于反馈循环,其中组织和验证的知识被视为反馈,以改进和完善知识库。


Figure 2: Overview of the PIKE-RAG framework, comprising several key components: file parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, task decomposition and coordination, and knowledge-centric reasoning. Each component can be tailored to meet the evolving demands of system capability.


图 2: PIKE-RAG 框架概览,包含几个关键组件:文件解析、知识提取、知识存储、知识检索、知识组织、任务分解与协调,以及以知识为中心的推理。每个组件都可以根据系统能力的演进需求进行定制。

As highlighted in the task classification examples, questions of different classes require distinct rationale routing for answer-seeking, influenced by multiple factors such as the availability of relevant information, the complexity of knowledge extraction, and the sophistication of reasoning. It is challenging to address these questions in a single retrieval and generation pass. To tackle this, we propose an iterative retrieval-generation mechanism supervised by task decomposition and coordination. This iterative mechanism enables the gradual collection of relevant information and progressive reasoning over incremental context, ensuring a more accurate and comprehensive response. More specially, the questions in industrial applications are fed into task decomposition module to produce preliminary decomposition scheme. This scheme outlines the retrieval steps, reasoning steps, and other necessary operations. Following these instructions, the knowledge retrieval module retrieves relevant information, which is then passed to the knowledge organization module for processing and organization. The organized knowledge is used to perform knowledge-centric reasoning, yielding an intermediate answer. With the updated relevant information and intermediate answer, the task decomposition module regenerates an updated scheme for the next iteration. This design boasts excellent adaptability, allowing us to tackle problems of varying difficulties and perspectives by adjusting the modules and iterative mechanisms.

如任务分类示例所示,不同类别的问题需要不同的推理路径来寻找答案,这受到多种因素的影响,例如相关信息的可用性、知识提取的复杂性以及推理的复杂度。在单次检索和生成过程中解决这些问题具有挑战性。为此,我们提出了一种由任务分解和协调监督的迭代检索-生成机制。这种迭代机制能够逐步收集相关信息,并在增量上下文的基础上进行渐进式推理,从而确保回答更加准确和全面。具体而言,工业应用中的问题被输入到任务分解模块中,生成初步的分解方案。该方案概述了检索步骤、推理步骤以及其他必要操作。遵循这些指令,知识检索模块检索相关信息,然后将其传递给知识组织模块进行处理和组织。组织好的知识用于执行以知识为中心的推理,生成中间答案。随着更新的相关信息和中间答案,任务分解模块重新生成更新的方案以进行下一次迭代。这种设计具有出色的适应性,允许我们通过调整模块和迭代机制来解决不同难度和视角的问题。

Table 2: Proposed frameworks for different system levels. To address the challenges facing at each level, we propose customized frameworks based on the framework illustrated in Figure 2. The following abbreviations are used: "PA" for file parsing, "KE" for knowledge extraction, "RT" for knowledge retrieval, "KO" for knowledge organization, and "KR" for knowledge-centric reasoning.

表 2: 针对不同系统层次的定制框架。为了解决每个层次面临的挑战,我们基于图 2 中展示的框架提出了定制化的框架。以下缩写用于表示:“PA”表示文件解析、“KE”表示知识提取、“RT”表示知识检索、“KO”表示知识组织、“KR”表示以知识为中心的推理。

Level Challenges ProposedFramework
LO 由于源文档格式多样,知识提取面临挑战,需要复杂的文件解析技术。· 从原始异构数据构建高质量知识库在知识组织和集成方面引入了显著的复杂性。 PA KE
L1 · 由于不恰当的分块破坏了语义连贯性,阻碍了知识的理解和提取,使准确检索复杂化。知识检索受到嵌入模型在对齐专业术语和别名方面的限制,降低了系统的精确度。 PA KE RT KO KR
L2 有效的知识提取和利用至关重要,因为分块后的文本通常包含相关和不相关信息。确保检索高质量数据对于准确生成至关重要。· 任务的理解和分解及其背后的逻辑往往忽略了支持数据的可用性,严重依赖大语言模型能力。 PA KR Task Decomp.& Coord.
L3 这一级别的挑战集中在知识收集和组织上,这对于支持预测推理至关重要。· 大语言模型在应用专业推理逻辑方面存在局限性,限制了其在预测任务中的有效性。 PA TaskDecomp.&Coord.
L4 困难在于从复杂的知识库中提取连贯的逻辑推理,其中多个因素之间的相互依赖性可能导致非唯一解。创造性问题的开放性使得对推理和知识合成过程的评估复杂化,难以定量评估答案质量。 PA RT Multi-agent Plan. Task Decomp.&Coord.

4.2 Phased system development

4.2 分阶段系统开发

We have categorized RAG systems into four distinct levels based on their problem-solving capabilities across the four classes of questions, as outlined in Table 1. Recognizing the pivotal role of knowledge base generation in RAG systems, we designate the construction of the knowledge base as the L0 stage of system development. The challenges faced by RAG systems vary across different levels. We analyze these challenges for each level and propose corresponding frameworks in Table 2. This stratified approach facilitates the phased development of RAG systems, enabling incremental enhancement of capabilities through iterative module refinement and algorithmic optimization.

我们根据RAG系统在四类问题中的解决能力,将其分为四个不同的层次,如表1所示。鉴于知识库生成在RAG系统中的关键作用,我们将知识库的构建定义为系统开发的L0阶段。RAG系统在不同层次面临的挑战各不相同。我们在表2中分析了每个层次的挑战,并提出了相应的框架。这种分层方法有助于RAG系统的分阶段开发,通过迭代模块优化和算法改进,逐步提升系统能力。

We observe that from L0 to L4, higher-level systems can inherit modules from lower levels and add new modules to enhance system capabilities. For instance, compared to an L1 system, an L2 system not only introduces a task decomposition and coordination module to leverage iterative retrieval-generation routing but also incorporates more advanced knowledge extraction modules, such as distilled knowledge generation, indicated in dark green in Figure 2. In the L3 system, the growing emphasis on predictive questioning necessitates enhanced requirements for knowledge organization and reasoning. Consequently, the knowledge organization module introduces additional submodules for knowledge structuring and knowledge induction, indicated in dark orange. Similarly, the knowledge-centric reasoning module has been expanded to include a forecasting submodule, highlighted in dark purple. In the L4 system, extracting complex rationale from an established knowledge base is highly challenging. To address this, we introduce multi-agent planning module to activate reasoning from diverse perspectives.

我们观察到,从 L0 到 L4,更高级的系统可以从低层级继承模块,并添加新模块以增强系统能力。例如,与 L1 系统相比,L2 系统不仅引入了任务分解和协调模块以利用迭代检索-生成路由,还引入了更高级的知识提取模块,如蒸馏知识生成,如图 2 中的深绿色所示。在 L3 系统中,对预测性问题的日益重视对知识组织和推理提出了更高的要求。因此,知识组织模块引入了额外的子模块,用于知识结构化和知识归纳,如深橙色所示。同样,以知识为中心的推理模块也扩展了预测子模块,如深紫色所示。在 L4 系统中,从已建立的知识库中提取复杂原理具有高度挑战性。为了解决这个问题,我们引入了多智能体规划模块,以激活从不同角度进行的推理。


Figure 3: Multi-layer heterogeneous graph as the knowledge base. The graph comprises three distinct layers: information resource layer, corpus layer and distilled knowledge layer.

图 3: 多层异构图作为知识库。该图包含三个不同的层次:信息资源层、语料库层和提炼知识层。

5 Detailed Implementation

5 详细实现

In this section, we delve into the implementation specifics of each module within our proposed versatile and expandable RAG framework. By elucidating the details at each level, we aim to provide a comprehensive understanding of how the framework operates and how its modularity and expand ability are achieved. The subsections that follow will cover the file parsing, knowledge extraction, knowledge storage, knowledge-centric reasoning, and task decomposition and coordination modules, providing insights into their individual functionalities and interactions.

在本节中,我们将深入探讨我们提出的多功能且可扩展的 RAG 框架中每个模块的实现细节。通过阐明每个层次的细节,我们的目标是提供对该框架如何运作以及如何实现其模块化和可扩展性的全面理解。接下来的小节将涵盖文件解析、知识提取、知识存储、以知识为中心的推理以及任务分解和协调模块,深入探讨它们各自的功能和交互。

5.1 Level-0: Knowledge Base Construction

5.1 Level-0: 知识库构建

The foundational stage of the proposed RAG systems is designated as the L0 system, focuses on the construction of a robust and comprehensive knowledge base. This stage is critical for enabling effective knowledge retrieval in subsequent levels. The primary objective of the L0 system is to process and structure domain-specific documents, transforming them into a machine-readable format and organizing the extracted knowledge into a heterogeneous graph. This graph serves as the backbone for all higher-level reasoning and retrieval tasks. The L0 system encompasses several key modules: file parsing, knowledge extraction, and knowledge storage. Each of these modules plays a crucial role in ensuring that the knowledge base is both extensive and accurately reflects the underlying information contained within the source documents.

所提出的 RAG 系统的基础阶段被指定为 L0 系统,专注于构建一个健壮且全面的知识库。这一阶段对于在后续层级实现有效的知识检索至关重要。L0 系统的主要目标是处理和结构化特定领域的文档,将其转换为机器可读的格式,并将提取的知识组织成异构图。该图作为所有更高级别推理和检索任务的基础。L0 系统包含几个关键模块:文件解析、知识提取和知识存储。这些模块中的每一个都在确保知识库既广泛又准确反映源文档中包含的底层信息方面发挥着关键作用。

5.1.1 File parsing

5.1.1 文件解析

The ability to effectively parse and read various types of files is a critical component in the development of RAG systems that rely on diverse data sources. Frameworks such as LangChain3 provide a comprehensive suite of tools for natural language processing (NLP), including modules for parsing and extracting information from unstructured text documents. Its file reader capabilities are designed to handle a wide range of file formats, ensuring that data from heterogeneous sources can be seamlessly integrated into the system. Additionally, several deep learning-based tools [2, 3] and commercial cloud APIs [1, 4] have been developed to conduct robust Optical Character Recognition (OCR) and accurate table extraction, enabling the conversion of scanned documents and images into structured, machine-readable text. Given that domain-specific files often encompass sophisticated tables, charts, and figures, text-based conversion may lead to information loss and disrupt the inherent logical structure. Therefore, we propose conducting layout analysis for these files and preserving multi-modal elements such as charts and figures. The layout information can aid the chunking operation, maintaining the completeness of chunked text, while figures and charts can be described

有效解析和读取各类文件的能力是开发依赖多样化数据源的 RAG 系统的关键组成部分。诸如 LangChain3 等框架提供了一套全面的自然语言处理 (NLP) 工具,包括用于从非结构化文本文档中解析和提取信息的模块。其文件读取功能设计用于处理广泛的文件格式,确保来自异构源的数据能够无缝集成到系统中。此外,还开发了几种基于深度学习的工具 [2, 3] 和商用云 API [1, 4],以进行稳健的光学字符识别 (OCR) 和准确的表格提取,从而将扫描文档和图像转换为结构化的、机器可读的文本。鉴于特定领域的文件通常包含复杂的表格、图表和图形,基于文本的转换可能导致信息丢失并破坏固有的逻辑结构。因此,我们建议对这些文件进行布局分析,并保留图表和图形等多模态元素。布局信息可以帮助分块操作,保持分块文本的完整性,同时图表和图形可以被描述

Figure 4: The process of distilling knowledge from corpus text. The corpus text are processed to extract knowledge units following customized extraction patterns. These knowledge units are then organized to structured knowledge in the distilled knowledge layer, which may take the form of knowledge graphs, atomic knowledge, tabular knowledge, and other induced knowledge.

图 4: 从语料文本中提炼知识的过程。语料文本经过处理,按照定制的提取模式抽取知识单元。这些知识单元随后被组织成结构化知识,存储在提炼的知识层中,其形式可能包括知识图谱、原子知识、表格知识以及其他归纳知识。

by Vision-Language Models (VLMs) to assist in knowledge retrieval. This approach ensures that the integrity and richness of the original documents are retained, enhancing the efficacy of RAG systems.

通过视觉语言模型(VLMs)辅助知识检索。这种方法确保了原始文档的完整性和丰富性,提升了RAG系统的效率。

5.1.2 Knowledge Organization

5.1.2 知识组织

The proposed knowledge base is structured as a multi-layer heterogeneous graph, representing different levels of information granularity and abstraction. The graph captures relationships between various components of the data (e.g., documents, sections, chunks, figures, and tables) and organizes them into nodes and edges, reflecting their interconnections and dependencies. As depicted in Figure 3, this multi-layer structure, encompassing the information resource layer, corpus layer, and distilled knowledge layer, enables both semantic understanding and rationale-based retrieval for downstream tasks.

所提出的知识库被构建为一个多层异构图,表示不同层次的信息粒度和抽象。该图捕获了数据各个组成部分之间的关系(例如文档、章节、片段、图表和表格),并将它们组织成节点和边,反映它们的相互联系和依赖关系。如图 3 所示,这种多层结构包括信息资源层、语料库层和精炼知识层,能够为下游任务实现语义理解和基于推理的检索。

Information Resource Layer: This layer captures the diverse information sources, treating them as source nodes with edges that denote referential relationships among them. This structure aids in cross-referencing and contextual i zing the knowledge, establishing a foundation for reasoning that depends on multiple sources.

信息资源层:该层捕捉多样化的信息源,将其视为源节点,并通过边表示它们之间的引用关系。这种结构有助于交叉引用和知识的情境化,为依赖多源的推理奠定了基础。

Corpus Layer: This layer organizes the parsed information into sections and chunks while preserving the document’s original hierarchical structure. Multi-modal content such as tables and figures is summarized by LLMs and integrated as chunk nodes, ensuring that multi-modal knowledge is available for retrieval. This layer enables knowledge extraction with varying levels of granularity, allowing for accurate semantic chunking and retrieval across diverse content types.

语料库层:该层将解析后的信息组织成章节和块,同时保留文档的原始层次结构。多模态内容(如表格和图表)由大语言模型进行总结,并作为块节点集成,确保多模态知识可用于检索。该层支持不同粒度的知识提取,允许跨多种内容类型进行准确的语义分块和检索。

Distilled Knowledge Layer: The corpus is further distilled into structured forms of knowledge (e.g., knowledge graphs, atomic knowledge, and tabular knowledge). This process, driven by techniques like Named Entity Recognition (NER) [19] and relationship extraction [40], ensures that the distilled knowledge captures key logical relationships and entities, supporting advanced reasoning processes. By organizing this structured knowledge in a distilled layer, we enhance the system’s ability to reason and synthesize based on deeper domain-specific knowledge. The knowledge distillation process is depicted in Figure 4. Below are the detailed distillation processes for typical knowledge forms.

蒸馏知识层:语料库进一步被提炼为结构化的知识形式(如知识图谱、原子知识和表格知识)。这一过程由命名实体识别(NER)[19] 和关系抽取 [40] 等技术驱动,确保蒸馏后的知识捕捉到关键的逻辑关系和实体,支持高级推理过程。通过将这些结构化知识组织在蒸馏层中,我们增强了系统基于更深层次的领域知识进行推理和综合的能力。知识蒸馏过程如图 4 所示。以下是典型知识形式的详细蒸馏过程。


Figure 5: Illustration of enhanced chunking with recurrent text splitting.

图 5: 使用循环文本分割进行增强分块的示意图。

5.2 Level-1: Factual Question focused RAG System

5.2 第一层:基于事实问题的 RAG 系统

Building upon the L0 system, the L1 system introduces knowledge retrieval and knowledge organization to realize its retrieval and generation capabilities. The primary challenges at this level are semantic alignment and chunking. The abundance of professional terminology and aliases can affect the accuracy of chunk retrieval, and unreasonable chunking can disrupt semantic coherence and introduce noise interference. To mitigate these issues, the L1 system incorporates more sophisticated query analysis techniques and basic knowledge extraction modules. The architecture is expanded to include components that facilitate task decomposition, coordination, and initial stages of knowledge organization (KO), ensuring that the system can manage more complex queries effectively.

在 L0 系统的基础上,L1 系统引入了知识检索和知识组织,以实现其检索和生成能力。这一层级的主要挑战是语义对齐和分块。专业术语和别名的丰富性可能影响分块检索的准确性,而不合理的分块则会破坏语义连贯性并引入噪声干扰。为了解决这些问题,L1 系统采用了更复杂的查询分析技术和基本知识提取模块。架构扩展了任务分解、协调和知识组织 (KO) 初期的组件,确保系统能够有效处理更复杂的查询。


Figure 6: Overview of L1 RAG framework. The squares $(\boxed{\Omega})$ indicate enhanced chunking and auto-tagging sub-module in knowledge extraction modules.

图 6: L1 RAG 框架概览。方框 $(\boxed{\Omega})$ 表示知识提取模块中的增强分块和自动标记子模块。

5.2.1 Enhanced chunking

5.2.1 增强分块

Chunking involves breaking down a large corpus of text into smaller, more manageable segments. The primary chunking strategies commonly utilized in RAG systems include fixed-size chunking, semantic chunking, and hybrid chunking. Chunking is essential for improving both the efficiency and accuracy of the retrieval process, which consequently affects the overall performance of RAG models in multiple dimensions. In our system, each chunk serves dual purposes: (i) it becomes a unit of information that is vectorized and stored in a database for retrieval, and (ii) it acts as a source for further knowledge extraction and information sum mari z ation. Improper chunking not only fails to ensures that text vectors encapsulate the necessary semantic information, but also hinders knowledge extraction based on complete context. For instance, in the context of laws and regulations, a fixed-size chunking approach are prone to destroying text semantics and omitting key conditions, thereby affecting the quality and accuracy of subsequent knowledge extraction.

分块 (Chunking) 涉及将大量文本分解为更小、更易管理的片段。RAG 系统中常用的主要分块策略包括固定大小分块、语义分块和混合分块。分块对于提高检索过程的效率和准确性至关重要,这进而影响 RAG 模型在多方面的整体性能。在我们的系统中,每个分块都承担双重作用:(i) 它成为向量化并存储在数据库中以便检索的信息单元,(ii) 它充当进一步知识提取和信息汇总的源。不恰当的分块不仅无法确保文本向量包含必要的语义信息,还会阻碍基于完整上下文的知识提取。例如,在法律法规的背景下,固定大小分块方法容易破坏文本语义并遗漏关键条件,从而影响后续知识提取的质量和准确性。

We propose a text split algorithm to enhance existing chunking methods by breaking down large text documents into smaller, manageable chunks while preserving context and enabling effective summary generation for each chunk. The chunking process is illustrated in Figure 5. Given a source text, the algorithm iterative ly splits the text into chunks. During the first iteration, it generates a forward summary of the initial chunk, providing context for generating summaries of subsequent chunks and maintaining a coherent narrative across splits. Each chunk is summarized using a predefined prompt template that incorporates both the forward summary and the current chunk. This summary is then stored alongside the chunk. The algorithm adjusts the text by removing the processed chunk and updating the forward summary with the summary of the current chunk, preparing for the next iteration. This process continues until the entire text is split and summarized. Additionally, the algorithm can dynamically adjust chunk sizes based on the content and structure of the text.

我们提出了一种文本分割算法,旨在增强现有的分块方法,通过将大型文本文档分解为更小、可管理的块,同时保留上下文并为每个块生成有效的摘要。分块过程如图 5 所示。给定源文本,该算法会迭代地将文本分割成块。在第一次迭代中,它生成初始块的前向摘要,为后续块的摘要生成提供上下文,并在分割中保持连贯的叙述。每个块都使用预定义的提示模板进行摘要,该模板结合了前向摘要和当前块的内容。然后,该摘要与块一起存储。算法通过移除已处理的块并使用当前块的摘要更新前向摘要来调整文本,为下一次迭代做好准备。此过程持续进行,直到整个文本被分割并摘要完毕。此外,该算法可以根据文本的内容和结构动态调整块的大小。


Figure 7: Illustration of the auto-tagging module.

图 7: 自动标注模块的示意图。

5.2.2 Auto-tagging

5.2.2 自动标签

In domain-specific RAG scenarios, the corpus is typically characterized by formal, professional, and rigorously expressed content, whereas the questions posed are often articulated in plain, easily understandable colloquial language. For instance, in medical question-answering (medQA) tasks [32], symptoms of diseases described in the questions are generally phrased in simple, conversational terms. In contrast, the corresponding medical knowledge within the corpus is often expressed using specialized professional terminology. This discrepancy introduces a domain gap that adversely affects the accuracy of chunk retrieval, especially given the limitations of the embedding models employed for this purpose.

在特定领域的 RAG 场景中,语料库通常以正式、专业且严谨表达的内容为特征,而提出的问题则往往以简单易懂的口语化语言表达。例如,在医疗问答 (medQA) 任务 [32] 中,问题中描述的疾病症状通常以简单的对话术语表达。相比之下,语料库中的相应医学知识则往往使用专业的术语表达。这种差异引入了领域差距,对块检索的准确性产生了不利影响,特别是在用于此目的的嵌入模型存在限制的情况下。

To address the domain gap issue, we propose an auto-tagging module designed to minimize the disparity between the source documents and the queries. This module pre processes the corpus to extract a comprehensive collection of domain-specific tags or to establish tag mapping rules. Prior to the retrieval process, tags are extracted from the query and then mapped to corpus domain using the pre processed tag collection or tag pair collection. This tag-based domain adaptation can be employed for query rewriting or keyword retrieval within sequential information retrieval frameworks, thereby enhancing both the recall and precision of the retrieval process.

为解决领域差距问题,我们提出了一个自动标注模块,旨在最小化源文档与查询之间的差异。该模块对语料库进行预处理,以提取全面的领域特定标签或建立标签映射规则。在检索过程之前,从查询中提取标签,然后使用预处理的标签集合或标签对集合将其映射到语料库领域。这种基于标签的领域适应可以用于序列信息检索框架中的查询重写或关键词检索,从而提高检索过程的召回率和精度。

Specifically, we leverage the capabilities of the LLMs to identify key factors within the corpus chunks, summarize these factors, and generalize them into category names, which we refer to as "tag classes." We generate semantic tag extraction prompts based on these tag classes to facilitate accurate tag extraction. In scenarios where only the corpus is available, LLMs are employed with meticulously designed prompts to extract semantic tags from the corpus, thereby forming a comprehensive corpus tag collection. When practical QA samples are available, semantic tag extraction is performed on both the queries and the corresponding retrieved answer chunks. Using the tag sets extracted from the chunks and queries, LLMs are utilized to map cross-domain semantic tags and generate a tag pair collection. After establishing both the corpus tag collection and the tag pair collection, tags can be extracted from the query, and the corresponding mapped tags can be identified within the collections. These mapped tags are then used to enhance subsequent information retrieval processes, improving both recall and precision. This workflow leverages the advanced understanding and contextual capabilities of LLMs for domain adaptation.

具体而言,我们利用大语言模型的能力,在语料块中识别关键因素,总结这些因素,并将其泛化为类别名称,我们称之为“标签类”。基于这些标签类,我们生成语义标签提取提示,以促进准确的标签提取。在仅有语料可用的情况下,大语言模型通过精心设计的提示从语料中提取语义标签,从而形成一个全面的语料标签集合。当有实际的问答样本时,对查询和相应的检索答案块都进行语义标签提取。利用从语料块和查询中提取的标签集,大语言模型被用于映射跨域语义标签,并生成标签对集合。在建立语料标签集合和标签对集合后,可以从查询中提取标签,并在集合中识别相应的映射标签。这些映射标签随后用于增强后续的信息检索过程,提高召回率和精确率。该工作流程利用了大语言模型的先进理解和上下文能力,进行领域适应。


Figure 8: Overview of multi-layer, multi-granularity retrieval over heterogeneous graph

图 8: 异质图多层多粒度检索概述

5.2.3 Multi-Granularity Retrieval

5.2.3 多粒度检索

The L1 system is designed to enable multi-layer, multi-granularity retrieval across a heterogeneous knowledge graph, which was constructed in the L0 system. Each layer of the graph (e.g., information source layer, corpus layer, distilled knowledge layer) represents knowledge at different levels of abstraction and granularity, allowing the system to explore and retrieve relevant information at various scales. For example, queries can be mapped to entire documents (information source layer) or specific chunks of text (corpus layer), ensuring that knowledge can be retrieved at the appropriate level for a given task. To support this, similarity scores between queries and graph nodes are computed to measure the alignment between the query and the retrieved knowledge. These scores are then propagated through the layers of the graph, allowing the system to aggregate information from multiple levels. This multi-layer propagation ensures that retrieval can be fine-tuned based on both the broader context (e.g., entire documents) and finer details (e.g., specific chunks or distilled knowledge). The final similarity score is generated through a combination of aggregation and propagation, ensuring that knowledge extraction and utilization are optimized for both precision and efficiency in factual question answering. The retrieval process can be iterative, refining the results based on sub-queries generated through task decomposition, further enhancing the system’s ability to generate accurate and con textually relevant answers.

L1 系统旨在实现跨异构知识图谱的多层次、多粒度检索,该图谱在 L0 系统中构建。图谱的每一层(例如信息源层、语料层、提炼知识层)代表了不同抽象层次和粒度的知识,使系统能够在不同尺度上探索和检索相关信息。例如,查询可以映射到整个文档(信息源层)或特定的文本块(语料层),确保知识可以根据给定任务在适当的层次上被检索。为了支持这一点,系统计算查询与图谱节点之间的相似度分数,以衡量查询与检索知识之间的匹配程度。这些分数随后通过图谱的各层传播,使系统能够从多个层次聚合信息。这种多层传播确保检索可以根据更广泛的上下文(例如整个文档)和更精细的细节(例如特定块或提炼知识)进行微调。最终相似度分数通过聚合和传播的组合生成,确保知识提取和利用在事实性问答中的精度和效率都得到优化。检索过程可以是迭代的,基于任务分解生成的子查询优化结果,进一步增强系统生成准确且上下文相关答案的能力。

The overview of multi-layer, multi-granularity retrieval is depicted in Figure 8. For each layer of the graph, both queries $Q$ and graph node are transformed into high-dimensional vector embeddings for similarity evaluation. We denote the similarity evaluation operation as $g()$ . Here, $I,C$ , and $D$ indicate the node sets in the information source layer, corpus layer, and distilled knowledge layer, respectively. The propagation and aggregation operations are represented by the function $f()$ . The final chunk similarity score $S$ is obtained by aggregating the scores from other layers and nodes.

图 8 展示了多层多粒度检索的概览。对于图的每一层,查询 $Q$ 和图节点都被转换为高维向量嵌入以进行相似性评估。我们将相似性评估操作表示为 $g()$ 。这里,$I,C$ 和 $D$ 分别表示信息源层、语料层和蒸馏知识层中的节点集。传播和聚合操作由函数 $f()$ 表示。最终的块相似性得分 $S$ 是通过聚合其他层和节点的得分得到的。

5.3 Level-2: Linkable and Reasoning Question focused RAG System

5.3 第二层级:可链接且专注于推理问题的 RAG 系统

The core functionality of the L2 system lies in its ability to efficiently retrieve multiple sources of relevant information and perform complex reasoning based on it. To facilitate this, the L2 system integrates an advanced knowledge extraction module that comprehensively identifies and extracts pertinent information. Furthermore, a task decomposition and coordination module is implemented to break down intricate tasks into smaller, manageable sub-tasks, thereby enhancing the system’s efficiency in handling them. The proposed framework of L2 RAG system is illustrated in Figure 9.

L2 系统的核心功能在于其能够高效检索多个相关信息来源并基于此进行复杂推理。为了实现这一点,L2 系统集成了一个高级的知识提取模块,全面识别并提取相关信息。此外,系统还实现了一个任务分解与协调模块,将复杂任务拆分为更小、更易管理的子任务,从而提升系统处理这些任务的效率。L2 RAG 系统的框架如图 9 所示。

Chunked text contains multifaceted information, increasing the complexity of retrieval. Recent studies have focused on extracting triple knowledge units from chunked text and constructing knowledge graphs to facilitate efficient information retrieval [20, 42]. However, the construction of knowledge graphs is costly, and the inherent knowledge may not always be fully explored. To better present the knowledge embedded the documents, we propose atomizing the original documents in Knowledge Extraction phase, a process we refer as Knowledge Atomizing. Besides, industrial tasks often necessitate multiple pieces of knowledge, implicitly requiring the capability to decompose the original question into several sequential or parallel atomic questions. We refer to this operation as

分块文本包含多方面的信息,增加了检索的复杂性。最近的研究集中在从分块文本中提取三元组知识单元并构建知识图谱,以促进高效的信息检索 [20, 42]。然而,知识图谱的构建成本较高,且内在知识可能并不总是被完全挖掘。为了更好地呈现文档中嵌入的知识,我们提出在知识提取阶段对原始文档进行原子化处理,这一过程我们称之为知识原子化。此外,工业任务通常需要多条知识,这隐含地要求具备将原始问题分解为多个顺序或并行的原子问题的能力。我们将此操作称为

Task Decomposition. By combining the extracted atomic knowledge with the original chunks, we construct an atomic hierarchical knowledge base. Each time we decompose a task, the hierarchical knowledge base provides insights into the available knowledge, enabling knowledge-aware task decomposition.

任务分解。通过将提取的原子知识与原始块结合,我们构建了一个原子层次知识库。每次分解任务时,层次知识库都能提供可用知识的洞察,从而实现知识感知的任务分解。

5.3.1 Knowledge Atomizing

5.3.1 知识原子化

We believe that a single document chunk often encompasses multiple pieces of knowledge. Typically, the information necessary to address a specific task represents only a subset of the entire knowledge. Therefore, consolidating these pieces within a single chunk, as traditionally done in information retrieval, may not facilitate the efficient retrieval of the precise information required. To align the granularity of knowledge with the queries generated during task solving, we propose a method called knowledge atomizing. This approach leverage the context understanding and content generation capabilities of LLMs to automatically tag atomic knowledge pieces within each document chunk. Note that, these chunks could be segments of an original reference document, description chunks generated for tables, images, videos, or summary chunks of entire sections, chapters or even documents.

我们相信单个文档块通常包含多个知识片段。通常,解决特定任务所需的信息只是整个知识的一部分。因此,像传统信息检索那样将这些片段整合在一个块中,可能无法有效检索到所需的精确信息。为了使知识的粒度与任务解决过程中生成的查询相匹配,我们提出了一种称为知识原子化的方法。该方法利用大语言模型的上下文理解和内容生成能力,自动标记每个文档块中的原子知识片段。请注意,这些块可以是原始参考文档的片段、为表格、图像、视频生成的描述块,甚至是整个章节或文档的摘要块。

The presentation of atomic knowledge can be various. Instead of utilizing declarative sentences or subject-relationship-object tuples, we propose using questions as knowledge indexes to further bridge the gap between stored knowledge and query. Unlike the semantic tagging process, in knowledge atomizing process, we input the document chunk to LLM as context, ask it to generate relevant questions that can be answered by the given chunk as many as possible. These generated atomic questions are saved as the atomic question tags together with the given chunk. An example of knowledge atomizing is demonstrated in Figure 10(c), where the atomic questions encapsulate various aspects of the knowledge contained within the chunk. A hierarchical knowledge base can accommodate queries of varying granularity. Figure 11 illustrates the retrieval process from an atomic knowledge base comprising chunks and atomic questions. Queries can directly retrieve reference chunks as usual. Additionally, since each chunk is tagged with multiple atomic questions, an atomic query can be used to locate relevant atomic questions, which then leads to the associated reference chunks.

原子知识的呈现方式可以多种多样。我们提出使用问题作为知识索引,而不是利用陈述句或主语-关系-宾语元组,以进一步缩小存储知识与查询之间的差距。与语义标记过程不同,在知识原子化过程中,我们将文档块输入到大语言模型 (LLM) 作为上下文,要求它生成尽可能多的可以由给定块回答的相关问题。这些生成的原子问题与给定块一起保存为原子问题标签。图 10(c) 展示了知识原子化的一个例子,其中原子问题涵盖了块内包含知识的各个方面。分层知识库可以适应不同粒度的查询。图 11 展示了从包含块和原子问题的原子知识库中检索的过程。查询可以像往常一样直接检索参考块。此外,由于每个块都标记了多个原子问题,因此可以使用原子查询来定位相关的原子问题,然后找到相关的参考块。

5.3.2 Knowledge-Aware Task Decomposition

5.3.2 知识感知的任务分解

For a specific task, multiple decomposition strategies might be applicable. Consider Q2 in Figure 1 as an example. The two-step analytical reasoning process depicted may be effective if an interchangeable biosimilar products list is available. However, if only a general list of biosimilar products exists, with attributes dispersed throughout multiple documents, a different decomposition strategy may be necessary: (1) Retrieve the biosimilar product list; (2) Determine whether each product is interchangeable; (3) Count the total number of interchangeable products. The critical factor in selecting the most effective decomposition approach lies in understanding the contents of the specialized knowledge base. Motivated by this, we design the Knowledge-Aware Task Decomposition workflow, which is illustrated in Figure 10(a). The complete algorithm for task solving using Knowledge-Aware Task Decomposition is presented in Algorithm 1.

对于特定任务,可能会有多种分解策略适用。以图 1 中的 Q2 为例,如果存在可互换的生物类似物产品列表,则所展示的两步分析推理过程可能有效。然而,如果仅存在一个通用的生物类似物产品列表,且属性分散在多个文档中,则可能需要采用不同的分解策略:(1) 检索生物类似物产品列表;(2) 确定每个产品是否可互换;(3) 计算可互换产品的总数。选择最有效分解方法的关键在于理解专业知识库的内容。基于此,我们设计了知识感知任务分解工作流程,如图 10(a) 所示。使用知识感知任务分解进行任务求解的完整算法如算法 1 所示。

The reference context $\ensuremath{\mathcal{C}}{t}$ is initialized as an empty set, and the original question is denoted by $q$ . As illustrated in the for-loop starting at line 2 of the algorithm, in the $t$ -th iteration, we use an LLM, denoted by $\mathcal{L L M}$ , to generate query proposals potentially useful for task completion, denoted as $\hat{q}{i}^{t}$ .

参考上下文 $\ensuremath{\mathcal{C}}{t}$ 初始化为空集,原始问题记为 $q$。如算法第2行开始的for循环所示,在第 $t$ 次迭代中,我们使用一个表示为 $\mathcal{L L M}$ 的大语言模型来生成可能对任务完成有用的查询提议,记为 $\hat{q}{i}^{t}$。


Figure 10: The illustration of knowledge atomizing and knowledge-aware task decomposition: (a) Workflow of task solving with knowledge-aware task decomposition, (b) Workflow of knowledge atomizing, (c) Example of knowledge atomizing, (d) RAG case with knowledge atomizing and knowledge-aware task decomposition.

图 10: 知识原子化和知识感知任务分解的示意图:(a) 使用知识感知任务分解的任务解决流程,(b) 知识原子化的流程,(c) 知识原子化的示例,(d) 使用知识原子化和知识感知任务分解的 RAG 案例。

In this step, the chosen reference chunks $\ensuremath{\mathcal{C}}{t}$ are provided as context to avoid generating proposals linked to already known knowledge. These proposals are then utilized as atomic queries to determine if relevant knowledge exists within the knowledge base. For each atomic question proposal, we retrieve its relevant atomic question candidates along with their source chunks ${(q{i j}^{t},c_{i j}^{\bar{t}})}$ from the knowledge base, denoted as $\kappa{\tt B}$ . We can use any score metric sim to retrieve atomic questions. In our experiment, we use cosine similarity of their corresponding embeddings to retrieve all top $K$ atomic questions, provided their similarity to a proposed atomic question is greater than or equal to a given threshold $\delta$ . With the original question $q$ , the accumulated context $\ensuremath{\mathcal{C}}{t}$ , and the list of retrieved atomic questions $q{i j}^{t}$ , $\mathcal{L L M}$ selects the most useful atomic question $q^{t}$ from $q_{i j}^{t}$ and retrieves the relevant chunk $c^{t}$ . This retrieved chunk is aggregated into the reference context $\ensuremath{\mathcal{C}}{t}$ for the next round of decomposition. Knowledge-aware decomposition can iterate up to $N$ times, where $N$ is a hyper parameter set to control computational cost. The iteration process can be terminated early if there are no high-quality question proposals, no highly relevant atomic candidates retrieved, no suitable atomic knowledge selections, or if the $\mathcal{L L M}$ determines that the acquired knowledge is sufficient to complete the task. Finally, the accumulated context $\ensuremath{\mathcal{C}}{t}$ is utilized to generate answer $\hat{a}$ for the given question $q$ in line 14.

在这一步中,选定的参考块 $\ensuremath{\mathcal{C}}{t}$ 被作为上下文提供,以避免生成与已知知识相关的提案。这些提案随后被用作原子查询,以确定知识库中是否存在相关知识。对于每个原子问题提案,我们从知识库中检索其相关的原子问题候选及其源块 ${(q{i j}^{t},c_{i j}^{\bar{t}})}$,记为 $\kappa{\tt B}$。我们可以使用任何评分指标 sim 来检索原子问题。在我们的实验中,我们使用其对应嵌入的余弦相似度来检索所有前 $K$ 个原子问题,前提是它们与提案的原子问题的相似度大于或等于给定阈值 $\delta$。通过原始问题 $q$、累积上下文 $\ensuremath{\mathcal{C}}{t}$ 以及检索到的原子问题列表 $q{i j}^{t}$,$\mathcal{L L M}$ 从 $q_{i j}^{t}$ 中选择最有用的原子问题 $q^{t}$ 并检索相关块 $c^{t}$。这个检索到的块被聚合到参考上下文 $\ensuremath{\mathcal{C}}{t}$ 中,用于下一轮分解。知识感知分解最多可以迭代 $N$ 次,其中 $N$ 是一个用于控制计算成本的超参数。如果在没有高质量问题提案、没有检索到高度相关的原子候选、没有合适的原子知识选择,或者 $\mathcal{L L M}$ 确定获得的知识足以完成任务时,迭代过程可以提前终止。最后,累积上下文 $\ensuremath{\mathcal{C}}{t}$ 被用于在第 14 行为给定问题 $q$ 生成答案 $\hat{a}$。


Figure 11: Retrieval process from an atomic knowledge base. It supports two retrieval paths: (a) using queries to directly retrieve chunks as usual; (b) locating atomic nodes first then achieving the associated chunks.

图 11: 从原子知识库中检索的过程。它支持两种检索路径:(a) 使用查询直接检索块;(b) 先定位原子节点,然后获取相关的块。

Algorithm 1 Task Solving with Knowledge-Aware Decomposition

算法 1: 基于知识感知分解的任务求解

5.3.3 Knowledge-Aware Task Decomposer Training

5.3.3 知识感知任务分解器训练

It is worth mentioning that knowledge-aware decomposition can be a learnable component. This trained proposer can then directly suggest atomic queries $q^{t}$ during inference, which means lines 3 to 5 in Algorithm 1 can be replaced by a single call to this learned proposer, thereby reducing both inference time and computational cost. In order to train the knowledge-aware decomposer, we collect data about the rationale behind each step by sampling context and creating diverse interaction trajectories. With this data collected, we train a decomposer that can incorporate domain-specific rationale into the task decomposition and result-seeking process.

值得一提的是,知识感知分解可以成为一个可学习的组件。经过训练后,这个提议器可以直接在推理过程中建议原子查询 $q^{t}$,这意味着算法 1 中的第 3 至 5 行可以被对该学习到的提议器的单次调用所取代,从而减少推理时间和计算成本。为了训练知识感知分解器,我们通过采样上下文并创建多样化的交互轨迹来收集每个步骤背后的原理数据。收集到这些数据后,我们训练一个分解器,使其能够将特定领域的原理融入任务分解和结果寻找的过程中。

The data collection process, as depicted in Figure 12 and Algo. 2, implements a sophisticated dual-dictionary system for managing and tracking information. Our system utilizes two primary data structures: dictionary $\boldsymbol{S}$ for maintaining comprehensive score records, and dictionary $\mathcal{V}$ for systematically tracking visit frequencies of candidate chunks. During the initialization phase, we establish baseline values by setting all scores to zero and initializing visit counters to one, creating a foundation for dynamic updates throughout the subsequent processing stages.

如图 12 和 Algo. 2 所示,数据收集过程实现了一个复杂的双字典系统,用于管理和跟踪信息。我们的系统利用了两种主要的数据结构:字典 $\boldsymbol{S}$ 用于维护全面的评分记录,字典 $\mathcal{V}$ 用于系统性地跟踪候选块的访问频率。在初始化阶段,我们通过将所有评分设置为零并将访问计数器初始化为一来建立基线值,为后续处理阶段的动态更新奠定基础。

In each iteration of our decomposition process, the system executes a detailed retrieval operation targeting the top $K^{\prime}$ chunks demonstrating maximum relevance to the current atomic question. These chunks must satisfy our similarity threshold criterion (specifically, similarity exceeding $\delta^{\prime}$ , where $\delta^{\prime}<\delta)$ , with $K^{\prime}$ intentionally configured to be larger than $K$ to ensure comprehensive coverage. Following this initial retrieval, we carefully select and integrate the data chunks corresponding to the top $K$ most relevant atomic retrieved pairs into the context. For those retrieved chunks that do not make it into the top $\mathcal{K}$ selection, we systematically incorporate them into $\boldsymbol{S}$ and methodically update their scores based on precisely calculated relevance metrics.

在我们的分解过程的每次迭代中,系统会针对与当前原子问题相关性最高的前 $K^{\prime}$ 个块执行详细的检索操作。这些块必须满足我们的相似性阈值标准(具体来说,相似性超过 $\delta^{\prime}$,其中 $\delta^{\prime}<\delta$),并且 $K^{\prime}$ 有意配置为大于 $K$ 以确保全面覆盖。在初始检索之后,我们仔细选择并将对应于前 $K$ 个最相关的原子检索对的数据块集成到上下文中。对于那些未被选入前 $\mathcal{K}$ 的检索块,我们系统地将其纳入 $\boldsymbol{S}$ 中,并根据精确计算的相关性指标有条不紊地更新它们的分数。


Figure 12: Data collection process for decomposer training, comprising four main components: a) sampling data chunks from the context sampling pool to serve as the reference context for question decomposition, b) saving the generated atomic query proposals, c) after retrieval and selection, saving the chosen atomic query proposals as part of the reasoning trajectories, d) evaluating the answer to generate a score.

图 12: 分解器训练的数据收集过程,包含四个主要组成部分:a) 从上下文采样池中抽取数据块作为问题分解的参考上下文,b) 保存生成的原子查询建议,c) 在检索和选择后,将选定的原子查询建议保存为推理轨迹的一部分,d) 评估答案以生成分数。


Figure 13: An example of context sampling and an illustration of decomposer training with collected data.

图 13: 上下文采样的示例以及使用收集数据进行的分解器训练示意图。

To ensure comprehensive exploration of the solution space, we have implemented an advanced sampling mechanism that intelligently selects additional chunks from $\boldsymbol{S}$ when available, incorporating them seamlessly into the reference context. Our implementation leverages the Upper Confidence Bound [8] (UCB) algorithm for context sampling, establishing a balanced approach between exploitation and exploration. The exploitation component manifests through the retriever-selected chunks, focusing on options with currently highest estimated rewards to optimize immediate performance gains. Conversely, the exploration aspect is fulfilled through context sampling from $\boldsymbol{S}$ , enabling the systematic investigation of less-certain options to accumulate valuable data and potentially uncover superior long-term alternatives.

为确保全面探索解决方案空间,我们实施了一种先进的采样机制,当$\boldsymbol{S}$中有可用数据时,智能地选择额外数据块,并将其无缝整合到参考上下文中。我们的实现利用了置信区间上界 [8] (UCB) 算法进行上下文采样,在利用和探索之间建立了一种平衡的方法。利用部分通过检索器选择的数据块体现,专注于当前估计奖励最高的选项,以优化即时性能提升。相反,探索部分通过从 $\boldsymbol{S}$ 中进行上下文采样来实现,使得能够系统地研究不确定性较高的选项,以积累有价值的数据,并可能发现更优的长期替代方案。

This meticulously crafted strategy serves a dual purpose: it not only facilitates the generation of diverse and comprehensive atomic query proposals but also enables systematic exploration of multiple potential reasoning pathways. Through this sophisticated approach, we progressively work toward deriving optimal final answers while maintaining a balance between immediate performance optimization and long-term discovery of potentially superior solutions.

这一精心设计的策略具有双重目的:它不仅有助于生成多样且全面的原子查询提案,还能够系统性地探索多种潜在的推理路径。通过这种复杂的方法,我们逐步朝着得出最佳最终答案的方向努力,同时在即时性能优化与长期发现潜在更优解决方案之间保持平衡。

We record atomic proposals (AP), interactive trajectories, and answer scores to support decomposer training. For each specialized domain, interactive trajectories featuring distinct reasoning paths are gathered for decomposer training. This allows us to use the answer score as a supervised signal to train the decomposer. The decomposer training process is depicted in Figure 13. By incorporating

我们记录原子提案(AP)、交互轨迹和答案分数以支持分解器训练。对于每个专门领域,收集具有不同推理路径的交互轨迹用于分解器训练。这使得我们可以使用答案分数作为监督信号来训练分解器。分解器训练过程如图 13 所示。通过结合

preferences in the form of answer scores, the decomposer training can capture domain-specific decomposition rules, thereby adapting the decomposer to meet domain requirements.

以答案分数的形式表达偏好,分解器训练可以捕捉特定领域的分解规则,从而适应分解器以满足领域需求。

Looking ahead, there are several promising avenues for implementing and enhancing our proposed decomposer. We could leverage well-established algorithms such as supervised fine-tuning (SFT) and direct policy optimization (DPO) [45] to train an effective decomposer based on existing LLMs. The practical implementation and performance evaluation of this comprehensive procedure, including detailed empirical analysis and comparative studies, will be addressed in future research work to thoroughly demonstrate its effectiveness and potential applications.

展望未来,实施和增强我们提出的分解器有几种有前景的途径。我们可以利用监督微调(SFT)和直接策略优化(DPO)[45]等成熟算法,基于现有的大语言模型训练一个有效的分解器。这一综合过程的实际实施和性能评估,包括详细的实证分析和比较研究,将在未来的研究工作中进行,以充分展示其有效性和潜在应用。

5.4 Level-3: Predictive Question focused RAG System

5.4 Level-3: 预测性问题导向的 RAG (Retrieval-Augmented Generation) 系统

In the L3 system, there is an increased emphasis on knowledge-based prediction capability, which necessitates effective knowledge collection, organization, and the construction of forecasting rationale. To address this, we leverage the task decomposition and coordination module to build forecasting rationale based on the organized knowledge, which is collected and organized from the retrieved knowledge. The framework of L3 system is illustrated in Figure14. To ensure the retrieved knowledge is well-prepared for advanced analysis and forecasting, the knowledge organization module is enhanced with specialized submodules dedicated to the structuring and organization of knowledge. These submodules streamline the process of transforming raw retrieved knowledge into a structured, coherent format, optimizing it for subsequent reasoning and predictive tasks. For example, in the FDA scenario referred in Figure 1, data from multiple sources—such as medicine labels, clinical trials, and application forms—are integrated into the multi-layer knowledge base. The knowledge structuring submodule follows the instruction from task decomposition module to collect and organize the relevant knowledge (e.g. medicine names with their approval dates) retrieved from knowledge base. The knowledge induction submodule further categorizes this structured knowledge, such as by approval date, to facilitate further statistics analysis and prediction.

在 L3系统中,我们更加重视基于知识的预测能力,这需要有效地进行知识收集、组织以及构建预测依据。为此,我们利用任务分解和协调模块,基于从检索到的知识中进行收集和组织的知识,构建预测依据。L3系统的框架如图14所示。为了确保检索到的知识能够为高级分析和预测做好准备,知识组织模块通过专门用于知识结构化和组织的子模块得到了增强。这些子模块简化了将原始检索到的知识转化为结构化、连贯格式的过程,优化了后续推理和预测任务。例如,在图1提到的FDA场景中,来自多个来源的数据(如药品标签、临床试验和申请表)被整合到多层知识库中。知识结构化子模块根据任务分解模块的指令,从知识库中收集并组织相关知识(例如药品名称及其批准日期)。知识归纳子模块进一步对这些结构化知识进行分类,例如按批准日期分类,以便进行进一步的统计分析和预测。


Figure 14: Overview of L3-RAG framework. The squares $(\sqsubseteq)$ indicate knowledge structuring and knowledge induction in knowledge organization module, while the square $(\sqsubseteq)$ represents forecasting sub-module in knowledge-centric reasoning module.

图 14: L3-RAG 框架概述。方块 $(\sqsubseteq)$ 表示知识组织模块中的知识结构化和知识归纳,而方块 $(\sqsubseteq)$ 则表示以知识为中心的推理模块中的预测子模块。

Given the limitations of LLMs in applying specialized reasoning logic, their effectiveness in predictive tasks can be restricted. To overcome this, the knowledge-centric reasoning module is enhanced with a forecasting submodule, enabling the system to infer outcomes based on the input queries and the organized knowledge (e.g. total numbers of medicines approved per year). This forecasting submodule allows the system to not only generate answers based on historical knowledge, but also make projections, providing a more robust and dynamic response to complex queries. By integrating advanced knowledge structuring and prediction capabilities, the L3 system can manage and utilize a more complex and dynamic knowledge base effectively.

鉴于大语言模型在应用专业推理逻辑方面的局限性,它们在预测任务中的有效性可能受到限制。为了克服这一问题,知识中心推理模块通过增加预测子模块得到增强,使系统能够根据输入查询和组织化的知识(例如每年批准的药物总数)推断结果。该预测子模块不仅允许系统基于历史知识生成答案,还可以进行预测,从而为复杂查询提供更强大和动态的响应。通过整合先进的知识结构化与预测能力,L3系统能够更有效地管理和利用复杂且动态的知识库。

5.5 Level-4: Creative Question focused RAG System

5.5 Level-4: 以创造性问题为中心的RAG系统

The L4 system implementation is characterized by the integration of multi-agent systems to facilitate multi-perspective thinking. Addressing creative questions requires creative thinking that draws on factual information and an understanding of underlying principles and rules. At this advanced level, the primary challenges include extracting coherent logical rationales from a retrieved knowledge, navigating complex reasoning processes with numerous influencing factors, and assessing the quality of responses to creative, open-ended questions. To tackle these challenges, the system coordinates multiple agents, each contributing unique insights and reasoning strategies, as illustrated in Figure15. These agents operate in parallel, synthesizing various thought processes to generate comprehensive and coherent solutions. This multi-agent architecture supports the parallel processing and integration of diverse reasoning paths, ensuring effective management and response to intricate queries. By simulating diverse viewpoints, the L4 system enhances its ability to tackle creative questions, generating innovative ideas rather than predefined solutions. The coordinated outputs from multiple agents not only enrich the reasoning process but also provide users with comprehensive perspectives, fostering creative thinking and inspiring novel solutions to complex problems.

L4 系统的实现特点在于集成多智能体系统以促进多角度思考。处理创造性问题需要基于事实信息和对基本原理及规则的理解的创造性思维。在这个高级阶段,主要挑战包括从检索到的知识中提取连贯的逻辑推理、在众多影响因素中导航复杂的推理过程,以及评估对创造性开放式问题的回答质量。为了解决这些挑战,系统协调多个智能体,每个智能体贡献独特的见解和推理策略,如图15所示。这些智能体并行运作,综合各种思维过程以生成全面且连贯的解决方案。这种多智能体架构支持并行处理和整合多种推理路径,确保有效管理和应对复杂查询。通过模拟多样化的观点,L4 系统增强了处理创造性问题的能力,生成创新想法而非预定义的解决方案。多个智能体的协调输出不仅丰富了推理过程,还为用户提供了全面的视角,促进创造性思维并激发复杂问题的新颖解决方案。


Figure 15: Overview of L4-RAG framework. The multi-agent planning module is introduced to enable multi-perspective thinking.

图 15: L4-RAG 框架概览。引入多智能体规划模块以实现多视角思考。

6 Evaluation and Metrics

6 评估与指标

To validate the effectiveness of our proposed method, we conduct experiments on both open-domain benchmarks and domain-specific benchmarks. We delineate the evaluation metrics and methods employed to assess the performance of proposed knowledge-aware task decomposition method in

为了验证我们提出方法的有效性,我们在开放域基准测试和特定领域基准测试上进行了实验。我们详细描述了用于评估所提出的知识感知任务分解方法性能的评估指标和方法。

Section 6.1. The evaluation results on three open-domain benchmarks are presented in Section 6.2, while the results on two legal domain-specific benchmarks in Section 6.3. Furthermore, we present in-depth analysis through three real case studies in Section 6.4, which highlight the superiority of our method compared to existing decomposition approaches.

第 6.1 节。第 6.2 节展示了在三个开放领域基准测试上的评估结果,而第 6.3 节展示了在两个法律领域特定基准测试上的结果。此外,我们在第 6.4 节中通过三个实际案例研究进行了深入分析,这些案例研究突出了我们的方法相比于现有分解方法的优越性。

6.1 Experimental Setup

6.1 实验设置

Methods To thoroughly evaluate the performance of our proposed knowledge-aware decomposition approach (described in Section 5.3), we have selected a variety of baseline methods that represent different strategies for task-solving with LLMs. We include Zero-Shot CoT[34] to assess the inherent reasoning capabilities and embedded knowledge of the underlying LLM without additional context. Naive RAG[35], which introduces external knowledge through retrieval, serves as a benchmark for evaluating the incremental benefits of augmented knowledge. The Self-Ask framework[43] is employed to investigate the impact of an iterative question decomposition and answering strategy on task performance. Additionally, GraphRAG [20] is evaluated in both local and global modes to assess the impact of knowledge graph-based methods on multi-hop reasoning tasks.

方法

为了全面评估我们提出的知识感知分解方法(见第 5.3 节)的性能,我们选择了多种基准方法,这些方法代表了使用大语言模型(LLM)解决任务的不同策略。我们引入了零样本思维链(Zero-Shot CoT)[34],以评估底层大语言模型的固有推理能力和嵌入知识,而无需额外的上下文。朴素检索增强生成(Naive RAG)[35]通过检索引入外部知识,作为评估增强知识增量效益的基准。我们采用了自我提问框架(Self-Ask)[43]来研究迭代问题分解和回答策略对任务性能的影响。此外,我们在局部和全局模式下评估了 GraphRAG [20],以评估基于知识图谱的方法对多跳推理任务的影响。

To ensure a fair comparison and to highlight the influence of hierarchical knowledge structures, we have extended Naive RAG and Self-Ask to utilize both a general flat knowledge base, denoted as $R$ , and a hierarchical retriever, denoted as $H{-}R$ , as introduced in Figure 11. The hierarchical retriever $\left(H!-!R\right)$ utilizes the questions or follow-up questions to retrieve chunks through both path (a) and path (b) at the same time. The retrieved chunks from both paths are then aggregated to form a comprehensive reference context for LLM to answer each question, potentially enhancing the relevance of the provided context.

为了确保公平比较并突出分层知识结构的影响,我们扩展了Naive RAG和Self-Ask,使其能够同时利用一个通用的扁平知识库(表示为$R$)和一个分层检索器(表示为$H{-}R$),如图11所示。分层检索器$\left(H!-!R\right)$利用问题或后续问题同时通过路径(a)和路径(b)来检索知识块。从两条路径检索到的知识块随后被聚合,形成一个综合的参考上下文供大语言模型回答每个问题,从而可能提高所提供上下文的相关性。

The experimental methods are summarized as follows:

实验方法总结如下:

• Zero-Shot CoT: Questions are addressed using solely the Chain-Of-Thought (CoT) technique, which prompts the LLMs to articulate its reasoning process step-by-step without the aid of example demonstrations or supplemental context. This method assesses the LLMs’ intrinsic knowledge and reasoning capabilities in a zero-shot setting.

零样本思维链 (Zero-Shot CoT):问题仅通过思维链 (Chain-Of-Thought, CoT) 技术解决,该技术提示大语言模型逐步阐述其推理过程,无需示例演示或补充上下文。此方法评估大语言模型在零样本设置中的内在知识和推理能力。

• Naive RAG w/ R: This approach employs dense retrieval from a flat knowledge base to procure relevant information for each question. The knowledge base consists of preembedded chunks are matched to the original question based on semantic similarity. The retrieval process is direct, without any intermediate task decomposition.

• Naive RAG w/ R: 该方法通过从扁平知识库中进行密集检索来为每个问题获取相关信息。知识库由预嵌入的块组成,这些块基于语义相似性与原始问题进行匹配。检索过程是直接的,没有任何中间任务分解。

• Naive RAG w/ H-R: This method extends the Naive RAG framework by incorporating a hierarchical retrieval process $\left(H{-}R\right)$ that operates through two concurrent paths. Path (a) performs a direct retrieval of knowledge chunks in response to the original question, similar to the flat retrieval approach. Path (b), on the other hand, use the original question again to find the relevant atomic questions and obtain the corresponding chunks. The combined output from both paths is then aggregated, creating a rich reference context.

Naive RAG w/ H-R:该方法通过引入层次化检索过程(H-R)扩展了Naive RAG框架,该过程通过两条并行路径操作。路径(a)直接检索知识块以响应原始问题,类似于扁平化检索方法。另一方面,路径(b)再次使用原始问题来查找相关的原子问题并获取相应的知识块。然后,两条路径的输出被聚合,形成一个丰富的参考上下文。

• Self-Ask: This method employs a task decomposition strategy wherein the LLMs is prompted to iterative ly generate and answer follow-up questions, thereby breaking down complex problems into more manageable sub-tasks. General demonstrations illustrating the logic and methodology of task decomposition are provided for all benchmarks to guide the LLMs’ reasoning process. As detailed in the original paper [43], the framework encourages the LLMs to engage in a recursive dialogue with itself, generating intermediate answers that progressively build towards the final answer. In this setting, the LLMs relies solely on its inherent knowledge base, as no external contexts are introduced to aid in answering the follow-up questions.

• Self-Ask: 该方法采用任务分解策略,通过提示大语言模型迭代生成并回答后续问题,从而将复杂问题分解为更易管理的子任务。为了指导大语言模型的推理过程,所有基准测试中都提供了展示任务分解逻辑和方法论的通用示例。如原始论文 [43] 中所述,该框架鼓励大语言模型与其自身进行递归对话,生成逐步构建最终答案的中间答案。在这种设置下,大语言模型仅依赖其固有的知识库,因为未引入外部上下文来协助回答后续问题。

• Self-Ask w/ R: Building upon the Self-Ask method, this setting introduces an additional retrieval component, for each follow-up question generated by the LLMs, relevant chunks are retrieved from a flat knowledge base to provide a reference context. The retrieval process uses the follow-up question as the query. This approach seeks to combine the benefits of iterative task decomposition with rich external knowledge from retrieval, potentially improving the LLMs’ performance on complex reasoning tasks.

• Self-Ask w/ R:在 Self-Ask 方法的基础上,此设置引入了额外的检索组件。对于大语言模型生成的每个后续问题,从平面知识库中检索相关块以提供参考上下文。检索过程使用后续问题作为查询。此方法旨在将迭代任务分解的优势与检索提供的丰富外部知识相结合,从而可能提高大语言模型在复杂推理任务中的表现。

• Self-Ask w/ H-R: This variant of the Self-Ask method enhances the retrieval process by utilizing a hierarchical knowledge base, as opposed to the flat one used in Self-Ask w/ R.

• Self-Ask w/ H-R:该方法是对 Self-Ask w/ R 的改进,通过使用分层知识库(hierarchical knowledge base)而非扁平知识库来增强检索过程。

When the LLMs generates follow-up questions, these are employed as queries in a dual-path retrieval system, specifically paths (a) and (b) in Figure 11. The outputs from both retrieval paths are then aggregated to form a richer reference context.

当大语言模型生成后续问题时,这些问题将作为查询用于双路径检索系统,具体为图 11 中的路径 (a) 和路径 (b)。然后,两个检索路径的输出将被聚合,以形成更丰富的参考上下文。

• GraphRAG Local: In this approach, the flat knowledge base is pre-processed to construct a knowledge graph in accordance with the public guidance. The inference is run in local mode.

• GraphRAG Local:在这种方法中,扁平化的知识库会按照公共指南进行预处理,以构建知识图谱。推理以本地模式运行。

• GraphRAG Global: The inference is run in global mode in this setting.

• GraphRAG Global: 在此设置下,推理以全局模式运行。

• Ours: The proposed knowledge-aware decomposition method iterative ly decomposes complex questions into sub-questions and retrieves relevant knowledge up to a maximum of five iterations. This process limits the context for the final answer to the five most useful knowledge chunks.

• 我们的方法:提出的知识感知分解方法迭代地将复杂问题分解为子问题,并检索相关知识,最多进行五次迭代。该过程将最终答案的上下文限制在五个最有用的知识块内。

Metrics For maintaining consistency with established benchmarks, two conventional metrics are adopted in our experimental evaluation: Exact Match (EM), which assesses whether the response is identical to a predefined correct answer, and the F1 score, which is the harmonic mean of precision and recall at the token level. During evaluation, we noticed that the LLM sometimes produced responses more verbose than expected, even when the QA prompt aimed to limit output style. To more accurately gauge the responses’ alignment with the intended answers—beyond mere lexical matching—we introduced a novel evaluation metric employing GPT-4. In this process, GPT-4 acts as an evaluator, assessing the correctness of a response in relation to the question and the correct answer labels. We refer to this metric as Accuracy (Acc). Upon manual inspection of a sample set, the judgments rendered by GPT-4 demonstrate complete agreement with human evaluators, affirming the reliability of this metric.

指标

Furthermore, we encountered situations where a method achieves high accuracy (Acc) scores yet registers low F1 scores. To elucidate the underlying factors of such discrepancies, we also report on the Recall and Precision of the generated responses. Recall measures the proportion of relevant tokens from the answer labels that are captured in the response, while precision evaluates the relevance of the tokens in the generated answer with respect to the correct labels. Specifically, in cases where multiple correct answer labels are available, we employ a conservative scoring approach for EM, F1, Precision, and Recall by retaining the highest score achieved. This approach is designed to equitably consider the range of correct answers that the LLM may generate. It should be noted that, in the context of computing Accuracy (Acc), all admissible answer labels are furnished concurrently to the evaluation process, resulting in a singular Accuracy score.

此外,我们遇到了一些情况,即方法在准确率 (Acc) 得分较高的情况下,F1 得分却较低。为了阐明这种差异的根本原因,我们还报告了生成响应的召回率和精确率。召回率衡量了从答案标签中捕获的相关 Token 的比例,而精确率则评估了生成答案中 Token 相对于正确标签的相关性。具体而言,在有多个正确答案标签的情况下,我们对 EM、F1、精确率和召回率采用保守的评分方法,保留最高得分。这种方法旨在公平地考虑大语言模型可能生成的各种正确答案范围。需要注意的是,在计算准确率 (Acc) 时,所有可接受的答案标签都会同时提供给评估过程,从而得出一个单一的准确率得分。

The metrics employed in this evaluation — Exact Match (EM), F1, Precision, Recall, and Accuracy (Acc) — are primarily suited for questions categorized as L1 and L2, which are characterized by their association with ground truth answers that are factual and definitive. However, the utility of these metrics diminishes for predictive and creative questions, namely the L3 and L4 questions, where answers are inherently uncertain or subjective, and no single correct response exists. For L3 questions, alternative assessment methods such as trend judgment and qualitative analysis become more appropriate to capture the predictive validity of the responses. Furthermore, for L4 questions, which demand a higher degree of insight or innovation, it is essential to evaluate answers through a multi-faceted lens, considering criteria such as relevance, diversity, comprehensiveness, uniqueness, and inspiration to fully appreciate the depth and originality of the approaches’ responses.

本次评估采用的指标——精确匹配 (Exact Match, EM)、F1、精确率 (Precision)、召回率 (Recall) 和准确率 (Accuracy, Acc)——主要适用于被归类为 L1 和 L2 的问题,这些问题的特点是与事实性和确定性答案相关联。然而,这些指标在预测性和创造性问题(即 L3 和 L4 问题)上的效用减弱,因为这类问题的答案本质上是不确定的或主观的,且不存在单一的正确答案。对于 L3 问题,趋势判断和定性分析等替代评估方法更适合捕捉回答的预测有效性。此外,对于 L4 问题,由于其需要更高程度的洞察力或创新性,必须通过多方面的视角来评估答案,考虑相关性、多样性、全面性、独特性和启发性等标准,以充分理解回答的深度和原创性。

LLM and Hyper-parameters In our experiments, we employ GPT-4 (1106-Preview version) across all the methods outlined previously. For the knowledge extraction phase, we utilize a temperature setting of 0.7 specifically for the Knowledge Atomizing process, promoting a balance between diversity and determinism in the generated atomic knowledge. Conversely, during all question-answering (QA) steps in each method, we implement a temperature of 0, ensuring consistent responses from the model. Regarding the retrieval component, we engage the text-embedding-ada-002 (version 2) as our embedding model for both the general flat knowledge bases and the hierarchical knowledge bases. For the general flat knowledge bases, the retriever is configured to fetch up to 16 knowledge chunks, applying a retrieval score threshold of 0.2. In the case of hierarchical knowledge bases, the retriever is initially set to retrieve a maximum of 8 chunks with a more stringent threshold of 0.5. Subsequently, an additional 4 chunks can be retrieved via each atomic query posed.

大语言模型与超参数

6.2 Evaluation on Open-Domain Benchmarks

6.2 开放域基准测试评估

In this subsection, we demonstrate the performance of our method across three open-domain benchmarks. To ensure a fair and objective evaluation, particularly in the context of real-world industrial applications, we have selected three widely-recognized multi-hop datasets: HotpotQA [60], 2WikiMultiHopQA [28], and MuSiQue [51]. Below, we provide a brief overview of these datasets, noting that our method does not leverage the question type information nor the number of hops information during the solving process, as our approach is designed to be agnostic to such classifications.

在本小节中,我们展示了我们的方法在三个开放域基准测试中的表现。为了确保公平和客观的评估,特别是在现实世界工业应用背景下,我们选择了三个广受认可的多跳数据集:HotpotQA [60]、2WikiMultiHopQA [28] 和 MuSiQue [51]。下面,我们简要概述这些数据集,并指出我们的方法在求解过程中既不利用问题类型信息,也不利用跳数信息,因为我们的方法设计为对这些分类不可知。

HotpotQA The HotpotQA dataset is a well-known multi-hop QA benchmark primarily consisting of 2-hop questions, each associated with 10 Wikipedia paragraphs. Among these, some paragraphs contain supporting facts essential to answering the question, while the rest serve as distract or s. The dataset also includes a question type field, which delineates the logical reasoning required—comparison questions involve contrasting two entities, and bridge questions require inferring the bridge entity, or inferring the property of an entity through an intermediary entity, or locating the answer entity [60]. The comparison questions in HotpotQA align with the comparative questions defined in Section 3.2. Similarly, bridge questions correspond to either bridging questions or summarizing questions, depending on the complexity of the rationale required. Although our method operates independently of these types, their description here exemplifies the nature of questions within the dataset and contextual ize s the expected performance variance across different benchmarks.

HotpotQA数据集是一个著名的多跳问答基准,主要由2跳问题组成,每个问题关联10个维基百科段落。其中,某些段落包含回答问题所需的关键支持事实,而其他段落则作为干扰项。该数据集还包括一个问题类型字段,用于划分所需的逻辑推理——比较问题涉及对两个实体的对比,而桥接问题则需要推断桥接实体,或通过中介实体推断实体的属性,或定位答案实体[60]。HotpotQA中的比较问题与第3.2节中定义的比较问题一致。同样,桥接问题根据所需推理的复杂性,对应桥接问题或总结问题。尽管我们的方法独立于这些类型,但这里的描述展示了数据集中问题的性质,并为不同基准上的预期性能差异提供了背景。

2 Wiki Multi Hop QA Inspired by HotpotQA, 2 Wiki Multi Hop QA expands the diversity of question types. It retains the comparison type from HotpotQA and introduces inference and compositional questions that evolve from the bridge type by focusing on entity attribute deduction and entity location, respectively. Additionally, the bridge comparison type is a novel category that requires a synthesis of bridge and comparison reasoning. In this dataset, the comparison questions correspond to the comparative questions defined in Section 3.2, akin to those in HotpotQA. The inference questions are analogous to bridging questions, and the compositional questions are similar to summarizing questions as described in the same section. The bridge comparison questions, due to their hybrid nature and increased complexity, also fall under the summarizing questions category. This dataset typically presents 2-hop to 4-hop questions, each accompanied by 10 Wikipedia paragraphs containing supporting facts and distract or s. While these types inform the dataset’s structure, they are not utilized by our method, which treats all questions uniformly regardless of their categorization.

2 Wiki Multi Hop QA
受 HotpotQA 启发,2 Wiki Multi Hop QA 扩展了问题类型的多样性。它保留了 HotpotQA 中的比较类型,并引入了推理和组合问题,分别通过关注实体属性推断和实体位置从桥接类型演变而来。此外,桥接比较类型是一种新的类别,需要综合桥接和比较推理。在该数据集中,比较问题对应于第 3.2 节中定义的比较问题,类似于 HotpotQA 中的问题。推理问题类似于桥接问题,组合问题类似于同一节中描述的总结问题。由于桥接比较问题的混合性质和增加的复杂性,它们也属于总结问题类别。该数据集通常提供 2 跳至 4 跳的问题,每个问题都附有 10 个包含支持事实和干扰信息的 Wikipedia 段落。虽然这些类型影响了数据集的结构,但我们的方法并未利用它们,而是对所有问题一视同仁,无论其分类如何。

MuSiQue Addressing the issue that many multi-hop questions can be solved via shortcuts—arriving at correct answers without proper reasoning—MuSiQue implements stringent filters and additional mechanisms specifically designed to encourage connected reasoning, as reported by Trivedi et al. [51]. Unlike the other datasets, MuSiQue does not categorize questions by type, but it does provide explicit information on the number of hops required for each question, ranging from 2 to 4 hops. Each question is associated with 20 context paragraphs, which introduce a mix of relevant and irrelevant information, further complicating the task of discerning the correct reasoning path. This explicit hop information, while not used by our method, underscores the complexity of the dataset and the robustness required by models to handle such challenges effectively.

MuSiQue
为了解决许多多跳问题可以通过捷径解决(即无需正确推理即可得出正确答案)的问题,MuSiQue 实施了严格的过滤器和额外的机制,专门设计用于鼓励连贯推理,正如 Trivedi 等人 [51] 所报告的那样。与其他数据集不同,MuSiQue 并不按类型对问题进行分类,但它确实提供了每个问题所需跳数的明确信息,范围从 2 跳到 4 跳。每个问题都与 20 个上下文段落相关联,这些段落引入了相关和不相关信息的混合,进一步增加了识别正确推理路径的难度。这种明确的跳数信息虽然未被我们的方法使用,但突显了数据集的复杂性以及模型有效处理此类挑战所需的鲁棒性。

In our experiments, we randomly sample $500;\mathrm{QA}$ data from the dev set of each dataset, without consideration for question type nor number of hops, to ensure randomness. We compile the context paragraphs from all sampled QA data into a single knowledge base for each benchmark, creating a more complex retrieval scenario. This design choice is aimed at rigorously assessing our model’s question decomposition and relevant context retrieval abilities. Table 3 outlines the distribution of question types within our sampled sets, offering insight into the variety of reasoning challenges presented in our evaluation, though this does not directly impact our method.

在我们的实验中,我们从每个数据集的开发集中随机抽取 500 条 QA 数据,不考虑问题类型或跳数,以确保随机性。我们将所有抽取的 QA 数据的上下文段落编译成每个基准的单一知识库,从而创建一个更复杂的检索场景。这一设计选择旨在严格评估我们模型的问题分解和相关上下文检索能力。表 3 概述了我们抽取集中的问题类型分布,提供了我们评估中呈现的各种推理挑战的洞察,虽然这并不直接影响我们的方法。

Overall Performance The evaluation results across HotpotQA, 2 Wiki Multi Hop QA, and MuSiQue are presented in Table 4, Table 5, and Table 6, respectively. If we hypothesize that the highest achievable performance on each benchmark may reflect its relative difficulty, a tentative ranking from easiest to most challenging would be: HotpotQA, 2 Wiki Multi Hop QA, and MuSiQue. Our observations suggest that for HotpotQA, considered the least challenging, the GraphRAG in local mode and our method are closely competitive, with minor performance disparities. However, as the difficulty increases for 2 Wiki Multi Hop QA and MuSiQue, our method outperforms others.

整体性能
HotpotQA、2 Wiki Multi Hop QA 和 MuSiQue 的评估结果分别呈现在表 4、表 5 和表 6 中。如果我们假设每个基准上可达到的最高性能可能反映其相对难度,那么从最简单到最具挑战性的初步排名将是:HotpotQA、2 Wiki Multi Hop QA 和 MuSiQue。我们的观察表明,对于被认为最不具挑战性的 HotpotQA,本地模式下的 GraphRAG 和我们的方法竞争激烈,性能差异较小。然而,随着 2 Wiki Multi Hop QA 和 MuSiQue 难度的增加,我们的方法表现优于其他方法。

Table 3: Distribution of question types across three multi-hop QA datasets.

(a) HotPotQA (b) 2 Wiki Multi Hop QA (c) MuSiQue

表 3: 三个多跳问答数据集中问题类型的分布

Type Count Ratio Type Count Ratio #Hops Count Ratio
comparison 107 21.4% comparison 132 26.4% 2 263 52.6%
bridge 393 78.6% inference 64 12.8% 3 169 33.8%
compositional 196 39.2% 4 68 13.6%
bridge_comparison 108 21.6%

(a) HotPotQA (b) 2 Wiki Multi Hop QA (c) MuSiQue

Table 4: Performance comparison on HotpotQA. Best in bold, second-best underlined.

表 4: HotpotQA 上的性能对比。最佳值加粗,次佳值加下划线。

方法 EM F1 Acc Precision Recall
Zero-ShotCoT NaiveRAGw/R 32.60 43.94 53.60 46.56 43.97
NaiveRAGw/H-R Self-Ask 56.80 72.67 82.60 74.52 74.86
54.80 70.25 81.60 72.56 72.24
28.80 43.61 59.60 43.49 56.21
44.80 63.08 81.00 63.23 74.57
47.20 64.24 82.20 64.27 75.95
Self-Askw/R Self-Askw/H-R GraphRAG Local 0.00 10.66 89.00 5.90 83.07
GraphRAG Global 0.00 7.42 64.80 4.08 63.16
Ours 61.20 76.26 87.60 78.10 78.95

The inclusion of retrieved context significantly enhances accuracy, with gains ranging from approximately $10%$ (comparing Zero-Shot CoT and Naive RAG on MuSiQue) to around $29%$ (on HotpotQA). This indicates that for simpler benchmarks, RAG equipped with naive knowledge retrieval could address simple multihop questions, leading to a significant accuracy boost. However, for more challenging benchmarks involving complex multihop questions, the accuracy improvement from naive knowledge retrieval is limited, underscoring the constrained reasoning capabilities of the LLMs. By incorporating decomposition mechanisms, Self-Ask significantly enhances accuracy, especially on more challenging benchmarks. The combination of knowledge retrieval and Self-Ask decomposition yields superior results on 2 Wiki Multi Hop QA and MuSiQue, compared to using a single mechanism. However, in the case of HotpotQA, all methods employing retrieval (except for GraphRAG in Global mode, which will be discussed later) attain accuracies above $80%$ , with negligible differences between them.

引入检索到的上下文显著提高了准确性,提升幅度从大约 $10%$(在 MuSiQue 上比较 Zero-Shot CoT 和 Naive RAG)到大约 $29%$(在 HotpotQA 上)。这表明,对于较简单的基准测试,配备简单知识检索的 RAG 能够解决简单的多跳问题,从而显著提高准确性。然而,对于涉及复杂多跳问题的更具挑战性的基准测试,简单知识检索带来的准确性提升有限,突显了大语言模型的推理能力受限。通过引入分解机制,Self-Ask 显著提高了准确性,尤其是在更具挑战性的基准测试上。知识检索与 Self-Ask 分解的结合在 2 Wiki Multi Hop QA 和 MuSiQue 上取得了优于单一机制的结果。然而,在 HotpotQA 上,所有采用检索的方法(除了 Global 模式下的 GraphRAG,稍后将讨论)的准确率均超过 $80%$,且它们之间的差异可以忽略不计。

Interestingly, the application of a hierarchical atomic knowledge base does not significantly impact Naive RAG’s performance compared to Naive RAG with general flat knowledge base, potentially due to the embedding distance between the original multi-hop questions and the atomic questions of relevant contexts. Nonetheless, when combined with task decomposition, a hierarchical knowledge base shows more promise, as evidenced by the performance boost observed in Self-Ask with Hierarchical Retrieval (Self-Ask w/ H-R) compared to Self-Ask with Retrieval (Self-Ask w/ R), particularly on MuSiQue, which requires more complex reasoning. This improvement underscores the potential of hierarchical knowledge bases in enhancing the effectiveness of decomposition mechanisms in complex reasoning tasks.

有趣的是,与使用一般扁平知识库的Naive RAG相比,层次化原子知识库的应用并未显著影响Naive RAG的性能,这可能是由于原始多跳问题与相关上下文的原子问题之间的嵌入距离所致。然而,当与任务分解结合时,层次化知识库显示出更大的潜力,这在Self-Ask with Hierarchical Retrieval (Self-Ask w/ H-R) 与 Self-Ask with Retrieval (Self-Ask w/ R) 的性能提升中得到了体现,尤其是在需要更复杂推理的MuSiQue数据集上。这一改进突显了层次化知识库在增强复杂推理任务中分解机制有效性方面的潜力。

Our proposed method focuses on knowledge-aware task decomposition, which performs decomposition with an awareness of available knowledge, effectively leveraging the atomic information provided by the hierarchical knowledge base. Experimental results demonstrate that our approach consistently outperforms other methods, validating its effectiveness in complex reasoning scenarios.

我们提出的方法专注于知识感知的任务分解,该方法在执行分解时意识到可用知识,从而有效利用层次化知识库提供的原子信息。实验结果表明,我们的方法始终优于其他方法,验证了其在复杂推理场景中的有效性。

Regarding GraphRAG, originally designed for the query-focused sum mari z ation (QFS) task as outlined by [20], we observe its suboptimal performance in both local and global modes compared to our method. Notably, GraphRAG exhibits a curious trend: it achieves higher accuracy and recall scores while performing lower on EM, F1, and Precision metrics. A closer analysis of GraphRAG’s outputs reveals a tendency to echo the query and include meta-information about the answer within its graph structure. Despite attempts to refine its QA prompt, this behavior persists. An illustrative

关于 GraphRAG,最初设计用于 [20] 中所述的查询聚焦摘要 (QFS) 任务,我们观察到其在局部和全局模式下的表现均不如我们的方法。值得注意的是,GraphRAG 展现了一个有趣趋势:它在准确率和召回率上得分较高,但在 EM、F1 和 Precision 指标上表现较低。对 GraphRAG 输出的进一步分析表明,它倾向于重复查询并在其图结构中包含关于答案的元信息。尽管尝试改进其 QA 提示,这种行为仍然存在。

Table 5: Performance comparison on 2 Wiki Multi Hop QA. Best in bold, second-best underlined.

表 5: 2 Wiki Multi Hop QA 的性能对比。最佳结果加粗,次佳结果加下划线。

方法 EM F1 准确率 精确率 召回率
Zero-ShotCoT NaiveRAGw/R NaiveRAGw/H-R 35.67 41.40 43.87 59.74 41.43 43.11
Self-Ask Self-Askw/R 51.20 51.40 62.80 63.00 59.06 59.36 62.30 62.43
59.73 23.80 37.49 51.60 34.56 60.72
46.80 64.17 79.80 61.17 80.21
Self-Askw/H-R 48.00 63.99 11.83 80.00 61.30 79.56
GraphRAG Local 0.00 71.20 6.74 75.17
GraphRAG Global 0.00 7.35 45.00 4.09 55.43
Ours 66.80 75.19 82.00 74.04 78.87

Table 6: Performance comparison on MuSiQue. Best in bold, second-best underlined.

表 6: MuSiQue 上的性能对比。最佳值加粗,次佳值加下划线。

Method EM F1 Acc Precision Recall
Zero-ShotCoT 12.93 22.90 23.47 24.40 24.10
32.00 43.31 44.40
NaiveRAGw/R 44.42 47.29
NaiveRAGw/H-R 30.40 41.30 43.40 42.06 44.53
Self-Ask Self-Askw/R 16.40 27.27 35.40 26.33 37.65
28.40 42.54 49.80
Self-Askw/H-R 29.80 44.05 54.00 41.13 53.37
GraphRAG Local 0.60 42.47 55.89
9.62 49.80 5.73 55.82
GraphRAG Global 0.00 5.16 44.60 2.82 52.19
Ours 46.40 56.62 59.60 57.45 59.53

example is presented in Table 7, which shows GraphRAG Local’s response to a question from HotpotQA.

表 7 中展示了一个示例,显示了 GraphRAG Local 对 HotpotQA 问题的响应。

6.3 Evaluation on Legal Benchmarks

6.3 法律基准评估

In this subsection, we present the performance of our approach on two legal benchmarks: LawBench [22] and Open Australian Legal QA [14]. Before doing so, we provide a brief description of each benchmark.

在本小节中,我们展示了我们的方法在两个法律基准上的表现:LawBench [22] 和 Open Australian Legal QA [14]。在此之前,我们简要介绍了每个基准。

LawBench LawBench is a comprehensive legal benchmark for Chinese laws. It comprises 20 meticulously designed tasks aimed at accurately assessing the legal capabilities of LLMs. Unlike some existing benchmarks that rely solely on multiple-choice questions, LawBench includes a variety of task types that are closely related to real-world applications. These tasks encompass legal entity recognition, reading comprehension, crime amount calculation, and legal consulting, among others. Since not all tasks are RAG-oriented (e.g., reading comprehension), we have selected 6 specific tasks, which are detailed in Table 8. The number of questions of each task is 500.

LawBench
LawBench 是一个面向中国法律的综合法律基准。它包含 20 个精心设计的任务,旨在准确评估大语言模型的法律能力。与一些仅依赖多项选择题的现有基准不同,LawBench 包含了多种与真实应用密切相关的任务类型。这些任务涵盖了法律实体识别、阅读理解、犯罪金额计算和法律咨询等。由于并非所有任务都是面向 RAG 的(例如阅读理解),我们选择了 6 个具体任务,详细信息见表 8。每个任务的问题数量为 500。

We also provide example questions of these tasks for the readers reference.

我们还提供了这些任务的示例问题,供读者参考。

Table 7: An Example of GraphRAG Local output on a HotpotQA question. The table showcases the tendency to repeat the question and include meta-information in its response.

表 7: GraphRAG 在 HotpotQA 问题上的本地输出示例。该表展示了在回答中重复问题并包含元信息的倾向。

问题 Alsa Mall 和 Spencer Plaza 位于哪个国家?
答案标签 印度
GraphRAG 的答案 Alsa Mall 和 Spencer Plaza 都位于印度钦奈 [数据: 印度和钦奈社区 (2391); 实体 (4901, 4904); 关系 (9479,1687,5215,5217)]。

Table 8: Overview of LawBench tasks

表 8: LawBench 任务概览

任务编号 任务 类型 指标
1-1 StatuteRecitation 生成 F1
1-2 法律知识问答 单选 EM
3-1 法规预测(基于事实) 多选 EM
3-2 法规预测(基于场景) 生成 F1
3-6 案例分析 单选 EM
3-8 咨询 生成 F1

$\hookrightarrow$ member management rules, and other relevant rules in accordance with $\hookrightarrow$ securities laws and administrative regulations, and reports to the $\hookrightarrow$ securities regulatory authority under the State Council for record.

$\hookrightarrow$ 会员管理规则及其他相关规则,并按照 $\hookrightarrow$ 证券法律和行政法规的规定,向 $\hookrightarrow$ 国务院证券监督管理机构报备。

3-1: Based on the following facts and charges, provide the relevant articles of the $\hookrightarrow$ Criminal Law. Facts: The Yushu City, Jilin Province, accused that on $\hookrightarrow$ November 15, 2015, the defendant He signed a car rental agreement with Guo, $\hookrightarrow$ the owner of a taxi with license plate number xxx. The agreement stipulated $\hookrightarrow$ a monthly rent of RMB 3,900.00, payable monthly. On January 19, 2016, $\hookrightarrow$ without the knowledge of Guo, the defendant He concealed the truth and $\hookrightarrow$ falsely claimed to be the owner of the taxi. He signed a car rental $\hookrightarrow$ agreement with the victim Ma, with a monthly rent of RMB 3,800.00 and a $\hookrightarrow$ rental period of one year, collecting a total of RMB 50,600.00 from Ma for $\hookrightarrow$ one year’s rent and vehicle deposit. On February 26, 2016, the taxi was $\hookrightarrow$ retrieved by its owner Guo from the victim Ma. The victim Ma repeatedly $\hookrightarrow$ asked the defendant He to return the rent and deposit, but the defendant He $\hookrightarrow$ refused to return them. The prosecution provided evidence including the $\hookrightarrow$ defendant’s confession, the victim’s statement, witness testimonies, and $\hookrightarrow$ documentary evidence, and believed that the defendant He, with the purpose $\hookrightarrow$ of illegal possession, defrauded others of their property by fabricating $\hookrightarrow$ facts and concealing the truth during the signing and performance of the $\hookrightarrow$ contract. The amount was relatively large, and his actions violated the $\hookrightarrow$ provisions of Article xx of the Criminal Law of the People’s Republic of $\hookrightarrow$ China, and he should be held criminally responsible for xx. Charge: Contract $\hookrightarrow$ Fraud.

3-1: 根据以下事实和指控,提供《中华人民共和国刑法》的相关条款。事实:吉林省榆树市指控,2015年11月15日,被告人何与车牌号为xxx的出租车车主郭签订了租车协议,协议约定月租金为人民币3900.00元,按月支付。2016年1月19日,被告人何在郭不知情的情况下,隐瞒真相,谎称自己是出租车的车主,与被害人马签订了租车协议,月租金为人民币3800.00元,租期为一年,共收取马某一年的租金及车辆押金人民币50600.00元。2016年2月26日,出租车被车主郭从被害人马处取回。被害人马多次要求被告人何退还租金及押金,但被告人何拒绝退还。检察机关提供了包括被告人供述、被害人陈述、证人证言、书证等证据,并认为被告人何以非法占有为目的,在签订、履行合同过程中,虚构事实、隐瞒真相,骗取他人财物,数额较大,其行为触犯了《中华人民共和国刑法》第xx条的规定,应以xx罪追究其刑事责任。指控:合同诈骗罪。

3-2: Please provide the legal basis according to the specific scenario and question, $\hookrightarrow$ only the content of the specific legal provision is needed, each scenario $\hookrightarrow$ involves only one legal provision. Scenario: A cargo ship arrives at the $\hookrightarrow$ port of discharge, but the consignee fails to arrive in time to collect the $\hookrightarrow$ goods. Under which legal provision can the captain unload the goods at $\hookrightarrow$ another appropriate place?

3-2: 请根据具体场景和问题提供法律依据,$\hookrightarrow$ 仅需提供具体法律条文的内容,每个场景 $\hookrightarrow$ 仅涉及一条法律条文。场景:一艘货船到达 $\hookrightarrow$ 卸货港,但收货人未能及时到达提取 $\hookrightarrow$ 货物。船长可以在哪条法律条文下将货物卸载到 $\hookrightarrow$ 另一个适当的地点?

3-6: One year after the bar opened, the business environment changed drastically, $\hookrightarrow$ and all partners held a meeting to discuss countermeasures. According to the $\hookrightarrow$ ’Partnership Enterprise Law,’ the following voting matters are considered $\hookrightarrow$ valid votes: A: Zhang believes that the name ’Tongcheng’ is not attractive $\hookrightarrow$ and proposes to change it to ’Tongsheng Bar.’ Wang and Zhao agree, but Li $\hookrightarrow$ opposes; B: In view of the sluggish business, Wang proposes to suspend $\hookrightarrow$ operations for one month for renovation and reorganization. Zhang and Zhao $\hookrightarrow$ agree, but Li opposes; C: Due to the urgent needs of the bar, Zhao proposes $\hookrightarrow$ to sell a batch of coffee machines to the bar. Zhang and Wang agree, but Li $\hookrightarrow$ opposes; D: Given the four partners’ lack of experience in bar management,

3-6: 酒吧开业一年后,经营环境发生了巨大变化,所有合伙人召开会议商讨对策。根据《合伙企业法》,以下投票事项被视为有效投票:A: 张认为“同城”这个名字不够吸引人,提议改为“同升酒吧”。王和赵同意,但李反对;B: 鉴于生意低迷,王提议暂停营业一个月进行装修和整顿。张和赵同意,但李反对;C: 由于酒吧的紧急需求,赵提议向酒吧出售一批咖啡机。张和王同意,但李反对;D: 鉴于四位合伙人缺乏酒吧管理经验,

Table 9: Evaluation Results on Legal Benchmarks (Metric is F1 / EM as indicated in Table 8)

表 9: 法律基准评估结果 (指标为 F1 / EM,如表 8 所示)

Task Zero-ShotCoT GraphRAG Local Ours (N=5)
LawBench 1-1 1-2 3-1 3-2 3-6 21.31 54.24 53.32 27.51 23.27 78.58
62.60 70.60
74.60 25.98 83.16 46.05
3-8 Open Australian Legal QA 51.16 17.44 25.10 47.64 18.43 34.35 61.91 23.58 63.34

Table 10: Evaluation Results on Legal Benchmarks (Metric is Acc)

表 10: 法律基准评估结果(指标为准确率)

任务 Zero-ShotCoT GraphRAG Local Ours (N=5)
LawBench
1-1 1.23 16.60 90.12
1-2 54.00 63.40 70.60
3-1 49.90 75.40 88.82
3-2 15.83 27.60 67.54
3-6 51.12 57.00 62.73
3-8 Open Australian Legal QA 49.70 16.48 58.80 88.27 61.72 98.59

$\hookrightarrow\mathtt{\Delta}\mathtt{L i}$ proposes to appoint his friend Wang as the managing partner. Zhang and $\hookrightarrow$ Wang agree, but Zhao opposes. 3-8: Resident A rented out the house to B. With A’s consent, B renovated the rented $\hookrightarrow$ house and sublet it to C. C unilaterally altered the load-bearing structure $\hookrightarrow$ of the house. Why can A request B to bear liability for breach of contract?

$\hookrightarrow\mathtt{\Delta}\mathtt{L i}$ 提议任命他的朋友 Wang 为管理合伙人。Zhang 和 $\hookrightarrow$ Wang 同意,但 Zhao 反对。3-8:住户 A 将房子租给 B。在 A 的同意下,B 对租来的 $\hookrightarrow$ 房子进行了装修并转租给 C。C 单方面改变了房屋的承重结构 $\hookrightarrow$。为什么 A 可以要求 B 承担违约责任?

Open Australian Legal QA The benchmark consists of 2,124 questions and answers synthesized by GPT-4 from the Australian legal corpus. All questions are of the generation type. One example is: "What is the landlord’s general obligation under section 63 of the Act in the case of Anderson v Armitage [2014] NSWCATCD 157 in New South Wales?"

澳大利亚法律问答基准
该基准由 GPT-4 从澳大利亚法律语料库中生成的 2,124 个问题和答案组成。所有问题均为生成类型。一个例子是:“在新南威尔士州的 Anderson v Armitage [2014] NSWCATCD 157 案件中,根据第 63 条,房东的一般义务是什么?”

Evaluation results are listed in Table 9, where we only compare to "GraphRAG Local", as it generally performs better than "GraphRAG Global" on these tasks.

评估结果列在表 9 中,我们仅与 "GraphRAG Local" 进行比较,因为在这些任务上,它通常比 "GraphRAG Global" 表现更好。

For the aforementioned reasons, we also use GPT-4 to evaluate all experimental results, reporting the accuracy (Acc) in Table 10. When comparing the results in Table 9 and Table 10, we observe that the order of the results is preserved, even though some metrics change significantly. In the following section, we aim to identify the reasons behind these changes, which may provide valuable insights for designing better metrics to evaluate RAG frameworks in the future.

基于上述原因,我们也使用 GPT-4 来评估所有实验结果,并在表 10 中报告准确率 (Acc)。在比较表 9 和表 10 的结果时,我们观察到结果的顺序保持不变,尽管某些指标发生了显著变化。在接下来的部分中,我们旨在找出这些变化背后的原因,这可能为未来设计更好的评估 RAG 框架的指标提供有价值的见解。

from a consulting website, resulting in varying quality. For example, one question asks "Do the children from the original marriage have an obligation to support the father?" However, the provided golden answer includes an irrelevant article, "Article 1067," which pertains to parents’ obligations to support minor children.

来自一个咨询网站,质量参差不齐。例如,一个问题问“原婚姻中的子女是否有义务赡养父亲?”然而,提供的标准答案中却包含了一个不相关的条款,“第1067条”,该条款涉及父母对未成年子女的抚养义务。

| → 婚姻是否有义务赡养父亲?参考答案:在我国,亲生子女有义务赡养父母。我们的回答:是的,有赡养义务。法律依据:根据 | → 各自组建新家庭并生育新子女,根据法院的判决,父亲需要按月支付抚养费给母亲,直到孩子年满18岁。原家庭的子女与父母的关系并不会因为父母的离婚而解除。如果子女不赡养老人,父母可以直接向人民法院提起诉讼,要求法院判令子女支付赡养费。法律依据:《中华人民共和国民法典》第1067条规定,如果父母不履行抚养义务,未成年子女或不能独立生活的成年子女有权要求父母支付抚养费。如果成年子女不履行赡养义务,丧失劳动能力或生活困难的父母有权要求成年子女支付赡养费。第1084条规定,父母与子女的关系不因父母离婚而解除。离婚后,无论子女由父亲或母亲直接抚养,仍是双方的子女。律师解释:父母离婚后,成年子女仍是父母双方的子女。如果成年子女不履行赡养义务,丧失劳动能力或生活困难的父母有权要求成年子女支付赡养费。根据《中华人民共和国民法典》第1069条,父母与子女的关系不因父母婚姻关系的变化而终止。因此,即使父母离婚并再婚生育新子女,原家庭的子女仍有赡养义务。 |

  1. The accuracy of all methods on choice tasks 1-2, 3-1, and 3-6 almost coincides with the F1 score, as expected. An exception is task 3-1, where the difference is mainly due to GPT-4’s capacity to understand Chinese, particularly in distinguishing numbers in Arabic and Chinese. In Chinese law, all numbers are written in Chinese, while in the golden answers, all numbers are given in Arabic.

所有方法在任务 1-2、3-1 和 3-6 上的准确率几乎与 F1 分数一致,正如预期的那样。例外是任务 3-1,其中的差异主要归因于 GPT-4 理解中文的能力,特别是在区分阿拉伯数字和中文数字方面。在中国法律中,所有数字都用中文书写,而在标准答案中,所有数字都用阿拉伯数字给出。

6.4 Real Case Studies

6.4 真实案例研究

This section presents three case studies from our evaluation benchmark to illustrate the underlying principles of our proposed decomposition pipeline, as detailed in Algorithm 1. Through these realworld examples, we aim to highlight the benefits of our systematic approach. These cases will shed light on how each step of the pipeline contributes to improved performance and the insights gained from their implementation.

本节通过三个案例研究展示了我们评估基准中的实际应用,以说明我们提出的分解流水线的基本原理,详见算法 1。通过这些真实世界的例子,我们旨在突出系统方法的好处。这些案例将揭示流水线中的每一步如何提升性能,以及从实施中获得的洞察。

Our task decomposition strategy involves generating multiple atomic queries rather than producing a single deterministic follow-up question, as demonstrated in the Self-Ask approach. Contemporary decomposition methods typically employ a generative model to formulate a singular follow-up question. However, this approach carries an intrinsic risk of generating erroneous questions, potentially leading to an incorrect decomposition pathway and, ultimately, an erroneous answer. Consider the Case (a) depicted in Figure 16, where the original question pertains to a film titled "What Women Love." Due to the existence of a more prominent film, "What Women Want," the employed language model tends to ‘correct’ the original question. Consequently, methods like Self-Ask (as shown on the left side of Figure 16) generate only one follow-up question related to this erroneously assumed object. In the illustrated instance, although the target chunk has been retrieved due to the similarity in embeddings, a ‘false’ intermediate answer is produced for the ‘false’ follow-up question, culminating in an incorrect final response. In contrast, our methodology posits atomic queries concerning both "What Women Love" and "What Women Want," thereby seeking to clarify the true intent of the initial question. With both films existing and relevant atomic questions being retrieved, our approach subsequently gains the advantage of verifying the question’s intent and selecting the correct and most pertinent chunk during the atomic selection phase.

我们的任务分解策略涉及生成多个原子查询,而不是像Self-Ask方法那样生成单一的确定性后续问题。当前的分解方法通常采用生成模型来构建单一的后续问题。然而,这种方法存在生成错误问题的内在风险,可能导致错误的分解路径,并最终产生错误的答案。考虑图16中的案例(a),原始问题涉及一部名为《What Women Love》的电影。由于存在另一部更著名的电影《What Women Want》,所使用的语言模型倾向于“纠正”原始问题。因此,像Self-Ask这样的方法(如图16左侧所示)仅生成一个与这个错误假设对象相关的后续问题。在所示实例中,尽管由于嵌入的相似性检索到了目标块,但为“错误”的后续问题生成了“错误”的中间答案,最终导致错误的最终响应。相比之下,我们的方法提出了关于《What Women Love》和《What Women Want》的原子查询,从而试图澄清初始问题的真实意图。由于两部电影都存在且相关的原子问题被检索到,我们的方法随后在原子选择阶段获得了验证问题意图并选择正确且最相关块的优势。


Figure 16: Case (a): Given the lesser-known film "What Women Love" as opposed to the more popular "What Women Want," single-path methods like Self-Ask on the left are predisposed to generating follow-up questions about the latter, leading to an incorrect final answer. Conversely, PIKE-RAG can effectively discern the intended meaning of the original question by positing several atomic queries and postpone the task understanding to atomic selection phase with relevant atomic questions provided, and subsequently arriving at an accurate conclusion.

图 16: 案例 (a): 给定相对不知名的电影《What Women Love》而非更受欢迎的《What Women Want》,左侧的 Self-Ask 等单路径方法倾向于生成关于后者的后续问题,从而导致错误的最终答案。相反,PIKE-RAG 可以通过提出多个原子查询有效辨别原始问题的意图,并将任务理解推迟到原子选择阶段,提供相关的原子问题,从而得出准确的结论。

Furthermore, the discrepancy between the formulation of the corpus and the query, is another critical factor advocating for a multi-query approach over a singular deterministic one. The presentation gap can impede the retrieval process even when the generated follow-up question is semantically accurate. For instance, as illustrated in Case (b) in Figure 17, a single-path method such as Self-Ask on the left side might directly inquire ’Who is the mother of Oskar Roehler?’ However, the knowledge base articulates familial relationships using a different schema, ’A is the son of B and $\mathbf{C}^{\bullet}$ in this case, thus the retrieval process falters despite the correctness of the question. Even when we applied the hierarchical retrieval to Self-Ask, the Self-Ask with Hierarchical Retrieval did not succeed in bridging this gap. In contrast, our approach, which generates multiple atomic queries, encompasses a broader range of phrasings that correspond to the diverse representations in the knowledge base. In the depicted case, while the atomic query specifically asking for Oskar Roehler’s mother encounters the same retrieval issue, an alternative query seeking information about his parents successfully retrieves the target chunk. This exemplifies how our method’s flexibility in query generation enhances the likelihood of aligning with the knowledge base’s structure and obtaining accurate information.

此外,语料库的构建与查询之间的不一致性,是另一个支持多查询方法而非单一确定性方法的关键因素。这种表达差距可能会阻碍检索过程,即使生成的后续问题在语义上是准确的。例如,如图 17 中的案例 (b) 所示,单一路径方法(如左侧的 Self-Ask)可能会直接询问“Oskar Roehler 的母亲是谁?”然而,知识库使用不同的模式来描述家庭关系,即“A 是 B 和 $\mathbf{C}^{\bullet}$ 的儿子”,在这种情况下,尽管问题的正确性,检索过程仍然失败。即使我们将分层检索应用于 Self-Ask,带有分层检索的 Self-Ask 也未能成功弥合这一差距。相比之下,我们的方法生成多个原子查询,涵盖了与知识库中多样化表示相对应的更广泛的表达方式。在所示案例中,虽然专门询问 Oskar Roehler 母亲的原子查询遇到了相同的检索问题,但另一个询问其父母信息的替代查询成功检索到了目标块。这展示了我们方法在查询生成中的灵活性如何提高了与知识库结构对齐并获取准确信息的可能性。


Figure 17: Case (b): By proposing multiple atomic queries, PIKE-RAG effectively retrieves the relevant knowledge chunk, whereas the single deterministic follow-up question approach employed by Self-Ask fails to align with the knowledge base’s schema, resulting in a retrieval failure.

图 17: 情况 (b): 通过提出多个原子查询,PIKE-RAG 有效检索到了相关的知识块,而 Self-Ask 使用的单一确定性后续问题方法未能与知识库的模式对齐,导致检索失败。

Our methodology emphasizes the retrieval of atomic questions rather than directly retrieving chunks. This design choice is exemplified in Case (b) depicted in Figure 17. The knowledge chunk in the corpus is structured using the pattern ’A ... as the son of B and $C'$ , which poses challenges for direct retrieval by queries such as ’Who is the mother of ...’. In our specialized knowledge base, such direct queries tend to retrieve chunks conforming to the patterns $^\circ\mathbf{A}$ is the mother of $\mathbf{B}^{\bullet}$ or ’A is the father of $\mathbf{B}^{,\bullet}$ . By utilizing atomic questions as intermediaries for retrieval, our approach effectively narrows the gap between a single query and the multiple sentence structures found in the knowledge base. It facilitates bridging the expression pattern differences exemplified by ’the mother of’ versus ’the son of’ in this scenario.

我们的方法强调检索原子问题,而不是直接检索知识块。这一设计选择在图17的案例(b)中得到了体现。语料库中的知识块采用了“A ... as the son of B and $C'$”的结构模式,这对于诸如“Who is the mother of ...”之类的查询直接检索来说存在挑战。在我们专门的知识库中,这类直接查询往往会检索出符合“$^\circ\mathbf{A}$ is the mother of $\mathbf{B}^{\bullet}$”或“A is the father of $\mathbf{B}^{,\bullet}$”模式的知识块。通过利用原子问题作为检索中介,我们的方法有效缩小了单一查询与知识库中多种句子结构之间的差距。它有助于弥合如“the mother of”与“the son of”在此场景中所体现的表达模式差异。

In contrast to methods like Self-Ask, which only retains intermediate answers for subsequent processing, our method preserves the entire chunk as contextual information. During the atomic selection phase, we present a list of atomic questions as candidate summaries of the relevant content from the original chunk. This strategy significantly reduces token usage and simplifies the process of selecting the pertinent information. Case (c) in Figure 18 demonstrates the dual benefits of our approach: first, by selecting from a curated list of atomic questions, we streamline the identification of relevant information; second, by retaining the entire selected chunk rather than just the intermediate answer, we ensure a rich context is maintained for accurate and comprehensive subsequent processing. While the Self-Ask method on the left retrieves the target chunk, it fails to correctly identify the pertinent ’Ernie Watts’ due to the excessive contextual information. Since retrieved chunks in Self-Ask are discarded after generating an intermediate answer, the method potentially follows an incorrect pathway, leading to an inaccurate conclusion. In contrast, our approach can efficiently filter and select the appropriate atomic question from a concise list. Although the atomic question in this round pertains to the role of Ernie Watts, there is no need to inquire further about his birthplace, as this information is encapsulated within the selected chunk, which remains available for context in subsequent rounds.

与仅保留中间答案以供后续处理的 Self-Ask 等方法不同,我们的方法保留了整个块作为上下文信息。在原子选择阶段,我们将一系列原子问题作为原始块相关内容的候选摘要呈现。这一策略显著减少了 Token 的使用,并简化了选择相关信息的过程。图 18 中的案例 (c) 展示了我们方法的双重优势:首先,通过从精心挑选的原子问题列表中进行选择,我们简化了相关信息的识别过程;其次,通过保留整个选定的块而不仅仅是中间答案,我们确保了丰富的上下文信息,以便进行准确而全面的后续处理。虽然左侧的 Self-Ask 方法检索到了目标块,但由于上下文信息过多,它未能正确识别出相关的 "Ernie Watts"。由于 Self-Ask 中检索到的块在生成中间答案后会被丢弃,该方法可能会遵循错误的路径,导致不准确的结论。相比之下,我们的方法能够有效地从简洁的列表中筛选并选择适当的原子问题。尽管本轮原子问题涉及 Ernie Watts 的角色,但无需进一步询问他的出生地,因为这些信息已封装在选定的块中,并在后续轮次中仍可作为上下文使用。


Figure 18: Case (c): PIKE-RAG’s benefits from leveraging a concise list of atomic questions for targeted selection and retaining full chunks for rich contextual support. Conversely, Self-Ask’s approach, although successful in retrieving relevant chunks, is compromised by its dependency on intermediate answers for context, which ultimately results in the generation of incorrect final answers.

图 18: 案例 (c): PIKE-RAG 通过利用简洁的原子问题列表进行针对性选择,并保留完整块以提供丰富的上下文支持而受益。相反,Self-Ask 的方法尽管成功检索到了相关块,但由于其对中间答案的依赖而受到影响,最终导致生成错误的最终答案。

7 Conclusion

7 结论

To address the diverse challenges faced by RAG systems in industrial applications, we propose that the core foundation of RAG systems should extend beyond traditional retrieval mechanisms to the effective construction and utilization of specialized knowledge and rationale. Therefore, we introduce a new paradigm that classifies tasks based on their difficulty in knowledge extraction, comprehension, and utilization, providing a novel framework for system design and evaluation. Applying this paradigm allows for phased exploration of RAG capabilities, which facilitates the progressive refinement of RAG algorithms and the staged implementation of RAG applications. Moreover, we introduce the specialized Knowledge and Rationale Augmented Generation (PIKERAG) framework, focusing on specialized knowledge extraction and rationale construction. PIKERAG effectively extracts, comprehends, and organizes specialized knowledge and construct coherent rationale for accurate answers, offering customizable system capabilities to meet varying requirements. Additionally, we propose knowledge atomizing and knowledge-aware task decomposition to tackle complex questions, such as multihop queries, achieving significant performance improvements on various open-domain and legal benchmarks.

为了应对RAG系统在工业应用中面临的各种挑战,我们提出RAG系统的核心基础应超越传统的检索机制,延伸到专业知识和推理的有效构建与利用。因此,我们引入了一种新的范式,根据任务在知识提取、理解和利用方面的难度进行分类,为系统设计和评估提供了一个新颖的框架。应用这一范式可以分阶段探索RAG能力,从而促进RAG算法的逐步优化和RAG应用的分阶段实施。此外,我们引入了专注于专业知识提取和推理构建的PIKERAG框架。PIKERAG有效地提取、理解并组织专业知识,构建连贯的推理以提供准确答案,提供可定制的系统能力以满足不同需求。同时,我们提出了知识原子化和知识感知的任务分解方法,以应对复杂问题,如多跳查询,在各种开放领域和法律基准测试中取得了显著的性能提升。

References

参考文献

阅读全文(20积分)