PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation
Abstract
摘要
Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduce sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to increment ally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems’ problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iterative ly construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks. The code is publicly available at https://github.com/microsoft/PIKE-RAG.
尽管检索增强生成(Retrieval-Augmented Generation, RAG)系统通过外部检索扩展了大语言模型(LLM)的能力,并取得了显著进展,但这些系统往往难以满足现实工业应用中复杂多样的需求。仅依赖检索被证明不足以从专业语料库中提取深层次的领域知识并进行逻辑推理。为此,我们提出了sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG),专注于提取、理解和应用专门知识,同时构建连贯的推理过程,逐步引导LLM生成准确的响应。认识到工业任务的多样化挑战,我们引入了一种新范式,基于知识提取和应用的复杂性对任务进行分类,从而系统评估RAG系统解决问题的能力。这一战略方法为RAG系统的分阶段开发和增强提供了路线图,以适应工业应用不断变化的需求。此外,我们提出了知识原子化和知识感知的任务分解方法,分别基于原始查询和累积的知识,有效地从数据块中提取多方面的知识,并迭代地构建推理过程,在各种基准测试中展示了卓越的性能。代码已公开在https://github.com/microsoft/PIKE-RAG。
1 Introduction
1 引言
Large Language Models (LLMs) have revolutionized the field of natural language processing by demonstrating the capability to generate coherent and con textually relevant text. These advanced models are trained on expansive corpora, equipping them with the versatility to execute a diverse spectrum of linguistic tasks, ranging from text completion to translation and sum mari z ation [5, 9, 50, 6]. Despite their broad capabilities, LLMs exhibit pronounced limitations when tasked with specialized queries in professional domains [38, 54], a demand that is particularly acute in industrial applications. This primarily stems from the scarcity of domain-specific training material and a limited grasp of specialized knowledge and rationale within these domains. As a result, LLMs may produce responses that are not only potentially erroneous but also lack the detail and precision required for expert-level engagement [11]. Besides the limitations in the domain-specific tasks, another striking issue with LLMs is the phenomena known as "hallucination", where the model generates information that is not grounded in reality or factual data [10, 57]. Moreover, the knowledge base of LLMs, being static and crystallized at the point of their last update, introduces temporal stasis [13]. Further compounding these challenges is the issue of long-context comprehension [37]. Existing LLMs struggle to maintain an understanding of task definitions across long context, and their performance tends to deteriorate significantly when confronted with more complex and demanding tasks.
大语言模型 (LLMs) 通过展示生成连贯且上下文相关文本的能力,彻底改变了自然语言处理领域。这些先进模型在庞大的语料库上进行训练,使它们具备执行多种语言任务的灵活性,从文本补全到翻译和摘要 [5, 9, 50, 6]。尽管它们具备广泛的能力,但在处理专业领域的特定查询时,LLMs 表现出明显的局限性 [38, 54],这一需求在工业应用中尤为突出。这主要源于领域特定训练材料的稀缺性以及对这些领域的专业知识和逻辑的有限理解。因此,LLMs 可能会生成不仅可能错误,而且缺乏专家级参与所需的细节和精确性的回复 [11]。除了领域特定任务的局限性外,LLMs 的另一个显著问题是所谓的“幻觉”现象,即模型生成的信息不基于现实或事实数据 [10, 57]。此外,LLMs 的知识库在其最后一次更新时是静态和固化的,这引入了时间停滞的问题 [13]。进一步加剧这些挑战的是长上下文理解的问题 [37]。现有的 LLMs 难以在长上下文中保持对任务定义的理解,当面对更复杂和要求更高的任务时,它们的表现往往会显著下降。
To address the inherent limitations of LLMs, Retrieval-Augmented Generation (RAG) [35] has been proposed, which merges the generative capabilities of LLMs with a retrieval mechanism, allowing the incorporation of relevant external information to anchor the generated text in factual data. This integrated strategy improves both the accuracy and reliability of the generated content, providing a promising pathway for the practical deployment of LLMs in industrial applications. However, current RAG methods remain heavily reliant on text retrieval and the comprehension capabilities of LLMs, with a lack of attention to extracting, understanding, and utilizing knowledge from the diverse source data. In industrial applications requiring expertise, such as specialized knowledge and problem-solving rationale, existing RAG approaches primarily designed for research benchmarks demonstrate significant limitations. There is a lack of clarity regarding the challenges that RAG encounters in industrial applications. Gaining a comprehensive insight into these challenges is crucial for the development of RAG algorithms. Therefore, we summarize the main challenges as follows.
为了解决大语言模型 (LLM) 的固有局限性,检索增强生成 (Retrieval-Augmented Generation, RAG) [35] 被提出,它将大语言模型的生成能力与检索机制相结合,允许引入相关的外部信息,使生成的文本基于事实数据。这种集成策略提高了生成内容的准确性和可靠性,为大语言模型在工业应用中的实际部署提供了一条有前景的路径。然而,当前的 RAG 方法仍然严重依赖文本检索和大语言模型的理解能力,缺乏对从多样化的源数据中提取、理解和利用知识的关注。在需要专业知识的工业应用中,例如专业知识和问题解决原理,现有的 RAG 方法主要设计用于研究基准测试,显示出显著的局限性。关于 RAG 在工业应用中遇到的挑战,目前尚不清晰。全面了解这些挑战对于 RAG 算法的发展至关重要。因此,我们将主要挑战总结如下。
• Knowledge source diversity: RAG systems are constructed upon a diverse corpus of source documents collected over many years from various domains, encompassing a wide range of file formats like scanned images, digital text files, and web data, sometimes accompanied by specialized databases. In contrast, widely-used datasets [28, 60, 51] typically feature pre-segmented, simplified corpora that do not capture the complexity of real-world data. Existing methods designed for these benchmarks struggle to efficiently extract specialized knowledge and uncover underlying rationales from diverse sources, particularly in industrial applications. For example, an LED product datasheet typically comprises specifications such as performance characteristics presented in complex tables, electrical properties depicted in charts, and installation instructions illustrated with figures. Addressing queries related to the non-textual knowledge presents significant challenges for existing RAG approaches.
• 知识源多样性:RAG系统建立在多年来从各个领域收集的多样化源文档语料库之上,涵盖了扫描图像、数字文本文件和网络数据等多种文件格式,有时还伴随专门的数据库。相比之下,广泛使用的数据集 [28, 60, 51] 通常具有预分割、简化的语料库,无法捕捉现实世界数据的复杂性。为这些基准设计的方法难以高效地从多样化来源中提取专门知识并揭示潜在原理,特别是在工业应用中。例如,LED产品数据表通常包含性能特征(以复杂表格呈现)、电气属性(以图表展示)和安装说明(以图示说明)等规格。解决与非文本知识相关的查询对现有的RAG方法提出了重大挑战。
• Domain specialization deficit: In industrial applications, RAG are expected to leverage the specialized knowledge and rationale in professional fields. However, these specialized knowledge are characterized by domain-specific terminologies, expertise, and distinctive logical frameworks that are integral to their functioning. RAG approaches built on common knowledge-centric datasets demonstrate unsatisfactory performance when applied to professional fields, as LLMs exhibit deficiencies in extracting, understanding, and organizing domain specific knowledge and rationale [38]. For example, in the field of semiconductor design, research relies heavily on a deep understanding of underlying physical properties. When LLMs are utilized to extract and organize the specialized knowledge and rationale from the research documents, they often fail to properly capture essential physical principles and achieve a comprehensive understanding due to their inherent limitations. Consequently, RAG systems frequently produce incomplete or inaccurate interpretations of critical problem elements and generate responses that lack proper rationale grounded in physical principles. Moreover, assessing the quality of professional content generation poses a significant challenge. This issue not only impedes the development and optimization of RAG algorithms but also complicates their practical deployment across various industrial applications.
• 领域专业化不足:在工业应用中,RAG(检索增强生成)被期望利用专业领域的专门知识和逻辑。然而,这些专门知识具有领域特定的术语、专业知识和独特的逻辑框架,这些是它们运作的核心。基于常识知识库的RAG方法在应用于专业领域时表现不佳,因为大语言模型在提取、理解和组织领域特定知识和逻辑方面存在缺陷 [38]。例如,在半导体设计领域,研究严重依赖对底层物理特性的深入理解。当使用大语言模型从研究文档中提取和组织专门知识和逻辑时,由于其固有的局限性,它们往往无法正确捕捉关键的物理原理并实现全面的理解。因此,RAG系统经常对关键问题要素产生不完整或不准确的解释,并生成缺乏基于物理原理的合理逻辑的响应。此外,评估专业内容生成的质量也面临重大挑战。这一问题不仅阻碍了RAG算法的开发和优化,还使它们在各种工业应用中的实际部署变得复杂。
• One-size-fits-all: Various RAG application scenarios, although based on a similar framework, present different challenges that require diverse capabilities, particularly for extracting, understanding, and organizing domain-specific knowledge and rationale. The complexity and focus of questions vary across these scenarios, and within a single scenario, the difficulty can also differ. For example, in rule-based query scenarios, such as determining the legal conditions for mailing items, RAG systems primarily focus on retrieving relevant factual rules by bridging the semantic gap between the query and the rules. In multihop query scenarios, such as comparing products across multiple aspects, RAG systems emphasize extracting information from diverse sources and performing multihop reasoning to arrive at accurate answers. Most existing RAG approaches [62] adopt a one-size-fits-all strategy, failing to account for the varying complexities and specific demands both within and across scenarios. This results in solutions that do not meet the comprehensive accuracy standards required for practical applications, thereby limiting the development and integration of RAG systems in real-world environments.
• 一刀切:各种 RAG 应用场景虽然基于相似的框架,但面临不同的挑战,需要多样化的能力,特别是在提取、理解和组织领域特定知识和逻辑方面。这些场景中问题的复杂性和侧重点各不相同,即使在单一场景内,难度也会有所不同。例如,在基于规则的查询场景中,如确定邮寄物品的合法条件,RAG 系统主要专注于通过弥合查询与规则之间的语义鸿沟来检索相关的事实规则。在多跳查询场景中,如从多个方面比较产品,RAG 系统则强调从不同来源提取信息并进行多跳推理以得出准确答案。大多数现有的 RAG 方法 [62] 采用了一刀切的策略,未能考虑到场景内外的不同复杂性和特定需求。这导致解决方案无法满足实际应用所需的全面准确性标准,从而限制了 RAG 系统在现实环境中的开发和集成。
We believe that the key to addressing these challenges lies in advancing beyond traditional retrieval augmentation, by effectively extracting, understanding, and applying specialized knowledge, and developing appropriate reasoning logic tailored to the specific tasks and the knowledge involved. We refer to this approach as sPecIalized Knowledge and Rationale Augmentation. Given that various tasks require diverse capabilities, particularly for extracting, understanding, and organizing domain-specific knowledge and rationale, we summarize and categorize the questions commonly encountered into four types with respect to their difficulty: factual questions, linkable-reasoning questions, predictive questions, and creative questions. Accordingly, we propose a classification of RAG system capability levels, aligned with the system’s ability to solve these different types of problems. This classification serves as a guideline for systematically advancing the system’s capabilities in a controllable and measurable manner.
我们认为,解决这些挑战的关键在于超越传统的检索增强,通过有效提取、理解和应用专业知识,并开发适合特定任务和相关知识的推理逻辑。我们将这种方法称为专业知识与推理增强。鉴于各种任务需要不同的能力,特别是在提取、理解和组织领域特定知识和推理方面,我们将常见问题按其难度总结和分类为四种类型:事实性问题、可链接推理问题、预测性问题和创造性问题。因此,我们提出了一种与大语言模型系统解决这些不同类型问题的能力相匹配的能力等级分类。这一分类作为指导,以可控和可衡量方式系统地提升系统的能力。
Furthermore, we propose sPecIalized KnowledgE and Rationale Augmented Generation (PIKERAG) framework, which not only support phased system development and deployment, demonstrating excellent versatility, but also enhances capabilities by effectively leveraging specialized knowledge and rationale. Within this framework, knowledge extraction components are employed to extract specialized knowledge from diverse source data, laying a robust foundation for knowledgebased retrieval and reasoning. Additionally, a task decomposer is utilized to dynamically manage the routing of retrieval and reasoning operations, creating specialized rationale based on available knowledge. PIKE-RAG enables a phased exploration of RAG capabilities, which facilitates the progressive refinement of RAG algorithms and the staged implementation of RAG applications. For each developing phase, the RAG framework and its modules are tailored to address specific challenges. For example, in the knowledge base construction phase, a multi-layer heterogeneous graph is employed to effectively represent relationship between various components of the data, enhancing knowledge organization and integration. The RAG system, designed for factual questions, introduces multi-granularity retrieval, allowing for multi-layer, multi-granularity retrieval across a heterogeneous knowledge graph to improve factual retrieval accuracy. In the advanced RAG system, aiming at addressing complex queries, knowledge atomizing is introduced to fully explore the intrinsic knowledge from data chunks, while knowledge-aware task decomposition manages the retrieval and organization of multiple pieces of atomic knowledge to construct a coherent rationale.
此外,我们提出了 sPecIalized KnowledgE and Rationale Augmented Generation (PIKERAG) 框架,该框架不仅支持分阶段的系统开发和部署,展现了出色的多功能性,还通过有效利用专业化知识和推理来增强能力。在此框架内,知识提取组件被用来从多样化的源数据中提取专业知识,为基于知识的检索和推理奠定坚实的基础。此外,任务分解器被用来动态管理检索和推理操作的路由,基于可用知识创建专门的推理。PIKE-RAG 支持分阶段探索 RAG 能力,这有助于逐步完善 RAG 算法和分阶段实施 RAG 应用。在每个开发阶段,RAG 框架及其模块都被定制以应对特定挑战。例如,在知识库构建阶段,采用多层异构图来有效表示数据各组件之间的关系,增强知识组织和集成。针对事实性问题的 RAG 系统引入了多粒度检索,允许在异质知识图谱上进行多层、多粒度的检索,以提高事实检索的准确性。在高级 RAG 系统中,为应对复杂查询,引入了知识原子化,以充分挖掘数据块中的内在知识,而知识感知的任务分解则管理多个原子知识的检索和组织,以构建连贯的推理。
Extensive experiments are conducted to evaluate the performance of the proposed PIKE-RAG framework on both open-domain and legal benchmarks, and experimental results demonstrate the effectiveness of PIKE-RAG. Our framework and staged development strategy could further advance the current research and application of RAG in industrial contexts. In summary, the contributions of this work are as follows:
为了评估所提出的PIKE-RAG框架在开放领域和法律基准上的性能,我们进行了广泛的实验,实验结果证明了PIKE-RAG的有效性。我们的框架和分阶段开发策略可以进一步推动RAG在工业环境中的研究和应用。总之,本工作的贡献如下:
2 Related work
2 相关工作
2.1 RAG
2.1 RAG
Retrieval-Augmented Generation (RAG) has emerged as a promising solution that effectively incorporates external knowledge to enhance response generation. Initially, retrieval-augmented techniques were introduced to improve the performance of pre-trained language models on knowledge-intensive tasks [35, 29, 12]. With the booming of Large Language Models [5, 9, 50, 6], most research in the RAG paradigm has shifted towards a framework that initially retrieves pertinent information from external data sources and subsequently integrates it into the context of the query prompt as supplementing knowledge for con textually relevant generation [46]. Following this framework, naive RAG research paradigm [25] converts raw data into uniform plain text and segment it into smaller chunks, which are encoded into vector space for query-based retrieval. The top k relevant chunks are used to expand the context of the prompt for generation. To enhance the retrieval quality of the naive RAG, advanced RAG approaches implement specific enhancements across the pre-retrieval, retrieval, and post-retrieval processes, including query optimization [39, 63], multi-granularity chunking [16, 65], mixed retrieval and chunk re-ranking.
检索增强生成 (RAG) 作为一种有效整合外部知识以增强响应生成的解决方案而崭露头角。最初,检索增强技术被引入以提高预训练语言模型在知识密集型任务中的表现 [35, 29, 12]。随着大语言模型的蓬勃发展 [5, 9, 50, 6],RAG 范式中的大多数研究已转向一种框架,该框架首先从外部数据源中检索相关信息,随后将其整合到查询提示的上下文中,作为上下文相关生成的补充知识 [46]。遵循这一框架,朴素 RAG 研究范式 [25] 将原始数据转换为统一的纯文本并将其分割成更小的块,这些块被编码到向量空间中以进行基于查询的检索。前 k 个相关块用于扩展提示的上下文以进行生成。为了提高朴素 RAG 的检索质量,高级 RAG 方法在预检索、检索和后检索过程中实施了特定的增强措施,包括查询优化 [39, 63]、多粒度分块 [16, 65]、混合检索和块重排序。
Beyond the aforementioned RAG paradigms, numerous sophisticated enhancements in RAG pipelines and system modules are introduced within modular RAG systems [26], aiming to improve system capability and versatility. These advancements have enabled the processing of a wider variety of source data, facilitating the transformation of raw information into structured data and, ultimately, into valuable knowledge [56, 20]. Furthermore, the indexing and retrieval modules have been refined with multi-granularity and multi-architecture approaches [58, 65]. Various pre-retrieval [24, 64] and postretrieval [18, 30] functions are proposed to enhance both the retrieval effectiveness and the quality of sequential generation. It has been recognized that naïve RAG systems are insufficient to tackle complex tasks such as sum mari z ation [27] and multi-hop reasoning [51, 28]. Consequently, most recent research focuses on developing advanced coordination schemes that leverage existing modules to collaborative ly address these challenges. ITERRETGEN [48] and DSP [33] employ retrieve-read iteration to leverage generation response as the context for next round retrieval. FLARE [31] proposes a confidence-based active retrieval mechanism that dynamically adjusts query with respect to the low-confidence tokens in the regenerated sentences. These loop-based RAG pipelines progressively converge towards the correct answer and provide enhanced flexibility to RAG systems in addressing diverse requirements.
除了上述的 RAG 范式外,模块化 RAG 系统 [26] 中还引入了许多 RAG 管道和系统模块的复杂增强,旨在提高系统的能力和多功能性。这些进步使得处理更多种类的源数据成为可能,促进了原始信息向结构化数据,最终向有价值知识的转化 [56, 20]。此外,索引和检索模块通过多粒度和多架构方法 [58, 65] 得到了改进。提出了各种预检索 [24, 64] 和后检索 [18, 30] 功能,以增强检索效果和顺序生成的质量。人们已经认识到,简单的 RAG 系统不足以应对诸如总结 [27] 和多跳推理 [51, 28] 等复杂任务。因此,最近的研究主要集中在开发高级协调方案,利用现有模块协同应对这些挑战。ITERRETGEN [48] 和 DSP [33] 采用检索-读取迭代,利用生成响应作为下一轮检索的上下文。FLARE [31] 提出了一种基于置信度的主动检索机制,动态调整查询以应对重新生成句子中的低置信度 Token。这些基于循环的 RAG 管道逐步向正确答案收敛,并为 RAG 系统提供了应对多样化需求的增强灵活性。
2.2 Knowledge bases for RAG
2.2 用于RAG的知识库
In naïve RAG approaches, source data is converted to plain text and chunked for retrieval. However, as RAG applications expand and demand for diversity grows, plain text-based retrieval becomes insufficient for several reasons: (1) textual information is generally redundant and noisy, leading to decreased retrieval quality; (2) complex problems require the integration of multiple data sources, and plain text alone cannot adequately represent the intricate relationships between objects. As a result, researchers are exploring diverse data sources to enrich the corpus, incorporating search engines [59, 53], databases [55, 41, 47], knowledge graphs [49, 56], and multimodal corpora [17, 15]. Concurrently, there is an emphasis on developing efficient knowledge representations for corpus to enhance knowledge retrieval. A graph is regarded as a powerful knowledge representation because of its capacity to intuitively model complex relationships. GraphRAG [20] combines knowledge graph generation and query-focused sum mari z ation with RAG to address both local and global questions. HOLMES [42] construct hyper-relational KGs and prune them to distilled graphs, which serve as an input to LLMs for multihop question answering. However, the construction of knowledge graphs is extremely resource-intensive, and the associated costs scale up with the size of the corpus.
在简单的 RAG 方法中,源数据被转换为纯文本并分块以进行检索。然而,随着 RAG 应用的扩展和对多样性的需求增加,基于纯文本的检索在几个方面显得不足:(1)文本信息通常冗余且嘈杂,导致检索质量下降;(2)复杂问题需要整合多个数据源,而纯文本无法充分表示对象之间复杂的关系。因此,研究人员正在探索多样化的数据源来丰富语料库,包括搜索引擎 [59, 53]、数据库 [55, 41, 47]、知识图谱 [49, 56] 和多模态语料库 [17, 15]。同时,重点在于开发高效的语料库知识表示以增强知识检索能力。图被视为一种强大的知识表示形式,因为它能够直观地建模复杂关系。GraphRAG [20] 将知识图谱生成和以查询为中心的摘要与 RAG 结合,以解决局部和全局问题。HOLMES [42] 构建超关系知识图谱并将其剪枝为蒸馏图,作为大语言模型的输入以进行多跳问答。然而,知识图谱的构建极其耗费资源,且相关成本随着语料库的规模而增加。
2.3 Multi-hop QA
2.3 多跳问答 (Multi-hop QA)
Multi-hop Question Answering (MHQA) [60] involves answering questions that require reasoning over multiple pieces of information, often scattered across different documents or paragraphs. This task presents unique challenges as it necessitates not only retrieving relevant information but also effectively combining and reasoning over the retrieved pieces to arrive at a correct answer. The traditional graph-based methods in MHQA solve the problem by building graphs and inferring on graph neural networks(GNN) to predict answers [44, 21]. With the advent of LLMs, recent graph-based methods [36, 42] have evolved to construct knowledge graphs for retrieval and generate response through LLMs. Another branch of methods dynamically convert multi-hop questions into a series of sub-queries by generating subsequent questions based on the answers to previous ones [52, 33, 23]. The subqueries guides the sequential retrieval and the retrieved results in turn are used to improve reasoning. Treating MHQA as a supervised problem, Self-RAG [61] trains an LM to learn to retrieve, generate, and critique text passages, and beam-retrieval [7] models the multi-hop retrieval process in an end-to-end manner by jointly optimizing an encoder and classification heads across all hops. Self-Ask [43] improves CoT by explicitly asking itself follow-up questions before answering the initial question. This method enables the automatic decomposition of questions and can be seamlessly integrated with retrieval mechanisms to tackle Multi-hop Question Answering.
多跳问答 (Multi-hop Question Answering, MHQA) [60] 涉及回答需要跨多个信息片段进行推理的问题,这些信息通常分散在不同的文档或段落中。这项任务带来了独特的挑战,因为它不仅需要检索相关信息,还需要有效地结合和推理检索到的信息以得出正确答案。传统的基于图的方法在MHQA中通过构建图并在图神经网络 (Graph Neural Networks, GNN) 上进行推理来预测答案 [44, 21]。随着大语言模型的出现,最近的基于图的方法 [36, 42] 已经演变为构建知识图谱用于检索,并通过大语言模型生成响应。另一类方法通过基于先前答案生成后续问题,将多跳问题动态转换为一组子查询 [52, 33, 23]。这些子查询指导顺序检索,而检索到的结果又用于改进推理。将MHQA视为监督问题,Self-RAG [61] 训练一个语言模型来学习检索、生成和批判文本段落,而beam-retrieval [7] 通过联合优化编码器和所有跳的分类头,以端到端的方式建模多跳检索过程。Self-Ask [43] 通过在回答初始问题之前明确询问自己后续问题,改进了链式推理 (Chain-of-Thought, CoT)。这种方法能够自动分解问题,并且可以无缝集成检索机制以应对多跳问答。
3 Problem formulation
3 问题表述
Existing research mainly concentrates on algorithmic enhancements to improve the performance of RAG systems. However, there is limited effort in providing a comprehensive and systematic discussion of the RAG framework. In this work, we conceptualize the RAG framework from three key perspectives: knowledge base, task classification, and system development. We assert that the knowledge base serves as the fundamental cornerstone of RAG, underpinning all retrieval and generation processes. Furthermore, we recognize that RAG tasks can vary significantly in complexity and difficulty, depending on the required generation capabilities and the availability of supporting corpora. By categorizing tasks according to their difficulty levels, we classify RAG systems into distinct levels based on their problem-solving capabilities across the different types of questions.
现有研究主要集中在算法增强以提高 RAG 系统的性能。然而,在提供 RAG 框架的全面和系统讨论方面,所做的努力有限。在本工作中,我们从三个关键视角对 RAG 框架进行了概念化:知识库、任务分类和系统开发。我们断言,知识库是 RAG 的基础基石,支撑着所有检索和生成过程。此外,我们认识到,RAG 任务的复杂性和难度可能会因所需的生成能力和支持语料的可用性而有很大差异。通过根据难度级别对任务进行分类,我们根据 RAG 系统在不同类型问题上的解决能力将其划分为不同的级别。
3.1 Knowledge base
3.1 知识库
In industrial applications, specialized knowledge primarily originates from years of accumulated data within specific fields such as manufacturing, energy, and logistics. For example, in the pharmaceutical industry, data sources include extensive research and development documentation, as well as drug application files amassed over many years. These sources are not only diverse in file formats, but also encompass a significant amount of multi-modal contents such as tables, charts, and figures, which are also crucial for problem-solving. Furthermore, there are often functional connections between files within a specialized domain, such as hyperlinks, references, and relational database links, which explicitly or implicitly reflect the logical organization of knowledge within the professional field. Currently, existing datasets provide pre-segmented corpora and do not account for the complexities encountered in real-world applications, such as the integration of multi-format data and the maintenance of referential relationships between documents. Therefore, the construction of a comprehensive knowledge base is foundational for Retrieval-Augmented Generation (RAG) in the industrial field. As the architecture and quality of the knowledge base directly influence the retrieval methods and their performance, we propose structuring the knowledge base as a multi-layer heterogeneous graph, denoted as $G$ , with corresponding nodes and edges represented by $(V,\dot{E})$ . The graph nodes can include documents, sections, chunks, figures, tables, and customized nodes from distilled knowledge. The edges signify the relationships among these nodes, encapsulating the interconnections and dependencies within the graph. This multi-layer heterogeneous graph encompasses three distinct layers: the information resource layer $G_{i}$ , the corpus layer $G_{c}$ and the distilled knowledge layer $G_{d k}$ . Each layer corresponds to different stages of information processing, representing varying levels of granularity and abstraction in knowledge.
在工业应用中,专业知识主要来源于特定领域多年积累的数据,如制造业、能源和物流等。例如,在制药行业,数据来源包括广泛的研究和开发文档,以及多年积累的药品申请文件。这些数据源不仅文件格式多样,还包含大量多模态内容,如表格、图表和图形,这些内容对于解决问题同样至关重要。此外,在专业领域中,文件之间通常存在功能关联,如超链接、引用和关系数据库链接,这些关联明确或隐含地反映了专业领域内知识的逻辑组织。目前,现有数据集提供的是预先分割的语料库,并未考虑实际应用中的复杂性,如多格式数据的集成和文档间引用关系的维护。因此,构建一个全面的知识库是工业领域中检索增强生成 (RAG) 的基础。由于知识库的架构和质量直接影响检索方法及其性能,我们提出将知识库结构化为多层异构图,记为 $G$,对应的节点和边表示为 $(V,\dot{E})$。图节点可以包括文档、章节、块、图形、表格以及从提炼知识中定制的节点。边表示这些节点之间的关系,囊括了图中的相互联系和依赖关系。这种多层异构图包含三个不同的层次:信息资源层 $G_{i}$、语料层 $G_{c}$ 和提炼知识层 $G_{d k}$。每一层对应于信息处理的不同阶段,代表了知识的不同粒度和抽象层次。
3.2 Task classification
3.2 任务分类
Contemporary RAG frameworks frequently overlook the intricate difficulty and logistical demands inherent to diverse tasks, typically employing a one-size-fits-all methodology. However, even with comprehensive knowledge retrieval, current RAG systems are insufficient to handle tasks of varying difficulty with equal effectiveness. Therefore, it is essential to categorize tasks and analyze the typical strategies for overcoming the challenges inherent to each category. The difficulty of a task is closely associated with several critical factors.
当代的 RAG 框架经常忽视了不同任务固有的复杂难度和逻辑需求,通常采用一刀切的方法。然而,即使具备全面的知识检索能力,当前的 RAG 系统仍不足以同等有效地处理不同难度的任务。因此,对任务进行分类并分析克服每类任务挑战的典型策略至关重要。任务的难度与几个关键因素密切相关。
Factual Questions
事实性问题

Linkable & Reasoning Questions
图 1: 可链接与推理问题

Figure 1: Illustrative examples of distinct question types
图 1: 不同问题类型的示例
• Effectiveness of Knowledge Utilization: The sophistication involved in applying the extracted knowledge to formulate responses, including synthesizing, organizing, and generating insights or predictions.
知识利用的有效性:将提取的知识应用于制定回答的复杂程度,包括综合、组织以及生成见解或预测的能力。
In categorizing real-world RAG tasks within industries, we focus on the processes of knowledge extraction, understanding, organization, and utilization to provide structured and insightful responses. Taking the aforementioned factors into account, we identify four distinct classes of questions that address a broad spectrum of demands. The first type, Factual Questions, involves extracting specific, explicit information directly from the corpus, relying on retrieval mechanisms to identify the relevant facts. Linkable-Reasoning Questions demand a deeper level of knowledge integration, often requiring multi-step reasoning and linking across multiple sources. Predictive Questions extend beyond the available data, requiring inductive reasoning and structuring of retrieved facts into analyzable forms, such as time series, for future-oriented predictions. Finally, Creative Questions engage domainspecific logic and creative problem-solving, encouraging the generation of innovative solutions by synthesizing knowledge and identifying patterns or influencing factors. This categorization, driven by varying levels of reasoning and knowledge management, ensures a comprehensive approach to addressing industry-specific queries.
在行业中对现实世界中的 RAG 任务进行分类时,我们专注于知识提取、理解、组织和利用的过程,以提供结构化和有洞察力的响应。考虑到上述因素,我们确定了四类不同的问题,以满足广泛的需求。第一类,事实性问题,涉及直接从语料库中提取特定的明确信息,依靠检索机制来识别相关事实。可链接推理问题需要更深层次的知识整合,通常需要多步推理并在多个来源之间建立联系。预测性问题超越了现有数据,需要归纳推理并将检索到的事实结构化为可分析的形式,例如时间序列,以进行面向未来的预测。最后,创造性问题涉及特定领域的逻辑和创造性问题解决,鼓励通过综合知识和识别模式或影响因素来生成创新解决方案。这种由不同层次的推理和知识管理驱动的分类,确保了解决行业特定问题的全面方法。
The criteria defining each category are elaborated in the following sections, with representative examples for each provided in Figure 1. For each question type, we also present the associated support data and the expected reasoning processes to illustrate the differences between these categories. These inquiries are formulated by experts in pharmaceutical applications, based on the data released by the FDA.2
定义每个类别的标准将在以下部分详细阐述,并在图1中提供了每个类别的代表性示例。对于每种问题类型,我们还展示了相关的支持数据和预期的推理过程,以说明这些类别之间的差异。这些查询由制药应用领域的专家根据FDA发布的数据制定。
Factual Questions These questions seek specific, concrete pieces of information explicitly presented in the original corpus. The referenced text can be processed within the context of a conversation in LLMs. As shown in Figure 1, this class of questions can be effectively answered if the relevant fact is successfully retrieved.
事实性问题
Linkable-Reasoning Questions Answering these questions necessitates gathering pertinent information from diverse sources and/or executing multi-step reasoning. The answers may be implicitly distributed across multiple texts. Due to variations in the linking and reasoning processes, we further divide this category into four subcategories: bridging questions, comparative questions, quantitative questions, and summarizing questions. Examples of each subcategory are illustrated in Figure 1. Specifically, bridging questions involve sequentially bridging multiple entities to derive the answer. Quantitative questions require statistical analysis based on the retrieved data. Comparative questions focus on comparing specified attributes of two entities. Summarizing questions require condensing or synthesizing information from multiple sources or large volumes of text into a concise, coherent summary, and they often involve integrating key points, identifying main themes, or drawing conclusions based on the aggregated content. Summarizing questions may combine elements of other question types, such as bridging, comparative, or quantitative questions, as they frequently require the extraction and integration of diverse pieces of information to generate a comprehensive and meaningful summary. Given these questions require multi-step retrieval and reasoning, it is crucial to establish a reasonable operation route for answer-seeking in interaction with the knowledge base.
可链接推理问题
这类问题需要从不同来源收集相关信息,并/或执行多步推理。答案可能隐式分布在多个文本中。由于链接和推理过程的差异,我们进一步将该类别分为四个子类别:桥接问题、比较问题、定量问题和总结问题。每个子类别的示例如图 1 所示。具体而言,桥接问题涉及依次桥接多个实体以得出答案。定量问题需要基于检索到的数据进行统计分析。比较问题侧重于比较两个实体的指定属性。总结问题需要将来自多个来源或大量文本的信息浓缩或综合成简明连贯的摘要,通常涉及整合关键点、识别主题或基于聚合内容得出结论。总结问题可能结合其他问题类型的元素,例如桥接、比较或定量问题,因为它们经常需要提取和整合多样化的信息以生成全面且有意义的摘要。鉴于这些问题需要多步检索和推理,在与知识库交互时建立一个合理的答案寻求操作路径至关重要。
Predictive Questions For this type of questions, the answers are not directly available in the original text and may not be purely factual, necessitating inductive reasoning and prediction based on existing facts. To harness the predictive capabilities of LLMs or other external prediction tools, it is essential to gather and organize relevant knowledge to generate structured data for further analysis. For instance, as illustrated in Figure 1, all biosimilar products with the approval dates are retrieved, and the total number of approvals for each year is calculated and organized to year-indexed time series data for prediction purposes. Furthermore, it is important to note that the correct answer to predictive questions may not be unique, reflecting the inherent uncertainty and variability in predictive tasks.
预测性问题
对于这类问题,答案并不直接存在于原始文本中,且可能并非纯粹的事实性内容,需要基于现有事实进行归纳推理和预测。为了充分利用大语言模型或其他外部预测工具的预测能力,必须收集和整理相关知识,生成结构化数据以供进一步分析。例如,如图 1 所示,检索了所有生物类似品的批准日期,并计算和整理了每年的批准总数,形成以年份为索引的时间序列数据用于预测。此外,需要注意的是,预测性问题的正确答案可能并不唯一,这反映了预测任务中固有的不确定性和变异性。
Creative Questions One significant demand of RAG is to mine valuable domain-specific logic from professional knowledge bases and introduce novel perspectives that can innovate and advance existing solutions. Addressing creative questions necessitates creative thinking based on the availability of factual information and an understanding of the underlying principles and rules. As illustrated in the example, it is essential to organize the extracted information to highlight key stages and their duration, and then identify common patterns and influential factors. Subsequently, solutions are developed with the objective of evaluating potential outcomes and stimulating fresh ideas. The goal of these responses is to inspire experts to generate innovative ideas, rather than to provide ready-to-implement solutions.
创意问题
It is crucial to recognize that the classification of a question may shift with changes in the knowledge base. Questions Q1, Q2, and Q3 in Figure1, although seemingly similar, are categorized differently depending on the availability of information and the logical steps required to derive an answer. For instance, Q1 is classified as a factual question because it can be directly answered using a table that concisely lists all biosimilar products along with their respective approval dates, providing sufficient explicit information. In contrast, Q2, which inquires about the total count of interchangeable biosimilar products, cannot be resolved by directly referencing a single explicit source. To answer Q2, one must identify all the products meeting the specified criteria and subsequently calculate the total, necessitating an additional step of statistical aggregation. Therefore, Q2 is categorized as a linkable-reasoning question due to the need for an intermediate processing. Finally, Q3 poses a challenge because the answer does not explicitly exist within the knowledge base. Addressing
认识到问题的分类可能随着知识库的变化而转变至关重要。图 1 中的问题 Q1、Q2 和 Q3,尽管看似相似,但根据信息的可用性和推导答案所需的逻辑步骤,它们被分类为不同类型。例如,Q1 被归类为事实性问题,因为它可以直接通过一张简明列出所有生物类似物产品及其各自批准日期的表格来回答,提供了足够的显式信息。相比之下,Q2 询问的是可互换生物类似物产品的总数,无法通过直接参考单一的显式来源来解决。要回答 Q2,必须识别所有符合指定条件的产品,然后计算总数,这需要额外的统计聚合步骤。因此,由于需要中间处理,Q2 被归类为可链接推理问题。最后,Q3 提出了一个挑战,因为答案并未明确存在于知识库中。
Table 1: Level definition based on RAG system’s capability
表 1: 基于 RAG 系统能力的层级定义
| 层级 | 系统能力描述 |
|---|---|
| L1 | L1 系统旨在为事实性问题提供准确可靠的答案,确保基本信息检索的坚实基础。 |
| L2 | L2 系统扩展其功能,包括对事实性问题和可链接推理问题的准确可靠响应,支持更复杂的多步骤检索和推理任务。 |
| L3 | L3 系统进一步增强其能力,具备为预测性问题提供合理预测的能力,同时保持对事实性问题和可链接推理问题的准确性和可靠性。 |
| L4 | L4 系统能够为创造性问题提出有充分理由的计划或解决方案。此外,它保留了对预测性问题提供合理预测的能力,以及对事实性问题和可链接推理问题的准确可靠回答。 |
Q3 requires gathering relevant data, organizing it to infer hidden patterns, and making predictions based on these inferred rules. As a result, Q3 is categorized as a predictive question, indicating the requirement to extrapolate beyond the existing data to forecast potential outcomes or trends.
Q3 需要收集相关数据,组织数据以推断隐藏模式,并根据这些推断的规则进行预测。因此,Q3 被归类为预测性问题,表明需要超出现有数据外推,以预测潜在结果或趋势。
3.3 RAG system level
3.3 RAG 系统级别
In industrial RAG systems, inquiries encompass a broad spectrum of difficulties and are approached from diverse perspectives. Although RAG systems can leverage the general question-answering(QA) abilities of LLMs, their limited comprehension of expert-level knowledge often leads to inconsistent response quality across questions of varying complexities. In response to this status quo, we propose categorizing RAG systems into four distinct levels based on their problem-solving capabilities across the four classes of questions outlined in the previous subsection. This stratified approach facilitates the phased development of RAG systems, allowing capabilities to be increment ally enhanced through iterative module refinement and algorithmic optimization. Our framework is strategically designed to provide a standardized, objective methodology for developing RAG systems that effectively meet the specialized needs of various industry scenarios. The definition of RAG systems in different level is presented in Table 1. It highlights the systems’ capabilities to handle increasingly complex queries, demonstrating the evolution from simple information retrieval to advanced predictive and creative problem-solving. Each level represents a step towards more sophisticated interactions with knowledge bases, requiring the RAG systems to demonstrate higher levels of understanding, reasoning, and innovation.
在工业 RAG 系统中,查询涵盖了广泛的难度,并从不同的角度进行处理。尽管 RAG 系统可以利用大语言模型的通用问答 (QA) 能力,但它们对专家级知识的有限理解往往导致对不同复杂程度问题的响应质量不一致。针对这一现状,我们建议根据 RAG 系统在前一小节中概述的四类问题上的解决能力,将其分为四个不同的级别。这种分层方法有助于 RAG 系统的分阶段发展,通过迭代模块优化和算法优化逐步增强能力。我们的框架旨在为开发 RAG 系统提供一种标准化、客观的方法,以有效满足各种行业场景的特定需求。表 1 展示了不同级别 RAG 系统的定义,突出了系统处理日益复杂查询的能力,展示了从简单的信息检索到高级预测性和创造性问题解决的演变。每个级别都代表了与知识库进行更复杂互动的步骤,要求 RAG 系统展示出更高层次的理解、推理和创新能力。
More specially, at the foundational level, RAG systems respond to factual questions with answers that are directly extract able from provided texts. Advancing to the second level, RAG systems are equipped to handle complex questions involving linkage and reasoning. These queries necessitate the synthesis of information from disparate sources or multi-step reasoning processes. The RAG could address a variety of composite questions, includes bridging questions that necessitate a sequence of logical reasoning, comparative questions demanding parallel analysis, and summarizing questions that involve condensing information into comprehensive responses. At the third level, the systems are intricately designed to tackle predictive questions where answers are not immediately discernible from the original text. Finally, RAG systems at the forth level demonstrate the capacity for creative problem-solving, utilizing a solid factual base to foster novel concepts or strategies. While these systems may not offer ready-to-implement solutions, they play a crucial role in stimulating expert creativity to advance fields such as analytics or treatment design.
具体而言,在基础层面上,RAG系统能够回答可以直接从提供的文本中提取的事实性问题。在第二层面上,RAG系统能够处理涉及链接和推理的复杂问题。这些问题需要从不同来源合成信息或进行多步推理过程。RAG能够应对各种复合问题,包括需要一系列逻辑推理的桥接问题、要求并行分析的比较问题,以及涉及将信息浓缩为全面回答的总结性问题。在第三层面上,这些系统被精心设计以应对预测性问题,这些问题的答案无法直接从原始文本中识别。最后,在第四层面上,RAG系统展示了创造性解决问题的能力,利用坚实的事实基础来培养新颖的概念或策略。虽然这些系统可能不提供即用型解决方案,但它们在激发专家创造力以推动分析或治疗设计等领域方面发挥了关键作用。
4 Methodology
4 方法论
4.1 Framework
4.1 框架
Based on the formulation of RAG systems in terms of knowledge base, task classification, and systemlevel division, we propose a versatile and expandable RAG framework. Within this framework, the progression in levels of RAG systems can be achieved by adjusting submodules within the main modules. The overview of our framework is depicted in Figure 2. The framework primarily consists of several fundamental modules, including file parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, knowledge-centric reasoning, and task decomposition and coordination. In this framework, domain-specific documents of diverse formats are processed by file parsing module to convert the file to machine-readable formats, and file units are generated to build up graph in information source layer. The knowledge extraction module chunks the text and generates corpus and knowledge units to construct graph in corpus layer and distilled knowledge layer. The heterogeneous graph established is utilized as the knowledge base for retrieval. Extracted knowledge is stored in multiple structured formats, and the knowledge retrieval module employs hybrid retrieval strategy to access relevant information. Note that the knowledge base not only serves as the source of knowledge gathering but also benefits from a feedback loop, where the organized and verified knowledge is regarded as feedback to refine and improve the knowledge base.
基于知识库、任务分类和系统层级划分的RAG系统表述,我们提出了一种通用且可扩展的RAG框架。在该框架中,通过调整主模块中的子模块,可以实现RAG系统层级的提升。图2展示了我们框架的概览。该框架主要由几个基础模块组成,包括文件解析、知识提取、知识存储、知识检索、知识组织、以知识为中心的推理以及任务分解与协调。在此框架中,不同格式的领域特定文档通过文件解析模块处理,将文件转换为机器可读的格式,并生成文件单元以构建信息源层的图。知识提取模块将文本分块并生成语料库和知识单元,以构建语料库层和提炼知识层的图。所建立的异构图被用作检索的知识库。提取的知识以多种结构化格式存储,知识检索模块采用混合检索策略访问相关信息。需要注意的是,知识库不仅是知识收集的来源,还受益于反馈循环,其中组织和验证的知识被视为反馈,以改进和完善知识库。

Figure 2: Overview of the PIKE-RAG framework, comprising several key components: file parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, task decomposition and coordination, and knowledge-centric reasoning. Each component can be tailored to meet the evolving demands of system capability.
图 2: PIKE-RAG 框架概览,包含多个关键组件:文件解析、知识提取、知识存储、知识检索、知识组织、任务分解与协调,以及以知识为中心的推理。每个组件都可以根据系统能力的发展需求进行定制。
As highlighted in the task classification examples, questions of different classes require distinct rationale routing for answer-seeking, influenced by multiple factors such as the availability of relevant information, the complexity of knowledge extraction, and the sophistication of reasoning. It is challenging to address these questions in a single retrieval and generation pass. To tackle this, we propose an iterative retrieval-generation mechanism supervised by task decomposition and coordination. This iterative mechanism enables the gradual collection of relevant information and progressive reasoning over incremental context, ensuring a more accurate and comprehensive response. More specially, the questions in industrial applications are fed into task decomposition module to produce preliminary decomposition scheme. This scheme outlines the retrieval steps, reasoning steps, and other necessary operations. Following these instructions, the knowledge retrieval module retrieves relevant information, which is then passed to the knowledge organization module for processing and organization. The organized knowledge is used to perform knowledge-centric reasoning, yielding an intermediate answer. With the updated relevant information and intermediate answer, the task decomposition module regenerates an updated scheme for the next iteration. This design boasts excellent adaptability, allowing us to tackle problems of varying difficulties and perspectives by adjusting the modules and iterative mechanisms.
正如任务分类示例中所强调的,不同类别的问题需要不同的推理路径来寻求答案,这受到多种因素的影响,如相关信息的可用性、知识提取的复杂性以及推理的复杂程度。在单次检索和生成过程中解决这些问题具有挑战性。为此,我们提出了一种由任务分解和协调监督的迭代检索-生成机制。这种迭代机制能够逐步收集相关信息,并在增量上下文的基础上进行渐进式推理,从而确保更准确和全面的响应。具体而言,工业应用中的问题被输入任务分解模块,生成初步的分解方案。该方案概述了检索步骤、推理步骤以及其他必要的操作。遵循这些指令,知识检索模块检索相关信息,然后将其传递给知识组织模块进行处理和组织。组织后的知识用于执行以知识为中心的推理,生成中间答案。随着更新的相关信息和中间答案,任务分解模块重新生成更新的方案以进行下一次迭代。这种设计具有出色的适应性,允许我们通过调整模块和迭代机制来应对不同难度和视角的问题。
Table 2: Proposed frameworks for different system levels. To address the challenges facing at each level, we propose customized frameworks based on the framework illustrated in Figure 2. The following abbreviations are used: "PA" for file parsing, "KE" for knowledge extraction, "RT" for knowledge retrieval, "KO" for knowledge organization, and "KR" for knowledge-centric reasoning.
表 2: 针对不同系统层次的建议框架。为了应对每个层次面临的挑战,我们基于图 2 中展示的框架提出了定制化的框架。以下缩写用于表示不同的模块:"PA" 表示文件解析 (file parsing),"KE" 表示知识提取 (knowledge extraction),"RT" 表示知识检索 (knowledge retrieval),"KO" 表示知识组织 (knowledge organization),"KR" 表示以知识为中心的推理 (knowledge-centric reasoning)。
| Level | Challenges | Proposed Framework |
|---|---|---|
| LO | 由于源文档格式多样,知识提取面临挑战,需要复杂的文件解析技术。·从原始的异构数据构建高质量的知识库引入了知识组织和集成的显著复杂性。 | PA KE |
| L1 | ·由于不恰当的分块阻碍了知识的理解和提取,破坏了语义连贯性,使准确检索变得复杂。知识检索受到嵌入模型在对齐专业术语和别名方面的限制影响,降低了系统的精度。 | PA KE RT KO KR |
| L2 | 有效的知识提取和利用至关重要,因为分块文本通常包含相关和不相关信息。确保检索高质量数据对于准确生成至关重要。·任务的理解和分解及其背后的理由往往忽略了支持数据的可用性,过度依赖大语言模型的能力。 | PA KR Task Decomp.& Coord. |
| L3 | 这一层的挑战集中在知识收集和组织上,这对于支持预测推理至关重要。·大语言模型在应用专业推理逻辑方面存在局限性,限制了它们在预测任务中的有效性。 | PA TaskDecomp.&Coord. |
| L4 | 难点在于从复杂的知识库中提取连贯的逻辑理由,其中多个因素之间的相互依赖关系可能导致非唯一解。创造性问题的开放性使推理和知识合成过程的评估复杂化,难以定量评估答案质量。 | PA RT Multi-agent Plan. Task Decomp.&Coord. |
4.2 Phased system development
4.2 分阶段系统开发
We have categorized RAG systems into four distinct levels based on their problem-solving capabilities across the four classes of questions, as outlined in Table 1. Recognizing the pivotal role of knowledge base generation in RAG systems, we designate the construction of the knowledge base as the L0 stage of system development. The challenges faced by RAG systems vary across different levels. We analyze these challenges for each level and propose corresponding frameworks in Table 2. This stratified approach facilitates the phased development of RAG systems, enabling incremental enhancement of capabilities through iterative module refinement and algorithmic optimization.
我们根据 RAG 系统在四类问题上的解决能力,将其分为四个不同的级别,如表 1 所示。认识到知识库生成在 RAG 系统中的关键作用,我们将知识库的构建指定为系统开发的 L0 阶段。RAG 系统在不同级别面临的挑战各不相同。我们在表 2 中分析了每个级别的挑战,并提出了相应的框架。这种分层方法有助于 RAG 系统的阶段性开发,通过迭代模块优化和算法改进,逐步提升能力。
We observe that from L0 to L4, higher-level systems can inherit modules from lower levels and add new modules to enhance system capabilities. For instance, compared to an L1 system, an L2 system not only introduces a task decomposition and coordination module to leverage iterative retrieval-generation routing but also incorporates more advanced knowledge extraction modules, such as distilled knowledge generation, indicated in dark green in Figure 2. In the L3 system, the growing emphasis on predictive questioning necessitates enhanced requirements for knowledge organization and reasoning. Consequently, the knowledge organization module introduces additional submodules for knowledge structuring and knowledge induction, indicated in dark orange. Similarly, the knowledge-centric reasoning module has been expanded to include a forecasting submodule, highlighted in dark purple. In the L4 system, extracting complex rationale from an established knowledge base is highly challenging. To address this, we introduce multi-agent planning module to activate reasoning from diverse perspectives.
我们观察到,从 L0 到 L4,更高层级的系统可以继承低层级的模块,并添加新模块以增强系统能力。例如,与 L1 系统相比,L2 系统不仅引入了任务分解和协调模块以利用迭代检索-生成路由,还加入了更高级的知识提取模块,如图 2 中深绿色所示的蒸馏知识生成。在 L3 系统中,预测性提问的日益重要性对知识组织和推理提出了更高的要求。因此,知识组织模块引入了额外的子模块用于知识结构化和知识归纳,如深橙色所示。同样,以知识为中心的推理模块也扩展了预测子模块,如深紫色所示。在 L4 系统中,从已建立的知识库中提取复杂逻辑极具挑战性。为此,我们引入了多智能体规划模块,以激活从不同角度进行的推理。

Figure 3: Multi-layer heterogeneous graph as the knowledge base. The graph comprises three distinct layers: information resource layer, corpus layer and distilled knowledge layer.
图 3: 作为知识库的多层异质图。该图包含三个不同的层:信息资源层、语料库层和蒸馏知识层。
5 Detailed Implementation
5 详细实现
In this section, we delve into the implementation specifics of each module within our proposed versatile and expandable RAG framework. By elucidating the details at each level, we aim to provide a comprehensive understanding of how the framework operates and how its modularity and expand ability are achieved. The subsections that follow will cover the file parsing, knowledge extraction, knowledge storage, knowledge-centric reasoning, and task decomposition and coordination modules, providing insights into their individual functionalities and interactions.
在本节中,我们将深入探讨我们提出的多功能且可扩展的 RAG(Retrieval-Augmented Generation)框架中每个模块的实现细节。通过阐明每个层面的细节,我们旨在全面理解框架的运作方式,以及如何实现其模块化和可扩展性。接下来的小节将涵盖文件解析、知识提取、知识存储、以知识为中心的推理以及任务分解和协调模块,深入探讨它们各自的功能和相互之间的交互。
5.1 Level-0: Knowledge Base Construction
5.1 Level-0: 知识库构建
The foundational stage of the proposed RAG systems is designated as the L0 system, focuses on the construction of a robust and comprehensive knowledge base. This stage is critical for enabling effective knowledge retrieval in subsequent levels. The primary objective of the L0 system is to process and structure domain-specific documents, transforming them into a machine-readable format and organizing the extracted knowledge into a heterogeneous graph. This graph serves as the backbone for all higher-level reasoning and retrieval tasks. The L0 system encompasses several key modules: file parsing, knowledge extraction, and knowledge storage. Each of these modules plays a crucial role in ensuring that the knowledge base is both extensive and accurately reflects the underlying information contained within the source documents.
提出的 RAG 系统的基础阶段被指定为 L0 系统,专注于构建一个健壮且全面的知识库。这一阶段对于在后续级别中实现有效的知识检索至关重要。L0 系统的主要目标是处理和结构化特定领域的文档,将其转换为机器可读的格式,并将提取的知识组织成异构图。该图作为所有更高级别推理和检索任务的基础。L0 系统包含几个关键模块:文件解析、知识提取和知识存储。这些模块中的每一个都扮演着关键角色,确保知识库既广泛又准确地反映源文档中包含的底层信息。
5.1.1 File parsing
5.1.1 文件解析
The ability to effectively parse and read various types of files is a critical component in the development of RAG systems that rely on diverse data sources. Frameworks such as LangChain3 provide a comprehensive suite of tools for natural language processing (NLP), including modules for parsing and extracting information from unstructured text documents. Its file reader capabilities are designed to handle a wide range of file formats, ensuring that data from heterogeneous sources can be seamlessly integrated into the system. Additionally, several deep learning-based tools [2, 3] and commercial cloud APIs [1, 4] have been developed to conduct robust Optical Character Recognition (OCR) and accurate table extraction, enabling the conversion of scanned documents and images into structured, machine-readable text. Given that domain-specific files often encompass sophisticated tables, charts, and figures, text-based conversion may lead to information loss and disrupt the inherent logical structure. Therefore, we propose conducting layout analysis for these files and preserving multi-modal elements such as charts and figures. The layout information can aid the chunking operation, maintaining the completeness of chunked text, while figures and charts can be described
有效解析和读取各类文件的能力是开发依赖多样化数据源的RAG系统的关键组成部分。LangChain3等框架提供了一套全面的自然语言处理(NLP)工具,包括用于从非结构化文本文档中解析和提取信息的模块。其文件读取功能旨在处理广泛的文件格式,确保来自异构源的数据可以无缝集成到系统中。此外,已经开发了多种基于深度学习的工具[2, 3]和商业云API[1, 4],用于进行稳健的光学字符识别(OCR)和准确的表格提取,从而将扫描文档和图像转换为结构化的、机器可读的文本。鉴于特定领域的文件通常包含复杂的表格、图表和图像,基于文本的转换可能会导致信息丢失并破坏固有的逻辑结构。因此,我们建议对这些文件进行布局分析,并保留图表和图像等多模态元素。布局信息可以帮助分块操作,保持分块文本的完整性,而图表和图像可以被描述


Figure 4: The process of distilling knowledge from corpus text. The corpus text are processed to extract knowledge units following customized extraction patterns. These knowledge units are then organized to structured knowledge in the distilled knowledge layer, which may take the form of knowledge graphs, atomic knowledge, tabular knowledge, and other induced knowledge.
图 4: 从语料文本中提炼知识的过程。语料文本按照定制的提取模式进行处理,以提取知识单元。这些知识单元随后被组织成结构化知识,存储在提炼知识层中,其形式可能包括知识图谱、原子知识、表格知识以及其他推导出的知识。
by Vision-Language Models (VLMs) to assist in knowledge retrieval. This approach ensures that the integrity and richness of the original documents are retained, enhancing the efficacy of RAG systems.
通过视觉-语言模型 (Vision-Language Models, VLMs) 辅助知识检索。这种方法确保了原始文档的完整性和丰富性,从而提升了 RAG 系统的效能。
5.1.2 Knowledge Organization
5.1.2 知识组织
The proposed knowledge base is structured as a multi-layer heterogeneous graph, representing different levels of information granularity and abstraction. The graph captures relationships between various components of the data (e.g., documents, sections, chunks, figures, and tables) and organizes them into nodes and edges, reflecting their interconnections and dependencies. As depicted in Figure 3, this multi-layer structure, encompassing the information resource layer, corpus layer, and distilled knowledge layer, enables both semantic understanding and rationale-based retrieval for downstream tasks.
所提出的知识库被构建为一个多层异构图,表示不同层次的信息粒度和抽象。该图捕捉了数据各个组件(例如文档、章节、块、图和表)之间的关系,并将它们组织成节点和边,反映了它们之间的相互联系和依赖关系。如图 3 所示,这种多层结构,包括信息资源层、语料库层和提炼知识层,能够为下游任务提供语义理解和基于推理的检索。
Information Resource Layer: This layer captures the diverse information sources, treating them as source nodes with edges that denote referential relationships among them. This structure aids in cross-referencing and contextual i zing the knowledge, establishing a foundation for reasoning that depends on multiple sources.
信息资源层:该层捕捉多样化的信息来源,将其视为源节点,并用边表示它们之间的引用关系。这种结构有助于交叉引用和知识的情境化,为依赖多源的推理奠定基础。
Corpus Layer: This layer organizes the parsed information into sections and chunks while preserving the document’s original hierarchical structure. Multi-modal content such as tables and figures is summarized by LLMs and integrated as chunk nodes, ensuring that multi-modal knowledge is available for retrieval. This layer enables knowledge extraction with varying levels of granularity, allowing for accurate semantic chunking and retrieval across diverse content types.
语料层:该层将解析后的信息组织成章节和块,同时保留文档的原始层次结构。多模态内容(如表格和图像)由大语言模型进行总结,并作为块节点进行集成,确保多模态知识可用于检索。该层支持不同粒度的知识提取,实现了跨多种内容类型的准确语义分块和检索。
Distilled Knowledge Layer: The corpus is further distilled into structured forms of knowledge (e.g., knowledge graphs, atomic knowledge, and tabular knowledge). This process, driven by techniques like Named Entity Recognition (NER) [19] and relationship extraction [40], ensures that the distilled knowledge captures key logical relationships and entities, supporting advanced reasoning processes. By organizing this structured knowledge in a distilled layer, we enhance the system’s ability to reason and synthesize based on deeper domain-specific knowledge. The knowledge distillation process is depicted in Figure 4. Below are the detailed distillation processes for typical knowledge forms.
蒸馏知识层:语料库进一步蒸馏为结构化的知识形式(例如,知识图谱、原子知识和表格知识)。这一过程由命名实体识别(NER)[19] 和关系抽取[40] 等技术驱动,确保蒸馏的知识捕捉到关键的逻辑关系和实体,支持高级推理过程。通过将这种结构化知识组织在蒸馏层中,我们增强了系统基于更深层次领域知识进行推理和综合的能力。知识蒸馏过程如图 4 所示。以下是典型知识形式的详细蒸馏过程。

Figure 5: Illustration of enhanced chunking with recurrent text splitting.
图 5: 使用循环文本分割的增强分块示意图。
5.2 Level-1: Factual Question focused RAG System
5.2 一级:基于事实问题的 RAG 系统
Building upon the L0 system, the L1 system introduces knowledge retrieval and knowledge organization to realize its retrieval and generation capabilities. The primary challenges at this level are semantic alignment and chunking. The abundance of professional terminology and aliases can affect the accuracy of chunk retrieval, and unreasonable chunking can disrupt semantic coherence and introduce noise interference. To mitigate these issues, the L1 system incorporates more sophisticated query analysis techniques and basic knowledge extraction modules. The architecture is expanded to include components that facilitate task decomposition, coordination, and initial stages of knowledge organization (KO), ensuring that the system can manage more complex queries effectively.
在 L0 系统的基础上,L1 系统引入了知识检索和知识组织,以实现其检索和生成能力。这一级别的主要挑战是语义对齐和分块。专业术语和别名的丰富性可能会影响分块检索的准确性,而不合理的分块可能会破坏语义连贯性并引入噪声干扰。为了缓解这些问题,L1 系统采用了更复杂的查询分析技术和基本的知识提取模块。架构扩展为包括促进任务分解、协调和知识组织 (KO) 初始阶段的组件,确保系统能够有效管理更复杂的查询。

Figure 6: Overview of L1 RAG framework. The squares $(\boxed{\Omega})$ indicate enhanced chunking and auto-tagging sub-module in knowledge extraction modules.
图 6: L1 RAG 框架概览。方框 $(\boxed{\Omega})$ 表示知识提取模块中的增强分块和自动标记子模块。
5.2.1 Enhanced chunking
5.2.1 增强分块
Chunking involves breaking down a large corpus of text into smaller, more manageable segments. The primary chunking strategies commonly utilized in RAG systems include fixed-size chunking, semantic chunking, and hybrid chunking. Chunking is essential for improving both the efficiency and accuracy of the retrieval process, which consequently affects the overall performance of RAG models in multiple dimensions. In our system, each chunk serves dual purposes: (i) it becomes a unit of information that is vectorized and stored in a database for retrieval, and (ii) it acts as a source for further knowledge extraction and information sum mari z ation. Improper chunking not only fails to ensures that text vectors encapsulate the necessary semantic information, but also hinders knowledge extraction based on complete context. For instance, in the context of laws and regulations, a fixed-size chunking approach are prone to destroying text semantics and omitting key conditions, thereby affecting the quality and accuracy of subsequent knowledge extraction.
分块是指将大量文本分解为更小、更易管理的片段。在 RAG 系统中常用的主要分块策略包括固定大小分块、语义分块和混合分块。分块对于提高检索过程的效率和准确性至关重要,进而影响 RAG 模型在多维度上的整体性能。在我们的系统中,每个分块具有双重作用:(i) 它作为信息单元,被向量化并存储在数据库中以供检索;(ii) 它作为进一步知识提取和信息汇总的源。不恰当的分块不仅无法确保文本向量包含必要的语义信息,还会阻碍基于完整上下文的知识提取。例如,在法律法规的语境中,固定大小的分块方法容易破坏文本语义并遗漏关键条件,从而影响后续知识提取的质量和准确性。
We propose a text split algorithm to enhance existing chunking methods by breaking down large text documents into smaller, manageable chunks while preserving context and enabling effective summary generation for each chunk. The chunking process is illustrated in Figure 5. Given a source text, the algorithm iterative ly splits the text into chunks. During the first iteration, it generates a forward summary of the initial chunk, providing context for generating summaries of subsequent chunks and maintaining a coherent narrative across splits. Each chunk is summarized using a predefined prompt template that incorporates both the forward summary and the current chunk. This summary is then stored alongside the chunk. The algorithm adjusts the text by removing the processed chunk and updating the forward summary with the summary of the current chunk, preparing for the next iteration. This process continues until the entire text is split and summarized. Additionally, the algorithm can dynamically adjust chunk sizes based on the content and structure of the text.
我们提出了一种文本分割算法,通过将大型文本文档分解为更小、更易管理的块来增强现有的分块方法,同时保留上下文并为每个块生成有效的摘要。分块过程如图 5 所示。给定源文本,该算法迭代地将文本分割成块。在第一次迭代中,它会生成初始块的前向摘要,为后续块摘要的生成提供上下文,并在分割之间保持连贯的叙述。每个块都使用预定义的提示模板进行摘要,该模板结合了前向摘要和当前块。然后将此摘要与块一起存储。该算法通过移除已处理的块并使用当前块的摘要更新前向摘要来调整文本,为下一次迭代做准备。此过程持续进行,直到整个文本被分割并摘要完毕。此外,该算法可以根据文本的内容和结构动态调整块的大小。

Figure 7: Illustration of the auto-tagging module.
图 7: 自动标注模块的示意图。
5.2.2 Auto-tagging
5.2.2 自动标签
In domain-specific RAG scenarios, the corpus is typically characterized by formal, professional, and rigorously expressed content, whereas the questions posed are often articulated in plain, easily understandable colloquial language. For instance, in medical question-answering (medQA) tasks [32], symptoms of diseases described in the questions are generally phrased in simple, conversational terms. In contrast, the corresponding medical knowledge within the corpus is often expressed using specialized professional terminology. This discrepancy introduces a domain gap that adversely affects the accuracy of chunk retrieval, especially given the limitations of the embedding models employed for this purpose.
在特定领域的 RAG 场景中,语料库通常以正式、专业和严谨表达的内容为特征,而提出的问题则往往以简单易懂的口语语言表述。例如,在医疗问答(medQA)任务 [32] 中,问题中描述的疾病症状通常以简单的对话方式表达。相比之下,语料库中的相应医学知识则经常使用专门的专业术语来表述。这种差异引入了领域差距,特别是在用于此目的的嵌入模型的局限性下,对片段检索的准确性产生了不利影响。
To address the domain gap issue, we propose an auto-tagging module designed to minimize the disparity between the source documents and the queries. This module pre processes the corpus to extract a comprehensive collection of domain-specific tags or to establish tag mapping rules. Prior to the retrieval process, tags are extracted from the query and then mapped to corpus domain using the pre processed tag collection or tag pair collection. This tag-based domain adaptation can be employed for query rewriting or keyword retrieval within sequential information retrieval frameworks, thereby enhancing both the recall and precision of the retrieval process.
为了解决领域差距问题,我们提出了一个自动标记模块,旨在最小化源文档与查询之间的差异。该模块对语料库进行预处理,以提取全面的领域特定标记集合或建立标记映射规则。在检索过程之前,从查询中提取标记,然后使用预处理的标记集合或标记对集合将其映射到语料库领域。这种基于标记的领域适应可以用于序列信息检索框架中的查询重写或关键词检索,从而提高检索过程的召回率和精确度。
Specifically, we leverage the capabilities of the LLMs to identify key factors within the corpus chunks, summarize these factors, and generalize them into category names, which we refer to as "tag classes." We generate semantic tag extraction prompts based on these tag classes to facilitate accurate tag extraction. In scenarios where only the corpus is available, LLMs are employed with meticulously designed prompts to extract semantic tags from the corpus, thereby forming a comprehensive corpus tag collection. When practical QA samples are available, semantic tag extraction is performed on both the queries and the corresponding retrieved answer chunks. Using the tag sets extracted from the chunks and queries, LLMs are utilized to map cross-domain semantic tags and generate a tag pair collection. After establishing both the corpus tag collection and the tag pair collection, tags can be extracted from the query, and the corresponding mapped tags can be identified within the collections. These mapped tags are then used to enhance subsequent information retrieval processes, improving both recall and precision. This workflow leverages the advanced understanding and contextual capabilities of LLMs for domain adaptation.
具体来说,我们利用大语言模型的能力来识别语料块中的关键因素,总结这些因素并将其泛化为类别名称,我们称之为“标签类”。我们基于这些标签类生成语义标签提取提示,以促进准确的标签提取。在只有语料可用的情况下,使用大语言模型通过精心设计的提示从语料中提取语义标签,从而形成一个全面的语料标签集合。当有实际的问答样本时,对查询和相应的检索答案块都进行语义标签提取。使用从块和查询中提取的标签集,利用大语言模型映射跨域语义标签并生成标签对集合。在建立了语料标签集合和标签对集合后,可以从查询中提取标签,并在集合中识别相应的映射标签。这些映射标签随后用于增强后续的信息检索过程,提高召回率和精确率。该工作流利用了大语言模型的高级理解和上下文能力进行领域适应。

Figure 8: Overview of multi-layer, multi-granularity retrieval over heterogeneous graph
图 8: 基于异构图的多层次、多粒度检索概览
5.2.3 Multi-Granularity Retrieval
5.2.3 多粒度检索
The L1 system is designed to enable multi-layer, multi-granularity retrieval across a heterogeneous knowledge graph, which was constructed in the L0 system. Each layer of the graph (e.g., information source layer, corpus layer, distilled knowledge layer) represents knowledge at different levels of abstraction and granularity, allowing the system to explore and retrieve relevant information at various scales. For example, queries can be mapped to entire documents (information source layer) or specific chunks of text (corpus layer), ensuring that knowledge can be retrieved at the appropriate level for a given task. To support this, similarity scores between queries and graph nodes are computed to measure the alignment between the query and the retrieved knowledge. These scores are then propagated through the layers of the graph, allowing the system to aggregate information from multiple levels. This multi-layer propagation ensures that retrieval can be fine-tuned based on both the broader context (e.g., entire documents) and finer details (e.g., specific chunks or distilled knowledge). The final similarity score is generated through a combination of aggregation and propagation, ensuring that knowledge extraction and utilization are optimized for both precision and efficiency in factual question answering. The retrieval process can be iterative, refining the results based on sub-queries generated through task decomposition, further enhancing the system’s ability to generate accurate and con textually relevant answers.
L1 系统旨在实现对异构知识图谱的多层次、多粒度检索,该知识图谱是在 L0 系统中构建的。图谱的每一层(例如,信息源层、语料库层、蒸馏知识层)代表了不同抽象层次和粒度的知识,使系统能够在不同尺度上探索和检索相关信息。例如,查询可以映射到整个文档(信息源层)或特定的文本块(语料库层),确保知识可以在给定任务的适当层次上被检索。为此,系统计算查询与图谱节点之间的相似度分数,以衡量查询与检索知识之间的匹配程度。这些分数随后通过图谱的各个层次进行传播,使系统能够从多个层次聚合信息。这种多层次传播确保检索可以根据更广泛的上下文(例如,整个文档)和更精细的细节(例如,特定文本块或蒸馏知识)进行微调。最终的相似度分数通过聚合和传播的组合生成,确保知识提取和利用在事实性问答中的精确性和效率都得到优化。检索过程可以是迭代的,通过任务分解生成的子查询来细化结果,进一步增强系统生成准确且上下文相关答案的能力。
The overview of multi-layer, multi-granularity retrieval is depicted in Figure 8. For each layer of the graph, both queries $Q$ and graph node are transformed into high-dimensional vector embeddings for similarity evaluation. We denote the similarity evaluation operation as $g()$ . Here, $I,C$ , and $D$ indicate the node sets in the information source layer, corpus layer, and distilled knowledge layer, respectively. The propagation and aggregation operations are represented by the function $f()$ . The final chunk similarity score $S$ is obtained by aggregating the scores from other layers and nodes.
多层、多粒度检索的概述如图 8 所示。对于图的每一层,查询 $Q$ 和图的节点都被转换为高维向量嵌入以进行相似性评估。我们将相似性评估操作表示为 $g()$。其中,$I$、$C$ 和 $D$ 分别表示信息源层、语料层和蒸馏知识层中的节点集。传播和聚合操作由函数 $f()$ 表示。最终的块相似性得分 $S$ 通过聚合其他层和节点的得分得到。
5.3 Level-2: Linkable and Reasoning Question focused RAG System
5.3 Level-2: 可链接且专注于推理问题的 RAG 系统
The core functionality of the L2 system lies in its ability to efficiently retrieve multiple sources of relevant information and perform complex reasoning based on it. To facilitate this, the L2 system integrates an advanced knowledge extraction module that comprehensively identifies and extracts pertinent information. Furthermore, a task decomposition and coordination module is implemented to break down intricate tasks into smaller, manageable sub-tasks, thereby enhancing the system’s efficiency in handling them. The proposed framework of L2 RAG system is illustrated in Figure 9.
L2 系统的核心功能在于其能够高效检索多个相关信息来源并基于此进行复杂推理。为此,L2 系统集成了一个先进的知识提取模块,全面识别并提取相关信息。此外,系统还实现了任务分解与协调模块,将复杂任务分解为更小、更易管理的子任务,从而提升系统处理效率。L2 RAG 系统的框架如图 9 所示。
Chunked text contains multifaceted information, increasing the complexity of retrieval. Recent studies have focused on extracting triple knowledge units from chunked text and constructing knowledge graphs to facilitate efficient information retrieval [20, 42]. However, the construction of knowledge graphs is costly, and the inherent knowledge may not always be fully explored. To better present the knowledge embedded the documents, we propose atomizing the original documents in Knowledge Extraction phase, a process we refer as Knowledge Atomizing. Besides, industrial tasks often necessitate multiple pieces of knowledge, implicitly requiring the capability to decompose the original question into several sequential or parallel atomic questions. We refer to this operation as
分块文本包含多方面的信息,增加了检索的复杂性。最近的研究集中在从分块文本中提取三元组知识单元并构建知识图谱,以促进高效的信息检索 [20, 42]。然而,知识图谱的构建成本高昂,且内在知识可能并不总是被充分挖掘。为了更好地呈现文档中嵌入的知识,我们在知识提取阶段提出将原始文档原子化,这一过程我们称之为知识原子化。此外,工业任务通常需要多条知识,这隐含着将原始问题分解为多个顺序或并行的原子问题的能力。我们将这一操作称为
Task Decomposition. By combining the extracted atomic knowledge with the original chunks, we construct an atomic hierarchical knowledge base. Each time we decompose a task, the hierarchical knowledge base provides insights into the available knowledge, enabling knowledge-aware task decomposition.
任务分解。通过将提取的原子知识与原始块结合,我们构建了一个原子层次知识库。每次分解任务时,层次知识库都会提供对可用知识的洞察,从而实现知识感知的任务分解。
5.3.1 Knowledge Atomizing
5.3.1 知识原子化
We believe that a single document chunk often encompasses multiple pieces of knowledge. Typically, the information necessary to address a specific task represents only a subset of the entire knowledge. Therefore, consolidating these pieces within a single chunk, as traditionally done in information retrieval, may not facilitate the efficient retrieval of the precise information required. To align the granularity of knowledge with the queries generated during task solving, we propose a method called knowledge atomizing. This approach leverage the context understanding and content generation capabilities of LLMs to automatically tag atomic knowledge pieces within each document chunk. Note that, these chunks could be segments of an original reference document, description chunks generated for tables, images, videos, or summary chunks of entire sections, chapters or even documents.
我们相信,单个文档块通常包含多个知识片段。通常情况下,解决特定任务所需的信息只是整个知识的一个子集。因此,像传统信息检索那样将这些片段整合在单个块中,可能不利于高效检索所需的具体信息。为了使知识的粒度与任务解决过程中生成的查询相匹配,我们提出了一种称为知识原子化的方法。该方法利用大语言模型的上下文理解和内容生成能力,自动为每个文档块中的原子知识片段打标签。需要注意的是,这些块可以是原始参考文档的片段、为表格、图像、视频生成的描述块,甚至是整个章节或文档的摘要块。
The presentation of atomic knowledge can be various. Instead of utilizing declarative sentences or subject-relationship-object tuples, we propose using questions as knowledge indexes to further bridge the gap between stored knowledge and query. Unlike the semantic tagging process, in knowledge atomizing process, we input the document chunk to LLM as context, ask it to generate relevant questions that can be answered by the given chunk as many as possible. These generated atomic questions are saved as the atomic question tags together with the given chunk. An example of knowledge atomizing is demonstrated in Figure 10(c), where the atomic questions encapsulate various aspects of the knowledge contained within the chunk. A hierarchical knowledge base can accommodate queries of varying granularity. Figure 11 illustrates the retrieval process from an atomic knowledge base comprising chunks and atomic questions. Queries can directly retrieve reference chunks as usual. Additionally, since each chunk is tagged with multiple atomic questions, an atomic query can be used to locate relevant atomic questions, which then leads to the associated reference chunks.
原子知识的呈现方式可以多种多样。我们提出使用问题作为知识索引,而不是利用陈述句或主谓宾元组,以进一步缩小存储知识与查询之间的差距。与语义标注过程不同,在知识原子化过程中,我们将文档块作为上下文输入到大语言模型 (LLM) 中,要求它生成尽可能多的可以由给定块回答的相关问题。这些生成的原子问题与给定的块一起保存为原子问题标签。知识原子化的一个例子如图 10(c) 所示,其中原子问题封装了块中包含知识的各个方面。分层知识库可以适应不同粒度的查询。图 11 展示了从包含块和原子问题的原子知识库中检索的过程。查询可以像往常一样直接检索参考块。此外,由于每个块都标记有多个原子问题,因此可以使用原子查询来定位相关的原子问题,然后找到关联的参考块。
5.3.2 Knowledge-Aware Task Decomposition
5.3.2 知识感知任务分解
For a specific task, multiple decomposition strategies might be applicable. Consider Q2 in Figure 1 as an example. The two-step analytical reasoning process depicted may be effective if an interchangeable biosimilar products list is available. However, if only a general list of biosimilar products exists, with attributes dispersed throughout multiple documents, a different decomposition strategy may be necessary: (1) Retrieve the biosimilar product list; (2) Determine whether each product is interchangeable; (3) Count the total number of interchangeable products. The critical factor in selecting the most effective decomposition approach lies in understanding the contents of the specialized knowledge base. Motivated by this, we design the Knowledge-Aware Task Decomposition workflow, which is illustrated in Figure 10(a). The complete algorithm for task solving using Knowledge-Aware Task Decomposition is presented in Algorithm 1.
对于特定任务,可能有多种分解策略适用。以图1中的Q2为例。如果存在可互换的生物类似产品列表,则所展示的两步分析推理过程可能有效。然而,如果仅存在一个通用的生物类似产品列表,且属性分散在多个文档中,则可能需要采用不同的分解策略:(1) 检索生物类似产品列表;(2) 确定每个产品是否可互换;(3) 统计可互换产品的总数。选择最有效分解方法的关键在于理解专业知识库的内容。基于此,我们设计了知识感知任务分解工作流,如图10(a)所示。使用知识感知任务分解的任务求解完整算法见算法1。
The reference context $\ensuremath{\mathcal{C}}{t}$ is initialized as an empty set, and the original question is denoted by $q$ . As illustrated in the for-loop starting at line 2 of the algorithm, in the $t$ -th iteration, we use an LLM, denoted by $\mathcal{L L M}$ , to generate query proposals potentially useful for task completion, denoted as $\hat{q}{i}^{t}$ .
参考上下文 $\ensuremath{\mathcal{C}}{t}$ 被初始化为空集,原始问题表示为 $q$。如算法第2行开始的for循环所示,在第 $t$ 次迭代中,我们使用由 $\mathcal{L L M}$ 表示的大语言模型生成可能对任务完成有用的查询提案,表示为 $\hat{q}{i}^{t}$。

Figure 10: The illustration of knowledge atomizing and knowledge-aware task decomposition: (a) Workflow of task solving with knowledge-aware task decomposition, (b) Workflow of knowledge atomizing, (c) Example of knowledge atomizing, (d) RAG case with knowledge atomizing and knowledge-aware task decomposition.
图 10: 知识原子化与知识感知任务分解的示意图:(a) 使用知识感知任务分解解决问题的流程,(b) 知识原子化的流程,(c) 知识原子化的示例,(d) 使用知识原子化与知识感知任务分解的 RAG 案例。
In this step, the chosen reference chunks $\ensuremath{\mathcal{C}}{t}$ are provided as context to avoid generating proposals linked to already known knowledge. These proposals are then utilized as atomic queries to determine if relevant knowledge exists within the knowledge base. For each atomic question proposal, we retrieve its relevant atomic question candidates along with their source chunks ${(q{i j}^{t},c_{i j}^{\bar{t}})}$ from the knowledge base, denoted as $\kappa{\tt B}$ . We can use any score metric sim to retrieve atomic questions. In our experiment, we use cosine similarity of their corresponding embeddings to retrieve all top $K$ atomic questions, provided their similarity to a proposed atomic question is greater than or equal to a given threshold $\delta$ . With the original question $q$ , the accumulated context $\ensuremath{\mathcal{C}}{t}$ , and the list of retrieved atomic questions $q{i j}^{t}$ , $\mathcal{L L M}$ selects the most useful atomic question $q^{t}$ from $q_{i j}^{t}$ and retrieves the relevant chunk $c^{t}$ . This retrieved chunk is aggregated into the reference context $\ensuremath{\mathcal{C}}{t}$ for the next round of decomposition. Knowledge-aware decomposition can iterate up to $N$ times, where $N$ is a hyper parameter set to control computational cost. The iteration process can be terminated early if there are no high-quality question proposals, no highly relevant atomic candidates retrieved, no suitable atomic knowledge selections, or if the $\mathcal{L L M}$ determines that the acquired knowledge is sufficient to complete the task. Finally, the accumulated context $\ensuremath{\mathcal{C}}{t}$ is utilized to generate answer $\hat{a}$ for the given question $q$ in line 14.
在这一步骤中,选择的参考块 $\ensuremath{\mathcal{C}}{t}$ 被作为上下文提供,以避免生成与已知知识相关的提案。这些提案随后被用作原子查询,以确定知识库中是否存在相关知识。对于每个原子问题提案,我们从知识库中检索其相关的原子问题候选及其来源块 ${(q{i j}^{t},c_{i j}^{\bar{t}})}$,记为 $\kappa{\tt B}$。我们可以使用任何评分指标 sim 来检索原子问题。在我们的实验中,我们使用其对应嵌入的余弦相似度来检索所有前 $K$ 个原子问题,前提是它们与提出的原子问题的相似度大于或等于给定的阈值 $\delta$。结合原始问题 $q$、累积的上下文 $\ensuremath{\mathcal{C}}{t}$ 以及检索到的原子问题列表 $q{i j}^{t}$,$\mathcal{L L M}$ 从 $q_{i j}^{t}$ 中选择最有用的原子问题 $q^{t}$ 并检索相关块 $c^{t}$。检索到的块被聚合到参考上下文 $\ensuremath{\mathcal{C}}{t}$ 中,用于下一轮分解。知识感知分解可以迭代最多 $N$ 次,其中 $N$ 是用于控制计算成本的超参数。如果没有高质量的问题提案、没有检索到高度相关的原子候选、没有合适的原子知识选择,或者 $\mathcal{L L M}$ 确定获取的知识足以完成任务,则迭代过程可以提前终止。最后,累积的上下文 $\ensuremath{\mathcal{C}}{t}$ 被用于生成给定问题 $q$ 的答案 $\hat{a}$(如第14行所示)。

Figure 11: Retrieval process from an atomic knowledge base. It supports two retrieval paths: (a) using queries to directly retrieve chunks as usual; (b) locating atomic nodes first then achieving the associated chunks.
图 11: 从原子知识库中检索的过程。它支持两种检索路径:(a) 使用查询直接检索块,如通常那样;(b) 先定位原子节点,然后获取相关的块。
Algorithm 1 Task Solving with Knowledge-Aware Decomposition
算法 1:基于知识感知分解的任务求解
5.3.3 Knowledge-Aware Task Decomposer Training
5.3.3 知识感知任务分解器训练
It is worth mentioning that knowledge-aware decomposition can be a learnable component. This trained proposer can then directly suggest atomic queries $q^{t}$ during inference, which means lines 3 to 5 in Algorithm 1 can be replaced by a single call to this learned proposer, thereby reducing both inference time and computational cost. In order to train the knowledge-aware decomposer, we collect data about the rationale behind each step by sampling context and creating diverse interaction trajectories. With this data collected, we train a decomposer that can incorporate domain-specific rationale into the task decomposition and result-seeking process.
值得一提的是,知识感知分解可以是一个可学习的组件。经过训练的提议者可以直接在推理过程中建议原子查询 $q^{t}$,这意味着算法1中的第3到5行可以用一次调用这个学习到的提议者来替代,从而减少推理时间和计算成本。为了训练知识感知分解器,我们通过采样上下文并创建多样化的交互轨迹来收集每一步背后的推理数据。在收集到这些数据后,我们训练了一个分解器,它能够将特定领域的推理融入任务分解和结果寻找过程中。
The data collection process, as depicted in Figure 12 and Algo. 2, implements a sophisticated dual-dictionary system for managing and tracking information. Our system utilizes two primary data structures: dictionary $\boldsymbol{S}$ for maintaining comprehensive score records, and dictionary $\mathcal{V}$ for systematically tracking visit frequencies of candidate chunks. During the initialization phase, we establish baseline values by setting all scores to zero and initializing visit counters to one, creating a foundation for dynamic updates throughout the subsequent processing stages.
数据收集过程如图 12 和 Algo. 2 所示,实现了一个复杂的双字典系统来管理和跟踪信息。我们的系统使用两种主要数据结构:字典 $\boldsymbol{S}$ 用于维护全面的得分记录,字典 $\mathcal{V}$ 用于系统地跟踪候选块的访问频率。在初始化阶段,我们通过将所有得分设置为零并将访问计数器初始化为一来建立基准值,为后续处理阶段的动态更新奠定了基础。
In each iteration of our decomposition process, the system executes a detailed retrieval operation targeting the top $K^{\prime}$ chunks demonstrating maximum relevance to the current atomic question. These chunks must satisfy our similarity threshold criterion (specifically, similarity exceeding $\delta^{\prime}$ , where $\delta^{\prime}<\delta)$ , with $K^{\prime}$ intentionally configured to be larger than $K$ to ensure comprehensive coverage. Following this initial retrieval, we carefully select and integrate the data chunks corresponding to the top $K$ most relevant atomic retrieved pairs into the context. For those retrieved chunks that do not make it into the top $\mathcal{K}$ selection, we systematically incorporate them into $\boldsymbol{S}$ and methodically update their scores based on precisely calculated relevance metrics.
在每次分解过程的迭代中,系统会针对与当前原子问题最相关的 top $K^{\prime}$ 块执行详细的检索操作。这些块必须满足我们的相似度阈值标准(具体来说,相似度超过 $\delta^{\prime}$,其中 $\delta^{\prime}<\delta$),并且 $K^{\prime}$ 有意配置为大于 $K$ 以确保全面覆盖。在初始检索之后,我们精心选择并将与 top $K$ 最相关的原子检索对对应的数据块集成到上下文中。对于那些未进入 top $\mathcal{K}$ 选择的检索块,我们会系统地将它们纳入 $\boldsymbol{S}$ 中,并根据精确计算的相关性指标系统地更新它们的分数。

Figure 12: Data collection process for decomposer training, comprising four main components: a) sampling data chunks from the context sampling pool to serve as the reference context for question decomposition, b) saving the generated atomic query proposals, c) after retrieval and selection, saving the chosen atomic query proposals as part of the reasoning trajectories, d) evaluating the answer to generate a score.
图 12: 用于分解器训练的数据收集过程,包含四个主要组件:a) 从上下文采样池中采样数据块作为问题分解的参考上下文,b) 保存生成的原子查询提议,c) 在检索和选择后,将选定的原子查询提议保存为推理轨迹的一部分,d) 评估答案以生成分数。

Figure 13: An example of context sampling and an illustration of decomposer training with collected data.
图 13: 上下文采样示例及使用收集数据进行分解器训练的示意图。
To ensure comprehensive exploration of the solution space, we have implemented an advanced sampling mechanism that intelligently selects additional chunks from $\boldsymbol{S}$ when available, incorporating them seamlessly into the reference context. Our implementation leverages the Upper Confidence Bound [8] (UCB) algorithm for context sampling, establishing a balanced approach between exploitation and exploration. The exploitation component manifests through the retriever-selected chunks, focusing on options with currently highest estimated rewards to optimize immediate performance gains. Conversely, the exploration aspect is fulfilled through context sampling from $\boldsymbol{S}$ , enabling the systematic investigation of less-certain options to accumulate valuable data and potentially uncover superior long-term alternatives.
为了确保对解决方案空间的全面探索,我们实现了一种先进的采样机制,当可用时智能地从 $\boldsymbol{S}$ 中选择额外的块,并将它们无缝地整合到参考上下文中。我们的实现利用了上置信界 (Upper Confidence Bound [8], UCB) 算法进行上下文采样,建立了利用和探索之间的平衡方法。利用部分通过检索器选择的块体现,专注于当前估计奖励最高的选项,以优化即时性能增益。相反,探索部分通过从 $\boldsymbol{S}$ 中进行上下文采样来实现,使系统能够系统地调查不确定性较高的选项,以积累有价值的数据,并可能发现更优越的长期替代方案。
This meticulously crafted strategy serves a dual purpose: it not only facilitates the generation of diverse and comprehensive atomic query proposals but also enables systematic exploration of multiple potential reasoning pathways. Through this sophisticated approach, we progressively work toward deriving optimal final answers while maintaining a balance between immediate performance optimization and long-term discovery of potentially superior solutions.
这一精心设计的策略具有双重目的:它不仅促进了多样化和全面的原子查询提案的生成,还能够系统性地探索多种潜在的推理路径。通过这种复杂的方法,我们逐步朝着得出最优最终答案的目标前进,同时在即时性能优化与长期发现可能更优解决方案之间保持平衡。
We record atomic proposals (AP), interactive trajectories, and answer scores to support decomposer training. For each specialized domain, interactive trajectories featuring distinct reasoning paths are gathered for decomposer training. This allows us to use the answer score as a supervised signal to train the decomposer. The decomposer training process is depicted in Figure 13. By incorporating
我们记录原子提议 (AP)、交互轨迹和答案分数以支持分解器训练。对于每个专业领域,我们收集了具有不同推理路径的交互轨迹用于分解器训练。这使得我们可以使用答案分数作为监督信号来训练分解器。分解器训练过程如图 13 所示。通过整合
preferences in the form of answer scores, the decomposer training can capture domain-specific decomposition rules, thereby adapting the decomposer to meet domain requirements.
以答案分数的形式进行偏好设置,分解器训练能够捕捉领域特定的分解规则,从而使分解器适应领域需求。
Looking ahead, there are several promising avenues for implementing and enhancing our proposed decomposer. We could leverage well-established algorithms such as supervised fine-tuning (SFT) and direct policy optimization (DPO) [45] to train an effective decomposer based on existing LLMs. The practical implementation and performance evaluation of this comprehensive procedure, including detailed empirical analysis and comparative studies, will be addressed in future research work to thoroughly demonstrate its effectiveness and potential applications.
展望未来,实施和增强我们提出的分解器(decomposer)有几种有前景的途径。我们可以利用监督微调(supervised fine-tuning,SFT)和直接策略优化(direct policy optimization,DPO)[45]等成熟算法,基于现有的大语言模型训练一个有效的分解器。这一综合流程的实际实施和性能评估,包括详细的实证分析和比较研究,将在未来的研究工作中进行,以充分展示其有效性和潜在应用。
5.4 Level-3: Predictive Question focused RAG System
5.4 三级:预测问题导向的 RAG 系统
In the L3 system, there is an increased emphasis on knowledge-based prediction capability, which necessitates effective knowledge collection, organization, and the construction of forecasting rationale. To address this, we leverage the task decomposition and coordination module to build forecasting rationale based on the organized knowledge, which is collected and organized from the retrieved knowledge. The framework of L3 system is illustrated in Figure14. To ensure the retrieved knowledge is well-prepared for advanced analysis and forecasting, the knowledge organization module is enhanced with specialized submodules dedicated to the structuring and organization of knowledge. These submodules streamline the process of transforming raw retrieved knowledge into a structured, coherent format, optimizing it for subsequent reasoning and predictive tasks. For example, in the FDA scenario referred in Figure 1, data from multiple sources—such as medicine labels, clinical trials, and application forms—are integrated into the multi-layer knowledge base. The knowledge structuring submodule follows the instruction from task decomposition module to collect and organize the relevant knowledge (e.g. medicine names with their approval dates) retrieved from knowledge base. The knowledge induction submodule further categorizes this structured knowledge, such as by approval date, to facilitate further statistics analysis and prediction.
在 L3 系统中,更加注重基于知识的预测能力,这需要有效的知识收集、组织和预测依据的构建。为了解决这个问题,我们利用任务分解和协调模块,基于从检索到的知识中收集和组织的知识,构建预测依据。L3 系统的框架如图 14 所示。为了确保检索到的知识能够为高级分析和预测做好准备,知识组织模块通过专门用于知识结构和组织的子模块进行了增强。这些子模块简化了将原始检索到的知识转换为结构化、连贯格式的过程,从而优化了后续的推理和预测任务。例如,在图 1 中提到的 FDA 场景中,来自多个来源的数据(如药品标签、临床试验和申请表)被整合到多层知识库中。知识结构化子模块遵循任务分解模块的指令,从知识库中收集和组织相关知识(例如药品名称及其批准日期)。知识归纳子模块进一步对这些结构化知识进行分类,例如按批准日期分类,以便于进一步的统计分析和预测。

Figure 14: Overview of L3-RAG framework. The squares $(\sqsubseteq)$ indicate knowledge structuring and knowledge induction in knowledge organization module, while the square $(\sqsubseteq)$ represents forecasting sub-module in knowledge-centric reasoning module.
图 14: L3-RAG 框架概述。方块 $(\sqsubseteq)$ 表示知识组织模块中的知识结构化和知识归纳,而方块 $(\sqsubseteq)$ 表示知识中心推理模块中的预测子模块。
Given the limitations of LLMs in applying specialized reasoning logic, their effectiveness in predictive tasks can be restricted. To overcome this, the knowledge-centric reasoning module is enhanced with a forecasting submodule, enabling the system to infer outcomes based on the input queries and the organized knowledge (e.g. total numbers of medicines approved per year). This forecasting submodule allows the system to not only generate answers based on historical knowledge, but also make projections, providing a more robust and dynamic response to complex queries. By integrating advanced knowledge structuring and prediction capabilities, the L3 system can manage and utilize a more complex and dynamic knowledge base effectively.
鉴于大语言模型在应用专门推理逻辑方面的局限性,其在预测任务中的有效性可能受到限制。为克服这一局限,知识中心推理模块通过增强预测子模块,使系统能够根据输入查询和结构化知识(例如每年批准的药物总数)推断结果。该预测子模块不仅使系统能够基于历史知识生成答案,还能进行预测,为复杂查询提供更强大和动态的响应。通过集成先进的知识结构化和预测能力,L3系统能够有效管理和利用更复杂和动态的知识库。
5.5 Level-4: Creative Question focused RAG System
5.5 第四级:创意问题导向的RAG系统
The L4 system implementation is characterized by the integration of multi-agent systems to facilitate multi-perspective thinking. Addressing creative questions requires creative thinking that draws on factual information and an understanding of underlying principles and rules. At this advanced level, the primary challenges include extracting coherent logical rationales from a retrieved knowledge, navigating complex reasoning processes with numerous influencing factors, and assessing the quality of responses to creative, open-ended questions. To tackle these challenges, the system coordinates multiple agents, each contributing unique insights and reasoning strategies, as illustrated in Figure15. These agents operate in parallel, synthesizing various thought processes to generate comprehensive and coherent solutions. This multi-agent architecture supports the parallel processing and integration of diverse reasoning paths, ensuring effective management and response to intricate queries. By simulating diverse viewpoints, the L4 system enhances its ability to tackle creative questions, generating innovative ideas rather than predefined solutions. The coordinated outputs from multiple agents not only enrich the reasoning process but also provide users with comprehensive perspectives, fostering creative thinking and inspiring novel solutions to complex problems.
L4 系统的实现特点在于集成多智能体系统以促进多角度思考。解决创造性问题需要基于事实信息和对基本原理及规则的理解进行创造性思考。在这个高级阶段,主要挑战包括从检索到的知识中提取连贯的逻辑推理、在多个影响因素下导航复杂的推理过程,以及评估对开放性创造性问题回答的质量。为了应对这些挑战,系统协调多个智能体,每个智能体贡献独特的见解和推理策略,如图15所示。这些智能体并行运作,综合各种思维过程以生成全面且一致的解决方案。这种多智能体架构支持并行处理和整合不同的推理路径,确保有效管理和应对复杂查询。通过模拟多样化的观点,L4 系统增强了其应对创造性问题的能力,生成创新思想而非预定义的解决方案。多个智能体的协调输出不仅丰富了推理过程,还为用户提供了全面的视角,促进创造性思维并激发解决复杂问题的新颖方案。

Figure 15: Overview of L4-RAG framework. The multi-agent planning module is introduced to enable multi-perspective thinking.
图 15: L4-RAG 框架概述。引入了多智能体规划模块以实现多视角思考。
6 Evaluation and Metrics
6 评估与指标
To validate the effectiveness of our proposed method, we conduct experiments on both open-domain benchmarks and domain-specific benchmarks. We delineate the evaluation metrics and methods employed to assess the performance of proposed knowledge-aware task decomposition method in
为了验证我们提出的方法的有效性,我们在开放域基准和特定领域基准上进行了实验。我们详细描述了用于评估提出的知识感知任务分解方法性能的评估指标和方法。
Section 6.1. The evaluation results on three open-domain benchmarks are presented in Section 6.2, while the results on two legal domain-specific benchmarks in Section 6.3. Furthermore, we present in-depth analysis through three real case studies in Section 6.4, which highlight the superiority of our method compared to existing decomposition approaches.
第6.1节。第6.2节展示了在三个开放域基准上的评估结果,而第6.3节展示了在两个法律领域特定基准上的结果。此外,我们在第6.4节中通过三个实际案例研究进行了深入分析,这些案例突显了我们的方法相较于现有分解方法的优越性。
6.1 Experimental Setup
6.1 实验设置
Methods To thoroughly evaluate the performance of our proposed knowledge-aware decomposition approach (described in Section 5.3), we have selected a variety of baseline methods that represent different strategies for task-solving with LLMs. We include Zero-Shot CoT[34] to assess the inherent reasoning capabilities and embedded knowledge of the underlying LLM without additional context. Naive RAG[35], which introduces external knowledge through retrieval, serves as a benchmark for evaluating the incremental benefits of augmented knowledge. The Self-Ask framework[43] is employed to investigate the impact of an iterative question decomposition and answering strategy on task performance. Additionally, GraphRAG [20] is evaluated in both local and global modes to assess the impact of knowledge graph-based methods on multi-hop reasoning tasks.
方法
为了全面评估我们提出的知识感知分解方法(详见第 5.3 节)的性能,我们选择了多种基线方法,这些方法代表了使用大语言模型解决任务的不同策略。我们引入了零样本思维链 (Zero-Shot CoT) [34] 来评估底层大语言模型固有的推理能力和嵌入式知识,而无需额外的上下文。Naive RAG [35] 通过检索引入外部知识,作为评估增强知识增量收益的基准。Self-Ask 框架 [43] 被用于研究迭代问题分解和回答策略对任务性能的影响。此外,GraphRAG [20] 在局部和全局模式下进行了评估,以评估基于知识图谱的方法对多跳推理任务的影响。
To ensure a fair comparison and to highlight the influence of hierarchical knowledge structures, we have extended Naive RAG and Self-Ask to utilize both a general flat knowledge base, denoted as $R$ , and a hierarchical retriever, denoted as $H{-}R$ , as introduced in Figure 11. The hierarchical retriever $\left(H!-!R\right)$ utilizes the questions or follow-up questions to retrieve chunks through both path (a) and path (b) at the same time. The retrieved chunks from both paths are then aggregated to form a comprehensive reference context for LLM to answer each question, potentially enhancing the relevance of the provided context.
为了确保公平比较并突出层次知识结构的影响,我们扩展了Naive RAG和Self-Ask,使其同时利用通用平面知识库(表示为$R$)和层次检索器(表示为$H{-}R$),如图11所示。层次检索器$\left(H!-!R\right)$利用问题或后续问题同时通过路径(a)和路径(b)检索块。然后,将两条路径检索到的块聚合,形成大语言模型回答每个问题的综合参考上下文,从而可能提高提供上下文的相关性。
The experimental methods are summarized as follows:
实验方法总结如下:
• Zero-Shot CoT: Questions are addressed using solely the Chain-Of-Thought (CoT) technique, which prompts the LLMs to articulate its reasoning process step-by-step without the aid of example demonstrations or supplemental context. This method assesses the LLMs’ intrinsic knowledge and reasoning capabilities in a zero-shot setting.
零样本思维链 (Zero-Shot CoT):问题仅通过思维链 (Chain-Of-Thought, CoT) 技术解决,该方法提示大语言模型逐步阐述其推理过程,而无需借助示例演示或补充上下文。此方法评估大语言模型在零样本环境下的内在知识和推理能力。
• Naive RAG w/ R: This approach employs dense retrieval from a flat knowledge base to procure relevant information for each question. The knowledge base consists of preembedded chunks are matched to the original question based on semantic similarity. The retrieval process is direct, without any intermediate task decomposition.
• Naive RAG w/ R: 该方法通过从扁平知识库中进行密集检索来获取每个问题的相关信息。知识库由预嵌入的文本块组成,这些文本块基于语义相似度与原始问题进行匹配。检索过程是直接的,没有任何中间任务分解。
• Naive RAG w/ H-R: This method extends the Naive RAG framework by incorporating a hierarchical retrieval process $\left(H{-}R\right)$ that operates through two concurrent paths. Path (a) performs a direct retrieval of knowledge chunks in response to the original question, similar to the flat retrieval approach. Path (b), on the other hand, use the original question again to find the relevant atomic questions and obtain the corresponding chunks. The combined output from both paths is then aggregated, creating a rich reference context.
• Naive RAG w/ H-R:该方法通过引入分层检索过程 (H-R) 扩展了 Naive RAG 框架,该过程通过两条并行路径进行操作。路径 (a) 直接检索与原始问题相关的知识块,类似于扁平检索方法。另一方面,路径 (b) 再次使用原始问题来查找相关的原子问题并获取相应的块。然后将两条路径的输出进行聚合,形成丰富的参考上下文。
• Self-Ask: This method employs a task decomposition strategy wherein the LLMs is prompted to iterative ly generate and answer follow-up questions, thereby breaking down complex problems into more manageable sub-tasks. General demonstrations illustrating the logic and methodology of task decomposition are provided for all benchmarks to guide the LLMs’ reasoning process. As detailed in the original paper [43], the framework encourages the LLMs to engage in a recursive dialogue with itself, generating intermediate answers that progressively build towards the final answer. In this setting, the LLMs relies solely on its inherent knowledge base, as no external contexts are introduced to aid in answering the follow-up questions.
• Self-Ask:该方法采用任务分解策略,通过提示大语言模型迭代生成并回答后续问题,从而将复杂问题分解为更易管理的子任务。所有基准测试中都提供了展示任务分解逻辑和方法论的通用示例,以指导大语言模型的推理过程。如原论文 [43] 所述,该框架鼓励大语言模型与自身进行递归对话,生成逐步构建最终答案的中间答案。在此设置中,大语言模型仅依赖其固有知识库,因为没有引入外部上下文来帮助回答后续问题。
• Self-Ask w/ R: Building upon the Self-Ask method, this setting introduces an additional retrieval component, for each follow-up question generated by the LLMs, relevant chunks are retrieved from a flat knowledge base to provide a reference context. The retrieval process uses the follow-up question as the query. This approach seeks to combine the benefits of iterative task decomposition with rich external knowledge from retrieval, potentially improving the LLMs’ performance on complex reasoning tasks.
• Self-Ask w/ R: 在 Self-Ask 方法的基础上,此设置引入了额外的检索组件,对于大语言模型生成的每个后续问题,从扁平知识库中检索相关块以提供参考上下文。检索过程使用后续问题作为查询。该方法旨在将迭代任务分解的优势与检索中的丰富外部知识相结合,从而潜在地提高大语言模型在复杂推理任务中的表现。
• Self-Ask w/ H-R: This variant of the Self-Ask method enhances the retrieval process by utilizing a hierarchical knowledge base, as opposed to the flat one used in Self-Ask w/ R.
• Self-Ask w/ H-R:该变体通过使用分层知识库而非Self-Ask w/ R中使用的扁平知识库来增强检索过程。
When the LLMs generates follow-up questions, these are employed as queries in a dual-path retrieval system, specifically paths (a) and (b) in Figure 11. The outputs from both retrieval paths are then aggregated to form a richer reference context.
当大语言模型生成后续问题时,这些问题将作为查询在双路径检索系统中使用,即图 11 中的路径 (a) 和 (b)。然后,两条检索路径的输出会被聚合,形成更丰富的参考上下文。
• GraphRAG Local: In this approach, the flat knowledge base is pre-processed to construct a knowledge graph in accordance with the public guidance. The inference is run in local mode.
• GraphRAG Local:在这种方法中,平面知识库会按照公共指南进行预处理,以构建知识图谱。推理过程在本地模式下运行。
• GraphRAG Global: The inference is run in global mode in this setting.
• GraphRAG Global:在此设置下,推理以全局模式运行。
• Ours: The proposed knowledge-aware decomposition method iterative ly decomposes complex questions into sub-questions and retrieves relevant knowledge up to a maximum of five iterations. This process limits the context for the final answer to the five most useful knowledge chunks.
• Ours: 我们提出的知识感知分解方法迭代地将复杂问题分解为子问题,并在最多五次迭代中检索相关知识。此过程将最终答案的上下文限制在五个最有用的知识块中。
Metrics For maintaining consistency with established benchmarks, two conventional metrics are adopted in our experimental evaluation: Exact Match (EM), which assesses whether the response is identical to a predefined correct answer, and the F1 score, which is the harmonic mean of precision and recall at the token level. During evaluation, we noticed that the LLM sometimes produced responses more verbose than expected, even when the QA prompt aimed to limit output style. To more accurately gauge the responses’ alignment with the intended answers—beyond mere lexical matching—we introduced a novel evaluation metric employing GPT-4. In this process, GPT-4 acts as an evaluator, assessing the correctness of a response in relation to the question and the correct answer labels. We refer to this metric as Accuracy (Acc). Upon manual inspection of a sample set, the judgments rendered by GPT-4 demonstrate complete agreement with human evaluators, affirming the reliability of this metric.
指标
为与现有基准保持一致,我们在实验评估中采用了两个传统指标:精确匹配(Exact Match, EM),用于评估响应是否与预定义正确答案完全一致;F1分数,即精确率和召回率的调和平均值,在Token级别进行计算。在评估过程中,我们注意到大语言模型有时会产生比预期更冗长的响应,即使QA提示旨在限制输出风格。为了更准确地衡量响应与预期答案的一致性(而不仅仅是词汇匹配),我们引入了一种采用GPT-4的新型评估指标。在此过程中,GPT-4充当评估者,评估响应与问题和正确答案标签的正确性。我们称此指标为准确率(Accuracy, Acc)。通过手动检查样本集,GPT-4的判断与人类评估者完全一致,证实了该指标的可靠性。
Furthermore, we encountered situations where a method achieves high accuracy (Acc) scores yet registers low F1 scores. To elucidate the underlying factors of such discrepancies, we also report on the Recall and Precision of the generated responses. Recall measures the proportion of relevant tokens from the answer labels that are captured in the response, while precision evaluates the relevance of the tokens in the generated answer with respect to the correct labels. Specifically, in cases where multiple correct answer labels are available, we employ a conservative scoring approach for EM, F1, Precision, and Recall by retaining the highest score achieved. This approach is designed to equitably consider the range of correct answers that the LLM may generate. It should be noted that, in the context of computing Accuracy (Acc), all admissible answer labels are furnished concurrently to the evaluation process, resulting in a singular Accuracy score.
此外,我们还遇到了一些情况,即某些方法在准确率(Acc)得分较高的情况下,F1得分却较低。为了阐明这种差异的潜在原因,我们还报告了生成响应的召回率和精确率。召回率衡量了响应中捕获的答案标签中相关Token的比例,而精确率则评估了生成答案中Token相对于正确标签的相关性。具体来说,在有多个正确答案标签的情况下,我们对EM、F1、精确率和召回率采用保守的评分方法,保留所取得的最高分。这种方法旨在公平地考虑大语言模型可能生成的各种正确答案。需要注意的是,在计算准确率(Acc)时,所有可接受的答案标签都会同时提供给评估过程,从而产生一个单一的准确率得分。
The metrics employed in this evaluation — Exact Match (EM), F1, Precision, Recall, and Accuracy (Acc) — are primarily suited for questions categorized as L1 and L2, which are characterized by their association with ground truth answers that are factual and definitive. However, the utility of these metrics diminishes for predictive and creative questions, namely the L3 and L4 questions, where answers are inherently uncertain or subjective, and no single correct response exists. For L3 questions, alternative assessment methods such as trend judgment and qualitative analysis become more appropriate to capture the predictive validity of the responses. Furthermore, for L4 questions, which demand a higher degree of insight or innovation, it is essential to evaluate answers through a multi-faceted lens, considering criteria such as relevance, diversity, comprehensiveness, uniqueness, and inspiration to fully appreciate the depth and originality of the approaches’ responses.
本次评估中采用的指标——精确匹配 (Exact Match, EM)、F1、准确率 (Precision)、召回率 (Recall) 和准确度 (Accuracy, Acc) ——主要适用于被归类为 L1 和 L2 的问题,这类问题的特点是与事实性和确定性答案相关。然而,这些指标在预测性和创造性问题(即 L3 和 L4 问题)上的效用会减弱,因为这类问题的答案本质上具有不确定性或主观性,且不存在唯一正确的回答。对于 L3 问题,趋势判断和定性分析等替代评估方法更适合捕捉回答的预测有效性。此外,对于 L4 问题,这类问题要求更高程度的洞察力或创新性,必须通过多方面的视角来评估答案,考虑相关性、多样性、全面性、独特性和启发性等标准,以充分理解方法回答的深度和原创性。
LLM and Hyper-parameters In our experiments, we employ GPT-4 (1106-Preview version) across all the methods outlined previously. For the knowledge extraction phase, we utilize a temperature setting of 0.7 specifically for the Knowledge Atomizing process, promoting a balance between diversity and determinism in the generated atomic knowledge. Conversely, during all question-answering (QA) steps in each method, we implement a temperature of 0, ensuring consistent responses from the model. Regarding the retrieval component, we engage the text-embedding-ada-002 (version 2) as our embedding model for both the general flat knowledge bases and the hierarchical knowledge bases. For the general flat knowledge bases, the retriever is configured to fetch up to 16 knowledge chunks, applying a retrieval score threshold of 0.2. In the case of hierarchical knowledge bases, the retriever is initially set to retrieve a maximum of 8 chunks with a more stringent threshold of 0.5. Subsequently, an additional 4 chunks can be retrieved via each atomic query posed.
大语言模型与超参数
6.2 Evaluation on Open-Domain Benchmarks
6.2 开放领域基准测试评估
In this subsection, we demonstrate the performance of our method across three open-domain benchmarks. To ensure a fair and objective evaluation, particularly in the context of real-world industrial applications, we have selected three widely-recognized multi-hop datasets: HotpotQA [60], 2WikiMultiHopQA [28], and MuSiQue [51]. Below, we provide a brief overview of these datasets, noting that our method does not leverage the question type information nor the number of hops information during the solving process, as our approach is designed to be agnostic to such classifications.
在本小节中,我们在三个开放域基准上展示了我们方法的性能。为了确保公平和客观的评估,特别是在现实世界工业应用的背景下,我们选择了三个广泛认可的多跳数据集:HotpotQA [60]、2WikiMultiHopQA [28] 和 MuSiQue [51]。下面,我们简要概述了这些数据集,并指出我们的方法在解决过程中没有利用问题类型信息或跳数信息,因为我们的方法旨在与这些分类无关。
HotpotQA The HotpotQA dataset is a well-known multi-hop QA benchmark primarily consisting of 2-hop questions, each associated with 10 Wikipedia paragraphs. Among these, some paragraphs contain supporting facts essential to answering the question, while the rest serve as distract or s. The dataset also includes a question type field, which delineates the logical reasoning required—comparison questions involve contrasting two entities, and bridge questions require inferring the bridge entity, or inferring the property of an entity through an intermediary entity, or locating the answer entity [60]. The comparison questions in HotpotQA align with the comparative questions defined in Section 3.2. Similarly, bridge questions correspond to either bridging questions or summarizing questions, depending on the complexity of the rationale required. Although our method operates independently of these types, their description here exemplifies the nature of questions within the dataset and contextual ize s the expected performance variance across different benchmarks.
HotpotQA
2 Wiki Multi Hop QA Inspired by HotpotQA, 2 Wiki Multi Hop QA expands the diversity of question types. It retains the comparison type from HotpotQA and introduces inference and compositional questions that evolve from the bridge type by focusing on entity attribute deduction and entity location, respectively. Additionally, the bridge comparison type is a novel category that requires a synthesis of bridge and comparison reasoning. In this dataset, the comparison questions correspond to the comparative questions defined in Section 3.2, akin to those in HotpotQA. The inference questions are analogous to bridging questions, and the compositional questions are similar to summarizing questions as described in the same section. The bridge comparison questions, due to their hybrid nature and increased complexity, also fall under the summarizing questions category. This dataset typically presents 2-hop to 4-hop questions, each accompanied by 10 Wikipedia paragraphs containing supporting facts and distract or s. While these types inform the dataset’s structure, they are not utilized by our method, which treats all questions uniformly regardless of their categorization.
2 Wiki 多跳 QA 数据集受 HotpotQA 启发,扩展了问题类型的多样性。它保留了 HotpotQA 中的比较类型,并引入了推理和组合问题,这些问题分别从桥接类型演变而来,专注于实体属性推导和实体位置。此外,桥接比较类型是一个新颖的类别,需要综合桥接和比较推理。在该数据集中,比较问题对应于第 3.2 节定义的比较问题,类似于 HotpotQA 中的问题。推理问题类似于桥接问题,组合问题则类似于同一节中描述的总结问题。桥接比较问题由于其混合性质和增加的复杂性,也被归类为总结问题。该数据集通常提供 2 跳至 4 跳的问题,每个问题都附带 10 个包含支持事实和干扰信息的 Wikipedia 段落。虽然这些类型影响了数据集的构建,但我们的方法并未使用它们,而是将所有问题视为相同,无论其分类如何。
MuSiQue Addressing the issue that many multi-hop questions can be solved via shortcuts—arriving at correct answers without proper reasoning—MuSiQue implements stringent filters and additional mechanisms specifically designed to encourage connected reasoning, as reported by Trivedi et al. [51]. Unlike the other datasets, MuSiQue does not categorize questions by type, but it does provide explicit information on the number of hops required for each question, ranging from 2 to 4 hops. Each question is associated with 20 context paragraphs, which introduce a mix of relevant and irrelevant information, further complicating the task of discerning the correct reasoning path. This explicit hop information, while not used by our method, underscores the complexity of the dataset and the robustness required by models to handle such challenges effectively.
MuSiQue
为解决许多多跳问题可通过捷径解决(即无需适当推理即可得出正确答案)的问题,MuSiQue 实施了严格的过滤器和额外机制,专门设计用于鼓励连接推理,如 Trivedi 等人 [51] 所报告。与其他数据集不同,MuSiQue 不按类型分类问题,但它确实提供了每个问题所需的跳数的明确信息,范围从 2 跳到 4 跳。每个问题都与 20 个上下文段落相关联,这些段落引入了相关和无关信息的混合,进一步增加了辨别正确推理路径的难度。尽管我们的方法未使用这些明确的跳数信息,但它突显了数据集的复杂性以及模型有效应对此类挑战所需的鲁棒性。
In our experiments, we randomly sample $500;\mathrm{QA}$ data from the dev set of each dataset, without consideration for question type nor number of hops, to ensure randomness. We compile the context paragraphs from all sampled QA data into a single knowledge base for each benchmark, creating a more complex retrieval scenario. This design choice is aimed at rigorously assessing our model’s question decomposition and relevant context retrieval abilities. Table 3 outlines the distribution of question types within our sampled sets, offering insight into the variety of reasoning challenges presented in our evaluation, though this does not directly impact our method.
在我们的实验中,我们从每个数据集的开发集中随机抽取 500 个问答数据,不考虑问题类型和跳数,以确保随机性。我们将所有抽取的问答数据的上下文段落编译成每个基准的单一知识库,从而创建了一个更复杂的检索场景。这一设计选择旨在严格评估我们模型的问题分解和相关上下文检索能力。表 3 概述了我们抽取集中的问题类型分布,展示了评估中呈现的多种推理挑战,尽管这并不直接影响我们的方法。
Overall Performance The evaluation results across HotpotQA, 2 Wiki Multi Hop QA, and MuSiQue are presented in Table 4, Table 5, and Table 6, respectively. If we hypothesize that the highest achievable performance on each benchmark may reflect its relative difficulty, a tentative ranking from easiest to most challenging would be: HotpotQA, 2 Wiki Multi Hop QA, and MuSiQue. Our observations suggest that for HotpotQA, considered the least challenging, the GraphRAG in local mode and our method are closely competitive, with minor performance disparities. However, as the difficulty increases for 2 Wiki Multi Hop QA and MuSiQue, our method outperforms others.
总体性能
HotpotQA、2 Wiki Multi Hop QA 和 MuSiQue 的评估结果分别在表 4、表 5 和表 6 中呈现。如果我们假设每个基准测试的最高可达到性能可能反映其相对难度,那么从最容易到最具挑战性的初步排名将是:HotpotQA、2 Wiki Multi Hop QA 和 MuSiQue。我们的观察表明,对于被认为最具挑战性的 HotpotQA,本地模式下的 GraphRAG 和我们的方法表现接近,性能差异较小。然而,随着 2 Wiki Multi Hop QA 和 MuSiQue 的难度增加,我们的方法优于其他方法。
Table 3: Distribution of question types across three multi-hop QA datasets.
(a) HotPotQA (b) 2 Wiki Multi Hop QA (c) MuSiQue
表 3: 三个多跳问答数据集中问题类型的分布。
| Type | Count | Ratio | Type | Count | Ratio | #Hops | Count | Ratio |
|---|---|---|---|---|---|---|---|---|
| comparison | 107 | 21.4% | comparison | 132 | 26.4% | 2 | 263 | 52.6% |
| bridge | 393 | 78.6% | inference | 64 | 12.8% | 3 | 169 | 33.8% |
| compositional | 196 | 39.2% | 4 | 68 | 13.6% | |||
| bridge_comparison | 108 | 21.6% |
(a) HotPotQA (b) 2 Wiki Multi Hop QA (c) MuSiQue
Table 4: Performance comparison on HotpotQA. Best in bold, second-best underlined.
表 4: HotpotQA 上的性能对比。最佳结果加粗,次佳结果加下划线。
| 方法 | EM | F1 | Acc | Precision | Recall |
|---|---|---|---|---|---|
| Zero-ShotCoT NaiveRAGw/R | 32.60 | 43.94 | 53.60 | 46.56 | 43.97 |
| NaiveRAGw/H-R Self-Ask | 56.80 | 72.67 | 82.60 | 74.52 | 74.86 |
| NaiveRAGw/H-R Self-Ask | 54.80 | 70.25 | 81.60 | 72.56 | 72.24 |
| NaiveRAGw/H-R Self-Ask | 28.80 | 43.61 | 59.60 | 43.49 | 56.21 |
| NaiveRAGw/H-R Self-Ask | 44.80 | 63.08 | 81.00 | 63.23 | 74.57 |
| NaiveRAGw/H-R Self-Ask | 47.20 | 64.24 | 82.20 | 64.27 | 75.95 |
| Self-Askw/R Self-Askw/H-R GraphRAG Local | 0.00 | 10.66 | 89.00 | 5.90 | 83.07 |
| GraphRAG Global | 0.00 | 7.42 | 64.80 | 4.08 | 63.16 |
| Ours | 61.20 | 76.26 | 87.60 | 78.10 | 78.95 |
The inclusion of retrieved context significantly enhances accuracy, with gains ranging from approximately $10%$ (comparing Zero-Shot CoT and Naive RAG on MuSiQue) to around $29%$ (on HotpotQA). This indicates that for simpler benchmarks, RAG equipped with naive knowledge retrieval could address simple multihop questions, leading to a significant accuracy boost. However, for more challenging benchmarks involving complex multihop questions, the accuracy improvement from naive knowledge retrieval is limited, underscoring the constrained reasoning capabilities of the LLMs. By incorporating decomposition mechanisms, Self-Ask significantly enhances accuracy, especially on more challenging benchmarks. The combination of knowledge retrieval and Self-Ask decomposition yields superior results on 2 Wiki Multi Hop QA and MuSiQue, compared to using a single mechanism. However, in the case of HotpotQA, all methods employing retrieval (except for GraphRAG in Global mode, which will be discussed later) attain accuracies above $80%$ , with negligible differences between them.
引入检索到的上下文显著提高了准确性,提升幅度从约 $10%$ (比较 MuSiQue 上的零样本 CoT 和 Naive RAG)到约 $29%$ (在 HotpotQA 上)。这表明,对于更简单的基准测试,配备简单知识检索的 RAG 可以解决简单的多跳问题,从而显著提高准确性。然而,对于涉及复杂多跳问题的更具挑战性的基准测试,简单知识检索带来的准确性提升有限,突显了大语言模型的推理能力受限。通过引入分解机制,Self-Ask 显著提高了准确性,尤其是在更具挑战性的基准测试上。知识检索与 Self-Ask 分解的结合在 2 Wiki Multi Hop QA 和 MuSiQue 上优于单一机制的使用。然而,在 HotpotQA 的情况下,所有采用检索的方法(除了稍后将讨论的 GraphRAG 的全局模式)的准确性都超过了 $80%$,且它们之间的差异可以忽略不计。
Interestingly, the application of a hierarchical atomic knowledge base does not significantly impact Naive RAG’s performance compared to Naive RAG with general flat knowledge base, potentially due to the embedding distance between the original multi-hop questions and the atomic questions of relevant contexts. Nonetheless, when combined with task decomposition, a hierarchical knowledge base shows more promise, as evidenced by the performance boost observed in Self-Ask with Hierarchical Retrieval (Self-Ask w/ H-R) compared to Self-Ask with Retrieval (Self-Ask w/ R), particularly on MuSiQue, which requires more complex reasoning. This improvement underscores the potential of hierarchical knowledge bases in enhancing the effectiveness of decomposition mechanisms in complex reasoning tasks.
有趣的是,与使用普通平面知识库的Naive RAG相比,使用分层原子知识库并没有显著影响Naive RAG的性能,这可能是由于原始多跳问题与相关上下文的原子问题之间的嵌入距离所致。然而,当与任务分解结合时,分层知识库显示出更大的潜力,这一点在Self-Ask with Hierarchical Retrieval (Self-Ask w/ H-R)与Self-Ask with Retrieval (Self-Ask w/ R)相比的性能提升中得到了体现,尤其是在需要更复杂推理的MuSiQue上。这一改进凸显了分层知识库在增强复杂推理任务中分解机制有效性方面的潜力。
Our proposed method focuses on knowledge-aware task decomposition, which performs decomposition with an awareness of available knowledge, effectively leveraging the atomic information provided by the hierarchical knowledge base. Experimental results demonstrate that our approach consistently outperforms other methods, validating its effectiveness in complex reasoning scenarios.
我们提出的方法专注于知识感知的任务分解,该方法在分解时考虑到可用知识,有效利用了层次知识库提供的原子信息。实验结果表明,我们的方法始终优于其他方法,验证了其在复杂推理场景中的有效性。
Regarding GraphRAG, originally designed for the query-focused sum mari z ation (QFS) task as outlined by [20], we observe its suboptimal performance in both local and global modes compared to our method. Notably, GraphRAG exhibits a curious trend: it achieves higher accuracy and recall scores while performing lower on EM, F1, and Precision metrics. A closer analysis of GraphRAG’s outputs reveals a tendency to echo the query and include meta-information about the answer within its graph structure. Despite attempts to refine its QA prompt, this behavior persists. An illustrative
关于 GraphRAG,最初设计用于查询聚焦摘要 (QFS) 任务,如 [20] 所述,我们观察到其在局部和全局模式下的表现均不如我们的方法。值得注意的是,GraphRAG 呈现出一个有趣的现象:它在准确率和召回率上得分较高,但在 EM、F1 和 Precision 指标上表现较低。对 GraphRAG 输出的深入分析表明,它倾向于重复查询并在其图结构中包含有关答案的元信息。尽管尝试改进其 QA 提示,这种行为仍然存在。
Table 5: Performance comparison on 2 Wiki Multi Hop QA. Best in bold, second-best underlined.
表 5: 2 Wiki Multi Hop QA 上的性能对比。最佳结果加粗,次佳结果加下划线。
| 方法 | EM F1 | Acc | Precision | Recall |
|---|---|---|---|---|
| Zero-ShotCoT NaiveRAGw/R NaiveRAGw/H-R | 35.67 | 41.40 43.87 59.74 | 41.43 | 43.11 |
| Self-Ask Self-Askw/R | 51.20 51.40 | 62.80 63.00 | 59.06 59.36 | 62.30 62.43 |
| 59.73 23.80 37.49 | 51.60 | 34.56 | 60.72 | |
| 46.80 64.17 | 79.80 | 61.17 | 80.21 | |
| Self-Askw/H-R | 48.00 63.99 11.83 | 80.00 | 61.30 | 79.56 |
| GraphRAG Local | 0.00 | 71.20 | 6.74 | 75.17 |
| GraphRAG Global | 0.00 7.35 | 45.00 | 4.09 | 55.43 |
| Ours | 66.80 | 75.19 82.00 | 74.04 | 78.87 |
Table 6: Performance comparison on MuSiQue. Best in bold, second-best underlined.
表 6: MuSiQue 上的性能对比。最佳结果加粗,次佳结果加下划线。
| 方法 | EM | F1 | Acc | Precision | Recall |
|---|---|---|---|---|---|
| Zero-ShotCoT | 12.93 32.00 | 22.90 43.31 | 23.47 44.40 | 24.40 | 24.10 |
| NaiveRAGw/R | 44.42 | 47.29 | |||
| NaiveRAGw/H-R | 30.40 | 41.30 | 43.40 | 42.06 | 44.53 |
| Self-Ask Self-Askw/R | 16.40 28.40 | 27.27 42.54 | 35.40 49.80 | 26.33 | 37.65 |
| Self-Askw/H-R | 29.80 | 44.05 | 54.00 | 41.13 | 53.37 |
| GraphRAG Local | 0.60 | 42.47 | 55.89 | ||
| 9.62 | 49.80 | 5.73 | 55.82 | ||
| GraphRAGGlobal | 0.00 | 5.16 | 44.60 | 2.82 | 52.19 |
| Ours | 46.40 | 56.62 | 59.60 | 57.45 | 59.53 |
example is presented in Table 7, which shows GraphRAG Local’s response to a question from HotpotQA.
表 7 中展示了一个示例,展示了 GraphRAG Local 对 HotpotQA 问题的回答。
6.3 Evaluation on Legal Benchmarks
6.3 法律基准评测
In this subsection, we present the performance of our approach on two legal benchmarks: LawBench [22] and Open Australian Legal QA [14]. Before doing so, we provide a brief description of each benchmark.
在本小节中,我们展示了我们的方法在两个法律基准上的性能:LawBench [22] 和 Open Australian Legal QA [14]。在此之前,我们简要描述了每个基准。
LawBench LawBench is a comprehensive legal benchmark for Chinese laws. It comprises 20 meticulously designed tasks aimed at accurately assessing the legal capabilities of LLMs. Unlike some existing benchmarks that rely solely on multiple-choice questions, LawBench includes a variety of task types that are closely related to real-world applications. These tasks encompass legal entity recognition, reading comprehension, crime amount calculation, and legal consulting, among others. Since not all tasks are RAG-oriented (e.g., reading comprehension), we have selected 6 specific tasks, which are detailed in Table 8. The number of questions of each task is 500.
LawBench
We also provide example questions of these tasks for the readers reference.
我们还提供了这些任务的示例问题,供读者参考。
Table 7: An Example of GraphRAG Local output on a HotpotQA question. The table showcases the tendency to repeat the question and include meta-information in its response.
表 7: 一个关于 HotpotQA 问题的 GraphRAG Local 输出示例。该表展示了在回答中重复问题并包含元信息的倾向。
| 问题 | Alsa Mall 和 Spencer Plaza 位于哪个国家? |
|---|---|
| 答案标签 | 印度 |
| GraphRAG 的答案 | Alsa Mall 和 Spencer Plaza 都位于印度金奈 [数据:印度和金奈社区(2391);实体(4901, 4904);关系(9479,1687,5215,5217)]。 |
Table 8: Overview of LawBench tasks
表 8: LawBench任务概览
| Task No. | Task | Type | Metric |
|---|---|---|---|
| 1-1 | StatuteRecitation | 生成 | F1 |
| 1-2 | Legal Knowledge Q&A | 单选题 | EM |
| 3-1 | StatutePrediction(Fact-based) | 多选题 | EM |
| 3-2 | StatutePrediction(Scenario-based) | 生成 | F1 |
| 3-6 | CaseAnalysis | 单选题 | EM |
| 3-8 | Consultation | 生成 | F1 |
$\hookrightarrow$ member management rules, and other relevant rules in accordance with $\hookrightarrow$ securities laws and administrative regulations, and reports to the $\hookrightarrow$ securities regulatory authority under the State Council for record.
$\hookrightarrow$ 根据证券法律和行政法规制定成员管理规则及其他相关规则,并向国务院证券监管机构报告备案。
3-1: Based on the following facts and charges, provide the relevant articles of the $\hookrightarrow$ Criminal Law. Facts: The Yushu City, Jilin Province, accused that on $\hookrightarrow$ November 15, 2015, the defendant He signed a car rental agreement with Guo, $\hookrightarrow$ the owner of a taxi with license plate number xxx. The agreement stipulated $\hookrightarrow$ a monthly rent of RMB 3,900.00, payable monthly. On January 19, 2016, $\hookrightarrow$ without the knowledge of Guo, the defendant He concealed the truth and $\hookrightarrow$ falsely claimed to be the owner of the taxi. He signed a car rental $\hookrightarrow$ agreement with the victim Ma, with a monthly rent of RMB 3,800.00 and a $\hookrightarrow$ rental period of one year, collecting a total of RMB 50,600.00 from Ma for $\hookrightarrow$ one year’s rent and vehicle deposit. On February 26, 2016, the taxi was $\hookrightarrow$ retrieved by its owner Guo from the victim Ma. The victim Ma repeatedly $\hookrightarrow$ asked the defendant He to return the rent and deposit, but the defendant He $\hookrightarrow$ refused to return them. The prosecution provided evidence including the $\hookrightarrow$ defendant’s confession, the victim’s statement, witness testimonies, and $\hookrightarrow$ documentary evidence, and believed that the defendant He, with the purpose $\hookrightarrow$ of illegal possession, defrauded others of their property by fabricating $\hookrightarrow$ facts and concealing the truth during the signing and performance of the $\hookrightarrow$ contract. The amount was relatively large, and his actions violated the $\hookrightarrow$ provisions of Article xx of the Criminal Law of the People’s Republic of $\hookrightarrow$ China, and he should be held criminally responsible for xx. Charge: Contract $\hookrightarrow$ Fraud.
3-1: 根据以下事实和指控,提供《中华人民共和国刑法》的相关条文。事实:吉林省榆树市指控,2015年11月15日,被告人何与车牌号为xxx的出租车车主郭签订了一份汽车租赁协议,协议规定月租金为人民币3,900.00元,按月支付。2016年1月19日,被告人何在郭不知情的情况下隐瞒真相,谎称自己是该出租车的车主,与被害人马签订了一份汽车租赁协议,月租金为人民币3,800.00元,租期为一年,向马收取了一年的租金及车辆押金共计人民币50,600.00元。2016年2月26日,该出租车被车主郭从被害人马处取回。被害人马多次要求被告人何退还租金及押金,但被告人何拒绝退还。公诉机关提供了包括被告人供述、被害人陈述、证人证言及书证等证据,并认为被告人何以非法占有为目的,在签订、履行合同过程中,虚构事实、隐瞒真相,骗取他人财物,数额较大,其行为违反了《中华人民共和国刑法》第xx条的规定,应以xx罪追究其刑事责任。指控:合同诈骗罪。
3-2: Please provide the legal basis according to the specific scenario and question, $\hookrightarrow$ only the content of the specific legal provision is needed, each scenario $\hookrightarrow$ involves only one legal provision. Scenario: A cargo ship arrives at the $\hookrightarrow$ port of discharge, but the consignee fails to arrive in time to collect the $\hookrightarrow$ goods. Under which legal provision can the captain unload the goods at $\hookrightarrow$ another appropriate place?
3-2: 请根据具体场景和问题提供法律依据,$\hookrightarrow$ 只需具体法律条款的内容,每个场景 $\hookrightarrow$ 仅涉及一条法律条款。场景:一艘货船到达 $\hookrightarrow$ 卸货港,但收货人未能及时到达以提取 $\hookrightarrow$ 货物。船长可以根据哪条法律条款在 $\hookrightarrow$ 其他适当的地方卸货?
3-6: One year after the bar opened, the business environment changed drastically, $\hookrightarrow$ and all partners held a meeting to discuss countermeasures. According to the $\hookrightarrow$ ’Partnership Enterprise Law,’ the following voting matters are considered $\hookrightarrow$ valid votes: A: Zhang believes that the name ’Tongcheng’ is not attractive $\hookrightarrow$ and proposes to change it to ’Tongsheng Bar.’ Wang and Zhao agree, but Li $\hookrightarrow$ opposes; B: In view of the sluggish business, Wang proposes to suspend $\hookrightarrow$ operations for one month for renovation and reorganization. Zhang and Zhao $\hookrightarrow$ agree, but Li opposes; C: Due to the urgent needs of the bar, Zhao proposes $\hookrightarrow$ to sell a batch of coffee machines to the bar. Zhang and Wang agree, but Li $\hookrightarrow$ opposes; D: Given the four partners’ lack of experience in bar management,
3-6: 酒吧开业一年后,经营环境发生巨变,所有合伙人召开会议商讨对策。根据《合伙企业法》,以下投票事项被视为有效表决:A: 张认为“同城”这个名字缺乏吸引力,提议改为“同生酒吧”。王和赵同意,但李反对;B: 鉴于生意惨淡,王提议暂停营业一个月进行装修和重组。张和赵同意,但李反对;C: 由于酒吧急需,赵提议向酒吧出售一批咖啡机。张和王同意,但李反对;D: 鉴于四位合伙人缺乏酒吧管理经验,
Table 9: Evaluation Results on Legal Benchmarks (Metric is F1 / EM as indicated in Table 8)
表 9: 法律基准评估结果(指标为 F1 / EM,如表 8 所示)
| Task | Zero-ShotCoT | GraphRAG Local | Ours (N=5) |
|---|---|---|---|
| LawBench | 1-1 1-2 3-1 3-2 3-6 | 21.31 54.24 53.32 27.51 | 23.27 |
| 62.60 | |||
| 74.60 25.98 | |||
| 3-8 Open Australian Legal QA | 51.16 17.44 25.10 | 47.64 18.43 34.35 | 61.91 23.58 63.34 |
Table 10: Evaluation Results on Legal Benchmarks (Metric is Acc)
表 10: 法律基准评估结果(指标为 Acc)
| Task | Zero-ShotCoT | GraphRAG Local | Ours (N=5) |
|---|---|---|---|
| LawBench | 1-1 | 1.23 | 16.60 |
| 1-2 | 54.00 | 63.40 | |
| 3-1 | 49.90 | 75.40 | |
| 3-2 | 15.83 | 27.60 | |
| 3-6 | 51.12 | 57.00 | |
| 3-8 Open Australian Legal QA | 49.70 16.48 | 58.80 88.27 | 61.72 98.59 |
$\hookrightarrow\mathtt{\Delta}\mathtt{L i}$ proposes to appoint his friend Wang as the managing partner. Zhang and $\hookrightarrow$ Wang agree, but Zhao opposes. 3-8: Resident A rented out the house to B. With A’s consent, B renovated the rented $\hookrightarrow$ house and sublet it to C. C unilaterally altered the load-bearing structure $\hookrightarrow$ of the house. Why can A request B to bear liability for breach of contract?
$\hookrightarrow\mathtt{\Delta}\mathtt{L i}$ 提议任命他的朋友王为管理合伙人。张和 $\hookrightarrow$ 王同意,但赵反对。3-8:居民 A 将房子租给 B。在 A 的同意下,B 对租来的 $\hookrightarrow$ 房子进行了装修,并将其转租给 C。C 单方面改变了房屋的承重结构 $\hookrightarrow$。为什么 A 可以要求 B 承担违约责任?
Open Australian Legal QA The benchmark consists of 2,124 questions and answers synthesized by GPT-4 from the Australian legal corpus. All questions are of the generation type. One example is: "What is the landlord’s general obligation under section 63 of the Act in the case of Anderson v Armitage [2014] NSWCATCD 157 in New South Wales?"
Open Australian Legal QA
该基准由GPT-4从澳大利亚法律语料库中生成的2,124个问答组成。所有问题均为生成式类型。例如:"在Anderson v Armitage [2014] NSWCATCD 157案中,根据新南威尔士州法律第63条,房东的一般义务是什么?"
Evaluation results are listed in Table 9, where we only compare to "GraphRAG Local", as it generally performs better than "GraphRAG Global" on these tasks.
评估结果列于表 9,其中我们仅与 "GraphRAG Local" 进行比较,因为在这些任务上它通常比 "GraphRAG Global" 表现更好。
For the aforementioned reasons, we also use GPT-4 to evaluate all experimental results, reporting the accuracy (Acc) in Table 10. When comparing the results in Table 9 and Table 10, we observe that the order of the results is preserved, even though some metrics change significantly. In the following section, we aim to identify the reasons behind these changes, which may provide valuable insights for designing better metrics to evaluate RAG frameworks in the future.
基于上述原因,我们也使用 GPT-4 来评估所有实验结果,并在表 10 中报告准确率 (Acc)。通过比较表 9 和表 10 的结果,我们观察到结果的顺序保持不变,尽管某些指标发生了显著变化。在接下来的部分中,我们旨在找出这些变化背后的原因,这可能为未来设计更好的指标来评估 RAG 框架提供有价值的见解。
from a consulting website, resulting in varying quality. For example, one question asks "Do the children from the original marriage have an obligation to support the father?" However, the provided golden answer includes an irrelevant article, "Article 1067," which pertains to parents’ obligations to support minor children.
来自咨询网站,质量参差不齐。例如,有一个问题是“原婚姻中的子女是否有义务赡养父亲?”然而,提供的标准答案中包含了不相关的条款“第1067条”,该条款涉及父母对未成年子女的抚养义务。
| → 子女有义务赡养父亲吗?参考答案:在我国,亲子女有义务赡养父母。我们的答案:是的,有赡养义务。法律依据:根据 | → 父母与新子女组建家庭,根据法院的判决,父亲需要每月支付抚养费给母亲,直到孩子年满18岁。原子女与父母的关系不会因为父母的离婚而解除。如果子女不赡养老人,父母可以直接向人民法院提起诉讼,要求法院判决子女支付赡养费。法律依据:《中华人民共和国民法典》第1067条规定,如果父母不履行抚养义务,未成年子女或不能独立生活的成年子女有权要求父母抚养。如果成年子女不履行赡养义务,无劳动能力或生活困难的父母有权要求成年子女赡养。第1084条规定,父母与子女的关系不因父母的离婚而解除。离婚后,无论子女是由父亲还是母亲直接抚养,他们仍然是父母双方的子女。律师解释:父母离婚后,成年子女仍然是父母双方的子女。如果成年子女不履行赡养义务,无劳动能力或生活困难的父母有权要求成年子女赡养。根据《中华人民共和国民法典》第1069条,父母与子女的关系不因父母婚姻关系的变化而终止。因此,即使父母离婚并再婚有了新的子女,原子女仍然有赡养义务。 |
- The accuracy of all methods on choice tasks 1-2, 3-1, and 3-6 almost coincides with the F1 score, as expected. An exception is task 3-1, where the difference is mainly due to GPT-4’s capacity to understand Chinese, particularly in distinguishing numbers in Arabic and Chinese. In Chinese law, all numbers are written in Chinese, while in the golden answers, all numbers are given in Arabic.
所有方法在任务1-2、3-1和3-6上的准确率几乎与F1分数一致,正如预期的那样。例外是任务3-1,其中差异主要归因于GPT-4对中文的理解能力,尤其是在区分阿拉伯数字和中文数字方面。在中国法律中,所有数字都以中文书写,而在标准答案中,所有数字都以阿拉伯数字给出。
6.4 Real Case Studies
6.4 实际案例研究
This section presents three case studies from our evaluation benchmark to illustrate the underlying principles of our proposed decomposition pipeline, as detailed in Algorithm 1. Through these realworld examples, we aim to highlight the benefits of our systematic approach. These cases will shed light on how each step of the pipeline contributes to improved performance and the insights gained from their implementation.
本节通过三个案例研究来展示我们评估基准中的示例,以说明我们提出的分解管道的基本原理,如算法 1 中详述。通过这些真实世界的例子,我们希望突出我们系统化方法的优势。这些案例将揭示管道的每个步骤如何有助于提高性能,以及从实施中获得的见解。
Our task decomposition strategy involves generating multiple atomic queries rather than producing a single deterministic follow-up question, as demonstrated in the Self-Ask approach. Contemporary decomposition methods typically employ a generative model to formulate a singular follow-up question. However, this approach carries an intrinsic risk of generating erroneous questions, potentially leading to an incorrect decomposition pathway and, ultimately, an erroneous answer. Consider the Case (a) depicted in Figure 16, where the original question pertains to a film titled "What Women Love." Due to the existence of a more prominent film, "What Women Want," the employed language model tends to ‘correct’ the original question. Consequently, methods like Self-Ask (as shown on the left side of Figure 16) generate only one follow-up question related to this erroneously assumed object. In the illustrated instance, although the target chunk has been retrieved due to the similarity in embeddings, a ‘false’ intermediate answer is produced for the ‘false’ follow-up question, culminating in an incorrect final response. In contrast, our methodology posits atomic queries concerning both "What Women Love" and "What Women Want," thereby seeking to clarify the true intent of the initial question. With both films existing and relevant atomic questions being retrieved, our approach subsequently gains the advantage of verifying the question’s intent and selecting the correct and most pertinent chunk during the atomic selection phase.
我们的任务分解策略涉及生成多个原子查询,而不是像 Self-Ask 方法中那样生成单一的确定性后续问题。当代的分解方法通常使用生成模型来制定单一的后续问题。然而,这种方法存在生成错误问题的内在风险,可能导致错误的分解路径,并最终产生错误的答案。考虑图 16 中的案例 (a),原始问题涉及一部名为《What Women Love》的电影。由于存在一部更知名的电影《What Women Want》,所使用的语言模型往往会“纠正”原始问题。因此,像 Self-Ask 这样的方法(如图 16 左侧所示)只生成一个与错误假设对象相关的后续问题。在所示实例中,尽管由于嵌入的相似性已经检索到目标块,但针对“错误”的后续问题产生了“错误”的中间答案,最终导致错误的最终响应。相比之下,我们的方法提出了关于《What Women Love》和《What Women Want》的原子查询,从而试图澄清初始问题的真实意图。由于两部电影都存在且相关的原子问题被检索到,我们的方法随后具备了验证问题意图并在原子选择阶段选择正确且最相关块的优势。

Figure 16: Case (a): Given the lesser-known film "What Women Love" as opposed to the more popular "What Women Want," single-path methods like Self-Ask on the left are predisposed to generating follow-up questions about the latter, leading to an incorrect final answer. Conversely, PIKE-RAG can effectively discern the intended meaning of the original question by positing several atomic queries and postpone the task understanding to atomic selection phase with relevant atomic questions provided, and subsequently arriving at an accurate conclusion.
图 16: 案例 (a): 相较于更知名的电影《What Women Want》,给定较为冷门的电影《What Women Love》时,左侧的 Self-Ask 等单路径方法倾向于生成关于前者的后续问题,从而导致最终答案错误。相反,PIKE-RAG 能够通过提出多个原子查询有效辨别原始问题的意图,并将任务理解推迟到原子选择阶段,提供相关的原子问题,从而得出准确的结论。
Furthermore, the discrepancy between the formulation of the corpus and the query, is another critical factor advocating for a multi-query approach over a singular deterministic one. The presentation gap can impede the retrieval process even when the generated follow-up question is semantically accurate. For instance, as illustrated in Case (b) in Figure 17, a single-path method such as Self-Ask on the left side might directly inquire ’Who is the mother of Oskar Roehler?’ However, the knowledge base articulates familial relationships using a different schema, ’A is the son of B and $\mathbf{C}^{\bullet}$ in this case, thus the retrieval process falters despite the correctness of the question. Even when we applied the hierarchical retrieval to Self-Ask, the Self-Ask with Hierarchical Retrieval did not succeed in bridging this gap. In contrast, our approach, which generates multiple atomic queries, encompasses a broader range of phrasings that correspond to the diverse representations in the knowledge base. In the depicted case, while the atomic query specifically asking for Oskar Roehler’s mother encounters the same retrieval issue, an alternative query seeking information about his parents successfully retrieves the target chunk. This exemplifies how our method’s flexibility in query generation enhances the likelihood of aligning with the knowledge base’s structure and obtaining accurate information.
此外,语料库的构建与查询之间的差异是支持多查询方法而非单一确定性方法的另一个关键因素。即使生成的后续问题在语义上是准确的,表述差异仍可能阻碍检索过程。例如,如图17中的案例(b)所示,左侧的单一路径方法(如Self-Ask)可能会直接询问“Oskar Roehler的母亲是谁?”,但知识库使用了不同的模式来表达家庭关系,即“A是B和$\mathbf{C}^{\bullet}$的儿子”,在这种情况下,尽管问题正确,检索过程仍会失败。即使我们对Self-Ask应用了分层检索,Self-Ask with Hierarchical Retrieval也未能成功弥合这一差距。相比之下,我们的方法生成多个原子查询,涵盖了与知识库中多样表示相对应的更广泛表述。在所示案例中,虽然专门询问Oskar Roehler母亲的原子查询遇到了相同的检索问题,但另一个询问其父母信息的查询成功检索到了目标数据块。这展示了我们方法在查询生成中的灵活性如何提高与知识库结构对齐并获得准确信息的可能性。

Figure 17: Case (b): By proposing multiple atomic queries, PIKE-RAG effectively retrieves the relevant knowledge chunk, whereas the single deterministic follow-up question approach employed by Self-Ask fails to align with the knowledge base’s schema, resulting in a retrieval failure.
图 17: 案例 (b): 通过提出多个原子查询,PIKE-RAG 有效地检索到了相关的知识块,而 Self-Ask 使用的单一确定性后续问题方法未能与知识库的模式对齐,导致检索失败。
Our methodology emphasizes the retrieval of atomic questions rather than directly retrieving chunks. This design choice is exemplified in Case (b) depicted in Figure 17. The knowledge chunk in the corpus is structured using the pattern ’A ... as the son of B and $C'$ , which poses challenges for direct retrieval by queries such as ’Who is the mother of ...’. In our specialized knowledge base, such direct queries tend to retrieve chunks conforming to the patterns $^\circ\mathbf{A}$ is the mother of $\mathbf{B}^{\bullet}$ or ’A is the father of $\mathbf{B}^{,\bullet}$ . By utilizing atomic questions as intermediaries for retrieval, our approach effectively narrows the gap between a single query and the multiple sentence structures found in the knowledge base. It facilitates bridging the expression pattern differences exemplified by ’the mother of’ versus ’the son of’ in this scenario.
我们的方法强调检索原子问题,而不是直接检索知识块。这一设计选择在图17中的案例(b)中得到了体现。语料库中的知识块使用模式“A ...作为B和$C$的儿子”进行结构化,这给诸如“...的母亲是谁”等查询的直接检索带来了挑战。在我们专门的知识库中,此类直接查询往往会检索符合模式$^\circ\mathbf{A}$是$\mathbf{B}^{\bullet}$的母亲或“A是$\mathbf{B}^{,\bullet}$的父亲”的块。通过利用原子问题作为检索的中间步骤,我们的方法有效地缩小了单个查询与知识库中多个句子结构之间的差距。它有助于在这种场景下弥合“母亲”与“儿子”之间表达模式的差异。
In contrast to methods like Self-Ask, which only retains intermediate answers for subsequent processing, our method preserves the entire chunk as contextual information. During the atomic selection phase, we present a list of atomic questions as candidate summaries of the relevant content from the original chunk. This strategy significantly reduces token usage and simplifies the process of selecting the pertinent information. Case (c) in Figure 18 demonstrates the dual benefits of our approach: first, by selecting from a curated list of atomic questions, we streamline the identification of relevant information; second, by retaining the entire selected chunk rather than just the intermediate answer, we ensure a rich context is maintained for accurate and comprehensive subsequent processing. While the Self-Ask method on the left retrieves the target chunk, it fails to correctly identify the pertinent ’Ernie Watts’ due to the excessive contextual information. Since retrieved chunks in Self-Ask are discarded after generating an intermediate answer, the method potentially follows an incorrect pathway, leading to an inaccurate conclusion. In contrast, our approach can efficiently filter and select the appropriate atomic question from a concise list. Although the atomic question in this round pertains to the role of Ernie Watts, there is no need to inquire further about his birthplace, as this information is encapsulated within the selected chunk, which remains available for context in subsequent rounds.
与 Self-Ask 等方法仅保留中间答案以供后续处理不同,我们的方法保留了整个块作为上下文信息。在原子选择阶段,我们将一系列原子问题作为候选摘要呈现,这些摘要来自原始块的相关内容。这一策略显著减少了 Token 的使用,并简化了选择相关信息的过程。图 18 中的案例 (c) 展示了我们方法的双重优势:首先,通过从精心挑选的原子问题列表中进行选择,我们简化了相关信息的识别过程;其次,通过保留整个选定的块而不仅仅是中间答案,我们确保了在后续处理中保持丰富的上下文,以实现准确且全面的处理。虽然左侧的 Self-Ask 方法检索到了目标块,但由于上下文信息过多,未能正确识别出相关的 "Ernie Watts"。由于 Self-Ask 中检索到的块在生成中间答案后会被丢弃,该方法可能会沿着错误的路径前进,导致不准确的结论。相比之下,我们的方法能够从简洁的列表中高效地筛选并选择合适的原子问题。尽管本轮原子问题涉及 Ernie Watts 的角色,但无需进一步询问他的出生地,因为该信息已封装在选定的块中,并可在后续轮次中作为上下文使用。

Figure 18: Case (c): PIKE-RAG’s benefits from leveraging a concise list of atomic questions for targeted selection and retaining full chunks for rich contextual support. Conversely, Self-Ask’s approach, although successful in retrieving relevant chunks, is compromised by its dependency on intermediate answers for context, which ultimately results in the generation of incorrect final answers.
图 18: 案例 (c): PIKE-RAG 通过利用简洁的原子问题列表进行针对性选择,并保留完整的块以提供丰富的上下文支持,从而获得优势。相反,Self-Ask 的方法虽然在检索相关块方面成功,但由于其依赖中间答案作为上下文,最终导致生成错误答案。
7 Conclusion
7 结论
To address the diverse challenges faced by RAG systems in industrial applications, we propose that the core foundation of RAG systems should extend beyond traditional retrieval mechanisms to the effective construction and utilization of specialized knowledge and rationale. Therefore, we introduce a new paradigm that classifies tasks based on their difficulty in knowledge extraction, comprehension, and utilization, providing a novel framework for system design and evaluation. Applying this paradigm allows for phased exploration of RAG capabilities, which facilitates the progressive refinement of RAG algorithms and the staged implementation of RAG applications. Moreover, we introduce the specialized Knowledge and Rationale Augmented Generation (PIKERAG) framework, focusing on specialized knowledge extraction and rationale construction. PIKERAG effectively extracts, comprehends, and organizes specialized knowledge and construct coherent rationale for accurate answers, offering customizable system capabilities to meet varying requirements. Additionally, we propose knowledge atomizing and knowledge-aware task decomposition to tackle complex questions, such as multihop queries, achieving significant performance improvements on various open-domain and legal benchmarks.
为了解决RAG系统在工业应用中面临的各种挑战,我们提出RAG系统的核心基础应超越传统的检索机制,扩展到专业知识和推理的有效构建与利用。因此,我们引入了一种新范式,该范式根据知识提取、理解和利用的难度对任务进行分类,为系统设计和评估提供了一个新颖的框架。应用这一范式可以分阶段探索RAG能力,从而促进RAG算法的逐步优化和RAG应用的分阶段实施。此外,我们引入了专注于专业知识提取和推理构建的专业知识与推理增强生成(PIKERAG)框架。PIKERAG有效地提取、理解并组织专业知识,并构建连贯的推理以提供准确的答案,提供可定制的系统能力以满足不同的需求。此外,我们提出了知识原子化和知识感知任务分解,以解决复杂问题,如多跳查询,在各种开放领域和法律基准上实现了显著的性能提升。
References
参考文献
