PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation
Abstract
摘要
Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduce sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to increment ally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems’ problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iterative ly construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks. The code is publicly available at https://github.com/microsoft/PIKE-RAG.
尽管检索增强生成(Retrieval-Augmented Generation, RAG)系统通过外部检索扩展了大语言模型(LLM)的能力,并取得了显著进展,但这些系统往往难以满足现实工业应用中复杂多样的需求。仅依赖检索被证明不足以从专业语料库中提取深层次的领域知识并进行逻辑推理。为此,我们提出了sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG),专注于提取、理解和应用专门知识,同时构建连贯的推理过程,逐步引导LLM生成准确的响应。认识到工业任务的多样化挑战,我们引入了一种新范式,基于知识提取和应用的复杂性对任务进行分类,从而系统评估RAG系统解决问题的能力。这一战略方法为RAG系统的分阶段开发和增强提供了路线图,以适应工业应用不断变化的需求。此外,我们提出了知识原子化和知识感知的任务分解方法,分别基于原始查询和累积的知识,有效地从数据块中提取多方面的知识,并迭代地构建推理过程,在各种基准测试中展示了卓越的性能。代码已公开在https://github.com/microsoft/PIKE-RAG。
1 Introduction
1 引言
Large Language Models (LLMs) have revolutionized the field of natural language processing by demonstrating the capability to generate coherent and con textually relevant text. These advanced models are trained on expansive corpora, equipping them with the versatility to execute a diverse spectrum of linguistic tasks, ranging from text completion to translation and sum mari z ation [5, 9, 50, 6]. Despite their broad capabilities, LLMs exhibit pronounced limitations when tasked with specialized queries in professional domains [38, 54], a demand that is particularly acute in industrial applications. This primarily stems from the scarcity of domain-specific training material and a limited grasp of specialized knowledge and rationale within these domains. As a result, LLMs may produce responses that are not only potentially erroneous but also lack the detail and precision required for expert-level engagement [11]. Besides the limitations in the domain-specific tasks, another striking issue with LLMs is the phenomena known as "hallucination", where the model generates information that is not grounded in reality or factual data [10, 57]. Moreover, the knowledge base of LLMs, being static and crystallized at the point of their last update, introduces temporal stasis [13]. Further compounding these challenges is the issue of long-context comprehension [37]. Existing LLMs struggle to maintain an understanding of task definitions across long context, and their performance tends to deteriorate significantly when confronted with more complex and demanding tasks.
大语言模型 (LLMs) 通过展示生成连贯且上下文相关文本的能力,彻底改变了自然语言处理领域。这些先进模型在庞大的语料库上进行训练,使它们具备执行多种语言任务的灵活性,从文本补全到翻译和摘要 [5, 9, 50, 6]。尽管它们具备广泛的能力,但在处理专业领域的特定查询时,LLMs 表现出明显的局限性 [38, 54],这一需求在工业应用中尤为突出。这主要源于领域特定训练材料的稀缺性以及对这些领域的专业知识和逻辑的有限理解。因此,LLMs 可能会生成不仅可能错误,而且缺乏专家级参与所需的细节和精确性的回复 [11]。除了领域特定任务的局限性外,LLMs 的另一个显著问题是所谓的“幻觉”现象,即模型生成的信息不基于现实或事实数据 [10, 57]。此外,LLMs 的知识库在其最后一次更新时是静态和固化的,这引入了时间停滞的问题 [13]。进一步加剧这些挑战的是长上下文理解的问题 [37]。现有的 LLMs 难以在长上下文中保持对任务定义的理解,当面对更复杂和要求更高的任务时,它们的表现往往会显著下降。
To address the inherent limitations of LLMs, Retrieval-Augmented Generation (RAG) [35] has been proposed, which merges the generative capabilities of LLMs with a retrieval mechanism, allowing the incorporation of relevant external information to anchor the generated text in factual data. This integrated strategy improves both the accuracy and reliability of the generated content, providing a promising pathway for the practical deployment of LLMs in industrial applications. However, current RAG methods remain heavily reliant on text retrieval and the comprehension capabilities of LLMs, with a lack of attention to extracting, understanding, and utilizing knowledge from the diverse source data. In industrial applications requiring expertise, such as specialized knowledge and problem-solving rationale, existing RAG approaches primarily designed for research benchmarks demonstrate significant limitations. There is a lack of clarity regarding the challenges that RAG encounters in industrial applications. Gaining a comprehensive insight into these challenges is crucial for the development of RAG algorithms. Therefore, we summarize the main challenges as follows.
为了解决大语言模型 (LLM) 的固有局限性,检索增强生成 (Retrieval-Augmented Generation, RAG) [35] 被提出,它将大语言模型的生成能力与检索机制相结合,允许引入相关的外部信息,使生成的文本基于事实数据。这种集成策略提高了生成内容的准确性和可靠性,为大语言模型在工业应用中的实际部署提供了一条有前景的路径。然而,当前的 RAG 方法仍然严重依赖文本检索和大语言模型的理解能力,缺乏对从多样化的源数据中提取、理解和利用知识的关注。在需要专业知识的工业应用中,例如专业知识和问题解决原理,现有的 RAG 方法主要设计用于研究基准测试,显示出显著的局限性。关于 RAG 在工业应用中遇到的挑战,目前尚不清晰。全面了解这些挑战对于 RAG 算法的发展至关重要。因此,我们将主要挑战总结如下。
• Knowledge source diversity: RAG systems are constructed upon a diverse corpus of source documents collected over many years from various domains, encompassing a wide range of file formats like scanned images, digital text files, and web data, sometimes accompanied by specialized databases. In contrast, widely-used datasets [28, 60, 51] typically feature pre-segmented, simplified corpora that do not capture the complexity of real-world data. Existing methods designed for these benchmarks struggle to efficiently extract specialized knowledge and uncover underlying rationales from diverse sources, particularly in industrial applications. For example, an LED product datasheet typically comprises specifications such as performance characteristics presented in complex tables, electrical properties depicted in charts, and installation instructions illustrated with figures. Addressing queries related to the non-textual knowledge presents significant challenges for existing RAG approaches.
• 知识源多样性:RAG系统建立在多年来从各个领域收集的多样化源文档语料库之上,涵盖了扫描图像、数字文本文件和网络数据等多种文件格式,有时还伴随专门的数据库。相比之下,广泛使用的数据集 [28, 60, 51] 通常具有预分割、简化的语料库,无法捕捉现实世界数据的复杂性。为这些基准设计的方法难以高效地从多样化来源中提取专门知识并揭示潜在原理,特别是在工业应用中。例如,LED产品数据表通常包含性能特征(以复杂表格呈现)、电气属性(以图表展示)和安装说明(以图示说明)等规格。解决与非文本知识相关的查询对现有的RAG方法提出了重大挑战。
• Domain specialization deficit: In industrial applications, RAG are expected to leverage the specialized knowledge and rationale in professional fields. However, these specialized knowledge are characterized by domain-specific terminologies, expertise, and distinctive logical frameworks that are integral to their functioning. RAG approaches built on common knowledge-centric datasets demonstrate unsatisfactory performance when applied to professional fields, as LLMs exhibit deficiencies in extracting, understanding, and organizing domain specific knowledge and rationale [38]. For example, in the field of semiconductor design, research relies heavily on a deep understanding of underlying physical properties. When LLMs are utilized to extract and organize the specialized knowledge and rationale from the research documents, they often fail to properly capture essential physical principles and achieve a comprehensive understanding due to their inherent limitations. Consequently, RAG systems frequently produce incomplete or inaccurate interpretations of critical problem elements and generate responses that lack proper rationale grounded in physical principles. Moreover, assessing the quality of professional content generation poses a significant challenge. This issue not only impedes the development and optimization of RAG algorithms but also complicates their practical deployment across various industrial applications.
• 领域专业化不足:在工业应用中,RAG(检索增强生成)被期望利用专业领域的专门知识和逻辑。然而,这些专门知识具有领域特定的术语、专业知识和独特的逻辑框架,这些是它们运作的核心。基于常识知识库的RAG方法在应用于专业领域时表现不佳,因为大语言模型在提取、理解和组织领域特定知识和逻辑方面存在缺陷 [38]。例如,在半导体设计领域,研究严重依赖对底层物理特性的深入理解。当使用大语言模型从研究文档中提取和组织专门知识和逻辑时,由于其固有的局限性,它们往往无法正确捕捉关键的物理原理并实现全面的理解。因此,RAG系统经常对关键问题要素产生不完整或不准确的解释,并生成缺乏基于物理原理的合理逻辑的响应。此外,评估专业内容生成的质量也面临重大挑战。这一问题不仅阻碍了RAG算法的开发和优化,还使它们在各种工业应用中的实际部署变得复杂。
• One-size-fits-all: Various RAG application scenarios, although based on a similar framework, present different challenges that require diverse capabilities, particularly for extracting, understanding, and organizing domain-specific knowledge and rationale. The complexity and focus of questions vary across these scenarios, and within a single scenario, the difficulty can also differ. For example, in rule-based query scenarios, such as determining the legal conditions for mailing items, RAG systems primarily focus on retrieving relevant factual rules by bridging the semantic gap between the query and the rules. In multihop query scenarios, such as comparing products across multiple aspects, RAG systems emphasize extracting information from diverse sources and performing multihop reasoning to arrive at accurate answers. Most existing RAG approaches [62] adopt a one-size-fits-all strategy, failing to account for the varying complexities and specific demands both within and across scenarios. This results in solutions that do not meet the comprehensive accuracy standards required for practical applications, thereby limiting the development and integration of RAG systems in real-world environments.
• 一刀切:各种 RAG 应用场景虽然基于相似的框架,但面临不同的挑战,需要多样化的能力,特别是在提取、理解和组织领域特定知识和逻辑方面。这些场景中问题的复杂性和侧重点各不相同,即使在单一场景内,难度也会有所不同。例如,在基于规则的查询场景中,如确定邮寄物品的合法条件,RAG 系统主要专注于通过弥合查询与规则之间的语义鸿沟来检索相关的事实规则。在多跳查询场景中,如从多个方面比较产品,RAG 系统则强调从不同来源提取信息并进行多跳推理以得出准确答案。大多数现有的 RAG 方法 [62] 采用了一刀切的策略,未能考虑到场景内外的不同复杂性和特定需求。这导致解决方案无法满足实际应用所需的全面准确性标准,从而限制了 RAG 系统在现实环境中的开发和集成。
We believe that the key to addressing these challenges lies in advancing beyond traditional retrieval augmentation, by effectively extracting, understanding, and applying specialized knowledge, and developing appropriate reasoning logic tailored to the specific tasks and the knowledge involved. We refer to this approach as sPecIalized Knowledge and Rationale Augmentation. Given that various tasks require diverse capabilities, particularly for extracting, understanding, and organizing domain-specific knowledge and rationale, we summarize and categorize the questions commonly encountered into four types with respect to their difficulty: factual questions, linkable-reasoning questions, predictive questions, and creative questions. Accordingly, we propose a classification of RAG system capability levels, aligned with the system’s ability to solve these different types of problems. This classification serves as a guideline for systematically advancing the system’s capabilities in a controllable and measurable manner.
我们认为,解决这些挑战的关键在于超越传统的检索增强,通过有效提取、理解和应用专业知识,并开发适合特定任务和相关知识的推理逻辑。我们将这种方法称为专业知识与推理增强。鉴于各种任务需要不同的能力,特别是在提取、理解和组织领域特定知识和推理方面,我们将常见问题按其难度总结和分类为四种类型:事实性问题、可链接推理问题、预测性问题和创造性问题。因此,我们提出了一种与大语言模型系统解决这些不同类型问题的能力相匹配的能力等级分类。这一分类作为指导,以可控和可衡量方式系统地提升系统的能力。
Furthermore, we propose sPecIalized KnowledgE and Rationale Augmented Generation (PIKERAG) framework, which not only support phased system development and deployment, demonstrating excellent versatility, but also enhances capabilities by effectively leveraging specialized knowledge and rationale. Within this framework, knowledge extraction components are employed to extract specialized knowledge from diverse source data, laying a robust foundation for knowledgebased retrieval and reasoning. Additionally, a task decomposer is utilized to dynamically manage the routing of retrieval and reasoning operations, creating specialized rationale based on available knowledge. PIKE-RAG enables a phased exploration of RAG capabilities, which facilitates the progressive refinement of RAG algorithms and the staged implementation of RAG applications. For each developing phase, the RAG framework and its modules are tailored to address specific challenges. For example, in the knowledge base construction phase, a multi-layer heterogeneous graph is employed to effectively represent relationship between various components of the data, enhancing knowledge organization and integration. The RAG system, designed for factual questions, introduces multi-granularity retrieval, allowing for multi-layer, multi-granularity retrieval across a heterogeneous knowledge graph to improve factual retrieval accuracy. In the advanced RAG system, aiming at addressing complex queries, knowledge atomizing is introduced to fully explore the intrinsic knowledge from data chunks, while knowledge-aware task decomposition manages the retrieval and organization of multiple pieces of atomic knowledge to construct a coherent rationale.
此外,我们提出了 sPecIalized KnowledgE and Rationale Augmented Generation (PIKERAG) 框架,该框架不仅支持分阶段的系统开发和部署,展现了出色的多功能性,还通过有效利用专业化知识和推理来增强能力。在此框架内,知识提取组件被用来从多样化的源数据中提取专业知识,为基于知识的检索和推理奠定坚实的基础。此外,任务分解器被用来动态管理检索和推理操作的路由,基于可用知识创建专门的推理。PIKE-RAG 支持分阶段探索 RAG 能力,这有助于逐步完善 RAG 算法和分阶段实施 RAG 应用。在每个开发阶段,RAG 框架及其模块都被定制以应对特定挑战。例如,在知识库构建阶段,采用多层异构图来有效表示数据各组件之间的关系,增强知识组织和集成。针对事实性问题的 RAG 系统引入了多粒度检索,允许在异质知识图谱上进行多层、多粒度的检索,以提高事实检索的准确性。在高级 RAG 系统中,为应对复杂查询,引入了知识原子化,以充分挖掘数据块中的内在知识,而知识感知的任务分解则管理多个原子知识的检索和组织,以构建连贯的推理。
Extensive experiments are conducted to evaluate the performance of the proposed PIKE-RAG framework on both open-domain and legal benchmarks, and experimental results demonstrate the effectiveness of PIKE-RAG. Our framework and staged development strategy could further advance the current research and application of RAG in industrial contexts. In summary, the contributions of this work are as follows:
为了评估所提出的PIKE-RAG框架在开放领域和法律基准上的性能,我们进行了广泛的实验,实验结果证明了PIKE-RAG的有效性。我们的框架和分阶段开发策略可以进一步推动RAG在工业环境中的研究和应用。总之,本工作的贡献如下:
2 Related work
2 相关工作
2.1 RAG
2.1 RAG
Retrieval-Augmented Generation (RAG) has emerged as a promising solution that effectively incorporates external knowledge to enhance response generation. Initially, retrieval-augmented techniques were introduced to improve the performance of pre-trained language models on knowledge-intensive tasks [35, 29, 12]. With the booming of Large Language Models [5, 9, 50, 6], most research in the RAG paradigm has shifted towards a framework that initially retrieves pertinent information from external data sources and subsequently integrates it into the context of the query prompt as supplementing knowledge for con textually relevant generation [46]. Following this framework, naive RAG research paradigm [25] converts raw data into uniform plain text and segment it into smaller chunks, which are encoded into vector space for query-based retrieval. The top k relevant chunks are used to expand the context of the prompt for generation. To enhance the retrieval quality of the naive RAG, advanced RAG approaches implement specific enhancements across the pre-retrieval, retrieval, and post-retrieval processes, including query optimization [39, 63], multi-granularity chunking [16, 65], mixed retrieval and chunk re-ranking.
检索增强生成 (RAG) 作为一种有效整合外部知识以增强响应生成的解决方案而崭露头角。最初,检索增强技术被引入以提高预训练语言模型在知识密集型任务中的表现 [35, 29, 12]。随着大语言模型的蓬勃发展 [5, 9, 50, 6],RAG 范式中的大多数研究已转向一种框架,该框架首先从外部数据源中检索相关信息,随后将其整合到查询提示的上下文中,作为上下文相关生成的补充知识 [46]。遵循这一框架,朴素 RAG 研究范式 [25] 将原始数据转换为统一的纯文本并将其分割成更小的块,这些块被编码到向量空间中以进行基于查询的检索。前 k 个相关块用于扩展提示的上下文以进行生成。为了提高朴素 RAG 的检索质量,高级 RAG 方法在预检索、检索和后检索过程中实施了特定的增强措施,包括查询优化 [39, 63]、多粒度分块 [16, 65]、混合检索和块重排序。
Beyond the aforementioned RAG paradigms, numerous sophisticated enhancements in RAG pipelines and system modules are introduced within modular RAG systems [26], aiming to improve system capability and versatility. These advancements have enabled the processing of a wider variety of source data, facilitating the transformation of raw information into structured data and, ultimately, into valuable knowledge [56, 20]. Furthermore, the indexing and retrieval modules have been refined with multi-granularity and multi-architecture approaches [58, 65]. Various pre-retrieval [24, 64] and postretrieval [18, 30] functions are proposed to enhance both the retrieval effectiveness and the quality of sequential generation. It has been recognized that naïve RAG systems are insufficient to tackle complex tasks such as sum mari z ation [27] and multi-hop reasoning [51, 28]. Consequently, most recent research focuses on developing advanced coordination schemes that leverage existing modules to collaborative ly address these challenges. ITERRETGEN [48] and DSP [33] employ retrieve-read iteration to leverage generation response as the context for next round retrieval. FLARE [31] proposes a confidence-based active retrieval mechanism that dynamically adjusts query with respect to the low-confidence tokens in the regenerated sentences. These loop-based RAG pipelines progressively converge towards the correct answer and provide enhanced flexibility to RAG systems in addressing diverse requirements.
除了上述的 RAG 范式外,模块化 RAG 系统 [26] 中还引入了许多 RAG 管道和系统模块的复杂增强,旨在提高系统的能力和多功能性。这些进步使得处理更多种类的源数据成为可能,促进了原始信息向结构化数据,最终向有价值知识的转化 [56, 20]。此外,索引和检索模块通过多粒度和多架构方法 [58, 65] 得到了改进。提出了各种预检索 [24, 64] 和后检索 [18, 30] 功能,以增强检索效果和顺序生成的质量。人们已经认识到,简单的 RAG 系统不足以应对诸如总结 [27] 和多跳推理 [51, 28] 等复杂任务。因此,最近的研究主要集中在开发高级协调方案,利用现有模块协同应对这些挑战。ITERRETGEN [48] 和 DSP [33] 采用检索-读取迭代,利用生成响应作为下一轮检索的上下文。FLARE [31] 提出了一种基于置信度的主动检索机制,动态调整查询以应对重新生成句子中的低置信度 Token。这些基于循环的 RAG 管道逐步向正确答案收敛,并为 RAG 系统提供了应对多样化需求的增强灵活性。
2.2 Knowledge bases for RAG
2.2 用于RAG的知识库
In naïve RAG approaches, source data is converted to plain text and chunked for retrieval. However, as RAG applications expand and demand for diversity grows, plain text-based retrieval becomes insufficient for several reasons: (1) textual information is generally redundant and noisy, leading to decreased retrieval quality; (2) complex problems require the integration of multiple data sources, and plain text alone cannot adequately represent the intricate relationships between objects. As a result, researchers are exploring diverse data sources to enrich the corpus, incorporating search engines [59, 53], databases [55, 41, 47], knowledge graphs [49, 56], and multimodal corpora [17, 15]. Concurrently, there is an emphasis on developing efficient knowledge representations for corpus to enhance knowledge retrieval. A graph is regarded as a powerful knowledge representation because of its capacity to intuitively model complex relationships. GraphRAG [20] combines knowledge graph generation and query-focused sum mari z ation with RAG to address both local and global questions. HOLMES [42] construct hyper-relational KGs and prune them to distilled graphs, which serve as an input to LLMs for multihop question answering. However, the construction of knowledge graphs is extremely resource-intensive, and the associated costs scale up with the size of the corpus.
在简单的 RAG 方法中,源数据被转换为纯文本并分块以进行检索。然而,随着 RAG 应用的扩展和对多样性的需求增加,基于纯文本的检索在几个方面显得不足:(1)文本信息通常冗余且嘈杂,导致检索质量下降;(2)复杂问题需要整合多个数据源,而纯文本无法充分表示对象之间复杂的关系。因此,研究人员正在探索多样化的数据源来丰富语料库,包括搜索引擎 [59, 53]、数据库 [55, 41, 47]、知识图谱 [49, 56] 和多模态语料库 [17, 15]。同时,重点在于开发高效的语料库知识表示以增强知识检索能力。图被视为一种强大的知识表示形式,因为它能够直观地建模复杂关系。GraphRAG [20] 将知识图谱生成和以查询为中心的摘要与 RAG 结合,以解决局部和全局问题。HOLMES [42] 构建超关系知识图谱并将其剪枝为蒸馏图,作为大语言模型的输入以进行多跳问答。然而,知识图谱的构建极其耗费资源,且相关成本随着语料库的规模而增加。
2.3 Multi-hop QA
2.3 多跳问答 (Multi-hop QA)
Multi-hop Question Answering (MHQA) [60] involves answering questions that require reasoning over multiple pieces of information, often scattered across different documents or paragraphs. This task presents unique challenges as it necessitates not only retrieving relevant information but also effectively combining and reasoning over the retrieved pieces to arrive at a correct answer. The traditional graph-based methods in MHQA solve the problem by building graphs and inferring on graph neural networks(GNN) to predict answers [44, 21]. With the advent of LLMs, recent graph-based methods [36, 42] have evolved to construct knowledge graphs for retrieval and generate response through LLMs. Another branch of methods dynamically convert multi-hop questions into a series of sub-queries by generating subsequent questions based on the answers to previous ones [52, 33, 23]. The subqueries guides the sequential retrieval and the retrieved results in turn are used to improve reasoning. Treating MHQA as a supervised problem, Self-RAG [61] trains an LM to learn to retrieve, generate, and critique text passages, and beam-retrieval [7] models the multi-hop retrieval process in an end-to-end manner by jointly optimizing an encoder and classification heads across all hops. Self-Ask [43] improves CoT by explicitly asking itself follow-up questions before answering the initial question. This method enables the automatic decomposition of questions and can be seamlessly integrated with retrieval mechanisms to tackle Multi-hop Question Answering.
多跳问答 (Multi-hop Question Answering, MHQA) [60] 涉及回答需要跨多个信息片段进行推理的问题,这些信息通常分散在不同的文档或段落中。这项任务带来了独特的挑战,因为它不仅需要检索相关信息,还需要有效地结合和推理检索到的信息以得出正确答案。传统的基于图的方法在MHQA中通过构建图并在图神经网络 (Graph Neural Networks, GNN) 上进行推理来预测答案 [44, 21]。随着大语言模型的出现,最近的基于图的方法 [36, 42] 已经演变为构建知识图谱用于检索,并通过大语言模型生成响应。另一类方法通过基于先前答案生成后续问题,将多跳问题动态转换为一组子查询 [52, 33, 23]。这些子查询指导顺序检索,而检索到的结果又用于改进推理。将MHQA视为监督问题,Self-RAG [61] 训练一个语言模型来学习检索、生成和批判文本段落,而beam-retrieval [7] 通过联合优化编码器和所有跳的分类头,以端到端的方式建模多跳检索过程。Self-Ask [43] 通过在回答初始问题之前明确询问自己后续问题,改进了链式推理 (Chain-of-Thought, CoT)。这种方法能够自动分解问题,并且可以无缝集成检索机制以应对多跳问答。
3 Problem formulation
3 问题表述
Existing research mainly concentrates on algorithmic enhancements to improve the performance of RAG systems. However, there is limited effort in providing a comprehensive and systematic discussion of the RAG framework. In this work, we conceptualize the RAG framework from three key perspectives: knowledge base, task classification, and system development. We assert that the knowledge base serves as the fundamental cornerstone of RAG, underpinning all retrieval and generation processes. Furthermore, we recognize that RAG tasks can vary significantly in complexity and difficulty, depending on the required generation capabilities and the availability of supporting corpora. By categorizing tasks according to their difficulty levels, we classify RAG systems into distinct levels based on their problem-solving capabilities across the different types of questions.
现有研究主要集中在算法增强以提高 RAG 系统的性能。然而,在提供 RAG 框架的全面和系统讨论方面,所做的努力有限。在本工作中,我们从三个关键视角对 RAG 框架进行了概念化:知识库、任务分类和系统开发。我们断言,知识库是 RAG 的基础基石,支撑着所有检索和生成过程。此外,我们认识到,RAG 任务的复杂性和难度可能会因所需的生成能力和支持语料的可用性而有很大差异。通过根据难度级别对任务进行分类,我们根据 RAG 系统在不同类型问题上的解决能力将其划分为不同的级别。
3.1 Knowledge base
3.1 知识库
In industrial applications, specialized knowledge primarily originates from years of accumulated data within specific fields such as manufacturing, energy, and logistics. For example, in the pharmaceutical industry, data sources include extensive research and development documentation, as well as drug application files amassed over many years. These sources are not only diverse in file formats, but also encompass a significant amount of multi-modal contents such as tables, charts, and figures, which are also crucial for problem-solving. Furthermore, there are often functional connections between files within a specialized domain, such as hyperlinks, references, and relational database links, which explicitly or implicitly reflect the logical organization of knowledge within the professional field. Currently, existing datasets provide pre-segmented corpora and do not account for the complexities encountered in real-world applications, such as the integration of multi-format data and the maintenance of referential relationships between documents. Therefore, the construction of a comprehensive knowledge base is foundational for Retrieval-Augmented Generation (RAG) in the industrial field. As the architecture and quality of the knowledge base directly influence the retrieval methods and their performance, we propose structuring the knowledge base as a multi-layer heterogeneous graph, denoted as $G$ , with corresponding nodes and edges represented by $(V,\dot{E})$ . The graph nodes can include documents, sections, chunks, figures, tables, and customized nodes from distilled knowledge. The edges signify the relationships among these nodes, encapsulating the interconnections and dependencies within the graph. This multi-layer heterogeneous graph encompasses three distinct layers: the information resource layer $G_{i}$ , the corpus layer $G_{c}$ and the distilled knowledge layer $G_{d k}$ . Each layer corresponds to different stages of information processing, representing varying levels of granularity and abstraction in knowledge.
在工业应用中,专业知识主要来源于特定领域多年积累的数据,如制造业、能源和物流等。例如,在制药行业,数据来源包括广泛的研究和开发文档,以及多年积累的药品申请文件。这些数据源不仅文件格式多样,还包含大量多模态内容,如表格、图表和图形,这些内容对于解决问题同样至关重要。此外,在专业领域中,文件之间通常存在功能关联,如超链接、引用和关系数据库链接,这些关联明确或隐含地反映了专业领域内知识的逻辑组织。目前,现有数据集提供的是预先分割的语料库,并未考虑实际应用中的复杂性,如多格式数据的集成和文档间引用关系的维护。因此,构建一个全面的知识库是工业领域中检索增强生成 (RAG) 的基础。由于知识库的架构和质量直接影响检索方法及其性能,我们提出将知识库结构化为多层异构图,记为 $G$,对应的节点和边表示为 $(V,\dot{E})$。图节点可以包括文档、章节、块、图形、表格以及从提炼知识中定制的节点。边表示这些节点之间的关系,囊括了图中的相互联系和依赖关系。这种多层异构图包含三个不同的层次:信息资源层 $G_{i}$、语料层 $G_{c}$ 和提炼知识层 $G_{d k}$。每一层对应于信息处理的不同阶段,代表了知识的不同粒度和抽象层次。
3.2 Task classification
3.2 任务分类
Contemporary RAG frameworks frequently overlook the intricate difficulty and logistical demands inherent to diverse tasks, typically employing a one-size-fits-all methodology. However, even with comprehensive knowledge retrieval, current RAG systems are insufficient to handle tasks of varying difficulty with equal effectiveness. Therefore, it is essential to categorize tasks and analyze the typical strategies for overcoming the challenges inherent to each category. The difficulty of a task is closely associated with several critical factors.
当代的 RAG 框架经常忽视了不同任务固有的复杂难度和逻辑需求,通常采用一刀切的方法。然而,即使具备全面的知识检索能力,当前的 RAG 系统仍不足以同等有效地处理不同难度的任务。因此,对任务进行分类并分析克服每类任务挑战的典型策略至关重要。任务的难度与几个关键因素密切相关。
Factual Questions
事实性问题
Linkable & Reasoning Questions
图 1: 可链接与推理问题
Figure 1: Illustrative examples of distinct question types
图 1: 不同问题类型的示例
• Effectiveness of Knowledge Utilization: The sophistication involved in applying the extracted knowledge to formulate responses, including synthesizing, organizing, and generating insights or predictions.
知识利用的有效性:将提取的知识应用于制定回答的复杂程度,包括综合、组织以及生成见解或预测的能力。
In categorizing real-world RAG tasks within industries, we focus on the processes of knowledge extraction, understanding, organization, and utilization to provide structured and insightful responses. Taking the aforementioned factors into account, we identify four distinct classes of questions that address a broad spectrum of demands. The first type, Factual Questions, involves extracting specific, explicit information directly from the corpus, relying on retrieval mechanisms to identify the relevant facts. Linkable-Reasoning Questions demand a deeper level of knowledge integration, often requiring multi-step reasoning and linking across multiple sources. Predictive Questions extend beyond the available data, requiring inductive reasoning and structuring of retrieved facts into analyzable forms, such as time series, for future-oriented predictions. Finally, Creative Questions engage domainspecific logic and creative problem-solving, encouraging the generation of innovative solutions by synthesizing knowledge and identifying patterns or influencing factors. This categorization, driven by varying levels of reasoning and knowledge management, ensures a comprehensive approach to addressing industry-specific queries.
在行业中对现实世界中的 RAG 任务进行分类时,我们专注于知识提取、理解、组织和利用的过程,以提供结构化和有洞察力的响应。考虑到上述因素,我们确定了四类不同的问题,以满足广泛的需求。第一类,事实性问题,涉及直接从语料库中提取特定的明确信息,依靠检索机制来识别相关事实。可链接推理问题需要更深层次的知识整合,通常需要多步推理并在多个来源之间建立联系。预测性问题超越了现有数据,需要归纳推理并将检索到的事实结构化为可分析的形式,例如时间序列,以进行面向未来的预测。最后,创造性问题涉及特定领域的逻辑和创造性问题解决,鼓励通过综合知识和识别模式或影响因素来生成创新解决方案。这种由不同层次的推理和知识管理驱动的分类,确保了解决行业特定问题的全面方法。
The criteria defining each category are elaborated in the following sections, with representative examples for each provided in Figure 1. For each question type, we also present the associated support data and the expected reasoning processes to illustrate the differences between these categories. These inquiries are formulated by experts in pharmaceutical applications, based on the data released by the FDA.2
定义每个类别的标准将在以下部分详细阐述,并在图1中提供了每个类别的代表性示例。对于每种问题类型,我们还展示了相关的支持数据和预期的推理过程,以说明这些类别之间的差异。这些查询由制药应用领域的专家根据FDA发布的数据制定。
Factual Questions These questions seek specific, concrete pieces of information explicitly presented in the original corpus. The referenced text can be processed within the context of a conversation in LLMs. As shown in Figure 1, this class of questions can be effectively answered if the relevant fact is successfully retrieved.
事实性问题
Linkable-Reasoning Questions Answering these questions necessitates gathering pertinent information from diverse sources and/or executing multi-step reasoning. The answers may be implicitly distributed across multiple texts. Due to variations in the linking and reasoning processes, we further divide this category into four subcategories: bridging questions, comparative questions, quantitative questions, and summarizing questions. Examples of each subcategory are illustrated in Figure 1. Specifically, bridging questions involve sequentially bridging multiple entities to derive the answer. Quantitative questions require statistical analysis based on the retrieved data. Comparative questions focus on comparing specified attributes of two entities. Summarizing questions require condensing or synthesizing information from multiple sources or large volumes of text into a concise, coherent summary, and they often involve integrating key points, identifying main themes, or drawing conclusions based on the aggregated content. Summarizing questions may combine elements of other question types, such as bridging, comparative, or quantitative questions, as they frequently require the extraction and integration of diverse pieces of information to generate a comprehensive and meaningful summary. Given these questions require multi-step retrieval and reasoning, it is crucial to establish a reasonable operation route for answer-seeking in interaction with the knowledge base.
可链接推理问题
这类问题需要从不同来源收集相关信息,并/或执行多步推理。答案可能隐式分布在多个文本中。由于链接和推理过程的差异,我们进一步将该类别分为四个子类别:桥接问题、比较问题、定量问题和总结问题。每个子类别的示例如图 1 所示。具体而言,桥接问题涉及依次桥接多个实体以得出答案。定量问题需要基于检索到的数据进行统计分析。比较问题侧重于比较两个实体的指定属性。总结问题需要将来自多个来源或大量文本的信息浓缩或综合成简明连贯的摘要,通常涉及整合关键点、识别主题或基于聚合内容得出结论。总结问题可能结合其他问题类型的元素,例如桥接、比较或定量问题,因为它们经常需要提取和整合多样化的信息以生成全面且有意义的摘要。鉴于这些问题需要多步检索和推理,在与知识库交互时建立一个合理的答案寻求操作路径至关重要。
Predictive Questions For this type of questions, the answers are not directly available in the original text and may not be purely factual, necessitating inductive reasoning and prediction based on existing facts. To harness the predictive capabilities of LLMs or other external prediction tools, it is essential to gather and organize relevant knowledge to generate structured data for further analysis. For instance, as illustrated in Figure 1, all biosimilar products with the approval dates are retrieved, and the total number of approvals for each year is calculated and organized to year-indexed time series data for prediction purposes. Furthermore, it is important to note that the correct answer to predictive questions may not be unique, reflecting the inherent uncertainty and variability in predictive tasks.
预测性问题
对于这类问题,答案并不直接存在于原始文本中,且可能并非纯粹的事实性内容,需要基于现有事实进行归纳推理和预测。为了充分利用大语言模型或其他外部预测工具的预测能力,必须收集和整理相关知识,生成结构化数据以供进一步分析。例如,如图 1 所示,检索了所有生物类似品的批准日期,并计算和整理了每年的批准总数,形成以年份为索引的时间序列数据用于预测。此外,需要注意的是,预测性问题的正确答案可能并不唯一,这反映了预测任务中固有的不确定性和变异性。
Creative Questions One significant demand of RAG is to mine valuable domain-specific logic from professional knowledge bases and introduce novel perspectives that can innovate and advance existing solutions. Addressing creative questions necessitates creative thinking based on the availability of factual information and an understanding of the underlying principles and rules. As illustrated in the example, it is essential to organize the extracted information to highlight key stages and their duration, and then identify common patterns and influential factors. Subsequently, solutions are developed with the objective of evaluating potential outcomes and stimulating fresh ideas. The goal of these responses is to inspire experts to generate innovative ideas, rather than to provide ready-to-implement solutions.
创意问题
It is crucial to recognize that the classification of a question may shift with changes in the knowledge base. Questions Q1, Q2, and Q3 in Figure1, although seemingly similar, are categorized differently depending on the availability of information and the logical steps required to derive an answer. For instance, Q1 is classified as a factual question because it can be directly answered using a table that concisely lists all biosimilar products along with their respective approval dates, providing sufficient explicit information. In contrast, Q2, which inquires about the total count of interchangeable biosimilar products, cannot be resolved by directly referencing a single explicit source. To answer Q2, one must identify all the products meeting the specified criteria and subsequently calculate the total, necessitating an additional step of statistical aggregation. Therefore, Q2 is categorized as a linkable-reasoning question due to the need for an intermediate processing. Finally, Q3 poses a challenge because the answer does not explicitly exist within the knowledge base. Addressing
认识到问题的分类可能随着知识库的变化而转变至关重要。图 1 中的问题 Q1、Q2 和 Q3,尽管看似相似,但根据信息的可用性和推导答案所需的逻辑步骤,它们被分类为不同类型。例如,Q1 被归类为事实性问题,因为它可以直接通过一张简明列出所有生物类似物产品及其各自批准日期的表格来回答,提供了足够的显式信息。相比之下,Q2 询问的是可互换生物类似物产品的总数,无法通过直接参考单一的显式来源来解决。要回答 Q2,必须识别所有符合指定条件的产品,然后计算总数,这需要额外的统计聚合步骤。因此,由于需要中间处理,Q2 被归类为可链接推理问题。最后,Q3 提出了一个挑战,因为答案并未明确存在于知识库中。
Table 1: Level definition based on RAG system’s capability
表 1: 基于 RAG 系统能力的层级定义
层级 | 系统能力描述 |
---|---|
L1 | L1 系统旨在为事实性问题提供准确可靠的答案,确保基本信息检索的坚实基础。 |
L2 | L2 系统扩展其功能,包括对事实性问题和可链接推理问题的准确可靠响应,支持更复杂的多步骤检索和推理任务。 |
L3 | L3 系统进一步增强其能力,具备为预测性问题提供合理预测的能力,同时保持对事实性问题和可链接推理问题的准确性和可靠性。 |
L4 | L4 系统能够为创造性问题提出有充分理由的计划或解决方案。此外,它保留了对预测性问题提供合理预测的能力,以及对事实性问题和可链接推理问题的准确可靠回答。 |
Q3 requires gathering relevant data, organizing it to infer hidden patterns, and making predictions based on these inferred rules. As a result, Q3 is categorized as a predictive question, indicating the requirement to extrapolate beyond the existing data to forecast potential outcomes or trends.
Q3 需要收集相关数据,组织数据以推断隐藏模式,并根据这些推断的规则进行预测。因此,Q3 被归类为预测性问题,表明需要超出现有数据外推,以预测潜在结果或趋势。
3.3 RAG system level
3.3 RAG 系统级别
In industrial RAG systems, inquiries encompass a broad spectrum of difficulties and are approached from diverse perspectives. Although RAG systems can leverage the general question-answering(QA) abilities of LLMs, their limited comprehension of expert-level knowledge often leads to inconsistent response quality across questions of varying complexities. In response to this status quo, we propose categorizing RAG systems into four distinct levels based on their problem-solving capabilities across the four classes of questions outlined in the previous subsection. This stratified approach facilitates the phased development of RAG systems, allowing capabilities to be increment ally enhanced through iterative module refinement and algorithmic optimization. Our framework is strategically designed to provide a standardized, objective methodology for developing RAG systems that effectively meet the specialized needs of various industry scenarios. The definition of RAG systems in different level is presented in Table 1. It highlights the systems’ capabilities to handle increasingly complex queries, demonstrating the evolution from simple information retrieval to advanced predictive and creative problem-solving. Each level represents a step towards more sophisticated interactions with knowledge bases, requiring the RAG systems to demonstrate higher levels of understanding, reasoning, and innovation.
在工业 RAG 系统中,查询涵盖了广泛的难度,并从不同的角度进行处理。尽管 RAG 系统可以利用大语言模型的通用问答 (QA) 能力,但它们对专家级知识的有限理解往往导致对不同复杂程度问题的响应质量不一致。针对这一现状,我们建议根据 RAG 系统在前一小节中概述的四类问题上的解决能力,将其分为四个不同的级别。这种分层方法有助于 RAG 系统的分阶段发展,通过迭代模块优化和算法优化逐步增强能力。我们的框架旨在为开发 RAG 系统提供一种标准化、客观的方法,以有效满足各种行业场景的特定需求。表 1 展示了不同级别 RAG 系统的定义,突出了系统处理日益复杂查询的能力,展示了从简单的信息检索到高级预测性和创造性问题解决的演变。每个级别都代表了与知识库进行更复杂互动的步骤,要求 RAG 系统展示出更高层次的理解、推理和创新能力。
More specially, at the foundational level, RAG systems respond to factual questions with answers that are directly extract able from provided texts. Advancing to the second level, RAG systems are equipped to handle complex questions involving linkage and reasoning. These queries necessitate the synthesis of information from disparate sources or multi-step reasoning processes. The RAG could address a variety of composite questions, includes bridging questions that necessitate a sequence of logical reasoning, comparative questions demanding parallel analysis, and summarizing questions that involve condensing information into comprehensive responses. At the third level, the systems are intricately designed to tackle predictive questions where answers are not immediately discernible from the original text. Finally, RAG systems at the forth level demonstrate the capacity for creative problem-solving, utilizing a solid factual base to foster novel concepts or strategies. While these systems may not offer ready-to-implement solutions, they play a crucial role in stimulating expert creativity to advance fields such as analytics or treatment design.
具体而言,在基础层面上,RAG系统能够回答可以直接从提供的文本中提取的事实性问题。在第二层面上,RAG系统能够处理涉及链接和推理的复杂问题。这些问题需要从不同来源合成信息或进行多步推理过程。RAG能够应对各种复合问题,包括需要一系列逻辑推理的桥接问题、要求并行分析的比较问题,以及涉及将信息浓缩为全面回答的总结性问题。在第三层面上,这些系统被精心设计以应对预测性问题,这些问题的答案无法直接从原始文本中识别。最后,在第四层面上,RAG系统展示了创造性解决问题的能力,利用坚实的事实基础来培养新颖的概念或策略。虽然这些系统可能不提供即用型解决方案,但它们在激发专家创造力以推动分析或治疗设计等领域方面发挥了关键作用。
4 Methodology
4 方法论
4.1 Framework
4.1 框架
Based on the formulation of RAG systems in terms of knowledge base, task classification, and systemlevel division, we propose a versatile and expandable RAG framework. Within this framework, the progression in levels of RAG systems can be achieved by adjusting submodules within the main modules. The overview of our framework is depicted in Figure 2. The framework primarily consists of several fundamental modules, including file parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, knowledge-centric reasoning, and task decomposition and coordination. In this framework, domain-specific documents of diverse formats are processed by file parsing module to convert the file to machine-readable formats, and file units are generated to build up graph in information source layer. The knowledge extraction module chunks the text and generates corpus and knowledge units to construct graph in corpus layer and distilled knowledge layer. The heterogeneous graph established is utilized as the knowledge base for retrieval. Extracted knowledge is stored in multiple structured formats, and the knowledge retrieval module employs hybrid retrieval strategy to access relevant information. Note that the knowledge base not only serves as the source of knowledge gathering but also benefits from a feedback loop, where the organized and verified knowledge is regarded as feedback to refine and improve the knowledge base.
基于知识库、任务分类和系统层级划分的RAG系统表述,我们提出了一种通用且可扩展的RAG框架。在该框架中,通过调整主模块中的子模块,可以实现RAG系统层级的提升。图2展示了我们框架的概览。该框架主要由几个基础模块组成,包括文件解析、知识提取、知识存储、知识检索、知识组织、以知识为中心的推理以及任务分解与协调。在此框架中,不同格式的领域特定文档通过文件解析模块处理,将文件转换为机器可读的格式,并生成文件单元以构建信息源层的图。知识提取模块将文本分块并生成语料库和知识单元,以构建语料库层和提炼知识层的图。所建立的异构图被用作检索的知识库。提取的知识以多种结构化格式存储,知识检索模块采用混合检索策略访问相关信息。需要注意的是,知识库不仅是知识收集的来源,还受益于反馈循环,其中组织和验证的知识被视为反馈,以改进和完善知识库。
Figure 2: Overview of the PIKE-RAG framework, comprising several key components: file parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, task decomposition and coordination, and knowledge-centric reasoning. Each component can be tailored to meet the evolving demands of system capability.
图 2: PIKE-RAG 框架概览,包含多个关键组件:文件解析、知识提取、知识存储、知识检索、知识组织、任务分解与协调,以及以知识为中心的推理。每个组件都可以根据系统能力的发展需求进行定制。
As highlighted in the task classification examples, questions of different classes require distinct rationale routing for answer-seeking, influenced by multiple factors such as the availability of relevant information, the complexity of knowledge extraction, and the sophistication of reasoning. It is challenging to address these questions in a single retrieval and generation pass. To tackle this, we propose an iterative retrieval-generation mechanism supervised by task decomposition and coordination. This iterative mechanism enables the gradual collection of relevant information and progressive reasoning over incremental context, ensuring a more accurate and comprehensive response. More specially, the questions in industrial applications are fed into task decomposition module to produce preliminary decomposition scheme. This scheme outlines the retrieval steps, reasoning steps, and other necessary operations. Following these instructions, the knowledge retrieval module retrieves relevant information, which is then passed to the knowledge organization module for processing and organization. The organized knowledge is used to perform knowledge-centric reasoning, yielding an intermediate answer. With the updated relevant information and intermediate answer, the task decomposition module regenerates an updated scheme for the next iteration. This design boasts excellent adaptability, allowing us to tackle problems of varying difficulties and perspectives by adjusting the modules and iterative mechanisms.
正如任务分类示例中所强调的,不同类别的问题需要不同的推理路径来寻求答案,这受到多种因素的影响,如相关信息的可用性、知识提取的复杂性以及推理的复杂程度。在单次检索和生成过程中解决这些问题具有挑战性。为此,我们提出了一种由任务分解和协调监督的迭代检索-生成机制。这种迭代机制能够逐步收集相关信息,并在增量上下文的基础上进行渐进式推理,从而确保更准确和全面的响应。具体而言,工业应用中的问题被输入任务分解模块,生成初步的分解方案。该方案概述了检索步骤、推理步骤以及其他必要的操作。遵循这些指令,知识检索模块检索相关信息,然后将其传递给知识组织模块进行处理和组织。组织后的知识用于执行以知识为中心的推理,生成中间答案。随着更新的相关信息和中间答案,任务分解模块重新生成更新的方案以进行下一次迭代。这种设计具有出色的适应性,允许我们通过调整模块和迭代机制来应对不同难度和视角的问题。
Table 2: Proposed frameworks for different system levels. To address the challenges facing at each level, we propose customized frameworks based on the framework illustrated in Figure 2. The following abbreviations are used: "PA" for file parsing, "KE" for knowledge extraction, "RT" for knowledge retrieval, "KO" for knowledge organization, and "KR" for knowledge-centric reasoning.
表 2: 针对不同系统层次的建议框架。为了应对每个层次面临的挑战,我们基于图 2 中展示的框架提出了定制化的框架。以下缩写用于表示不同的模块:"PA" 表示文件解析 (file parsing),"KE" 表示知识提取 (knowledge extraction),"RT" 表示知识检索 (knowledge retrieval),"KO" 表示知识组织 (knowledge organization),"KR" 表示以知识为中心的推理 (knowledge-centric reasoning)。
Level | Challenges | Proposed Framework |
---|---|---|
LO | 由于源文档格式多样,知识提取面临挑战,需要复杂的文件解析技术。·从原始的异构数据构建高质量的知识库引入了知识组织和集成的显著复杂性。 | PA KE |
L1 | ·由于不恰当的分块阻碍了知识的理解和提取,破坏了语义连贯性,使准确检索变得复杂。知识检索受到嵌入模型在对齐专业术语和别名方面的限制影响,降低了系统的精度。 | PA KE RT KO KR |
L2 | 有效的知识提取和利用至关重要,因为分块文本通常包含相关和不相关信息。确保检索高质量数据对于准确生成至关重要。·任务的理解和分解及其背后的理由往往忽略了支持数据的可用性,过度依赖大语言模型的能力。 | PA KR Task Decomp.& Coord. |
L3 | 这一层的挑战集中在知识收集和组织上,这对于支持预测推理至关重要。·大语言模型在应用专业推理逻辑方面存在局限性,限制了它们在预测任务中的有效性。 | PA TaskDecomp.&Coord. |
L4 | 难点在于从复杂的知识库中提取连贯的逻辑理由,其中多个因素之间的相互依赖关系可能导致非唯一解。创造性问题的开放性使推理和知识合成过程的评估复杂化,难以定量评估答案质量。 | PA RT Multi-agent Plan. Task Decomp.&Coord. |
4.2 Phased system development
4.2 分阶段系统开发
We have categorized RAG systems into four distinct levels based on their problem-solving capabilities across the four classes of questions, as outlined in Table 1. Recognizing the pivotal role of knowledge base generation in RAG systems, we designate the construction of the knowledge base as the L0 stage of system development. The challenges faced by RAG systems vary across different levels. We analyze these challenges for each level and propose corresponding frameworks in Table 2. This stratified approach facilitates the phased development of RAG systems, enabling incremental enhancement of capabilities through iterative module refinement and algorithmic optimization.
我们根据 RAG 系统在四类问题上的解决能力,将其分为四个不同的级别,如表 1 所示。认识到知识库生成在 RAG 系统中的关键作用,我们将知识库的构建指定为系统开发的 L0 阶段。RAG 系统在不同级别面临的挑战各不相同。我们在表 2 中分析了每个级别的挑战,并提出了相应的框架。这种分层方法有助于 RAG 系统的阶段性开发,通过迭代模块优化和算法改进,逐步提升能力。
We observe that from L0 to L4, higher-level systems can inherit modules from lower levels and add new modules to enhance system capabilities. For instance, compared to an L1 system, an L2 system not only introduces a task decomposition and coordination module to leverage iterative retrieval-generation routing but also incorporates more advanced knowledge extraction modules, such as distilled knowledge generation, indicated in dark green in Figure 2. In the L3 system, the growing emphasis on predictive questioning necessitates enhanced requirements for knowledge organization and reasoning. Consequently, the knowledge organization module introduces additional submodules for knowledge structuring and knowledge induction, indicated in dark orange. Similarly, the knowledge-centric reasoning module has been expanded to include a forecasting submodule, highlighted in dark purple. In the L4 system, extracting complex rationale from an established knowledge base is highly challenging. To address this, we introduce multi-agent planning module to activate reasoning from diverse perspectives.
我们观察到,从 L0 到 L4,更高层级的系统可以继承低层级的模块,并添加新模块以增强系统能力。例如,与 L1 系统相比,L2 系统不仅引入了任务分解和协调模块以利用迭代检索-生成路由,还加入了更高级的知识提取模块,如图 2 中深绿色所示的蒸馏知识生成。在 L3 系统中,预测性提问的日益重要性对知识组织和推理提出了更高的要求。因此,知识组织模块引入了额外的子模块用于知识结构化和知识归纳,如深橙色所示。同样,以知识为中心的推理模块也扩展了预测子模块,如深紫色所示。在 L4 系统中,从已建立的知识库中提取复杂逻辑极具挑战性。为此,我们引入了多智能体规划模块,以激活从不同角度进行的推理。
Figure 3: Multi-layer heterogeneous graph as the knowledge base. The graph comprises three distinct layers: information resource layer, corpus layer and distilled knowledge layer.
图 3: 作为知识库的多层异质图。该图包含三个不同的层:信息资源层、语料库层和蒸馏知识层。
5 Detailed Implementation
5 详细实现
In this section, we delve into the implementation specifics of each module within our proposed versatile and expandable RAG framework. By elucidating the details at each level, we aim to provide a comprehensive understanding of how the framework operates and how its modularity and expand ability are achieved. The subsections that follow will cover the file parsing, knowledge extraction, knowledge storage, knowledge-centric reasoning, and task decomposition and coordination modules, providing insights into their individual functionalities and interactions.
在本节中,我们将深入探讨我们提出的多功能且可扩展的 RAG(Retrieval-Augmented Generation)框架中每个模块的实现细节。通过阐明每个层面的细节,我们旨在全面理解框架的运作方式,以及如何实现其模块化和可扩展性。接下来的小节将涵盖文件解析、知识提取、知识存储、以知识为中心的推理以及任务分解和协调模块,深入探讨它们各自的功能和相互之间的交互。
5.1 Level-0: Knowledge Base Construction
5.1 Level-0: 知识库构建
The foundational stage of the proposed RAG systems is designated as the L0 system, focuses on the construction of a robust and comprehensive knowledge base. This stage is critical for enabling effective knowledge retrieval in subsequent levels. The primary objective of the L0 system is to process and structure domain-specific documents, transforming them into a machine-readable format and organizing the extracted knowledge into a heterogeneous graph. This graph serves as the backbone for all higher-level reasoning and retrieval tasks. The L0 system encompasses several key modules: file parsing, knowledge extraction, and knowledge storage. Each of these modules plays a crucial role in ensuring that the knowledge base is both extensive and accurately reflects the underlying information contained within the source documents.
提出的 RAG 系统的基础阶段被指定为 L0 系统,专注于构建一个健壮且全面的知识库。这一阶段对于在后续级别中实现有效的知识检索至关重要。L0 系统的主要目标是处理和结构化特定领域的文档,将其转换为机器可读的格式,并将提取的知识组织成异构图。该图作为所有更高级别推理和检索任务的基础。L0 系统包含几个关键模块:文件解析、知识提取和知识存储。这些模块中的每一个都扮演着关键角色,确保知识库既广泛又准确地反映源文档中包含的底层信息。
5.1.1 File parsing
5.1.1 文件解析
The ability to effectively parse and read various types of files is a critical component in the development of RAG systems that rely on diverse data sources. Frameworks such as LangChain3 provide a comprehensive suite of tools for natural language processing (NLP), including modules for parsing and extracting information from unstructured text documents. Its file reader capabilities are designed to handle a wide range of file formats, ensuring that data from heterogeneous sources can be seamlessly integrated into the system. Additionally, several deep learning-based tools [2, 3] and commercial cloud APIs [1, 4] have been developed to conduct robust Optical Character Recognition (OCR) and accurate table extraction, enabling the conversion of scanned documents and images into structured, machine-readable text. Given that domain-specific files often encompass sophisticated tables, charts, and figures, text-based conversion may lead to information loss and disrupt the inherent logical structure. Therefore, we propose conducting layout analysis for these files and preserving multi-modal elements such as charts and figures. The layout information can aid the chunking operation, maintaining the completeness of chunked text, while figures and charts can be described
有效解析和读取各类文件的能力是开发依赖多样化数据源的RAG系统的关键组成部分。LangChain3等框架提供了一套全面的自然语言处理(NLP)工具,包括用于从非结构化文本文档中解析和提取信息的模块。其文件读取功能旨在处理广泛的文件格式,确保来自异构源的数据可以无缝集成到系统中。此外,已经开发了多种基于深度学习的工具[2, 3]和商业云API[1, 4],用于进行稳健的光学字符识别(OCR)和准确的表格提取,从而将扫描文档和图像转换为结构化的、机器可读的文本。鉴于特定领域的文件通常包含复杂的表格、图表和图像,基于文本的转换可能会导致信息丢失并破坏固有的逻辑结构。因此,我们建议对这些文件进行布局分析,并保留图表和图像等多模态元素。布局信息可以帮助分块操作,保持分块文本的完整性,而图表和图像可以被描述
Figure 4: The process of distilling knowledge from corpus text. The corpus text are processed to extract knowledge units following customized extraction patterns. These knowledge units are then organized to structured knowledge in the distilled knowledge layer, which may take the form of knowledge graphs, atomic knowledge, tabular knowledge, and other induced knowledge.
图 4: 从语料文本中提炼知识的过程。语料文本按照定制的提取模式进行处理,以提取知识单元。这些知识单元随后被组织成结构化知识,存储在提炼知识层中,其形式可能包括知识图谱、原子知识、表格知识以及其他推导出的知识。
by Vision-Language Models (VLMs) to assist in knowledge retrieval. This approach ensures that the integrity and richness of the original documents are retained, enhancing the efficacy of RAG systems.
通过视觉-语言模型 (Vision-Language Models, VLMs) 辅助知识检索。这种方法确保了原始文档的完整性和丰富性,从而提升了 RAG 系统的效能。
5.1.2 Knowledge Organization
5.1.2 知识组织
The proposed knowledge base is structured as a multi-layer heterogeneous graph, representing different levels of information granularity and abstraction. The graph captures relationships between various components of the data (e.g., documents, sections, chunks, figures, and tables) and organizes them into nodes and edges, reflecting their interconnections and dependencies. As depicted in Figure 3, this multi-layer structure, encompassing the information resource layer, corpus layer, and distilled knowledge layer, enables both semantic understanding and rationale-based retrieval for downstream tasks.
所提出的知识库被构建为一个多层异构图,表示不同层次的信息粒度和抽象。该图捕捉了数据各个组件(例如文档、章节、块、图和表)之间的关系,并将它们组织成节点和边,反映了它们之间的相互联系和依赖关系。如图 3 所示,这种多层结构,包括信息资源层、语料库层和提炼知识层,能够为下游任务提供语义理解和基于推理的检索。
Information Resource Layer: This layer captures the diverse information sources, treating them as source nodes with edges that denote referential relationships among them. This structure aids in cross-referencing and contextual i zing the knowledge, establishing a foundation for reasoning that depends on multiple sources.
信息资源层:该层捕捉多样化的信息来源,将其视为源节点,并用边表示它们之间的引用关系。这种结构有助于交叉引用和知识的情境化,为依赖多源的推理奠定基础。
Corpus Layer: This layer organizes the parsed information into sections and chunks while preserving the document’s original hierarchical structure. Multi-modal content such as tables and figures is summarized by LLMs and integrated as chunk nodes, ensuring that multi-modal knowledge is available for retrieval. This layer enables knowledge extraction with varying levels of granularity, allowing for accurate semantic chunking and retrieval across diverse content types.
语料层:该层将解析后的信息组织成章节和块,同时保留文档的原始层次结构。多模态内容(如表格和图像)由大语言模型进行总结,并作为块节点进行集成,确保多模态知识可用于检索。该层支持不同粒度的知识提取,实现了跨多种内容类型的准确语义分块和检索。
Distilled Knowledge Layer: The corpus is further distilled into structured forms of knowledge (e.g., knowledge graphs, atomic knowledge, and tabular knowledge). This process, driven by techniques like Named Entity Recognition (NER) [19] and relationship extraction [40], ensures that the distilled knowledge captures key logical relationships and entities, supporting advanced reasoning processes. By organizing this structured knowledge in a distilled layer, we enhance the system’s ability to reason and synthesize based on deeper domain-specific knowledge. The knowledge distillation process is depicted in Figure 4. Below are the detailed distillation processes for typical knowledge forms.
蒸馏知识层:语料库进一步蒸馏为结构化的知识形式(例如,知识图谱、原子知识和表格知识)。这一过程由命名实体识别(NER)[19] 和关系抽取[40] 等技术驱动,确保蒸馏的知识捕捉到关键的逻辑关系和实体,支持高级推理过程。通过将这种结构化知识组织在蒸馏层中,我们增强了系统基于更深层次领域知识进行推理和综合的能力。知识蒸馏过程如图 4 所示。以下是典型知识形式的详细蒸馏过程。
Figure 5: Illustration of enhanced chunking with recurrent text splitting.
图 5: 使用循环文本分割的增强分块示意图。
5.2 Level-1: Factual Question focused RAG System
5.2 一级:基于事实问题的 RAG 系统
Building upon the L0 system, the L1 system introduces knowledge retrieval and knowledge organization to realize its retrieval and generation capabilities. The primary challenges at this level are semantic alignment and chunking. The abundance of professional terminology and aliases can affect the accuracy of chunk retrieval, and unreasonable chunking can disrupt semantic coherence and introduce noise interference. To mitigate these issues, the L1 system incorporates more sophisticated query analysis techniques and basic knowledge extraction modules. The architecture is expanded to include components that facilitate task decomposition, coordination, and initial stages of knowledge organization (KO), ensuring that the system can manage more complex queries effectively.
在 L0 系统的基础上,L1 系统引入了知识检索和知识组织,以实现其检索和生成能力。这一级别的主要挑战是语义对齐和分块。专业术语和别名的丰富性可能会影响分块检索的准确性,而不合理的分块可能会破坏语义连贯性并引入噪声干扰。为了缓解这些问题,L1 系统采用了更复杂的查询分析技术和基本的知识提取模块。架构扩展为包括促进任务分解、协调和知识组织 (KO) 初始阶段的组件,确保系统能够有效管理更复杂的查询。
Figure 6: Overview of L1 RAG framework. The squares $(\boxed{\Omega})$ indicate enhanced chunking and auto-tagging sub-module in knowledge extraction modules.
图 6: L1 RAG 框架概览。方框 $(\boxed{\Omega})$ 表示知识提取模块中的增强分块和自动标记子模块。
5.2.1 Enhanced chunking
5.2.1 增强分块
Chunking involves breaking down a large corpus of text into smaller, more manageable segments. The primary chunking strategies commonly utilized in RAG systems include fixed-size chunking, semantic chunking, and hybrid chunking. Chunking is essential for improving both the efficiency and accuracy of the retrieval process, which consequently affects the overall performance of RAG models in multiple dimensions. In our system, each chunk serves dual purposes: (i) it becomes a unit of information that is vectorized and stored in a database for retrieval, and (ii) it acts as a source for further knowledge extraction and information sum mari z ation. Improper chunking not only fails to ensures that text vectors encapsulate the necessary semantic information, but also hinders knowledge extraction based on complete context. For instance, in the context of laws and regulations, a fixed-size chunking approach are prone to destroying text semantics and omitting key conditions, thereby affecting the quality and accuracy of subsequent knowledge extraction.
分块是指将大量文本分解为更小、更易管理的片段。在 RAG 系统中常用的主要分块策略包括固定大小分块、语义分块和混合分块。分块对于提高检索过程的效率和准确性至关重要,进而影响 RAG 模型在多维度上的整体性能。在我们的系统中,每个分块具有双重作用:(i) 它作为信息单元,被向量化并存储在数据库中以供检索;(ii) 它作为进一步知识提取和信息汇总的源。不恰当的分块不仅无法确保文本向量包含必要的语义信息,还会阻碍基于完整上下文的知识提取。例如,在法律法规的语境中,固定大小的分块方法容易破坏文本语义并遗漏关键条件,从而影响后续知识提取的质量和准确性。
We propose a text split algorithm to enhance existing chunking methods by breaking down large text documents into smaller, manageable chunks while preserving context and enabling effective summary generation for each chunk. The chunking process is illustrated in Figure 5. Given a source text, the algorithm iterative ly splits the text into chunks. During the first iteration, it generates a forward summary of the initial chunk, providing context for generating summaries of subsequent chunks and maintaining a coherent narrative across splits. Each chunk is summarized using a predefined prompt template that incorporates both the forward summary and the current chunk. This summary is then stored alongside the chunk. The algorithm adjusts the text by removing the processed chunk and updating the forward summary with the summary of the current chunk, preparing for the next iteration. This process continues until the entire text is split and summarized. Additionally, the algorithm can dynamically adjust chunk sizes based on the content and structure of the text.
我们提出了一种文本分割算法,通过将大型文本文档分解为更小、更易管理的块来增强现有的分块方法,同时保留上下文并为每个块生成有效的摘要。分块过程如图 5 所示。给定源文本,该算法迭代地将文本分割成块。在第一次迭代中,它会生成初始块的前向摘要,为后续块摘要的生成提供上下文,并在分割之间保持连贯的叙述。每个块都使用预定义的提示模板进行摘要,该模板结合了前向摘要和当前块。然后将此摘要与块一起存储。该算法通过移除已处理的块并使用当前块的摘要更新前向摘要来调整文本,为下一次迭代做准备。此过程持续进行,直到整个文本被分割并摘要完毕。此外,该算法可以根据文本的内容和结构动态调整块的大小。
Figure 7: Illustration of the auto-tagging module.
图 7: 自动标注模块的示意图。
5.2.2 Auto-tagging
5.2.2 自动标签
In domain-specific RAG scenarios, the corpus is typically characterized by formal, professional, and rigorously expressed content, whereas the questions posed are often articulated in plain, easily understandable colloquial language. For instance, in medical question-answering (medQA) tasks [32], symptoms of diseases described in the questions are generally phrased in simple, conversational terms. In contrast, the corresponding medical knowledge within the corpus is often expressed using specialized professional terminology. This discrepancy introduces a domain gap that adversely affects the accuracy of chunk retrieval, especially given the limitations of the embedding models employed for this purpose.
在特定领域的 RAG 场景中,语料库通常以正式、专业和严谨表达的内容为特征,而提出的问题则往往以简单易懂的口语语言表述。例如,在医疗问答(medQA)任务 [32] 中,问题中描述的疾病症状通常以简单的对话方式表达。相比之下,语料库中的相应医学知识则经常使用专门的专业术语来表述。这种差异引入了领域差距,特别是在用于此目的的嵌入模型的局限性下,对片段检索的准确性产生了不利影响。
To address the domain gap issue, we propose an auto-tagging module designed to minimize the disparity between the source documents and the queries. This module pre processes the corpus to extract a comprehensive collection of domain-specific tags or to establish tag mapping rules. Prior to the retrieval process, tags are extracted from the query and then mapped to corpus domain using the pre processed tag collection or tag pair collection. This tag-based domain adaptation can be employed for query rewriting or keyword retrieval within sequential information retrieval frameworks, thereby enhancing both the recall and precision of the retrieval process.
为了解决领域差距问题,我们提出了一个自动标记模块,旨在最小化源文档与查询之间的差异。该模块对语料库进行预处理,以提取全面的领域特定标记集合或建立标记映射规则。在检索过程之前,从查询中提取标记,然后使用预处理的标记集合或标记对集合将其映射到语料库领域。这种基于标记的领域适应可以用于序列信息检索框架中的查询重写或关键词检索,从而提高检索过程的召回率和精确度。
Specifically, we leverage the capabilities of the LLMs to identify key factors within the corpus chunks, summarize these factors, and generalize them into category names, which we refer to as "tag classes." We generate semantic tag extraction prompts based on these tag classes to facilitate accurate tag extraction. In scenarios where only the corpus is available, LLMs are employed with meticulously designed prompts to extract semantic tags from the corpus, thereby forming a comprehensive corpus tag collection. When practical QA samples are available, semantic tag extraction is performed on both the queries and the corresponding retrieved answer chunks. Using the tag sets extracted from the chunks and queries, LLMs are utilized to map cross-domain semantic tags and generate a tag pair collection. After establishing both the corpus tag collection and the tag pair collection, tags can be extracted from the query, and the corresponding mapped tags can be identified within the collections. These mapped tags are then used to enhance subsequent information retrieval processes, improving both recall and precision. This workflow leverages the advanced understanding and contextual capabilities of LLMs for domain adaptation.
具体来说,我们利用大语言模型的能力来识别语料块中的关键因素,总结这些因素并将其泛化为类别名称,我们称之为“标签类”。我们基于这些标签类生成语义标签提取提示,以促进准确的标签提取。在只有语料可用的情况下,使用大语言模型通过精心设计的提示从语料中提取语义标签,从而形成一个全面的语料标签集合。当有实际的问答样本时,对查询和相应的检索答案块都进行语义标签提取。使用从块和查询中提取的标签集,利用大语言模型映射跨域语义标签并生成标签对集合。在建立了语料标签集合和标签对集合后,可以从查询中提取标签,并在集合中识别相应的映射标签。这些映射标签随后用于增强后续的信息检索过程,提高召回率和精确率。该工作流利用了大语言模型的高级理解和上下文能力进行领域适应。
Figure 8: Overview of multi-layer, multi-granularity retrieval over heterogeneous graph
图 8: 基于异构图的多层次、多粒度检索概览
5.2.3 Multi-Granularity Retrieval
5.2.3 多粒度检索
The L1 system is designed to enable multi-layer, multi-granularity retrieval across a heterogeneous knowledge graph, which was constructed in the L0 system. Each layer of the graph (e.g., information source layer, corpus layer, distilled knowledge layer) represents knowledge at different levels of abstraction and granularity, allowing the system to explore and retrieve relevant information at various scales. For example, queries can be mapped to entire documents (information source layer) or specific chunks of text (corpus layer), ensuring that knowledge can be retrieved at the appropriate level for a given task. To support this, similarity scores between queries and graph nodes are computed to measure the alignment between the query and the retrieved knowledge. These scores are then propagated through the layers of the graph, allowing the system to aggregate information from multiple levels. This multi-layer propagation ensures that retrieval can be fine-tuned based on both the broader context (e.g., entire documents) and finer details (e.g., specific chunks or distilled knowledge). The final similarity score is generated through a combination of aggregation and propagation, ensuring that knowledge extraction and utilization are optimized for both precision and efficiency in factual question answering. The retrieval process can be iterative, refining the results based on sub-queries generated through task decomposition, further enhancing the system’s ability to generate accurate and con textually relevant answers.
L1 系统旨在实现对异构知识图谱的多层次、多粒度检索,该知识图谱是在 L0 系统中构建的。图谱的每一层(例如,信息源层、语料库层、蒸馏知识层)代表了不同抽象层次和粒度的知识,使系统能够在不同尺度上探索和检索相关信息。例如,查询可以映射到整个文档(信息源层)或特定的文本块(语料库层),确保知识可以在给定任务的适当层次上被检索。为此,系统计算查询与图谱节点之间的相似度分数,以衡量查询与检索知识之间的匹配程度。这些分数随后通过图谱的各个层次进行传播,使系统能够从多个层次聚合信息。这种多层次传播确保检索可以根据更广泛的上下文(例如,整个文档)和更精细的细节(例如,特定文本块或蒸馏知识)进行微调。最终的相似度分数通过聚合和传播的组合生成,确保知识提取和利用在事实性问答中的精确性和效率都得到优化。检索过程可以是迭代的,通过任务分解生成的子查询来细化结果,进一步增强系统生成准确且上下文相关答案的能力。
The overview of multi-layer, multi-granularity retrieval is depicted in Figure 8. For each layer of the graph, both queries $Q$ and graph node are transformed into high-dimensional vector embeddings for similarity evaluation. We denote the similarity evaluation operation as $g()$ . Here, $I,C$ , and $D$ indicate the node sets in the information source layer, corpus layer, and distilled knowledge layer, respectively. The propagation and aggregation operations are represented by the function $f()$ . The final chunk similarity score $S$ is obtained by aggregating the scores from other layers and nodes.
多层、多粒度检索的概述如图 8 所示。对于图的每一层,查询 $Q$ 和图的节点都被转换为高维向量嵌入以进行相似性评估。我们将相似性评估操作表示为 $g()$。其中,$I$、$C$ 和 $D$ 分别表示信息源层、语料层和蒸馏知识层中的节点集。传播和聚合操作由函数 $f()$ 表示。最终的块相似性得分 $S$ 通过聚合其他层和节点的得分得到。
5.3 Level-2: Linkable and Reasoning Question focused RAG System
5.3 Level-2: 可链接且专注于推理问题的 RAG 系统
The core functionality of the L2 system lies in its ability to efficiently retrieve multiple sources of relevant information and perform complex reasoning based on it. To facilitate this, the L2 system integrates an advanced knowledge extraction module that comprehensively identifies and extracts pertinent information. Furthermore, a task decomposition and coordination module is implemented to break down intricate tasks into smaller, manageable sub-tasks, thereby enhancing the system’s efficiency in handling them. The proposed framework of L2 RAG system is illustrated in Figure 9.
L2 系统的核心功能在于其能够高效检索多个相关信息来源并基于此进行复杂推理。为此,L2 系统集成了一个先进的知识提取模块,全面识别并提取相关信息。此外,系统还实现了任务分解与协调模块,将复杂任务分解为更小、更易管理的子任务,从而提升系统处理效率。L2 RAG 系统的框架如图 9 所示。
Chunked text contains multifaceted information, increasing the complexity of retrieval. Recent studies have focused on extracting triple knowledge units from chunked text and constructing knowledge graphs to facilitate efficient information retrieval [20, 42]. However, the construction of knowledge graphs is costly, and the inherent knowledge may not always be fully explored. To better present the knowledge embedded the documents, we propose atomizing the original documents in Knowledge Extraction phase, a process we refer as Knowledge Atomizing. Besides, industrial tasks often necessitate multiple pieces of knowledge, implicitly requiring the capability to decompose the original question into several sequential or parallel atomic questions. We refer to this operation as
分块文本包含多方面的信息,增加了检索的复杂性。最近的研究集中在从分块文本中提取三元组知识单元并构建知识图谱,以促进高效的信息检索 [20, 42]。然而,知识图谱的构建成本高昂,且内在知识可能并不总是被充分挖掘。为了更好地呈现文档中嵌入的知识,我们在知识提取阶段提出将原始文档原子化,这一过程我们称之为知识原子化。此外,工业任务通常需要多条知识,这隐含着将原始问题分解为多个顺序或并行的原子问题的能力。我们将这一操作称为
Task Decomposition. By combining the extracted atomic knowledge with the original chunks, we construct an atomic hierarchical knowledge base. Each time we decompose a task, the hierarchical knowledge base provides insights into the available knowledge, enabling knowledge-aware task decomposition.
任务分解。通过将提取的原子知识与原始块结合,我们构建了一个原子层次知识库。每次分解任务时,层次知识库都会提供对可用知识的洞察,从而实现知识感知的任务分解。
5.3.1 Knowledge Atomizing
5.3.1 知识原子化
We believe that a single document chunk often encompasses multiple pieces of knowledge. Typically, the information necessary to address a specific task represents only a subset of the entire knowledge. Therefore, consolidating these pieces within a single chunk, as traditionally done in information retrieval, may not facilitate the efficient retrieval of the precise information required. To align the granularity of knowledge with the queries generated during task solving, we propose a method called knowledge atomizing. This approach leverage the context understanding and content generation capabilities of LLMs to automatically tag atomic knowledge pieces within each document chunk. Note that, these chunks could be segments of an original reference document, description chunks generated for tables, images, videos, or summary chunks of entire sections, chapters or even documents.
我们相信,单个文档块通常包含多个知识片段。通常情况下,解决特定任务所需的信息只是整个知识的一个子集。因此,像传统信息检索那样将这些片段整合在单个块中,可能不利于高效检索所需的具体信息。为了使知识的粒度与任务解决过程中生成的查询相匹配,我们提出了一种称为知识原子化的方法。该方法利用大语言模型的上下文理解和内容生成能力,自动为每个文档块中的原子知识片段打标签。需要注意的是,这些块可以是原始参考文档的片段、为表格、图像、视频生成的描述块,甚至是整个章节或文档的摘要块。
The presentation of atomic knowledge can be various. Instead of utilizing declarative sentences or subject-relationship-object tuples, we propose using questions as knowledge indexes to further bridge the gap between stored knowledge and query. Unlike the semantic tagging process, in knowledge atomizing process, we input the document chunk to LLM as context, ask it to generate relevant questions that can be answered by the given chunk as many as possible. These generated atomic questions are saved as the atomic question tags together with the given chunk. An example of knowledge atomizing is demonstrated in Figure 10(c), where the atomic questions encapsulate various aspects of the knowledge contained within the chunk. A hierarchical knowledge base can accommodate queries of varying granularity. Figure 11 illustrates the retrieval process from an atomic knowledge base comprising chunks and atomic questions. Queries can directly retrieve reference chunks as usual. Additionally, since each chunk is tagged with multiple atomic questions, an atomic query can be used to locate relevant atomic questions, which then leads to the associated reference chunks.
原子知识的呈现方式可以多种多样。我们提出使用问题作为知识索引,而不是利用陈述句或主谓宾元组,以进一步缩小存储知识与查询之间的差距。与语义标注过程不同,在知识原子化过程中,我们将文档块作为上下文输入到大语言模型 (LLM) 中,要求它生成尽可能多的可以由给定块回答的相关问题。这些生成的原子问题与给定的块一起保存为原子问题标签。知识原子化的一个例子如图 10(c) 所示,其中原子问题封装了块中包含知识的各个方面。分层知识库可以适应不同粒度的查询。图 11 展示了从包含块和原子问题的原子知识库中检索的过程。查询可以像往常一样直接检索参考块。此外,由于每个块都标记有多个原子问题,因此可以使用原子查询来定位相关的原子问题,然后找到关联的参考块。
5.3.2 Knowledge-Aware Task Decomposition
5.3.2 知识感知任务分解
For a specific task, multiple decomposition strategies might be applicable. Consider Q2 in Figure 1 as an example. The two-step analytical reasoning process depicted may be effective if an interchangeable biosimilar products list is available. However, if only a general list of biosimilar products exists, with attributes dispersed throughout multiple documents, a different decomposition strategy may be necessary: (1) Retrieve the biosimilar product list; (2) Determine whether each product is interchangeable; (3) Count the total number of interchangeable products. The critical factor in selecting the most effective decomposition approach lies in understanding the contents of the specialized knowledge base. Motivated by this, we design the Knowledge-Aware Task Decomposition workflow, which is illustrated in Figure 10(a). The complete algorithm for task solving using Knowledge-Aware Task Decomposition is presented in Algorithm 1.
对于特定任务,可能有多种分解策略适用。以图1中的Q2为例。如果存在可互换的生物类似产品列表,则所展示的两步分析推理过程可能有效。然而,如果仅存在一个通用的生物类似产品列表,且属性分散在多个文档中,则可能需要采用不同的分解策略:(1) 检索生物类似产品列表;(2) 确定每个产品是否可互换;(3) 统计可互换产品的总数。选择最有效分解方法的关键在于理解专业知识库的内容。基于此,我们设计了知识感知任务分解工作流,如图10(a)所示。使用知识感知任务分解的任务求解完整算法见算法1。
The reference context $\ensuremath{\mathcal{C}}{t}$ is initialized as an empty set, and the original question is denoted by $q$ . As illustrated in the for-loop starting at line 2 of the algorithm, in the $t$ -th iteration, we use an LLM, denoted by $\mathcal{L L M}$ , to generate query proposals potentially useful for task completion, denoted as $\hat{q}{i}^{t}$ .
参考上下文 $\ensuremath{\mathcal{C}}{t}$ 被初始化为空集,原始问题表示为 $q$。如算法第2行开始的for循环所示,在第 $t$ 次迭代中,我们使用由 $\mathcal{L L M}$ 表示的大语言模型生成可能对任务完成有用的查询提案,表示为 $\hat{q}{i}^{t}$。
Figure 10: The illustration of knowledge atomizing and knowledge-aware task decomposition: (a) Workflow of task solving with knowledge-aware task decomposition, (b) Workflow of knowledge atomizing, (c) Example of knowledge atomizing, (d) RAG case with knowledge atomizing and knowledge-aware task decomposition.
图 10: 知识原子化与知识感知任务分解的示意图:(a) 使用知识感知任务分解解决问题的流程,(b) 知识原子化的流程,(c) 知识原子化的示例,(d) 使用知识原子化与知识感知任务分解的 RAG 案例。
In this step, the chosen reference chunks $\ensuremath{\mathcal{C}}{t}$ are provided as context to avoid generating proposals linked to already known knowledge. These proposals are then utilized as atomic queries to determine if relevant knowledge exists within the knowledge base. For each atomic question proposal, we retrieve its relevant atomic question candidates along with their source chunks ${(q{i j}^{t},c_{i j}^{\bar{t}})}$ from the knowledge base, denoted as $\kappa{\tt B}$ . We can use any score metric sim to retrieve atomic questions. In our experiment, we use cosine similarity of their corresponding embeddings to retrieve all top $K$ atomic questions, provided their similarity to a proposed atomic question is greater than or equal to a given threshold $\delta$ . With the original question $q$ , the accumulated context $\ensuremath{\mathcal{C}}{t}$ , and the list of retrieved atomic questions $q{i j}^{t}$ , $\mathcal{L L M}$ selects the most useful atomic question $q^{t}$ from $q_{i j}^{t}$ and retrieves the relevant chunk $c^{t}$ . This retrieved chunk is aggregated into the reference context $\ensuremath{\mathcal{C}}{t}$ for the next round of decomposition. Knowledge-aware decomposition can iterate up to $N$ times, where $N$ is a hyper parameter set to control computational cost. The iteration process can be terminated early if there are no high-quality question proposals, no highly relevant atomic candidates retrieved, no suitable atomic knowledge selections, or if the $\mathcal{L L M}$ determines that the acquired knowledge is sufficient to complete the task. Finally, the accumulated context $\ensuremath{\mathcal{C}}{t}$ is utilized to generate answer $\hat{a}$ for the given question $q$ in line 14.
在这一步骤中,选择的参考块 $\ensuremath{\mathcal{C}}{t}$ 被作为上下文提供,以避免生成与已知知识相关的提案。这些提案随后被用作原子查询,以确定知识库中是否存在相关知识。对于每个原子问题提案,我们从知识库中检索其相关的原子问题候选及其来源块 ${(q{i j}^{t},c_{i j}^{\bar{t}})}$,记为 $\kappa{\tt B}$。我们可以使用任何评分指标 sim 来检索原子问题。在我们的实验中,我们使用其对应嵌入的余弦相似度来检索所有前 $K$ 个原子问题,前提是它们与提出的原子问题的相似度大于或等于给定的阈值 $\delta$。结合原始问题 $q$、累积的上下文 $\ensuremath{\mathcal{C}}{t}$ 以及检索到的原子问题列表 $q{i j}^{t}$,$\mathcal{L L M}$ 从 $q_{i j}^{t}$ 中选择最有用的原子问题 $q^{t}$ 并检索相关块 $c^{t}$。检索到的块被聚合到参考上下文 $\ensuremath{\mathcal{C}}{t}$ 中,用于下一轮分解。知识感知分解可以迭代最多 $N$ 次,其中 $N$ 是用于控制计算成本的超参数。如果没有高质量的问题提案、没有检索到高度相关的原子候选、没有合适的原子知识选择,或者 $\mathcal{L L M}$ 确定获取的知识足以完成任务,则迭代过程可以提前终止。最后,累积的上下文 $\ensuremath{\mathcal{C}}{t}$ 被用于生成给定问题 $q$ 的答案 $\hat{a}$(如第14行所示)。
Figure 11: Retrieval process from an atomic knowledge base. It supports two retrieval paths: (a) using queries to directly retrieve chunks as usual; (b) locating atomic nodes first then achieving the associated chunks.
图 11: 从原子知识库中检索的过程。它支持两种检索路径:(a) 使用查询直接检索块,如通常那样;(b) 先定位原子节点,然后获取相关的块。
Algorithm 1 Task Solving with Knowledge-Aware Decomposition
算法 1:基于知识感知分解的任务求解
5.3.3 Knowledge-Aware Task Decomposer Training
5.3.3 知识感知任务分解器训练
It is worth mentioning that knowledge-aware decomposition can be a learnable component. This trained proposer can then directly suggest atomic queries $q^{t}$ during inference, which means lines 3 to 5 in Algorithm 1 can be replaced by a single call to this learned proposer, thereby reducing both inference time and computational cost. In order to train the knowledge-aware decomposer, we collect data about the rationale behind each step by sampling context and creating diverse interaction trajectories. With this data collected, we train a decomposer that can incorporate domain-specific rationale into the task decomposition and result-seeking process.
值得一提的是,知识感知分解可以是一个可学习的组件。经过训练的提议者可以直接在推理过程中建议原子查询 $q^{t}$,这意味着算法1中的第3到5行可以用一次调用这个学习到的提议者来替代,从而减少推理时间和计算成本。为了训练知识感知分解器,我们通过采样上下文并创建多样化的交互轨迹来收集每一步背后的推理数据。在收集到这些数据后,我们训练了一个分解器,它能够将特定领域的推理融入任务分解和结果寻找过程中。
The data collection process, as depicted in Figure 12 and Algo. 2, implements a sophisticated dual-dictionary system for managing and tracking information. Our system utilizes two primary data structures: dictionary $\boldsymbol{S}$ for maintaining comprehensive score records, and dictionary $\mathcal{V}$ for systematically tracking visit frequencies of candidate chunks. During the initialization phase, we establish baseline values by setting all scores to zero and initializing visit counters to one, creating a foundation for dynamic updates throughout the subsequent processing stages.
数据收集过程如图 12 和 Algo. 2 所示,实现了一个复杂的双字典系统来管理和跟踪信息。我们的系统使用两种主要数据结构:字典 $\boldsymbol{S}$ 用于维护全面的得分记录,字典 $\mathcal{V}$ 用于系统地跟踪候选块的访问频率。在初始化阶段,我们通过将所有得分设置为零并将访问计数器初始化为一来建立基准值,为后续处理阶段的动态更新奠定了基础。
In each iteration of our decomposition process, the system executes a detailed retrieval operation targeting the top $K^{\prime}$ chunks demonstrating maximum relevance to the current atomic question. These chunks must satisfy our similarity threshold criterion (specifically, similarity exceeding $\delta^{\prime}$ , where $\delta^{\prime}<\delta)$ , with $K^{\prime}$ intentionally configured to be larger than $K$ to ensure comprehensive coverage. Following this initial retrieval, we carefully select and integrate the data chunks corresponding to the top $K$ most relevant atomic retrieved pairs into the context. For those retrieved chunks that do not make it into the top $\mathcal{K}$ selection, we systematically incorporate them into $\boldsymbol{S}$ and methodically update their scores based on precisely calculated relevance metrics.
在每次分解过程的迭代中,系统会针对与当前原子问题最相关的 top $K^{\prime}$ 块执行详细的检索操作。这些块必须满足我们的相似度阈值标准(具体来说,相似度超过 $\delta^{\prime}$,其中 $\delta^{\prime}<\delta$),并且 $K^{\prime}$ 有意配置为大于 $K$ 以确保全面覆盖。在初始检索之后,我们精心选择并将与 top $K$ 最相关的原子检索对对应的数据块集成到上下文中。对于那些未进入 top $\mathcal{K}$ 选择的检索块,我们会系统地将它们纳入 $\boldsymbol{S}$ 中,并根据精确计算的相关性指标系统地更新它们的分数。
Figure 12: Data collection process for decomposer training, comprising four main components: a) sampling data chunks from the context sampling pool to serve as the reference context for question decomposition, b) saving the generated atomic query proposals, c) after retrieval and selection, saving the chosen atomic query proposals as part of the reasoning trajectories, d) evaluating the answer to generate a score.
图 12: 用于分解器训练的数据收集过程,包含四个主要组件:a) 从上下文采样池中采样数据块作为问题分解的参考上下文,b) 保存生成的原子查询提议,c) 在检索和选择后,将选定的原子查询提议保存为推理轨迹的一部分,d) 评估答案以生成分数。
Figure 13: An example of context sampling and an illustration of decomposer training with collected data.
图 13: 上下文采样示例及使用收集数据进行分解器训练的示意图。
To ensure comprehensive exploration of the solution space, we have implemented an advanced sampling mechanism that intelligently selects additional chunks from $\boldsymbol{S}$ when available, incorporating them seamlessly into the reference context. Our implementation leverages the Upper Confidence Bound [8] (UCB) algorithm for context sampling, establishing a balanced approach between exploitation and exploration. The exploitation component manifests through the retriever-selected chunks, focusing on options with currently highest estimated rewards to optimize immediate performance gains. Conversely, the exploration aspect is fulfilled through context sampling from $\boldsymbol{S}$ , enabling the systematic investigation of less-certain options to accumulate valuable data and potentially uncover superior long-term alternatives.
为了确保对解决方案空间的全面探索,我们实现了一种先进的采样机制,当可用时智能地从 $\boldsymbol{S}$ 中选择额外的块,并将它们无缝地整合到参考上下文中。我们的实现利用了上置信界 (Upper Confidence Bound [8], UCB) 算法进行上下文采样,建立了利用和探索之间的平衡方法。利用部分通过检索器选择的块体现,专注于当前估计奖励最高的选项,以优化即时性能增益。相反,探索部分通过从 $\boldsymbol{S}$ 中进行上下文采样来实现,使系统能够系统地调查不确定性较高的选项,以积累有价值的数据,并可能发现更优越的长期替代方案。
This meticulously crafted strategy serves a dual purpose: it not only facilitates the generation of diverse and comprehensive atomic query proposals but also enables systematic exploration of multiple potential reasoning pathways. Through this sophisticated approach, we progressively work toward deriving optimal final answers while maintaining a balance between immediate performance optimization and long-term discovery of potentially superior solutions.
这一精心设计的策略具有双重目的:它不仅促进了多样化和全面的原子查询提案的生成,还能够系统性地探索多种潜在的推理路径。通过这种复杂的方法,我们逐步朝着得出最优最终答案的目标前进,