From Matching to Generation: A Survey on Generative Information Retrieval
从匹配到生成:生成式信息检索综述
XIAOXI LI and JIAJIE JIN, Renmin University of China, China YUJIA ZHOU, Tsinghua University, China YUYAO ZHANG and PEITIAN ZHANG, Renmin University of China, China YUTAO ZHU and ZHICHENG DOU∗, Renmin University of China, China
李晓曦和靳佳杰,中国人民大学,中国
周雨佳,清华大学,中国
张雨瑶和张沛天,中国人民大学,中国
朱玉涛和窦志成*,中国人民大学,中国
Information Retrieval (IR) systems are crucial tools for users to access information, which have long been dominated by traditional methods relying on similarity matching. With the advancement of pre-trained language models, generative information retrieval (GenIR) emerges as a novel paradigm, attracting increasing attention. Based on the form of information provided to users, current research in GenIR can be categorized into two aspects: (1) Generative Document Retrieval (GR) leverages the generative model’s parameters for memorizing documents, enabling retrieval by directly generating relevant document identifiers without explicit indexing. (2) Reliable Response Generation employs language models to directly generate information users seek, breaking the limitations of traditional IR in terms of document granularity and relevance matching while offering flexibility, efficiency, and creativity to meet practical needs. This paper aims to systematically review the latest research progress in GenIR. We will summarize the advancements in GR regarding model training and structure, document identifier, incremental learning, etc., as well as progress in reliable response generation in aspects of internal knowledge memorization, external knowledge augmentation, etc. We also review the evaluation, challenges and future developments in GenIR systems. This review aims to offer a comprehensive reference for researchers, encouraging further development in the GenIR field. 1
信息检索 (IR) 系统是用户获取信息的关键工具,长期以来一直由依赖相似性匹配的传统方法主导。随着预训练语言模型的发展,生成式信息检索 (GenIR) 作为一种新范式出现,吸引了越来越多的关注。根据向用户提供信息的形式,当前 GenIR 的研究可分为两个方面:(1) 生成式文档检索 (GR) 利用生成模型的参数记忆文档,无需显式索引即可通过直接生成相关文档标识符实现检索。(2) 可靠响应生成利用语言模型直接生成用户寻求的信息,打破传统 IR 在文档粒度和相关性匹配方面的限制,同时提供灵活性、效率和创造性以满足实际需求。本文旨在系统梳理 GenIR 的最新研究进展。我们将总结 GR 在模型训练与结构、文档标识符、增量学习等方面的进展,以及可靠响应生成在内部知识记忆、外部知识增强等方面的进展。我们还回顾了 GenIR 系统的评估、挑战与未来发展。本综述旨在为研究者提供全面参考,推动 GenIR 领域的进一步发展。[1]
CCS Concepts: • Information systems $\rightarrow$ Retrieval models and ranking.
CCS概念:• 信息系统 $\rightarrow$ 检索模型与排序
Additional Key Words and Phrases: Generative Information Retrieval; Generative Document Retrieval; Reliable Response Generation
附加关键词与短语:生成式信息检索 (Generative Information Retrieval);生成式文档检索 (Generative Document Retrieval);可靠响应生成
1 INTRODUCTION
1 引言
Information retrieval (IR) systems are crucial for navigating the vast sea of online information in today’s digital landscape. From using search engines such as Google [76], Bing [196], and Baidu [209], to engaging with question-answering or dialogue systems like ChatGPT [209] and Bing Chat [197], and discovering content via recommendation platforms like Amazon [4] and
信息检索 (IR) 系统在当今数字环境中对海量在线信息的导航至关重要。从使用 Google [76]、Bing [196] 和百度 [209] 等搜索引擎,到与 ChatGPT [209] 和 Bing Chat [197] 等问答或对话系统互动,再到通过 Amazon [4] 等内容推荐平台发现信息
Fig. 1. Exploring IR Evolution: From Traditional to Generative Methods - This diagram illustrates the shift from traditional similarity-based document matching (a) to GenIR techniques. Current GenIR methods can be categorized into two types: generative retrieval (b), which retrieves documents by directly generating relevant DocIDs constrained by a DocID prefix tree; and response generation (c), which directly generates reliable and user-centric answers.
图 1: 探索信息检索(IR)演进:从传统方法到生成式方法 - 该图展示了从基于相似度的传统文档匹配(a)到生成式信息检索(GenIR)技术的转变。当前GenIR方法可分为两类:生成式检索(b)通过受文档ID前缀树约束直接生成相关DocID来检索文档;响应生成(c)直接生成可靠且以用户为中心的答案。
YouTube [77], IR technologies are integral to our everyday online experiences. These systems are reliable and play a key role in spreading knowledge and ideas globally.
YouTube [77],信息检索(IR)技术已成为我们日常在线体验不可或缺的组成部分。这些系统稳定可靠,在全球知识传播和思想交流中发挥着关键作用。
Traditional IR systems primarily rely on sparse retrieval methods based on word-level matching. These methods, which include Boolean Retrieval [242], BM25 [238], SPLADE [65], and UniCOIL [163], establish connections between vocabulary and documents, offering high retrieval efficiency and robust system performance. With the rise of deep learning, dense retrieval methods such as DPR [117] and ANCE [324], based on the bidirectional encoding representations from the BERT model [121], capture the deep semantic information of documents, significantly improving retrieval precision. Although these methods have achieved leaps in accuracy, they rely on large-scale document indices [57, 187] and cannot be optimized in an end-to-end way. Moreover, when people search for information, what they really need is a precise and reliable answer. This document ranking list-based IR approach still requires users to spend time summarizing their required answers, which is not ideal enough for information seeking [195].
传统信息检索(IR)系统主要依赖于基于词级匹配的稀疏检索方法。这些方法包括布尔检索(Boolean Retrieval)[242]、BM25 [238]、SPLADE [65]和UniCOIL [163],它们建立了词汇与文档之间的联系,具有高检索效率和强大的系统性能。随着深度学习的兴起,基于BERT模型(Bidirectional Encoder Representations from Transformers)[121]的双向编码表示的密集检索方法,如DPR [117]和ANCE [324],能够捕捉文档的深层语义信息,显著提高了检索精度。尽管这些方法在准确性上实现了飞跃,但它们依赖于大规模文档索引[57, 187],且无法以端到端的方式进行优化。此外,当人们搜索信息时,真正需要的是一个精确可靠的答案。这种基于文档排序列表的信息检索方法仍然需要用户花费时间总结所需答案,对于信息寻求来说还不够理想[195]。
With the development of Transformer-based pre-trained language models such as T5 [231], BART [138], and GPT [228], they have demonstrated their strong text generation capabilities. In recent years, large language models (LLMs) have brought about revolutionary changes in the field of AI-generated content (AIGC) [19, 359]. Based on large pre-training corpora and advanced training techniques like RLHF [36], LLMs [8, 105, 209, 286] have made significant progress in natural language tasks, such as dialogue [209, 282] and question answering [174, 225]. The rapid development of LLMs is transforming IR systems, giving rise to a new paradigm of generative information retrieval (GenIR), which achieves IR goals through generative approaches.
随着基于Transformer的预训练语言模型(如T5 [231]、BART [138]和GPT [228])的发展,它们展现了强大的文本生成能力。近年来,大语言模型(LLM)为AI生成内容(AIGC)领域带来了革命性变革[19, 359]。基于大规模预训练语料和RLHF [36]等先进训练技术,大语言模型[8, 105, 209, 286]在对话[209, 282]、问答[174, 225]等自然语言任务中取得显著进展。大语言模型的快速发展正在重塑信息检索(IR)系统,催生出通过生成式方法实现IR目标的新范式——生成式信息检索(GenIR)。
As envisioned by Metzler et al. [195], in order to build an IR system that can respond like a domain expert, the system should not only provide accurate responses but also include source citations to ensure the credibility of the results. To achieve this, GenIR models must possess both sufficient memorized knowledge and the ability to recall the associations between knowledge and source documents, which could be the final goal of GenIR systems. Currently, research in GenIR primarily focuses on two main patterns: (1) Generative Document Retrieval (GR), which involves retrieving documents by generating their identifiers; and (2) Reliable Response Generation, which entails directly generating user-centric responses with reliability enhancement strategies. Noting that although these two methods have not yet been integrated technically, they represent two primary forms by which IR systems present information to users in generative manners: either by generating lists of document identifiers or by generating reliable and user-centric responses. Figure 1 illustrates the difference between these two forms. These strategies are essential to the next generation of information retrieval and constitute the central focus of this survey.
正如 Metzler 等人 [195] 所设想的那样,要构建一个能像领域专家一样响应的信息检索 (IR) 系统,该系统不仅要提供准确的响应,还应包含来源引用以确保结果的可信度。为实现这一目标,生成式信息检索 (GenIR) 模型必须同时具备足够的记忆知识以及回忆知识与源文档关联的能力,这可能是 GenIR 系统的终极目标。目前 GenIR 的研究主要聚焦于两种模式:(1) 生成式文档检索 (Generative Document Retrieval, GR),即通过生成文档标识符来检索文档;(2) 可靠响应生成,即通过可靠性增强策略直接生成以用户为中心的响应。值得注意的是,虽然这两种方法在技术上尚未融合,但它们代表了信息检索系统以生成方式向用户呈现信息的两种主要形式:要么生成文档标识符列表,要么生成可靠且以用户为中心的响应。图 1 展示了这两种形式的区别。这些策略对下一代信息检索至关重要,也是本综述的核心焦点。
Generative document retrieval, a new retrieval paradigm based on generative models, is garnering increasing attention. This approach leverages the parametric memory of generative models to directly generate document identifiers (DocIDs) related to the documents [18, 281, 307, 371]. Figure 1 illustrates this transition, where traditional IR systems match queries to documents based on an indexed database (Figure 1(a)), while generative methods use language models to retrieve by directly generating relevant document identifiers (Figure 1(b)). Specifically, GR assigns a unique identifier to each document, which can be numeric-based or text-based, and then trains a generative retrieval model to learn the mapping from queries to the relevant DocIDs. This allows the model to index documents using its internal parameters. During inference, GR models use constrained beam search to limit the generated DocIDs to be valid within the corpus, ranking them based on generation probability to produce a ranked list of DocIDs. This eliminates the need for large-scale document indexes in traditional methods, enabling end-to-end training of the model.
生成式文档检索 (Generative document retrieval) 是一种基于生成模型的新型检索范式,正受到越来越多的关注。该方法利用生成模型的参数化记忆直接生成与文档相关的文档标识符 (DocIDs) [18, 281, 307, 371]。图 1 展示了这一转变:传统信息检索系统通过索引数据库匹配查询与文档 (图 1(a)),而生成式方法则通过语言模型直接生成相关文档标识符进行检索 (图 1(b))。具体而言,生成式检索会为每个文档分配唯一标识符 (可以是数字或文本形式),然后训练生成式检索模型学习从查询到相关 DocIDs 的映射关系,使模型能够利用内部参数对文档建立索引。在推理阶段,生成式检索模型通过约束束搜索将生成的 DocIDs 限制在语料库有效范围内,并根据生成概率进行排序,最终输出排序后的 DocIDs 列表。这种方法消除了传统方法对大规模文档索引的依赖,实现了模型的端到端训练。
Recent studies on generative retrieval have delved into model training and structure [6, 153, 281, 307, 365, 369, 372], document identifier design [18, 265, 281, 288, 330], continual learning on dynamic corpora [80, 124, 192], downstream task adaptation [27, 28, 152], multi-modal generative retrieval [157, 178, 357], and generative recommend er systems [74, 233, 304]. The progress in GR is shifting retrieval systems from matching to generation. It has also led to the emergence of workshops [10] and tutorials [279]. However, there is currently no comprehensive review that systematically organizes the research, challenges, and prospects of this emerging field.
近期关于生成式检索的研究深入探讨了模型训练与结构[6,153,281,307,365,369,372]、文档标识符设计[18,265,281,288,330]、动态语料库的持续学习[80,124,192]、下游任务适配[27,28,152]、多模态生成式检索[157,178,357]以及生成式推荐系统[74,233,304]。生成式检索的进步正推动检索系统从匹配转向生成,并催生了专题研讨会[10]和教程[279]。然而,目前尚无系统梳理这一新兴领域研究进展、挑战与前景的全面综述。
Reliable response generation is also a promising direction in the IR field, offering usercentric and accurate answers that directly meet users’ needs. Since LLMs are particularly adept at following instructions [359], capable of generating customized responses, and can even cite the knowledge sources [204, 223], making direct response generation a new and intuitive way to access information [54, 75, 241, 315, 367]. As illustrated in Figure 1, the generative approach marks a significant shift from traditional IR systems, which return a ranked list of documents (as shown in Figure 1(a,b)). Instead, response generation methods (depicted in Figure 1(c)) offer a more dynamic form of information access by directly generating detailed, user-centric responses, thereby providing a richer and more immediate understanding of the information need behind the users’ queries.
可靠响应生成也是信息检索(IR)领域一个极具前景的方向,它能提供以用户为中心且精准的答案,直接满足用户需求。由于大语言模型(LLM)特别擅长遵循指令[359],能够生成定制化响应,甚至可以引用知识来源[204, 223],这使得直接响应生成成为了一种新颖直观的信息获取方式[54, 75, 241, 315, 367]。如图1所示,这种生成式方法标志着与传统IR系统的重大转变——传统系统返回的是文档排序列表(如图1(a,b)所示),而响应生成方法(如图1(c)所示)通过直接生成详细的、以用户为中心的响应,提供了更具动态性的信息获取形式,从而使用户能更丰富、更即时地理解查询背后的信息需求。
However, the responses generated by language models may not always be reliable. They have the potential to generate irrelevant answers [85], contradict factual information [90, 104], provide outdated data [291], or generate toxic content [93, 263]. Consequently, these limitations render them unsuitable for many scenarios that require accurate and up-to-date information. To address these challenges, the academic community has developed strategies across four key aspects: enhancing internal knowledge [16, 37, 56, 119, 132, 193, 243, 267, 285]; augmenting external knowledge [5, 113, 139, 151, 204, 245, 333]; generating responses with citation [129, 142, 156, 204, 314]; and improving personal information assistance [149, 172, 295, 327]. Despite these efforts, there is still a lack of a systematic review that organizes the existing research under this new paradigm of generative information access.
然而,语言模型生成的回答并不总是可靠的。它们可能产生无关答案[85]、与事实矛盾[90, 104]、提供过时数据[291]或生成有害内容[93, 263]。这些局限性使其难以适用于需要精准时效信息的场景。针对这些问题,学术界从四个关键方向提出了解决方案:增强内部知识[16, 37, 56, 119, 132, 193, 243, 267, 285];扩充外部知识[5, 113, 139, 151, 204, 245, 333];生成带引用的回答[129, 142, 156, 204, 314];以及改进个人信息辅助[149, 172, 295, 327]。尽管已有这些探索,目前仍缺乏系统性综述来梳理生成式信息获取新范式下的研究成果。
This review will systematically review the latest research progress and future developments in the field of GenIR, as shown in Figure 2, which displays the classification of research related to the GenIR system. We will introduce background knowledge in Section 2, generative document retrieval technologies in Section 3, direct information accessing with generative language models in Section 4, evaluation in Section 5, current challenges and future directions in Section 6, respectively. Section 7 will summarize the content of this review. This article is the first to systematically organize the research, evaluation, challenges and prospects of generative IR, while also looking forward to the potential and importance of GenIR’s future development. Through this review, readers will gain a deep understanding of the latest progress in developing GenIR systems and how it shapes the future of information access. The main contribution of this survey is summarized as follows:
本综述将系统梳理GenIR领域的最新研究进展与未来发展,如图2所示,该图展示了GenIR系统相关研究的分类。我们将在第2节介绍背景知识,第3节讨论生成式文档检索技术,第4节探讨基于生成式语言模型的直接信息获取,第5节分析评估方法,第6节分别阐述当前挑战与未来方向。第7节将总结本综述内容。本文首次系统梳理了生成式IR的研究、评估、挑战与前景,同时展望了GenIR未来发展的潜力与重要性。通过本综述,读者将深入了解开发GenIR系统的最新进展及其如何塑造信息获取的未来。本调查的主要贡献总结如下:
Fig. 2. Taxonomy of research on generative information retrieval: investigating generative document retrieval, reliable response generation, evaluation, challenges and prospects.
图 2: 生成式信息检索研究分类:涵盖生成式文档检索、可靠响应生成、评估方法、挑战与前景。
• First comprehensive survey on generative information retrieval (GenIR): This survey is the first to comprehensively organize the techniques, evaluation, challenges, and prospects on the emerging field of GenIR, providing a deep understanding of the latest progress in developing GenIR systems and its future in shaping information access. • Systematic categorization and in-depth analysis: The survey offers a systematic categorization of research related to GenIR systems, including generative document retrieval, reliable response generation. It provides an in-depth analysis of each category, covering model training and structure, document identifier, etc. in generative document retrieval; internal knowledge memorization, external knowledge enhancement, etc. for reliable response generation. • Comprehensive review of evaluation metrics and benchmarks: The survey reviews a range of widely used evaluation metrics and benchmark datasets for accessing GenIR methods, alongside analysis on the effectiveness and weaknesses of existing GenIR methods. • Discussions of current challenges and future directions: The survey identifies and discusses the current challenges faced in the GenIR field. We also provide potential solutions for each challenge and outline future research directions for building GenIR systems.
• 首次关于生成式信息检索 (Generative Information Retrieval, GenIR) 的综合综述:本综述首次全面梳理了新兴领域GenIR的技术、评估、挑战与前景,深入阐释了GenIR系统开发的最新进展及其在重塑信息获取方式的未来潜力。
• 系统性分类与深度分析:综述对GenIR相关研究进行了系统分类,包括生成式文档检索、可靠响应生成等方向。针对每类研究提供深度分析,涵盖生成式文档检索中的模型训练与结构、文档标识符等技术细节;可靠响应生成中的内部知识记忆、外部知识增强等关键方法。
• 评估指标与基准的全面评述:综述系统梳理了评估GenIR方法的常用指标和基准数据集,并分析了现有GenIR方法的有效性与局限性。
• 当前挑战与未来方向探讨:综述明确指出了GenIR领域面临的挑战,针对每项挑战提出潜在解决方案,并规划了构建GenIR系统的未来研究方向。
2 BACKGROUND AND PRELIMINARIES
2 背景与基础知识
Information retrieval techniques aim at efficiently obtaining, processing, and understanding information from massive data. Technological advancements have continuously driven the evolution of these methods: from early keyword-based sparse retrieval to deep learning-based dense retrieval, and more recently, to generative retrieval, large language models, and their augmentation techniques. Each advancement enhances retrieval accuracy and efficiency, catering to the complex and diverse query needs of users.
信息检索技术旨在从海量数据中高效获取、处理和理解信息。技术进步持续推动着这些方法的演进:从早期基于关键词的稀疏检索,到基于深度学习的稠密检索,再到最近的生成式检索、大语言模型及其增强技术。每一次进步都提升了检索的准确性和效率,以满足用户复杂多样的查询需求。
2.1 Traditional Information Retrieval
2.1 传统信息检索
Sparse Retrieval. In the field of traditional information retrieval, sparse retrieval techniques implement fast and accurate document retrieval through the inverted index method. Inverted indexing technology maps each unique term to a list of all documents containing that term, providing an efficient means for information retrieval in large document collections. Among these methods, TF-IDF (Term Frequency-Inverse Document Frequency) [235] is a particularly important statistical tool used to assess the importance of a word in a document collection, thereby widely applied in various traditional retrieval systems.
稀疏检索 (Sparse Retrieval)。在传统信息检索领域,稀疏检索技术通过倒排索引 (inverted index) 方法实现快速精准的文档检索。倒排索引技术将每个唯一词项映射到包含该词项的所有文档列表,为海量文档集合中的信息检索提供了高效手段。其中,TF-IDF (词频-逆文档频率) [235] 是评估词项在文档集合中重要性的重要统计工具,因而被广泛应用于各类传统检索系统。
The core of sparse retrieval technology lies in evaluating the relevance between documents and user queries. Specifically, given a document collection $\mathcal{D}$ and a user query $q$ , traditional information retrieval systems identify and retrieve information by calculating the relevance $\mathcal{R}$ between document $d$ and query $q$ . This relevance evaluation typically relies on the similarity measure between document $d$ and query $q$ , as shown below:
稀疏检索技术的核心在于评估文档与用户查询之间的相关性。具体而言,给定文档集合 $\mathcal{D}$ 和用户查询 $q$,传统信息检索系统通过计算文档 $d$ 与查询 $q$ 之间的相关性 $\mathcal{R}$ 来识别并检索信息。这种相关性评估通常依赖于文档 $d$ 与查询 $q$ 之间的相似性度量,如下所示:
$$
\mathcal{R}(q,d)=\sum_{t\in q\cap d}\operatorname{tf-idf}(t,d)\cdot\operatorname{tf-idf}(t,q),
$$
$$
\mathcal{R}(q,d)=\sum_{t\in q\cap d}\operatorname{tf-idf}(t,d)\cdot\operatorname{tf-idf}(t,q),
$$
ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
ACM Trans. Inf. Syst.,第1卷,第1期,文章。出版日期:2025年3月。
where 𝑡 represents the terms common to both query $q$ and document $d$ , and tf-idf $(t,d)$ and tf $\operatorname{idf}(t,q)$ represent the TF-IDF weights of term $t$ in document $d$ and query $q$ , respectively. Although sparse retrieval methods like TF-IDF [235] and BM25 [238] excel at fast retrieval, it struggles with complex queries involving synonyms, specialized terms, or context, as term matching and TF-IDF may not fully meet users’ information needs [180].
其中,𝑡表示查询$q$和文档$d$共有的词项,tf-idf$(t,d)$和tf$\operatorname{idf}(t,q)$分别表示词项$t$在文档$d$和查询$q$中的TF-IDF权重。尽管TF-IDF [235]和BM25 [238]等稀疏检索方法擅长快速检索,但对于涉及同义词、专业术语或上下文的复杂查询,由于词项匹配和TF-IDF可能无法完全满足用户的信息需求 [180],这类方法仍存在不足。
Dense Retrieval. The advent of pre-trained language models like BERT [121] has revolutionized information retrieval, leading to the development of dense retrieval methods, like DPR [117], ANCE [324], E5 [298], SimLM [299]. Unlike traditional sparse retrieval, these methods leverage Transformer-based encoders to create dense vector representations for both queries and documents. This approach enhances the capability to grasp the underlying semantics, thereby improving retrieval accuracy.
稠密检索 (Dense Retrieval)。BERT [121] 等预训练语言模型的出现彻底改变了信息检索领域,催生了 DPR [117]、ANCE [324]、E5 [298]、SimLM [299] 等稠密检索方法。与传统稀疏检索不同,这些方法利用基于 Transformer 的编码器为查询和文档生成稠密向量表示。该方法增强了对底层语义的捕捉能力,从而提高了检索准确性。
The core of dense retrieval lies in converting documents and queries into vector representations. Given document $d$ and query $q$ and their vector representations $\mathbf{v}_ {q}$ , each document $d$ is transformed into a dense vector $\mathbf{v}_ {d}$ through a pre-trained language model, similarly, query $q$ is transformed into vector $\mathbf{v}_ {q}$ . Specifically, we can use encoder functions $E_{d}(\cdot)$ and $E_{q}(\cdot)$ to represent the encoding process for documents and queries, respectively:
密集检索的核心在于将文档和查询转化为向量表示。给定文档 $d$ 和查询 $q$ 及其向量表示 $\mathbf{v}_ {q}$,每个文档 $d$ 通过预训练语言模型转换为密集向量 $\mathbf{v}_ {d}$,类似地,查询 $q$ 转换为向量 $\mathbf{v}_ {q}$。具体而言,我们可以使用编码函数 $E_{d}(\cdot)$ 和 $E_{q}(\cdot)$ 分别表示文档和查询的编码过程:
$$
\mathbf{v}_ {d}=E_{d}(d),\quad\mathbf{v}_ {q}=E_{q}(q),
$$
$$
\mathbf{v}_ {d}=E_{d}(d),\quad\mathbf{v}_ {q}=E_{q}(q),
$$
where $E_{d}(\cdot)$ and $E_{q}(\cdot)$ can be the same model or different models optimized for specific tasks.
其中 $E_{d}(\cdot)$ 和 $E_{q}(\cdot)$ 可以是相同模型,也可以是针对特定任务优化的不同模型。
Dense retrieval methods evaluate relevance by calculating the similarity between the query vector and document vector, which can be measured by cosine similarity, expressed as follows:
密集检索方法通过计算查询向量和文档向量之间的相似度来评估相关性,可采用余弦相似度衡量,表达式如下:
$$
\mathcal{R}(q,d)=\cos({\mathbf{v}_ {q}},{\mathbf{v}_ {d}})=\frac{{\mathbf{v}_ {q}}\cdot\mathbf{v}_ {d}}{|\mathbf{v}_ {q}||\mathbf{v}_{d}|},
$$
$$
\mathcal{R}(q,d)=\cos({\mathbf{v}_ {q}},{\mathbf{v}_ {d}})=\frac{{\mathbf{v}_ {q}}\cdot\mathbf{v}_ {d}}{|\mathbf{v}_ {q}||\mathbf{v}_{d}|},
$$
where $\mathbf{v}_ {q}\cdot\mathbf{v}_ {d}$ represents the dot product of query vector $\mathbf{v}_ {q}$ and document vector ${\bf v}_ {d}$ , and $|\mathbf{v}_ {q}|$ and $|\mathbf{v}_{d}|$ respectively represent the magnitudes of the query and document vector. Finally, documents are ranked based on these similarity scores to identify the most relevant ones for the user.
其中 $\mathbf{v}_ {q}\cdot\mathbf{v}_ {d}$ 表示查询向量 $\mathbf{v}_ {q}$ 和文档向量 ${\bf v}_ {d}$ 的点积,$|\mathbf{v}_ {q}|$ 和 $|\mathbf{v}_{d}|$ 分别表示查询向量和文档向量的模长。最后根据这些相似度得分对文档进行排序,从而为用户找出最相关的文档。
2.2 Generative Retrieval
2.2 生成式检索 (Generative Retrieval)
With the significant progress of language models, generative retrieval has emerged as a new direction in the field of information retrieval [195, 281, 328]. Unlike traditional index-based retrieval methods, generative retrieval relies on pre-trained generative language models, such as T5 [231] and BART [138], to directly generate document identifiers (DocIDs) relevant to the query, thereby achieving end-to-end retrieval without relying on large-scale pre-built document indices.
随着语言模型的显著进步,生成式检索 (generative retrieval) 已成为信息检索领域的新方向 [195, 281, 328]。与传统基于索引的检索方法不同,生成式检索依赖预训练的生成式语言模型(如 T5 [231] 和 BART [138]),直接生成与查询相关的文档标识符 (DocIDs),从而实现不依赖大规模预建文档索引的端到端检索。
DocID Construction and Prefix Constraints. To facilitate generative retrieval, each document $d$ in the corpus $\mathcal{D}={d_{1},d_{2},\ldots,d_{N}}$ is assigned a unique document identifier $d^{\prime}$ , forming the set $\mathcal{D}^{\prime}={d_{1}^{\prime},d_{2}^{\prime},\ldots,d_{N}^{\prime}}$ . This mapping is typically established via a bijective function $\phi:{\mathcal{D}}\rightarrow{\mathcal{D}}^{\prime}$ , ensuring that:
文档ID构建与前缀约束。为实现生成式检索,语料库 $\mathcal{D}={d_{1},d_{2},\ldots,d_{N}}$ 中的每个文档 $d$ 都被分配唯一的文档标识符 $d^{\prime}$,构成集合 $\mathcal{D}^{\prime}={d_{1}^{\prime},d_{2}^{\prime},\ldots,d_{N}^{\prime}}$。该映射通常通过双射函数 $\phi:{\mathcal{D}}\rightarrow{\mathcal{D}}^{\prime}$ 建立,并满足以下条件:
$$
\phi(d_{i})=d_{i}^{\prime},\quad\forall d_{i}\in{\mathcal{D}}.
$$
$$
\phi(d_{i})=d_{i}^{\prime},\quad\forall d_{i}\in{\mathcal{D}}.
$$
To enable the language model to generate only valid DocIDs during inference, we construct prefix constraints based on ${\mathcal{D}}^{\prime}$ . This is typically implemented using a trie (prefix tree), where each path from the root to a leaf node corresponds to a valid DocID.
为了使语言模型在推理过程中仅生成有效的DocID,我们基于${\mathcal{D}}^{\prime}$构建前缀约束。这通常通过字典树(前缀树)实现,其中从根节点到叶节点的每条路径对应一个有效DocID。
Constrained Beam Search. Given a query $q$ , the generative retrieval model aims to generate the top $k$ DocIDs that are most relevant to $q$ . The language model $P(\cdot|q;\theta)$ generates DocIDs token by token, guided by the prefix constraints. At each decoding step $i$ , only those tokens that extend the current partial sequence $d_{<i}^{\prime}$ into a valid prefix of some DocIDs in $\mathcal{D}^{\prime}$ are considered. Formally, the set of allowable next tokens is:
约束束搜索 (Constrained Beam Search)。给定查询 $q$,生成式检索模型旨在生成与 $q$ 最相关的 top $k$ 个 DocID。在前缀约束的引导下,语言模型 $P(\cdot|q;\theta)$ 逐 Token 生成 DocID。在每个解码步骤 $i$ 中,仅考虑那些能将当前部分序列 $d_{<i}^{\prime}$ 扩展为 $\mathcal{D}^{\prime}$ 中某些 DocID 有效前缀的 Token。形式化地,允许的下一 Token 集合为:
$$
\mathcal V(d_{<i}^{\prime})={v\mid\exists d^{\prime}\in\mathcal D^{\prime}\mathrm{ such that }d_{<i}^{\prime}v\mathrm{ is a prefix of }d^{\prime}}.
$$
$$
\mathcal V(d_{<i}^{\prime})={v\mid\exists d^{\prime}\in\mathcal D^{\prime}\mathrm{ such that }d_{<i}^{\prime}v\mathrm{ is a prefix of }d^{\prime}}.
$$
ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: March 2025.
ACM Trans. Inf. Syst.,第1卷,第1期,文章。出版日期:2025年3月。
By employing constrained beam search, the model efficiently explores the space of valid DocIDs, maintaining a beam of the most probable sequences at each decoding step while adhering to the DocID prefix constraints.
通过采用受限束搜索 (constrained beam search),该模型能高效探索有效文档ID (DocID) 的空间,在遵守文档ID前缀约束的同时,于每个解码步骤保留最可能的序列束。
Document Relevance. The relevance between the query $q$ and a document $d$ is quantified by the probability of generating its corresponding DocID $d^{\prime}$ given $q$ . This is computed as:
文档相关性。查询 $q$ 与文档 $d$ 之间的相关性通过给定 $q$ 时生成其对应 DocID $d^{\prime}$ 的概率来量化,计算公式为:
$$
\mathcal{R}(q,d)=P(d^{\prime}|q;\theta)=\prod_{i=1}^{T}P(d_{i}^{\prime}\mid d_{<i}^{\prime},q;\theta),
$$
$$
\mathcal{R}(q,d)=P(d^{\prime}|q;\theta)=\prod_{i=1}^{T}P(d_{i}^{\prime}\mid d_{<i}^{\prime},q;\theta),
$$
where $T$ is the length of the DocID $d^{\prime}$ in tokens, $d_{i}^{\prime}$ is the token at position $i$ , and $d_{<i}^{\prime}$ denotes the sequence of tokens generated before position $i.$ . The constrained beam search produces a ranked list of top $k$ DocIDs ${d^{\prime(1)},d^{\prime(2)},\ldots,d^{\prime(k)}}$ based on their generation probabilities ${\mathcal{R}(q,d^{(1)}),\mathcal{R}(q,d^{(2)}) , \cdot\cdot\cdot,\mathcal{R}(q,d^{(k)})}$ . The corresponding documents ${d^{(\bar{1})},d^{(2)},\ldots,\bar{d}^{(k)}}$ are then considered the most relevant to the query 𝑞.
其中 $T$ 是文档ID $d^{\prime}$ 的Token长度,$d_{i}^{\prime}$ 表示位置 $i$ 的Token,$d_{<i}^{\prime}$ 表示位置 $i$ 之前生成的Token序列。约束束搜索根据生成概率 ${\mathcal{R}(q,d^{(1)}),\mathcal{R}(q,d^{(2)}), \cdot\cdot\cdot,\mathcal{R}(q,d^{(k)})}$ 生成前 $k$ 个文档ID的排序列表 ${d^{\prime(1)},d^{\prime(2)},\ldots,d^{\prime(k)}}$。对应的文档 ${d^{(\bar{1})},d^{(2)},\ldots,\bar{d}^{(k)}}$ 则被视为与查询 𝑞 最相关的结果。
Model Optimization. Generative retrieval models are typically optimized using cross-entropy loss, which measures the discrepancy between the generated DocID sequence and the ground truth DocID. Given a query $q$ and its corresponding DocID $d^{\prime}$ , the cross-entropy loss is defined as:
模型优化。生成式检索模型通常使用交叉熵损失进行优化,该损失衡量生成的DocID序列与真实DocID之间的差异。给定查询$q$及其对应的DocID$d^{\prime}$,交叉熵损失定义为:
$$
\mathcal{L}=-\sum_{i=1}^{T}\log P(d_{i}^{\prime}\mid d_{<i}^{\prime},q;\theta),
$$
$$
\mathcal{L}=-\sum_{i=1}^{T}\log P(d_{i}^{\prime}\mid d_{<i}^{\prime},q;\theta),
$$
where $T$ is the length of the DocID in tokens, $d_{i}^{\prime}$ is the token at position $i$ , and $d_{<i}^{\prime}$ denotes the sequence of tokens generated before position $i$ . This loss function encourages the model to learn the association between query and labeled DocID sequence.
其中 $T$ 是DocID的Token长度,$d_{i}^{\prime}$ 是位置 $i$ 的Token,$d_{<i}^{\prime}$ 表示位置 $i$ 之前生成的Token序列。该损失函数促使模型学习查询与标注DocID序列之间的关联。
This approach allows the generative retrieval model to produce a relevance-ordered list of documents without relying on traditional indexing structures. The core of this approach lies in leveraging the language model’s capability to generate DocID sequences within prefix constraints. This section discusses the simplest generative retrieval method. In Section 3, we will delve into advanced methods from multiple perspectives, including model architectures, training strategies, and DocID design, to further enhance retrieval performance across various scenarios.
该方法使生成式检索模型无需依赖传统索引结构即可生成按相关性排序的文档列表。其核心在于利用语言模型在前缀约束下生成DocID序列的能力。本节讨论最简单的生成式检索方法,第3节我们将从模型架构、训练策略和DocID设计等多角度深入探讨进阶方法,以提升不同场景下的检索性能。
2.3 Large Language Models
2.3 大语言模型 (Large Language Models)
The evolution of Large Language Models (LLMs) marks a significant leap in natural language processing (NLP), rooted from early statistical and neural network-based language models [374]. These models, through pre-training on vast text corpora, learned deep semantic features of language, greatly enriching the understanding of text. Subsequently, generative language models, most notably the GPT series [16, 228, 229], significantly improved text generation and understanding capabilities with improved model size and number of parameters.
大语言模型 (Large Language Models, LLMs) 的演进标志着自然语言处理 (NLP) 领域的重大飞跃,其根源可追溯至早期的统计和基于神经网络的语言模型 [374]。这些模型通过在海量文本语料库上进行预训练,学习语言的深层语义特征,极大地丰富了文本理解能力。随后,生成式语言模型(尤其是 GPT 系列 [16, 228, 229])通过扩大模型规模和参数量,显著提升了文本生成与理解能力。
LLMs can be mainly divided into two categories: encoder-decoder models and decoder-only models. Encoder-decoder models, like T5 [231] and BART [138], convert input text into vector representations through their encoder, then the decoder generates output text based on these representations. This model perspective treats various NLP tasks as text-to-text conversion problems, solving them through text generation. On the other hand, decoder-only models, like the GPT [228] and GPT-2 [229], rely entirely on the Transformer decoder, generating text step by step through the self-attention mechanism. The introduction of GPT-3 [16], with its 175 billion parameters, marked a significant milestone in this field and led to the creation of models like Instruct GP T [210], Falcon [215], PaLM [34] and Llama series [59, 285, 286]. These models, all using a decoder-only architecture, trained on large-scale datasets, have shown astonishing language processing capabilities [359].
大语言模型(LLM)主要可分为两类:编码器-解码器模型和仅解码器模型。编码器-解码器模型如T5 [231]和BART [138],通过编码器将输入文本转换为向量表示,再由解码器基于这些表示生成输出文本。该模型视角将各类NLP任务视为文本到文本的转换问题,通过文本生成来解决。另一方面,仅解码器模型如GPT [228]和GPT-2 [229]完全依赖Transformer解码器,通过自注意力机制逐步生成文本。拥有1750亿参数的GPT-3 [16]问世成为该领域重要里程碑,并催生了InstructGPT [210]、Falcon [215]、PaLM [34]以及Llama系列 [59, 285, 286]等模型。这些采用仅解码器架构、基于海量数据训练的模型,展现出惊人的语言处理能力 [359]。
For information retrieval tasks, large language models (LLMs) play a crucial role in directly generating the exact information users seek [55, 173, 374]. This capability marks a significant step towards a new era of generative information retrieval. In this era, the retrieval process is not solely about locating existing information but also about creating new content that meets the specific needs of users. This feature is especially advantageous in situations where users might not know how to phrase their queries or when they are in search of complex and highly personalized information, scenarios where traditional matching-based methods fall short.
在信息检索任务中,大语言模型(LLMs) 通过直接生成用户所需的精确信息发挥着关键作用 [55, 173, 374]。这一能力标志着生成式信息检索新时代的重要进展。在此背景下,检索过程不仅关乎定位现有信息,更涉及创造符合用户特定需求的新内容。当用户不知如何准确表述查询需求,或需要获取复杂且高度个性化的信息时,这一特性展现出显著优势,而传统基于匹配的方法在此类场景中往往表现不足。
2.4 Augmented Language Models
2.4 增强型语言模型
Despite the advances of LLMs, they still face significant challenges such as hallucination, particularly in complex tasks or those requiring access to long-tail or real-time information [90, 359]. To address these issues, retrieval augmentation and tool augmentation have emerged as effective strategies. Retrieval augmentation involves integrating external knowledge sources into the language model’s workflow. This integration allows the model to access up-to-date and accurate information during the generation process, thereby grounding its responses in verified data and reducing the likelihood of hallucinations [139, 252, 271]. Tool augmentation, on the other hand, extends the capabilities of LLMs by incorporating specialized tools or APIs that can perform specific functions like mathematical computations, data retrieval, or executing predefined commands [226, 245, 276]. With retrieval and tool augmentations, language models can provide more precise and con textually relevant responses, thereby improving factuality and functionality in practical applications.
尽管大语言模型取得了进展,但它们仍面临重大挑战,例如幻觉问题(hallucination),尤其是在复杂任务或需要获取长尾或实时信息的任务中 [90, 359]。为解决这些问题,检索增强(retrieval augmentation)和工具增强(tool augmentation)已成为有效策略。检索增强涉及将外部知识源集成到语言模型的工作流程中,使模型在生成过程中能够访问最新且准确的信息,从而将其响应建立在已验证数据的基础上,降低幻觉发生的可能性 [139, 252, 271]。另一方面,工具增强通过整合专用工具或 API 来扩展大语言模型的能力,这些工具或 API 可以执行特定功能,如数学计算、数据检索或执行预定义命令 [226, 245, 276]。通过检索增强和工具增强,语言模型能够提供更精确且与上下文相关的响应,从而在实际应用中提高事实性和功能性。
Moreover, due to the aforementioned issue of hallucinations, the responses generated by LLMs are often considered unreliable because users are unaware of the sources behind the generated content, making it difficult to verify its accuracy. To enhance the credibility of responses, some studies have focused on generating responses with citations [143, 204, 256]. This approach involves enabling language models to cite the source documents of their generated content, thereby increasing the trustworthiness of the responses. All these methods are effective strategies for improving both the quality and reliability of language model outputs and are essential technologies for building the next generation of generative information retrieval systems.
此外,由于上述幻觉问题,大语言模型生成的回答常被认为不可靠,因为用户无法知晓生成内容背后的来源,难以验证其准确性。为提高回答的可信度,部分研究聚焦于生成带引用的回答 [143, 204, 256]。该方法通过使语言模型能够引用生成内容的源文档,从而提升回答的可信度。这些方法都是提升语言模型输出质量与可靠性的有效策略,也是构建下一代生成式信息检索系统的关键技术。
3 GENERATIVE DOCUMENT RETRIEVAL: FROM SIMILARITY MATCHING TO GENERATING DOCUMENT IDENTIFIERS
3 生成式文档检索:从相似性匹配到生成文档标识符
In recent advancements in AIGC, generative retrieval (GR) has emerged as an promising approach in the field of information retrieval, garnering increasing interest from the academic community. Figure 3 showcases a timeline of the GR methods. Initially, GENRE [18] proposed to retrieve entities by generating their unique names through constrained beam search via a pre-built entity prefix tree, achieving advanced entity retrieval performance. Subsequently, Metzler et al. [195] envisioned a model-based information retrieval framework aiming to combine the strengths of traditional document retrieval systems and pre-trained language models to create systems capable of providing expert-quality answers in various domains.
在AIGC领域的最新进展中,生成式检索 (Generative Retrieval, GR) 已成为信息检索领域一种颇具前景的方法,日益受到学术界的关注。图 3: 展示了GR方法的发展时间线。最初,GENRE [18] 提出通过预构建的实体前缀树进行约束束搜索来生成实体唯一名称以实现检索,从而取得了先进的实体检索性能。随后,Metzler等人 [195] 设想了一种基于模型的信息检索框架,旨在结合传统文档检索系统和预训练语言模型的优势,构建能够提供跨领域专家级答案的系统。
Following their lead, a diverse range of methods including DSI [281], Dynamic Retriever [370], SEAL [13], NCI [307], etc., have been developed, with a continuously growing body of work. These methods explore various aspects such as model training and architectures, document identifiers, incremental learning, task-specific adaptation, and generative recommendations. Figure 4 presents an overview of the GR system and we’ll provide an in-depth discussion of each associated challenge in the following sections.
继这些研究之后,DSI [281]、Dynamic Retriever [370]、SEAL [13]、NCI [307] 等多种方法相继被提出,相关研究成果持续增长。这些方法探索了模型训练与架构、文档标识符、增量学习、任务特定适配以及生成式推荐等多个方面。图 4 展示了 GR (Generative Retrieval) 系统的概览,我们将在后续章节深入讨论每个相关挑战。
3.1 Model Training and Structure
3.1 模型训练与结构
One of the core components of GR is the model training and structure, aiming to enhance the model’s ability to memorize documents in the corpus.
GR的核心组件之一是模型训练和结构,旨在增强模型记忆语料库中文档的能力。
Fig. 3. Timeline of research in generative retrieval: focus on model training and structure, document identifier design, incremental learning and downstream task adaptation.
图 3: 生成式检索研究时间轴:重点关注模型训练与结构、文档标识符设计、增量学习及下游任务适配。
3.1.1 Model Training. To effectively train generative models for indexing documents, the standard approach is to train the mapping from queries to relevant DocIDs, based on standard sequence-tosequence (seq2seq) training methods, as described in Equation (2). This method has been widely used in numerous GR research works, such as DSI [281], NCI [307], SEAL [13], etc. Moreover, a series of works have proposed various model training methods tailored for GR tasks to further enhance GR retrieval performance, such as sampling documents or generating queries based on document content to serve as pseudo queries for data augmentation; or including training objectives for document ranking.
3.1.1 模型训练。为有效训练面向文档索引的生成式模型,标准方法是基于序列到序列(seq2seq)训练方法,按照公式(2)所述训练从查询到相关DocID的映射。该方法已广泛应用于众多生成式检索(GR)研究工作中,如DSI [281]、NCI [307]、SEAL [13]等。此外,一系列研究提出了针对GR任务定制的多样化模型训练方法以进一步提升检索性能,例如通过采样文档或基于文档内容生成查询作为数据增强的伪查询,或引入文档排序的训练目标。
Specifically, DSI [281] proposed two training strategies: one is “indexing”, that is, training the model to associate document tokens with their corresponding DocIDs, where DocIDs are pre-built based on documents in corpus, which will be discussed in detail in Section 3.2; the other is “retrieval”, using labeled query-DocID pairs to fine-tune the model. Notably, DSI was the first to realize a differentiable search index based on the Transformer [290] structure, showing good performance in web search [205] and question answering [126] scenarios. Next, a series of methods propose training methods for data augmentation and improving GR model ranking ability
具体而言,DSI [281] 提出了两种训练策略:一是"索引构建 (indexing)",即训练模型将文档 token 与对应的 DocID 关联起来(DocID 基于语料库文档预先构建,详见第3.2节讨论);二是"检索 (retrieval)",使用标注的查询-DocID 对来微调模型。值得注意的是,DSI 首次实现了基于 Transformer [290] 结构的可微分搜索索引,在网页搜索 [205] 和问答 [126] 场景中展现出良好性能。随后,一系列方法提出了针对数据增强和提升 GR (Generative Retrieval) 模型排序能力的训练方案
Sampling Document Pieces as Pseudo Queries. In the same era, Dynamic Retriever [370], also based on the encoder-decoder model, constructed a model-based IR system by initializing the encoder with a pre-trained BERT [121]. Besides, Dynamic Retriever utilizes passages, sampled terms and $N$-grams to serve as pseudo queries to enhance the model’s memorization of DocIDs. Formally, the training methods can be summarized as follows:
采样文档片段作为伪查询。同一时期,同样基于编码器-解码器模型的Dynamic Retriever [370]通过用预训练BERT [121]初始化编码器,构建了基于模型的IR系统。此外,Dynamic Retriever利用段落、采样词项和$N$-gram作为伪查询来增强模型对DocID的记忆能力。其训练方法可形式化表述为:
$$
\begin{array}{r l}&{\mathrm{Sampled Document:}d_{s_{i}}\longrightarrow\mathrm{DocID},i\in{1,...,k_{d_{s}}},}\ &{\qquad\mathrm{Labeled Query:}q_{i}\longrightarrow\mathrm{DocID},i\in{1,...,k_{q}},}\end{array}
$$
$$
\begin{array}{r l}&{\mathrm{采样文档:}d_{s_{i}}\longrightarrow\mathrm{文档ID},i\in{1,...,k_{d_{s}}},}\ &{\qquad\mathrm{标注查询:}q_{i}\longrightarrow\mathrm{文档ID},i\in{1,...,k_{q}},}\end{array}
$$
where $d_{s_{i}}$ and $q_{i}$ denote each of the $k_{d_{s}}$ sampled document text and each of the $k_{q}$ labeled query for the corresponding DocID, respectively.
其中 $d_{s_{i}}$ 和 $q_{i}$ 分别表示对应 DocID 的 $k_{d_{s}}$ 个采样文档文本和 $k_{q}$ 个标注查询。
Generating Pseudo Queries from Documents. Following DSI, the NCI [307] model was trained using a combination of labeled query-document pairs and augmented pseudo querydocument pairs. Specifically, NCI proposed two strategies: one using the DocT5Query [208] model as a query generator, generating pseudo queries for each document in the corpus through beam search; the other strategy directly uses the document as a query, as stated in Equation (8). Similarly, DSI-QG [375] also proposed using a query generator to enhance training data, establishing a bridge between indexing and retrieval in DSI. This data augmentation method has been proven in subsequent works to be an effective method to enhance the model’s memorization for DocIDs,
从文档生成伪查询。遵循DSI方法,NCI[307]模型通过结合标注的查询-文档对和增强的伪查询-文档对进行训练。具体而言,NCI提出了两种策略:一种使用DocT5Query[208]模型作为查询生成器,通过束搜索为语料库中的每个文档生成伪查询;另一种策略则直接使用文档作为查询,如公式(8)所示。类似地,DSI-QG[375]也提出使用查询生成器来增强训练数据,在DSI中建立了索引与检索之间的桥梁。后续研究证明,这种数据增强方法是提升模型对DocID记忆能力的有效手段。
Fig. 4. A conceptual framework for a generative retrieval system, with a focus on challenges in incremental learning, identifier construction, model training and structure, and integration with downstream tasks and recommendation systems.
图 4: 生成式检索系统的概念框架,重点关注增量学习、标识符构建、模型训练与结构,以及与下游任务和推荐系统集成方面的挑战。
which can be expressed as follows:
可以表示为:
$$
\begin{array}{r}{\operatorname{Pseudo}\operatorname{Query}:q_{s_{i}}\longrightarrow\operatorname{DocID},i\in{1,...,k_{q_{s}}},}\end{array}
$$
$$
\begin{array}{r}{\operatorname{Pseudo}\operatorname{Query}:q_{s_{i}}\longrightarrow\operatorname{DocID},i\in{1,...,k_{q_{s}}},}\end{array}
$$
where $q_{s_{i}}$ represents each of the $k_{q_{s}}$ generated pseudo query for the corresponding DocID.
其中,$q_{s_{i}}$ 表示对应 DocID 生成的 $k_{q_{s}}$ 个伪查询中的每一个。
Improving Ranking Capability. Additionally, a series of methods focus on further optimizing the ranking capability of GR models. Chen et al. [30] proposed a multi-task distillation method to improve retrieval quality without changing the model structure, thereby obtaining better indexing and ranking capabilities. Meanwhile, LTRGR [159] introduced a ranking loss to train the model in ranking paragraphs. Subsequently, [365] proposed GenRRL, which improves ranking quality through reinforcement learning with relevance feedback, aligning token-level DocID generation with document-level relevance estimation. Moreover, [161] introduced DGR, which enhances generative retrieval through knowledge distillation. Specifically, DGR uses a cross-encoder as a teacher model, providing fine-grained passage ranking supervision signals, and then optimizes the model with a distilled RankNet loss. ListGR [280] defined positional conditional probabilities, emphasizing the importance of the generation order of each DocID in the list. In addition, ListGR employs relevance calibration that adjusts the generated list of DocIDs to better align with the labeled ranking list. See Table 1 for a detailed comparison of GR methods.
提升排序能力。此外,一系列方法专注于进一步优化GR模型的排序能力。Chen等人[30]提出了一种多任务蒸馏方法,在不改变模型结构的情况下提升检索质量,从而获得更好的索引和排序能力。同时,LTRGR[159]引入了排序损失来训练模型对段落进行排序。随后,[365]提出了GenRRL,通过基于相关性反馈的强化学习提升排序质量,使Token级别的DocID生成与文档级别的相关性估计对齐。此外,[161]提出了DGR,通过知识蒸馏增强生成式检索。具体而言,DGR使用交叉编码器作为教师模型,提供细粒度的段落排序监督信号,并通过蒸馏的RankNet损失优化模型。ListGR[280]定义了位置条件概率,强调列表中每个DocID生成顺序的重要性。此外,ListGR采用相关性校准技术,调整生成的DocID列表以更好地匹配标注的排序列表。GR方法的详细对比见表1。
3.1.2 Model Structure. Basic generative retrieval models mostly use pre-trained encoder-decoder structured generative models, such as T5 [231] and BART [138], fine-tuned for the DocID generation task. To better adapt to the GR task, researchers have proposed a series of specifically designed model structures [130, 224, 237, 275, 307, 342, 346].
3.1.2 模型结构。基础生成式检索模型主要采用预训练的编码器-解码器结构生成模型(如T5 [231]和BART [138]),通过微调适应DocID生成任务。为更好适配GR任务,研究者提出了一系列专门设计的模型结构[130, 224, 237, 275, 307, 342, 346]。
Model Decoding Methods. For the semantic structured DocID proposed by DSI [281], NCI [307] designed a Prefix-Aware Weight-Adaptive (PAWA) decoder. By adjusting the weights at different positions of DocIDs, this decoder can capture the semantic hierarchy of DocIDs. To allow the GR model to utilize both own parametric knowledge and external information, NP-Decoding [130] proposed using non-parametric contextual i zed word embeddings (as external memory) instead of traditional word embeddings as the input to the decoder. Additionally, PAG [345] proposed a planning-ahead generation approach, which first decodes the set-based DocID to approximate document-level scores, and then continues to decode the sequence-based DocID on this basis.
模型解码方法。针对DSI [281]提出的语义结构化DocID,NCI [307]设计了一种前缀感知权重自适应(PAWA)解码器。该解码器通过调整DocID不同位置的权重,能够捕捉DocID的语义层级结构。为使GR模型既能利用自身参数化知识又能结合外部信息,NP-Decoding [130]提出使用非参数化上下文词嵌入(作为外部记忆)替代传统词嵌入作为解码器输入。此外,PAG [345]提出了一种前瞻式生成方法,先解码基于集合的DocID以近似文档级评分,再在此基础上继续解码基于序列的DocID。
Combining Generative and Dense Retrieval Methods. Combining seq2seq generative models with dual-encoder retrieval models, MEVI [346] utilizes Residual Quantization (RQ) [189] to organize documents into hierarchical clusters, enabling efficient retrieval of candidate clusters and precise document retrieval within those clusters. Similarly, Generative Dense Retrieval (GDR) [342] proposed to first broadly match queries to document clusters, optimizing for interaction depth and memory efficiency, and then perform precise, cluster-specific document retrieval, boosting both recall and s cal ability.
结合生成式与密集检索方法。MEVI [346] 将序列到序列(seq2seq)生成模型与双编码器检索模型相结合,利用残差量化(Residual Quantization,RQ)[189] 将文档组织为分层聚类结构,从而实现候选聚类的高效检索及聚类内文档的精准定位。类似地,生成式密集检索(Generative Dense Retrieval,GDR)[342] 提出先对查询与文档聚类进行粗粒度匹配,优化交互深度与内存效率,再执行聚类专属的精确文档检索,显著提升了召回率与可扩展性。
Utilizing Multiple Models. TOME [237] proposed to decompose the GR task into two stages, first generating text paragraphs related to the query through an additional model, then using the GR model to generate the URL related to the paragraph. Diffusion Ret [224] proposed to first use a diffusion model (Seq DiffuSe q [341]) to generate a pseudo-document from a query, where the generated pseudo-document is similar to real documents in length, format, and content, rich in semantic information; Then, it employs another generative model to perform retrieval based on N-grams, similar to the process used by SEAL[13], leveraging an FM-Index[62] for generating N-grams found in the corpus. Self-Retrieval [275] fully integrated indexing, retrieval, and evaluation into a single large language model. It generates natural language indices and document segments, and performs self-evaluation to score and rank the generated documents.
利用多模型策略。TOME [237] 提出将GR (Generative Retrieval) 任务分解为两个阶段:先通过额外模型生成与查询相关的文本段落,再由GR模型生成与该段落关联的URL。Diffusion Ret [224] 提出先用扩散模型 (Seq DiffuSe q [341]) 根据查询生成伪文档,生成的伪文档在长度、格式和内容上与真实文档相似且语义信息丰富;随后采用另一生成式模型基于N-grams进行检索,类似SEAL[13]使用的流程,并利用FM-Index[62]生成语料库中存在的N-grams。Self-Retrieval [275] 将索引构建、检索和评估完全整合到单个大语言模型中,通过生成自然语言索引和文档片段,并执行自评估对生成文档进行打分排序。
3.2 Design of Document Identifiers
3.2 文档标识符设计
Another essential component of generative retrieval is document representation, also known as document identifiers (DocIDs), which act as the target outputs for the GR model. Accurate document representations are crucial as they enable the model to more effectively memorize document information, leading to enhanced retrieval performance. Table 1 provides a detailed comparison of the states, data types, and order of DocIDs across numerous GR methods. In the following sections, we will explore the design of DocIDs from two categories: numeric-based identifiers and text-based identifiers.
生成式检索的另一关键组成部分是文档表示,也称为文档标识符(DocID),它们作为GR模型的目标输出。准确的文档表示至关重要,因为它们能让模型更有效地记忆文档信息,从而提升检索性能。表1详细比较了多种GR方法中DocID的状态、数据类型和顺序。在接下来的章节中,我们将从两类DocID设计展开探讨:基于数字的标识符和基于文本的标识符。
3.2.1 Numeric-based Identifiers. An intuitive method to represent documents is by using a single number or a series of numbers, referred to as DocIDs. Existing methods have designed both static and learnable DocIDs.
3.2.1 基于数字的标识符。表示文档的一种直观方法是使用单个数字或一系列数字,称为文档ID (DocID)。现有方法设计了静态和可学习的文档ID。
Static DocIDs. Initially, DSI [281] introduced three numeric DocIDs to represent documents: (1) Unstructured Atomic DocID: a unique integer identifier is randomly assigned to each document, containing no structure or semantic information. (2) Naively Structured String DocID: treating random integers as divisible strings, implementing character-level DocID decoding to replace large softmax output layers. (3) Semantically Structured DocID: introducing semantic structure through hierarchical $k$ -means method, allowing semantically similar documents to share prefixes in their identifiers, effectively reducing the search space. Concurrently, Dynamic Retriever [370] also built a model-based IR system based on unstructured atomic DocID. Subsequently, Ultron [371] encoded documents into a latent semantic space using BERT [121], and compressed vectors into a smaller semantic space via Product Quantization (PQ) [73, 102], preserving semantic information. Each document’s PQ code serves as its semantic identifier. MEVI [346] clusters documents using Residual Quantization (RQ) [189] and utilizes dual-tower and seq2seq model embeddings for a balanced performance in large-scale document retrieval.
静态DocID。最初,DSI [281] 引入了三种数字DocID来表示文档:(1) 非结构化原子DocID:为每个文档随机分配一个唯一的整数标识符,不包含任何结构或语义信息。(2) 朴素结构化字符串DocID:将随机整数视为可分割的字符串,实现字符级DocID解码以替代大型softmax输出层。(3) 语义结构化DocID:通过分层 $k$ -means方法引入语义结构,使语义相似的文档在其标识符中共享前缀,有效缩小搜索空间。与此同时,Dynamic Retriever [370] 也基于非结构化原子DocID构建了基于模型的IR系统。随后,Ultron [371] 使用BERT [121] 将文档编码到潜在语义空间,并通过乘积量化(PQ) [73, 102] 将向量压缩到更小的语义空间,保留语义信息。每个文档的PQ代码作为其语义标识符。MEVI [346] 使用残差量化(RQ) [189] 对文档进行聚类,并利用双塔和seq2seq模型嵌入,在大规模文档检索中实现平衡性能。
Learnable DocIDs. Unlike previous static DocIDs, GenRet [265] proposed learnable document representations, transforming documents into DocIDs through an encoder, then reconstructs documents from DocIDs using a decoder, trained to minimize reconstruction error. Furthermore, it used progressive training and diversity clustering for optimization. To ensure that DocID embeddings can reflect document content, Tied-Atomic [206] proposed to link document text with token embeddings and employs contrastive loss for DocID generation. LMIndexer [112] and ASI [330] learned optimal DocIDs through semantic indexing, with LMIndexer using a re parameter iz ation mechanism for unified optimization, facilitating efficient retrieval by aligning semantically similar documents under common DocIDs. ASI extends this by establishing an end-to-end retrieval framework, incorporating semantic loss functions and re parameter iz ation to enable joint training.
可学习的文档标识符 (Learnable DocIDs)。与以往静态文档标识符不同,GenRet [265] 提出可学习的文档表示方法,通过编码器将文档转换为文档标识符,再使用解码器从文档标识符重构文档,并通过最小化重构误差进行训练。此外,该方法采用渐进式训练和多样性聚类进行优化。为确保文档标识符嵌入能反映文档内容,Tied-Atomic [206] 提出将文档文本与 token 嵌入关联,并采用对比损失生成文档标识符。LMIndexer [112] 和 ASI [330] 通过语义索引学习最优文档标识符:LMIndexer 使用重参数化机制进行统一优化,使语义相似的文档在相同文档标识符下对齐以提升检索效率;ASI 进一步扩展该方法,构建端到端检索框架,结合语义损失函数和重参数化实现联合训练。
Table 1. Comparisons of representative generative retrieval methods, focusing on document identifier, training data augmentation, and training objective.
Method | DocumentIdentifier | Training Data Augmentation | Training Objective | |||||
State | Data Type | Order | Sample Doc | Doc2Query | Seq2seq | DocID | Ranking | |
GENRE [18] | Static | Text | Sequence | |||||
DSI [281] | Static | Numeric | Sequence | |||||
DynamicRetriever [370] | Static | Numeric | Sequence | |||||
SEAL [13] | Static | Text | Sequence | |||||
DSI-QG [375] | Static | Numeric | Sequence | |||||
NCI [307] | Static | Numeric | Sequence | |||||
Ultron [371] | Static | Numeric/Text | Sequence | |||||
CorpusBrain[28] | Static | Text | Sequence | |||||
GenRet [265] | Learnable | Numeric | Sequence | |||||
AutoTSG[352] | Static | Text | Set | |||||
SE-DSI [278] | Static | Text | Sequence | |||||
Chen et al. [30] | Static | Numeric | Sequence | √ | ||||
LLM-URL [376] | Static | Text | Sequence | |||||
MINDER[160] | Static | Text | Sequence | |||||
LTRGR [159] | Static | Text | Sequence | |||||
NOVO [311] | Learnable | Text | Set | |||||
GenRRL [365] | Static | Text | Sequence | √ | ||||
LMIndexer [112] | Learnable | Numeric | Sequence | |||||
ASI [330] | Learnable | Numeric | Sequence | |||||
RIPOR [344] | Learnable | Numeric | Sequence | |||||
GLEN [135] | Learnable | Text | Sequence | |||||
DGR [161] | Static | Text | Sequence | |||||
ListGR [280] | Static | Numeric | Sequence |
表 1: 代表性生成式检索方法对比,重点关注文档标识符、训练数据增强和训练目标。
方法 | 文档标识符 | 训练数据增强 | 训练目标 |
---|---|---|---|
状态 | 数据类型 | 顺序 | |
GENRE [18] | 静态 | 文本 | 序列 |
DSI [281] | 静态 | 数值 | 序列 |
DynamicRetriever [370] | 静态 | 数值 | 序列 |
SEAL [13] | 静态 | 文本 | 序列 |
DSI-QG [375] | 静态 | 数值 | 序列 |
NCI [307] | 静态 | 数值 | 序列 |
Ultron [371] | 静态 | 数值/文本 | 序列 |
CorpusBrain[28] | 静态 | 文本 | 序列 |
GenRet [265] | 可学习 | 数值 | 序列 |
AutoTSG[352] | 静态 | 文本 | 集合 |
SE-DSI [278] | 静态 | 文本 | 序列 |
Chen et al. [30] | 静态 | 数值 | 序列 |
LLM-URL [376] | 静态 | 文本 | 序列 |
MINDER[160] | 静态 | 文本 | 序列 |
LTRGR [159] | 静态 | 文本 | 序列 |
NOVO [311] | 可学习 | 文本 | 集合 |
GenRRL [365] | 静态 | 文本 | 序列 |
LMIndexer [112] | 可学习 | 数值 | 序列 |
ASI [330] | 可学习 | 数值 | 序列 |
RIPOR [344] | 可学习 | 数值 | 序列 |
GLEN [135] | 可学习 | 文本 | 序列 |
DGR [161] | 静态 | 文本 | 序列 |
ListGR [280] | 静态 | 数值 | 序列 |
Furthermore, RIPOR [344] treats the GR model as a dense encoder to encode document content. It then splits these representations into vectors via RQ [189], creating unique DocID sequences. Furthermore, RIPOR implements a prefix-guided ranking optimization, increasing relevance scores for prefixes of pertinent DocIDs through margin decomposed pairwise loss during decoding.
此外,RIPOR [344] 将GR模型视为密集编码器来编码文档内容,随后通过RQ [189] 将这些表示拆分为向量,生成独特的DocID序列。此外,RIPOR采用前缀引导的排序优化策略,在解码过程中通过边缘分解成对损失提升相关DocID前缀的关联分数。
In summary, numeric-based document representations can utilize the embeddings of dense retrievers, obtaining semantically meaningful DocID sequences through methods such as $k$ -means, PQ [102], and RQ [189]; they can also combine encoder-decoder GR models with bi-encoder DR models to achieve complementary advantages [206, 346].
总之,基于数值的文档表示可以利用密集检索器的嵌入,通过诸如 $k$ -means、PQ [102] 和 RQ [189] 等方法获得具有语义意义的 DocID 序列;也可以将编码器-解码器 GR 模型与双编码器 DR 模型相结合,实现优势互补 [206, 346]。
3.2.2 Text-based Identifiers. Text-based DocIDs have the inherent advantage of effectively leveraging the strong capabilities of pre-trained language models and offering better interpret ability.
3.2.2 基于文本的标识符。基于文本的文档标识符(DocID)具有天然优势:能有效利用预训练语言模型的强大能力,并提供更好的可解释性。
Document Titles. The most straightforward text-based identifier is the document title, which requires each title to uniquely represent a document in the corpus, otherwise, it would not be possible to accurately retrieve a specific document. The Wikipedia corpus used in the KILT [218] benchmark, due to its well-regulated manual annotation, has a unique title corresponding to each document. Thus, GENRE [18], based on the title as DocID and leveraging the generative model BART [138] and pre-built DocID prefix, achieved superior retrieval performance across 11 datasets in KILT. Following GENRE, GERE [27], Corpus Brain [28], Re3val [257], and Corpus Brain ${}_{,++}$ [80] also based their work on title DocIDs for Wikipedia-based tasks. Notably, LLM-URL [376] directly generated URLs using ChatGPT prompts, achieving commendable performance after removing invalid URLs. However, in the web search scenario [205], document titles in the corpus often have significant duplication and many meaningless titles, making it unfeasible to use titles alone as DocIDs. Thus, Ultron [371] effectively addressed this issue by combining URLs and titles as DocIDs, identifying documents through keywords in web page URLs and titles.
文档标题。最直接的基于文本的标识符是文档标题,它要求每个标题能唯一代表语料库中的一篇文档,否则就无法准确检索到特定文档。KILT [218] 基准中使用的维基百科语料库由于其规范的人工标注,每篇文档都有唯一的标题。因此,GENRE [18] 以标题作为 DocID,并利用生成式模型 BART [138] 和预构建的 DocID 前缀,在 KILT 的 11 个数据集上实现了卓越的检索性能。继 GENRE 之后,GERE [27]、Corpus Brain [28]、Re3val [257] 和 Corpus Brain ${}_{,++}$ [80] 也基于维基百科任务的标题 DocID 展开工作。值得注意的是,LLM-URL [376] 直接使用 ChatGPT 提示生成 URL,在去除无效 URL 后取得了令人称赞的性能。然而,在网络搜索场景 [205] 中,语料库中的文档标题往往存在大量重复和无意义的标题,因此仅使用标题作为 DocID 不可行。为此,Ultron [371] 通过将 URL 和标题结合作为 DocID,有效解决了这一问题,通过网页 URL 和标题中的关键词来识别文档。
Sub-strings of Documents. To increase the flexibility of DocIDs, SEAL [13] proposed a substring identifier, representing documents with any N-grams within them. Using FM-Index (a compressed full-text sub-string index) [62], SEAL could generate N-grams present in the corpus to retrieve all documents containing those N-grams, scoring and ranking documents based on the frequency of N-grams in each document and the importance of N-grams. Following SEAL, various GR models [26, 159–161] also utilized sub-string DocIDs and FM-Index during inference. For a more comprehensive representation of documents, MINDER [160] proposed multi-view identifiers, including generated pseudo queries from document content via DocT5Query [208], titles, and sub-strings. This multi-view DocID was also used in LTRGR [159] and DGR [161].
文档子串。为提高文档标识符(DocID)的灵活性,SEAL [13]提出了子串标识符方案,允许用文档内任意N元语法(N-gram)进行表征。通过FM-Index(一种压缩全文子串索引)[62],SEAL能生成语料库中存在的N元语法,检索包含这些N元语法的所有文档,并根据N元语法在文档中的出现频率及其重要性进行打分排序。继SEAL之后,多种生成式检索(GR)模型[26, 159–161]在推理阶段也采用了子串DocID和FM-Index技术。为获得更全面的文档表征,MINDER[160]提出了多视角标识符方案,包含通过DocT5Query[208]生成的文档内容伪查询、标题以及子串。这种多视角DocID同样被应用于LTRGR[159]和DGR[161]模型。
Term Sets. Unlike the sequential DocIDs described earlier, AutoTSG [352] proposed a term set-based document representation, using keywords extracted from titles and content, rather than predefined sequences, allowing for retrieval of the target document as long as the generated term set is included in the extracted keywords. Recently, PAG [345] also constructed DocIDs based on sets of key terms, disregarding the order of terms, which is utilized for approximating document relevance in decoding.
术语集。与之前描述的序列化文档ID不同,AutoTSG [352]提出了一种基于术语集的文档表示方法,该方法使用从标题和内容中提取的关键词,而非预定义的序列,只要生成的术语集包含在提取的关键词中,就能检索到目标文档。最近,PAG [345]也基于关键术语集构建了文档ID,忽略术语的顺序,用于在解码过程中近似文档相关性。
Learnable DocIDs. Text-based identifiers can also be learnable. Similarly based on term-sets, NOVO [311] proposed learnable continuous N-grams constituting term-set DocIDs. Through denoising query modeling, the model learned to generate queries from documents with noise, thereby implicitly learning to filter out document N-grams more relevant to queries. NOVO also improves the document’s semantic representation by updating N-gram embeddings. Later, GLEN[135] uses dynamic lexical DocIDs and follows a two-phase index learning strategy. First, it assigns DocIDs by extracting keywords from documents using self-supervised signals. Then, it refines DocIDs by integrating query-document relevance through two loss functions. During inference, GLEN ranks documents using DocID weights without additional overhead.
可学习的文档标识符。基于文本的标识符同样可以具备可学习性。NOVO [311] 同样基于词项集合,提出了由可学习连续N元词项构成的文档标识符。该模型通过去噪查询建模,学习从带噪声的文档生成查询,从而隐式地筛选出与查询更相关的文档N元词项。NOVO还通过更新N元词项嵌入来优化文档的语义表示。后续研究GLEN [135] 采用动态词汇文档标识符,并采用两阶段索引学习策略:首先利用自监督信号从文档提取关键词来分配文档标识符;然后通过两种损失函数整合查询-文档相关性来优化标识符。在推理阶段,GLEN直接利用文档标识符权重进行排序,无需额外计算开销。
3.3 Incremental Learning on Dynamic Corpora
3.3 动态语料库上的增量学习
Prior studies have focused on generative retrieval from static document corpora. However, in reality, the documents available for retrieval are continuously updated and expanded. To address this challenge, researchers have developed a range of methods to optimize GR models for adapting to dynamic corpora.
先前的研究主要集中于从静态文档库中进行生成式检索。然而现实中,可供检索的文档会持续更新和扩展。为应对这一挑战,研究人员开发了一系列方法来优化生成式检索(Generative Retrieval, GR)模型,使其适应动态语料库。
Optimizer and Document Rehearsal. At first, $\mathrm{DSI++}$ [192] aims to address the incremental learning challenges encountered by DSI [281]. $\mathrm{DSI++}$ modifies the training by optimizing flat loss basins through the Sharpness-Aware Minimization (SAM) optimizer, stabilizing the learning process of the model. It also employs DocT5Query [208] to generate pseudo queries for documents in the existing corpus as training data augmentation, mitigating the forgetting issue of GR models.
优化器与文档复述机制。最初,$\mathrm{DSI++}$[192]旨在解决DSI[281]面临的增量学习挑战。该方法通过Sharpness-Aware Minimization (SAM) 优化器优化平坦损失盆地来调整训练过程,从而稳定模型的学习轨迹。同时采用DocT5Query[208]为现有语料库中的文档生成伪查询作为训练数据增强,缓解生成式检索(GR)模型的遗忘问题。
Constrained Optimization Addressing the scenario of real-time addition of new documents, such as news or scientific literature IR systems, IncDSI [124] views the addition of new documents as a constrained optimization problem to find optimal representations for the new documents. This approach aims to (1) ensure new documents can be correctly retrieved by their relevant queries, and (2) maintain the retrieval performance of existing documents unaffected.
约束优化
针对实时新增文档(如新闻或科学文献信息检索系统)的场景,IncDSI [124] 将新增文档视为一个约束优化问题,旨在为新文档找到最优表示。该方法力求实现两个目标:(1) 确保新文档能被相关查询正确检索,(2) 保持现有文档的检索性能不受影响。
Incremental Product Quantization. CLEVER [25], based on Product Quantization (PQ) [102], proposes Incremental Product Quantization (IPQ) for generating PQ codes as DocIDs for documents. Compared to traditional PQ methods, IPQ designs two adaptive thresholds to update only a subset of centroids instead of all, maintaining the indices of updated centroids constant. This method reduces computational costs and allows the system to adapt flexibly to new documents.
增量乘积量化 (Incremental Product Quantization)。基于乘积量化 (PQ) [102] 的 CLEVER [25] 提出了增量乘积量化 (IPQ) 方法,用于生成文档的 PQ 编码作为文档 ID。与传统 PQ 方法相比,IPQ 设计了两个自适应阈值来仅更新部分中心点而非全部,同时保持已更新中心点的索引不变。该方法降低了计算成本,并使系统能够灵活适应新文档。
Fine-tuning Adatpers for Specific Tasks. Corpus Brain ${}++$ [80] introduces the $\mathrm{KLT++}$ benchmark for continuously updated KILT [218] tasks and designs a dynamic architecture paradigm with a backbone-adapter structure. By fixing a shared backbone model to provide basic retrieval capabilities while introducing task-specific adapters to increment ally learn new documents for each task, it effectively avoids catastrophic forgetting. During training, Corpus Brain $^{++}$ generates pseudo queries for new document sets and continues to pre-train adapters for specific tasks.
微调适配器以执行特定任务。Corpus Brain ${}++$ [80] 提出了持续更新的KILT [218]任务基准$\mathrm{KLT++}$,并设计了一种基于主干-适配器结构的动态架构范式。该方法通过固定共享主干模型提供基础检索能力,同时引入任务专用适配器逐步学习各任务的新文档,有效避免了灾难性遗忘。训练过程中,Corpus Brain$^{++}$会为新文档集生成伪查询,并持续针对特定任务预训练适配器。
3.4 Downstream Task Adaption
3.4 下游任务适配
Generative retrieval methods, apart from addressing retrieval tasks individually, have been tailored to various downstream generative tasks. These include fact verification [284], entity linking [86], open-domain QA [126], dialogue [51], slot filling [137], among others, as well as knowledgeintensive tasks [218], code [179], conversational QA [3], and multi-modal retrieval scenarios [165], demonstrating superior performance and efficiency. These methods are discussed in terms of separate training, joint training, and multi-modal generative retrieval.
生成式检索方法除了单独处理检索任务外,还被定制用于各种下游生成任务。这些任务包括事实验证 [284]、实体链接 [86]、开放域问答 [126]、对话 [51]、槽位填充 [137] 等,以及知识密集型任务 [218]、代码 [179]、会话式问答 [3] 和多模态检索场景 [165],展现出卓越的性能和效率。这些方法分别从独立训练、联合训练和多模态生成式检索的角度进行了讨论。
3.4.1 Separate Training. For fact verification tasks [284], which involve determining the correctness of input claims, GERE [27] proposed using an encoder-decoder-based GR model to replace traditional indexing-based methods. Specifically, GERE first utilizes a claim encoder to encode input claims, and then generates document titles related to the claim through a title decoder to obtain candidate sentences for corresponding documents.
3.4.1 独立训练
针对事实核查任务 [284](即判定输入主张正确性的任务),GERE [27] 提出采用基于编码器-解码器的生成式检索 (Generative Retrieval, GR) 模型替代传统的基于索引的方法。具体而言,GERE 首先通过主张编码器对输入主张进行编码,随后通过标题解码器生成与主张相关的文档标题,从而获取对应文档的候选句子。
Knowledge-Intensive Language Tasks. For Knowledge-Intensive Language Tasks (KILT) [218], Corpus Brain [28] introduced three pre-training tasks to enhance the model’s understanding of query-document relationships at various granular i ties: Internal Sentence Selection, Leading Paragraph Selection, and Hyperlink Identifier Prediction. Similarly, UGR [26] proposed using different granular i ties of N-gram DocIDs to adapt to various downstream tasks, unifying different retrieval tasks into a single generative form. UGR achieves this by letting the GR model learn prompts specific to tasks, generating corresponding document, passage, sentence, or entity identifiers.
知识密集型语言任务。针对知识密集型语言任务 (KILT) [218],Corpus Brain [28] 引入了三种预训练任务来增强模型在不同粒度上对查询-文档关系的理解:内部句子选择、主导段落选择和超链接标识符预测。类似地,UGR [26] 提出使用不同粒度的 N-gram 文档标识符 (DocIDs) 来适应各种下游任务,将不同的检索任务统一为单一的生成形式。UGR 通过让 GR 模型学习特定于任务的提示 (prompts),生成相应的文档、段落、句子或实体标识符来实现这一目标。
Futhermore, DearDR [283] utilizes remote supervision and self-supervised learning techniques, using Wikipedia page titles and hyperlinks as training data. The model samples sentences from Wikipedia documents as input and trains a self-regressive model to decode page titles or hyperlinks, or both, without the need for manually labeled data. Re3val [257] proposes a retrieval framework combining generative reordering and reinforcement learning. It first reorders retrieved page titles using context information obtained from a dense retriever, then optimizes the reordering using the REINFORCE algorithm to maximize rewards generated by constrained decoding.
此外,DearDR [283] 采用远程监督和自监督学习技术,利用维基百科页面标题和超链接作为训练数据。该模型从维基百科文档中采样句子作为输入,并训练一个自回归模型来解码页面标题或超链接(或两者),无需手动标注数据。Re3val [257] 提出了一种结合生成式重排序和强化学习的检索框架。它首先使用密集检索器获取的上下文信息对检索到的页面标题进行重排序,然后通过 REINFORCE 算法优化排序,以最大化由约束解码生成的奖励。
Multi-hop retrieval. In multi-hop retrieval tasks, which require iterative document retrievals to gather adequate evidence for answering a query, GMR [131] proposed to employ language model memory and multi-hop memory to train a generative retrieval model, enabling it to memorize the target corpus and simulate real retrieval scenarios through constructing pseudo multi-hop query data, achieving dynamic stopping and efficient performance in multi-hop retrieval tasks.
多跳检索。在多跳检索任务中,需要通过迭代文档检索来收集足够的证据以回答查询,GMR [131] 提出利用语言模型记忆和多跳记忆训练生成式检索模型,使其能够记忆目标语料库,并通过构建伪多跳查询数据模拟真实检索场景,从而在多跳检索任务中实现动态停止和高效性能。
Code Retrieval. CodeDSI [203] is an end-to-end generative code search method that directly maps queries to pre-stored code samples’ DocIDs instead of generating new code. Similar to DSI [281], it includes indexing and retrieval stages, learning to map code samples and real queries to their respective DocIDs. CodeDSI explores different DocID representation strategies, including direct and clustered representation, as well as numerical and character representations.
代码检索。CodeDSI [203] 是一种端到端的生成式代码搜索方法,它直接将查询映射到预存储代码样本的 DocID,而非生成新代码。与 DSI [281] 类似,它包含索引和检索两个阶段,学习将代码样本和真实查询映射到各自的 DocID。CodeDSI 探索了不同的 DocID 表示策略,包括直接表示和聚类表示,以及数值和字符表示。
Conversational Question Answering. GCoQA [158] is a generative retrieval method for conversational QA systems that directly generates DocIDs for passage retrieval. This method focuses on key information in the dialogue context at each decoding step, achieving more precise and efficient passage retrieval and answer generation, thereby improving retrieval performance and overall system efficiency.
会话问答。GCoQA [158] 是一种用于会话问答系统的生成式检索方法,可直接生成用于段落检索的 DocID。该方法在每一步解码时聚焦对话上下文中的关键信息,实现更精准高效的段落检索与答案生成,从而提升检索性能及系统整体效率。
3.4.2 Joint Training. The methods in the previous section involve separately training generative retrievers and downstream task generators. However, due to the inherent nature of GR models as generative models, a natural advantage lies in their ability to be jointly trained with downstream generators to obtain a unified model for retrieval and generation tasks.
3.4.2 联合训练。上一节的方法涉及分别训练生成式检索器(Generative Retriever)和下游任务生成器。然而,由于GR模型作为生成式模型的固有特性,其天然优势在于能够与下游生成器进行联合训练,从而获得一个适用于检索和生成任务的统一模型。
Multi-decoder Structure. UniGen [155] proposes a unified generation framework to integrate retrieval and question answering tasks, bridging the gap between query input and generation targets using connectors generated by large language models. UniGen employs shared encoders and task-specific decoders for retrieval and question answering, introducing iterative enhancement strategies to continuously improve the performance of both tasks.
多解码器结构。UniGen [155] 提出了一种统一的生成框架,用于整合检索和问答任务,通过大语言模型生成的连接器弥合查询输入与生成目标之间的差距。UniGen 采用共享编码器和任务专用解码器分别处理检索与问答,并引入迭代增强策略持续提升两项任务的性能。
Multi-task Training. Later, CorpusLM [152] introduces a unified language model that integrates GR, closed-book generation, and retrieval-augmented generation to handle various knowledgeintensive tasks. The model adopts a multi-task learning approach and introduces ranking-guided DocID decoding strategies and continuous generation strategies to improve retrieval and generation performance. In addition, CorpusLM designs a series of auxiliary DocID understanding tasks to deepen the model’s understanding of DocID semantics.
多任务训练。随后,CorpusLM [152] 提出了一种统一的大语言模型,整合了生成式检索 (GR)、闭卷生成和检索增强生成,以处理各种知识密集型任务。该模型采用多任务学习方法,并引入了基于排名的DocID解码策略和连续生成策略,以提升检索和生成性能。此外,CorpusLM还设计了一系列辅助DocID理解任务,以加深模型对DocID语义的理解。
3.4.3 Multi-modal Generative Retrieval. Generative retrieval methods can also leverage multimodal data such as text, images, etc., to achieve end-to-end multi-modal retrieval.
3.4.3 多模态生成式检索。生成式检索方法同样可利用文本、图像等多模态数据,实现端到端的多模态检索。
Tokenizing Images to DocID Sequences. At first, IRGen [357] transforms image retrieval problems into generative problems, predicting relevant discrete visual tokens, i.e., image identifiers, through a seq2seq model given a query image. IRGen proposed a semantic image tokenizer, which converts global image features into short sequences capturing high-level semantic information.
将图像转换为DocID序列的Token化方法。首先,IRGen [357] 将图像检索问题转化为生成式问题,通过一个seq2seq模型在给定查询图像的情况下预测相关的离散视觉token(即图像标识符)。IRGen提出了一种语义图像token化器,能够将全局图像特征转换为捕捉高层语义信息的短序列。
Advanced Model Training and Structure. Later, GeMKR [178] combines LLMs’ generation capabilities with visual-text features, designing a generative knowledge retrieval framework. It first guides multi-granularity visual learning using object-aware prefix tuning techniques to align visual features with LLMs’ text feature space, achieving cross-modal interaction. GeMKR then employs a two-step retrieval process: generating knowledge clues closely related to the query and then retrieving corresponding documents based on these clues. GRACE [178] achieves generative cross-modal retrieval method by assigning unique identifier strings to images and training multimodal large language models (MLLMs) [7] to memorize the association between images and their identifiers. The training process includes (1) learning to memorize images and their corresponding identifiers, and (2) learning to generate the target image identifiers from textual queries. GRACE explores various types of image identifiers, including strings, numbers, semantic and atomic identifiers, to adapt to different memory and retrieval requirements.
高级模型训练与结构。随后,GeMKR [178] 将大语言模型的生成能力与视觉-文本特征相结合,设计了一个生成式知识检索框架。它首先利用对象感知前缀调优技术引导多粒度视觉学习,将视觉特征与大语言模型的文本特征空间对齐,实现跨模态交互。GeMKR 随后采用两步检索流程:生成与查询密切相关的知识线索,然后根据这些线索检索对应文档。GRACE [178] 通过为图像分配唯一标识符字符串并训练多模态大语言模型 (MLLMs) [7] 来记忆图像与其标识符的关联,实现了生成式跨模态检索方法。训练过程包括:(1) 学习记忆图像及其对应标识符,(2) 学习从文本查询生成目标图像标识符。GRACE 探索了多种类型的图像标识符,包括字符串、数字、语义和原子标识符,以适应不同的记忆和检索需求。
3.4.4 Generative Recommend er Systems. Recommendation systems, as an integral part of the information retrieval, are currently undergoing a paradigm shift from disc rim i native models to generative models. Generative recommendation systems do not require the computation of ranking scores for each item followed by database indexing, but instead accomplish item recommendations through the direct generation of IDs. In this section, several seminal works, including P5 [74], GPT4Rec [146], TIGER [233], SEATER [254], IDGenRec [273], LC-Rec [360] and ColaRec [309], are summarized to outline the development trends in generative recommendations.
3.4.4 生成式推荐系统
推荐系统作为信息检索的重要组成部分,当前正经历从判别式模型向生成式模型的范式转变。生成式推荐系统无需计算每个物品的排序分数再进行数据库索引,而是通过直接生成ID来完成物品推荐。本节总结了P5 [74]、GPT4Rec [146]、TIGER [233]、SEATER [254]、IDGenRec [273]、LC-Rec [360]和ColaRec [309]等开创性工作,以概述生成式推荐的发展趋势。
P5 [74] transforms various recommendation tasks into different natural language sequences, designing a universal, shared framework for recommendation completion. This method, by setting unique training objectives, prompts, and prediction paradigms for each recommendation domain’s downstream tasks, serves well as a backbone model, accomplishing various recommendation tasks through generated text. In generative retrieval, effective indexing identifiers have been proven to significantly enhance the performance of generative methods. Similarly, TIGER [233] initially learns a residual quantized auto encoder to generate semantically informative indexing identifiers for different items. It then trains a transformer-based encoder-decoder model with this semantically informative indexing identifier sequence to generate item identifiers for recommending the next item based on historical sequences.
P5 [74] 将各类推荐任务转化为不同的自然语言序列,设计了一个通用的共享框架来完成推荐。该方法通过为每个推荐领域的下游任务设置独特的训练目标、提示和预测范式,很好地充当了骨干模型,通过生成文本来完成各种推荐任务。在生成式检索中,有效的索引标识符已被证明能显著提升生成方法的性能。类似地,TIGER [233] 首先学习残差量化自编码器,为不同物品生成具有语义信息的索引标识符,随后利用这些语义丰富的索引标识符序列训练基于Transformer的编码器-解码器模型,从而根据历史序列生成物品标识符来推荐下一个物品。
Focusing solely on semantic information and overlooking the collaborative filtering information under the recommendation context might limit the further development of generative models. Therefore, after generating semantic indexing identifiers similar to TIGER using a residual quantized auto encoder with uniform semantic mapping, LC-Rec [360] also engages in a series of alignment tasks, including sequential item prediction, explicit index-language alignment, and recommendation-oriented implicit alignment. Based on the learned item identifiers, it integrates semantic and collaborative information, enabling large language models to better adapt to sequence recommendation tasks.
仅关注语义信息而忽略推荐场景下的协同过滤信息,可能会限制生成模型的进一步发展。因此,在通过具有统一语义映射的残差量化自编码器生成类似TIGER的语义索引标识符后,LC-Rec [360]还进行了一系列对齐任务,包括序列物品预测、显式索引-语言对齐以及面向推荐的隐式对齐。基于学习到的物品标识符,它整合了语义和协同信息,使大语言模型能更好地适应序列推荐任务。
IDGenRec [273] innovative ly combines generative recommendation systems with large language models by using human language tokens to generate unique, concise, semantically rich and platform-agnostic texual identifiers for recommended items. The framework includes a text ID generator trained on item metadata with a diversified ID generation algorithm, and an alternating training strategy that optimizes both the ID generator and the LLM-based recommendation model for improved performance and accuracy in sequential recommendations. SEATER [254] designs a balanced $\mathbf{k}$ -ary tree-structured indexes, using a constrained $\mathbf{k}$ -means clustering method to recursively cluster vectors encoded from item texts, obtaining equal-length identifiers. Compared to the method proposed by DSI [281], this balanced $\mathbf{k}$ -ary tree index maintains semantic consistency at every level. It then trains a Transformer-based encoder-decoder model and enhances the semantics of each level of indexing through contrastive learning and multi-task learning. ColaRec [309] integrates collaborative filtering signals and content information by deriving generative item identifiers from a pretrained recommendation model and representing users via aggregated item content. Then it uses an item indexing generation loss and contrastive loss to align content-based semantic spaces with collaborative interaction spaces, enhancing the model’s ability to recommend items in an end-to-end framework.
IDGenRec [273] 创新性地将生成式推荐系统与大语言模型相结合,通过使用人类语言token为推荐项生成独特、简洁、语义丰富且与平台无关的文本标识符。该框架包含一个基于物品元数据训练的文本ID生成器(采用多样化ID生成算法),以及一种交替训练策略,可同步优化ID生成器和基于大语言模型的推荐模型,从而提升序列推荐的性能和准确性。SEATER [254] 设计了平衡的 $\mathbf{k}$ 叉树结构索引,采用约束型 $\mathbf{k}$ 均值聚类方法递归聚类从物品文本编码得到的向量,生成等长标识符。与DSI [281] 提出的方法相比,这种平衡 $\mathbf{k}$ 叉树索引在每一层级都保持语义一致性。随后训练基于Transformer的编码器-解码器模型,并通过对比学习和多任务学习增强各层级索引的语义。ColaRec [309] 通过从预训练推荐模型中导出生成式物品标识符,并基于聚合物品内容表示用户,从而整合协同过滤信号与内容信息。接着使用物品索引生成损失和对比损失,将基于内容的语义空间与协同交互空间对齐,增强模型在端到端框架中的物品推荐能力。
4 RELIABLE RESPONSE GENERATION: DIRECT INFORMATION ACCESSING WITHGENERATIVE LANGUAGE MODELS
4 可靠的响应生成:利用生成式语言模型直接获取信息
The rapid advancement of large language models has positioned them as a novel form of IR system, capable of generating reliable responses directly aligned with users’ informational needs. This not only saves the time users would otherwise spend on collecting and integrating information but also provides personalized, user-centric answers tailored to individual users.
大语言模型的快速发展使其成为一种新型的信息检索(IR)系统,能够直接生成与用户信息需求高度匹配的可靠响应。这不仅节省了用户原本需要花费在收集和整合信息上的时间,还能提供针对个体用户量身定制的个性化、以用户为中心的答案。
However, challenges remain in creating a grounded system that delivers faithful answers, such as hallucination, prolonged inference time, and high operational costs. This section will outline strategies for constructing a faithful GenIR system by: (1) Optimizing the GenIR model internally, (2) Enhancing the model with external knowledge, (3) Increasing accountability, and (4) Developing personalized information assistants.
然而,构建一个能提供真实答案的可靠系统仍面临诸多挑战,例如幻觉问题、推理时间过长以及高昂的运营成本。本节将概述构建可信生成式信息检索(Generative IR)系统的策略:(1) 内部优化GenIR模型,(2) 通过外部知识增强模型,(3) 提高问责机制,(4) 开发个性化信息助手。
4.1 Internal Knowledge Memorization
4.1 内部知识记忆
To develop a user-friendly and reliable IR system, the generative model should be equipped with comprehensive internal knowledge. Optimization of the backbone generative model can be categorized into three aspects: structural enhancements, training strategies, and inference techniques. The overview of this section is shown in the green part of Figure 5.
为开发用户友好且可靠的信息检索(IR)系统,生成式模型需具备全面的内部知识。主干生成模型的优化可分为三个层面:结构增强、训练策略和推理技术。本节概述如图5绿色部分所示。
4.1.1 Model Structure. With the advent of generative models, various methods have been introduced to improve model structure and enhance generative reliability. We aim to discuss the crucial technologies contributing to this advancement in this subsection.
4.1.1 模型结构
随着生成式模型(Generative Model)的出现,各种改进模型结构并增强生成可靠性的方法被提出。本节将重点讨论推动这一进步的关键技术。
Fig. 5. An Illustration of strategies for enhancing language models to generate user-centric and reliable responses, including model internal knowledge memorization and external knowledge augmentation.
图 5: 提升语言模型生成以用户为中心且可靠回答的策略示意图,包括模型内部知识记忆和外部知识增强。
(1) Model Scaling Model parameter scaling is a pivotal factor influencing performance. Contemporary language models predominantly employ the Transformer architecture, and scaling both the model parameters and the training data enhances the model’s capacity to retain knowledge and capabilities [116]. For instance, in the GPT [2, 16, 228, 229] series and LLaMA [285, 286] family, larger models tend to perform better on diverse downstream tasks, including few-shot learning, language understanding, and generation [34]. Additionally, scaling the model contributes to improved instruction-following capabilities [227], enabling a more adept comprehension of user intent and generating responses that better satisfy user requests.
模型扩展
模型参数扩展是影响性能的关键因素。当代语言模型主要采用Transformer架构,扩展模型参数和训练数据能增强模型保留知识和能力的能力[116]。例如,在GPT[2,16,228,229]系列和LLaMA[285,286]家族中,更大的模型往往在多种下游任务中表现更优,包括少样本学习、语言理解和生成[34]。此外,扩展模型有助于提升指令跟随能力[227],使模型更擅长理解用户意图并生成更符合用户需求的响应。
(2) Model Integration Model integration is an effective method to enhance the reliability of generated outputs by capitalizing on the diverse strengths of various models. The predominant approach is the Mixture of Experts (MoE) [96], which utilizes a gating mechanism to selectively activate sections of network parameters during inference, greatly increasing the effective parameters without inflating inference costs [58, 61, 106, 136]. This method also boasts impressive s cal ability, with efficacy augmented alongside the expanding parameter volume and the number of expert models [38]. Alternatively, the LLM-Blender framework [107] employs a ranker and a fuser to combine answers from various LLMs, including black-box models, but faces high deployment costs.
(2) 模型集成
模型集成是一种通过利用不同模型的多样化优势来增强生成输出可靠性的有效方法。主流方法是混合专家模型 (Mixture of Experts, MoE) [96],它采用门控机制在推理时选择性激活部分网络参数,大幅提升有效参数量而不增加推理成本 [58, 61, 106, 136]。该方法还具有出色的可扩展性,其效能随着参数量与专家模型数量的增加而提升 [38]。另一种方案 LLM-Blender 框架 [107] 通过排序器和融合器整合包括黑盒模型在内的多种大语言模型输出,但面临较高的部署成本。
4.1.2 Training and Inference. In the model training stage, methods to enhance the reliability of answers can be categorized into two aspects: training data optimization and training methods optimization.
4.1.2 训练与推理。在模型训练阶段,提升答案可靠性的方法可分为两方面:训练数据优化和训练方法优化。
(1) Training Data Optimization The quality of training data substantially affects the reliability of model outputs. Noise, misinformation, and incomplete information can disrupt the learning process, leading to hallucinations and other issues. To address this, [79] used GPT-3.5 to artificially create textbooks filled with examples and language descriptions as training data, resulting in significant improvements on downstream tasks after minor fine-tuning. LIMA [363] used dialogues from community forums to construct a small-scale fine-tuning dataset, enhancing the model’s conversation capabilities during the alignment phase. To reduce redundancies in crawled internet data, Lee et al. [132] combined suffix arrays [188] and MinHash [15] to approximate matching and de duplicate the training dataset, reducing direct reproduction from the same source.
(1) 训练数据优化
训练数据的质量显著影响模型输出的可靠性。噪声数据、错误信息和不完整信息会干扰学习过程,导致幻觉等问题。为解决这一问题,[79] 使用 GPT-3.5 人工生成包含示例和语言描述的教科书作为训练数据,经过少量微调后在下游任务中取得了显著提升。LIMA [363] 采用社区论坛的对话构建小规模微调数据集,在对齐阶段增强了模型的对话能力。为减少网络爬取数据的冗余,Lee 等人 [132] 结合后缀数组 [188] 和 MinHash [15] 实现近似匹配与训练数据集去重,降低了同一来源的直接复现。
(2) Training Methods Optimization Beyond conventional training methods, additional techniques have been proposed to improve the factuality of model outputs. MixCL [264] incorporates contrastive learning into the training objective, using an external knowledge base to identify correct snippets and reduce the probability of generating incorrect tokens, thus enhancing model reliability. CaliNet [56] utilizes a contrastive method to assess erroneous knowledge learned by the model and fine-tunes the parameters of the FFN layer to rectify these errors. FactTune [211] incorporates factuality assessment during the RLHF phase, using automatic evaluation methods like FactScore [198] to rank outputs and employing DPO [230] to teach the model factuality preference.
训练方法优化
除常规训练方法外,研究者还提出了多种提升模型输出事实性的技术。MixCL [264] 将对比学习融入训练目标,通过外部知识库识别正确片段并降低错误token生成概率,从而增强模型可靠性。CaliNet [56] 采用对比方法评估模型习得的错误知识,并通过微调FFN层参数进行修正。FactTune [211] 在RLHF阶段引入事实性评估,使用FactScore [198] 等自动评估方法对输出排序,并采用DPO [230] 教导模型事实性偏好。
Apart from enhancing the internal knowledge reliability during training, the inference stage significantly impacts the reliability of answers. The overall inference process consists of user input and the model’s token decoding, and approaches to increase generation reliability can be divided into prompt engineering and decoding strategy.
除了提升训练过程中内部知识的可靠性,推理阶段对答案的可靠性也有显著影响。整体推理流程包含用户输入和模型的Token解码,提高生成可靠性的方法可分为提示工程(prompt engineering)和解码策略(decoding strategy)两大类。
(3) Prompt Engineering Prompting methods play a vital role in guiding the model. A welldesigned prompt can better promote the model’s internal capabilities to provide more accurate answers. The Chain-of-Thought (CoT) [313] prompting method guides the model to explicitly decompose the question into a reasoning chain during decoding, improving response accuracy by grounding the final answer on accurate intermediate steps. Further, CoT-SC [306] samples multiple answers and chooses the most consistent one as the final answer. The Tree of Thoughts [332] expands CoT’s single reasoning path to multiple paths, synthesizing their outcomes to arrive at the final answer. The Chain-of-Verification (CoVE) [49] introduces a self-reflection mechanism where the LLM generates a draft response, then validates each statement for factual inaccuracies, correcting errors to enhance factual accuracy. Additionally, methods like RECITE [268] and GenRead [339] prompt the model to output relevant internal knowledge fragments, which are then used to bolster the question-answering process.
(3) 提示工程 (Prompt Engineering)
提示方法在引导模型方面起着至关重要的作用。精心设计的提示可以更好地激发模型的内部能力,从而提供更准确的答案。思维链 (Chain-of-Thought, CoT) [313] 提示方法引导模型在解码过程中将问题显式分解为推理链,通过基于准确的中间步骤得出最终答案,从而提高响应准确性。进一步地,CoT-SC [306] 对多个答案进行采样,并选择最一致的答案作为最终结果。思维树 (Tree of Thoughts) [332] 将 CoT 的单一推理路径扩展为多条路径,综合其结果以得出最终答案。验证链 (Chain-of-Verification, CoVE) [49] 引入了自反思机制,大语言模型先生成草稿响应,然后验证每个陈述的事实准确性,纠正错误以提高事实准确性。此外,RECITE [268] 和 GenRead [339] 等方法提示模型输出相关的内部知识片段,随后用于增强问答过程。
(4) Decoding Strategy Decoding strategies are another critical factor influencing the reliability of model-generated responses. An appropriate decoding method can maintain the reliability and diversity of a model’s response. Nucleus Sampling [133] samples within a set probability range for tokens, ensuring better diversity while balancing variety and reliability. Building on this, FactualNucleus Sampling [134] employs a dynamic, decaying threshold for token sampling, ensuring later tokens are not influenced by earlier less factual tokens. Wan et al. [292] proposed a faithfulnessaware decoding method to enhance the faithfulness of the beam-search approach by incorporating a Ranker to reorder generated sequences and a lookahead method to avoid unfaithful tokens.
(4) 解码策略
解码策略是影响模型生成响应可靠性的另一个关键因素。适当的解码方法可以保持模型响应的可靠性和多样性。Nucleus Sampling [133] 在token的设定概率范围内进行采样,在平衡多样性与可靠性的同时确保更好的多样性。在此基础上,FactualNucleus Sampling [134] 采用动态衰减阈值进行token采样,确保后续token不受先前低事实性token的影响。Wan等人 [292] 提出了一种基于忠实性的解码方法,通过引入Ranker对生成序列重新排序,并采用前瞻性方法避免不忠实token,从而提升beam-search方法的忠实性。
Apart from directly modifying the decoding method, several studies influence the decoding distribution by leveraging hidden layer information. DoLa [37] uses distribution al differences between hidden and output layers to prioritize newly learned factual knowledge or key terms, increasing their generation likelihood. Inference-Time Intervention (ITI) [147] identifies attention heads strongly correlated with response correctness, adjusts their orientations, and moderates their activation, achieving more truthful generation with minimal model interference. Shi et al. [251] proposed CAD, comparing output distributions before and after adding extra information, reducing reliance on the model’s own knowledge to avoid conflicts leading to inaccuracies.
除了直接修改解码方法外,多项研究通过利用隐藏层信息来影响解码分布。DoLa [37] 利用隐藏层与输出层之间的分布差异,优先处理新学习的事实知识或关键术语,从而提高其生成概率。推理时干预 (ITI) [147] 识别与回答正确性高度相关的注意力头,调整其方向并调节激活强度,以最小化模型干扰实现更真实的生成。Shi等 [251] 提出CAD方法,通过比较添加额外信息前后的输出分布,减少对模型自身知识的依赖,从而避免因知识冲突导致的错误。
4.1.3 Knowledge Updating. In real-life scenarios, information is constantly evolving, and therefore, the GenIR system needs to continuously acquire the latest knowledge to meet users’ information needs. Since the model’s knowledge storage is limited, knowledge updating is necessary to ensure more reliable generated responses. In this section, we will discuss existing methods for knowledge updating from two perspectives: incremental learning and knowledge editing.
4.1.3 知识更新。在现实场景中,信息不断演变,因此GenIR系统需要持续获取最新知识以满足用户信息需求。由于模型的知识存储有限,必须通过知识更新来确保生成更可靠的响应。本节将从增量学习和知识编辑两个角度探讨现有的知识更新方法。
(1) Incremental Learning Incremental learning refers to the ability of machine learning models to continuously learn new skills and tasks while retaining previously acquired knowledge [301,
(1) 增量学习
增量学习指机器学习模型在保留已掌握知识的同时持续学习新技能和任务的能力 [301,
Table 2. Comparison of representative reliable response generation methods, considering model configurations, specializations, and evaluations. For simplicity, "LM" stands for Language Modeling and "ODQA" stands for Open-Domain Question Answering.
Method | Model Configuration | Target Domain | ||||
Backbone | Parameters | Trained | Capability | Evaluation Task | ||
GPT-3 [16] | Transformer | 175B | General | General Tasks (LM, QA, Reasoning,...) | ||
Llama-3.1 [59] | Transformer | 8B/70B/405B | General | General Tasks | ||
Mistral [105] | Transformer | 7B/22B/123B | General | General Tasks | ||
PaLM [34] | Transformer | 540B | General | General Tasks | ||
FactTune [211] | Llama-2 | 7B | Factuality | Domain-specific QA | ||
GenRead [339] | InstructGPT | 175B | Factuality | Knowledge-intensive Tasks | ||
DoLa [37] | LLaMA | 7B65B | Factuality | Multi-choice QA, Open-ended Generation | ||
RAG [139] | BART | 400M | Factuality | Knowledge-intensive Tasks | ||
REPLUG [252] | GPT-3 | 175B | Factuality | LM, Multi-choice QA, ODQA | ||
FLARE [111] | GPT-3 | 175B | Factuality | Knowledge-intensive Tasks | ||
Self-RAG [5] | Llama-2 | 7B/13B | Factuality | ODQA, Reasoning, Fact Check. | ||
IR-CoT [287] | GPT-3/Flan-T5 | 175B/11B | × | Factuality | Multi-hop QA | |
ReAct [333] | PaLM | 540B | Tools | Multi-hop QA,Fact Check.,Decision Making | ||
StructGPT[110] | GPT-3/GPT-3.5 | 175B/- | × | Tools | KG-based QA, Table-based QA, Text-to-SQL | |
ToolFormer [245] | GPT-J | 6B | Tools | LM, Math, QA, Temporal Tasks | ||
ToolLLM [226] | LLaMA | 7B | Tools | Tool Use | ||
HuggingGPT [250] | GPT-3.5 | Tools | Various Complex AI Tasks | |||
According to [314] | GPT-3/Flan-T5/.. | 175B/11B/... | Accountability | ODQA | ||
IFL [129] | GPT-J | 6B | Accountability | Long-form QA | ||
WebGPT [204] | GPT-3 | 175B | Accountability | Long-form QA | ||
WebBrain[223] | BART | 400M | Accountability | Long-form QA | ||
RARR [70] | PaLM | 540B | Accountability | ODQA,Reasoning,Conversational QA | ||
SearChain [326] | GPT-3.5 | × | Accountability | Knowledge-intensive Tasks | ||
P2Bot [172] | Transformer | Personalization | Personalized Dialogue | |||
P-Soups [98] | nn | 7B | Personalization | Personalized Dialogue | ||
OPPU [274] | Llama-2 | 7B | Personalization | Language ModelPersonalization Tasks | ||
Zhongjing [11] | Ziya-LLaMA | 13B | Healthcare | Chinese Medical Dialogue | ||
Mental-LLM [327] | Alpaca/GPT-3.5/... | 7B/-/.. | √/x | Healthcare | Mental Health Reasoning Tasks | |
Edu-Chat [48] | LLaMA | 13B | Education | ODQA,Education Tasks |
表 2: 代表性可靠响应生成方法对比,涵盖模型配置、专业领域和评估指标。为简洁起见,"LM"代表语言建模,"ODQA"代表开放域问答。
方法 | 主干模型 | 参数量 | 是否微调 | 目标领域 | 评估任务 |
---|---|---|---|---|---|
GPT-3 [16] | Transformer | 175B | 通用 | 通用任务 (LM、QA、推理等) | |
Llama-3.1 [59] | Transformer | 8B/70B/405B | 通用 | 通用任务 | |
Mistral [105] | Transformer | 7B/22B/123B | 通用 | 通用任务 | |
PaLM [34] | Transformer | 540B | 通用 | 通用任务 | |
FactTune [211] | Llama-2 | 7B | 事实性 | 领域特定QA | |
GenRead [339] | InstructGPT | 175B | 事实性 | 知识密集型任务 | |
DoLa [37] | LLaMA | 7B/65B | 事实性 | 多选QA、开放生成 | |
RAG [139] | BART | 400M | 事实性 | 知识密集型任务 | |
REPLUG [252] | GPT-3 | 175B | 事实性 | LM、多选QA、ODQA | |
FLARE [111] | GPT-3 | 175B | 事实性 | 知识密集型任务 | |
Self-RAG [5] | Llama-2 | 7B/13B | 事实性 | ODQA、推理、事实核查 | |
IR-CoT [287] | GPT-3/Flan-T5 | 175B/11B | × | 事实性 | 多跳QA |
ReAct [333] | PaLM | 540B | 工具 | 多跳QA、事实核查、决策制定 | |
StructGPT [110] | GPT-3/GPT-3.5 | 175B/- | × | 工具 | 基于KG的QA、基于表格的QA、Text-to-SQL |
ToolFormer [245] | GPT-J | 6B | 工具 | LM、数学、QA、时序任务 | |
ToolLLM [226] | LLaMA | 7B | 工具 | 工具使用 | |
HuggingGPT [250] | GPT-3.5 | 工具 | 各类复杂AI任务 | ||
According to [314] | GPT-3/Flan-T5/... | 175B/11B/... | 可问责性 | ODQA | |
IFL [129] | GPT-J | 6B | 可问责性 | 长格式QA | |
WebGPT [204] | GPT-3 | 175B | 可问责性 | 长格式QA | |
WebBrain [223] | BART | 400M | 可问责性 | 长格式QA | |
RARR [70] | PaLM | 540B | 可问责性 | ODQA、推理、对话式QA | |
SearChain [326] | GPT-3.5 | × | 可问责性 | 知识密集型任务 | |
P2Bot [172] | Transformer | 个性化 | 个性化对话 | ||
P-Soups [98] | nn | 7B | 个性化 | 个性化对话 | |
OPPU [274] | Llama-2 | 7B | 个性化 | 语言模型个性化任务 | |
Zhongjing [11] | Ziya-LLaMA | 13B | 医疗 | 中文医疗对话 | |
Mental-LLM [327] | Alpaca/GPT-3.5/... | 7B/-/... | √/× | 医疗 | 心理健康推理任务 |
Edu-Chat [48] | LLaMA | 13B | 教育 | ODQA、教育任务 |
303, 321, 351]. In the GenIR system, it is crucial to enable the language model to memorize the latest information while preventing the forgetting of previous knowledge.
303, 321, 351]。在GenIR系统中,关键是要让语言模型既能记住最新信息,又能避免遗忘已有知识。
One approach is Incremental Pre-training, which does not rely on supervised data but continues pre-training on continuously updated corpora to alleviate catastrophic forgetting. For example, Baidu proposed the ERNIE 2.0 framework [267], enhancing language understanding through continuous multi-task learning. Jang et al. [100] introduced Continual Knowledge Learning (CKL) to explore how LLMs update and retain knowledge amidst rapidly changing information, creating benchmarks like FUAR. Cossu et al. [39] studied continual pre-training for language and vision, finding that self-supervised or unsupervised methods are more effective in retaining previous knowledge compared to supervised learning. Additionally, Ke et al. [119] proposed Domain Adaptive Pre-training (DAP-training) to improve the model’s adaptability to new domains while preventing forgetting using techniques like soft masking and contrastive learning. For domain-specific model construction, Xie et al. [323] introduced FinPythia-6.9B, an efficient continual pre-training method specifically designed for large-scale language models in the financial domain.
一种方法是增量预训练(Incremental Pre-training),它不依赖监督数据,而是在持续更新的语料库上继续预训练以缓解灾难性遗忘。例如,百度提出的ERNIE 2.0框架[267]通过持续多任务学习增强语言理解能力。Jang等人[100]提出持续知识学习(CKL)方法,探索大语言模型如何在快速变化的信息中更新和保留知识,并创建了FUAR等基准测试。Cossu等人[39]研究了语言和视觉的持续预训练,发现自监督或无监督方法比监督学习更能有效保留先前知识。此外,Ke等人[119]提出领域自适应预训练(DAP-training),通过软掩码和对比学习等技术提升模型对新领域的适应能力,同时防止遗忘。针对领域专用模型构建,Xie等人[323]推出了FinPythia-6.9B,这是专为金融领域大语言模型设计的高效持续预训练方法。
On the other hand, Incremental Fine-tuning utilizes only labeled data for training. Progressive Prompts [236] appends new soft prompts for each new task, facilitating knowledge transfer and reducing forgetting. DynaInst [201] enhances lifelong learning in pre-trained language models through parameter regular iz ation and experience replay, employing dynamic instance and task selection for efficient learning under resource constraints. Jang et al. [99] challenge traditional multitask prompt fine-tuning by refining expert models on individual tasks. Suhr et al. [260] introduce a feedback-driven continual learning approach for instruction-following agents, where natural language feedback is converted into immediate rewards via contextual bandits to optimize learning. O-LoRA [305] achieves superior continual learning by training new tasks in orthogonal low-rank subspaces, significantly minimizing task interference. Peng et al. [216] propose a scalable language model that dynamically adjusts parameters based on task requirements, effectively preventing the forgetting of previously learned tasks.
另一方面,增量微调 (Incremental Fine-tuning) 仅使用标注数据进行训练。Progressive Prompts [236] 为每个新任务追加新的软提示 (soft prompts),促进知识迁移并减少遗忘。DynaInst [201] 通过参数正则化 (parameter regularization) 和经验回放 (experience replay) 增强预训练语言模型的终身学习能力,采用动态实例和任务选择机制在资源受限条件下实现高效学习。Jang等人 [99] 通过优化单任务专家模型,对传统多任务提示微调提出了挑战。Suhr等人 [260] 为指令跟随型AI智能体提出反馈驱动的持续学习方法,通过上下文老虎机 (contextual bandits) 将自然语言反馈转化为即时奖励以优化学习过程。O-LoRA [305] 通过在正交低秩子空间训练新任务实现卓越的持续学习性能,显著降低了任务间干扰。Peng等人 [216] 提出可扩展的大语言模型,能根据任务需求动态调整参数,有效防止已学习任务的遗忘。
(2) Knowledge Editing Knowledge editing refers to the process of modifying and updating existing knowledge within language models [191, 303], distinct from incremental learning that focuses on adapting to new domains or tasks. By editing the weights or layers of a model, knowledge editing methods can correct erroneous facts and incorporate new knowledge, making it important before deploying GenIR systems. There are primarily three paradigms for internal knowledge editing within language models: adding trainable parameters, locate-then-edit, and meta-learning.
(2) 知识编辑
知识编辑指修改和更新语言模型中现有知识的过程 [191, 303],与专注于适应新领域或任务的增量学习不同。通过编辑模型的权重或层,知识编辑方法可以纠正错误事实并整合新知识,这对部署生成式信息检索(GenIR)系统至关重要。语言模型内部知识编辑主要有三种范式:添加可训练参数、定位后编辑和元学习。
One method of Adding Trainable Parameters is by integrating new single neurons (patches) in the final feed-forward neural network (FFN) layer, as in T-Patcher [94] and CaliNet [56], which serve as trainable parameters to adjust the model’s behavior. Alternatively, discrete code-book modules are introduced in the middle layers of the language model, as in GRACE [83], to adjust and correct information.
一种添加可训练参数的方法是在最后的全连接神经网络(FFN)层中集成新的单神经元(patches),如T-Patcher [94]和CaliNet [56]所做的那样,这些神经元作为可训练参数来调整模型的行为。另一种方法是在语言模型的中间层引入离散码本模块,如GRACE [83]所做的那样,来调整和校正信息。
Moreover, the Locate-then-Edit method first identifies the parameters corresponding to specific knowledge and then updates these targeted parameters directly. Common techniques involve identifying key-value pairs in the FFN matrix, known as "knowledge neurons," and updating them [45]. Techniques like ROME [193] use causal mediation analysis to pinpoint areas needing editing, and MEMIT [194] builds on ROME to implement synchronized editing in various scenarios. Methods such as PMET [154] employ attention mechanisms for editing, while BIRD [182] introduces a bidirectional inverse relation modeling approach.
此外,定位后编辑(Locate-then-Edit)方法首先识别与特定知识对应的参数,然后直接更新这些目标参数。常见技术包括识别FFN矩阵中的键值对(称为"知识神经元(knowledge neurons)"并对其进行更新[45]。ROME[193]等技术使用因果中介分析(causal mediation analysis)来精确定位需要编辑的区域,MEMIT[194]则在ROME基础上实现了多种场景下的同步编辑。PMET[154]等方法采用注意力机制进行编辑,而BIRD[182]引入了双向逆关系建模(bidirectional inverse relation modeling)方法。
Meta-Learning, another paradigm, uses hyper-networks to generate the necessary updates for model editing. KE (Knowledge Editor) [17] predicts weight updates for each data point using a hyper-network. MEND [199], by taking low-order decomposition of gradients as input, learns to rapidly edit language models to enhance performance. Additionally, MALMEN [270] separates the computations of hyper-networks and language models, facilitating the editing of multiple facts under a limited memory budget. These meta-learning mechanisms enable models to swiftly adapt to new knowledge and tasks. A detailed comparison of representative reliable response generation methods is provided in Table 2.
元学习 (Meta-Learning) 是另一种范式,它使用超网络 (hyper-network) 为模型编辑生成必要的更新。KE (Knowledge Editor) [17] 通过超网络预测每个数据点的权重更新。MEND [199] 以梯度的低阶分解作为输入,学习快速编辑语言模型以提升性能。此外,MALMEN [270] 将超网络和语言模型的计算分离,在有限内存预算下实现多事实编辑。这些元学习机制使模型能够快速适应新知识和任务。表 2 提供了代表性可靠响应生成方法的详细对比。
4.2 External Knowledge Augmentation
4.2 外部知识增强
Although large language models have demonstrated significant effectiveness in response generation, issues such as susceptibility to hallucinations, difficulty handling in-domain knowledge, and challenges with knowledge updating persist. Augmenting the model’s generative process with external knowledge sources can serve as an effective way to tackle these issues. Based on the form of external knowledge employed, these approaches can be classified into retrieval augmentation and tool augmentation. The blue area in Figure 5 provides an overview of this section.
尽管大语言模型在生成响应方面表现出显著效果,但幻觉倾向、领域知识处理困难以及知识更新挑战等问题依然存在。通过外部知识源增强模型的生成过程,可作为解决这些问题的有效途径。根据所采用的外部知识形式,这些方法可分为检索增强和工具增强两类。图5中的蓝色区域概述了本节内容。
4.2.1 Retrieval Augmentation. Retrieval-Augmented Generation (RAG) enhances the response of generative models by combining them with a retrieval mechanism [95, 139, 368]. By querying a large collection of documents, information that is relevant to the input query can be fetched and integrated into the input of the generative model. RAG enables generative models to be grounded in existing reliable knowledge, significantly improving the reliability of model generation. Typically, a RAG method involves a retriever and a generator. Based on the interaction flow between these two, RAG methods can be divided into four categories [72].
4.2.1 检索增强。检索增强生成 (Retrieval-Augmented Generation, RAG) 通过将生成模型与检索机制相结合来增强其响应能力 [95, 139, 368]。通过查询大量文档集合,可以获取与输入查询相关的信息并将其整合到生成模型的输入中。RAG 使生成模型能够基于现有可靠知识进行生成,显著提高了模型生成结果的可靠性。典型的 RAG 方法包含检索器和生成器两个组件。根据二者交互流程的不同,RAG 方法可分为四类 [72]。
(1) Sequential RAG: Sequential RAG operates on a linear progression, where the retriever first retrieves relevant information and the generator utilizes this information to directly complete the response generation process.
(1) 顺序RAG: 顺序RAG采用线性流程运作, 检索器首先获取相关信息, 生成器随后利用这些信息直接完成响应生成过程。
The basic form of sequential RAG is a “Retrieve-Read” framework [183], where early works perform joint [14, 81, 139] or separate [95] training of retriever and generator but require costly pre-training. In-Context RALM [234] addresses this by directly using retrieved documents as input, leveraging the model’s in-context learning without additional training.
序列化RAG的基本形式是"检索-阅读"框架[183],早期研究通过联合[14, 81, 139]或分离[95]训练检索器和生成器来实现,但需要昂贵的预训练成本。上下文RALM[234]通过直接使用检索文档作为输入,利用模型的上下文学习能力,避免了额外训练。
With the widespread adoption of LLMs, most subsequent works are built on the foundation of a frozen generator. AAR [340] fine-tunes a general retriever to adapt to the information acquisition preferences of the generative model. LLM-embedder [353] uses rewards produced by LLM to train an embedding model dedicated to retrieval augmentation. ARL2 [350] leverages LLM to annotate relevance scores in the training set and trains a retriever using contrastive learning.
随着大语言模型(LLM)的广泛采用,大多数后续工作都建立在冻结生成器的基础上。AAR [340]通过微调通用检索器来适应生成模型的信息获取偏好。LLM-embedder [353]利用大语言模型产生的奖励来训练专用于检索增强的嵌入模型。ARL2 [350]借助大语言模型标注训练集中的相关性分数,并采用对比学习训练检索器。
Several works introduce pre-retrieval and post-retrieval processes [72] into the sequential pipeline to enhance the overall efficiency. In the pre-retrieval process, the RRR model [183] introduces a rewriter module before the retriever, trained using the generator’s feedback to enable the retrieval system to provide more suitable information for generation.
多项研究在顺序流程中引入了检索前(pre-retrieval)和检索后(post-retrieval)处理过程[72]以提升整体效率。在检索前阶段,RRR模型[183]在检索器前加入了重写模块,该模块通过生成器的反馈进行训练,使检索系统能为生成过程提供更合适的信息。
In the post-retrieval process, information compressors are proposed to filter out irrelevant content from documents, avoiding misleading the generator’s response [43, 114, 170]. RECOMP [325] uses both abstract ive and extractive compressors to generate concise summaries of retrieved documents. LLMLingua [109] retains important tokens by calculating token importance based on the perplexity provided by the generative model. Long LL M Lingua [108] introduces query-aware compression and reranks retrieved documents based on importance scores to alleviate the “loss in the middle” phenomenon [170]. PRCA [329] employs reinforcement learning to train a text compressor adaptable to black-box LLMs and various retrievers, serving as a versatile plug-in.
在检索后处理过程中,信息压缩器被提出来过滤文档中的无关内容,避免误导生成器的响应 [43, 114, 170]。RECOMP [325] 同时使用抽象式和抽取式压缩器来生成检索文档的简洁摘要。LLMLingua [109] 通过基于生成模型提供的困惑度计算 token 重要性来保留重要 token。Long LLMLingua [108] 引入了查询感知压缩,并根据重要性分数对检索文档重新排序,以缓解"中间丢失"现象 [170]。PRCA [329] 采用强化学习训练可适配黑盒大语言模型和各种检索器的文本压缩器,作为通用插件使用。
(2) Branching RAG: In the Branching RAG framework, the input query is processed across multiple pipelines, and each pipeline may involve the entire process in the sequential pipeline. The outputs from all pipelines are merged to form the final response, allowing for finer-grained handling of the query or retrieval results.
(2) 分支RAG (Branching RAG): 在分支RAG框架中,输入查询会通过多个流程并行处理,每个流程可能包含顺序流程中的完整步骤。所有流程的输出结果会被合并形成最终响应,从而实现对查询或检索结果更细粒度的处理。
In the pre-retrieval stage, TOC [123] uses few-shot prompting to recursively decompose complex questions into clear sub-questions in a tree structure, retrieving relevant documents for each and generating a comprehensive answer. Blend Filter [297] enhances the original query using prompts with internal and external knowledge, retrieves related documents with the augmented queries, and merges them for a comprehensive response.
在预检索阶段,TOC [123]采用少样本提示递归地将复杂问题分解为树状结构的清晰子问题,为每个子问题检索相关文档并生成综合答案。Blend Filter [297]通过结合内部与外部知识的提示增强原始查询,用增强后的查询检索相关文档,最后合并生成全面响应。
In the post-retrieval stage, REPLUG [252] processes each retrieved document with the query through the generator separately and combines the resulting probability distributions to form the final prediction. GenRead [339] prompts LLM to generate related documents and merges them with retrieved documents from the retriever as input, enhancing content coverage.
在检索后阶段,REPLUG [252] 将每个检索到的文档与查询分别通过生成器处理,并合并所得概率分布以形成最终预测。GenRead [339] 提示大语言模型生成相关文档,并将其与检索器获取的文档合并作为输入,从而提升内容覆盖率。
(3) Conditional RAG: The Conditional RAG framework adapts to various query types through distinct processes, improving the system’s flexibility. Since there can be knowledge conflict between the knowledge from retrieved documents and the generator’s own knowledge, RAG’s effectiveness isn’t consistent across all scenarios. To address this, common conditional RAG methods include a decision-making module that determines whether to engage the retrieval process for each query.
(3) 条件式RAG:条件式RAG框架通过差异化流程适配各类查询类型,提升系统灵活性。由于检索文档中的知识与生成器自身知识可能存在冲突,RAG的效能并非在所有场景中都稳定。为解决该问题,常见的条件式RAG方法会引入决策模块,用以判断是否对每个查询启动检索流程。
SKR [308] trains a binary classifier on a dataset of questions LLMs can or cannot answer, determining at inference whether to use retrieval. Training labels are obtained by prompting the model to assess if external knowledge is needed. Self-DC [296] uses the model’s confidence score to decide on retrieval necessity, categorizing queries into unknown, uncertain, and known, with unknown queries processed through sequential RAG and uncertain ones decomposed into sub-questions. Rowen [52] introduces a multilingual detection module that perturbs the original question and measures response consistency to decide on retrieval.
SKR [308] 在LLM能回答或不能回答的问题数据集上训练二元分类器,在推理时判断是否使用检索。训练标签通过提示模型评估是否需要外部知识获得。Self-DC [296] 利用模型的置信度分数决定检索必要性,将查询分为未知、不确定和已知三类,未知查询通过顺序RAG处理,不确定查询则分解为子问题。Rowen [52] 提出多语言检测模块,通过扰动原始问题并测量响应一致性来决定是否检索。
(4) Loop RAG: Loop RAG involves deep interactions between the retriever and generator components. Owing to multi-turn retrieval and generation processes, accompanied by comprehensive interactions, it excels at handling complex and diverse input queries, yielding superior results in response generation.
(4) Loop RAG:Loop RAG 通过检索器(retriever)与生成器(generator)组件的深度交互,借助多轮检索与生成过程及全面交互机制,擅长处理复杂多样的输入查询,在响应生成方面表现优异。
ITER-RETGEN [248] introduces an iterative framework alternating between retrieval-augmented generation and generation-augmented retrieval, repeating this process to produce the final answer. IR-COT [287] follows a similar procedure to ITER-RETGEN but the iteration pauses based on the model’s own generative process. FLARE [111] conducts concurrent retrieval during generation, evaluating the need for retrieval at each new sentence based on the LLM’s confidence score, dynamically supplementing information to enhance content reliability. COG [128] models generation as continual retrieval and copying of segments from an external corpus, with the generator producing conjunctions to maintain fluency. Self-RAG [5] adds special tokens into the vocabulary, allowing the generator to decide on retrieval, document importance, and whether to perform a critique.
ITER-RETGEN [248] 提出了一种迭代框架,交替进行检索增强生成和生成增强检索,通过重复这一过程生成最终答案。IR-COT [287] 采用与 ITER-RETGEN 类似的流程,但会根据模型自身的生成过程暂停迭代。FLARE [111] 在生成过程中并行执行检索,基于大语言模型的置信度分数评估每个新句子是否需要检索,动态补充信息以提高内容可靠性。COG [128] 将生成建模为从外部语料库持续检索和复制片段的过程,生成器负责生成连接词以保持流畅性。Self-RAG [5] 在词表中添加特殊 Token,使生成器能够自主决定检索行为、文档重要性评估以及是否执行批判性验证。
Some works focus on de constructing complex inquiries into sub-questions, addressing these individually to produce a more dependable response. [221] guides LLM to decompose complex questions into sub-questions, answer each using retrieved results, and synthesize the answers; RET-Robust [338] builds upon this by incorporating an NLI model to verify retrieved documents support the sub-question answers, reducing misinformation.
一些研究致力于将复杂查询拆解为子问题,通过分别处理这些子问题来生成更可靠的回答。[221] 指导大语言模型将复杂问题分解为子问题,利用检索结果逐一解答后综合答案;RET-Robust [338] 在此基础上引入自然语言推理 (NLI) 模型来验证检索文档是否支持子问题答案,从而减少错误信息。
4.2.2 Tool Augmentation. Although retrieval-augmented techniques have significantly improved upon the blind spots of a generator’s self-knowledge, these methods struggle with the rapid and flexible update of information since they rely on the existence of information within an external corpus of documents. Tool augmentation, on the other hand, excels in addressing this issue by invoking various tools that allow for the timely acquisition and usage of the latest data, including finance, news, and more. Moreover, tool augmentation expands the scope of responses a model can offer, such as language translation, image generation, and other tasks, to more comprehensively meet users’ information retrieval needs.
4.2.2 工具增强 (Tool Augmentation)
尽管检索增强技术显著改善了生成器自知识盲点的问题,但这些方法依赖于外部文档语料库中信息的存在,因此在信息的快速灵活更新方面存在困难。相比之下,工具增强通过调用各类工具(如金融、新闻等最新数据的及时获取与使用)更擅长解决这一问题。此外,工具增强还扩展了模型可提供的响应范围(例如语言翻译、图像生成等任务),从而更全面地满足用户的信息检索需求。
There are four categories of tools that can be utilized to construct a more reliable information retrieval system:
构建更可靠的信息检索系统可利用的工具可分为四类:
(1) Search Engine: Common search engine tools like Google Search and Bing Search help answer frequent and time-sensitive queries effectively. Self-Ask [221] initially decomposes complex questions into multiple sub-questions, then uses a search engine to answer each sub-question, and finally generates a comprehensive answer to the complex question. ReAct [333] embeds search engine calls into the model’s reasoning process, allowing the generative model to determine when to make calls and what queries to input for more flexible reasoning. New Bing can automatically search relevant information from Bing based on user input, yielding reliable and detailed answers, including citation annotations in the generated content.
(1) 搜索引擎: 常见搜索引擎工具如Google Search和Bing Search能有效回答高频且时效性强的问题。Self-Ask [221] 首先将复杂问题分解为多个子问题,随后使用搜索引擎回答每个子问题,最终生成复杂问题的综合答案。ReAct [333] 将搜索引擎调用嵌入模型的推理过程,使生成式模型能自主决定调用时机及查询内容,实现更灵活的推理。New Bing可根据用户输入自动从Bing搜索相关信息,生成包含引用标注的可靠详细答案。
Some works have also built advanced conversational systems based on tools like search engines. Internet-Augmented Generation [125] enhances the quality of conversational replies by using search engines during conversations. LaMDA [282] and BlenderBot [253] combine search engines with conversational agents, constantly accessing internet information to enrich conversation factual ness. WebGPT [204] and WebCPM [225] directly teach models to perform human-like browser operations by generating commands such as Search, Click, and Quote, facilitating the automated retrieval and acquisition of information.
一些研究还基于搜索引擎等工具构建了高级对话系统。联网增强生成 (Internet-Augmented Generation) [125] 通过在对话过程中使用搜索引擎来提升回复质量。LaMDA [282] 和 BlenderBot [253] 将搜索引擎与对话智能体结合,持续获取网络信息以增强对话的事实性。WebGPT [204] 和 WebCPM [225] 则直接教导模型执行类人浏览器操作,通过生成搜索 (Search)、点击 (Click)、引用 (Quote) 等指令,实现信息的自动化检索与获取。
(2) Knowledge Graph (KG): Compared to search engines, KGs are particularly useful for extracting structured, explicit knowledge. Relevant knowledge from a knowledge graph can be extracted and used as a prompt input to enhance the generative process [262]. StructGPT [110] introduces an iterative reading-and-reasoning framework where the model can access a knowledge graph through a well-designed interface, continually acquiring information and reasoning until an answer is obtained. RoG [181] generates plausible reasoning paths from a KG, executes them in parallel, and integrates outcomes for a final answer; ToG [262] allows the model to explore entities and links without pre-planning paths, continuously assessing reasoning feasibility.
(2) 知识图谱 (Knowledge Graph, KG): 与搜索引擎相比,知识图谱特别适用于提取结构化的显性知识。可以从知识图谱中提取相关知识并作为提示输入,以增强生成过程 [262]。StructGPT [110] 提出了一种迭代式阅读-推理框架,模型通过精心设计的接口访问知识图谱,持续获取信息并进行推理直至获得答案。RoG [181] 从知识图谱生成合理的推理路径,并行执行这些路径并整合结果以得出最终答案;ToG [262] 允许模型在不预先规划路径的情况下探索实体和链接,持续评估推理可行性。
(3) API-based Tools: An important part of the tools is the real-world APIs, which enable the model to obtain information from specific data sources, such as real-time stock information, movie services, code interpreters, and so on. However, the multitude and diversity of APIs, coupled with the adherence to certain operational protocols, make the teaching of API usage to models a focal point of this area.
(3) 基于API的工具: 工具中重要的一部分是现实世界的API, 它们使得模型能够从特定数据源获取信息, 例如实时股票信息、电影服务、代码解释器等。然而, API的多样性和数量庞大, 加上需要遵循特定的操作协议, 使得教授模型使用API成为该领域的重点。
Toolformer [245] trains language models in a self-supervised manner to automatically call APIs when needed, using prompts to generate API calls, executing them, and filtering ineffective ones to form the dataset. Training with standard language modeling objectives yields models that can autonomously invoke APIs across tasks without losing language modeling capabilities. RestGPT [258] formulates a framework prompting LLMs to invoke RESTful APIs, comprising an online planner, an API selector, and an executor. ToolLLM [226] uses a large corpus of scraped APIs to build a dataset for fine-tuning. Gorilla [214] introduces an information retriever providing the model with reference API documentation, facilitating retrieval-based information utilization during fine-tuning. ToolkenGPT [82] incorporates each tool as a new token into the vocabulary, enabling the model to invoke APIs during inference as naturally as generating text.
Toolformer [245] 通过自监督方式训练语言模型,使其能在需要时自动调用API:利用提示生成API调用指令,执行这些指令并过滤无效调用以构建数据集。采用标准语言建模目标训练后,模型能在跨任务场景中自主调用API,同时保持语言建模能力。RestGPT [258] 提出了一个框架,通过提示大语言模型调用RESTful API,该框架包含在线规划器、API选择器和执行器。ToolLLM [226] 利用爬取的大规模API语料构建微调数据集。Gorilla [214] 引入了信息检索器,为模型提供参考API文档,促进微调过程中基于检索的信息利用。ToolkenGPT [82] 将每个工具作为新token加入词汇表,使模型在推理时能像生成文本一样自然地调用API。
Beyond learning to invoke APIs, CREATOR [222] prompts models to write code based on actual problems as new tool implementations, with generated tools functioning through a code interpreter and demonstrating impressive results on complex tasks.
除了学习调用API外,CREATOR [222] 还能基于实际问题提示模型编写代码作为新工具实现,生成的工具通过代码解释器运行,并在复杂任务上展现出令人印象深刻的效果。
Some works additionally support multimodal inputs, broadening the application scope of the models. AssistGPT [69] offers a framework including modules like Planner, Executor, Inspector, and Learner, utilizing language and code for intricate inference. ViperGPT [269] feeds CodeX with user queries and visual API information to generate Python code invoking APIs, successfully completing complex visual tasks.
一些研究还支持多模态输入,从而拓宽了模型的应用范围。AssistGPT [69] 提供了一个包含 Planner、Executor、Inspector 和 Learner 等模块的框架,利用语言和代码进行复杂推理。ViperGPT [269] 将用户查询和视觉 API 信息输入 CodeX,生成调用 API 的 Python语言 代码,成功完成复杂的视觉任务。
(4) Model-based Tools: With the swift expansion of diverse AI communities (i.e., Hugging face, ModelScope, GitHub), various types of AI models have become readily accessible for use, serving as a pivotal tool in enhancing generative retrieval systems. These AI models encompass a wide array of tasks, each accompanied by comprehensive model descriptions and usage examples.
(4) 基于模型的工具: 随着多样化AI社区(如Hugging Face、ModelScope、GitHub)的快速扩张,各类AI模型已变得触手可及,成为增强生成式检索系统的关键工具。这些AI模型涵盖广泛的任务范围,每个模型都配有详细的模型说明和使用示例。
HuggingGPT [250] employs ChatGPT as a controller to deconstruct user queries into tasks, determining which models to invoke for execution. Similarly, Visual ChatGPT [317] integrates a visual foundation model with LLMs, leveraging ChatGPT as a prompt manager to mobilize visual foundation models like BLIP [145] and ControlNet [349], adept at processing image-based requests efficiently compared to multi-modal models.
HuggingGPT [250] 采用 ChatGPT 作为控制器,将用户查询分解为任务,并确定调用哪些模型来执行。类似地,Visual ChatGPT [317] 将视觉基础模型与大语言模型相结合,利用 ChatGPT 作为提示管理器来调动 BLIP [145] 和 ControlNet [349] 等视觉基础模型,相比多模态模型,它更擅长高效处理基于图像的请求。
4.3 Generating Response with Citation
4.3 带引用的响应生成
To build a reliable GenIR system, generating responses with citations is a promising approach [88, 168, 195]. Citations allow users to clearly understand the source of each piece of knowledge in the response, enhancing trust and facilitating widespread adoption. Existing methods can be divided into directly generating responses with citations and using a retrieval module to enhance the generated content. Refer to the green portion in Figure 6 for an overview of this section.
要构建可靠的生成式信息检索(GenIR)系统,生成带引用的响应是一种有效方法[88, 168, 195]。引用能让用户清晰了解响应中每个知识点的来源,增强可信度并促进广泛应用。现有方法可分为直接生成带引用的响应,以及使用检索模块增强生成内容两类。具体分类请参见图6绿色部分。
4.3.1 Direct Generating Response with Citation. This method uses the model’s intrinsic memory to generate source citations without relying on a retrieval module.
4.3.1 直接生成带引用的响应。该方法利用模型的内在记忆生成来源引用,无需依赖检索模块。
(1) Model Intrinsic Knowledge. Leveraging the capabilities of the language model it- self, according-to prompting [314] guides LLMs to more accurately cite information from pretraining data by adding phrases like "according to Wikipedia" in the prompts.
模型内在知识。利用语言模型自身的能力,根据提示[314],通过在提示中添加"根据维基百科"等短语,引导大语言模型更准确地引用预训练数据中的信息。
To improve citation quality, Iterative Feedback Learning (IFL) [129] employs a critique model to assess and provide feedback on generated text, iterative ly enhancing LLMs’ citation accuracy, content correctness, and fluency. Additionally, Fierro et al. [63] introduce a planbased approach using a series of questions as a blueprint for content generation, with abstract and extractive attribution models, showing that planning improves citation quality.
为提高引用质量,迭代反馈学习 (Iterative Feedback Learning, IFL) [129] 采用批评模型来评估生成文本并提供反馈,通过迭代提升大语言模型的引用准确性、内容正确性和流畅性。此外,Fierro 等人 [63] 提出了一种基于计划的方法,使用一系列问题作为内容生成的蓝图,结合抽象和提取式归因模型,证明规划能提升引用质量。
Fig. 6. Generating response with citation and personal information assistant are also crucial approaches for building a reliable and user-centric GenIR system.
图 6: 生成带有引用的响应及个人信息助手功能,同样是构建可靠且以用户为中心的生成式信息检索 (GenIR) 系统的关键方法。
(2) Incorporating Generative Retrieval. As envisioned by Metzler et al. [195], allowing the model to directly generate responses with citations is a promising approach for building an expertlevel reliable IR system. Users receive reliable responses tailored to their needs without searching through returned documents. Moreover, the cited document is generated by the model through the generative retrieval approach described in Section 3, directly producing corresponding DocIDs.
(2) 融入生成式检索 (Generative Retrieval)。正如 Metzler 等人 [195] 所设想的那样,允许模型直接生成带有引用的响应是构建专家级可靠信息检索 (IR) 系统的有前景的方法。用户无需搜索返回的文档即可获得量身定制的可靠响应。此外,引用的文档由模型通过第 3 节所述的生成式检索方法生成,直接产生相应的 DocID。
Utilizing generative retrieval, 1-PAGER [97] combines answer generation and evidence retrieval by generating N-gram DocIDs through constrained decoding using FM-Index [62], enabling stepby-step corpus partitioning, document selection, and response generation. This method matches retrieval-then-read methods in accuracy and surpasses closed-book QA models by attributing predictions to specific evidence, offering a new scheme for integrating retrieval into seq2seq generation.
利用生成式检索 (generative retrieval) 技术,1-PAGER [97] 通过 FM-Index [62] 进行约束解码生成 N-gram DocIDs,将答案生成与证据检索相结合,实现了逐步的语料库分区、文档选择和响应生成。该方法在准确度上与检索后阅读 (retrieval-then-read) 方法相当,并通过将预测归因于特定证据超越了闭卷问答 (closed-book QA) 模型,为将检索整合到序列到序列 (seq2seq) 生成中提供了新方案。
Recently, [122] proposes a source-aware training method where models learn to associate DocIDs with knowledge during pre-training and provide supporting citations during instruction tuning, effectively achieving knowledge attribution and enhancing LLM verifiability.
[122] 近期提出了一种源感知训练方法,该方法让模型在预训练阶段学习将文档ID (DocID) 与知识关联,并在指令微调阶段提供支持性引用,有效实现了知识溯源并增强了大语言模型的可验证性。
4.3.2 Retrieval-based Response with Citation. To enhance the accuracy of citations, several methods have been developed based on retrieval techniques to fetch relevant documents, thereby improving the quality of responses with embedded citations.
4.3.2 基于检索的引用回复。为提高引用准确性,目前已开发出多种基于检索技术的方法来获取相关文档,从而提升嵌入引用回复的质量。
(1) Citation within Generation. Following retrieval, models directly generate responses that include citations. Initially, systems like WebGPT [204], LaMDA [282], and WebBrain [223] utilized web pages or Wikipedia to construct large-scale pre-training datasets, teaching models how to generate responses with citations.
(1) 生成式引用。检索完成后,模型会直接生成包含引用的回答。最初,像WebGPT [204]、LaMDA [282]和WebBrain [223]这样的系统利用网页或维基百科构建大规模预训练数据集,教会模型如何生成带引用的回答。
Subsequently, more advanced strategies for citation generation were proposed. For instance, Search-in-the-Chain (SearChain) [326] first generates a reasoning chain (Chain-of-Query, CoQ) via LLM prompts, then interacts with each CoQ node using retrieval for verification and completion, ultimately generating the reasoning process and marking citations at each inference step.
随后,研究者们提出了更先进的文献引用生成策略。例如,Search-in-the-Chain (SearChain) [326] 首先通过大语言模型提示生成推理链 (Chain-of-Query, CoQ),随后利用检索与每个 CoQ 节点进行交互以验证和补全,最终生成推理过程并在每个推断步骤中标注引用。
LLatrieval [156] suggests continuously improving retrieval results through iterative updating, verifying whether retrieved documents support the generated answers until satisfaction. AGREE [335] uses a Natural Language Inference (NLI) model to verify consistency between LLM-generated answers and retrieved documents, employing a Test-Time Adaptation (TTA) strategy that allows LLMs to actively search and cite current information during generation, enhancing response accuracy and reliability. VTG [261] integrates an evolved memory system and a dual-layer validator for generating verifiable text, combining long-term and short-term memories to adapt to dynamic content, and uses an NLI model to evaluate logical support between claims and evidence.
LLatrieval [156] 提出通过迭代更新持续改进检索结果,验证检索到的文档是否支持生成的答案直至满意。AGREE [335] 采用自然语言推理 (Natural Language Inference, NLI) 模型验证大语言模型生成答案与检索文档的一致性,并运用测试时适应 (Test-Time Adaptation, TTA) 策略,使大语言模型在生成过程中主动搜索并引用实时信息,提升回答的准确性与可靠性。VTG [261] 整合进化记忆系统和双层验证器来生成可验证文本,结合长期与短期记忆以适应动态内容,同时利用NLI模型评估主张与证据间的逻辑支持关系。
Based on the Graph of Thoughts (GoT) [12], HGOT [60] improves context learning in retrievalaugmented settings by constructing a hierarchical GoT, leveraging the LLM’s planning capabilities to break down complex queries into smaller sub-queries and introducing a scoring mechanism to assess the quality of retrieved paragraphs.
基于思维图(GoT) [12]的改进,HGOT [60]通过构建分层思维图来增强检索增强设置中的上下文学习能力,利用大语言模型的规划能力将复杂查询分解为更小的子查询,并引入评分机制来评估检索段落的质量。
Employing reinforcement learning, Huang et al. [87] introduce a fine-grained reward mechanism to train language models, allocating specific rewards for each generated sentence and citation to teach models accurate external source citation. This approach uses rejection sampling and reinforcement learning algorithms to enhance citation-inclusive text generation through localized reward signals. APO [142] reimagines attributive text generation as a preference learning problem, automatically generating preference data pairs to reduce annotation costs, and uses progressive preference optimization and experience replay to reinforce preference signals without over fitting or text degradation.
黄等人 [87] 采用强化学习,引入细粒度奖励机制来训练语言模型,为每个生成的句子和引用分配特定奖励,以教导模型准确引用外部来源。该方法利用拒绝采样和强化学习算法,通过局部奖励信号提升包含引用的文本生成能力。APO [142] 将归因文本生成重新构想为偏好学习问题,自动生成偏好数据对以降低标注成本,并采用渐进式偏好优化和经验回放来强化偏好信号,避免过拟合或文本质量下降。
(2) Citation after Generation. This approach involves models first generating a response, then adding citations through models like NLI. RARR [70] improves at tri but ability by automatically finding external evidence for the language model’s output and post-editing to correct content while preserving the original output, enhancing attribution capabilities without altering the existing model. PURR [24] employs an unsupervised learning method where LLMs generate text noise themselves, then trains an editor to eliminate this noise, improving attribution performance and significantly speeding up generation. CEG [150] searches for supporting documents related to generated content and uses an NLI-based citation generation module to ensure each statement is supported by citations. "Attribute First, then Generate" [256] decomposes the generation process, first selecting relevant source text details and then generating based on these details, achieving localized at tri but ability with each sentence supported by a clear source, greatly reducing manual fact-checking workload.
(2) 生成后引用。这种方法先让模型生成回答,再通过NLI等模型添加引用。RARR [70] 通过自动查找大语言模型输出的外部证据并进行后编辑来修正内容,同时保留原始输出,从而在不改变现有模型的情况下提升归因能力。PURR [24] 采用无监督学习方法,让大语言模型自行生成文本噪声,然后训练一个编辑器来消除这些噪声,既改善了归因表现又显著加快了生成速度。CEG [150] 会搜索与生成内容相关的支持文档,并使用基于NLI的引用生成模块确保每个陈述都有引用支撑。"Attribute First, then Generate" [256] 将生成过程分解,先选择相关源文本细节再基于这些细节生成,实现局部归因——每个句子都有明确来源支撑,大幅减少了人工事实核查的工作量。
4.4 Personal Information Assistant
4.4 个人信息助手
The core of the GenIR system is the user, so understanding user intent is crucial. Researchers have explored various methods like personalized search [302, 364, 373? ], dialogue [172, 184, 354], and recommend er [46, 169, 361] systems to explore users’ interests. Specifically, personalized information assistants aim to better understand users’ personalities and preferences, generating personalized responses to better meet their information needs. This section reviews the progress in research on personalized dialogue and domain-specific personalization. An overview of this section is provided in the blue area of Figure 6.
GenIR系统的核心是用户,因此理解用户意图至关重要。研究者们探索了多种方法来挖掘用户兴趣,例如个性化搜索[302, 364, 373?]、对话系统[172, 184, 354]和推荐系统[46, 169, 361]。具体而言,个性化信息助手旨在更深入地理解用户的个性和偏好,生成个性化响应以更好地满足其信息需求。本节回顾了个性化对话和领域特定个性化方面的研究进展,相关概述见图6蓝色区域。
4.4.1 Personalized Dialogue System. To better understand user needs, researchers have explored two main approaches: personalized prompt design and model fine-tuning.
4.4.1 个性化对话系统。为了更好地理解用户需求,研究者探索了两种主要方法:个性化提示(prompt)设计和模型微调(fine-tuning)。
(1) Personalized Prompt. For personalized prompt design, Liu et al. [169] and Dai et al. [46] input users’ interaction and rating history into ChatGPT [209] for in-context learning, effectively generating personalized responses. LaMP [240] enhances the language model’s personalized output by retrieving personalized history from user profiles. Using long-term history, [35] designs prompts describing users’ long-term interests, needs, and goals for input into LLMs. BookGPT [361] uses LLM prompts, interactive querying methods, and result verification frameworks to obtain personalized book recommendations. PerSE [294] infers preferences from several reviews by a specific reviewer and provides personalized evaluations for new story inputs.
(1) 个性化提示。在个性化提示设计方面,Liu等人[169]和Dai等人[46]将用户的交互和评分历史输入ChatGPT[209]进行上下文学习,有效生成个性化响应。LaMP[240]通过从用户档案中检索个性化历史记录来增强语言模型的个性化输出。利用长期历史数据,[35]设计了描述用户长期兴趣、需求和目标的提示输入到大语言模型中。BookGPT[361]采用大语言模型提示、交互式查询方法和结果验证框架来获取个性化图书推荐。PerSE[294]通过分析特定评论者的多条评论推断其偏好,并为新故事输入提供个性化评估。
Using prompt rewriting, [140] proposes a method combining supervised and reinforcement learning to better generate responses from frozen LLMs. Similarly, [31] rewrites user input prompts using extensive user text-to-image interaction history to align better with expected visual outputs.
通过提示词重写,[140]提出了一种结合监督学习和强化学习的方法,以更好地从冻结的大语言模型中生成响应。类似地,[31]利用大量用户文生图交互历史来重写用户输入提示,从而更好地与预期视觉输出对齐。
(2) Personalized Fine-tuning. This line of work focuses on fine-tuning models for personalized response generation. Zhang et al. [354] introduced the Persona-Chat dataset with 5 million personas to train models for more personalized dialogues. Mazaré et al. [190] created a dataset of over 700 million conversations extracted from Reddit, demonstrating the effectiveness of training dialogue models on large-scale personal profiles. $\mathcal{P}^{2}\mathrm{Bot}$ [172] generates personalized and consistent dialogues by simulating the perception of personalities between conversation participants. DHAP [184] designs a novel Transformer structure to automatically learn implicit user profiles from dialogue history without explicit personal information. Wu et al. [322] propose a generative segmentation memory network to integrate diverse personal information. Fu et al. [67] developed a variation al approach to model the relationship between personal memory and knowledge selection, with a bidirectional learning mechanism.
(2) 个性化微调 (Personalized Fine-tuning)。这类研究专注于为个性化响应生成微调模型。Zhang等人[354]提出了包含500万个人设的Persona-Chat数据集,用于训练生成更个性化对话的模型。Mazaré等人[190]创建了从Reddit提取的超过7亿对话数据集,证明了在大规模个人资料上训练对话模型的有效性。$\mathcal{P}^{2}\mathrm{Bot}$[172]通过模拟对话参与者之间的个性感知,生成个性一致化的对话。DHAP[184]设计了一种新型Transformer结构,无需显式个人信息即可从对话历史中自动学习隐式用户画像。Wu等人[322]提出生成式分段记忆网络来整合多样化个人信息。Fu等人[67]开发了基于变分的方法来建模个人记忆与知识选择之间的关系,并采用双向学习机制。
Using reinforcement learning, Cheng et al. [32] collected a domain-specific preference (DSP) dataset and proposed a three-stage reward model learning scheme, including base model training, general preference fine-tuning, and customized preference fine-tuning. Jang et al. [98] developed "Personalized Soups," first optimizing multiple policy models with different preferences using PPO [246], then dynamically combining parameters during inference.
Cheng等[32]采用强化学习方法,收集了领域特定偏好(DSP)数据集,并提出三阶段奖励模型学习方案,包括基础模型训练、通用偏好微调和定制化偏好微调。Jang等[98]开发了"个性化参数汤"技术,先通过PPO[246]优化具有不同偏好的多个策略模型,再在推理阶段动态组合参数。
Using retrieval-enhanced methods, LAPDOG [91] retrieves relevant information from story documents to enhance personal profiles and generate better personalized responses. SAFARI [295] leverages LLMs’ planning and knowledge integration to generate responses consistent with character settings. Inspired by writing education, Li et al. [141] proposed a multi-stage, multi-task framework including retrieval, ranking, sum mari z ation, synthesis, and generation to teach LLMs personalized responses. For subjective tasks, [316] studied the s