Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers
混合检索增强生成 (Blended RAG):通过语义搜索与混合查询检索器提升 RAG (Retriever-Augmented Generation) 准确率
Abstract—Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a private knowledge base of documents with Large Language Models (LLM) to build Generative Q&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an outsized role in the overall RAG accuracy by extracting the most relevant document from the corpus to provide context to the LLM. In this paper, we propose the ’Blended RAG’ method of leveraging semantic search techniques, such as Dense Vector indexes and Sparse Encoder indexes, blended with hybrid query strategies. Our study achieves better retrieval results and sets new benchmarks for IR (Information Retrieval) datasets like NQ and TREC-COVID datasets. We further extend such a ’Blended Retriever’ to the RAG system to demonstrate far superior results on Generative Q&A datasets like SQUAD, even surpassing fine-tuning performance.
摘要—检索增强生成 (Retrieval-Augmented Generation,RAG) 是一种将私有文档知识库与大语言模型 (LLM) 结合的流行方法,用于构建生成式问答 (Generative Q&A) 系统。然而,随着文档库规模扩大,RAG 的准确性面临越来越大的挑战,其中检索器 (Retriever) 通过从语料库中提取最相关文档为 LLM 提供上下文,对整体 RAG 准确性起着关键作用。本文提出"混合 RAG"方法,结合稠密向量索引 (Dense Vector indexes)、稀疏编码器索引 (Sparse Encoder indexes) 等语义搜索技术与混合查询策略。我们的研究在 NQ 和 TREC-COVID 等信息检索 (IR) 数据集上取得了更好的检索效果,并创造了新基准。我们进一步将这种"混合检索器"扩展到 RAG 系统,在 SQUAD 等生成式问答数据集上展现出显著优于微调性能的结果。
Index Terms—RAG, Retrievers, Semantic Search, Dense Index, Vector Search
索引术语—RAG (Retrieval-Augmented Generation)、检索器、语义搜索、稠密索引、向量搜索
I. INTRODUCTION
I. 引言
RAG represents an approach to text generation that is based not only on patterns learned during training but also on dynamically retrieved external knowledge [1]. This method combines the creative flair of generative models with the encyclopedic recall of a search engine. The efficacy of the RAG system relies fundamentally on two components: the Retriever (R) and the Generator (G), the latter representing the size and type of LLM.
RAG代表了一种文本生成方法,它不仅基于训练期间学习到的模式,还依赖于动态检索的外部知识 [1]。这种方法结合了生成模型的创造力和搜索引擎的百科全书式记忆能力。RAG系统的有效性从根本上依赖于两个组件:检索器 (R) 和生成器 (G),后者代表大语言模型的规模和类型。
The language model can easily craft sentences, but it might not always have all the facts. This is where the Retriever (R) steps in, quickly sifting through vast amounts of documents to find relevant information that can be used to inform and enrich the language model's output. Think of the retriever as a researcher part of the AI, which feeds the con textually grounded text to generate knowledgeable answers to Generator (G). Without the retriever, RAG would be like a well-spoken individual who delivers irrelevant information.
语言模型可以轻松构造句子,但可能并不总是掌握所有事实。这时检索器 (Retriever, R) 就会介入,快速筛选大量文档以找到相关信息,用于指导和丰富语言模型的输出。可以将检索器视为AI中的研究员角色,它向生成器 (Generator, G) 提供基于上下文的文本,从而生成知识丰富的回答。若没有检索器,RAG就像一位口若悬河却提供无关信息的人。
II. RELATED WORK
II. 相关工作
Search has been a focal point of research in information retrieval, with numerous studies exploring various methodologies. Historically, the BM25 (Best Match) algorithm, which uses similarity search, has been a cornerstone in this field, as explored by Robertson and Zaragoza (2009). [2]. BM25 prioritizes documents according to their pertinence to a query, capitalizing on Term Frequency (TF), Inverse Document Frequency (IDF), and Document Length to compute a relevance score.
搜索一直是信息检索领域的研究重点,众多研究探索了各种方法。历史上,由Robertson和Zaragoza (2009) [2] 提出的BM25 (Best Match) 算法采用相似性搜索,成为该领域的基石。BM25根据文档与查询的相关性对文档进行排序,利用词频 (TF)、逆文档频率 (IDF) 和文档长度来计算相关性得分。
Dense vector models, particularly those employing KNN (k Nearest Neighbours) algorithms, have gained attention for their ability to capture deep semantic relationships in data. Studies by Johnson et al. (2019) demonstrated the efficacy of dense vector representations in large-scale search applications. The kinship between data entities (including the search query) is assessed by computing the vectorial proximity (via cosine similarity etc.). During search execution, the model discerns the ’k’ vectors closest in resemblance to the query vector, hence returning the corresponding data entities as results. Their ability to transform text into vector space models, where semantic similarities can be quantitatively assessed, marks a significant advancement over traditional keywordbased approaches. [3]
密集向量模型(特别是采用KNN(k近邻)算法的模型)因其能够捕捉数据中的深层语义关系而受到关注。Johnson等人(2019)的研究证明了密集向量表示在大规模搜索应用中的有效性。通过计算向量间邻近度(如余弦相似度等)来评估数据实体(包括搜索查询)之间的关联性。在执行搜索时,模型会识别与查询向量最相似的"k"个向量,从而返回相应的数据实体作为结果。这类模型将文本转化为向量空间模型(可量化评估语义相似性)的能力,标志着对传统基于关键词方法的重大突破。[3]
On the other hand, sparse encoder based vector models have also been explored for their precision in representing document semantics. The work of Zaharia et al. (2010) illustrates the potential of these models in efficiently handling high-dimensional data while maintaining interpret ability, a challenge often faced in dense vector representations. In Sparse Encoder indexes the indexed documents, and the user’s search query maps into an extensive array of associated terms derived from a vast corpus of training data to encapsulate relationships and contextual use of concepts. The resultant expanded terms for documents and queries are encoded into sparse vectors, an efficient data representation format when handling an extensive vocabulary.
另一方面,基于稀疏编码器 (sparse encoder) 的向量模型也因其在表示文档语义方面的精确性而受到关注。Zaharia 等人的研究 (2010) 展示了这些模型在高效处理高维数据的同时保持可解释性的潜力,这正是稠密向量表示常面临的挑战。在稀疏编码器索引中,被索引的文档和用户搜索查询会被映射到从海量训练数据语料中衍生出的关联词项数组,以此封装概念间的关联关系和上下文用法。最终生成的文档与查询扩展词项会被编码为稀疏向量——这种高效的数据表示格式特别适用于处理大规模词汇表。
A. Limitations in the current RAG system
A. 当前 RAG 系统的局限性
Most current retrieval methodologies employed in RetrievalAugmented Generation (RAG) pipelines rely on keyword and similarity-based searches, which can restrict the RAG system’s overall accuracy. Table 1 provides a summary of the current benchmarks for retriever accuracy.
当前检索增强生成 (Retrieval-Augmented Generation, RAG) 流程中采用的大多数检索方法依赖于关键词和相似性搜索,这可能会限制 RAG 系统的整体准确性。表 1 总结了当前检索器的准确性基准。
TABLE I: Current Retriever Benchmarks
表 I: 当前检索器基准测试
Dataset | Benchmark Metrics | NDCG@10 | p@20 | F1 |
---|---|---|---|---|
NQDataset | P@20 | 0.633 | 86 | 79.6 |
Trec Covid | NDCG@10 | 80.4 | ||
HotpotQA | F1,EM | 0.85 |
While most of prior efforts in improving RAG accuracy is on G part, by tweaking LLM prompts, tuning etc.,[9] they have limited impact on the overall accuracy of the RAG system, since if R part is feeding irreverent context then answer would be inaccurate. Furthermore, most retrieval methodologies employed in RAG pipelines rely on keyword and similarity-based searches, which can restrict the system's overall accuracy.
虽然先前提升RAG (Retrieval-Augmented Generation) 准确性的工作主要集中在G部分(通过调整大语言模型提示、参数调优等)[9],但这些方法对RAG系统整体准确性的影响有限,因为如果R部分提供的上下文不相关,答案就会不准确。此外,RAG流程中采用的大多数检索方法依赖于关键词和基于相似性的搜索,这会限制系统的整体准确性。
Finding the best search method for RAG is still an emerging area of research. The goal of this study is to enhance retriever and RAG accuracy by incorporating Semantic Search-Based Retrievers and Hybrid Search Queries.
寻找最佳的RAG搜索方法仍是一个新兴研究领域。本研究旨在通过结合基于语义搜索的检索器(Semantic Search-Based Retrievers)和混合搜索查询(Hybrid Search Queries)来提高检索器和RAG的准确性。
III. BLENDED RETRIEVERS
III. 混合检索器
For RAG systems, we explored three distinct search strategies: keyword-based similarity search, dense vector-based, and semantic-based sparse encoders, integrating these to formulate hybrid queries. Unlike conventional keyword matching, semantic search delves into the nuances of a user’s query, deciphering context and intent. This study systematically evaluates an array of search techniques across three primary indices: BM25 [4] for keyword-based, KNN [5] for vector-based, and Elastic Learned Sparse Encoder (ELSER) for sparse encoderbased semantic search.
对于RAG系统,我们探索了三种不同的搜索策略:基于关键词的相似性搜索、基于稠密向量的搜索以及基于语义的稀疏编码器,并将这些方法整合形成混合查询。与传统关键词匹配不同,语义搜索能深入理解用户查询的细微差别,解析上下文和意图。本研究系统评估了三种主要索引下的多种搜索技术:基于关键词的BM25 [4]、基于向量的KNN [5],以及用于稀疏编码器语义搜索的Elastic Learned Sparse Encoder (ELSER)。
A. Methodology
A. 方法论
Our methodology unfolds in a sequence of progressive steps, commencing with the elementary match query within the BM25 index. We then escalate to hybrid queries that amalgamate diverse search techniques across multiple fields, leveraging the multi-match query within the Sparse EncoderBased Index. This method proves invaluable when the exact location of the query text within the document corpus is indeterminate, hence ensuring a comprehensive match retrieval.
我们的方法按照一系列渐进步骤展开,首先从BM25索引中的基础匹配查询开始。随后升级为混合查询,通过在稀疏编码器索引(Sparse EncoderBased Index)中结合多字段的不同搜索技术,利用多匹配查询功能。当查询文本在文档库中的确切位置不确定时,这种方法能确保全面匹配检索,因而极具价值。
The multi-match queries are categorized as follows:
多匹配查询分类如下:
• Cross Fields: Targets concurrence across multiple fields
• 跨领域:目标在多个领域同时出现
After initial match queries, we incorporate dense vector (KNN) and sparse encoder indices, each with their bespoke hybrid queries. This strategic approach synthesizes the strengths of each index, channeling them towards the unified goal of refining retrieval accuracy within our RAG system. We calculate the top-k retrieval accuracy metric to distill the essence of each query type.
在初始匹配查询后,我们整合了稠密向量 (KNN) 和稀疏编码器索引,每种索引都有其定制化的混合查询。这一策略性方法综合了各类索引的优势,将其导向统一目标:提升RAG系统中的检索精度。我们通过计算top-k检索准确率指标来提炼每种查询类型的核心价值。
In Figure 1, we introduce a scheme designed to create Blended Retrievers by blending semantic search with hybrid queries.
图 1: 我们介绍了一种通过混合语义搜索与混合查询来创建混合检索器 (Blended Retriever) 的方案。
B. Constructing RAG System
B. 构建 RAG 系统
From the plethora of possible permutations, a select sextet (top 6) of hybrid queries—those exhibiting paramount retrieval efficacy—were chosen for further scrutiny. These queries were then subjected to rigorous evaluation across the benchmark datasets to ascertain the precision of the retrieval component within RAG. The sextet queries represent the culmination of retriever experimentation, embodying the synthesis of our finest query strategies aligned with various index types. The six blended queries are then fed to generative questionanswering systems. This process finds the best retrievers to feed to the Generator of RAG, given the exponential growth in the number of potential query combinations stemming from the integration with distinct index types.
从众多可能的排列组合中,我们精选了六种(前6名)混合查询方案——这些方案展现出卓越的检索效能——进行深入分析。随后在基准数据集上对这些查询方案展开严格评估,以确定RAG框架中检索组件的精确度。这六项查询方案代表着检索器实验的最终成果,凝聚了我们针对不同索引类型所设计的最佳查询策略。选定这六种混合查询方案后,将其输入生成式问答系统。由于与不同索引类型集成会引发潜在查询组合数量呈指数级增长,该流程能筛选出最适合馈送至RAG生成器的最佳检索方案。
The intricacies of constructing an effective RAG system are multi-fold, particularly when source datasets have diverse and complex landscapes. We undertook a comprehensive evaluation of a myriad of hybrid query formulations, scrutinizing their performance across benchmark datasets, including the Natural Questions (NQ), TREC-COVID, Stanford Question Answering Dataset (SqUAD), and HotPotQA.
构建高效RAG (Retrieval-Augmented Generation) 系统的复杂性体现在多个层面,尤其当源数据集具有多样化和复杂的结构时。我们对多种混合查询方案进行了全面评估,详细考察了它们在基准数据集上的表现,包括自然问题 (NQ)、TREC-COVID、斯坦福问答数据集 (SqUAD) 和 HotPotQA。
IV. EXPERIMENTATION FOR RETRIEVER EVALUATION
IV. 检索器评估实验
We used top-10 retrieval accuracy to narrow down the six best types of blended retrievers (index $^+$ hybrid query) for comparison for each benchmark dataset.
我们使用前10检索准确率来筛选出六种最佳混合检索器(索引 $^+$ 混合查询)类型,以便在每个基准数据集上进行对比。
- Top-10 retrieval accuracy on the NQ dataset : For the NQ dataset [6], our empirical analysis has demonstrated the superior performance of hybrid query strategies, attributable to the ability to utilize multiple data fields effectively. In Figure 2, our findings reveal that the hybrid query approach employing the Sparse Encoder with Best Fields attains the highest retrieval accuracy, reaching an impressive $88.77%$ . This result surpasses the efficacy of all other formulations, establishing a new benchmark for retrieval tasks within this dataset.
- NQ数据集上的Top-10检索准确率:在NQ数据集[6]中,我们的实证分析表明,混合查询策略由于能有效利用多个数据字段而表现出卓越性能。图2显示,采用最佳字段稀疏编码器(Sparse Encoder with Best Fields)的混合查询方法达到了88.77%的最高检索准确率,这一结果超越了所有其他方案,为该数据集内的检索任务树立了新标杆。
- Top-10 Retrieval Accuracy on TREC-Covid dataset: For the TREC-COVID dataset [7], which encompasses relevancy scores spanning from -1 to 2, with -1 indicative of irrelevance and 2 denoting high relevance, our initial assessments targeted documents with a relevancy of 1, deemed partially relevant.
- TREC-Covid数据集上的Top-10检索准确率:TREC-COVID数据集[7]的相关性评分范围为-1至2(-1表示无关,2表示高度相关),我们初步评估针对相关性为1(部分相关)的文档展开。
Blended Retriever Queries using Similarity and Semantic Search Indexes
基于相似性与语义搜索索引的混合检索查询
Fig. 1: Scheme of Creating Blended Retrievers using Semantic Search with Hybrid Queries.
图 1: 使用混合查询语义搜索创建混合检索器的方案。
Figure 3 analysis reveals a superior performance of vector search hybrid queries over those based on keywords. In particular, hybrid queries that leverage the Sparse EncodeR utilizing Best Fields demonstrate the highest efficacy across all index types at $78%$ accuracy.
图 3: 分析显示向量搜索混合查询的性能优于基于关键词的查询。其中,采用最佳字段(Best Fields)的稀疏编码器(Sparse EncodeR)的混合查询在所有索引类型中表现最佳,准确率达到 $78%$。
Fig. 2: Top-10 Retriever Accuracy for NQ Dataset Fig. 3: Top 10 retriever accuracy for Trec-Covid Score-1
图 2: NQ数据集Top-10检索器准确率
图 3: Trec-Covid Score-1 Top 10检索器准确率
Subsequent to the initial evaluation, the same spectrum of queries was subjected to assessment against the TRECCOVID dataset with a relevancy score of 2, denoting that the documents were entirely pertinent to the associated queries. Figure 4 illustrated with a relevance score of two, where documents fully meet the relevance criteria for associated queries, reinforce the efficacy of vector search hybrid queries over conventional keyword-based methods. Notably, the hybrid query incorporating Sparse Encoder with Best Fields demonstrates a $98%$ top-10 retrieval accuracy, eclipsing all other formulations. This suggests that a methodological pivot towards more nuanced blended search, particularly those that effectively utilize the Best Fields, can significantly enhance retrieval outcomes in information retrieval (IR) systems.
在初步评估之后,同一组查询在TRECCOVID数据集上进行了相关性评分为2的评估,表明文档与相关查询完全匹配。图4展示了相关性评分为2的情况,其中文档完全符合相关查询的标准,进一步证实了向量搜索混合查询相较于传统基于关键词方法的有效性。值得注意的是,结合稀疏编码器(Sparse Encoder)与最佳字段(Best Fields)的混合查询在top-10检索准确率上达到了98%,超越了其他所有方案。这表明,在信息检索(IR)系统中,向更精细化的混合搜索方法转变,尤其是那些有效利用最佳字段的方案,能够显著提升检索效果。
Fig. 4: Top 10 retriever accuracy for Trec-Covid Score-2
图 4: Trec-Covid Score-2 检索器准确率 Top 10
Fig. 5: Top 10 retriever accuracy for HotPotQA dataset
图 5: HotPotQA 数据集前 10 名检索器准确率
- Top-10 Retrieval Accuracy on the HotPotQA dataset : The HotPotQA [8] dataset, with its extensive corpus of over 5M documents and a query set comprising 7,500 items, presents a formidable challenge for comprehensive evaluation due to compute requirements. Consequently, the assessment was confined to a select subset of hybrid queries. Despite these constraints, the analysis provided insightful data, as reflected in the accompanying visualization in Figure 5.
HotPotQA数据集上的Top-10检索准确率:HotPotQA [8]数据集拥有超过500万篇文档的庞大语料库和包含7,500个项目的查询集,由于计算资源需求,全面评估面临巨大挑战。因此,评估仅限于精选的混合查询子集。尽管存在这些限制,分析仍提供了有见地的数据,如图5所示的可视化结果所反映。
Figure 5 shows that hybrid queries, specifically those utilizing Cross Fields and Best Fields search strategies, demonstrate superior performance. Notably, the hybrid query that blends Sparse EncodeR with Best Fields queries achieved the highest efficiency, of $65.70%$ on the HotPotQA dataset.
图 5: 混合查询(特别是采用跨字段(Cross Fields)和最佳字段(Best Fields)搜索策略的查询)展现出更优性能。值得注意的是,稀疏编码器(Sparse EncodeR)与最佳字段查询相结合的混合查询实现了最高效率,在HotPotQA数据集上达到$65.70%$。
Fig. 6: NQ dataset Benchmarking using NDCG $@$ 10 Metric
图 6: 使用NDCG $@$ 10指标在NQ数据集上的基准测试
TABLE II: Retriever Benchmarking using NDCG $@$ 10 Metric
表 II: 使用 NDCG@10 指标的检索器基准测试
数据集 | 模型/流程 | NDCG@10 |
---|---|---|
Trec-covid | COCO-DR Large | 0.804 |
Trec-covid | BlendedRAG | 0.87 |
NQdataset | monoT5-3B | 0.633 |
NQdataset | BlendedRAG | 0.67 |
A. Retriever Benchmarking
A. 检索器基准测试
Now that we have identified the best set of combinations of Index $^+$ Query types, we will use these sextet queries on IR datasets for benchmarking using $\mathrm{NDCG}@10$ [9] scores (Normalised Discounted Cumulative Gain metric).
在确定了索引 $^+$ 查询类型的最佳组合集后,我们将在IR数据集上使用这六种查询进行基准测试,采用 $\mathrm{NDCG}@10$ [9] 分数(归一化折损累积增益指标)作为评估标准。
- NQ dataset benchmarking: The results for $\mathrm{NDCG}@10$ using sextet queries and the current benchmark on the NQ dataset are shown in the chart Figure 7.