A Thorough Comparison of Cross-Encoders and LLMs for Reranking SPLADE
交叉编码器与大语言模型在重排序SPLADE中的全面对比
Hervé Déjean, Stéphane Clinchant, Thibault Formal first.lastname@naverlabs.com Naver Labs Europe Meylan, France
Hervé Déjean, Stéphane Clinchant, Thibault Formal first.lastname@naverlabs.com Naver Labs Europe 法国梅朗
ABSTRACT
摘要
We present a comparative study between cross-encoder and LLMs rerankers in the context of re-ranking effective SPLADE retrievers. We conduct a large evaluation on TREC Deep Learning datasets and out-of-domain datasets such as BEIR and LoTTE. In the first set of experiments, we show how cross-encoder rerankers are hard to distinguish when it comes to re-rerank SPLADE on MS MARCO. Observations shift in the out-of-domain scenario, where both the type of model and the number of documents to re-rank have an impact on effectiveness. Then, we focus on listwise rerankers based on Large Language Models – especially GPT-4. While GPT-4 demonstrates impressive (zero-shot) performance, we show that traditional cross-encoders remain very competitive. Overall, our findings aim to to provide a more nuanced perspective on the recent excitement surrounding LLM-based re-rankers – by positioning them as another factor to consider in balancing effectiveness and efficiency in search systems.
我们针对交叉编码器(Cross-encoder)与大语言模型重排序器在高效SPLADE检索结果重排序任务中的表现进行了对比研究。通过在TREC深度学习数据集及BEIR、LoTTE等域外数据集上的大规模评估,我们首先发现:在MS MARCO数据集上重排序SPLADE结果时,不同交叉编码器之间的性能差异微乎其微。但在域外场景中,模型类型和待重排序文档数量都会显著影响效果。随后我们重点研究了基于大语言模型(特别是GPT-4)的列表式重排序器:尽管GPT-4展现出惊人的零样本(zero-shot)性能,传统交叉编码器仍保持强劲竞争力。本研究旨在为当前大语言模型重排序器的热潮提供更细致的观察视角——将其定位为平衡搜索系统效果与效率时需要考量的另一个因素。
KEYWORDS
关键词
Information Retrieval, Neural Search, Reranking, Cross-Encoders, Large Language Models
信息检索、神经搜索、重排序、交叉编码器、大语言模型
Therefore, we conduct an extensive experimental study on the TREC Deep Learning datasets (19-23) [4–7] for in-domain evaluation, as well as the BEIR [27] and LoTTE [24] collections for out-ofdomain evaluation. Overall, it is difficult to draw general conclusions from this extensive evaluation, but our findings reveal that:
因此,我们在TREC深度学习数据集(19-23)[4-7]上进行了广泛的实验研究以进行领域内评估,同时使用BEIR[27]和LoTTE[24]数据集进行跨领域评估。总体而言,从这项广泛评估中很难得出普遍性结论,但我们的研究发现表明:
• Cross-encoder rerankers behave slightly differently on indomain and out-of-domain datasets. • Cross-encoders remain competitive against LLM-based rerankers – in addition to being way more efficient. • Open LLMs such under-perform compared to GPT-4, but still exhibit good ranking abilities under some constraints (e.g., small prompts).
- 跨编码器(cross-encoder)重排序器在领域内和领域外数据集上表现略有差异。
- 跨编码器在与基于大语言模型的重排序器对比中仍具竞争力,且效率更高。
- 开源大语言模型(如GPT-4)表现相对欠佳,但在某些限制条件下(例如短提示)仍展现出良好的排序能力。
1 INTRODUCTION
1 引言
Reranking models significantly enhance the quality of Information Retrieval (IR) systems. Due to their complexity, they are usually bound to reorder a limited number of documents provided by an efficient first-stage retriever such as BM25 [22, 23]. Traditional reranking methods used to rely on manually defined features, and employed specific learning-to-rank losses [15]. Since the advent of models like BERT [8], cross-encoders have become the standard reranking “machinery” [10, 17]. More recent architectures have gradually been tested, including encoder-decoder [19, 32, 33] or decoder-only models [16]. More recently, Large Language Models (LLMs) have been shown to be effective zero-shot rerankers. For instance, RankGPT [25] – relying on OpenAI GPT-4 [18] – provides puzzling outcomes: it performs very well as a listwise reranker outof-the-box, and can even be iterative ly re-applied to increment ally improve the reranked lists.
重排序模型显著提升了信息检索(IR)系统的质量。由于其复杂性,这类模型通常仅对高效第一阶段检索器(如BM25 [22, 23])提供的有限文档进行重排序。传统重排序方法曾依赖人工定义特征,并采用特定的学习排序损失函数[15]。自BERT[8]等模型出现以来,交叉编码器已成为标准的重排序"机制"[10, 17]。包括编码器-解码器[19, 32, 33]或纯解码器模型[16]在内的新架构也逐渐得到验证。最新研究表明,大语言模型(LLM)可作为有效的零样本重排序器。例如基于OpenAI GPT-4[18]的RankGPT[25]展现出令人费解的效果:作为开箱即用的列表式重排序器表现优异,甚至可通过迭代重复应用来逐步改进重排序结果。
However, we notice that strong baselines are often absent or not systematically used in recent works evaluating LLM-based rerankers (e.g., [25]). For instance, it remains unclear whether such approaches significantly outperform standard cross-encoders when re-ordering the results of strong retrievers – and if so, in which setting (e.g., how many documents to consider). Therefore, this study aims to shed light on such questions, and in particular:
然而,我们注意到在近期评估基于大语言模型的重排序器工作中,强基线方法往往缺失或未被系统性地使用(例如 [25])。例如,当对强检索器的结果进行重排序时,这类方法是否能显著优于标准交叉编码器仍不明确——如果确实更优,具体在何种设置下(如考虑多少文档)。因此,本研究旨在阐明这些问题,具体而言:
2 LLMS AS RERANKERS
2 大语言模型 (LLM) 作为重排序器
RankGPT [25] is the first approach to investigate the direct use of LLMs as rerankers – the model generating an ordered list of document ids as output. To bypass the inherent prompt length limit of GPT models, Sun et al. introduce a sliding window strategy that allows LLMs to rank an arbitrary number of passages. Both GPT3.5 Turbo and GPT-4 [18] are evaluated – the latter providing remarkable results (especially given the zero-shot nature of the task). Later on, Tang et al. [26] present a more effective approach (PSC) to rerank the documents by comparing permuted input lists.
RankGPT [25] 是首个直接利用大语言模型(LLM)作为重排序器的研究方法,该模型通过生成有序的文档ID列表作为输出。为了突破GPT模型固有的提示长度限制,Sun等人提出了一种滑动窗口策略,使大语言模型能够对任意数量的段落进行排序。研究同时评估了GPT-3.5 Turbo和GPT-4 [18]的表现,后者展现出卓越性能(特别是在该任务的零样本特性下)。后续工作中,Tang等人[26]提出了一种更高效的文档重排序方法(PSC),通过比较置换后的输入列表来实现排序优化。
Experiments using open-source LLMs – Open AI LLMs being closed and sometimes very expensive – are however more underwhelming: Qin et al. [21] show that listwise ranking approaches generate completely useless outputs on [open] moderate-sized LLMs. They therefore propose a pairwise approach with an advanced sorting algorithm (PRP-Sorting) to improve and speed up reranking. Zhuang et al. [33] further propose to compare the different ways to conduct reranking: pointwise, pairwise, and listwise. They introduce a new setwise prompting method which leads to a more effective and efficient listwise (zero-shot) ranking. Their experiments are conducted by fine-tuning a FLAN-T5-XXL model [2]. Besides, this work seems to be the first one really questioning the (un)efficiency of LLMs as rerankers.
使用开源大语言模型的实验(由于OpenAI的大语言模型是闭源且有时非常昂贵)结果却更令人失望:Qin等人[21]指出,列表排序方法在[开源]中等规模大语言模型上会产生完全无效的输出。为此他们提出了一种采用先进排序算法(PRP-Sorting)的成对排序方法,以改进和加速重排序过程。Zhuang等人[33]进一步比较了点序、成对排序和列表排序等不同重排序方法,并提出了一种新的集合提示方法,从而实现了更高效且有效的列表式(零样本)排序。他们的实验通过对FLAN-T5-XXL模型[2]进行微调来完成。此外,这项研究似乎是首个真正质疑大语言模型作为重排序器(低)效率的工作。
In the meantime, several approaches have studied fine-tuning of LLMs for the task of reranking [16, 20, 31]. For instance, Pradeep et al. [20] fine-tune a moderate-sized LLM based on the Zephyr-7퐵 model [29], and achieve competitive results on par with GPT-4.
与此同时,已有多种方法研究如何对大语言模型进行微调以完成重排序任务 [16, 20, 31]。例如,Pradeep等人 [20] 基于Zephyr-7퐵模型 [29] 对中等规模的大语言模型进行微调,取得了与GPT-4相当的竞争力结果。
While demonstrating impressive capabilities (both in zero-shot or fine-tuning settings), these models are relatively inefficient 1 compared to traditional rerankers based on cross-encoders – which are themselves considered as a “slow” components in IR systems. How to make these LLMs more efficient remains unclear.
虽然这些模型展现出令人印象深刻的能力(无论是零样本还是微调场景),但与传统基于交叉编码器(cross-encoders)的重排序器相比效率较低——而后者本身已被视为信息检索(IR)系统中的"慢速"组件。如何提升这些大语言模型的效率仍是一个未解难题。
3 EXPERIMENTAL SETTING
3 实验设置
We describe our experimental setting: the retrievers we consider, and the cross-encoder and LLM-based re-rankers we aim to compare in in- and out-of-domain settings.
我们描述了实验设置:所考虑的检索器,以及旨在域内和域外设置中比较的交叉编码器(Cross-Encoder)和大语言模型重排序器。
3.1 An Effective Retriever – SPLADE-v3
3.1 高效检索器 – SPLADE-v3
While demonstrating impressive results, rerankers based on LLMs haven’t been thoroughly evaluated when coupled with more effective first-stage models2. For instance, RankGPT [25] solely reorders BM25 top documents. However, state-of-the-art retrievers such as ${\mathrm{SPLADE{++}}}$ [9] or SPLADE-v3 [14] already achieve better results than the ones presented in RankGPT. Obviously, a good first-stage retriever “makes things easier” for the reranker, but it is still unclear how they interact.
虽然基于大语言模型的重新排序器 (reranker) 取得了令人印象深刻的结果,但尚未与更高效的第一阶段模型进行全面评估 [2]。例如,RankGPT [25] 仅对 BM25 的顶部文档进行重新排序。然而,最先进的检索器如 ${\mathrm{SPLADE{++}}}$ [9] 或 SPLADE-v3 [14] 已经取得了比 RankGPT 更好的结果。显然,优秀的第一阶段检索器能为重新排序器减轻负担,但两者之间的相互作用机制仍不明确。
We, therefore, propose to study reranking for highly effective retrievers – and focus on SPLADE models due to their good results on several tracks of the TREC Deep Learning evaluation campaign [13] as well as out-of-domain benchmarks. More specifically, we test the three variants of the latest series of SPLADE models3 [14]: (1) SPLADE-v3, (2) SPLADE-v3-DistilBERT, (3) SPLADE-v3-Doc.
因此,我们建议研究针对高效检索器的重排序方法——并重点关注SPLADE模型系列,因为它们在TREC深度学习评估活动[13]的多个赛道及跨域基准测试中表现优异。具体而言,我们测试了最新SPLADE模型系列[14]的三种变体:(1) SPLADE-v3,(2) SPLADE-v3-DistilBERT,(3) SPLADE-v3-Doc。
While SPLADE-v3 outperforms its two more efficient counterparts, we aim to assess the impact of reranking on the final performance with “weaker” models.
虽然 SPLADE-v3 的表现优于其两个更高效的对应版本,但我们旨在评估使用"较弱"模型进行重排序对最终性能的影响。
3.2 Rerankers Based on Cross-Encoders
3.2 基于交叉编码器 (Cross-Encoders) 的重排序器
We then evaluate two cross-encoders specifically trained to re-rank SPLADE models [11, 13]. Specifically, we select the ones based on DeBERTa-v3 large [12] and ELECTRA-large [3]4. The DeBERTav3 large model comes with 24 layers and a hidden size of 1024. It has $304M$ parameters, and a Byte-Pair-Encoding vocabulary containing $128k$ tokens. The ELECTRA model has similar specifics – besides the WordPiece vocabulary containing $30k$ tokens – with $335M$ parameters in total.
我们随后评估了两个专门为SPLADE模型[11,13]重排序训练的交叉编码器。具体而言,我们选择了基于DeBERTa-v3 large[12]和ELECTRA-large[3]的模型。DeBERTa-v3 large模型包含24层和1024的隐藏层大小,具有$304M$参数,以及包含$128k$ token的字节对编码(BPE)词汇表。ELECTRA模型具有相似的规格——除了包含$30k$ token的WordPiece词汇表——总参数量为$335M$。
3.3 Under the Hood of RankGPT
3.3 RankGPT 内部原理
Following Sun et al. [25], we use GPT-3.5 and GPT-4 as zero-shot rerankers. To bypass the limited prompt length of GPT, we also use the sliding window strategy that allows ranking a larger pool by iteratively ranking overlapping chunks of documents. We show (Section 4.3) that using a more effective retriever (in our case, SPLADEv3) allows us to rerank a smaller number of documents without resorting to this window strategy.
遵循 Sun 等人 [25] 的方法,我们使用 GPT-3.5 和 GPT-4 作为零样本 (zero-shot) 重排序器。为了绕过 GPT 有限的提示长度限制,我们还采用了滑动窗口策略,通过迭代排序重叠的文档块来对更大规模的文档池进行排序。我们在第 4.3 节中表明,使用更高效的检索器 (在本研究中为 SPLADEv3) 可以让我们重排序更少数量的文档,而无需依赖这种窗口策略。
Note that Open AI models – especially GPT-4 – are rather expensive. Due to budget limits, some of our experiments are not as comprehensive as we would have liked them to be.
需要注意的是,OpenAI 的模型——尤其是 GPT-4——成本相当高昂。由于预算限制,我们的一些实验未能达到预期的全面性。
For the experiments, we rely on the RankGPT code5, and modify it to be able to use other (open) models. We now describe in detail the pre- and post-processing steps used in RankGPT – which are critical aspects for such re-ranking approaches based on generative models.
在实验中,我们基于RankGPT代码5进行修改,使其能够使用其他(开源)模型。接下来将详细说明RankGPT采用的前后处理步骤——这些步骤对于基于生成式模型的重排序方法至关重要。
3.3.1 Sliding Window Strategy. A key concern when using LLMbased rerankers is the number of documents their prompt can ingest. A key mechanism used by RankGPT is a sliding window strategy: the system starts by ranking the $N$ last documents, then creates an $k$ -length overlapping window with the $N$ previous documents, ranks them, and iterates the process until reaching the $N$ first documents. Because ranking has to be performed several times per query, this approach is rather costly and inefficient. We show in the experiments that, depending on the model and the dataset, this mechanism is either useful or can be ignored.
3.3.1 滑动窗口策略。使用基于大语言模型的重排序器时,一个关键问题在于其提示词能处理的文档数量。RankGPT采用的核心机制是滑动窗口策略:系统首先对最后$N$篇文档进行排序,随后创建一个与之前$N$篇文档重叠的$k$长度窗口,对其进行排序,并迭代该过程直至处理完前$N$篇文档。由于每个查询需多次执行排序,该方法成本较高且效率低下。实验表明,该机制的效果取决于模型和数据集,可能有用也可能可以忽略。
3.3.2 Pre-Processing: Truncating Documents. Another mechanism used to manage the prompt size is document truncation. As long as the $N$ documents do not fit into the prompt, documents are shortened by one token. The default document length in RankGPT is $|d|=300$ (which, however, is already quite large when considering MS MARCO passages). It is much more efficient than the window mechanism, but can negatively impact effectiveness.
3.3.2 预处理:文档截断
另一种管理提示(prompt)大小的机制是文档截断。当 $N$ 篇文档无法全部放入提示时,文档会被逐token缩短。RankGPT的默认文档长度为 $|d|=300$ (但考虑到MS MARCO段落时,这一长度已经相当大)。该机制比滑动窗口机制高效得多,但可能对效果产生负面影响。
3.3.3 Post-Processing. The output of the model – assuming the instructions are followed by the LLM – is a list of document identifiers ordered by decreasing relevance, and separated by a $\mathbf{\hat{\mu}}^{s}\mathbf{>}^{s}$ token. For strong LLMs such as GPT-4, almost no post-processing is needed, but it is strongly recommended to use it with other models that do not always respect the formatting instructions. Additionally, document identifiers that are not present in the LLM output are added to the final output (according to the original ordering). It allows an LLM that does not generate anything “meaningful” to perform as well as the retriever.
3.3.3 后处理。模型的输出(假设大语言模型遵循指令)是一个按相关性降序排列的文档标识符列表,各标识符间以$\mathbf{\hat{\mu}}^{s}\mathbf{>}^{s}$ token分隔。对于GPT-4等强大模型,几乎不需要后处理,但强烈建议与其他不总是遵循格式指令的模型搭配使用。此外,未出现在大语言模型输出中的文档标识符会按原始顺序添加到最终输出中。这使得无法生成"有意义"结果的模型仍能保持与检索器相当的表现。
3.3.4 Prompting. Pradeep et al. [20] use the prompt designed for RankGPT – we also follow the same strategy for all the LLMs in this study.
3.3.4 提示工程。Pradeep等人[20]采用了为RankGPT设计的提示模板,我们在本研究中对所有大语言模型也遵循相同的策略。
3.4 Datasets
3.4 数据集
For evaluation, we use two types of datasets: in-domain and out-ofdomain. For in-domain datasets we use all the available TREC Deep Learning datasets based on the MS MARCO collection (passages): from DL19 to DL23 [4–7]. For the cross-encoder rerankers, we use BEIR [27] – with the 13 readily available datasets – and LoTTE [24] as out-of-domain datasets. For the OpenAI LLMs rerankers, their cost prevents us from evaluating them at this scale. Therefore, we select datasets from BEIR for which the number of queries is around 50: TREC-COVID, TREC-NEWS, and Touché-2020. Finally, we also consider the NovelEval dataset [25] to assess GPT-4’s effec ti ve ness on unseen data (data issued after GPT-4 training).
为了评估,我们使用两种类型的数据集:领域内和领域外。对于领域内数据集,我们使用所有基于MS MARCO集合(段落)的TREC深度学习数据集:从DL19到DL23 [4–7]。对于交叉编码器(cross-encoder)重排器,我们使用BEIR [27](包含13个现成可用数据集)和LoTTE [24]作为领域外数据集。对于OpenAI大语言模型重排器,其成本限制了我们在这一规模上进行评估。因此,我们从BEIR中选择了查询数量约为50的数据集:TREC-COVID、TREC-NEWS和Touché-2020。最后,我们还考虑了NovelEval数据集[25]来评估GPT-4在未见数据(GPT-4训练后发布的数据)上的有效性。
4 RESULTS
4 结果
We first discuss the results with cross-encoders for in-domain and out-of-domain evaluation settings. Then we compare cross-encoders with LLMs rerankers. Finally, we conduct various experiments on the TREC-COVID dataset, to highlight the different behavior between cross-encoders and LLMs.
我们首先讨论跨编码器(cross-encoder)在领域内和领域外评估设置下的结果。然后比较跨编码器与大语言模型重排序器的性能差异。最后在TREC-COVID数据集上进行多组实验,以突显跨编码器与大语言模型的不同行为特征。
Our results show that it isn’t obvious to draw clear conclusions, as observations are quite dependent on datasets.
我们的结果表明,由于观察结果很大程度上依赖于数据集,因此很难得出明确的结论。
In the following, we report $\mathrm{nDCG}@10^{6}$ . Note that we haven’t (yet) performed statistical significance tests – but given the small number of queries in the DL datasets, a difference of less than 1 point $\mathrm{nDCG}@10$ is usually not considered significant.
以下我们报告 $\mathrm{nDCG}@10^{6}$ 。需要注意的是,我们尚未进行统计显著性检验——但鉴于DL数据集中查询数量较少,通常认为 $\mathrm{nDCG}@10$ 差异小于1分时不具显著性。
4.1 Cross-Encoders: In-Domain Evaluation
4.1 交叉编码器:领域内评估
As we provide results for several models, rerankers, and $\mathrm{top}_ {k}\mathrm{-}$ $k$ being the number of reranked documents – we decompose the analysis by first comparing the SPLADE models (Table 1), then the rerankers (Table 2), and finally the $\mathsf{t o p}_{k}$ (Table 3). In these tables, we fix one reference – in grey – and display the differences $\Delta$ in performance between the reference and other results. The full table is also given in Appendix 5.
由于我们提供了多个模型、重排序器以及 $\mathrm{top}_ {k}\mathrm{-}$($k$ 表示重排序文档数量)的结果,因此我们通过以下步骤分解分析:首先比较 SPLADE 模型(表1),然后是重排序器(表2),最后是 $\mathsf{t o p}_{k}$(表3)。在这些表格中,我们固定一个灰色标注的参考值,并显示参考值与其他结果之间的性能差异 $\Delta$。完整表格见附录5。
4.1.1 Comparing Retrievers. Taking SPLADE-v3-DistilBERT as the reference point, we compare it with SPLADE-v3 and SPLADE-v3- Doc. Considering the baseline row (i.e., first-stage only), we can observe that SPLADE-v3 is usually far more effective –except for the DL19 dataset – while SPLADE-v3-Doc lags behind. When looking at the number of documents to re-rank, we however see that the gaps between models gradually diminish when increasing $k$ .
4.1.1 检索器对比。以SPLADE-v3-DistilBERT为基准点,我们将其与SPLADE-v3和SPLADE-v3-Doc进行对比。观察基线行(即仅第一阶段)时可以发现,除DL19数据集外,SPLADE-v3通常效果显著更优,而SPLADE-v3-Doc表现落后。但当分析需要重排的文档数量时,我们发现随着$k$值增大,模型间的差距逐渐缩小。
4.1.2 Comparing Rerankers. We now compare the DeBERTa-v3 (taken as the reference) and ELECTRA rerankers in Table 2. It is difficult to see whether one model is really better than the other: observations vary depending on the dataset. For instance, the model based on ELECTRA performs better for DL20 (and marginally better for DL21) – but this trend tends to reverse for DL19 or DL22.
4.1.2 重排序器对比
我们现在表 2 中对比 DeBERTa-v3 (作为基准) 和 ELECTRA 重排序器的表现。很难判断哪个模型真正更优:观察结果因数据集而异。例如,基于 ELECTRA 的模型在 DL20 上表现更好 (在 DL21 上略优),但这种趋势在 DL19 或 DL22 中往往会发生逆转。
4.1.3 Impact of $k$ . We investigate the impact of $k$ – the number of documents to re-rank. Using $k=50$ as a reference, we compare it with $k=100$ and $k=200$ . We observe that the smallest model (SPLADE-v3-DistilBERT) benefits the most from the increase of $k$ , while the others are relatively stable over the $k$ values. This is sensible, as less effective models will tend to retrieve relevant documents at lower ranks. Yet, the impact of $k$ also seems to depend on the dataset used, which makes any general conclusion difficult to make.
4.1.3 $k$ 值的影响。我们研究了 $k$ (待重排文档数量)的影响。以 $k=50$ 为基准,将其与 $k=100$ 和 $k=200$ 进行对比。观察到最小模型 (SPLADE-v3-DistilBERT) 从 $k$ 值增加中获益最大,其他模型在不同 $k$ 值下表现相对稳定。这种现象是合理的,因为效果较差的模型往往会在较低排名处检索到相关文档。不过 $k$ 值的影响似乎还与所用数据集相关,因此难以得出普适性结论。
To conclude, there is no general trend to be observed from these in-domain comparisons. We can see that the best first-stage model usually leads to the best end performance, but the rerankers act as expected by narrowing the initial gaps between the three retrievers.
综上所述,从这些领域内比较中并未观察到普遍趋势。可以看出,最佳的第一阶段模型通常能带来最优的最终性能,但重排序器通过缩小三个检索器之间的初始差距发挥了预期作用。
4.2 Cross-Encoder: Out-of-Domain Evaluation
4.2 Cross-Encoder: 跨领域评估
The out-of-domain evaluation brings more contrast to the results. In the following, we provide the same three comparisons as in Section 4.1: first-stage models, rerankers models, and the ${\mathrm{top}}_{k}$ . The full results are provided in Appendix 5, Table 14.
域外评估使结果更具对比性。以下我们提供与第4.1节相同的三个比较项:第一阶段模型、重排序模型以及${\mathrm{top}}_{k}$。完整结果见附录5中的表14。
A note on BEIR. Looking at the BEIR dataset, the improvements at first glance were a bit disappointing. With a closer look, we spotted a very strange behavior for the ArguAna dataset. Indeed, the performance tended to (drastically) diminish when increasing the number of documents to re-rank – e.g., from 50 $(\mathrm{nDCG}@10)$ down to approximately 15 for all SPLADE models, leading to almost no improvement for the ELECTRA reranker. This may be explained by the fact that the ArguAna task consists in finding the counter-argument of the “query”. When discarding ArguAna (column called BEIR-12), we are “back on our feet”, and the behavior of BEIR-12 datasets is aligned with LoTTE. We therefore differentiate these two versions of BEIR in our study.
关于BEIR的说明。观察BEIR数据集时,初步的改进效果有些令人失望。经过仔细检查,我们发现ArguAna数据集存在非常奇怪的现象。实际上,当增加重排文档数量时(例如从50个增加到更多),所有SPLADE模型的性能($\mathrm{nDCG}@10$)会(急剧)下降至约15,导致ELECTRA重排器几乎没有任何提升。这可能是因为ArguAna任务的核心是寻找"查询"的反驳论点。剔除ArguAna数据集后(称为BEIR-12列),结果恢复正常,BEIR-12数据集的表现与LoTTE保持一致。因此,我们在研究中区分了这两个版本的BEIR。
4.2.1 Comparing Retrievers. We first compare in Table 4 the impact of the first-stage model. Overall, we can draw similar observations to the in-domain ones: rerankers flatten the differences between models, and with a large enough $k$ , most of the differences are below one $\mathrm{nDCG@10}$ point – while previously reaching up to three points.
4.2.1 检索器对比。我们首先在表4中比较第一阶段模型的影响。总体而言,可以得出与领域内实验相似的结论:重排序器会缩小模型间的差异,当$k$值足够大时,多数差异会低于1个$\mathrm{nDCG@10}$点(而此前差异可达3个点)。
4.2.2 Comparing Rerankers. In Table 5, we further compare rerankers. In out-of-domain, we observe a clear “winner”: DeBERTa-v3 consistently improves over the ELECTRA-based model.
4.2.2 重排序器对比
在表5中,我们进一步对比了不同重排序器的表现。在跨领域场景下,我们观察到一个明显的"优胜者":基于DeBERTa-v3的模型始终优于基于ELECTRA的模型。
4.2.3 Impact of 푘. We additionally show in Table 6 that increasing $k$ is a good way to increase the effectiveness of the models. As expected, this is especially true for the weakest models, but even the more effective ones benefit from re-ordering more documents – especially for the LoTTE dataset.
4.2.3 푘值的影响。我们在表6中进一步展示了增加$k$是提升模型效果的有效方式。正如预期,这对性能最弱的模型尤为明显,但即使是更高效的模型也能从更多文档重排序中获益——在LoTTE数据集上表现尤为突出。
The findings for the out-of-domain setting are different from the in-domain ones. In this case, a more effective re-ranker (DeBERTaV3) consistently outperforms its ELECTRA-based counterpart, no matter the retriever. Additionally, increasing the number of documents to re-rank is consistently beneficial (considering the BEIR12 version).
域外设置的结果与域内设置不同。在这种情况下,无论检索器如何,更有效的重排器(DeBERTaV3)始终优于基于ELECTRA的重排器。此外,增加重排文档数量始终有益(考虑BEIR12版本)。
4.3 LLM as Rerankers
4.3 大语言模型 (LLM) 作为重排序器
We now discuss the evaluation of LLMs as rerankers. First, we focus on OpenAI models used by Sun et al. [25]: GPT-3.5 Turbo and GPT-4. Unfortunately, the cost of using such models prevents us from conducting extensive experiments. For some configurations, we have conducted 3 runs to asses the LLMs variations to sampling, the standard deviation being around 0.2/0.5 point for the $\mathrm{nDCG}@10$ measure. These results are indicated in Table 7 with the format: AVG (STD).
我们现在讨论将大语言模型(LLM)作为重排序器的评估。首先,我们关注Sun等人[25]使用的OpenAI模型:GPT-3.5 Turbo和GPT-4。遗憾的是,使用这些模型的成本限制了大规模实验的开展。针对部分配置,我们进行了3次实验以评估大语言模型在采样时的波动性,其$\mathrm{nDCG}@10$指标的标准差约为0.2/0.5分。这些结果以AVG (STD)格式呈现在表7中。
We present in Table 7 the evaluation of GPT-3.5 Turbo and GPT4 re-ranking documents from SPLADE-v3, for the in-domain datasets and for the following out-of-domain datasets: TREC-COVID, TRECNEWS, Touché-2020 and NovelEval. As a comparison, we also report the DeBERTa-v3 results from Table 2.
我们在表7中展示了GPT-3.5 Turbo和GPT4对SPLADE-v3文档的重新排序评估结果,涵盖领域内数据集及以下领域外数据集:TREC-COVID、TRECNEWS、Touché-2020和NovelEval。作为对比,我们还报告了表2中的DeBERTa-v3结果。
As we can see, GPT-3.5 Turbo is able to improve performance when reranking, but falls short compared to DeBERTa-v3. In some cases, it even degrades the results of the retriever (DL20, DL22, Touché). It is interesting to note that this degradation is not visible when BM25 is used as a first-stage [25] (see DL20 results). On the other hand, GPT-4 performs well – on par with DeBERTa- v3 – and even better for some datasets (especially DL23 and NovelEval).
可以看出,GPT-3.5 Turbo在重排序时能够提升性能,但与DeBERTa-v3相比仍有差距。在某些情况下,它甚至降低了检索器的效果(DL20、DL22、Touché)。值得注意的是,当BM25作为第一阶段检索器时[25],这种性能下降并不明显(参见DL20结果)。另一方面,GPT-4表现优异——与DeBERTa-v3相当——在某些数据集上甚至更优(尤其是DL23和NovelEval)。
Table 1: Effectiveness of three SPLADE-v3 models on TREC DL datasets. SPLADE-v3-DistilBERT serves as the reference point ( grey area , $\mathbf{nDCG}@\mathbf{10}$ ) – and the comparisons are given in $\Delta(\mathbf{nDCG}@\mathbf{10})$ . The baselines correspond to the retrievers only (no re-ranking).
SPLADE-v3-DistilBERT | SPLADE-v3 | SPLADE-v3-Doc | ||||||||||||||
DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | ||
baseline | 72.3 | 75.4 | 70.7 | 61.9 | 50.7 | -3.2 | 0.9 | 5.5 | 6.8 | 4.1 | -3.9 | -4.1 | -0.3 | -3.5 | -0.2 | |
Reranker | topk | |||||||||||||||
DeBERTa-v3 | 50 | 78.1 | 75.3 | 74.1 | 65.4 | 52.2 | -0.9 | 0.6 | 0.6 | 2.7 | 6.1 | -1.3 | 0.7 | -1.2 | -0.3 | 4.1 |
100 | 78.5 | 75.5 | 73.4 | 66.3 | 55.8 | -1.1 | 0.0 | 0.6 | 1.7 | 2.3 | -0.8 | -0.1 | -0.1 | 0.4 | 0.6 | |
200 | 78.2 | 75.5 | 74.0 | 67.0 | 57.0 | -0.7 | 0.1 | 0.0 | 0.8 | 1.3 | -0.3 | 0.0 | -0.6 | 0.5 | -0.2 | |
50 | 77.5 | 77.1 | 74.5 | 64.1 | 55.5 | -0.6 | 0.6 | 0.2 | 3.9 | 2.4 | -0.8 | 0.1 | -1.3 | 0.4 | 1.4 | |
100 | 77.4 | 77.3 | 74.2 | 65.6 | 57.0 | -0.4 | 0.1 | -0.1 | 1.6 | 0.6 | -0.4 | 0.0 | -0.2 | 0.44 | 0.8 | |
200 | 76.8 | 77.5 | 74.0 | 66.6 | 57.2 | 0.4 | 0.0 | -0.1 | 0.0 | 0.4 | 0.2 | 0.0 | -0.1 | -0.4 | -0.3 |
表 1: 三种 SPLADE-v3 模型在 TREC DL 数据集上的有效性。SPLADE-v3-DistilBERT 作为参考点 (灰色区域, $\mathbf{nDCG}@\mathbf{10}$) ,比较结果以 $\Delta(\mathbf{nDCG}@\mathbf{10})$ 给出。基线结果仅对应检索器 (无重排序)。
SPLADE-v3-DistilBERT | SPLADE-v3 | SPLADE-v3-Doc | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | ||
baseline | 72.3 | 75.4 | 70.7 | 61.9 | 50.7 | -3.2 | 0.9 | 5.5 | 6.8 | 4.1 | -3.9 | -4.1 | -0.3 | -3.5 | -0.2 | |
Reranker | topk | |||||||||||||||
DeBERTa-v3 | 50 | 78.1 | 75.3 | 74.1 | 65.4 | 52.2 | -0.9 | 0.6 | 0.6 | 2.7 | 6.1 | -1.3 | 0.7 | -1.2 | -0.3 | 4.1 |
100 | 78.5 | 75.5 | 73.4 | 66.3 | 55.8 | -1.1 | 0.0 | 0.6 | 1.7 | 2.3 | -0.8 | -0.1 | -0.1 | 0.4 | 0.6 | |
200 | 78.2 | 75.5 | 74.0 | 67.0 | 57.0 | -0.7 | 0.1 | 0.0 | 0.8 | 1.3 | -0.3 | 0.0 | -0.6 | 0.5 | -0.2 | |
50 | 77.5 | 77.1 | 74.5 | 64.1 | 55.5 | -0.6 | 0.6 | 0.2 | 3.9 | 2.4 | -0.8 | 0.1 | -1.3 | 0.4 | 1.4 | |
100 | 77.4 | 77.3 | 74.2 | 65.6 | 57.0 | -0.4 | 0.1 | -0.1 | 1.6 | 0.6 | -0.4 | 0.0 | -0.2 | 0.44 | 0.8 | |
200 | 76.8 | 77.5 | 74.0 | 66.6 | 57.2 | 0.4 | 0.0 | -0.1 | 0.0 | 0.4 | 0.2 | 0.0 | -0.1 | -0.4 | -0.3 |
Table 2: Comparison of DeBERTa-v3 and ELECTRA rerankers for the three SPLADE-v3 models. DeBERTa-v3 serves as the reference point ( grey area , $\mathbf{nDCG}(\alpha\mathbf{10})$ – and the comparisons are given in $\Delta(\mathbf{nDCG}@\mathbf{10})$ .
Reranker | topk | SPLADE-v3-DistilBERT | SPLADE-v3 | SPLADE-v3-Doc | ||||||||||||
DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | ||
DeBERTa-v3 | 50 | 78.1 | 75.3 | 74.1 | 65.4 | 52.2 | 77.2 | 75.9 | 74.7 | 68.2 | 58.3 | 76.8 | 76.0 | 72.9 | 65.1 | 56.4 |
ELECTRA | 100 | 78.5 | 75.5 | 73.4 | 66.3 | 55.8 | 77.4 | 75.5 | 74.0 | 68.0 | 58.2 | 77.7 | 75.4 | 73.2 | 66.8 | 56.3 |
200 | 78.2 | 75.5 | 74.0 | 67.0 | 57.0 | 77.5 | 75.6 | 74.0 | 67.9 | 58.3 | 77.8 | 75.5 | 73.4 | 67.4 | 56.9 | |
50 | -0.6 | 1.9 | 0.5 | -1.3 | 3.3 | -0.3 | 1.9 | 0.0 | -0.2 | -0.4 | -0.2 | 1.3 | 0.4 | -0.7 | 0.6 | |
100 | -1.0 | 1.8 | 0.8 | -0.7 | 1.2 | -0.4 | 1.9 | 0.1 | -0.8 | -0.5 | -0.7 | 2.0 | 0.8 | -0.7 | 1.4 | |
200 | -1.4 | 2.0 | 0.0 | -0.4 | 0.2 | -0.3 | 1.9 | 0.0 | -1.2 | -0.7 | -0.8 | 1.9 | 0.5 | -1.3 | 0.1 |
表 2: DeBERTa-v3与ELECTRA在三种SPLADE-v3模型上的重排序器对比。DeBERTa-v3作为参照基准(灰色区域, $\mathbf{nDCG}(\alpha\mathbf{10})$),对比结果以 $\Delta(\mathbf{nDCG}@\mathbf{10})$ 形式给出。
重排序器 | topk | SPLADE-v3-DistilBERT | SPLADE-v3 | SPLADE-v3-Doc | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | ||
DeBERTa-v3 | 50 | 78.1 | 75.3 | 74.1 | 65.4 | 52.2 | 77.2 | 75.9 | 74.7 | 68.2 | 58.3 | 76.8 | 76.0 | 72.9 | 65.1 | 56.4 |
ELECTRA | 100 | 78.5 | 75.5 | 73.4 | 66.3 | 55.8 | 77.4 | 75.5 | 74.0 | 68.0 | 58.2 | 77.7 | 75.4 | 73.2 | 66.8 | 56.3 |
200 | 78.2 | 75.5 | 74.0 | 67.0 | 57.0 | 77.5 | 75.6 | 74.0 | 67.9 | 58.3 | 77.8 | 75.5 | 73.4 | 67.4 | 56.9 | |
50 | -0.6 | 1.9 | 0.5 | -1.3 | 3.3 | -0.3 | 1.9 | 0.0 | -0.2 | -0.4 | -0.2 | 1.3 | 0.4 | -0.7 | 0.6 | |
100 | -1.0 | 1.8 | 0.8 | -0.7 | 1.2 | -0.4 | 1.9 | 0.1 | -0.8 | -0.5 | -0.7 | 2.0 | 0.8 | -0.7 | 1.4 | |
200 | -1.4 | 2.0 | 0.0 | -0.4 | 0.2 | -0.3 | 1.9 | 0.0 | -1.2 | -0.7 | -0.8 | 1.9 | 0.5 | -1.3 | 0.1 |
Table 3: Impact of the number of documents to re-rank $(\mathbf{top}_ {k})$ ) on effectiveness. $\mathbf{top}_{k} = 50$ serves as the reference point ( grey area , $\mathbf{nDCG}(\alpha\mathbf{10})$ – and the comparisons are given in $\Delta(\mathbf{nDCG}@\mathbf{10})$ .
Reranker | topk | SPLADE-v3-DistilBERT | SPLADE-V3 | SPLADE-v3-Doc | ||||||||||||
DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | ||
DeBERTa-v3 | 50 | 78.1 | 75.3 | 74.1 | 65.4 | 52.2 | 77.2 | 75.9 | 74.8 | 68.2 | 58.3 | 76.8 | 76.0 | 72.9 | 65.1 | 56.4 |
100 | 0.3 | 0.3 | -0.7 | 0.9 | 3.5 | 0.2 | -0.3 | -0.7 | -0.2 | -0.2 | 0.9 | -0.6 | 0.3 | 1.6 | -0.1 | |
200 | 0.0 | 0.2 | -0.1 | 1.5 | 4.8 | 0.2 | -0.3 | -0.7 | -0.3 | 0.0 | 1.0 | -0.4 | 0.5 | 2.3 | 0.4 | |
50 | 77.5 | 77.1 | 74.5 | 64.1 | 55.5 | 76.9 | 77.7 | 74.7 | 68.1 | 57.9 | 76.7 | 77.2 | 73.3 | 64.5 | 56.9 | |
100 | -0.1 | 0.2 | -0.4 | 1.5 | 1.5 | 0.1 | -0.3 | -0.6 | -0.9 | -0.3 | 0.3 | 0.2 | 0.7 | 1.5 | 0.8 | |
200 | -0.7 | 0.3 | -0.6 | 2.4 | 1.7 | 0.2 | -0.3 | -0.8 | -1.5 | -0.3 | 0.3 | 0.2 | 0.6 | 1.7 | 0.0 |
表 3: 重排序文档数量 $(\mathbf{top}_ {k})$ 对效果的影响。以 $\mathbf{top}_{k} = 50$ 作为参考点 (灰色区域, $\mathbf{nDCG}(\alpha\mathbf{10})$ ) ,比较结果以 $\Delta(\mathbf{nDCG}@\mathbf{10})$ 给出。
Reranker | topk | SPLADE-v3-DistilBERT | SPLADE-V3 | SPLADE-v3-Doc | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | DL19 | DL20 | DL21 | DL22 | DL23 | ||
DeBERTa-v3 | 50 | 78.1 | 75.3 | 74.1 | 65.4 | 52.2 | 77.2 | 75.9 | 74.8 | 68.2 | 58.3 | 76.8 | 76.0 | 72.9 | 65.1 | 56.4 |
100 | 0.3 | 0.3 | -0.7 | 0.9 | 3.5 | 0.2 | -0.3 | -0.7 | -0.2 | -0.2 | 0.9 | -0.6 | 0.3 | 1.6 | -0.1 | |
200 | 0.0 | 0.2 | -0.1 | 1.5 | 4.8 | 0.2 | -0.3 | -0.7 | -0.3 | 0.0 | 1.0 | -0.4 | 0.5 | 2.3 | 0.4 | |
50 | 77.5 | 77.1 | 74.5 | 64.1 | 55.5 | 76.9 | 77.7 | 74.7 | 68.1 | 57.9 | 76.7 | 77.2 | 73.3 | 64.5 | 56.9 | |
100 | -0.1 | 0.2 | -0.4 | 1.5 | 1.5 | 0.1 | -0.3 | -0.6 | -0.9 | -0.3 | 0.3 | 0.2 | 0.7 | 1.5 | 0.8 | |
200 | -0.7 | 0.3 | -0.6 | 2.4 | 1.7 | 0.2 | -0.3 | -0.8 | -1.5 | -0.3 | 0.3 | 0.2 | 0.6 | 1.7 | 0.0 |
We tested several configurations with GPT-4: various $\mathsf{t o p}_{k}$ documents $(k\in{25,50,75,100})$ , and with or without the sliding window mechanism. These various configurations used for GPT-4 tend to show that the sliding window mechanism does not seem necessary: the shortening mechanism (Section 3.3.2) often provides results on par with (or even better than) the sliding window mechanism. This observation may be different with BM25, where more documents $(k{=}100)$ may be needed to get similar results. Therefore, a more effective retriever makes the job of rerankers easier.
我们测试了GPT-4的几种配置:不同的$\mathsf{top}_{k}$文档$(k\in{25,50,75,100})$,以及是否使用滑动窗口机制。这些用于GPT-4的不同配置往往表明滑动窗口机制似乎没有必要:缩短机制(第3.3.2节)通常提供与滑动窗口机制相当(甚至更好)的结果。这一观察可能与BM25不同,后者可能需要更多文档$(k{=}100)$才能获得类似结果。因此,更有效的检索器使重排序器的工作更容易。
To extend the comparison and add more reference points, we added results from some TREC participants: these results are usually very competitive, but obtained by combining a set of various models. We also add the recent results obtained by RankZephyr [26], which fine-tunes a Zephyr model for the listwise reranking task. The $\rho$ version corresponds to the RankZephyr model progressively reranking the input three times using a SPLADE $^{++}$ ED model [9]. This model in general performs very well, especially for the DL20 dataset, but is on par with DeBERTa-v3 reranker for many other datasets. Please note that, for these results, the retriever is slightly less effective than SPLADE-v3, so results are not entirely comparable – but are given as another comparison point (The RankZephyr model is not publicly available as of January, 14 2024).
为了扩展比较范围并增加更多参考点,我们添加了部分TREC参与者的结果:这些结果通常极具竞争力,但通过组合多种模型获得。同时我们引入了RankZephyr [26]的最新研究成果,该工作对Zephyr模型进行了列表式重排序任务的微调。其中$\rho$版本表示RankZephyr模型使用SPLADE$^{++}$ ED模型[9]对输入进行三次渐进式重排序。该模型整体表现优异,尤其在DL20数据集上,但在多数其他数据集上与DeBERTa-v3重排序器性能相当。需注意这些结果采用的检索器效果略低于SPLADE-v3,因此数据并非完全可比——仅作为额外参考点(截至2024年1月14日,RankZephyr模型尚未开源)。
Table 4: Effectiveness of three SPLADE-v3 models on out-of-domain datasets (BEIR and LoTTE). SPLADE-v3-DistilBERT serves as the reference point ( grey area , mean nDCG $@$ 10) – and the comparisons are given in $\Delta(\mathbf{mean~nDCG}@\mathbf{10})$ . The baselines correspond to the retrievers only (no re-ranking).
SPLADE-v3-DistilBERT | SPLADE-v3 | SPLADE-v3-Doc | ||||||||
BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | ||
baseline | 50.0 | 50.1 | 39.9 | 2.2 | 2.1 | 3.5 | -3.0 | -3.1 | -0.1 | |
Reranker DeBERTa-v3 | topk | |||||||||
ELECTRA | 50 | 54.5 | 56.9 | 53.4 | 0.5 | 0.6 | 1.8 | -0.8 | -0.6 | -0.3 |
100 | 54.6 | 57.4 | 55.0 | 0.3 | 0.4 | 1.3 | -0.2 | -0.3 | -0.5 | |
200 | 54.9 | 57.5 | 56.1 | 0.1 | 0.4 | 0.7 | -0.3 | 0.0 | -0.4 | |
50 | 52.8 | 55.7 | 51.9 | 0.2 | 0.3 | 1.6 | -0.4 | -0.8 | -0.3 | |
100 | 52.5 | 55.8 | 53.0 | 0.1 | 0.2 | 1.1 | -0.3 | -0.2 | -0.2 | |
200 | 52.5 | 55.8 | 53.9 | -0.3 | 0.3 | 0.7 | -0.3 | -0.1 | -0.4 |
表 4: 三种 SPLADE-v3 模型在域外数据集 (BEIR 和 LoTTE) 上的有效性。SPLADE-v3-DistilBERT 作为参考点 (灰色区域,平均 nDCG $@$ 10) ,比较结果以 $\Delta(\mathbf{平均~nDCG}@\mathbf{10})$ 给出。基线仅对应检索器 (无重排序)。
SPLADE-v3-DistilBERT | SPLADE-v3 | SPLADE-v3-Doc | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | ||
baseline | 50.0 | 50.1 | 39.9 | 2.2 | 2.1 | 3.5 | -3.0 | -3.1 | -0.1 | |
Reranker DeBERTa-v3 | topk | |||||||||
ELECTRA | 50 | 54.5 | 56.9 | 53.4 | 0.5 | 0.6 | 1.8 | -0.8 | -0.6 | -0.3 |
100 | 54.6 | 57.4 | 55.0 | 0.3 | 0.4 | 1.3 | -0.2 | -0.3 | -0.5 | |
200 | 54.9 | 57.5 | 56.1 | 0.1 | 0.4 | 0.7 | -0.3 | 0.0 | -0.4 | |
50 | 52.8 | 55.7 | 51.9 | 0.2 | 0.3 | 1.6 | -0.4 | -0.8 | -0.3 | |
100 | 52.5 | 55.8 | 53.0 | 0.1 | 0.2 | 1.1 | -0.3 | -0.2 | -0.2 | |
200 | 52.5 | 55.8 | 53.9 | -0.3 | 0.3 | 0.7 | -0.3 | -0.1 | -0.4 |
Table 5: Comparison of DeBERTa-v3 and ELECTRA rerankers for the three SPLADE-v3 models on out-of-domain datasets (BEIR and LoTTE). DeBERTa-v3 serves as the reference point ( grey area , mean nDCG $@$ 10) – and the comparisons are given in Δ(mean nDCG@10).
Reranker | topk | SPLADE-v3-DistilBERT | SPLADE-v3 | SPLADE-v3-Doc | ||||||
BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | ||
DeBERTa-v3 | 50 | 54.5 | 56.9 | 53.4 | 55.0 | 57.5 | 55.2 | 53.8 | 56.3 | 53.1 |
ELECTRA | 100 | 54.6 | 57.4 | 55.0 | 55.0 | 57.8 | 56.3 | 54.4 | 57.1 | 54.5 |
200 | 54.9 | 57.5 | 56.1 | 54.9 | 57.9 | 56.8 | 54.6 | 57.5 | 55.8 | |
50 | -1.7 | -1.2 | -1.5 | -2.0 | -1.5 | -1.7 | -1.3 | -1.4 | -1.5 | |
100 | -2.1 | -1.6 | -2.0 | -2.3 | -1.7 | -2.2 | -2.1 | -1.4 | -1.7 | |
200 | -2.4 | -1.7 | -2.2 | -2.7 | -1.7 | -2.2 | -2.4 | -1.8 | -2.2 |
表 5: DeBERTa-v3与ELECTRA在域外数据集(BEIR和LoTTE)上对三种SPLADE-v3模型的重新排序效果对比。DeBERTa-v3作为基准点(灰色区域,平均nDCG@10),比较结果以Δ(平均nDCG@10)形式给出。
重新排序器 | topk | SPLADE-v3-DistilBERT | SPLADE-v3 | SPLADE-v3-Doc | ||||||
---|---|---|---|---|---|---|---|---|---|---|
BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | ||
DeBERTa-v3 | 50 | 54.5 | 56.9 | 53.4 | 55.0 | 57.5 | 55.2 | 53.8 | 56.3 | 53.1 |
ELECTRA | 100 | 54.6 | 57.4 | 55.0 | 55.0 | 57.8 | 56.3 | 54.4 | 57.1 | 54.5 |
200 | 54.9 | 57.5 | 56.1 | 54.9 | 57.9 | 56.8 | 54.6 | 57.5 | 55.8 | |
50 | -1.7 | -1.2 | -1.5 | -2.0 | -1.5 | -1.7 | -1.3 | -1.4 | -1.5 | |
100 | -2.1 | -1.6 | -2.0 | -2.3 | -1.7 | -2.2 | -2.1 | -1.4 | -1.7 | |
200 | -2.4 | -1.7 | -2.2 | -2.7 | -1.7 | -2.2 | -2.4 | -1.8 | -2.2 |
Table 6: Impact of the number of documents to re-rank $(\mathbf{top}_ {k})$ on effectiveness on out-of-domain datasets (BEIR and LoTTE). $\mathbf{top}_{k}=50$ serves as the reference point ( grey area , mean nDCG $\mathbf{\Psi}(\varpi\mathbf{10})$ ) – and the comparisons are given in Δ(mean nDCG@10).
Reranker | topk | SPLADE-v3-DistilBERT | SPLADE-V3 | SPLADE-v3-Doc | ||||||
BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | ||
DeBERTa-v3 | 50 | 54.48 | 56.95 | 53.36 | 54.97 | 57.52 | 55.20 | 53.72 | 56.33 | 53.09 |
ELECTRA | 100 | 0.14 | 0.43 | 1.65 | 0.00 | 0.26 | 1.09 | 0.66 | 0.72 | 1.44 |
200 | 0.40 | 0.57 | 2.78 | -0.03 | 0.39 | 1.64 | 0.89 | 1.17 | 2.68 | |
50 | 52.79 | 55.72 | 51.89 | 54.94 | 57.91 | 56.84 | 54.61 | 57.50 | 55.77 | |
100 | -0.26 | 0.08 | 1.11 | -0.32 | 0.06 | 0.64 | -0.12 | 0.66 | 1.20 | |
200 | -0.29 | 0.12 | 2.05 | -0.76 | 0.17 | 1.13 | -0.20 | 0.79 | 1.96 |
表 6: 重排序文档数量 $(\mathbf{top}_ {k})$ 对域外数据集(BEIR和LoTTE)效果的影响。以 $\mathbf{top}_{k}=50$ 为基准点(灰色区域,平均nDCG $\mathbf{\Psi}(\varpi\mathbf{10})$),比较结果以Δ(平均nDCG@10)给出。
Reranker | topk | SPLADE-v3-DistilBERT | SPLADE-V3 | SPLADE-v3-Doc | ||||||
---|---|---|---|---|---|---|---|---|---|---|
BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | BEIR | BEIR-12 | LoTTE | ||
DeBERTa-v3 | 50 | 54.48 | 56.95 | 53.36 | 54.97 | 57.52 | 55.20 | 53.72 | 56.33 | 53.09 |
ELECTRA | 100 | 0.14 | 0.43 | 1.65 | 0.00 | 0.26 | 1.09 | 0.66 | 0.72 | 1.44 |
200 | 0.40 | 0.57 | 2.78 | -0.03 | 0.39 | 1.64 | 0.89 | 1.17 | 2.68 | |
50 | 52.79 | 55.72 | 51.89 | 54.94 | 57.91 | 56.84 | 54.61 | 57.50 | 55.77 | |
100 | -0.26 | 0.08 | 1.11 | -0.32 | 0.06 | 0.64 | -0.12 | 0.66 | 1.20 | |
200 | -0.29 | 0.12 | 2.05 | -0.76 | 0.17 | 1.13 | -0.20 | 0.79 | 1.96 |
We also test the recent GPT-4 Turbo in Table 8: it has a longer prompt length $100k$ instead of $8k$ ), but the results are a bit under w hel ming. Especially, the performance does not increase with the number of documents to re-rank: it performs better with $k{=}25$ documents than with $k{=}100$ documents. Therefore, even if a model can deal with longer prompts, the “naive” strategy of increasing $k$ might be suboptimal.
我们还测试了最新的 GPT-4 Turbo (表 8) : 它支持更长的提示长度 ( $100k$ 而非 $8k$ ),但结果略显平淡。特别是,重排序性能并未随文档数量增加而提升: 其在 $k{=}25$ 文档时的表现反而优于 $k{=}100$ 文档。因此,即使模型能处理更长提示,单纯增加 $k$ 值的"朴素"策略可能并非最优解。
Table 7: Evaluation of GPT-based models as zero-shot rerankers on top of SPLADE-v3 (strong baseline) – nDCG $@$ 10. We also report the best TREC runs for each year as comparison points.
DL19 | DL20 | DL21 | DL22 | DL23 | TREC-COVID | TREC-NEWS | Touché-2020 | NovelEval | |||
SPLADE-v3 | 72.3 | 75.4 | 70.7 | 61.9 | 50.6 | 74.7 | 41.8 | 29.3 | 70.0 | ||
Best TREC-DL | 76.4 | 80.3 | 74.9 | 71.8 | 69.9 | ||||||
Reranker | topk | W | |||||||||
DeBERTa-v3 | 200 | 77.6 | 75.6 | 73.4 | 67.5 | 58.3 | 89.2 | 51.9 | 33.3 | 82.8 | |
GPT-based | |||||||||||
GPT-3.5Turbo | 50 | 73.6 | 67.1 | 71.0 | 60.7 | 54.1 | 78.6 | 43.4 | 24.1 | 80.2 | |
GPT-4 | 25 | 78.5 (0.3) | 77.9 | 75.4 | 70.1 | 60.2 | 86.9 (0.3) | 49.6 | 32.0 | 90.9 | |
GPT-4 | 50 | 77.6 | 79.8 | 77.2 | 70.0 | 63.0 | 86.9 (0.5) | 50.2 | 31.6 | 87.7 | |
GPT-4 | 75 | 78.8 | 78.2 | 77.3 | 70.2 | 63.1 | 87.5 (0.4) | ||||
GPT-4 GPT-4 | 50 | 25/10 20/10 | 81.0 | 30.8 | |||||||
100 |