[论文翻译]BM25S:通过即时稀疏评分实现数量级加速的词汇搜索


原文地址:https://arxiv.org/pdf/2407.03618v1


BM25S: Orders of magnitude faster lexical search via eager sparse scoring

BM25S:通过即时稀疏评分实现数量级加速的词汇搜索

Abstract

摘要

We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy1 and Scipy2. BM25S achieves up to a $500\mathrm{x}$ speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at https://github.com/xhluca/bm25s

我们推出BM25S,这是一种基于Python语言的高效BM25实现,仅依赖Numpy1和Scipy2。通过索引期间主动计算BM25分数并将其存储为稀疏矩阵,BM25S相比最流行的Python框架实现了高达$500\mathrm{x}$的加速。与采用高度优化的Java实现的主流商业产品相比,BM25S也取得了显著的加速效果。此外,通过采用新颖的分数偏移方法将主动评分扩展到非稀疏变体,BM25S完整复现了Kamphuis等人(2020)提出的五种BM25变体实现。代码详见https://github.com/xhluca/bm25s

1 Background

1 背景

Sparse lexical search algorithms, such as the BM25 family (Robertson et al., 1995) remain widely used as they do not need to be trained, can be applied to multiple languages, and are generally faster, especially when using highly efficient Java-based implementations. Those Java implementations, usually based on Lucene3, are accessible inside Python via the Pyserini reproducibility toolkit (Lin et al., 2021), and through HTTP by using the Elasticsearch web client4. The Lucene-based libraries are known to be faster than existing Python-based implementations, such as Rank-BM255.

稀疏词法搜索算法(如BM25系列(Robertson et al., 1995))因其无需训练、可应用于多种语言且通常速度更快(尤其是在使用高效的基于Java语言的实现时)而仍被广泛使用。这些通常基于Lucene3的Java实现可通过Pyserini可复现工具包(Lin et al., 2021)在Python语言中访问,或通过Elasticsearch web client4使用HTTP协议调用。众所周知,基于Lucene的库比现有的基于Python语言的实现(如Rank-BM255)速度更快。

This work shows that it is possible to achieve a significant speedup compared to existing Pythonbased implementations by introducing two improvements: eagerly calculating all possible scores that can be assigned to any future query token when indexing a corpus, and storing those calculations inside sparse matrices to enable faster slicing and summations. The idea of sparse matrices was previously explored in $\mathrm{BM}25\mathrm{-PT}^{6}$ , which precomputes BM25 scores using PyTorch and multiplies them with a bag-of-word encoding of the query via sparse matrix multiplication.

本工作表明,通过引入两项改进措施,可以显著提速现有基于Python语言的实现方案:在索引语料库时预先计算所有可能分配给未来查询token的分数,并将这些计算结果存储在稀疏矩阵中以实现更快的切片与求和操作。稀疏矩阵的思路曾在$\mathrm{BM}25\mathrm{-PT}^{6}$中被探讨过,该方法使用PyTorch预计算BM25分数,并通过稀疏矩阵乘法将其与查询的词袋编码相乘。

This work expands upon the initial idea proposed by the BM25-PT project by significantly simplifying the implementation and introducing a strategy to generalize to other variants of the original BM25. Unlike BM25-pt, BM25S does not rely on PyTorch, and instead uses Scipy’s sparse matrix implementation. Whereas BM25-PT multiplies bag-of-words with the document matrix, BM25S instead slices relevant indices and sums across the token-dimension, removing the need of matrix multiplications.

本工作基于BM25-PT项目提出的初始构想进行了扩展,通过大幅简化实现方式并引入策略来推广至原始BM25的其他变体。与BM25-pt不同,BM25S不再依赖PyTorch,而是采用Scipy的稀疏矩阵实现。BM25-PT通过词袋与文档矩阵相乘的方式计算,而BM25S改为切片相关索引并在token维度求和,从而省去了矩阵乘法运算。

At the implementation level, BM25S also introduces a simple but fast Python-based tokenizer that combines Scikit-Learn’s text splitting (Pedregosa et al., 2011), Elastic’s stopword list7, and (optionally) integrates a C-based implementation of the Snowball stemmer (Bouchet-Valat, 2014). This achieves a better performance compared to subword tokenizers (Kudo and Richardson, 2018) used by BM25-PT. Finally, it implements top-k retrieval using an average $O(n)$ time complexity when selecting the $K$ most relevant documents from a set of $n$ scores associated with each document.

在实现层面,BM25S还引入了一个简单但快速的基于Python语言的tokenizer,它结合了Scikit-Learn的文本分割 (Pedregosa等人,2011) 、Elastic的停用词列表7,并(可选地)集成了基于C语言的Snowball词干提取器 (Bouchet-Valat,2014) 。与BM25-PT使用的子词tokenizer (Kudo和Richardson,2018) 相比,这实现了更好的性能。最后,当从与每个文档相关联的n个分数集合中选择K个最相关的文档时,它使用平均$O(n)$时间复杂度实现了top-k检索。

2 Implementation

2 实现

The implementation described below follows the study by Kamphuis et al. (2020).

以下实现遵循Kamphuis等人的研究 [20]。

Calculation of BM25 Many variants of BM25 exist, which could lead to significant confusion about the exact scoring method used in a given implementation (Kamphuis et al., 2020). By de- fault, we use the scoring method proposed by Lucene. Thus, for a given query $Q$ (tokenized into $q_{1},\dots,q_{|Q|})$ and document $D$ from collection $C$ , we compute the following score8:

BM25的计算
存在许多BM25的变体,这可能导致对特定实现中所用评分方法的严重混淆 (Kamphuis et al., 2020)。默认情况下,我们采用Lucene提出的评分方法。因此,对于给定查询$Q$(被分词为$q_{1},\dots,q_{|Q|})$和来自集合$C$的文档$D$,我们计算以下score8:

$$
\begin{array}{l}{{\displaystyle{\cal B}(Q,D)=\sum_{i=1}^{|Q|}S(q_{i},D)}}\ {{\displaystyle~=\sum_{i=1}^{|Q|}\mathrm{IDF}(q_{i},C)\frac{\mathrm{TF}(q_{i},D)}{\mathcal{D}}}}\end{array}
$$

$$
\begin{array}{l}{{\displaystyle{\cal B}(Q,D)=\sum_{i=1}^{|Q|}S(q_{i},D)}}\ {{\displaystyle~=\sum_{i=1}^{|Q|}\mathrm{IDF}(q_{i},C)\frac{\mathrm{TF}(q_{i},D)}{\mathcal{D}}}}\end{array}
$$

where $\begin{array}{r}{\mathcal{D}=\mathrm{TF}(t,D)+k_{1}\left(1-b+b\frac{|D|}{L_{a v g}}\right),}\end{array}$ $L_{a v g}$ is the average length of documents in corpus $C$ (calculated in number of tokens), $\mathrm{TF}(q_{i},D)$ is the term frequency of token $q_{i}$ within the set of tokens in $D$ . The $I D F$ is the inverse document frequency, which is calculated as:

其中 $\begin{array}{r}{\mathcal{D}=\mathrm{TF}(t,D)+k_{1}\left(1-b+b\frac{|D|}{L_{a v g}}\right),}\end{array}$ $L_{a v g}$ 表示语料库 $C$ 中文档的平均长度 (以token数量计算),$\mathrm{TF}(q_{i},D)$ 是token $q_{i}$ 在文档 $D$ 的token集合中的词频。$I D F$ 为逆文档频率,计算公式为:

$$
\mathrm{IDF}(q_{i},C)=\ln\left(\frac{|C|-\mathrm{DF}(q_{i},C)+0.5}{\mathrm{DF}(q_{i},C)+0.5}+1\right)
$$

$$
\mathrm{IDF}(q_{i},C)=\ln\left(\frac{|C|-\mathrm{DF}(q_{i},C)+0.5}{\mathrm{DF}(q_{i},C)+0.5}+1\right)
$$

Where document frequency $\mathrm{DF}(q_{i},C)$ is the number of documents in $C$ containing $q_{i}$ . Although $B(Q,D)$ depends on the query, which is only given during retrieval, we show below how to reformulate the equation to eagerly calculate the TF and IDF during indexing.

其中文档频率 $\mathrm{DF}(q_{i},C)$ 表示集合 $C$ 中包含 $q_{i}$ 的文档数量。虽然 $B(Q,D)$ 依赖于查询(仅在检索时给定),但我们将在下文展示如何重构该公式以便在索引阶段提前计算词频 (TF) 和逆文档频率 (IDF)。

Eager index-time scoring Let’s now consider all tokens in a vocabulary $V$ , denoted by $t\in V$ . We can reformulate $S(t,D)$ as:

即时索引评分
现在考虑词汇表 $V$ 中的所有 token,记为 $t\in V$。我们可以将 $S(t,D)$ 重新表述为:

$$
S(t,D)=\mathrm{TF}(t,D)\cdot\mathrm{IDF}(t,C)\frac{1}{\mathcal{D}}
$$

$$
S(t,D)=\mathrm{TF}(t,D)\cdot\mathrm{IDF}(t,C)\frac{1}{\mathcal{D}}
$$

When $t$ is a token that is not present in document $D$ , then $\mathrm{TF}(t,D)=0$ , leading to $S(t,D)=0$ as well. This means that, for most tokens in vocabulary $V$ , we can simply set the relevance score to 0, and only compute values for $t$ that are actually in the document $D$ . This calculation can be done during the indexing process, thus avoiding the need to compute $S(q_{i},D)$ at query time, apart from straightforward summations.

当 $t$ 是文档 $D$ 中不存在的 token 时,$\mathrm{TF}(t,D)=0$,进而导致 $S(t,D)=0$。这意味着对于词汇表 $V$ 中的大多数 token,我们可以直接将相关性分数设为 0,仅需计算实际存在于文档 $D$ 中的 $t$ 值。该计算可在索引过程中完成,从而避免在查询时计算 $S(q_{i},D)$(除简单求和外)。

Assigning Query Scores Given our sparse matrix of shape $|V|\times|C|$ , we can use the query tokens to select relevant rows, leaving us a matrix of shape $|Q|\times|C|$ , which we can then sum across the column dimension, resulting in a single $|C|$ -dimension vector (representing the score of the score of each document for the query).

给定形状为$|V|\times|C|$的稀疏矩阵,我们可以使用查询token选择相关行,得到一个形状为$|Q|\times|C|$的矩阵,然后沿列维度求和,最终生成一个$|C|$维向量(表示查询中每个文档的得分)。

Efficient Matrix Sparsity We implement a sparse matrix in Compressed Sparse Column (CSC)

高效矩阵稀疏化
我们采用压缩稀疏列 (CSC) 格式实现稀疏矩阵

format (scipy.sparse.csc_matrix)9, which provides an efficient conversion between the coordinate and CSC format. Since we slice and sum alongside the column dimension, this implementation is the optimal choice among sparse matrix im- ple ment at ions. In practice, we replicate the sparse operations directly using Numpy array.

格式 (scipy.sparse.csc_matrix)9,它提供了坐标格式和CSC格式之间的高效转换。由于我们沿着列维度进行切片和求和,该实现在稀疏矩阵实现中是最优选择。实际应用中,我们直接使用Numpy数组复现稀疏操作。

Token iz ation To split the text, we use the same Regular Expression pattern used by Scikit-Learn (Pedregosa et al., 2011) for their own tokenizers, which is $\mathsf{r}^{\prime\prime}(\updownarrow\mathsf{u})\backslash\mathsf{b}\backslash\mathsf{w}\backslash\mathsf{w}+\backslash\mathsf{b}^{\prime\prime}$ . This pattern conveniently parses words in UTF-8 (allowing coverage of various languages), with \b handling word boundaries. Then, if stemming is desired, we can stem all words in the vocabulary, which can be used to look up the stemmed version of each word in the collection. Finally, we build a dictionary mapping each unique (stemmed) word to an integer index, which we use to convert the tokens into their corresponding index, thus significantly reducing memory usage and allowing them to be used to slice Scipy matrices and Numpy arrays.

Token化
为分割文本,我们采用了与Scikit-Learn (Pedregosa等人, 2011) 分词器相同的正则表达式模式:$\mathsf{r}^{\prime\prime}(\updownarrow\mathsf{u})\backslash\mathsf{b}\backslash\mathsf{w}\backslash\mathsf{w}+\backslash\mathsf{b}^{\prime\prime}$。该模式能高效解析UTF-8编码的单词(支持多语言处理),其中\b用于处理单词边界。若需词干提取,可对词汇表中所有单词进行词干化处理,并通过该词汇表查询集合中每个单词的词干形式。最后,我们构建一个将每个唯一(词干化)单词映射为整数索引的字典,用于将token转换为对应索引,从而显著降低内存占用,并支持对Scipy矩阵和Numpy数组的切片操作。

Top-k selection Upon computing scores for all documents in a collection, we can complete the search process by selecting the top $k$ most relevant elements. A naive approach to this would be to sort the score vector and select the last $k$ elements; instead, we take the partition of the array, selecting only the last $k$ documents (unordered). Using an algorithm such as Quick select (Hoare, 1961), we can accomplish this in an average time complexity of $O(n)$ for $n$ documents in the collection, whereas sorting requires $O(n\log n)$ . If the user wishes to receive the top $k$ results in order, sorting the partitioned documents would take an additional $O(k\log k)$ , which is a negligible increase in time complexity assuming $k\ll n$ . In practice, BM25S allows the use of two implementations: one based in numpy, which leverages np.arg partition, and another in jax, which relies on XLA’s top $\mathbf{\nabla\cdot}\mathbf{k}$ imple ment ation. Numpy’s arg partition uses10 the introspective selection algorithm (Musser, 1997), which modifies the quick select algorithm to ensure that the worst-case performance remains in $O(n)$ . Although this guarantees optimal time complexity, we observe that JAX’s implementation achieves better performance in practice.

Top-k选择
在计算完集合中所有文档的得分后,我们可以通过选择前$k$个最相关的元素来完成搜索过程。一种简单的方法是对得分向量进行排序并选择最后$k$个元素;但我们采用数组分区的方式,仅选择最后$k$个文档(无序)。使用类似快速选择 (Quick select) (Hoare, 1961) 的算法,我们可以在平均时间复杂度$O(n)$内完成这一操作(集合包含$n$个文档),而排序则需要$O(n\log n)$。如果用户希望按顺序获取前$k$个结果,对分区后的文档进行排序将额外消耗$O(k\log k)$的时间,在假设$k\ll n$的情况下,这一时间复杂度的增加可以忽略不计。

实践中,BM25S支持两种实现方式:一种基于numpy(利用np.argpartition),另一种基于jax(依赖XLA的top $\mathbf{\nabla\cdot}\mathbf{k}$实现)。Numpy的argpartition使用了自省选择算法 (introspective selection algorithm) (Musser, 1997),该算法改进了快速选择算法以确保最坏情况下的性能仍为$O(n)$。尽管这保证了最优时间复杂度,但我们观察到JAX的实现在实际应用中性能更优。

Table 1: To calculate the throughput, we calculate the number of queries per second (QPS) that each model can process for each task in the public section of the BEIR leader board; instances achieve over $50\mathrm{QPS}$ are shown in bold. We compare BM25S, BM25-PT (PT), Elastic search (ES) and Rank-BM25 (Rank). OOM indicates failure due to out-of-memory issues.

表 1: 为计算吞吐量,我们统计了每个模型在BEIR排行榜公开部分各任务上每秒处理的查询数(QPS),超过$50\mathrm{QPS}$的实例以粗体显示。对比BM25S、BM25-PT(PT)、Elasticsearch(ES)和Rank-BM25(Rank),OOM表示内存不足导致的失败。

Dataset BM25S ES PT Rank
ArguAna 573.91 13.67 110.51 2.00
Climate-FEVER 13.09 4.02 OOM 0.03
CQADupstack 170.91 13.38 OOM 0.77
DBPedia 13.44 10.68 OOM 0.11
FEVER 20.19 7.45 OOM 0.06
FiQA 507.03 16.96 20.52 4.46
HotpotQA 20.88 7.11 OOM 0.04
MSMARCO 12.20 11.88 OOM 0.07
NFCorpus 1196.16 45.84 256.67 224.66
NQ 41.85 12.16 OOM 0.10
Quora 183.53 21.80 6.49 1.18
SCIDOCS 767.05 17.93 41.34 9.01
SciFact 952.92 20.81 184.30 47.60
TREC-COVID 85.64 7.34 3.73 1.48
Touche-2020 60.59 13.53 OOM 1.10

Multi-threading We implement optional multithreading capabilities through pooled executors 11 to achieve further speed-up during retrieval.

多线程
我们通过线程池执行器11实现可选的多线程功能,以在检索过程中进一步提升速度。

Alternative BM25 implementations Above, we describe how to implement BM25S for one variant of BM25 (namely, Lucene). However, we can easily extend the BM25S method to many variants of BM25; the sparsity can be directly applied to Robertson’s original design (Robertson et al., 1995), ATIRE (Trotman et al., 2014), and Lucene. For other models, a modification of the scoring described above is needed.

替代 BM25 实现方案
上文我们描述了如何为 BM25 的一种变体 (即 Lucene) 实现 BM25S。但我们可以轻松将 BM25S 方法扩展到 BM25 的多种变体:稀疏性可直接应用于 Robertson 的原始设计 (Robertson et al., 1995)、ATIRE (Trotman et al., 2014) 和 Lucene。对于其他模型,则需要修改上述评分方式。

2.1 Extending sparsity via non-occurrence adjustments

2.1 通过非出现调整扩展稀疏性

For BM25L (Lv and Zhai, 2011), $\mathrm{BM}25+$ (Lv and Zhai, 2011) and $\mathrm{TF}_{l\circ\delta\circ p}\times\mathrm{IDF}$ (Rousseau and Vazirgiannis, 2013), we notice that when $T F(t,D)=0$ , the value of $S(t,D)$ will not be zero; we denote this value as a scalar12 $S^{\theta}(t)$ , which represents the score of $t$ when it does not occur in document $D$ .

对于BM25L (Lv and Zhai, 2011)、BM25+ (Lv and Zhai, 2011)以及TFₗ∘δ∘p×IDF (Rousseau and Vazirgiannis, 2013),我们注意到当TF(t,D)=0时,S(t,D)的值不会为零;我们将该值记作标量Sθ(t),表示词项t未在文档D中出现时的得分。

Clearly, constructing a $|V|\times|C|$ dense matrix would use up too much memory13. Instead, we can still achieve sparsity by subtracting $S^{\theta}(t)$ from each token $t$ and document $D$ in the score matrix (since most tokens $t$ in the vocabulary will not be present in any given document $D$ , their value in the score matrix will be 0). Then, during retrieval, we can simply compute $S^{\theta}(q_{i})$ for each query $q_{i}\in Q$ , and sum it up to get a single scalar that we can add to the final score (which would not affect the rank).

显然,构建一个 $|V|\times|C|$ 的稠密矩阵会消耗过多内存 [13]。相反,我们仍可通过从得分矩阵中每个 token (token) $t$ 和文档 $D$ 减去 $S^{\theta}(t)$ 来实现稀疏性(因为词汇表中大多数 token 不会出现在任意给定文档 $D$ 中,它们在得分矩阵中的值将为 0)。检索时,只需为每个查询 $q_{i}\in Q$ 计算 $S^{\theta}(q_{i})$ 并求和,得到一个可加至最终得分的标量(这不会影响排序结果)。

More formally, for an empty document $\varnothing$ , we define $S^{\theta}(t)=S(t,\emptyset)$ as the non occurrence score for token $t$ . Then, the differential score $S^{\Delta}(t,D)$ is defined as:

更正式地说,对于一个空文档$\varnothing$,我们将$S^{\theta}(t)=S(t,\emptyset)$定义为token $t$的非出现分数。然后,差分分数$S^{\Delta}(t,D)$定义为:

$$
S^{\Delta}(t,D)=S(t,D)-S^{\theta}(t)
$$

$$
S^{\Delta}(t,D)=S(t,D)-S^{\theta}(t)
$$

Then, we reformulate the BM25 $(B)$ score as:

然后,我们将BM25 $(B)$ 分数重新表述为:

$$
\begin{array}{l}{{\displaystyle B(Q,D)=\sum_{i=1}^{|Q|}S(q_{i},D)}}\ {{\displaystyle\qquad=\sum_{i=1}^{|Q|}\left(S(q_{i},D)-S^{\theta}(q_{i})+S^{\theta}(q_{i})\right)}}\ {{\displaystyle\qquad=\sum_{i=1}^{|Q|}\left(S^{\Delta}(q_{i},D)+S^{\theta}(q_{i})\right)}}\ {{\displaystyle\qquad=\sum_{i=1}^{|Q|}S^{\Delta}(q_{i},D)+\sum_{i=1}^{|Q|}S^{\theta}(q_{i})}}\end{array}
$$

$$
\begin{array}{l}{{\displaystyle B(Q,D)=\sum_{i=1}^{|Q|}S(q_{i},D)}}\ {{\displaystyle\qquad=\sum_{i=1}^{|Q|}\left(S(q_{i},D)-S^{\theta}(q_{i})+S^{\theta}(q_{i})\right)}}\ {{\displaystyle\qquad=\sum_{i=1}^{|Q|}\left(S^{\Delta}(q_{i},D)+S^{\theta}(q_{i})\right)}}\ {{\displaystyle\qquad=\sum_{i=1}^{|Q|}S^{\Delta}(q_{i},D)+\sum_{i=1}^{|Q|}S^{\theta}(q_{i})}}\end{array}
$$

where $\begin{array}{r}{\sum_{i=1}^{|Q|}S^{\Delta}(q_{i},D)}\end{array}$ can be efficiently computed using the differential sparse score matrix (the same way as ATIRE, Lucene and Robertson) in scipy. Also, $\textstyle\sum_{i=1}^{|Q|}S^{\theta}(q_{i})$ only needs to be computed once f or the query $Q$ , and can be subsequently applied to every retrieved document to obtain the exact scores.

其中 $\begin{array}{r}{\sum_{i=1}^{|Q|}S^{\Delta}(q_{i},D)}\end{array}$ 可通过 scipy 中的差分稀疏评分矩阵高效计算(方法与 ATIRE、Lucene 和 Robertson 相同)。此外,$\textstyle\sum_{i=1}^{|Q|}S^{\theta}(q_{i})$ 只需为查询 $Q$ 计算一次,后续可应用于每个检索文档以获取精确分数。

3 Benchmarks

3 基准测试

Throughput For benchmarking, we use the publicly available datasets from the BEIR benchmark (Thakur et al., 2021). Results in Table 1 show that $\mathtt{B M25S}$ is substantially faster than Rank-BM25, as it achieves over $100\mathrm{x}$ higher throughput in 10 out of the 14 datasets; in one instance, it achieves a $500\mathrm{x}$ speedup. Further details can be found in Appendix A.

吞吐量
在基准测试中,我们使用了BEIR基准测试(Thakur et al., 2021)公开可用的数据集。表1结果显示,$\mathtt{B M25S}$明显快于Rank-BM25,在14个数据集中的10个上实现了超过$100\mathrm{x}$的吞吐量提升;在一个实例中,甚至达到了$500\mathrm{x}$的加速比。更多细节详见附录A。

Impact of Token iz ation We further examine the impact of token iz ation on each model in Table 2 by comparing $\mathtt{B M25S}$ Lucene with $k_{1}=1.5$ and $b=0.75$ (1) without stemming, (2) without stop words, and (3) with neither, and (4) with both. On average, adding a Stemmer improves the score on average, wheareas the stopwords have minimal impact. However, on individual cases, the stopwords can have a bigger impact, such as in the case of Trec-COVID (TC) and ArguAna (AG).

Token化影响分析
我们进一步通过比较$\mathtt{B M25S}$ Lucene在$k_{1}=1.5$和$b=0.75$参数下的四种配置:(1) 无词干提取, (2) 无停用词, (3) 两者皆无, (4) 两者兼具,来检验表2中各模型的token化影响。平均而言,添加词干提取器(Stemmer)能提升评分,而停用词影响较小。但在特定案例中,停用词会产生较大影响,例如Trec-COVID(TC)和ArguAna(AG)数据集。

Table 2: NDCG $@10$ results of different token iz ation schemes (including and excluding stopwords and the Snowball stemmer) on all BEIR dataset (Appendix A provides a list of datasets). We notice that including both stopwords and stemming modestly improves the performance of the BM25 algorithm.

表 2: 不同 Token 化方案 (包括和不包括停用词及 Snowball 词干提取器) 在所有 BEIR 数据集上的 NDCG $@10$ 结果 (数据集列表见附录 A)。我们注意到同时包含停用词和词干提取能略微提升 BM25 算法的性能。

Stop Stem Avg. AG CD CF DB FQ FV HP MS NF NQ QR SD SF TC WT
Eng. None 38.4 48.3 29.4 13.1 27.0 23.3 48.2 56.3 21.2 30.6 27.3 74.8 15.4 66.2 59.5 35.8
Eng. Snow. 39.7 49.3 29.9 13.6 29.9 25.1 48.1 56.9 21.9 32.1 28.5 80.4 15.8 68.7 62.3 33.1
None None 38.3 46.8 29.6 13.6 26.6 23.2 48.8 56.9 21.1 30.6 27.8 74.2 15.2 66.1 58.3 35.9
None Snow. 39.6 47.7 30.2 13.9 29.5 25.1 48.7 57.5 21.7 32.0 29.1 79.7 15.6 68.5 61.6 33.4
k1 b Variant Avg. AG CD CF DB FQ FV HP MS NF NQ QR SD SF TC WT
1.5 0.75 BM25PT 44.9 22.5 31.9 75.1 14.7 67.8 58.0
1.5 0.75 PSRN 40.0* 48.4 14.2 30.0 25.3 50.0 57.6 22.1 32.6 28.6 80.6 15.6 68.8 63.4 33.5
1.5 0.75 R-BM25 39.6 49.5 29.6 13.6 29.9 25.3 49.3 58.1 21.1 32.1 28.5 80.3 15.8 68.5 60.1 32.9
1.5 0.75 Elastic 42.0 47.7 29.8 17.8 31.1 25.3 62.0 58.6 22.1 34.4 31.6 80.6 16.3 69.0 68.0 35.4
1.5 0.75 Lucene 39.7 49.3 29.9 13.6 29.9 25.1 48.1 56.9 21.9 32.1 28.5 80.4 15.8 68.7 62.3 33.1
0.9 0.4 Lucene 41.1 40.8 28.2 16.2 31.9 23.8 63.8 62.9 22.8 31.8 30.5 78.7 15.0 67.6 58.9 44.2
1.2 0.75 Lucene 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.0 61.0 33.2
1.2 0.75 ATIRE 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.1 61.0 33.2
1.2 0.75 BM25+ 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.1 61.0 33.2
1.2 0.75 BM25L 39.5 49.6 29.8 13.5 29.4 25.0 46.6 55.9 21.4 32.2 28.1 80.3 15.8 68.7 62.9 33.0
1.2 0.75 Robertson 39.9 49.2 29.9 13.7 30.3 25.4 50.3 58.5 22.6 31.9 29.2 80.4 15.5 68.3 59.0 33.8

Table 3: Comparison of different variants and parameters on all BEIR dataset (Appendix A provides a list of datasets). Following the recommended range of $k_{1}\in[1.2,2]$ by Schütze et al. (2008), we try both $k_{1}=1.5$ and $k_{1}=1.2$ with $b=0.75$ . Additionally, we use $k_{1}=0.9$ and $b=0.4$ following the parameters recommend in BEIR. We additionally benchmark five of the BM25 variants described in Kamphuis et al. (2020). *note that Pyserini’s average results are estimated, as the experiments for CQ AD up Stack (CD) did not terminate due to OOM errors.

表 3: 不同变体及参数在全部BEIR数据集上的对比(附录A列出了数据集列表)。根据Schütze等人(2008)推荐的$k_{1}\in[1.2,2]$范围,我们同时测试了$k_{1}=1.5$和$k_{1}=1.2$(固定$b=0.75$)的情况。此外,按照BEIR推荐的参数,我们还测试了$k_{1}=0.9$和$b=0.4$的组合。我们额外评估了Kamphuis等人(2020)描述的5种BM25变体。*注:由于CQADupStack(CD)实验因OOM错误未能完成,Pyserini的平均结果是估算值。

Comparing model variants In Table 3, we compare many implementation variants, including commercial (Elastic search) offerings and re prod uci bility toolkits (Pyserini). We notice that most implementations achieve an average be between 39.7 and 40, with the exception of Elastic which achieves a marginally higher score. The variance can be attributed to the difference in the token iz ation scheme; notably, the subword tokenizer used in BM25-PT likely lead to the difference in the results, considering the implementation is a hybrid between ATIRE and Lucene, both of which achieve better results with a word-level tokenizer. Moreover, although Elastic search is built on top of Lucene, it remains an independent commercial product, and the documentation s 14 do not clearly describe how they are splitting the text15, and whether they incorporate additional processing beyond the access to a Snowball stemmer and the removal of stopwords.

比较模型变体
在表3中,我们比较了多种实现变体,包括商业产品(Elasticsearch)和复现工具包(Pyserini)。我们发现大多数实现的平均得分在39.7到40之间,只有Elasticsearch略高。这种差异可能源于分词方案的不同:值得注意的是,BM25-PT使用的子词分词器(subword tokenizer)可能导致结果差异,因为该实现是ATIRE和Lucene的混合体,而两者使用词级分词器(word-level tokenizer)时表现更好。此外,尽管Elasticsearch基于Lucene构建,但它仍是独立的商业产品,其文档[14]并未明确说明文本分割方式[15],也未提及除Snowball词干提取器和停用词移除外是否包含其他处理步骤。

4 Conclusion

4 结论

We provide a novel method for calculating BM25 scores, BM25S, which also offers fast token iz ation out-of-the-box and efficient top $\mathbf{\nabla\cdot}\mathbf{k}$ selection during querying, minimizes dependencies and makes it usable directly inside Python. As a result, BM25S naturally complements previous implementations: BM25-pt can be used with PyTorch, Rank-BM25 allows changing parameters $k1$ during inference, and Pyserini provides a large collection of both sparse and dense retrieval algorithm, making it the best framework for reproducible retrieval research. On the other hand, $\mathtt{B M25S}$ remains focused on sparse and mathematically accurate implementations of BM25 that leverage the eager sparse scoring methods, with optional Python dependencies like PyStemmer for stemming and Jax for top $\mathbf{\nabla\cdot}\mathbf{k}$ selection. By minimizing dependencies, BM25S becomes a good choice in scenarios where storage might be limited (e.g. for edge deployments) and can be used in the browser via WebAssembly frameworks like Pyodide16 and Pyscript17. We believe our fast and accurate implementation will make lexical search more accessible to a broader audience.

我们提出了一种计算BM25分数的新方法BM25S,该方法还提供开箱即用的快速分词(Tokenization)功能,并在查询时实现高效的前$\mathbf{\nabla\cdot}\mathbf{k}$筛选,同时最小化依赖项使其可直接在Python语言中使用。因此,BM25S自然成为现有实现的补充:BM25-pt可与PyTorch配合使用,Rank-BM25允许在推理时调整参数$k1$,而Pyserini则提供了包含稀疏与稠密检索算法的大型集合,使其成为可复现检索研究的最佳框架。相比之下,$\mathtt{B M25S}$仍专注于利用即时稀疏评分方法实现数学精确的BM25稀疏检索,并支持PyStemmer(词干提取)和Jax(前$\mathbf{\nabla\cdot}\mathbf{k}$筛选)等可选Python依赖项。通过最小化依赖,BM25S成为存储受限场景(如边缘部署)的理想选择,并能通过Pyodide16和Pyscript17等WebAssembly框架在浏览器中运行。我们相信这一快速精确的实现将使词汇检索技术更广泛地惠及用户群体。

Limitations

局限性

A customized Python-based tokenizer (also known as analyzer) was created for BM25S, which allows the use of stemmer and stopwords. By focusing on a readable, extensible and fast implementation, it may not achieve the highest possible performance. When reporting benchmarks results in research papers, it is worth considering different lexical search implementations in addition to BM25S.

为BM25S定制了一个基于Python语言的tokenizer(也称为analyzer),支持使用词干提取器和停用词。该实现注重可读性、可扩展性和速度,但可能无法达到最高性能。在研究论文中报告基准测试结果时,除了BM25S外,还值得考虑其他词法搜索实现方案。

Additionally, in order to ensure re prod uci bility and accessibility, our experiments are all performed on free and readily available hardware (Appendix A). As a result, experiments that are less memory efficient terminated with OOM errors.

此外,为了确保可复现性和可访问性,我们的实验均在免费且易于获取的硬件上完成 (附录 A)。因此,内存效率较低的实验因 OOM (内存不足) 错误而终止。

阅读全文(20积分)