BM25S: Orders of magnitude faster lexical search via eager sparse scoring
BM25S:通过即时稀疏评分实现数量级加速的词汇搜索
Abstract
摘要
We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy1 and Scipy2. BM25S achieves up to a $500\mathrm{x}$ speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at https://github.com/xhluca/bm25s
我们推出BM25S,这是一种基于Python语言的高效BM25实现,仅依赖Numpy1和Scipy2。通过索引期间主动计算BM25分数并将其存储为稀疏矩阵,BM25S相比最流行的Python框架实现了高达$500\mathrm{x}$的加速。与采用高度优化的Java实现的主流商业产品相比,BM25S也取得了显著的加速效果。此外,通过采用新颖的分数偏移方法将主动评分扩展到非稀疏变体,BM25S完整复现了Kamphuis等人(2020)提出的五种BM25变体实现。代码详见https://github.com/xhluca/bm25s
1 Background
1 背景
Sparse lexical search algorithms, such as the BM25 family (Robertson et al., 1995) remain widely used as they do not need to be trained, can be applied to multiple languages, and are generally faster, especially when using highly efficient Java-based implementations. Those Java implementations, usually based on Lucene3, are accessible inside Python via the Pyserini reproducibility toolkit (Lin et al., 2021), and through HTTP by using the Elasticsearch web client4. The Lucene-based libraries are known to be faster than existing Python-based implementations, such as Rank-BM255.
稀疏词法搜索算法(如BM25系列(Robertson et al., 1995))因其无需训练、可应用于多种语言且通常速度更快(尤其是在使用高效的基于Java语言的实现时)而仍被广泛使用。这些通常基于Lucene3的Java实现可通过Pyserini可复现工具包(Lin et al., 2021)在Python语言中访问,或通过Elasticsearch web client4使用HTTP协议调用。众所周知,基于Lucene的库比现有的基于Python语言的实现(如Rank-BM255)速度更快。
This work shows that it is possible to achieve a significant speedup compared to existing Pythonbased implementations by introducing two improvements: eagerly calculating all possible scores that can be assigned to any future query token when indexing a corpus, and storing those calculations inside sparse matrices to enable faster slicing and summations. The idea of sparse matrices was previously explored in $\mathrm{BM}25\mathrm{-PT}^{6}$ , which precomputes BM25 scores using PyTorch and multiplies them with a bag-of-word encoding of the query via sparse matrix multiplication.
本工作表明,通过引入两项改进措施,可以显著提速现有基于Python语言的实现方案:在索引语料库时预先计算所有可能分配给未来查询token的分数,并将这些计算结果存储在稀疏矩阵中以实现更快的切片与求和操作。稀疏矩阵的思路曾在$\mathrm{BM}25\mathrm{-PT}^{6}$中被探讨过,该方法使用PyTorch预计算BM25分数,并通过稀疏矩阵乘法将其与查询的词袋编码相乘。
This work expands upon the initial idea proposed by the BM25-PT project by significantly simplifying the implementation and introducing a strategy to generalize to other variants of the original BM25. Unlike BM25-pt, BM25S does not rely on PyTorch, and instead uses Scipy’s sparse matrix implementation. Whereas BM25-PT multiplies bag-of-words with the document matrix, BM25S instead slices relevant indices and sums across the token-dimension, removing the need of matrix multiplications.
本工作基于BM25-PT项目提出的初始构想进行了扩展,通过大幅简化实现方式并引入策略来推广至原始BM25的其他变体。与BM25-pt不同,BM25S不再依赖PyTorch,而是采用Scipy的稀疏矩阵实现。BM25-PT通过词袋与文档矩阵相乘的方式计算,而BM25S改为切片相关索引并在token维度求和,从而省去了矩阵乘法运算。
At the implementation level, BM25S also introduces a simple but fast Python-based tokenizer that combines Scikit-Learn’s text splitting (Pedregosa et al., 2011), Elastic’s stopword list7, and (optionally) integrates a C-based implementation of the Snowball stemmer (Bouchet-Valat, 2014). This achieves a better performance compared to subword tokenizers (Kudo and Richardson, 2018) used by BM25-PT. Finally, it implements top-k retrieval using an average $O(n)$ time complexity when selecting the $K$ most relevant documents from a set of $n$ scores associated with each document.
在实现层面,BM25S还引入了一个简单但快速的基于Python语言的tokenizer,它结合了Scikit-Learn的文本分割 (Pedregosa等人,2011) 、Elastic的停用词列表7,并(可选地)集成了基于C语言的Snowball词干提取器 (Bouchet-Valat,2014) 。与BM25-PT使用的子词tokenizer (Kudo和Richardson,2018) 相比,这实现了更好的性能。最后,当从与每个文档相关联的n个分数集合中选择K个最相关的文档时,它使用平均$O(n)$时间复杂度实现了top-k检索。
2 Implementation
2 实现
The implementation described below follows the study by Kamphuis et al. (2020).
以下实现遵循Kamphuis等人的研究 [20]。
Calculation of BM25 Many variants of BM25 exist, which could lead to significant confusion about the exact scoring method used in a given implementation (Kamphuis et al., 2020). By de- fault, we use the scoring method proposed by Lucene. Thus, for a given query $Q$ (tokenized into $q_{1},\dots,q_{|Q|})$ and document $D$ from collection $C$ , we compute the following score8:
BM25的计算
存在许多BM25的变体,这可能导致对特定实现中所用评分方法的严重混淆 (Kamphuis et al., 2020)。默认情况下,我们采用Lucene提出的评分方法。因此,对于给定查询$Q$(被分词为$q_{1},\dots,q_{|Q|})$和来自集合$C$的文档$D$,我们计算以下score8:
$$
\begin{array}{l}{{\displaystyle{\cal B}(Q,D)=\sum_{i=1}^{|Q|}S(q_{i},D)}}\ {{\displaystyle~=\sum_{i=1}^{|Q|}\mathrm{IDF}(q_{i},C)\frac{\mathrm{TF}(q_{i},D)}{\mathcal{D}}}}\end{array}
$$
$$
\begin{array}{l}{{\displaystyle{\cal B}(Q,D)=\sum_{i=1}^{|Q|}S(q_{i},D)}}\ {{\displaystyle~=\sum_{i=1}^{|Q|}\mathrm{IDF}(q_{i},C)\frac{\mathrm{TF}(q_{i},D)}{\mathcal{D}}}}\end{array}
$$
where $\begin{array}{r}{\mathcal{D}=\mathrm{TF}(t,D)+k_{1}\left(1-b+b\frac{|D|}{L_{a v g}}\right),}\end{array}$ $L_{a v g}$ is the average length of documents in corpus $C$ (calculated in number of tokens), $\mathrm{TF}(q_{i},D)$ is the term frequency of token $q_{i}$ within the set of tokens in $D$ . The $I D F$ is the inverse document frequency, which is calculated as:
其中 $\begin{array}{r}{\mathcal{D}=\mathrm{TF}(t,D)+k_{1}\left(1-b+b\frac{|D|}{L_{a v g}}\right),}\end{array}$ $L_{a v g}$ 表示语料库 $C$ 中文档的平均长度 (以token数量计算),$\mathrm{TF}(q_{i},D)$ 是token $q_{i}$ 在文档 $D$ 的token集合中的词频。$I D F$ 为逆文档频率,计算公式为:
$$
\mathrm{IDF}(q_{i},C)=\ln\left(\frac{|C|-\mathrm{DF}(q_{i},C)+0.5}{\mathrm{DF}(q_{i},C)+0.5}+1\right)
$$
$$
\mathrm{IDF}(q_{i},C)=\ln\left(\frac{|C|-\mathrm{DF}(q_{i},C)+0.5}{\mathrm{DF}(q_{i},C)+0.5}+1\right)
$$
Where document frequency $\mathrm{DF}(q_{i},C)$ is the number of documents in $C$ containing $q_{i}$ . Although $B(Q,D)$ depends on the query, which is only given during retrieval, we show below how to reformulate the equation to eagerly calculate the TF and IDF during indexing.
其中文档频率 $\mathrm{DF}(q_{i},C)$ 表示集合 $C$ 中包含 $q_{i}$ 的文档数量。虽然 $B(Q,D)$ 依赖于查询(仅在检索时给定),但我们将在下文展示如何重构该公式以便在索引阶段提前计算词频 (TF) 和逆文档频率 (IDF)。
Eager index-time scoring Let’s now consider all tokens in a vocabulary $V$ , denoted by $t\in V$ . We can reformulate $S(t,D)$ as:
即时索引评分
现在考虑词汇表 $V$ 中的所有 token,记为 $t\in V$。我们可以将 $S(t,D)$ 重新表述为:
$$
S(t,D)=\mathrm{TF}(t,D)\cdot\mathrm{IDF}(t,C)\frac{1}{\mathcal{D}}
$$
$$
S(t,D)=\mathrm{TF}(t,D)\cdot\mathrm{IDF}(t,C)\frac{1}{\mathcal{D}}
$$
When $t$ is a token that is not present in document $D$ , then $\mathrm{TF}(t,D)=0$ , leading to $S(t,D)=0$ as well. This means that, for most tokens in vocabulary $V$ , we can simply set the relevance score to 0, and only compute values for $t$ that are actually in the document $D$ . This calculation can be done during the indexing process, thus avoiding the need to compute $S(q_{i},D)$ at query time, apart from straightforward summations.
当 $t$ 是文档 $D$ 中不存在的 token 时,$\mathrm{TF}(t,D)=0$,进而导致 $S(t,D)=0$。这意味着对于词汇表 $V$ 中的大多数 token,我们可以直接将相关性分数设为 0,仅需计算实际存在于文档 $D$ 中的 $t$ 值。该计算可在索引过程中完成,从而避免在查询时计算 $S(q_{i},D)$(除简单求和外)。
Assigning Query Scores Given our sparse matrix of shape $|V|\times|C|$ , we can use the query tokens to select relevant rows, leaving us a matrix of shape $|Q|\times|C|$ , which we can then sum across the column dimension, resulting in a single $|C|$ -dimension vector (representing the score of the score of each document for the query).
给定形状为$|V|\times|C|$的稀疏矩阵,我们可以使用查询token选择相关行,得到一个形状为$|Q|\times|C|$的矩阵,然后沿列维度求和,最终生成一个$|C|$维向量(表示查询中每个文档的得分)。
Efficient Matrix Sparsity We implement a sparse matrix in Compressed Sparse Column (CSC)
高效矩阵稀疏化
我们采用压缩稀疏列 (CSC) 格式实现稀疏矩阵
format (scipy.sparse.csc_matrix)9, which provides an efficient conversion between the coordinate and CSC format. Since we slice and sum alongside the column dimension, this implementation is the optimal choice among sparse matrix im- ple ment at ions. In practice, we replicate the sparse operations directly using Numpy array.
格式 (scipy.sparse.csc_matrix)9,它提供了坐标格式和CSC格式之间的高效转换。由于我们沿着列维度进行切片和求和,该实现在稀疏矩阵实现中是最优选择。实际应用中,我们直接使用Numpy数组复现稀疏操作。
Token iz ation To split the text, we use the same Regular Expression pattern used by Scikit-Learn (Pedregosa et al., 2011) for their own tokenizers, which is $\mathsf{r}^{\prime\prime}(\updownarrow\mathsf{u})\backslash\mathsf{b}\backslash\mathsf{w}\backslash\mathsf{w}+\backslash\mathsf{b}^{\prime\prime}$ . This pattern conveniently parses words in UTF-8 (allowing coverage of various languages), with \b handling word boundaries. Then, if stemming is desired, we can stem all words in the vocabulary, which can be used to look up the stemmed version of each word in the collection. Finally, we build a dictionary mapping each unique (stemmed) word to an integer index, which we use to convert the tokens into their corresponding index, thus significantly reducing memory usage and allowing them to be used to slice Scipy matrices and Numpy arrays.
Token化
为分割文本,我们采用了与Scikit-Learn (Pedregosa等人, 2011) 分词器相同的正则表达式模式:$\mathsf{r}^{\prime\prime}(\updownarrow\mathsf{u})\backslash\mathsf{b}\backslash\mathsf{w}\backslash\mathsf{w}+\backslash\mathsf{b}^{\prime\prime}$。该模式能高效解析UTF-8编码的单词(支持多语言处理),其中\b用于处理单词边界。若需词干提取,可对词汇表中所有单词进行词干化处理,并通过该词汇表查询集合中每个单词的词干形式。最后,我们构建一个将每个唯一(词干化)单词映射为整数索引的字典,用于将token转换为对应索引,从而显著降低内存占用,并支持对Scipy矩阵和Numpy数组的切片操作。
Top-k selection Upon computing scores for all documents in a collection, we can complete the search process by selecting the top $k$ most relevant elements. A naive approach to this would be to sort the score vector and select the last $k$ elements; instead, we take the partition of the array, selecting only the last $k$ documents (unordered). Using an algorithm such as Quick select (Hoare, 1961), we can accomplish this in an average time complexity of $O(n)$ for $n$ documents in the collection, whereas sorting requires $O(n\log n)$ . If the user wishes to receive the top $k$ results in order, sorting the partitioned documents would take an additional $O(k\log k)$ , which is a negligible increase in time complexity assuming $k\ll n$ . In practice, BM25S allows the use of two implementations: one based in numpy, which leverages np.arg partition, and another in jax, which relies on XLA’s top $\mathbf{\nabla\cdot}\mathbf{k}$ imple ment ation. Numpy’s arg partition uses10 the introspective selection algorithm (Musser, 1997), which modifies the quick select algorithm to ensure that the worst-case performance remains in $O(n)$ . Although this guarantees optimal time complexity, we observe that JAX’s implementation achieves better performance in practice.
Top-k选择
在计算完集合中所有文档的得分后,我们可以通过选择前$k$个最相关的元素来完成搜索过程。一种简单的方法是对得分向量进行排序并选择最后$k$个元素;但我们采用数组分区的方式,仅选择最后$k$个文档(无序)。使用类似快速选择 (Quick select) (Hoare, 1961) 的算法,我们可以在平均时间复杂度$O(n)$内完成这一操作(集合包含$n$个文档),而排序则需要$O(n\log n)$。如果用户希望按顺序获取前$k$个结果,对分区后的文档进行排序将额外消耗$O(k\log k)$的时间,在假设$k\ll n$的情况下,这一时间复杂度的增加可以忽略不计。
实践中,BM25S支持两种实现方式:一种基于numpy(利用np.argpartition),另一种基于jax(依赖XLA的top $\mathbf{\nabla\cdot}\mathbf{k}$实现)。Numpy的argpartition使用了自省选择算法 (introspective selection algorithm) (Musser, 1997),该算法改进了快速选择算法以确保最坏情况下的性能仍为$O(n)$。尽管这保证了最优时间复杂度,但我们观察到JAX的实现在实际应用中性能更优。
Table 1: To calculate the throughput, we calculate the number of queries per second (QPS) that each model can process for each task in the public section of the BEIR leader board; instances achieve over $50\mathrm{QPS}$ are shown in bold. We compare BM25S, BM25-PT (PT), Elastic search (ES) and Rank-BM25 (Rank). OOM indicates failure due to out-of-memory issues.
表 1: 为计算吞吐量,我们统计了每个模型在BEIR排行榜公开部分各任务上每秒处理的查询数(QPS),超过$50\mathrm{QPS}$的实例以粗体显示。对比BM25S、BM25-PT(PT)、Elasticsearch(ES)和Rank-BM25(Rank),OOM表示内存不足导致的失败。
Dataset | BM25S | ES | PT | Rank |
---|---|---|---|---|
ArguAna | 573.91 | 13.67 | 110.51 | 2.00 |
Climate-FEVER | 13.09 | 4.02 | OOM | 0.03 |
CQADupstack | 170.91 | 13.38 | OOM | 0.77 |
DBPedia | 13.44 | 10.68 | OOM | 0.11 |
FEVER | 20.19 | 7.45 | OOM | 0.06 |
FiQA | 507.03 | 16.96 | 20.52 | 4.46 |
HotpotQA | 20.88 | 7.11 | OOM | 0.04 |
MSMARCO | 12.20 | 11.88 | OOM | 0.07 |
NFCorpus | 1196.16 | 45.84 | 256.67 | 224.66 |
NQ | 41.85 | 12.16 | OOM | 0.10 |
Quora | 183.53 | 21.80 | 6.49 | 1.18 |
SCIDOCS | 767.05 | 17.93 | 41.34 | 9.01 |
SciFact | 952.92 | 20.81 | 184.30 | 47.60 |
TREC-COVID | 85.64 | 7.34 | 3.73 | 1.48 |
Touche-2020 | 60.59 | 13.53 | OOM | 1.10 |
Multi-threading We implement optional multithreading capabilities through pooled executors 11 to achieve further speed-up during retrieval.
多线程
我们通过线程池执行器11实现可选的多线程功能,以在检索过程中进一步提升速度。
Alternative BM25 implementations Above, we describe how to implement BM25S for one variant of BM25 (namely, Lucene). However, we can easily extend the BM25S method to many variants of BM25; the sparsity can be directly applied to Robertson’s original design (Robertson et al., 1995), ATIRE (Trotman et al., 2014), and Lucene. For other models, a modification of the scoring described above is needed.
替代 BM25 实现方案
上文我们描述了如何为 BM25 的一种变体 (即 Lucene) 实现 BM25S。但我们可以轻松将 BM25S 方法扩展到 BM25 的多种变体:稀疏性可直接应用于 Robertson 的原始设计 (Robertson et al., 1995)、ATIRE (Trotman et al., 2014) 和 Lucene。对于其他模型,则需要修改上述评分方式。
2.1 Extending sparsity via non-occurrence adjustments
2.1 通过非出现调整扩展稀疏性
For BM25L (Lv and Zhai, 2011), $\mathrm{BM}25+$ (Lv and Zhai, 2011) and $\mathrm{TF}_{l\circ\delta\circ p}\times\mathrm{IDF}$ (Rousseau and Vazirgiannis, 2013), we notice that when $T F(t,D)=0$ , the value of $S(t,D)$ will not be zero; we denote this value as a scalar12 $S^{\theta}(t)$ , which represents the score of $t$ when it does not occur in document $D$ .
对于BM25L (Lv and Zhai, 2011)、BM25+ (Lv and Zhai, 2011)以及TFₗ∘δ∘p×IDF (Rousseau and Vazirgiannis, 2013),我们注意到当TF(t,D)=0时,S(t,D)的值不会为零;我们将该值记作标量Sθ(t),表示词项t未在文档D中出现时的得分。
Clearly, constructing a $|V|\times|C|$ dense matrix would use up too much memory13. Instead, we can still achieve sparsity by subtracting $S^{\theta}(t)$ from each token $t$ and document $D$ in the score matrix (since most tokens $t$ in the vocabulary will not be present in any given document $D$ , their value in the score matrix will be 0). Then, during retrieval, we can simply compute $S^{\theta}(q_{i})$ for each query $q_{i}\in Q$ , and sum it up to get a single scalar that we can add to the final score (which would not affect the rank).
显然,构建一个 $|V|\times|C|$ 的稠密矩阵会消耗过多内存 [13]。相反,我们仍可通过从得分矩阵中每个 token (token) $t$ 和文档 $D$ 减去 $S^{\theta}(t)$ 来实现稀疏性(因为词汇表中大多数 token 不会出现在任意给定文档 $D$ 中,它们在得分矩阵中的值将为 0)。检索时,只需为每个查询 $q_{i}\in Q$ 计算 $S^{\theta}(q_{i})$ 并求和,得到一个可加至最终得分的标量(这不会影响排序结果)。
More formally, for an empty document $\varnothing$ , we define $S^{\theta}(t)=S(t,\emptyset)$ as the non occurrence score for token $t$ . Then, the differential score $S^{\Delta}(t,D)$ is defined as:
更正式地说,对于一个空文档$\varnothing$,我们将$S^{\theta}(t)=S(t,\emptyset)$定义为t