[论文翻译]在十亿向量中搜索:基于信源编码的重排序


原文地址:https://arxiv.org/pdf/1102.3828


SEARCHING IN ONE BILLION VECTORS: RE-RANK WITH SOURCE CODING

在十亿向量中搜索:基于信源编码的重排序

Hervé Jégou INRIA Rennes

Hervé Jégou INRIA Rennes

Romain Tavenard University Rennes I

Romain Tavenard 雷恩第一大学

Matthijs Douze INRIA Grenoble

Matthijs Douze INRIA Grenoble

Laurent Amsaleg CNRS, IRISA

Laurent Amsaleg CNRS, IRISA

ABSTRACT

摘要

Recent indexing techniques inspired by source coding have been shown successful to index billions of high-dimensional vectors in memory. In this paper, we propose an approach that re-ranks the neighbor hypotheses obtained by these compressed-domain indexing methods. In contrast to the usual post-verification scheme, which performs exact distance calculation on the short-list of hypotheses, the estimated distances are refined based on short quantization codes, to avoid reading the full vectors from disk.

受信源编码启发的近期索引技术已被证明能成功在内存中索引数十亿高维向量。本文提出一种方法,对这些压缩域索引方法获得的邻近假设进行重排序。与常见的后验证方案(需对候选短列表执行精确距离计算)不同,该方法基于短量化码优化估计距离,从而避免从磁盘读取完整向量。

We have released a new public dataset of one billion 128- dimensional vectors and proposed an experimental setup to evaluate high dimensional indexing algorithms on a realistic scale. Experiments show that our method accurately and efficiently re-ranks the neighbor hypotheses using little memory compared to the full vectors representation.

我们发布了一个包含10亿个128维向量的新公共数据集,并提出了一种实验设置来评估高维索引算法在现实规模下的表现。实验表明,与完整向量表示相比,我们的方法能以较少内存准确高效地对邻近假设进行重排序。

Index Terms— nearest neighbor search, quantization, source coding, high dimensional indexing, large databases

索引术语— 最近邻搜索、量化、信源编码、高维索引、大型数据库

1. INTRODUCTION

1. 引言

Approximate nearest neighbors (ANN) search methods [3, 10, 14, 15] are required to handle large databases, especially for computer vision [12] and music retrieval [2] applications. One of the most popular techniques is Euclidean Locality-Sensitive Hashing [3]. However, most of these approaches are memory consuming, since several hash tables or trees are required. The methods of [4, 15], which embeds the vector into a binary space, better satisfies the memory constraint. They are, however, significantly outperformed in terms of the trade-off between memory usage and accuracy by recent methods that cast high dimensional indexing to a source coding problem [11, 5, 6], in particular the product quantization-based method of [5] exhibits impressive results for large scale image search [6].

近似最近邻 (ANN) 搜索方法 [3, 10, 14, 15] 是处理大型数据库的必要技术,尤其适用于计算机视觉 [12] 和音乐检索 [2] 应用场景。其中最流行的技术之一是欧氏局部敏感哈希 [3]。然而,由于需要构建多个哈希表或树结构,这类方法通常消耗大量内存。通过将向量嵌入二进制空间的 [4, 15] 方法能更好地满足内存限制,但在内存使用与准确率的权衡方面,近期将高维索引转化为信源编码问题的方法 [11, 5, 6] 表现更为突出,其中基于乘积量化的 [5] 方法在大规模图像搜索 [6] 中展现了卓越性能。

State-of-the-art approaches usually perform a re-ranking stage to produce a ranked list of nearest neighbors. This is always done in partitioning based method such as LSH [3] or FLANN [10], as the neighbor hypotheses are not ranked on output of the index. But as shown in [5], this post verification is also important for methods based on binary [4, 15] or quantized codes [11, 5, 6], as the ranking provided on output of the large scale search is significantly improved when verifying the few first hypotheses. In all these approaches, reranking is performed by computing the exact Euclidean distance between each query and the returned hypotheses. For large datasets, this raises a memory issue: the vectors can not be stored in central memory and must be read from disk on-the-fly. Moreover, the vectors are to a large extent accessed randomly, which in practice limits the number of hypotheses that can be verified.

最先进的方法通常会执行重排序阶段来生成最近邻的排序列表。这在基于分区的方法中总是如此,例如LSH [3]或FLANN [10],因为邻域假设在索引输出时并未排序。但如[5]所示,这种后验证对于基于二进制[4, 15]或量化编码[11, 5, 6]的方法同样重要,因为当验证前几个假设时,大规模搜索提供的排序会显著改善。在所有这些方法中,重排序是通过计算每个查询与返回假设之间的精确欧氏距离来完成的。对于大型数据集,这会引发内存问题:向量无法存储在中央内存中,必须实时从磁盘读取。此外,向量在很大程度上是随机访问的,这实际上限制了可验证假设的数量。

In this paper, we propose a new post-verification scheme in the context of compression-based indexing methods. We focus on the method presented in [5], which offers state-of-the-art performance, outperforming the FLANN which was previously shown to outperform LSH [10]. It also provides an explicit approximation of the indexed vectors. Our algorithm exploits the approximation resulting from the first ranking, and refines it using codes stored in RAM. There is an analogy between this approach and the scalable compression techniques proposed in the last decade [13], where the term “scalable” means that a reconstruction of the compressed signal is refined in an incremental manner by successive description layers.

本文提出了一种基于压缩索引方法的新型后验证方案。我们重点研究了[5]中提出的方法,该方法提供了最先进的性能表现,优于先前被证明超越LSH[10]的FLANN。该方法还能提供索引向量的显式近似。我们的算法利用首次排序产生的近似结果,并通过存储在RAM中的编码进行优化。该方法与过去十年提出的可扩展压缩技术[13]存在相似性,其中"可扩展"指通过连续描述层以增量方式优化压缩信号的重建。

In order to evaluate our approach, we introduce a dataset of one billion vectors extracted from millions of images using the standard SIFT descriptor [8]. Testing on a large scale is important, as most ANN methods are usually evaluated on sets of unrealistic size, thereby ignoring memory issues that arise in real applications, where billions of vectors have to be handled [4, 9]. The ground truth nearest-neighbors have been computed for 10000 queries using exact distance computations. To our knowledge, this set is the largest ever released to evaluate ANN algorithms against an exact linear scan on real data: the largest other experiment we are aware of is the one reported in [7], where a private set of 179 million vectors was considered. [1] reports an experiment on 1 billion vectors, but on synthetic data with a known model exploited by the algorithm.

为了评估我们的方法,我们引入了一个包含十亿向量的数据集,这些向量是使用标准 SIFT (Scale-Invariant Feature Transform) 描述符 [8] 从数百万张图像中提取的。大规模测试至关重要,因为大多数近似最近邻 (ANN) 方法通常在不切实际的小规模数据集上进行评估,从而忽略了实际应用中处理十亿级向量时出现的内存问题 [4, 9]。我们通过精确距离计算为 10000 个查询生成了真实最近邻结果。据我们所知,这是目前公开的最大规模数据集,用于在真实数据上通过精确线性扫描来评估 ANN 算法:我们已知的最大规模实验是文献 [7] 中报道的 1.79 亿私有向量集。[1] 报告了针对十亿向量的实验,但使用的是算法可利用已知模型的合成数据。

Experiments performed on this dataset show that the proposed approach offers an alternative to the standard post-verification scheme. The precision of the search is significantly improved by the re-ranking step, leading to state-of-the-art performance on this scale, without accessing the disk.

在该数据集上进行的实验表明,所提出的方法为标准后验证方案提供了替代选择。通过重排序步骤显著提高了搜索精度,在不访问磁盘的情况下实现了该规模下的最先进性能。

2. CONTEXT: COMPRESSION BASED INDEXING

2. 背景:基于压缩的索引

In this section, we briefly review the indexing method of [5], which finds the approximate $k$ nearest neighbors using a source coding approach. For the sake of presentation, we describe only the Asymmetric Distance Computation (ADC) method proposed in [5]. We also assume that we search for the nearest neighbor (i.e., $k=1$ ),

在本节中,我们简要回顾了[5]提出的索引方法,该方法通过信源编码(source coding)技术寻找近似$k$最近邻。为便于说明,我们仅描述[5]提出的非对称距离计算(Asymmetric Distance Computation, ADC)方法,并假设搜索目标为最近邻(即$k=1$)。

Let $x\in\mathbb{R}^{d}$ be a query vector and $\mathcal{V}={y_{1},\dots,y_{n}}$ a set of vectors in which we want to find the nearest neighbor $\operatorname{NN}(x)$ of $_ x$ . The ADC approach consists in encoding each vector $y_{i}$ by a quantized version $\ddot{c}_ {i}=\mathrm{q}_ {\mathrm{c}}(y_{i})\in\mathbb{R}^{d}$ . For a quantizer $\mathrm{q}_ {\mathrm{c}}(.)$ with $K$ centroids, the vector is encoded by $b_{\mathrm{c}}=\log_{2}(K)$ bits, assuming $K$ is a power of 2. An approximate distance $d_{\mathrm{c}}(x,y_{i})$ between a query $_x$ and a database vector is computed as

设 $x\in\mathbb{R}^{d}$ 为查询向量,$\mathcal{V}={y_{1},\dots,y_{n}}$ 为待查找最近邻 $\operatorname{NN}(x)$ 的向量集合。ADC (Approximate Distance Computation) 方法通过量化器 $\mathrm{q}_ {\mathrm{c}}(.)$ 将每个向量 $y_{i}$ 编码为量化版本 $\ddot{c}_ {i}=\mathrm{q}_ {\mathrm{c}}(y_{i})\in\mathbb{R}^{d}$。对于具有 $K$ 个质心的量化器,当 $K$ 为2的幂时,向量编码位数为 $b_{\mathrm{c}}=\log_{2}(K)$。查询向量 $_ x$ 与数据库向量间的近似距离 $d_{\mathrm{c}}(x,y_{i})$ 计算公式为:

$$
d_{\mathrm{c}}(x,y_{i})^{2}=|x-\operatorname{q}_ {\mathrm{c}}(y_{i})|^{2}.
$$

$$
d_{\mathrm{c}}(x,y_{i})^{2}=|x-\operatorname{q}_ {\mathrm{c}}(y_{i})|^{2}.
$$

The approximate nearest neighbor $\mathrm{NN}_{a}(x)$ of $_x$ is obtained by minimizing this distance estimator:

$x$ 的近似最近邻 $\mathrm{NN}_{a}(x)$ 通过最小化该距离估计量获得:

$$
\mathbf{NN}_ {a}(x)=\arg\operatorname*{min}_ {i}d_{\mathrm{c}}(x,y_{i})^{2}=\arg\operatorname*{min}_ {i}|x-\operatorname{qc}(y_{i})|^{2},
$$

$$
\mathbf{NN}_ {a}(x)=\arg\operatorname*{min}_ {i}d_{\mathrm{c}}(x,y_{i})^{2}=\arg\operatorname*{min}_ {i}|x-\operatorname{qc}(y_{i})|^{2},
$$

which is an approximation of the exact distance calculation

这是对精确距离计算的近似

$$
\mathbf{NN}(x)=\arg\operatorname*{min}_ {i}\left|x-y_{i}\right|^{2}.
$$

$$
\mathbf{NN}(x)=\arg\operatorname*{min}_ {i}\left|x-y_{i}\right|^{2}.
$$

Note that, in contrast with the binary embedding method of [15], the query $_x$ is not converted to a code: there is no approximation error on the query side.

需要注意的是,与[15]的二进制嵌入方法不同,查询 $_x$ 不会被转换为代码:查询端不存在近似误差。

To get a good vector approximation, $K$ should be large $\mathbf{\nabla}(K=$ $2^{64}$ for a 64 bit code). For such large values of $K$ , learning a $K$ - means codebook is not tractable, neither is the assignment of the vectors to their nearest centroids. To address this issue, [5] uses a product quantizer, for which there is no need to explicitly enumerate the centroids. A vector $\boldsymbol{y}\in\mathbb{R}^{d}$ is first split into $m$ subvectors $y^{1},...,y^{m}\in\mathbb{R}^{d/m}$ . A product quantizer is then defined as a function

为了获得良好的向量近似效果,参数 $K$ 需要足够大 (例如 64 位编码中 $K=2^{64}$)。对于如此大的 $K$ 值,学习 $K$-means 码本或为向量分配最近质心都不可行。为解决该问题,[5] 采用了无需显式枚举质心的乘积量化器 (product quantizer)。具体而言,向量 $\boldsymbol{y}\in\mathbb{R}^{d}$ 首先被分割为 $m$ 个子向量 $y^{1},...,y^{m}\in\mathbb{R}^{d/m}$,乘积量化器则被定义为如下函数:

$$
\begin{array}{r}{\mathrm{q}_{\mathrm{c}}(y)=\bigl(\mathrm{q}^{1}(y^{1}),...,\mathrm{q}^{m}(y^{m})\bigr),}\end{array}
$$

$$
\begin{array}{r}{\mathrm{q}_{\mathrm{c}}(y)=\bigl(\mathrm{q}^{1}(y^{1}),...,\mathrm{q}^{m}(y^{m})\bigr),}\end{array}
$$

which maps the input vector $y$ to a tuple of indices by separately quantizing the subvectors. Each individual quantizer $\mathrm{q}^{j}(.)$ has $K_{\mathrm{s}}$ reproduction values, learned by $K$ -means. To limit the assignment complexity, $\mathcal{O}(m\times K_{\mathrm{s}})$ , $K_{\mathrm{s}}$ is set to a small value (e.g. $K_{\mathrm{s}}\mathrm{=}256$ ). However, the set of $K$ centroids induced by the product quantizer $\mathrm{q}_ {\mathrm{c}}(.)$ is large, as $K=(K_{\mathrm{s}})^{m}$ . The squared distances in Equation 2 are computed using the decomposition

将输入向量 $y$ 映射为一个索引元组,通过对子向量分别进行量化。每个独立的量化器 $\mathrm{q}^{j}(.)$ 具有 $K_{\mathrm{s}}$ 个由 $K$ -means 学习得到的重构值。为了限制分配复杂度 $\mathcal{O}(m\times K_{\mathrm{s}})$ ,将 $K_{\mathrm{s}}$ 设为较小值(例如 $K_{\mathrm{s}}\mathrm{=}256$ )。然而,乘积量化器 $\mathrm{q}_ {\mathrm{c}}(.)$ 导出的 $K$ 个质心集合规模较大,因为 $K=(K_{\mathrm{s}})^{m}$ 。公式2中的平方距离通过分解计算得出

$$
d_{\mathrm{c}}(\boldsymbol{x},\boldsymbol{y})^{2}=|\boldsymbol{x}-\boldsymbol{\mathrm{q}}_ {\mathrm{c}}(\boldsymbol{y})|^{2}=\sum_{j=1,\dots,m}|\boldsymbol{x}^{j}-\boldsymbol{\mathrm{q}}^{j}(\boldsymbol{y}^{j})|^{2},
$$

$$
d_{\mathrm{c}}(\boldsymbol{x},\boldsymbol{y})^{2}=|\boldsymbol{x}-\boldsymbol{\mathrm{q}}_ {\mathrm{c}}(\boldsymbol{y})|^{2}=\sum_{j=1,\dots,m}|\boldsymbol{x}^{j}-\boldsymbol{\mathrm{q}}^{j}(\boldsymbol{y}^{j})|^{2},
$$

where $y^{j}$ is the $j^{\mathrm{th}}$ subvector of $y$ . The squared distances in the sum are read from look-up tables. These tables are constructed on-the-fly for a given query, prior to the search in the set of quantization codes, from each subvector $x^{j}$ and the $k_{\mathrm{s}}$ centroids associated with the corresponding quantizer $\boldsymbol{\mathrm{q}}^{j}$ . The complexity of the table generation is $\mathcal{O}(d\times K_{\mathrm{s}})$ . When $K_{\mathrm{s}}\ll n$ , this complexity is negligible compared to the summation cost of $\mathcal{O}(d\times n)$ in Equation 2.

其中 $y^{j}$ 是 $y$ 的第 $j$ 个子向量。平方距离之和通过查找表读取。这些表在给定查询时动态构建,在量化码集搜索之前,由每个子向量 $x^{j}$ 和对应量化器 $\boldsymbol{\mathrm{q}}^{j}$ 关联的 $k_{\mathrm{s}}$ 个质心生成。表生成的复杂度为 $\mathcal{O}(d\times K_{\mathrm{s}})$ 。当 $K_{\mathrm{s}}\ll n$ 时,该复杂度相对于式2中 $\mathcal{O}(d\times n)$ 的求和成本可忽略不计。

This approximate nearest neighbor method implicitly sees multidimensional indexing as a vector approximation problem: a database vector $y$ can be decomposed as

这种近似最近邻方法将多维索引隐式视为向量近似问题:数据库向量 $y$ 可分解为

$$
y=\operatorname{q_{c}}(y)+\operatorname{r}(y),
$$

$$
y=\operatorname{q_{c}}(y)+\operatorname{r}(y),
$$

where $\operatorname{q}_{\mathrm{c}}(y)$ is the centroid associated with $y$ and $\operatorname{r}(y)$ the error vector resulting from the quantization, called the residual vector. It is proved [5] that the square error between the distance and its estimation is bounded, on average, by the quantization error. This ensures, asymptotically, perfect search results when increasing the number of bits allocated for the quantization indexes.

其中 $\operatorname{q}_{\mathrm{c}}(y)$ 是与 $y$ 关联的质心,$\operatorname{r}(y)$ 是由量化产生的误差向量,称为残差向量。已有证明 [5] 表明,距离与其估计值之间的平方误差在平均情况下受限于量化误差。这保证了当增加分配给量化索引的比特数时,搜索结果会渐进地趋于完美。

The ADC indexing method is fully para met rize d by the number of subvectors $m$ and the total number of bits $b_{\mathrm{c}}$ . In the following, we set $b_{\mathrm{c}}=8$ (i.e., $K_{\mathrm{s}}=256)$ , as suggested in [5], which means that we use exactly $m$ bytes per indexed vector.

ADC索引方法完全由子向量数量$m$和总比特数$b_{\mathrm{c}}$参数化。根据[5]的建议,我们设定$b_{\mathrm{c}}=8$(即$K_{\mathrm{s}}=256)$,这意味着每个索引向量恰好占用$m$字节。

3. RE-RANKING NEIGHBORS USING SOURCE CODING

3. 基于信源编码的邻居重排序

3.1. Refinement: principle

3.1. 精炼:原理

The objective of the method proposed in this paper is to avoid the costly post-verification scheme adopted in most state-of-the-art approximate search techniques [3, 10]. The idea is to take advantage of the information on the database point provided by the indexing. This is possible when using the ADC method [5]: this search algorithm provides an explicit approximation $\operatorname{q}_{\mathrm{c}}(y)$ of database vector $y$ .

本文提出的方法旨在避免当前最先进近似搜索技术[3,10]所采用的高成本后验证方案。其核心思想是利用索引提供的数据库点信息。这一思路在使用ADC方法[5]时尤为可行:该搜索算法能显式地提供数据库向量$y$的近似表示$\operatorname{q}_{\mathrm{c}}(y)$。

We first assume that the first retrieval stage returns a set of $k^{\prime}$ hypotheses. These vectors are the one for which a post-verification is required. For each database vectors $y_{i}$ , the error vector is equal to

我们首先假设第一检索阶段返回一组$k^{\prime}$假设。这些向量是需要进行后验证的。对于每个数据库向量$y_{i}$,误差向量等于

$$
\operatorname{r}(y)=y-\operatorname{q}_{\mathbf{c}}(y).
$$

$$
\operatorname{r}(y)=y-\operatorname{q}_{\mathbf{c}}(y).
$$

The proposed method consists in reducing the energy of this residual vector to limit the impact of the approximation error on the

所提出的方法旨在降低该残差向量的能量,以限制近似误差对结果的影响。


Fig. 1. Illustration of the proposed refinement process. For each database vector $y$ , the distance $d_{\mathrm{c}}(x,y)=d(x,\operatorname{q}_ {\mathrm{c}}(y))$ is computed to build the short-list of potential nearest neighbors. For selected $y$ vectors, the distance is re-estimated by $d_{\mathrm{{r}}}(x,y)$ , which is obtained by computing the distance between $y$ and its improved approximation $d_{\mathrm{r}}=\mathrm{q_{c}}(y)+\mathrm{q_{r}}(y-\mathrm{q_{c}}(y))$ .

图 1: 提出的优化流程示意图。对于每个数据库向量 $y$,计算距离 $d_{\mathrm{c}}(x,y)=d(x,\operatorname{q}_ {\mathrm{c}}(y))$ 以构建潜在最近邻的候选列表。对于选定的 $y$ 向量,通过 $d_{\mathrm{{r}}}(x,y)$ 重新估算距离,该值通过计算 $y$ 与其改进近似 $d_{\mathrm{r}}=\mathrm{q_{c}}(y)+\mathrm{q_{r}}(y-\mathrm{q_{c}}(y))$ 之间的距离获得。

estimated distances. This is done by encoding the residual vector $\operatorname{r}(y)$ using another product quantizer $\mathrm{q_{r}}$ defined by its reproduction values $\mathcal{C}_{\mathrm{r}}$ :

通过使用另一个由重构值 $\mathcal{C}_ {\mathrm{r}}$ 定义的产品量化器 $\mathrm{q_{r}}$ 对残差向量 $\operatorname{r}(y)$ 进行编码,从而估计距离。

$$
\mathrm{q_{r}}(\mathrm{r}(y))=\arg\operatorname*{min}_ {c\in\mathcal{C}_{\mathrm{r}}}\left|\mathrm{r}(y)-c\right|^{2},
$$

$$
\mathrm{q_{r}}(\mathrm{r}(y))=\arg\operatorname*{min}_ {c\in\mathcal{C}_{\mathrm{r}}}\left|\mathrm{r}(y)-c\right|^{2},
$$

where the product quantizer $\operatorname{q}_ {\mathrm{r}}(.)$ is learned on an independent set of residual vectors. Similar to $\mathrm{q}_ {\mathrm{c}}(.)$ , the set of reproduction values $\mathcal{C}_{\mathrm{r}}$ is never exhaustively listed, as all operations are performed using the product space decomposition.

其中乘积量化器 $\operatorname{q}_ {\mathrm{r}}(.)$ 是在一组独立的残差向量上学习得到的。与 $\mathrm{q}_ {\mathrm{c}}(.)$ 类似,再现值集合 $\mathcal{C}_{\mathrm{r}}$ 从未被完整列举,因为所有操作都是通过乘积空间分解来执行的。

The coded residual vector can be seen as the “least significant bits”, except that the term “bits” usually refers to scalar quantization. An improved estimation $\hat{y}$ of $y$ is the sum of the approximation vector and the decoded residual vector:

编码后的残差向量可以视为"最低有效位",只不过"位"通常指标量量化。改进后的估计值$\hat{y}$是近似向量与解码残差向量之和:

$$
\hat{y}=\operatorname{q_{c}}(y)+\operatorname{q_{r}}(\operatorname{r}(y)).
$$

$$
\hat{y}=\operatorname{q_{c}}(y)+\operatorname{q_{r}}(\operatorname{r}(y)).
$$

As shown in Figure 1, this estimator will be used at search time to update the distance estimation between the query $_x$ and the database vectors $y$ that are selected as potential neighbors:

如图 1 所示,该估计器将在搜索时用于更新查询 $_x$ 与被选为潜在邻居的数据库向量 $y$ 之间的距离估计:

$$
d(x,y)^{2}\approx d_{\mathrm{r}}(x,y)^{2}=|\mathrm{q}_ {\mathrm{c}}(y)+\mathrm{q}_{\mathrm{r}}(\mathrm{r}(y))-x|^{2}.
$$

$$
d(x,y)^{2}\approx d_{\mathrm{r}}(x,y)^{2}=|\mathrm{q}_ {\mathrm{c}}(y)+\mathrm{q}_{\mathrm{r}}(\mathrm{r}(y))-x|^{2}.
$$

The refinement product quantizer $\mathrm{q_{r}}$ is para met rize d by its number of sub quant iz ers and the total number of bits. Similar to $\mathrm{q_{c}}$ , we use 8 bits per sub quant ize r. Therefore the only parameter for the refinement quantizer is the number $m^{\prime}$ of bytes for each code. The total memory usage per indexed vector is $m+m^{\prime}$ bytes.

精细化乘积量化器 $\mathrm{q_{r}}$ 的参数包括子量化器数量和总比特数。与 $\mathrm{q_{c}}$ 类似,我们为每个子量化器分配8比特。因此,精细化量化器的唯一参数是每个编码的字节数 $m^{\prime}$。每个索引向量的总内存占用为 $m+m^{\prime}$ 字节。

3.2. Algorithm

3.2. 算法

This subsection details how the refinement codes are used to re-rank the hypotheses provided by the ADC. The resulting approach will be denoted by $\mathrm{ADC}{+}\mathrm{R}$ in the experimental section. As for most indexing algorithms, we distinguish between the offline stage and the query stage, during which the system needs to be very efficient. The offline stage consists in learning the indexing parameters and building the indexing structure associated with a vector dataset. It is performed as follows.

本小节详细阐述如何利用优化代码对ADC提供的假设进行重新排序。实验部分将用$\mathrm{ADC}{+}\mathrm{R}$表示最终方法。与多数索引算法类似,我们区分离线阶段和查询阶段——后者要求系统保持高效。离线阶段包含索引参数学习和向量数据集关联索引结构构建,具体流程如下。

  1. The quantizers $\mathrm{q}_ {\mathrm{c}}(.)$ and $\operatorname{q}_ {\mathrm{r}}(.)$ are learned on a training set. 2. The vector dataset $\mathcal{V}={y_{1},\dots,y_{n}}$ to be indexed is encoded using $\mathrm{q_{c}}$ , producing codes $\operatorname{q}_ {\mathrm{c}}(y_{i})$ for $i=1,\ldots,n$ .
  2. 量化器 $\mathrm{q}_ {\mathrm{c}}(.)$ 和 $\operatorname{q}_{\mathrm{r}}(.)$ 在训练集上学习得到。
  3. 待索引的向量数据集 $\mathcal{V}={y_{1},\dots,y_{n}}$ 通过 $\mathrm{q_{c}}$ 编码,生成对应的编码 $\operatorname{q}_ {\mathrm{c}}(y_{i})$,其中 $i=1,\ldots,n$。
  4. The residual vectors are encoded, producing the codes $\operatorname{q}_ {\mathrm{r}}\left(y_{i}-\operatorname{q}_ {\mathrm{c}}\left(y_{i}\right)\right)$ associated with all the indexed vectors.
  5. 对残差向量进行编码,生成与所有索引向量相关联的代码 $\operatorname{q}_ {\mathrm{r}}\left(y_{i}-\operatorname{q}_ {\mathrm{c}}\left(y_{i}\right)\right)$。

Searching a query vector $_x$ proceeds as follows:

查询向量 $_x$ 的搜索流程如下:

On output, we obtain a re-ranked list of $k$ approximate nearest neighbors. The choice of the number $k^{\prime}$ of vectors in the short-list depends on parameters $m,m^{\prime}$ , $k^{\prime}$ and on the distribution of the vectors. In order for the post-verification scheme to have a negligible complexity, we typically set the ratio $k^{\prime}/k$ to 2.

在输出时,我们获得了一个包含 $k$ 个近似最近邻的重新排序列表。短列表中向量数量 $k^{\prime}$ 的选择取决于参数 $m,m^{\prime}$ 、 $k^{\prime}$ 以及向量的分布。为了使后验证方案的复杂度可忽略不计,我们通常将比率 $k^{\prime}/k$ 设为2。

3.3. Non exhaustive variant

3.3. 非穷举变体

Up to now, we have only considered the ADC method of [5], which requires an exhaustive scan of the dataset codes. Note however that the re-ranking method proposed in this paper can be applied to any method for which an approximate reconstruction of the indexed vectors is possible, e.g., [11]. In particular, in the experimental section we evaluate our approach with the IVFADC variant of [5], that avoids the aforementioned exhaustive scan by using an inverted file structure. This requires an additional coarse quantizer.

截至目前,我们仅考虑了[5]提出的ADC方法,该方法需要对数据集编码进行穷举扫描。但需注意的是,本文提出的重排序方法可应用于任何能对索引向量进行近似重建的方法,例如[11]。特别地,在实验部分我们采用[5]的IVFADC变体进行评估,该方法通过倒排文件结构避免了上述穷举扫描,但需要额外配置一个粗量化器。

Adapting our re-ranking method to the IVFADC method is straightforward, as this method also provides an explicit approximation of the indexed vectors. In addition to the numbers $m$ and $m^{\prime}$ of bytes used to encode the vector, this variant requires two additional parameters: the number $c$ of reproduction values of the coarse quantizer and the number $v$ of inverted lists that are visited for a given query. The main advantage of this variant is to compute the distance estimators only for a fraction (in the order of $v/c)$ of the database, at the risk of missing some nearest neighbors if $v/c$ is not large enough. Note finally that the memory usage is increased by $\log_{2}(c)$ bits (typically 4 bytes), due to the inverted file structure. In the following, the IVFADC method used jointly with our re-ranking method will be denoted by IVFADC+R.

将我们的重排序方法适配到IVFADC方法上非常简单,因为该方法同样提供了索引向量的显式近似。除了用于编码向量的字节数$m$和$m^{\prime}$外,该变体还需要两个额外参数:粗量化器的重现值数量$c$以及针对给定查询访问的倒排列表数量$v$。该变体的主要优势是仅需对数据库的一小部分(约为$v/c$量级)计算距离估计值,但如果$v/c$不够大则可能遗漏部分最近邻。最后需注意,由于倒排文件结构,内存占用量会增加$\log_{2}(c)$比特(通常为4字节)。下文将结合使用我们重排序方法的IVFADC方案记为IVFADC+R。

4. EXPERIMENTS

4. 实验

4.1. BIGANN: a billion-sized evaluation dataset

4.1. BIGANN: 十亿规模评估数据集

To evaluate ANN search methods, we propose a new evaluation dataset available online: http://corpus-texmex.irisa.fr. This benchmark, called BIGANN, consists of 128-dimensional SIFT descriptors (widely adopted image descriptors [8]) extracted from approximately 1 million images. It comprises three distinct subsets:

为评估ANN搜索方法,我们提出了一个可在线获取的新评估数据集:http://corpus-texmex.irisa.fr。该基准测试名为BIGANN,包含从约100万张图像中提取的128维SIFT描述符(广泛采用的图像描述符 [8]),由三个独立子集构成:

• base vectors: one billion vectors to search in • query vectors: 10000 vectors that are submitted to the system • learning vectors: a set of 100 million vectors to compute the parameters involved in the indexing method.

  • 基础向量 (base vectors):用于搜索的10亿条向量
  • 查询向量 (query vectors):提交至系统的1万条向量
  • 学习向量 (learning vectors):用于计算索引方法相关参数的1亿条向量集

The ground truth has been pre-computed: for each query, we provide the $k$ nearest neighbors that are obtained when computing exact Euclidean distance, as well as their square distance to the query vector. The ground truth for smaller sets $\scriptstyle{n=1\mathbf{M}}$ , 2M, 5M, 10M, ..., 200M vectors) is also provided. As our own approach does not require many training vectors, we only used the first million vectors from the learning set. All measurements (accuracy and timings) were averaged over the 1000 first queries.

真实值已预先计算完成:针对每个查询,我们提供了通过精确欧氏距离计算得到的 $k$ 个最近邻,以及它们与查询向量的平方距离。同时提供了更小规模数据集 ( $\scriptstyle{n=1\mathbf{M}}$ 、2M、5M、10M……200M向量) 的真实值。由于我们的方法不需要大量训练向量,仅使用了学习集中前100万个向量。所有测量结果(准确性与耗时)均基于前1000次查询取平均值。

Table 1. Performance and efficiency measured on 1 billion vectors, $\scriptstyle{m=8}$ . The query time is measured in seconds per query. The timings validate the limited impact of the re-ranking stage on efficiency.

Methodmrecall@1@10@100time/query
ADC00.0750.2740.5865.626
ADC+R80.2580.6830.9515.686
160.4340.8950.9825.692
320.6560.9700.9855.689
IVFADC00.0880.3720.7330.074
IVFADC+R80.2620.7010.9620.116
160.4290.8940.9820.119
320.6300.9770.9830.120

表 1: 在10亿向量上测量的性能和效率,$\scriptstyle{m=8}$。查询时间以秒/查询为单位。时间数据验证了重排序阶段对效率的有限影响。

方法 m recall@1 @10 @100 time/query
ADC 0 0.075 0.274 0.586 5.626
ADC+R 8 0.258 0.683 0.951 5.686
16 0.434 0.895 0.982 5.692
32 0.656 0.970 0.985 5.689
IVFADC 0 0.088 0.372 0.733 0.074
IVFADC+R 8 0.262 0.701 0.962 0.116
16 0.429 0.894 0.982 0.119
32 0.630 0.977 0.983 0.120

4.2. Evaluation protocol

4.2. 评估协议

The search quality is measured by the recall $\ @r$ measure, i.e., the proportion of queries whose nearest neighbor is ranked in the first $r$ positions. The curve obtained by varying $r$ corresponds to the distribution function of the ranks, and the point $r{=}1$ corresponds 1 to the “precision” measure used in [10] to evaluate ANN methods. Also, the recall $\ @r$ is the fraction of queries for which the nearest neighbor would be retrieved correctly if a short-list of $k=r$ vectors was verified using exact Euclidean distances.

搜索质量通过召回率 $\ @r$ 来衡量,即最近邻被排在前 $r$ 位的查询比例。通过变化 $r$ 得到的曲线对应于排序的分布函数,而点 $r{=}1$ 对应[10]中用于评估近似最近邻(ANN)方法的"精度"指标。此外,召回率 $\ @r$ 表示若使用精确欧氏距离验证一个包含 $k=r$ 个向量的候选列表时,能够正确检索到最近邻的查询比例。

The efficiency is measured by actual timings.

效率通过实际耗时来衡量。

4.3. Evaluation of the proposed approach

4.3. 所提方法的评估

Unless explicitly specified, our approach is evaluated by querying in the whole BIGANN set (i.e., one billion vectors). The performance of the ADC and IVFADC algorithms are given for reference, and compared with the re-ranked versions $\mathrm{ADC}{+}\mathrm{R}$ and $\mathrm{IVFADC+R}$ . In all experiments, we have set $k{=}10000$ and $k^{\prime}{=}20000$ . In addition, for the IVFAD $\mathrm{C+R}$ we have fixed $c{=}8192$ and $v{=}64$ , which means that the query is compared to approximately $1/128^{\mathrm{th}}$ of the indexed vectors.

除非另有说明,我们的方法通过在整个BIGANN数据集(即十亿向量)中进行查询来评估。ADC和IVFADC算法的性能作为参考给出,并与重排序版本$\mathrm{ADC}{+}\mathrm{R}$和$\mathrm{IVFADC+R}$进行比较。所有实验中,我们设定$k{=}10000$和$k^{\prime}{=}20000$。此外,对于IVFAD$\mathrm{C+R}$,我们固定$c{=}8192$和$v{=}64$,这意味着查询会与约$1/128^{\mathrm{th}}$的索引向量进行比较。

The re-ranking gain: We first consider the improvement brought by the re-ranking stage compared with the reference methods. Figure 2 shows the importance of this re-ranking stage on the recall $\ @\mathrm{r}$ measure: the performance of $\scriptstyle{\mathrm{PQ}}+{\mathrm{R}}$ (resp. $\scriptstyle{\mathrm{IVFPQ+R}}$ ) is significantly better than that of ADC (resp. IVFADC).

重排序增益:我们首先考虑重排序阶段相比参考方法带来的改进。图 2 展示了该重排序阶段对召回率 $\ @\mathrm{r}$ 指标的重要性:$\scriptstyle{\mathrm{PQ}}+{\mathrm{R}}$ (或 $\scriptstyle{\mathrm{IVFPQ+R}}$ ) 的性能明显优于 ADC (或 IVFADC)。

These observations are confirmed by Table 1, which additionally shows that the impact of the re-ranking stage on the query time is limited. As already reported in [5], the IVFADC version is better than the ADC method, at the cost of 4 additional bytes per indexed vector. As expected, this observation remains valid when comparing IVFADC $\mathbf{\Delta}_{+\mathrm{R}}$ with $\mathrm{ADC}{+}\mathrm{R}$ .

表1证实了这些观察结果,并进一步表明重排序阶段对查询时间的影响有限。如[5]所述,IVFADC版本优于ADC方法,但每个索引向量需要额外占用4字节。正如预期,当比较IVFADC $\mathbf{\Delta}_{+\mathrm{R}}$ 与 $\mathrm{ADC}{+}\mathrm{R}$ 时,这一结论仍然成立。

Performance for a fixed memory usage: Table 2 shows that, for a given memory usage (i.e., for $m+m^{\prime}$ constant), the re-ranking approach $\mathrm{ADC}{+}\mathrm{R}$ achieves similar recall as ADC, at a lower computing cost. The search is approximately two times faster, as the search time is dominated by the first retrieval stage, whose complexity is asymptotically linear in $m$ and $n$ for large values of $n$ . One can also observe that near perfect neighbors are obtained by increasing the number $m^{\prime}$ of bytes used to re-rank the queries.

固定内存占用下的性能:表 2 显示,在给定内存占用 (即 $m+m^{\prime}$ 为常数) 的情况下,重排序方法 $\mathrm{ADC}{+}\mathrm{R}$ 能以更低的计算成本实现与 ADC 相似的召回率。搜索速度提升约两倍,因为搜索时间主要由第一阶段检索决定,其复杂度在大规模 $n$ 值时渐近线性于 $m$ 和 $n$ 。此外可以看到,通过增加用于查询重排序的字节数 $m^{\prime}$ ,能获得近乎完美的近邻结果。


Fig. 2. Searching in one billion vectors: impact of the re-ranking stage on the search accuracy (recall $\textstyle{\mathcal{Q}}\Gamma$ . $\scriptstyle{m=8}$ .

图 2: 在十亿向量中搜索:重排序阶段对搜索准确率的影响 (召回率 $\textstyle{\mathcal{Q}}\Gamma$ . $\scriptstyle{m=8}$ .

Table 2. Comparative accuracy of ADC $(m^{\prime}{=}0)$ and $\mathrm{ADC}{+}\mathrm{R}$ for the same total amount of memory (bytes per indexed vector). In these experiments, the $\mathrm{ADC}{+}\mathrm{R}$ approach with $m=m^{\prime}$ is approximately $2\times$ more efficient that the ADC approach, as the cost of the the re-ranking stage is almost negligible, as shown Table 1. These experiments have been performed on the whole BIGANN set.

bytes/vectormmrecall@1@10@100
8800.0750.2740.586
440.0560.2230.504
161600.2450.6710.952
880.2580.6830.951
323200.4870.9560.999
16160.5710.9771.000
646400.7911.0001.000
32320.8320.9991.000

表 2. ADC $(m^{\prime}{=}0)$ 与 $\mathrm{ADC}{+}\mathrm{R}$ 在相同总内存量(每索引向量字节数)下的准确率对比。如表 1 所示,当 $m=m^{\prime}$ 时,$\mathrm{ADC}{+}\mathrm{R}$ 方法的效率约为 ADC 方法的 $2\times$,因为重排序阶段的成本几乎可以忽略不计。本实验在整个 BIGANN 数据集上进行。

字节/向量 m m recall@1 @10 @100
8 8 0 0.075 0.274 0.586
4 4 0.056 0.223 0.504
16 16 0 0.245 0.671 0.952
8 8 0.258 0.683 0.951
32 32 0 0.487 0.956 0.999
16 16 0.571 0.977 1.000
64 64 0 0.791 1.000 1.000
32 32 0.832 0.999 1.000

Impact of the dataset size: Figure 3 shows the impact of the database size on the recall $@10$ measure. The re-ranking stage becomes more important as the database grows in size, due to an increasing number of outliers. Interestingly, the quality of the search degrades more gracefully when using a precise post-verification scheme (with $m^{\prime}{=}16$ bytes).

数据集规模的影响:图 3 展示了数据库大小对召回率 $@10$ 指标的影响。随着数据库规模扩大,由于异常值数量增加,重排序阶段变得更为关键。值得注意的是,当采用精确的后验证方案(使用 $m^{\prime}{=}16$ 字节)时,搜索质量的下降更为平缓。

5. CONCLUSION

5. 结论

In this paper, following recent works on multi-dimensional indexing based on source coding, we propose a method to re-rank the vectors with a limited amount of memory, thereby avoiding costly disk accesses. Refining the neighbor hypotheses strongly improves recall at a rather low cost on response time. Our experimental validation is performed on a new public dataset of one billion vectors.

本文基于近期基于信源编码的多维索引研究,提出了一种在有限内存条件下实现向量重排序的方法,从而避免昂贵的磁盘访问。通过优化邻近假设,我们以较低响应时间为代价显著提升了召回率。实验验证在包含10亿向量的新公开数据集上进行。

6. ACKNOWLEDGEMENTS

6. 致谢

This work was partly realized as part of the Quaero Programme, funded by OSEO, French State agency for innovation.

本工作部分由法国国家创新机构OSEO资助的Quaero项目完成。


Fig. 3. Impact of the database size on search performance measured by recall $@10$ . $\scriptstyle{m=8}$ .

图 3: 数据库规模对搜索性能的影响(以召回率 $@10$ 衡量) 。 $\scriptstyle{m=8}$ 。

7. REFERENCES

7. 参考文献

阅读全文(20积分)