SEARCHING IN ONE BILLION VECTORS: RE-RANK WITH SOURCE CODING

在十亿向量中搜索：基于信源编码的重排序

Hervé Jégou INRIA Rennes

Romain Tavenard University Rennes I

Romain Tavenard 雷恩第一大学

Matthijs Douze INRIA Grenoble

Laurent Amsaleg CNRS, IRISA

ABSTRACT

摘要

Recent indexing techniques inspired by source coding have been shown successful to index billions of high-dimensional vectors in memory. In this paper, we propose an approach that re-ranks the neighbor hypotheses obtained by these compressed-domain indexing methods. In contrast to the usual post-verification scheme, which performs exact distance calculation on the short-list of hypotheses, the estimated distances are refined based on short quantization codes, to avoid reading the full vectors from disk.

受信源编码启发的近期索引技术已被证明能成功在内存中索引数十亿高维向量。本文提出一种方法，对这些压缩域索引方法获得的邻近假设进行重排序。与常见的后验证方案（需对候选短列表执行精确距离计算）不同，该方法基于短量化码优化估计距离，从而避免从磁盘读取完整向量。

We have released a new public dataset of one billion 128- dimensional vectors and proposed an experimental setup to evaluate high dimensional indexing algorithms on a realistic scale. Experiments show that our method accurately and efficiently re-ranks the neighbor hypotheses using little memory compared to the full vectors representation.

我们发布了一个包含10亿个128维向量的新公共数据集，并提出了一种实验设置来评估高维索引算法在现实规模下的表现。实验表明，与完整向量表示相比，我们的方法能以较少内存准确高效地对邻近假设进行重排序。

Index Terms— nearest neighbor search, quantization, source coding, high dimensional indexing, large databases

索引术语— 最近邻搜索、量化、信源编码、高维索引、大型数据库

1. INTRODUCTION

1. 引言

Approximate nearest neighbors (ANN) search methods [3, 10, 14, 15] are required to handle large databases, especially for computer vision [12] and music retrieval [2] applications. One of the most popular techniques is Euclidean Locality-Sensitive Hashing [3]. However, most of these approaches are memory consuming, since several hash tables or trees are required. The methods of [4, 15], which embeds the vector into a binary space, better satisfies the memory constraint. They are, however, significantly outperformed in terms of the trade-off between memory usage and accuracy by recent methods that cast high dimensional indexing to a source coding problem [11, 5, 6], in particular the product quantization-based method of [5] exhibits impressive results for large scale image search [6].

近似最近邻 (ANN) 搜索方法 [3, 10, 14, 15] 是处理大型数据库的必要技术，尤其适用于计算机视觉 [12] 和音乐检索 [2] 应用场景。其中最流行的技术之一是欧氏局部敏感哈希 [3]。然而，由于需要构建多个哈希表或树结构，这类方法通常消耗大量内存。通过将向量嵌入二进制空间的 [4, 15] 方法能更好地满足内存限制，但在内存使用与准确率的权衡方面，近期将高维索引转化为信源编码问题的方法 [11, 5, 6] 表现更为突出，其中基于乘积量化的 [5] 方法在大规模图像搜索 [6] 中展现了卓越性能。

State-of-the-art approaches usually perform a re-ranking stage to produce a ranked list of nearest neighbors. This is always done in partitioning based method such as LSH [3] or FLANN [10], as the neighbor hypotheses are not ranked on output of the index. But as shown in [5], this post verification is also important for methods based on binary [4, 15] or quantized codes [11, 5, 6], as the ranking provided on output of the large scale search is significantly improved when verifying the few first hypotheses. In all these approaches, reranking is performed by computing the exact Euclidean distance between each query and the returned hypotheses. For large datasets, this raises a memory issue: the vectors can not be stored in central memory and must be read from disk on-the-fly. Moreover, the vectors are to a large extent accessed randomly, which in practice limits the number of hypotheses that can be verified.

最先进的方法通常会执行重排序阶段来生成最近邻的排序列表。这在基于分区的方法中总是如此，例如LSH [3]或FLANN [10]，因为邻域假设在索引输出时并未排序。但如[5]所示，这种后验证对于基于二进制[4, 15]或量化编码[11, 5, 6]的方法同样重要，因为当验证前几个假设时，大规模搜索提供的排序会显著改善。在所有这些方法中，重排序是通过计算每个查询与返回假设之间的精确欧氏距离来完成的。对于大型数据集，这会引发内存问题：向量无法存储在中央内存中，必须实时从磁盘读取。此外，向量在很大程度上是随机访问的，这实际上限制了可验证假设的数量。

In this paper, we propose a new post-verification scheme in the context of compression-based indexing methods. We focus on the method presented in [5], which offers state-of-the-art performance, outperforming the FLANN which was previously shown to outperform LSH [10]. It also provides an explicit approximation of the indexed vectors. Our algorithm exploits the approximation resulting from the first ranking, and refines it using codes stored in RAM. There is an analogy between this approach and the scalable compression techniques proposed in the last decade [13], where the term “scalable” means that a reconstruction of the compressed signal is refined in an incremental manner by successive description layers.

本文提出了一种基于压缩索引方法的新型后验证方案。我们重点研究了[5]中提出的方法，该方法提供了最先进的性能表现，优于先前被证明超越LSH[10]的FLANN。该方法还能提供索引向量的显式近似。我们的算法利用首次排序产生的近似结果，并通过存储在RAM中的编码进行优化。该方法与过去十年提出的可扩展压缩技术[13]存在相似性，其中"可扩展"指通过连续描述层以增量方式优化压缩信号的重建。

In order to evaluate our approach, we introduce a dataset of one billion vectors extracted from millions of images using the standard SIFT descriptor [8]. Testing on a large scale is important, as most ANN methods are usually evaluated on sets of unrealistic size, thereby ignoring memory issues that arise in real applications, where billions of vectors have to be handled [4, 9]. The ground truth nearest-neighbors have been computed for 10000 queries using exact distance computations. To our knowledge, this set is the largest ever released to evaluate ANN algorithms against an exact linear scan on real data: the largest other experiment we are aware of is the one reported in [7], where a private set of 179 million vectors was considered. [1] reports an experiment on 1 billion vectors, but on synthetic data with a known model exploited by the algorithm.

为了评估我们的方法，我们引入了一个包含十亿向量的数据集，这些向量是使用标准 SIFT (Scale-Invariant Feature Transform) 描述符 [8] 从数百万张图像中提取的。大规模测试至关重要，因为大多数近似最近邻 (ANN) 方法通常在不切实际的小规模数据集上进行评估，从而忽略了实际应用中处理十亿级向量时出现的内存问题 [4, 9]。我们通过精确距离计算为 10000 个查询生成了真实最近邻结果。据我们所知，这是目前公开的最大规模数据集，用于在真实数据上通过精确线性扫描来评估 ANN 算法：我们已知的最大规模实验是文献 [7] 中报道的 1.79 亿私有向量集。[1] 报告了针对十亿向量的实验，但使用的是算法可利用已知模型的合成数据。

Experiments performed on this dataset show that the proposed approach offers an alternative to the standard post-verification scheme. The precision of the search is significantly improved by the re-ranking step, leading to state-of-the-art performance on this scale, without accessing the disk.

在该数据集上进行的实验表明，所提出的方法为标准后验证方案提供了替代选择。通过重排序步骤显著提高了搜索精度，在不访问磁盘的情况下实现了该规模下的最先进性能。

2. CONTEXT: COMPRESSION BASED INDEXING

2. 背景：基于压缩的索引

In this section, we briefly review the indexing method of [5], which finds the approximate $k$ nearest neighbors using a source coding approach. For the sake of presentation, we describe only the Asymmetric Distance Computation (ADC) method proposed in [5]. We also assume that we search for the nearest neighbor (i.e., $k=1$ ),

在本节中，我们简要回顾了[5]提出的索引方法，该方法通过信源编码(source coding)技术寻找近似$k$最近邻。为便于说明，我们仅描述[5]提出的非对称距离计算(Asymmetric Distance Computation, ADC)方法，并假设搜索目标为最近邻(即$k=1$)。

Let $x\in\mathbb{R}^{d}$ be a query vector and $\mathcal{V}={y_{1},\dots,y_{n}}$ a set of vectors in which we want to find the nearest neighbor $\operatorname{NN}(x)$ of $_ x$ . The ADC approach consists in encoding each vector $y_{i}$ by a quantized version $\ddot{c}_ {i}=\mathrm{q}_ {\mathrm{c}}(y_{i})\in\mathbb{R}^{d}$ . For a quantizer $\mathrm{q}_ {\mathrm{c}}(.)$ with $K$ centroids, the vector is encoded by $b_{\mathrm{c}}=\log_{2}(K)$ bits, assuming $K$ is a power of 2. An approximate distance $d_{\mathrm{c}}(x,y_{i})$ between a query $_x$ and a database vector is computed as

设 $x\in\mathbb{R}^{d}$ 为查询向量，$\mathcal{V}={y_{1},\dots,y_{n}}$ 为待查找最近邻 $\operatorname{NN}(x)$ 的向量集合。ADC (Approximate Distance Computation) 方法通过量化器 $\mathrm{q}_ {\mathrm{c}}(.)$ 将每个向量 $y_{i}$ 编码为量化版本 $\ddot{c}_ {i}=\mathrm{q}_ {\mathrm{c}}(y_{i})\in\mathbb{R}^{d}$。对于具有 $K$ 个质心的量化器，当 $K$ 为2的幂时，向量编码位数为 $b_{\mathrm{c}}=\log_{2}(K)$。查询向量 $_ x$ 与数据库向量间的近似距离 $d_{\mathrm{c}}(x,y_{i})$ 计算公式为：

$$
d_{\mathrm{c}}(x,y_{i})^{2}=|x-\operatorname{q}_ {\mathrm{c}}(y_{i})|^{2}.
$$

The approximate nearest neighbor $\mathrm{NN}_{a}(x)$ of $_x$ is obtained by minimizing this distance estimator:

$x$ 的近似最近邻 $\mathrm{NN}_{a}(x)$ 通过最小化该距离估计量获得：

$$
\mathbf{NN}_ {a}(x)=\arg\operatorname*{min}_ {i}d_{\mathrm{c}}(x,y_{i})^{2}=\arg\operatorname*{min}_ {i}|x-\operatorname{qc}(y_{i})|^{2},
$$

which is an approximation of the exact distance calculation

这是对精确距离计算的近似

$$
\mathbf{NN}(x)=\arg\operatorname*{min}_ {i}\left|x-y_{i}\right|^{2}.
$$

Note that, in contrast with the binary embedding method of [15], the query $_x$ is not converted to a code: there is no approximation error on the query side.

需要注意的是，与[15]的二进制嵌入方法不同，查询 $_x$ 不会被转换为代码：查询端不存在近似误差。

To get a good vector approximation, $K$ should be large $\mathbf{\nabla}(K=$ $2^{64}$ for a 64 bit code). For such large values of $K$ , learning a $K$ - means codebook is not tractable, neither is the assignment of the vectors to their nearest centroids. To address this issue, [5] uses a product quantizer, for which there is no need to explicitly enumerate the centroids. A vector $\boldsymbol{y}\in\mathbb{R}^{d}$ is first split into $m$ subvectors $y^{1},...,y^{m}\in\mathbb{R}^{d/m}$ . A product quantizer is then defined as a function

为了获得良好的向量近似效果，参数 $K$ 需要足够大 (例如 64 位编码中 $K=2^{64}$)。对于如此大的 $K$ 值，学习 $K$-means 码本或为向量分配最近质心都不可行。为解决该问题，[5] 采用了无需显式枚举质心的乘积量化器 (product quantizer)。具体而言，向量 $\boldsymbol{y}\in\mathbb{R}^{d}$ 首先被分割为 $m$ 个子向量 $y^{1},...,y^{m}\in\mathbb{R}^{d/m}$，乘积量化器则被定义为如下函数：

$$
\begin{array}{r}{\mathrm{q}_{\mathrm{c}}(y)=\bigl(\mathrm{q}^{1}(y^{1}),...,\mathrm{q}^{m}(y^{m})\bigr),}\end{array}
$$

which maps the input vector $y$ to a tuple of indices by separately quantizing the subvectors. Each individual quantizer $\mathrm{q}^{j}(.)$ has $K_{\mathrm{s}}$ reproduction values, learned by $K$ -means. To limit the assignment complexity, $\mathcal{O}(m\times K_{\mathrm{s}})$ , $K_{\mathrm{s}}$ is set to a small value (e.g. $K_{\mathrm{s}}\mathrm{=}256$ ). However, the set of $K$ centroids induced by the product quantizer $\mathrm{q}_ {\mathrm{c}}(.)$ is large, as $K=(K_{\mathrm{s}})^{m}$ . The squared distances in Equation 2 are computed using the decomposition

将输入向量 $y$ 映射为一个索引元组，通过对子向量分别进行量化。每个独立的量化器 $\mathrm{q}^{j}(.)$ 具有 $K_{\mathrm{s}}$ 个由 $K$ -means 学习得到的重构值。为了限制分配复杂度 $\mathcal{O}(m\times K_{\mathrm{s}})$ ，将 $K_{\mathrm{s}}$ 设为较小值（例如 $K_{\mathrm{s}}\mathrm{=}256$ ）。然而，乘积量化器 $\mathrm{q}_ {\mathrm{c}}(.)$ 导出的 $K$ 个质心集合规模较大，因为 $K=(K_{\mathrm{s}})^{m}$ 。公式2中的平方距离通过分解计算得出

$$
d_{\mathrm{c}}(\boldsymbol{x},\boldsymbol{y})^{2}=|\boldsymbol{x}-\boldsymbol{\mathrm{q}}_ {\mathrm{c}}(\boldsymbol{y})|^{2}=\sum_{j=1,\dots,m}|\boldsymbol{x}^{j}-\boldsymbol{\mathrm{q}}^{j}(\boldsymbol{y}^{j})|^{2},
$$

where $y^{j}$ is the $j^{\mathrm{th}}$ subvector of $y$ . The squared distances in the sum are read from look-up tables. These tables are constructed on-the-fly for a given query, prior to the search in the set of quantization codes, from each subvector $x^{j}$ and the $k_{\mathrm{s}}$ centroids associated with the corresponding quantizer $\boldsymbol{\mathrm{q}}^{j}$ . The complexity of the table generation is $\mathcal{O}(d\times K_{\mathrm{s}})$ . When $K_{\mathrm{s}}\ll n$ , this complexity is negligible compared to the summation cost of $\mathcal{O}(d\times n)$ in Equation 2.

其中 $y^{j}$ 是 $y$ 的第 $j$ 个子向量。平方距离之和通过查找表读取。这些表在给定查询时动态构建，在量化码集搜索之前，由每个子向量 $x^{j}$ 和对应量化器 $\boldsymbol{\mathrm{q}}^{j}$ 关联的 $k_{\mathrm{s}}$ 个质心生成。表生成的复杂度为 $\mathcal{O}(d\times K_{\mathrm{s}})$ 。当 $K_{\mathrm{s}}\ll n$ 时，该复杂度相对于式2中 $\mathcal{O}(d\times n)$ 的求和成本可忽略不计。

This approximate nearest neighbor method implicitly sees multidimensional indexing as a vector approximation problem: a database vector $y$ can be decomposed as

这种近似最近邻方法将多维索引隐式视为向量近似问题：数据库向量 $y$ 可分解为

$$
y=\operatorname{q_{c}}(y)+\operatorname{r}(y),
$$

where $\operatorname{q}_{\mathrm{c}}(y)$ is the centroid associated with $y$ and $\operatorname{r}(y)$ the error vector resulting from the quantization, called the residual vector. It is proved [5] that the square error between the distance and its estimation is bounded, on average, by the quantization error. This ensures, asymptotically, perfect search results when increasing the number of bits allocated for the quantization indexes.

其中 $\operatorname{q}_{\mathrm{c}}(y)$ 是与 $y$ 关联的质心，$\operatorname{r}(y)$ 是由量化产生的误差向量，称为残差向量。已有证明 [5] 表明，距离与其估计值之间的平方误差在平均情况下受限于量化误差。这保证了当增加分配给量化索引的比特数时，搜索结果会渐进地趋于完美。

The ADC indexing method is fully para met rize d by the number of subvectors $m$ and the total number of bits $b_{\mathrm{c}}$ . In the following, we set $b_{\mathrm{c}}=8$ (i.e., $K_{\mathrm{s}}=256)$ , as suggested in [5], which means that we use exactly $m$ bytes per indexed vector.

ADC索引方法完全由子向量数量$m$和总比特数$b_{\mathrm{c}}$参数化。根据[5]的建议，我们设定$b_{\mathrm{c}}=8$（即$K_{\mathrm{s}}=256)$，这意味着每个索引向量恰好占用$m$字节。

3. RE-RANKING NEIGHBORS USING SOURCE CODING

3. 基于信源编码的邻居重排序

3.1. 精炼：原理

The objective of the method proposed in this paper is to avoid the costly post-verification scheme adopted in most state-of-the-art approximate search techniques [3, 10]. The idea is to take advantage of the information on the database point provided by the indexing. This is possible when using the ADC method [5]: this search algorithm provides an explicit approximation $\operatorname{q}_{\mathrm{c}}(y)$ of database vector $y$ .

本文提出的方法旨在避免当前最先进近似搜索技术[3,10]所采用的高成本后验证方案。其核心思想是利用索引提供的数据库点信息。这一思路在使用ADC方法[5]时尤为可行：该搜索算法能显式地提供数据库向量$y$的近似表示$\operatorname{q}_{\mathrm{c}}(y)$。

We first assume that the first retrieval stage returns a set of $k^{\prime}$ hypotheses. These vectors are the one for which a post-verification is required. For each database vectors $y_{i}$ , the error vector is equal to

我们首先假设第一检索阶段返回一组$k^{\prime}$假设。这些向量是需要进行后验证的。对于每个数据库向量$y_{i}$，误差向量等于

$$
\operatorname{r}(y)=y-\operatorname{q}_{\mathbf{c}}(y).
$$

The proposed method consists in reducing the energy of this residual vector to limit the impact of the approximation error on the

所提出的方法旨在降低该残差向量的能量，以限制近似误差对结果的影响。

Fig. 1. Illustration of the proposed refinement process. For each database vector $y$ , the distance $d_{\mathrm{c}}(x,y)=d(x,\operatorname{q}_ {\mathrm{c}}(y))$ is computed to build the short-list of potential nearest neighbors. For selected $y$ vectors, the distance is re-estimated by $d_{\mathrm{{r}}}(x,y)$ , which is obtained by computing the distance between $y$ and its improved approximation $d_{\mathrm{r}}=\mathrm{q_{c}}(y)+\mathrm{q_{r}}(y-\mathrm{q_{c}}(y))$ .

图 1: 提出的优化流程示意图。对于每个数据库向量 $y$，计算距离 $d_{\mathrm{c}}(x,y)=d(x,\operatorname{q}_ {\mathrm{c}}(y))$ 以构建潜在最近邻的候选列表。对于选定的 $y$ 向量，通过 $d_{\mathrm{{r}}}(x,y)$ 重新估算距离，该值通过计算 $y$ 与其改进近似 $d_{\mathrm{r}}=\mathrm{q_{c}}(y)+\mathrm{q_{r}}(y-\mathrm{q_{c}}(y))$ 之间的距离获得。

estimated distances. This is done by encoding the residual vector $\operatorname{r}(y)$ using another product quantizer $\mathrm{q_{r}}$ defined by its reproduction values $\mathcal{C}_{\mathrm{r}}$ :

通过使用另一个由重构值 $\mathcal{C}_ {\mathrm{r}}$ 定义的产品量化器 $\mathrm{q_{r}}$ 对残差向量 $\operatorname{r}(y)$ 进行编码，从而估计距离。

$$
\mathrm{q_{r}}(\mathrm{r}(y))=\arg\operatorname*{min}_ {c\in\mathcal{C}_{\mathrm{r}}}\left|\mathrm{r}(y)-c\right|^{2},
$$

where the product quantizer $\operatorname{q}_ {\mathrm{r}}(.)$ is learned on an independent set of residual vectors. Similar to $\mathrm{q}_ {\mathrm{c}}(.)$ , the set of reproduction values $\mathcal{C}_{\mathrm{r}}$ is never exhaustively listed, as all operations are performed using the product space decomposition.

其中乘积量化器 $\operatorname{q}_ {\mathrm{r}}(.)$ 是在一组独立的残差向量上学习得到的。与 $\mathrm{q}_ {\mathrm{c}}(.)$ 类似，再现值集合 $\mathcal{C}_{\mathrm{r}}$ 从未被完整列举，因为所有操作都是通过乘积空间分解来执行的。

The coded residual vector can be seen as the “least significant bits”, except that the term “bits” usually refers to scalar quantization. An improved estimation $\hat{y}$ of $y$ is the sum of the approximation vector and the decoded residual vector:

编码后的残差向量可以视为"最低有效位"，只不过"位"通常指标量量化。改进后的估计值$\hat{y}$是近似向量与解码残差向量之和:

$$
\hat{y}=\operatorname{q_{c}}(y)+\operatorname{q_{r}}(\operatorname{r}(y)).
$$

As shown in Figure 1, this estimator will be used at search time to update the distance estimation between the query $_x$ and the database vectors $y$ that are selected as potential neighbors:

如图 1 所示，该估计器将在搜索时用于更新查询 $_x$ 与被选为潜在邻居的数据库向量 $y$ 之间的距离估计：

$$
d(x,y)^{2}\approx d_{\mathrm{r}}(x,y)^{2}=|\mathrm{q}_ {\mathrm{c}}(y)+\mathrm{q}_{\mathrm{r}}(\mathrm{r}(y))-x|^{2}.
$$

The refinement product quantizer $\mathrm{q_{r}}$ is para met rize d by its number of sub quant iz ers and the total number of bits. Similar to $\mathrm{q_{c}}$ , we use 8 bits per sub quant ize r. Therefore the only parameter for the refinement quantizer is the number $m^{\prime}$ of bytes for each code. The total memory usage per indexed vector is $m+m^{\prime}$ bytes.

精细化乘积量化器 $\mathrm{q_{r}}$ 的参数包括子量化器数量和总比特数。与 $\mathrm{q_{c}}$ 类似，我们为每个子量化器分配8比特。因此，精细化量化器的唯一参数是每个编码的字节数 $m^{\prime}$。每个索引向量的总内存占用为 $m+m^{\prime}$ 字节。

3.2. Algorithm

3.2. 算法

This subsection details how the refinement codes are used to re-rank the hypotheses provided by the ADC. The resulting approach will be denoted by $\mathrm{ADC}{+}\mathrm{R}$ in the experimental section. As for most indexing algorithms, we distinguish between the offline stage and the query stage, during which the system needs to be very efficient. The offline stage consists in learning the indexing parameters and building the indexing structure associated with a vector dataset. It is performed as follows.

本小节详细阐述如何利用优化代码对ADC提供的假设进行重新排序。实验部分将用$\mathrm{ADC}{+}\mathrm{R}$表示最终方法。与多数索引算法类似，我们区分离线阶段和查询阶段——后者要求系统保持高效。离线阶段包含索引参数学习和向量数据集关联索引结构构建，具体流程如下。

The quantizers $\mathrm{q}_ {\mathrm{c}}(.)$ and $\operatorname{q}_ {\mathrm{r}}(.)$ are learned on a training set. 2. The vector dataset $\mathcal{V}={y_{1},\dots,y_{n}}$ to be indexed is encoded using $\mathrm{q_{c}}$ , producing codes $\operatorname{q}_ {\mathrm{c}}(y_{i})$ for $i=1,\ldots,n$ .
量化器 $\mathrm{q}_ {\mathrm{c}}(.)$ 和 $\operatorname{q}_{\mathrm{r}}(.)$ 在训练集上学习得到。
待索引的向量数据集 $\mathcal{V}={y_{1},\dots,y_{n}}$ 通过 $\mathrm{q_{c}}$ 编码，生成对应的编码 $\operatorname{q}_ {\mathrm{c}}(y_{i})$，其中 $i=1,\ldots,n$。
The residual vectors are encoded, producing the codes $\operatorname{q}_ {\mathrm{r}}\left(y_{i}-\operatorname{q}_ {\mathrm{c}}\left(y_{i}\right)\right)$ associated with all the indexed vectors.
对残差向量进行编码，生成与所有索引向量相关联的代码 $\operatorname{q}_ {\mathrm{r}}\left(y_{i}-\operatorname{q}_ {\mathrm{c}}\left(y_{i}\right)\right)$。

Searching a query vector $_x$ proceeds as follows:

查询向量 $_x$ 的搜索流程如下：

On output, we o

[论文翻译]在十亿向量中搜索：基于信源编码的重排序

原文地址：https://arxiv.org/pdf/1102.3828