[论文翻译][部分转载]当SIFT遇上CNN:图像检索任务十年总结 SIFT Meets CNN: A Decade Survey of Instance R


下载PDF:https://arxiv.org/pdf/1608.01807v2.pdf


SIFT Meets CNN: A Decade Survey of Instance Retrieval 当 SIFT 遇上 CNN:图像检索任务十年总结

ABSTRACT

In the early days, content-based image retrieval (CBIR) was studied with global features. Since 2003, image retrieval based on local descriptors (de facto SIFT) has been extensively studied for over a decade due to the advantage of SIFT in dealing with image transformations. Recently, image representations based on the convolutional neural network (CNN) have attracted increasing interest in the community and demonstrated impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of instance retrieval over the last decade. Two broad categories, SIFT-based and CNN-based methods, are presented. For the former, according to the codebook size, we organize the literature into using large/medium-sized/small codebooks. For the latter, we discuss three lines of methods, i.e., using pre-trained or fine-tuned CNN models, and hybrid methods. The first two perform a single-pass of an image to the network, while the last category employs a patch-based feature extraction scheme. This survey presents milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods. After analyzing and comparing retrieval performance of different categories on several datasets, we discuss promising directions towards generic and specialized instance retrieval.

在基于内容的图像检索技术(CBIR)发展早期,研究人员大多基于图像的全局特征进行研究。自 2003 年开始,由于 SIFT 特征在图像变换问题中的优异表现,十多年来基于局部描述算子(如 SIFT 描述算子)的图像检索方法一直被广泛研究。最近,基于卷积神经网络(CNN)的图像表示方法吸引了社区越来越多的关注,同时这种方法也展现出了令人赞叹的性能。我们领域正处于快速发展时期,本文对实例检索近十多年来的进展进行了综合且全面的调查研究,主要展示了基于 SIFT 和 CNN 特征的两类主要方法。对 SIFT 一类的方法,我们根据字典本大小,将相关文献按照字典的大/中/小规模进行组织。对 CNN 一类的方法,我们主要依据预训练模型,微调模型和混合模型进行分类和讨论。预训练模型和微调模型方法采用了单通道的图像输入方法而混合模型则采用了基于块的特征提取策略。本篇综述选取了在现代实例检索任务中先前的各类工作,展现了该任务中的里程碑时刻,并提出了关于 SIFT 与 CNN 的内在联系的见解。在分析与比较了各种方法在几个数据集上的检索性能后,我们分别讨论了通用实例检索和专用实例检索任务未来的发展前景。

Instance retrieval, SIFT, convolutional neural network, literature survey.

Introduction

Content-based image retrieval (CBIR) has been a long-standing research topic in the computer vision society. In the early 1990s, the study of CBIR truly started. Images were indexed by the visual cues, such as texture and color, and a myriad of algorithms and image retrieval systems have been proposed. A straightforward strategy is to extract global descriptors. This idea dominated the image retrieval community in the 1990s and early 2000s. Yet, a well-known problem is that global signatures may fail the invariance expectation to image changes such as illumination, translation, occlusion and truncation. These variances compromise the retrieval accuracy and limit the application scope of global descriptors. This problem has given rise to local feature based image retrieval.

基于内容的图像检索任务(CBIR)是计算机视觉领域一项由来已久的研究课题。CBIR 研究在 20 世纪 90 年代早期正式开始,研究人员根据诸如纹理、颜色这样的视觉特征对图像建立索引,在这一时期大量的算法和图像检索系统被提出。其中一种简单明了的策略就是提取出图像的全局描述符,这种策略在 1990s 和 2000s 早期是图像检索社区研究的重点。然而,众所周知,全局描述符这种方法在诸如光照,形变,遮挡和裁剪这些情况下难以达到预想的效果。这些缺陷也导致了图像检索准确率的低下,也局限了全局描述符算法的应用范围。恰在这时,基于局部特征的图像检索算法给解决这一问题带来了曙光。

The focus of this survey is instance-level image retrieval. In this task, given a query image depicting a particular object/scene/architecture, the aim is to retrieve images containing the same object/scene/architecture that may be captured under different views, illumination, or with occlusions. Instance retrieval departs from class retrieval [1] in that the latter aims at retrieving images of the same class with the query. In the following, if not specified, we use “image retrieval” and “instance retrieval” interchangeably.
本篇综述主要关注于实例级的图像检索任务。在这个任务中,给定一张物体/场景/建筑类型的待查询图片,查询出包含拍摄自不同角度、光照或有遮挡的,含有相同物体/场景/建筑的图片。实例检索不同于类别检索任务,因为后者的目标是检索出同类别的图片。接下来,如果没有特别指出的话,“图像检索”和“实例检索”两个名词可以相互替代。

The milestones of instance retrieval in the past years are presented in Fig. 1, in which the times of the SIFT-based and CNN-based methods are highlighted. The majority of traditional methods can be considered to end in 2000 when Smeulders et al. [2] presented a comprehensive survey of CBIR “at the end of the early years”. Three years later (2003) the Bag-of-Words (BoW) model was introduced to the image retrieval community [3], and in 2004 was applied to image classification [4], both relying on the SIFT descriptor [5]. The retrieval community has since witnessed the prominence of the BoW model for over a decade during which many improvements were proposed. In 2012, Krizhevsky et al. [6] with the AlexNet achieved the state-of-the-art recognition accuracy in ILSRVC 2012, exceeding previous best results by a large margin. Since then, research focus has begun to transfer to deep learning based methods [7, 8, 9, 10], especially the convolutional neural network (CNN).
在图 1 中我们展示了多年来实例检索任务中的里程碑时刻,并且在图中着重标出了基于 SIFT 特征和 CNN 特征算法的提出的时间。2000 年可以认为是大部分传统方法结束的时间节点,当时 Smeulders 等撰写了“图像检索早期发展的终结”这篇综述。三年后(2003),词袋模型(BoW)进入图像检索社区的视野,并在 2004 年结合了 SIFT 方法符被应用于图像分类任务。这后来的近 10 年时间里,社区见证了 BoW 模型的优越性,它给图像检索任务带来了各种提升。在 2012 年,Krizhevsky 等人使用 AlexNet 神经网络模型在 ILSRVC 2012 上取得了当时世界上最高的识别准确率。从那以后,研究的重心开始向基于深度学习特别是卷积神经网络(CNN)的方法转移。

Milestones of instance retrieval. After a survey of methods before the year 2000 by Smeulders
Fig. 1: Milestones of instance retrieval.After a survey of methods before the year 2000 by Smeulders et al [2], Sivic and Zisserman [3] proposed Video Google in 2003, marking the beginning of the BoW model. Then, the hierarchical k-means and approximate k-means were proposed by Stewénius and Nistér [11] and Philbin et al. [12], respectively, marking the use of large codebooks in retrieval. In 2008, Jégou et al. [13] proposed Hamming Embedding, a milestone in using medium-sized codebooks. Then, compact visual representations for retrieval were proposed by Perronnin et al. [14] and Jégou et al. [15] in 2010. Although SIFT-based methods were still moving forward, CNN-based methods began to gradually take over, following the pioneering work of Krizhevsky et al. [6]. In 2014, Razavian et al. [7] proposed a hybrid method extracting multiple CNN features from an image. Babenko et al. [8] were the first to fine-tune a CNN model for generic instance retrieval. Both [9, 10] employ the column features from pre-trained CNN models, and [10] inspires later state-of-the-art methods. These milestones are the representative works of the categorization scheme in this survey.
图 1:图像检索里程碑
在 Smeulders [ 2 ]在 2000 年之前对方法进行了调查之后,Sivic 和 Zisserman [ 3 ]在 2003 年提出了 Video Google,标志着 BoW 模型的开始。然后,由 Stewénius 和 Nistér [ 11 ]以及 Philbin 等人 提出了层次 k-均值和近似 k-均值*。[ 12 ]分别标记了在检索中使用大型码本。在 2008 年,Jégou等人。* [ 13 ]提出了汉明嵌入(Hamming Embedding),这是使用中型码本的里程碑。然后,Perronnin 等人 提出了用于检索的紧凑视觉表示*。[ 14 ]和 Jégou等。* [ 15 ]在 2010 年。尽管基于 SIFT 的方法仍在向前发展,但随着 Krizhevsky 等人 的开创性工作,基于 CNN 的方法开始逐渐被人们所接受*。[ 6 ]。在 2014 年,Razavian等人。* [ 7 ]提出了一种从图像中提取多个 CNN 特征的混合方法。Babenko 等。 [ 8 ]是第一个为通用实例检索微调 CNN 模型的公司。既[ 910 ]采用柱从预先训练 CNN 模型的特征,和[ 10 ]激发后状态的最先进的方法。这些里程碑是本次调查中分类方案的代表作品。

The SIFT-based methods mostly rely on the BoW model. BoW was originally proposed for modeling documents because the text is naturally parsed into words. It builds a word histogram for a document by accumulating word responses into a global vector. In the image domain, the introduction of the scale-invariant feature transform (SIFT) makes the BoW model feasible . Originally, SIFT is comprised of a detector and descriptor, but which are used in isolation now; in this survey, if not specified, SIFT usually refers to the 128-dim descriptor, a common practice in the community. With a pre-trained codebook (vocabulary), local features are quantized to visual words. An image can thus be represented in a similar form to a document, and classic weighting and indexing schemes can be leveraged.

In recent years, the popularity of SIFT-based models seems to be overtaken by the convolutional neural network (CNN), a hierarchical structure that has been shown to outperform hand-crafted features in many vision tasks. In retrieval, competitive performance compared to the BoW models has been reported, even with short CNN vectors . The CNN-based retrieval models usually compute compact representations and employ the Euclidean distance or some approximate nearest neighbor (ANN) search methods for retrieval. Current literature may directly employ the pre-trained CNN models or perform fine-tuning for specific retrieval tasks. A majority of these methods feed the image into the network only once to obtain the descriptor. Some are based on patches which are passed to the network multiple times, a similar manner to SIFT; we classify them into hybrid methods in this survey.

基于 SIFT 特征的方法大多依赖于 BoW 模型。BoW 模型最初是为解决文档建模问题而提出的,因为文本本身就是由单词组成的。它通过累加单词响应到一个全局向量来给文档建立单词直方图。在图像领域,尺度不变(SIFT)特征的引入使得 BoW 模型变得可行。最初,SIFT 由检测器和描述符组成,但现在描述符被单独提取出来使用。在这篇综述中,如果没有特别指明的话,SIFT 往往是指 128 维的描述符(译者注:OpenCV 的 SIFT 实现也是默认生成 128 维向量),这也是社区的惯例。通过一个预训练的字典(译者注:补充说明一下,在工业界的项目中,待检索的图像往往有特定的范围,使用特定范围内的有代表性的图片构建出预训练字典可以取得比较好的效果),局部特征被量化表示为视觉词汇。一张图片能够被表示成类似文档的形式,这样就可以使用经典的权重索引方案。

近几年,CNN 这种层次结构模型在许多视频相关的任务上取得的成绩远好于手工特征,基于 SIFT 特征的模型的风头似乎被 CNN 盖过了。基于 CNN 的检索模型通常计算出紧密的图像表示向量,并使用欧氏距离或 ANN(approximate nearest neighbor)查找算法进行检索。最近的文献可能会直接使用预训练好的 CNN 模型或微调后应用于特定的检索任务。这些方法大多只将图像输入到网络中一次就可以获取描述符。一些基于图像块的方法则是将图像多次输入到网络中,这和 SIFT 方法的习惯有些类似;在这篇综述中,我们将这些方法称为混合型方法。

Organization of This Paper 本文结构

Upon the time of change, this paper provides a comprehensive literature survey of both the SIFT-based and CNN-based instance retrieval methods. We first present the categorization methodology in Section 2. We then describe the two major method types in Section 3 and Section 4, respectively. On several benchmark datasets, Section 5 summarizes the comparisons between SIFT- and CNN-based methods. In Section 6, we point out two possible future directions. This survey will be concluded in Section 7.

随着时间的推移,本文对基于 SIFT 的实例检索方法和基于 CNN 的实例检索方法进行了全面的文献综述。我们首先在第 2 节中介绍分类方法。然后,我们分别在第 3 节和第 4 节中描述两种主要的方法类型。在一些基准数据集上,第 5 节总结了基于 SIFT 和 CNN 的方法之间的比较。在第 6 节中,进一步地,我们将基于 SIFT 的方法根据编码本大小又分为大,中,小编码本三类。我们注意到,编码本的大小与所选取的编码方法紧密相关。基于 CNN 的方法分为使用预训练的模型,微调的模型以及混合模型方法。他们的异同点列于表 1。我们指出了两个可能的未来方向。该调查将在第 7 节中结束。

Categorization Methodology 分类方法

image.png

TABLE I: Major differences between various types of instance retrieval models. For SIFT-based methods, hand-crafted local invariant features are extracted, and according to the codebook sizes, different encoding and indexing strategies are leveraged. For CNN-based methods, pre-trained, fine-tuned CNN models and hybrid methods are the primary types; fixed-length compact vectors are usually produced, combined with approximate nearest neighbor (ANN) methods.
表 I:各种类型的实例检索模型之间的主要区别。对于基于 SIFT 的方法,将提取手工制作的局部不变特征,并根据码本的大小,利用不同的编码和索引策略。对于基于 CNN 的方法,预训练的,经过微调的 CNN 模型和混合方法是主要类型。通常生成固定长度的紧凑向量,并结合近似最近邻(ANN)方法。

According to the different visual representations, this survey categorizes the retrieval literature into two broad types: SIFT-based and CNN-based. The SIFT-based methods are further organized into three classes: using large, medium-sized or small codebooks. We note that the codebook size is closely related to the choice of encoding methods. The CNN-based methods are categorized into using pre-trained or fine-tuned CNN models, as well as hybrid methods. Their similarities and differences are summarized in Table I.

The SIFT-based methods had been predominantly studied before 2012 [6] (good works also appear in recent years [18, 19]). This line of methods usually use one type of detector, e.g., Hessian-Affine, and one type of descriptor, e.g., SIFT. Encoding maps a local feature into a vector. Based on the size of the codebook used during encoding, we classify SIFT-based methods into three categories as below.

  • [leftmargin=0pt]
  • Using small codebooks. The visual words are fewer than several thousand. Compact vectors are generated [15, 14] before dimension reduction and coding.
  • Using medium-sized codebooks. Given the sparsity of BoW and the low discriminative ability of visual words, the inverted index and binary signatures are used [13]. The trade-off between accuracy and efficiency is a major influencing factor [20].
  • Using large codebooks. Given the sparse BoW histograms and the high discriminative ability of visual words, the inverted index and memory-friendly signatures are used [21]. Approximate methods are used in codebook generation and encoding [11, 12].

基于 SIFT 的方法在 2012 年之前一直是研究的重点(当然近年来也有不少相关的杰出工作)。这一类方法通常使用如 Hessian-Affine 这种探测器,同时也使用 SIFT 这种描述符。编码本将局部特征映射到一组向量中。基于编码本大小,我们将基于 SIFT 的方法分为如下三类。

  • 使用小型编码本。视觉词汇少于几千个,紧凑编码向量在降维和编码之前生成。
  • 使用中型编码本。鉴于 BoW 的稀疏性和视觉词汇的低区分度,使用倒排索引和二进制签名方法。准确率和效率间的权衡是影响算法的主要因素。
  • 使用大型编码本。鉴于 BoW 直方图的稀疏性和视觉词汇的高区分度,在算法中使用了倒排索引和存储友好型的签名方式。在编码本的生成和编码中使用了类似的方法。

The CNN-based methods extract features using CNN models. Compact (fixed-length) representations are usually built. There are three classes:

  • Hybrid methods. Image patches are fed into CNN multiple times for feature extraction [7]. Encoding and indexing are similar to SIFT-based methods [22].

  • Using pre-trained CNN models. Features are extracted in a single pass using CNN pre-trained on some large-scale datasets like ImageNet [23]. Compact Encoding/pooling techniques are used [9, 10].

  • Using fine-tuned CNN models. The CNN model (e.g., pre-trained on ImageNet) is fine-tuned on a training set in which the images share similar distributions with the target database [8]. CNN features can be extracted in an end-to-end manner through a single pass to the CNN model. The visual representations exhibit improved discriminative ability [24, 17].

    基于 CNN 的方法使用 CNN 模型提取特征,建立紧凑向量(固定长度)。它们也分为三类:

  • 混合型方法。图像块被多次输入进 CNN 用于特征提取。编码与索引方法和基于 SIFT 的检索方法近似。

  • 使用预训练的模型。通过在大规模图像集(例如 ImageNet)上预训练的 CNN 模型进行单通道传播提取特征。使用紧凑编码/池化技术进行检索。

  • 使用微调的模型。在训练图像与目标数据库具有相似的分布的训练集上,对 CNN 模型进行微调。通过单通道 CNN 模型,运用端到端的方法提取出 CNN 特征。这种视觉表示方法提升了模型的区分能力。

SIFT-based Image Retrieval 基于 SIFT 特征的图像检索系统

Pipeline

The pipeline of SIFT-based retrieval is introduced in Fig. pipeline. Suppose we have a gallery $ \mathcal{G} $ consisting of $ N $ images. Given a feature detector, we extract local descriptors from the regions around the sparse interest points or dense patches. We denote the local descriptors of $ D $ detected regions in an image as $ {f_i}{i = i}^D, f_i\in\mathbb{R}^p $ . SIFT-based methods train a codebook offline. Each visual word in the codebook lies in the center of a subspace, called the Voronoi cell. A larger codebook corresponds to a finer partitioning, resulting in more discriminative visual words and vice versa. Suppose that a pool of local descriptors $ \mathcal{F}\equiv{f_i}{i=1}^M $ are computed from an unlabeled training set. The baseline approach, i.e., k-means, partitions the $ M $ points into $ K $ clusters; the $ K $ visual words thus constitute a codebook of size $ K $ . A local descriptor $ f_i \in\mathbb{R}^p $ is mapped into a feature embedding $ g_i \in\mathbb{R}^l $ through the feature encoding process, $ f_i \rightarrow g_i $ . When k-means clustering is used, $ f_i $ can be encoded according to its distances to the visual words. For large codebooks, hard and soft quantization are good choices. In the former, the resulting embedding $ g_i $ has only one non-zero entry; in the latter, $ f_i $ can be quantized to a small number of visual words. A global signature is produced after a sum-pooling of all the embeddings of local features. For medium-sized codebooks, additional binary signatures can be generated to preserve the original information. When using small codebooks, popular encoding schemes include vector of locally aggregated descriptors (VLAD) , Fisher vector (FV) , etc.

基于 SIFT 特征与 CNN 特征的图像检索流程如图 2 所示。假设我们有一个由$ N $图像组成的图库$\mathcal{G} $。给定特征检测器,我们从稀疏兴趣点或密集斑块周围的区域中提取本地描述符。我们表示$ D $检测到的本地描述符作为$ {f_i}^D_{i = i}, f_i\in\mathbb{R}^p $。表示图像中被检测的区域。

编码本的训练。基于 SIFT 的方法你先训练编码本。编码本中的每一个视觉词汇位于子空间的中心,这称为“沃罗诺伊 voronoi 单元”。更大的码本对应于更精细的划分,从而产生更多辨别性的视觉词,反之亦然。假设从未标记的训练集计算了一款本地描述符$F = {f_i}_{i=1}^M $。基线方法,即, k 均值,将$ M $分区为$ K $集群;因此,$ K $视觉单词构成大小$ K $的码本。本地描述符$ f_i \in\mathbb{R}^p $被映射到嵌入特征$ g_i \in\mathbb{R}^l $通过特征编码过程,$ f_i \rightarrow g_i $。当使用 K-Means 群集时,可以根据其距离对视觉单词进行编码$ f_i $。对于大规模的编码本,硬量化和软量化方法都是很好的选择。前者量化得到只有一个非零条目的嵌入特征$ g_i $;在后者中,$ f_i $可以量化为少量视觉单词。在总结本地特征的嵌入式的总和汇总后生成全局签名。对于中型码本,可以生成其他二进制签名以保留原始信息。使用小型码本时,流行的编码方案包括当地聚合描述符(VLAD),Fisher Vector(FV),* ETC *的向量。

A general pipeline of SIFT- and CNN-based retrieval models. Features are computed from hand-crafted detectors for SIFT, and densely applied filters or image patches for CNN. In both methods, under small codebooks, encoding/pooling is employed to produce compact vectors. In SIFT-based methods, the inverted index is necessary under large/medium-sized codebooks. The CNN features can also be computed in an end-to-end way using fine-tuned CNN models.
Fig. 2: A general pipeline of SIFT- and CNN-based retrieval models. Features are computed from hand-crafted detectors for SIFT, and densely applied filters or image patches for CNN. In both methods, under small codebooks, encoding/pooling is employed to produce compact vectors. In SIFT-based methods, the inverted index is necessary under large/medium-sized codebooks. The CNN features can also be computed in an end-to-end way using fine-tuned CNN models.
图 2:基于 SIFT 和 CNN 的检索模型的一般流水线。特征是通过手工制作的 SIFT 检测器和密集应用的滤镜或 CNN 图像补丁计算得出的。在这两种方法中,在小型码本下,都采用编码/合并来生成紧凑向量。在基于 SIFT 的方法中,在大/中型码本下,倒排索引是必需的。也可以使用微调的 CNN 模型以端到端的方式计算 CNN 功能。

Local Feature Extraction 局部特征提取

Local invariant features aim at accurate matching of local structures between images [26]. SIFT-based methods usually share a similar feature extraction step composed of a feature detector and a descriptor.

Local detector. The interest point detectors aim to reliably localize a set of stable local regions under various imaging conditions. In the retrieval community, finding affine-covariant regions has been preferred. It is called “covariant” because the shapes of the detected regions change with the affine transformations, so that the region content (descriptors) can be invariant. This kind of detectors are different from keypoint-centric detectors such as the Hessian detector [27], and from those focusing on scale-invariant regions such as the difference of Gaussians (DoG) [5] detector. Elliptical regions which are adapted to the local intensity patterns are produced by affine detectors. This ensures that the same local structure is covered under deformations caused by viewpoint variances, a problem often encountered in instance retrieval. In the milestone work [3], the Maximally Stable Extremal Region (MSER) detector [28] and the affine extended Harris-Laplace detector are employed, both of which are affine-invariant region detectors. MSER is used in several later works [29, 11]. Starting from [12], the Hessian-affine detector [30] has been widely adopted in retrieval. It has been shown to be superior to the difference of Gaussians (DoG) detector [13, 31], due to its advantage in reliably detecting local structures under large viewpoint changes. To fix the orientation ambiguity of these affine-covariant regions, the gravity assumption is made [32]. The practice which dismisses the orientation estimation is employed by later works [33, 34] and demonstrates consistent improvement on architecture datasets where the objects are usually upright. Other non-affine detectors have also been tested in retrieval, such as the Laplacian of Gaussian (LOG) and Harris detectors used in [35]. For objects with smooth surfaces [36], few interest points can be detected, so the object boundaries are good candidates for local description.

On the other hand, some employ the dense region detectors. In the comparison between densely sampled image patches and the detected patches, Sicre et al. [37] report the superiority of the former. To recover the rotation invariance of dense sampling, the dominant angle of patches is estimated in [38]. A comprehensive comparison of various dense sampling strategies, the interest point detectors, and those in between can be accessed in [39].

Local Descriptor. With a set of detected regions, descriptors encode the local content. SIFT [5] has been used as the default descriptor. The 128-dim vector has been shown to outperform competing descriptors in matching accuracy [40]. In an extension, PCA-SIFT [41] reduces the dimension from 128 to 36 to speed up the matching process at the cost of more time in feature computation and loss of distinctiveness. Another improvement is RootSIFT [33], calculated by two steps: 1) ℓ1 normalize the SIFT descriptor, 2) square root each element. RootSIFT is now used as a routine in SIFT-based retrieval. Apart from SIFT, SURF [42] is also widely used. It combines the Hessian-Laplace detector and a local descriptor of the local gradient histograms. The integral image is used for acceleration. SURF has a comparable matching accuracy with SIFT and is faster to compute. See [43] for comparisons between SIFT, PCA-SIFT, and SURF. To further accelerate the matching speed, binary descriptors [44] replace Euclidean distance with Hamming distance during matching.

Apart from hand-crafted descriptors, some also propose learning schemes to improve the discriminative ability of local descriptors. For example, Philbin et al. [45] proposes a non-linear transformation so that the projected SIFT descriptor yields smaller distances for true matches. Simoyan et al. [34] improve this process by learning both the pooling region and a linear descriptor projection.

局部不变特征针对精准匹配图像局部结构而提出。基于 SIFT 的方法和大多特征提取步骤类似,都是由特征检测器和描述符组成。

局部检测器。感兴趣点检测器旨在于在多样的图像场景中定位出一系列特征稳定的局部区域。在检索社区中,寻找图像的仿射协变区域(affine-covariant regions)一直是首选方法。它之所以称之为“协变的”是因为检测区域随着仿射变化而改变,因此区域描述符具有不变性。这种仿射协变区域检测器和以关键点为中心的海森检测器(Hessian detector)不同,当然也和以寻找尺度不变区域为目标的高斯差分检测器(DoG detector)。适应于局部强度模式的椭圆区域由仿射检测器探测到。这就确保了相同的图像局部结构即使是因为视角变化而产生形变时也能被检测到,视角形变问题也是图像检索任务中的常见问题。在里程碑事件中,我们也提到了最稳定连通域(MSER)检测器和仿射拓展的海尔-拉普拉斯检测器( affine extended Harris-Laplace detector)这两种具有仿射不变性的区域检测器。鉴于海森仿射检测器在解决视角变化问题中的优异性能,社区认为它是要优于 DoG 检测器的。为了解决这些仿射协变区域方向模糊的问题,重力假设方法应运而生。这种方法抛弃了方向估计的思路,并在建筑物数据集上的效果不断改善。在图像检索中也尝试了其他的非仿射检测器,例如拉普拉斯-高斯(LOG)和海尔检测器。对于表明光滑的物体,仅有少量的关键点会产生响应,因此可以用物体边缘作为局部特征描述。

另一方面,针对密集区域检测器也有不少研究。在对比了密集采样图像块和探测图像块两种方法后,Sicre 等指出前者表现更优。为了恢复密集采样图像块的旋转不变性,提出了图像块主旋转角方法。各种密集采样策略以及关键点检测器的综合比较可以在《A comparison of dense region detectors for image search and fine-grained classification》这篇文献中查阅到。

局部描述符。局部描述符使用一系列检测区域对局部图像内容进行编码。SIFT 描述符一直以来都是大家默认使用的描述符。这种 128 维的向量在匹配准确率上从众多描述符中脱颖而出。更进一步地,PCA-SIFT 描述符将特征向量的维度从 128 维减少到 36 维,通过增加特征建立计算量和降低区分度来加快匹配速度。另一种改进方法是 RootSIFT,它首先将 SIFT 描述符进行 $ \ell_1 $归一化,随后对每个元素开方。RootSIFT 现在是基于 SIFT 的检索方法惯用方法。除了 SIFT 之外,SURF 描述符也被广泛应用。SURF 描述符结合了海森-拉普拉斯检测器和局部梯度直方图。积分图技巧可以用于加速特征的计算(译者注:积分图技巧会对原图像首先生成一张积分图,这种通过空间换取时间的策略,在计算图像的诸如海尔特征时可以大幅提高计算速度)。SURF 可以取得和 SIFT 近乎一样的准确率,同时 SURF 计算速度更快。文献《A comparison of sift, pca-sift and surf》对 SIFT,PCA-SIFT 和 SURFF 进行了比较。为了进一步加快匹配速度,二值描述符用汉明距离替代了欧氏距离。

除了手工特征,一些研究人员也提出了学习式的方案来提高局部描述符特征的区分度。例如,Philbin 等提出了一种非线性的变换使得投影 SIFT 描述符为真实匹配产生更小的差异。Simoyan 等更进一步地设计了学习池化区域和线性描述符投影的方案来改进 Philbin 等的方案。

Retrieval Using Small Codebooks 使用小规模编码本进行检索

A small codebook has several thousand, several hundred or fewer visual words, so the computational complexity of codebook generation and encoding is moderate. Representative works include BoW , VLAD and FV . We mainly discuss VLAD and FV and refer readers to for a comprehensive evaluation of the BoW compact vectors.
小规模编码本一般包含几千、几百甚至更少的视觉词汇,因此编码本生成以及编码算法的时间复杂度不高。这方面一些有代表性的工作包括 BoW、VLAD 和 FV。我们主要讨论了 VLAD 和 FV 模型,同时根据《Multiple measurements and joint dimensionality reduction for large scale image search with short vectors》 这篇文献综合评价了 BoW 压缩向量。

Codebook Generation 生成编码本

Clustering complexity depends heavily on the codebook size. In works based on VLAD or FV , the codebook sizes are typically small, e.g., 64, 128, 256. For VLAD, flat k-means is employed for codebook generation. For FV, the Gaussian mixture model (GMM), i.e., $ u_{\lambda}(x) = \sum_{i=1}^K w_i u_i(x) $ , where $ K $ is the number of Gaussian mixtures, is trained using the maximum likelihood estimation. GMM describes the feature space with a mixture of $ K $ Gaussian distributions, and can be denoted as $ \lambda = {w_i, \mu_i, \sum_i, i = 1,...,K} $ , where $ w_i $ , $ \mu_i $ and $ \sum_i $ represent the mixture weight, the mean vector and the covariance matrix of Gaussian $ u_i $ , respectively.

聚类时的算法复杂度很大程度上依赖于编码本的大小。用 VLAD 或 FV 生成打的编码本通常很小,一般是 64,128,256。在 VLAD 中使用平面 k-means 聚类算法生成编码本。在 FV 中使用 GMM 算法,即, $ u_{\lambda}(x) = \sum_{i=1}^K w_i u_i(x) $,其中$ K $是使用最大似然估计训练的高斯混合的数量。 GMM 描述了具有$ K $高斯分布的混合的特征空间,并且可以被表示为$\lambda = {w_i, \mu_i, \sum_i, i = 1,...,K} $,其中$ w_i $,$\mu_i $和$\sum_i $别表示混合权重,均值向量和高斯曲线$ u_i $的协方差矩阵。

Encoding 编码方法

Due to the small codebook size, relative complex and information-preserving encoding techniques can be applied. We mainly describe FV, VLAD and their improvements in this section. With a pre-trained GMM model, FV describes the averaged first and second order difference between local features and the GMM centers. Its dimension is $ 2pK $ , where $ p $ is the dimension of the local descriptors and $ K $ is the codebook size of GMM. FV usually undergoes power normalization , to suppress the burstiness problem (to be described in Section tf-idf). In this step, each component of FV undergoes non-linear transformation featured by parameter $ \alpha $ , $ x_i := {sign}(x_i)|x_i|^\alpha $ . Then $ \ell_2 $ normalization is employed. Later, FV is improved from different aspects. For example, Koniusz et al. augment each descriptor with its spatial coordinates and associated tunable weights. In , larger codebooks (up to 4,096) are generated and demonstrate superior classification accuracy to smaller codebooks, at the cost of computational efficiency. To correct the assumption that local regions are identically and independently distributed (iid), Cinbis et al. propose non-iid models that discount the burstiness effect and yield improvement over the power normalization. The VLAD encoding scheme proposed by J'gou et al. can be thought of as a simplified version of FV. It quantizes a local feature to its nearest visual word in the codebook and records the difference between them. Nearest neighbor search is performed because of the small codebook size. The residual vectors are then aggregated by sum pooling followed by normalizations. The dimension of VLAD is $ pK $ . Comparisons of some important encoding techniques are presented in . Again, the improvement of VLAD comes from multiple aspects. In , J'gou and Chum suggest the usage of PCA and whitening (denoted as PCA $ _w $ in Table compare_methods_datasets) to de-correlate visual word co-occurrences, and the training of multiple codebooks to reduce quantization loss. In , Arandjeloviet al. extend VLAD in three aspects: 1) normalize the residual sum within each coarse cluster, called intra-normalization, 2) vocabulary adaptation to address the dataset transfer problem and 3) multi-VLAD for small object discovery.

因为小规模编码本尺寸小的缘故,相对复杂的和存储信息的方法可以在这上面使用。我们在这个小节中主要调研了 FV,VLAD 方法及其发展。使用预先训练的 GMM 模型,FV 描述局部特征和 GMM 中心之间的平均一阶和二阶差异。它的维度是$ 2pK $,其中$ p $是本地描述符的尺寸,$ K $是 GMM 编码本的长度。FV 通常进行指数归一化(power normalization)以抑制突发性问题。在这一步,FV 的每一部分在的非线性变换由参数$\alpha$,$ x_i := |x_i|^\alpha$表征。然后采用$\ell_2 $归一化。归一化,这样 FV 就从各方面得到提升。例如,Koniusz 等人用每个描述符的空间坐标和相关的可调权重来对其进行扩充。在《Revisiting the fisher vector for fine-grained classification》这篇文献中,更大的编码本(将近 4096)表现出比小编码本更好的分类准确率,当然计算花费同时也增大了。为了修正局部区域独立同分布这一假设,Cinbis 等人提出了非独立同分布模型,这个工作抑制了突发事件的影响,同时也改进了指数归一化的效果。

VLAD 编码方案由 Jégou 提出,可以认为 VLAD 是 FV 的简化版本。VLAD 量化将局部特征量化为最近邻视觉词汇,同时记录下两者的距离。由于编码本规模小,因此最近邻检索方案可行。在残差向量被总和池化聚合后进行归一化。VLAD 的维度是$ pK $。同样,研究人员对 VLAD 在多方面进行了改进。Jégou 和 Chum 提出使用 PCA 和白化(在表 5 中表示为 PCA $_w $))去消除视觉词语共现现象,并且训练多个编码本以减少量化带来的损失。Arandjelovi 等从三个方面拓展了 VLAD:1)归一化每个粗糙聚类中的残差和,称为内部归一化,2)通过词汇自适应来解决数据集迁移问题,3)用多 VLAD 模型(multi-VLAD)来解决小目标发掘问题。

Concurrent to , Delhumeau et al. propose to normalize each residual vector instead of the residual sums; they also advocate for local PCA within each Voronoi cell which does not perform dimension reduction as . A recent work employs soft assignment and empirically learns optimal weights for each rank to improve over the hard quantization. Note that some general techniques benefit various embedding methods, such as VLAD, FV, BoW, locality-constrained linear coding (LLC) and monomial embeddings. To improve the discriminative ability of embeddings, Tolias et al. propose the orientation covariant embedding to encode the dominant orientation of the SIFT regions jointly with the SIFT descriptor. It achieves a similar covariance property to weak geometric consistency (WGC) by using geometric cues within regions of interest so that matching points with similar dominant orientations are up-weighted and vice versa. The triangulation embedding only considers the direction instead of the magnitude of the input vectors. J'gou et al. also present a democratic aggregation that limits the interference between the mapped vectors. Baring a similar idea with democratic aggregation, Murray and Perronnin propose the generalized max pooling (GMP) optimized by equalizing the similarity between the pooled vector and each coding representation. The computational complexity of BoW, VLAD and FV is similar. We neglect the offline training and SIFT extraction steps. During visual word assignment, each feature should compute its distance (or soft assignment coefficient) with all the visual words (or Gaussians) for VLAD (or FV). So this step has a complexity of $ \mathcal{O}(pK) $ . In the other steps, complexity does not exceed $ \mathcal{O}(pK) $ . Considering the sum-pooling of the embeddings, the encoding process has an overall complexity of $ \mathcal{O}(pKD) $ , where $ D $ is the number of features in an image. Triangulation embedding , a variant of VLAD, has a similar complexity. The complexity of multi-VLAD is $ \mathcal{O}(pKD) $ , too, but it has a more costly matching process. Hierarchical VLAD has a complexity of $ \mathcal{O}(pKK'D) $ , where $ K' $ is the size of the secondary codebook. In the aggregation stage, both GMP and democratic aggregation have high complexity. The complexity of GMP is $ \mathcal{O}(\frac{P^2}{K}) $ , where $ P $ is the dimension of the feature embedding, while the computational cost of democratic aggregation comes from the Sinkhorn algorithm.

Delhumeau 等提出应该将残差向量归一化,而不是求残差和;他们还提倡在每个 Voronoi 格内进行 PCA 降维,而不是像《Aggregating local image descriptors into compact codes》中所提出的降维方案。《Improving large-scale image retrieval through robust aggregation of local descriptors》中提出应该使用软任务和经验性地为每个等级学习最佳权重来改进硬量化方案。

注意到许多常规的方法对 VLAD,FV,BoW,局部约束线性编码(LLC)以及单项嵌入这些嵌入方法有益。Tolias 等人提出结合 SIFT 描述符,用方向协变嵌入的方法来对 SIFT 特征主方向进行编码。它通过在感兴趣的区域内使用几何线索来实现与弱几何一致性(WGC)相似的协方差属性,使得与主方向相似的匹配点被加权,反之亦然。三角嵌入方法只考虑了输入向量的方向而没有考虑其大小。Jégou 等人同样也提出了一种限制映射向量之间干扰的民主聚合的方法。除了类似民主聚合的思想,Murray 和 Perronnin 提出,通过均衡池化向量和每个编码表示之间的相似性优化广义最大池化(GMP)方法。

BoW,VLAD 和 FV 的复杂度基本一致。我们忽视线下训练时间和 SIFT 特征提取时间。在视觉词汇分配这一步中,VLAD(FV)模型中每一个特征(高斯曲线)需要计算和每一个视觉词汇的距离(soft assignment coefficient),因此这一步的计算复杂度为$\mathcal{O}(pK) $。在其他步骤中,复杂性不超过$\mathcal{O}(pK) $。考虑到嵌入式的总和,编码过程具有$\mathcal{O}(pKD) $的总体复杂性,其中$ D $表示图像中提取的特征的数目。三角嵌入式 VLAD 的一个变异体,它和 VLAD 的计算复杂度相似,都是 $\mathcal{O}(pKD) $,但它具有更昂贵的匹配过程。分层 VLAD 具有$\mathcal{O}(pKK'D) $的复杂性,其中$ K' $是辅助码本的大小。在聚合阶段,GMP 和民主党两者都具有很高的复杂性。 GMP 的复杂性是$\mathcal{O}(\frac{P^2}{K}) $,其中$ P $是特征嵌入的维度, 民主聚类方法的计算复杂度主要来源于 Sinkhorn 算法。

ANN Search 最似最近邻检索

Due to the high dimensionality of the VLAD/FV embeddings, efficient compression and ANN search methods have been employed [61, 62]. For example, the principle component analysis (PCA) is usually adapted to for dimension reduction, and it is shown that retrieval accuracy even increases after PCA [53]. For hashing-based ANN methods, Perronnin et al. [47] use standard binary encoding techniques such as locality sensitive hashing [63] and spectral hashing [64]. Nevertheless, when being tested on the SIFT and Gist feature datasets, spectral hashing is shown to be outperformed by Product Quantization (PQ) [61]. In these quantization-based ANN methods, PQ is demonstrated to be better than other popular ANN methods such as FLANN [62] as well. A detailed discussion of VLAD and PQ can be viewed in [65]. PQ has since then been improved in a number of works. In [66], Douze et al. propose to re-order the cluster centroids so that adjacent centroids have small Hamming distances. This method is compatible with Hamming distance based ANN search, which offers significant speedup for PQ. We refer readers to [67] for a survey of ANN approaches.
由于 VLAD/FV 嵌入的维度相当高,因此研究人员提出了高效的压缩和最似最近邻检索(ANN)算法。例如,主成分分析(PCA)算法常适用于降维任务,特别是使用 PCA 降维后检索的准确度甚至会提高。对于基于哈希的最似最近邻方法,Perronnin 等人使用标准二值编码方法,如局部敏感哈希(LSH)和谱哈希(spectral hashing)。然而,在使用 SIFT 和 Gist 特征数据库进行测试时,谱哈希方法被证明要优于乘积量化方法。在这些基于量化的最似最近邻算法中,PQ 算法表现得最为出色。关于 VLAD 和 PQ 算法的详细研究可以参见《A comprehensive study over vlad and product quantization in large-scale image retrieval》。同样,PQ 算法后来也被不断改进。Douze 等人提出对聚类中心重新排序,使得相邻的中心具有较小的汉明距离。该方法与基于汉明距离的最似最近邻检索相兼容,这为 PQ 算法提供了显著的加速。我们参阅了《A survey on learning to hash》作为 ANN 方法的调研报告基础。

We also mention an emerging ANN technique, i.e., group testing [68], [69], [70]. In a nutshell, the database is decomposed into groups, each represented by a group vector. Comparisons between the query and group vectors reveal how likely a group contains a true match. Since group vectors are much fewer than the database vectors, search time is reduced. Iscen et al. [69] propose to directly find the best group vectors summarizing the database without explicitly forming the groups, which reduces the memory consumption.
我们还提到一种新兴的 ANN 算法,群组测试算法。简要地说,该算法将数据库分组,每组都由一个组向量表示。通过查询和组向量之间的比较计算出一个组包含正确匹配的可能性。因为组向量数目远少于数据库向量,因此检索时间大大缩短。Iscen 等提出直接找出数据库中最优组向量,而不用精确地分组,这个方案减少了内存的消耗。

image

image
Fig. 3: Two milestone clustering methods (a) hierarchical k-means (HKM) [11] and (b) approximate k-means (AKM) [12] for large codebook generation. Bold borders and blue discs are the clustering boundaries and centers of the first layer of HKM. Slim borders and red squares are the final clustering results in both methods.

图 3:在大规模编码本生成中的两个里程碑似的聚类算法 (a) 分层 k-means(HKM) (b) 近似 k-means(AKM)

Retrieval Using Large Codebooks 使用大规模编码本进行检索

A large codebook may contain 1 million visual words or more . Some major steps undergo important changes compared with using small codebooks.
一个大规模编码本可能含有 1 百万个甚至更多的视觉词汇。其中一些步骤和小编码本方案比起来有很大的差异。

Codebook Generation 生成编码本

Approximate methods are critical in assigning data into a large number of clusters. In the retrieval community, two representative works are hierarchical k-means (HKM) and approximate k-means (AKM) , as illustrated in Fig. timeline and Fig. clustering. Proposed in 2006, HKM applies standard k-means on the training features hierarchically. It first partitions the points into a few clusters (e.g., $ \bar{k}\ll K $ ) and then recursively partitions each cluster into further clusters. In every recursion, each point should be assigned to one of the $ \bar{k} $ clusters, with the depth of the cluster tree being $ \mathcal{O}(\log K) $ , where $ K $ is the target cluster number. The computational cost of HKM is therefore $ \mathcal{O}(\bar{k}M\log K) $ , where $ M $ is the number of training samples. It is much smaller than the complexity of flat k-means $ \mathcal{O}(MK) $ when $ K $ is large (a large codebook). The other milestone in large codebook generation is AKM . This method indexes the $ K $ cluster centers using a forest of random $ k $ -d trees so that the assignment step can be performed efficiently with ANN search. In AKM, the cost of assignment can be written as $ \mathcal{O}(K\log K + vM\log K) = \mathcal{O}(vM\log K) $ , where $ v $ is the number of nearest cluster candidates to be accessed in the $ k $ -d trees. So the computational complexity of AKM is on par with HKM and is significantly smaller than flat k-means when $ K $ is large. Experiments show that AKM is superior to HKM due to its lower quantization error (see Section quantization_large). In most AKM-based methods, the default choice for ANN search is FLANN .

在将数据分配到大量集群中,近似方法是至关重要的。在检索社区中,两个有代表性的工作是:分层 k-means(HKM)和近似 k-means(AKM),如图 1 和图 3 所示。HKM 方法在 2006 年提出,HKM 分层次地应用标准 k-means 方法进行特征训练。HKM 首先将特征空间中的点划分为几个簇,接着递归地将每个簇划分为更多的群集。在每次递归时,每个点都要被归类为$\bar{k} $集群之一,群集树的深度为$\mathcal{O}(\log K) $,其中$ K $是目标群集编号。因此,HKM 的计算成本是$\mathcal{O}(\bar{k}M\log K) $,其中$ M $是训练样本的数量。当$ K $大(一个大码本)时,它远小于扁平 k-means(flat k-means) $\mathcal{O}(MK) $的复杂性。

另一个里程碑式的大规模编码本生成算法是 AKM。该方法利用随机 K-D 树对 K 聚类中心进行索引,使得在分配步骤中能够高效地使用 ANN 搜索。在 AKM 中,分配的成本可以写入$\mathcal{O}(K\log K) = \mathcal{O} K) = \log(vM\log K) $,其中$ v $是在$ k $ -D1 -D 树中访问的最近群集候选的数量。因此,AKM 的计算复杂性与 HKM 为单位,并且当$ K $大时明显小于扁平 K-Means。实验表明,由于其较低的量化误差,AKM 优于 HKM(参见“尺寸化截式”部分)。在基于大多数基于 Akm 的方法中,ANN 搜索的默认选择是 FLANN。

Feature Encoding (Quantization) 特征编码(量化)

Feature encoding is interleaved with codebook clustering, because ANN search is critical in both components. The ANN techniques implied in some classic methods like AKM and HKM can be used in both clustering and encoding steps. Under a large codebook, the key trade-off is between quantization error and computational complexity. In the encoding step, information-preserving encoding methods such as FV [14], sparse coding [73] are mostly infeasible due to their computational complexity. It therefore remains a challenging problem how to reduce the quantization error while keeping the quantization process efficient.

Fro the ANN methods, the earliest solution is to quantize a local feature along the hierarchical tree structure [11]. Quantized tree nodes in different levels are assigned different weights. However, due to the highly imbalanced tree structure, this method is outperformed by k-d tree based quantization method [12]: one visual word is assigned to each local feature, using a k-d tree built from the codebook for fast ANN search. In an improvement to this hard quantization scheme, Philbin et al. [25] propose soft quantization by quantizing a feature into several nearest visual words. The weight of each assigned visual word relates negatively to its distance from the feature by $ \exp(-\frac{d^2}{2\sigma^2}) $ , where $ d $ is the distance between the descriptor and the cluster center. While soft quantization is based on the Euclidean distance, Mikulik et al. propose to find relevant visual words for each visual word through an unsupervised set of matching features. Built on a probabilistic model, these alternative words tend to contain descriptors of matching features. To reduce the memory cost of soft quantization and the number of query visual words, Cai et al. suggest that when a local feature is far away from even the nearest visual word, this feature can be discarded without a performance drop. To further accelerate quantization, scalar quantization suggests that local features be quantized without an explicitly trained codebook. A floating-point vector is binarized, and the first dimensions of the resulting binary vector are directly converted to a decimal number as a visual word. In the case of large quantization error and low recall, scalar quantization uses bit-flop to generate hundreds of visual words for a local feature.

特征编码与编码本聚类是相互交错的,ANN 检索对两者都至关重要。在 AKM 和 HKM 等一些经典方法中应用的 ANN 技术可用于聚类和编码中。在大规模编码本中,关键是要平衡量化误差和计算复杂度两者。在编码步骤中,诸如 FV,稀疏编码的信息存留式编码方法大都不可行,因为它们的计算复杂度过高。因此,如何在保证量化效率的同时减少量化误差仍是一个极具挑战的问题。

对 ANN 方法来说,最早的解决方案是沿着分层树结构量化局部特征。不同级别的量化树节点被赋予不同的权重。然而,由于高度不平衡的树结构,该方法优于基于 k-d 树的量化方法:一个视觉词被分配给每个局部特征,使用从码书构建的 k-d 树来进行快速 ANN 搜索。对这种硬量化方案的一种改进是 Philbin 等人提出的软量化方案,这种方案将一个特征量化为几个最近的视觉词汇。由式$\exp(-\frac{d^2}{2\sigma^2}) $指出,每个指定的视觉单词的权重与它到特征的距离呈负相关,其中 d 是描述符和聚类中心之间的距离。虽然软量化是基于欧几里得距离,但 Mikulik 等人提出通过无监督的匹配特征集为每个视觉单词找到相关的视觉单词。基于概率模型,这些备选词往往包含匹配特征的描述符。为了减少软量化的存储成本和查询视觉词汇的数量,Cai 等提出当局部特征离最近的视觉词汇距离很远时,该特征可以被丢弃而且不会带来性能的下降。为了进一步加速量化,标量量化提出局部特征在没有明确训练的编码本的情况下被量化。浮点向量是二值化的,并且所得的二进制向量的第一维直接转换为十进制数作为视觉词汇。在高量化误差和低召回率的情况下,标量量化使用位翻转(bit-flop)来为局部特征生成数百个视觉词汇。

Feature WeightingTF-IDF. TF-IDF 特征加权

The visual words in codebook $ \mathcal{C} $ are typically assigned specific weights, called the term frequency and inverse document frequency (TF-IDF), which are integrated with the BoW encoding. TF is defined as:
码本$\mathcal{C}$中往往被分配给指定的权重。称为频率与逆文档频率(TF-IDF),这种策略被集成在 BoW 编码中。TF 定义如下:

$$ {TF}(c_i^j) = o_i^j, $$
where $ o_i^j $ is the number of occurrences of a visual word $ c_i $ within an image $ j $ . TF is thus a local weight. IDF, on the other hand, determines the contribution of a given visual word through global statistics. The classic IDF weight of visual word $ c_i $ is calculated as:
,其中$ o_i^j $是图像$ j $内的视觉字$ c_i $的出现次数。因此,TF 是局部重量。另一方面,IDF 通过全局统计来确定给定的视觉单词的贡献。计算出视觉字$ c_i $的经典 IDF 重量为:

$$ {IDF}(c_i) = \log{\frac{N}{n_i}}, {where} n_i = \sum_{j\in \mathcal{G}}{1}(o_i^j > 0), $$
where $ N $ is the number of gallery images, and $ n_i $ encodes the number of images in which word $ c_i $ appears. The TF-IDF weight for visual word $ c_i $ in image $ j $ is:
,其中$ N $是图库图像的数量,$ n_i $编码一个显示字$ c_i $的图像的数量。图像$ j $中的可视字$ c_i $的 tf-idf 重量是:

$$ w(c_i^j) = {TF}(c_i^j){IDF}(c_i). $$

A major problem associated with visual word weighting is burstiness . It refers to the phenomenon whereby repetitive structures appear in an image. This problem tends to dominate image similarity. Jgou et al. propose several TF variants to deal with burstiness. An effective strategy consists in exerting a square operation on TF. Instead of grouping features with the same word index, Revaud et al. propose detecting keypoint groups frequently happening in irrelevant images which are down-weighted in the scoring function. While the above two methods detect bursty groups after quantization, Shi et al propose detecting them in the descriptor stage. The detected bursty descriptors undergo average pooling and are fed in the BoW architectures. From the aspect of IDF, Zheng et al. propose the $ \mathcal{L}_p $ -norm IDF to tackle burstiness and Murata et al. design the exponential IDF which is later incorporated into the BM25 formula. When most works try to suppress burstiness, Torii et al view it as a distinguishing feature for architectures and design new similarity measurement following burstiness detection. Another feature weighting strategy is feature augmentation on the database side , . Both methods construct an image graph offline, with edges indicating whether two images share a same object. For , only features that pass the geometric verification are preserved, which reduces the memory cost. Then, the feature of the base image is augmented with all the visual words of its connecting images. This method is improved in by only adding those visual words which are estimated to be visible in the augmented image, so that noisy visual words can be excluded.

改进方案与视觉单词加权相关的一个主要问题是突变性。它指的是图像中出现重复结构的现象。这个问题往往在图像相似度中占主要位置。Jégou 等人提出了几个 TF 的变种来解决突变问题。一个有效的策略是在 TF 上进行平方运算。Revaud 等人提出了检测在不相关图像中频繁出现的关键点组来降低评分函数的计算值,而不是用相同的单词索引来分组特征。尽管上述两种方法都提出在量化后检测突变组,Shi 等人提出在描述符阶段检测它们。检测到的突变描述符经过平均池化并且送往 BoW 结构中。在改进 IDF 方面,Zheng 等人提出了$\mathcal{L}_p $ -NORM IDFIDF 方法来处理突变情况,同时 Murata 等人设计了后来被并入到 BM25 公式的指数 IDF。在大多数方案都以抑制突变性为目的时,Torii 等人将这个问题视为体系结构的一个显着特征,并在突发性检测后设计新的相似性度量方法。

另一个特征加权策略是数据库端的特征增强,《Better matching with fewer features: The selection of useful features in large database recognition problems》和《Three things everyone should know to improve object retrieval》在这方面进行了研究。两篇文献中的方法都离线构建图像的图结构,通过边缘指示两个图像是否共享同一对象。对第一种方案来说,只有通过几何验证的特征才会被被保留,这降低了存储成本。然后,利用其连接图像的所有视觉字来增强基础图像的特征。第二种方案进一步进行对其进行改进,通过只添加那些被认为在增强图像中可见的视觉词汇,从而干扰性的视觉词被排除。

The Inverted Index 倒排

The inverted index is designed to enable efficient storage and retrieval and is usually used under large/medium-sized codebooks. Its structure is illustrated in Fig. inv_index. The inverted index is a one-dimensional structure where each entry corresponds to a visual word in the codebook. An inverted list is attached to each word entry, and those indexed in the each inverted list are called indexed features or postings. The inverted index takes advantages of the sparse nature of the visual word histogram under a large codebook.
倒排是一种提高存储和检索效率的算法,它常被用于大/中等规模的编码本中,结构如图 4 所示。倒排是一种单一尺寸的结构,其中每一个条目对应编码本中低的一个视觉词汇。每一个视觉词汇都包含一个倒排表,每个倒排表中的索引被称为索引特征或者记录。倒排索引很好地发挥了大规模编码本词汇直方图稀疏性的特点。

image

In literature, it is required that new retrieval methods be adjustable to the inverted index. In the baseline , the image ID and term frequency (TF) are stored in a posting. When other information is integrated, they should be small in size. For example, in , the metadata are quantized, such as descriptor contextual weight, descriptor density, mean relative log scale and the mean orientation difference in each posting. Similarly, quantized spatial information such as the orientation can also be stored , . In co-indexing , when the inverted index is enlarged with globally consistent neighbors, semantically isolated images are deleted to reduce memory consumption. In , the original one-dimensional inverted index is expanded to two-dimensional for ANN search, which learns a codebook for each SIFT sub-vector. Later, it is applied to instance retrieval by to fuse local color and SIFT descriptors.

。新的文献提出新的检索方法来适应倒排算法。在基准方案中,图像 ID 和 TF 值都被存储在一条记录中。但其他的信息被整合进来时,它们的尺寸应该足够小。例如,在《Contextual weighting for vocabulary tree based image retrieval》中,原始数据在一条记录中被描述符上下文,描述符密度,平均关联日志规模和平均方向差异等属性量化。相似地,方向等空间信息也会被量化。在联合检索的方法中,当倒排索引随着全局一直近邻增长是,单独分割的图片将会被删除以减少内存消耗。在《The inverted multi-index》中提出,原始的单一尺寸倒排索引可以拓展为二维结构来进行替代了 SITF 特征向量的 ANN 检索。后来,这种方法被《Packing and padding: Coupled multi-index for accurate image retrieval》改进,融合局部颜色和 SIFT 描述符进行实例检索。

Retrieval Using Medium-sized Codebooks 使用中等规模编码本进行检索

Medium-sized codebooks refer to those having 10-200k visual words. The visual words exhibit medium discriminative ability, and the inverted index is usually constructed.
中等规模编码本一般含有 10——200k 个视觉词汇。视觉词汇展现了中等区分能力,同时检索时也使用了倒排索引。

Codebook Generation and Quantization 编码本的生成与量化

Considering the relatively small computational cost compared with large codebooks (Section codebook_large), flat k-means can be adopted for codebook generation , . It is also shown in that using AKM for clustering also yields very competitive retrieval accuracy. For quantization, nearest neighbor search can be used to find the nearest visual words in the codebook. Practice may tell that using some strict ANN algorithms produces competitive retrieval results. So comparing with the extensive study on quantization under large codebooks (Section quantization_large) , relatively fewer works focus on the quantization problem under a medium-sized codebook.
考虑到中等规模编码本和大规模编码本相比,计算成本较低,扁平 k-means 可以在中等规模编码本的生成中使用。同样也有文献指出使用 AKM 算法可以在聚类中取得很好的效果。

在量化过程中,最近邻检索用来搜索最近的视觉词汇。实践表明使用高精度的 ANN 算法可以得到更好的检索效果。和大规模编码本下量化算法(Quantization_Large)的研究热度比起来,中等规模编码本的研究明显低了很多。

Hamming Embedding and its improvements 汉明嵌入算法及其改进

The discriminative ability of visual words in medium-sized codebooks lies in between that of small and large codebooks. So it is important to compensate the information loss during quantization. To this end, a milestone work, i.e., Hamming embedding (HE) has been dominantly employed. Proposed by Jgou et al. , HE greatly improves the discriminative ability of visual words under medium-sized codebooks. HE first maps a SIFT descriptor $ f\in\mathbb{R}^p $ from the $ p $ -dimensional space to a $ p_b $ -dimensional space:
在中等规模编码本下视觉词汇的区分度介于小规模编码本和大规模编码本之间。因此,对量化过程中带来的信息损失需要进行补偿。最终,汉明嵌入(HE)这个里程碑式的工作成为实践中的主流算法。

HE 算法由 Jégou 等人在论文《Hamming embedding and weak geometric consistency for large scale image search》中提出,它提升了中等规模编码本下视觉词汇的区分能力,也是首次将一个 SIFT 描述符$ f\in\mathbb{R}^p $从$ p $ -dimensional 空间映射到$ p_b $ -dimension 空间:

$$ x = P\cdot f = (x_1,...x_{p_b}), $$

where $ P\in\mathbb{R}^p_b\times p $ is a projecting matrix, and $ x $ is a low-dimensional vector. By creating a matrix of random Gaussian values and applying a QR factorization to it, matrix $ P $ is taken as the first $ p_b $ rows of the resulting orthogonal matrix. To binarize $ x $ , Jegou et al. propose to compute the median vector $ \overline{x_i} = (\overline{x_{1,i}},...,\overline{x_{p_b, i}}) $ of the low-dimensional vector using descriptors falling in each Voronoi cell $ c_i $ . Given descriptor $ f $ and its projected vector $ x $ , HE computes its visual word $ c_t $ , and the HE binary vector is computed as:
其中$ P\in\mathbb{R}^p_b\times p $是投影矩阵,而$ x $是低维向量。通过创建随机高斯值的矩阵并将 QR 分解应用于它,矩阵$ P $作为所得到的正交矩阵的第一$ p_b $行。对于二进制化$ x $,,Jegou 等人提出在每个 Voronoi 单元$ c_i $中使用描述符下降法来计算低维向量的中值向量$\overline{x_i},...,\overline{x_{p_b, i}},...,\overline{x_{p_b, i}}) $。给定描述符$ f $及其投影矢量$ x $,,HE 计算它的视觉词汇$ c_t $,,HE 二值向量计算公式如下:

$$ b_j(x) =
1 , if x_j > \overline{x_{j, t}},
0,otherwise $$

where $ b(x) = (b_1(x),...,b_{p_b}(x)) $ is the resulting HE vector of dimension $ p_b $ . The binary feature $ b(x) $ serves as a secondary check for feature matching. A pair of local features are a true match when two criteria are satisfied: 1) identical visual words and 2) small Hamming distance between their HE signatures. The extension of HE estimates the matching strength between feature $ f_1 $ and $ f_2 $ reversely to the Hamming distance by an exponential function:

,其中$ b(x) = (b_1(x),...,b_{p_b}(x)) $是由此产生的尺寸$ p_b $的载体。二进制功能$ b(x) $作为特征匹配的第二重校验。当满足以下两个标准时,一对局部特征可以认为是匹配的:1) 它们是同一个视觉词汇;2) 它们的 HE 哈希值距离很小。HE 的扩展方法通过指数函数估计特征。他估计特征$ f_1 $和$ f_2 $与 Hamming 距离的匹配强度::

$$ w_{HE}(f_1, f_2) = \exp(-\frac{\mathcal{H}(b(x_1), b(x_2))}{2\gamma^2}), $$

where $ b(x_1) $ and $ b(x_2) $ are the HE binary vector of $ f_1 $ and $ f_2 $ , respectively, $ \mathcal{H}(\cdot,\cdot) $ computes the Hamming distance between two binary vectors, and $ \gamma $ is a weighting parameter. As shown in Fig. performance_improvement, HE and its weighted version improves accuracy considerably in 2008 and 2010.

,其中$ b(x_1) $和$ b(x_2) $是$ f_1 $和$ f_2 $的二进制向量,分别,$\mathcal{H}(\cdot,\cdot) $计算两个二值向量的汉明距离,$\gamma $是权重参数。如图 6 所示,HE 及其加权版本在 2008 和 2010 年准确率大大提高。

Applications of HE include video copy detection , image classification and re-ranking . For example, in image classification, patch matching similarity is efficiently estimated by HE which is integrated into linear kernel-based SVM . In image re-ranking, Tolias et al. use lower HE thresholds to find strict correspondences which resemble those found by RANSAC, and the resulting image subset is more likely to contain true positives for query reformulation. The improvement over HE has been observed in a number of works, especially from the view of match kernel . To reduce the information loss on the query side, Jain et al. propose a vector-to-binary distance comparison. It exploits the vector-to-hyperplane distance while retaining the efficiency of the inverted index. Further, Qin et al. design a higher-order match kernel within a probabilistic framework and adaptively normalize the local feature distances by the distance distribution of false matches. This method is in the spirit similar to , in which the word-word distance, instead of the feature-feature distance , is normalized, according to the neighborhood distribution of each visual word. While the average distance between a word to its neighbors is regularized to be almost constant in , the idea of democratizing the contribution of individual embeddings has later been employed in . In , Tolias et al. show that VLAD and HE share similar natures and propose a new match kernel which trades off between local feature aggregation and feature-to-feature matching, using a similar matching function to . They also demonstrate that using more bits (e.g., 128) in HE is superior to the original 64 bits scheme at the cost of decreased efficiency. Even more bits (256) are used in , but this method may be prone to relatively low recall.
HE 应用于视频拷贝检测、图像分类和重排序等场合。例如,在图像分类中,例如,在图像分类中,将 HE 集成到基于线性核的 SVM 中,有效地提高了了图像块匹配相似度的速度。在图像重排序任务中,Tolias 等人使用更低的 HE 阈值来找到类似于 RANSAC 得到的(图像局部特征)严格对应关系,并且得到的图像子集更可能包含真正应查得的图像。

有很多工作都对 HE 提升,特别是从匹配核的角度对 HE 进行改进。为了减少查询上的信息损失,Jain 等人提出一种矢量二值距离比较法。它利用向量到超平面距离,同时保持倒排索引的效率。更进一步地,Qin 等人在概率框架内设计一个高阶匹配核函数,并通过假匹配的距离分布自适应地标准化局部特征距离。该方法的思想类似于《Accurate image search using the contextual dissimilarity measure》,其中,根据每个视觉词汇的邻域分布,将字-字距离而不是特征-特征距离归一化。虽然在《Accurate image search using the contextual dissimilarity measure》中,一个词与它的邻居之间的平均距离被规范为几乎为恒定值,但后来在《Triangulation embedding and democratic aggregation for image search》中采用了将单个嵌入向量的贡献民主化的想法。在《To aggregate or not to aggregate: Selective match kernels for image search》中,Tolias 等人表明 VLAD 和 HE 向量具有相似的性质,并提出了一种新的匹配核函数,它在局部特征聚合和特征到特征匹配之间进行折衷,使用和《Query adaptive similarity for large scale object retrieval》相似的匹配函数。他们还证明了在 HE 中使用更多比特位(例如 128bit)优于原始 64 比特方案,代价是效率的降低。在《Scalar quantization for large scale image search》中使用了更多的比特位(256),但是这种方法可能使得结果的召回率相对较低。

image
Fig. 5: False match removal by (A) HE [13], (B) local-local feature fusion, and (C) local-global feature fusion.
图 5:通过(A)HE [ 13 ],(B)局部-局部特征融合和(C)局部-全局特征融合消除错误匹配。

Other Important Issues 其他重要问题

Feature FusionLocal-local fusion. 功能融合

Local-local fusion. A problem with the SIFT feature is that only local gradient description is provided. Other discriminative information encoded in an image is still not leveraged. In Fig. 5 (B), a pair of false matches cannot be rejected by HE due to their similarity in the SIFT space, but the fusion of other local (or regional) features may correct this problem. A good choice for local-local fusion is to couple SIFT with color descriptors. The usage of color-SIFT descriptors can partially address the trade-off between invariance and discriminative ability. Evaluation has been conducted on several recognition benchmarks [93] of the descriptors such as HSV-SIFT [94], HueSIFT [95] and OpponentSIFT [93]. Both HSV-SIFT and HueSIFT are scale-invariant and shift-invariant. OpponentSIFT describes all the channels in the opponent color space using the SIFT descriptor and is largely robust to the light color changes. In [93], OpponentSIFT is recommended when no prior knowledge about the datasets is available. In more recent works, the binary color signatures are stored in the inverted index [96], [31]. Despite the good retrieval accuracy on some datasets, the potential problem is that intensive variation in illumination may compromise the effectiveness of colors.
局部-局部特征融合。 SIFT 特征的一个问题就是它只提供了局部梯度描述,在图像中编码的其他判别信息仍然没有被利用。在图 5(B)中,由于一对错误匹配在 SIFT 空间中的相似性,因此这对匹配没有被 HE 编码拒绝,但是其他局部(或区域)特征的融合可以纠正这个问题。
将 SIFT 与颜色描述符耦合是局部-局部特征融合的一个好选择。颜色-SIFT 描述符融合特征的使用可以部分地解决不变性和辨别能力之间的权衡问题。在几个基准识别测试集上已经对几个诸如 HSV-SIFT,HueSIFT 和 OpponentSIFT 几个融合特征进行了评估。HSV-SIFT 和 HueSIFT 特征都属于尺度,平移不变性特征。OpponentSIFT 使用 SIFT 描述符描述对立的颜色空间中的所有通道,并且对光照颜色变化大的图像具有很强的鲁棒性。在《Evaluating color descriptors for object and scene recognition》中认为 OpponentSIFT 是当有关数据没有先验知识时的优先选择。在最近的工作中,二进制颜色签名都存储在倒排索引中。尽管现有的图像检索方法在一些数据集上取得了很好的检索精度,但一个潜在不容忽视:照明的密集变化可能会有损颜色特征检索的有效性。

Local-global fusion. Local and global features describe images from different aspects and can be complementary. In Fig. 5 (C), when local (and regional) cues are not enough to reject a false match pair, it would be effective to further incorporate visual information from a larger context scale. Early and late fusion are two possible ways. In early fusion, the image neighborhood relationship mined by global features such as FC8 in AlexNet [6] is fused in the SIFT-based inverted index [72]. In late fusion, Zhang el al. [97] build an offline graph for each type of feature, which is subsequently fused during the online query. In an improvement of [97], Deng et al. [98] add weakly supervised anchors to aid graph fusion. Both works on the rank level. For score-level fusion, automatically learned category-specific attributes are combined with pre-trained category-level information [99]. Zheng et al. [86] propose the query-adaptive late fusion by extracting a number of features (local or global, good or bad) and weighting them in a query-adaptive manner.
局部-全局特征融合。 局部特征和全局特征从不同的角度来描述图像并互为补充。在图 5(C)中,但局部(以及区域)信息不足以判断出一个错误的匹配对时,进一步整合更广的上下文尺度视觉信息是有效的。前期和后期融合是两种可能的方式。在前期融合中,图像邻域关系由如 AlexNet 中的 FC8 这样的全局特征挖掘出,并融合在基于 SIFT 的倒排索引中。在后期融合中,Zhang 等人为每种类型的特征创建一个离线图,随后在在线查询期间进行特征融合。在对《Query specific fusion for image retrieval》中的方法进行改进时,Deng 等人在《Visual reranking through weakly supervised multi-graph learning》提出增加弱监督锚(weakly supervised anchors)来协助图融合。两个工作都是在排序方面进行研究。对于分数级别的融合,将自动学习的类别的特定属性与预训练的类级别信息相结合。Zheng 等人在《Query-adaptive late fusion for image search and person re-identification》中提出通过提取一些特征(局部或全局特征,好的或坏的特征)进行后期特征融合的自适应查询,并且以自适应查询的方式赋给特征相应的权重。

Geometric Matching 几何学上的匹配问题

A frequent concern with the BoW model is the lack of geometric constraints among local features. Geometric verification can be used as a critical pre-processing step various scenarios, such as query expansion [100, 101], feature selection [81], database-side feature augmentation [81], [33], large-scale object mining [102], etc. The most well-known method for global spatial verification is RANSAC [12]. It calculates affine transformations for each correspondence repeatedly which are verified by the number of inliers that fit the transformation. RANSAC is effective in re-ranking a subset of top-ranked images but has efficiency problems. As a result, how to efficiently and accurately incorporate spatial cues in the SIFT-based framework has been extensively studied.

A good choice is to discover the spatial context among local features. For example, visual phrases [103, 104, 105, 106] are generated among individual visual words to provide more strict matching criterion. Visual word co-occurrences in the entire image are estimated [107] and aggregated [108], while in [109, 110], [29] visual word clusters within local neighborhoods are discovered. Visual phrases can also be constructed from adjacent image patches [103], random spatial partitioning [106], and localized stable regions [29] such as MSER [28].

Another strategy uses voting to check geometric consistency. In the voting space, a bin with a larger value is more likely to represent the true transformation. An important work is weak geometrical consistency (WGC) [13], which focuses on the difference in scale and orientation between matched features. The space of difference is quantized into bins. Hough voting is used to locate the subset of correspondences similar in scale or orientation differences. Many later works can be viewed as extensions of WGC. For example, the method of Zhang et al. [21] can be viewed as WGC using x, y offsets instead of scale and orientation. This method is invariant to object translations, but may be sensitive to scale and rotation changes due to the rigid coordinate quantization. To regain the scale and the rotation variance, Shen et al. [111] quantize the angle and scale of the query region after applying several transformations. A drawback of [111] is that query time and memory cost are both increased. To enable efficient voting and alleviate quantization artifacts, Hough pyramid matching (HPM) [112] distributes the matches over a hierarchical partition of the transformation space. HPM trades off between flexibility and accuracy and is very efficient. Quantization artifact can also be reduced by allowing a single correspondence to vote for multiple bins [113]. HPM and [113] are much faster than RANSAC and can be viewed as extensions in the rotation and the scale invariance to the weak geometry consistency proposed along with Hamming Embedding [13]. In [114], a rough global estimate of orientation and scale changes is made by voting, which is used to verify the transformation obtained by the matched features. A recent method [115] combines the advantage of hypothesis-based methods such as RANSAC [12] and voting-based methods [112, 21, 113, 114]. Possible hypothesises are identified by voting and later verified and refined. This method inherits efficiency from voting and supports query expansion since it outputs an explicit transformation and a set of inliers.
BoW 模型的一个常见问题是缺乏局部特征间的几何约束。几何验证可以用作各种场景的关键预处理步骤,例如拓展查询,特征选择,数据库端的特征增强,大规模咯物体挖掘等。著名的全局空间验证方法是 RANSAC。RANSAC 它重复计算每个对应的仿射变换,并通过适合变换的内点数来验证。RANSAC 算法有效地重新排列排名最高的图像的子集,但存在效率问题。最终,如何在 SIFT 为基础的框架下有效、准确地结合空间信息被广泛地研究。

一个好的方法是研究局部特征间的空间上下文。例如,视觉短语在独立的视觉词汇中产生以提供更加精准的匹配规范。估计和聚合整个图像中的视觉词汇共现是一种研究思路,同时也有研究员研究视觉词汇在局部邻域中的聚类。视觉短语也可以通过临近图像块,随机空间分割和局部稳定区域(如 MSER)的方式来组成。

另一种策略使用投票机制来检查几何一致性。在投票空间中,具有较大值的容器更可能代表真正的转换。其中一项重要的工作就是弱几何一致性(WGC),这种方法关注匹配特征在尺度和方向上的差异,不同空间则被量化到容器中。Hough 投票方法被用来定位在规模或方向上相似或相异的子集。许多后来的研究工作可以看作是 WGC 的扩展。例如,Zhang 等人的工作《Image retrieval with geometrypreserving visual phrases》可以被视为使用 x,y 偏移量而不是比例和方向的 WGC 方法。该方法具有目标平移不变性,但由于采用了刚体坐标量化,因此对尺度和旋转变化敏感。为了重新获得目标的尺度和旋转的不变性,Shen 等人在《Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking》中提出在应用多个变换后,量化查询区域的角度和尺度。Shen 等人的这个方法的一个缺点就是,查询时间和存储效率的降低。实现高效的投票方法并减轻量化损失,论文《Hough pyramid matching: Speeded-up geometry re-ranking for large scale image retrieval》提出了霍夫金字塔匹配(HPM)方法,通过分层划分变换空间来分配匹配结果。HPM 在灵活性和准确性之间取得了平衡,非常高效。还可以通过允许单个通信对多个容器进行投票来减少量化损失。HPM 和这种方法都在速度上快于 RANSAC 算法,同时也可以被看作是对和 HE 一起提出的 WGC 在旋转和尺度不变性上的拓展。在《Pairwise geometric matching for large-scale object retrieval》中提出了一种基于投票的全局方向和尺度的粗略估计方法,以此来检验通过匹配特征得到的变形参数。《A vote-and-verify strategy for fast spatial verification in image retrieval》结合了基于假设的方法(如 RANSAC)和基于投票的方法的优点,通过投票和后续的验证、精确微调来确定可能的假设。该方法保有了投票方法的效率,同时因为它的输出是显式变换和一组内值,因此还支持了查询扩展。

Query Expansion 拓展查询

As a post-processing step, query expansion (QE) significantly improves the retrieval accuracy. In a nutshell, a number of top-ranked images from the original rank list are employed to issue a new query which is in turn used to obtain a new rank list. QE allows additional discriminative features to be added to the original query, thus improving recall.

In instance retrieval, Chum et al. [100] are the first to exploit this idea. They propose the average query expansion (AQE) which averages features of the top-ranked images to issue the new query. Usually, spatial verification [12] is employed for re-ranking and obtaining the ROIs from which the local features undergo average pooling. AQE is used by many later works [17, 10, 24] as a standard tool. The recursive AQE and the scale-band recursive QE are effective improvement but incur more computational cost [100]. Four years later, Chum et al. [101] improve QE from the perspectives of learning background confusers, expanding the query region and incremental spatial verification. In [33], a linear SVM is trained online using the top-ranked and bottom-ranked images as positive and negative training samples, respectively. The learned weight vector is used to compute the average query. Other important extensions include “hello neighbor” based on reciprocal neighbors [116], QE with rank-based weighting [111], Hamming QE [89] (see Section 3.5), etc.
作为后处理步骤,拓展查询(QE)对提高检索的准确度很有帮助。简单地说,QE 就是采用来自原始排名列表的多个排在前列的图像来发布新的查询,新的查询用于获得新的排名列表。QE 可以增加额外的有区分度的特征到原始查询中,因此提高了召回率。

在实例检索任务中,Chum 等人是第一个提出研究这项工作的。他们提出了平均拓展查询(AQE)方法,用排名靠前的图像的平均特征来发出新的查询。通常,空间验证用于重排序以及局部特征通过平均池化获得感兴趣区域。AQE 被后来许多工作作为标准工具来使用。递归 AQE 和尺度-带递归 QE 是对 AQE 有效的改进,但它们的计算成本更大。四年后,Chum 等从学习背景的混淆、扩展查询区域和增加空间验证的角度来改进 QE。在《Three things everyone should know to improve object retrieval》中分别使用最靠前和最靠后图片作为训练正负样本。在线训练了一个线性支持向量机。学习到的权重向量用于计算平均查询。其他重要的 QE 算法的扩展包括基于互惠邻居思想的“hello neighbor”算法,基于排序的权重 QE 算法,汉明 QE 算法等。

Small Object Retrieval 小目标检索

Retrieving objects that cover a small portion of images is a challenging task due to 1) the few detected local features and 2) the large amount of background noise. The Instance Search task in the TRECVID campaign [117] and the task of logo retrieval are important venues/applications for this task.

Generally speaking, both TRECVID and logo retrieval can be tackled with similar pipelines. For keypoint-based methods, the spatial context among the local features is important to discriminative target objects from others, especially in cases of rigid objects. Examples include [118, 119, 120]. Other effective methods include burstiness handling [77] (discussed in Section 3.4.3), considering the different inlier ratios between the query and target objects [121], etc. In the second type of methods, effective region proposals [122] or multi-scale image patches [123] can be used as object region candidates. In [123], a recent state-of-the-art method, a regional diffusion mechanism based on neighborhood graphs is proposed to further improve the recall of small objects.

检索图像中的一小部分对象是一项具有挑战性的任务由于 1) 检测到的局部特征数量少,2) 背景噪声过大。TRECVID 活动中的实例检索任务和 logo 检索任务都是小目标检索任务中的重要竞赛/应用。

一般来数,TRECVID 任务和 logo 检索都可以用相似的流程来处理。对于基于关键点的检索方法,局部特征之间的空间上下文对区分目标是至关重要的,特别是对要求苛刻的小目标检索任务来说。其他有效的方法包括突发性处理,考虑查询对象和目标对象之间不同的比率。在第二种方法中,有效的可能区域或多尺度图像块可用作候选对象区域。在《Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations》中,提出了一种基于邻域图的区域扩散机制,以进一步提高小对象的查全率,达到了当时最高水平。

CNN-based Image Retrieval 基于 CNN 的图像检索系统

CNN-based retrieval methods have constantly been proposed in recent years and are gradually replacing the hand-crafted local detectors and descriptors. In this survey, CNN-based methods are classified into three categories: using pre-trained CNN models, using fine-tuned CNN models and hybrid methods. The first two categories compute the global feature with a single network pass, and the hybrid methods may require multiple network passes (see Fig. 2).

基于 CNN 的图像检索方法近年来不断被提出,并且在逐渐取代基于手工检测器和描述符的方法。在这篇综述中,基于 CNN 的方法被分为三类:使用预训练的 CNN 模型,使用微调的 CNN 模型以及使用混合模型。前两类方法使用单向传递网络来提取全局特征,混合模型方法可能需要多个网络传递。如图 2 所示

Retrieval Using Pre-trained CNN Models 使用预训练 CNN 模型的图像检索系统

This type of methods is efficient in feature computation due to the single-pass mode. Given the transfer nature, its success lies in the feature extraction and encoding steps. We will first describe some commonly used datasets and networks for pre-training, and then the feature computation process.

由于预训练 CNN 模型是单通模式,因此这种方法在特征计算中非常高效。考虑到传输特性,它的成功在于特征提取和编码步骤。我们将首先描述一些常用的数据集和网络进行预训练,然后进行特征计算。

Pre-trained CNN Models 预训练的 CNN 模型

Popular CNN architectures. Several CNN models serve as good choices for extracting features, including AlexNet , VGGNet , GoogleNet and ResNet , which are listed in Table CNN-models. Briefly, CNN can be viewed as a set of non-linear functions and is composed of a number of layers such as convolution, pooling, non-linearities, etc. CNN has a hierarchical structure. From bottom to top layers, the image undergoes convolution with filters, and the receptive field of these image filters increases. Filters in the same layer have the same size but different parameters. AlexNet was proposed the earliest among these networks, which has five convolutional layers and three fully connected (FC) layers. It has 96 filters in the first layer of sizes $ 11\times11\times3 $ and has 256 filters of size $ 3\times 3\times 192 $ in the 5th layer. Zeiler et al. observe that the filters are sensitive to certain visual patterns and that these patterns evolve from low-level bars in bottom layers to high-level objects in top layers. For low-level and simple visual stimulus, the CNN filters act as the detectors in the local hand-crafted features, but for the high-level and complex stimulus, the CNN filters have distinct characteristics that depart from SIFT-like detectors. AlexNet has been shown to be outperformed by newer ones such as VGGNet, which has the largest number of parameters. ResNet and GoogleNet won the ILSVRC 2014 and 2015 challenges, respectively, showing that CNNs are more effective with more layers. A full review of these networks is beyond the scope of this paper, and we refer readers to , for details.

流行的 CNN 网络结构 AlexNet,VGGNet,GoogleNet 以及 ResNet 这几个 CNN 网络适用于特征提取,详见表 2.简单来说,CNN 网络可以视为一系列非线性函数的集合,它由如卷积,池化,非线性等多个层组成。CNN 是一个分层次的结构。自网络的底层到顶层,图像经过滤波器的卷积,同时这些图像滤波器的感受野随增长而增加。同一层的滤波器尺寸相同但是参数不同。AlxNet 是这些网络中最早被提出的的,它有五个卷积层和三个全连接(FC)层。它的第一层大小 96 个 11×11×3 的滤波器,在第五层中有 256 个大小为 3×3×192 的滤波器。Zeiler 等人观察到滤波器对某些视觉模式十分敏感,这些模式从底层的低级的图像纹理演变到顶层的高级的图像目标。对于低层次和简单的视觉刺激,CNN 滤波器类似局部手工制作的特征中的检测器,但是对于高层次和复杂的刺激,CNN 滤波器具有不同于 SIFT 类检测器的特质。AlxNET 已被证明被新的的如具有最大数量参数的 VGGNet 超越。ResNet 和 GoogleNet 分别赢得了 ILSVRC 2014 和 2015 的挑战,表明 CNN 网络的效果和网络层数成正比。如果要调研全部这些网络超出了本文的范围,我们建议读者参阅《Imagenet classification with deep convolutional neural networks》,《Return of the devil in the details: Delving deep into convolutional nets》和《Very deep convolutional networks for large-scale image recognition》中的细节。

Datasets for pre-training. Several large-scale recognition datasets are used for CNN pre-training. Among them, the ImageNet dataset [23] is mostly commonly used. It contains 1.2 million images of 1000 semantic classes and is usually thought of as being generic. Another data source for pre-training is the Places-205 dataset [129] which is twice as large as ImageNet but has five times fewer classes. It is a scene-centric dataset depicting various indoor and outdoor scenes. A hybrid dataset combining the Places-205 and the ImageNet datasets has also been used for pre-training [129]. The resulting HybridNet is evaluated in [130, 131, 125, 126] for instance retrieval.

用于预训练模型的数据集。 一些大规模的识别数据集被用于 CNN 网络的预训练。在其中,ImageNet 数据集常被研究员拿来使用。它包含 1000 个语义类的 120 万个图像,并且通常被认为是具有普适性的。用于预训练模型的另一个数据集是 PASES-205,它的数据规模是 ImageNet 的两倍但图像种类却要比 ImageNet 少五倍。它是一个以场景为主的数据集,描绘了各种室内场景和室外场景。在《Learning deep features for scene recognition using places database》中,混合了 ImageNet 和 PASES-205 的数据集也同样会被拿来用于模型的预训练。HybridNet 在《Going deeper with convolutions》,《Deep residual learning for image recognition》,《Factors of transferability for a generic convnet representation》和《A practical guide to cnns and fisher vectors for image instance retrieval》中被用于实例检索任务的评估。

models size # layers training Set used in
OverFeat [132] 144M 6+3 ImageNet [7]
AlexNet [6] 60M 5+3 ImageNet [22, 133]
PlacesNet [129] Places [130, 131]
HybridNet [129] ImageNet+Places [130, 131]
VGGNet [124] 138M 13+3 ImageNet [10]
GoogleNet [125] 11M 22 ImageNet [9]
ResNet [126] 44.6M 101 ImageNet n.a

TABLE II: Pre-trained CNN models that can be used.

The transfer issue. Comprehensive evaluation of various CNNs on instance retrieval has been conducted in several recent works [130, 131, 134]. The transfer effect is mostly concerned. It is considered in [130] that instance retrieval, as a target task, lies farthest from the source, i.e., ImageNet. Studies reveal some critical insights in the transfer process. First, during model transfer, features extracted from different layers exhibit different retrieval performance. Experiments confirm that the top layers may exhibit lower generalization ability than the layer before it. For example, for AlexNet pre-trained on ImageNet, it is shown that FC6, FC7, and FC8 are in descending order regarding retrieval accuracy [130]. It is also shown in [134, 10] that the pool5 feature of AlexNet and VGGNet is even superior to FC6 when proper encoding techniques are employed. Second, the source training set is relevant to retrieval accuracy on different datasets. For example, Azizpour et al. [130] report that HybridNet yields the best performance on Holidays after PCA. They also observe that AlexNet pre-trained on ImageNet is superior to PlacesNet and HybridNet on the Ukbench dataset [11] which contains common objects instead of architectures or scenes. So the similarity of the source and target plays a critical role in instance retrieval when using a pre-trained CNN model.

迁移问题。 最近的一些工作综合评估了各种 CNN 网络在实例检索任务中的表现,模型迁移是大家都比较关心的一个问题。在《Factors of transferability for a generic convnet representation》中将实例检索任务认为是距离原始数据集最远的(计算机视觉)目标。首先,在模型迁移过程中,从不同层提取的特征表现出不同的检索性能。实验表明高层网络的泛化能力要低于较低层的网络。例如,在 ImageNet 上预训练的网络 AlexNet 表明,FC6、FC7 和 FC8 在检索精度上呈递减顺序。《Particular object retrieval with integral max-pooling of cnn activations》和《Good practice in cnn feature transfer》也指出,当使用适当的编码技术时,AlexNet 和 VGGNet 的 pool5 层特征甚至优于 FC6 层特征。其次,当原始的训练集不同时,模型的准确率也会受到影响。例如,Azizpour 等人指出 HybridNet 在 Holidays 数据集上展现出的性能要劣于 PCA。他们同样发现在 ImageNet 上预训练的 AlexNet 模型在包含常见物体而非建筑场景图像的 Ukbench 数据集上的表现要好于 PlacesNet 和 HybridNet(译者注:AlexNet,PlacesNet 和 HybridNet 预训练模型使用的训练集不同)。因此,当使用预训练的 CNN 模型时,源和目标的相似度在实例检索中起着至关重要的作用。

Feature Extraction 特征提取

The most straightforward idea is to extract the descriptor from the fully-connected (FC) layer of the network , e.g., the 4,096-dim FC6 or FC7 descriptor in AlexNet. The FC descriptor is generated after layers of convolutions with the input image, has a global receptive field, and thus can be viewed as a global feature. It yields fair retrieval accuracy under Euclidean distance and can be improved with power normalization . Many recent retrieval methods focus on local descriptors in the intermediate layers. In these methods, lower-level convolutional filters (kernels) are used to detect local visual patterns. Viewed as local detectors, these filters have a smaller receptive field and are densely applied on the entire image. Compared with the global FC feature, local detectors are more robust to image transformations such as truncation and occlusion, in ways that are similar to the local invariant detectors (Section local_feature_extraction).

FC 描述符。 最直接的想法就是网络的全连接层(FC layer)提取描述符,在 AlexNet 中就是 FC6 或 FC7 中的描述符。FC 描述符是在与输入图像卷积的层之后生成的,具有全局表示性,因此可以被视为全局特征。它在欧几里德距离下产生较好的检索精度,并且可以使用指数归一化来提高检索精度。

中间局部特征。 许多最新的检索方法专注于研究中间层的描述符。在这种方法中,低层网络的卷积核用于检测局部视觉模式。作为局部检测器,这些滤波器具有较小的感受野并密集地应用于整张图像。与全局 FC 特征相比,局部检测器对于诸如截断和遮挡的图像变换更鲁棒,其方式类似于局部不变量检测器(Local_Feature_extraction)。

Local descriptors are tightly coupled with these intermediate local detectors, i.e., they are the responses of the input image to these convolution operations. In other words, after the convolutions, the resulting activation maps can be viewed as a feature ensemble, which is called the column feature in this survey. For example in AlexNet , there are $ n=96 $ detectors (convolutional filters) in the 1st convolutional layer. These filters produces $ n=96 $ heat maps of size $ 27\times27 $ (after max pooling). Each pixel in the maps has a receptive field of $ 19\times19 $ and records the response of the image w.r.t the corresponding filter . The column feature is therefore of size $ 1\times1\times96 $ (Fig. pipeline) and can be viewed as a description of a certain patch in the original image. Each dimension of this descriptor denotes the level of activation of the corresponding detector and resembles the SIFT descriptor to some extent. The column feature initially appears in , where Razavian et al. first do max-pooling over regularly partitioned windows on the feature maps and then concatenate them across all filter responses, yielding column-like features. In , column features from multiple layers of the networks are concatenated, forming the hypercolumn feature.

局部描述符与这些中间局部检测器紧密耦合,换而言之,它们是输入图像对这些卷积运算的响应。另一方面,在卷积运算后等到的激活图层可以看做是特征的集成,在这篇综述中将其称为“列特征”。例如,在 AlexNet 中第一层有$ n=96 $探测器(卷积滤波器)。这些过滤器产生$ n=96 $大小$ 27\times27 $的热力图(在最大池化后)。热力图中的每个像素点具有大小为$ 19\times19 $的感受野,同时记录了图像对滤波器的响应。因此列特征的大小是$ 1\times1\times96 $它可以看作是对原始图像中某个图像块的描述。该描述符的每个维度表示相应检测器的激活程度,并且在某种程度上类似于 SIFT 描述符。列特征最早出现在《Visual instance retrieval with deep convolutional networks》中,Razavian 等人首先在分好块的特征图上进行最大池化,然后将它们连接在所有过滤器上,最终生成列特征。在《Hypercolumns for object segmentation and fine-grained localization》中,来自多层的列特征被连接形成“超列”(hypercolumn)特征。

Feature Encoding and Pooling 特征编码与池化

When column features are extracted, an image is represented by a set of descriptors. To aggregate these descriptors into a global representation, currently two strategies are adopted: encoding and direct pooling (Fig. pipeline).

Encoding A set of column features resembles a set of SIFT features. So standard encoding schemes can be directly employed. The most commonly used methods are VLAD and FV . A brief review of VLAD and FV can be seen in Section encoding_small. A milestone work is , in which the column features are encoded into VLAD for the first time. This idea was later extended to CNN model fine-tuning . The BoW encoding can also be leveraged, as the case in . The column features within each layer are aggregated into a BoW vector which is then concatenated across the layers. An exception to these fix-length representations is , in which the column features are quantized with a codebook of size 25k and an inverted index is employed for efficiency.

PoolingA major difference between the CNN column feature and SIFT is that the former has an explicit meaning in each dimension, i.e., the response of a particular region of the input image to a filter. Therefore, apart from the encoding schemes mentioned above, direct pooling techniques can produce discriminative features as well. A milestone work in this direction consists in the Maximum activations of convolutions (MAC) proposed by Tolias *et al.

当提取列特征时,图像由一组描述符表示。为了将这些描述符聚合为全局表示,目前采用了两种策略:编码和直接池合并(如图 2 所示)。
编码。 一组列特征类似于一组 SIFT 特征,因此可以直接使用标准编码方案。常用的方法就是 VLAD 和 FV 算法,两个算法的简要介绍可以参加本文 3.3.2 节。一个里程碑式的工作发布于《Exploiting local features from deep networks for image retrieval》,文中后首次将列特征用 VLAD 算法编码。这个想法后来扩展为 CNN 的微调。BoW 编码同样也可以使用,具体工作可以参见《Hybrid multi-layer deep cnn/aggregator feature for image classification》。每个层内的列特征被聚集成一个 BoW 向量,然后跨层连接。《Bags of local convolutional features for scalable instance search》是固定长度表示的一个例外,这篇文章将列特征用大小为 25K 的码本量化,还采用了倒排索引结构来提升效率。

池化。 CNN 特征与 SIFT 的主要区别在于前者在每个维度上都有明确的含义,也就是对输入图像的特定区域的滤波器响应。因此,除了上面提到的编码方案之外,直接池化技术也可以产生具有区分度的特征。这方面的一项里程碑工作包括 Tolias 等人提出的最大卷积激活(MAC)。

Without distorting or cropping images, MAC computes a global descriptor with a single forward pass. Specifically, MAC calculates the maximum value of each intermediate feature map and concatenates all these values within a convolutional layer. In its multi-region version, the integral image and an approximate maximum operator are used for fast computation. The regional MAC descriptors are subsequently sum-pooled along with a series of normalization and PCA-whitening operations . We also note in this survey that several other works , also employ similar ideas with in employing max or average pooling on the intermediate feature maps and that Razavian et al. are the first. It has been observed that the last convolutional layer (e.g., pool5 in VGGNet), after pooling usually yields superior accuracy to the FC descriptors and the other convolutional layers . Apart from direct feature pooling, it is also beneficial to assign some specific weights to the feature maps within each layer before pooling. In , Babenko et al. propose the injection of the prior knowledge that objects tend to be located toward image centers, and impose a 2-D Gaussian mask on the feature maps before sum pooling. Xie et al. improve the MAC representation by propagating the high-level semantics and spatial context to low-level neurons for improving the descriptive ability of these bottom-layer activations. With a more general weighting strategy, Kalantidis et al. perform both feature map-wise and channel-wise weighing, which aims to highlight the highly active spatial responses while reducing burstiness effects.

在没有扭曲或裁剪图像的情况下,MAC 用单个前向传递来计算全局描述符。特别地,MAC 计算每个中间特征映射的最大值,并将所有这些值串联在一个卷积层内。在其多区域版本中,使用积分图算法和最似最大算子进行快速计算。随后局部的 MAC 描述符随着一系列归一化和 PCA 白化操作被一起合并。我们在本次调研中也注意到了其他一些工作同样也采用了相似的思想,在中间特征映射上采用最大或平均池化,其中 Razavian 等人的《Visual instance retrieval with deep convolutional networks》是打开先河的工作。同时大家也发现最后一层卷积层(如 VGGNet 的 pool5)在池化后达到的准确率要高于 FC 描述符以及其他卷积层。

除了直接特征池化,在池化之前给每个层内的特征图分配一些特定的权重也是有益的。在《Aggregating local deep features for image retrieval》中,Babenko 等人提出“目标对象往往出现在图像中心”这样一个先验知识,并在总池化前对特征图施加一个 2-D 高斯掩膜。Xie 等人在《Interactive: Inter-layer activeness propagation》中改进了 MAC 表示法,他们将高层语义和空间上下文传播到底层神经元,以提高这些底层激活神经元的描述能力。Kalantidis 等人使用了一个更常规的加权策略,他们同时执行特征映射和信道加权以突出高激活的空间响应,同时减少异常突发情况的影响。

Image Retrieval with Fine-Tuned CNN Models 使用微调 CNN 模型的图像检索系统

Although pre-trained CNN models have achieved impressive retrieval performance, a hot topic consists in fine-tuning the CNN model on specific training sets. When a fine-tuned CNN model is employed, the image-level descriptor is usually generated in an end-to-end manner, i.e., the network will produce a final visual representation without additional explicit encoding or pooling steps.

虽然预先训练的 CNN 模型已经取得了令人惊叹的检索性能,但在指定训练集上对 CNN 模型进行微调也是一个热门话题。当采用微调的 CNN 模型时,图像级的描述符通常以端到端的方式生成,那么网络将产生最终的视觉表示,而不需要额外的显式编码或合并步骤。

Datasets for Fine-Tuning 用于微调网络的数据集

The nature of the datasets used in fine-tuning is the key to learning discriminative CNN features. ImageNet only provides images with class labels. So the pre-trained CNN model is competent in discriminating images of different object/scene classes, but may be less effective to tell the difference between images that fall in the same class (e.g., architecture) but depict different instances (e.g., Eiffel Tower and Notre-Dame). Therefore, it is important to fine-tune the CNN model on task-oriented datasets.

name # images # classes content
Landmarks [8] 213,678 672 Landmark
3D Landmark [24] 163,671 713 Landmark
Tokyo TM [137] 112,623 n.a Landmark
MV RGB-D [142] 250,000 300 House. object
Product [143] 101,945×2 n.a Furniture

TABLE III: Statistics of instance-level datasets having been used in fine-tuning.

The datasets having been used for fine-tuning in recent years are shown in Table ft_datasets. Buildings and common objects are the focus. The milestone work on fine-tuning is . It collects the by a semi-automated approach: automated searching for the popular landmarks in Yandex search engine, followed by a manual estimation of the proportion of relevant image among the top ranks. This dataset contains 672 classes of various architectures, and the fine-tuned network produces superior features on landmark related datasets such as Oxford5k and Holidays , but has decreased performance on Ukbench where common objects are presented. Babenko et al. have also fine-tuned CNNs on the containing turntable views of 300 household objects, in order to improve performance on Ukbench. The Landmark dataset is later used by Gordo et al. for fine-tuning, after an automatic cleaning approach based on SIFT matching. In , Radenoviet al. employ the retrieval and Structure-From-Motion methods to build models so that images depicting the same architecture can be grouped. Using this labeled dataset, the linear discriminative projections (denoted as $ L_w $ in Table compare_methods_datasets) outperform the previous whitening technique . Another dataset called is collected using Google Street View Time Machine which provides images depicting the same places over time . While most of the above datasets focus on landmarks, Bell et al. build a consisting of furniture by developing a crowd-sourced pipeline to draw connections between in-situ objects and the corresponding products. It is also feasible to fine-tune on the query sets suggested in , but this method may not be adaptable to new query types.

微调网络时使用的数据集对学习高区分度的 CNN 特征具有至关重要的作用。ImageNet 仅提供了图像的类别标签,因此预训练的 CNN 模型可以对图像的类别进行分类,但却难以区分同一类的图像。因此要面向任务数据集进行 CNN 模型微调。

近年来用于微调网络方法数据集统计在表 3 中。数据集主要集中于建筑物和普通物体中。微调网络方向一个里程碑式的工作是《Neural codes for image retrieva》。这篇文章通过一个半自动化的方法收集地标数据集:在 Yandex 搜索引擎中自动地爬取流行的地标,然后手动估计排名靠前的相关图像的比例。该数据集包含 672 类不同的地标建筑,微调网络在相关的地标数据集,如 Oxford5k 和假日数据集上表现优异,但是在 Ukbench 数据集(包含有普通物体)上性能降低了。Babenko 等人也在含有 300 个多角度拍摄的日常物品图像的多视图 RGB-D 数据集上对 CNN 模型进行了精细调整,以提高在 Ukbench 数据集上的性能。地标数据集后来被 Gordo 等人使用,他们使用基于 SIFT 匹配的自动清洗方法后再微调网络。在《Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples》中,Radenovi 等人利用检索和运动结构的方法来构建三维地标模型,以便将描述相同建筑的图像进行分组。使用这个标记的数据集,线性判别投影方法(在表 5 中表示为 $L_w $优于先前的白化方法。另一个名为** Tokyo Time Machine**的数据集使用谷歌街景时间机器工具来收集图像,谷歌提供的这个工具可以提供同一地点不同时间的图像。上述的大部分数据集主要关注了地标图像,而 Bell 等人则建立了一个由家具组成的产品数据集,通过开发众包流程来绘制现场的目标和相应产品之间的连接。对所得到的查询集进行微调也是可行的,但是这种方法可能不适合于新的查询类型。

Networks in Fine-Tuning 微调的网络

The CNN architectures used in fine-tuning mainly fall into two types: the classification-based network and the verification-based network. The classification-based network is trained to classify architectures into pre-defined categories. Since there is usually no class overlap between the training set and the query images, the learned embedding e.g., FC6 or FC7 in AlexNet, is used for Euclidean distance based retrieval. This train/test strategy is employed in , in which the last FC layer is modified to have 672 nodes corresponding to the number of classes in the Landmark dataset. The verification network may either use a siamese network with pairwise loss or use a triplet loss and has been more widely employed for fine-tuning. A standard siamese network based on AlexNet and the contrastive loss is employed in . In , Radenoviet al. propose to replace the FC layers with a MAC layer . Moreover, with the 3D architecture models built in , training pairs can be mined. Positive image pairs are selected based on the number of co-observed 3D points (matched SIFT features), while hard negatives are defined as those with small distances in their CNN descriptors. These image pairs are fed into the siamese network, and the contrastive loss is calculated from the $ \ell_2 $ normalized MAC features. In a concurrent work to , Gordo et al. fine-tune a triplet-loss network and a region proposal network on the Landmark dataset . The superiority of consists in its localization ability, which excludes the background in feature learning and extraction. In both works, the fine-tuned models exhibit state-of-the-art accuracy on landmark retrieval datasets including Oxford5k, Paris6k and Holidays, and also good generalization ability on Ukbench (Table compare_methods_datasets). In , a VLAD-like layer is plugged in the network at the last convolutional layer which is amenable to training via back-propagation. Meanwhile, a new triplet loss is designed to make use of the weakly supervised Google Street View Time Machine data.

用于微调的 CNN 结构主要分为两类:基于分类的网络和基于验证的网络。基于分类的网络被训练以将建筑分类为预定义的类别。由于训练集和查询图像之间通常不存在类重叠,因此在 AlexNet 中如 FC6 或 FC7 的学习到的嵌入特征用于基于欧氏距离的检索。该训练/测试策略采用在方框中,其中最后的 FC 层被修改为具有对应于地标数据集中类的数目的 672 个节点。在《Neural codes for image retrieval》中采用训练/测试策略,其网络最后的 FC 层被修改为 672 个节点,对应于地标数据集中类别数目。

验证网络可以使用孪生网络(siamese network)结合成对损失函数(pairwise loss)或三元损失函数(triplet loss),这种方法已经被更广泛地用于微调网络任务中。在《Learning visual similarity for product design with convolutional neural networks》中采用了基于 AlexNet 的孪生网络和对比损失函数。在《Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples》中 Radenovi´c 等人提出用 MAC 成代替全连接层。更进一步地,可以通过建立的 3 维建筑模型挖掘训练对。基于共同观测的 3D 点云(匹配的 SIFT 特征)的数目来选择正例图像对,而 CNN 描述符中距离较小的那些图像对被认为是负例样本。这些图像输入到孪生网络中,并且用$ \ell_2 $ 正则后的 MAC 层输出计算对比损失函数。与《Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples》同时进行的一项工作是《Deep image retrieval: Learning global representations for image search》,Gordo 等人在 Landmark 数据库上对三元损失网络和区域提取网络进行微调。《Deep image retrieval: Learning global representations for image search》这项工作的的优越性在于物体其定位能力,它很好地在特征学习和提取步骤中排除了图像背景。在这两项工作中,微调模型在 landmark,OxFoD5K、PARIS6K 和 Holidays 数据集上表现出了最先进的精度,以及在 UKBayes 数据集上表现出良好的泛化能力(将表 5)。在《Netvlad: Cnn architecture for weakly supervised place recognition》中,在最后一个卷积层中插入一个类似 VLAD 编码层,通过反向传播进行训练。与此同时,设计了一个新的三元损失函数来利用弱监督的 Google Street View Time Machine 数据。

Hybrid CNN-based Methods 基于 CNN 模型的混合式方法

For the hybrid methods, multiple network passes are performed. A number of image patches are generated from an input image, which are fed into the network for feature extraction before an encoding/pooling stage. Since the manner of detector + descriptor is similar to SIFT-based methods, we call this method type hybrid. It is usually less efficient than the single-pass methods.

混合式方法中使用多网络传递方式。许多图像块从输入图像中获得并被输入网络中进行特征提取,随后进行编码/池化。由于“检测器 + 描述符”的方式和基于 SIFT 的方法很相似,因此我们称其为“混合式”方法。这种方法的效率通常比单通传递要低。

Feature Extraction 特征提取

In hybrid methods, the feature extraction process consists of patch detection and description steps. For the first step, the literature has seen three major types of region detectors. The first is grid image patches. For example, in [22], a two-scale sliding window strategy is employed to generate patches. In [7], the dataset images are first cropped and rotated, and then divided into patches of different scales, the union of which covers the whole image. The second type is invariant keypoint/region detectors. For instance, the difference of Gaussian feature points are used in [145]; the MSER region detector is leveraged in [146]. Third, region proposals also provide useful information on the locations of the potential objects. Mopuri et al. [147] employ selective search [148] to generate image patches, while EdgeBox [149] is used in [150]. In [144], the region proposal network (RPN) [151] is applied to locate the potential objects in an image.

The use of CNN as region descriptors is validated in [146], showing that CNN is superior to SIFT in image matching except on blurred images. Given the image patches, the hybrid CNN method usually employs the FC or pooled intermediate CNN features. Examples using the FC descriptors include [22, 7, 152, 147]. In these works, the 4,096-dim FC features are extracted from the multi-scale image regions [22, 7, 152] or object proposals [147]. On the other hand, Razavian et al. [133] also uses the intermediate descriptors after max-pooling as region descriptors.

The above methods use pre-trained models for patch feature extraction. Based on the hand-crafted detectors, patch descriptors can also be learned through CNN in either supervised [153] or unsupervised manner [145], which improves over the previous works on SIFT descriptor learning [45], [34]. Yi et al. [154] further propose an end-to-end learning method integrating region detector, orientation estimator and feature descriptor in a single pipeline.

在混合方法中,特征提取过程包括图像块检测和描述符生成。对第一步而言,主要有三种区域检测器。第一种检测器是网格化图像块。例如,在《Multi-scale orderless pooling of deep convolutional activation features》中使用了两个尺寸滑动窗口的策略来生成图像块。在《Cnn features off-the-shelf: an astounding baseline for recognition》中首先对数据集进行裁剪和旋转,然后将其划分为不同尺度的图像块。第二类是具有不变性的关键点/区域检测器。例如高斯差分特征点在《Learning to compare image patches via convolutional neural networks》中使用。MSER 区域检测器在《Descriptor matching with convolutional neural networks: a comparison to sift》中被使用。第三种是区域建议方法,它也同样提供了潜在对象可能的位置信息。Mopuri 等人使用选择性搜索策略来提取图像块,而边缘区域方法在《Fisher encoded convolutional bag-of-windows for efficient image retrieval and social image tagging》中使用。在《Faster r-cnn features for instance search》中使用区域建议网络(RPN)来对目标进行定位。

《Descriptor matching with convolutional neural networks: a comparison to sift》证实了 CNN 一类的区域描述是有效的,并且在出模糊图像之外的图像匹配任务繁重要优于 SIFT 描述符。对于给定的图像块,混合 CNN 方法通常使用全连接层或池化的方法来整合 CNN 特征,相关文献对此均有研究。这些研究从多尺度的图像区域中提取 4096 维 FC 特征或目标建议区域。另一方面,Razavian 等人还在最大池化后采用中间描述符来作为区域描述符。

上述方法采用预训练模型进行图像块特征提取。以手工检测器为基础,图像块描述符也可以通过有监督或无监督方式进行 CNN 训练学习,这相对于之前关于 SIFT 描述符学习的工作有所改进。Yi 等人进一步提出了一种在单个流程中集成了区域检测器、方向估计和特征描述符结果的端到端学习方法。

Feature Encoding and Indexing

The encoding/indexing procedure of hybrid methods resembles SIFT-based retrieval, *e.g., *VLAD/FV encoding under a small codebook or the inverted index under a large codebook.

The VLAD/FV encoding, such as [22, 147], follow the standard practice in the case of SIFT features [15, 14], so we do not detail here. On the other hand, several works exploit the inverted index on the patch-based CNN features [155, 156, 139]. Again, standard techniques in SIFT-based methods such as HE are employed [156]. Apart from the above-mentioned strategies, we notice that several works [7, 133, 152] extract several region descriptors per image to do a many-to-many matching, called “spatial search” [7]. This method improves the translation and scale invariance of the retrieval system but may encounter efficiency problems. A reverse strategy to applying encoding on top of CNN activations is to build a CNN structure (mainly consisting of FC layers) on top of SIFT-based representations such as FV. By training a classification model on natural images, the intermediate FC layer can be used for retrieval [157].

混合方法的编码/索引过程类似于基于 SIFT 的检索,如同在小码本下的 VLAD / FV 编码或大码本下的倒排索引。

VLAD/FV 编码过程紧随 SIFT 特征提取后,在上文已经详细描述过这样的流程,不再赘述。另一方面,有一些工作研究探索了图像块的 CNN 特征的倒排索引。同样,在 SIFT 方法流程中诸如 HE 之类的编码方法也被使用。除了上述提到的编码策略,我们注意到《Cnn features off-the-shelf: an astounding baseline for recognition》,《Visual instance retrieval with deep convolutional networks》,《Image classification and retrieval are one》这些工作提取每个图像的多个区域描述符进行多对多匹配,这种方法称为称为“空间搜索”。该方法提高了检索系统对平移和尺度变化的鲁棒性,但可能会遇到效率问题。另一种使用 CNN 最高层特征编码的策略是在基于 SIFT 编码(如 FV)的最后面建立一个 CNN 结构(主要由全连接层组成)。通过在自然图像上训练一个分类模型,中间的全连接层可以被用来进行检索任务。

Discussions 讨论

Relationship between SIFT- and CNN-based Methods 基于 SIFT 和 CNN 的方法间的关系

In this survey, we categorize current literature into six fine-grained classes. The differences and some representative works of the six categories are summarized in Table I and Table V. Our observation goes below.

First, the hybrid method can be viewed as a transition zone from SIFT- to CNN-based methods. It resembles the SIFT-based methods in all the aspects except that it extracts CNN features as the local descriptor. Since the network is accessed multiple times during patch feature extraction, the efficiency of the feature extraction step may be compromised.

Second, the single-pass CNN methods tend to combine the individual steps in the SIFT-based and hybrid methods. In Table V, the “pre-trained single-pass” category integrates the feature detection and description steps; in the “fine-tuned single-pass” methods, the image-level descriptor is usually extracted in an end-to-end mode, so that no separate encoding process is needed. In [17], a “PCA” layer is integrated for discriminative dimension reduction, making a further step towards end-to-end feature learning.

Third, fixed-length representations are gaining more popularity due to efficiency considerations. It can be obtained by aggregating local descriptors (SIFT or CNN) [15], [18, 22], [9], direct pooling [147], [10], or end-to-end feature computation [8, 17]. Usually, dimension reduction methods such as PCA can employed on top of the fixed-length representations, and ANN search methods such as PQ [15] or hashing [47] can be used for fast retrieval.

在本篇综述中,我们将现有的文献分为六个精细的类,表 1 和表 5 总结了六个类别的差异和代表性作品。我们的观察结果如下。

第一,混合方法可被视为从 SIFT-到基于 CNN 的方法的过渡方法,除了将 CNN 特征提取为局部描述符之外,它在所有方面都类似于基于 SIFT 的方法。由于在图像块特征提取期间需要多次访问网络,因此特征提取步骤的效率可能会受到影响。

第二,单向 CNN 方法倾向于将 SIFT 和混合方法中的各个步骤结合起来。在表 5 中,“预训练单向网络”一类方法整合了特征检测和描述步骤;在“微调单向网络”中,图像级描述符通常是在端到端模式下提取的,因此不需要单独的编码过程。在《Deep image retrieval: Learning global representations for image search》中,集成了类似“PCA”层以减少区分维数,进一步完善了端到端的特征学习。

第三,出于效率上的考虑,特征编码的固定长度表示方法越来越流行。它可以通过聚集局部描述符(SIFT 或 CNN)、直接汇或端到端特征计算的方法来获得。通常,诸如 PCA 的降维方法可以在固定长度的特征表达中使用,ANN 搜索方法(如 PQ 或哈希)可用于快速检索。

Hashing and Instance Retrieval 哈希与实例检索

Hashing is a major solution to the approximate nearest neighbor problem. It can be categorized into locality sensitive hashing (LSH) and learning to hash. LSH is data-independent and is usually outperformed by learning to hash, a data-dependent hashing approach. For learning to hash, a recent survey categorizes it into quantization and pairwise similarity preserving. The quantization methods are briefly discussed in Section encoding_small. For the pairwise similarity preserving methods, some popular hand-crafted methods include Spectral hashing , LDA hashing , etc. Recently, hashing has seen a major shift from hand-crafted to supervised hashing with deep neural networks. These methods take the original image as input and produce a learned feature before binarization . Most of these methods, however, focus on class-level image retrieval, a different task with instance retrieval discussed in this survey. For instance retrieval, when adequate training data can be collected, such as architecture and pedestrians, the deep hashing methods may be of critical importance.
哈希方法是最似最近邻问题的主流解决方案。它可以被分类类为局部敏感哈希(LSH)算法和哈希学习方法。LSH 是数据无关的且常通过学习哈希来获得更优异的性能。对于学习哈希方法,最近的一项调研《A survey on learning to hash》将其归类为量化和成对相似性保留这两类。我们在 3.3.2 节已经详细讨论过量化方法热,不再赘述。成对相似性保留方法包括一些常用的手工设计哈希方法,如谱哈希,LDA 哈希等。

近年来随着深度网络的发展,哈希方法也从手工设计的方式转变到受监督的训练方式。这些方法将原始图像作为输入,并在二值化之前生成学习的特征。然而,这些方法大多集中于图像分类式的检索任务,与本次调研所中讨论的实例图像检索不同。实例检索任务中,当可以收集到足够的训练数据时(例如建筑和行人和数据)时,深度散列方法可能是至关重要的。

Experimental Comparisons 实验比较

Image Retrieval Datasets 图像检索数据集

Five popular instance retrieval datasets are used in this survey. Statistics of these datasets can be accessed in Table IV.

在本次调研中使用了五个流行的实例检索数据集,这些数据集的统计数据如表 4 所示。

name # images # queries content
Holidays [13] 1,491 500 scene
Ukbench [11] 10,200 10,200 common objects
Paris6k [25] 6,412 55 buildings
Oxford5k [12] 5,062 55 buildings
Flickr100k [25] 99,782 - from Flickr’s popular tags

TABLE IV: Statistics of popular instance-level datasets.

Holidays [13] is collected by Jégou et al. from personal holiday albums, so most of the images are of various scene types. The database has 1,491 images composed of 500 groups of similar images. Each image group has 1 query, totaling 500 query images. Most SIFT-based methods employ the original images, except [71, 32] which manually rotate the images into upright orientations. Many recent CNN-based methods [140], [137], [16] also use the rotated version of Holidays. In Table V, results of both versions of Holidays are shown (separated by “/”). Rotating the images usually brings 2-3% mAP improvement.
Holidays 数据集由 Jégou 等人从个人假日相册中收集,因此图像包含各种各样的场景。该数据库由 500 组 1,491 幅相似图像组成,每组图像有一条查询记录,共计 500 条。除了《Efficient representation of local geometry for large scale object retrieval》和《Learning a fine vocabulary》,大多数基于 SIFT 的方法手动地将图像旋转成直立方向。最近许多基于 CNN 的方法也使用了旋转版的 Holidays 数据集。在表 5 中这两个版本数据集上的结果用”/“间隔,旋转图像可以带来 2%-3% 的 mAP 值。

Ukbench [11] consists of 10,200 images of various content, such as objects, scenes, and CD covers. All the images are divided into 2,550 groups. Each group has four images depicting the same object/scene, under various angles, illuminations, translations, etc. Each image in this dataset is taken as the query in turn, so there are 10,200 queries.
Ukbench 数据集包括 10,200 种不同内容的图像,如物体、场景和 CD 封面。所有图像被分为 2,550 组,每个组有四个图像描述相同的物体/场景,在不同的角度,照明,形变,等情况下的表现。该数据集中的每个图像依次作为查询记录,因此有 10,200 条查询记录。

Oxford5k [12] is collected by crawling images from Flickr using the names of 11 different landmarks in Oxford. A total of 5,062 images form the image database. The dataset defines five queries for each landmark by hand-drawn bounding boxes, so that 55 query Regions of Interest (ROI) exist in total. Each database image is assigned one of four labels, good, OK, junk, or bad. The first two labels are true matches to the query ROIs, while “bad” denotes the distractors. In junk images, less than 25% of the objects are visible, or they undergo severe occlusion or distortion, so these images have zero impact on retrieval accuracy.

Oxford5k 数据集用牛津 11 个地标名从 Flickr 网站爬取共计 5062 幅图像组建数据集。该数据集通过手绘边界框为每个地标的定义五个查询记录,从而总共存在 55 个感兴趣区域(ROI)查询记录。每个数据库图像被分配了好的还可以的垃圾的,或坏的四个标签之一。前两个标签表示与查询的感兴趣区域是匹配的,而“坏”表示不匹配。在糟糕的图像中,只有不到 25%的对象是可见的,或者它们遭受严重的遮挡或变形,因此这些图像对检索精度影响不大。
Flickr100k [25] contains 99,782 high resolution images crawled from Flickr’s 145 most popular tags. In literature, this dataset is typically added to Oxford5k to test the scalability of retrieval algorithms.

Flickr100k 数据集包括 99,782 张来 Flickr 网站 145 个最流行标签的高清图像。在文献中,通常将该数据集添加到 Oxford5k 中,以测试检索算法的可扩展性。

Paris6k [25] is featured by 6,412 images crawled from 11 queries on specific Paris architecture. Each landmark has five queries, so there are also 55 queries with bounding boxes. The database images are annotated with the same four types of labels as Oxford5k. Two major evaluation protocols exist for Oxford5k and Paris6k. For SIFT-based methods, the cropped regions are usually used as query. For CNN-based methods, some employ the full-sized query images [8, 137]; some follow the standard cropping protocol, either by cropping the ROI and feeding it into CNN [16] or extracting CNN features using the full image and selecting those falling in the ROI [144]. Using the full image may lead to mAP improvement. These protocols are used in Table V.

Paros6k 数据集从 11 指定的巴黎建筑查询中爬出 6,412 中图像。每个地标有五个查询记录,因此这个数据集同样有 55 个带有边界框的查询记录。数据库图像使用和 Oxford5k 一样的四种类型的标签作为 Oxford5k 数据集标签。针对 Oxford5k 和 Paris6k 数据集有两个评估准则。对于基于 SIFT 的方法,被裁剪的区域通常用于查询。对于基于 CNN 的方法,有些工作采用的是全尺寸图像,有些工作采用的是将裁剪的 ROI 传入 CNN 或者用 CNN 提取全图特征后再裁剪得到 ROI。使用完整的图像的方法可以提高 mAP 指标。详细的指标可以参见表 5。
640?wx_fmt=png

TABLE V: Performance summarization of some representative methods of the six categories on the benchmarks. “+100k” –> the addition of Flickr100k into Oxford5k. “pw.” –> power low normalization [14]. “MP” –> max pooling. “SP” –> sum pooling. * and parentheses –> results are obtained with post-processing steps such as spatial verification or QE. \lx@sectionsign –> numbers are estimated from the curves. † numbers are reported by our implementation. For Holidays, results using the rotated images are presented after “/”. For Oxford5k (+100k) and Paris6k, results using the full-sized queries are shown after “/”. ♮ –> the full query image is fed into the network, but only the features whose centers fall into the query region of interest are aggregated. Note that in many fixed-length representations, ANN algorithms such as PQ are not used to report the results, but ANN can be readily applied after PCA during indexing.

image

image

image
Fig. 6: The state of the art over the years on the (a) Holidays, (b) Ukbench, and (c) Oxford5k datasets. Six fine-grained categories are summarized (see Section 2). For each year, the best accuracy of each category is reported. For the compact representations, results of 128-bit vectors are preferentially selected. The purple star denotes the results produced by 2,048-dim vectors [17], the best performance in fine-tuned CNN methods. Methods with a pink asterisk denote using rotated images on Holidays, full-sized queries on Oxford5k, or spatial verification and QE on Oxford5k (see Table V).

Evaluation MetricsPrecision-recall. 评价指标

Recall denotes the ratio of returned true matches to the total number or true matches in the database, while precision refers to the fraction of true matches in the returned images. Given a subset of $ n $ returned images, assuming there are $ n_p $ true matches among them, and a total of $ N_p $ true matches exist in the whole database, then recall $ @n $ ( $ r@n $ ) and precision $ @n $ ( $ p@n $ ) are calculated as $ \frac{n_p}{N_p} $ and $ \frac{n_p}{n} $ , respectively. In image retrieval, given a query image and its rank list, a precision-recall curve can be drawn on the (precision, recall) points $ (r@1, p@1), (r@2, p@2), ..., (r@N, p@N) $ , where $ N $ is the number of images in the database. . To more clearly record the retrieval performance, average precision (AP) is used, which amounts to the area under the precision-recall curve. Typically, a larger AP means a higher precision-recall curve and thus better retrieval performance. Since retrieval datasets typically have multiple query images, their respective APs are averaged to produce a final performance evaluation, i.e., the mean average precision (mAP). Conventionally, we use mAP to evaluate retrieval accuracy on the Oxford5k, Paris6k, and Holidays datasets.

**精准度-召回率。**recall 表示返回的 true 匹配与数据库中总数或 true 匹配的比率,而精度是指返回的图像中的真实匹配的分数。给定$ n $返回的图像子集,假设它们之间存在$ n_p $ TRUE 匹配,并且整个数据库中存在$ N_p $ TRUE 匹配,然后召回$ @n $($ r@n $)和精密$ @n $($ p@n $)分别计算为$\frac{n_p}{N_p} $和$\frac{n_p}{n} $。在图像检索中,给定查询图像及其等级列表,可以在(精度,召回)点$(r@1, p@1), (r@2, p@2), ..., (r@N, p@N) $上绘制精确调用曲线,其中$ N $是数据库中的图像数。

平均准确率和平均精度。 为了更加清晰地记录图像检索系统的性能,我们使用平均准确率(average precision)对其进行衡量,它相当于精准度-召回率曲线下的面积。通常,较大的 AP 值意味着更高的精准度-召回率曲线,亦即更好的检索性能。由于图像检索数据集通常具有多个查询图像,所以对它们各自的 AP 值进行平均,以产生最终的性能评价,即平均精度(mean average precision, mAP)。传统地,我们使用 mAP 来评估检索系统在 Oxford5k、Paris6k 和 Holidays 数据集上的准确度。

N-S 得分。 N-S 得分专用于 Ukbench 数据集,它是以 David Nistér 和 Henrik Stewénius 的名字来命名的。N-S 得分其实等价于精准度或者召回率,因为在 Ukbench 数据集中的每个查询在数据库中都有四个正确的匹配项。N-S 得分用总排名列表中前四中的真实匹配的平均数量来计算。

image

image

image

image

Fig. 7: The impact of feature dimension on retrieval accuracy. Compact (fixed-length) representations are shown, i.e., SIFT small voc., hybrid CNN methods, pre-trained CNN methods, and fine-tuned CNN methods. Curves with a pink asterisk on the end indicates using rotated images or full-sized queries on Holidays and Oxford5k (see Table V), resp.

Comparison and Analysis 比较和分析

Performance Improvement Over the Years 多年来性能的改进

We present the improvement in retrieval accuracy over the past ten years in Fig. performance_improvement and the numbers of some representative methods in Table compare_methods_datasets. The results are computed using codebooks trained on independent datasets . We can clearly observe that the field of instance retrieval has constantly been improving. The baseline approach (HKM) proposed over ten years ago only yields a retrieval accuracy of 59.7%, 2.85, 44.3%, 26.6%, and 46.5% on Holidays, Ukbench, Oxford5k, Oxford5k+Flickr100k, and Paris6k, respectively. Starting from the baseline approaches , methods using large codebooks improve steadily when more discriminative codebooks , spatial constraints , and complementary descriptors are introduced. For medium-sized codebooks, the most significant accuracy advance has been witnessed in the years 2008-2010 with the introduction of Hamming Embedding and its improvements . From then on, major improvements come from the strength of feature fusion , , with the color and CNN features, especially on the Holidays and Ukbench datasets.

我们在图 6 中展示了过去十年图像检索精度的改善以及在表 5 中展示了一些有代表性的方法。实验结果通过在独立的数据集上建立的编码本来计算。我们可以清楚地看到,实例检索的领域一直在不断改进。10 多年前提出的基线方法(HKM)在 Holidays, Ukbench, Oxford5k, Oxford5k+Flickr100k 以及 Paris6k 数据集上的准确率分别仅为 59.7%, 2.85, 44.3%, 26.6% 以及 46.5%。从基线方法开始,通过引入高区分度编码本、空间约束和互补描述符,大规模编码本方法开始稳定地提升。对于中型编码本方法来说,随着 Hamming 嵌入及其改进的方法,在 2008 年至 2010 年间它见证了最显著的精度提升。从那时起,主要的改进来自特征融合的强度,特别是使用在 Holiday 和 Ukbench 数据集上提取的的颜色和 CNN 特征。

| method type | efficiency | accuracy |
| feat. ext. | retr. | mem. | train | generic | specific |
| - | - | - |
| SIFT large voc. | fair | high | fair | fair | high | fair |
| SIFT mid voc. | fair | low | low | fair | high | high |
| SIFT small voc. | fair | high | high | high | low | Low |
| CNN hybrid | low | varies | varies | varies | fair | fair |
| CNN pre-trained | high | high | high | high | high | fair |
| CNN fine-tuned | high | high | high | low | high | high |

TABLE VI: A summary of efficiency and accuracy comparison between different categories. Note: feature extraction time is estimated using CPUs and GPUs (as is usually done) for SIFT and CNN, resp. When using GPUs for SIFT extraction, the efficiency could be high as well.
表 6:不同类别方法间的效率和准确度的比较

On the other hand, CNN-based retrieval models have quickly demonstrated their strengths in instance retrieval. In the year 2012 when the AlexNet was introduced, the performance of the off-the-shelf FC features is still far from satisfactory compared with SIFT models during the same period. For example, the FC descriptor of AlexNet pre-trained on ImageNet yields 64.2%, 3.42, and 43.3% in mAP, N-S score, and mAP, respectively, on the Holidays, Ukbench, and Oxford5k datasets. These numbers are lower than by 13.85%, 0.14 on Holidays and Ukbench, respectively, and lower than by 31.9% on Oxford5k. However, with the advance in CNN architectures and fine-tuning strategies, the performance of the CNN-based methods is improving fast, being competitive on the Holidays and Ukbench datasets , and slightly lower on Oxford5k but with much smaller memory cost .

另一方面,基于 CNN 的检索模型在图像例检索中迅速显示出其优势。在 2012 年 AlexNet 刚提出时,当时的 FC 特征的性能与 SIFT 模型相比仍然远不能令人满意。例如,在 ImageNet 上预训练的 AlexNet,其 FC 描述符在 Holidays,Ukbench 和 Oxford5k 数据集上的 AP,N-S 得分和 mAP 上的得分分别为 64.2%,3.42,43.3%。这些指标是要比《Contextual weighting for vocabulary tree based image retrieval》在 Holidays 和 Ukbench 数据集上的成绩低 13.85% 和 0.14,比《Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking》在 Oxford5k 上的成绩低 31.9%。然而,然而,CNN 网络结构和微调策略的进步,基于 CNN 的方法的性能迅速提高,在 Holidays 和 Ukbench 数据集上极具竞争力,并且在 Oxford5k 数据集上的指标略低,但它具的内存消耗更小。

Accuracy Comparisons 准确率比较

The retrieval accuracy of different categories on different datasets can be viewed in Fig. 6, Table V and Table VI. From these results, we arrive at three observations.

First, among the SIFT-based methods, those with medium-sized codebooks [13, 31], [19] usually lead to superior (or competitive) performance, while those based on small codebook (compact representations) [15, 18, 56] exhibit inferior accuracy. On the one hand, the visual words in the medium-sized codebooks lead to relatively high matching recall due to the large Voronoi cells. The further integration of HE methods largely improves the discriminative ability, achieving a desirable trade-off between matching recall and precision. On the other hand, although the visual words in small codebooks have the highest matching recall, their discriminative ability is not significantly improved due to the aggregation procedure and the small dimensionality. So its performance can be compromised.

Second, among the CNN-based categories, the fine-tuned category [8, 17, 24] is advantageous in specific tasks (such as landmark/scene retrieval) which have similar data distribution with the training set. While this observation is within expectation, we find it interesting that the fine-tuned model proposed in [17] yields very competitive performance on generic retrieval (such as Ukbench) which has distinct data distribution with the training set. In fact, Babenko et al. [8] show that the CNN features fine-tuned on Landmarks compromise the accuracy on Ukbench. The generalization ability of [17] could be attributed to the effective training of the region proposal network. In comparison, using pre-trained models may exhibit high accuracy on Ukbench, but only yields moderate performance on landmarks. Similarly, the hybrid methods have fair performance on all the tasks, when it may still encounter efficiency problems [152, 7].

Third, comparing all the six categories, the “CNN fine-tuned” and “SIFT mid voc.” categories have the best overall accuracy, while the “SIFT small voc.” category has a relatively low accuracy.
不同数据集上不同类别的检索精度可以在图 6,表 5 和表 6 中查看。从这些结果中,我们有三个发现。

第一,在基于 SIFT 的方法中,中等规模编码本对的表现要优于小规模编码本。一方面,由于大的沃罗诺伊方格,中等规模编码本的视觉词汇可以使相关匹配的召回率变高。HE 方法的进一步集成在很大程度上提高了模型区分度,实现了匹配图像召回率和精度之间较好的平衡。另一方面,虽然小规模编码本中的视觉词具有最高的匹配召回率,但由于聚合过程和维度小,它们的图像区分能力没有显著提高。因此它的表现可以认为是不佳的。

第二,在基于 CNN 的方法中,微调的模型在特定任务(如地标/场景检索)中的表现要有很大优势,这些任务的数据一般和训练集数据分布相似。虽然这一观察是在预期之内,有趣的是我们发现在《Deep image retrieval: Learning global representations for image search》中提出的微调模型在通用检索(例如 Ukbench 数据集)上的表现极具竞争力,而它与训练集的数据分布并不同。事实上,Babenko 等人在《Neural codes for image retrieval》中表明,在 Landmarks 数据集上进行微调的 CNN 特征会降低在 Ukbench 上的的准确率。《Deep image retrieval: Learning global representations for image search》这项工作的泛化能力可以归因于对区域提取网络的有效训练。相比之下,使用预先训练模型可以在 Ukbench 上表现出较高的精度,但在 landmarks 数据集上的表现中等。相似地,混合方法在所有的任务中的表现都相当,但它仍然可能遇到效率问题时。

第三,比较这六中方法,“CNN 微调模型”和“SIFT 中等编码本”方法具有最好的总体准确度,而“SIFT 小编码本”类别具有相对较低的准确度。

image
Fig. 8: Memory cost vs. retrieval accuracy on Oxford5k. In the legend, the first 8 methods are based on large codebooks, while the last 7 use medium-sized codebooks. This figure shares the same legend with Fig. 7(c) except the newly added numbers (in black).

Efficiency Comparisons

Feature computation time. For the SIFT-based methods, the dominating step is local feature extraction. Usually, it takes 1-2s for a CPU to extract the Hessian-Affine region based SIFT descriptors for a 640×480 image, depending on the complexity (texture) of the image. For the CNN-based method, it takes 0.082s and 0.347s for a single forward pass of a 224×224 and 1024×768 image through VGG16 on a TitanX card, respectively. It is reported in [17] that four images (with largest side of 724 pixels) can be processed in 1 second. The encoding (VLAD or FV) time of the pre-trained column features is very fast. For the CNN Hybrid methods, extracting CNN features out of tens of regions may take seconds. Overall speaking, the CNN pre-trained and fine-tuned models are efficient in feature computation using GPUs. Yet it should be noted that when using GPUs for SIFT extraction, high efficiency could also be achieved.
特征计算时间。 在基于 SIFT 的方法中,主要的步骤就是局部特征的提取。通常情况下,根据图像的复杂度(纹理),CPU 提取 640×480 大小图像的基于 Hessian 仿射区域的 SIFT 描述符需要 1-2s。对于基于 CNN 的方法,在 TitanX 卡上通过 VGG16 网络对一个 224×224 和 1024×768 的图像进行单向传递分别需要 0.082s 和 0.34 7s。据报道,四幅图像(最大边 724 像素)可以在 1s 内处理。预训练列特征的编码(VLAD 或 FV)的时间非常快。对于 CNN 混合方法,提取几十个区域的 CNN 特征可能需要几秒钟。总体而言,CNN 预训练模型和微调模型在用 GPU 进行特征计算时的效率高。同样应该注意的是,当使用 GPU 进行 SIFT 提取时,也可以实现高效率。

Retrieval time. The efficiency of nearest neighbor search is high for “SIFT large voc.”, “SIFT small voc.”, “CNN pre-trained” and “CNN fine-tuned”, because the inverted lists are short for a properly trained large codebook, and because the latter three have a compact representation to be accelerated by ANN search methods like PQ [61]. Efficiency for the medium-sized codebook is low because the inverted list contains more postings compared to a large codebook, and the filtering effect of HE methods can only correct this problem to some extent. The retrieval complexity for hybrid methods, as mentioned in Section 4.3, may suffer from the expensive many-to-many matching strategy [7, 133, 152].

检索时间。 最似最近邻搜索算法用于“SIFT 大编码本”,“SIFT 小编码本”,“CNN 预训练模型”和“CNN 微调模型”时都是相当高效的,这是因为倒排列表对于适当训练的大码本来说是简短的,并且因为后者三有一个紧凑的表示,用像 PQ 这样的 ANN 搜索方法来加速是可行的。中等规模编码本的效率较低,因为它的倒排索引与大码本相比包含更多的条目,并且汉明嵌入方法的过滤效果只能在一定程度上修正这个问题。如 4.3 节所述,混合方法的检索复杂度会因为多对多匹配策略的影响而变得低效率。

Training time. Training a large or medium-sized codebook usually takes several hours with AKM or HKM. Using small codebooks reduces the codebook training time. For the fine-tuned model, Gordo et al. [17] report using five days on a K40 GPU for the triplet-loss model. It may take less time for the siamese [24] or the classification models [8], but should still much longer than SIFT codebook generation. Therefore, in terms of training, those using direct pooling [10, 134] or small codebooks [15], [9] are more time efficient.

训练时间。 用 AKM 或 HKM 训练大型或中型编码本通常需要几个小时,使用小型编码本可以缩短训练时间。对于微调模型,Gordo 等人在一块 K40 GPU 上花费了 5 天训练三元损失模型。可能在孪生网络或者分类模型上这会花费更少的时间,但是要比生成 SIFT 编码本的时间长得多。因此,在训练方面,使用直接池或小码本的效率更高。

Memory cost. Table V and Fig. 8 show that the SIFT methods with large codebooks and the compact representations are both efficient in memory cost. But the compact representations can be compressed into compact codes [53] using PQ or other competing quantization/hashing methods, so their memory consumption can be further reduced. In comparison, the methods using medium-sized codebooks are the most memory-consuming because the binary signatures should be stored in the inverted index. The hybrid methods somehow have mixed memory cost because the many-to-many strategy requires storing a number of region descriptors per image [7, 152] while some others employ efficient encoding methods [22, 147].

存储代价。 表 5 和图 8 表明具有大码本的 SIFT 方法和紧凑方法在存储成本上都是高效的。还可以使用 PQ 或其他有效的量化/散列方法将紧凑表示压缩成紧凑编码,从而可以进一步减少它们的存储消耗。相比之下,使用中等码本的方法是最消耗内存的,因为二进制签名应该存储在倒排索引中。混合方法总要有混合存储成本,因为多对多策略需要存储每个图像的多个区域描述符,而其他一些方法则采用高效的编码方法。

Spatial verification and query expansion. Spatial verification which provides refined rank lists is often used in conjunction with QE. The RANSAC verification proposed in [12]The RANSAC verification proposed in has a complexity of $ \mathcal{O}(z^2) $ , where $ z $ is the number of matched features. So this method is computationally expensive. The ADV approach is less expensive with $ \mathcal{O}(z\log z) $ complexity due to its ability to avoid unrelated Hough votes. The most efficient methods consist in which has a complexity of $ \mathcal{O}(z) $ , and further outputs the transformation and inliers for QE.
空间验证与查询拓展。 空间验证通常和 QE 算法一起使用,可以使得检索结果排列表更加精准。RANSAC 验证在《Object retrieval with large vocabularies and fast spatial matching》中提出,它的复杂度为$ \mathcal{O}(z^2) $ ,其中 z 是匹配的特征数目,可以看出算法的复杂度较高。ADV 方法的复杂度相对较小,为$ \mathcal{O}(z\log z) $ ,因为它能够避免不相关的 Hough 选票。《Hough pyramid matching: Speeded-up geometry re-ranking for large scale image retrieval》和《A vote-and-verify strategy for fast spatial verification in image retrieval》提出的方法最有效,复杂度仅为$ \mathcal{O}(z) $ ,同时后一项工作进一步地输出 QE 的变换和内值。

image

image
Fig. 9: The impact of codebook size on SIFT-based methods using (a) large codebooks and (b) medium-sized codebooks on the Oxford5k dataset.

From the perspective of query expansion, since new queries are issued, search efficiency is compromised. For example, AQE [100] almost doubles the search time due to the new query. For the recursive AQE and the scale-band recursive QE [100], the search time is much longer because several new searches are conducted. For other QE variants [101], [33], the proposed improvements only add marginal cost compared to performing another search, so their complexity is similar to basic QE methods.

从查询扩展的角度来看,由于提出了新的查询,搜索效率会受到影响。例如,由于新查询,AQE 的搜索时间几乎增加了一倍。对于递归 AQE 和带尺度递归 QE 方法,搜索时间更加长了,因为要执行好几个新的搜索。其他 QE 变体所提出的改进只比执行另一搜索增加了边际成本,因此它们的复杂性类似于 QE 方法。

Important Parameters 重要的参数

We summarize the impact of codebook size on SIFT methods using large/medium-sized codebooks, and the impact of dimensionality on compact representations including SIFT small codebooks and CNN-based methods.

Codebook size. The mAP results on Oxford5k are drawn in Fig. 9, and methods using large/medium-sized codebooks are compared. Two observations can be made. First, mAP usually increases with the codebook size but may reach saturation when the codebook is large enough. This is because a larger codebook improves the matching precision, but if it is too large, matching recall is lower, leading to saturated or even compromised performance [12]. Second, methods using the medium-sized codebooks have more stable performance when codebook size changes. This can be attributed to HE [13], which contributes more for a smaller codebook, compensating the lower baseline performance.

Dimensionality. The impact of dimensionality on compact vectors is presented in Fig. 7. Our finding is that the retrieval accuracy usually remains stable under larger dimensions, and drops quickly when the dimensionality is below 256 or 128. Our second finding favors the methods based on region proposals [147], [17]. These methods demonstrate very competitive performance under various feature lengths, probably due to their superior ability in object localization.

我们总结编码本大小对使用 SIFT 特征的大/中型码本的影响,以及维数对包括 SIFT 小编码本和基于 CNN 方法的紧凑表示的影响。

编码本规模。 图 8 展示了模型在 Oxford5k 上的 mAP 结果,对大规模编码本和中规模编码本的方法进行对比。有两点值得注意。第一,mAP 值通常随着编码本增大而增加,但当码本足够大时 aMP 值可能达到饱和。这是因为更大的码本提高了匹配精度,但是如果它太大,匹配的召回率变低,导致性能饱和甚至损害性能。第二,当编码本规模变化时,使用中等规模编码本的方法表现更稳定。这可以归因于 HE 方法,它对更小的码本贡献更多,弥补了较低的基线方法的性能。

维数。 维数对紧凑向量的影响在图 7 中给出。我们的发现检索精度通常在较大的尺寸下较为稳定,而当维数低于 256 或 128 时精度迅速下降。我们第二个发现是关于区域提取的。这些方法在各种特征长度下都表现出非常出色的性能,这可能是由于它们在目标定位方面的优越能力。

Discussions 讨论

We provide a brief discussion on when to use CNN over SIFT and the other way around. The above discussions provide comparisons between the two features. On the one hand, CNN-based methods with fixed-length representations have advantages in nearly all the benchmarking datasets. Specifically, in two cases, CNN-based methods can be assigned with higher priority. First, for specific object retrieval (e.g., buildings, pedestrians) when sufficient training data is provided, the ability of CNN embedding learning can be fully utilized. Second, for common object retrieval or class retrieval, the pre-trained CNN models are competitive.

On the other hand, despite the usual advantages of CNN-based methods, we envision that the SIFT feature still has merits in some cases. For example, when the query or some target images are gray-scale, CNN may be less effective than SIFT because SIFT is computed on gray-scale images without resorting to color information. A similar situation involves when object color change is highly intense. In another example, for small object retrieval or when the queried object undergoes severe occlusions, the usage of local features like SIFT is favored. In applications like book/CD cover retrieval, we can also expect good performance out of SIFT due to the rich textures.

我们简要地讨论何时使用 CNN 或 SIFT 以及其他相关方法。上文对两者特征进行了详细的比较。

一方面,表示向量长度固定的 CNN 方法几乎在所有的基准数据集上的性能都占有优势。具体而言,在两种情况下基于 CNN 的方法可以考虑优先使用。第一种是对于特定对象的检索(例如建筑物、行人),当提供的训练数据足够时,可以充分利用 CNN 网络嵌入学习的能力。第二种,对于常见的对象检索或类检索,预训练的 CNN 模型是有竞争力的。

另一方面,尽管基于 CNN 方法的通常是具有优势的,我们仍认为 SIFT 特征在某些情况下仍然具有优势。例如,当查询或一些目标图像是灰度图像时,CNN 可能不如 SIFT 有效,因为 SIFT 是在灰度图像上计算而不诉诸于颜色信息。当物体颜色变化非常剧烈时也同样如此。另外,在小对象检索中或当查询对象被严重遮挡时,使用诸如 SIFT 之类的局部特征是更好的选择。在书籍/CD 封面检索等应用中,由于丰富的纹理,我们也可以期待 SIFT 的良好性能。

Future Research Directions 未来的研究方向

Towards Generic Instance Retrieval

A critical direction is to make the search engine applicable to generic search purpose. Towards this goal, two important issues should be addressed. First, large-scale instance-level datasets are to be introduced. While several instance datasets have been released as shown in Table ft_datasets, these datasets usually contain a particular type of instances such as landmarks or indoor objects. Although the RPN structure used by Gordo et al. has proven competitive on Ukbench in addition to the building datasets, it remains unknown if training CNNs on more generic datasets will bring further improvement. Therefore, the community is in great need of large-scale instance-level datasets or efficient methods for generating such a dataset in either a supervised or unsupervised manner. Second, designing new CNN architectures and learning methods are important in fully exploiting the training data. Previous works employ standard classification , pairwise-loss or Triplet-loss , CNN models for fine-tuning. The introduction of Faster R-CNN to instance retrieval is a promising starting point towards more accurate object localization . Moreover, transfer learning methods are also important when adopting a fine-tuned model in another retrieval task .

面向通用任务的实例检索

图像检索一个非常重要的方向就是使用搜索引擎实现通用检索。为了实现这个目标需要解决两个重要问题。

第一,需要引入大规模图像数据集。虽然如表 3 所示展示了多个图像数据集,但这些数据集通常包含特定类型的实例,例如地标或室内物品。虽然 Gordo 等人在《Deep image retrieval: Learning global representations for image search》中使用的 RPN 结构除了在构建数据集之外,还在在 Ukbench 数据集上表现得富有竞争力,但如果在更通用的数据集上训练 CNN 能否带来进一步的改进,则仍然是未知数。因此,社区迫切需要大规模的图像数据集或一种可以以监督或非监督的方式生成这样一个数据集的有效方法。

第二,设计新的 CNN 网络和学习策略对于充分利用训练数据具有重要意义。先前有工作采用标准分类模型,成对损失或三重损失模型对 CNN 网络进行微调。Faster R-CNN 在实例检索中的引入对更精确的对象定位来说是一个良好的开始。此外,在另一个检索任务中采用微调模型时,迁移学习方法也是非常重要。

Towards Specialized Instance Retrieval

To the other end, there are also increasing interests in specialized instance retrieval. Examples include place retrieval , pedestrian retrieval , vehicle retrieval , logo retrieval , etc. Images in these tasks have specific prior knowledge that can be made use of. For example in pedestrian retrieval, the recurrent neural network (RNN) can be employed to pool the body part or patch descriptors. In vehicle retrieval, the view information can be inferred during feature learning, and the license plate can also provide critical information when being captured within a short distance. Meanwhile, the process of training data collection can be further explored. For example, training images of different places can be collected via Google Street View . Vehicle images can be accessed either through surveillance videos or internet images. Exploring new learning strategies in these specialized datasets and studying the transfer effect would be interesting. Finally, compact vectors or short codes will also become important in realistic retrieval settings.

面向专用任务的实例检索

另一方面,在专用实例检索中的研究也越来越多。例如地点检索,行人检索,车辆检索,标志检索等。在这些任务中的图像具有特定的先验知识。例如在行人检索任务中,循环神经网络(RNN)可以连接身体部分的描述符,在车辆检索任务中,在特征学习期间可以推断视图信息,同时牌照图像也可以提供关键信息。

同时,训练数据的收集过程可以进一步研究。例如,可以通过谷歌街景收集不同地点的训练图像。车辆图像可以通过监视视频或互联网图像来获取。在这些特定的数据集上探索新的学习策略以及研究迁移学习的效果将是有趣的。最后,紧凑向量编码或短编码也将在现实的检索任务设置中变得重要

Concluding Remarks 总结

This survey reviews instance retrieval approaches based on the SIFT and CNN features. According to the codebook size, we classify the SIFT-based methods into three classes: using large, medium-sized, and small codebook. According to the feature extraction process, the CNN-based methods are categorized into three classes, too: using pre-trained models, fine-tuned models, and hybrid methods. A comprehensive survey of the previous approaches is conducted under each of the defined categories. The category evolution suggests that the hybrid methods are in the transition position between SIFT and CNN-based methods, that compact representations are getting popular, and that instance retrieval is working towards end-to-end feature learning and extraction. Through the collected experimental results on several benchmark datasets, comparisons are made between the six method categories. Our findings favor the usage of CNN fine-tuning strategy, which yields competitive accuracy on various retrieval tasks and has advantages in efficiency. Future research may focus on learning more generic feature representations or more specialized retrieval tasks. The authors would like to thank the pioneer researchers in image retrieval and other related fields. This work was partially supported by the Google Faculty Award and the Data to Decisions Cooperative Research Centre. This work was supported in part to Dr. Qi Tian by ARO grant W911NF-15-1-0290, Faculty Research Gift Awards by NEC Laboratories of America and Blippar, and National Science Foundation of China (NSFC) 61429201.

本篇综述回顾了基于 SIFT 和 CNN 特征的实例检索方法。根据编码本的规模,我们将基于 SIFT 的方法分为三类:使用大,中,小规模的编码本。基于 CNN 的方法也被分为了三类:使用预训练模型,微调模型和混合模型的方法。在每个类别下都对先前的方法进行了全面的调研。从各种方法的演变可以看出,混合方法处于 SIFT 和 CNN 方法的过渡位置,紧凑编码方法越来越流行,并且实例检索正朝着端到端的特征学习和提取的方向发展。

通过在几个基准数据集上收集的实验结果,对六种方法进行了比较。我们发现 CNN 微调模型策略在不同的检索任务上都得到了较高准确率,并且在效率上也具有优势。未来的研究可能集中于学习更通用的特征表示或更特定场景的检索任务。

翻译参考 :【TPAMI 重磅综述】 SIFT 与 CNN 的碰撞:万字长文回顾图像检索任务十年探索历程