[论文翻译]deconvnet ZFNet:卷积神经网络的可视化和理解


Matthew D.Zeiler Rob Fergus

Abstract 摘要:

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark (Krizhevsky et al., 2012). However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

  近些年,大型卷积神经网络模型在 ImageNet 数据集上表现出令人印象深刻的效果(如 2012 年的 Krizhevsky),但是很多人没有弄明白为什么这些卷积网络会取得如此好的效果,以及如何提高分类效果。在这篇文章中,我们对这两个问题均进行了讨论。我们介绍了一种创新性的可视化技术深入观察中间的特征层函数的作用以及分类器的行为。作为一项类似诊断性的技术,可视化操作可以使我们找到比 Krizhevsky(AlexNet 模型)更好的模型架构。在 ImageNet 分类数据集上,我们还进行了一项抽丝剥茧的工作,以发现不同的层对结果的影响。我们看到,当 Softmax 分类器重新训练后,我们的模型在 ImageNet 数据集上可以很好地泛化到其他数据集,瞬间就击败了现如今 Caltech-101 以及 Caltech-256 上的最好的方法。

1,引言 Introduction

Since their introduction by (LeCun et al., 1989) in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection. In the last year, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks. (Ciresan et al., 2012) demonstrate state-of-the-art performance on NORB and CIFAR-10 datasets. Most notably, (Krizhevsky et al., 2012) show record beating performance on the ImageNet 2012 classification benchmark, with their convnet model achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%. Several factors are responsible for this renewed interest in convnet models: (i) the availability of much larger training sets, with millions of labeled examples; (ii) powerful GPU implementations, making the training of very large models practical and (iii) better model regularization strategies, such as Dropout (Hinton et al., 2012).

  自从 1989 年 LeCun 等人研究推广卷积神经网络(以下称为 CNN)之后,在 1990 年代,CNN 在一些图像应用领域展现出极好的效果,例如手写字体分类,人脸识别等等。在去年,许多论文都表示他们可以在一些有难度的数据集上取得较好的分类效果,Ciresan 等人于 2012 年在 NORB 和 CIFAR-10 数据集上取得了最好的效果。更具有代表性的是 Krizhevsky 等人 2012 的论文,在 ImageNet 2012 数据集分类挑战中取得了绝对的优势,他们的错误率仅有 16.4%,与此相对的第二名则是 26.1%。造成这种有趣的现象的因素有很多:
-(ii)强大的 GPU 训练;
-(iii)很好的模型泛化技术如 Dropout(Hinton et al,2012)

Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In this paper we introduce a visualization technique that reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model. The visualization technique we propose uses a multi-layered Deconvolutional Network (deconvnet), as proposed by (Zeiler et al., 2011), to project the feature activations back to the input pixel space. We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification.


Using these tools, we start with the architecture of (Krizhevsky et al., 2012) and explore different architectures, discovering ones that outperform their results on ImageNet. We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by (Hinton et al., 2006) and others (Bengio et al., 2007; Vincent et al., 2008). The generalization ability of convnet features is also explored in concurrent work by (Donahue et al., 2013).

Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers this is not the case, and there are limited methods for interpreting activity. (Erhan et al., 2009) find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation. This requires a careful initialization and does not give any information about the unit’s invariances. Motivated by the latter’s short-coming, (Le et al., 2010) (extending an idea by (Berkes & Wiskott, 2006)) show how the Hessian of a given unit may be computed numerically around the optimal response, giving some insight into invariances. The problem is that for higher layers, the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach, by contrast, provides a non-parametric view of invariance, showing which patterns from the training set activate the feature map. (Donahue et al., 2013) show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. Our visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.

  通过对神经网络可视化方法来获得一些科研灵感是很常有的做法,但大多数都局限于第一层,因为第一层比较容易映射到像素空间。在较高的网络层就难以处理了,只有有限的解释节点活跃性的方法。Erhan 等人 2009 的方法,通过在图像空间中执行梯度下降以最大化单元的激活来找到每个单元的最大响应刺激方式。这需要很细心的操作,而且也没有给出任何关于单位某种恒定属性的信息。受此启发有一种改良的方法,(Leet al ,2010)在(Berkes & Wiskott 2006)的基础上做一些延伸,通过计算一个节点的 Hessian 矩阵来观测节点的一些稳定的属性,问题在于对于更高层次而言,不变性非常复杂,因此通过简单的二次近似法(quadratic approximation)很难描述。相反,我们的方法提供了非参数化的不变性视图,显示了训练集中的哪些模式激活了特征映射。(Donahue 等,2013)显示了可视化,看可视化结果能表明模型中高层的节点究竟是被哪一块区域给激活。我们的可视化效果不同,因为它不仅仅是输入图像的作物,而是自上而下的投影,它揭示了每个补丁内的结构,这些结构会刺激特定的特征图谱。


We use standard fully supervised convnet models throughout the paper, as defined by (LeCun et al., 1989) and (Krizhevsky et al., 2012). These models map a color 2D input image xi, via a series of layers, to a probability vector ^yi over the C different classes. Each layer consists of (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; (ii) passing the responses through a rectified linear function (relu(x)=max(x,0)); (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a local contrast operation that normalizes the responses across feature maps. For more details of these operations, see (Krizhevsky et al., 2012) and (Jarrett et al., 2009). The top few layers of the network are conventional fully-connected networks and the final layer is a softmax classifier. Fig. 3 shows the model used in many of our experiments.
  本文采用了由(LeCun etal. 1989)以及(Krizhevsky etal 2012)提出的标准的有监督学习的卷积网模型,该模型通过一系列隐含层,将输入的二维彩色图像映射成长度为 C 的一维概率向量,向量的每个概率分别对应 C 个不同分类,每层包含以下部分:
2,矫正层,对每个卷积结果都进行矫正运算 relu(x) = Max(0, 1);
3 [可选] max pooling 层,对矫正运算结果进行一定领域内的 max pooling 操作,获得降采样图;
4 [可选] 对降采样图进行对比度归一化操作,使得输出特征平稳。更多操作细节,请参考(Krizhevsky et al 2012)以及(Jarrett et al 2009)。
5 最后几乎是全连接网络,输出层是一个 Softmax 分类器,图 3 上部展示了这个模型。

We train these models using a large set of N labeled images {x,y}, where label yi is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare ^yi and yi. The parameters of the network (filters in the convolutional layers, weight matrices in the fully-connected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent. Full details of training are given in Section 3.

  我们使用 N 张标签图片(x, y)构成的数据集来训练模型,其中标签 yi 是一个离散变量,用来表示图片的类别。用交叉熵误差函数来评估输出标签 yihat 和真实标签 yi 的差异。整个网络参数(包括卷积层的卷积核,全连接层的权值矩阵和偏置值)通过反向传播算法进行训练,选择随机梯度下降法更新权值,具体细节参见章节 3。

通过监督的方式利用 SGD 来训练网络,要了解卷积,需要解释中间层的特征激活值;我们利用 Deconvnet 来映射特这激活值返回到初始的像素层。

2.1 通过反卷积网络(Deconvnet)实现可视化 VISUALIZATION WITH A DECONVNET

Understanding the operation of a convnet requires interpreting the feature activity in intermediate layers. We present a novel way to map these activities back to the input pixel space, showing what input pattern originally caused a given activation in the feature maps. We perform this mapping with a Deconvolutional Network (deconvnet) (Zeiler et al., 2011). A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In (Zeiler et al., 2011), deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.

  要想深入了解卷积网络,就需要了解中间层特征的作用。本文将中间层特征反向映射到像素空间,观察出什么输入会导致特点的输出,可视化过程基于(Zeiler et al 2011)提出的反卷积网络实现。一层反卷积网可以看成是一层卷积网络的逆操作,他们拥有相同的卷积核和 pooling 函数(准确来讲,应该是逆函数),因此反卷积网是将输出特征逆映射成输入信号。在(Zeiler et al 2011)中,反卷积网络被用作无监督学习,本文则用来可视化演示。

To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1(top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.

  为了检查一个网络,在本文的模型中,卷积网络的每一层都附加了一个反卷积层,参见图 1,提供了一条由输出特征到输入图像的反通路。首先,输入图像通过卷积网模型,每层都会产生特定特征,而后,我们将反卷积网中观测层的其他连接权值全部置零,将卷积网观测层产生的特征当做输入,送给对应的反卷积层,依次进行以下操作:1 unpool,2 rectify(矫正),3 反卷积(过滤以重新构建所选激活下的图层中的活动,然后重复此操作直至达到输入像素空间)

Unpooling: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure.

Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity.

Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally.

Unpooling:严格来讲,max pooling 操作是不可逆的,本文用了一种近似方法来计算 max pooling 的逆操作:在 max pooling 过程中,用 Max locations “Switches”表格记录下每个最大值的位置,在 unpooling 过程中,我们将最大值标注回记录所在位置,其余位置填 0,图 1 底部显示了这一过程。

  简单说就是,在训练过程中记录每一个池化操作的 z*z 的区域内输入的最大值的位置,这样在反池化的时候,就将最大值返回到其应该在的位置,其他位置的值补 0。

Rectification:在卷积网络中,为了保证特征有效性,我们通过 relu 非线性函数来保证所有输出都为非负数,这个约束对反卷积过程依然成立,因此将重构信号送入 relu 函数中。


Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved.

  使用原卷积核的转秩和 feature map 进行卷积。反卷积其实是一个误导,这里真正的名字就是 转秩卷积操作。

  在 unpooling 过程中,由于“Switches”只记录了极大值的位置信息,其余位置均用 0 填充,因此重构出的图片看起来会不连续,很像原始图片中的某个碎片,这些碎片就是训练出高性能卷积网的关键。由于这些重构图像不是从模型中采样生成,因此中间不存在生成式过程。

  图 1:Top:一层反卷积网络(左)附加在一层卷积网(右)上。反卷积网络层会近似重构出下卷积网络层产生的特征。 Bottom:反卷积网络 unpooling 过程的演示,使用 Switches 表格记录极大值点的位置,从而近似还原出 pooling 操作前的特征。


We now describe the large convnet model that will be visualized in Section 4. The architecture, shown in Fig. 3, is similar to that used by (Krizhevsky et al., 2012) for ImageNet classification. One difference is that the sparse connections used in Krizhevsky’s layers 3,4,5 (due to the model being split across 2 GPUs) are replaced with dense connections in our model. Other important differences relating to layers 1 and 2 were made following inspection of the visualizations in Fig. 6, as described in Section 4.1.

  图 3 中的网络模型与(Krizhevsky et al 2012)使用的卷积模型很相似,不同点在于:1,Krizhevsky 在 3, 4, 5 层使用的是稀疏连接(由于该模型被分配到了两个 GPU 上),而本文用了稠密连接。2,另一个重要的不同将在章节 4.1 和 图 6 中详细阐述。

The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes). Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10−2, in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10−2 and biases are set to 0.

  本文选择了 ImageNet 2012 作为训练集(130 万张图片,超过 1000 个不同类别),首先截取每张 RGB 图片最中心的 256256 区域,然后减去整张图片颜色均值,再截出 10 个不同的 224224 窗口(可对原图进行水平翻转,窗口可在区域中滑动)。采用随机梯度下降法学习,batchsize 选择 128, 学习率选择 0.01,动量系数选择 0.9;当误差趋于收敛时,手动停止训练过程:Dropout 策略(Hinton et al 2012)运用在全连接层中,系数设为 0.5,所有权值初始化值设为 0.01, 偏置值设为 0。

Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10−1 to this fixed radius. This is crucial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in (Krizhevsky et al., 2012), we produce multiple different crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on (Krizhevsky et al., 2012).

  图 6(a)展示了部分训练得到的第 1 层卷积核,其中有一部分核数值过大,为了避免这种情况,我们采取了如下策略:均方根超过 0.1 的核将重新进行归一化,使其均方根为 0.1.该步骤非常关键,因为第 1 层的输入变化范围在 [-128, 128]之间。前面提到了,我们通过滑动窗口截取和对原始图像的水平翻转来提高训练集的大小,这一点和(Krizhevsky et al 2012)相同。整个训练过程基于(Krizhevsky et al 2012)的代码实现,在单块 GTX580 GPU 上进行,总共进行了 70 次全库迭代,运行了 12 天。


Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set.
  通过章节 3 描述的结构框架,我们开始使用反卷积网来可视化 ImageNet 验证集上的特征激活。

Feature Visualization: Fig. 2 shows feature visualizations from our model once training is complete. However, instead of showing the single strongest activation for a given feature map, we show the top 9 activations. Projecting each separately down to pixel space reveals the different structures that excite a given feature map, hence showing its invariance to input deformations. Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations as the latter solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches appear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects.

  特征可视化:图 2 展示了训练结束后,模型各个隐含层提取的特征,图 2 显示了在给定输出特征的情况下,反卷积层产生的最强的 9 个输入特征。将这些计算所得的特征,用像素空间表示后,可以清洗的看出:一组特定的输入特征(通过重构获得),将刺激卷积网络产生一个固定的输出特征。这一点解释了为什么当输入存在一定畸变时,网络的输出结果保持不变。在可视化结果的右边是对应的输入图片,和重构特征相比,输入图片之间的差异性很大,而重构特征只包含哪些具有判别能力纹理结构。举例说明:层 5 第 1 行第 2 列的 9 张输入图片各不相同,差异很大,而对应的重构输入特征则都显示了背景的草地,没有显示五花八门的前景。

The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text (R2,C4)). Layer 4 shows significant variation, but is more class-specific: dog faces (R1,C1); bird’s legs (R4,C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1,C11) and dogs (R4).

  每层的可视化结果都展示了网络的层次化特点。层 2 展示了物体的边缘和轮廓,以及与颜色的组合;层 3 拥有了更复杂的不变性,主要展示了相似的纹理(例如:第 1 行第 1 列的网格模型;第 2 行第 4 列的花纹);层 4 不同组重构特征存在着重大差异性,开始体现类与类之间的差异:狗狗的脸(第一行第一列),鸟的腿(第 4 行第 2 列);层 5 每组图片都展示了存在重大差异的一类物体,例如:键盘(第 1 行第 1 列),狗(第 4 行)

Feature Evolution during Training: Fig. 4 visualizes the progression during training of the strongest activation (across all training examples) within a given feature map projected back to pixel space. Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

  特征在训练过程中的演化:图 4 展示了在训练过程中,由特定输出特征反向卷积,所获得的最强重构输入特征(从所有训练样本中选出)是如何演化的,当输入图片中的最强刺激源发生变换时,对应的输出特征轮廓发生跳变。经过一定次数的迭代后,底层特征趋于稳定。但更高层的特征则需要更多的迭代才能收敛(约 40~50 个周期),这表明:只有所有层都收敛时,分类模型才堪用。

Feature Invariance: Fig. 5 shows 5 sample images being translated, rotated and scaled by varying degrees while looking at the changes in the feature vectors from the top and bottom layers of the model, relative to the untransformed feature. Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi-linear for translation & scaling. The network output is stable to translations and scalings. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center).

  特征不变性:图 5 展示了 5 个不同的例子,他们分别被平移,旋转和缩放。图 5 右边显示了不同层特征向量所具有的不变性能力。在第 1 层,很小的微变都会导致输出特征变换明显,但越往高层走,平移和尺度变换对最终结果的影响越小。总体来说:卷积网络无法对旋转操作产生不变性,除非物体具有很强的对称性。

  图 2 :训练好的卷积网络,显示了层 2 到层 5 通过反卷积层的计算,得到 9 个最强输入特征,并将输入特征映射到了像素空间。本文的重构输入特征不是采样生成的:他们是固定的,由特定的输出特征反卷积计算产生。每一个重构输入特征都对应的显示了它的输入图像。三点启示:1 每组重构特征(9 个)都有强相关性;2 层次越高,不变性越强;3 都是原始输入图片具有强辨识度部分的夸张展现。例如:狗的眼睛,鼻子(层 4, 第一行第 1 列)


While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al. ’s architecture (Fig. 6(b) & (d)), various problems are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. 6(c) & (e). More importantly, it also improves the classification performance as shown in Section 5.1.

  观察 Krizhevsky 的网络模型可以帮助我们在一开始就选择一个好的模型。反卷积网络可视化技术显示了 Krizhevsky 卷积网络的一些问题。如图 6(a)以及 6(d)所示,第 1 层卷积核混杂了大量的高频和低频信息,缺少中频信息;第二层由于卷积过程选择 4 作为步长,产生了混乱无用的特征。为了解决这些问题,我们做了如下的工作:
-(i)将第一层的卷积核大小由 11×11 调整到 7×7;
-(ii)将卷积步长由 4 调整为 2;新的模型不但保留了 1, 2 层绝大部分的有用特征,如图 6(c),6(e)所示,还提高了最终分类性能,我们将在章节 5.1 看到具体结果。

(a): 1st layer features without feature scale clipping. Note
that one feature dominates. (b): 1st layer features from


With image classification approaches, a natural question is if the model is truly identifying the location of the object in the image, or just using the surrounding context. Fig. 7 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier. The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded. Fig. 7 also shows visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of occluder position. When the occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the image structure that stimulates that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2.

  当模型达到期望的分类性能时,一个自然而然的想法是:分类器究竟使用了什么信息实现分类?是图像中具体位置的像素值,还是图像中的上下文。我们试图回答这个问题,图 7 中使用了一个灰色矩阵对输入图像的每个部分进行遮挡,并测试在不同遮挡情况下,分类器的输出结果,可以清楚地看到:当关键区域发生遮挡时,分类器性能急剧下降。图 7 还展示了最上层卷积网的最强响应特征,展示了遮挡位置和响应强度之间的关系:当遮挡发生在关键物体出现的位置时,响应强度急剧下降。该图真实的反映了输入什么样的刺激,会促使系统产生某个特定的输出特征,用这种方法可以一一查找出图 2 和图 4 中特定特征的最佳刺激是什么。
Three test examples where we systematically cover up different portions
of the scene with a gray square (1st column) and see how the top
(layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the gray scale, we record the total activation in one
layer 5 feature map (the one with the strongest response in the
unoccluded image). (c): a visualization of this feature map projected
down into the input image (black square), along with visualizations of
this map from other images. The first row example shows the strongest feature
to be the dog’s face. When this is covered-up the activity in the
feature map decreases (blue area in (b)). (d): a map of correct class
probability, as a function of the position of the gray
square. E.g. when the dog’s face is obscured, the probability for
“pomeranian” drops significantly. (e): the most probable label as a
function of occluder position. E.g. in the 1st row, for most locations
it is “pomeranian”, but if the dog’s face is obscured but not the
ball, then it predicts “tennis ball”. In the 2nd example, text on
the car is the strongest feature in layer 5, but the classifier is
most sensitive to the wheel. The 3rd example contains multiple
objects. The strongest feature in layer 5 picks out the faces, but the
classifier is sensitive to the dog (blue region in (d)),
since it uses multiple feature maps.

Figure 7: Three test examples where we systematically cover up different portions of the scene with a gray square (1st column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response in the unoccluded image). (c): a visualization of this feature map projected down into the input image (black square), along with visualizations of this map from other images. The first row example shows the strongest feature to be the dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability for “pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row, for most locations it is “pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog (blue region in (d)), since it uses multiple feature maps.


Deep models differ from many existing recognition approaches in that there is no explicit mechanism for establishing correspondence between specific object parts in different images (e.g. faces have a particular spatial configuration of the eyes and nose). However, an intriguing possibility is that deep models might be implicitly computing them. To explore this, we take 5 randomly drawn dog images with frontal pose and systematically mask out the same part of the face in each image (e.g. all left eyes, see Fig. 8). For each image i, we then compute:

  与其他许多已知的识别模型不同,深度神经网络没有一套有效理论来分析特定物体部件之间的关系(例如:如何解释人脸眼睛和鼻子在空间位置上的关系),但深度网络很可能非显式的计算了这些特征。为了验证这些假设,本文随机选择了 5 张狗狗的正面图片,并系统性地挡住狗狗所有照片的一部分(例如:所有的左眼,参见图 8)。对于每张图 I,计算:

  其中 xil 和 xilhat 分别表示原始图片和被遮挡图片所产生的的特征,然后策略所有图片对(i, j)的误差向量 epsilon 的一致性:

H is Hamming distance. A lower value indicates greater consistency in the change resulting from the masking operation, hence tighter correspondence between the same object parts in different images (i.e. blocking the left eye changes the feature representation in a consistent way). In Table 1 we compare the Δ score for three parts of the face (left eye, right eye and nose) to random parts of the object, using features from layer l=5 and l=7. The lower score for these parts, relative to random object regions, for the layer 5 features show the model does establish some degree of correspondence.

  其中,H 是 Hamming distance, Δl 值越小,对应操作对狗狗分类的影响越一致,就表明这些不同图片上被遮挡的部件越存在紧密联系。表 1 中我们对比了遮挡 左眼,右眼,鼻子的 Δ 比随机遮挡的 Δ 更低,说明眼睛图片和鼻子图片内部存在相关性。第 5 层鼻子和眼睛的得分差异明显,说明第 5 层卷积网对部件级(鼻子,眼睛等等)的相关性更为关注;第 7 层各个部分得分差异不大,说明第 7 层卷积网络开始关注更高层的信息(狗狗的品种等等)。

  图 3 本文使用 8 层卷积网络模型。输入层为 224224 的 3 通道 RGB 图像,从原始图像裁剪产生。层 1 包含了 96 个卷积核(红色表示),每个核大小为 77 ,x 和 y 方向的跨度均为 2。获得的卷积图进行如下操作:1 通过矫正函数 relu(x) = max(0, x),使所有卷积值均不小于 0(图中未显示);2 进行 max pooling 操作(33 区域,跨度为 2);3 对比度归一化操作。最终产生 96 个不同的特征模板,大小为 5555。层 2, 3, 4, 5 都是类似操作,层 5 输出 256 个 6*6 的特征图。最后两层网络为全连接层,最后层是一个 C 类 softmax 函数,C 为类别个数。所有的卷积核与特征图均为正方形。

  图 4 模型特征逐层演化过程。从左至右的块,依次为层 1 到层 5 的重构特征。块展示在随机选定一个具体输出特征时,计算所得的重构输入特征在第 1, 2, 5, 10, 20, 30, 40, 64 次迭代时(训练集所有图片跑 1 遍为 1 次迭代),是什么样子(1 列为 1 组)。显示效果经过了人工色彩增强。

  图 5 图像的垂直移动,尺度变换,旋转以及卷积网络模型中相应的特征不变性。列 1:对图像进行各种变形;列 2 和列 3:原始图片和变形图片分别在层 1~层 7 所产生特征间的欧式距离。列 4:真实类别在输出中的概率。

  图 6 (a)层 1 输出的特征,还未经过尺度约束操作,可以看到有一个特征十分巨大;(b)(Krizhevsky et al 2012)第 1 层产生的特征;(c)本文模型第 1 层产生的特征。更小的跨度(2 vs 4),更小的核尺寸(7×7 vs 11×11)从而产生了更具辨识度的特征和更少的“无用特征”;(d):(Krizhevsky et al 2012)第二层产生的特征;(e)本文模型第二层产生的特征,很明显,没有(d)中的模糊特征。

  图 7:输入图片被遮挡时的情况。灰度方块遮挡了不同区域(第 1 列),会对第 5 层的输出强度产生影响(b 和 c),分类结果也发生改变(d 和 e)(b):图像遮挡位置对第 5 层特定输出强度的影响。(c)将第 5 层特定输出特征投影到像素空间的情形(带黑框的),第 1 行展示了狗狗图片产生的最强特征。当存在遮挡时,对应输入图片对特征产生的刺激强度降低(蓝色区域表示降低)。(d)正确分类对应的概率,是关于遮挡位置的函数,当小狗面部发生遮挡时,波西米亚小狗的概率急剧降低。(e)最可能类的分布图,也是一个关于遮挡位置的函数。在第 1 行中,只要遮挡区域不在狗狗面部,输出结果都是波西米亚小狗,当遮挡区域发生在狗狗面部但有没有遮挡网球时,输出结果是“网球”。在第 2 行中,车上的纹理是第 5 层卷积网络的最强输出特征,但也很容易被误判为“车轮”。第三行包含了多个物体,第 5 层卷积网对应的最强输出特征是人脸,但分类器对“狗狗”十分敏感(d)中的蓝色区域,原因在于 Softmax 分类器使用了多组特征(即有人的特征,又有狗的特征)。

  图 8 其他用于遮挡实验的图片,第 1 列:原始图片,第 2,3,4 列;遮挡分别发送在右眼,左眼和鼻子部位;其余列显示了随机遮挡。


5.1 ImageNet 2012

This dataset consists of 1.3M/50k/100k training/validation/test examples, spread over 1000 categories. Table 2 shows our results on this dataset.

Using the exact architecture specified in (Krizhevsky et al., 2012), we attempt to replicate their result on the validation set. We achieve an error rate within 0.1% of their reported value on the ImageNet 2012 validation set.

Next we analyze the performance of our model with the architectural changes outlined in Section 4.1 (7×7 filters in layer 1 and stride 2 convolutions in layers 1 & 2). This model, shown in Fig. 3, significantly outperforms the architecture of (Krizhevsky et al., 2012), beating their single model result by 1.7% (test top-5). When we combine multiple models, we obtain a test error of 14.8%, the best published performance on this dataset11This performance has been surpassed in the recent Imagenet 2013 competition (http://www.image-net.org/challenges/LSVRC/2013/results.php). (despite only using the 2012 training set). We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classification challenge, which obtained 26.2% error (Gunji et al., 2012).

  该图像库共包含了(130 万/5 万/10 万)张(训练/确认/测试)样例,种类数超过 1000。表 2 显示了本文模型的测试结果。

  首先,本文重构了(Krizhevsky et al 2012)的模型,重构模型的错误率与作者给出的错误率十分一致,误差在 0.1% 以内,一次作为参考标准。

  而后,本文将第 1 层的卷积核大小调整为 7*7,将第 1 层和第 2 层卷积运算的步长改为 2,获得了相当不错的结果,与(Krizhevsky et al 2012)相比,我们的错误率为 14.8%,比(Krizhevsky et al 2012)的 15.3% 提高了 0.5 个百分点。

  表 2,ImageNet 2012 分类错误率,星号表示了使用 ImageNet2011 和 ImageNet2012 两个训练集

Varying ImageNet Model Sizes: In Table 3, we first explore the architecture of (Krizhevsky et al., 2012) by adjusting the size of layers, or removing them entirely. In each case, the model is trained from scratch with the revised architecture. Removing the fully connected layers (6,7) only gives a slight increase in error. This is surprising, given that they contain the majority of model parameters. Removing two of the middle convolutional layers also makes a relatively small different to the error rate. However, removing both the middle convolution layers and the fully connected layers yields a model with only 4 layers whose performance is dramatically worse. This would suggest that the overall depth of the model is important for obtaining good performance. In Table 3, we modify our model, shown in Fig. 3. Changing the size of the fully connected layers makes little difference to performance (same for model of (Krizhevsky et al., 2012)). However, increasing the size of the middle convolution layers goes give a useful gain in performance. But increasing these, while also enlarging the fully connected layers results in over-fitting.

改变卷积网络结构:如图 3 所示,本文测试了改变(Krizhevsky et al 2012)模型的结构会对最终分类造成什么样的影响,例如:调节隐藏层节点个数,或者将某隐含层直接删除等等。每种情况下,都将改变后的结构从头训练。当层 6, 7 被完全删除后,错误率只有轻微上升;删除掉两个隐含卷积层,错误率也只有轻微上升。然而当所有的中间卷积层都被删除后,仅仅只有 4 层的模型分类能力急剧下降。这个现象或许说明了模型的深度与分类效果密切相关,深度越大,效果越好。改变全连接层的节点个数对分类性能影响不大;扩大中间卷积层的节点数对训练效果有提升,但也同时加大了全连接出现过拟合的可能。


The experiments above show the importance of the convolutional part of our ImageNet model in obtaining state-of-the-art performance. This is supported by the visualizations of Fig. 2 which show the complex invariances learned in the convolutional layers. We now explore the ability of these feature extraction layers to generalize to other datasets, namely Caltech-101 (Fei-fei et al., 2006), Caltech-256 (Griffin et al., 2006) and PASCAL VOC 2012. To do this, we keep layers 1-7 of our ImageNet-trained model fixed and train a new softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset. Since the softmax contains relatively few parameters, it can be trained quickly from a relatively small number of examples, as is the case for certain datasets.

  上面的实验显示了我们的 ImageNet 模型的卷积部分在获得最新性能方面的重要性。这由图 2 的可视化支持,它显示了卷积层中的复杂不变性。我们现在探索这些特征提取层的能力,以推广到其他数据集,即 Caltech-100(Feifei 等人, 2006),Caltech-256(Grifffi 等人, 2006)和 PASCAL VOC 2012。为此,我们将我们的 ImageNet 训练的模型的第 1~7 层固定并使用新数据集的训练图像在最上面训练一个新的 Softmax 分类器(使用适当的类别数)。由于 Softmax 包含的参数相对较少,因此可以从相对较小的示例中快速训练,如某些数据集的情况。

The classifiers used by our model (a softmax) and other approaches (typically a linear SVM) are of similar complexity, thus the experiments compare our feature representation, learned from ImageNet, with the hand-crafted features used by other methods. It is important to note that both our feature representation and the hand-crafted features are designed using images beyond the Caltech and PASCAL training sets. For example, the hyper-parameters in HOG descriptors were determined through systematic experiments on a pedestrian dataset (Dalal & Triggs, 2005). We also try a second strategy of training a model from scratch, i.e. resetting layers 1-7 to random values and train them, as well as the softmax, on the training images of the dataset.

  我们的模型(Softmax)和其他方法(通常是线性 SVM)使用的分类器具有相似的复杂性,因此实验将我们从 ImageNet 学习到的特征表示与其他方法使用的手工标注的特征进行比较。需要注意的是,我们的特征表示和手工标注的特征都是使用 Caltech 和 PASCAL 训练集的图像设计的。例如,HOG 描述中的超参数是通过对行人数据集进行系统实验确定的(Dalal & Triggs,2005)。我们还尝试了第二种从头开始训练模型的策略,即将层 1~7 重新设置为随机值,并在数据集的训练图像上训练他们以及 Softmax。

One complication is that some of the Caltech datasets have some images that are also in the ImageNet training data. Using normalized correlation, we identified these few ‘‘overlap’’ images2 and removed them from our Imagenet training set and then retrained our Imagenet models, so avoiding the possibility of train/test contamination.

2For Caltech-101, we found 44 images in common (out of 9,144 total images), with a maximum overlap of 10 for any given class. For Caltech-256, we found 243 images in common (out of 30,607 total images), with a maximum overlap of 18 for any given class.

  其中一个复杂因素是 Caltech 数据集中有一些图像也是在 ImageNet 训练数据中。使用归一化相关性,我们确定了这些“重叠”图像,并将其从我们的 ImageNet 训练集中移除,然后重新训练我们的 ImageNet 模型,从而避免训练/测试 污染的可能性。

Caltech-101: We follow the procedure of (Fei-fei et al., 2006) and randomly select 15 or 30 images per class for training and test on up to 50 images per class reporting the average of the per-class accuracies in Table 4, using 5 train/test folds. Training took 17 minutes for 30 images/class. The pre-trained model beats the best reported result for 30 images/class from (Bo et al., 2013) by 2.2%. The convnet model trained from scratch however does terribly, only achieving 46.5%.

Caltech-101:我们遵循(Fei-fei 等 ,2006)的程序,每类随机选择 15 或 30 幅图像进行训练,每类测试 50 幅图像,表 4 中报告了每类准确度的平均值,使用 5 次训练/测试折叠。训练 30 张图像/类的数据用时 17 分钟。预先训练的模型在 30 幅图像/类上的结果比(Bo et al 2013)的成绩提高 2.2%,然而,从零开始训练的 Convnet 模型非常糟糕,只能达到 46.5%。说明基于 ImageNet 学习到的特征更有效。

Caltech-256: We follow the procedure of (Griffin et al., 2006), selecting 15, 30, 45, or 60 training images per class, reporting the average of the per-class accuracies in Table 5. Our ImageNet-pretrained model beats the current state-of-the-art results obtained by Bo et al. (Bo et al., 2013) by a significant margin: 74.2% vs 55.2% for 60 training images/class. However, as with Caltech-101, the model trained from scratch does poorly. In Fig. 9, we explore the “one-shot learning” (Fei-fei et al., 2006) regime. With our pre-trained model, just 6 Caltech-256 training images are needed to beat the leading method using 10 times as many images. This shows the power of the ImageNet feature extractor.

Caltech-256:遵循(Griffin et al 2006)的测试方法进行测试,为每个类选择 15, 30, 45,或 60 个训练图片,结果如表 5 所示,基于 ImageNet 预先学习的模型在每类 60 训练图像上以准确率高出 19% (74.2% VS 55.2%)的巨大优势击败了历史最好的成绩。图 9 从另一个角度(一次性学习)描述了基于 ImageNet 预先学习模型的成功,只需要 6 张 Caltech-256 训练图像即可击败使用 10 倍图像训练的先进方法,这显示了 ImageNet 特征提取器的强大功能。

PASCAL 2012: We used the standard training and validation images to train a 20-way softmax on top of the ImageNet-pretrained convnet. This is not ideal, as PASCAL images can contain multiple objects and our model just provides a single exclusive prediction for each image. Table 6 shows the results on the test set. The PASCAL and ImageNet images are quite different in nature, the former being full scenes unlike the latter. This may explain our mean performance being 3.2% lower than the leading (Yan et al., 2012) result, however we do beat them on 5 classes, sometimes by large margins.

PASCAL 2012:我们使用标准的训练和验证图像在 ImageNet 预训练的网络顶部训练 20 类 Softmax。这并不理想,因为 PASCAL 图像可以包含多个对象,而我们的模型仅为每个图像提供一个单独的预测。表 6 显示了测试集的结果。PASCAL 和 ImageNet 的图像本质上是完全不同的,前者是完整的场景而后者不是。这可能解释我们的平均表现比领先(Yan et al 2012)的结果低 3.2%,但是我们确实在 5 类的结果上击败了他们,有时甚至大幅度上涨。


We explore how discriminative the features in each layer of our Imagenet-pretrained model are. We do this by varying the number of layers retained from the ImageNet model and place either a linear SVM or softmax classifier on top. Table 7 shows results on Caltech-101 and Caltech-256. For both datasets, a steady improvement can be seen as we ascend the model, with best results being obtained by using all layers. This supports the premise that as the feature hierarchies become deeper, they learn increasingly powerful features.

  我们研究了 ImageNet 预训练模型在每一层如何区分特征的。我们通过改变 ImageNet 模型中的层数来完整这一工作,并将线性 SVM 或者 Softmax 分类器置于顶层。表 7 显示了 Caltech-101 和 Caltech-256 的结果。对于这两个数据集,随着我们提升模型,可以看到一个稳定的改进,使用所有层可以获得最佳结果。该结果证明了:当深度增加时,网络可学到更好的特征。


We explored large convolutional neural network models, trained for image classification, in a number ways. First, we presented a novel way to visualize the activity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers. We also showed how these visualization can be used to debug problems with the model to obtain better results, for example improving on Krizhevsky et al. ’s (Krizhevsky et al., 2012) impressive ImageNet 2012 result. We then demonstrated through a series of occlusion experiments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context. An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance.

  我们以多种方式探索了大型神经网络模型,并对图像分类进行了训练。首先,我们提出了一种新颖的方式来可视化模型中的活动。这揭示了这些特征远不是随机的,不可解释的模式。相反,提升模型时,他们显示出许多直观上令人满意的属性,例如组合性,增强不变性和类别区别等。我们还展示了如何使用这些可视化来调试模型的问题以获得更好的结果,例如改进(Krizhevsky et al 2012)令人印象深刻的 ImageNet 2012 结果。


Finally, we showed how the ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that we can beat the best reported results, in the latter case by a significant margin. This result brings into question to utility of benchmarks with small (i.e. <104) training sets. Our convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias (Torralba & Efros, 2011), although it was still within 3.2% of the best reported result, despite no tuning for the task. For example, our performance might improve if a different loss function was used that permitted multiple objects per image. This would naturally enable the networks to tackle the object detection as well.

  最后我们展示了 imageNet 训练模型如何很好地推广到其他数据集。对于 Caltech-101 和 Clatech-256,数据集足够相似,以至于我们可以击败最好的结果,在后一种情况下,结果有一个显著的提高。这个结果带来了对具有小(数量级 1000)训练集的基准效用的问题。我们的 ConvNet 模型对 PASCAL 数据的推广程度较差,这可能源于数据集偏差(Torrarba & Efors, 2011),虽然它仍处理报告的最佳结果的 3.2% 之内,尽管没有调整任务。例如,如果允许对每个图像的多个对象的使用不同的损失函数,我们的性能可能会提高。这自然也能使网络能够很好地处理对象检测。