[论文翻译]deconvnet ZFNet:卷积神经网络的可视化和理解


Matthew D.Zeiler Rob Fergus

Abstract 摘要:

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark (Krizhevsky et al., 2012). However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

  近些年,大型卷积神经网络模型在 ImageNet数据集上表现出令人印象深刻的效果(如 2012年的Krizhevsky),但是很多人没有弄明白为什么这些卷积网络会取得如此好的效果,以及如何提高分类效果。在这篇文章中,我们对这两个问题均进行了讨论。我们介绍了一种创新性的可视化技术深入观察中间的特征层函数的作用以及分类器的行为。作为一项类似诊断性的技术,可视化操作可以使我们找到比 Krizhevsky(AlexNet模型)更好的模型架构。在ImageNet分类数据集上,我们还进行了一项抽丝剥茧的工作,以发现不同的层对结果的影响。我们看到,当Softmax分类器重新训练后,我们的模型在 ImageNet数据集上可以很好地泛化到其他数据集,瞬间就击败了现如今 Caltech-101以及 Caltech-256 上的最好的方法。

1,引言 Introduction

Since their introduction by (LeCun et al., 1989) in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection. In the last year, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks. (Ciresan et al., 2012) demonstrate state-of-the-art performance on NORB and CIFAR-10 datasets. Most notably, (Krizhevsky et al., 2012) show record beating performance on the ImageNet 2012 classification benchmark, with their convnet model achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%. Several factors are responsible for this renewed interest in convnet models: (i) the availability of much larger training sets, with millions of labeled examples; (ii) powerful GPU implementations, making the training of very large models practical and (iii) better model regularization strategies, such as Dropout (Hinton et al., 2012).

  自从 1989年 LeCun 等人研究推广卷积神经网络(以下称为 CNN)之后,在 1990年代,CNN在一些图像应用领域展现出极好的效果,例如手写字体分类,人脸识别等等。在去年,许多论文都表示他们可以在一些有难度的数据集上取得较好的分类效果,Ciresan等人于 2012年在 NORB 和 CIFAR-10 数据集上取得了最好的效果。更具有代表性的是 Krizhevsky 等人 2012 的论文,在ImageNet 2012 数据集分类挑战中取得了绝对的优势,他们的错误率仅有 16.4%,与此相对的第二名则是 26.1%。造成这种有趣的现象的因素有很多:
-(ii)强大的 GPU训练;
-(iii)很好的模型泛化技术如 Dropout(Hinton et al,2012)

Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In this paper we introduce a visualization technique that reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model. The visualization technique we propose uses a multi-layered Deconvolutional Network (deconvnet), as proposed by (Zeiler et al., 2011), to project the feature activations back to the input pixel space. We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification.


Using these tools, we start with the architecture of (Krizhevsky et al., 2012) and explore different architectures, discovering ones that outperform their results on ImageNet. We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by (Hinton et al., 2006) and others (Bengio et al., 2007; Vincent et al., 2008). The generalization ability of convnet features is also explored in concurrent work by (Donahue et al., 2013).

Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers this is not the case, and there are limited methods for interpreting activity. (Erhan et al., 2009) find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation. This requires a careful initialization and does not give any information about the unit’s invariances. Motivated by the latter’s short-coming, (Le et al., 2010) (extending an idea by (Berkes & Wiskott, 2006)) show how the Hessian of a given unit may be computed numerically around the optimal response, giving some insight into invariances. The problem is that for higher layers, the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach, by contrast, provides a non-parametric view of invariance, showing which patterns from the training set activate the feature map. (Donahue et al., 2013) show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. Our visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.

  通过对神经网络可视化方法来获得一些科研灵感是很常有的做法,但大多数都局限于第一层,因为第一层比较容易映射到像素空间。在较高的网络层就难以处理了,只有有限的解释节点活跃性的方法。Erhan等人 2009 的方法,通过在图像空间中执行梯度下降以最大化单元的激活来找到每个单元的最大响应刺激方式。这需要很细心的操作,而且也没有给出任何关于单位某种恒定属性的信息。受此启发有一种改良的方法,(Leet al ,2010)在(Berkes & Wiskott 2006)的基础上做一些延伸,通过计算一个节点的 Hessian矩阵来观测节点的一些稳定的属性,问题在于对于更高层次而言,不变性非常复杂,因此通过简单的二次近似法(quadratic approximation)很难描述。相反,我们的方法提供了非参数化的不变性视图,显示了训练集中的哪些模式激活了特征映射。(Donahue等,2013)显示了可视化,看可视化结果能表明模型中高层的节点究竟是被哪一块区域给激活。我们的可视化效果不同,因为它不仅仅是输入图像的作物,而是自上而下的投影,它揭示了每个补丁内的结构,这些结构会刺激特定的特征图谱。


We use standard fully supervised convnet models throughout the paper, as defined by (LeCun et al., 1989) and (Krizhevsky et al., 2012). These models map a color 2D input image xi, via a series of layers, to a probability vector ^yi over the C different classes. Each layer consists of (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; (ii) passing the responses through a rectified linear function (relu(x)=max(x,0)); (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a local contrast operation that normalizes the responses across feature maps. For more details of these operations, see (Krizhevsky et al., 2012) and (Jarrett et al., 2009). The top few layers of the network are conventional fully-connected networks and the final layer is a softmax classifier. Fig. 3 shows the model used in many of our experiments.
  本文采用了由(LeCun etal. 1989)以及(Krizhevsky etal 2012)提出的标准的有监督学习的卷积网模型,该模型通过一系列隐含层,将输入的二维彩色图像映射成长度为C的一维概率向量,向量的每个概率分别对应 C 个不同分类,每层包含以下部分:
2,矫正层,对每个卷积结果都进行矫正运算 relu(x) = Max(0, 1);
3 [可选] max pooling 层,对矫正运算结果进行一定领域内的 max pooling 操作,获得降采样图;
4 [可选] 对降采样图进行对比度归一化操作,使得输出特征平稳。更多操作细节,请参考(Krizhevsky et al 2012)以及(Jarrett et al 2009)。
5 最后几乎是全连接网络,输出层是一个 Softmax 分类器,图3上部展示了这个模型。

We train these models using a large set of N labeled images {x,y}, where label yi is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare ^yi and yi. The parameters of the network (filters in the convolutional layers, weight matrices in the fully-connected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent. Full details of training are given in Section 3.

  我们使用 N 张标签图片(x, y)构成的数据集来训练模型,其中标签 yi 是一个离散变量,用来表示图片的类别。用交叉熵误差函数来评估输出标签 yihat 和真实标签 yi 的差异。整个网络参数(包括卷积层的卷积核,全连接层的权值矩阵和偏置值)通过反向传播算法进行训练,选择随机梯度下降法更新权值,具体细节参见章节3。


2.1 通过反卷积网络(Deconvnet)实现可视化 VISUALIZATION WITH A DECONVNET

Understanding the operation of a convnet requires interpreting the feature activity in intermediate layers. We present a novel way to map these activities back to the input pixel space, showing what input pattern originally caused a given activation in the feature maps. We perform this mapping with a Deconvolutional Network (deconvnet) (Zeiler et al., 2011). A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In (Zeiler et al., 2011), deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.

  要想深入了解卷积网络,就需要了解中间层特征的作用。本文将中间层特征反向映射到像素空间,观察出什么输入会导致特点的输出,可视化过程基于(Zeiler et al 2011)提出的反卷积网络实现。一层反卷积网可以看成是一层卷积网络的逆操作,他们拥有相同的卷积核和 pooling函数(准确来讲,应该是逆函数),因此反卷积网是将输出特征逆映射成输入信号。在(Zeiler et al 2011)中,反卷积网络被用作无监督学习,本文则用来可视化演示。

To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1(top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.

  为了检查一个网络,在本文的模型中,卷积网络的每一层都附加了一个反卷积层,参见图1,提供了一条由输出特征到输入图像的反通路。首先,输入图像通过卷积网模型,每层都会产生特定特征,而后,我们将反卷积网中观测层的其他连接权值全部置零,将卷积网观测层产生的特征当做输入,送给对应的反卷积层,依次进行以下操作:1 unpool,2 rectify(矫正),3 反卷积(过滤以重新构建所选激活下的图层中的活动,然后重复此操作直至达到输入像素空间)

Unpooling: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure.

Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity.

Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally.

Unpooling:严格来讲,max pooling 操作是不可逆的,本文用了一种近似方法来计算 max pooling 的逆操作:在 max pooling 过程中,用 Max locations “Switches”表格记录下每个最大值的位置,在 unpooling 过程中,我们将最大值标注回记录所在位置,其余位置填 0,图1底部显示了这一过程。

  简单说就是,在训练过程中记录每一个池化操作的 z*z 的区域内输入的最大值的位置,这样在反池化的时候,就将最大值返回到其应该在的位置,其他位置的值补0。

Rectification:在卷积网络中,为了保证特征有效性,我们通过 relu 非线性函数来保证所有输出都为非负数,这个约束对反卷积过程依然成立,因此将重构信号送入 relu 函数中。


Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved.

  使用原卷积核的转秩和 feature map 进行卷积。反卷积其实是一个误导,这里真正的名字就是 转秩卷积操作。

  在 unpooling过程中,由于“Switches”只记录了极大值的位置信息,其余位置均用 0 填充,因此重构出的图片看起来会不连续,很像原始图片中的某个碎片,这些碎片就是训练出高性能卷积网的关键。由于这些重构图像不是从模型中采样生成,因此中间不存在生成式过程。

  图1:Top:一层反卷积网络(左)附加在一层卷积网(右)上。反卷积网络层会近似重构出下卷积网络层产生的特征。 Bottom:反卷积网络 unpooling 过程的演示,使用 Switches表格记录极大值点的位置,从而近似还原出 pooling 操作前的特征。


We now describe the large convnet model that will be visualized in Section 4. The architecture, shown in Fig. 3, is similar to that used by (Krizhevsky et al., 2012) for ImageNet classification. One difference is that the sparse connections used in Krizhevsky’s layers 3,4,5 (due to the model being split across 2 GPUs) are replaced with dense connections in our model. Other important differences relating to layers 1 and 2 were made following inspection of the visualizations in Fig. 6, as described in Section 4.1.

  图3中的网络模型与(Krizhevsky et al 2012)使用的卷积模型很相似,不同点在于:1,Krizhevsky在3, 4, 5层使用的是稀疏连接(由于该模型被分配到了两个 GPU上),而本文用了稠密连接。2,另一个重要的不同将在章节4.1 和 图6中详细阐述。

The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes). Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10−2, in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10−2 and biases are set to 0.

  本文选择了 ImageNet 2012 作为训练集(130万张图片,超过 1000 个不同类别),首先截取每张 RGB图片最中心的 256256区域,然后减去整张图片颜色均值,再截出 10 个不同的 224224 窗口(可对原图进行水平翻转,窗口可在区域中滑动)。采用随机梯度下降法学习,batchsize选择128, 学习率选择 0.01,动量系数选择 0.9;当误差趋于收敛时,手动停止训练过程:Dropout 策略(Hinton et al 2012)运用在全连接层中,系数设为 0.5,所有权值初始化值设为 0.01, 偏置值设为0。

Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10−1 to this fixed radius. This is crucial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in (Krizhevsky et al., 2012), we produce multiple different crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on (Krizhevsky et al., 2012).

  图6(a)展示了部分训练得到的第1层卷积核,其中有一部分核数值过大,为了避免这种情况,我们采取了如下策略:均方根超过 0.1 的核将重新进行归一化,使其均方根为 0.1.该步骤非常关键,因为第1层的输入变化范围在 [-128, 128]之间。前面提到了,我们通过滑动窗口截取和对原始图像的水平翻转来提高训练集的大小,这一点和(Krizhevsky et al 2012)相同。整个训练过程基于(Krizhevsky et al 2012)的代码实现,在单块 GTX580 GPU 上进行,总共进行了 70次全库迭代,运行了 12天。


Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set.

Feature Visualization: Fig. 2 shows feature visualizations from our model once training is complete. However, instead of showing the single strongest activation for a given feature map, we show the top 9 activations. Projecting each separately down to pixel space reveals the different structures that excite a given feature map, hence showing its invariance to input deformations. Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations as the latter solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches appear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects.


The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text (R2,C4)). Layer 4 shows significant variation, but is more class-specific: dog faces (R1,C1); bird’s legs (R4,C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1,C11) and dogs (R4).


Feature Evolution during Training: Fig. 4 visualizes the progression during training of the strongest activation (across all training examples) within a given feature map projected back to pixel space. Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

  特征在训练过程中的演化:图4展示了在训练过程中,由特定输出特征反向卷积,所获得的最强重构输入特征(从所有训练样本中选出)是如何演化的,当输入图片中的最强刺激源发生变换时,对应的输出特征轮廓发生跳变。经过一定次数的迭代后,底层特征趋于稳定。但更高层的特征则需要更多的迭代才能收敛(约 40~50个周期),这表明:只有所有层都收敛时,分类模型才堪用。

Feature Invariance: Fig. 5 shows 5 sample images being translated, rotated and scaled by varying degrees while looking at the changes in the feature vectors from the top and bottom layers of the model, relative to the untransformed feature. Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi-linear for translation & scaling. The network output is stable to translations and scalings. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center).


  图2 :训练好的卷积网络,显示了层2到层5通过反卷积层的计算,得到9个最强输入特征,并将输入特征映射到了像素空间。本文的重构输入特征不是采样生成的:他们是固定的,由特定的输出特征反卷积计算产生。每一个重构输入特征都对应的显示了它的输入图像。三点启示:1 每组重构特征(9个)都有强相关性;2 层次越高,不变性越强;3都是原始输入图片具有强辨识度部分的夸张展现。例如:狗的眼睛,鼻子(层4, 第一行第1列)


While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al. ’s architecture (Fig. 6(b) & (d)), various problems are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. 6(c) & (e). More importantly, it also improves the classification performance as shown in Section 5.1.

  观察 Krizhevsky 的网络模型可以帮助我们在一开始就选择一个好的模型。反卷积网络可视化技术显示了 Krizhevsky 卷积网络的一些问题。如图 6(a)以及 6(d)所示,第1层卷积核混杂了大量的高频和低频信息,缺少中频信息;第二层由于卷积过程选择4作为步长,产生了混乱无用的特征。为了解决这些问题,我们做了如下的工作:
-(i)将第一层的卷积核大小由11×11 调整到 7×7;
-(ii)将卷积步长由 4调整为2;新的模型不但保留了1, 2层绝大部分的有用特征,如图 6(c),6(e)所示,还提高了最终分类性能,我们将在章节5.1看到具体结果。

(a): 1st layer features without feature scale clipping. Note
that one feature dominates. (b): 1st layer features from


With image classification approaches, a natural question is if the model is truly identifying the location of the object in the image, or just using the surrounding context. Fig. 7 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier. The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded. Fig. 7 also shows visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of occluder position. When the occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the image structure that stimulates that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2.

Three test examples where we systematically cover up different portions
of the scene with a gray square (1st column) and see how the top
(layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the gray scale, we record the total activation in one
layer 5 feature map (the one with the strongest response in the
unoccluded image). (c): a visualization of this feature map projected
down into the input image (black square), along with visualizations of
this map from other images. The first row example shows the strongest feature
to be the dog’s face. When this is covered-up the activity in the
feature map decreases (blue area in (b)). (d): a map of correct class
probability, as a function of the position of the gray
square. E.g. when the dog’s face is obscured, the probability for
“pomeranian” drops significantly. (e): the most probable label as a
function of occluder position. E.g. in the 1st row, for most locations
it is “pomeranian”, but if the dog’s face is obscured but not the
ball, then it predicts “tennis ball”. In the 2nd example, text on
the car is the strongest feature in layer 5, but the classifier is
most sensitive to the wheel. The 3rd example contains multiple
objects. The strongest feature in layer 5 picks out the faces, but the
classifier is sensitive to the dog (blue region in (d)),
since it uses multiple feature maps.

Figure 7: Three test examples where we systematically cover up different portions of the scene with a gray square (1st column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response in the unoccluded image). (c): a visualization of this feature map projected down into the input image (black square), along with visualizations of this map from other images. The first row example shows the strongest feature to be the dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability for “pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row, for most locations it is “pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog (blue region in (d)), since it uses multiple feature maps.


Deep models differ from many existing recognition approaches in that there is no explicit mechanism for establishing correspondence between specific object parts in different images (e.g. faces have a particular spatial configuration of the eyes and nose). However, an intriguing possibility is that deep models might be implicitly computing them. To explore this, we take 5 randomly drawn dog images with frontal pose and systematically mask out the same part of the face in each image (e.g. all left eyes, see Fig. 8). For each image i, we then compute:


  其中 xil 和 xilhat 分别表示原始图片和被遮挡图片所产生的的特征,然后策略所有图片对(i, j)的误差向量 epsilon 的一致性:

H is Hamming distance. A lower value indicates greater consistency in the change resulting from the masking operation, hence tighter correspondence between the same object parts in different images (i.e. blocking the left eye changes the feature representation in a consistent way). In Table 1 we compare the Δ score for three parts of the face (left eye, right eye and nose) to random parts of the object, using features from layer l=5 and l=7. The lower score for these parts, relative to random object regions, for the layer 5 features show the model does establish some degree of correspondence.

  其中,H 是 Hamming distance, Δl 值越小,对应操作对狗狗分类的影响越一致,就表明这些不同图片上被遮挡的部件越存在紧密联系。表1 中我们对比了遮挡 左眼,右眼,鼻子的 Δ 比随机遮挡的 Δ 更低,说明眼睛图片和鼻子图片内部存在相关性。第5层鼻子和眼睛的得分差异明显,说明第5层卷积网对部件级(鼻子,眼睛等等)的相关性更为关注;第7层各个部分得分差异不大,说明第7层卷积网络开始关注更高层的信息(狗狗的品种等等)。

  图3 本文使用8层卷积网络模型。输入层为 224224 的3通道 RGB图像,从原始图像裁剪产生。层1 包含了 96个卷积核(红色表示),每个核大小为 77 ,x和 y方向的跨度均为2。获得的卷积图进行如下操作:1 通过矫正函数 relu(x) = max(0, x),使所有卷积值均不小于0(图中未显示);2 进行 max pooling 操作(33 区域,跨度为2);3 对比度归一化操作。最终产生 96个不同的特征模板,大小为 5555。层2, 3, 4, 5都是类似操作,层5输出 256个 6*6 的特征图。最后两层网络为全连接层,最后层是一个 C类softmax函数,C为类别个数。所有的卷积核与特征图均为正方形。

  图4 模型特征逐层演化过程。从左至右的块,依次为层1到层5的重构特征。块展示在随机选定一个具体输出特征时,计算所得的重构输入特征在第 1, 2, 5, 10, 20, 30, 40, 64 次迭代时(训练集所有图片跑1遍为1次迭代),是什么样子(1列为1组)。显示效果经过了人工色彩增强。

  图5 图像的垂直移动,尺度变换,旋转以及卷积网络模型中相应的特征不变性。列1:对图像进行各种变形;列2和列3:原始图片和变形图片分别在层1~层7所产生特征间的欧式距离。列4:真实类别在输出中的概率。

  图6 (a)层1输出的特征,还未经过尺度约束操作,可以看到有一个特征十分巨大;(b)(Krizhevsky et al 2012)第1层产生的特征;(c)本文模型第1层产生的特征。更小的跨度(2 vs 4),更小的核尺寸(7×7 vs 11×11)从而产生了更具辨识度的特征和更少的“无用特征”;(d):(Krizhevsky et al 2012)第二层产生的特征;(e)本文模型第二层产生的特征,很明显,没有(d)中的模糊特征。

  图7:输入图片被遮挡时的情况。灰度方块遮挡了不同区域(第1列),会对第5层的输出强度产生影响(b和c),分类结果也发生改变(d和e)(b):图像遮挡位置对第5层特定输出强度的影响。(c)将第5层特定输出特征投影到像素空间的情形(带黑框的),第1行展示了狗狗图片产生的最强特征。当存在遮挡时,对应输入图片对特征产生的刺激强度降低(蓝色区域表示降低)。(d)正确分类对应的概率,是关于遮挡位置的函数,当小狗面部发生遮挡时,波西米亚小狗的概率急剧降低。(e)最可能类的分布图,也是一个关于遮挡位置的函数。在第1行中,只要遮挡区域不在狗狗面部,输出结果都是波西米亚小狗,当遮挡区域发生在狗狗面部但有没有遮挡网球时,输出结果是“网球”。在第2行中,车上的纹理是第5层卷积网络的最强输出特征,但也很容易被误判为“车轮”。第三行包含了多个物体,第5层卷积网对应的最强输出特征是人脸,但分类器对“狗狗”十分敏感(d)中的蓝色区域,原因在于 Softmax 分类器使用了多组特征(即有人的特征,又有狗的特征)。

  图8 其他用于遮挡实验的图片,第1列:原始图片,第2,3,4列;遮挡分别发送在右眼,左眼和鼻子部位;其余列显示了随机遮挡。


5.1 ImageNet 2012

This dataset consists of 1.3M/50k/100k training/validation/test examples, spread over 1000 categories. Table 2 shows our results on this dataset.

Using the exact architecture specified in (Krizhevsky et al., 2012), we attempt to replicate their result on the validation set. We achieve an error rate within 0.1% of their reported value on the ImageNet 2012 validation set.

Next we analyze the performance of our model with the architectural changes outlined in Section 4.1 (7×7 filters in layer 1 and stride 2 convolutions in layers 1 & 2). This model, shown in Fig. 3, significantly outperforms the architecture of (Krizhevsky et al., 2012), beating their single model result by 1.7% (test top-5). When we combine multiple models, we obtain a test error of 14.8%, the best published performance on this dataset11This performance has been surpassed in the recent Imagenet 2013 competition (http://www.image-net.org/challenges/LSVRC/2013/results.php). (despite only using the 2012 training set). We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classification challenge, which obtained 26.2% error (Gunji et al., 2012).

  该图像库共包含了(130万/5万/10万)张(训练/确认/测试)样例,种类数超过 1000。表2显示了本文模型的测试结果。

  首先,本文重构了(Krizhevsky et al 2012)的模型,重构模型的错误率与作者给出的错误率十分一致,误差在 0.1%以内,一次作为参考标准。

  而后,本文将第1层的卷积核大小调整为 7*7,将第1层和第2层卷积运算的步长改为2,获得了相当不错的结果,与(Krizhevsky et al 2012)相比,我们的错误率为 14.8%,比(Krizhevsky et al 2012)的 15.3% 提高了 0.5 个百分点。

  表2,ImageNet 2012分类错误率,星号表示了使用 ImageNet2011和ImageNet2012两个训练集

Varying ImageNet Model Sizes: In Table 3, we first explore the architecture of (Krizhevsky et al., 2012) by adjusting the size of layers, or removing them entirely. In each case, the model is trained from scratch with the revised architecture. Removing the fully connected layers (6,7) only gives a slight increase in error. This is surprising, given that they contain the majority of model parameters. Removing two of the middle convolutional layers also makes a relatively small different to the error rate. However, removing both the middle convolution layers and the fully connected layers yields a model with only 4 layers whose performance is dramatically worse. This would suggest that the overall depth of the model is important for obtaining good performance. In Table 3, we modify our model, shown in Fig. 3. Changing the size of the fully connected layers makes little difference to performance (same for model of (Krizhevsky et al., 2012)). However, increasing the size of the middle convolution layers goes give a useful gain in performance. But increasing these, while also enlarging the fully connected layers results in over-fitting.

改变卷积网络结构:如图3所示,本文测试了改变(Krizhevsky et al 2012)模型的结构会对最终分类造成什么样的影响,例如:调节隐藏层节点个数,或者将某隐含层直接删除等等。每种情况下,都将改变后的结构从头训练。当层6, 7被完全删除后,错误率只有轻微上升;删除掉两个隐含卷积层,错误率也只有轻微上升。然而当所有的中间卷积层都被删除后,仅仅只有4层的模型分类能力急剧下降。这个现象或许说明了模型的深度与分类效果密切相关,深度越大,效果越好。改变全连接层的节点个数对分类性能影响不大;扩大中间卷积层的节点数对训练效果有提升,但也同时加大了全连接出现过拟合的可能。


The experiments above show the importance of the convolutional part of our ImageNet model in obtaining state-of-the-art performance. This is supported by the visualizations of Fig. 2 which show the complex invariances learned in the convolutional layers. We now explore the ability of these feature extraction layers to generalize to other datasets, namely Caltech-101 (Fei-fei et al., 2006), Caltech-256 (Griffin et al., 2006) and PASCAL VOC 2012. To do this, we keep layers 1-7 of our ImageNet-trained model fixed and train a new softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset. Since the softmax contains relatively few parameters, it can be trained quickly from a relatively small number of examples, as is the case for certain datasets.

  上面的实验显示了我们的 ImageNet 模型的卷积部分在获得最新性能方面的重要性。这由图2 的可视化支持,它显示了卷积层中的复杂不变性。我们现在探索这些特征提取层的能力,以推广到其他数据集,即 Caltech-100(Feifei 等人, 2006),Caltech-256(Grifffi等人, 2006)和 PASCAL VOC 2012。为此,我们将我们的 ImageNet 训练的模型的第1~7层固定并使用新数据集的训练图像在最上面训练一个新的 Softmax 分类器(使用适当的类别数)。由于Softmax包含的参数相对较少,因此可以从相对较小的示例中快速训练,如某些数据集的情况。

The classifiers used by our model (a softmax) and other approaches (typically a linear SVM) are of similar complexity, thus the experiments compare our feature representation, learned from ImageNet, with the hand-crafted features used by other methods. It is important to note that both our feature representation and the hand-crafted features are designed using images beyond the Caltech and PASCAL training sets. For example, the hyper-parameters in HOG descriptors were determined through systematic experiments on a pedestrian dataset (Dalal & Triggs, 2005). We also try a second strategy of training a model from scratch, i.e. resetting layers 1-7 to random values and train them, as well as the softmax, on the training images of the dataset.

  我们的模型(Softmax)和其他方法(通常是线性 SVM)使用的分类器具有相似的复杂性,因此实验将我们从 ImageNet 学习到的特征表示与其他方法使用的手工标注的特征进行比较。需要注意的是,我们的特征表示和手工标注的特征都是使用 Caltech 和 PASCAL 训练集的图像设计的。例如,HOG描述中的超参数是通过对行人数据集进行系统实验确定的(Dalal & Triggs,2005)。我们还尝试了第二种从头开始训练模型的策略,即将层1~7重新设置为随机值,并在数据集的训练图像上训练他们以及 Softmax。

One complication is that some of the Caltech datasets have some images that are also in the ImageNet training data. Using normalized correlation, we identified these few ‘‘overlap’’ images2 and removed them from our Imagenet training set and then retrained our Imagenet models, so avoiding the possibility of train/test contamination.

2For Caltech-101, we found 44 images in common (out of 9,144 total images), with a maximum overlap of 10 for any given class. For Caltech-256, we found 243 images in common (out of 30,607 total images), with a maximum overlap of 18 for any given class.

  其中一个复杂因素是 Caltech 数据集中有一些图像也是在 ImageNet 训练数据中。使用归一化相关性,我们确定了这些“重叠”图像,并将其从我们的 ImageNet训练集中移除,然后重新训练我们的 ImageNet 模型,从而避免训练/测试 污染的可能性。

Caltech-101: We follow the procedure of (Fei-fei et al., 2006) and randomly select 15 or 30 images per class for training and test on up to 50 images per class reporting the average of the per-class accuracies in Table 4, using 5 train/test folds. Training took 17 minutes for 30 images/class. The pre-trained model beats the best reported result for 30 images/class from (Bo et al., 2013) by 2.2%. The convnet model trained from scratch however does terribly, only achieving 46.5%.

Caltech-101:我们遵循(Fei-fei等 ,2006)的程序,每类随机选择15或30幅图像进行训练,每类测试50幅图像,表4中报告了每类准确度的平均值,使用 5次训练/测试折叠。训练 30张图像/类的数据用时 17分钟。预先训练的模型在 30幅图像/类上的结果比(Bo et al 2013)的成绩提高 2.2%,然而,从零开始训练的 Convnet 模型非常糟糕,只能达到 46.5%。说明基于 ImageNet 学习到的特征更有效。

Caltech-256: We follow the procedure of (Griffin et al., 2006), selecting 15, 30, 45, or 60 training images per class, reporting the average of the per-class accuracies in Table 5. Our ImageNet-pretrained model beats the current state-of-the-art results obtained by Bo et al. (Bo et al., 2013) by a significant margin: 74.2% vs 55.2% for 60 training images/class. However, as with Caltech-101, the model trained from scratch does poorly. In Fig. 9, we explore the “one-shot learning” (Fei-fei et al., 2006) regime. With our pre-trained model, just 6 Caltech-256 training images are needed to beat the leading method using 10 times as many images. This shows the power of the ImageNet feature extractor.

Caltech-256:遵循(Griffin et al 2006)的测试方法进行测试,为每个类选择15, 30, 45,或 60 个训练图片,结果如表5所示,基于ImageNet预先学习的模型在每类 60 训练图像上以准确率高出 19% (74.2% VS 55.2%)的巨大优势击败了历史最好的成绩。图9从另一个角度(一次性学习)描述了基于 ImageNet 预先学习模型的成功,只需要6张 Caltech-256训练图像即可击败使用10倍图像训练的先进方法,这显示了ImageNet特征提取器的强大功能。

PASCAL 2012: We used the standard training and validation images to train a 20-way softmax on top of the ImageNet-pretrained convnet. This is not ideal, as PASCAL images can contain multiple objects and our model just provides a single exclusive prediction for each image. Table 6 shows the results on the test set. The PASCAL and ImageNet images are quite different in nature, the former being full scenes unlike the latter. This may explain our mean performance being 3.2% lower than the leading (Yan et al., 2012) result, however we do beat them on 5 classes, sometimes by large margins.

PASCAL 2012:我们使用标准的训练和验证图像在ImageNet预训练的网络顶部训练20类Softmax。这并不理想,因为 PASCAL 图像可以包含多个对象,而我们的模型仅为每个图像提供一个单独的预测。表6显示了测试集的结果。PASCAL 和 ImageNet 的图像本质上是完全不同的,前者是完整的场景而后者不是。这可能解释我们的平均表现比领先(Yan et al 2012)的结果低 3.2%,但是我们确实在 5类的结果上击败了他们,有时甚至大幅度上涨。


We explore how discriminative the features in each layer of our Imagenet-pretrained model are. We do this by varying the number of layers retained from the ImageNet model and place either a linear SVM or softmax classifier on top. Table 7 shows results on Caltech-101 and Caltech-256. For both datasets, a steady improvement can be seen as we ascend the model, with best results being obtained by using all layers. This supports the premise that as the feature hierarchies become deeper, they learn increasingly powerful features.

  我们研究了 ImageNet 预训练模型在每一层如何区分特征的。我们通过改变 ImageNet模型中的层数来完整这一工作,并将线性SVM 或者 Softmax 分类器置于顶层。表7显示了 Caltech-101 和 Caltech-256 的结果。对于这两个数据集,随着我们提升模型,可以看到一个稳定的改进,使用所有层可以获得最佳结果。该结果证明了:当深度增加时,网络可学到更好的特征。


We explored large convolutional neural network models, trained for image classification, in a number ways. First, we presented a novel way to visualize the activity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers. We also showed how these visualization can be used to debug problems with the model to obtain better results, for example improving on Krizhevsky et al. ’s (Krizhevsky et al., 2012) impressive ImageNet 2012 result. We then demonstrated through a series of occlusion experiments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context. An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance.

  我们以多种方式探索了大型神经网络模型,并对图像分类进行了训练。首先,我们提出了一种新颖的方式来可视化模型中的活动。这揭示了这些特征远不是随机的,不可解释的模式。相反,提升模型时,他们显示出许多直观上令人满意的属性,例如组合性,增强不变性和类别区别等。我们还展示了如何使用这些可视化来调试模型的问题以获得更好的结果,例如改进(Krizhevsky et al 2012)令人印象深刻的 ImageNet 2012 结果。


Finally, we showed how the ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that we can beat the best reported results, in the latter case by a significant margin. This result brings into question to utility of benchmarks with small (i.e. <104) training sets. Our convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias (Torralba & Efros, 2011), although it was still within 3.2% of the best reported result, despite no tuning for the task. For example, our performance might improve if a different loss function was used that permitted multiple objects per image. This would naturally enable the networks to tackle the object detection as well.

  最后我们展示了 imageNet 训练模型如何很好地推广到其他数据集。对于Caltech-101 和 Clatech-256,数据集足够相似,以至于我们可以击败最好的结果,在后一种情况下,结果有一个显著的提高。这个结果带来了对具有小(数量级1000)训练集的基准效用的问题。我们的 ConvNet模型对 PASCAL 数据的推广程度较差,这可能源于数据集偏差(Torrarba & Efors, 2011),虽然它仍处理报告的最佳结果的 3.2%之内,尽管没有调整任务。例如,如果允许对每个图像的多个对象的使用不同的损失函数,我们的性能可能会提高。这自然也能使网络能够很好地处理对象检测。