[论文翻译]deconvnet ZFNet：卷积神经网络的可视化和理解

author

Matthew D.Zeiler Rob Fergus
2013
http://arxiv.org/pdf/1311.2901v3.pdf

Abstract 摘要：

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark (Krizhevsky et al., 2012). However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

　　近些年，大型卷积神经网络模型在 ImageNet数据集上表现出令人印象深刻的效果（如 2012年的Krizhevsky），但是很多人没有弄明白为什么这些卷积网络会取得如此好的效果，以及如何提高分类效果。在这篇文章中，我们对这两个问题均进行了讨论。我们介绍了一种创新性的可视化技术深入观察中间的特征层函数的作用以及分类器的行为。作为一项类似诊断性的技术，可视化操作可以使我们找到比 Krizhevsky（AlexNet模型）更好的模型架构。在ImageNet分类数据集上，我们还进行了一项抽丝剥茧的工作，以发现不同的层对结果的影响。我们看到，当Softmax分类器重新训练后，我们的模型在 ImageNet数据集上可以很好地泛化到其他数据集，瞬间就击败了现如今 Caltech-101以及 Caltech-256 上的最好的方法。

1，引言 Introduction

Since their introduction by (LeCun et al., 1989) in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection. In the last year, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks. (Ciresan et al., 2012) demonstrate state-of-the-art performance on NORB and CIFAR-10 datasets. Most notably, (Krizhevsky et al., 2012) show record beating performance on the ImageNet 2012 classification benchmark, with their convnet model achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%. Several factors are responsible for this renewed interest in convnet models: (i) the availability of much larger training sets, with millions of labeled examples; (ii) powerful GPU implementations, making the training of very large models practical and (iii) better model regularization strategies, such as Dropout (Hinton et al., 2012).

　　自从 1989年 LeCun 等人研究推广卷积神经网络（以下称为 CNN）之后，在 1990年代，CNN在一些图像应用领域展现出极好的效果，例如手写字体分类，人脸识别等等。在去年，许多论文都表示他们可以在一些有难度的数据集上取得较好的分类效果，Ciresan等人于 2012年在 NORB 和 CIFAR-10 数据集上取得了最好的效果。更具有代表性的是 Krizhevsky 等人 2012 的论文，在ImageNet 2012 数据集分类挑战中取得了绝对的优势，他们的错误率仅有 16.4%，与此相对的第二名则是 26.1%。造成这种有趣的现象的因素有很多：
-（i）大量的训练数据和已标注数据；
-（ii）强大的 GPU训练；
-（iii）很好的模型泛化技术如 Dropout（Hinton et al，2012）

Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In this paper we introduce a visualization technique that reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model. The visualization technique we propose uses a multi-layered Deconvolutional Network (deconvnet), as proposed by (Zeiler et al., 2011), to project the feature activations back to the input pixel space. We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification.

　　尽管如此，我们还是很少能够深入理解神经网络中的机制，以及为何他们能取得如此效果。从科学的角度来说，这是远远不够的。如果没有清楚地了解其中的本质，那么更好的模型的开发就只能变成整天像无头苍蝇一样乱试。在本文中，我们利用反卷积网重构每层的输入信息，再将重构信息投影到像素空间中，从而实现了可视化。通过可视化技术来分析“输入的色彩如何映射不同层上的特征”，“特征如何随着训练过程发生变化”等问题，甚至利用可视化技术来诊断和改进当前网络结果可能存在的问题。我们还进行了一项很有意义的研究，那就是遮挡输入图像的一部分，来说明那一部分是对分类最有影响的。

Using these tools, we start with the architecture of (Krizhevsky et al., 2012) and explore different architectures, discovering ones that outperform their results on ImageNet. We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by (Hinton et al., 2006) and others (Bengio et al., 2007; Vincent et al., 2008). The generalization ability of convnet features is also explored in concurrent work by (Donahue et al., 2013).

Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers this is not the case, and there are limited methods for interpreting activity. (Erhan et al., 2009) find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation. This requires a careful initialization and does not give any information about the unit’s invariances. Motivated by the latter’s short-coming, (Le et al., 2010) (extending an idea by (Berkes & Wiskott, 2006)) show how the Hessian of a given unit may be computed numerically around the optimal response, giving some insight into invariances. The problem is that for higher layers, the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach, by contrast, provides a non-parametric view of invariance, showing which patterns from the training set activate the feature map. (Donahue et al., 2013) show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. Our visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.

　　通过对神经网络可视化方法来获得一些科研灵感是很常有的做法，但大多数都局限于第一层，因为第一层比较容易映射到像素空间。在较高的网络层就难以处理了，只有有限的解释节点活跃性的方法。Erhan等人 2009 的方法，通过在图像空间中执行梯度下降以最大化单元的激活来找到每个单元的最大响应刺激方式。这需要很细心的操作，而且也没有给出任何关于单位某种恒定属性的信息。受此启发有一种改良的方法，（Leet al ，2010）在（Berkes & Wiskott 2006）的基础上做一些延伸，通过计算一个节点的 Hessian矩阵来观测节点的一些稳定的属性，问题在于对于更高层次而言，不变性非常复杂，因此通过简单的二次近似法（quadratic approximation）很难描述。相反，我们的方法提供了非参数化的不变性视图，显示了训练集中的哪些模式激活了特征映射。（Donahue等，2013）显示了可视化，看可视化结果能表明模型中高层的节点究竟是被哪一块区域给激活。我们的可视化效果不同，因为它不仅仅是输入图像的作物，而是自上而下的投影，它揭示了每个补丁内的结构，这些结构会刺激特定的特征图谱。

2 实现方法 APPROACH

We use standard fully supervised convnet models throughout the paper, as defined by (LeCun et al., 1989) and (Krizhevsky et al., 2012). These models map a color 2D input image xi, via a series of layers, to a probability vector ^yi over the C different classes. Each layer consists of (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; (ii) passing the responses through a rectified linear function (relu(x)=max(x,0)); (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a local contrast operation that normalizes the responses across feature maps. For more details of these operations, see (Krizhevsky et al., 2012) and (Jarrett et al., 2009). The top few layers of the network are conventional fully-connected networks and the final layer is a softmax classifier. Fig. 3 shows the model used in many of our experiments.
　　本文采用了由（LeCun etal. 1989）以及（Krizhevsky etal 2012）提出的标准的有监督学习的卷积网模型，该模型通过一系列隐含层，将输入的二维彩色图像映射成长度为C的一维概率向量，向量的每个概率分别对应 C 个不同分类，每层包含以下部分：
1，卷积层，每个卷积图都由前面一层网络的输出结果（对于第一层来说，上层输出结果就是输入图片），与学习获得的特定核进行卷积运算产生；
2，矫正层，对每个卷积结果都进行矫正运算 relu(x) = Max(0, 1)；
3 [可选] max pooling 层，对矫正运算结果进行一定领域内的 max pooling 操作，获得降采样图；
4 [可选] 对降采样图进行对比度归一化操作，使得输出特征平稳。更多操作细节，请参考（Krizhevsky et al 2012）以及（Jarrett et al 2009）。
5 最后几乎是全连接网络，输出层是一个 Softmax 分类器，图3上部展示了这个模型。

We train these models using a large set of N labeled images {x,y}, where label yi is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare ^yi and yi. The parameters of the network (filters in the convolutional layers, weight matrices in the fully-connected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent. Full details of training are given in Section 3.

　　我们使用 N 张标签图片（x, y）构成的数据集来训练模型，其中标签 yi 是一个离散变量，用来表示图片的类别。用交叉熵误差函数来评估输出标签 yihat 和真实标签 yi 的差异。整个网络参数（包括卷积层的卷积核，全连接层的权值矩阵和偏置值）通过反向传播算法进行训练，选择随机梯度下降法更新权值，具体细节参见章节3。

通过监督的方式利用SGD来训练网络，要了解卷积，需要解释中间层的特征激活值；我们利用Deconvnet来映射特这激活值返回到初始的像素层。

2.1 通过反卷积网络（Deconvnet）实现可视化 VISUALIZATION WITH A DECONVNET

Understanding the operation of a convnet requires interpreting the feature activity in intermediate layers. We present a novel way to map these activities back to the input pixel space, showing what input pattern originally caused a given activation in the feature maps. We perform this mapping with a Deconvolutional Network (deconvnet) (Zeiler et al., 2011). A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In (Zeiler et al., 2011), deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.

　　要想深入了解卷积网络，就需要了解中间层特征的作用。本文将中间层特征反向映射到像素空间，观察出什么输入会导致特点的输出，可视化过程基于（Zeiler et al 2011）提出的反卷积网络实现。一层反卷积网可以看成是一层卷积网络的逆操作，他们拥有相同的卷积核和 pooling函数（准确来讲，应该是逆函数），因此反卷积网是将输出特征逆映射成输入信号。在（Zeiler et al 2011）中，反卷积网络被用作无监督学习，本文则用来可视化演示。

To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1(top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.

　　为了检查一个网络，在本文的模型中，卷积网络的每一层都附加了一个反卷积层，参见图1，提供了一条由输出特征到输入图像的反通路。首先，输入图像通过卷积网模型，每层都会产生特定特征，而后，我们将反卷积网中观测层的其他连接权值全部置零，将卷积网观测层产生的特征当做输入，送给对应的反卷积层，依次进行以下操作：1 unpool，2 rectify（矫正），3 反卷积（过滤以重新构建所选激活下的图层中的活动，然后重复此操作直至达到输入像素空间）

Unpooling: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure.

Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity.

Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally.

Unpooling：严格来讲，max pooling 操作是不可逆的，本文用了一种近似方法来计算 max pooling 的逆操作：在 max pooling 过程中，用 Max locations “Switches”表格记录下每个最大值的位置，在 unpooling 过程中，我们将最大值标注回记录所在位置，其余位置填 0，图1底部显示了这一过程。

　　简单说就是，在训练过程中记录每一个池化操作的 z*z 的区域内输入的最大值的位置，这样在反池化的时候，就将最大值返回到其应该在的位置，其他位置的值补0。

Rectification：在卷积网络中，为了保证特征有效性，我们通过 relu 非线性函数来保证所有输出都为非负数，这个约束对反卷积过程依然成立，因此将重构信号送入 relu 函数中。

Filtering：卷积网使用学习得到的卷积核与上层输出做卷积，得到特征。为了实现逆过程，反卷积网络使用相同卷积核的转置作为核，与矫正后的特征进行卷积运算。

Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved.

　　使用原卷积核的转秩和 feature map 进行卷积。反卷积其实是一个误导，这里真正的名字就是转秩卷积操作。

　　在 unpooling过程中，由于“Switches”只记录了极大值的位置信息，其余位置均用 0 填充，因此重构出的图片看起来会不连续，很像原始图片中的某个碎片，这些碎片就是训练出高性能卷积网的关键。由于这些重构图像不是从模型中采样生成，因此中间不存在生成式过程。

　　图1：Top：一层反卷积网络（左）附加在一层卷积网（右）上。反卷积网络层会近似重构出下卷积网络层产生的特征。 Bottom：反卷积网络 unpooling 过程的演示，使用 Switches表格记录极大值点的位置，从而近似还原出 pooling 操作前的特征。

3，训练细节 TRAINING DETAILS

We now describe the large convnet model that will be visualized in Section 4. The architecture, shown in Fig. 3, is similar to that used by (Krizhevsky et al., 2012) for ImageNet classification. One difference is that the sparse connections used in Krizhevsky’s layers 3,4,5 (due to the model being split across 2 GPUs) are replaced with dense connections in our model. Other important differences relating to layers 1 and 2 were made following inspection of the visualizations in Fig. 6, as described in Section 4.1.

　　图3中的网络模型与（Krizhevsky et al 2012）使用的卷积模型很相似，不同点在于：1，Krizhevsky在3， 4， 5层使用的是稀疏连接（由于该模型被分配到了两个 GPU上），而本文用了稠密连接。2，另一个重要的不同将在章节4.1 和图6中详细阐述。

The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes). Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10−2, in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10−2 and biases are set to 0.

　　本文选择了 ImageNet 2012 作为训练集（130万张图片，超过 1000 个不同类别），首先截取每张 RGB图片最中心的 256256区域，然后减去整张图片颜色均值，再截出 10 个不同的 224224 窗口（可对原图进行水平翻转，窗口可在区域中滑动）。采用随机梯度下降法学习，batchsize选择128，学习率选择 0.01，动量系数选择 0.9；当误差趋于收敛时，手动停止训练过程：Dropout 策略（Hinton et al 2012）运用在全连接层中，系数设为 0.5，所有权值初始化值设为 0.01，偏置值设为0。

Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10−1 to this fixed radius. This is crucial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in (Krizhevsky et al., 2012), we produce multiple different crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on (Krizhevsky et al., 2012).

　　图6（a）展示了部分训练得到的第1层卷积核，其中有一部分核数值过大，为了避免这种情况，我们采取了如下策略：均方根超过 0.1 的核将重新进行归一化，使其均方根为 0.1.该步骤非常关键，因为第1层的输入变化范围在 [-128, 128]之间。前面提到了，我们通过滑动窗口截取和对原始图像的水平翻转来提高训练集的大小，这一点和（Krizhevsky et al 2012）相同。整个训练过程基于（Krizhevsky et al 2012）的代码实现，在单块 GTX580 GPU 上进行，总共进行了 70次全库迭代，运行了 12天。

4，卷积网络可视化 CONVNET VISUALIZATION

Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set.
　　通过章节3描述的结构框架，我们开始使用反卷积网来可视化ImageNet验证集上的特征激活。

Feature Visualization: Fig. 2 shows feature visualizations from our model once training is complete. However, instead of showing the single strongest activation for a given feature map, we show the top 9 activations. Projecting each separately down to pixel space reveals the different structures that excite a given feature map, hence showing its invariance to input deformations. Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations as the latter solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches appear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects.

　　特征可视化：图2展示了训练结束后，模型各个隐含层提取的特征，图2显示了在给定输出特征的情况下，反卷积层产生的最强的9个输入特征。将这些计算所得的特征，用像素空间表示后，可以清洗的看出：一组特定的输入特征（通过重构获得），将刺激卷积网络产生一个固定的输出特征。这一点解释了为什么当输入存在一定畸变时，网络的输出结果保持不变。在可视化结果的右边是对应的输入图片，和重构特征相比，输入图片之间的差异性很大，而重构特征只包含哪些具有判别能力纹理结构。举例说明：层5第1行第2列的9张输入图片各不相同，差异很大，而对应的重构输入特征则都显示了背景的草地，没有显示五花八门的前景。

The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text (R2,C4)). Layer 4 shows significant variation, but is more class-specific: dog faces (R1,C1); bird’s legs (R4,C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1,C11) and dogs (R4).

　　每层的可视化结果都展示了网络的层次化特点。层2展示了物体的边缘和轮廓，以及与颜色的组合；层3拥有了更复杂的不变性，主要展示了相似的纹理（例如：第1行第1列的网格模型；第2行第4列的花纹）；层4不同组重构特征存在着重大差异性，开始体现类与类之间的差异：狗狗的脸（第一行第一列），鸟的腿（第4行第2列）；层5每组图片都展示了存在重大差异的一类物体，例如：键盘（第1行第1列），狗（第4行）

Feature Evolution during Training: Fig. 4 visualizes the progression during training of the strongest activation (across all training examples) within a given feature map projected back to pixel space. Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

　　特征在训练过程中的演化：图4展示了在训练过程中，由特定输出特征反向卷积，所获得的最强重构输入特征（从所有训练样本中选出）是如何演化的，当输入图片中的最强刺激源发生变换时，对应的输出特征轮廓发生跳变。经过一定次数的迭代后，底层特征趋于稳定。但更高层的特征则需要更多的迭代才能收敛（约 40~50个周期），这表明：只有所有层都收敛时，分类模型才堪用。

Feature Invariance: Fig. 5 shows 5 sample images being translated, rotated and scaled by varying degrees while looking at the changes in the feature vectors from the top and bottom layers of the model, relative to the untransformed feature. Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi-linear for translation & scaling. The network output is stable to translations and scalings. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center).

　　特征不变性：图5展示了5个不同的例子，他们分别被平移，旋转和缩放。图5右边显示了不同层特征向量所具有的不变性能力。在第1层，很小的微变都会导致输出特征变换明显，但越往高层走，平移和尺度变换对最终结果的影响越小。总体来说：卷积网络无法对旋转操作产生不变性，除非物体具有很强的对称性。

　　图2 ：训练好的卷积网络，显示了层2到层5通过反卷积层的计算，得到9个最强输入特征，并将输入特征映射到了像素空间。本文的重构输入特征不是采样生成的：他们是固定的，由特定的输出特征反卷积计算产生。每一个重构输入特征都对应的显示了它的输入图像。三点启示：1 每组重构特征（9个）都有强相关性；2 层次越高，不变性越强；3都是原始输入图片具有强辨识度部分的夸张展现。例如：狗的眼睛，鼻子（层4，第一行第1列）

4.1 结构选择 ARCHITECTURE SELECTION

While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al. ’s architecture (Fig. 6(b) & (d)), various problems are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. 6(c) & (e). More importantly, it also improves the classification performance as shown in Section 5.1.

　　观察 Krizhevsky 的网络模型可以帮助我们在一开始就选择一个好的模型。反卷积网络可视化技术显示了 Krizhevsky 卷积网络的一些问题。如图 6（a）以及 6（d）所示，第1层卷积核混杂了大量的高频和低频信息，缺少中频信息；第二层由于卷积过程选择4作为步长，产生了混乱无用的特征。为了解决这些问题，我们做了如下的工作：
-（i）将第一层的卷积核大小由11×11 调整到 7×7；
-（ii）将卷积步长由 4调整为2；新的模型不但保留了1， 2层绝大部分的有用特征，如图 6（c），6（e）所示，还提高了最终分类性能，我们将在章节5.1看到具体结果。

(a): 1st layer features without feature scale clipping. Note
that one feature dominates. (b): 1st layer features from

4.2 遮挡灵敏性 OCCLUSION SENSITIVITY

With image classification approaches, a natural question is if the model is truly identifying the location of the object in the image, or just using the surrounding context. Fig. 7 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier. The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded. Fig. 7 also shows visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of occluder position. When the occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the image structure that stimulates that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2.

　　当模型达到期望的分类性能时，一个自然而然的想法是：分类器究竟使用了什么信息实现分类？是图像中具体位置的像素值，还是图像中的上下文。我们试图回答这个问题，图7中使用了一个灰色矩阵对输入图像的每个部分进行遮挡，并测试在不同遮挡情况下，分类器的输出结果，可以清楚地看到：当关键区域发生遮挡时，分类器性能急剧下降。图7还展示了最上层卷积网的最强响应特征，展示了遮挡位置和响应强度之间的关系：当遮挡发生在关键物体出现的位置时，响应强度急剧下降。该图真实的反映了输入什么样的刺激，会促使系统产生某个特定的输出特征，用这种方法可以一一查找出图2和图4中特定特征的最佳刺激是什么。

Figure 7: Three test examples where we systematically cover up different portions of the scene with a gray square (1st column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response in the unoccluded image). (c): a visualization of this feature map projected down into the input image (black square), along with visualizations of this map from other images. The first row example shows the strongest feature to be the dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability for “pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row, for most locations