Delving Deep into Rectifiers深入研究整流器: 在ImageNet分类上超越人类水平的性能


下载PDF:http://arxiv.org/pdf/1502.01852v1.pdf


Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification

Kaiming He   Xiangyu Zhang   Shaoqing Ren   Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren,

ABSTRACT

Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66% [29]). To our knowledge, our result is the first to surpass human-level performance (5.1%, [22]) on this visual recognition challenge.
整流的激活单元(整流器)对于最新的神经网络至关重要。在这项工作中,我们从两个方面研究了用于图像分类的整流器神经网络。首先,我们提出了一种参数化线性整流单元(PReLU),该参数化了传统的整流单元。PReLU 以几乎为零的额外计算成本和极小的过拟合风险改善了模型拟合。其次,我们得出了一种鲁棒的初始化方法,该方法特别考虑了整流器的非线性。这种方法使我们能够从头开始直接训练极深的整流模型,并研究更深或更广的网络体系结构。基于我们的 PReLU 网络(PReLU-nets),我们达到了 4.94%ImageNet 2012 分类数据集中的 top-5 测试错误。相对于 2014 年 ILSVRC 冠军(GoogLeNet,6.66%[ 29 ]),相对改善了 26%。据我们所知,我们的结果是在视觉识别挑战上首次超过人类水平的表现(5.1%,[ 22 ])。

1INTRODUCTION

Convolutional neural networks (CNNs) [17, 16] have demonstrated recognition accuracy better than or comparable to humans in several visual recognition tasks, including recognizing traffic signs [3], faces [30, 28], and hand-written digits [3, 31]. In this work, we present a result that surpasses human-level performance on a more generic and challenging recognition task - the classification task in the 1000-class ImageNet dataset [22].

In the last few years, we have witnessed tremendous improvements in recognition performance, mainly due to advances in two technical directions: building more powerful models, and designing effective strategies against overfitting. On one hand, neural networks are becoming more capable of fitting training data, because of increased complexity (e.g., increased depth [25, 29], enlarged width [33, 24], and the use of smaller strides [33, 24, 2, 25]), new nonlinear activations [21, 20, 34, 19, 27, 9], and sophisticated layer designs [29, 11]. On the other hand, better generalization is achieved by effective regularization techniques [12, 26, 9, 31], aggressive data augmentation [16, 13, 25, 29], and large-scale data [4, 22].

Among these advances, the rectifier neuron [21, 8, 20, 34], e.g., Rectified Linear Unit (ReLU), is one of several keys to the recent success of deep networks [16]. It expedites convergence of the training procedure [16] and leads to better solutions [21, 8, 20, 34] than conventional sigmoid-like units. Despite the prevalence of rectifier networks, recent improvements of models [33, 24, 11, 25, 29] and theoretical guidelines for training them [7, 23] have rarely focused on the properties of the rectifiers.

In this paper, we investigate neural networks from two aspects particularly driven by the rectifiers. First, we propose a new generalization of ReLU, which we call Parametric Rectified Linear Unit (PReLU). This activation function adaptively learns the parameters of the rectifiers, and improves accuracy at negligible extra computational cost. Second, we study the difficulty of training rectified models that are very deep. By explicitly modeling the nonlinearity of rectifiers (ReLU/PReLU), we derive a theoretically sound initialization method, which helps with convergence of very deep models (e.g., with 30 weight layers) trained directly from scratch. This gives us more flexibility to explore more powerful network architectures.

On the 1000-class ImageNet 2012 dataset, our PReLU network (PReLU-net) leads to a single-model result of 5.71% top-5 error, which surpasses all existing multi-model results. Further, our multi-model result achieves 4.94% top-5 error on the test set, which is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66% [29]). To the best of our knowledge, our result surpasses for the first time the reported human-level performance (5.1% in [22]) on this visual recognition challenge.

卷积神经网络(细胞神经网络)[ 1716 ]已经证明识别精度优于或相当于人类在几个视觉识别任务,包括识别交通标志[ 3 ],面[ 3028 ],和手写数字[ 331 ]。在这项工作中,我们提出了一个结果,在一个更为通用和更具挑战性的识别任务上-在 1000 类 ImageNet 数据集中的分类任务[ 22 ],其结果超过了人类的水平。

在过去的几年中,我们见证了识别性能的巨大提高,这主要归功于两个技术方向的进步:建立更强大的模型以及设计有效的策略以防止过度拟合。一方面,神经网络正变得越来越能够嵌合的训练数据的,由于增加的复杂性(例如,增加的深度[ 2529 ],扩大宽度[ 3324 ],和使用较小的步幅[ 3324225 ]),新的非线性激活[ 2120 34 19 27 9 ],和复杂的设计层[ 29 11 ]。在另一方面,更好的泛化由有效的正则化技术来实现[ 12 26 9 31 ],攻击性数据扩张[ 16 13 25 29 ],以及大规模数据[ 4 22 ]。

在这些进展中,整流器神经元[ 2182034 ],例如,线性整流单元(RELU),是多个键以最近深网络的成功之一[ 16 ]。它加快了训练程序的收敛[ 16 ],并导致更好的解决方案[ 2182034 ]比常规的 S 形状单元。尽管整流器网络的流行,最近的模型改进[ 332411 25 29 ]和训练他们理论指导[ 7 23 ]很少集中在整流器的性能。

在本文中,我们从特别是由整流器驱动的两个方面研究了神经网络。首先,我们提出了 ReLU 的一种新的概括,我们称之为参数化整流线性单位(PReLU)。该激活功能可自适应地学习整流器的参数,并以可忽略的额外计算成本提高精度。其次,我们研究了训练非常深的校正模型的难度。通过对整流器的非线性(ReLU / PReLU)进行显式建模,我们得出了一种理论上合理的初始化方法,该方法有助于收敛从头开始直接训练的非常深的模型(例如,具有 30 个权重层)。这为我们提供了更大的灵活性来探索功能更强大的网络体系结构。

在 1000 类 ImageNet 2012 数据集上,我们的 PReLU 网络(PReLU-net)导致 5.71%的 top-5 错误的单模型结果超过了所有现有的多模型结果。此外,我们的多模型结果在测试集上实现了 4.94%的 top-5 误差,这比 2014 年 ILSVRC 冠军(GoogLeNet,6.66%[ 29 ])提高了 26%。据我们所知,在这一视觉识别挑战上,我们的结果首次超过了报告的人类水平的表现([ 22 ]中为 5.1%)。

2 APPROACH 方法

In this section, we first present the PReLU activation function (Sec. 2.1). Then we derive our initialization method for deep rectifier networks (Sec. 2.2). Lastly we discuss our architecture designs (Sec. 2.3).
在本节中,我们首先介绍 PReLU 激活功能(第 2.1 节 )。然后,我们导出了用于深层整流器网络的初始化方法(第 2.2 节 )。最后,我们讨论我们的体系结构设计(第 2.3 节 )。

2.1PARAMETRIC RECTIFIERS 参数整流器

We show that replacing the parameter-free ReLU activation by a learned parametric activation unit improves classification accuracy11Concurrent with our work, Agostinelli et al. [1] also investigated learning activation functions and showed improvement on other tasks..

我们表明,通过学习的参数激活单元替换无参数的 ReLU 激活可以提高分类精度 11 与我们的工作同时进行的是 Agostinelli 。[ 1 ]还研究了学习激活功能,并显示出在其他任务上的改进。。

ReLU vs. PReLU. For PReLU, the coefficient of the negative part is not constant and is adaptively learned.
Figure 1: ReLU vs. PReLU. For PReLU, the coefficient of the negative part is not constant and is adaptively learned.#### Definition
图 1: ReLU 与 PReLU。对于 PReLU,负部分的系数不是恒定的,而是可以自适应学习的。

Formally, we consider an activation function defined as:

f(yi)={yi,if yi>0aiyi,if yi≤0. (1)

Here yi is the input of the nonlinear activation f on the ith channel, and ai is a coefficient controlling the slope of the negative part. The subscript i in ai indicates that we allow the nonlinear activation to vary on different channels. When ai=0, it becomes ReLU; when ai is a learnable parameter, we refer to Eqn.(1) as Parametric ReLU (PReLU). Figure 1 shows the shapes of ReLU and PReLU. Eqn.(1) is equivalent to
我们将等式(1)称为参数化 ReLU(PReLU)。图 1 显示了 ReLU 和 PReLU 的形状。式(1)等价于

f(yi)=max(0,yi)+aimin(0,yi).

If ai is a small and fixed value, PReLU becomes the Leaky ReLU (LReLU) in [20] (ai=0.01). The motivation of LReLU is to avoid zero gradients. Experiments in [20] show that LReLU has negligible impact on accuracy compared with ReLU. On the contrary, our method adaptively learns the PReLU parameters jointly with the whole model. We hope for end-to-end training that will lead to more specialized activations.
LReLU 的动机是避免零梯度。[ 20 ]中的实验表明,与 ReLU 相比,LReLU 对准确性的影响可忽略不计。相反,我们的方法与整个模型一起自适应地学习 PReLU 参数。我们希望进行端到端培训,这将导致更专业的激活。

PReLU introduces a very small number of extra parameters. The number of extra parameters is equal to the total number of channels, which is negligible when considering the total number of weights. So we expect no extra risk of overfitting. We also consider a channel-shared variant:
PReLU 引入了非常少量的额外参数。额外参数的数量等于通道的总数,在考虑权重的总数时可以忽略不计。因此,我们预计不会出现过度拟合的额外风险。我们还考虑了渠道共享的变体:

f(yi)=max(0,yi)+amin(0,yi)

where the coefficient is shared by all channels of one layer. This variant only introduces a single extra parameter into each layer.
系数由一层的所有通道共享。此变体仅在每个层中引入一个额外的参数。

Optimization 优化

PReLU can be trained using backpropagation [17] and optimized simultaneously with other layers. The update formulations of {ai} are simply derived from the chain rule. The gradient of ai for one layer is:
可以使用反向传播[ 17 ]训练 PReLU,并与其他层同时进行优化。

| | ∂E∂ai=∑yi∂E∂f(yi)∂f(yi)∂ai, | | (2) |

where E represents the objective function. The term ∂E/∂f(yi) is the gradient propagated from the deeper layer. The gradient of the activation is given by:
E 代表目标函数。∂E/∂f(yi) 从较深层传播的梯度。激活的梯度由下式给出:

∂f(yi)∂ai={0,if yi>0yi,if yi≤0. (3)

The summation ∑yi runs over all positions of the feature map. For the channel-shared variant, the gradient of a is ∂E∂a=∑i∑yi∂E∂f(yi)∂f(yi)∂a, where ∑i sums over all channels of the layer. The time complexity due to PReLU is negligible for both forward and backward propagation.

We adopt the momentum method when updating ai:

Δai:=μΔai+ϵ∂E∂ai. (4)

Here μ is the momentum and ϵ is the learning rate. It is worth noticing that we do not use weight decay (l2 regularization) when updating ai. A weight decay tends to push ai to zero, and thus biases PReLU toward ReLU. Even without regularization, the learned coefficients rarely have a magnitude larger than 1 in our experiments. Further, we do not constrain the range of ai so that the activation function may be non-monotonic. We use ai=0.25 as the initialization throughout this paper.

Comparison Experiments 比较实验

We conducted comparisons on a deep but efficient model with 14 weight layers. The model was studied in [10] (model E of [10]) and its architecture is described in Table 1. We choose this model because it is sufficient for representing a category of very deep models, as well as to make the experiments feasible.

As a baseline, we train this model with ReLU applied in the convolutional (conv) layers and the first two fully-connected (fc) layers. The training implementation follows [10]. The top-1 and top-5 errors are 33.82% and 13.34% on ImageNet 2012, using 10-view testing (Table 2).
我们对具有 14 个权重层的深而有效的模型进行了比较。该模型进行了研究[ 10 ](模型电子[ 10 ])和它的结构在表中描述了 1。我们选择此模型是因为它足以代表一类非常深的模型,并且足以使实验可行。

作为基线,我们使用在卷积(conv)层和前两个完全连接(fc)层中应用 ReLU 训练该模型。培训实施如下[ 10 ]。使用 10 视图测试(表 2),ImageNet 2012 的 top-1 和 top-5 错误分别为 33.82%和 13.34% 。

learned coefficients
layer channel-shared
conv1 7×7, 64, /2
pool1 3×3, /3
conv21 2×2, 128
conv22 2×2, 128
conv23 2×2, 128
conv24 2×2, 128
pool2 2×2, /2
conv31 2×2, 256
conv32 2×2, 256
conv33 2×2, 256
conv34 2×2, 256
conv35 2×2, 256
conv36 2×2, 256
spp {6,3,2,1}
fc1 4096
fc2 4096
fc3 1000

Table 1: A small but deep 14-layer model [10]. The filter size and filter number of each layer is listed. The number /s indicates the stride s that is used. The learned coefficients of PReLU are also shown. For the channel-wise case, the average of {ai} over the channels is shown for each layer.
表 1:一个小而深的 14 层模型[ 10 ]。列出了每层的过滤器尺寸和过滤器编号。数字 /s 表示大步前进 s 被使用。还显示了 PReLU 的学习系数。

Then we train the same architecture from scratch, with all ReLUs replaced by PReLUs (Table 2). The top-1 error is reduced to 32.64%. This is a 1.2% gain over the ReLU baseline. Table 2 also shows that channel-wise/channel-shared PReLUs perform comparably. For the channel-shared version, PReLU only introduces 13 extra free parameters compared with the ReLU counterpart. But this small number of free parameters play critical roles as evidenced by the 1.1% gain over the baseline. This implies the importance of adaptively learning the shapes of activation functions.
然后,我们从头开始训练相同的架构,所有 ReLU 被 PReLU 取代(表 2)。前 1 个错误减少到 32.64%。这比 ReLU 基准提高了 1.2%。表 2 还显示了按通道/按通道共享的 PReLU 的性能相当。对于通道共享版本,与 ReLU 相比,PReLU 仅引入了 13 个额外的免费参数。但是,少量的自由参数起着至关重要的作用,这比基线提高了 1.1%证明了这一点。这意味着自适应学习激活函数形状的重要性。

activation functions top1 top5
ReLU 33.82 13.34
PReLU, channel-shared 32.71 12.87
PReLU, channel-wise 32.64 12.75

表 2:在小型模型上 ReLU 和 PReLU 之间的比较。错误率适用于使用 10 视图测试的 ImageNet 2012。调整图像大小,以便在训练和测试期间较短的一面为 256。每个视图是 224×224.所有模型都使用 75 个时期进行训练

Table 2: Comparisons between ReLU and PReLU on the small model. The error rates are for ImageNet 2012 using 10-view testing. The images are resized so that the shorter side is 256, during both training and testing. Each view is 224×224. All models are trained using 75 epochs.Table 1 also shows the learned coefficients of PReLUs for each layer. There are two interesting phenomena in Table 1. First, the first conv layer (conv1) has coefficients (0.681 and 0.596) significantly greater than 0. As the filters of conv1 are mostly Gabor-like filters such as edge or texture detectors, the learned results show that both positive and negative responses of the filters are respected. We believe that this is a more economical way of exploiting low-level information, given the limited number of filters (e.g., 64). Second, for the channel-wise version, the deeper conv layers in general have smaller coefficients. This implies that the activations gradually become “more nonlinear” at increasing depths. In other words, the learned model tends to keep more information in earlier stages and becomes more discriminative in deeper stages.
1 还显示了每层的 PReLU 的学习系数。表 1 中有两个有趣的现象 。首先,第一 conv 层(conv1)的系数(0.681 和 0.596)显着大于 0。由于 conv1 的滤镜主要是类似 Gabor 的滤镜,例如边缘或纹理检测器,因此学习的结果表明,过滤器受到尊重。我们认为,鉴于过滤器数量有限(例如,。,64)。其次,对于通道方式,较深的 conv 层通常具有较小的系数。这意味着,随着深度的增加,激活逐渐变得“非线性”。换句话说,学习的模型倾向于在早期阶段保留更多信息,而在较深阶段则更具区分性。

2.2INITIALIZATION OF FILTER WEIGHTS FOR RECTIFIERS 整流器滤波器权重的初始化

Rectifier networks are easier to train [8, 16, 34] compared with traditional sigmoid-like activation networks. But a bad initialization can still hamper the learning of a highly non-linear system. In this subsection, we propose a robust initialization method that removes an obstacle of training extremely deep rectifier networks.

Recent deep CNNs are mostly initialized by random weights drawn from Gaussian distributions [16]. With fixed standard deviations (e.g., 0.01 in [16]), very deep models (e.g., >8 conv layers) have difficulties to converge, as reported by the VGG team [25] and also observed in our experiments. To address this issue, in [25] they pre-train a model with 8 conv layers to initialize deeper models. But this strategy requires more training time, and may also lead to a poorer local optimum. In [29, 18], auxiliary classifiers are added to intermediate layers to help with convergence.

Glorot and Bengio [7] proposed to adopt a properly scaled uniform distribution for initialization. This is called “Xavier” initialization in [14]. Its derivation is based on the assumption that the activations are linear. This assumption is invalid for ReLU and PReLU.

In the following, we derive a theoretically more sound initialization by taking ReLU/PReLU into account. In our experiments, our initialization method allows for extremely deep models (e.g., 30 conv/fc layers) to converge, while the “Xavier” method [7] cannot.
整流器网络是更容易训练#### 前向传播案例[ 81634 ]与传统的 S 形状激活网络相比。但是,不良的初始化仍然会妨碍高度非线性系统的学习。在本小节中,我们提出了一种鲁棒的初始化方法,该方法消除了训练极深的整流器网络的障碍。

最近的深度 CNN 大多是通过从高斯分布中提取的随机权重来初始化的[ 16 ]。使用固定的标准偏差(例如[ 16 ]中的 0.01 ),非常深的模型(例如,> 如 VGG 小组所报道[ 25 ],并且在我们的实验中也观察到,8 个转换层很难收敛。为了解决这个问题,在[ 25 ]中,他们预训练了具有 8 个转换层的模型,以初始化更深的模型。但是这种策略需要更多的训练时间,并且还可能导致较差的局部最优值。在[ 2918 ],辅助分类器被添加到中间层以具有收敛的帮助。

Glorot 和 Bengio [ 7 ]提出采用适当缩放的均匀分布进行初始化。这在[ 14 ]中称为“ Xavier ”初始化。它的推导基于激活是线性的假设。该假设对 ReLU 和 PReLU 无效。

在下文中,我们通过考虑 ReLU / PReLU 得出理论上更合理的声音初始化。在我们的实验中,我们的初始化方法允许极深的模型(例如 30 个 conv / fc 层)收敛,而“ Xavier ”方法[ 7 ]则不能。

Forward Propagation Case 前向传播案例

Our derivation mainly follows [7]. The central idea is to investigate the variance of the responses in each layer.
我们的推导主要遵循[ 7 ]。中心思想是调查每一层响应的方差。

对于转换层,响应为:

For a conv layer, a response is:

yl=Wlxl+bl. (5)

Here, x is a k2c-by-1 vector that represents co-located k×k pixels in c input channels. k is the spatial filter size of the layer. With n=k2c denoting the number of connections of a response, W is a d-by-n matrix, where d is the number of filters and each row of W represents the weights of a filter. b is a vector of biases, and y is the response at a pixel of the output map. We use l to index a layer. We have xl=f(yl−1) where f is the activation. We also have cl=dl−1.

We let the initialized elements in Wl be mutually independent and share the same distribution. As in [7], we assume that the elements in xl are also mutually independent and share the same distribution, and xl and Wl are independent of each other. Then we have:

\emphVar[yl]=nl\emphVar[wlxl], (6)

where now yl, xl, and wl represent the random variables of each element in yl, Wl, and xl respectively. We let wl have zero mean. Then the variance of the product of independent variables gives us:

\emphVar[yl]=nl\emphVar[wl]E[x2l]. (7)

Here E[x2l] is the expectation of the square of xl. It is worth noticing that E[x2l]≠\emphVar[xl] unless xl has zero mean. For the ReLU activation, xl=max(0,yl−1) and thus it does not have zero mean. This will lead to a conclusion different from [7].

If we let wl−1 have a symmetric distribution around zero and bl−1=0, then yl−1 has zero mean and has a symmetric distribution around zero. This leads to E[x2l]=12\emphVar[yl−1] when f is ReLU. Putting this into Eqn.(7), we obtain:

\emphVar[yl]=12nl\emphVar[wl]\emphVar[yl−1]. (8)

With L layers put together, we have:

\emphVar[yL]=\emphVary1. (9)

This product is the key to the initialization design. A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. So we expect the above product to take a proper scalar (e.g., 1). A sufficient condition is:

12nl\emphVar[wl]=1,∀l. (10)

This leads to a zero-mean Gaussian distribution whose standard deviation (std) is √2/nl. This is our way of initialization. We also initialize b=0.

For the first layer (l=1), we should have n1\emphVar[w1]=1 because there is no ReLU applied on the input signal. But the factor 1/2 does not matter if it just exists on one layer. So we also adopt Eqn.(10) in the first layer for simplicity.

Backward Propagation Case 向后传播案例

For back-propagation, the gradient of a conv layer is computed by:

Δxl=^WlΔyl. (11)

Here we use Δx and Δy to denote gradients (∂E∂x and ∂E∂y) for simplicity. Δy represents k-by-k pixels in d channels, and is reshaped into a k2d-by-1 vector. We denote ^n=k2d. Note that ^n≠n=k2c. ^W is a c-by-^n matrix where the filters are rearranged in the way of back-propagation. Note that W and ^W can be reshaped from each other. Δx is a c-by-1 vector representing the gradient at a pixel of this layer. As above, we assume that wl and Δyl are independent of each other, then Δxl has zero mean for all l, when wl is initialized by a symmetric distribution around zero.

In back-propagation we also have Δyl=f′(yl)Δxl+1 where f′ is the derivative of f. For the ReLU case, f′(yl) is zero or one, and their probabilities are equal. We assume that f′(yl) and Δxl+1 are independent of each other. Thus we have E[Δyl]=E[Δxl+1]/2=0, and also E[(Δyl)2]=\emphVar[Δyl]=12\emphVar[Δxl+1]. Then we compute the variance of the gradient in Eqn.(11):

\emphVar[Δxl] = ^nl\emphVar[wl]\emphVar[Δyl] (12)
= 12^nl\emphVar[wl]\emphVar[Δxl+1].

The scalar 1/2 in both Eqn.(12) and Eqn.(8) is the result of ReLU, though the derivations are different. With L layers put together, we have:

\emphVar[Δx2]=\emphVarΔxL+1. (13)

We consider a sufficient condition that the gradient is not exponentially large/small:

12^nl\emphVar[wl]=1,∀l. (14)

The only difference between this equation and Eqn.(10) is that ^nl=k2ldl while nl=k2lcl=k2ldl−1. Eqn.(14) results in a zero-mean Gaussian distribution whose std is √2/^nl.

For the first layer (l=1), we need not compute Δx1 because it represents the image domain. But we can still adopt Eqn.(14) in the first layer, for the same reason as in the forward propagation case - the factor of a single layer does not make the overall product exponentially large/small.

We note that it is sufficient to use either Eqn.(14) or Eqn.(10) alone. For example, if we use Eqn.(14), then in Eqn.(13) the product ∏Ll=212^nl\emphVar[wl]=1, and in Eqn.(9) the product ∏Ll=212nl\emphVar[wl]=∏Ll=2nl/^nl=c2/dL, which is not a diminishing number in common network designs. This means that if the initialization properly scales the backward signal, then this is also the case for the forward signal; and vice versa. For all models in this paper, both forms can make them converge.

The convergence of a The convergence of a

Figure 2: The convergence of a 22-layer large model (B in Table 3). The x-axis is the number of training epochs. The y-axis is the top-1 error of 3,000 random val samples, evaluated on the center crop. We use ReLU as the activation for both cases. Both our initialization (red) and “Xavier” (blue) [7] lead to convergence, but ours starts reducing error earlier.Figure 3: The convergence of a 30-layer small model (see the main text). We use ReLU as the activation for both cases. Our initialization (red) is able to make it converge. But “Xavier” (blue) [7] completely stalls - we also verify that its gradients are all diminishing. It does not converge even given more epochs.Figure 2: The convergence of a 22-layer large model (B in Table 3). The x-axis is the number of training epochs. The y-axis is the top-1 error of 3,000 random val samples, evaluated on the center crop. We use ReLU as the activation for both cases. Both our initialization (red) and “Xavier” (blue) [7] lead to convergence, but ours starts reducing error earlier.
图 2:22 层大型模型的收敛性(表 3 中的 B )。x 轴是训练时期的数量。y 轴是在中心作物上评估的 3,000 个随机 val 样本的 top-1 误差。对于这两种情况,我们都使用 ReLU 作为激活。我们的初始化(红色)和“ Xavier ”(蓝色)[ 7 ]都导致收敛,但是我们的初始化开始更早地减少了错误。图 3:30 层小模型的收敛性(请参见正文)。对于这两种情况,我们都使用 ReLU 作为激活。我们的初始化(红色)能够使其收敛。但是“ Xavier ”(蓝色)[ 7 ]完全停滞了-我们还验证了它的梯度都在减小。即使有更多的时期,它也不会收敛。图 2:22 层大型模型的收敛性(表 3 中的 B )。x 轴是训练时期的数量。y 轴是在中心作物上评估的 3,000 个随机 val 样本的 top-1 误差。对于这两种情况,我们都使用 ReLU 作为激活。我们的初始化(红色)和“ Xavier ”(蓝色)[ 7 ]都导致收敛,但是我们的初始化开始更早地减少了错误。

Discussions 讨论

If the forward/backward signal is inappropriately scaled by a factor β in each layer, then the final propagated signal will be rescaled by a factor of βL after L layers, where L can represent some or all layers. When L is large, if β>1, this leads to extremely amplified signals and an algorithm output of infinity; if β<1, this leads to diminishing signals22In the presence of weight decay (l2 regularization of weights), when the gradient contributed by the logistic loss function is diminishing, the total gradient is not diminishing because of the weight decay. A way of diagnosing diminishing gradients is to check whether the gradient is modulated only by weight decay.. In either case, the algorithm does not converge - it diverges in the former case, and stalls in the latter.

Our derivation also explains why the constant standard deviation of 0.01 makes some deeper networks stall [25]. We take “model B” in the VGG team’s paper [25] as an example. This model has 10 conv layers all with 3×3 filters. The filter numbers (dl) are 64 for the 1st and 2nd layers, 128 for the 3rd and 4th layers, 256 for the 5th and 6th layers, and 512 for the rest. The std computed by Eqn.(14) (√2/^nl) is 0.059, 0.042, 0.029, and 0.021 when the filter numbers are 64, 128, 256, and 512 respectively. If the std is initialized as 0.01, the std of the gradient propagated from conv10 to conv2 is 1/(5.9×4.22×2.92×2.14)=1/(1.7×104) of what we derive. This number may explain why diminishing gradients were observed in experiments.
如果前向/后向信号的缩放比例不合适 β 在每一层中,最终的传播信号将按以下比例重新缩放 β 大号 后 大号 层,在哪里 大号可以代表部分或全部图层。什么时候大号 很大,如果 β>1 个,这会导致信号极大放大,并产生无穷大的算法输出;如果 β<1 个,这会导致信号 2 减弱 2 个在体重下降的情况下(l2 个权重的正则化),当由逻辑损失函数贡献的梯度减小时,总梯度不会由于权重衰减而减小。诊断梯度递减的一种方法是检查梯度是否仅通过权重衰减进行调制。。无论哪种情况,算法都不会收敛-在前一种情况下会发散,而在后一种情况下会停滞。

我们的推导也解释了为什么 0.01 的恒定标准偏差会使一些更深的网络停滞[ 25 ]。我们以 VGG 团队论文[ 25 ]中的“模型 B”为例。该模型有 10 个转换层,所有层均包含 3 个 ×3 个过滤器。过滤器编号(d 升)对于第一层和第二层是 64,对于第三层和第四层是 128,对于第五层和第六层是 256,对于其余层是 512。由等式(14)计算的标准 √2 个/^ñ 升当过滤器编号分别为 64、128、256 和 512 时,)分别为 0.059、0.042、0.029 和 0.021。如果将 std 初始化为 0.01,则从 conv10 传播到 conv2 的梯度的 std 为 1 个/(5.9×4.22 个 ×2.92 个 ×2.14)=1 个/(1.7×104)我们得出的结果。这个数字可以解释为什么在实验中观察到梯度递减。

还值得注意的是,从第一层到最后一层可以大致保留输入信号的方差。如果输入信号未标准化(例如,在[-128,128]),其大小可能太大,以至于 softmax 运算符将溢出。解决方案是对输入信号进行归一化,但这可能会影响其他超参数。

It is also worth noticing that the variance of the input signal can be roughly preserved from the first layer to the last. In cases when the input signal is not normalized (e.g., it is in the range of [−128,128]), its magnitude can be so large that the softmax operator will overflow. A solution is to normalize the input signal, but this may impact other hyper-parameters. Another solution is to include a small factor on the weights among all or some layers, e.g., L√1/128 on L layers. In practice, we use a std of 0.01 for the first two fc layers and 0.001 for the last. These numbers are smaller than they should be (e.g., √2/4096) and will address the normalization issue of images whose range is about [−128,128].

For the initialization in the PReLU case, it is easy to show that Eqn.(10) becomes:

12(1+a2)nl\emphVar[wl]=1,∀l, (15)

where a is the initialized value of the coefficients. If a=0, it becomes the ReLU case; if a=1, it becomes the linear case (the same as [7]). Similarly, Eqn.(14) becomes 12(1+a2)^nl\emphVar[wl]=1.

Comparisons with “Xavier” Initialization [7]

The main difference between our derivation and the “Xavier” initialization [7] is that we address the rectifier nonlinearities33There are other minor differences. In [7], the derived variance is adopted for uniform distributions, and the forward and backward cases are averaged. But it is straightforward to adopt their conclusion for Gaussian distributions and for the forward or backward case only.. The derivation in [7] only considers the linear case, and its result is given by nl\emphVar[wl]=1 (the forward case), which can be implemented as a zero-mean Gaussian distribution whose std is √1/nl. When there are L layers, the std will be 1/√2L of our derived std. This number, however, is not small enough to completely stall the convergence of the models actually used in our paper (Table 3, up to 22 layers) as shown by experiments. Figure 3 compares the convergence of a 22-layer model. Both methods are able to make them converge. But ours starts reducing error earlier. We also investigate the possible impact on accuracy. For the model in Table 2 (using ReLU), the “Xavier” initialization method leads to 33.90/13.44 top-1/top-5 error, and ours leads to 33.82/13.34. We have not observed clear superiority of one to the other on accuracy.

Next, we compare the two methods on extremely deep models with up to 30 layers (27 conv and 3 fc). We add up to sixteen conv layers with 256 2×2 filters in the model in Table 1. Figure 3 shows the convergence of the 30-layer model. Our initialization is able to make the extremely deep model converge. On the contrary, the “Xavier” method completely stalls the learning, and the gradients are diminishing as monitored in the experiments.

These studies demonstrate that we are ready to investigate extremely deep, rectified models by using a more principled initialization method. But in our current experiments on ImageNet, we have not observed the benefit from training extremely deep models. For example, the aforementioned 30-layer model has 38.56/16.59 top-1/top-5 error, which is clearly worse than the error of the 14-layer model in Table 2 (33.82/13.34). Accuracy saturation or degradation was also observed in the study of small models [10], VGG’s large models [25], and in speech recognition [34]. This is perhaps because the method of increasing depth is not appropriate, or the recognition task is not enough complex.

Though our attempts of extremely deep models have not shown benefits, our initialization method paves a foundation for further study on increasing depth. We hope this will be helpful in other more complex tasks.
然而,这个数字还不足以完全阻止我们的论文中实际使用的模型(表 3,最多 22 层)的收敛, 如实验所示。图 3 比较了 22 层模型的收敛性。两种方法都能使它们收敛。但是我们更早开始减少错误。我们还调查了对准确性的可能影响。对于表 2 中的模型 (使用 ReLU),“ Xavier ”初始化方法导致 33.90 / 13.44 top-1 / top-5 错误,而我们的导致 33.82 / 13.34。我们还没有观察到一种在准确性方面明显优于另一种的优势。

接下来,我们在具有 30 层(27 转换和 3 fc)的极深模型上比较这两种方法。我们用 256 2 总共增加了 16 个转换层 × 表 1 中的模型中有 2 个过滤器 。图 3 显示了 30 层模型的收敛性。我们的初始化能够使极深的模型收敛。相反,“ Xavier ”方法完全使学习停滞,并且梯度在实验中逐渐减小。

这些研究表明,我们已经准备好使用更原则的初始化方法来研究极深的,经过纠正的模型。但是,在我们目前在 ImageNet 上进行的实验中,我们还没有观察到训练极深模型的好处。例如,上述 30 层模型的 top-1 / top-5 误差为 38.56 / 16.59,显然比表 2 中的 14 层模型的误差 (33.82 / 13.34)差。在小模型[ 10 ],VGG 的大模型[ 25 ]和语音识别[ 34 ]的研究中也观察到了准确性的饱和或退化。。这可能是因为增加深度的方法不合适,或者识别任务不够复杂。

尽管我们对极深模型的尝试并未显示出任何好处,但我们的初始化方法为进一步研究深度奠定了基础。我们希望这将对其他更复杂的任务有所帮助。

input size VGG-19 [25] model A model B model C
224 3×3, 64 7×7, 96, /2 7×7, 96, /2 7×7, 96, /2
3×3, 64
2×2 maxpool, /2
112 3×3, 128
3×3, 128
2×2 maxpool, /2 2×2 maxpool, /2 2×2 maxpool, /2 2×2 maxpool, /2
56 3×3, 256 3×3, 256 3×3, 256 3×3, 384
3×3, 256 3×3, 256 3×3, 256 3×3, 384
3×3, 256 3×3, 256 3×3, 256 3×3, 384
3×3, 256 3×3, 256 3×3, 256 3×3, 384
3×3, 256 3×3, 256 3×3, 384
3×3, 256 3×3, 384
2×2 maxpool, /2 2×2 maxpool, /2 2×2 maxpool, /2 2×2 maxpool, /2
28 3×3, 512 3×3, 512 3×3, 512 3×3, 768
3×3, 512 3×3, 512 3×3, 512 3×3, 768
3×3, 512 3×3, 512 3×3, 512 3×3, 768
3×3, 512 3×3, 512 3×3, 512 3×3, 768
3×3, 512 3×3, 512 3×3, 768
3×3, 512 3×3, 768
2×2 maxpool, /2 2×2 maxpool, /2 2×2 maxpool, /2 2×2 maxpool, /2
14 3×3, 512 3×3, 512 3×3, 512 3×3, 896
3×3, 512 3×3, 512 3×3, 512 3×3, 896
3×3, 512 3×3, 512 3×3, 512 3×3, 896
3×3, 512 3×3, 512 3×3, 512 3×3, 896
3×3, 512 3×3, 512 3×3, 896
3×3, 512 3×3, 896
2×2 maxpool, /2 spp, {7,3,2,1} spp, {7,3,2,1} spp, {7,3,2,1}
fc1 4096
fc2 4096
fc3 1000
depth (conv+fc) 19 19 22 22
complexity (ops., ×1010) 1.96 1.90 2.32 5.30

Table 3: Architectures of large models. Here “/2” denotes a stride of 2.

2.3ARCHITECTURES

The above investigations provide guidelines of designing our architectures, introduced as follows.

Our baseline is the 19-layer model (A) in Table 3. For a better comparison, we also list the VGG-19 model [25]. Our model A has the following modifications on VGG-19: (i) in the first layer, we use a filter size of 7×7 and a stride of 2; (ii) we move the other three conv layers on the two largest feature maps (224, 112) to the smaller feature maps (56, 28, 14). The time complexity (Table 3, last row) is roughly unchanged because the deeper layers have more filters; (iii) we use spatial pyramid pooling (SPP) [11] before the first fc layer. The pyramid has 4 levels - the numbers of bins are 7×7, 3×3, 2×2, and 1×1, for a total of 63 bins.

It is worth noticing that we have no evidence that our model A is a better architecture than VGG-19, though our model A has better results than VGG-19’s result reported by [25]. In our earlier experiments with less scale augmentation, we observed that our model A and our reproduced VGG-19 (with SPP and our initialization) are comparable. The main purpose of using model A is for faster running speed. The actual running time of the conv layers on larger feature maps is slower than those on smaller feature maps, when their time complexity is the same. In our four-GPU implementation, our model A takes 2.6s per mini-batch (128), and our reproduced VGG-19 takes 3.0s, evaluated on four Nvidia K20 GPUs.

In Table 3, our model B is a deeper version of A. It has three extra conv layers. Our model C is a wider (with more filters) version of B. The width substantially increases the complexity, and its time complexity is about 2.3× of B (Table 3, last row). Training A/B on four K20 GPUs, or training C on eight K40 GPUs, takes about 3-4 weeks.

We choose to increase the model width instead of depth, because deeper models have only diminishing improvement or even degradation on accuracy. In recent experiments on small models [10], it has been found that aggressively increasing the depth leads to saturated or degraded accuracy. In the VGG paper [25], the 16-layer and 19-layer models perform comparably. In the speech recognition research of [34], the deep models degrade when using more than 8 hidden layers (all being fc). We conjecture that similar degradation may also happen on larger models for ImageNet. We have monitored the training procedures of some extremely deep models (with 3 to 9 layers added on B in Table 3), and found both training and testing error rates degraded in the first 20 epochs (but we did not run to the end due to limited time budget, so there is not yet solid evidence that these large and overly deep models will ultimately degrade). Because of the possible degradation, we choose not to further increase the depth of these large models.

On the other hand, the recent research [5] on small datasets suggests that the accuracy should improve from the increased number of parameters in conv layers. This number depends on the depth and width. So we choose to increase the width of the conv layers to obtain a higher-capacity model.

While all models in Table 3 are very large, we have not observed severe overfitting. We attribute this to the aggressive data augmentation used throughout the whole training procedure, as introduced below.
以上调查提供了设计我们的体系结构的准则,介绍如下。

我们的基准是表 3 中的 19 层模型(A) 。为了更好地进行比较,我们还列出了 VGG-19 模型[ 25 ]。我们的模型 A 对 VGG-19 进行了以下修改:(i)在第一层中,我们使用的过滤器大小为 7×7 和 2 的步幅;(ii)我们将两个最大的要素图(224、112)上的其他三个转换层移到较小的要素图(56、28、14)。时间复杂度(表 3,最后一行)大致保持不变,因为更深的层具有更多的过滤器。(iii)我们在第一个 fc 层之前使用空间金字塔池(SPP)[ 11 ]。金字塔有 4 个等级-垃圾箱数量为 7×7、3×3 2×2 和 1×1 个,共 63 个垃圾箱。

值得注意的是,尽管[ 25 ]报告的模型 A 比 VGG-19 的结果更好,但我们没有证据表明模型 A 比 VGG-19 的体系结构更好。在我们早期进行的规模较小的实验中,我们观察到模型 A 和复制的 VGG-19(带有 SPP 和初始化)是可比的。使用模型 A 的主要目的是为了提高运行速度。当相同时间复杂度相同时,较大要素地图上的 conv 图层的实际运行时间比较小要素地图上的 conv 图层的实际运行时间慢。在我们的四 GPU 实施中,模型 A 在每个微型批处理(128)中耗时 2.6s,而我们复制的 VGG-19 在三个 Nvidia K20 GPU 上评估时耗时 3.0s。

在表 3 中,我们的模型 B 是 A 的更深版本。它具有三个额外的转换层。我们的模型 C 是 B 的较宽版本(具有更多过滤器)。宽度大大增加了复杂度,其时间复杂度约为 2.3×B(表 3,最后一行)。在四个 K20 GPU 上进行 A / B 训练,或在八个 K40 GPU 上进行 C 训练,大约需要 3-4 周。

我们选择增加模型的宽度而不是深度,因为更深的模型只会减少改进甚至降低准确性。在最近的小型模型实验中[ 10 ],已经发现积极增加深度会导致饱和或降低精度。在 VGG 论文[ 25 ]中,16 层和 19 层模型的性能相当。在[ 34 ]的语音识别研究中,当使用 8 个以上的隐藏层(全部为 fc)时,深度模型会退化。我们推测,在较大的 ImageNet 模型上也可能发生类似的降级。我们已经监视了一些极深模型的训练过程(表 B 中增加了 3 到 9 层) 3),并发现训练和测试错误率在前 20 个时期都下降了(但由于时间预算有限,我们没有跑到最后,因此,尚无确凿的证据表明这些庞大且过深的模型最终会退化) 。由于可能的降级,我们选择不进一步增加这些大型模型的深度。

另一方面,最近关于小型数据集的研究[ 5 ]提出,随着 conv 层中参数数量的增加,精度应该有所提高。此数字取决于深度和宽度。因此,我们选择增加 conv 层的宽度以获得更高容量的模型。

尽管表 3 中的所有模型 都非常大,但我们并未观察到严重的过度拟合。我们将其归因于整个培训过程中使用的积极的数据扩充,如下所述。

3IMPLEMENTATION DETAILS 实施细节

Training 训练

Our training algorithm mostly follows [16, 13, 2, 11, 25]. From a resized image whose shorter side is s, a 224×224 crop is randomly sampled, with the per-pixel mean subtracted. The scale s is randomly jittered in the range of [256,512], following [25]. One half of the random samples are flipped horizontally [16]. Random color altering [16] is also used.

Unlike [25] that applies scale jittering only during fine-tuning, we apply it from the beginning of training. Further, unlike [25] that initializes a deeper model using a shallower one, we directly train the very deep model using our initialization described in Sec. 2.2 (we use Eqn.(14)). Our end-to-end training may help improve accuracy, because it may avoid poorer local optima.

Other hyper-parameters that might be important are as follows. The weight decay is 0.0005, and momentum is 0.9. Dropout (50%) is used in the first two fc layers. The mini-batch size is fixed as 128. The learning rate is 1e-2, 1e-3, and 1e-4, and is switched when the error plateaus. The total number of epochs is about 80 for each model.
我们的训练算法大多如下[ 161321125 ]。从调整尺寸的图像开始,该图像的短边是 s224× 随机采样 224 种作物,减去每像素均值。规模 s 在以下范围内随机抖动 [256,512],[ 25 ]之后。随机样本的一半水平翻转[ 16 ]。还使用随机颜色更改[ 16 ]。

与[ 25 ]仅在微调期间应用比例抖动不同,我们从训练开始就应用它。此外,与[ 25 ]中使用较浅的模型初始化更深的模型不同,我们直接使用本节中介绍的初始化来训练非常深的模型。 2.2(我们使用等式(14))。我们的端到端培训可以帮助提高准确性,因为它可以避免较差的局部最优性。

其他可能很重要的超参数如下。权重衰减为 0.0005,动量为 0.9。前两个 fc 层使用压降(50%)。最小批量大小固定为 128。学习速率为 1e-2、1e-3 和 1e-4,并在错误达到稳定水平时切换。每个模型的纪元总数约为 80。

Testing 测验

We adopt the strategy of “multi-view testing on feature maps” used in the SPP-net paper [11]. We further improve this strategy using the dense sliding window method in [24, 25].

We first apply the convolutional layers on the resized full image and obtain the last convolutional feature map. In the feature map, each 14×14 window is pooled using the SPP layer [11]. The fc layers are then applied on the pooled features to compute the scores. This is also done on the horizontally flipped images. The scores of all dense sliding windows are averaged [24, 25]. We further combine the results at multiple scales as in [11].
我们采用 SPP 网络论文[ 11 ]中使用的“在特征图上进行多视图测试”的策略。我们进一步提高使用密集的滑动窗口的方法在此策略[ 2425 ]。

我们首先将卷积层应用到已调整大小的完整图像上,并获得最后的卷积特征图。在功能图中,每个 14× 使用 SPP 层[ 11 ]合并 14 个窗口。然后将 fc 图层应用于合并的要素以计算分数。在水平翻转的图像上也可以执行此操作。所有密滑动窗口的得分进行平均[ 2425 ]。如[ 11 ]所示,我们进一步在多个尺度上组合了结果。

Multi-GPU Implementation 多 GPU 实施

我们采用 Krizhevsky 方法的简单变体[ 15 ]在多个 GPU 上进行并行训练。我们在 conv 层采用了“数据并行性” [ 15 ]。GPU 在第一个 fc 层之前进行同步。然后,在单个 GPU 上执行 fc 层的向前/向后传播-这意味着我们不会并行化 fc 层的计算。fc 层的时间成本很低,因此不必并行化它们。这比[ 15 ]中的“模型并行性”导致更简单的实现。此外,模型并行性由于过滤器响应的传递而引入了一些开销,并且不比仅在单个 GPU 上计算 fc 层快。

我们在对 Caffe 库的修改中实现了上述算法[ 14 ]。我们不增加小批量的大小(128),因为精度可能会降低[ 15 ]。对于本文中的大型模型,我们观察到使用 4 个 GPU 的速度提高了 3.8 倍,使用 8 个 GPU 的速度提高了 6.0 倍。

We adopt a simple variant of Krizhevsky’s method [15] for parallel training on multiple GPUs. We adopt “data parallelism” [15] on the conv layers. The GPUs are synchronized before the first fc layer. Then the forward/backward propagations of the fc layers are performed on a single GPU - this means that we do not parallelize the computation of the fc layers. The time cost of the fc layers is low, so it is not necessary to parallelize them. This leads to a simpler implementation than the “model parallelism” in [15]. Besides, model parallelism introduces some overhead due to the communication of filter responses, and is not faster than computing the fc layers on just a single GPU.

We implement the above algorithm on our modification of the Caffe library [14]. We do not increase the mini-batch size (128) because the accuracy may be decreased [15]. For the large models in this paper, we have observed a 3.8x speedup using 4 GPUs, and a 6.0x speedup using 8 GPUs.

4EXPERIMENTS ON IMAGENET ## 在 IMAGENET 上进行实验

我们对 1000 级 ImageNet 2012 数据集[ 22 ]进行了实验,该数据集包含约 120 万个训练图像,50,000 个验证图像和 100,000 个测试图像(没有发布的标签)。结果通过 top-1 / top-5 错误率来衡量[ 22 ]。我们仅使用提供的数据进行培训。除表 7 中的最终结果(在测试集上评估)外,所有结果均在验证集 上评估。前五位错误率是在分类挑战中正式用于对方法进行排名的度量标准[ 22 ]。

We perform the experiments on the 1000-class ImageNet 2012 dataset [22] which contains about 1.2 million training images, 50,000 validation images, and 100,000 test images (with no published labels). The results are measured by top-1/top-5 error rates [22]. We only use the provided data for training. All results are evaluated on the validation set, except for the final results in Table 7, which are evaluated on the test set. The top-5 error rate is the metric officially used to rank the methods in the classification challenge [22].

| model A | ReLU | PReLU |
| scale s | top-1 | top-5 | top-1 | top-5 |
| - | - | - |
| 256 | 26.25 | 8.25 | 25.81 | 8.08 |
| 384 | 24.77 | 7.26 | 24.20 | 7.03 |
| 480 | 25.46 | 7.63 | 24.83 | 7.39 |
| multi-scale | 24.02 | 6.51 | 22.97 | 6.28 |

Table 4: Comparisons between ReLU/PReLU on model A in ImageNet 2012 using dense testing.#### Comparisons between ReLU and PReLU
表 4:使用密集测试在 ImageNet 2012 中的模型 A 上的 ReLU / PReLU 之间的比较。

In Table 4, we compare ReLU and PReLU on the large model A. We use the channel-wise version of PReLU. For fair comparisons, both ReLU/PReLU models are trained using the same total number of epochs, and the learning rates are also switched after running the same number of epochs.

Table 4 shows the results at three scales and the multi-scale combination. The best single scale is 384, possibly because it is in the middle of the jittering range [256,512]. For the multi-scale combination, PReLU reduces the top-1 error by 1.05% and the top-5 error by 0.23%

compared with ReLU. ReLU 和 PReLU 之间的比较

在表 4 中,我们在大型模型 A 上比较了 ReLU 和 PReLU。我们使用 PReLU 的通道方式。为了公平地比较,两个 ReLU / PReLU 模型都使用相同的时期总数进行训练,并且在运行相同的时期之后也可以切换学习率。

4 显示了三个等级和多等级组合的结果。最佳单音阶是 384,可能是因为它在抖动范围的中间[256,512]。对于多尺度组合,与 ReLU 相比,PReLU 将 top-1 误差降低了 1.05%,top-5 误差降低了 0.23%。表 2 和表 4 中的结果 始终表明,PReLU 可以改善小型和大型模型。几乎没有计算成本就能获得这种改进。

The results in Table 2 and Table 4 consistently show that PReLU improves both small and large models. This improvement is obtained with almost no computational cost.

Comparisons of Single-model Results 单模型结果的比较

接下来,我们比较单模型结果。我们首先在表 7 中显示 10 视图测试结果[ 16 ]。在这里,每个视图都是一个 224 幅作物。VGG-16 的 10 个视图结果基于我们使用公开发布的模型[ 25 ]进行的测试,因为在[ 25 ]中未进行报告。我们最好的 10 个视图结果是 7.38%(表 7)。我们的其他模型也优于现有结果。

7 显示了单模型结果的比较,这些结果都是使用多尺度和多视图(或密集)测试获得的。我们的结果表示为 MSRA。我们的基线模型(A + ReLU,6.51%)已经大大好于[ 25 ](arXiv v5)的最新更新中对 VGG-19 报告的最佳现有单模型结果 7.1%。我们认为,这种收益主要是由于我们进行了端到端的培训,而无需预先训练浅层模型。

此外,我们最好的单一模型(C,PReLU)具有 5.71%的 top-5 误差。该结果甚至比以前所有的多模型结果都要好(表 7)。将 A + PReLU 与 B + PReLU 进行比较,我们看到 19 层模型和 22 层模型具有可比的性能。另一方面,增加宽度(C vs. B,表 7)仍可以提高精度。这表明当模型足够深时,宽度成为精度的重要因素。

Next we compare single-model results. We first show 10-view testing results [16] in Table 7. Here, each view is a 224-crop. The 10-view results of VGG-16 are based on our testing using the publicly released model [25] as it is not reported in [25]. Our best 10-view result is 7.38% (Table 7). Our other models also outperform the existing results.

Table 7 shows the comparisons of single-model results, which are all obtained using multi-scale and multi-view (or dense) test. Our results are denoted as MSRA. Our baseline model (A+ReLU, 6.51%) is already substantially better than the best existing single-model result of 7.1% reported for VGG-19 in the latest update of [25] (arXiv v5). We believe that this gain is mainly due to our end-to-end training, without the need of pre-training shallow models.

Moreover, our best single model (C, PReLU) has 5.71% top-5 error. This result is even better than all previous multi-model results (Table 7). Comparing A+PReLU with B+PReLU, we see that the 19-layer model and the 22-layer model perform comparably. On the other hand, increasing the width (C vs. B, Table 7) can still improve accuracy. This indicates that when the models are deep enough, the width becomes an essential factor for accuracy.

model top-1 top-5
MSRA [11] 29.68 10.95
VGG-16 [25] 28.07† 9.33†
GoogLeNet [29] - 9.15
A, ReLU 26.48 8.59
A, PReLU 25.59 8.23
B, PReLU 25.53 8.13
C, PReLU 24.27 7.38

Table 5: The single-model 10-view results for ImageNet 2012 val set. †: Based on our tests.
表 5: ImageNet 2012 val 集的单模型 10 视图结果。

team top-1 top-5
in competition ILSVRC 14 MSRA [11] 27.86 9.08†
in competition ILSVRC 14 VGG [25] - 8.43†
in competition ILSVRC 14 GoogLeNet [29] - 7.89
post-competition VGG [25] (arXiv v2) 24.8 7.5
post-competition VGG [25] (arXiv v5) 24.4 7.1
post-competition Baidu [32] 24.88 7.42
post-competition MSRA (A, ReLU) 24.02 6.51
post-competition MSRA (A, PReLU) 22.97 6.28
post-competition MSRA (B, PReLU) 22.85 6.27
post-competition MSRA (C, PReLU) 21.59 5.71

Table 6: The single-model results for ImageNet 2012 val set. †: Evaluated from the test set.
表 6: ImageNet 2012 val 集的单模型结果。

team top-5 (test)
in competition ILSVRC 14 MSRA, SPP-nets [11] 8.06
in competition ILSVRC 14 VGG [25] 7.32
in competition ILSVRC 14 GoogLeNet [29] 6.66
post-competition VGG [25] (arXiv v5) 6.8
post-competition Baidu [32] 5.98
post-competition MSRA, PReLU-nets 4.94

Table 7: The multi-model results for the ImageNet 2012 test set.
表 7: ImageNet 2012 测试集的多模型结果。

Comparisons of Multi-model Results 多模型结果的比较

我们结合了六个模型,包括表 7 中的模型 。目前,我们仅训练了一种具有 C 架构的模型。其他模型的准确性比 C 差很多。我们猜想我们可以通过使用更少的更强模型来获得更好的结果。

多模型结果在表 7 中。我们的结果是测试集上 4.95%的 top-5 错误。该数字由 ILSVRC 服务器评估,因为未发布测试集的标签。我们的结果比 2014 年 ILSVRC 冠军(GoogLeNet,6.66%[ 29 ])高 1.7%,这代表了〜相对改善 26%。这也是〜相对于最新结果而言,相对改善了 17%(百度,5.98%[ 32 ])。

We combine six models including those in Table 7. For the time being we have trained only one model with architecture C. The other models have accuracy inferior to C by considerable margins. We conjecture that we can obtain better results by using fewer stronger models.

The multi-model results are in Table 7. Our result is 4.94% top-5 error on the test set. This number is evaluated by the ILSVRC server, because the labels of the test set are not published. Our result is 1.7% better than the ILSVRC 2014 winner (GoogLeNet, 6.66% [29]), which represents a ∼26% relative improvement. This is also a ∼17% relative improvement over the latest result (Baidu, 5.98% [32]).

Example validation images successfully classified by our method. For each image, the ground-truth label and the top-5 labels predicted by our method are listed.
Figure 4: Example validation images successfully classified by our method. For each image, the ground-truth label and the top-5 labels predicted by our method are listed.
图 4:通过我们的方法成功分类的示例验证图像。对于每个图像,列出了通过我们的方法预测的地面真相标签和前 5 个标签。

Analysis of Results 结果分析

4 显示了一些通过我们的方法成功分类的示例验证图像。除了正确预测的标签外,我们还应注意前 5 个结果中的其他四个预测。这四个标签中的一些标签是多对象图像中的其他对象,例如,“马车”图像(图 4,第 1 行,第 1 行)包含“微型总线”,并且算法也可以识别。这四个标签中的某些标签归因于相似类别之间的不确定性,例如,“ coucal”图像(图 4,第 2 行,第 1 栏)已预测了其他鸟类的标签。

7 显示了我们的测试集结果的每类前 5 名错误(平均值为 4.94%),以升序显示。我们的结果在 113 个类别中的 top-5 错误为零-这些类别中的图像均已正确分类。前 5 个错误最高的三个类别是“开信刀”(49%),“聚光灯”(38%)和“餐厅”(36%)。该错误是由于存在多个对象,较小的对象或较大的组内方差。图 5 显示了在这三个类别中被我们的方法误分类的一些示例图像。一些预测标签仍然有意义。

在图 7 中,我们显示了我们的结果(平均 4.94%)与我们团队在 ILSVRC 2014 中的竞赛中结果(平均 8.06%)之间的前 5 个错误率的按类别差异。错误率降低了 824 个级别,错误率降低了 127 个级别,错误率提高了 49 个级别。

Figure 4 shows some example validation images successfully classified by our method. Besides the correctly predicted labels, we also pay attention to the other four predictions in the top-5 results. Some of these four labels are other objects in the multi-object images, e.g., the “horse-cart” image (Figure 4, row 1, col 1) contains a “mini-bus” and it is also recognized by the algorithm. Some of these four labels are due to the uncertainty among similar classes, e.g., the “coucal” image (Figure 4, row 2, col 1) has predicted labels of other bird species.

Figure 7 shows the per-class top-5 error of our result (average of 4.94%) on the test set, displayed in ascending order. Our result has zero top-5 error in 113 classes - the images in these classes are all correctly classified. The three classes with the highest top-5 error are “letter opener” (49%), “spotlight” (38%), and “restaurant” (36%). The error is due to the existence of multiple objects, small objects, or large intra-class variance. Figure 5 shows some example images misclassified by our method in these three classes. Some of the predicted labels still make some sense.

In Figure 7, we show the per-class difference of top-5 error rates between our result (average of 4.94%) and our team’s in-competition result in ILSVRC 2014 (average of 8.06%). The error rates are reduced in 824 classes, unchanged in 127 classes, and increased in 49 classes.

Example validation images incorrectly classified by our method, in the three classes with the highest top-5 test error. Top: “letter opener” (49% top-5 test error). Middle: “spotlight” (38%). Bottom: “restaurant” (36%). For each image, the ground-truth label and the top-5 labels predicted by our method are listed.Figure 5: Example validation images incorrectly classified by our method, in the three classes with the highest top-5 test error. Top: “letter opener” (49% top-5 test error). Middle: “spotlight” (38%). Bottom: “restaurant” (36%). For each image, the ground-truth label and the top-5 labels predicted by our method are listed.

Comparisons with Human Performance from [22] #### [ 22 ]与人类绩效的比较

Russakovsky 。[ 22 ]最近报道,人类的表现在 ImageNet 数据集上产生了 5.1%的前 5 名错误。这个数字是由人工注释者获得的,该注释者在验证图像上训练有素,可以更好地了解相关类的存在。对测试图像进​​行注释时,将为人类注释者提供一个特殊的界面,其中每个班级标题均附带一行 13 个示例训练图像。报告的人类表现是根据 1500 张测试图像的随机子集估算的。

我们的结果(4.94%)超出了报告的人员水平绩效。据我们所知,我们的结果是在视觉识别挑战方面首次超越人类。文献[ 22 ]中的分析表明,人为错误的两种主要类型来自细粒度的识别和对类的不了解。在调查中[ 22 ]表明,算法可以做到细粒度识别更好的工作(例如,在数据集 120 种狗)。图 4 第二行 展示了通过我们的方法成功识别的一些示例细粒度对象-“ coucal”,“ komondor”和“ yellow's slipper”。尽管人类可以轻松地将这些物体识别为鸟,狗和花,但对于大多数人类来说,分辨它们的种类并不容易。不利的一面是,在人类不难的情况下,尤其是对于那些需要上下文理解或高级知识的情况(例如,图 5 中的“聚光灯”图像 ),我们的算法仍然会犯错误。

尽管我们的算法在此特定数据集上产生了优异的结果,但这并不表示机器视觉在对象识别方面总体上胜过人类视觉。在识别诸如 Pascal VOC 任务之类的基本对象类别(日常生活中的常见对象或概念)[ 6 ]时,在对于人类来说微不足道的情况下,机器仍然存在明显的错误。尽管如此,我们相信我们的结果显示了机器算法在视觉识别上与人类水平的表现相匹配的巨大潜力。

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well trained on the validation images to be better aware of the existence of relevant classes. When annotating the test images, the human annotator is given a special interface, where each class title is accompanied by a row of 13 example training images. The reported human performance is estimated on a random subset of 1500 test images.

Our result (4.94%) exceeds the reported human-level performance. To our knowledge, our result is the first published instance of surpassing humans on this visual recognition challenge. The analysis in [22] reveals that the two major types of human errors come from fine-grained recognition and class unawareness. The investigation in [22] suggests that algorithms can do a better job on fine-grained recognition (e.g., 120 species of dogs in the dataset). The second row of Figure 4 shows some example fine-grained objects successfully recognized by our method - “coucal”, “komondor”, and “yellow lady’s slipper”. While humans can easily recognize these objects as a bird, a dog, and a flower, it is nontrivial for most humans to tell their species. On the negative side, our algorithm still makes mistakes in cases that are not difficult for humans, especially for those requiring context understanding or high-level knowledge (e.g., the “spotlight” images in Figure 5).

While our algorithm produces a superior result on this particular dataset, this does not indicate that machine vision outperforms human vision on object recognition in general. On recognizing elementary object categories (i.e., common objects or concepts in daily lives) such as the Pascal VOC task [6], machines still have obvious errors in cases that are trivial for humans. Nevertheless, we believe that our results show the tremendous potential of machine algorithms to match human-level performance on visual recognition.

The per-class top-5 errors of our result (average of 4.94%) on the test set. Errors are displayed in ascending order. The per-class top-5 errors of our result (average of 4.94%) on the test set. Errors are displayed in ascending order.

Figure 6: The per-class top-5 errors of our result (average of 4.94%) on the test set. Errors are displayed in ascending order.Figure 7: The difference of top-5 error rates between our result (average of 4.94%) and our team’s in-competition result for ILSVRC 2014 (average of 8.06%) on the test set, displayed in ascending order. A positive number indicates a reduced error rate.Figure 6: The per-class top-5 errors of our result (average of 4.94%) on the test set. Errors are displayed in ascending order.
图 6:我们在测试集上的结果的每类前 5 个错误(平均为 4.94%)。错误以升序显示。图 7:测试结果中我们的结果(平均 4.94%)与我们团队在 ILSVRC 2014 的竞赛中结果(平均 8.06%)之间的前 5 个错误率的差异,按升序显示。正数表示降低的错误率。图 6:我们在测试集上的结果的每类前 5 个错误(平均为 4.94%)。错误以升序显示。

REFERENCES