Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren,
ABSTRACT
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66% [29]). To our knowledge, our result is the first to surpass human-level performance (5.1%, [22]) on this visual recognition challenge.
整流的激活单元(整流器)对于最新的神经网络至关重要。在这项工作中,我们从两个方面研究了用于图像分类的整流器神经网络。首先,我们提出了一种参数化线性整流单元(PReLU),该参数化了传统的整流单元。PReLU以几乎为零的额外计算成本和极小的过拟合风险改善了模型拟合。其次,我们得出了一种鲁棒的初始化方法,该方法特别考虑了整流器的非线性。这种方法使我们能够从头开始直接训练极深的整流模型,并研究更深或更广的网络体系结构。基于我们的PReLU网络(PReLU-nets),我们达到了4.94%ImageNet 2012分类数据集中的top-5测试错误。相对于2014年ILSVRC冠军(GoogLeNet,6.66%[ 29 ]),相对改善了26%。据我们所知,我们的结果是在视觉识别挑战上首次超过人类水平的表现(5.1%,[ 22 ])。
1INTRODUCTION
Convolutional neural networks (CNNs) [17, 16] have demonstrated recognition accuracy better than or comparable to humans in several visual recognition tasks, including recognizing traffic signs [3], faces [30, 28], and hand-written digits [3, 31]. In this work, we present a result that surpasses human-level performance on a more generic and challenging recognition task - the classification task in the 1000-class ImageNet dataset [22].
In the last few years, we have witnessed tremendous improvements in recognition performance, mainly due to advances in two technical directions: building more powerful models, and designing effective strategies against overfitting. On one hand, neural networks are becoming more capable of fitting training data, because of increased complexity (e.g., increased depth [25, 29], enlarged width [33, 24], and the use of smaller strides [33, 24, 2, 25]), new nonlinear activations [21, 20, 34, 19, 27, 9], and sophisticated layer designs [29, 11]. On the other hand, better generalization is achieved by effective regularization techniques [12, 26, 9, 31], aggressive data augmentation [16, 13, 25, 29], and large-scale data [4, 22].
Among these advances, the rectifier neuron [21, 8, 20, 34], e.g., Rectified Linear Unit (ReLU), is one of several keys to the recent success of deep networks [16]. It expedites convergence of the training procedure [16] and leads to better solutions [21, 8, 20, 34] than conventional sigmoid-like units. Despite the prevalence of rectifier networks, recent improvements of models [33, 24, 11, 25, 29] and theoretical guidelines for training them [7, 23] have rarely focused on the properties of the rectifiers.
In this paper, we investigate neural networks from two aspects particularly driven by the rectifiers. First, we propose a new generalization of ReLU, which we call Parametric Rectified Linear Unit (PReLU). This activation function adaptively learns the parameters of the rectifiers, and improves accuracy at negligible extra computational cost. Second, we study the difficulty of training rectified models that are very deep. By explicitly modeling the nonlinearity of rectifiers (ReLU/PReLU), we derive a theoretically sound initialization method, which helps with convergence of very deep models (e.g., with 30 weight layers) trained directly from scratch. This gives us more flexibility to explore more powerful network architectures.
On the 1000-class ImageNet 2012 dataset, our PReLU network (PReLU-net) leads to a single-model result of 5.71% top-5 error, which surpasses all existing multi-model results. Further, our multi-model result achieves 4.94% top-5 error on the test set, which is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66% [29]). To the best of our knowledge, our result surpasses for the first time the reported human-level performance (5.1% in [22]) on this visual recognition challenge.
卷积神经网络(细胞神经网络)[ 17,16 ]已经证明识别精度优于或相当于人类在几个视觉识别任务,包括识别交通标志[ 3 ],面[ 30,28 ],和手写数字[ 3,31 ]。在这项工作中,我们提出了一个结果,在一个更为通用和更具挑战性的识别任务上-在1000类ImageNet数据集中的分类任务[ 22 ],其结果超过了人类的水平。
在过去的几年中,我们见证了识别性能的巨大提高,这主要归功于两个技术方向的进步:建立更强大的模型以及设计有效的策略以防止过度拟合。一方面,神经网络正变得越来越能够嵌合的训练数据的,由于增加的复杂性(例如,增加的深度[ 25,29 ],扩大宽度[ 33,24 ],和使用较小的步幅[ 33,24,2,25 ]),新的非线性激活[ 21,20, 34, 19, 27, 9 ],和复杂的设计层[ 29, 11 ]。在另一方面,更好的泛化由有效的正则化技术来实现[ 12, 26, 9, 31 ],攻击性数据扩张[ 16, 13, 25, 29 ],以及大规模数据[ 4, 22 ]。
在这些进展中,整流器神经元[ 21,8,20,34 ],例如,线性整流单元(RELU),是多个键以最近深网络的成功之一[ 16 ]。它加快了训练程序的收敛[ 16 ],并导致更好的解决方案[ 21,8,20,34 ]比常规的S形状单元。尽管整流器网络的流行,最近的模型改进[ 33,24,11, 25, 29 ]和训练他们理论指导[ 7, 23 ]很少集中在整流器的性能。
在本文中,我们从特别是由整流器驱动的两个方面研究了神经网络。首先,我们提出了ReLU的一种新的概括,我们称之为参数化整流线性单位(PReLU)。该激活功能可自适应地学习整流器的参数,并以可忽略的额外计算成本提高精度。其次,我们研究了训练非常深的校正模型的难度。通过对整流器的非线性(ReLU / PReLU)进行显式建模,我们得出了一种理论上合理的初始化方法,该方法有助于收敛从头开始直接训练的非常深的模型(例如,具有30个权重层)。这为我们提供了更大的灵活性来探索功能更强大的网络体系结构。
在1000类ImageNet 2012数据集上,我们的PReLU网络(PReLU-net)导致5.71%的top-5错误的单模型结果超过了所有现有的多模型结果。此外,我们的多模型结果在测试集上实现了4.94%的top-5误差,这比2014年ILSVRC冠军(GoogLeNet,6.66%[ 29 ])提高了26%。据我们所知,在这一视觉识别挑战上,我们的结果首次超过了报告的人类水平的表现([ 22 ]中为5.1%)。
2 APPROACH 方法
In this section, we first present the PReLU activation function (Sec. 2.1). Then we derive our initialization method for deep rectifier networks (Sec. 2.2). Lastly we discuss our architecture designs (Sec. 2.3).
在本节中,我们首先介绍PReLU激活功能(第2.1节 )。然后,我们导出了用于深层整流器网络的初始化方法(第2.2节 )。最后,我们讨论我们的体系结构设计(第2.3节 )。
2.1PARAMETRIC RECTIFIERS 参数整流器
We show that replacing the parameter-free ReLU activation by a learned parametric activation unit improves classification accuracy11Concurrent with our work, Agostinelli et al. [1] also investigated learning activation functions and showed improvement on other tasks..
我们表明,通过学习的参数激活单元替换无参数的ReLU激活可以提高分类精度11 与我们的工作同时进行的是Agostinelli等。[ 1 ]还研究了学习激活功能,并显示出在其他任务上的改进。。
Figure 1: ReLU vs. PReLU. For PReLU, the coefficient of the negative part is not constant and is adaptively learned.#### Definition
图1: ReLU与PReLU。对于PReLU,负部分的系数不是恒定的,而是可以自适应学习的。
Formally, we consider an activation function defined as:
f(yi)={yi,if yi>0aiyi,if yi≤0. (1)
Here yi is the input of the nonlinear activation f on the ith channel, and ai is a coefficient controlling the slope of the negative part. The subscript i in ai indicates that we allow the nonlinear activation to vary on different channels. When ai=0, it becomes ReLU; when ai is a learnable parameter, we refer to Eqn.(1) as Parametric ReLU (PReLU). Figure 1 shows the shapes of ReLU and PReLU. Eqn.(1) is equivalent to
我们将等式(1)称为参数化ReLU(PReLU)。图 1显示了ReLU和PReLU的形状。式(1)等价于
f(yi)=max(0,yi)+aimin(0,yi).
If ai is a small and fixed value, PReLU becomes the Leaky ReLU (LReLU) in [20] (ai=0.01). The motivation of LReLU is to avoid zero gradients. Experiments in [20] show that LReLU has negligible impact on accuracy compared with ReLU. On the contrary, our method adaptively learns the PReLU parameters jointly with the whole model. We hope for end-to-end training that will lead to more specialized activations.
LReLU的动机是避免零梯度。[ 20 ]中的实验表明,与ReLU相比,LReLU对准确性的影响可忽略不计。相反,我们的方法与整个模型一起自适应地学习PReLU参数。我们希望进行端到端培训,这将导致更专业的激活。
PReLU introduces a very small number of extra parameters. The number of extra parameters is equal to the total number of channels, which is negligible when considering the total number of weights. So we expect no extra risk of overfitting. We also consider a channel-shared variant:
PReLU引入了非常少量的额外参数。额外参数的数量等于通道的总数,在考虑权重的总数时可以忽略不计。因此,我们预计不会出现过度拟合的额外风险。我们还考虑了渠道共享的变体:
f(yi)=max(0,yi)+amin(0,yi)
where the coefficient is shared by all channels of one layer. This variant only introduces a single extra parameter into each layer.
系数由一层的所有通道共享。此变体仅在每个层中引入一个额外的参数。
Optimization 优化
PReLU can be trained using backpropagation [17] and optimized simultaneously with other layers. The update formulations of {ai} are simply derived from the chain rule. The gradient of ai for one layer is:
可以使用反向传播[ 17 ]训练PReLU,并与其他层同时进行优化。
| | ∂E∂ai=∑yi∂E∂f(yi)∂f(yi)∂ai, | | (2) |
where E represents the objective function. The term ∂E/∂f(yi) is the gradient propagated from the deeper layer. The gradient of the activation is given by:
E代表目标函数。∂E/∂f(yi) 从较深层传播的梯度。激活的梯度由下式给出:
∂f(yi)∂ai={0,if yi>0yi,if yi≤0. | (3) |
---|
The summation ∑yi runs over all positions of the feature map. For the channel-shared variant, the gradient of a is ∂E∂a=∑i∑yi∂E∂f(yi)∂f(yi)∂a, where ∑i sums over all channels of the layer. The time complexity due to PReLU is negligible for both forward and backward propagation.
We adopt the momentum method when updating ai:
Δai:=μΔai+ϵ∂E∂ai. | (4) |
---|
Here μ is the momentum and ϵ is the learning rate. It is worth noticing that we do not use weight decay (l2 regularization) when updating ai. A weight decay tends to push ai to zero, and thus biases PReLU toward ReLU. Even without regularization, the learned coefficients rarely have a magnitude larger than 1 in our experiments. Further, we do not constrain the range of ai so that the activation function may be non-monotonic. We use ai=0.25 as the initialization throughout this paper.
Comparison Experiments 比较实验
We conducted comparisons on a deep but efficient model with 14 weight layers. The model was studied in [10] (model E of [10]) and its architecture is described in Table 1. We choose this model because it is sufficient for representing a category of very deep models, as well as to make the experiments feasible.
As a baseline, we train this model with ReLU applied in the convolutional (conv) layers and the first two fully-connected (fc) layers. The training implementation follows [10]. The top-1 and top-5 errors are 33.82% and 13.34% on ImageNet 2012, using 10-view testing (Table 2).
我们对具有14个权重层的深而有效的模型进行了比较。该模型进行了研究[ 10 ](模型电子[ 10 ])和它的结构在表中描述了 1。我们选择此模型是因为它足以代表一类非常深的模型,并且足以使实验可行。
作为基线,我们使用在卷积(conv)层和前两个完全连接(fc)层中应用ReLU训练该模型。培训实施如下[ 10 ]。使用10视图测试(表2),ImageNet 2012的top-1和top-5错误分别为33.82%和13.34% 。
learned coefficients | |
---|---|
layer | channel-shared |
conv1 | 7×7, 64, /2 |
pool1 | 3×3, /3 |
conv21 | 2×2, 128 |
conv22 | 2×2, 128 |
conv23 | 2×2, 128 |
conv24 | 2×2, 128 |
pool2 | 2×2, /2 |
conv31 | 2×2, 256 |
conv32 | 2×2, 256 |
conv33 | 2×2, 256 |
conv34 | 2×2, 256 |
conv35 | 2×2, 256 |
conv36 | 2×2, 256 |
spp | {6,3,2,1} |
fc1 | 4096 |
fc2 | 4096 |
fc3 | 1000 |
Table 1: A small but deep 14-layer model [10]. The filter size and filter number of each layer is listed. The number /s indicates the stride s that is used. The learned coefficients of PReLU are also shown. For the channel-wise case, the average of {ai} over the channels is shown for each layer.
表1:一个小而深的14层模型[ 10 ]。列出了每层的过滤器尺寸和过滤器编号。数字 /s 表示大步前进 s被使用。还显示了PReLU的学习系数。
Then we train the same architecture from scratch, with all ReLUs replaced by PReLUs (Table 2). The top-1 error is reduced to 32.64%. This is a 1.2% gain over the ReLU baseline. Table 2 also shows that channel-wise/channel-shared PReLUs perform comparably. For the channel-shared version, PReLU only introduces 13 extra free parameters compared with the ReLU counterpart. But this small number of free parameters play critical roles as evidenced by the 1.1% gain over the baseline. This implies the importance of adaptively learning the shapes of activation functions.
然后,我们从头开始训练相同的架构,所有ReLU被PReLU取代(表 2)。前1个错误减少到32.64%。这比ReLU基准提高了1.2%。表 2还显示了按通道/按通道共享的PReLU的性能相当。对于通道共享版本,与ReLU相比,PReLU仅引入了13个额外的免费参数。但是,少量的自由参数起着至关重要的作用,这比基线提高了1.1%证明了这一点。这意味着自适应学习激活函数形状的重要性。
activation functions | top1 | top5 |
---|---|---|
ReLU | 33.82 | 13.34 |
PReLU, channel-shared | 32.71 | 12.87 |
PReLU, channel-wise | 32.64 | 12.75 |
表2:在小型模型上ReLU和PReLU之间的比较。错误率适用于使用10视图测试的ImageNet 2012。调整图像大小,以便在训练和测试期间较短的一面为256。每个视图是224×224.所有模型都使用75个时期进行训练
Table 2: Comparisons between ReLU and PReLU on the small model. The error rates are for ImageNet 2012 using 10-view testing. The images are resized so that the shorter side is 256, during both training and testing. Each view is 224×224. All models are trained using 75 epochs.Table 1 also shows the learned coefficients of PReLUs for each layer. There are two interesting phenomena in Table 1. First, the first conv layer (conv1) has coefficients (0.681 and 0.596) significantly greater than 0. As the filters of conv1 are mostly Gabor-like filters such as edge or texture detectors, the learned results show that both positive and negative responses of the filters are respected. We believe that this is a more economical way of exploiting low-level information, given the limited number of filters (e.g., 64). Second, for the channel-wise version, the deeper conv layers in general have smaller coefficients. This implies that the activations gradually become “more nonlinear” at increasing depths. In other words, the learned model tends to keep more information in earlier stages and becomes more discriminative in deeper stages.
表 1还显示了每层的PReLU的学习系数。表1中有两个有趣的现象 。首先,第一conv层(conv1)的系数(0.681和0.596)显着大于0。由于conv1的滤镜主要是类似Gabor的滤镜,例如边缘或纹理检测器,因此学习的结果表明,过滤器受到尊重。我们认为,鉴于过滤器数量有限(例如,。,64)。其次,对于通道方式,较深的conv层通常具有较小的系数。这意味着,随着深度的增加,激活逐渐变得“非线性”。换句话说,学习的模型倾向于在早期阶段保留更多信息,而在较深阶段则更具区分性。
2.2INITIALIZATION OF FILTER WEIGHTS FOR RECTIFIERS 整流器滤波器权重的初始化
Rectifier networks are easier to train [8, 16, 34] compared with traditional sigmoid-like activation networks. But a bad initialization can still hamper the learning of a highly non-linear system. In this subsection, we propose a robust initialization method that removes an obstacle of training extremely deep rectifier networks.
Recent deep CNNs are mostly initialized by random weights drawn from Gaussian distributions [16]. With fixed standard deviations (e.g., 0.01 in [16]), very deep models (e.g., >8 conv layers) have difficulties to converge, as reported by the VGG team [25] and also observed in our experiments. To address this issue, in [25] they pre-train a model with 8 conv layers to initialize deeper models. But this strategy requires more training time, and may also lead to a poorer local optimum. In [29, 18], auxiliary classifiers are added to intermediate layers to help with convergence.
Glorot and Bengio [7] proposed to adopt a properly scaled uniform distribution for initialization. This is called “Xavier” initialization in [14]. Its derivation is based on the assumption that the activations are linear. This assumption is invalid for ReLU and PReLU.
In the following, we derive a theoretically more sound initialization by taking ReLU/PReLU into account. In our experiments, our initialization method allows for extremely deep models (e.g., 30 conv/fc layers) to converge, while the “Xavier” method [7] cannot.
整流器网络是更容易训练#### 前向传播案例[ 8,16,34 ]与传统的S形状激活网络相比。但是,不良的初始化仍然会妨碍高度非线性系统的学习。在本小节中,我们提出了一种鲁棒的初始化方法,该方法消除了训练极深的整流器网络的障碍。
最近的深度CNN大多是通过从高斯分布中提取的随机权重来初始化的[ 16 ]。使用固定的标准偏差(例如[ 16 ]中的0.01 ),非常深的模型(例如,>如VGG小组所报道[ 25 ],并且在我们的实验中也观察到,8个转换层很难收敛。为了解决这个问题,在[ 25 ]中,他们预训练了具有8个转换层的模型,以初始化更深的模型。但是这种策略需要更多的训练时间,并且还可能导致较差的局部最优值。在[ 29,18 ],辅助分类器被添加到中间层以具有收敛的帮助。
Glorot和Bengio [ 7 ]提出采用适当缩放的均匀分布进行初始化。这在[ 14 ]中称为“ Xavier ”初始化。它的推导基于激活是线性的假设。该假设对ReLU和PReLU无效。
在下文中,我们通过考虑ReLU / PReLU得出理论上更合理的声音初始化。在我们的实验中,我们的初始化方法允许极深的模型(例如30个conv / fc层)收敛,而“ Xavier ”方法[ 7 ]则不能。
Forward Propagation Case 前向传播案例
Our derivation mainly follows [7]. The central idea is to investigate the variance of the responses in each layer.
我们的推导主要遵循[ 7 ]。中心思想是调查每一层响应的方差。
对于转换层,响应为:
For a conv layer, a response is:
yl=Wlxl+bl. (5)
Here, x is a k2c-by-1 vector that represents co-located k×k pixels in c input channels. k is the spatial filter size of the layer. With n=k2c denoting the number of connections of a response, W is a d-by-n matrix, where d is the number of filters and each row of W represents the weights of a filter. b is a vector of biases, and y is the response at a pixel of the output map. We use l to index a layer. We have xl=f(yl−1) where f is the activation. We also have cl=dl−1.
We let the initialized elements in Wl be mutually independent and share the same distribution. As in [7], we assume that the elements in xl are also mutually independent and share the same distribution, and xl and Wl are independent of each other. Then we have:
\emphVar[yl]=nl\emphVar[wlxl], | (6) |
---|
where now yl, xl, and wl represent the random variables of each element in yl, Wl, and xl respectively. We let wl have zero mean. Then the variance of the product of independent variables gives us:
\emphVar[yl]=nl\emphVar[wl]E[x2l]. | (7) |
---|
Here E[x2l] is the expectation of the square of xl. It is worth noticing that E[x2l]≠\emphVar[xl] unless xl has zero mean. For the ReLU activation, xl=max(0,yl−1) and thus it does not have zero mean. This will lead to a conclusion different from [7].
If we let wl−1 have a symmetric distribution around zero and bl−1=0, then yl−1 has zero mean and has a symmetric distribution around zero. This leads to E[x2l]=12\emphVar[yl−1] when f is ReLU. Putting this into Eqn.(7), we obtain:
\emphVar[yl]=12nl\emphVar[wl]\emphVar[yl−1]. | (8) |
---|
With L layers put together, we have:
\emphVar[yL]=\emphVary1. | (9) |
---|
This product is the key to the initialization design. A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. So we expect the above product to take a proper scalar (e.g., 1). A sufficient condition is:
12nl\emphVar[wl]=1,∀l. | (10) |
---|
This leads to a zero-mean Gaussian distribution whose standard deviation (std) is √2/nl. This is our way of initialization. We also initialize b=0.
For the first layer (l=1), we should have n1\emphVar[w1]=1 because there is no ReLU applied on the input signal. But the factor 1/2 does not matter if it just exists on one layer. So we also adopt Eqn.(10) in the first layer for simplicity.
Backward Propagation Case 向后传播案例
For back-propagation, the gradient of a conv layer is computed by:
Δxl=^WlΔyl. | (11) |
---|
Here we use Δx and Δy to denote gradients (∂E∂x and ∂E∂y) for simplicity. Δy represents k-by-k pixels in d channels, and is reshaped into a k2d-by-1 vector. We denote ^n=k2d. Note that ^n≠n=k2c. ^W is a c-by-^n matrix where the filters are rearranged in the way of back-propagation. Note that W and ^W can be reshaped from each other. Δx is a c-by-1 vector representing the gradient at a pixel of this layer. As above, we assume that wl and Δyl are independent of each other, then Δxl has zero mean for all l, when wl is initialized by a symmetric distribution around zero.
In back-propagation we also have Δyl=f′(yl)Δxl+1 where f′ is the derivative of f. For the ReLU case, f′(yl) is zero or one, and their probabilities are equal. We assume that f′(yl) and Δxl+1 are independent of each other. Thus we have E[Δyl]=E[Δxl+1]/2=0, and also E[(Δyl)2]=\emphVar[Δyl]=12\emphVar[Δxl+1]. Then we compute the variance of the gradient in Eqn.(11):
\emphVar[Δxl] | = | ^nl\emphVar[wl]\emphVar[Δyl] | (12) | ||
---|---|---|---|---|---|
= | 12^nl\emphVar[wl]\emphVar[Δxl+1]. |
The scalar 1/2 in both Eqn.(12) and Eqn.(8) is the result of ReLU, though the derivations are different. With L layers put together, we have:
\emphVar[Δx2]=\emphVarΔxL+1. | (13) |
---|
We consider a sufficient condition that the gradient is not exponentially large/small:
12^nl\emphVar[wl]=1,∀l. | (14) |
---|
The only difference between this equation and Eqn.(10) is that ^nl=k2ldl while nl=k2lcl=k2ldl−1. Eqn.(14) results in a zero-mean Gaussian distribution whose std is √2/^nl.
For the first layer (l=1), we need not compute Δx1 because it represents the image domain. But we can still adopt Eqn.(14) in the first layer, for the same reason as in the forward propagation case - the factor of a single layer does not make the overall product exponentially large/small.
We note that it is sufficient to use either Eqn.(14) or Eqn.(10) alone. For example, if we use Eqn.(14), then in Eqn.(13) the product ∏Ll=212^nl\emphVar[wl]=1, and in Eqn.(9) the product ∏Ll=212nl\emphVar[wl]=∏Ll=2nl/^nl=c2/dL, which is not a diminishing number in common network designs. This means that if the initialization properly scales the backward signal, then this is also the case for the forward signal; and vice versa. For all models in this paper, both forms can make them converge.
Figure 2: The convergence of a 22-layer large model (B in Table 3). The x-axis is the number of training epochs. The y-axis is the top-1 error of 3,000 random val samples, evaluated on the center crop. We use ReLU as the activation for both cases. Both our initialization (red) and “Xavier” (blue) [7] lead to convergence, but ours starts reducing error earlier.Figure 3: The convergence of a 30-layer small model (see the main text). We use ReLU as the activation for both cases. Our initialization (red) is able to make it converge. But “Xavier” (blue) [7] completely stalls - we also verify that its gradients are all diminishing. It does not converge even given more epochs.Figure 2: The convergence of a 22-layer large model (B in Table 3). The x-axis is the number of training epochs. The y-axis is the top-1 error of 3,000 random val samples, evaluated on the center crop. We use ReLU as the activation for both cases. Both our initialization (red) and “Xavier” (blue) [7] lead to convergence, but ours starts reducing error earlier.
图2:22层大型模型的收敛性(表3中的B )。x轴是训练时期的数量。y轴是在中心作物上评估的3,000个随机val样本的top-1误差。对于这两种情况,我们都使用ReLU作为激活。我们的初始化(红色)和“ Xavier ”(蓝色)[ 7 ]都导致收敛,但是我们的初始化开始更早地减少了错误。图3:30层小模型的收敛性(请参见正文)。对于这两种情况,我们都使用ReLU作为激活。我们的初始化(红色)能够使其收敛。但是“ Xavier ”(蓝色)[ 7 ]完全停滞了-我们还验证了它的梯度都在减小。即使有更多的时期,它也不会收敛。图2:22层大型模型的收敛性(表3中的B )。x轴是训练时期的数量。y轴是在中心作物上评估的3,000个随机val样本的top-1误差。对于这两种情况,我们都使用ReLU作为激活。我们的初始化(红色)和“ Xavier ”(蓝色)[ 7 ]都导致收敛,但是我们的初始化开始更早地减少了错误。
Discussions 讨论
If the forward/backward signal is inappropriately scaled by a factor β in each layer, then the final propagated signal will be rescaled by a factor of βL after L layers, where L can represent some or all layers. When L is large, if β>1, this leads to extremely amplified signals and an algorithm output of infinity; if β<1, this leads to diminishing signals22In the presence of weight decay (l2 regularization of weights), when the gradient contributed by the logistic loss function is diminishing, the total gradient is not diminishing because of the weight decay. A way of diagnosing diminishing gradients is to check whether the gradient is modulated only by weight decay.. In either case, the algorithm does not converge - it diverges in the former case, and stalls in the latter.
Our derivation also explains why the constant standard deviation of 0.01 makes some deeper networks stall [25]. We take “model B” in the VGG team’s paper [25] as an example. This model has 10 conv layers all with 3×3 filters. The filter numbers (dl) are 64 for the 1st and 2nd layers, 128 for the 3rd and 4th layers, 256 for the 5th and 6th layers, and 512 for the rest. The std computed by Eqn.(14) (√2/^nl) is 0.059, 0.042, 0.029, and 0.021 when the filter numbers are 64, 128, 256, and 512 respectively. If the std is initialized as 0.01, the std of the gradient propagated from conv10 to conv2 is 1/(5.9×4.22×2.92×2.14)=1/(1.7×104) of what we derive. This number may explain why diminishing gradients were observed in experiments.
如果前向/后向信号的缩放比例不合适 β 在每一层中,最终的传播信号将按以下比例重新缩放 β大号 后 大号 层,在哪里 大号可以代表部分或全部图层。什么时候大号 很大,如果 β>1个,这会导致信号极大放大,并产生无穷大的算法输出;如果β<1个,这会导致信号2减弱2个在体重下降的情况下(l2个权重的正则化),当由逻辑损失函数贡献的梯度减小时,总梯度不会由于权重衰减而减小。诊断梯度递减的一种方法是检查梯度是否仅通过权重衰减进行调制。。无论哪种情况,算法都不会收敛-在前一种情况下会发散,而在后一种情况下会停滞。
我们的推导也解释了为什么0.01的恒定标准偏差会使一些更深的网络停滞[ 25 ]。我们以VGG团队论文[ 25 ]中的“模型B”为例。该模型有10个转换层,所有层均包含3个×3个过滤器。过滤器编号(d升)对于第一层和第二层是64,对于第三层和第四层是128,对于第五层和第六层是256,对于其余层是512。由等式(14)计算的标准√2个/^ñ升当过滤器编号分别为64、128、256和512时,)分别为0.059、0.042、0.029和0.021。如果将std初始化为0.01,则从conv10传播到conv2的梯度的std为1个/(5.9×4.22个×2.92个×2.14)=1个/(1.7×104)我们得出的结果。这个数字可以解释为什么在实验中观察到梯度递减。
还值得注意的是,从第一层到最后一层可以大致保留输入信号的方差。如果输入信号未标准化(例如,在[-128,128]),其大小可能太大,以至于softmax运算符将溢出。解决方案是对输入信号进行归一化,但这可能会影响其他超参数。
It is also worth noticing that the variance of the input signal can be roughly preserved from the first layer to the last. In cases when the input signal is not normalized (e.g., it is in the range of [−128,128]), its magnitude can be so large that the softmax operator will overflow. A solution is to normalize the input signal, but this may impact other hyper-parameters. Another solution is to include a small factor on the weights among all or some layers, e.g., L√1/128 on L layers. In practice, we use a std of 0.01 for the first two fc layers and 0.001 for the last. These numbers are smaller than they should be (e.g., √2/4096) and will address the normalization issue of images whose range is about [−128,128].
For the initialization in the PReLU case, it is easy to show that Eqn.(10) becomes:
12(1+a2)nl\emphVar[wl]=1,∀l, | (15) |
---|
where a is the initialized value of the coefficients. If a=0, it becomes the ReLU case; if a=1, it becomes the linear case (the same as [7]). Similarly, Eqn.(14) becomes 12(1+a2)^nl\emphVar[wl]=1.
Comparisons with “Xavier” Initialization [7]
The main difference between our derivation and the “Xavier” initialization [7] is that we address the rectifier nonlinearities33There are other minor differences. In [7], the derived variance is adopted for uniform distributions, and the forward and backward cases are averaged. But it is straightforward to adopt their conclusion for Gaussian distributions and for the forward or backward case only.. The derivation in [7] only considers the linear case, and its result is given by nl\emphVar[wl]=1 (the forward case), which can be implemented as a zero-mean Gaussian distribution whose std is √1/nl. When there are L layers, the std will be 1/√2L of our derived std. This number, however, is not small enough to completely stall the convergence of the models actually used in our paper (Table 3, up to 22 layers) as shown by experiments. Figure 3 compares the convergence of a 22-layer model. Both methods are able to make them converge. But ours starts reducing error earlier. We also investigate the possible impact on accuracy. For the model in Table 2 (using ReLU), the “Xavier” initialization method leads to 33.90/13.44 top-1/top-5 error, and ours leads to 33.82/13.34. We have not observed clear superiority of one to the other on accuracy.
Next, we compare the two methods on extremely deep models with up to 30 layers (27 conv and 3 fc). We add up to sixteen conv layers with 256 2×2 filters in the model in Table 1. Figure 3 shows the convergence of the 30-layer model. Our initialization is able to make the extremely deep model converge. On the contrary, the “Xavier” method completely stalls the learning, and the gradients are diminishing as monitored in the experiments.
These studies demonstrate that we are ready to investigate extremely deep, rectified models by using a more principled initialization method. But in our current experiments on ImageNet, we have not observed the benefit from training extremely deep models. For example, the aforementioned 30-layer model has 38.56/16.59 top-1/top-5 error, which is clearly worse than the error of the 14-layer model in Table 2 (33.82/13.34). Accuracy saturation or degradation was also observed in the study of small models [10], VGG’s large models [25], and in speech recognition [34]. This is perhaps because the method of increasing depth is not appropriate, or the recognition task is not enough complex.
Though our attempts of extremely deep models have not shown benefits, our initialization method paves a foundation for further study on increasing depth. We hope this will be helpful in other more complex tasks.
然而,这个数字还不足以完全阻止我们的论文中实际使用的模型(表3,最多22层)的收敛, 如实验所示。图 3比较了22层模型的收敛性。两种方法都能使它们收敛。但是我们更早开始减少错误。我们还调查了对准确性的可能影响。对于表2中的模型 (使用ReLU),“ Xavier ”初始化方法导致33.90 / 13.44 top-1 / top-5错误,而我们的导致33.82 / 13.34。我们还没有观察到一种在准确性方面明显优于另一种的优势。
接下来,我们在具有30层(27转换和3 fc)的极深模型上比较这两种方法。我们用256 2总共增加了16个转换层×表1中的模型中有2个过滤器 。图 3显示了30层模型的收敛性。我们的初始化能够使极深的模型收敛。相反,“ Xavier ”方法完全使学习停滞,并且梯度在实验中逐渐减小。
这些研究表明,我们已经准备好使用更原则的初始化方法来研究极深的,经过纠正的模型。但是,在我们目前在ImageNet上进行的实验中,我们还没有观察到训练极深模型的好处。例如,上述30层模型的top-1 / top-5误差为38.56 / 16.59,显然比表2中的14层模型的误差 (33.82 / 13.34)差。在小模型[ 10 ],VGG的大模型[ 25 ]和语音识别[ 34 ]的研究中也观察到了准确性的饱和或退化。。这可能是因为增加深度的方法不合适,或者识别任务不够复杂。
尽管我们对极深模型的尝试并未显示出任何好处,但我们的初始化方法为进一步研究深度奠定了基础。我们希望这将对其他更复杂的任务有所帮助。
input size | VGG-19 [25] | model A | model B | model C |
---|---|---|---|---|
224 | 3×3, 64 | 7×7, 96, /2 | 7×7, 96, /2 | 7×7, 96, /2 |
3×3, 64 | ||||
2×2 maxpool, /2 | ||||
112 | 3×3, 128 | |||
3×3, 128 | ||||
2×2 maxpool, /2 | 2×2 maxpool, /2 | 2×2 maxpool, /2 | 2×2 maxpool, /2 | |
56 | 3×3, 256 | 3×3, 256 | 3×3, 256 | 3×3, 384 |
3×3, 256 | 3×3, 256 | 3×3, 256 | 3×3, 384 | |
3×3, 256 | 3×3, 256 | 3×3, 256 | 3×3, 384 | |
3×3, 256 | 3×3, 256 | 3×3, 256 | 3×3, 384 | |
3×3, 256 | 3×3, 256 | 3×3, 384 | ||
3×3, 256 | 3×3, 384 | |||
2×2 maxpool, /2 | 2×2 maxpool, /2 | 2×2 maxpool, /2 | 2×2 maxpool, /2 | |
28 | 3×3, 512 | 3×3, 512 | 3×3, 512 | 3×3, 768 |
3×3, 512 | 3×3, 512 | 3×3, 512 | 3×3, 768 | |
3×3, 512 | 3×3, 512 | 3×3, 512 | 3×3, 768 | |
3×3, 512 | 3×3, 512 | 3×3, 512 | 3×3, 768 | |
3×3, 512 | 3×3, 512 | 3×3, 768 | ||
3×3, 512 | 3×3, 768 | |||
2×2 maxpool, /2 | 2×2 maxpool, /2 | 2×2 maxpool, /2 | 2×2 maxpool, /2 | |
14 | 3×3, 512 | 3×3, 512 | 3×3, 512 | 3×3, 896 |
3×3, 512 | 3×3, 512 | 3×3, 512 | 3×3, 896 | |
3×3, 512 | 3×3, 512 | 3×3, 512 | 3×3, 896 | |
3×3, 512 | 3×3, 512 | 3×3, 512 | 3×3, 896 | |
3×3, 512 | 3×3, 512 | 3×3, 896 | ||
3×3, 512 | 3×3, 896 | |||
2×2 maxpool, /2 | spp, {7,3,2,1} | spp, {7,3,2,1} | spp, {7,3,2,1} | |
fc1 | 4096 | |||
fc2 | 4096 | |||
fc3 | 1000 | |||
depth (conv+fc) | 19 | 19 | 22 | 22 |
complexity (ops., ×1010) | 1.96 | 1.90 | 2.32 | 5.30 |
Table 3: Architectures of large models. Here “/2” denotes a stride of 2.
2.3ARCHITECTURES
The above investigations provide guidelines of designing our architectures, introduced as follows.
Our baseline is the 19-layer model (A) in Table 3. For a better comparison, we also list the VGG-19 model [25]. Our model A has the following modifications on VGG-19: (i) in the first layer, we use a filter size of 7×7 and a stride of 2; (ii) we move the other three conv layers on the two largest feature maps (224, 112) to the smaller feature maps (56, 28, 14). The time complexity (Table 3, last row) is roughly unchanged because the deeper layers have more filters; (iii) we use spatial pyramid pooling (SPP) [11] before the first fc layer. The pyramid has 4 levels - the numbers of bins are 7×7, 3×3, 2×2, and 1×1, for a total of 63 bins.
It is worth noticing that we have no evidence that our model A is a better architecture than VGG-19, though our model A has better results than VGG-19’s result reported by [25]. In our earlier experiments with less scale augmentation, we observed that our model A and our reproduced VGG-19 (with SPP and our initialization) are comparable. The main purpose of using model A is for faster running speed. The actual running time of the conv layers on larger feature maps is slower than those on smaller feature maps, when their time complexity is the same. In our four-GPU implementation, our model A takes 2.6s per mini-batch (128), and our reproduced VGG-19 takes 3.0s, evaluated on four Nvidia K20 GPUs.
In Table 3, our model B is a deeper version of A. It has three extra conv layers. Our model C is a wider (with more filters) version of B. The width substantially increases the complexity, and its time complexity is about 2.3× of B (Table 3, last row). Training A/B on four K20 GPUs, or training C on eight K40 GPUs, takes about 3-4 weeks.
We choose to increase the model width instead of depth, because deeper models have only diminishing improvement or even degradation on accuracy. In recent experiments on small models [10], it has been found that aggressively increasing the depth leads to saturated or degraded accuracy. In the VGG paper [25], the 16-layer and 19-layer models perform comparably. In the speech recognition research of [34], the deep models degrade when using more than 8 hidden layers (all being fc). We conjecture that similar degradation may also happen on larger models for ImageNet. We have monitored the training procedures of some extremely deep models (with 3 to 9 layers added on B in Table 3), and found both training and testing error rates degraded in the first 20 epochs (but we did not run to the end due to limited time budget, so there is not yet solid evidence that these large and overly deep models will ultimately degrade). Because of the possible degradation, we choose not to further increase the depth of these large models.
On the other hand, the recent research [5] on small datasets suggests that the accuracy should improve from the increased number of parameters in conv layers. This number depends on the depth and width. So we choose to increase the width of the conv layers to obtain a higher-capacity model.
While all models in Table [3](https://www.arxiv-vanity.com/papers/1502.01852/#S2.T3 "Table 3 ‣ Comparisons with “Xavier” Initialization [7] ‣ 2.2 Initialization of Filter Weights for Rectifiers ‣ 2 Approach ‣ Delving Deep into Rectifiers: Surpassing Human-Level Performance on