Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
> 论文中英文对照合集 : https://aiqianji.com/blog/articles
作者
Christian Szegedy
Google Inc.
1600 Amphitheatre Pkwy, Mountain View, CA
Sergey Ioffe
Vincent Vanhoucke
Alex Alemi
Abstract
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections.
Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin.
We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly.
We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08% top-5 error on the test set of the ImageNet classification (CLS) challenge.
摘要
近年来,超深度卷积网络对于图像识别性能的最大进步至关重要。一个示例是Inception体系结构,该体系已显示出以相对较低的计算成本实现了非常好的性能。最近,在2015年ILSVRC挑战赛中,引入残差连接以及更传统的体系结构带来了最先进的性能Start-of-art;它的性能类似于最新一代的Inception-v3网络。这就提出了一个问题,那就是将Inception体系结构与残差连接结合起来是否有任何好处。在这里,我们提供了充足的实验证据,即使用残差连接进行训练会显着加速Inception网络的训练。还有一些证据表明,残差Inception网络的性能优于类似没有残差连接的更复杂的Inception网络。我们还为残留和非残留Inception网络提供了几种新的简化架构。这些变化大大提高了ILSVRC 2012分类任务的单帧识别性能。我们进一步证明,在保证宽Residual Inception网络的稳定性训练前提下,如何合理的增大每一层的激活值。结合三个残差和一个Inception-v4,我们在ImageNet分类(CLS)质询的测试集上实现了3.08%的top-5错误。
1、Introduction
2012年ImageNet的冠军获得者-Krizhevsky,他们提出的网络 "AlexNet" 被成功的应用于各个视觉领域,比如目标检测,分割,人体姿态估计,视频分类,目标跟踪以及超分辨率。这些示例只是深度卷积网络成功广泛应用中的几个。
本文中,我们研究如何结合最新的两种卷积网络思想:残差连接(residual connections)和Inception-v3。ResNet的作者认为残差连接是训练超深度卷积网络必不可少的条件。由于Inception网络非常深,很自然的,我们让Inception与residual结合。这将使Inception在保留其计算效率的同时,也能够获得residual思想的好处。
除了直接将Residual融合,我们也研究将Inception本身做得更深,更广泛,是否可以使其效率更高。出于这个目的,我们设计新版本的结构-Inception-v4,它具有更加统一、简化的结构,以及更多的Inception模块(相比于Inception-v3)。Inception-v3继承了早期网络设计的诸多优点,主要的限制是如何进行分布式训练。但是,当TensorFlow出现后,那些不足不再存在。这使我们可以显着简化架构。该简化体系结构的详细信息在本节中介绍。(Section3)。
Since the 2012 ImageNet competition [11] winning entry by Krizhevsky et al [8], their network “AlexNet” has been successfully applied to a larger variety of computer vision tasks, for example to object-detection [4], segmentation [10], human pose estimation [17], video classification [7], object tracking [18], and superresolution [3]. These examples are but a few of all the applications to which deep convolutional networks have been very successfully applied ever since.
In this work we study the combination of the two most recent ideas: Residual connections introduced by He et al. in [5] and the latest revised version of the Inception architecture [15]. In [5], it is argued that residual connections are of inherent importance for training very deep architectures. Since Inception networks tend to be very deep, it is natural to replace the filter concatenation stage of the Inception architecture with residual connections. This would allow Inception to reap all the benefits of the residual approach while retaining its computational efficiency.
Besides a straightforward integration, we have also studied whether Inception itself can be made more efficient by making it deeper and wider. For that purpose, we designed a new version named Inception-v4 which has a more uniform simplified architecture and more inception modules than Inception-v3. Historically, Inception-v3 had inherited a lot of the baggage of the earlier incarnations. The technical constraints chiefly came from the need for partitioning the model for distributed training using DistBelief [2]. Now, after migrating our training setup to TensorFlow [1] these constraints have been lifted, which allowed us to simplify the architecture significantly. The details of that simplified architecture are described in Section 3.
本文中,我们比较了两种Inception网络,Inception-v3和Inception-v4,它们的计算复杂度与Inception-ResNet类似。这些模型的设计的基本假设:相比于non-residual的模型,其参数和复杂度不能增加。事实上,我们测试过更大、更宽的Inception-ResNet变体,在ImageNet上的表现非常相似。
此处报告的最后一个实验是对此处显示的所有最佳性能模型的综合评估。实验表明,Inception-v4和Inception-ResNet-v2表现差不多。都超过了ImageNet验证数据集的单图效果的最新性能,我们想看看它们的结合如何推动了最新技术的发展。在这个经过充分研究的数据集上。令人惊讶的是,我们发现单图效果的提升并没有转化为整体性能的类似大幅提升。尽管如此,据我们所知,它仍然在验证集上有3.1%的top-5错误,其中包含四个模型,这些模型设定了新的技术水平。
在最后一部分中,我们研究了一些分类失败的问题,并得出结论,该集合仍未达到该数据集上注释的标签噪声,并且预测仍有改进的空间。
In this report, we will compare the two pure Inception variants, Inception-v3 and v4, with similarly expensive hybrid Inception-ResNet versions. Admittedly, those models were picked in a somewhat ad hoc manner with the main constraint being that the parameters and computational complexity of the models should be somewhat similar to the cost of the non-residual models. In fact we have tested bigger and wider Inception-ResNet variants and they performed very similarly on the ImageNet classification challenge [11] dataset.
The last experiment reported here is an evaluation of an ensemble of all the best performing models presented here. As it was apparent that both Inception-v4 and Inception-ResNet-v2 performed similarly well, exceeding state-of-the art single frame performance on the ImageNet validation dataset, we wanted to see how a combination of those pushes the state of the art on this well studied dataset. Surprisingly, we found that gains on the single-frame performance do not translate into similarly large gains on ensembled performance. Nonetheless, it still allows us to report 3.1% top-5 error on the validation set with four models ensembled setting a new state of the art, to our best knowledge.
In the last section, we study some of the classification failures and conclude that the ensemble still has not reached the label noise of the annotations on this dataset and there is still room for improvement for the predictions.
2. Related Work
自从AlexNet提出之后,卷积网络在图像识别领域越来越流行。之后一些重要的里程碑模型,比如VGG,Network-in-network,GoogLeNet。
He et al 提出了残差连接思想,并给出了强有力的理论依据以及实验,证明了在图像识别,特别是目标检测中,残差连接具有更强的信息融合能力。作者认为,残差连接是训练超深神经网络的必要条件。但是,我们的研究并不是特别支持这个观点,至少在图像识别领域。然而,这可能需要更多的深度网络训练实验,才能更好地理解残差连接真正的优势。实验表明,没有残差连接的神经网络,训练并不是那么难。但是,残差连接可以很大程度上提升训练速度,这一点是毋庸置疑的。
Convolutional networks have become popular in large scale image recognition tasks after Krizhevsky et al. [8]. Some of the next important milestones were Network-in-network [9] by Lin et al., VGGNet [12] by Simonyan et al. and GoogLeNet (Inception-v1) [14] by Szegedy et al.
Residual connection were introduced by He et al. in [5] in which they give convincing theoretical and practical evidence for the advantages of utilizing additive merging of signals both for image recognition, and especially for object detection. The authors argue that residual connections are inherently necessary for training very deep convolutional models. Our findings do not seem to support this view, at least for image recognition. However it might require more measurement points with deeper architectures to understand the true extent of beneficial aspects offered by residual connections. In the experimental section we demonstrate that it is not very difficult to train competitive very deep networks without utilizing residual connections. However the use of residual connections seems to improve the training speed greatly, which is alone a great argument for their use.
Figure 1: Residual connections as introduced in He et al. [[5]
Figure 2: Optimized version of ResNet connections by [5] to shield computation.
The Inception deep convolutional architecture was introduced in [14] and was called GoogLeNet or Inception-v1 in our exposition. Later the Inception architecture was refined in various ways, first by the introduction of batch normalization [6] (Inception-v2) by Ioffe et al. Later the architecture was improved by additional factorization ideas in the third iteration [15] which will be referred to as Inception-v3 in this report.
3、Architectural Choices
3.1、Pure Inception blocks
出于模型占用内存的考虑,早期的Inception模型采用分段训练的模式。但是,Inception结构是可以微调的,意味着可以更改很多层的卷积核的个数,并且不会影响整体的性能。为了优化训练速度,以及平衡不同子网络之间的计算效率,我们仔细调整层的大小。作为对比,当引入TensorFlow后,我们最新的模型不需要分割来进行训练。同时,也优化了内存的使用:比如,反向传播中,考虑到哪些需要计算,哪些不需要等等。
历史上讲,在更改网络结构上,我们相对的保守,并且,在实验中,我们独立的限制网络的模块(保证其它的网络部分稳定)。这也导致网络很复杂,难以修改。在我们最新的实验中(Inception-v4),我们简化了模块的设计,去除不必要的集成,对每一个Inception模块进行统一的设计。图9给出了Inception-v4的结构,图3,4,5,6,7,8是每个模块的详细结构。所有卷积中(没有标记为V)意味着填充方式为"SAME Padding",输入和输出维度一致。标记为V的卷积中,使用"VALID Padding",输出维度视具体情况而定。
Our older Inception models used to be trained in a partitioned manner, where each replica was partitioned into a multiple sub-networks in order to be able to fit the whole model in memory. However, the Inception architecture is highly tunable, meaning that there are a lot of possible changes to the number of filters in the various layers that do not affect the quality of the fully trained network. In order to optimize the training speed, we used to tune the layer sizes carefully in order to balance the computation between the various model sub-networks. In contrast, with the introduction of TensorFlow our most recent models can be trained without partitioning the replicas. This is enabled in part by recent optimizations of memory used by backpropagation, achieved by carefully considering what tensors are needed for gradient computation and structuring the computation to reduce the number of such tensors. Historically, we have been relatively conservative about changing the architectural choices and restricted our experiments to varying isolated network components while keeping the rest of the network stable. Not simplifying earlier choices resulted in networks that looked more complicated that they needed to be. In our newer experiments, for Inception-v4 we decided to shed this unnecessary baggage and made uniform choices for the Inception blocks for each grid size. Plase refer to Figure 9 for the large scale structure of the Inception-v4 network and Figures 3, 4, 5, 6, 7 and 8 for the detailed structure of its components. All the convolutions not marked with “V” in the figures are same-padded meaning that their output grid matches the size of their input. Convolutions marked with “V” are valid padded, meaning that input patch of each unit is fully contained in the previous layer and the grid size of the output activation map is reduced accordingly.
3.2、Residual Inception Blocks
对于残差版本的Inception网络,我们使用更加低廉的Inception blocks。每一个Inception block都会添加卷积核扩展层(1x1卷积,没有激活层),在与输入执行相加之前,增大卷积核个数(宽度)。相当于是对Inception block降低维度的弥补。
我们尝试过不同版本的ResNet-Inception,只详细介绍其中的两种。第一种:"Inception-ResNet-V1",计算效率与Inception-v3类似;第二种:"Inception-ResNet-V2",计算效率与Inception-v4类似。图15给出了上述网络的大致结构。
For the residual versions of the Inception networks, we use cheaper Inception blocks than the original Inception. Each Inception block is followed by filter-expansion layer (1×1 convolution without activation) which is used for scaling up the dimensionality of the filter bank before the addition to match the depth of the input. This is needed to compensate for the dimensionality reduction induced by the Inception block.
We tried several versions of the residual version of Inception. Only two of them are detailed here. The first one “Inception-ResNet-v1” roughly the computational cost of Inception-v3, while “Inception-ResNet-v2” matches the raw cost of the newly introduced Inception-v4 network. See Figure 15 for the large scale structure of both varianets. (However, the step time of Inception-v4 proved to be significantly slower in practice, probably due to the larger number of layers.)
Residual and non-residual Inception的另一个微小的差别是:在Inception-ResNet网络中,我们只在传统层(traditional layers)使用BN,不在求和层(summations layers)使用BN。通常认为,所有层使用BN是有必要的,但是我们希望模型能够在单个GPU上训练。实验证明,增大层的宽度对于GPU内存的消耗是不成比例的。通过在网络顶层去掉BN层,我们大幅度提高Inception模块的数量。
Another small technical difference between our residual and non-residual Inception variants is that in the case of Inception-ResNet, we used batch-normalization only on top of the traditional layers, but not on top of the summations. It is reasonable to expect that a thorough use of batch-normalization should be advantageous, but we wanted to keep each model replica trainable on a single GPU. It turned out that the memory footprint of layers with large activation size was consuming disproportionate amount of GPU-memory. By omitting the batch-normalization on top of those layers, we were able to increase the overall number of Inception blocks substantially. We hope that with better utilization of computing resources, making this trade-off will become unecessary.
Figure 3: The schema for stem of the pure Inception-v4 and Inception-ResNet-v2 networks. This is the input part of those networks. Cf. Figures [9] and [15]
Figure 4: The schema for 35×35 grid modules of the pure Inception-v4 network. This is the Inception-A block of Figure [9]
Figure 5: The schema for 17×17 grid modules of the pure Inception-v4 network. This is the Inception-B block of Figure [9]
Figure 6: The schema for 8×8 grid modules of the pure Inception-v4 network. This is the Inception-C block of Figure [9]
Figure 7: The schema for 35×35 to 17×17 reduction module. Different variants of this blocks (with various number of filters) are used in Figure [9], and [15] in each of the new Inception(-v4, -ResNet-v1, -ResNet-v2) variants presented in this paper. The k, l, m, n numbers represent filter bank sizes which can be looked up in Table [1]
Figure 8: The schema for 17×17 to 8×8 grid-reduction module. This is the reduction module used by the pure Inception-v4 network in Figure [9]
Figure 9: The overall schema of the Inception-v4 network. For the detailed modules, please refer to Figures 3, 4, 5, 6, 7 and 8 for the detailed structure of the various components.
Figure 10: The schema for 35×35 grid (Inception-ResNet-A) module of Inception-ResNet-v1 network.
Figure 11: The schema for 17×17 grid (Inception-ResNet-B) module of Inception-ResNet-v1 network.
Figure 12: “Reduction-B” 17×17 to 8×8 grid-reduction module. This module used by the smaller Inception-ResNet-v1 network in Figure [15]
(https://arxiv.org/abs/1602.07261/#S3.F15 "Figure 15 ‣ 3.2 Residual Inception Blocks ‣ 3 Architectural Choices ‣ Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning").
Figure 13: The schema for 8×8 grid (Inception-ResNet-C) module of Inception-ResNet-v1 network.
Figure 14: The stem of the Inception-ResNet-v1 network.
Figure 15: Schema for Inception-ResNet-v1 and Inception-ResNet-v2 networks. This schema applies to both networks but the underlying components differ. Inception-ResNet-v1 uses the blocks as described in Figures 14, 10, 7, 11, 12 and 13. Inception-ResNet-v2 uses the blocks as described in Figures 3, 16, 7,17, 18 and 19. The output sizes in the diagram refer to the activation vector tensor shapes of Inception-ResNet-v1.
Figure 16: The schema for 35×35 grid (Inception-ResNet-A) module of the Inception-ResNet-v2 network.
Figure 17: The schema for 17×17 grid (Inception-ResNet-B) module of the Inception-ResNet-v2 network.Figure 18: The schema for 17×17 to 8×8 grid-reduction module. Reduction-B module used by the wider Inception-ResNet-v1 network in Figure 15.Figure 19: The schema for 8×8 grid (Inception-ResNet-C) module of the Inception-ResNet-v2 network.
Network | k | l | m | n |
---|---|---|---|---|
Inception-v4 | 192 | 224 | 256 | 384 |
Inception-ResNet-v1 | 192 | 192 | 256 | 384 |
Inception-ResNet-v2 | 256 | 256 | 384 | 384 |
Table 1: The number of filters of the Reduction-A module for the three Inception variants presented in this paper. The four numbers in the colums of the paper parametrize the four convolutions of Figure 7
3.3、Scaling of the Residuals
Also we found that if the number of filter