Squeeze-and-Excitation Networks 挤压和激励网络


Convolutional neural networks are built upon the convolution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representational power of a network, much existing work has shown the benefits of enhancing spatial encoding. In this work, we focus on channels and propose a novel architectural unit, which we term the “Squeeze-and-Excitation”(SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet architectures that generalise extremely well across challenging datasets. Crucially, we find that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at slight computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 eRor to 2.251%, achieving a ∼25% relative improvement over the winning entry of 2016.

卷积神经网络建立在卷积运算的基础上,通过融合局部感受野内的空间信息和通道信息来提取信息特征。为了提高网络的表示能力,许多现有的工作已经显示出增强空间编码的好处。在这项工作中,我们专注于通道,并提出了一种新颖的结构单元,我们称之为“Squeeze-and-Excitation”(SE)块,通过显式地建模通道之间的相互依赖关系,自适应地重新校准通道式的特征响应。通过将这些块堆叠在一起,我们证明了我们可以构建SENet架构,在具有挑战性的数据集中可以进行泛化地非常好。关键的是,我们发现SE块以微小的计算成本为现有的最先进的深层架构产生了显著的性能改进。SENets是我们ILSVRC 2017分类提交的基础,它在该分类赢得​​了第一名,并将top-5错误率显著减少到2.251 %,相对于2016年的获胜成绩取得了25%的相对改进。

Introduction 综述

Convolutional neural networks (CNNs) have proven to be useful models for tackling a wide range of visual tasks . At each convolutional layer in the network, a collection of filters expresses neighbourhood spatial connectivity patterns along input channels---fusing spatial and channel-wise information together within local receptive fields. By interleaving a series of convolutional layers with non-linear activation functions and downsampling operators, CNNs are able to produce image representations that capture hierarchical patterns and attain global theoretical receptive fields. A central theme of computer vision research is the search for more powerful representations that capture only those properties of an image that are most salient for a given task, enabling improved performance. As a widely-used family of models for vision tasks, the development of new neural network architecture designs now represents a key frontier in this search. Recent research has shown that the representations produced by CNNs can be strengthened by integrating learning mechanisms into the network that help capture spatial coRelations between features. One such approach, popularised by the Inception family of architectures , incorporates multi-scale processes into network modules to achieve improved performance. Further work has sought to better model spatial dependencies and incorporate spatial attention into the structure of the network .


In contrast to these methods, we investigate a diFerent aspect of architectural design —— the channel relationship, by introducing a new architectural unit, which we term the “Squeeze-and-Excitation” (SE) block. Our goal is to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. To achieve this, we propose a mechanism that allows the network to perform feature recalibration, through which it can learn to use global information to selectively emphasise informative features and suppress less useful ones.

与这些方法相反,我们引入新的架构单元,“Squeeze-and-Excitation” (SE)块,我们研究了架构设计的一个不同方向——通道关系。我们的目标是通过对网络的卷积特征通道之间的相互依赖性进行建模,从而提高网络的表示能力。为了达到这个目的,我们提出了一种机制,使网络能够执行特征重新校准,通过该机制,网络可以学习使用全局信息来选择性地强调信息性特征,并抑制不太有用的特征。

The basic structure of the SE building block is illustrated in Fig.1. For any given transformation $ F_{tr} $ mapping the input $ X $ to the feature maps $ U $ where $ U\in\mathbb{R}^{H \times W \times C} $ , (e.g. a convolution or a set of convolutions), we can construct a coResponding SE block to perform feature recalibration as follows. The features \mathbf{U} are first passed through a *squeeze* operation, which aggregates the feature maps across spatial dimensions W \times H to produce a channel descriptor. This descriptor embeds the global distribution of channel-wise feature responses, enabling information from the global receptive field of the network to be leveraged by its lower layers. This is followed by an *excitation* operation, in which sample-specific activations, learned for each channel by a self-gating mechanism based on channel dependence, govern the excitation of each channel. The feature maps \mathbf{U}$ are then reweighted to generate the output of the SE block which can then be fed directly into subsequent layers.

SE构建块的基本结构如图1所示。对于任何给定的变换$ F_{tr} $ : $X → U $, $X \in \mathbb{R}^{W’ \times H’ \times C’}$,$ U\in\mathbb{R}^{H \times W \times C} $ ,(例如卷积或一组卷积),我们可以构造一个相应的SE块来执行特征重新校准,如下所示。特征U首先通过squeeze操作,该操作跨越空间维度W $\times $ H聚合特征映射来产生通道描述符。这个描述符嵌入了通道特征响应的全局分布,使来自网络全局感受野的信息能够被其较低层利用。这之后是一个excitation操作,其中通过基于通道依赖性的自门控机制(self-gating)为每个通道学习特定采样的激活,控制每个通道的激励。然后特征映射U被重新加权以生成SE块的输出,然后可以将其直接输入到后续层中。

Figure 1

图1. Squeeze-and-Excitation块 A Squeeze-and-Excitation block.

An SE network can be generated by simply stacking a collection of SE building blocks. SE blocks can also be used as a drop-in replacement for the original block at any depth in the architecture. However, while the template for the building block is generic, as we show in Sec. 6.3, the role it performs at diFerent depths adapts to the needs of the network. In the early layers, it learns to excite informative features in a class agnostic manner, bolstering the quality of the shared lower level representations. In later layers, the SE block becomes increasingly specialised, and responds to diFerent inputs in a highly class-specific manner. Consequently, the benefits of feature recalibration conducted by SE blocks can be accumulated through the entire network.


The development of new CNN architectures is a challenging engineering task, typically involving the selection of many new hyperparameters and layer configurations. By contrast, the design of the SE block outlined above is simple, and can be used directly with existing state-of-the-art architectures whose convolutional layers can be strengthened by direct replacement with their SE counterparts. Moreover, as shown in Sec. 4, SE blocks are computationally lightweight and impose only a slight increase in model complexity and computational burden. To support these claims, we develop several SENets, namely SE-ResNet, SE-Inception, SE-ResNeXt and SE-Inception-ResNet and provide an extensive evaluation of SENets on the ImageNet 2012 dataset [30]. Further, to demonstrate the general applicability of SE blocks, we also present results beyond ImageNet, indicating that the proposed approach is not restricted to a specific dataset or a task.

新CNN架构的开发是一项具有挑战性的工程任务,通常涉及许多新的超参数和层配置的选择。相比之下,上面概述的SE块的设计是简单的,并且可以直接与现有的最新架构一起使用,其卷积层可以通过直接用对应的SE层来替换从而进行加强。另外,如第四节所示,SE块在计算上是轻量级的,并且在模型复杂性和计算负担方面仅稍微增加。为了支持这些声明,我们开发了一些SENets,即SE-ResNet,SE-Inception,SE-ResNeXt和SE-Inception-ResNet,并在ImageNet 2012数据集上对SENets进行了广泛的评估。此外,为了证明SE块的一般适用性,我们还呈现了ImageNet之外的结果,表明所提出的方法不受限于特定的数据集或任务。

Using SENets, we won the first place in the ILSVRC 2017 classification competition. Our top performing model ensemble achieves a $2.251% top-5 eRor on the test set. This represents a \sim 25% relative improvement in comparison to the winner entry of the previous year (with a top-$5 eRor of $2.991%$). Our models and related materials have been made available to the research community.

使用SENets,我们赢得了ILSVRC 2017分类竞赛的第一名。我们的表现最好的模型集合在测试集上达到了2.251%的top-5错误率。与前一年的获奖者(2.991%的top-5错误率)相比,这表示25%的相对改进。我们的模型和相关材料已经提供给研究界。

Deep architectures. A wide range of work has shown that restructuring the architecture of a convolutional neural network in a manner that eases the learning of deep features can yield substantial improvements in performance. VGGNets [35] and Inception models [39] demonstrated the benefits that could be attained with an increased depth, significantly outperforming previous approaches on ILSVRC 2014. Batch normalization (BN) [14] improved gradient propagation through deep networks by inserting units to regulate layer inputs stabilising the learning process, which enables further experimentation with a greater depth. He et al. [9, 10] showed that it was effective to train deeper networks by restructuring the architecture to learn residual functions through the use of identity-based skip connections which ease the flow of information across units. More recently, reformulations of the connections between network layers [5, 12] have been shown to further improve the learning and representational properties of deep networks.

深层架构 大量的工作已经表明,以易于学习深度特征的方式重构卷积神经网络的架构可以大大提高性能。VGGNets[35]和Inception模型[39]证明了深度增加可以获得的好处,明显超过了ILSVRC 2014之前的方法。批标准化(BN)[14]通过插入单元来调节层输入稳定学习过程,改善了通过深度网络的梯度传播,这使得可以用更深的深度进行进一步的实验。He等人[9,10]表明,通过重构架构来训练更深层次的网络是有效的,通过使用基于恒等映射的跳跃连接来学习残差函数,从而减少跨单元的信息流动。最近,网络层间连接的重新表示[5,12]已被证明可以进一步改善深度网络的学习和表征属性。

An alternative line of research has explored ways to tune the functional form of the modular components of a network. Grouped convolutions can be used to increase cardinality (the size of the set of transformations) [13, 43] to learn richer representations. Multi-branch convolutions can be interpreted as a generalisation of this concept, enabling more flexible compositions of convolutional operators [14, 38, 39, 40]. Cross-channel correlations are typically mapped as new combinations of features, either independently of spatial structure [6, 18] or jointly by using standard convolutional filters [22] with 1×11×1 convolutions, while much of this work has concentrated on the objective of reducing model and computational complexity. This approach reflects an assumption that channel relationships can be formulated as a composition of instance-agnostic functions with local receptive fields. In contrast, we claim that providing the network with a mechanism to explicitly model dynamic, non-linear dependencies between channels using global information can ease the learning process, and significantly enhance the representational power of the network.


Attention and gating mechanisms. Attention can be viewed, broadly, as a tool to bias the allocation of available processing resources towards the most informative components of an input signal. The development and understanding of such mechanisms has been a longstanding area of research in the neuroscience community [15, 16, 28] and has seen significant interest in recent years as a powerful addition to deep neural networks [20, 25]. Attention has been shown to improve performance across a range of tasks, from localisation and understanding in images [3, 17] to sequence-based models [2, 24]. It is typically implemented in combination with a gating function (e.g. a softmax or sigmoid) and sequential techniques [11, 37]. Recent work has shown its applicability to tasks such as image captioning [4, 44] and lip reading [7], in which it is exploited to efficiently aggregate multi-modal data. In these applications, it is typically used on top of one or more layers representing higher-level abstractions for adaptation between modalities. Highway networks [36] employ a gating mechanism to regulate the shortcut connection, enabling the learning of very deep architectures. Wang et al. [42] introduce a powerful trunk-and-mask attention mechanism using an hourglass module [27], inspired by its success in semantic segmentation. This high capacity unit is inserted into deep residual networks between intermediate stages. In contrast, our proposed SE-block is a lightweight gating mechanism, specialised to model channel-wise relationships in a computationally efficient manner and designed to enhance the representational power of modules throughout the network.


Squeeze-and-Excitation Blocks SE模块

A Squeeze-and-Excitation block is a computational unit which can be built upon a transformation $ F_{tr} $ mapping an input $ x\in R^{H' \times W' \times C'} $ to feature maps $ U\in R^{H \times W \times C} $ . In the notation that follows we take $ F_{tr} $ to be a convolutional operator and use $V= [v_1, v_2, \dots, v_c] $to denote the learned set of filter kernels, where $ v_c $ refers to the parameters of the filter. We can then write the outputs as $ U=[u_1,u_2, \dots, U_c] $ , where

Squeeze-and-Excitation块是一种计算单元,可以为任何给定的变换构建:$F_{tr}:X→U $,$x\in R^{H' \times W' \times C'} $,$U\in R^{H \times W \times C} $。为了简化说明,在接下来的表示中,我们将$F_{tr} $作为卷积运算符使用$V= [v_1, v_2, \dots, v_c] $表示学习到的一组滤波器核,其中$v_c $ 是指的是第c个滤波器的参数。然后,我们可以将输出写成 $ U=[u_1,u_2, \dots, U_c] $,其中

$$ U_c = v_c ∗ X = \sum_{s=1}^{C'}v^s_c∗ x^s。 $$

Here $ \ast $ denotes convolution, $ v_c = [v^1_c, v^2_c, \dots, v_c = 0_c = 1_c = 2_c = 3 $ , $ x = [x^1, x^2, \dots, x = 0 = 1 = 2 $ and . $ v^s_c $ is a $ 2 $ D spatial kernel representing a single channel of $ v_c $ that acts on the coResponding channel of $ x $ . To simplify the notation, bias terms are omitted. Since the output is produced by a summation through all channels, channel dependencies are implicitly embedded in $ v_c $ , but are entangled with the local spatial coRelation captured by the filters. The channel relationships modelled by convolution are inherently implicit and local (except the ones at top-most layers). We expect the learning of convolutional features to be enhanced by explicitly modelling channel interdependencies, so that the network is able to increase its sensitivity to informative features which can be exploited by subsequent transformations. Consequently, we would like to provide it with access to global information and recalibrate filter responses in two steps, and , before they are fed into the next transformation. A diagram illustrating the structure of an SE block is shown in Fig.splash.

这里$\ast $表示卷积,$v_c = [v^1_c, v^2_c, \dots, v^C'_c] $,$X = [x^1, x^2, \dots, x^C'_c $和。 $v^s_c $是2d空间核,表示单个通道的$v_c $,作用于对应的通道X。由于输出由通过所有通道的求和产生,因此通道依赖性被隐式地嵌入到$v_c $中,但是这些依赖性与滤波器捕获的空间相关性纠缠在一起。我们的目标是确保能够提高网络对信息特征的敏感度,以便后续转换可以利用这些功能,并抑制不太有用的功能。我们建议通过显式建模通道依赖性来实现这一点,以便在进入下一个转换之前通过两步重新校准滤波器响应,两步为:squeeze和excitation。SE构建块的图如图所示。

Squeeze: Global Information Embedding Squeeze:全局信息嵌入

In order to tackle the issue of exploiting channel dependencies, we first consider the signal to each channel in the output features. Each of the learned filters operate with a local receptive field and consequently each unit of the transformation output U is unable to exploit contextual information outside of this region. This is an issue that becomes more severe in the lower layers of the network whose receptive field sizes are small.


To mitigate this problem, we propose to global spatial information into a channel descriptor. This is achieved by using global average pooling to generate channel-wise statistics. Formally, a statistic $ z\in R^{C} $ is generated by shrinking $ U $ through its spatial dimensions $ H \times W $ , such that the $ c $ -th element of $ \z $ is calculated by:

为了缓解此问题,我们提出将全局空间信息压缩成一个通道描述符。这是通过使用全局平均池化生成通道统计实现的。形式上,在空间维度$ H \times W $ 收缩U统计生成$ z\in R^{C} $ ,使得$z$的第$ c$ 个元素通过如下公式计算:
$$ z_c = F_{sq}(U_c) = \frac{1}{H\times W}\sum_{i=1}^{H}\sum_{j=1}^{W} u_c(i,j). $$

Discussion.The output of the transformation $ U $ can be interpreted as a collection of the local descriptors whose statistics are expressive for the whole image. Exploiting such information is prevalent in prior feature engineering work . We opt for the simplest aggregation technique, global average pooling, noting that more sophisticated strategies could be employed here as well.

讨论:转换$U $的输出可以被解释为局部描述子的集合,其统计信息对于整个图像呈现。利用此类信息在先前的特征工程工作中是普遍使用的。我们选择最简单的全局平均池化,注意到这里可以使用更复杂的策略。

Excitation: Adaptive Recalibration Excitation:自适应重新校准

To make use of the information aggregated in the operation, we follow it with a second operation which aims to fully capture channel-wise dependencies. To fulfil this objective, the function must meet two criteria: first, it must be flexible (in particular, it must be capable of learning a nonlinear interaction between channels) and second, it must learn a non-mutually-exclusive relationship since we would like to ensure that multiple channels are allowed to be emphasised (rather than enforcing a one-hot activation). To meet these criteria, we opt to employ a simple gating mechanism with a sigmoid activation:


$$ s = F_{ex}(z, W) = \sigma(g(z, W)) = \sigma(W_2\delta(W_1z)), $$

where $ \delta $ refers to the ReLU function, $ W_1 \in R^{\frac{C}{r} \times C} $ and $ W_2 \in R^{C \times \frac{C}{r}} $ . To limit model complexity and aid generalisation, we parameterise the gating mechanism by forming a bottleneck with two fully-connected (FC) layers around the non-linearity, i.e. a dimensionality-reduction layer with reduction ratio $ r $ (this parameter choice is discussed in Sectionreduction), a ReLU and then a dimensionality-increasing layer returning to the channel dimension of the transformation output $ U $ . The final output of the block is obtained by rescaling $ U $ with the activations $ s $ :

其中$\delta $指的是Relu功能,$W_1 \in R^{\frac{C}{r} \times C} $和$\in R^{C \times \frac{C}{r}} $。为了限制模型复杂性和辅助泛化,我们通过形成围绕非线性的两个全连接层(Fc)的瓶颈来参数化门机制,即降维层参数为W1,降维比例为r(我们把它设置为16,这个参数选择在6.3节中讨论),一个ReLU,然后是一个参数为W2的升维层。块的最终输出通过重新调节带有激活的变换输出U得到:

$$ \widetilde{x}c = F{scale}(u_c, s_c) = s_c,u_c, $$


where X and $ F_{scale}(U_c, s_c) $ refers to channel-wise multiplication between the scalar $ s_c $ and the feature map $ U_c \in R^{H \times W} $ .

其中X 和 $F_{scale}(U_c, s_c) $指的是 标量$ s_c $和特征映射$U_c \in R^{H \times W} $之间的对应通道乘积。

The excitation operator maps the input-specific descriptor $ z $ to a set of channel weights. In this regard, SE blocks intrinsically introduce dynamics conditioned on the input, which can be regarded as a self-attention function on channels whose relationships are not confined to the local receptive field the convolutional filters are responsive to.

激励运算符将输入特定的描述符$ z $映射到一组频道权重。在这方面,SE块本质上引入了以输入为条件的动态特性,有助于提高特征辨别力。

3.3. Exemplars: SE-Inception and SE-ResNet 示例:SE-Inception和SE-ResNet

The flexibility of the SE block means that it can be directly applied to transformations beyond standard convolutions. To illustrate this point, we develop SENets by integrating SE blocks into two popular network families of architectures, Inception and ResNet. SE blocks are constructed for the Inception network by taking the transformation $ F_{tr} $to be an entire Inception module (see Fig.2). By making this change for each such module in the architecture, we construct an SE-Inception network.

SE块的灵活性意味着它可以直接应用于标准卷积之外的变换。为了说明这一点,我们通过将SE块集成到两个流行的网络架构系列Inception和ResNet中来开发SENets。通过将变换$ F_{tr} $看作一个整体的Inception模块(参见图2),为Inception网络构建SE块。通过对架构中的每个模块进行更改,我们构建了一个SE-Inception网络。

Figure 2

Figure 2. The schema of the original Inception module (left) and the SE-Inception module (right).

Residual networks and their variants have shown to be highly effective at learning deep representations. We develop a series of SE blocks that integrate with ResNet [9], ResNeXt [43] and Inception-ResNet [38] respectively. Fig.3 depicts the schema of an SE-ResNet module. Here, the SE block transformation $ F_{tr} $ is taken to be the non-identity branch of a residual module. Squeeze and excitation both act before summation with the identity branch.

Figure 3

Figure 3. The schema of the original Residual module (left) and the SE-ResNet module (right).
图3。 最初的Residual模块架构(左)和SE-ResNet模块架构(右)。


Model and Computational Complexity 模型和计算复杂度

An SENet is constructed by stacking a set of SE blocks. In practice, it is generated by replacing each original block (i.e. residual block) with its corresponding SE counterpart (i.e. SE-residual block). We describe the architecture of SE-ResNet-50 and SE-ResNeXt-50 in Table 1.


Table 1. (Left) ResNet-50. (Middle) SE-ResNet-50. (Right) SE-ResNeXt-50 with a 32×4d32×4d template. The shapes and operations with specific parameter settings of a residual building block are listed inside the brackets and the number of stacked blocks in a stage is presented outside. The inner brackets following by fc indicates the output dimension of the two fully connected layers in a SE-module.


For the proposed SE block design to be of practical use, it must oFer a good trade-oF between improved performance and increased model complexity. To illustrate the computational burden associated with the module, we consider a comparison between ResNet-50 and SE-ResNet-50 as an example. ResNet-50 requires $ {\sim}3.86 $ GFLOPs in a single forward pass for a $ 224\times224 $ pixel input image. Each SE block makes use of a global average pooling operation in the phase and two small FC layers in the phase, followed by an inexpensive channel-wise scaling operation. In the aggregate, when setting the reduction ratio $ r $ (introduced in Sectionadaptive-recal) to $ 16 $ , SE-ResNet-50 requires $ {\sim}3.87 $ GFLOPs, coResponding to a $ 0.26% $ relative increase over the original ResNet-50. In exchange for this slight additional computational burden, the accuracy of surpasses that of ResNet-50 and indeed, approaches that of a deeper ResNet-101 network requiring $ {\sim}7.58 $ GFLOPs (Tableimagenet-results). In practical terms, a single pass forwards and backwards through ResNet-50 takes $ 190 $ ms, compared to $ 209 $ ms for SE-ResNet-50 with a training minibatch of $ 256 $ images (both timings are performed on a server with $ 8 $ NVIDIA Titan X GPUs). We suggest that this represents a reasonable runtime overhead, which may be further reduced as global pooling and small inner-product operations receive further optimisation in popular GPU libraries. Due to its importance for embedded device applications, we further benchmark CPU inference time for each model: for a $ 224\times 224 $ pixel input image, ResNet-50 takes $ 164 $ ms in comparison to $ 167 $ ms for SE-ResNet-50.

为提出的SE块设计具有实际使用,它必须在提高性能和提高模型复杂性之间提供良好的权衡。为了说明与模块相关的计算负担,我们考虑作为示例的Reset-50和SE-Resnet-50之间的比较。 RESET-50需要${\sim}3.86 $ GFLOP在单个向前通行证中进行$ 224\times224 $像素输入图像。每个SE块在阶段中使用相位和两个小FC层的全局平均池操作,然后是廉价的通道 - 方向缩放操作。在聚合时,在将$ r $(在SEENADAPTIVE-RECAL中引入)到$ 16 $时,SE-RESNET-50需要${\sim}3.87 $ GFLOP,对应于$ 0.26% $相对增加原始RESNET - 50。为了换取这种轻微的额外计算负担,超越Reset-50且实际上的准确性接近了更深的Reset-101网络,需要${\sim}7.58 $ GFLOPS(TableImageNet-excupt)。实际上,通过RESET-50通过RESET-50向后和向后向后,与$ 209 $ MS相比,对于SE-RESNET-50,具有$ 256 $图像的训练传单(两个定时在服务器上执行两个定时)相比$ 8 $ nvidia titan x gpus)。我们建议这代表了合理的运行时间开销,这可能进一步减少,因为全球池和小型内部产品运营接收到流行的GPU库中的进一步优化。由于其对嵌入式设备应用的重要性,我们进一步的每个型号的基准CPU推理时间:对于$ 224\times 224 $像素输入图像,RESET-50与$ 167 $ MS进行$ 164 $ MS,用于SE-RESET- 50。

Table 2. Single-crop error rates (%) on the ImageNet validation set and complexity comparisons. The original column refers to the results reported in the original papers. To enable a fair comparison, we re-train the baseline models and report the scores in the re-implementation column. The SENet column refers the corresponding architectures in which SE blocks have been added. The numbers in brackets denote the performance improvement over the re-implemented baselines. † indicates that the model has been evaluated on the non-blacklisted subset of the validation set (this is discussed in more detail in [38]), which may slightly improve results.

In practice, with a training mini-batch of 256 images, a single pass forwards and backwards through ResNet-50 takes 190190ms, compared to 209ms for SE-ResNet-50 (both timings are performed on a server with 88 NVIDIA Titan X GPUs). We argue that it is a reasonable overhead as global pooling and small inner-product operations are less optimised in existing GPU libraries. Moreover, due to its importance for embedded device applications, we also benchmark CPU inference time for each model: for a 224×224pixel input image, ResNet-50 takes 164ms, compared to for SE-ResNet-50. The small additional computational overhead required by the SE block is justified by its contribution to model performance (discussed in detail in Sec. 6).

在实践中,训练的批数据大小为256张图像,ResNet-50的一次前向传播和反向传播花费190 ms,而SE-ResNet-50则花费209 ms(两个时间都在具有88个NVIDIA Titan X GPU的服务器上执行)。我们认为这是一个合理的开销,因为在现有的GPU库中,全局池化和小型内积操作的优化程度较低。此外,由于其对嵌入式设备应用的重要性,我们还对每个模型的CPU推断时间进行了基准测试:对于224×224像素的输入图像,ResNet-50花费了164ms,相比之下,SE-ResNet-50花费了167ms。SE块所需的小的额外计算开销对于其对模型性能的贡献来说是合理的(在第6节中详细讨论)。

Next, we consider the additional parameters introduced by the proposed block. All additional parameters are contained in the two fully connected layers of the gating mechanism, which constitute a small fraction of the total network capacity. More precisely, the number of additional parameters introduced is given by:


where rdenotes the reduction ratio (we set r to 16 in all our experiments), S refers to the number of stages (where each stage refers to the collection of blocks operating on feature maps of a common spatial dimension), Cs denotes the dimension of the output channels for stage s and Nsrefers to the repeated block number. In total, SE-ResNet-50 introduces ∼∼2.5 million additional parameters beyond the ∼∼25 million parameters required by ResNet-50, corresponding to a ∼10% increase in the total number of parameters. The majority of these additional parameters come from the last stage of the network, where excitation is performed across the greatest channel dimensions. However, we found that the comparatively expensive final stage of SE blocks could be removed at a marginal cost in performance (<0.1% top-1 error on ImageNet dataset) to reduce the relative parameter increase to ∼4%, which may prove useful in cases where parameter usage is a key consideration.


During training, we follow standard practice and perform data augmentation with random-size cropping [39] to 224×224 pixels (299×299 for Inception-ResNet-v2 [38] and SE-Inception-ResNet-v2) and random horizontal flipping. Input images are normalised through mean channel subtraction. In addition, we adopt the data balancing strategy described in [32] for mini-batch sampling to compensate for the uneven distribution of classes. The networks are trained on our distributed learning system “ROCS” which is capable of handing efficient parallel training of large networks. Optimisation is performed using synchronous SGD with momentum 0.9 and a mini-batch size of 1024 (split into sub-batches of 32 images per GPU across 4 servers, each containing 8 GPUs). The initial learning rate is set to 0.6 and decreased by a factor of 10 every 30 epochs. All models are trained for 100 epochs from scratch, using the weight initialisation strategy described in [8].


Experiments 实验

In this section we conduct extensive experiments on the ImageNet 2012 dataset [30] for the purposes: first, to explore the impact of the proposed SE block for the basic networks with different depths and second, to investigate its capacity of integrating with current state-of-the-art network architectures, which aim to a fair comparison between SENets and non-SENets rather than pushing the performance. Next, we present the results and details of the models for ILSVRC 2017 classification task. Furthermore, we perform experiments on the Places365-Challenge scene classification dataset [48] to investigate how well SENets are able to generalise to other datasets. Finally, we investigate the role of excitation and give some analysis based on experimental phenomena.

在这一部分,我们在ImageNet 2012数据集上进行了大量的实验[30],其目的是:首先探索提出的SE块对不同深度基础网络的影响;其次,调查它与最先进的网络架构集成后的能力,旨在公平比较SENets和非SENets,而不是推动性能。接下来,我们将介绍ILSVRC 2017分类任务模型的结果和详细信息。此外,我们在Places365-Challenge场景分类数据集[48]上进行了实验,以研究SENets是否能够很好地泛化到其它数据集。最后,我们研究激励的作用,并根据实验现象给出了一些分析。

Image Classification 图像分类

The ImageNet 2012 dataset is comprised of 1.28 million training images and 50K validation images from 1000 classes. We train networks on the training set and report the top-1 and the top-5 errors using centre crop evaluations on the validation set, where 224×224 pixels are cropped from each image whose shorter edge is first resized to 256 (299×299 from each image whose shorter edge is first resized to 352 for Inception-ResNet-v2 and SE-Inception-ResNet-v2).

ImageNet 2012数据集包含来自1000个类别的128万张训练图像和5万张验证图像。我们在训练集上训练网络,并在验证集上使用中心裁剪图像评估来报告top-1top-5错误率,其中每张图像短边首先归一化为256,然后从每张图像中裁剪出224×224个像素,(对于Inception-ResNet-v2和SE-Inception-ResNet-v2,每幅图像的短边首先归一化到352,然后裁剪出299×299个像素)。

Network depth. We first compare the SE-ResNet against a collection of standard ResNet architectures. Each ResNet and its corresponding SE-ResNet are trained with identical optimisation schemes. The performance of the different networks on the validation set is shown in Table 2, which shows that SE blocks consistently improve performance across different depths with an extremely small increase in computational complexity.


Remarkably, SE-ResNet-50 achieves a single-crop top-5 validation error of 6.62%, exceeding ResNet-50 (7.48%) by 0.86% and approaching the performance achieved by the much deeper ResNet-101 network (6.52% top-5 error) with only half of the computational overhead (3.87 GFLOPs vs. 7.58 GFLOPs). This pattern is repeated at greater depth, where SE-ResNet-101 (6.07% top-5 error) not only matches, but outperforms the deeper ResNet-152 network (6.34% top-5 error) by 0.27%. Fig. 4 depicts the training and validation curves of SE-ResNets and ResNets, respectively. While it should be noted that the SE blocks themselves add depth, they do so in an extremely computationally efficient manner and yield good returns even at the point at which extending the depth of the base architecture achieves diminishing returns. Moreover, we see that the performance improvements are consistent through training across a range of different depths, suggesting that the improvements induced by SE blocks can be used in combination with adding more depth to the base architecture.
值得注意的是,SE-ResNet-50实现了单裁剪图像6.62%的top-5验证错误率,超过了ResNet-50(7.48%)0.86%,接近更深的ResNet-101网络(6.52%的top-5错误率),且只有ResNet-101一半的计算开销(3.87 GFLOPs vs. 7.58 GFLOPs)。这种模式在更大的深度上重复,SE-ResNet-101(6.07%的top-5错误率)不仅可以匹配,而且超过了更深的ResNet-152网络(6.34%的top-5错误率)。图4分别描绘了SE-ResNets和ResNets的训练和验证曲线。虽然应该注意SE块本身增加了深度,但是它们的计算效率极高,即使在扩展的基础架构的深度达到收益递减的点上也能产生良好的回报。而且,我们看到通过对各种不同深度的训练,性能改进是一致的,这表明SE块引起的改进可以与增加基础架构更多深度结合使用。

Figure 4

Figure 4. Training curves on ImageNet. (Left): ResNet-50 and SE-ResNet-50; (Right): ResNet-152 and SE-ResNet-152.

Integration with modern architectures. We next investigate the effect of combining SE blocks with another two state-of-the-art architectures, Inception-ResNet-v2 [38] and ResNeXt [43]. The Inception architecture constructs modules of convolutions as multibranch combinations of factorised filters, reflecting the Inception hypothesis [6] that spatial correlations and cross-channel correlations can be mapped independently. In contrast, the ResNeXt architecture asserts that richer representations can be obtained by aggregating combinations of sparsely connected (in the channel dimension) convolutional features. Both approaches introduce prior-structured correlations in modules. We construct SENet equivalents of these networks, SE-Inception-ResNet-v2 and SE-ResNeXt (the configuration of SE-ResNeXt-50 (32×4d) is given in Table 1). Like previous experiments, the same optimisation scheme is used for both the original networks and their SENet counterparts.


The results given in Table 2 illustrate the significant performance improvement induced by SE blocks when introduced into both architectures. In particular, SE-ResNeXt-50 has a top-5 error of 5.49% which is superior to both its direct counterpart ResNeXt-50 (5.90% top-5 error) as well as the deeper ResNeXt-101 (5.57% top-5 error), a model which has almost double the number of parameters and computational overhead. As for the experiments of Inception-ResNet-v2, we conjecture the difference of cropping strategy might lead to the gap between their reported result and our re-implemented one, as their original image size has not been clarified in [38] while we crop the 299×299 region from a relative larger image (where the shorter edge is resized to 352). SE-Inception-ResNet-v2 (4.79% top-5 error) outperforms our reimplemented Inception-ResNet-v2 (5.21% top-5 error) by 0.42% (a relative improvement of 8.1%) as well as the reported result in [38]. The optimisation curves for each network are depicted in Fig. 5, illustrating the consistency of the improvement yielded by SE blocks throughout the training process.


Figure 5

Figure 5. Training curves on ImageNet. (Left): ResNeXt-50 and SE-ResNeXt-50; (Right): Inception-ResNet-v2 and SE-Inception-ResNet-v2.
图5。ImageNet的训练曲线。(左): ResNeXt-50和SE-ResNeXt-50;(右):Inception-ResNet-v2和SE-Inception-ResNet-v2。

Finally, we assess the effect of SE blocks when operating on a non-residual network by conducting experiments with the BN-Inception architecture [14] which provides good performance at a lower model complexity. The results of the comparison are shown in Table 2 and the training curves are shown in Fig. 6, exhibiting the same phenomena that emerged in the residual architectures. In particular, SE-BN-Inception achieves a lower top-5 error of 7.14% in comparison to BN-Inception whose error rate is 7.89%. These experiments demonstrate that improvements induced by SE blocks can be used in combination with a wide range of architectures. Moreover, this result holds for both residual and non-residual foundations.

最后,我们通过对BN-Inception架构[14]进行实验来评估SE块在非残差网络上的效果,该架构在较低的模型复杂度下提供了良好的性能。比较结果如表2所示,训练曲线如图6所示,表现出的现象与残差架构中出现的现象一样。尤其是与BN-Inception 7.89%的错误率相比,SE-BN-Inception获得了更低的top-5错误率7.14%。这些实验表明SE块引起的改进可以与多种架构结合使用。而且,这个结果适用于残差和非残差基础。

Figure 6

Figure 6. Training curves of BN-Inception and SE-BN-Inception on ImageNet.

Results on ILSVRC 2017 Classification Competition. ILSVRC [30] is an annual computer vision competition which has proved to be a fertile ground for model developments in image classification. The training and validation data of the ILSVRC 2017 classification task are drawn from the ImageNet 2012 dataset, while the test set consists of an additional unlabelled 100K images. For the purposes of the competition, the top-5 error metric is used to rank entries.

**ILSVRC 2017分类竞赛的结果。**ILSVRC[30]是一个年度计算机视觉竞赛,被证明是图像分类模型发展的沃土。ILSVRC 2017分类任务的训练和验证数据来自ImageNet 2012数据集,而测试集包含额外的未标记的10万张图像。为了竞争的目的,使用top-5错误率度量来对输入条目进行排序。

SENets formed the foundation of our submission to the challenge where we won first place. Our winning entry comprised a small ensemble of SENets that employed a standard multi-scale and multi-crop fusion strategy to obtain a 2.251% top-5 error on the test set. This result represents a ∼25% relative improvement on the winning entry of 2016 (2.99% top-5 error). One of our high-performing networks is constructed by integrating SE blocks with a modified ResNeXt [43] (details of the modifications are provided in Appendix A). We compare the proposed architecture with the state-of-the-art models on the ImageNet validation set in Table 3. Our model achieves a top-1 error of 18.68% and a top-5 error of 4.47% using a 224×224 centre crop evaluation on each image (where the shorter edge is first resized to 256). To enable a fair comparison with previous models, we also provide a 320×320 centre crop evaluation, obtaining the lowest error rate under both the top-1 (17.28%) and the top-5 (3.79%) error metrics.


Table 3

Table 3. Single-crop error rates of state-of-the-art CNNs on ImageNet validation set. The size of test crop is 224×224224×224 and 320×320320×320/299×299299×299 as in [10]. Our proposed model, SENet, shows a significant performance improvement on prior work.

Scene Classification 场景分类

Large portions of the ImageNet dataset consist of images dominated by single objects. To evaluate our proposed model in more diverse scenarios, we also evaluate it on the Places365-Challenge dataset [48] for scene classification. This dataset comprises 8 million training images and 36, 500 validation images across 365 categories. Relative to classification, the task of scene understanding can provide a better assessment of the ability of a model to generalise well and handle abstraction, since it requires the capture of more complex data associations and robustness to a greater level of appearance variation


We use ResNet-152 as a strong baseline to assess the effectiveness of SE blocks and follow the evaluation protocol in [33]. Table 4 shows the results of training a ResNet-152 model and a SE-ResNet-152 for the given task. Specifically, SE-ResNet-152 (11.01% top-5 error) achieves a lower validation error than ResNet-152 (11.61% top-5 error), providing evidence that SE blocks can perform well on different datasets. This SENet surpasses the previous state-of-the-art model Places-365-CNN [33] which has a top-5 error of 11.48% on this task.

我们使用ResNet-152作为强大的基线来评估SE块的有效性,并遵循[33]中的评估协议。表4显示了针对给定任务训练ResNet-152模型和SE-ResNet-152的结果。具体而言,SE-ResNet-152(11.01%的top-5错误率)取得了比ResNet-152(11.61%的top-5错误率)更低的验证错误率,证明了SE块可以在不同的数据集上表现良好。这个SENet超过了先前的最先进的模型Places-365-CNN [33],它在这个任务上有11.48%的top-5错误率。

Table 4

Table 4. Single-crop error rates (%) on the Places365 validation set.

Analysis and Discussion

Reduction ratio. The reduction ratio r introduced in Eqn. (5) is an important hyperparameter which allows us to vary the capacity and computational cost of the SE blocks in the model. To investigate this relationship, we conduct experiments based on the SE-ResNet-50 architecture for a range of different r values. The comparison in Table 5 reveals that performance does not improve monotonically with increased capacity. This is likely to be a result of enabling the SE block to overfit the channel interdependencies of the training set. In particular, we found that setting r=16 achieved a good tradeoff between accuracy and complexity and consequently, we used this value for all experiments.

减少比率 公式(5)中引入的减少比率r是一个重要的超参数,它允许我们改变模型中SE块的容量和计算成本。为了研究这种关系,我们基于SE-ResNet-50架构进行了一系列不同r值的实验。表5中的比较表明,性能并没有随着容量的增加而单调上升。这可能是使SE块能够过度拟合训练集通道依赖性的结果。尤其是我们发现设置r=16在精度和复杂度之间取得了很好的平衡,因此我们将这个值用于所有的实验。

Table 5

Table 5. Single-crop error rates (%) on the ImageNet validation set and corresponding model sizes for the SE-ResNet-50 architecture at different reduction ratios rr. Here original refers to ResNet-50.
表5。 ImageNet验证集上单裁剪图像的错误率(%)和SE-ResNet-50架构在不同减少比率rr下的模型大小。这里original指的是ResNet-50。

The role of Excitation. While SE blocks have been empirically shown to improve network performance, we would also like to understand how the self-gating excitation mechanism operates in practice. To provide a clearer picture of the behaviour of SE blocks, in this section we study example activations from the SE-ResNet-50 model and examine their distribution with respect to different classes at different blocks. Specifically, we sample four classes from the ImageNet dataset that exhibit semantic and appearance diversity, namely goldfish, pug, plane and cliff (example images from these classes are shown in Fig. 7). We then draw fifty samples for each class from the validation set and compute the average activations for fifty uniformly sampled channels in the last SE block in each stage (immediately prior to downsampling) and plot their distribution in Fig. 8. For reference, we also plot the distribution of average activations across all 1000 classes.


We make the following three observations about the role of Excitation in SENets. First, the distribution across different classes is nearly identical in lower layers, e.g. SE_2_3. This suggests that the importance of feature channels is likely to be shared by different classes in the early stages of the network. Interestingly however, the second observation is that at greater depth, the value of each channel becomes much more class-specific as different classes exhibit different preferences to the discriminative value of features e.g. SE_4_6 and SE_5_1. The two observations are consistent with findings in previous work [21, 46], namely that lower layer features are typically more general (i.e. class agnostic in the context of classification) while higher layer features have greater specificity. As a result, representation learning benefits from the recalibration induced by SE blocks which adaptively facilitates feature extraction and specialisation to the extent that it is needed. Finally, we observe a somewhat different phenomena in the last stage of the network. SE_5_2 exhibits an interesting tendency towards a saturated state in which most of the activations are close to 1 and the remainder are close to 0. At the point at which all activations take the value 1, this block would become a standard residual block. At the end of the network in the SE_5_3 (which is immediately followed by global pooling prior before classifiers), a similar pattern emerges over different classes, up to a slight change in scale (which could be tuned by the classifiers). This suggests that SE_5_2 and SE_5_3 are less important than previous blocks in providing recalibration to the network. This finding is consistent with the result of the empirical investigation in Sec. 4 which demonstrated that the overall parameter count could be significantly reduced by removing the SE blocks for the last stage with only a marginal loss of performance (<0.1% top-1 error).


Figure 7

Figure 7. Example images from the four classes of ImageNet.

Figure 8

Figure 8. Activations induced by Excitation in the different modules of SE-ResNet-50 on ImageNet. The module is named as “SE stageID blockID”.

Conclusion 结论

In this paper we proposed the SE block, a novel architectural unit designed to improve the representational capacity of a network by enabling it to perform dynamic channel-wise feature recalibration. Extensive experiments demonstrate the effectiveness of SENets which achieve state-of-the-art performance on multiple datasets. In addition, they provide some insight into the limitations of previous architectures in modelling channel-wise feature dependencies, which we hope may prove useful for other tasks requiring strong discriminative features. Finally, the feature importance induced by SE blocks may be helpful to related fields such as network pruning for compression.


Acknowledgements. We would like to thank Professor Andrew Zisserman for his helpful comments and Samuel Albanie for his discussions and writing edit for the paper. We would like to thank Chao Li for his contributions in the memory optimisation of the training system. Li Shen is supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract number 2014-14071600010. The views and conclusions contained herein are those of the author and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.

致谢 我们要感谢Andrew Zisserman教授的有益评论,并感谢Samuel Albanie的讨论并校订论文。我们要感谢Chao Li在训练系统内存优化方面的贡献。Li Shen由国家情报总监(ODNI),先期研究计划中心(IARPA)资助,合同号为2014-14071600010。本文包含的观点和结论属于作者的观点和结论,不应理解为ODNI,IARPA或美国政府明示或暗示的官方政策或认可。尽管有任何版权注释,美国政府有权为政府目的复制和分发重印。


[1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016.

[2] T. Bluche. Joint line segmentation and transcription for end-to-end handwritten paragraph recognition. In NIPS, 2016.

[3] C.Cao, X.Liu, Y.Yang, Y.Yu, J.Wang, Z.Wang, Y.Huang, L. Wang, C. Huang, W. Xu, D. Ramanan, and T. S. Huang. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In ICCV, 2015.

[4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR, 2017.

[5] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. arXiv:1707.01629, 2017.

[6] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.

[7] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In CVPR, 2017.

[8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In ICCV, 2015.

[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

[10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.

[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.

[12] G. Huang, Z. Liu, K. Q. Weinberger, and L. Maaten. Densely connected convolutional networks. In CVPR, 2017.

[13] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. Deep roots: Improving CNN efficiency with hierarchical filter groups. In CVPR, 2017.

[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

[15] L. Itti and C. Koch. Computational modelling of visual attention. Nature reviews neuroscience, 2001.

[16] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI, 1998.

[17] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015.

[18] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.

[20] H. Larochelle and G. E. Hinton. Learning to combine foveal glimpses with a third-order boltzmann machine. In NIPS, 2010.

[21] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.

[22] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.

[23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.

[24] A. Miech, I. Laptev, and J. Sivic. Learnable pooling with context gating for video classification. arXiv:1706.06905, 2017.

[25] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014.

[26] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.

[27] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.

[28] B. A. Olshausen, C. H. Anderson, and D. C. V. Essen. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 1993.

[29] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.

[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. IJCV, 2015.

[31] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. RR-8209, INRIA, 2013.

[32] L. Shen, Z. Lin, and Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, 2016.

[33] L. Shen, Z. Lin, G. Sun, and J. Hu. Places401 and places365 models. https://github.com/lishen-shirley/ Places2-CNNs, 2016.

[34] L. Shen, G. Sun, Q. Huang, S. Wang, Z. Lin, and E. Wu. Multi-level discriminative dictionary learning with application to large scale image classification. IEEE TIP, 2015.

[35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

[36] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, 2015.

[37] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In NIPS, 2014.

[38] C.Szegedy, S.Ioffe, V.Vanhoucke, and A.Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:1602.07261, 2016.

[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.

[40] C.Szegedy, V.Vanhoucke, S.Ioffe, J.Shlens, and Z.Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.

[41] A. Toshev and C. Szegedy. DeepPose: Human pose estimation via deep neural networks. In CVPR, 2014.

[42] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In CVPR, 2017.

[43] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.

[44] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.

[45] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.

[46] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS, 2014.

[47] X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of structural diversity in very deep networks. In CVPR, 2017.

[48] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE TPAMI, 2017.

A. ILSVRC 2017 Classification Competition Entry Details

The SENet in Table 3 is constructed by integrating SE blocks to a modified version of the 64×4d ResNeXt-152 that extends the original ResNeXt-101 [43] by following the block stacking of ResNet-152 [9]. More differences to the design and training (beyond the use of SE blocks) were as follows: (a) The number of first 1×1 convolutional channels for each bottleneck building block was halved to reduce the computation cost of the network with a minimal decrease in performance. (b) The first 7×7 convolutional layer was replaced with three consecutive 3×3 convolutional layers. (c) The down-sampling projection 1×1 with stride-2 convolution was replaced with a 3×3 stride-2 convolution to preserve information. (d) A dropout layer (with a drop ratio of 0.2) was inserted before the classifier layer to prevent overfitting. (e) Label-smoothing regularisation (as introduced in [40]) was used during training. (f) The parameters of all BN layers were frozen for the last few training epochs to ensure consistency between training and testing. (g) Training was performed with 8 servers (64 GPUs) in parallelism to enable a large batch size (2048) and initial learning rate of 1.0.

A. ILSVRC 2017分类竞赛输入细节