[论文翻译]Xception:具有深度可分卷积的深度学习


下载PDF:https://arxiv.org/pdf/1610.02357v3.pdf

下载代码:https://keras.io/applications/#xception

Xception: Deep Learning with Depthwise Separable Convolutions

xception:深度学习,深度可分离卷积

Abstract

We present an interpretation of Inception modules in convolutional neural networks as being an intermediate step in-between regular convolution and the \textit{depthwise separable convolution} operation (a depthwise convolution followed by a pointwise convolution). In this light, a depthwise separable convolution can be understood as an Inception module with a maximally large number of towers. This observation leads us to propose a novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depthwise separable convolutions. We show that this architecture, dubbed Xception, slightly outperforms Inception V3 on the ImageNet dataset (which Inception V3 was designed for), and significantly outperforms Inception V3 on a larger image classification dataset comprising 350 million images and 17,000 classes. Since the Xception architecture has the same number of parameters as Inception V3, the performance gains are not due to increased capacity but rather to a more efficient use of model parameters.

摘要

我们将卷积神经网络中的 Inception 模块解释为是常规卷积和深度可分离卷积之间的中间步骤运算(先进行深度卷积,再进行点状卷积)。鉴于此,可以将深度方向上可分离的卷积理解为具有最大数量的塔的 Inception 模块。这一发现使我们提出了一种受 Inception 启发的新颖的深度卷积神经网络体系结构,其中 Inception 模块已被深度可分离卷积替代。我们显示,这种被称为 Xception 的体系结构在 ImageNet 数据集(Inception V3 在此上面设计,这是个包含 3.5 亿张图像和 17,000 个类别的较大图像分类数据集)上明显优于 Inception V3。它有着和 Inception V3 相同数量的参数,但是模型更有效果。

.

Introduction

Convolutional neural networks have emerged as the master algorithm in computer vision in recent years, and developing recipes for designing them has been a subject of considerable attention. The history of convolutional neural network design started with LeNet-style models , which were simple stacks of convolutions for feature extraction and max-pooling operations for spatial sub-sampling. In 2012, these ideas were refined into the AlexNet architecture , where convolution operations were being repeated multiple times in-between max-pooling operations, allowing the network to learn richer features at every spatial scale. What followed was a trend to make this style of network increasingly deeper, mostly driven by the yearly ILSVRC competition; first with Zeiler and Fergus in 2013 and then with the VGG architecture in 2014 . At this point a new style of network emerged, the Inception architecture, introduced by Szegedy et al. in 2014 as GoogLeNet (Inception V1), later refined as Inception V2 , Inception V3 , and most recently Inception-ResNet . Inception itself was inspired by the earlier Network-In-Network architecture . Since its first introduction, Inception has been one of the best performing family of models on the ImageNet dataset , as well as internal datasets in use at Google, in particular JFT . The fundamental building block of Inception-style models is the Inception module, of which several different versions exist. In figure inception_module we show the canonical form of an Inception module, as found in the Inception V3 architecture. An Inception model can be understood as a stack of such modules. This is a departure from earlier VGG-style networks which were stacks of simple convolution layers. While Inception modules are conceptually similar to convolutions (they are convolutional feature extractors), they empirically appear to be capable of learning richer representations with less parameters. How do they work, and how do they differ from regular convolutions? What design strategies come after Inception?

简介

近年来,卷积神经网络已成为计算机视觉中的主要算法,开发用于设计它们的配方已成为相当受关注的主题。卷积神经网络设计的历史始于 LeNet 样式的模型[ 10 ],该模型是用于特征提取的卷积的简单堆栈和用于空间子采样的最大池化操作。在 2012 年,这些想法被提炼为 AlexNet 架构[ 9 ],在最大池操作之间多次重复进行卷积操作,从而使网络可以在每个空间尺度上学习更丰富的功能。随后出现的趋势是使这种类型的网络越来越深,这主要是由年度 ILSVRC 竞争推动的。首先在 2013 年加入 Zeiler 和 Fergus [ 25 ] ,然后在 2014 年加入 VGG 架构[ 18 ]。
此时,出现了一种新的网络样式,即由 Szegedy 等人介绍的 Inception 体系结构。在 2014 年[ 20 ]改名为 GoogLeNet(Inception V1),后来又更名为 Inception V2 [ 7 ],Inception V3 [ 21 ],以及最近的 Inception-ResNet [ 19 ]。Inception 本身的灵感来自早期的 Network-In-Network 体系结构[ 11 ]。自从首次推出以来,Inception 一直是 ImageNet 数据集[ 14 ]以及 Google 使用的内部数据集,特别是 JFT [[5 ]。

Inception 样式模型的基本构建模块是 Inception 模块,其中存在几种不同的版本。在图 1 中,我们显示了 Inception V3 体系结构中的 Inception 模块的规范形式。初始模型可以理解为此类模块的堆栈。这与早期的 VGG 样式网络不同,后者是简单的卷积层的堆栈。

尽管 Inception 模块在概念上与卷积相似(它们是卷积特征提取器),但从经验上看,它们似乎能够以较少的参数学习更丰富的表示形式。它们是如何工作的,它们与常规卷积有何不同?Inception 之后会有哪些设计策略?

The Inception hypothesis 初始假设

A convolution layer attempts to learn filters in a 3D space, with 2 spatial dimensions (width and height) and a channel dimension; thus a single convolution kernel is tasked with simultaneously mapping cross-channel correlations and spatial correlations. This idea behind the Inception module is to make this process easier and more efficient by explicitly factoring it into a series of operations that would independently look at cross-channel correlations and at spatial correlations. More precisely, the typical Inception module first looks at cross-channel correlations via a set of 1x1 convolutions, mapping the input data into 3 or 4 separate spaces that are smaller than the original input space, and then maps all correlations in these smaller 3D spaces, via regular 3x3 or 5x5 convolutions. This is illustrated in figure inception_module. In effect, the fundamental hypothesis behind Inception is that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly A variant of the process is to independently look at width-wise correlations and height-wise correlations. This is implemented by some of the modules found in Inception V3, which alternate 7x1 and 1x7 convolutions. The use of such spatially separable convolutions has a long history in image processing and has been used in some convolutional neural network implementations since at least 2012 (possibly earlier). . Consider a simplified version of an Inception module that only uses one size of convolution (e.g. 3x3) and does not include an average pooling tower (figure simplified_inception_module). This Inception module can be reformulated as a large 1x1 convolution followed by spatial convolutions that would operate on non-overlapping segments of the output channels (figure simplified_inception_reformulation). This observation naturally raises the question: what is the effect of the number of segments in the partition (and their size)? Would it be reasonable to make a much stronger hypothesis than the Inception hypothesis, and assume that cross-channel correlations and spatial correlations can be mapped completely separately?

卷积层尝试在 3D 空间中学习过滤器,该空间具有 2 个空间维度(宽度和高度)和一个通道维度;因此,单个卷积内核的任务是同时映射跨通道相关性和空间相关性。

Inception 模块背后的想法是,通过将其明确地分解为一系列可独立查看跨通道相关性和空间相关性的操作,从而使此过程更轻松,更高效。更准确地说,典型的 Inception 模块首先通过一组 1x1 卷积查看跨通道相关性,将输入数据映射到小于原始输入空间的 3 或 4 个独立空间中,然后在这些较小的 3D 空间中映射所有相关性,通过常规 3x3 或 5x5 卷积。这在图 1 中示出。实际上,Inception 背后的基本假设是跨通道相关性和空间相关性充分解耦,因此最好不要将它们一起映射(该过程的一种变体是独立查看宽度方向的相关性和高度方向的相关性。)这是由 Inception V3 中的某些模块实现的,这些模块交替进行 7x1 和 1x7 卷积。这种空间上可分离的卷积的使用在图像处理中具有悠久的历史,并且至少从 2012 年起(可能更早)就已在一些卷积神经网络实现中使用。。

考虑 Inception 模块的简化版本,该模块仅使用一种卷积大小(例如 3x3),并且不包括平均池塔 average pooling tower(图 2)。可以将该 Inception 模块重新构造为大的 1x1 卷积,然后再进行空间卷积,这些卷积将在输出通道的非重叠段上进行操作(图 3)。这种观察自然会引发一个问题:分区中的段数(及其大小)会产生什么影响?做出比 Inception 假设强得多的假设,并假设跨通道相关性和空间相关性可以完全分开映射,是否合理?
A canonical Inception module (Inception V3).
Figure 1: A canonical Inception module (Inception V3).

A simplified Inception module.
Figure 2: A simplified Inception module.

A strictly equivalent reformulation of the simplified Inception module.
Figure 3: A strictly equivalent reformulation of the simplified Inception module.

An “extreme” version of our Inception module, with one spatial convolution per output channel of the 1x1 convolution.
Figure 4: An “extreme” version of our Inception module, with one spatial convolution per output channel of the 1x1 convolution.

The continuum between convolutions and separable convolutions 卷积和可分离卷积之间的连续性

An “extreme” version of an Inception module, based on this stronger hypothesis, would first use a 1x1 convolution to map cross-channel correlations, and would then separately map the spatial correlations of every output channel. This is shown in figure 4. We remark that this extreme form of an Inception module is almost identical to a depthwise separable convolution, an operation that has been used in neural network design as early as 2014 [15] and has become more popular since its inclusion in the TensorFlow framework [1] in 2016.

A depthwise separable convolution, commonly called “separable convolution” in deep learning frameworks such as TensorFlow and Keras, consists in a depthwise convolution, i.e. a spatial convolution performed independently over each channel of an input, followed by a pointwise convolution, i.e. a 1x1 convolution, projecting the channels output by the depthwise convolution onto a new channel space. This is not to be confused with a spatially separable convolution, which is also commonly called “separable convolution” in the image processing community.

Two minor differences between and “extreme” version of an Inception module and a depthwise separable convolution would be:

  • The order of the operations: depthwise separable convolutions as usually implemented (e.g. in TensorFlow) perform first channel-wise spatial convolution and then perform 1x1 convolution, whereas Inception performs the 1x1 convolution first.
  • The presence or absence of a non-linearity after the first operation. In Inception, both operations are followed by a ReLU non-linearity, however depthwise separable convolutions are usually implemented without non-linearities.

We argue that the first difference is unimportant, in particular because these operations are meant to be used in a stacked setting. The second difference might matter, and we investigate it in the experimental section (in particular see figure 10).

We also note that other intermediate formulations of Inception modules that lie in between regular Inception modules and depthwise separable convolutions are also possible: in effect, there is a discrete spectrum between regular convolutions and depthwise separable convolutions, parametrized by the number of independent channel-space segments used for performing spatial convolutions. A regular convolution (preceded by a 1x1 convolution), at one extreme of this spectrum, corresponds to the single-segment case; a depthwise separable convolution corresponds to the other extreme where there is one segment per channel; Inception modules lie in between, dividing a few hundreds of channels into 3 or 4 segments. The properties of such intermediate modules appear not to have been explored yet.

Having made these observations, we suggest that it may be possible to improve upon the Inception family of architectures by replacing Inception modules with depthwise separable convolutions, i.e. by building models that would be stacks of depthwise separable convolutions. This is made practical by the efficient depthwise convolution implementation available in TensorFlow. In what follows, we present a convolutional neural network architecture based on this idea, with a similar number of parameters as Inception V3, and we evaluate its performance against Inception V3 on two large-scale image classification task.

基于此更强的假设,Inception 模块的“极端”版本将首先使用 1x1 卷积来映射跨通道相关性,然后分别映射每个输出通道的空间相关性。如图 4 所示。我们注意到,Inception 模块的这种极端形式几乎与深度可分离卷积相同,该操作早在 2014 年就已在神经网络设计中使用[ 15 ],并且自于其在 2016 年的 TensoRFlow 框架中纳入以来已经更受欢迎 [1]。 ]。

深度可分离卷积在深度学习框架(如 TensorFlow 和 Keras)中通常称为“可分离卷积”,其中包括深度卷积,即在输入的每个通道上独立执行的空间卷积,然后是逐点卷积,即 1x1 卷积,将通过深度卷积输出的通道投影到新的通道空间上。请勿将其与空间可分离卷积混淆,空间可分离卷积在图像处理社区中通常也称为“可分离卷积”。

Inception 模块的“极端”版本与深度可分离卷积之间的两个小区别是:

  • 操作顺序:通常实现的深度可分离卷积(例如在 TensorFlow 中)首先执行通道空间卷积,然后执行 1x1 卷积,而 Inception 首先执行 1x1 卷积。
  • 第一次操作后是否存在非线性。在 Inception 中,所有操作后面都跟随 ReLU 这个非线性单元,但是深度可分离卷积通常在没有非线性单元。

In Inception, both operations are followed by a ReLU non-linearity, however depthwise separable convolutions are usually implemented without non-linearities. We argue that the first difference is unimportant, in particular because these operations are meant to be used in a stacked setting. The second difference might matter, and we investigate it in the experimental section (in particular see figure xception_imagenet_activations). We also note that other intermediate formulations of Inception modules that lie in between regular Inception modules and depthwise separable convolutions are also possible: in effect, there is a discrete spectrum between regular convolutions and depthwise separable convolutions, parametrized by the number of independent channel-space segments used for performing spatial convolutions. A regular convolution (preceded by a 1x1 convolution), at one extreme of this spectrum, corresponds to the single-segment case; a depthwise separable convolution corresponds to the other extreme where there is one segment per channel; Inception modules lie in between, dividing a few hundreds of channels into 3 or 4 segments. The properties of such intermediate modules appear not to have been explored yet. Having made these observations, we suggest that it may be possible to improve upon the Inception family of architectures by replacing Inception modules with depthwise separable convolutions, i.e. by building models that would be stacks of depthwise separable convolutions. This is made practical by the efficient depthwise convolution implementation available in TensorFlow. In what follows, we present a convolutional neural network architecture based on this idea, with a similar number of parameters as Inception V3, and we evaluate its performance against Inception V3 on two large-scale image classification task.

我们认为第一个区别并不重要,特别是因为这些操作是要在堆叠的环境中使用的。第二个差异可能很重要,我们将在实验部分对此进行研究(特别是参见图 10)。

我们还注意到,位于常规 Inception 模块和深度可分离卷积之间的 Inception 模块的其他中间形式也是可能的:实际上,常规卷积和深度可分离卷积之间存在离散频谱,其参数由独立通道空间的数量参数化用于执行空间卷积的线段。在这种频谱的一个极端情况下,常规卷积(以 1x1 卷积为先)对应于单段情况;深度可分离卷积对应于每个通道有一个分段的另一个极端;初始模块位于两者之间,将数百个通道分为 3 或 4 个段。此类中间模块的属性似乎尚未探索。

进行了这些观察后,我们建议通过用深度可分离卷积代替 Inception 模块,即通过构建将是深度可分离卷积的堆栈的模型,来改进 Inception 系列体系结构是可能的。通过 TensorFlow 中可用的有效深度深度卷积实现,这变得切实可行。接下来,我们将基于此思想提出一种卷积神经网络体系结构,其参数数量与 Inception V3 相似,并针对两个大型图像分类任务针对 Inception V3 评估其性能。

Prior work 相关工作

The present work relies heavily on prior efforts in the following areas:

  • Convolutional neural networks [10, 9, 25], in particular the VGG-16 architecture [18], which is schematically similar to our proposed architecture in a few respects.
  • The Inception architecture family of convolutional neural networks [20, 7, 21, 19], which first demonstrated the advantages of factoring convolutions into multiple branches operating successively on channels and then on space.
  • Depthwise separable convolutions, which our proposed architecture is entirely based upon. While the use of spatially separable convolutions in neural networks has a long history, going back to at least 2012 [12] (but likely even earlier), the depthwise version is more recent. Laurent Sifre developed depthwise separable convolutions during an internship at Google Brain in 2013, and used them in AlexNet to obtain small gains in accuracy and large gains in convergence speed, as well as a significant reduction in model size. An overview of his work was first made public in a presentation at ICLR 2014 [23]. Detailed experimental results are reported in Sifre’s thesis, section 6.2 [15]. This initial work on depthwise separable convolutions was inspired by prior research from Sifre and Mallat on transformation-invariant scattering [16, 15]. Later, a depthwise separable convolution was used as the first layer of Inception V1 and Inception V2 [20, 7]. Within Google, Andrew Howard [6] has introduced efficient mobile models called MobileNets using depthwise separable convolutions. Jin et al. in 2014 [8] and Wang et al. in 2016 [24] also did related work aiming at reducing the size and computational cost of convolutional neural networks using separable convolutions. Additionally, our work is only possible due to the inclusion of an efficient implementation of depthwise separable convolutions in the TensorFlow framework [1].
  • Residual connections, introduced by He et al. in [4], which our proposed architecture uses extensively.

目前的工作在很大程度上取决于在以下领域的先前努力:

  • 卷积神经网络[ 10925 ],特别是 VGG-16 体系结构[ 18 ],这是示意性地类似于我们提出的架构在几个方面。
  • Inception 系列卷积神经网络[ 2072119 ],首先展示了将卷积组合成多个分支结构,将其分解为在频道上的连续操作的优势
  • 深度可分离卷积,这是我们提出的体系结构完全基于的。虽然在神经网络中使用空间上可分离的卷积已有很长的历史,但至少可以追溯到 2012 年[ 12 ](但可能更早),但深度版本是最近的。Laurent Sifre 于 2013 年在 Google Brain 实习期间开发了深度可分离卷积,并在 AlexNet 中使用了它们,从而获得了较小的准确性和较大的收敛速度,并且显着减小了模型大小。在 2014 年 ICLR 上的一次演讲中首次公开了他的工作概述[ 23 ]。详细的实验结果在 Sifre 论文的 6.2 节中报告[ 15 ]。深度方向上的可分离卷积该初始工作是从 Sifre 和的 Mallat 上变换不变散射之前研究的启发[ 1615 ]。之后,将深度可分离卷积用作用在 InceptionV1 和 InceptionV2 上的第一层[ 207 ]。在 Google 内部,安德鲁·霍华德[ 6 ]使用深度可分离卷积引入了称为 MobileNets 的高效移动模型。Jin 等。在 2014 年[ 8 ]和 Wang 等。2016 年[ 24 ]还做了相关工作,旨在减少使用可分离卷积的卷积神经网络的大小和计算成本。此外,由于 TensorFlow 框架[ 1 ]中包含深度可分离卷积的有效实现,因此我们的工作才有可能实现。
  • 残差联系,由 He 等人介绍。在[ 4 ]中,我们提出的体系结构中广泛使用到它。

The Xception architecture Xception 架构

We propose a convolutional neural network architecture based entirely on depthwise separable convolution layers. In effect, we make the following hypothesis: that the mapping of cross-channels correlations and spatial correlations in the feature maps of convolutional neural networks can be entirely decoupled. Because this hypothesis is a stronger version of the hypothesis underlying the Inception architecture, we name our proposed architecture Xception, which stands for “Extreme Inception”.

A complete description of the specifications of the network is given in figure 5. The Xception architecture has 36 convolutional layers forming the feature extraction base of the network. In our experimental evaluation we will exclusively investigate image classification and therefore our convolutional base will be followed by a logistic regression layer. Optionally one may insert fully-connected layers before the logistic regression layer, which is explored in the experimental evaluation section (in particular, see figures 7 and 8). The 36 convolutional layers are structured into 14 modules, all of which have linear residual connections around them, except for the first and last modules.

In short, the Xception architecture is a linear stack of depthwise separable convolution layers with residual connections. This makes the architecture very easy to define and modify; it takes only 30 to 40 lines of code using a high-level library such as Keras [2] or TensorFlow-Slim [17], not unlike an architecture such as VGG-16 [18], but rather unlike architectures such as Inception V2 or V3 which are far more complex to define. An open-source implementation of Xception using Keras and TensorFlow is provided as part of the Keras Applications module https://keras.io/applications/#xception , under the MIT license.

我们提出了一种完全基于深度可分离卷积层的卷积神经网络体系结构。实际上,我们做出以下假设:卷积神经网络的特征图中跨通道相关性和空间相关性的映射可以完全解耦。因为此假设是 Inception 体系结构假设的更强版本,所以我们将建议的体系结构 Xception 命名为“ Extreme Inception”。

网络规范的完整描述在图 5 中给出。Xception 体系结构具有 36 个卷积层,构成了网络的特征提取基础。在我们的实验评估中,我们将专门研究图像分类,因此在我们的卷积基础之后将是逻辑回归层。可选地,可以在逻辑回归层之前插入完全连接的层,这将在实验评估部分中进行探讨(特别是,参见图 78)。36 个卷积层被构造为 14 个模块,除了第一个和最后一个模块外,所有这些模块周围都具有线性残差连接。

简而言之,Xception 体系结构是具有残差连接的深度可分离卷积层的线性堆栈。这使得体系结构非常容易定义和修改。使用高级库(例如 Keras [ 2 ]或 TensorFlow-Slim [ 17 ])仅需要 30 至 40 行代码,这与 VGG-16 [ 18 ]这样的体系结构没有什么不同,但与 Inception V2 或 V3 的定义却要复杂得多。作为 Keras 应用程序模块 2 的一部分,提供了使用 Keras 和 TensorFlow 的 Xception 的开源实现。 https://keras.io/applications/#xception,在 MIT 许可

The Xception architecture: the data first goes through the entry flow, then through the middle flow which is repeated eight times, and finally through the exit flow. Note that all Convolution and SeparableConvolution layers are followed by batch normalization

Figure 5: The Xception architecture: the data first goes through the entry flow, then through the middle flow which is repeated eight times, and finally through the exit flow. Note that all Convolution and SeparableConvolution layers are followed by batch normalization [7] (not included in the diagram). All SeparableConvolution layers use a depth multiplier of 1 (no depth expansion).

图 5: Xception 体系结构:数据首先经过入口流,然后经过重复八次的中间流,最后经过出口流。请注意,所有卷积和 SeparableConvolution 层之后均进行批处理归一化[ 7 ](图中未包括)。所有 SeparableConvolution 图层均使用深度倍数 1(无深度扩展)。

Experimental evaluation 实验评估

We choose to compare Xception to the Inception V3 architecture, due to their similarity of scale: Xception and Inception V3 have nearly the same number of parameters (table sizeandspeed), and thus any performance gap could not be attributed to a difference in network capacity. We conduct our comparison on two image classification tasks: one is the well-known 1000-class single-label classification task on the ImageNet dataset , and the other is a 17,000-class multi-label classification task on the large-scale JFT dataset.

我们选择将 xception 与 v3 架构进行比较,由于它们的规模相似:xception 和 v3 具有几乎相同数量的参数(表 sizeandspeed),因此任何性能差距都无法归因于 a 网络容量差异。我们对两种图像分类任务进行比较:一个是 ImageNet DataSet 上的众所周知的 1000 级单个标签分类任务,另一个是大型 JFT 数据集上的 17,000 级多标签分类任务。

The JFT dataset JFT 数据集

JFT is an internal Google dataset for large-scale image classification dataset, first introduced by Hinton et al. in , which comprises over 350 million high-resolution images annotated with labels from a set of 17,000 classes. To evaluate the performance of a model trained on JFT, we use an auxiliary dataset, . FastEval14k is a dataset of 14,000 images with dense annotations from about 6,000 classes (36.5 labels per image on average). On this dataset we evaluate performance using Mean Average Precision for top 100 predictions (MAP@100), and we weight the contribution of each class to MAP@100 with a score estimating how common (and therefore important) the class is among social media images. This evaluation procedure is meant to capture performance on frequently occurring labels from social media, which is crucial for production models at Google.

JFT 是用于大型图像分类数据集的内部 Google 数据集,最早由 Hinton 等人引入。在[ 5 ]中,它包含超过 3.5 亿个高分辨率图像,这些图像带有来自 17,000 个类别的标签的注释。为了评估在 JFT 上训练的模型的性能,我们使用了辅助数据集 FastEval14k。

FastEval14k 是 14,000 张图像的数据集,其中包含来自大约 6,000 个类别的密集注释(平均每张图像 36.5 个标签)。在此数据集上,我们使用平均平均精度对前 100 个预测(MAP @ 100 )进行评估,并对每个类别对 MAP @ 100 的贡献进行加权,并给出一个分数,以估算该类别在社交媒体图像中的普遍程度(因此很重要) 。此评估程序旨在从社交媒体上捕获频繁出现的标签上的效果,这对于 Google 的生产模型至关重要。

Optimization configuration 优化配置

A different optimization configuration was used for ImageNet and JFT:

  • On ImageNet:

    • Optimizer: SGD
    • Momentum: 0.9
    • Initial learning rate: 0.045
    • Learning rate decay: decay of rate 0.94 every 2 epochs
  • On JFT:

    • Optimizer: RMSprop [22]
    • Momentum: 0.9
    • Initial learning rate: 0.001
    • Learning rate decay: decay of rate 0.9 every 3,000,000 samples

For both datasets, the same exact same optimization configuration was used for both Xception and Inception V3. Note that this configuration was tuned for best performance with Inception V3; we did not attempt to tune optimization hyperparameters for Xception. Since the networks have different training profiles (figure 6), this may be suboptimal, especially on the ImageNet dataset, on which the optimization configuration used had been carefully tuned for Inception V3.

Additionally, all models were evaluated using Polyak averaging [13] at inference time.

ImageNet 和 JFT 使用了不同的优化配置:

  • 在 ImageNet 上:

    • 优化程序:SGD
    • 动量:0.9
    • 初始学习率:0.045
    • 学习率衰减:每 2 个时代 0.94 的衰减率
  • 在 JFT 上:

    • 优化器:RMSprop [ 22 ]
    • 动量:0.9
    • 初始学习率:0.001
    • 学习速率衰减:每 3,000,000 个样本速率下降 0.9

对于两个数据集,Xception 和 Inception V3 都使用了完全相同的优化配置。请注意,此配置已通过 Inception V3 进行了调整,以实现最佳性能。我们没有尝试针对 Xception 调整优化超参数。由于网络具有不同的训练配置文件(图 6),因此这可能不是最佳的,尤其是在 ImageNet 数据集上,对于该数据库,针对 Inception V3 精心优化了使用的优化配置。

此外,在推断时使用 Polyak 平均[ 13 ]评估了所有模型。

Training profile on ImageNet
Figure 6: Training profile on ImageNet

Regularization configuration 正则化配置

  • Weight decay: The Inception V3 model uses a weight decay (L2 regularization) rate of 4e−5, which has been carefully tuned for performance on ImageNet. We found this rate to be quite suboptimal for Xception and instead settled for 1e−5. We did not perform an extensive search for the optimal weight decay rate. The same weight decay rates were used both for the ImageNet experiments and the JFT experiments.
  • Dropout: For the ImageNet experiments, both models include a dropout layer of rate 0.5 before the logistic regression layer. For the JFT experiments, no dropout was included due to the large size of the dataset which made overfitting unlikely in any reasonable amount of time.
  • Auxiliary loss tower: The Inception V3 architecture may optionally include an auxiliary tower which backpropagates the classification loss earlier in the network, serving as an additional regularization mechanism. For simplicity, we choose not to include this auxiliary tower in any of our models.
  • 权重衰减: Inception V3 模型使用权重衰减(L2 正则化)速率为 4e-5 率,已对其进行了精心调整,以提高 ImageNet 的性能。我们发现,对于 Xception 来说,该速率是次优的,因此选择了 1E-5 $。我们没有对最佳重量衰减速率进行广泛的搜索。对于 Imagenet 实验和 JFT 实验,使用相同的重量衰减率。
  • dropout:对于 ImageNet 实验,两个模型在逻辑回归层之前都包含一个速率为 0.5 的 dropout 层。对于 JFT 实验,由于数据集的大小过大,因此在任何合理的时间内都不太可能进行过度拟合,因此不包括任何 dropout。
  • 辅助 loss 计算: Inception V3 体系结构可以选择包括一个辅助计算,该辅助计算在网络早期地反向传播计算分类 loss,用作附加的正则化机制。为简单起见,我们选择在任何模型中均不包括该辅助计算。

Training infrastructure 训练基础架构

All networks were implemented using the TensorFlow framework and trained on 60 NVIDIA K80 GPUs each. For the ImageNet experiments, we used data parallelism with gradient descent to achieve the best classification performance, while for JFT we used gradient descent so as to speed up training. The ImageNet experiments took approximately 3 days each, while the JFT experiments took over one month each. The JFT models were not trained to full convergence, which would have taken over three month per experiment.

所有网络均使用 TensorFlow 框架[ 1 ]实施,并分别在 60 个 NVIDIA K80 GPU 上进行了培训。对于 ImageNet 实验,我们使用具有同步梯度下降的数据并行性来获得最佳的分类性能,而对于 JFT,我们使用异步梯度下降以加快训练速度。ImageNet 实验每个大约花费 3 天,而 JFT 实验每个则花费一个多月。JFT 模型没有经过完全收敛的训练,每个实验将花费三个月以上的时间。

Comparison with Inception V3 与 Inception v3 的比较

Classification performance 分类表现

All evaluations were run with a single crop of the inputs images and a single model. ImageNet results are reported on the validation set rather than the test set (i.e. on the non-blacklisted images from the validation set of ILSVRC 2012). JFT results are reported after 30 million iterations (one month of training) rather than after full convergence. Results are provided in table imagenetperf and table jftperf, as well as figure xception_imagenet, figure xception_jft_no_fc, figure xception_jft_with_fc. On JFT, we tested both versions of our networks that did not include any fully-connected layers, and versions that included two fully-connected layers of 4096 units each before the logistic regression layer. On ImageNet, Xception shows marginally better results than Inception V3. On JFT, Xception shows a 4.3% relative improvement on the FastEval14k MAP@100 metric. We also note that Xception outperforms ImageNet results reported by He et al. for ResNet-50, ResNet-101 and ResNet-152 .

所有评估都是通过单幅输入图像和单个模型进行的。ImageNet 结果报告在验证集而非测试集上(即,在 ILSVRC 2012 验证集的未列入黑名单的图像上)。JFT 的结果是经过 3000 万次迭代(一个月的培训)后报告的,而不是完全收敛之后的结果。结果在表 1 和表 2 以及图 6,图 7,图 8 中提供。在 JFT 上,我们测试了不包含任何完全连接层的两个版本的网络,以及在逻辑回归层之前包含两个 4096 个单位的完全连接层的版本。

在 ImageNet 上,Xception 比 Inception V3 的结果略好。在 JFT 上,Xception 对 FastEval14k MAP @ 100 指标显示出 4.3%的相对改进。我们还注意到,Xception 优于 He 等人报告的 ImageNet 结果。用于 ResNet-50,ResNet-101 和 ResNet-152 [ 4 ]。

Top-1 accuracy Top-5 accuracy
VGG-16 0.715 0.901
ResNet-152 0.770 0.933
Inception V3 0.782 0.941
Xception 0.790 0.945

Table 1: Classification performance comparison on ImageNet (single crop, single model). VGG-16 and ResNet-152 numbers are only included as a reminder. The version of Inception V3 being benchmarked does not include the auxiliary tower.

FastEval14k MAP@100
Inception V3 - no FC layers 6.36
Xception - no FC layers 6.70
Inception V3 with FC layers 6.50
Xception with FC layers 6.78

Table 2: Classification performance comparison on JFT (single crop, single model).

Training profile on JFT, without fully-connected layers
Figure 7: Training profile on JFT, without fully-connected layers

Training profile on JFT, with fully-connected layers
Figure 8: Training profile on JFT, with fully-connected layers

The Xception architecture shows a much larger performance improvement on the JFT dataset compared to the ImageNet dataset. We believe this may be due to the fact that Inception V3 was developed with a focus on ImageNet and may thus be by design over-fit to this specific task. On the other hand, neither architecture was tuned for JFT. It is likely that a search for better hyperparameters for Xception on ImageNet (in particular optimization parameters and regularization parameters) would yield significant additional improvement.

与 ImageNet 数据集相比,Xception 体系结构在 JFT 数据集上显示出更大的性能改进。我们认为,这可能是由于 Inception V3 专注于 ImageNet 而开发的,因此可能是由于设计过分地适合此特定任务。另一方面,两种架构都不适合 JFT。在 ImageNet 上搜索更好的 Xception 超参数(尤其是优化参数和正则化参数)可能会带来重大的额外改进。

Size and speed 大小和速度

In table sizeandspeed we compare the size and speed of Inception V3 and Xception. Parameter count is reported on ImageNet (1000 classes, no fully-connected layers) and the number of training steps (gradient updates) per second is reported on ImageNet with 60 K80 GPUs running synchronous gradient descent. Both architectures have approximately the same size (within 3.5%), and Xception is marginally slower. We expect that engineering optimizations at the level of the depthwise convolution operations can make Xception faster than Inception V3 in the near future. The fact that both architectures have almost the same number of parameters indicates that the improvement seen on ImageNet and JFT does not come from added capacity but rather from a more efficient use of the model parameters.

Parameter count Steps/second
Inception V3 23,626,728 31
Xception 22,855,952 28

Table 3: Size and training speed comparison.

在表 3 中,我们比较了 Inception V3 和 Xception 的大小和速度。在 ImageNet 上报告参数计数(1000 个类,没有完全连接的层),在具有 60 个运行同步梯度下降的 K80 GPU 的 ImageNet 上报告每秒的训练步骤(梯度更新)数。两种架构的大小大致相同(在 3.5%以内),Xception 的速度稍慢一些。我们期望深度卷积运算级别的工程优化可以使 Xception 在不久的将来比 Inception V3 更快。两种体系结构具有几乎相同数量的参数这一事实表明,在 ImageNet 和 JFT 上看到的改进不是来自增加的容量,而是来自更有效地使用模型参数。

Effect of the residual connections 残差连接的影响

To quantify the benefits of residual connections in the Xception architecture, we benchmarked on ImageNet a modified version of Xception that does not include any residual connections. Results are shown in figure xception_imagenet_nonresidual. Residual connections are clearly essential in helping with convergence, both in terms of speed and final classification performance. However we will note that benchmarking the non-residual model with the same optimization configuration as the residual model may be uncharitable and that better optimization configurations might yield more competitive results. Additionally, let us note that this result merely shows the importance of residual connections , and that residual connections are in no way in order to build models that are stacks of depthwise separable convolutions. We also obtained excellent results with non-residual VGG-style models where all convolution layers were replaced with depthwise separable convolutions (with a depth multiplier of 1), superior to Inception V3 on JFT at equal parameter count.

为了量化 Xception 体系结构中残留连接的好处,我们在 ImageNet 上对不包含任何残留连接的 Xception 的修改版本进行了基准测试。结果如图 9 所示。在速度和最终分类性能方面,残余连接显然对于帮助收敛至关重要。但是,我们将注意到,使用与残差模型相同的优化配置对非残差模型进行基准测试可能是不明智的,并且更好的优化配置可能会产生更具竞争力的结果。

另外,让我们注意,此结果仅显示了此特定体系结构的残差连接的重要性,并且不需要以建立深度可分离卷积堆栈的模型的方式来要求残差连接。我们还用非残留 VGG 样式的模型获得了优异的结果,其中所有卷积层都被深度可分离的卷积(深度乘数为 1)替换,在相等参数数下优于 JFT 上的 Inception V3。

Training profile with and without residual connections.
Figure 9: Training profile with and without residual connections.

Effect of an intermediate activation after pointwise convolutions 点卷积后中间激活的影响

We mentioned earlier that the analogy between depthwise separable convolutions and Inception modules suggests that depthwise separable convolutions should potentially include a non-linearity between the depthwise and pointwise operations. In the experiments reported so far, no such non-linearity was included. However we also experimentally tested the inclusion of either ReLU or ELU as intermediate non-linearity. Results are reported on ImageNet in figure xception_imagenet_activations, and show that the absence of any non-linearity leads to both faster convergence and better final performance. This is a remarkable observation, since Szegedy et al. report the opposite result in for Inception modules. It may be that the depth of the intermediate feature spaces on which spatial convolutions are applied is critical to the usefulness of the non-linearity: for deep feature spaces (e.g. those found in Inception modules) the non-linearity is helpful, but for shallow ones (e.g. the 1-channel deep feature spaces of depthwise separable convolutions) it becomes harmful, possibly due to a loss of information.

前面我们提到过,深度可分离卷积和 Inception 模块之间的类比表明,深度可分离卷积可能应包括深度和点运算之间的非线性。在迄今为止报道的实验中,没有包括这样的非线性。但是,我们还通过实验测试了 ReLU 或 ELU [ 3 ]的包含作为中间非线性。结果在 ImageNet 上的图 10 中进行了报告,结果表明,缺少任何非线性都会导致更快的收敛速度和更好的最终性能。

自从 Szegedy 等人以来,这是一个了不起的发现。在[ 21 ]中为 Inception 模块报告相反的结果。可能是在其上应用了空间卷积的中间特征空间的深度对于非线性的有效性至关重要:对于较深的特征空间(例如,在 Inception 模块中发现的那些),非线性是有帮助的,但对于较浅的特征它(例如深度可分离卷积的 1 通道深特征空间)变得有害,可能是由于信息丢失所致。

Training profile with different activations between the depthwise and pointwise operations of the separable convolution layers.
Figure 10: Training profile with different activations between the depthwise and pointwise operations of the separable convolution layers.

Future directions 未来发展方向

We noted earlier the existence of a discrete spectrum between regular convolutions and depthwise separable convolutions, parametrized by the number of independent channel-space segments used for performing spatial convolutions. Inception modules are one point on this spectrum. We showed in our empirical evaluation that the extreme formulation of an Inception module, the depthwise separable convolution, may have advantages over regular a regular Inception module. However, there is no reason to believe that depthwise separable convolutions are optimal. It may be that intermediate points on the spectrum, lying between regular Inception modules and depthwise separable convolutions, hold further advantages. This question is left for future investigation.

前面我们注意到,在规则卷积和深度可分离卷积之间存在离散频谱,其参数化是用于执行空间卷积的独立通道空间段的数量。初始模块是这一范围的重点。我们在经验评估中表明,与常规的常规 Inception 模块相比,Inception 模块的极端公式化(深度方向上可分离的卷积)可能具有优势。但是,没有理由相信深度可分离卷积是最佳的。可能是频谱上的中间点位于常规的 Inception 模块和深度可分离的卷积之间,具有其他优势。这个问题留待将来调查。

Conclusions 结论

We showed how convolutions and depthwise separable convolutions lie at both extremes of a discrete spectrum, with Inception modules being an intermediate point in between. This observation has led to us to propose replacing Inception modules with depthwise separable convolutions in neural computer vision architectures. We presented a novel architecture based on this idea, named Xception, which has a similar parameter count as Inception V3. Compared to Inception V3, Xception shows small gains in classification performance on the ImageNet dataset and large gains on the JFT dataset. We expect depthwise separable convolutions to become a cornerstone of convolutional neural network architecture design in the future, since they offer similar properties as Inception modules, yet are as easy to use as regular convolution layers.

我们展示了卷积和深度可分离卷积如何位于离散频谱的两个极端,而 Inception 模块是介于两者之间的中间点。这种观察导致我们提出在神经计算机视觉体系结构中用深度可分离卷积替换 Inception 模块。我们基于此思想提出了一种新颖的架构,名为 Xception,它的参数计数与 Inception V3 相似。与 Inception V3 相比,Xception 在 ImageNet 数据集上的分类性能提高很小,而在 JFT 数据集上的分类性能提高了很多。我们期望深度可分离卷积在将来成为卷积神经网络体系结构设计的基石,因为它们提供与 Inception 模块相似的属性,但与常规卷积层一样容易使用。

REFERENCES