Dual Path Networks
author
Yunpeng Chen1, Jianan Li1,2, Huaxin Xiao1,3, Xiaojie Jin1, Shuicheng Yan4,1, Jiashi Feng1
1National University of Singapore
2Beijing Institute of Technology
3National University of Defense Technology
4Qihoo 360 AI Institute
ABSTRACT 摘要
In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over state-of-the-arts. In particular, on the ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101(64×4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art single model performance with about 2 times faster training speed. Experiments on the Places365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications.
在这项工作中,我们提出了一种用于图像分类的简单,高效且模块化的双路径网络(DPN),它在提供了一种网络内部连接路径的新拓扑结构。通过在Hornn框架内揭示最新的残差网络(ResNet)和密集卷积网络(DenseNet)的等效性,我们发现ResNet支持特征重用,而DenseNet支持新特征探索,这两个都是学习特征上很重要的因素。为了同时享受这两种路径拓扑带来的好处,我们所提出的双路网络对这种路径共同的部分进行共享,同时保持了通过双路径架构探索新特征的灵活性。我们在三个基准数据集(ImagNet-1k,Places365和PASCAL VOC)上进行的广泛实验,清楚地证明了所提出的DPN优于最新模型ResNeXt-101(64×4d),而且模型尺寸缩小26%,计算成本降低25%,内存消耗降低8%,此外更深的DPN(DPN-131)进一步推动了最新的单模型性能,训练速度提高了约2倍。在Places365大型场景数据集,PASCAL VOC检测数据集和PASCAL VOC分割数据集上进行的实验也证明了其在各种应用程序上的性能始终优于DenseNet,ResNet和最新的ResNeXt模型。
1 INTRODUCTION
“Network engineering” is increasingly more important for visual recognition research. In this paper, we aim to develop new path topology of deep architectures to further push the frontier of representation learning. In particular, we focus on analyzing and reforming the skip connection, which has been widely used in designing modern deep neural networks and offers remarkable success in many applications [16, 7, 20, 14, 5]. Skip connection creates a path propagating information from a lower layer directly to a higher layer. During the forward propagation, skip connection enables a very top layer to access information from a distant bottom layer; while for the backward propagation, it facilitates gradient back-propagation to the bottom layer without diminishing magnitude, which effectively alleviates the gradient vanishing problem and eases the optimization.
Deep Residual Network (ResNet) [5] is one of the first works that successfully adopt skip connections, where each mirco-block, a.k.a. residual function, is associated with a skip connection, called residual path. The residual path element-wisely adds the input features to the output of the same mirco-block, making it a residual unit. Depending on the inner structure design of the mirco-block, the residual network has developed into a family of various architectures, including WRN [22], Inception-resnet [20], and ResNeXt [21].
“网络工程”对于视觉识别研究变得越来越重要。在本文中,我们旨在开发深度架构的新路径拓扑,以进一步推动表示学习的前沿。尤其是,我们着重分析和改革跳过连接,已被广泛应用于许多应用程序设计,以现代深层神经网络,并提供了非凡的成功 [ 16,7,20,14,5 ]。跳过连接会创建一条路径,将信息从较低层直接传播到较高层。在正向传播过程中,跳过连接使最顶层可以从较远的底层访问信息。而对于向后传播,它有助于梯度向后传播至底层而不会减小幅度,从而有效地缓解了梯度消失的问题并简化了优化过程。
深度残差网络(ResNet) [ 5 ]是成功采用跳过连接的第一批作品之一,其中每个mirco块(也称为残差功能)都与一个称为残差路径的跳过连接相关联。残差路径逐个元素地将输入特征添加到同一mirco-block的输出中,从而使其成为残差单元。根据mirco-block的内部结构设计,残差网络已发展为各种架构家族,包括WRN [ 22 ],Inception-resnet [ 20 ]和ResNeXt [ 21 ]。
More recently, Huang et al. [8] proposed a different network architecture that achieves comparable accuracy with deep ResNet [5], named Dense Convolutional Network (DenseNet). Different from residual networks which add the input features to the output features through the residual path, the DenseNet uses a densely connected path to concatenate the input features with the output features, enabling each micro-block to receive raw information from all previous micro-blocks. Similar with residual network family, DenseNet can be categorized to the densely connected network family. Although the width of the densely connected path increases linearly as it goes deeper, causing the number of parameters to grow quadratically, DenseNet provides higher parameter efficiency compared with the ResNet [5].
In this work, we aim to study the advantages and limitations of both topologies and further enrich the path design by proposing a dual path architecture. In particular, we first provide a new understanding of the densely connected networks from the lens of a higher order recurrent neural network (HORNN) [19], and explore the relations between densely connected networks and residual networks. More specifically, we bridge the densely connected networks with the HORNNs, showing that the densely connected networks are HORNNs when the weights are shared across steps. Inspired by [12] which demonstrates the relations between the residual networks and RNNs, we prove that the residual networks are densely connected networks when connections are shared across layers. With this unified view on the state-of-the-art deep architecture, we find that the deep residual networks implicitly reuse the features through the residual path, while densely connected networks keep exploring new features through the densely connected path.
Based on this new view, we propose a novel dual path architecture, called the Dual Path Network (DPN). This new architecture inherits both advantages of residual and densely connected paths, enabling effective feature re-usage and re-exploitation. The proposed DPN also enjoys higher parameter efficiency, lower computational cost and lower memory consumption, and being friendly for optimization compared with the state-of-the-art classification networks. Experimental results validate the outstanding high accuracy of DPN compared with other well-established baselines for image classification on both ImageNet-1k dataset and Places365-Standard dataset. Additional experiments on object detection task and semantic segmentation task also demonstrate that the proposed dual path architecture can be broadly applied for various tasks and consistently achieve the best performance.
最近,黄等人。[ 8 ]提出了一种与深层ResNet [ 5 ]相媲美的精度的不同网络体系结构 ,称为密集卷积网络(DenseNet)。与残留网络通过残留路径将输入要素添加到输出要素的残留网络不同,DenseNet使用密集连接的路径将输入要素与输出要素串联起来,使每个微模块都能从所有先前的微模块接收原始信息。与剩余网络系列相似,DenseNet可以归类为密集连接的网络系列。尽管紧密连接的路径的宽度随着深度的增加而线性增加,从而导致参数数量呈二次方增长,但与ResNet相比,DenseNet提供了更高的参数效率 [ 5 ]。
在这项工作中,我们旨在研究两种拓扑的优点和局限性,并通过提出双路径架构进一步丰富路径设计。特别是,我们首先从高阶递归神经网络(HORNN)的角度提供了对紧密连接网络的新理解 [ 19 ],并探讨了紧密连接网络与残差网络之间的关系。更具体地说,我们将密集连接的网络与HORNN桥接,表明当跨步共享权重时,密集连接的网络就是HORNN。受到[ 12 ]的启发 这证明了残差网络和RNN之间的关系,我们证明了当层之间共享连接时,残差网络是密集连接的网络。通过对最新的深度架构的这种统一看法,我们发现深度残差网络通过残差路径隐式地重用了特征,而密集连接的网络则继续通过密集连接的路径探索新的特征。
基于这种新观点,我们提出了一种新颖的双路径架构,称为双路径网络(DPN)。这种新架构继承了残差路径和密集连接路径的优点,从而可以有效地重用和重新利用特征。与最新的分类网络相比,提出的DPN还具有更高的参数效率,更低的计算成本和更低的内存消耗,并且易于优化。实验结果证明,与ImageNet-1k数据集和Places365-Standard数据集上的图像分类基准相比,DPN与其他公认的基线模型相比具有出色的高精度。
2 RELATED WORK
Designing an advanced neural network architecture is one of the most challenging but effective ways for improving the image classification performance, which can also directly benefit a variety of other tasks. AlexNet [10] and VGG [18] are two most important works that show the power of deep convolutional neural networks. They demonstrate that building deeper networks with tiny convolutional kernels is a promising way to increase the learning capacity of the neural network. Residual Network was first proposed by He et al. [5], which greatly alleviates the optimization difficulty and further pushes the depth of deep neural networks to hundreds of layers by using skipping connections. Since then, different kinds of residual networks arose, concentrating on either building a more efficient micro-block inner structure [3, 21] or exploring how to use residual connections [9]. Recently, Huang et al. [8] proposed a different network, called Dense Convolutional Networks, where skip connections are used to concatenate the input to the output instead of adding. However, the width of the densely connected path linearly increases as the depth rises, causing the number of parameters to grow quadratically and costing a large amount of GPU memory compared with the residual networks if the implementation is not specifically optimized. This limits the building of a deeper and wider densenet that may further improve the accuracy.
Besides designing new architectures, researchers also try to re-explore the existing state-of-the-art architectures. In [6], the authors showed the importance of the residual path on alleviating the optimization difficulty. In [12], the residual networks are bridged with recurrent neural networks (RNNs), which helps people better understand the deep residual network from the perspective of RNNs. In [3], several different residual functions are unified, trying to provide a better understanding of designing a better mirco structure with higher learning capacity. But still, for the densely connected networks, in addition to several intuitive explanations on better feature reusage and efficient gradient flow introduced, there have been few works that are able to provide a really deeper understanding.
In this work, we aim to provide a deeper understanding of the densely connected network, from the lens of Higher Order RNN, and explain how the residual networks are in indeed a special case of densely connected network. Based on these analysis, we then propose a novel Dual Path Network architecture that not only achieves higher accuracy, but also enjoys high parameter and computational efficiency.
相关工作
设计先进的神经网络体系结构是提高图像分类性能的最具挑战性但最有效的方法之一,这也可以直接使许多其他任务受益。AlexNet [ 10 ]和VGG [ 18 ]是两个最重要的著作,它们展示了深度卷积神经网络的强大功能。他们证明,使用微小的卷积核构建更深的网络是增加神经网络学习能力的一种有前途的方法。残差网络最早由He等人提出 。[ 5 ],这极大地减轻了优化难度,并通过使用跳过连接将深度神经网络的深度进一步推向了数百个层。此后,不同种类的残差网络的出现,集中构建更高效的mirco-block内部结构 [ 3,21 ]或探索如何更好的使用残差连接 [ 9 ]。最近, 黄等人。[ 8 ]他提出了另一种称为密集卷积网络的网络,其中使用跳过连接将输入连接到输出而不是进行叠加。但是,密集连接的路径的宽度随着深度的增加而线性增加,如果未对实现进行特别优化,则与残差网络相比,参数数量将呈平方增长,并且会消耗大量GPU内存。这限制了更深更宽的密网的构建,这可能会进一步提高准确性。
除了设计新的体系结构,研究人员还尝试重新探索现有的最新体系结构。在 [ 6 ]中,作者表明了残差路径对减轻优化难度的重要性。在 [ 12 ]中,残差网络与递归神经网络(RNN)桥接,这有助于人们从RNN的角度更好地理解深层残差网络。在 [ 3 ],几个不同的残差函数被统一起来,试图更好地理解设计具有更高学习能力的更好的mirco结构。但是,对于密集连接的网络,除了引入了一些有关更好地重用特征和有效梯度流的直观解释之外,几乎没有什么作品能够提供更深刻的理解。
在这项工作中,我们旨在从高阶RNN的角度更深入地了解紧密连接的网络,并解释残差网络在确实是紧密连接网络的特殊情况下的情况。基于这些分析,然后我们提出了一种新颖的双路径网络体系结构,该体系结构不仅可以实现更高的精度,而且还具有很高的参数和计算效率。
3 REVISITING RESNET, DENSENET AND HIGHER ORDER RNN 回顾RESNET,DENSENET和高阶RNN
In this section, we first bridge the densely connected network [8] with higher order recurrent neural networks [19] to provide a new understanding of the densely connected network. We prove that residual networks [5, 6, 22, 21, 3], essentially belong to the family of densely connected networks except their connections are shared across steps. Then, we present analysis on strengths and weaknesses of each topology architecture, which motivates us to develop the dual path network architecture.
在本节中,我们首先将密集连接的网络[ 8 ]与高阶递归神经网络 [ 19 ]进行桥接, 以提供对密集连接的网络的新理解。我们证明了残留网络 [ 5,6,22,21,3 ]本质上属于稠密连接网络家族,但它们的连接是跨步骤共享的。然后,对每种拓扑结构的优缺点进行了分析,从而推动了双路径网络体系结构的发展。
For exploring the above relation, we provide a new view on the densely connected networks from the lens of Higher Order RNN, explain their relations and then specialize the analysis to residual networks. Throughout the paper, we formulate the HORNN in a more generalized form. We use $ h^t $ to denote the hidden state of the recurrent neural network at the $ t $ -th step and use $ k $ as the index of the current step. Let $ x^t $ denotes the input at $ t $ -th step, $ h^0=x^0 $ . For each step, $ f_t^k(\cdot) $ refers to the feature extracting function which takes the hidden state as input and outputs the extracted information. The $ g^k(\cdot) $ denotes a transformation function that transforms the gathered information to current hidden state:
为了探索上述关系,我们从高阶RNN的角度对密集连接的网络提供了新的观点,解释了它们之间的关系,然后将分析专门用于残差网络。在整篇文章中,我们以更笼统的形式表述HORNN。我们用$ h^t $ 表示在递归神经网络的隐藏状态 t-步骤和使用 ķ作为当前步骤的索引。让 $ x^t $ 表示输入为 第t步, $ h^0=x^0 $ 。对于每一步,$ f_t^k(\cdot) $ 表示特征提取功能,该功能以隐藏状态为输入并输出提取的信息。这$ g^k(\cdot) $ 表示将收集到的信息转换为当前隐藏状态的转换函数:
$$ h^k = g^k[\sum\limits_{t=0}^{k-1} f_t^k(h^{t})]$$
Eqn. (1) encapsulates the update rule of various network architectures in a generalized way. For HORNNs, weights are shared across steps, i.e.
公式(1)以通用的方式封装了各种网络体系结构的更新规则。对于HORNN,权重是跨步骤共享的,
$$ \forall t,k, f_{k-t}^k(\cdot) \equiv f_{t}( t,k, f_0 t,k, f_1 $ and $ \forall k, g^k(\cdot) \equiv g(\cdot) $$
For the densely connected networks, each step (micro-block) has its own parameter, which means $ f_t^k(\cdot) $ and $ g^k(\cdot) $ are not shared. Such observation shows that the densely connected path of DenseNet is essentially a higher order path which is able to extract new information from previous states. Figurefig_rnn(c)(d) graphically shows the relations of densely connected networks and higher order recurrent networks.
对于密集连接的网络,每个步骤(微型模块)都有自己的参数,这意味着$ f_t^k(\cdot) $ 和 $ g^k(\cdot) $ 不共享。这样的观察表明,DenseNet的密集连接路径本质上是高阶路径,它能够从以前的状态中提取新信息。图 1(c)(d)以图形方式显示了密集连接的网络与高阶循环网络的关系。
Figure 1: The topological relations of different types of neural networks. (a) and (b) show relations between residual networks and RNN, as stated in [12]; (c) and (d) show relations between densely connected networks and higher order recurrent neural network (HORNN), which is explained in this paper. The symbol “z−1” denotes a time-delay unit; “⊕” denotes the element-wise summation; “I(⋅)” denotes an identity mapping function.
图1:不同类型的神经网络的拓扑关系。(a)和(b)显示了残差网络与RNN之间的关系,如 [ 12 ]中所述;(c)和(d)显示了紧密连接的网络与高阶递归神经网络(HORNN)之间的关系,本文对此进行了说明。符号“ž-1个”表示时间延迟单位;”⊕”表示逐元素求和;”I(⋅)”表示身份映射功能。
We then explain that the residual networks are special cases of densely connected networks if taking $ \forall t,k, f_t^k(\cdot) \equiv f_t(\cdot) $ . Here, for succinctness we introduce $ r^{k} $ to denote the intermediate results and let $ r^{0}=0 $ . Then Eqn. can be rewritten as in Thus, by substituting Eqn. into Eqn., Eqn. can be simplified as -0.
然后,我们解释说,剩余的网络密集连接的网络的特殊情况下,如果采取$\forall t,k, f_t^k(\cdot) \equiv f_t(\cdot) $。这里,对于简洁性,我们介绍$ r^{k} $表示中间结果,让$ r^{0}=0 $。然后是EQN。可以通过代替EQN重写为 。进入EQN。,EQN。可以简化为-0。
$$ r^{k} \triangleq \sum\limits_{t=1}^{k-1} f_t(h^t) = r^{k-1} + f_{k-1}(h^{k-1}) $$
Thus, by substituting Eqn. (3) into Eqn. (2), Eqn. (2) can be simplified as
因此,通过代入等式。(3)成等式。(2),等式。(2)可以简化为
$$ r^{k} = r^{k-1} + f_{k-1}(h^{k-1}) = r^{k-1} + f_{k-1}(g^{k-1} \left( r^{k-1} \right)) = r^{k-1} + \phi^{k-1}(r^{k-1})$$
where $ \phi^{k}(\cdot)=f_{k}(g^{k}(\cdot) $ . Obviously, Eqn(4) has the same form as the residual network and the recurrent neural network. Specifically, when $ \forall k, \phi^{k}(\cdot) \equiv\phi(\cdot) $ , Eqn. (4) degenerates to an RNN; when $ \phi^{k}(\cdot) $ is shared and $ x^k=0, k>1 $ , Eqn (4) produces a residual network. Figure 1(a)(b) graphically shows the relation. Besides, recall that Eqn. (4) is derived under the condition when $ \forall t,k, f_t^k(\cdot) \equiv f_t(\cdot) $ from Eqn. (1) and the densely connected networks are in forms of Eqn. (1), meaning that the residual network family essentially belongs to the densely connected network family. Figure 2(a–c) give an example and demonstrate such equivalence, where ft(⋅) corresponds to the first 1×1 convolutional layer and the gk(⋅) corresponds to the other layers within a micro-block in Figure 2(b).
$ \phi^{k}(\cdot)=f_{k}(g^{k}(\cdot) $ ,显然,等式 (4)具有与残差网络和递归神经网络相同的形式。具体而言,当对于所有的 k, $ \phi^{k}(\cdot) \equiv\phi(\cdot) $ ,EQN(4)退化为RNN;当没有$\phi^{k}(\cdot) $和$ x^k=0, k>1 $,等式。(4)产生一个残差网络。图 1(a)(b)以图形方式显示了这种关系。此外,还记得等式。(4)在$\forall t,k, f_t^k(\cdot) \equiv f_t(\cdot)$的条件下,由等式(1)导出。 密集连接网络的形式是等式(1),意味着残余网络系列基本上属于密集连接的网络系列。图Fig_arch(a - c)举例说明并演示了这样的等价。(1),意味着残余网络系列基本上属于密集连接的网络系列。图 2(a–c)给出了一个示例,并证明了这种等效性,其中$ f_t(\cdot) $对应第一个 1个×1个 卷积层和 Gķ(⋅)对应于图2(b)中微区块内的其他层 。
From the above analysis, we observe: 1) both residual networks and densely connected networks can be seen as a HORNN when fkt(⋅) and gk(⋅) are shared for all k; 2) a residual network is a densely connected network if ∀t,k,fkt(⋅)≡ft(⋅). By sharing the fkt(⋅) across all steps, gk(⋅) receives the same feature from a given output state, which encourages the feature reusage and thus reduces the feature redundancy. However, such an information sharing strategy makes it difficult for residual networks to explore new features. Comparatively, the densely connected networks are able to explore new information from previous outputs since the fkt(⋅) is not shared across steps. However, different fkt(⋅) may extract the same type of features multiple times, leading to high redundancy.
In the following section, we present the dual path networks which can overcome both inherent limitations of these two state-of-the-art network architectures. Their relations with HORNN also imply that our proposed architecture can be used for improving HORNN, which we leave for future works.
在上面的分析中,我们观察到:
- 当$ f_t^k(\cdot) $和$ g^k(\cdot) $为所有$ k $共享时;残差网络和稠密连接网络都可以看作一个HORNN
- 如果$\forall t,k, f_t^k(\cdot) \equiv f_t(\cdot) $,残差网络就是一个密集连接的网络
通过在所有步骤中共享$ f_t^k(\cdot) $,$ g^k(\cdot) $从给定的输出状态接收相同的特征,这鼓励特征重用,从而降低了特征冗余。但是,这种信息共享策略使得残差网络难以探索新的特征。相比之下,由于$ f_t^k(\cdot) $不跨步骤共享,因此密集连接的网络能够从先前的输出中探索新的特征。但是,不同的$ f_t^k(\cdot) $可以多次提取相同类型的特征,导致冗余高。
在下一节中,我们将介绍双路径网络,它可以克服这两种最新网络架构的固有局限性。他们与HORNN的关系也暗示我们提出的体系结构可用于改进HORNN,我们将其留作将来的工作。
4 DUAL PATH NETWORKS 双路径网络
Above we explain the relations between residual networks and densely connected networks, showing that the residual path implicitly reuses features, but it is not good at exploring new features. In contrast the densely connected network keeps exploring new features but suffers from higher redundancy.
In this section, we describe the details of our proposed novel dual path architecture, i.e. the Dual Path Network (DPN). In the following, we first introduce and formulate the dual path architecture, and then present the network structure in details with complexity analysis.
上面我们解释了残差网络和密集连接的网络之间的关系,表明残差路径隐式地重用了特征,但是它并不善于探索新的要素。相反,密集连接的网络一直在探索新功能,但存在更高的冗余度。
在本节中,我们描述了我们提出的新颖的双路径体系结构,即*双路径网络(DPN)*的细节。在下文中,我们首先介绍和制定双路径架构,然后通过复杂性分析详细介绍网络结构。
4.1 DUAL PATH ARCHITECTURE双路径架构
Sec. 3 discusses the advantage and limitations of both residual networks and densely connected networks. Based on the analysis, we propose a simple dual path architecture which shares the $ f_t^k(\cdot) $ across all blocks to enjoy the benefits of reusing common features with low redundancy, while still remaining a densely connected path to give the network more flexibility in learning new features. We formulate such a dual path architecture as follows:
第三段(https://www.aiqianji.com/papers/1707.01629/#S3 "3再探ResNet,DenseNet和高阶RNN‣双路径网络")讨论了残留网络和密集连接的网络的优点和局限性。在分析的基础上,我们提出了一种简单的双路径架构,该架构在所有块上共享 $ f_t^k(\cdot) $,以享受低冗余重用公共特征的好处,同时仍然保持密集连接的路径,使网络在学习新特征时更具灵活性。我们将这种双路径体系结构描述如下:
$$x^{k} \triangleq \sum\limits_{t=1}^{k-1} f_t^{k}(h^t) \text{,} $$
$$y^{k} \triangleq \sum\limits_{t=1}^{k-1} v_t(h^t) = y^{k-1} + \phi^{k-1}(y^{k-1}) $$
$$r^{k} \triangleq x^{k} + y^{k} $$
$$h^k = g^k \left( r^{k} \right)$$
where $ x^{k} $ and $ y^{k} $ denote the extracted information at $ k $ -th step from individual path, $ v_t(\cdot) $ is a feature learning function as $ f_t^k(\cdot) $ . Eqn. refers to the densely connected path that enables exploring new features, Eqn. refers to the residual path that enables common features re-usage, and Eqn. defines the dual path that integrates them and feeds them to the last transformation function in Eqn.. The final transformation function $ g^k(\cdot) $ generates current state, which is used for making next mapping or prediction. Figurefig_arch(d)(e) show an example of the dual path architecture that is being used in our experiments.
$ x^{k} $ 和 $ y^{k} $ 表示从单个路径的第k步提取的信息, $ v_t(\cdot) $ 是$ f_t^k(\cdot) $的一种特征学习函数,公式 (5)是指能够探索新特征的密集连接路径。(6)是指残差路径,该残差路径使通用特征得以重用。(7)定义了将它们集成在一起并将它们馈送给等式(8)中最后一个转换函数的双路径。最终的转换功能 $ g^k(\cdot) $生成当前状态,用于进行下一个映射或预测。图 2(d)(e)显示了我们的实验中使用的双路径架构的示例。
Figure 2: Architecture comparison of different networks. (a) The residual network. (b) The densely connected network, where each layer can access the outputs of all previous micro-blocks. Here, a 1×1 convolutional layer (underlined) is added for consistency with the micro-block design in (a). (c) By sharing the first 1×1 connection of the same output across micro-blocks in (b), the densely connected network degenerates to a residual network. The dotted rectangular in (c) highlights the residual unit. (d) The proposed dual path architecture, DPN. (e) An equivalent form of (d) from the perspective of implementation, where the symbol “≀” denotes a split operation, and “+” denotes element-wise addition.
图2:不同网络的体系结构比较。(a)残差网络。(b)密集连接的网络,其中每个层都可以访问所有先前的微型块的输出。在这里1个×1个添加卷积层(带下划线)以与(a)中的微块设计保持一致。(c)通过分享第一个1×1个如果在(b)中的微块上连接相同的输出,则密集连接的网络将退化为残差网络。(c)中的虚线矩形突出显示了残差单位。(d)拟议的双路径架构DPN。(e)从执行的角度看,是(d)的等价形式,其中“≀”表示拆分操作,而“+”表示逐元素加法。
More generally, the proposed DPN is a family of convolutional neural networks which contains a residual alike path and a densely connected alike path, as explained later. Similar to these networks, one can customize the micro-block function of DPN for task-specific usage or for further overall performance boosting.
更一般地,所提出的DPN是一个卷积神经网络家族,其中包含一个残差相似路径和一个紧密连接的相似路径,这将在后面解释。与这些网络类似,您可以自定义DPN的微块功能,以用于特定任务使用或进一步提高整体性能。
4.2 DUAL PATH NETWORKS 双路径网络
The proposed network is built by stacking multiple modualized mirco-blocks as shown in Figure 2. In this work, the structure of each micro-block is designed with a bottleneck style [5] which starts with a 1×1 convolutional layer followed by a 3×3 convolutional layer, and ends with a 1×1 convolutional layer. The output of the last 1×1 convolutional layer is split into two parts: the first part is element-wisely added to the residual path, and the second part is concatenated with the densly connected path. To enhance the leaning capacity of each micro-block, we use the grouped convolution layer in the second layer as the ResNeXt [21].
Considering that the residual networks are more wildly used than the densely connected networks in practice, we choose the residual network as the backbone and add a thin densely connected path to build the dual path network. Such design also helps slow the width increment of the densely connected path and the cost of GPU memory. Table 1 shows the detailed architecture settings. In the table, G refers to the number of groups, and k refers to the channels increment for the densely connected path. For the new proposed DPNs, we use (+k) to indicate the width increment of the densely connected path. The overall design of DPN inherits backbone architecture of the vanilla ResNet / ResNeXt, making it very easy to implement and apply to other tasks. One can simply implement a DPN by adding one more “slice layer” and “concat layer” upon existing residual networks. Under a well optimized deep learning platform, none of these newly added operations requires extra computational cost or extra memory consumption, making the DPNs highly efficient.
In order to demonstrate the appealing effectiveness of the dual path architecture, we intentionally design a set of DPNs with a considerably smaller model size and less FLOPs compared with the sate-of-the-art ResNeXts [21], as shown in Table 1. Due to limited computational resources, we set these hyper-parameters based on our previous experience instead of grid search experiments.
如图2所示,建议的网络是通过堆叠多个模块化的mirco块构建的 。在这项工作中,每个微区块的结构均以瓶颈样式[ 5 ]设计,从1×1卷积层开始,然后是3×3卷积层,最后是1×1卷积层。最后一个1×1卷积层的输出被分成两部分:第一部分被合理地添加到残差路径,第二部分与密集连接的路径连接。为了增强每个微块的学习能力,,我们在第二层中使用ResNeXt的分组卷积层 [ 21 ]。
考虑到残差网络在实践中比密集连接的网络使用更为广泛,我们选择残差网络作为骨干网,并添加一条细的密集连接路径来构建双路径网络。这种设计还有助于减缓密集连接路径的宽度增量和GPU内存的成本。表 1显示了详细的体系结构设置。在桌子上G 指组数,并且 ķ指密集连接路径的通道增量。对于新提议的DPN,我们使用(+ķ)指示密集连接路径的宽度增量。DPN的总体设计继承了香草ResNet / ResNeXt的主干架构,因此非常易于实现并应用于其他任务。通过在现有残留网络上再增加一个“切片层”和“连接层”,可以简单地实现DPN。在一个经过优化的深度学习平台下,这些新添加的操作均不需要额外的计算成本或额外的内存消耗,从而使DPN高效。
为了证明双路径体系结构的吸引力,我们有意设计了一组DPN,与最新的ResNeXts [ 21 ]相比,它们的模型尺寸小得多,FLOP少 ,如表1所示 。由于计算资源有限,我们根据以前的经验而不是网格搜索实验来设置这些超参数。
Model complexity
We measure the model complexity by counting the total number of learnable parameters within each neural network. Table 1 shows the results for different models. The DPN-92 costs about 15% fewer parameters than ResNeXt-101 (32×4d), while the DPN-98 costs about 26% fewer parameters than ResNeXt-101 (64×4d).
模型复杂度
我们通过计算每个神经网络内可学习参数的总数来衡量模型的复杂性。表 1显示了不同模型的结果。DPN-92的成本约为15% 比ResNeXt-101少的参数(32×4d),而DPN-98的成本约为 26% 比ResNeXt-101少的参数(64×4d)。
Computational complexity
We measure the computational cost of each deep neural network using the floating-point operations (FLOPs) with input size of 224×224, in the number of multiply-adds following [21]. Table 1 shows the theoretical computational cost. Though the actual time cost might be influenced by other factors, e.g. GPU bandwidth and coding quality, the computational cost shows the speed upper bound. As can be see from the results, DPN-92 consumes about 19% less FLOPs than ResNeXt-101(32×4d), and the DPN-98 consumes about 25% less FLOPs than ResNeXt-101(64×4d).
计算复杂度
我们使用输入大小为的浮点运算(FLOP)来衡量每个深度神经网络的计算成本。 224×224,在[ 21 ]之后的乘法加法数中 。表 1列出了理论计算成本。尽管实际时间成本可能会受到其他因素(例如GPU带宽和编码质量)的影响,但计算成本却显示出速度上限。从结果可以看出,DPN-92消耗 比ResNeXt-101的FLOP少约19%(32×4d),而DPN-98消耗比ResNeXt-101的FLOP约少 25% (64×4d)。
Table 1: Architecture and complexity comparison of our proposed Dual Path Networks (DPNs) and other state-of-the-art networks. We compare DPNs with two baseline methods: DenseNet [5] and ResNeXt [21]. The symbol (+k) denotes the width increment on the densely connected path.
表1:我们提议的双路径网络(DPN)与其他最新网络的体系结构和复杂性比较。我们将DPNs与两种基线方法进行比较:DenseNet [ 5 ]和ResNeXt [ 21 ]。符号(+ķ) 表示密集连接路径上的宽度增量。
5EXPERIMENTS
Extensive experiments are conducted for evaluating the proposed Dual Path Networks. Specifically, we evaluate the proposed architecture on three tasks: image classification, object detection and semantic segmentation, using three standard benchmark datasets: the ImageNet-1k dataset, Places365-Standard dataset and the PASCAL VOC datasets.
Key properties of the proposed DPNs are studied on the ImageNet-1k object classification dataset [17] and further verified on the Places365-Standard scene understanding dataset [24]. To verify whether the proposed DPNs can benefit other tasks besides image classification, we further conduct experiments on the PASCAL VOC dataset [4] to evaluate its performance in object detection and semantic segmentation.
5.1EXPERIMENTS ON IMAGE CLASSIFICATION TASK
We implement the DPNs using MXNet [2] on a cluster with 40 K80 graphic cards. Following [3], we adopt standard data augmentation methods and train the networks using SGD with a mini-batch size of 32 for each GPU. For the deepest network, i.e. DPN-13111The DPN-131 has 128 channels at conv1, 4 blocks at conv2, 8 blocks at conv3, 28 blocks at conv4 and 3 blocks at conv5, which has #params=79.5×106 and FLOPs=16.0×109., the mini-batch size is limited to 24 because of the 12GB GPU memory constraint. The learning rate starts from √0.1 for DPN-92 and DPN-131, and from 0.4 for DPN-98. It drops in a “steps” manner by a factor of 0.1. Following [5], batch normalization layers are refined after training.
实验
为了评估所提出的双路径网路,我们进行了大量的实验。具体来说,我们使用三个标准基准数据集:ImageNet-1k数据集、Places365标准数据集和PASCAL VOC数据集,从图像分类、对象检测和语义分割三个方面对所提出的体系结构进行了评估。
在ImageNet-1k对象分类数据集[ 17 ]上研究了提出的DPN的关键属性, 并在Places365-Standard场景理解数据集[ 24 ]上进一步进行了验证 。为了验证所提出的DPNs是否可以使图像分类以外的其他任务受益,我们进一步对PASCAL VOC数据集[ 4 ]进行了实验, 以评估其在对象检测和语义分割中的性能。
5.1图像分类任务实验
我们在具有40个K80图形卡的群集上使用MXNet [ 2 ]实现DPN 。按照 [ 3 ],我们采用标准的数据扩充方法,并使用SGD训练网络,每个GPU的最小批量大小为32。对于最深的网络,即DPN-131 11个该DPN-131具有在128个通道CONV1,在4个块CONV2,在8个块conv3,在28块CONV4和在3块conv5,其具有#PARAMS =$ 79.5 \ times10 ^ 6 $和FLOPs = $ 16.0 \times10 ^ 9 $。,由于12GB GPU内存的限制,最小批量大小限制为24。学习率从√0.1 用于DPN-92和DPN-131,以及从 0.4适用于DPN-98。它以“阶梯式”方式下降了一个因子0.1。继 [ 5 ]之后,在训练后细化批归一化层。
ImageNet-1k dataset
Firstly, we compare the image classification performance of DPNs with current state-of-the-art models. As can be seen from the first block in Table 3, a shallow DPN with only the depth of 92 reduces the top-1 error rate by an absolute value of 0.5% compared with the ResNeXt-101(32×4d) and an absolute value of 1.5% compared with the DenseNet-161 yet provides with considerably less FLOPs. In the second block of Table 3, a deeper DPN (DPN-98) surpasses the best residual network – ResNeXt-101 (64×4d), and still enjoys 25% less FLOPs and a much smaller model size (236 MB v.s. 320 MB). In order to further push the state-of-the-art accuracy, we slightly increase the depth of the DPN to 131 (DPN-131). The results are shown in the last block in Table 3. Again, the DPN shows superior accuracy over the best single model – Very Deep PolyNet [23], with a much smaller model size (304 MB v.s. 365 MB). Note that the Very Deep PolyNet adopts numerous tricks, e.g. initialization by insertion, residual scaling, stochastic paths, to assist the training process. In contrast, our proposed DPN-131 is simple and does not involve these tricks, DPN-131 can be trained using a standard training strategy as shallow DPNs. More importantly, the actual training speed of DPN-131 is about 2 times faster than the Very Deep PolyNet, as discussed in the following paragraph.
首先,我们将DPN的图像分类性能与当前的最新模型进行比较。从表3的第一个方框可以看出, 深度仅为92的浅DPN将top-1错误率降低了绝对值。0.5% 与ResNeXt-101(32×4d)的绝对值 1.5%与DenseNet-161相比,FLOP的数量要少得多。在表3的第二块中 ,更深的DPN(DPN-98)超过了最佳残留网络– ResNeXt-101(64×4d),并且仍然喜欢 25%更少的FLOP和更小的模型大小(236 MB与320 MB)。为了进一步提高最先进的精度,我们将DPN的深度略微增加到131(DPN-131)。结果显示在表3的最后一个块中 。同样,DPN显示了优于最佳单一模型– Very Deep PolyNet [ 23 ]的更高精度 ,并且模型大小更小(304 MB与365 MB)。注意,非常深的PolyNet采用了许多技巧,例如通过插入,剩余缩放,随机路径进行初始化,以协助训练过程。相反,我们提出的DPN-131很简单,并且不涉及这些技巧,可以使用标准的训练策略将DPN-131作为浅层DPN进行训练。更重要的是,DPN-131的实际训练速度比甚深PolyNet快约2倍,如以下段落所述。
Method | Model Size | GFLOPs | x224 | x320 / x299 |
---|---|---|---|---|
top-1 | top-5 | top-1 | top-5 | |
DenseNet-161(k=48) [8] | 111 MB | 7.7 | 22.2 | – |
ResNet-101* [5] | 170 MB | 7.8 | 22.0 | 6.0 |
ResNeXt-101 (32×4d) [21] | 170 MB | 8.0 | 21.2 | 5.6 |
DPN-92 (32×3d) | 145 MB | 6.5 | 20.7 | 5.4 |
ResNet-200 [6] | 247 MB | 15.0 | 21.7 | 5.8 |
Inception-resnet-v2 [20] | 227 MB | – | – | – |
ResNeXt-101 (64×4d) [21] | 320 MB | 15.5 | 20.4 | 5.3 |
DPN-98 (40×4d) | 236 MB | 11.7 | 20.2 | 5.2 |
Very deep Inception-resnet-v2 [23] | 531 MB | – | – | – |
Very Deep PolyNet [23] | 365 MB | – | – | – |
DPN-131 (40×4d) | 304 MB | 16.0 | 19.93 | 5.12 |
DPN-131 (40×4d) † | 304 MB | 16.0 | 19.93 | 5.12 |
Table 2: Comparison with state-of-the-art CNNs on ImageNet-1k dataset. Single crop validation error rate (%) on validation set. *: Performance reported by [21], †: With Mean-Max Pooling (see Appendix A).
表2:与ImageNet-1k数据集上最新的CNN的比较。单季作物验证错误率(%)上的验证集。*:[ 21 ]报告的性能 ,†:具有均值最大池化(请参阅附录 A)。
Method | Model Size | top-1 acc. | top-5 acc. |
---|---|---|---|
AlexNet [24] | 223 MB | 53.17 | 82.89 |
GoogleLeNet [24] | 44 MB | 53.63 | 83.88 |
VGG-16 [24] | 518 MB | 55.24 | 84.91 |
ResNet-152 [24] | 226 MB | 54.74 | 85.08 |
ResNeXt-101 [3] | 165 MB | 56.21 | 86.25 |
CRU-Net-116 [3] | 163 MB | 56.60 | 86.55 |
DPN-92 (32×3d) | 138 MB | 56.84 | 86.69 |
Table 3: Comparison with state-of-the-art CNNs on Places365-Standard dataset. 10 crops validation accuracy rate (%) on validation set.
Figure 3: Comparison of total actual cost between different models during training. Evaluations are conducted on a single Node with 4 K80 graphic card with all training samples cached into memory. (For the comparison of Training Speed, we push the mini-batch size to its maximum value given a 12GB GPU memory to test the fastest possible training speed of each model.)Secondly, we compare the training cost between the best performing models. Here, we focus on evaluating two key properties – the actual GPU memory cost and the actual training speed. Figure 3 shows the results. As can be seen from Figure 3(a)(b), the DPN-98 is 15% faster and uses 9% less memory than the best performing ResNeXt with a considerably lower testing error rate. Note that theoretically the computational cost of DPN-98 shown in Table 3 is 25% less than the best performing ResNeXt, indicating there is still room for code optimization. Figure 3(c) presents the same result in a more clear way. The deeper DPN-131 only costs about 19% more training time compared with the best performing ResNeXt, but achieves the state-of-the-art single model performance. The training speed of the previous state-of-the-art single model, i.e. Very Deep PolyNet (537 layers) [23], is about 31 samples per second based on our implementation using MXNet, showing that DPN-131 runs about 2 times faster than the Very Deep PolyNet during training.
图3:培训期间不同模型之间的总实际成本比较。评估是在带有4 K80图形卡的单个节点上进行的,所有培训样本都缓存到内存中。(为了比较Training Speed,在给定12GB GPU内存的情况下,我们将最小批量大小推至最大值,以测试每种型号可能的最快训练速度。)其次,我们比较表现最佳的模型之间的培训成本。在这里,我们专注于评估两个关键属性–实际的GPU内存成本和实际的训练速度。图 3显示了结果。从图3(a)(b)可以看出 ,DPN-98是15% 更快和使用 9%与性能最佳的ResNeXt相比,内存更少,而测试错误率则低得多。请注意,理论上,表3中所示的DPN-98的计算成本 为25%小于性能最好的ResNeXt,表明仍有代码优化的空间。图 3(c)以更清晰的方式显示了相同的结果。更深层次的DPN-131仅需大约19%与性能最佳的ResNeXt相比,培训时间更长,但可实现最新的单模型性能。根据我们使用MXNet的实现,以前最先进的单个模型即Very Deep PolyNet(537层) [ 23 ]的训练速度约为每秒31个样本,这表明DPN-131的运行速度约为2倍在训练过程中比Very Deep PolyNet更快。
Place365-Standard dataset
In this experiment, we further evaluate the accuracy of the proposed DPN on the scene classification task using the Places365-Standard dataset. The Places365-Standard dataset is a high-resolution scene understanding dataset with more than 1.8 million images of 365 scene categories. Different from object images, scene images do not have very clear discriminative patterns and require a higher level context reasoning ability.
Table 3 shows the results of different models on this dataset. To make a fair comparison, we perform the DPN-92 on this dataset instead of using deeper DPNs. As can be seen from the results, DPN achieves the best validation accuracy compared with other methods. The DPN-92 requires much less parameters (138 MB v.s. 163 MB), which again demonstrates its high parameter efficiency and high generalization ability.
Place365-Standard数据集
在本实验中,我们使用Places365-Standard数据集进一步评估了建议的DPN在场景分类任务上的准确性。Places365-Standard数据集是一个高分辨率的场景理解数据集,包含365个场景类别的180万张图像。与对象图像不同,场景图像没有非常清晰的判别模式,需要更高级别的上下文推理能力。
表 3显示了此数据集上不同模型的结果。为了进行公平的比较,我们在此数据集上执行DPN-92,而不是使用更深的DPN。从结果可以看出,与其他方法相比,DPN具有最佳的验证准确性。DPN-92需要更少的参数(138 MB和163 MB),这再次证明了其高参数效率和高泛化能力。
5.2EXPERIMENTS ON THE OBJECT DETECTION TASK 实验对象检测任务
Method | mAP | areo | bike | bird | boat | bottle | bus | car | cat | chair | cow | table | dog | horse | mbk | prsn | plant | sheep | sofa | train | tv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DenseNet-161 (k=48) | 79.9 | 80.4 | 85.9 | 81.2 | 72.8 | 68.0 | 87.1 | 88.0 | 88.8 | 64.0 | 83.3 | 75.4 | 87.5 | 87.6 | 81.3 | 84.2 | 54.6 | 83.2 | 80.2 | 87.4 | 77.2 |
ResNet-101 [16] | 76.4 | 79.8 | 80.7 | 76.2 | 68.3 | 55.9 | 85.1 | 85.3 | 89.8 | 56.7 | 87.8 | 69.4 | 88.3 | 88.9 | 80.9 | 78.4 | 41.7 | 78.6 | 79.8 | 85.3 | 72.0 |
ResNeXt-101 (32×4d) | 80.1 | 80.2 | 86.5 | 79.4 | 72.5 | 67.3 | 86.9 | 88.6 | 88.9 | 64.9 | 85.0 | 76.2 | 87.3 | 87.8 | 81.8 | 84.1 | 55.5 | 84.0 | 79.7 | 87.9 | 77.0 |
DPN-92 (32×3d) | 82.5 | 84.4 | 88.5 | 84.6 | 76.5 | 70.7 | 87.9 | 88.8 | 89.4 | 69.7 | 87.0 | 76.7 | 89.5 | 88.7 | 86.0 | 86.1 | 58.4 | 85.0 | 80.4 | 88.2 | 83.1 |
Table 4: Object detection results on PASCAL VOC 2007 test set. The performance is measured by mean of Average Precision (mAP, in %).
表4: PASCAL VOC 2007测试集上的对象检测结果。性能是通过平均精度(mAP,%)来衡量的。
We further evaluate the proposed Dual Path Network on the object detection task. Experiments are performed on the PASCAL VOC 2007 datasets [4]. We train the models on the union set of VOC 2007 trainval and VOC 2012 trainval following [16], and evaluate them on VOC 2007 test set. We use standard evaluation metrics Average Precision (AP) and mean of AP (mAP) following the PASCAL challenge protocols for evaluation.
We perform all experiments based on the ResNet-based Faster R-CNN framework, following [5] and make comparisons by replacing the residual network, while keeping other parts unchanged. Since our goal is to evaluate DPN, rather than further push the state-of-the-art accuracy on this dataset, we adopt the shallowest DPN-92 and baseline networks at roughly the same complexity level. Table 4 provides the detection performance comparisons of the proposed DPN with several current state-of-the-art models. It can be observed that the DPN obtains the mAP of 82.5%, which makes large improvements, i.e. 6.1% compared with ResNet-101 [16] and 2.4% compared with ResNeXt-101 (32×4d). The better results shown in this experiment demonstrate that the Dual Path Network is also capable of learning better feature representations for detecting objects and benefiting the object detection task.
我们进一步评估了在物体检测任务上提出的双路径网络。在PASCAL VOC 2007数据集上进行了实验 [ 4 ]。我们培养的并集VOC 2007年的车型trainval和VOC 2012 trainval以下 [ 16 ],并评估他们在2007年VOC测试集。我们使用标准评估指标,PASCAL挑战协议中的平均精度(AP)和均值(mAP)进行评估。
我们根据[ 5 ],基于基于ResNet的Faster R-CNN框架执行所有实验, 并通过替换残差网络进行比较,同时保持其他部分不变。由于我们的目标是评估DPN,而不是进一步提高该数据集的最新准确性,因此我们采用了最浅的DPN-92和基线网络,它们的复杂度大致相同。表 4提供了建议的DPN与几种当前最新模型的检测性能比较。可以看出,DPN获得了mAP的mAP。82.5%,可以进行较大的改进,即 6.1%与ResNet-101 [ 16 ]和2.4% 与ResNeXt-101(32×4d)。该实验中显示的更好结果表明,双路径网络还能够学习更好的特征表示以检测物体并从物体检测任务中受益。
5.3语义分割任务的实验
5.3EXPERIMENTS ON THE SEMANTIC SEGMENTATION TASK
Method | mIoU | bkg | areo | bike | bird | boat | bottle | bus | car | cat | chair | cow | table | dog | horse | mbk | prsn | plant | sheep | sofa | train | tv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DenseNet-161 (k=48) | 68.7 | 92.1 | 77.3 | 37.1 | 83.6 | 54.9 | 70.0 | 85.8 | 82.5 | 85.9 | 26.1 | 73.0 | 55.1 | 80.2 | 74.0 | 79.1 | 78.2 | 51.5 | 80.0 | 42.2 | 75.1 | 58.6 |
ResNet-101 | 73.1 | 93.1 | 86.9 | 39.9 | 87.6 | 59.6 | 74.4 | 90.1 | 84.7 | 87.7 | 30.0 | 81.8 | 56.2 | 82.7 | 82.7 | 80.1 | 81.1 | 52.4 | 86.2 | 52.5 | 81.3 | 63.6 |
ResNeXt-101 (32×4d) | 73.6 | 93.1 | 84.9 | 36.2 | 80.3 | 65.0 | 74.7 | 90.6 | 83.9 | 88.7 | 31.1 | 86.3 | 62.4 | 84.7 | 86.1 | 81.2 | 80.1 | 54.0 | 87.4 | 54.0 | 76.3 | 64.2 |
DPN-92 (32×3d) | 74.8 | 93.7 | 88.3 | 40.3 | 82.7 | 64.5 | 72.0 | 90.9 | 85.0 | 88.8 | 31.1 | 87.7 | 59.8 | 83.9 | 86.8 | 85.1 | 82.8 | 60.8 | 85.3 | 54.1 | 82.6 | 64.6 |
Table 5: Semantic segmentation results on PASCAL VOC 2012 test set. The performance is measured by mean Intersection over Union (mIoU, in %).
表5: PASCAL VOC 2012测试集的语义分割结果。性能通过联盟的平均交集(mIoU,以%为单位)进行衡量。
In this experiment, we evaluate the performance of the Dual Path Network for dense prediction, i.e. semantic segmentation, where the training target is to predict the semantic label for each pixel in the input image.We conduct experiments on the PASCAL VOC 2012 segmentation benchmark dataset [4] and use the DeepLab-ASPP-L [1] as the segmentation framework. For each compared method in Table 5, we replace the 3×3 convolutional layers in conv4 and conv5 of Table 1 with atrous convolution [1] and plug in a head of Atrous Spatial Pyramid Pooling (ASPP) [1] in the final feature maps of conv5. We adopt the same training strategy for all networks following [1] for fair comparison.
Table 5 shows the results of different convolutional neural networks. It can be observed that the proposed DPN-92 has the highest overall mIoU accuracy. Compared with the ResNet-101 which has a larger model size and higher computational cost, the proposed DPN-92 further improves the IoU for most categories and improves the overall mIoU by an absolute value 1.7%. Considering the ResNeXt-101 (32×4d) only improves the overall mIoU by an absolute value 0.5% compared with the ResNet-101, the proposed DPN-92 gains more than 3 times improvement compared with the ResNeXt-101 (32×4d). The better results once again demonstrate the proposed Dual Path Network is capable of learning better feature representation for dense prediction.
在此实验中,我们评估了双路径网络的性能,以进行密集预测(即语义分割),其中训练目标是预测输入图像中每个像素的语义标签。我们对PASCAL VOC 2012细分基准数据集[ 4 ]进行了实验, 并使用DeepLab-ASPP-L [ 1 ]作为细分框架。对于表5中的每种比较方法 ,我们将3×3表1的conv4和conv5中的卷积层具有atrous卷积 [ 1 ],并在conv5的最终特征图中插入了Atrous空间金字塔池(ASPP) [ 1 ]的头部。为了公平比较,我们针对[ 1 ]之后的所有网络采用相同的培训策略。
表5显示了不同卷积神经网络的结果。可以看出,提出的DPN-92具有最高的总体mIoU精度。与具有较大模型尺寸和较高计算成本的ResNet-101相比,拟议中的DPN-92进一步提高了大多数类别的IoU,并以绝对值提高了总体mIoU1.7%。考虑ResNeXt-101(32×4d)仅将整体mIoU提高绝对值 0.5% 与ResNet-101相比,建议的DPN-92获得的收益比 3 与ResNeXt-101相比的改进倍数(32×4d)。更好的结果再次证明了所提出的双路径网络能够为密集预测学习更好的特征表示。
6CONCLUSION
In this paper, we revisited the densely connected networks, bridged the densely connected networks with Higher Order RNNs and proved the residual networks are essentially densely connected networks with shared connections. Based on this new explanation, we proposed a dual path architecture that enjoys benefits from both sides. The novel network, DPN, is then developed based on this dual path architecture. Experiments on the image classification task demonstrate that the DPN enjoys high accuracy, small model size, low computational cost and low GPU memory consumption, thus is extremely useful for not only research but also real-word application. Experiments on the object detection task and semantic segmentation tasks show that the proposed DPN can also benefit other tasks by simply replacing the base network.
结论
在本文中,我们重新研究了紧密连接的网络,将紧密连接的网络与高阶RNN桥接,并证明了剩余网络本质上是具有共享连接的紧密连接的网络。基于这个新的解释,我们提出了一种双路径架构,该架构享有双方的利益。然后,基于这种双路径架构开发了新颖的网络DPN。图像分类任务的实验表明,DPN具有较高的精度,较小的模型尺寸,较低的计算成本和较低的GPU内存消耗,因此不仅对研究而且对实词应用都非常有用。在对象检测任务和语义分割任务上的实验表明,所提出的DPN还可以通过简单地替换基础网络来使其他任务受益。
REFERENCES
- Chen et al. [2016] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
- Chen et al. [2015] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
- Chen et al. [2017] Yunpeng Chen, Xiaojie Jin, Bingyi Kang, Jiashi Feng, and Shuicheng Yan. Sharing residual units through collective tensor factorization in deep neural networks. arXiv preprint arXiv:1703.02180, 2017.
- Everingham et al. [2014] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2014.
- He et al. [2016a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016a.
- He et al. [2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016b.
- He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. arXiv preprint arXiv:1703.06870, 2017.
- Huang et al. [2016] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
- Kim et al. [2016] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
- Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Lee et al. [2016] Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In Artificial Intelligence and Statistics, pages 464–472, 2016.
- Liao and Poggio [2016] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640, 2016.
- Long et al. [2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- Newell et al. [2016] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
- Pleiss et al. [2017] Geoff Pleiss, Danlu Chen, Gao Huang, Tongcheng Li, Laurens van der Maaten, and Kilian Q Weinberger. Memory-efficient implementation of densenets. arXiv preprint arXiv:1707.06990, 2017.
- Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
- Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Soltani and Jiang [2016] Rohollah Soltani and Hui Jiang. Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064, 2016.
- Szegedy et al. [2016] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016.
- Xie et al. [2016] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
- Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Zhang et al. [2016] Xingcheng Zhang, Zhizhong Li, Chen Change Loy, and Dahua Lin. Polynet: A pursuit of structural diversity in very deep networks. arXiv preprint arXiv:1611.05725, 2016.
- Zhou et al. [2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva. Places: An image database for deep scene understanding. arXiv preprint arXiv:1610.02055, 2016.
APPENDIX A TESTING WITH MEAN-MAX POOLING
Here, we introduce a new testing technique by using Mean-Max Pooling which can further improve the performance of a well trained CNN in the testing phase without any noticeable computational overhead. This testing technique is very effective for testing images with size larger than training crops. The idea is to first convert a trained CNN model into a convolutional network [13] and then insert the following Mean-Max Pooling layer (a.k.a. Max-Avg Pooling [11]), i.e. 0.5 * (global average pooling + global max pooling), just before the final softmax layer.
Comparisons between the models with and without Mean-Max Pooling are shown in Table 6. As can be seen from the results, the simple Mean-Max Pooling testing strategy successfully improves the testing accuracy for all models.
Method | Model Size | GFLOPs | w/o Mean-Max Pooling | w/ Mean-Max Pooling |
---|---|---|---|---|
top-1 | top-5 | top-1 | top-5 | |
DPN-92 (32×3d) | 145 MB | 6.5 | 19.34 | 4.66 |
DPN-98 (40×4d) | 236 MB | 11.7 | 18.94 | 4.44 |
DPN-131 (40×4d) | 304 MB | 16.0 | 18.62 | 4.23 |
Table 6: Comparison with different testing techniques on ImageNet-1k dataset. Single crop validation error rate (%) on validation set.