[论文翻译]DenseNet:密集连接的卷积网络


下载PDF:https://arxiv.org/pdf/1608.06993v5.pdf

下载代码:https://github.com/liuzhuang13/densenet

Densely Connected Convolutional Networks 密集连接卷积网络

Abstract

Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the DenseNet, which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with $ L $ layers have $ L $ connections---one between each layer and its subsequent layer---our network has $ \frac{L(L+1)}{2} $ direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. \methodnameshorts{} have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). \methodnameshorts{} obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet.

image

摘要

最近的工作表明,如果它们在接近输入的层和接近输出的层之间包含较短的连接,则卷积网络可以的深度可以显著增加,准确度更高,并且更易于训练。在本文中,我们采纳这一观点,并提出了密集连接卷积网络(DenseNet),它以前馈的方式将每个层连接到其他层。而传统的卷积网络 $ L $ 层网络具有 $ L $层连接 (每个层和其后续层之间)---我们的网络有$ \frac{L(L+1)}{2} $直接连接。对于每层,所有前面图层的特征映射用作输入,并且其自己的特征映射用作所有后续层的输入。。DenseNet 有几个引人注目的优势:它们缓解了消失的渐变问题,加强了特征传播,鼓励功能重用,并大大减少参数的数量。我们在四个竞争激烈的物体识别基准任务(CIFAR-10,CiFar-100,SVHN 和 Imagenet)上评估我们所提出的架构。DenseNet 在大多数 SOTA 情况下获得显着改进,同时需要较少的计算来实现高性能。代码和预先训练的模型参见 https://github.com/liuzhuang13/densenet

INTRODUCTION 简介

Convolutional neural networks (CNNs) have become the dominant machine learning approach for visual object recognition. Although they were originally introduced over 20 years ago [18], improvements in computer hardware and network structure have enabled the training of truly deep CNNs only recently. The original LeNet5 [19] consisted of 5 layers, VGG featured 19 [28], and only last year Highway Networks [33] and Residual Networks (ResNets) [11] have surpassed the 100-layer barrier.

各种卷积神经网络(CNN)已经成为视觉对象识别的主要机器学习方法。虽然它们最早是在 20 多年前被提出的[18],但是直到近年来,随着计算机硬件和网络结构的改进,才使得训练真正深层的 CNN 成为可能。最早的 LeNet5[19]由 5 层组成,VGG 有 19 层 [28],去年的 Highway Network[33]和残差网络(ResNet) [11]已经超过了 100 层。

As CNNs become increasingly deep, a new research problem emerges: as information about the input or gradient passes through many layers, it can vanish and “wash out” by the time it reaches the end (or beginning) of the network. Many recent publications address this or related problems. ResNets [11] and Highway Networks [33] bypass signal from one layer to the next via identity connections. Stochastic depth [13] shortens ResNets by randomly dropping layers during training to allow better information and gradient flow. FractalNets [17] repeatedly combine several parallel layer sequences with different number of convolutional blocks to obtain a large nominal depth, while maintaining many short paths in the network. Although these different approaches vary in network topology and training procedure, they all share a key characteristic: they create short paths from early layers to later layers.

随着 CNN 越来越深,出现了一个新的研究问题:梯度弥散。许多最近的研究致力于解决这个问题或相关的问题。ResNet [11]和 Highway Network [33]通过恒等连接将信号从一个层传递到另一层。Stochastic depth[13] 通过在训练期间随机丢弃层来缩短 ResNets,以获得更好的信息和梯度流。FractalNet [17] 重复地将几个并行层序列与不同数量的约束块组合,以获得大的标称深度,同时在网络中保持许多短路径。虽然这些不同的方法在网络拓扑和训练过程中有所不同,但它们都具有一个关键特性:它们创建从靠近输入的层与靠近输出的层的短路径。

In this paper, we propose an architecture that distills this insight into a simple connectivity pattern: to ensure maximum information flow between layers in the network, we connect all layers (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Figure 1 illustrates this layout schematically. Crucially, in contrast to ResNets, we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the ℓth layer has ℓ inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−ℓ subsequent layers. This introduces \ frac {l(l + 1)} {2} connections in an L-layer network, instead of just L, as in traditional architectures. Because of its dense connectivity pattern, we refer to our approach as Dense Convolutional Network (DenseNet).
在本文中,我们提出了一种将这种关键特性简化为简单连接模式的架构:为了确保网络中各层之间的最大信息流,我们将所有层(匹配的特征图大小)直接连接在一起。为了保持前馈性质,每个层从所有先前层中获得额外的输入,并将其自身的特征图传递给所有后续层。如下图(图 1)所示:(5 层的密集连接块,增长率 k=4k=4,每一层都把先前层的输出作为输入)

在本文中,我们提出了一种架构,该架构将这一见解提炼为简单的连接模式:为了确保网络中各层之间的最大信息流,我们将所有层(具有匹配的特征图大小)直接相互连接。为了保留前馈特性,每一层都从所有先前的层中获取其他输入,并将其自身的特征图传递给所有后续层。图 1 所示。至关重要的是,与 ResNets 相比,在将特征传递到图层之前,我们绝不会通过求和来组合特征。相反,我们通过级联特征来组合它们。因此,ℓ 层有 ℓ 输入,由所有先前的卷积块的特征图组成。它自己的特征图会传递给后续所有 L-ℓ 层。 在一个连接 L 层网络中有 \ frac {l(l + 1)} {2} 个连接,而不仅仅是传统架构的 L 层连接。由于其密集的连通性模式,我们将我们的方法称为密集卷积网络(DenseNet)

A 5-layer dense block with a growth rate of

Figure 1: A 5-layer dense block with a growth rate of k=4. Each layer takes all preceding feature-maps as input.

A possibly counter-intuitive effect of this dense connectivity pattern is that it requires fewer parameters than traditional convolutional networks, as there is no need to re-learn redundant feature maps. Traditional feed-forward architectures can be viewed as algorithms with a state, which is passed on from layer to layer. Each layer reads the state from its preceding layer and writes to the subsequent layer. It changes the state but also passes on information that needs to be preserved. ResNets [11] make this information preservation explicit through additive identity transformations. Recent variations of ResNets [13] show that many layers contribute very little and can in fact be randomly dropped during training. This makes the state of ResNets similar to (unrolled) recurrent neural networks [21], but the number of parameters of ResNets is substantially larger because each layer has its own weights. Our proposed DenseNet architecture explicitly differentiates between information that is added to the network and information that is preserved. DenseNet layers are very narrow (e.g., 12 feature-maps per layer), adding only a small set of feature-maps to the “collective knowledge” of the network and keep the remaining feature-maps unchanged—and the final classifier makes a decision based on all feature-maps in the network.
这种密集连接模式的可能的反直觉效应是,它比传统卷积网络需要的参数少,因为不需要重新学习冗余特征图。传统的前馈架构可以被看作是具有状态的算法,它是从一层传递到另一层的。每个层从上一层读取状态并写入后续层。它改变状态,但也传递需要保留的信息。 ResNet[13] 通过加性恒等转换使此信息保持明确。 ResNet 的最新变化表明,许多层次贡献很小,实际上可以在训练过程中随机丢弃。这使得 ResNet 的状态类似于(展开的)循环神经网络(recurrent neural network)[21],但是 ResNet 的参数数量太大,因为每个层都有自己的权重。我们提出的 DenseNet 架构明确区分添加到网络的信息和保留的信息。 DenseNet 层非常窄(例如,每层 12 个卷积核),仅将一小组特征图添加到网络的“集体知识”,并保持剩余的特征图不变。最终分类器基于网络中的所有特征图。

Besides better parameter efficiency, one big advantage of DenseNets is their improved flow of information and gradients throughout the network, which makes them easy to train. Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision [20]. This helps training of deeper network architectures. Further, we also observe that dense connections have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes.
除了更好的参数效率,DenseNet 的一大优点是它改善了整个网络中信息流和梯度流,从而使其易于训练。每个层都可以直接从损失函数和原始输入信号中获取梯度,从而进行了深入的监督[20]。这有助于训练更深层的网络架构。此外,我们还观察到密集连接具有正则化效应,这减少了具有较小训练集大小的任务的过拟合。

We evaluate DenseNets on four highly competitive benchmark datasets (CIFAR-10, CIFAR-100, SVHN, and ImageNet). Our models tend to require much fewer parameters than existing algorithms with comparable accuracy. Further, we significantly outperform the current state-of-the-art results on most of the benchmark tasks.

我们在四个高度竞争的基准数据集(CIFAR-10,CIFAR-100,SVHN 和 ImageNet)上评估 DenseNet。在准确度相近的情况下,我们的模型往往需要比现有算法少得多的参数。此外,我们的模型在大多数测试任务中,准确度明显优于其它最先进的方法。

Figure 2: A deep DenseNet with three dense blocks. The layers between two adjacent blocks are referred to as transition layers and change feature map sizes via convolution and pooling.

The exploration of network architectures has been a part of neural network research since their initial discovery. The recent resurgence in popularity of neural networks has also revived this research domain. The increasing number of layers in modern networks amplifies the differences between architectures and motivates the exploration of different connectivity patterns and the revisiting of old research ideas. A similar to our proposed dense network layout has already been studied in the neural networks literature in the 1980s . Their pioneering work focuses on fully connected multi-layer perceptrons trained in a layer-by-layer fashion. More recently, fully connected cascade networks to be trained with batch gradient descent were proposed . Although effective on small datasets, this approach only scales to networks with a few hundred parameters. In , utilizing multi-level features in CNNs through skip-connnections has been found to be effective for various vision tasks. Parallel to our work, derived a purely theoretical framework for networks with cross-layer connections similar to ours. Highway Networks were amongst the first architectures that provided a means to effectively train end-to-end networks with more than 100 layers. Using bypassing paths along with gating units, Highway Networks with hundreds of layers can be optimized without difficulty. The bypassing paths are presumed to be the key factor that eases the training of these very deep networks. This point is further supported by ResNets , in which pure identity mappings are used as bypassing paths. ResNets have achieved impressive, record-breaking performance on many challenging image recognition, localization, and detection tasks, such as ImageNet and COCO object detection. Recently, stochastic depth was proposed as a way to successfully train a 1202-layer ResNet . Stochastic depth improves the training of deep residual networks by dropping layers randomly during training. This shows that not all layers may be needed and highlights that there is a great amount of redundancy in deep (residual) networks. Our paper was partly inspired by that observation. ResNets with pre-activation also facilitate the training of state-of-the-art networks with $ > $ 1000 layers .
一直以来,网络架构的探索是神经网络研究的一部分。最近的神经网络普及又开始了这一研究领域的探索。现代网络中的层数越来越多,放大了架构之间的差异,并激励了不同连通模式的探索和旧研究思想的重新探讨。在 20 世纪 80 年代的神经网络文献中已经研究了类似于我们所提出的密集网络布局[ 3 ]。他们的开创性工作侧重于完全连接的多层的 Perceptrons,以层层的方式训练。最近,[ 39 ]提出了用批量梯度下降训练的完全连接的级联网络。虽然对小型数据集有效,但这种方法只能缩放到具有几百个参数的网络。在[ 9233040 ]已经发现通过 Skip-Connections 利用 CNN 中的多级功能,对各种视觉任务有效。与我们的工作同期,[ 1 ]推导了具有与我们相似的跨层连接的网络的纯理论框架。

Highway Networks 是最早提供可有效训练 100 层以上的端到端网络的方法的体系结构之一。使用旁路与门控单元,可以无困难地优化具有数百层深度的 Highway Network。旁路是使这些非常深的网络训练变得简单的关键因素。通过 RESNets 进一步支持该观点,其中纯恒等映射用作旁路。 Resnets 在许多具有挑战性的图像识别,定位和检测任务中实现了令人印象深刻的,破记录的表现,如 ImageNet 和 COCO 对象检测。最近,Stochastic depth 提出了随机深度作为成功训练 1202 层 resnet 的一种方式。随机深度通过在训练期间随机丢弃层来改善 resnet 的训练。这表明并非所有层都可能是必需的,并且突出显示在深(残差)网络中存在大量冗余。我们的论文部分受到该观察的启发。使用预激活的 ResNet *还有助于帮助训练超过 1000 层的最先进网络。

An orthogonal approach to making networks deeper (e.g., with the help of skip connections) is to increase the network width. The GoogLeNet uses an Inception module which concatenates feature-maps produced by filters of different sizes. In , a variant of ResNets with wide generalized residual blocks was proposed. In fact, simply increasing the number of filters in each layer of ResNets can improve its performance provided the depth is sufficient . FractalNets also achieve competitive results on several datasets using a wide network structure .

另一种使网络更深的方法(例如,借助于跳连接)是增加网络宽度。 GoogLeNet[ 3536 ]使用一个“Inception 模块”,它连接由不同大小的卷积核产生的特征图。在[ 37 ]中,提出了具有宽广残差块的 ResNet 变体。事实上,只要深度足够,简单地增加每层 ResNets 中的卷积核数量就可以提高其性能[ 41 ]。FractalNet 也可以使用更宽的网络结构在几个数据集上达到不错的效果[ 17]。

Instead of drawing representational power from extremely deep or wide architectures, exploit the potential of the network through feature reuse, yielding condensed models that are easy to train and highly parameter-efficient. Concatenating feature-maps learned by different layers increases variation in the input of subsequent layers and improves efficiency. This constitutes a major difference between and ResNets. Compared to Inception networks , which also concatenate features from different layers, s are simpler and more efficient.

DenseNet 不是通过更深或更宽的架构来获取更强的表示学习能力,而是通过特征重用来发掘网络的潜力,产生易于训练和高效利用参数的浓缩模型。由不同层次学习的串联的特征映射增加了后续层输入的变化并提高效率。这是 DenseNet 和 ResNet 之间的主要区别。与 Inception 网络[ 3536 ]相比,它也连接不同层的特征,DenseNet 更简单和更高效。

There are other notable network architecture innovations which have yielded competitive results. The Network in Network (NIN) structure includes micro multi-layer perceptrons into the filters of convolutional layers to extract more complicated features. In Deeply Supervised Network (DSN) , internal layers are directly supervised by auxiliary classifiers, which can strengthen the gradients received by earlier layers. Ladder Networks introduce lateral connections into autoencoders, producing impressive accuracies on semi-supervised learning tasks. In , Deeply-Fused Nets (DFNs) were proposed to improve information flow by combining intermediate layers of different base networks. The augmentation of networks with pathways that minimize reconstruction losses was also shown to improve image classification models .

还有其他显著的网络架构创新产生了有竞争力的效果。Network in Network (NIN)的结构包括将多层感知机插入到卷积层的卷积核中,以提取更复杂的特征。在深度监督网络(DSN)[ 20 ]中,隐藏层由辅助分类器直接监督,可以加强先前层次接收的梯度。梯形网络[ 2625 ]引入横向连接到自动编码器,在半监督学习任务上产生了令人印象深刻的准确性。在[ 38 ]中,提出了通过组合不同基网络的中间层来提高信息流的深度融合网络(DFN)。通过增加最小化重建损失路径的网络也使得图像分类模型得到改善[ 42 ]。

Table 1: DenseNet architectures for ImageNet. The growth rate for the first 3 networks is k=32, and k=48 for DenseNet-161. Note that each “conv” layer shown in the table corresponds the sequence BN-ReLU-Conv.

DenseNets

Consider a single image $ x_0 $ that is passed through a convolutional network. The network comprises $ L $ layers, each of which implements a non-linear transformation $ H_\ell(\cdot) $ , where $ \ell $ indexes the layer. $ H_\ell(\cdot) $ can be a composite function of operations such as Batch Normalization (BN) , rectified linear units (ReLU), Pooling, or Convolution (Conv). We denote the output of the $ \ell^{th} $ layer as $ x_{\ell} $ . Traditional convolutional feed-forward networks connect the output of the $ \ell^{th} $ layer as input to the $ (\ell+1)^{th} $ layer, which gives rise to the following layer transition: $ x_\ell=H_\ell(x_{\ell-1}) $ . ResNets add a skip-connection that bypasses the non-linear transformations with an identity function:

考虑通过卷积网络传递的单个图像$x_0 $。该网络包括$ L $层,每个层实现非线性变换$ H_\ell(\cdot) $,其中$\ell $索引该层。 $ H_\ell(\cdot) $可以是批量归一化(BN),整流线性单元(Relu),池化(Pooling)或卷积(Conv)等操作的复合函数。我们表示$\ell^{th} $层的输出为$x_{\ell} $。

ResNet:传统的卷积前馈神经网络将$\ell^{th} $层的输出连接到$(\ell+1)^{th} $层的输入,这导致以下层转换:$x=H_\ell(x_{\ell-1}) $ 。 Resnets 添加了跳过连接,使用恒等函数跳过非线性变换:

$$ x_\ell=H_\ell(x_{\ell-1})+x_{\ell-1}. $$

An advantage of ResNets is that the gradient can flow directly through the identity function from later layers to the earlier layers. However, the identity function and the output of $ H_\ell $ are combined by summation, which may impede the information flow in the network.
ResNets 的一个优点是梯度可以通过从后续层到先前层的恒等函数直接流动。然而,恒等函数与 $ H_\ell $ 的输出是通过求和组合,这可能阻碍网络中的信息流。

To further improve the information flow between layers we propose a different connectivity pattern: we introduce direct connections from any layer to all subsequent layers. Figureccnn illustrates the layout of the resulting schematically.
Consequently, the $ \ell^{th} $ layer receives the feature-maps of all preceding layers, $ x_0,\dots,x_{\ell-1} $ , as input:

DenseNet:为了进一步改善层之间的信息流,我们提出了不同的连接模式:我们提出从任何层到所有后续层的直接连接。因此,第 ℓ 层接收所有先前图层的特征图,x0,…,xℓ−1,作为输入:

$$ x_{\ell} = H_\ell([x_0, x_1,\ldots, x_{\ell-1}]), $$

where $ [x_0, x_1,\ldots, x_{\ell-1}x0 $ refers to the concatenation of the feature-maps produced in layers $ 0,\dots,\ell-1 $ . Because of its dense connectivity we refer to this network architecture as \methodnamecap{}(\methodnameshort{}). For ease of implementation, we concatenate the multiple inputs of $ H_\ell(\cdot) $ in eq.(densenet) into a single tensor. Motivated by , we define $ H_\ell(\cdot) $ as a composite function of three consecutive operations: batch normalization (BN) , followed by a rectified linear unit (ReLU) and a $ 3\times 3 $ convolution (Conv). The concatenation operation used in Eq.(densenet) is not viable when the size of feature-maps changes. However, an essential part of convolutional networks is down-sampling layers that change the size of feature-maps. To facilitate down-sampling in our architecture we divide the network into multiple densely connected dense blocks; see Figureccnn_all.

其中$ [x_0, x_1,\ldots, x_{\ell-1}x0 $ 是指在层$ 0,\dots,\ell-1 $中产生的特征映射的串联。由于其密集的连接,我们将此网络体系结构称为 DenseNet。为了便于实现,我们将公式(2)中$ H_\ell(\cdot) $的多个输入连接起来变为单张量。

复合函数:我们将$ H_\ell(\cdot) $ 定义为三个连续操作的复合功能:先批量归一化(BN),后跟整流线性单元(Relu)和最后$ 3\times 3 $卷积(CONV)。

池化层:当特征图的大小变化时,公式(2)中的级串联运算是不可行的。所以,卷积网络的一个必要的部分是改变特征图尺寸的下采样层。为了便于我们的架构进行下采样,我们将网络分为多个密集连接的密集块,如下图(图 2)所示:(一个有三个密集块的 DenseNet。两个相邻块之间的层被称为过渡层,并通过卷积和池化来改变特征图大小。)

We refer to layers between blocks as transition layers, which do convolution and pooling. The transition layers used in our experiments consist of a batch normalization layer and an 1 $ \times $ 1 convolutional layer followed by a 2 $ \times $ 2 average pooling layer. If each function $ H_\ell $ produces $ k $ feature-maps, it follows that the $ \ell^{th} $ layer has $ k_0+k\times(\ell-1) $ input feature-maps, where $ k_0 $ is the number of channels in the input layer. An important difference between and existing network architectures is that can have very narrow layers, e.g., $ k=12 $ . We refer to the hyper-parameter $ k $ as the \stepsizename{} of the network. We show in Sectionresults that a relatively small is sufficient to obtain state-of-the-art results on the datasets that we tested on. One explanation for this is that each layer has access to all the preceding feature-maps in its block and, therefore, to the network's collective knowledge. One can view the feature-maps as the global state of the network. Each layer adds $ k $ feature-maps of its own to this state. The regulates how much new information each layer contributes to the global state. The global state, once written, can be accessed from everywhere within the network and, unlike in traditional network architectures, there is no need to replicate it from layer to layer. Although each layer only produces $ k $ output feature-maps, it typically has many more inputs. It has been noted in that a 1 $ \times $ 1 convolution can be introduced as bottleneck layer before each 3 $ \times $ 3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. We find this design especially effective for and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1 $ \times $ 1)-BN-ReLU-Conv(3 $ \times $ 3) version of $ H_\ell $ , as -B.

我们将块之间的层称为过渡层,它们进行卷积和池化。我们实验中使用的过渡层由一个批量归一化层(BN),一个 1$\times$1 的卷积层和一个 2$\times$ 2 平均池化层组成。

增长率: 如果每个功能$ H_\ell $产生$ k $功能映射,则$\ell^{th} $层具有$ k_0+k\times(\ell-1) $输入特征映射,其中$ k_0 $是输入中的通道数层。与其它网络架构的一个重要的区别是,DenseNet 可以有非常窄的层,比如 k=12。我们将超参数 k 称为网络的增长率。

实验这一节中可以看到,一个相对小的增长率已经足矣在我们测试的数据集上达到领先的结果。对此的一个解释是,每个层都可以访问其块中的所有先前的特征图,即访问网络的“集体知识”。可以将特征图视为网络的全局状态。每个层都将自己的$ k $个特征图添加到这个状态。增长率调节每层对全局状态贡献多少新信息。与传统的网络架构不同的是,一旦写入,从网络内的任何地方都可以访问全局状态,无需逐层复制。

瓶颈层:尽管每层只产生 kk 个输出特征图,但它通常具有更多的输入。在 36 11 中提到,一个 1×1 的卷积层可以被看作是瓶颈层,放在一个 3×3 的卷积层之前可以起到减少输入数量的作用,以提高计算效率。我们发现这种设计对 DenseNet 尤其有效,我们的网络也有这样的瓶颈层,比如另一个名为 DenseNet-B 版本的$ H_\ell $是这样的:BN-ReLU-Conv(1×1)-BN-ReLU-Conv(3×3),在我们的实验中,我们让 1×1 的卷积层输出 4k 个特征图。

In our experiments, we let each 1 $ \times $ 1 convolution produce $ 4k $ feature-maps. To further improve model compactness, we can reduce the number of feature-maps at transition layers. If a dense block contains $ m $ feature-maps, we let the following transition layer generate $ \lfloor\theta m\rfloor $ output feature-maps, where $ 0<!\theta\le!1 $ is referred to as the compression factor. When $ \theta!=!1 $ , the number of feature-maps across transition layers remains unchanged. We refer the with $ \theta!<!1 $ as DenseNet-C, and we set $ \theta=0.5 $ in our experiment. When both the bottleneck and transition layers with $ \theta!<1 $ are used, we refer to our model as -BC. On all datasets except ImageNet, the used in our experiments has three dense blocks that each has an equal number of layers. Before entering the first dense block, a convolution with 16 (or twice the for DenseNet-BC) output channels is performed on the input images. For convolutional layers with kernel size 3 $ \times $ 3, each side of the inputs is zero-padded by one pixel to keep the feature-map size fixed. We use 1 $ \times $ 1 convolution followed by 2 $ \times $ 2 average pooling as transition layers between two contiguous dense blocks. At the end of the last dense block, a global average pooling is performed and then a softmax classifier is attached. The feature-map sizes in the three dense blocks are 32 $ \times $ 32, 16 $ \times $ 16, and 8 $ \times $ 8, respectively. We experiment with the basic structure with configurations L=40,k=12,L=100,k=12 和 L=100,k=24 .

For -BC, the networks with configurations L=100,k=12,L=250,k=24 and L=190,k=40 are evaluated. In our experiments on ImageNet, we use a -BC structure with 4 dense blocks on 224 $ \times $ 224 input images. The initial convolution layer comprises $ 2k $ convolutions of size 7 $ \times $ 7 with stride 2; the number of feature-maps in all other layers also follow from setting $ k $ . The exact network configurations we used on ImageNet are shown in Tabledensenet-imagenet. - Avoid information loss during forward propagation; - Ensure earlier layers receive effective gradient; - Make the inputs of convolutional layers more diversified.

压缩:我们可以让每一个 1 $\times $ 1 卷积产生$ 4k $ feature-maps,为了进一步提高模型的紧凑性,我们可以在过渡层减少特征图的数量。如果密集块包含 m 个特征图,我们让后续的过渡层输出 $ \lfloor\theta m\rfloor $个特征图,其中 θ 为压缩因子,且 0<θ≤1。当 θ=1 时,通过过渡层的特征图的数量保持不变。我们将 θ<1 的 DenseNet 称作 DenseNet-C,我们在实验设置 θ=0.5。我们将同时使用瓶颈层和压缩的模型称为 DenseNet-BC。

实现细节:在除 ImageNet 之外的所有数据集上,我们实验中使用的 DenseNet 具有三个密集块,每个具有相等层数。在进入第一个密集块之前,对输入图像执行卷积,输出 16(DenseNet-BC 的增长率的两倍)通道的特征图。对于卷积核为 3×3 的卷积层,输入的每一边都添加 1 像素宽度的边,以 0 填充,以保持特征图尺寸不变。在两个连续的密集块之间,我们使用一个 1×1 的卷积层接一个 2×2 的平均池化层作为过渡层。在最后一个密集块的后面,执行全局平均池化,然后附加一个 softmax 分类器。三个密集块的特征图尺寸分别是 32×32,16×16 和 8×8。在实验中,基础版本的 DenseNet 超参数配置为:L=40,k=12,L=100,k=12 和 L=100,k=24。DenseNet-BC 的超参数配置为:L=100,k=12,L=250,k=24 和 L=190,k=40。

在 ImageNet 上的实验,我们使用的 DenseNet-BC 架构包含 4 个密集块,输入图像大小为 224×224。初始卷积层包含 2k 个大小为 7×7,步长为 2 的卷积。其它层的特征图数量都由超参数 kk 决定。具体的网络架构如(表 1)所示:(用于 ImageNet 的 DenseNet 网络架构。所有网络的增长率都是 k=32。注意,表格中的 Conv 层表示 BN-ReLU-Conv 的组合)- 避免在前向传播期间的信息丢失; - 确保早期的层获得有效梯度; - 使卷积层的输入更加多样化。

Experiments 实验

We empirically demonstrate 's effectiveness on several benchmark datasets and compare with state-of-the-art architectures, especially with ResNet and its variants.

我们经验证明了几个基准数据集的效力,并与最先进的架构进行比较,特别是与 Reset 及其变体相比。

Method Depth Params C10 C10+ C100 C100+ SVHN
Network in Network [22] - - 10.41 8.81 35.68 - 2.35
All-CNN [31] - - 9.08 7.25 - 33.71 -
Deeply Supervised Net [20] - - 9.69 7.97 - 34.57 1.92
Highway Network [33] - - - 7.72 - 32.39 -
FractalNet [17] 21 38.6M 10.18 5.22 35.34 23.30 2.01
with Dropout/Drop-path 21 38.6M 7.33 4.60 28.20 23.73 1.87
ResNet [11] 110 1.7M - 6.61 - - -
ResNet (reported by [13]) 110 1.7M 13.63 6.41 44.74 27.22 2.01
ResNet with Stochastic Depth [13] 110 1.7M 11.66 5.23 37.80 24.58 1.75
1202 10.2M - 4.91 - - -
Wide ResNet [41] 16 11.0M - 4.81 - 22.07 -
28 36.5M - 4.17 - 20.50 -
with Dropout 16 2.7M - - - - 1.64
ResNet (pre-activation) [12] 164 1.7M 11.26∗ 5.46 35.58∗ 24.33 -
1001 10.2M 10.56∗ 4.62 33.47∗ 22.71 -
DenseNet (k=12) 40 1.0M 7.00 5.24 27.55 24.42 1.79
DenseNet (k=12) 100 7.0M 5.77 4.10 23.79 20.20 1.67
DenseNet (k=24) 100 27.2M 5.83 3.74 23.42 19.25 1.59
DenseNet-BC (k=12) 100 0.8M 5.92 4.51 24.15 22.27 1.76
DenseNet-BC (k=24) 250 15.3M 5.19 3.62 19.64 17.60 1.74
DenseNet-BC (k=40) 190 25.6M - 3.46 - 17.18 -


Table 2: Error rates (%) on CIFAR and SVHN datasets. L denotes the network depth and k its growth rate. Results that surpass all competing methods are bold and the overall best results are blue. “+” indicates standard data augmentation (translation and/or mirroring). ∗ indicates results run by ourselves. All the results of DenseNets without data augmentation (C10, C100, SVHN) are obtained using Dropout. DenseNets achieve lower error rates while using fewer parameters than ResNet. Without data augmentation, DenseNet performs better by a large margin.

Datasets 数据集

The two CIFAR datasets consist of colored natural images with 32 $ \times $ 32 pixels. CIFAR-10 (C10) consists of images drawn from 10 and CIFAR-100 (C100) from 100 classes. The training and test sets contain 50,000 and 10,000 images respectively, and we hold out 5,000 training images as a validation set. We adopt a standard data augmentation scheme (mirroring/shifting) that is widely used for these two datasets . We denote this data augmentation scheme by a + mark at the end of the dataset name (e.g., C10+). For preprocessing, we normalize the data using the channel means and standard deviations. For the final run we use all 50,000 training images and report the final test error at the end of training. The Street View House Numbers (SVHN) dataset contains 32 $ \times $ 32 colored digit images. There are 73,257 images in the training set, 26,032 images in the test set, and 531,131 images for additional training. Following common practice we use all the training data without any data augmentation, and a validation set with 6,000 images is split from the training set. We select the model with the lowest validation error during training and report the test error. We follow and divide the pixel values by 255 so they are in the $ [0, 1] $ range.

**CIFAR:**两个 CIFAR 数据集都是由 32×32 像素的彩色照片组成的。CIFAR-10(C10)包含 10 个类别,CIFAR-100(C100)包含 100 个类别。训练和测试集分别包含 50,000 和 10,000 张照片,我们将 5000 张训练照片作为验证集。我们采用广泛应用于这两个数据集的标准数据增强方案(镜像/移位)。我们通过数据集名称末尾的“+”标记(例如,C10+)表示该数据增强方案。对于预处理,我们使用各通道的均值和标准偏差对数据进行归一化。对于最终的训练,我们使用所有 50,000 训练图像,作为最终的测试结果。

**SVHN:**街景数字(SVHN)数据集由 32×32 像素的彩色数字照片组成。训练集有 73,257 张照片,测试集有 26,032 张照片,以及 531,131 张照片进行额外的训练。按照常规做法,我们使用所有的训练数据,没有任何数据增强,使用训练集中的 6,000 张照片作为验证集。我们选择在训练期间具有最低验证错误的模型,作最终的测试。我们遵循[2 并将像素值除以 255,使它们在[0,1]范围内。

**ImageNet:**ILSVRC 2012 分类数据集包含 1000 个类,训练集 120 万张照片,验证集 50,000 张照片。我们采用与训练图像相同的数据增强方案来训练照片,并在测试时使用尺寸为 224×224 的 single-crop 和 10-crop。与[11][12][13]一样,我们报告验证集上的分类错误。

Training 训练

All the networks are trained using stochastic gradient descent (SGD). On CIFAR and SVHN we train using batch size 64 for 300 and 40 epochs, respectively. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75% of the total number of training epochs. On ImageNet, we train models for 90 epochs with a batch size of 256. The learning rate is set to 0.1 initially, and is lowered by 10 times at epoch 30 and 60. Note that a naive implementation of DenseNet may contain memory inefficiencies. To reduce the memory consumption on GPUs, please refer to our technical report on the memory-efficient implementation of DenseNets . Following , we use a weight decay of $ 10^{-4} $ and a Nesterov momentum of 0.9 without dampening. We adopt the weight initialization introduced by . For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer after each convolutional layer (except the first one) and set the dropout rate to 0.2. The test errors were only evaluated once for each task and model setting.
所有网络都使用随机梯度下降(SGD)进行训练。在 CIFAR 和 SVHN 上,我们训练批量为 64,分别训练 300 和 40 个周期。初始学习率设置为 0.1,在训练周期数达到 50%和 75%时除以 10。在 ImageNet 上,训练批量为 256,训练 90 个周期。学习速率最初设置为 0.1,并在训练周期数达到 30 和 60 时除以 10。由于 GPU 内存限制,我们最大的模型(DenseNet-161)以小批量 128 进行训练。为了补偿较小的批量,我们训练该模型的周期数为 100,并在训练周期数达到 90 时将学习率除以 10。

注意,Densenet 的原生实现可能包含内存低效率。为降低 GPU 上的内存消耗,请参阅我们关于 DENSENETS 的内存高效实施的技术报告。

根据[8],我们使用的权重衰减为$ 10^{-4} $,Nesterov 动量[34]为 0.9 且没有衰减。我们采用[10]中提出的权重初始化。对于没有数据增强的三个数据集,即 C10,C100 和 SVHN,我们在每个卷积层之后(除第一个层之外)添加一个 Dropout 层,并将 Dropout 率设置为 0.2。只对每个任务和超参数设置评估一次测试误差。

Classification Results on CIFAR and SVHN CIFAR 和 SVHN 上的分类结果

We train s with different depths, $ L $ , and s, $ k $ . The main results on CIFAR and SVHN are shown in Tablemy-label. To highlight general trends, we mark all results that outperform the existing state-of-the-art in and the overall best result in . Possibly the most noticeable trend may originate from the bottom row of Tablemy-label, which shows that -BC with $ L!=!190 $ and $ k!=!40 $ outperforms the existing state-of-the-art consistently on all the CIFAR datasets. Its error rates of 3.46% on C10+ and 17.18% on C100+ are significantly lower than the error rates achieved by wide ResNet architecture. Our best results on C10 and C100 (without data augmentation) are even more encouraging: both are close to 30% lower than FractalNet with drop-path regularization . On SVHN, with dropout, the with $ L!=!100 $ and $ k!=!24 $ also surpasses the current best result achieved by wide ResNet. However, the 250-layer DenseNet-BC doesn't further improve the performance over its shorter counterpart. This may be explained by that SVHN is a relatively easy task, and extremely deep models may overfit to the training set. Without compression or bottleneck layers, there is a general trend that perform better as $ L $ and $ k $ increase. We attribute this primarily to the corresponding growth in model capacity. This is best demonstrated by the column of C10+ and C100+. On C10+, the error drops from 5.24% to 4.10% and finally to 3.74% as the number of parameters increases from 1.0M, over 7.0M to 27.2M. On C100+, we observe a similar trend. This suggests that can utilize the increased representational power of bigger and deeper models. It also indicates that they do not suffer from overfitting or the optimization difficulties of residual networks. The results in Tablemy-label indicate that utilize parameters more efficiently than alternative architectures (in particular, ResNets). The -BC with bottleneck structure and dimension reduction at transition layers is particularly parameter-efficient.

我们训练不同深度$ L $和增长率 k 的 DenseNet。 CIFAR 和 SVHN 的主要结果如下表(表 2)所示:(CIFAR 和 SVHN 数据集的错误率(%)。 kk 表示网络的增长率。超越所有竞争方法的结果以粗体表示,整体最佳效果标记为蓝色。 “+”表示标准数据增强(翻转和/或镜像)。“*”表示我们的运行结果。没有数据增强(C10,C100,SVHN)的 DenseNets 测试都使用 Dropout。使用比 ResNet 少的参数,DenseNets 可以实现更低的错误率。没有数据增强,DenseNet 的表现更好。)

准确率 可能最明显的趋势在表 2 的底行,L=190,k=40 的 DenseNet-BC 优于所有 CIFAR 数据集上现有的一流技术。它的 C10+ 错误率为 3.46%,C100+ 的错误率为 17.18%,明显低于 Wide ResNet 架构的错误率。我们在 C10 和 C100(没有数据增强)上取得的最佳成绩更令人鼓舞:两者的错误率均比带有下降路径正则化的 FractalNet 下降了接近 30%。在 SVHN 上,在使用 Dropout 的情况下,L=100,k=24 的 DenseNet 也超过 Wide ResNet 成为了当前的最佳结果。然而,相对于层数较少的版本,250 层 DenseNet-BC 并没有进一步改善其性能。这可以解释为 SVHN 是一个相对容易的任务,极深的模型可能会过拟合训练集。

模型容量 在没有压缩或瓶颈层的情况下,总体趋势是 DenseNet 在 L 和 k 增加时表现更好。我们认为这主要是模型容量相应地增长。从表 2 中 C10+ 和 C100+ 列可以看出。在 C10+ 上,随着参数数量从 1.0M,增加到 7.0M,再到 27.2M,误差从 5.24%,下降到 4.10%,最终降至 3.74%。在 C100 + 上,我们也可以观察到类似的趋势。这表明 DenseNet 可以利用更大更深的模型提高其表达学习能力。也表明它们不会受到类似 ResNet 那样的过度拟合或优化困难的影响。

For example, our 250-layer model only has 15.3M parameters, but it consistently outperforms other models such as FractalNet and Wide ResNets that have more than 30M parameters. We also highlight that -BC with $ L!=!100 $ and $ k!=!12 $ achieves comparable performance (e.g., 4.51% vs 4.62% error on C10+, 22.27% vs 22.71% error on C100+) as the 1001-layer pre-activation ResNet using 90% fewer parameters. Figureparams(right panel) shows the training loss and test errors of these two networks on C10+. The 1001-layer deep ResNet converges to a lower training loss value but a similar test error. We analyze this effect in more detail below. One positive side-effect of the more efficient use of parameters is a tendency of to be less prone to overfitting. We observe that on the datasets without data augmentation, the improvements of architectures over prior work are particularly pronounced. On C10, the improvement denotes a 29% relative reduction in error from 7.33% to 5.19%. On C100, the reduction is about 30% from 28.20% to 19.64%. In our experiments, we observed potential overfitting in a single setting: on C10, a 4 $ \times $ growth of parameters produced by increasing $ k!=!12 $ to $ k!=!24 $ lead to a modest increase in error from 5.77% to 5.83%. The -BC bottleneck and compression layers appear to be an effective way to counter this trend.

参数效率: 表 2 中的结果表明,DenseNet 比其它架构(特别是 ResNet)更有效地利用参数。具有压缩和瓶颈层结构的 DenseNet-BC 参数效率最高。例如,我们的 250 层模型只有 15.3M 个参数,但它始终优于其他模型,如 FractalNet 和具有超过 30M 个参数的 Wide ResNet。还需指出的是,与 1001 层的预激活 ResNet 相比,具有 L=100,k=12 的 DenseNet-BC 实现了相当的性能(例如,对于 C10+,错误率分别为 4.62%和 4.51%,而对于 C100+,错误率分别为 22.71%和 22.27%)但参数数量少 90%。图 4 中右图显示了这两个网络在 C10+ 上的训练误差和测试误差。 1001 层深 ResNet 收敛到了更低的训练误差,但测试误差却相似。我们在下面更详细地分析这个效果。

过拟合: 更有效地使用参数的一个好处是 DenseNet 不太容易过拟合。我们发现,在没有数据增强的数据集上,DenseNet 相比其它架构的改善特别明显。在 C10 上,错误率从 7.33%降至 5.19%,相对降低了 29%。在 C100 上,错误率从 28.20%降至 19.64%,相对降低约 30%。在我们的实验中,我们观察到了潜在的过拟合:在 C10 上,通过将 k=12 增加到 $k = 24% 使参数数量增长 4 倍,导致误差略微地从 5.77%增加到 5.83%。 DenseNet-BC 的压缩和瓶颈层似乎是抑制这一趋势的有效方式。

Classification Results on ImageNet ImageNet 上的分类结果

We evaluate -BC with different depths and growth rates on the ImageNet classification task, and compare it with state-of-the-art ResNet architectures. To ensure a fair comparison between the two architectures, we eliminate all other factors such as differences in data preprocessing and optimization settings by adopting the publicly available Torch implementation for ResNet by https://github.com/facebook/fb.resnet.torch . We simply replace the ResNet model with the -BC network, and keep all the experiment settings exactly the same as those used for ResNet. We report the single-crop and 10-crop validation errors of on ImageNet in Tableimagenet-numbers. Figureimagenet shows the single-crop top-1 validation errors of and ResNets as a function of the number of parameters (left) and FLOPs (right). The results presented in the figure reveal that perform on par with the state-of-the-art ResNets, whilst requiring significantly fewer parameters and computation to achieve comparable performance. For example, a DenseNet-201 with 20M parameters model yields similar validation error as a 101-layer ResNet with more than 40M parameters. Similar trends can be observed from the right panel, which plots the validation error as a function of the number of FLOPs: a that requires as much computation as a ResNet-50 performs on par with a ResNet-101, which requires twice as much computation. It is worth noting that our experimental setup implies that we use hyperparameter settings that are optimized for ResNets but not for . It is conceivable that more extensive hyper-parameter searches may further improve the performance of on ImageNet.

我们在 ImageNet 分类任务上评估不同深度和增长率的 DenseNet-BC,并将其与最先进的 ResNet 架构进行比较。为了确保两种架构之间的公平对比,我们采用 Facebook 提供的 ResNet 的 Torch 实现来消除数据预处理和优化设置之间的所有其他因素的影响。我们只需将 ResNet 替换为 DenseNet-BC,并保持所有实验设置与 ResNet 所使用的完全相同。

在 ImageNet 上 DenseNets 的测试结果如下表(表 3)所示:(ImageNet 验证集上的 top-1 和 top-5 错误率,测试分别使用了 single-crop 和 10-crop)

Model top-1 top-5
DenseNet-121 25.02 / 23.61 7.71 / 6.66
DenseNet-169 23.80 / 22.08 6.85 / 5.92
DenseNet-201 22.58 / 21.46 6.34 / 5.54
DenseNet-264 22.15 / 20.80 6.12 / 5.29

The top-1 and top-5 error rates on the ImageNet validation set, with single-crop / 10-crop testing.
将 DenseNet 和 ResNet 在验证集上使用 single-crop 进行测试,把 top-1 错误率作为参数数量(左)和计算量(右)的函数,结果如下图(图 3)所示:

Figure 3: Comparison of the DenseNet and ResNet Top-1 (single model and single-crop) error rates on the ImageNet classification dataset as a function of learned parameters (left) and flops during test-time (right).

图中显示的结果表明,在 DenseNet 与最先进的 ResNet 验证误差相当的情况下,DensNet 需要的参数数量和计算量明显减少。例如,具有 20M 个参数的 DenseNet-201 与具有超过 40M 个参数的 101 层 ResNet 验证误差接近。从右图可以看出类似的趋势,它将验证错误作为计算量的函数:DenseNet 只需要与 ResNet-50 相当的计算量,就能达到与 ResNet-101 接近的验证误差,而 ResNet-101 需要 2 倍的计算量。

值得注意的是,我们的实验设置意味着我们使用针对 ResNet 优化的超参数设置,而不是针对 DenseNet。可以想象,可以探索并找到更好的超参数设置以进一步提高 DenseNet 在 ImageNet 上的性能。(我们的 DenseNet 实现显存使用效率不高,暂时不能进行超过 30M 参数的实验。)

Discussion 讨论

Superficially, s are quite similar to ResNets: Eq. (densenet) differs from Eq.(resnet) only in that the inputs to $ H_\ell(\cdot) $ are concatenated instead of summed. However, the implications of this seemingly small modification lead to substantially different behaviors of the two network architectures. As a direct consequence of the input concatenation, the feature-maps learned by any of the layers can be accessed by all subsequent layers. This encourages feature reuse throughout the network, and leads to more compact models. The left two plots in Figureparams show the result of an experiment that aims to compare the parameter efficiency of all variants of (left) and also a comparable ResNet architecture (middle). We train multiple small networks with varying depths on C10+ and plot their test accuracies as a function of network parameters. In comparison with other popular network architectures, such as AlexNet or VGG-net , ResNets with pre-activation use fewer parameters while typically achieving better results. Hence, we compare ( $ k=12 $ ) against this architecture. The training setting for is kept the same as in the previous section. The graph shows that -BC is consistently the most parameter efficient variant of . Further, to achieve the same level of accuracy, -BC only requires around 1/3 of the parameters of ResNets (middle plot). This result is in line with the results on ImageNet we presented in Figureimagenet. The right plot in Figureparams shows that a DenseNet-BC with only 0.8M trainable parameters is able to achieve comparable accuracy as the 1001-layer (pre-activation) ResNet with 10.2M parameters.

表面上,DenseNet 与 ResNet 非常相似:公式(2)与公式(1)的区别仅仅是输入被串联而不是相加。然而,这种看似小的修改导致这两种网络架构产生本质上的不同。

**模型紧凑性:**作为输入串联的直接结果,任何 DenseNet 层学习的特征图可以由所有后续层访问。这有助于整个网络中的特征重用,并产生更紧凑的模型。
Figure 4

如上图(图 4)所示,左边两个图显示了一个实验的结果,其目的是比较 DenseNet(左)的所有变体的参数效率以及可比较的 ResNet 架构(中间)。我们对 C10 + 上的不同深度的多个小网络进行训练,并将其测试精度作为网络参数量的函数进行绘制。与其他流行的网络架构(如 AlexNet 或 VGG 相比,虽然具有预激活的 ResNet 使用更少的参数,但通常会获得更好的结果。因此,我们将这个架构和 DenseNet(k=12)进行比较。 DenseNet 的训练设置与上一节保持一致。

图表显示,DenseNet-BC 始终是 DenseNet 最具参数效率的变体。此外,为了达到相同的准确度,DenseNet-BC 只需要 ResNet 参数数量的大约 1/3(中间图)。这个结果与图 3 所示的 ImageNet 结果一致。图 4 中的右图显示,只有 0.8M 可训练参数的 DenseNet-BC 能够实现与 1001 层(预激活)ResNet 相当的精度, ResNet 具有 10.2M 参数。

One explanation for the improved accuracy of s may be that individual layers receive additional supervision from the loss function through the shorter connections. One can interpret to perform a kind of deep supervision. The benefits of deep supervision have previously been shown in deeply-supervised nets (DSN; ), which have classifiers attached to every hidden layer, enforcing the intermediate layers to learn discriminative features. perform a similar deep supervision in an implicit fashion: a single classifier on top of the network provides direct supervision to all layers through at most two or three transition layers. However, the loss function and gradient of are substantially less complicated, as the same loss function is shared between all layers. There is an interesting connection between s and stochastic depth regularization of residual networks. In stochastic depth, layers in residual networks are randomly dropped, which creates direct connections between the surrounding layers. As the pooling layers are never dropped, the network results in a similar connectivity pattern as : there is a small probability for any two layers, between the same pooling layers, to be directly connected---if all intermediate layers are randomly dropped. Although the methods are ultimately quite different, the interpretation of stochastic depth may provide insights into the success of this regularizer. By design, allow layers access to feature-maps from all of its preceding layers (although sometimes through transition layers). We conduct an experiment to investigate if a trained network takes advantage of this opportunity. We first train a on C10+ with $ L!=!40 $ and $ k!=!12 $ . For each convolutional layer $ \ell $ within a block, we compute the average (absolute) weight assigned to connections with layer $ s $ . Figureweight shows a heat-map for all three dense blocks. The average absolute weight serves as a surrogate for the dependency of a convolutional layer on its preceding layers. A red dot in position ( $ \ell,s $ ) indicates that the layer $ \ell $ makes, on average, strong use of feature-maps produced $ s $ -layers before. Several observations can be made from the plot:

隐性深度监督:密集卷积网络提高精度的一个可能的解释是,各层通过较短的连接从损失函数中接收额外的监督。可以认为 DenseNet 执行了一种“深度监督”。在深度监督的网络(DSN)中显示了深度监督的好处,其中每个隐藏层都附有分类器,迫使中间层去学习不同的特征。

密集网络以一种简单的方式进行类似的深度监督:网络上的单一分类器通过最多两个或三个过渡层对所有层进行直接监督。然而,DenseNet 的损失函数和形式显然不那么复杂,因为所有层之间共享相同的损失函数。

随机连接与确定连接: 密集卷积网络与残差网络的随机深度正则化之间有一个有趣的联系。在随机深度中,通过层之间的直接连接,残差网络中的层被随机丢弃。由于池化层不会被丢弃,网络会产生与 DenseNet 类似的连接模式:如果所有中间层都是随机丢弃的,那么在相同的池化层之间的任何两层直接连接的概率很小,尽管这些方法最终是完全不同的,但 DenseNet 对随机深度的解释可以为这种正规化的成功提供线索。

特征重用:通过设计,DenseNet 允许层访问来自其所有先前层的特征映射(尽管有时通过过渡层)。我们进行实验,探查训练后的网络是否利用了这一特性。我们首先在 C10+ 上,使用超参数:L=40,k=12 训练了一个 DenseNet。对于任一块内的每个卷积层 ℓℓ,我们计算分配给与层 ss 连接的平均(绝对)权重。如下图(图 5)所示:(卷积层在训练后的 DenseNet 中的平均绝对卷积核权重。像素(s,ℓ)(s,ℓ)表示在同一个密集块中,连接卷积层 s 与 ℓ 的权重的平均 L1 范数(由输入特征图的数量归一化)。由黑色矩形突出显示的三列对应于两个过渡层和分类层。第一行表示的是连接到密集块的输入层的权重。)

image

图 5 显示了所有三个密集块的热度图。平均绝对权重反映了层之间的依赖性。在坐标(s,ℓ)上的红点表示层 ℓ 高度依赖前 s 层生成的特征图。从图中可以看出:

  1. 任一层都在同一密集块内的许多输入上更新它们的权重。这表明,在同一密集块中,由先前层提取的特征实际上被后续层直接使用。
  2. 过渡层也在先前密集块内的所有层的输出上更新它的权重,这表明,信息从 DenseNet 的第一层到最后层进通过很少的间接传播。
  3. 第二和第三密集块内的层一致地将最小的权重分配给过渡层(三角形的顶行)的输出,表明过渡层输出许多冗余特征(平均权重较小) 。这与 DenseNet-BC 的强大结果保持一致,其中这些输出被压缩。
  4. 虽然最右边的分类层也在整个密集块中使用权重,但似乎集中在最终的特征图上,这表明最终的特征图中可能会出现更多的高级特征。

Conclusion 结论

We proposed a new convolutional network architecture, which we refer to as (). It introduces direct connections between any two layers with the same feature-map size. We showed that scale naturally to hundreds of layers, while exhibiting no optimization difficulties. In our experiments, tend to yield consistent improvement in accuracy with growing number of parameters, without any signs of performance degradation or overfitting. Under multiple settings, it achieved state-of-the-art results across several highly competitive datasets. Moreover, require substantially fewer parameters and less computation to achieve state-of-the-art performances. Because we adopted hyperparameter settings optimized for residual networks in our study, we believe that further gains in accuracy of may be obtained by more detailed tuning of hyperparameters and learning rate schedules. Whilst following a simple connectivity rule, naturally integrate the properties of identity mappings, deep supervision, and diversified depth. They allow feature reuse throughout the networks and can consequently learn more compact and, according to our experiments, more accurate models. Because of their compact internal representations and reduced feature redundancy, may be good feature extractors for various computer vision tasks that build on convolutional features, e.g., . We plan to study such feature transfer with in future work. The authors are supported in part by the NSF III-1618134, III-1526012, IIS-1149882, the Office of Naval Research Grant N00014-17-1-2175 and the Bill and Melinda Gates foundation. GH is supported by the International Postdoctoral Exchange Fellowship Program of China Postdoctoral Council (No.20150015). ZL is supported by the National Basic Research Program of China Grants 2011CBA00300, 2011CBA00301, the NSFC 61361136003. We also thank Daniel Sedra, Geoff Pleiss and Yu Sun for many insightful discussions. - [1.] In the current version, we introduce an improvement to DenseNet (DenseNet-BC), which has an additional 1 $ \times $ 1 convolution in each composite function (cf. Section 3) and compressive transition layers. Experimental results with this architecture are added (cf. Section 4.3 and Section 4.4). - [2.] Results of DenseNets on the ImageNet classification task are added (cf. Section 4.4). - [3.] A brief discussion on the connection between DenseNets and Stochastic Depth Networks is added (cf. Section 5). - [4.] We replaced the heat map (Fig. 4) in the previous version which was not clear enough. - [5.] Due to length limit, we removed the comparison between DenseNets and Partial DenseNets in the previous version (Fig. 6).

我们提出了一个新的卷积网络架构,我们称之为密集卷积网络(DenseNet)。它引入了具有相同特征图大小的任意两个层之间的直接连接。我们发现,DenseNet 可以轻易地扩展到数百层,而没有优化困难。在我们的实验中,DenseNet 趋向于在不断增加的参数数量上提高准确性,没有任何性能下降或过度拟合的迹象。在多种不同的超参数设置下,在多个竞争激烈的数据集上获得了领先的结果。此外,DenseNet 需要更少的参数和更少的计算来达到领先的性能。因为我们在研究中采用了针对残差网络优化的超参数设置,我们认为,通过更详细地调整超参数和学习速率表,可以获得 DenseNet 的精度进一步提高。

虽然遵循简单的连接规则,DenseNet 自然地整合了恒等映射,深度监督和多样化深度的属性。它们允许在整个网络中进行特征重用,从而可以学习更紧凑的,并且根据我们的实验,更准确的模型。由于其紧凑的内部表示和减少了特征冗余,DenseNet 可能是建立在卷积特征上的各种计算机视觉任务的良好特征提取器,例如[4][5]。我们计划在未来的工作中使用 DenseNets 研究这种特征的转移。

**致谢:**作者得到国家科学基金会资助的项目 III-1618134,III-1526012,IIS-1149882,海军研究所资助的项目 N00014-17-1-2175 以及比尔和梅琳达·盖茨基金会的支持。Gao Huang 得到中国博士后国际博士后科学研究项目(No.20150015)支持。 Zhuang Liu 得到中国国家基础研究计划资助的项目 2011CBA00300,2011CBA00301,中国国家自然科学基金资助的项目 61361136003。我们还感谢 Daniel Sedra,Geoff Pleiss 和 Yu Sun 和我们进行了许多有见地的讨论。

  • [1.]在当前版本中,我们对 Densenet(DenSenet-BC)引入了改进,其中每个复合功能(CF.第 3 节)和压缩转换层中具有额外的 1$\times$ 1 卷积。添加了这种架构的实验结果(参见第 4.3 节和第 4.4 节)。
  • [2.]添加了模特分类任务的 Densenets 的结果(参见第 4.4 节)。
  • [3.]添加关于 DENSENET 和随机深度网络之间的连接的简要讨论(参见第 5 节)。
  • [4.]我们在以前的版本中替换了热图(图 4),这不够清楚。
  • [5.]由于长度限制,我们删除了前一个版本中的 Densenets 和部分 Densenets 之间的比较(图 6)。
  • [1] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. Adanet: Adaptive structural learning of artificial neural networks. arXiv preprint arXiv:1607.01097, 2016.
  • [2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [3] S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In NIPS, 1989.
  • [4] J. R. Gardner, M. J. Kusner, Y. Li, P. Upchurch, K. Q. Weinberger, and J. E. Hopcroft. Deep manifold traversal: Changing labels with convolutional features. arXiv preprint arXiv:1511.06421, 2015.
  • [5] L. Gatys, A. Ecker, and M. Bethge. A neural algorithm of artistic style. Nature Communications, 2015.
  • [6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.
  • [7] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.
  • [8] S. Gross and M. Wilber. Training and investigating residual nets, 2016.
  • [9] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
  • [13] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
  • [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [15] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Tech Report, 2009.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [17] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
  • [18] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  • [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [20] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In AISTATS, 2015.
  • [21] Q. Liao and T. Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640, 2016.
  • [22] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
  • [23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [24] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning, 2011. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  • [25] M. Pezeshki, L. Fan, P. Brakel, A. Courville, and Y. Bengio. Deconstructing the ladder network architecture. In ICML, 2016.
  • [26] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder networks. In NIPS, 2015.
  • [27] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
  • [28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [29] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 3288–3291. IEEE, 2012.
  • [30] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-stage feature learning. In CVPR, 2013.
  • [31] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
  • [32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [33] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, 2015.
  • [34] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
  • [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  • [37] S. Targ, D. Almeida, and K. Lyman. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029, 2016.
  • [38] J. Wang, Z. Wei, T. Zhang, and W. Zeng. Deeply-fused nets. arXiv preprint arXiv:1605.07716, 2016.
  • [39] B. M. Wilamowski and H. Yu. Neural network learning without backpropagation. IEEE Transactions on Neural Networks, 21(11):1793–1803, 2010.
  • [40] S. Yang and D. Ramanan. Multi-scale recognition with dag-cnns. In ICCV, 2015.
  • [41] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • [42] Y. Zhang, K. Lee, and H. Lee. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In ICML, 2016.