# Densely Connected Convolutional Networks 密集连接卷积网络

## Abstract

Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the DenseNet, which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with $L$ layers have $L$ connections---one between each layer and its subsequent layer---our network has $\frac{L(L+1)}{2}$ direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. \methodnameshorts{} have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). \methodnameshorts{} obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet.

## INTRODUCTION 简介

Convolutional neural networks (CNNs) have become the dominant machine learning approach for visual object recognition. Although they were originally introduced over 20 years ago [18], improvements in computer hardware and network structure have enabled the training of truly deep CNNs only recently. The original LeNet5 [19] consisted of 5 layers, VGG featured 19 [28], and only last year Highway Networks [33] and Residual Networks (ResNets) [11] have surpassed the 100-layer barrier.

As CNNs become increasingly deep, a new research problem emerges: as information about the input or gradient passes through many layers, it can vanish and “wash out” by the time it reaches the end (or beginning) of the network. Many recent publications address this or related problems. ResNets [11] and Highway Networks [33] bypass signal from one layer to the next via identity connections. Stochastic depth [13] shortens ResNets by randomly dropping layers during training to allow better information and gradient flow. FractalNets [17] repeatedly combine several parallel layer sequences with different number of convolutional blocks to obtain a large nominal depth, while maintaining many short paths in the network. Although these different approaches vary in network topology and training procedure, they all share a key characteristic: they create short paths from early layers to later layers.

In this paper, we propose an architecture that distills this insight into a simple connectivity pattern: to ensure maximum information flow between layers in the network, we connect all layers (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Figure 1 illustrates this layout schematically. Crucially, in contrast to ResNets, we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the ℓth layer has ℓ inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−ℓ subsequent layers. This introduces \ frac {l（l + 1）} {2} connections in an L-layer network, instead of just L, as in traditional architectures. Because of its dense connectivity pattern, we refer to our approach as Dense Convolutional Network (DenseNet).

Figure 1: A 5-layer dense block with a growth rate of k=4. Each layer takes all preceding feature-maps as input.

A possibly counter-intuitive effect of this dense connectivity pattern is that it requires fewer parameters than traditional convolutional networks, as there is no need to re-learn redundant feature maps. Traditional feed-forward architectures can be viewed as algorithms with a state, which is passed on from layer to layer. Each layer reads the state from its preceding layer and writes to the subsequent layer. It changes the state but also passes on information that needs to be preserved. ResNets [11] make this information preservation explicit through additive identity transformations. Recent variations of ResNets [13] show that many layers contribute very little and can in fact be randomly dropped during training. This makes the state of ResNets similar to (unrolled) recurrent neural networks [21], but the number of parameters of ResNets is substantially larger because each layer has its own weights. Our proposed DenseNet architecture explicitly differentiates between information that is added to the network and information that is preserved. DenseNet layers are very narrow (e.g., 12 feature-maps per layer), adding only a small set of feature-maps to the “collective knowledge” of the network and keep the remaining feature-maps unchanged—and the final classifier makes a decision based on all feature-maps in the network.

Besides better parameter efficiency, one big advantage of DenseNets is their improved flow of information and gradients throughout the network, which makes them easy to train. Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision [20]. This helps training of deeper network architectures. Further, we also observe that dense connections have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes.

We evaluate DenseNets on four highly competitive benchmark datasets (CIFAR-10, CIFAR-100, SVHN, and ImageNet). Our models tend to require much fewer parameters than existing algorithms with comparable accuracy. Further, we significantly outperform the current state-of-the-art results on most of the benchmark tasks.

Figure 2: A deep DenseNet with three dense blocks. The layers between two adjacent blocks are referred to as transition layers and change feature map sizes via convolution and pooling.

The exploration of network architectures has been a part of neural network research since their initial discovery. The recent resurgence in popularity of neural networks has also revived this research domain. The increasing number of layers in modern networks amplifies the differences between architectures and motivates the exploration of different connectivity patterns and the revisiting of old research ideas. A similar to our proposed dense network layout has already been studied in the neural networks literature in the 1980s . Their pioneering work focuses on fully connected multi-layer perceptrons trained in a layer-by-layer fashion. More recently, fully connected cascade networks to be trained with batch gradient descent were proposed . Although effective on small datasets, this approach only scales to networks with a few hundred parameters. In , utilizing multi-level features in CNNs through skip-connnections has been found to be effective for various vision tasks. Parallel to our work, derived a purely theoretical framework for networks with cross-layer connections similar to ours. Highway Networks were amongst the first architectures that provided a means to effectively train end-to-end networks with more than 100 layers. Using bypassing paths along with gating units, Highway Networks with hundreds of layers can be optimized without difficulty. The bypassing paths are presumed to be the key factor that eases the training of these very deep networks. This point is further supported by ResNets , in which pure identity mappings are used as bypassing paths. ResNets have achieved impressive, record-breaking performance on many challenging image recognition, localization, and detection tasks, such as ImageNet and COCO object detection. Recently, stochastic depth was proposed as a way to successfully train a 1202-layer ResNet . Stochastic depth improves the training of deep residual networks by dropping layers randomly during training. This shows that not all layers may be needed and highlights that there is a great amount of redundancy in deep (residual) networks. Our paper was partly inspired by that observation. ResNets with pre-activation also facilitate the training of state-of-the-art networks with $>$ 1000 layers .

Highway Networks 是最早提供可有效训练 100 层以上的端到端网络的方法的体系结构之一。使用旁路与门控单元，可以无困难地优化具有数百层深度的 Highway Network。旁路是使这些非常深的网络训练变得简单的关键因素。通过 RESNets 进一步支持该观点，其中纯恒等映射用作旁路。 Resnets 在许多具有挑战性的图像识别，定位和检测任务中实现了令人印象深刻的，破记录的表现，如 ImageNet 和 COCO 对象检测。最近，Stochastic depth 提出了随机深度作为成功训练 1202 层 resnet 的一种方式。随机深度通过在训练期间随机丢弃层来改善 resnet 的训练。这表明并非所有层都可能是必需的，并且突出显示在深（残差）网络中存在大量冗余。我们的论文部分受到该观察的启发。使用预激活的 ResNet *还有助于帮助训练超过 1000 层的最先进网络。

An orthogonal approach to making networks deeper (e.g., with the help of skip connections) is to increase the network width. The GoogLeNet uses an Inception module which concatenates feature-maps produced by filters of different sizes. In , a variant of ResNets with wide generalized residual blocks was proposed. In fact, simply increasing the number of filters in each layer of ResNets can improve its performance provided the depth is sufficient . FractalNets also achieve competitive results on several datasets using a wide network structure .

Instead of drawing representational power from extremely deep or wide architectures, exploit the potential of the network through feature reuse, yielding condensed models that are easy to train and highly parameter-efficient. Concatenating feature-maps learned by different layers increases variation in the input of subsequent layers and improves efficiency. This constitutes a major difference between and ResNets. Compared to Inception networks , which also concatenate features from different layers, s are simpler and more efficient.

DenseNet 不是通过更深或更宽的架构来获取更强的表示学习能力，而是通过特征重用来发掘网络的潜力，产生易于训练和高效利用参数的浓缩模型。由不同层次学习的串联的特征映射增加了后续层输入的变化并提高效率。这是 DenseNet 和 ResNet 之间的主要区别。与 Inception 网络[ 3536 ]相比，它也连接不同层的特征，DenseNet 更简单和更高效。

There are other notable network architecture innovations which have yielded competitive results. The Network in Network (NIN) structure includes micro multi-layer perceptrons into the filters of convolutional layers to extract more complicated features. In Deeply Supervised Network (DSN) , internal layers are directly supervised by auxiliary classifiers, which can strengthen the gradients received by earlier layers. Ladder Networks introduce lateral connections into autoencoders, producing impressive accuracies on semi-supervised learning tasks. In , Deeply-Fused Nets (DFNs) were proposed to improve information flow by combining intermediate layers of different base networks. The augmentation of networks with pathways that minimize reconstruction losses was also shown to improve image classification models .

Table 1: DenseNet architectures for ImageNet. The growth rate for the first 3 networks is k=32, and k=48 for DenseNet-161. Note that each “conv” layer shown in the table corresponds the sequence BN-ReLU-Conv.

## DenseNets

Consider a single image $x_0$ that is passed through a convolutional network. The network comprises $L$ layers, each of which implements a non-linear transformation $H_\ell(\cdot)$ , where $\ell$ indexes the layer. $H_\ell(\cdot)$ can be a composite function of operations such as Batch Normalization (BN) , rectified linear units (ReLU), Pooling, or Convolution (Conv). We denote the output of the $\ell^{th}$ layer as $x_{\ell}$ . Traditional convolutional feed-forward networks connect the output of the $\ell^{th}$ layer as input to the $(\ell+1)^{th}$ layer, which gives rise to the following layer transition: $x_\ell=H_\ell(x_{\ell-1})$ . ResNets add a skip-connection that bypasses the non-linear transformations with an identity function:

ResNet:传统的卷积前馈神经网络将$\ell^{th}$层的输出连接到$(\ell+1)^{th}$层的输入，这导致以下层转换：$x=H_\ell(x_{\ell-1})$ 。 Resnets 添加了跳过连接，使用恒等函数跳过非线性变换：

$$x_\ell=H_\ell(x_{\ell-1})+x_{\ell-1}.$$

An advantage of ResNets is that the gradient can flow directly through the identity function from later layers to the earlier layers. However, the identity function and the output of $H_\ell$ are combined by summation, which may impede the information flow in the network.
ResNets 的一个优点是梯度可以通过从后续层到先前层的恒等函数直接流动。然而，恒等函数与 $H_\ell$ 的输出是通过求和组合，这可能阻碍网络中的信息流。

To further improve the information flow between layers we propose a different connectivity pattern: we introduce direct connections from any layer to all subsequent layers. Figureccnn illustrates the layout of the resulting schematically.
Consequently, the $\ell^{th}$ layer receives the feature-maps of all preceding layers, $x_0,\dots,x_{\ell-1}$ , as input:

DenseNet：为了进一步改善层之间的信息流，我们提出了不同的连接模式：我们提出从任何层到所有后续层的直接连接。因此，第 ℓ 层接收所有先前图层的特征图，x0,…,xℓ−1，作为输入：

$$x_{\ell} = H_\ell([x_0, x_1,\ldots, x_{\ell-1}]),$$

where $[x_0, x_1,\ldots, x_{\ell-1}x0$ refers to the concatenation of the feature-maps produced in layers $0,\dots,\ell-1$ . Because of its dense connectivity we refer to this network architecture as \methodnamecap{}(\methodnameshort{}). For ease of implementation, we concatenate the multiple inputs of $H_\ell(\cdot)$ in eq.(densenet) into a single tensor. Motivated by , we define $H_\ell(\cdot)$ as a composite function of three consecutive operations: batch normalization (BN) , followed by a rectified linear unit (ReLU) and a $3\times 3$ convolution (Conv). The concatenation operation used in Eq.(densenet) is not viable when the size of feature-maps changes. However, an essential part of convolutional networks is down-sampling layers that change the size of feature-maps. To facilitate down-sampling in our architecture we divide the network into multiple densely connected dense blocks; see Figureccnn_all.

We refer to layers between blocks as transition layers, which do convolution and pooling. The transition layers used in our experiments consist of a batch normalization layer and an 1 $\times$ 1 convolutional layer followed by a 2 $\times$ 2 average pooling layer. If each function $H_\ell$ produces $k$ feature-maps, it follows that the $\ell^{th}$ layer has $k_0+k\times(\ell-1)$ input feature-maps, where $k_0$ is the number of channels in the input layer. An important difference between and existing network architectures is that can have very narrow layers, e.g., $k=12$ . We refer to the hyper-parameter $k$ as the \stepsizename{} of the network. We show in Sectionresults that a relatively small is sufficient to obtain state-of-the-art results on the datasets that we tested on. One explanation for this is that each layer has access to all the preceding feature-maps in its block and, therefore, to the network's collective knowledge. One can view the feature-maps as the global state of the network. Each layer adds $k$ feature-maps of its own to this state. The regulates how much new information each layer contributes to the global state. The global state, once written, can be accessed from everywhere within the network and, unlike in traditional network architectures, there is no need to replicate it from layer to layer. Although each layer only produces $k$ output feature-maps, it typically has many more inputs. It has been noted in that a 1 $\times$ 1 convolution can be introduced as bottleneck layer before each 3 $\times$ 3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. We find this design especially effective for and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1 $\times$ 1)-BN-ReLU-Conv(3 $\times$ 3) version of $H_\ell$ , as -B.

In our experiments, we let each 1 $\times$ 1 convolution produce $4k$ feature-maps. To further improve model compactness, we can reduce the number of feature-maps at transition layers. If a dense block contains $m$ feature-maps, we let the following transition layer generate $\lfloor\theta m\rfloor$ output feature-maps, where $0<!\theta\le!1$ is referred to as the compression factor. When $\theta!=!1$ , the number of feature-maps across transition layers remains unchanged. We refer the with $\theta!<!1$ as DenseNet-C, and we set $\theta=0.5$ in our experiment. When both the bottleneck and transition layers with $\theta!<1$ are used, we refer to our model as -BC. On all datasets except ImageNet, the used in our experiments has three dense blocks that each has an equal number of layers. Before entering the first dense block, a convolution with 16 (or twice the for DenseNet-BC) output channels is performed on the input images. For convolutional layers with kernel size 3 $\times$ 3, each side of the inputs is zero-padded by one pixel to keep the feature-map size fixed. We use 1 $\times$ 1 convolution followed by 2 $\times$ 2 average pooling as transition layers between two contiguous dense blocks. At the end of the last dense block, a global average pooling is performed and then a softmax classifier is attached. The feature-map sizes in the three dense blocks are 32 $\times$ 32, 16 $\times$ 16, and 8 $\times$ 8, respectively. We experiment with the basic structure with configurations L=40,k=12，L=100,k=12 和 L=100,k=24 .

For -BC, the networks with configurations L=100,k=12，L=250,k=24 and L=190,k=40 are evaluated. In our experiments on ImageNet, we use a -BC structure with 4 dense blocks on 224 $\times$ 224 input images. The initial convolution layer comprises $2k$ convolutions of size 7 $\times$ 7 with stride 2; the number of feature-maps in all other layers also follow from setting $k$ . The exact network configurations we used on ImageNet are shown in Tabledensenet-imagenet. - Avoid information loss during forward propagation; - Ensure earlier layers receive effective gradient; - Make the inputs of convolutional layers more diversified.

## Experiments 实验

We empirically demonstrate 's effectiveness on several benchmark datasets and compare with state-of-the-art architectures, especially with ResNet and its variants.

Method Depth Params C10 C10+ C100 C100+ SVHN
Network in Network [22] - - 10.41 8.81 35.68 - 2.35
All-CNN [31] - - 9.08 7.25 - 33.71 -
Deeply Supervised Net [20] - - 9.69 7.97 - 34.57 1.92
Highway Network [33] - - - 7.72 - 32.39 -
FractalNet [17] 21 38.6M 10.18 5.22 35.34 23.30 2.01
with Dropout/Drop-path 21 38.6M 7.33 4.60 28.20 23.73 1.87
ResNet [11] 110 1.7M - 6.61 - - -
ResNet (reported by [13]) 110 1.7M 13.63 6.41 44.74 27.22 2.01
ResNet with Stochastic Depth [13] 110 1.7M 11.66 5.23 37.80 24.58 1.75
1202 10.2M - 4.91 - - -
Wide ResNet [41] 16 11.0M - 4.81 - 22.07 -
28 36.5M - 4.17 - 20.50 -
with Dropout 16 2.7M - - - - 1.64
ResNet (pre-activation) [12] 164 1.7M 11.26∗ 5.46 35.58∗ 24.33 -
1001 10.2M 10.56∗ 4.62 33.47∗ 22.71 -
DenseNet (k=12) 40 1.0M 7.00 5.24 27.55 24.42 1.79
DenseNet (k=12) 100 7.0M 5.77 4.10 23.79 20.20 1.67
DenseNet (k=24) 100 27.2M 5.83 3.74 23.42 19.25 1.59
DenseNet-BC (k=12) 100 0.8M 5.92 4.51 24.15 22.27 1.76
DenseNet-BC (k=24) 250 15.3M 5.19 3.62 19.64 17.60 1.74
DenseNet-BC (k=40) 190 25.6M - 3.46 - 17.18 -

Table 2: Error rates (%) on CIFAR and SVHN datasets. L denotes the network depth and k its growth rate. Results that surpass all competing methods are bold and the overall best results are blue. “+” indicates standard data augmentation (translation and/or mirroring). ∗ indicates results run by ourselves. All the results of DenseNets without data augmentation (C10, C100, SVHN) are obtained using Dropout. DenseNets achieve lower error rates while using fewer parameters than ResNet. Without data augmentation, DenseNet performs better by a large margin.

### Datasets 数据集

The two CIFAR datasets consist of colored natural images with 32 $\times$ 32 pixels. CIFAR-10 (C10) consists of images drawn from 10 and CIFAR-100 (C100) from 100 classes. The training and test sets contain 50,000 and 10,000 images respectively, and we hold out 5,000 training images as a validation set. We adopt a standard data augmentation scheme (mirroring/shifting) that is widely used for these two datasets . We denote this data augmentation scheme by a + mark at the end of the dataset name (e.g., C10+). For preprocessing, we normalize the data using the channel means and standard deviations. For the final run we use all 50,000 training images and report the final test error at the end of training. The Street View House Numbers (SVHN) dataset contains 32 $\times$ 32 colored digit images. There are 73,257 images in the training set, 26,032 images in the test set, and 531,131 images for additional training. Following common practice we use all the training data without any data augmentation, and a validation set with 6,000 images is split from the training set. We select the model with the lowest validation error during training and report the test error. We follow and divide the pixel values by 255 so they are in the $[0, 1]$ range.

**CIFAR：**两个 CIFAR 数据集都是由 32×32 像素的彩色照片组成的。CIFAR-10(C10)包含 10 个类别，CIFAR-100(C100)包含 100 个类别。训练和测试集分别包含 50,000 和 10,000 张照片，我们将 5000 张训练照片作为验证集。我们采用广泛应用于这两个数据集的标准数据增强方案（镜像/移位）。我们通过数据集名称末尾的“+”标记（例如，C10+）表示该数据增强方案。对于预处理，我们使用各通道的均值和标准偏差对数据进行归一化。对于最终的训练，我们使用所有 50,000 训练图像，作为最终的测试结果。

**SVHN：**街景数字(SVHN)数据集由 32×32 像素的彩色数字照片组成。训练集有 73,257 张照片，测试集有 26,032 张照片，以及 531,131 张照片进行额外的训练。按照常规做法，我们使用所有的训练数据，没有任何数据增强，使用训练集中的 6,000 张照片作为验证集。我们选择在训练期间具有最低验证错误的模型，作最终的测试。我们遵循[2 并将像素值除以 255，使它们在[0,1]范围内。

**ImageNet：**ILSVRC 2012 分类数据集包含 1000 个类，训练集 120 万张照片，验证集 50,000 张照片。我们采用与训练图像相同的数据增强方案来训练照片，并在测试时使用尺寸为 224×224 的 single-crop 和 10-crop。与[11][12][13]一样，我们报告验证集上的分类错误。

### Training 训练

All the networks are trained using stochastic gradient descent (SGD). On CIFAR and SVHN we train using batch size 64 for 300 and 40 epochs, respectively. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75% of the total number of training epochs. On ImageNet, we train models for 90 epochs with a batch size of 256. The learning rate is set to 0.1 initially, and is lowered by 10 times at epoch 30 and 60. Note that a naive implementation of DenseNet may contain memory inefficiencies. To reduce the memory consumption on GPUs, please refer to our technical report on the memory-efficient implementation of DenseNets . Following , we use a weight decay of $10^{-4}$ and a Nesterov momentum of 0.9 without dampening. We adopt the weight initialization introduced by . For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer after each convolutional layer (except the first one) and set the dropout rate to 0.2. The test errors were only evaluated once for each task and model setting.

### Classification Results on CIFAR and SVHN CIFAR 和 SVHN 上的分类结果

We train s with different depths, $L$ , and s, $k$ . The main results on CIFAR and SVHN are shown in Tablemy-label. To highlight general trends, we mark all results that outperform the existing state-of-the-art in and the overall best result in . Possibly the most noticeable trend may originate from the bottom row of Tablemy-label, which shows that -BC with $L!=!190$ and $k!=!40$ outperforms the existing state-of-the-art consistently on all the CIFAR datasets. Its error rates of 3.46% on C10+ and 17.18% on C100+ are significantly lower than the error rates achieved by wide ResNet architecture. Our best results on C10 and C100 (without data augmentation) are even more encouraging: both are close to 30% lower than FractalNet with drop-path regularization . On SVHN, with dropout, the with $L!=!100$ and $k!=!24$ also surpasses the current best result achieved by wide ResNet. However, the 250-layer DenseNet-BC doesn't further improve the performance over its shorter counterpart. This may be explained by that SVHN is a relatively easy task, and extremely deep models may overfit to the training set. Without compression or bottleneck layers, there is a general trend that perform better as $L$ and $k$ increase. We attribute this primarily to the corresponding growth in model capacity. This is best demonstrated by the column of C10+ and C100+. On C10+, the error drops from 5.24% to 4.10% and finally to 3.74% as the number of parameters increases from 1.0M, over 7.0M to 27.2M. On C100+, we observe a similar trend. This suggests that can utilize the increased representational power of bigger and deeper models. It also indicates that they do not suffer from overfitting or the optimization difficulties of residual networks. The results in Tablemy-label indicate that utilize parameters more efficiently than alternative architectures (in particular, ResNets). The -BC with bottleneck structure and dimension reduction at transition layers is particularly parameter-efficient.

For example, our 250-layer model only has 15.3M parameters, but it consistently outperforms other models such as FractalNet and Wide ResNets that have more than 30M parameters. We also highlight that -BC with $L!=!100$ and $k!=!12$ achieves comparable performance (e.g., 4.51% vs 4.62% error on C10+, 22.27% vs 22.71% error on C100+) as the 1001-layer pre-activation ResNet using 90% fewer parameters. Figureparams(right panel) shows the training loss and test errors of these two networks on C10+. The 1001-layer deep ResNet converges to a lower training loss value but a similar test error. We analyze this effect in more detail below. One positive side-effect of the more efficient use of parameters is a tendency of to be less prone to overfitting. We observe that on the datasets without data augmentation, the improvements of architectures over prior work are particularly pronounced. On C10, the improvement denotes a 29% relative reduction in error from 7.33% to 5.19%. On C100, the reduction is about 30% from 28.20% to 19.64%. In our experiments, we observed potential overfitting in a single setting: on C10, a 4 $\times$ growth of parameters produced by increasing $k!=!12$ to $k!=!24$ lead to a modest increase in error from 5.77% to 5.83%. The -BC bottleneck and compression layers appear to be an effective way to counter this trend.

• [2.]添加了模特分类任务的 Densenets 的结果（参见第 4.4 节）。
• [3.]添加关于 DENSENET 和随机深度网络之间的连接的简要讨论（参见第 5 节）。
• [4.]我们在以前的版本中替换了热图（图 4），这不够清楚。
• [5.]由于长度限制，我们删除了前一个版本中的 Densenets 和部分 Densenets 之间的比较（图 6）。
• [1] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. Adanet: Adaptive structural learning of artificial neural networks. arXiv preprint arXiv:1607.01097, 2016.
• [2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
• [3] S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In NIPS, 1989.
• [4] J. R. Gardner, M. J. Kusner, Y. Li, P. Upchurch, K. Q. Weinberger, and J. E. Hopcroft. Deep manifold traversal: Changing labels with convolutional features. arXiv preprint arXiv:1511.06421, 2015.
• [5] L. Gatys, A. Ecker, and M. Bethge. A neural algorithm of artistic style. Nature Communications, 2015.
• [6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.
• [7] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.
• [8] S. Gross and M. Wilber. Training and investigating residual nets, 2016.
• [9] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.
• [10] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
• [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
• [12] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
• [13] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
• [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
• [15] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Tech Report, 2009.
• [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
• [17] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
• [18] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
• [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
• [20] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In AISTATS, 2015.
• [21] Q. Liao and T. Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640, 2016.
• [22] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
• [23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
• [24] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning, 2011. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
• [25] M. Pezeshki, L. Fan, P. Brakel, A. Courville, and Y. Bengio. Deconstructing the ladder network architecture. In ICML, 2016.
• [26] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder networks. In NIPS, 2015.
• [27] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
• [28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
• [29] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 3288–3291. IEEE, 2012.
• [30] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-stage feature learning. In CVPR, 2013.
• [31] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
• [32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
• [33] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, 2015.
• [34] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
• [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
• [36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
• [37] S. Targ, D. Almeida, and K. Lyman. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029, 2016.
• [38] J. Wang, Z. Wei, T. Zhang, and W. Zeng. Deeply-fused nets. arXiv preprint arXiv:1605.07716, 2016.
• [39] B. M. Wilamowski and H. Yu. Neural network learning without backpropagation. IEEE Transactions on Neural Networks, 21(11):1793–1803, 2010.
• [40] S. Yang and D. Ramanan. Multi-scale recognition with dag-cnns. In ICCV, 2015.
• [41] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
• [42] Y. Zhang, K. Lee, and H. Lee. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In ICML, 2016.