# NasNet 搜索空间：学习可转移的架构来实现可扩展的图像识别

ICLR2017 的 Neural architecture search with reinforcement learning（NAS）通过用强化学习在一个 search space 中搜索最优的网络结构，在 CIFAR-10 上完成了实验，所以那种搜索方式可以在能接受的时间范围内达到目的，但是在 ImageNet 数据集应用效果不太好。因此这篇文章的主要贡献是一个新的搜索空间的设计，也就是设计一个合适的 search space，使得在 CIFAR-10 上得到的最佳的网络结构可以方便地迁移到 ImageNet 这样的图像更大的更高分辨率的数据集上，因此这篇文章可以看做的 ICLR2017 那篇文章的升级版（搜索速度比之前快了 7 倍左右）。我们将该搜索空间命名为 NASNet 搜索空间，因为是 NASNET 搜索出的最佳架构。

## Abstract 摘要

Developing state-of-the-art image classification models often requires significant architecture engineering and tuning. In this paper, we attempt to reduce the amount of architecture engineering by using Neural Architecture Search to learn an architectural building block on a small dataset that can be transferred to a large dataset. This approach is similar to learning the structure of a recurrent cell within a recurrent network. In our experiments, we search for the best convolutional cell on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more of this cell. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves state-of-the-art accuracy of 82.3% top-1 and 96.0% top-5 on ImageNet, which is 0.8% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS. This cell can also be scaled down two orders of magnitude: a smaller network constructed from the best cell also achieves 74% top-1 accuracy, which is 3.1% better than the equivalently-sized, state-of-the-art models for mobile platforms.

## Introduction

ImageNet classification (Deng et al., 2009) is a historically important benchmark in computer vision. The seminal work of Krizhevsky et al. (2012) on using convolutional architectures (Fukushima, 1980; LeCun et al., 1998) for ImageNet classification represents one of the most important breakthroughs in deep learning. Successive advancements on this benchmark based on convolutional neural networks (CNNs) have achieved impressive results through significant architecture engineering (Simonyan and Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Szegedy et al., 2016b, a; Xie et al., 2016).

In this paper, we consider learning the convolutional architectures directly from data with application to ImageNet classification. We focus on ImageNet classification because the features derived from this network are of great importance in computer vision. For example, features from networks that perform well on ImageNet classification provide state-of-the-art performance when transferred to other computer vision tasks where labeled data is limited (Donahue et al., 2014).

Our approach derives from the recently proposed Neural Architecture Search (NAS) framework (Zoph and Le, 2017), which uses a policy gradient algorithm to optimize architecture configurations. Running NAS directly on the ImageNet dataset is computationally expensive given the size of the dataset. We therefore use NAS to search for a good architecture on the far smaller CIFAR-10 dataset, and transfer the architecture to ImageNet. We achieve this transferrability by designing the search space so that the complexity of the architecture is independent of the depth of the network and the size of input images. More concretely, all convolutional networks in our search space are composed of convolutional cells with identical structures but different weights. Searching for the best convolutional architectures is therefore reduced to searching for the best cell structures. Searching for convolutional cells in this manner is much faster and the architecture itself is more likely to generalize to other problems. In particular, this approach significantly accelerates the search for the best architectures using CIFAR-10 (e.g., 4 weeks to 4 days) and learns architectures that successfully transfer to ImageNet.

Our primary result is that the best architecture found on CIFAR-10 achieves state-of-the-art accuracy on ImageNet classification without much modification. On ImageNet, an architecture constructed from the best cell achieves state-of-the-art accuracy of 82.3% top-1 and 96.0% top-5, which is 0.8% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS. On CIFAR-10 itself, the architecture achieves 96.59% accuracy, while having fewer parameters than models with comparable performance. A small version of the state-of-the-art Shake-Shake model (Gastaldi, 2017) with 2.9M parameters achieves 96.45% accuracy, while our 3.3M parameters model achieves a 96.59% accuracy. Not only our model has a better accuracy, it also needs only 600 epochs to train, which is one third of number of epochs for the Shake-Shake model.

Finally, by simply varying the number of the convolutional cells and number of filters in the convolutional cells, we can create convolutional architectures with different computational demands. In particular, we can generate a family of models that achieve accuracies superior to all human-invented models at equivalent or smaller computational budgets (Szegedy et al., 2016b; Ioffe and Szegedy, 2015). Notably, the smallest version of the learned model achieves 74.0% top-1 accuracy on ImageNet, which is 3.1% better than previously engineered architectures targeted towards mobile and embedded vision tasks (Howard et al., 2017; Zhang et al., 2017).

## 简介

ImageNet 分类 （Deng 等，2009）是计算机视觉中历史上重要的基准。克里热夫斯基等人的开创性工作 。（2012 年）关于使用卷积架构（Fukushima，1980； LeCun 等，1998）进行 ImageNet 分类是深度学习中最重要的突破之一。在基于卷积神经网络（CNN）的基准测试上的不断改进已经通过重大的架构工程取得了令人瞩目的成果 （Simonyan 和 Zisserman，2014； Szegedy 等人，2015； He 等人，2016； Szegedy 等人，2016b一个；谢等人，2016）。

### 2.1NEURAL ARCHITECTURE SEARCH FOR LARGE SCALE IMAGE CLASSIFICATION

Our work extends the Neural Architecture Search (NAS) framework proposed by Zoph and Le (2017). To briefly summarize the training procedure of NAS, a controller recurrent neural network (RNN) samples child networks with different architectures. The child networks are trained to convergence to obtain some accuracy on a held-out validation set. The resulting accuracies are used to update the controller so that the controller will generate better architectures over time. The controller weights are updated using a policy gradient method (Figure 1).

Figure 1: Overview of Neural Architecture Search (Zoph and Le, 2017). A controller RNN predicts architecture A from search space S with probability p. A child network with architecture A is trained to convergence achieving accuracy R. Scale the gradients of p by R to update the RNN controller.A key element of NAS is to design the search space S to generalize across problems of varying complexity and spatial scales. We observed that applying NAS directly on the ImageNet dataset would be very expensive and require months to complete an experiment. However, if the search space is properly constructed, architectural elements can transfer across datasets (Zoph and Le, 2017).

The focus of this work is to design a search space, such that the best architecture found on the CIFAR-10 dataset would scale to larger, higher-resolution image datasets across a range of computational settings. One inspiration for this search space is the recognition that architecture engineering with CNNs often identifies repeated motifs consisting of combinations of convolutional filter banks, nonlinearities and a prudent selection of connections to achieve state-of-the-art results (Szegedy et al., 2015; He et al., 2016; Szegedy et al., 2016b, a). These observations suggest that it may be possible for the controller RNN to predict a generic convolutional cell expressed in terms of these motifs. This cell can then be stacked in series to handle inputs of arbitrary spatial dimensions and filter depth.

NAS 的关键要素是设计搜索空间 S 概括各种复杂性和空间尺度的问题。我们观察到，将 NAS 直接应用于 ImageNet 数据集代价将非常昂贵，并且需要数月才能完成实验。但是，如果正确构建了搜索空间，则架构元素可以跨数据集传输（Zoph 和 Le，2017）。这项工作的重点是设计一个搜索空间，以便在 CIFAR-10 数据集上找到的最佳体系结构可以在一系列计算设置范围内扩展到更大，分辨率更高的图像数据集。该搜索空间的一个灵感是认识到，构建一个 CNN 网络架构通常会识别出重复的图案，这些图案包括卷积滤波器组，非线性和谨慎选择的连接的组合，以实现最新的结果（Szegedy 等人，2015 年）； He 等人，2016； Szegedy 等人，2016ba）。这些观察结果表明，控制器 RNN 可能会预测通用卷积单元用这些图案表达出来。然后可以将此单元串联堆叠，以处理任意空间尺寸和过滤器深度的输入。

To construct a complete model for image classification, we take an architecture for a convolutional cell and simply repeat it many times. Each convolutional cell would have the same architecture, but have different weights. To easily build scalable architectures for images of any size, we need two types of convolutional cells to serve two main functions when taking in a feature map as input: (1) convolutional cells that return a feature map of the same dimension, and (2) convolutional cells that return a feature map where the feature map height and width is reduced by a factor of two. We name the first type and second type of convolutional cells Normal Cell and Reduction Cell respectively. For the Reduction Cell, to reduce the height and width by a factor of two, we make the initial operation applied to cell’s inputs have a stride of two. All of our operations that we consider for building our convolutional cells have an option of striding.

（1）返回相同维度的特征图的卷积单元
（2 ）返回特征映射的卷积单元，其中特征映射高度和宽度减少了两倍。

Figure 2 shows our placement of Normal and Reduction Cells for CIFAR and ImageNet. Note on ImageNet we have more Reduction Cells, since the incoming image size is 299x299 compared to 32x32 for CIFAR. The Reduction and Normal Cell could be the same architecture, but we empirically found it was beneficial to learn two separate architectures. We employ a common heuristic to double the number of filters in the output whenever the spatial activation size is reduced in order to maintain roughly constant hidden state dimension Krizhevsky et al. (2012); Simonyan and Zisserman (2014). Importantly, we consider the number of motif repetitions N and the number of initial convolutional filters as free parameters that we tailor to the scale of an image classification problem.

2 显示了我们为 CIFAR 和 ImageNet 放置的正常和还原单元的位置。请注意，在 ImageNet 上，我们具有更多的还原像元，因为传入的图像尺寸为 299x299，而 CIFAR 的图像尺寸为 32x32。还原单元和普通单元可能是相同的体系结构，但是根据经验我们发现学习两个单独的体系结构是有益的。每当空间激活大小减小时，我们采用一种常见的启发式方法将输出中的滤波器数量加倍，以保持大致恒定的隐藏状态尺寸 Krizhevsky 等人。（2012）; Simonyan 和 Zisserman（2014）。重要的是，我们考虑了主题重复的次数 ñ 以及初始卷积滤波器的数量作为我们根据图像分类问题的规模定制的自由参数。

Figure 2: Scalable architecture for image classification consists of two repeated motifs termed Normal Cell and Reduction Cell. This diagram highlights the model architecture for CIFAR-10 and ImageNet. The choice for the number of times the Normal Cells that gets stacked between reduction cells, N, can vary in our experiments.

### 2.2 SEARCH SPACE FOR CONVOLUTIONAL CELLS 卷积单元的搜索空间

Our search space differs from (Zoph and Le, 2017) where the controller needs to predict the entire architecture of the neural networks. In our method, the controller needs to predict the structures of the two convolutional cells (Normal Cell and Reduction Cell), which can be then stacked many times to create the eventual architecture shown in Figure 2-Left. The convolutional cell is inspired by the concept of recurrent cells, where the structure of the cells is independent of the number of time steps in the recurrent network. This is an effective way to decouple the complexity of the cells from the depth of the neural network so that the controller RNN only needs to focus on predicting the structure of the cell.

Each cell receives as input two initial hidden states h_i and h_(i−1) which are the outputs of two cells in previous two lower layers or the input image. The job of the controller RNN is to recursively predict the rest of the structure of the convolutional cell (Figure 3). The predictions of the controller for each cell are grouped into B blocks, where each block has 5 prediction steps made by 5 distinct softmax classifiers corresponding to discrete choices of the elements of a block:

1. Select a hidden state from h or from the set of hidden states created in previous blocks.
2. Select a second hidden state from the same options as in Step 1.
3. Select an operation to apply to the hidden state selected in Step 1.
4. Select an operation to apply to the hidden state selected in Step 2.
5. Select a method to combine the outputs of Step 3 and 4 to create a new hidden state.

1 从 H 或从之前创建的一组隐藏状态的块中获取中选择一个隐藏状态。

2 从与步骤 1 相同的选项中选择第二个隐藏状态。

3 选择一个操作以应用于在步骤 1 中选择的隐藏状态。

4 选择一个操作以应用于在步骤 2 中选择的隐藏状态。

5 选择一种方法来组合步骤 3 和 4 的输出以创建新的隐藏状态。

|

Figure 3: Controller model architecture for recursively constructing one block of a convolutional cell. Each block requires selecting 5 discrete parameters, each of which corresponds to the output of a softmax layer. Example constructed block shown on right. A convolutional cell contains B blocks, hence the controller contains 5B softmax layers for predicting the architecture of a convolutional cell. In our experiments, the number of blocks B is 5.1. [itemindent=0.7cm,label=Step 0.]

5B 用于预测卷积单元结构的 softmax 层。在我们的实验中，块数 B 是 5

The algorithm appends the newly-created hidden state to the set of existing hidden states as a potential input in subsequent blocks. The controller RNN repeats the above 5 prediction steps B times corresponding to the B blocks in a convolutional cell. In our experiments, selecting B=5 provides good results, although we have not exhaustively searched this space due to computational limitations.

In steps 3 and 4, the controller RNN selects an operation to apply to the hidden states. We collected the following set of operations based on their prevalence in the CNN literature:

∙ identity ∙ 1x3 then 3x1 convolution
∙ 1x7 then 7x1 convolution ∙ 3x3 dilated convolution
∙ 3x3 average pooling ∙ 3x3 max pooling
∙ 5x5 max pooling ∙ 7x7 max pooling
∙ 1x1 convolution ∙ 3x3 convolution
∙ 3x3 separable convolution ∙ 5x5 separable convolution
∙ 7x7 separable convolution

In our experiments, we apply each separable operation twice during the execution of the child model, once that operation is selected by the controller.

In step 5 the controller RNN selects a method to combine the two hidden states, either (1) elementwise addition between two hidden states and (2) concatenation between two hidden states along the filter dimension. Finally, all of the unused hidden states generated in the convolutional cell are concatenated together in depth to provide the final cell output.

To have the controller RNN predict both the Normal and Reduction cell we simply make the controller have 2×5B predictions in total, where the first 5B predictions are for the Normal Cell and the second 5B predictions are for the Reduction Cell.

## 3 EXPERIMENTS AND RESULTS 实验与结果

In this section, we describe our experiments with Neural Architecture Search using the search space described above to learn a convolutional cell. In summary, all architecture searches are performed using the CIFAR-10 classification task (Krizhevsky, 2009). The controller RNN was trained using Proximal Policy Optimization (PPO) (Schulman et al., 2015) by employing a global workqueue system for generating a pool of child networks controlled by the RNN. In our experiments, the pool of workers in the workqueue consisted of 450 GPUs. Please see Appendix A for complete details of the architecture learning and controller system.

The result of this search process yields several candidate convolutional cells. Figure 4 shows a diagram of the top performing cell. Note the prevalence of separable convolutions and the number of branches as compared with competing architectures Simonyan and Zisserman (2014); Szegedy et al. (2015); He et al. (2016); Szegedy et al. (2016b, a). Subsequent experiments focus on this convolutional cell architecture, although we examine the efficacy of other, top-ranked convolutional cells in ImageNet experiments (described in Appendix B) and report their results as well. We call the three networks constructed from the best three cells NASNet-A, NASNet-B and NASNet-C.

Figure 4: Architecture of the best convolutional cells (NASNet-A) with B=5 blocks identified with CIFAR-10 . The input (white) is the hidden state from previous activations (or input image). The output (pink) is the result of a concatenation operation across all resulting branches. Each convolutional cell is the result of B blocks. A single block is corresponds to two primitive operations (yellow) and a combination operation (green). Note that colors correspond to operations in Figure 3.

We demonstrate the utility of the convolutional cell by employing this learned architecture on CIFAR-10 and a family of ImageNet classification tasks. The latter family of tasks is explored across a few orders of magnitude in computational budget. After having learned the convolutional cell, several hyper-parameters may be explored to build a final network for a given task: (1) the number of cell repeats N and (2) the number of filters in the initial convolutional cell. We employ a common heuristic to double the number of filters whenever the stride is 2.

### 3.1GENERAL TRAINING STRATEGIES 通用训练策略

We found that adding Batch Normalization and/or a ReLU between the depthwise and pointwise operations in the separable convolution operations to not help performance. L1 regularization was tried with the NASNet models, but this was found to hurt performance. We also tried ELU Clevert et al. (2015) instead of ReLUs and found that performance was about the same. Dropout Srivastava et al. (2014) was also tried on the convolutional filters, but this was found to degrade performance.

#### Operation Ordering:

All convolution operations that could be predicted are applied within the convolutional cells in the following order: ReLU, convolution operation, Batch Normalization. We found this order to improve performance over a different ordering: convolution operation, Batch Normalization and ReLU. This result is inline with findings from other papers where using the pre-ReLU activation works better. In order to be sure shapes always match in the convolutional cells, 1x1 convolutions are inserted as necessary.

#### Cell Path Dropout:

When training our NASNet models, we found stochastically dropping out each path (edge with a yellow box) in the cell with some fixed probability to be an extremely good regularizer. This is similar to Huang et al. (2016c) and Zhang et al. (2016) where they dropout full parts of their model during training and then at test time scale the path by the probability of keeping that path during training. Interestingly we found that linearly increasing the probability of dropping out a path over the course of training to significantly improve the final performance for both CIFAR and ImageNet experiments.

### 3.2RESULTS ON CIFAR-10 IMAGE CLASSIFICATION CIFAR-10 图像分类的结果

For the task of image classification with CIFAR-10, we set N=4 or 6 (Figure 2). The networks are trained on CIFAR-10 for 600 epochs. The test accuracies of the best architectures are reported in Table 1 along with other state-of-the-art models; the best architectures found by the controller RNN are better than the previous state-of-the-art Shake-Shake model when comparing with the same number of parameters. Additionally, we only train for a third of the time where Shake-Shake trains for 1800 epochs. See appendix A for more details on CIFAR training.

Model Depth # parameters Error rate (%)
DenseNet (L=40,k=12) (Huang et al., 2016a) 40 1.0M 5.24
DenseNet(L=100,k=12) (Huang et al., 2016a) 100 7.0M 4.10
DenseNet (L=100,k=24) (Huang et al., 2016a) 100 27.2M 3.74
DenseNet-BC (L=100,k=40) (Huang et al., 2016b) 190 25.6M 3.46
Shake-Shake 26 2x32d (Gastaldi, 2017) 26 2.9M 3.55
Shake-Shake 26 2x96d (Gastaldi, 2017) 26 26.2M 2.86
NAS v1 no stride or pooling (Zoph and Le, 2017) 15 4.2M 5.50
NAS v2 predicting strides (Zoph and Le, 2017) 20 2.5M 6.01
NAS v3 max pooling (Zoph and Le, 2017) 39 7.1M 4.47
NAS v3 max pooling + more filters (Zoph and Le, 2017) 39 37.4M 3.65
NASNet-A N=6 3.3M 3.41
NASNet-B N=4 2.6M 3.73
NASNet-C N=4 3.1M 3.59

Table 1: Performance of Neural Architecture Search and other state-of-the-art models on CIFAR-10.

### 3.3 RESULTS ON IMAGENET IMAGE CLASSIFICATION IMAGENET 图像分类的结果

We performed several sets of experiments on ImageNet with the best convolutional cells learned from CIFAR-10. Results are summarized in Table 2 and 3 and Figure 5. In the first set of experiments, we train several image classification systems operating on 299x299 or 331x331 resolution images scaled in computational demand on par with Inception-v2 Ioffe and Szegedy (2015), Inception-v3 Szegedy et al. (2016b) and PolyNet Zhang et al. (2016). We demonstrate that this family of models achieve state-of-the-art performance with fewer floating point operations and parameters than comparable architectures. Second, we demonstrate that by adjusting the scale of the model we can achieve state-of-the-art performance at smaller computational budgets, exceeding streamlined CNNs hand-designed for this operating regime Howard et al. (2017); Zhang et al. (2017).

Note we do not have residual connections around convolutional cells as the models learn skip connections on their own. We empirically found manually inserting residual connections between cells to not help performance. Our training setup on ImageNet is similar to Szegedy et al. (2016b), but please see Appendix A for details.

|

Figure 5: Accuracy versus computational demand (left) and number of parameters (right) across top performing CNN architectures on ImageNet 2012 ILSVRC challenge prediction task (compiled as of July 2017). Computational demand is measured in the number of floating-point multiply-add operations to process a single image. Black circles indicate previously published work and red squares highlight our proposed models. Vertical dashed line indicates 1 billion multiply-add operations. Horizontal dashed line indicates 80% precision@1 prediction accuracy.

Model image size # parameters Mult-Adds Top 1 Acc. (%) Top 5 Acc. (%)
Inception V2 (Ioffe and Szegedy, 2015) 224×224 11.2 M 1.94 B 74.8 92.2
NASNet-A (N = 5) 299×299 10.9 M 2.35 B 78.6 94.2
Inception V3 (Szegedy et al., 2016b) 299×299 23.8 M 5.72 B 78.0 93.9
Xception (Chollet, 2016) 299×299 22.8 M 8.38 B 79.0 94.5
Inception ResNet V2 (Szegedy et al., 2016a) 299×299 55.8 M 13.2 B 80.4 95.3
NASNet-A (N = 7) 299×299 22.6 M 4.93 B 80.8 95.3
ResNeXt-101 (64 x 4d) (Xie et al., 2016) 320×320 83.6 M 31.5 B 80.9 95.6
PolyNet (Zhang et al., 2016) 331×331 92 M 34.7 B 81.3 95.8
DPN-131 (Chen et al., 2017) 320×320 79.5 M 32.0 B 81.5 95.8
NASNet-A (N = 7) 331×331 84.9 M 23.2 B 82.3 96.0

Table 2: Performance of architecture search and other state-of-the-art models on ImageNet classification. Mult-Adds indicate the number of composite multiply-accumulate operations for a single image.

Model # parameters Mult-Adds Top 1 Acc. (%) Top 5 Acc. (%)
Inception V1 (Szegedy et al., 2015) 6.6M 1,448 M 69.8 89.9
MobileNet-224 Howard et al. (2017) 4.2 M 569 M 70.6 89.5
ShuffleNet (2x) Zhang et al. (2017) ∼ 5M 524 M 70.9 89.8
NASNet-A (N=4) 5.3 M 564 M 74.0 91.6
NASNet-B (N=4) 5.3M 488 M 72.8 91.3
NASNet-C (N=3) 4.9M 558 M 72.5 91.0

Table 3: Performance on ImageNet classification on a subset of models operating in a constrained computational setting, i.e., <1.5 B multiply-accumulate operations per image. All models employ 224x224 images.

Table 2 shows that the convolutional cells discovered with CIFAR-10 generalize well to ImageNet problems. In particular, each model based on the convolutional cell exceeds the predictive performance of the corresponding hand-designed model. Importantly, the largest model achieves a new state-of-the-art performance for ImageNet (82.3%) based on single, non-ensembled predictions, surpassing previous state-of-the-art by 0.8% Chen et al. (2017). Figure 5 shows a complete summary of these results. Note the family of models based on convolutional cells provides an envelope over a broad class of human-invented architectures.

Finally, we test how well the best convolutional cells may perform in a resource-constrained setting, e.g., mobile devices (Table 3). In these settings, the number of floating point operations is severely constrained and predictive performance must be weighed against latency requirements on a device with limited computational resources. MobileNet Howard et al. (2017) and ShuffleNet Zhang et al. (2017) provide state-of-the-art results predicting 70.6% and 70.9% accuracy, respectively on 224x224 images using ∼ 550M multliply-add operations. An architecture constructed from the best convolutional cells achieves superior predictive performance (74.0% accuracy) surpassing previous models but with comparable computational demand. In summary, we find that the learned convolutional cells are flexible across model scales achieving state-of-the-art performance across almost 2 orders of magnitude in computational budget.

％ 分别使用 224x224 图像的精度 〜550M 多加运算。由最佳卷积单元构建的体系结构可实现超越先前模型的出色预测性能（准确度为 74.0％），但具有可比的计算需求。总而言之，我们发现学习的卷积单元在模型规模上很灵活，在计算预算中几乎达到了两个数量级，从而实现了最新的性能。

The proposed method is related to previous work in hyperparameter optimization (Pinto et al., 2009; Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2012, 2015; Bergstra et al., 2013; Mendoza et al., 2016) – especially recent approaches in designing architectures such as Neural Fabrics (Saxena and Verbeek, 2016), DiffRNN (Miconi, 2016), MetaQNN (Baker et al., 2016) and DeepArchitect (Negrinho and Gordon, 2017). A more flexible class of methods for designing architecture is evolutionary algorithms (Wierstra et al., 2005; Floreano et al., 2008; Stanley et al., 2009; Jozefowicz et al., 2015; Real et al., 2017; Miikkulainen et al., 2017; Xie and Yuille, 2017), yet they have not had as much success at large scale. Xie and Yuille (2017) also transferred learned architectures from CIFAR-10 to ImageNet but performance of these models (top-1 accuracy 72.1%) are notably below previous state-of-the-art (Table 2).

The concept of having one neural network interact with a second neural network to aid the learning process, or learning to learn or meta-learning (Hochreiter et al., 2001; Schaul and Schmidhuber, 2010) has attracted much attention in recent years (Andrychowicz et al., 2016; Wang et al., 2016; Duan et al., 2016; Ha et al., 2017; Li and Malik, 2017; Ravi and Larochelle, 2017; Finn et al., 2017). Most of these approaches have not been scaled to large problems like ImageNet. A notable exception is recent work focused on learning an optimizer for ImageNet classification that achieved notable improvements (Wichrowska et al., 2017).

The design of our search space took much inspiration from LSTMs (Hochreiter and Schmidhuber, 1997), and NASCell (Zoph and Le, 2017). The modular structure of the convolutional cell is also related to previous methods on ImageNet such as VGG (Simonyan and Zisserman, 2014), Inception (Szegedy et al., 2015, 2016b, 2016a), ResNet/ResNext (He et al., 2016; Xie et al., 2016), and Xception/MobileNet (Chollet, 2016; Howard et al., 2017).

## 5 CONCLUSION

In this work, we demonstrate how to learn scalable, convolutional cells from data that transfer to multiple image classification tasks. The key insight to this approach is to design a search space that decouples the complexity of an architecture from the depth of a network. This resulting search space permits identifying good architectures on a small dataset (i.e., CIFAR-10) and transferring the learned architecture to image classifications across a range of data and computational scales.

The resulting architectures approach or exceed state-of-the-art performance in terms of CIFAR-10, ImageNet classification with less computational demand than human-designed architectures Szegedy et al. (2016b); Ioffe and Szegedy (2015); Zhang et al. (2016). The ImageNet results are particularly important because many state-of-the-art computer vision problems (e.g., object detection Huang et al. (2016d), face detection Schroff et al. (2015), image localization Weyand et al. (2016)) derive image features or architectures from ImageNet classification models. Finally, we demonstrate that we can employ the resulting learned architecture to perform ImageNet classification with reduced computational budgets that outperform streamlined architectures targeted to mobile and embedded platforms Howard et al. (2017); Zhang et al. (2017).

Our results have strong implications for transfer learning and meta-learning as this is the first work to demonstrate state-of-the-art results using meta-learning on a large scale problem. This work also highlights that learned elements of network architectures, beyond model weights, can transfer across datasets.

ACKNOWLEDGEMENTS
We thank Jeff Dean, Yifeng Lu, Jonathan Shen, Vishy Tirumalashetty, Xiaoqiang Zheng, and the Google Brain team for the help with the project. We additionally thank Christian Sigg for performance improvements to depthwise separable convolutions.

REFERENCES
Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
Baker et al. (2016) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations, 2016.
Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 2012.
Bergstra et al. (2011) James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In Neural Information Processing Systems, 2011.
Bergstra et al. (2013) James Bergstra, Daniel Yamins, and David D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. International Conference on Machine Learning, 2013.
Chen et al. (2016) Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous sgd. In International Conference on Learning Representations Workshop Track, 2016.
Chen et al. (2017) Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. arXiv preprint arXiv:1707.01083, 2017.
Chollet (2016) François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
Clevert et al. (2015) Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009.
Donahue et al. (2014) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Icml, volume 32, pages 647–655, 2014.
Duan et al. (2016) Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL
2
: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
Floreano et al. (2008) Dario Floreano, Peter Dürr, and Claudio Mattiussi. Neuroevolution: from architectures to learning. Evolutionary Intelligence, 2008.
Fukushima (1980) Kunihiko Fukushima. A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, page 93–202, 1980.
Gastaldi (2017) Xavier Gastaldi. Shake-shake regularization of 3-branch residual networks. In International Conference on Learning Representations Workshop Track, 2017.
Ha et al. (2017) David Ha, Andrew Dai, and Quoc V. Le. Hypernetworks. In International Conference on Learning Representations, 2017.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.
Hochreiter et al. (2001) Sepp Hochreiter, A Younger, and Peter Conwell. Learning to learn using gradient descent. Artificial Neural Networks, pages 87–94, 2001.
Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Huang et al. (2016a) Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016a.
Huang et al. (2016b) Gao Huang, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016b.
Huang et al. (2016c) Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382, 2016c.
Huang et al. (2016d) Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016d.
Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
Jozefowicz et al. (2015) Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In ICML, 2015.
Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing System, 2012.
LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
Li and Malik (2017) Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017.
Mendoza et al. (2016) Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Towards automatically-tuned neural networks. In Proceedings of the 2016 Workshop on Automatic Machine Learning, pages 58–65, 2016.
Miconi (2016) Thomas Miconi. Neural networks with differentiable structure. arXiv preprint arXiv:1606.06216, 2016.
Miikkulainen et al. (2017) Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju, Arshak Navruzyan, Nigel Duffy, and Babak Hodjat. Evolving deep neural networks. arXiv preprint arXiv:1703.00548, 2017.
Negrinho and Gordon (2017) Renato Negrinho and Geoff Gordon. DeepArchitect: Automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792, 2017.
Pinto et al. (2009) Nicolas Pinto, David Doukhan, James J DiCarlo, and David D Cox. A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Computational Biology, 5(11):e1000579, 2009.
Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
Real et al. (2017) Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041, 2017.
Saxena and Verbeek (2016) Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems, 2016.
Schaul and Schmidhuber (2010) Tom Schaul and Juergen Schmidhuber. Metalearning. Scholarpedia, 2010.
Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015.
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, 2012.
Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mostofa Ali, Ryan P. Adams, et al. Scalable Bayesian optimization using deep neural networks. In International Conference on Machine Learning, 2015.
Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Stanley et al. (2009) Kenneth O. Stanley, David B. D’Ambrosio, and Jason Gauci. A hypercube-based encoding for evolving large-scale neural networks. Artificial Life, 2009.
Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
Szegedy et al. (2016a) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016a.
Szegedy et al. (2016b) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition, 2016b.
Wang et al. (2016) Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
Weyand et al. (2016) Tobias Weyand, Ilya Kostrikov, and James Philbin. Planet-photo geolocation with convolutional neural networks. arXiv preprint arXiv:1602.05314, 2016.
Wichrowska et al. (2017) Olga Wichrowska, Niru Maheswaranathan, Matthew W. Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. arXiv preprint arXiv:1703.04813, 2017.
Wierstra et al. (2005) Daan Wierstra, Faustino J. Gomez, and Jürgen Schmidhuber. Modeling systems with internal state using evolino. In The Genetic and Evolutionary Computation Conference, 2005.
Williams (1992) Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Machine Learning, 1992.
Xie and Yuille (2017) Lingxi Xie and Alan Yuille. Genetic CNN. arXiv preprint arXiv:1703.01513, 2017.
Xie et al. (2016) Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
Zhang et al. (2017) Xiangyu Zhang, Xinyu Zhou, Lin Mengxiao, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.
Zhang et al. (2016) Xingcheng Zhang, Zhizhong Li, Chen Change Loy, and Dahua Lin. Polynet: A pursuit of structural diversity in very deep networks. arXiv preprint arXiv:1611.05725, 2016.
Zoph and Le (2017) Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.

## APPENDIX A EXPERIMENTAL DETAILS

### A.1DATASET FOR ARCHITECTURE SEARCH 数据集

The CIFAR-10 dataset [Krizhevsky, 2009] consists of 60,000 32x32 RGB images across 10 classes (50,000 train and 10,000 test images). We partition a random subset of 5,000 images from the training set to use as a validation set for the controller RNN. All images are whitened and then undergone several data augmentation steps: we randomly crop 32x32 patches from upsampled images of size 40x40 and apply random horizontal flips. This data augmentation procedure is common among related work.

CIFAR-10 数据集 [Krizhevsky，2009 ]由 10 个类别的 60,000 个 32x32 RGB 图像组成（50,000 个训练和 10,000 个测试图像）。我们从训练集中划分了 5,000 个图像的随机子集，以用作控制器 RNN 的验证集。所有图像均变白，然后经历几个数据增强步骤：我们从大小为 40x40 的上采样图像中随机裁剪 32x32 色块，并应用随机的水平翻转。这种数据扩充过程在相关工作中很常见。

### A.2CONTROLLER ARCHITECTURE 控制器架构

The controller RNN is a one-layer LSTM [Hochreiter and Schmidhuber, 1997] with 100 hidden units at each layer and 2×5B softmax predictions for the two convolutional cells (where B is typically 5) associated with each architecture decision. Each of the 10B predictions of the controller RNN is associated with a probability. The joint probability of a child network is the product of all probabilities at these 10B softmaxes. This joint probability is used to compute the gradient for the controller RNN. The gradient is scaled by the validation accuracy of the child network to update the controller RNN such that the controller assigns low probabilities for bad child networks and high probabilities for good child networks.

Unlike Zoph and Le [2017], who used the REINFORCE rule [Williams, 1992] to update the controller, we employ Proximal Policy Optimization (PPO) [Schulman et al., 2015] with learning rate 0.00035 because it made training more robust and increased convergence speed. To encourage exploration we also use an entropy penalty with a weight of 0.00001. We also use a baseline function, which we set to be an exponential moving average of previous rewards, with a weight of 0.95. The weights of the controller are initialized uniformly between -0.1 and 0.1.

### A.3TRAINING OF THE CONTROLLER

For distributed training, we use a work queue system where all the samples generated from the controller RNN are added to a global workqueue. A free "child" worker in a distributed worker pool asks the controller for new work from the global workqueue. Once the training of the child network is complete, the accuracy on a held-out validation set is computed and reported to the controller RNN. In our experiments we use a child worker pool size of 450, which means there are 450 networks being trained on 450 GPUs concurrently at any time. Upon receiving enough child model training results, the Controller RNN will perform a gradient update on its weights using TRPO and then sample another batch of architectures that go into the global work queue. This process continues until a predetermined number of architectures have been sampled. In our experiments, this predetermined number of architectures is 20,000 which means the search process is terminated after 20,000 child models have been trained. Additionally, we update the controller RNN with minibatches of 20 architectures. Once the search is over, the top 250 architectures are then chosen to train until convergence on CIFAR-10 to determine the very best architecture.

### A.4TRAINING OF CIFAR MODELS

All of our CIFAR models use a single period cosine decay as in Gastaldi [2017]. All models use the momentum optimizer with momentum rate set to 0.9. All models also use L2 weight decay. Each architecture is trained for a fixed 20 epochs on CIFAR-10 during the architecture search process. Additionally, we found it beneficial to use the cosine learning rate decay during the 20 epochs the CIFAR models were trained for as this helped to further differentiate good architectures. We also found that having the CIFAR models use a small N=2 during the architecture search process allowed for models to train quite quickly, while still finding cells that work well once more were stacked.

### A.5TRAINING OF IMAGENET MODELS

We use ImageNet 2012 ILSVRC challenge data for large scale image classification. The dataset consists of ∼ 1.2M images labeled across 1000 classes Deng et al. [2009]. Overall our training and testing procedures are almost identical to Szegedy et al. [2016b]. ImageNet models are trained and evaluated on 299x299 or 331x331 images using the same data augmentation procedures as described previously Szegedy et al. [2016b]. We use distributed synchronous SGD to train the ImageNet model with 50 workers (and 3 backup workers) each with a Tesla K40 GPU Chen et al. [2016]. We use RMSProp with a decay of 0.9 and epsilon of 1.0. Evaluations are calculated using with a running average of parameters over time with a decay rate of 0.9999. We use label smoothing with a value of 0.1 for all ImageNet models as done in Szegedy et al. [2016b]. Additionally, all models use an auxiliary classifier located at 2/3 of the way up the network. The loss of the auxiliary classifier is weighted by 0.4 as done in Szegedy et al. [2016b]. We empirically found our network to be insensitive to the number of parameters associated with this auxiliary classifier along with the weight associated with the loss. All models also use L2 regularization. The learning rate decay scheme is the exponential decay scheme used in Szegedy et al. [2016b]. Dropout is applied to the final softmax matrix with probability 0.5.