Multi-Objective Evolutionary Design of Deep Convolutional Neural Networks for Image Classification
用于图像分类的深度卷积神经网络多目标进化设计
Abstract—Convolutional neural networks (CNNs) are the backbones of deep learning paradigms for numerous vision tasks. Early advancements in CNN architectures are primarily driven by human expertise and by elaborate design processes. Recently, neural architecture search was proposed with the aim of automating the network design process and generating task-dependent architectures. While existing approaches have achieved competitive performance in image classification, they are not well suited to problems where the computational budget is limited for two reasons: (1) the obtained architectures are either solely optimized for classification performance, or only for one deployment scenario; (2) the search process requires vast computational resources in most approaches. To overcome these limitations, we propose an evolutionary algorithm for searching neural architectures under multiple objectives, such as classification performance and floating point operations (FLOPs). The proposed method addresses the first shortcoming by populating a set of architectures to approximate the entire Pareto frontier through genetic operations that recombine and modify architectural components progressively. Our approach improves computational efficiency by carefully down-scaling the architectures during the search as well as reinforcing the patterns commonly shared among past successful architectures through Bayesian model learning. The integration of these two main contributions allows an efficient design of architectures that are competitive and in most cases outperform both manually and automatically designed architectures on benchmark image classification datasets: CIFAR, ImageNet and human chest $\mathbf{X}$ -ray. The flexibility provided from simultaneously obtaining multiple architecture choices for different compute requirements further differentiates our approach from other methods in the literature. Code is available at https://github.com/mike lz c 1990/nsganetv1.
摘要—卷积神经网络(CNN)是众多视觉任务深度学习范式的核心架构。早期CNN架构的进步主要依靠人类专业知识和精心设计流程。近年来,神经架构搜索技术被提出以实现网络设计自动化并生成任务相关架构。虽然现有方法在图像分类任务中取得了优异性能,但由于以下两个原因,它们并不适用于计算资源受限的场景:(1) 所得架构要么仅针对分类性能优化,要么仅适配单一部署场景;(2) 大多数方法的搜索过程需要消耗大量计算资源。为克服这些限制,我们提出一种多目标进化算法来搜索神经架构,优化目标包括分类性能和浮点运算量(FLOPs)。该方法通过遗传操作逐步重组和修改架构组件,生成一组能逼近整个帕累托前沿的架构,从而解决第一个缺陷。我们通过两种方式提升计算效率:在搜索过程中谨慎缩放架构规模,以及通过贝叶斯模型学习强化历史成功架构的共有模式。这两个核心贡献的结合使我们能高效设计出具有竞争力的架构,在CIFAR、ImageNet和人体胸部$\mathbf{X}$光片等基准图像分类数据集上,多数情况下性能优于人工设计和自动设计的架构。该方法能同时为不同计算需求提供多种架构选择,这种灵活性使其显著区别于现有文献中的其他方法。代码详见https://github.com/mike lz c 1990/nsganetv1。
Index Terms—Neural architecture search (NAS), evolutionary deep learning, convolutional neural networks (CNNs), genetic algorithms (GAs)
索引术语—神经架构搜索(NAS)、进化深度学习、卷积神经网络(CNNs)、遗传算法(GAs)
I. INTRODUCTION
I. 引言
Deep convolutional neural networks (CNNs) have been overwhelmingly successful in a variety of computer-visionrelated tasks like object classification, detection, and segmentation. One of the main driving forces behind this success is the introduction of many CNN architectures, including GoogLeNet [1], ResNet [2], DenseNet [3], etc., in the context of object classification. Concurrently, architecture designs, such as ShuffleNet [4], MobileNet [5], LBCNN [6], etc., have been developed with the goal of enabling real-world deployment of high-performance models on resource-constrained devices.
深度卷积神经网络 (CNN) 在物体分类、检测和分割等各类计算机视觉任务中取得了巨大成功。这一成功背后的主要驱动力之一是在物体分类领域引入了诸多CNN架构,包括GoogLeNet [1]、ResNet [2]、DenseNet [3]等。与此同时,ShuffleNet [4]、MobileNet [5]、LBCNN [6]等架构设计也相继问世,旨在实现高性能模型在资源受限设备上的实际部署。
These developments are the fruits of years of painstaking efforts and human ingenuity.
这些发展是多年辛勤努力和人类智慧的结晶。
Neural architecture search (NAS), on the other hand, presents a promising path to alleviate this painful process by posing the design of CNN architectures as an optimization problem. By altering the architectural components in an algorithmic fashion, novel CNNs can be discovered that exhibit improved performance metrics on representative datasets. The huge surge in research and applications of NAS indicates the tremendous academic and industrial interest NAS has attracted, as teams seek to stake out some of this territory. It is now well recognized that designing bespoke neural network architectures for various tasks is one of the most challenging and practically beneficial components of the entire Deep Neural Network (DNN) development process, and is a fundamental step toward automated machine learning.
另一方面,神经架构搜索 (NAS) 通过将 CNN 架构设计转化为优化问题,为缓解这一痛苦过程提供了可行路径。通过算法化调整架构组件,能够发现新型 CNN 在代表性数据集上展现更优性能指标。NAS 研究和应用的爆炸式增长,反映了学术界与工业界对其的巨大兴趣,各团队争相在这一领域抢占先机。目前公认的是:为不同任务定制神经网络架构是整个深度神经网络 (DNN) 开发流程中最具挑战性且最具实用价值的环节,也是实现自动化机器学习的关键步骤。
Early methods for NAS relied on Reinforcement Learning (RL) to navigate and search for architectures with high performance. A major limitation of these approaches [7], [8] is the steep computational requirement for the search process itself, often requiring weeks of wall clock time on hundreds of Graphics Processing Unit (GPU) cards. Recent relaxation-based methods [9]–[12] seek to improve the computational efficiency of NAS approaches by approximating the connectivity between different layers in the CNN architectures by real-valued variables that are learned (optimized) through gradient descent together with the weights. However, such relaxation-based NAS methods suffer from excessive GPU memory requirements during search, resulting in constraints on the size of the search space (e.g., reduced layer operation choices).
早期的NAS方法依赖强化学习(RL)来探索和搜索高性能架构。这些方法[7][8]的主要局限在于搜索过程本身需要极高的计算资源,通常需要数百张图形处理器(GPU)卡运行数周时间。近期基于松弛的方法[9]-[12]试图通过用可学习(优化)的实值变量来近似CNN架构中不同层间的连接性,并与权重一起通过梯度下降进行优化,从而提高NAS方法的计算效率。然而,这类基于松弛的NAS方法在搜索过程中存在GPU内存占用过高的问题,导致搜索空间规模受限(例如减少层操作选择)。
In addition to being accurate in prediction, real-world applications demand that NAS methods find network architectures that are also efficient in computation—e.g., have low power consumption in mobile applications and low latency in autonomous driving applications. It has been a common observation that the predictive performance continuously improves as the complexity (i.e., # of layers, channels, etc.) of the network architectures increases [2], [3], [8], [13]. This alludes to the competing nature of trying to simultaneously maximize predictive performance and minimize network complexity, thereby necessitating multi-objective optimization. Despite recent advances in RL and relaxation-based NAS methods, they are still not readily applicable for multi-objective NAS.
除了预测准确外,实际应用还要求NAS方法找到计算效率高的网络架构——例如,在移动应用中功耗低,在自动驾驶应用中延迟低。普遍观察到的现象是,随着网络架构复杂度(即层数、通道数等)的增加,预测性能会持续提升[2], [3], [8], [13]。这表明在尝试同时最大化预测性能和最小化网络复杂度之间存在竞争关系,因此需要进行多目标优化。尽管基于强化学习(RL)和松弛方法的NAS技术近期取得了进展,但它们仍难以直接应用于多目标NAS场景。
Among the many different NAS methods being continually proposed, Evolutionary Algorithms (EAs) are getting a plethora of attention, due to their population-based nature and flexibility in encoding. They offer a viable alternative to conventional machine learning (ML)-oriented approaches, especially under the scope of multi-objective NAS. An EA, in general, is an iterative process in which individuals in a population are made gradually better by applying variations to selected individuals and/or recombining parts of multiple individuals. Despite the ease of extending them to handle multiple objectives, most existing EA-based NAS methods [14]–[19] are still singleobjective driven.
在不断提出的众多不同NAS方法中,进化算法 (Evolutionary Algorithms, EAs) 因其基于种群的特性和编码灵活性而受到广泛关注。它们为传统以机器学习 (ML) 为导向的方法提供了可行替代方案,尤其是在多目标NAS范畴内。进化算法通常是一种迭代过程,通过对选定个体施加变异和/或重组多个个体的部分,使种群中的个体逐渐优化。尽管它们易于扩展以处理多目标问题,但现有大多数基于进化算法的NAS方法 [14]–[19] 仍以单目标驱动为主。
In this paper, we present NSGANetV1, a multi-objective evolutionary algorithm for NAS, extending on an earlier proofof-principle method [20], to address the aforementioned limitations of current approaches. The key contributions followed by the extensions made in this paper are summarized below:
本文提出NSGANetV1,这是一种基于多目标进化算法的神经网络架构搜索(NAS)方法,在早期原理验证方法[20]的基础上进行扩展,以解决现有方法的上述局限性。本文的主要贡献及扩展内容总结如下:
- NSGANetV1 populates a set of architectures to approximate the entire Pareto front in one run through customized genetic operations that recombine and modify architectural components progressively. NSGANetV1 improves computational efficiency by carefully downscaling the architectures during the search as well as reinforcing the emerging patterns shared among past successful architectures through a Bayesian Network based distribution estimation operator. Empirically, the obtained architectures, in most cases, outperform both manually and other automatically designed architectures on various datasets.
- NSGANetV1通过定制化的遗传操作(逐步重组和修改架构组件)生成一组架构来近似单次运行中的整个帕累托前沿。该算法通过两种方式提升计算效率:在搜索过程中谨慎缩减架构规模,以及通过基于贝叶斯网络的分布估计算子强化过往成功架构中涌现的共享模式。实证表明,在多数情况下,所得架构在多个数据集上的表现均优于人工设计及其他自动设计的架构。
- By obtaining a set of architectures in one run, NSGANetV1 allows designers to choose a suitable network a-posteriori as opposed to a pre-defined preference weighting of objectives prior to the search. Further postoptimal analysis of the set of non-dominated architectures often reveals valuable design principles, which is another benefit of posing NAS as a multi-objective optimization problem, as is done in NSGANetV1.
- 通过单次运行获得一组架构,NSGANetV1允许设计者在搜索后选择合适网络,而非预先定义目标偏好权重。对非支配架构集的进一步后优化分析常能揭示宝贵设计原则,这也是将NAS (Neural Architecture Search) 构建为多目标优化问题的另一优势,如NSGANetV1所示。
- From an algorithmic perspective, we extend our previous work [20] in a number of ways: (i) an expanded search space to include five more layer operations and one more option that controls the width of the network, (ii) improved encoding, mutation and crossover operators accompanying the modified search space, and (iii) a more thorough lower-level optimization process for weight learning, resulting in better and more reliable performance.
- 从算法角度来看,我们在先前工作[20]的基础上进行了多方面扩展:(i) 扩展搜索空间以包含五种新增层操作和一项控制网络宽度的选项;(ii) 针对修改后的搜索空间改进了编码、变异和交叉算子;(iii) 采用更精细的底层优化流程进行权重学习,从而获得更优且更稳定的性能。
- From an evaluation perspective, we extend our previous work [20] in two different ways: (i) adding three more tasks, including medical imaging, robustness to adversarial attacks, and car key-point estimation; and (ii) evaluating the searched architectures on five new datasets, including, ImageNet, ImageNet-V2, CIFAR10.1, corrupted CIFAR-10 and corrupted CIFAR-100.
- 从评估角度来看,我们在之前工作 [20] 的基础上进行了两方面的扩展:(i) 新增了三个任务,包括医学影像、对抗攻击鲁棒性以及汽车关键点估计;(ii) 在五个新数据集上评估搜索到的架构,包括 ImageNet、ImageNet-V2、CIFAR10.1、带噪声的 CIFAR-10 和带噪声的 CIFAR-100。
The remainder of this paper is organized as follows. Section II introduces and summarizes related literature. In Section III, we provide a detailed description of the main components of our approach. We describe the experimental setup to validate our approach along with a discussion of the results in Section IV, followed by further analysis and an application study in Sections $\mathrm{v}$ and VI, respectively. Finally, we conclude with a summary of our findings and comment on possible future directions in Section VII.
本文的其余部分组织如下。第II节介绍并总结了相关文献。第III节详细描述了我们方法的主要组成部分。第IV节描述了验证我们方法的实验设置以及结果讨论,随后在第V节和第VI节分别进行了进一步分析和应用研究。最后,我们在第VII节总结了我们的发现,并对未来可能的方向进行了评论。
II. RELATED WORK
II. 相关工作
Recent years have witnessed growing interest in NAS. The promise of being able to automatically and efficiently search for task-dependent network architectures is particularly appealing as deep neural networks are widely deployed in diverse applications and computational environments. Early methods [21], [22] made efforts to simultaneously evolve the topology of neural networks along with weights and hyper parameters. These methods perform competitively with hand-crafted networks on control tasks with shallow fully connected networks. In the following, we present studies related to deep convolutional neural networks for image classification. Readers are referred to the supplementary materials for a more detailed review of the topic.
近年来,人们对神经网络架构搜索 (NAS) 的兴趣与日俱增。随着深度神经网络在各种应用和计算环境中的广泛部署,能够自动高效地搜索任务相关网络架构的前景尤为诱人。早期方法 [21][22] 致力于同步演化神经网络拓扑结构、权重及超参数。这些方法在采用浅层全连接网络的控制任务中,表现可与人工设计网络相媲美。下文我们将重点介绍与图像分类深度卷积神经网络相关的研究。关于该主题更详细的综述,请参阅补充材料。
Evolutionary NAS: Designing neural networks through evolution has been a topic of interest for a long time. Recent evolutionary approaches focus on evolving solely the topology while leaving the learning of weights to gradient descent algorithms, and using hyper-parameter settings that are manually tuned. Xie and Yuille’s work of Genetic CNN [14] is one of the early studies that shows the promise of using EAs for NAS. Real et al. [15] introduce perhaps the first truly large scale application of a simple EA to NAS. The extension of this method presented in [17], called AmoebaNet, provides the first large scale comparison of EA and RL methods. Their EA, using an age-based selection similar to [23], has demonstrated faster convergence to an accurate network when compared to RL and random search. Concurrently, Liu et al. [16] evolve a hierarchical representation that allows non-modular layer structures to emerge. Despite the impressive improvements achieved on various datasets, these EA methods are extremely computationally inefficient, e.g., one run of the regularized evolution method [17] takes one week on 450 GPU cards.
进化式神经架构搜索(NAS):通过进化设计神经网络一直是一个备受关注的话题。最近的进化方法主要专注于优化网络拓扑结构,而将权重学习任务交给梯度下降算法,并采用人工调优的超参数设置。Xie和Yuille提出的Genetic CNN [14] 是早期展示进化算法(EA)在NAS中潜力的研究之一。Real等人[15]的研究可能是首个将简单进化算法大规模应用于NAS的案例。该方法在[17]中的扩展版本AmoebaNet,首次对进化算法与强化学习(RL)方法进行了大规模比较。他们采用的基于年龄选择机制(类似[23])的进化算法,相比强化学习和随机搜索能更快收敛到高精度网络。与此同时,Liu等人[16]开发了能生成非模块化层结构的层次化表示方法。尽管这些进化算法在多个数据集上取得了显著提升,但其计算效率极低——例如[17]提出的正则化进化方法单次运行就需要450块GPU运行一周。
Concurrently, another streamlining of EA methods for use in budgeted NAS has emerged. Suganuma et al. [24] use Cartesian genetic programming to assemble an architecture from existing modular blocks (e.g., Residual blocks). Sun et al. in [19] use a random forest as an offline surrogate model to predict the performance of architectures, partially eliminating the lowerlevel optimization via gradient descent. The reported results yield $3\times$ savings in wall clock time with similar classification performance when compared to their previous works [18], [25]. However, results reported from these budgeted EA methods are far from state-of-the-art and only demonstrated on small-scale datasets—i.e., CIFAR-10 and CIFAR-100.
与此同时,另一种针对预算化神经架构搜索(NAS)的进化算法(EA)精简方法也出现了。Suganuma等人[24]采用笛卡尔遗传编程从现有模块化构件(如残差块)组装架构。Sun等人在[19]中利用随机森林作为离线代理模型来预测架构性能,部分消除了通过梯度下降进行的底层优化。报告结果显示,与他们之前的工作[18][25]相比,在保持相近分类性能的同时实现了3倍的墙钟时间节省。然而,这些预算化EA方法报告的结果远未达到最先进水平,且仅在CIFAR-10和CIFAR-100等小规模数据集上进行了验证。
Multi-objective NAS: In this work, the term multi-objective NAS refers to methods that simultaneously approximate the entire set of efficient trade-off architectures in one run [26], [27]. Kim et al. [28] presented NEMO, one of the earliest evolutionary multi-objective approaches to evolve CNN architectures. NEMO uses NSGA-II [29] to maximize classification performance and inference time of a network and searches over the space of the number of output channels from each layer within a restricted space of seven different architectures. DPP-Net [30], an extension from [31], progressively expands networks from simple structures and only trains the top-K (based on Pareto-optimality) networks that are predicted to be promising by a surrogate model. Elsken et al. [32] present the LEMONADE method, which is formulated to develop networks with high predictive performance and lower resource constraints. LEMONADE reduces compute requirements through customdesigned approximate network morphisms [33], which allow newly generated networks to share parameters with their forerunners, obviating the need to train new networks from scratch. However, LEMONADE still requires nearly 100 GPUdays to search on the CIFAR datasets [34].
多目标神经架构搜索 (Multi-objective NAS) : 本文中,多目标NAS指通过单次运行同时逼近整个高效权衡架构集合的方法 [26], [27]。Kim等人 [28] 提出了NEMO,这是最早采用进化多目标方法优化CNN架构的算法之一。NEMO使用NSGA-II [29] 来最大化网络分类性能和推理速度,并在七种固定架构的受限空间内搜索各层输出通道数。DPP-Net [30] 作为 [31] 的扩展方法,通过逐步扩展简单结构网络,仅训练代理模型预测有潜力的帕累托最优 (Pareto-optimality) 前K个网络。Elsken等人 [32] 提出的LEMONADE方法旨在开发具有高预测性能且资源约束更低的网络,其通过定制设计的近似网络形态变换 [33] 降低计算需求——新生成网络可与前代网络共享参数,从而避免从头训练。但LEMONADE在CIFAR数据集 [34] 上仍需消耗近100个GPU日的搜索时间。

Fig. 1: Overview: Given a dataset and objectives, NSGANetV1 designs a set of custom architectures spanning the trading-off front. NSGANetV1 estimate the performance of an architecture through its proxy model, optimized by Stochastic Gradient Descent (SGD) in the lower-level. The search proceeds in exploration via genetic operations, followed by exploitation via distribution estimation. See Algorithm 1 for pseudocode and colors are in correspondence.
图 1: 概述:给定数据集和目标,NSGANetV1设计了一组覆盖权衡前沿的自定义架构。NSGANetV1通过其代理模型评估架构性能,该模型在底层通过随机梯度下降(SGD)进行优化。搜索过程通过遗传操作进行探索,随后通过分布估计进行利用。伪代码参见算法1,颜色对应一致。
Search Efficiency: The main computation bottleneck of NAS resides in the lower-level optimization of learning the weights for evaluating the performance of architectures. One such evaluation typically requires hours to finish. To improve the practical utility of search under a constrained computational budget, NAS methods commonly advocate for substitute measurements without a full-blown lower-level optimization. A widely-used approach proceeds as follows: it reduces the depth (number of layers) and the width (number of channels) of the intended architecture to create a small-scale network—i.e., a proxy model. Proxy models require an order of magnitude less computation time (typically, minutes) to perform lower-level optimization, and the performance of proxy models is then used as surrogate measurements to guide the search. However, most existing NAS work [8], [16], [17], [31], [35] follows simple heuristics to construct the proxy model, resulting in low correlation in prediction. For instance, NASNet [8] has an additional re-ranking stage that trains the top 250 architectures for 300 epochs each (takes more than a year on a single GPU card) before picking the best one, and the reported NASNet-A model was originally ranked 70th among the top 250 according to the performance measured at proxy model scale. Similarly, AmoebaNet [17] relies on evaluation of duplicate architectures to gauge representative performance, leading to 27K models being evaluated during search.
搜索效率:NAS的主要计算瓶颈在于评估架构性能所需的底层权重学习优化过程。单次评估通常需要数小时才能完成。为了在有限计算资源下提升搜索的实用性,NAS方法普遍采用替代性评估指标,避免完整的底层优化。一种广泛采用的做法是:通过缩减目标架构的深度(层数)和宽度(通道数)构建小规模网络(即代理模型)。代理模型的底层优化计算时间可减少一个数量级(通常为分钟级),其性能被用作指导搜索的替代指标。然而现有NAS研究[8][16][17][31][35]大多采用简单启发式方法构建代理模型,导致预测相关性较低。例如NASNet[8]设置了额外重排序阶段,需对250个顶级架构各训练300轮次(单GPU卡耗时超一年)才能选出最优模型,而最终报告的NASNet-A模型在代理模型评估阶段仅排名第70位。类似地,AmoebaNet[17]通过评估重复架构来衡量代表性能,导致搜索过程中累计评估了27K个模型。
In this work, we focus on both the efficiency and the reliability aspects of the proxy model; through a series of systematic studies in a controlled setting, we empirically establish the trade-off between the correlation of proxy performance to true performance and the speed-up in estimation. We then implement a suitable setting that is specific to our search space and dataset.
在本研究中,我们同时关注代理模型的效率与可靠性:通过在受控环境中进行一系列系统性实验,我们实证建立了代理性能与真实性能相关性同评估加速之间的权衡关系,随后针对特定搜索空间和数据集实现了定制化配置。
III. PROPOSED APPROACH
III. 研究方法
Practical applications of NAS can rarely be considered from the point of view of a single objective of maximizing performance; rather, they must be evaluated from at least one additional, conflicting objective that is specific to the deployment scenario. In this work, we approach the problem of designing high-performance architectures with diverse complexities for different deployment scenarios as a multi-objective bilevel optimization problem∗ [36]. We mathematically formulate the problem as,
实际应用中的神经网络架构搜索(NAS)很少能仅从单一性能最大化目标的角度考虑,而必须根据部署场景特有的至少一个额外冲突目标进行评估。本研究将面向不同部署场景设计多样化复杂度的高性能架构问题建模为多目标双层优化问题* [36],其数学表述为:
$$
\begin{array}{r l}{\mathrm{minimize}}&{F(\boldsymbol{x})=\big(f_{1}(\boldsymbol{x};\boldsymbol{w}^{}(\boldsymbol{x})),f_{2}(\boldsymbol{x})\big)^{T},}\ {\mathrm{subject to}}&{\boldsymbol{w}^{*}(\boldsymbol{x})\in\mathrm{argmin~}\mathcal{L}(\boldsymbol{w};\boldsymbol{x}),}\ &{\boldsymbol{x}\in\Omega_{x},\quad\boldsymbol{w}\in\Omega_{w},}\end{array}
$$
$$
\begin{array}{r l}{\mathrm{minimize}}&{F(\boldsymbol{x})=\big(f_{1}(\boldsymbol{x};\boldsymbol{w}^{}(\boldsymbol{x})),f_{2}(\boldsymbol{x})\big)^{T},}\ {\mathrm{subject to}}&{\boldsymbol{w}^{*}(\boldsymbol{x})\in\mathrm{argmin~}\mathcal{L}(\boldsymbol{w};\boldsymbol{x}),}\ &{\boldsymbol{x}\in\Omega_{x},\quad\boldsymbol{w}\in\Omega_{w},}\end{array}
$$
where $\Omega_{x}=\Pi_{i=1}^{n}[a_{i},b_{i}]\subseteq\mathbb{Z}^{n}$ is the architecture decision space, where $a_{i},~b_{i}$ are the lower and upper bounds, $x=$ $(x_{1},\dots,x_{n})^{T}\in\Omega_{x}$ is a candidate architecture, and the lowerlevel variable $\pmb{w}\in\Omega_{w}$ denotes its associated weights. The upper-level objective vector $\pmb{F}$ comprises of the classification error $(f_{1})$ on the validation data $\mathcal{D}_{v l d}$ , and the complexity $(f_{2})$ of the network architecture. The lower level objective $\mathcal{L}(\pmb{w};\pmb{x})$ is the cross-entropy loss on the training data $\mathcal{D}_{t r n}$ .
其中 $\Omega_{x}=\Pi_{i=1}^{n}[a_{i},b_{i}]\subseteq\mathbb{Z}^{n}$ 是架构决策空间,$a_{i},~b_{i}$ 为上下界,$x=$ $(x_{1},\dots,x_{n})^{T}\in\Omega_{x}$ 是候选架构,底层变量 $\pmb{w}\in\Omega_{w}$ 表示其关联权重。上层目标向量 $\pmb{F}$ 包含验证数据 $\mathcal{D}_{v l d}$ 上的分类误差 $(f_{1})$ 和网络架构复杂度 $(f_{2})$。底层目标 $\mathcal{L}(\pmb{w};\pmb{x})$ 是训练数据 $\mathcal{D}_{t r n}$ 上的交叉熵损失。
Our proposed algorithm, NSGANetV1, is an iterative process in which initial architectures are made gradually better as a group, called a population. In every iteration, a group of offspring (i.e., new architectures) is created by applying variations through crossover and mutation to the more promising of the architectures already found, also known as parents, from the population. Every member in the population (including both
我们提出的算法 NSGANetV1 是一个迭代过程,初始架构作为群体逐步优化。在每次迭代中,通过对种群中已发现较有潜力的架构(称为父代)进行交叉和变异操作,生成一组子代(即新架构)。种群中的每个成员(包括
parents and offspring) compete for survival and reproduction (becoming a parent) in each iteration. The initial population may be generated randomly or guided by prior-knowledge, i.e., seeding the past successful architectures directly into the initial population. Subsequent to initialization, NSGANetV1 conducts the search in two sequential stages: (i) exploration, with the goal of discovering diverse ways to construct architectures, and (ii) exploitation that reinforces the emerging patterns commonly shared among the architectures successful during exploration. A set of architectures representing efficient tradeoffs between network performance and complexity is obtained at the end of evolution, through genetic operators and a Bayesian-model-based learning procedure. A flowchart and a pseudocode outlining the overall approach are shown in Fig. 1 and Algorithm 1, respectively. In the remainder of this section, we provide a detailed description of the aforementioned components in Sections III-A - III-C.
父母与后代)在每次迭代中竞争生存和繁殖(成为父母)。初始种群可以随机生成,也可以基于先验知识引导,即直接将过去成功的架构种子植入初始种群。初始化后,NSGANetV1按顺序执行两个搜索阶段:(i)探索阶段,目标是发现构建架构的多样化方法;(ii)利用阶段,强化探索期间成功架构中普遍出现的新兴模式。通过遗传算子和基于贝叶斯模型的学习过程,在进化结束时获得一组代表网络性能和复杂性之间高效权衡的架构。图1和算法1分别展示了整体方法的流程图和伪代码。在本节剩余部分,我们将在第III-A至III-C小节详细描述上述组件。
A. Search Space and Encoding
A. 搜索空间与编码
The search for optimal network architectures can be performed over many different search spaces. The generality of the chosen search space has a major influence on the quality of results that are even possible. Most existing evolutionary NAS approaches [14], [19], [24], [32] search only one aspect of the architecture space—e.g., the connections and/or hyper-parameters. In contrast, NSGANetV1 searches over both operations and connections—the search space is thus more comprehensive, including most of the previous successful architectures designed both by human experts and algorithmic ally.
寻找最优网络架构可以在许多不同的搜索空间中进行。所选搜索空间的通用性对可能获得的结果质量有重大影响。大多数现有的进化式神经架构搜索(NAS)方法[14]、[19]、[24]、[32]仅搜索架构空间的一个方面——例如连接和/或超参数。相比之下,NSGANetV1同时搜索操作和连接——因此其搜索空间更加全面,涵盖了之前由人类专家和算法设计的大多数成功架构。
Modern CNN architectures are often composed of an outer structure (network-level) design where the width (i.e., # of channels), the depth (i.e., # of layers) and the spatial resolution changes (i.e., locations of pooling layers) are decided; and an inner structure (block-level) design where the layer-wise connections and computations are specified, e.g., Inception block [1], ResNet block [2], and DenseNet block [3], etc. As seen in the CNN literature, the network-level decisions are mostly hand-tuned based on meta-heuristics from prior knowledge and the task at hand, as is the case in this work. For block-level design, we adopt the one used in [8], [9], [17], [31] to be consistent with previous work.
现代CNN架构通常由外部结构(网络层面)设计和内部结构(块层面)设计组成。外部结构设计决定宽度(即通道数)、深度(即层数)和空间分辨率变化(即池化层位置);内部结构设计则指定层间连接和计算方式,例如Inception块[1]、ResNet块[2]和DenseNet块[3]等。从CNN文献中可见,网络层面的决策主要基于先验知识和当前任务的元启发式方法进行手动调整,本文工作亦遵循此惯例。对于块层面设计,为与先前研究保持一致,我们采用[8]、[9]、[17]、[31]中使用的方案。
A block is a small convolutional module, typically repeated multiple times to form the entire neural network. To construct scalable architectures for images of different resolutions, we use two types of blocks to process intermediate information: (1) the Normal block, a block type that returns information of the same spatial resolution; and (2) the Reduction block, another block type that returns information with spatial resolution halved by using a stride of two. See Fig. 2a for a pictorial illustration.
块(block)是一种小型卷积模块,通常重复多次以构成整个神经网络。为了构建适用于不同分辨率图像的可扩展架构,我们使用两种类型的块来处理中间信息:(1) 常规块(Normal block),该类型块返回相同空间分辨率的信息;(2) 降维块(Reduction block),该类型块通过步长为2的操作返回空间分辨率减半的信息。图示说明见图2a。
We use directed acyclic graphs (DAGs) consisting of five nodes to construct both types of blocks (a Reduction block uses a stride of two). Each node is a two-branched structure, mapping two inputs to one output. For each node in block $i$ , we need to pick two inputs from among the output of the previous block $h_{i-1}$ , the output of the previous-previous block $h_{i-2}$ and the set of hidden states created in any previous nodes of block $i$ . For pairs of inputs chosen, we choose a computation operation from among the following options, collected based on their prevalence in the CNN literature:
我们使用由五个节点组成的有向无环图(DAG)来构建两种类型的模块(缩减模块使用步长为2)。每个节点采用双分支结构,将两个输入映射为一个输出。对于模块$i$中的每个节点,我们需要从前一个模块的输出$h_{i-1}$、前前模块的输出$h_{i-2}$以及模块$i$中任何先前节点创建的隐藏状态集合中选择两个输入。针对选定的输入对,我们从以下计算操作中选择(这些操作基于CNN文献中的常见实践进行收集):
• identity 3x3 dilated convolution • 3x3 max pooling 5x5 dilated convolution • 3x3 average pooling • 3x3 depthwise-separable conv • squeeze-and-excitation [37] • 5x5 depthwise-separable conv • 3x3 local binary conv [6] • 7x7 depthwise-separable conv • 5x5 local binary conv [6] • 1x7 then 7x1 convolution
- 恒等 (identity)
- 3x3 空洞卷积 (3x3 dilated convolution)
- 3x3 最大池化 (3x3 max pooling)
- 5x5 空洞卷积 (5x5 dilated convolution)
- 3x3 平均池化 (3x3 average pooling)
- 3x3 深度可分离卷积 (3x3 depthwise-separable conv)
- 压缩-激励模块 (squeeze-and-excitation) [37]
- 5x5 深度可分离卷积 (5x5 depthwise-separable conv)
- 3x3 局部二值卷积 (3x3 local binary conv) [6]
- 7x7 深度可分离卷积 (7x7 depthwise-separable conv)
- 5x5 局部二值卷积 (5x5 local binary conv) [6]
- 1x7 接 7x1 卷积 (1x7 then 7x1 convolution)
The results computed from both branches are then added together to create a new hidden state, which is available for subsequent nodes in the same block. See Fig. 2b-2d for pictorial illustrations. The search space we consider in this paper is an expanded version of the micro search space used in our previous work [20]. Specifically, the current search space (i) search for a factor that gradually increments the channel size of each block with depth (see Fig. 2b) as opposed to sharply doubling the channel size when down-sampling. (ii) considers an expanded set of primitive operations to include both more recent advanced layer primitives such as squeeze-and-excitation [37] and more parameter-efficient layer primitives like local binary convolution [6].
两个分支的计算结果相加后生成新的隐藏状态,供同一块中的后续节点使用。图示说明见图 2b-2d。本文考虑的搜索空间是我们之前工作 [20] 中使用的微观搜索空间的扩展版本。具体而言,当前搜索空间 (i) 寻找一个因子,使每个块的通道大小随深度逐渐增加 (见图 2b),而不是在下采样时突然翻倍;(ii) 考虑了扩展的原始操作集,既包含更新的高级层原语,如压缩激励 [37],也包含更高效的参数层原语,如局部二值卷积 [6]。

Fig. 2: Schematic of the NSGANetV1 search space motivated from [8]: (a) An architecture is composed of stacked blocks. (b) The number of channels in each block is gradually increased with depth of the network. (c) Each block is composed of five nodes, where each node is a two-branched computation applied to outputs from either previous blocks or previous nodes within the same block. (d) A graphical visualization of (c).
图 2: 受 [8] 启发的 NSGANetV1 搜索空间示意图:(a) 架构由堆叠块组成。(b) 每个块中的通道数随网络深度逐渐增加。(c) 每个块由五个节点组成,其中每个节点是对来自前一个块或同一块内前一个节点的输出应用的双分支计算。(d) (c) 的图形可视化。
With the above-mentioned search space, there are in total 20 decisions to constitute a block structure—i.e., choose two pairs of input and operation for each node, and repeat for five nodes. The resulting number of combinations for a block structure is:
在上述搜索空间中,共有20个决策构成一个块结构——即为每个节点选择两对输入和操作,并重复五次节点操作。
where $n$ denotes the number of nodes, $n_{-}o p s$ denotes the number of considered operations. Therefore, with one Normal block and one Reduction block with five nodes in each, the overall size of the encoded search space is approximately $10^{33}$
其中 $n$ 表示节点数量,$n_{-}ops$ 表示考虑的操作数量。因此,当每个模块包含5个节点(一个Normal模块和一个Reduction模块)时,编码搜索空间的总体规模约为 $10^{33}$
B. Performance Estimation Strategy
B. 性能评估策略
To guide NSGANetV1 towards finding more accurate and efficient architectures, we consider two metrics as objectives, namely, classification accuracy and architecture complexity. Assessing the classification accuracy of an architecture during search requires another optimization to first identify the optimal values of the associated weights via Stochastic Gradient Descent (SGD; Algorithm 2). Even though there exist wellestablished gradient descent algorithms to efficiently solve this optimization, repeatedly executing such an algorithm for every candidate architecture renders the overall process computationally very prohibitive. Therefore, to overcome this computational bottleneck, we carefully (using a series of ablation studies) down-scale the architectures to create their proxy models [8], [17], which can be optimized efficiently in the lower-level through SGD. Their performances become surrogate measurements to select architectures during search. Details are provided in Section V-C.
为了引导NSGANetV1寻找更准确高效的架构,我们采用分类准确率和架构复杂度作为双优化目标。评估搜索过程中架构的分类准确率需先通过随机梯度下降(SGD; 算法2)优化权重参数,尽管现有成熟的梯度下降算法能高效完成此优化,但为每个候选架构重复执行该算法会导致整体计算量激增。为此,我们通过系统消融实验将架构缩放为代理模型[8][17],在底层通过SGD实现高效优化,其性能作为架构选择的替代指标。具体实现详见第V-C节。
A number of metrics can serve as proxies for complexity, including: the number of active nodes, number of active connections between the nodes, number of parameters, inference time and number of floating-point operations (FLOPs) needed to execute the forward pass of a given architecture. Our initial experiments considered each of these metrics in turn. We concluded from extensive experimentation that inference time cannot be estimated reliably due to differences
多项指标可作为复杂度的衡量标准,包括:活跃节点数量、节点间活跃连接数、参数量、推理时间以及执行给定架构前向传播所需的浮点运算量 (FLOPs) 。我们通过初步实验逐一考察了这些指标。大量实验表明,由于差异因素的存在,推理时间难以被可靠估算。
and inconsistencies in the computing environment, GPU manufacturer, ambient temperature, etc. Similarly, the number of parameters, active connections or active nodes only relate to one aspect of the complexity. In contrast, we found an estimate of FLOPs to be a more accurate and reliable proxy for network complexity. Therefore, classification accuracy and FLOPs serve as our choice of twin objectives to be traded off for selecting architectures. To simultaneously compare and select architectures based on these two objectives, we use the non-dominated ranking and the “crowded-ness" concepts proposed in [29].
计算环境、GPU制造商、环境温度等存在的不一致和差异。同样,参数数量、活跃连接或活跃节点仅涉及复杂性的一个方面。相比之下,我们发现FLOPs的估算能更准确可靠地代表网络复杂性。因此,分类准确率和FLOPs被选为我们权衡架构选择的双目标。为了基于这两个目标同时比较和选择架构,我们采用了[29]中提出的非支配排序和"拥挤度"概念。
C. Creation of New Generation
C. 新一代的创造
Exploration: Given a population of architectures, parents are selected from the population with a fitness bias. This choice is dictated by two observations, (1) offspring created around better parents are expected to have higher fitness on average than those created around worse parents, with the assumption of some level of gradualism in the solution space; (2) occasionally (although not usually), offspring perform better than their parents, through inheriting useful traits from both parents. Because of this, one might demand that the best architecture in the population should always be chosen as one of the parents. However, the deterministic and greedy nature of that approach would likely lead to premature convergence due to loss of diversity in the population [38]. To address this problem, we use binary tournament selection [39] to promote parent architectures in a stochastic fashion. At each iteration, binary tournament selection randomly picks two architectures from the population, then the one favored by the multi-objective selection criterion described in Section III-B becomes one of the parents. This process is repeated to select a second parent architecture; the two parent architectures then undergo a crossover operation.
探索:给定一个架构种群,根据适应度偏好从中选择父代。这一选择基于两点观察:(1) 假设解空间具有一定渐进性,那么围绕更优父代创建的子代平均而言会比围绕较差父代创建的子代具有更高适应度;(2) 子代偶尔(虽非通常)会通过继承双亲的有用特征而表现优于父代。因此,可能会要求始终选择种群中的最优架构作为父代之一。然而,这种确定性且贪婪的方法很可能因种群多样性丧失而导致早熟收敛[38]。为解决该问题,我们采用二元锦标赛选择[39]以随机方式提升父代架构。每次迭代时,二元锦标赛选择从种群中随机选取两个架构,然后由第III-B节所述多目标选择准则偏好的那个成为父代之一。重复此过程选择第二个父代架构后,两个父代架构将进行交叉操作。

Fig. 3: Illustration of node-level crossover.
图 3: 节点级交叉示意图。
In NSGANetV1, we use two types of crossover (with equal probability of being chosen) to efficiently exchange substructures between two parent architectures. The first type is at the block level, in which the offspring architectures are created by recombining the Normal block from the first parent with the Reduction block from the other parent and vice versa. The second type is at the node level, where a node from one parent is randomly chosen and exchanged with another node at the same position from the other parent. We apply the node-level crossover to both Normal and Reduction blocks. Fig. 3 illustrates an example of node-level crossover. Note that two offspring architectures are generated after each crossover operation, and only one of them (randomly chosen) is added to the offspring population. In each generation, an offspring population of the same size as the parent population is generated.
在NSGANetV1中,我们使用两种交叉类型(以相等概率选择)来高效交换两个父架构的子结构。第一种是在块级别进行的,子代架构通过将第一个父代的Normal块与另一个父代的Reduction块重组来创建,反之亦然。第二种是在节点级别进行的,随机选择一个父代的节点并与另一个父代相同位置的节点交换。我们对Normal块和Reduction块都应用节点级交叉。图3展示了一个节点级交叉的示例。注意每次交叉操作后会生成两个子代架构,其中只有一个(随机选择)会被加入子代种群。每一代都会生成一个与父代种群规模相同的子代种群。

Fig. 4: Input and Operation Mutation: Dashed line boxes with red color highlight the mutation. $h_{i-2}$ and $h_{i-1}$ are outputs from previous-previous and previous blocks, respectively. hi(3) indicates output from node 3 of the current block.
图 4: 输入与操作变异: 红色虚线框突出显示变异部分。$h_{i-2}$ 和 $h_{i-1}$ 分别表示前前块和前块的输出。hi(3)表示当前块节点3的输出。
To enhance the diversity of the population and the ability to escape from local attractors, we use a disc ret i zed version of the polynomial mutation (PM) operator [40] subsequent to crossover. We allow mutation to be applied on both the input hidden states and the choice of operations. Figure 4 shows an example of each type of mutation using the parentcentric PM operator, in which the offspring are intentionally created around the parents in the decision space. In association with PM, we sort our discrete encoding of input hidden states chronologically and choice of operations in ascending order of computational complexity. In the context of neural architecture, this step results in the mutated input hidden states in offspring architectures to more likely be close to the input hidden states in parent architectures in a chronological manner. For example, $h_{i}^{(\tilde{2})}$ is more likely to be mutated to hi(1) than to hi−2 by PM. A similar logic is applied in case of mutation on layer operations.
为增强种群的多样性和逃离局部吸引子的能力,我们在交叉操作后采用了多项式变异 (PM) 算子 [40] 的离散化版本。我们允许变异同时作用于输入隐藏状态和操作选择。图 4 展示了使用亲本中心 PM 算子进行每种变异类型的示例,其中子代被刻意创建在决策空间中亲本周围。结合 PM 算子,我们按时间顺序对输入隐藏状态的离散编码进行排序,并按计算复杂度升序排列操作选择。在神经网络架构的背景下,这一步骤使得子代架构中变异的输入隐藏状态更有可能按时间顺序接近父代架构中的输入隐藏状态。例如,$h_{i}^{(\tilde{2})}$ 通过 PM 变异为 hi(1) 的概率高于变异为 hi−2。类似的逻辑也适用于层操作变异的情况。

Fig. 5: Illustrative example of BN-based exploitation step in NSGANetV1: given past successful architectures, we construct a BN relating the dependencies between the four nodes inside the Normal block. A new architecture is then sampled from this BN and proceeds forward for performance estimation.
图 5: NSGANetV1中基于贝叶斯网络(BN)的探索步骤示例:给定过去成功的架构,我们构建了一个BN来描述Normal块内四个节点之间的依赖关系。随后从这个BN中采样出新架构,并继续进行性能评估。
Exploitation: After a sufficient number of architectures has been explored (consuming 2/3 of the total computational budget; i.e., $\tau$ in Algorithm 1), we start to enhance the exploitation aspect of the search. The key idea is to reinforce and reuse the patterns commonly shared among past successful architectures. We use the Bayesian Network (BN) [41] as the probabilistic model to estimate the distribution of the Pareto set (of architectures). In the context of our search space and encoding, this translates to learning the correlations among the operations and connections of nodes within a block. Our exploitation step uses a subset (top-100 architectures selected based on domination rank and crowding distance [29]) of the past evaluated architectures to guide the final part of the search. More specifically, say we are designing a Normal block with three nodes, namely $\alpha_{n}^{(1)}$ , $\alpha_{n}^{(2)}$ , and α(n3 ). We would like to know the relationship among these three nodes. For this purpose, we construct a BN relating these variables, modeling the probability of Normal blocks beginning with a particular node $\overset{\cdot}{\alpha_{n}^{(1)}}$ , the probability that $\alpha_{n}^{(2)}$ follows $\alpha_{n}^{(1)}$ , and the probability that $\alpha_{n}^{(3)}$ follows $\alpha_{n}^{(2)}$ and $\alpha_{n}^{(1)}$ . In other words, we estimate the conditional distributions $p\left(\alpha_{n}^{(1)}\right)$ $p\left(\alpha_{n}^{(2)}|\alpha_{n}^{(1)}\right)$ , and $p\left(\alpha_{n}^{(3)}|\alpha_{n}^{(2)},\alpha_{n}^{(1)}\right)$ by using the population history, and update these estimates during the exploitation process. New offspring architectures are created by sampling from this BN. A pictorial illustration of this process is provided in Fig. 5. This BN-based exploitation strategy is used in addition to the genetic operators, where we initially (i.e., at the beginning of exploitation) assign $25%$ of the offspring (line 34 in Algorithm 1) to be created by BN and we update this probability adaptively (line 36 in Algorithm 1). To be more specific, we calculate the probabilities of using genetic operators and sampling from the BN model at generation $t$ based on the survival rates of offspring created using them in the previous generation, following the softmax function:
利用阶段:当探索足够数量的架构后(消耗总计算预算的2/3,即算法1中的$\tau$),我们开始增强搜索的利用性。核心思想是强化并复用过去成功架构中的共有模式。我们采用贝叶斯网络(BN)[41]作为概率模型来估计架构的帕累托集分布。在搜索空间和编码背景下,这转化为学习块内节点操作与连接之间的相关性。利用阶段选取历史评估架构的子集(基于支配排序和拥挤距离[29]筛选的top-100架构)来指导最终搜索。具体而言,假设设计包含三个节点$\alpha_{n}^{(1)}$、$\alpha_{n}^{(2)}$和$\alpha_{n}^{(3)}$的Normal块时,我们通过构建BN来建模这些变量的关系:估算Normal块以特定节点$\overset{\cdot}{\alpha_{n}^{(1)}$起始的概率、$\alpha_{n}^{(2)}$跟随$\alpha_{n}^{(1)}$的概率,以及$\alpha_{n}^{(3)}$跟随$\alpha_{n}^{(2)}$和$\alpha_{n}^{(1)}$的概率。即利用种群历史数据估计条件分布$p\left(\alpha_{n}^{(1)}\right)$、$p\left(\alpha_{n}^{(2)}|\alpha_{n}^{(1)}\right)$和$p\left(\alpha_{n}^{(3)}|\alpha_{n}^{(2)},\alpha_{n}^{(1)}\right)$,并在利用过程中动态更新。新子代架构通过从BN采样生成,该过程如图5所示。此BN利用策略与遗传算子协同工作:初始阶段(利用开始时)分配25%的子代(算法1第34行)由BN生成,并自适应调整该概率(算法1第36行)。具体而言,基于前代子代存活率,按softmax函数计算第$t$代使用遗传算子或BN采样的概率:
$$
\rho_{t}^{(i)}=\frac{\exp(s_{t-1}^{(i)})}{\sum_{i=1}^{2}\exp(s_{t-1}^{(i)})}
$$
$$
\rho_{t}^{(i)}=\frac{\exp(s_{t-1}^{(i)})}{\sum_{i=1}^{2}\exp(s_{t-1}^{(i)})}
$$
where ρt $\rho_{t}^{(i)}$ are the probabilities of using genetic operators $(i=1)$ ) and sampling from the learned BN model $(i=2)$ ); and $s_{t-1}^{(i)}$ are the survival rates of the offspring created by genetic operators $(i=1)$ ) and the learned BN model $(i=2)$ ) at the previous generation $t-1$ . Note that $\rho_{t}^{(i=1)}$ corresponds to $\rho$ in Algorithm 1.
其中 $\rho_{t}^{(i)}$ 表示使用遗传算子 $(i=1)$ 和从学习到的BN模型中采样 $(i=2)$ 的概率;$s_{t-1}^{(i)}$ 表示上一代 $t-1$ 中由遗传算子 $(i=1)$ 和学习到的BN模型 $(i=2)$ 创建的后代的存活率。注意 $\rho_{t}^{(i=1)}$ 对应算法1中的 $\rho$。
IV. EXPERIMENTAL SETUP AND RESULTS
IV. 实验设置与结果
In this section, we will evaluate the efficacy of NSGANetV1 on multiple benchmark image classification datasets.
在本节中,我们将评估NSGANetV1在多个基准图像分类数据集上的效能。
A. Baselines
A. 基线方法
To demonstrate the effectiveness of the proposed algorithm, we compare the non-dominated architectures achieved at the conclusion of NSGANetV1’s evolution with architectures reported by various peer methods published in top-tier venues. The chosen peer methods can be broadly categorized into three groups: architectures manually designed by human experts, nonEA- (mainly RL or relaxation)-based, and EA-based. Human engineered architectures include ResNet [2], ResNeXt [42], and DenseNet [3], etc. The second and third groups range from earlier methods [7], [14], [15] that are oriented towards “proofof-concept" for NAS, to more recent methods [8], [9], [17], [43], many of which improve state-of-the-art results on various computer vision benchmarks at the time they were published. The effectiveness of the different architectures is judged on both classification accuracy and computational complexity. For comparison on classification accuracy, three widely used natural object classification benchmark datasets are considered, namely, CIFAR-10, CIFAR-100 and ImageNet. More details and a gallery of examples from these three datasets are provided in Fig. 2 in supplementary materials under Section V.
为验证所提算法的有效性,我们将NSGANetV1演化结束时获得的非支配架构与顶级会议上发表的各种同侪方法报告的架构进行对比。所选同侪方法大致可分为三类:人类专家手动设计的架构、非EA(主要是强化学习或松弛)方法和基于EA的方法。人工设计的架构包括ResNet [2]、ResNeXt [42]和DenseNet [3]等;第二类和第三类方法涵盖早期面向NAS"概念验证"的研究[7][14][15],以及近期能提升当时计算机视觉基准state-of-the-art成果的方法[8][9][17][43]。不同架构的有效性通过分类准确率和计算复杂度两方面评估。在分类准确率对比方面,选用三个广泛使用的自然物体分类基准数据集:CIFAR-10、CIFAR-100和ImageNet。补充材料第五节图2提供了这三个数据集的更多细节及示例图库。
B. Implementation Details
B. 实现细节
Motivated by efficiency and practicality considerations most existing NAS methods, including [8], [17], [18], [44], carry out the search process on the CIFAR-10 dataset. However, as we demonstrate through ablation studies in Section V-C the CIFAR100 provides a more reliable measure of an architecture’s efficacy in comparison to CIFAR-10. Based on this observation, in contrast to existing approaches, we use the more challenging CIFAR-100 dataset for the search process. Furthermore, we split the original CIFAR-100 training set $(80%-20%)$ to create a training and validation set to prevent over-fitting to the training set and improve the general iz ability. We emphasize that the original testing set is never used to guide the selection of architectures in any form during the search.
出于效率和实用性的考虑,大多数现有神经架构搜索(NAS)方法(包括[8]、[17]、[18]、[44])都在CIFAR-10数据集上进行搜索。但如第V-C节的消融实验所示,与CIFAR-10相比,CIFAR-100能更可靠地衡量架构效能。基于这一发现,我们采用更具挑战性的CIFAR-100数据集进行搜索,这与现有方法形成鲜明对比。此外,我们将原始CIFAR-100训练集按$(80%-20%)$比例划分为训练集和验证集,以防止过拟合并提升泛化能力。需要强调的是,在搜索过程中从未以任何形式使用原始测试集来指导架构选择。
TABLE I: Summary of Hyper-parameter Settings.
| Categories | Parameters | Settings |
| search space | #ofinitial channels(Chinit) | 32 |
| #of channelincrements (Chinc) | 6 | |
| #of repetitions of Normal blocks (V) | 4/5/6 | |
| gradientdescent | batch size | 128 |
| weight decay (L2regularization) | 5.00E-04 | |
| epochs | 36/600 | |
| learningrate schedule | Cosine Annealing[48] | |
| search strategy | populationsize | 40 |
| #ofgenerations | 30 | |
| crossover probability | 0.9 | |
| mutationprobability | 0.1 |
表 1: 超参数设置汇总
| 类别 | 参数 | 设置 |
|---|---|---|
| 搜索空间 | 初始通道数 (Chinit) | 32 |
| 通道增量数 (Chinc) | 6 | |
| 普通块重复次数 (V) | 4/5/6 | |
| 梯度下降 | 批量大小 | 128 |
| 权重衰减 (L2正则化) | 5.00E-04 | |
| 训练轮数 | 36/600 | |
| 学习率调度 | 余弦退火 [48] | |
| 搜索策略 | 种群大小 | 40 |
| 代数 | 30 | |
| 交叉概率 | 0.9 | |
| 变异概率 | 0.1 |
The search itself is repeated five times with different initial random seeds. We select and report the performance of the median run as measured by hyper volume (HV). Such a procedure ensures the reproducibility of our NAS experiments and mitigates the concerns that have arisen in recent NAS studies [45], [46]. We use the standard SGD algorithm for learning the associated weights for each architecture. Other hyperparameter settings related to the search space, the gradient descent training and the search strategy are summarized in Table I. We provide analysis aimed at justifying some of the hyper-parameter choices in Section V-C. All experiments are performed on 8 Nvidia 2080Ti GPU cards.
搜索过程会使用不同的初始随机种子重复五次。我们选择并报告以超体积 (HV) 衡量的中位数运行性能。这一流程确保了我们的神经架构搜索 (NAS) 实验可复现性,并缓解了近期NAS研究[45][46]中出现的争议。我们使用标准SGD算法学习每个架构的关联权重。与搜索空间、梯度下降训练及搜索策略相关的其他超参数设置总结如表1所示。我们在第V-C节提供了旨在验证部分超参数选择合理性的分析。所有实验均在8块Nvidia 2080Ti GPU上完成。
Our post-search training settings largely follow [9]: We extend the number of epochs to 600 with a batch size of 96 to thoroughly re-train the selected models from scratch. We also incorporate a data pre-processing technique cutout [47], and a regular iz ation technique scheduled path dropout introduced in [8]. In addition, to further improve the training process, an auxiliary head classifier [1] is appended to the architecture at approximately 2/3 depth (right after the second resolutionreduction operation). The loss from this auxiliary head classifier, scaled by a constant factor 0.4, is aggregated with the loss from the original architecture before back-propagation during training.
我们的后搜索训练设置主要遵循[9]:将训练周期扩展到600次,批次大小为96,以从头彻底重新训练所选模型。我们还采用了数据预处理技术cutout [47]和[8]中提出的正则化技术scheduled path dropout。此外,为进一步优化训练过程,在架构约2/3深度处(紧接第二次分辨率降低操作后)添加了一个辅助头分类器[1]。该辅助头分类器的损失按常数因子0.4缩放后,与原始架构的损失在训练的反向传播前进行聚合。
C. Effectiveness of NSGANetV1
C. NSGANetV1的有效性
We first present the objective space distribution of all architectures generated by NSGANetV1 during the course of evolution on CIFAR-100, in Fig. 6a. We include architectures generated by the original NSGA-II algorithm and uniform random sampling as references for comparison. Details of these two methods are provided in Section IV-D. From the set of nondominated solutions (outlined by red box markers in Fig. 6a), we select five architectures based on the ratio of the gain on accuracy over the sacrifice on FLOPs. For reference purposes, we name these five architectures as NSGANetV1-A0 to -A4 in ascending FLOPs order. See Fig. 1 in the supplementary materials for a visualization of the searched architectures.
我们首先在图6a中展示了NSGANetV1在CIFAR-100数据集上进化过程中生成的所有架构在目标空间中的分布情况。作为对比参考,我们同时纳入了原始NSGA-II算法和均匀随机采样生成的架构,这两种方法的详细说明见第IV-D节。从非支配解集中(图6a中用红色方框标记标出),我们根据准确率提升与FLOPs牺牲的比值选取了五种架构。为便于参考,按FLOPs升序将这些架构命名为NSGANetV1-A0至-A4。搜索到的架构可视化结果可参见补充材料中的图1。
For comparison with other peer methods, we follow the training procedure in [9] and re-train the weights of NSGANetV1- A0 to -A4 on CIFAR-100, following the steps outlined in Section IV-B. We would like to mention that since most existing approaches do not report the number of FLOPs for the architectures used on the CIFAR-100 dataset, we instead compare their computational complexity through number of parameters to prevent potential discrepancies from re-implementation. Fig. 6b shows the post-search architecture comparisons, NSGANetV1- A0 to A4, i.e., the algorithms derived in this paper, jointly dominate all other considered peer methods with a clear margin. More specifically, NSGANetV1-A1 is more accurate than peer EA method, AE-CNN-E2EPP [19], while being 30x more efficient in network parameters; NSGANetV1-A2 achieves better performance than AmoebaNet [17] and NSGA-Net [20] with $\mathbf{3x}$ fewer parameters. Furthermore, NSGANetV1-A4 exceeds the classification accuracy of Shake-Even 29 2x4x64d $+~S E$ [37] using 8x fewer parameters. More comparisons can be found in Table IIb.
为与其他同类方法进行比较,我们按照[9]中的训练流程,遵循第IV-B节概述的步骤,在CIFAR-100上重新训练了NSGANetV1-A0至A4的权重。需要说明的是,由于现有方法大多未报告CIFAR-100数据集所用架构的FLOPs数量,我们转而通过参数量来比较计算复杂度,以避免重新实现可能带来的偏差。图6b展示了搜索后架构的对比结果:本文提出的NSGANetV1-A0至A4算法以明显优势全面优于其他同类方法。具体而言,NSGANetV1-A1比同类进化算法AE-CNN-E2EPP[19]精度更高,同时网络参数量效率提升30倍;NSGANetV1-A2以$\mathbf{3x}$更少的参数量,性能优于AmoebaNet[17]和NSGA-Net[20]。此外,NSGANetV1-A4以8倍更少的参数量,分类精度超过了Shake-Even 29 2x4x64d $+~S E$[37]。更多比较结果见表IIb。

Fig. 6: (a) Accuracy vs. FLOPs of all architectures generated by NSGANetV1 during the course of evolution on CIFAR-100. A subset of non-dominated architectures (see text), named NSGANetV1-A0 to A4, are re-trained thoroughly and compared with other peer methods in (b).
图 6: (a) NSGANetV1在CIFAR-100数据集进化过程中生成的所有架构的准确率与FLOPs对比。图中标出了名为NSGANetV1-A0至A4的非支配架构子集 (详见正文), 这些架构经过充分训练后与其他同类方法在(b)中进行比较。
Following the practice adopted in most previous approaches [8], [9], [17], [31], [44], we measure the transferability of the obtained architectures by allowing the architectures evolved on one dataset (CIFAR-100 in this case) to be inherited and used on other datasets, by retraining the weights from scratch on the new dataset–in our case, on CIFAR-10 and ImageNet.
遵循大多数先前方法[8]、[9]、[17]、[31]、[44]采用的实践,我们通过将在某个数据集(本文为CIFAR-100)上进化得到的架构继承并用于其他数据集(本文为CIFAR-10和ImageNet)来测量所获架构的可迁移性,具体方式是在新数据集上从头开始重新训练权重。
The effectiveness of NSGANetV1 is further validated by the transferred performance on the CIFAR-10 dataset. As we show in Figs. 7a, the trade-off frontier established by NSGANetV1- A0 to -A4 completely dominates the frontiers obtained by the peer EMO methods, both DPP-Net [30] and LEMONADE [32], as well as those obtained with other single-objective peer methods. More specifically, NSGANetV1-A0 uses $\mathbf{27x}$ fewer parameters and achieves higher classification accuracy than Large-scale Evo. [15]. NSGANetV1-A1 outperforms Hierarchical NAS [16] and DenseNet [3] in classification, while saving 122x and 51x in parameters. NSGANetV1-A2 uses $\mathbf{4x}$ less parameters to achieve similar performance as compared to NSGA-Net [20]. Furthermore, NSGANetV1-A4 exceeds previous state-of-the-art results reported by Proxyless NAS [43] while being 1.4x more compact. Refer to Table IIa for more comparisons.
NSGANetV1的有效性在CIFAR-10数据集上的迁移性能中得到了进一步验证。如图7a所示,NSGANetV1-A0至-A4建立的帕累托前沿完全优于同类EMO方法(包括DPP-Net [30]和LEMONADE [32])以及其他单目标对比方法所得结果。具体而言,NSGANetV1-A0使用的参数量比Large-scale Evo. [15]少27倍,同时获得更高分类准确率;NSGANetV1-A1在分类性能上超越Hierarchical NAS [16]和DenseNet [3],参数量分别减少122倍和51倍;NSGANetV1-A2仅需NSGA-Net [20] 1/4的参数量即可达到相近性能;NSGANetV1-A4在超越Proxyless NAS [43]报告的最优结果的同时,模型紧凑度提升1.4倍。更多对比数据详见表IIa。
For transfer performance comparison on the ImageNet dataset, we follow previous work [5], [8], [9], [44] and use the ImageNet-mobile setting, i.e., the setting where number of FLOPs is less than 600M. The NSGANetV1-A0 is too simple for the ImageNet dataset and NSGANetV1-A4 exceeds the 600M FLOPs threshold for the mobile setting, so we provide results only for NSGANetV1-A1, -A2 and -A3. Fig. 7b compares the objective space with the other peer methods. Clearly, NSGANetV1 can achieve a better trade-off between the objectives. NSGANetV1-A2 dominates a wide range of peer methods including ShuffleNet [4] by human experts, NASNet-A [8] by RL, DARTS [9] by relaxation-based methods, and AmoebaNet-A [17] by EA. Moreover, NSGANetV1-A3 surpasses previous state-of-the-art performance reported by MobileNet-V2 [5] and AmoebaNet-C [17] on mobile-setting with a marginal overhead in FLOPs $(1%-3%)$ .
为在ImageNet数据集上进行迁移性能比较,我们遵循先前工作[5]、[8]、[9]、[44]采用ImageNet-mobile设置(即FLOPs小于6亿的配置)。NSGANetV1-A0对ImageNet数据集过于简单,而NSGANetV1-A4超出移动端设置的6亿FLOPs阈值,因此我们仅提供NSGANetV1-A1、-A2和-A3的结果。图7b展示了与其他同类方法在目标空间的对比。显然,NSGANetV1能在多个目标间实现更优权衡。NSGANetV1-A2在广泛范围内优于人工设计的ShuffleNet[4]、基于强化学习的NASNet-A[8]、基于松弛方法的DARTS[9]以及采用进化算法的AmoebaNet-A[17]等同类方法。此外,NSGANetV1-A3以微小的FLOPs开销$(1%-3%)$超越了MobileNet-V2[5]和AmoebaNet-C[17]在移动端设置下报告的最先进性能。
D. Efficiency of NSGANetV1
D. NSGANetV1的效率
Comparing the search phase contribution to the success of different NAS algorithms can be difficult and ambiguous due to substantial differences in search spaces and training procedures used during the search. Therefore, we use vanilla NSGA-II and uniform random sampling as comparisons to demonstrate the efficiency of the search phase in NSGANetV1. All three methods use the same search space and performance estimation strategy as described in Section III. The vanilla NSGA-II is implemented by disc ret i zing the crossover and mutation operators in the original NSGA-II [29] algorithm with all hyperparameters set to default values; and it does not utilize any additional enhancements—e.g., the Bayesian-Network-modelbased exploitation. The uniform random sampling method is implemented by replacing the crossover and mutation operators in the original NSGA-II algorithm with an initialization method that uniformly samples the search space. We run each of the three methods five times and record the 25-percentile, median and 75-percentile of the normalized HV (NHV) that we obtain. The NHV measurements shown in Fig. 8a suggest that NSGANetV1 is capable of finding more accurate and simpler architectures than vanilla NSGA-II or uniform random sampling (even with an extended search budget), in a more efficient manner.
由于不同神经架构搜索(NAS)算法在搜索阶段使用的搜索空间和训练流程存在显著差异,直接比较搜索阶段对算法成功的贡献度往往具有难度和模糊性。为此,我们采用标准NSGA-II算法和均匀随机采样作为对比基线,以验证NSGANetV1搜索阶段的高效性。这三种方法均采用第3节描述的相同搜索空间和性能评估策略:标准NSGA-II通过将原始NSGA-II [29] 算法中的交叉和变异算子离散化实现(所有超参数保持默认值),且不包含任何额外增强机制(如基于贝叶斯网络的开发策略);均匀随机采样方法则通过将NSGA-II的交叉变异算子替换为搜索空间均匀采样的初始化方法实现。每种方法运行5次后,我们记录了归一化超体积(NHV)的25百分位、中位数和75百分位。图8a所示的NHV测量结果表明:相较于标准NSGA-II或均匀随机采样(即使延长搜索预算),NSGANetV1能以更高效率发现更精确且更简洁的架构。
Apart from the HV metric, another important aspect of demonstrating the efficiency of NAS is the computational complexity of the methods. Since theoretical analysis of the computational complexity of different NAS methods is challenging, we compare the computation time spent on Graphics
除了HV指标外,展示NAS效率的另一个重要方面是方法的计算复杂度。由于对不同NAS方法计算复杂度进行理论分析具有挑战性,我们比较了图形处理上花费的计算时间

Fig. 7: Transfer ability of the NSGANetV1 architectures to (a) CIFAR-10, and (b) ImageNet. We compare Top-1 Accuracy vs. Computational Complexity. Architectures joined by dashed lines are from multi-objective algorithms.
图 7: NSGANetV1架构在 (a) CIFAR-10 和 (b) ImageNet 上的迁移能力对比。我们展示了Top-1准确率与计算复杂度之间的关系。虚线连接的架构来自多目标优化算法。
TABLE II: Comparison between NSGANetV1 and other baseline methods. NSGANetV1 architectures are obtained by searching on CIFAR-100. NSGANetV1 results on CIFAR-10 and ImageNet are obtained by re-training the weights with images from their respective datasets. Ratio-to-NSGANetV1 indicates the resulting savings on #Params and #FLOPs. The search cost is compared in GPU-days, calculated by multiplying the number of GPU cards deployed with the execution time in days.
表 II: NSGANetV1与其他基线方法的对比。NSGANetV1架构通过在CIFAR-100上搜索获得。NSGANetV1在CIFAR-10和ImageNet上的结果是通过使用各自数据集的图像重新训练权重得到的。Ratio-to-NSGANetV1表示在参数量(#Params)和计算量(#FLOPs)上节省的比例。搜索成本以GPU天为单位进行比较,通过部署的GPU卡数量与执行天数相乘计算得出。
(a) CIFAR-10
| Architecture | Search Method | GPU- Days | Top-1 Acc. | #Params | Ratio-to- NSGANetV1 |
| NSGANetV1-A0 | EA | 27 | 95.33% | 0.2M | 1x |
| CGP-CNN [24] | EA | 27 | 94.02% | 1.7M | 8.5x |
| Large-scale Evo. [15] | EA | 2,750 | 94.60% | 5.4M | 27x |
| AE-CNN+E2EPP[19] | EA | 7 | 94.70% | 4.3M | 21x |
| ResNet [2] | manual | 95.39% | 1.7M | 8.5x | |
| NSGANetV1-A1 | EA | 27 | 96.51% | 0.5M | 1x |
| Hierarchical NAS[16] | EA | 300 | 96.37% | 61.3M | 122x |
| PNAS [31] | SMBO | 150 | 96.37% | 3.2M | 6.4x |
| DenseNet [3] | manual | 96.54% | 25.6M | 51x | |
| NSGANetV1-A2 | EA | 27 | 97.35% | 0.9M | 1x |
| CNN-GA [25] | EA | 35 | 96.78% | 2.9M | 3.2x |
| AmoebaNet-A [17] | EA | 3,150 | 96.88% | 3.1M | 3.4x |
| DARTS [9] | relaxation | 1 | 97.18% | 3.4M | 3.8x |
| NSGA-Net [20] | EA | 4 | 97.25% | 3.3M | 3.7x |
| NSGANetV1-A3 | EA | 27 | 97.78% | 2.2M | 1x |
| NASNet-A [8] | RL | 1,575 | 97.35% | 3.3M | 1.5x |
| LEMONADE [32] | EA | 90 | 97.42% | 13.1M | 6.0x |
| NSGANetV1-A4 | EA | 27 | 97.98% | 4.0M | 1x |
| AmoebaNet-B [17] | EA | 3,150 | 97.87% | 34.9M | 8.7x |
| Proxyless NAS [43] | RL | 1,500 | 97.92% | 5.7 M | 1.4x |
(a) CIFAR-10
| 架构 | 搜索方法 | GPU天数 | Top-1准确率 | 参数量 | 与NSGANetV1比率 |
|---|---|---|---|---|---|
| NSGANetV1-A0 | 进化算法 (EA) | 27 | 95.33% | 0.2M | 1x |
| CGP-CNN [24] | 进化算法 (EA) | 27 | 94.02% | 1.7M | 8.5x |
| Large-scale Evo. [15] | 进化算法 (EA) | 2,750 | 94.60% | 5.4M | 27x |
| AE-CNN+E2EPP[19] | 进化算法 (EA) | 7 | 94.70% | 4.3M | 21x |
| ResNet [2] | 人工设计 | - | 95.39% | 1.7M | 8.5x |
| NSGANetV1-A1 | 进化算法 (EA) | 27 | 96.51% | 0.5M | 1x |
| Hierarchical NAS[16] | 进化算法 (EA) | 300 | 96.37% | 61.3M | 122x |
| PNAS [31] | 序列模型优化 (SMBO) | 150 | 96.37% | 3.2M | 6.4x |
| DenseNet [3] | 人工设计 | - | 96.54% | 25.6M | 51x |
| NSGANetV1-A2 | 进化算法 (EA) | 27 | 97.35% | 0.9M | 1x |
| CNN-GA [25] | 进化算法 (EA) | 35 | 96.78% | 2.9M | 3.2x |
| AmoebaNet-A [17] | 进化算法 (EA) | 3,150 | 96.88% | 3.1M | 3.4x |
| DARTS [9] | 松弛法 | 1 | 97.18% | 3.4M | 3.8x |
| NSGA-Net [20] | 进化算法 (EA) | 4 | 97.25% | 3.3M | 3.7x |
| NSGANetV1-A3 | 进化算法 (EA) | 27 | 97.78% | 2.2M | 1x |
| NASNet-A [8] | 强化学习 (RL) | 1,575 | 97.35% | 3.3M | 1.5x |
| LEMONADE [32] | 进化算法 (EA) | 90 | 97.42% | 13.1M | 6.0x |
| NSGANetV1-A4 | 进化算法 (EA) | 27 | 97.98% | 4.0M | 1x |
| AmoebaNet-B [17] | 进化算法 (EA) | 3,150 | 97.87% | 34.9M | 8.7x |
| Proxyless NAS [43] | 强化学习 (RL) | 1,500 | 97.92% | 5.7M | 1.4x |
(b) CIFAR-100
| Architecture | Search Method | GPU- Days | Top-1 Acc. | #Params | Ratio-to- NSGANetV1 |
| NSGANetV1-A0 | EA | 27 | 74.83% | 0.2M | 1x |
| Genetic CNN[14] | EA | 17 | 70.95% | ||
| MetaQNN [49] | RL | 90 | 72.86% | 11.2M | 56x |
| NSGANetV1-A1 Large-scale Evo.[15] | EA EA | 27 | 80.77% | 0.7M | 1x |
| ResNet [2] | manual | 2,750 | 77.00% 77.90% | 40.4M | 58x |
| AE-CNN+E2EPP[19] | EA | 10 | 77.98% | 1.7M | 2.4x |
| NSGA-Net [20] | EA | 8 | 79.26% | 20.9M 3.3M | 30x |
| CNN-GA [25] | EA | 40 | 79.47% | 4.1M | 4.7x 5.9x |
| PNAS [31] | SMBO | 150 | 80.47% | 3.2M | 4.6x |
| ENAS [44] | RL | 0.5 | 80.57% | 4.6M | 6.6x |
| NSGANetV1-A2 | EA | 27 | 82.58% | ||
| AmoebaNet-A [17] | EA | 3,150 | 81.07% | 0.9M 3.1M | 1x |
| GDAS [11] | relaxation | 0.2 | 81.62% | 3.4x | |
| DARTS [9] | relaxation | 1 | 82.46% | 3.4M 3.4M | 3.8x |
| EA | 3.8x | ||||
| NSGANetV1-A3 | 27 | 82.77% | 2.2M | 1x | |
| NSGANetV1-A4 DenseNet [3] | EA | 27 | 85.62% | 4.1M | 1x |
| manual | 82.82% | 25.6M | 6.2x | ||
| SENet [37] | manual | = | 84.59% | 34.4M | 8.4x |
| Block-QNN [35] | RL | 32 | 85.17% | 33.3M | 8.1x |
(b) CIFAR-100
| 架构 | 搜索方法 | GPU天数 | Top-1准确率 | 参数量 | 与NSGANetV1比率 |
|---|---|---|---|---|---|
| NSGANetV1-A0 | EA | 27 | 74.83% | 0.2M | 1x |
| Genetic CNN[14] | EA | 17 | 70.95% | ||
| MetaQNN [49] | RL | 90 | 72.86% | 11.2M | 56x |
| NSGANetV1-A1 | EA | 27 | 80.77% | 0.7M | 1x |
| Large-scale Evo.[15] | EA | 2,750 | 77.00% | 40.4M | 58x |
| ResNet [2] | manual | 77.90% | |||
| AE-CNN+E2EPP[19] | EA | 10 | 77.98% | 1.7M | 2.4x |
| NSGA-Net [20] | EA | 8 | 79.26% | 20.9M | 30x |
| CNN-GA [25] | EA | 40 | 79.47% | 4.1M | 4.7x |
| PNAS [31] | SMBO | 150 | 80.47% | 3.2M | 4.6x |
| ENAS [44] | RL | 0.5 | 80.57% | 4.6M | 6.6x |
| NSGANetV1-A2 | EA | 27 | 82.58% | ||
| AmoebaNet-A [17] | EA | 3,150 | 81.07% | 0.9M | 1x |
| GDAS [11] | relaxation | 0.2 | 81.62% | 3.4x | |
| DARTS [9] | relaxation | 1 | 82.46% | 3.4M | 3.8x |
| EA | 3.8x | ||||
| NSGANetV1-A3 | 27 | 82.77% | 2.2M | 1x | |
| NSGANetV1-A4 | EA | 27 | 85.62% | 4.1M | 1x |
| DenseNet [3] | manual | 82.82% | 25.6M | 6.2x | |
| SENet [37] | manual | = | 84.59% | 34.4M | 8.4x |
| Block-QNN [35] | RL | 32 | 85.17% | 33.3M | 8.1x |
(c) ImageNet
| Architecture | SearchMethod | GPU-Days | Top-1Acc. | Top-5Acc. | #Params | Ratio-to-NSGANetV1 | #FLOPs | Ratio-to-NSGANetV1 |
| NSGANetV1-A1 | EA | 27 | 70.9% | 90.0% | 3.0M | 1x | 270M | 1x |
| MobileNet-V2[5] | manual | 72.0% | %0'16 | 3.4M | 1.1x | 300M | 1.1x | |
| NSGANetV1-A2 | EA | 27 | 74.5% | 92.0% | 4.1M | 1x | 466M | 1x |
| ShuffleNet[4] | manual | 73.7% | 5.4M | 1.3x | 524M | 1.1x | ||
| NASNet-A [8] | RL | 1,575 | 74.0% | 91.3% | 5.3M | 1.3x | 564M | 1.2x |
| PNAS [31] | SMBO | 150 | 74.2% | 91.9% | 5.1M | 1.2x | 588M | 1.3x |
| AmoebaNet-A [17] | EA | 3,150 | 74.5% | 92.0% | 5.1M | 1.2x | 555M | 1.2x |
| DARTS [9] | relaxation | 1 | 73.1% | 91.0% | 4.9M | 1.2x | 595M | 1.3x |
| NSGANetV1-A3 | EA | 27 | 76.2% | 93.0% | 5.0M | 1x | 585M | 1x |
| MobileNetV2 (1.4)[5] | manual | 74.7% | 92.5% | 6.06M | 1.2x | 582M | 1x | |
| AmoebaNet-C [17] | EA | 3,150 | 75.7% | 92.4% | 6.4M | 1.3x | 570M | 1x |
† SMBO stands for sequential model-based optimization. SENet is the abbreviation for Shake-Even $29~2\mathrm{x}4\mathrm{x}64\mathrm{d}+\mathrm{S}\mathrm{E}$ . ‡ The CIFAR-100 accuracy and #params for ENAS [44] and DARTS [9] are from [11]. #Params for AE-CNN $^+$ E2EPP are from [18].
| 架构 | 搜索方法 | GPU天数 | Top-1准确率 | Top-5准确率 | 参数量 | 与NSGANetV1比值 | FLOPs | 与NSGANetV1比值 |
|---|---|---|---|---|---|---|---|---|
| NSGANetV1-A1 | EA | 27 | 70.9% | 90.0% | 3.0M | 1x | 270M | 1x |
| MobileNet-V2[5] | manual | 72.0% | %0'16 | 3.4M | 1.1x | 300M | 1.1x | |
| NSGANetV1-A2 | EA | 27 | 74.5% | 92.0% | 4.1M | 1x | 466M | 1x |
| ShuffleNet[4] | manual | 73.7% | 5.4M | 1.3x | 524M | 1.1x | ||
| NASNet-A [8] | RL | 1,575 | 74.0% | 91.3% | 5.3M | 1.3x | 564M | 1.2x |
| PNAS [31] | SMBO | 150 | 74.2% | 91.9% | 5.1M | 1.2x | 588M | 1.3x |
| AmoebaNet-A [17] | EA | 3,150 | 74.5% | 92.0% | 5.1M | 1.2x | 555M | 1.2x |
| DARTS [9] | relaxation | 1 | 73.1% | 91.0% | 4.9M | 1.2x | 595M | 1.3x |
| NSGANetV1-A3 | EA | 27 | 76.2% | 93.0% | 5.0M | 1x | 585M | 1x |
| MobileNetV2 (1.4)[5] | manual | 74.7% | 92.5% | 6.06M | 1.2x | 582M | 1x | |
| AmoebaNet-C [17] | EA | 3,150 | 75.7% | 92.4% | 6.4M | 1.3x | 570M | 1x |
† SMBO代表序列化基于模型的优化。SENet是Shake-Even $29~2\mathrm{x}4\mathrm{x}64\mathrm{d}+\mathrm{S}\mathrm{E}$的缩写。‡ ENAS [44]和DARTS [9]的CIFAR-100准确率和参数量来自[11]。AE-CNN $^+$ E2EPP的参数量来自[18]。
Processing Units (GPUs), GPU-Days, by each approach to arrive at the reported architectures. The number of GPU-Days is calculated by multiplying the number of employed GPU cards by the execution time (in units of days).
各方法为达到所报告架构所消耗的处理单元 (GPUs) 和 GPU-天数。GPU-天数的计算方式为使用的 GPU 卡数量乘以执行时间 (以天为单位)。
One run of NSGANetV1 on the CIFAR-100 dataset takes approximately 27 GPU-Days to finish, averaged over five runs. The search costs of most of the peer methods are measured on the CIFAR-10 dataset, except for Block-QNN [35] which is measured on CIFAR-100. From the search cost comparison in Fig. 8b, we observe that our proposed algorithm is more efficient at identifying a set of architectures than a number of other approaches, and the set of architectures obtained has higher performance. More specifically, NSGANetV1 simultaneously finds multiple architectures while using 10x to 100x less GPU-days in searching than most of the considered peer methods, including Hierarchical NAS [16], AmoebaNet [17], NASNet [8], and Proxyless NAS [43], all of which find a single architecture at a time. When compared to the peer multi-objective NAS method, LEMONADE [32], NSGANetV1 manages to obtain a better (in the Pareto dominance sense) set of architectures than LEMONADE with 3x fewer GPUDays. Further experiments and comparisons are provided in the supplementary materials under Section VII-A.
NSGANetV1在CIFAR-100数据集上单次运行平均耗时约27个GPU日(基于五次运行取均值)。除Block-QNN[35]的搜索成本基于CIFAR-100测算外,其他对比方法的搜索成本均在CIFAR-10数据集上测得。通过图8b的搜索成本对比可见,我们提出的算法在识别架构集合时较多数对比方法更为高效,且获得的架构集合具有更高性能。具体而言,NSGANetV1能同时发现多个架构,其搜索过程消耗的GPU日仅为Hierarchical NAS[16]、AmoebaNet[17]、NASNet[8]及Proxyless NAS[43]等对比方法的1/10至1/100(这些方法每次仅能找到一个架构)。与多目标NAS对比方法LEMONADE[32]相比,NSGANetV1以三分之一的GPU日消耗获得了帕累托占优性更优的架构集合。更多实验对比结果详见补充材料第七章节A部分。

Fig. 8: Search efficiency comparison between NSGANetV1 and other baselines in terms of (a) HV, and (b) the required compute time in GPU-Days. The search cost is measured on CIFAR-10 for most methods, except NSGANetV1 and Block-QNN [35], where the CIFAR-100 dataset is used for.
图 8: NSGANetV1与其他基线方法在搜索效率上的对比:(a) HV指标,(b) GPU天数为单位的计算耗时。除NSGANetV1和Block-QNN [35]使用CIFAR-100数据集外,其余方法的搜索成本均在CIFAR-10上测得。
E. Observations on Evolved Architectures
E. 对进化架构的观察
Population-based approaches with multiple conflicting objectives often lead to a set of diverse solution candidates, which can be “mined” for commonly shared design principles [50]. In order to discover any patterns for more efficient architecture design, we analyzed the entire history of architectures generated by NSGANetV1. We make the following observations:
基于群体且具有多重冲突目标的方法通常会生成一组多样化的候选解决方案,这些方案可被"挖掘"出共有的设计原则 [50]。为了发现更高效架构设计的潜在模式,我们分析了NSGANetV1生成的全部架构历史,得出以下观察结果:
• Non-parametric operations—e.g., skip connections (identity) and average pooling $(a\nu g_p o o l_3x3)$ —are effective in trading off performance for complexity. Empirically, we notice that three out of the four most frequently used operations in non-dominated architectures are nonparametric, as shown in Fig. 9a (see also supplementary materials under Section VII-B for our follow-up study). • Larger kernel size and parallel operations improve classification accuracy, as shown in Fig. 9a and Fig. 9b respectively. In particular, the frequencies of convolutions with large kernel sizes (e.g., dil con v 5 x 5, $c o n v_7x l_l x7)$ are significantly higher in the top $20%$ most accurate archi tec ture s than in non-dominated architectures in general, which must also balance FLOPs. Similar findings are also reported in previous work [17], [42].
• 非参数化操作——例如跳跃连接 (identity) 和平均池化 $(a\nu g_p o o l_3x3)$ ——能有效权衡性能与复杂度。实证数据显示,非支配架构中最常使用的四种操作中有三种是非参数化的 (如图 9a 所示,另见第七-B节的补充材料中的后续研究) 。
• 更大的卷积核尺寸和并行操作能提升分类准确率,分别如图 9a 和图 9b 所示。特别是,在准确率最高的前 $20%$ 架构中,大卷积核操作 (如 dil con v 5 x 5, $c o n v_7x l_l x7)$ 的出现频率显著高于需要兼顾 FLOPs 的非支配架构。类似结论在先前研究 [17], [42] 中也有报道。
The above common properties of multiple final nondominated solutions stay as important knowledge for future applications. It is noteworthy that such a post-optimal knowledge extraction process is possible only from a multi-objective optimization study, another benefit that we enjoy for posing NAS as a multi-objective optimization problem.
多个最终非支配解的上述共同特性,将作为未来应用的重要知识。值得注意的是,这种优化后知识提取过程仅能通过多目标优化研究实现,这也是我们将NAS (Neural Architecture Search) 构建为多目标优化问题的另一优势。

Fig. 9: Post-Search Analysis: (a) Frequency of each operation selected during the search. (b) Effect of number of input channels that are concatenation on the validation accuracy. More channels improve the predictive performance of the architectures, but adversely affect the computational efficiency.
图 9: 搜索后分析: (a) 搜索过程中每个操作被选择的频率。(b) 输入通道数(拼接操作)对验证准确率的影响。更多通道能提升架构的预测性能,但会对计算效率产生负面影响。
V. FURTHER ANALYSIS
V. 进一步分析
The over arching goal of NAS is to find architecture models that generalize to new instances of what the models were trained on. We usually quantify generalization by measuring the performance of a model on a held-out testing set. Since many computer vision benchmark datasets, including the three datasets used in this paper—i.e. CIFAR-10, CIFAR100, and ImageNet, have been the focus of intense research for almost a decade, does the steady stream of promising empirical results from NAS simply arise from over fitting of these excessively re-used testing sets? Does advancement on these testing sets imply better robustness vis-a-vis commonly observable corruptions in images and adversarial images by which the human vision system is more robust? To answer these questions in a quantitative manner, in this section, we provide systematic studies on newly proposed testing sets from the CNN literature, followed by hyper-parameter analysis.
神经架构搜索(NAS)的终极目标是找到能够泛化到训练数据新实例的模型架构。我们通常通过在保留测试集上衡量模型性能来量化泛化能力。由于包括本文使用的CIFAR-10、CIFAR-100和ImageNet在内的许多计算机视觉基准数据集已被集中研究近十年,NAS持续涌现的优秀实证结果是否仅源于对这些过度重复使用的测试集的过拟合?在这些测试集上的进步是否意味着模型对图像常见损坏和对抗样本具有更强的鲁棒性(人类视觉系统对此更具抵抗力)?为了定量回答这些问题,本节我们将基于CNN文献中新提出的测试集开展系统研究,随后进行超参数分析。

Fig. 10: Generalization: We evaluate the models on new and extended test sets for (a) CIFAR-10, and (b) ImageNet. Numbers in the boxes indicate absolute drop in accuracy $(%)$ .
图 10: 泛化能力评估: 我们在 (a) CIFAR-10 和 (b) ImageNet 的新扩展测试集上评估模型性能。方框内数字表示准确率绝对下降百分比 $(%)$。
A. Generalization
A. 泛化
By mimicking the documented curation process of the original CIFAR-10 and ImageNet datasets, Recht et al. [51] propose two new testing sets, CIFAR-10.1 and ImageNetV2. Refer to supplementary materials under Section V-A for details and examples of the new testing sets. Representative architectures are selected from each of the main categories (i.e., RL, EA, relaxation-based, and manual). The selected architectures are similar in number of parameters or FLOPs, except DenseNet-BC [3] and Inception-V1 [1]. All architectures are trained on the original CIFAR-10 and ImageNet training sets as in Section IV-C, then evaluated on CIFAR-10.1 and ImageNet-V2, respectively.
通过模仿原始CIFAR-10和ImageNet数据集记录的筛选过程,Recht等人[51]提出了两个新的测试集CIFAR-10.1和ImageNetV2。关于新测试集的详细信息和示例,请参阅第V-A节下的补充材料。从每个主要类别(即强化学习(RL)、进化算法(EA)、基于松弛的方法和人工设计)中选取代表性架构。除DenseNet-BC[3]和Inception-V1[1]外,所选架构在参数量或FLOPs上相近。所有架构均按照第IV-C节所述在原始CIFAR-10和ImageNet训练集上进行训练,然后分别在CIFAR-10.1和ImageNet-V2上进行评估。
It is evident from the results summarized in Figs. 10a and 10b that there is a significant drop in accuracy of $3%$ - $7%$ on CIFAR-10.1 and $8%$ to $10%$ on ImageNet-V2 across architectures. However, the relative rank-order of accuracy on the original testing sets translates well to the new testing sets, i.e., the architecture with the highest accuracy on the original testing set (NSGANetV1 in this case) is also the architecture with the highest accuracy on new testing sets. Additionally, we observe that the accuracy gains on the original testing sets translate to larger gains on the new testing sets, especially in the case of CIFAR-10 (curvatures of red vs. blue markers in Fig. 10). These results provide evidence that extensive benchmarking on the original testing sets is an effective way to measure the progress of architectures.
从图10a和图10b汇总的结果可以明显看出,在CIFAR-10.1上准确率显著下降了3%-7%,在ImageNet-V2上各架构下降了8%到10%。然而,原始测试集上的准确率相对排名在新测试集上仍然保持良好,即原始测试集上准确率最高的架构(本例中为NSGANetV1)在新测试集上同样保持最高准确率。此外,我们观察到原始测试集上的准确率提升在新测试集上会转化为更大的增益,尤其是在CIFAR-10中(图10中红色与蓝色标记的曲率对比)。这些结果证明,对原始测试集进行广泛基准测试是衡量架构进展的有效方法。
B. Robustness
B. 鲁棒性
The vulnerability to small changes in query images may very likely prevent the deployment of deep learning vision systems in safety-critical applications at scale. Understanding the architectural advancements under the scope of robustness against various forms of corruption is still in its infancy. Hendrycks and Dietterich [52] recently introduced two new testing datasets, CIFAR-10-C and CIFAR-100-C, by applying commonly observable corruptions (e.g., noise, weather, compression, etc.) to the clean images from the original datasets.
查询图像中微小变化带来的脆弱性很可能阻碍深度学习视觉系统在安全关键应用中的大规模部署。目前,针对各类损坏形式鲁棒性的架构进展研究仍处于起步阶段。Hendrycks和Dietterich [52] 近期通过向原始数据集的干净图像施加常见可观测损坏(如噪声、天气、压缩等),提出了两个新的测试数据集CIFAR-10-C和CIFAR-100-C。

Fig. 11: Robustness: Effect of commonly observable corruptions and adversarial attacks on (a) CIFAR-10, and (b) CIFAR100. Higher values of $\epsilon$ indicate more severe adversarial attacks. We use prediction accuracy on the corrupted test images (from each dataset) as a measurement of robustness.
图 11: 鲁棒性分析: 常见可观测噪声和对抗攻击对 (a) CIFAR-10 及 (b) CIFAR100 的影响。$\epsilon$ 值越高表示对抗攻击越强。我们使用各数据集中被干扰测试图像的预测准确率作为鲁棒性衡量指标。
Each dataset contains images perturbed by 19 different types of corruption at five different levels of severity. More details and visualization s are provided in supplementary materials under Section V-B. In addition, we include adversarial images as examples of worst-case corruption. We use the fast gradient signed method (FGSM) [53] to construct adversarial examples for both the CIFAR-10 and $-100$ datasets. The severity of the attack is controlled via a hyper-parameter $\epsilon$ as shown below:
每个数据集包含经过19种不同类型损坏处理的图像,每种损坏分为五个严重等级。更多细节和可视化内容请参见补充材料第V-B节。此外,我们还加入了对抗样本作为最坏损坏情况的示例。我们使用快速梯度符号法 (FGSM) [53] 为CIFAR-10和$-100$数据集构建对抗样本,攻击强度通过超参数$\epsilon$控制如下:
$$
\pmb{x}^{\prime}=\pmb{x}+\epsilon\mathrm{sign}(\nabla_{\pmb{x}}\mathcal{L}(\pmb{x},y_{t r u e})),
$$
$$
\pmb{x}^{\prime}=\pmb{x}+\epsilon\mathrm{sign}(\nabla_{\pmb{x}}\mathcal{L}(\pmb{x},y_{t r u e})),
$$
where ${x}$ is the original image, $\mathbf{\Delta}{x^{\prime}}$ is the adversarial image, ytrue is the true class label, and $\mathcal{L}$ is the cross-entropy loss. Following the previous section, we pick representative architectures of similar complexities from each of the main categories. Using the weights learned on the clean images from the original CIFAR-10/100 training sets, we evaluate each architecture’s classification performance on the corrupted datasets as our measure of robustness.
其中 ${x}$ 是原始图像,$\mathbf{\Delta}{x^{\prime}}$ 是对抗图像,ytrue 是真实类别标签,$\mathcal{L}$ 是交叉熵损失。延续上一节的思路,我们从每个主要类别中选取复杂度相近的代表性架构。利用在原始 CIFAR-10/100 训练集的干净图像上学习到的权重,我们通过评估各架构在污染数据集上的分类性能来衡量其鲁棒性。
Our empirical findings summarized in Figs. 11a and 11b appear to suggest that a positive correlation exists between the generalization performance on clean data and data under commonly observable corruptions – i.e., we observe that NSGANetV1 architectures perform noticeably better than other peer methods’ architectures on corrupted datasets even though the robustness measurement was never a part of the architecture selection process in NSGANetV1. However, we emphasize that no architectures are considered robust to corruption, especially under adversarial attacks. We observe that the architectural advancements have translated to minuscule improvements in robustness against adversarial examples. The classification accuracy of all selected architectures deteriorates drastically with minor increments in adversarial attack severity ϵ, leading to the question of whether architecture is the “right” ingredient to investigate in pursuit of adversarial robustness. A further step towards answering this question is provided in supplementary materials under Section V-C.
我们在图11a和11b中总结的实证研究结果表明,干净数据与常见可观测损坏数据上的泛化性能存在正相关关系——即我们观察到,尽管NSGANetV1的架构选择过程从未包含鲁棒性测量指标,但其架构在损坏数据集上的表现明显优于其他同类方法。然而需要强调的是,这些架构均未展现出对数据损坏(尤其是对抗攻击)的鲁棒性。我们发现,架构进步仅带来对抗样本鲁棒性的微小提升。当对抗攻击强度ϵ轻微增加时,所有选定架构的分类准确率都会急剧下降,这引发了一个问题:架构是否是追求对抗鲁棒性研究中"正确"的探索方向。补充材料第V-C节对此问题提供了进一步分析。
C. Ablation Studies
C. 消融研究
Dataset for Search: As previously mentioned in Section III-B, our proposed method differs from most of the existing peer methods in the choice of datasets on which the search is carried out. Instead of directly following the current practice of using the CIFAR-10 dataset, we investigated the utility of search on multiple benchmark datasets in terms of their ability to provide reliable estimates of classification accuracy and generalization. We carefully selected four datasets, SVHN [54], fashion M NIST [55], CIFAR-10, and CIFAR-100 for comparison. The choice was based on a number of factors including the number of classes, numbers of training examples, resolutions and required training times. We uniformly sampled 40 architectures from the search space (described in Section III) along with five architectures generated by other peer NAS methods. We trained every architecture three times with different initial random seeds and report the averaged classification accuracy on each of the four datasets in Fig. 12a. Empirically, we observe that the CIFAR-100 dataset is challenging enough for architectural differences to affect predicted performance. This can be observed in Fig. 12a where the variation (blue boxes) in classification accuracy across architectures is noticeably larger on CIFAR-100 than on the other three datasets. In addition, we observe that mean differences in classification accuracy on CIFAR-100 between randomly generated architectures and architectures from principle-based methods have higher deviations, suggesting that it is less likely to find a good architecture on CIFAR-100 by chance.
用于搜索的数据集:如第III-B节所述,我们提出的方法在搜索数据集的选择上与现有多数同类方法不同。我们没有直接沿用当前使用CIFAR-10数据集的惯例,而是研究了在多个基准数据集上进行搜索的效用,评估它们提供分类准确率和泛化能力可靠估计的能力。我们精心选择了四个数据集进行比较:SVHN [54]、Fashion MNIST [55]、CIFAR-10和CIFAR-100。选择依据包括类别数量、训练样本量、分辨率和所需训练时长等因素。我们从搜索空间(如第III节所述)中统一采样了40个架构,以及五个由其他NAS方法生成的架构。每个架构使用不同随机初始种子训练三次,图12a展示了四个数据集的平均分类准确率。实证表明,CIFAR-100数据集能充分体现架构差异对预测性能的影响,这体现在图12a中CIFAR-100的分类准确率方差(蓝色方框)明显大于其他三个数据集。此外,随机生成架构与基于原理方法生成的架构在CIFAR-100上的分类准确率均值差异具有更高偏差,表明在CIFAR-100上偶然发现优秀架构的可能性更低。
Proxy Models: The main computational bottleneck of NAS approaches resides in evaluating the classification accuracy of the architectures by invoking the lower-level weight optimization. One such evaluation typically takes hours to finish, which limits the practical utility of the search under a constrained search budget. In our proposed algorithm, we adopt the concept of a proxy model [8], [17]. Proxy models are small-scale versions of the intended architectures, where the number of layers $N$ in Fig. 2a) and the number of channels $(C h_{i n i t}$ in Fig. 2a) in each layer are reduced. Due to the down scaling, proxy models typically require much less compute time to train†. However, there exists a trade-off between gains in computation efficiency and loss of prediction accuracy. Therefore, it is not necessary that the performance of an architecture measured at proxy-model scale can serve as a reliable indicator of the architecture’s performance measured at the desired scale.
代理模型 (Proxy Models):NAS方法的主要计算瓶颈在于通过调用底层权重优化来评估架构的分类准确率。每次这样的评估通常需要数小时才能完成,这在有限的搜索预算下限制了搜索的实际效用。在我们提出的算法中,我们采用了代理模型的概念 [8][17]。代理模型是目标架构的小规模版本,其中图2a中的层数$N$和每层通道数$(C h_{i n i t}$在图2a中)都减少了。由于规模缩小,代理模型通常需要的训练计算时间要少得多†。然而,在计算效率的提升和预测准确率的损失之间存在权衡。因此,在代理模型规模下测量的架构性能不一定能作为该架构在目标规模下性能的可靠指标。

(c) Effect of proxy model’s depth. (d) Effect of training epochs. Fig. 12: (a) Mean classification accuracy distribution of randomly generated architectures and architectures from peer NAS methods on four datasets. Correlation in performance (red lines) vs. Savings in gradient descent wall time (blue boxes) by reducing (b) the number of channels in layers, (c) the number of layers, and (d) the number of epochs to train. Note that (b), (c) and (d) have two y-axis labels corresponding to the color of the lines.
(c) 代理模型深度的影响。(d) 训练周期数的影响。图12: (a) 随机生成架构与同类NAS方法架构在四个数据集上的平均分类准确率分布。通过减少(b)层通道数、(c)层数、(d)训练周期数,性能相关性(红线)与梯度下降耗时节省(蓝框)的对比。注意(b)、(c)、(d)中双y轴标签分别对应线条颜色。
To determine the smallest proxy model that can provide a reliable estimate of performance at a larger scale, we conducted parametric studies that gradually reduced the sizes of the proxy models of 100 randomly sampled architectures from our search space. Then, we measured the rank-order correlation and the savings in lower-level optimization compute time between the proxy models and the same architectures at the full scale. Figures 12b, 12c and 12d show the effect of numbers of channels, layers and epochs, respectively, on the training time, and the Spearman rank-order correlation between the proxy and full scale models. We make the following observations, (1) increasing the number of channels does not significantly affect the wall clock time, and (2) reducing the number of layers or training epochs significantly reduces the wall clock time but also reduces the rank-order correlation. Based on these observations and the exact trade-offs from the plots, for our proxy model, we set the number of channels to 36 (maximum desired), number of epochs to 36, and number of layers to 14. Empirically, we found that this choice of parameters offers a good trade-off between practicality of search and reliability of proxy models.
为了确定能够可靠估算更大规模性能的最小代理模型,我们进行了参数化研究:从搜索空间中随机抽取100个架构样本,逐步缩小其代理模型规模。随后测量了代理模型与全规模架构之间的排序相关性,以及底层优化计算时间的节省情况。图12b、图12c和图12d分别展示了通道数、层数和训练周期数对训练时间的影响,以及代理模型与全规模模型之间的Spearman排序相关性。我们得出以下结论:(1) 增加通道数对实际耗时影响不大;(2) 减少层数或训练周期会显著缩短耗时,但会降低排序相关性。根据这些观察结果和图表中的具体权衡关系,我们将代理模型的参数设定为:通道数36(期望最大值)、训练周期36、层数14。实证表明,该参数选择在搜索实用性和代理模型可靠性之间实现了良好平衡。
VI. AN APPLICATION TO CHEST X-RAY CLASSIFICATION
VI. 胸部X光分类应用
The ChestX-Ray14 benchmark was recently introduced in [56]. The dataset contains 112,120 high resolution frontalview chest $\mathrm{\DeltaX}$ -ray images from 30,805 patients, and each image is labeled with one or multiple common thorax diseases, or “Normal”, otherwise. More details are provided in supplementary materials under Section V-C . Past approaches [56]–[58] typically extend from existing architectures, and the current state-of-the-art method [58] uses a variant of the DenseNet [3] architecture, which is designed manually by human experts. For reference purpose, we call the obtained architecture NSGANetV1-X, and we re-train the weights thoroughly from scratch with an extended number of epochs The learning rate is gradually reduced when the AUROC on the validation set plateaus.
ChestX-Ray14基准测试最近在[56]中被提出。该数据集包含来自30,805名患者的112,120张高分辨率正面胸部$\mathrm{\DeltaX}$光片,每张图像标注有一种或多种常见胸部疾病,否则标记为"正常"。更多细节见补充材料第V-C节。以往方法[56]-[58]通常基于现有架构扩展,当前最优方法[58]采用DenseNet[3]架构的变体,该架构由人类专家手动设计。作为参考,我们将获得的架构称为NSGANetV1-X,并使用更多训练周期从头彻底重新训练权重。当验证集AUROC指标趋于平稳时,学习率会逐步降低。
TABLE III: AUROC on ChestX-Ray14 testing set.
| Method | Type | #Params | TestAUROC(%) |
| Wang et al.(2017) [56] | manual | 73.8 | |
| Yao et al.(2017) [57] | manual | 一 | 79.8 |
| CheXNet(2017)[58] | manual | 7.0M | 84.4 |
| Google AutoML(2018)[59] | RL | 79.7 | |
| LEAF (2019)[60] | EA | 84.3 | |
| NSGANetV1-A3 | EA | 5.0M | 84.7 |
| NSGANetV1-X | EA | 2.2M | 84.6 |
† Google AutoML result is from [60]. ‡ NSGANetV1-A3 represents results under the standard transfer learning setup.
表 III: ChestX-Ray14测试集上的AUROC
| 方法 | 类型 | 参数量 | 测试AUROC(%) |
|---|---|---|---|
| Wang et al.(2017) [56] | 人工 | - | 73.8 |
| Yao et al.(2017) [57] | 人工 | - | 79.8 |
| CheXNet(2017)[58] | 人工 | 7.0M | 84.4 |
| Google AutoML(2018)[59] | 强化学习 | - | 79.7 |
| LEAF (2019)[60] | 进化算法 | - | 84.3 |
| NSGANetV1-A3 | 进化算法 | 5.0M | 84.7 |
| NSGANetV1-X | 进化算法 | 2.2M | 84.6 |
† Google AutoML结果来自[60]。 ‡ NSGANetV1-A3代表标准迁移学习设置下的结果。

Fig. 13: (a) NSGANetV1-X multi-label classification performance on ChestX-Ray14 and (b) the class-wise mean test AUROC comparison with peer methods.
图 13: (a) NSGANetV1-X 在 ChestX-Ray14 上的多标签分类性能 (b) 与同类方法的类别平均测试 AUROC 对比。
Table III compares the performance of NSGANetV1-X with peer methods that are extended from existing manually designed architectures. This includes architectures used by the authors who originally introduced the ChestX-Ray14 dataset [56], and the CheXNet [58], which is the current state-of-theart on this dataset. We also include results from commercial AutoML systems, i.e., Google AutoML [59], and LEAF [60], as comparisons with NAS-based methods. The setup details of these two AutoML systems are available in [60]. Noticeably, the performance of NSGANetV1-X exceeds Google AutoML’s by a large margin of nearly 4 AUROC points. In addition, NSGANetV1-X matches the state-of-the-art results from human engineered CheXNet, while using $\mathbf{3.2x}$ fewer parameters. For completeness, we also include the result from NSGANetV1-A3, which is evolved on CIFAR-100, to demonstrate the transfer learning capabilities of NSGANetV1.
表 III 将 NSGANetV1-X 与基于现有手动设计架构扩展的同类方法进行了性能对比。这包括最初提出 ChestX-Ray14 数据集 [56] 的作者使用的架构,以及当前该数据集上最先进的 CheXNet [58]。我们还纳入了商用 AutoML 系统的结果 (即 Google AutoML [59] 和 LEAF [60]) 作为与基于 NAS 方法的对比。这两个 AutoML 系统的配置细节详见 [60]。值得注意的是,NSGANetV1-X 的性能大幅领先 Google AutoML 近 4 个 AUROC 点。此外,NSGANetV1-X 在参数量减少 $\mathbf{3.2x}$ 的情况下,达到了人工设计的 CheXNet 的最先进水平。为完整起见,我们还列出了在 CIFAR-100 上进化的 NSGANetV1-A3 的结果,以展示 NSGANetV1 的迁移学习能力。
More detailed results showing the disease-wise ROC curve of NSGANetV1-X and disease-wise AUROC comparison with other peer methods are provided in Figs. 13a and 13b, respectively. To understand the pattern behind the disease classification decisions of NSGANetV1-X, we visualize the class activation map (CAM) [61], which is commonly adopted for localizing the disc rim i native regions for image classification. In the examples shown in Fig. 14a - 14f, stronger CAM areas are covered with warmer colors. We also outline the bounding boxes provided by the ChestX-Ray14 dataset [56] as references.
更详细的结果展示了NSGANetV1-X针对不同疾病的ROC曲线,以及与其他同类方法的疾病AUROC对比,分别见图13a和图13b。为了理解NSGANetV1-X疾病分类决策背后的模式,我们可视化了类激活图(CAM) [61],该方法常用于定位图像分类中的视盘边缘原生区域。在图14a至图14f的示例中,CAM强度较高的区域用暖色调覆盖。我们还标注了ChestX-Ray14数据集[56]提供的边界框作为参考。
These results further validate the ability of our proposed algorithm to generate task-dependent architectures automatically. Conventional approaches, e.g., transfer learning from existing architectures, can be effective in yielding similar performance, however, as demonstrated by NSGANetV1, simultaneously considering complexity along with performance in an algorithmic fashion allows architectures to be practically deployed in resource-constrained environments. We observe this phenomenon in another application of NSGANetV1 to keypoint prediction on cars (see the supplementary materials under Section V-D).
这些结果进一步验证了我们提出的算法能够自动生成任务相关的架构。传统方法(例如从现有架构进行迁移学习)可以有效实现相似性能,但正如NSGANetV1所展示的,以算法方式同时考量复杂度和性能,能使架构在资源受限环境中实际部署。我们在NSGANetV1应用于汽车关键点预测的另一案例中也观察到了这一现象(详见第五部分D节的补充材料)。

Fig. 14: Examples of class activation map [61] of NSGANetV1- X, highlighting the class-specific disc rim i native regions. The ground truth bounding boxes are plotted over the heatmaps.
图 14: NSGANetV1-X 的类别激活图 [61] 示例,突出显示了类别特定的判别区域。真实标注框绘制在热力图上方。
VII. CONCLUSIONS
VII. 结论
In this paper, we have presented NSGANetV1, an evolutionary multi-objective algorithm for neural architecture search. NSGANetV1 explores the design space of architectures through recombining and mutating architectural components. NSGANetV1 further improves the search efficiency by exploiting the patterns among the past successful architectures via distribution estimation through a Bayesian Network model. Experiments on CIFAR-10, CIFAR-100, and ImageNet datasets have demonstrated the effectiveness of NSGANetV1. Further analysis towards validating the generalization and robustness aspects of the obtained architectures is also provided along with an application to common thorax disease classification on human chest X-rays. We believe these results are encouraging and demonstrate the importance of customized and efficient evolutionary algorithms for neural architecture search in achieving superior performance compared to other contemporary machine learning methods.
本文提出了NSGANetV1,一种用于神经架构搜索的进化多目标算法。NSGANetV1通过重组和变异架构组件来探索架构设计空间,并利用贝叶斯网络模型进行分布估计,从而挖掘过往成功架构中的模式以进一步提升搜索效率。在CIFAR-10、CIFAR-100和ImageNet数据集上的实验验证了NSGANetV1的有效性。我们还提供了关于所得架构泛化性和鲁棒性的进一步分析,并将其应用于人类胸部X光片中常见胸部疾病的分类任务。这些结果充分表明,与其他现代机器学习方法相比,定制化且高效的进化算法对于实现卓越性能的神经架构搜索具有重要意义。
ACKNOWLEDGEMENTS
致谢
This material is based in part upon work supported by the National Science Foundation under Cooperative Agreement No. DBI-0939454. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
本材料部分基于美国国家科学基金会 (National Science Foundation) 在合作协议编号 DBI-0939454 下的支持。材料中表达的任何观点、发现、结论或建议均为作者个人观点,并不一定反映美国国家科学基金会的立场。
REFERENCES
参考文献
APPENDIX
附录
In this supplementary document, we provide additional details on (1) related works in Section A; (2) multi-objective related issues in NAS in Section B; (3) bilevel optimization in Section C; (4) layer operations in Section D; (5) datasets in Section E; (6) implementation in Section F; (7) other potential utilities of our proposed algorithm in Section G; (8) hyper volume in Section H.
在本补充文档中,我们提供了以下内容的额外细节:(1) 附录A中的相关工作;(2) 附录B中神经网络架构搜索(NAS)的多目标相关问题;(3) 附录C中的双层优化;(4) 附录D中的层操作;(5) 附录E中的数据集;(6) 附录F中的实现;(7) 附录G中我们提出算法的其他潜在用途;(8) 附录H中的超体积指标。
A. Related Work Continued
A. 相关工作(续)
Existing NAS approaches can be broadly classified into evolutionary algorithm (EA), reinforcement learning (RL), and relaxation-based approaches – with a few additional methods falling outside these categories.
现有的NAS方法大致可分为进化算法 (EA)、强化学习 (RL) 和基于松弛的方法,另有少数方法不属于这些类别。
Reinforcement Learning (RL): $Q$ -learning [62] is a widelyused value iteration method used for RL. The MetaQNN method [49] employs an $\epsilon$ -greedy $Q$ -learning strategy with experience replay to search connections between convolution, pooling, and fully connected layers, and the operations carried out inside the layers. Zhong et al. [35] extended this idea with the BlockQNN method. BlockQNN searches the design of a computational block with the same $Q$ -learning approach. The block is then repeated to construct a network, resulting in a much more general network that achieves better results than its predecessor on CIFAR-10 [34]. A policy gradient method seeks to approximate non-differentiable reward functions to train a model that requires parameter gradients, like a neural network architecture. Zoph and Le [7] first apply this method in NAS to train a recurrent neural network controller that constructs networks. The original method in [7] uses the controller to generate the entire network at once. This contrasts with its successor, NASNet [8], which designs a convolutional and pooling block that is repeated to construct a network. NASNet outperforms its predecessor and produces a network achieving state-of-the-art performance on CIFAR-10 and ImageNet.
强化学习 (RL) : $Q$ -learning [62] 是一种广泛使用的值迭代方法,用于强化学习。MetaQNN 方法 [49] 采用 $\epsilon$ -greedy $Q$ -learning 策略并结合经验回放,来搜索卷积层、池化层和全连接层之间的连接关系,以及层内执行的操作。Zhong 等人 [35] 通过 BlockQNN 方法扩展了这一思路。BlockQNN 使用相同的 $Q$ -learning 方法搜索计算块的设计,然后通过重复该块来构建网络,从而生成比其前身在 CIFAR-10 [34] 上表现更优的通用网络。策略梯度方法旨在近似不可微的奖励函数,以训练需要参数梯度的模型(如神经网络架构)。Zoph 和 Le [7] 首次在 NAS 中应用该方法,训练了一个用于构建网络的循环神经网络控制器。[7] 中的原始方法要求控制器一次性生成整个网络,而其后续工作 NASNet [8] 则设计了一个可重复使用的卷积和池化块来构建网络。NASNet 的表现超越了前代,并在 CIFAR-10 和 ImageNet 上实现了最先进的性能。
Relaxation-based Approaches and Others: Approximating the connectivity between different layers in CNN architectures by real-valued variables weighting the importance of each layer is the common principle of relaxation-based NAS methods. Liu et al. first implement this idea in the DARTS algorithm [9]. DARTS seeks to improve search efficiency by fixing the weights while updating the architectures, showing convergence on both CIFAR-10 and Penn Treebank [63] within one day in wall clock time on a single GPU card. Subsequent approaches in this line of research include [10]–[12], [64]. The search efficiency of these approaches stems from weight sharing during the search process. This idea is complementary to our approach and can be incorporated into NSGANetV1 as well. However, it is beyond the scope of this paper and is a topic of future study.
基于松弛的方法及其他:通过用实值变量加权每层重要性来近似CNN架构中不同层间的连接性,是松弛类NAS方法的共同原理。Liu等人在DARTS算法[9]中首次实现了这一思想。DARTS通过固定权重同时更新架构来提高搜索效率,在单张GPU卡上仅需一天即可在CIFAR-10和Penn Treebank[63]上实现收敛。该研究方向的后续方法包括[10]–[12]、[64]。这些方法的搜索效率源于搜索过程中的权重共享机制。该思路与我们的方法具有互补性,也可整合到NSGANetV1中,但超出本文研究范围,将作为未来课题。
Methods not covered by the EA-, RL- or relaxation-based paradigms have also shown success in architecture search. Liu et al. [31] proposed a method that progressively expands networks from simple cells and only trains the best $K$ networks that are predicted to be promising by a RNN meta-model of the encoding space. PPP-Net [65] extended this idea to use a multi-objective approach, selecting the $K$ networks based on their Pareto-optimality when compared to other networks. Li and Talwalkar [45] show that an augmented random search approach is an effective alternative to NAS. Kandasamy et al. [66] present a Gaussian-process-based approach to optimize network architectures, viewing the process through a Bayesian optimization lens.
基于EA(进化算法)、RL(强化学习)或松弛范式的策略之外,其他方法在架构搜索领域也取得了成功。Liu等人[31]提出了一种从简单单元逐步扩展网络的方法,仅训练编码空间RNN元模型预测最有潜力的$K$个网络。PPP-Net[65]将此思路扩展为多目标方法,根据网络间的帕累托最优性选择$K$个网络。Li和Talwalkar[45]证明了增强随机搜索可作为NAS的有效替代方案。Kandasamy等人[66]提出了基于高斯过程的方法,通过贝叶斯优化视角优化网络架构。
Multi-obj NAS through Scalar iz ation: A portfolio of works that aims to design hardware-specific network architectures emerges. This include, Proxy less NAS [43], MnasNet [67], FBNet [12], and Mobile Ne tV 3 [68] which use a scalarized objective that encourages high accuracy and penalizes compute inefficiency at the same time, e.g., maximize $A c c*$ $(L a t e n c y/T a r g e t)^{-0.07}$ . These methods require a pre-defined preference weighting of the importance of different objectives before the search, which in itself requires a number of trials.
通过标量化实现的多目标神经架构搜索:一系列旨在设计硬件专用网络架构的研究成果应运而生。这包括ProxyLess NAS [43]、MnasNet [67]、FBNet [12]和MobileNetV3 [68],它们采用标量化目标函数,在追求高精度的同时惩罚计算低效,例如最大化$Acc*$$(Latency/Target)^{-0.07}$。这些方法需要在搜索前预先定义不同目标的重要性权重,而权重设定本身就需要大量试验。
Weight Sharing: Another recently proposed approach for improving the search efficiency of NAS is through weight sharing. Approaches in this category involve training a supernet that contains all searchable architectures as its subnets. They can be broadly classified into two categories depending on whether the supernet training is coupled with architecture search or decoupled into a two-stage process. Approaches of the former kind [9], [43], [44] are computationally efficient but return sub-optimal models. Numerous studies [45], [46], [69] allude to weak correlation between performance at the search and final evaluation stages. Methods of the latter kind [70], [71] use performance of subnets (obtained by sampling the trained supernet) as a metric to select architectures during search. However, training a supernet beforehand for each new task is computationally prohibitive.
权重共享:另一种近期提出的提升神经网络架构搜索(NAS)效率的方法是通过权重共享。这类方法通过训练一个包含所有可搜索架构作为子网络的超级网络来实现。根据超级网络训练与架构搜索是耦合还是解耦为两阶段流程,可将其大致分为两类。前者[9][43][44]具有计算效率优势,但返回的模型往往次优。多项研究[45][46][69]指出搜索阶段与最终评估阶段的性能相关性较弱。后者[70][71]则利用采样已训练超级网络获得的子网性能作为搜索时的架构选择指标,但为每个新任务预先训练超级网络的计算成本过高。
B. Multi-objective Optimization in NAS
B. NAS中的多目标优化
In addition to high predictive accuracy, real-world applications demand NAS algorithms to simultaneously balance a few other network complexity related objectives that are specific to the deployment scenarios. For instance, mobile or embedded devices often have restrictions in terms of model size, multiply-adds, latency, power consumption, and memory footprint.
除了高预测精度外,实际应用还要求NAS算法同时平衡一些与网络复杂性相关的其他目标,这些目标因部署场景而异。例如,移动或嵌入式设备通常在模型大小、乘法加法运算量、延迟、功耗和内存占用方面存在限制。
It has been a common observation in the Deep Learning literature that classification performance is positively correlated with the complexity of the neural network. Since we want to maximize one (performance) while minimizing the other (FLOPS), they constitute a conflicting scenario. Optimization of a single composite objective obtained by weighting two objectives into one will produce a neural architecture and weight combination which may be too complex (requiring more FLOPS) or too inaccurate (having less accuracy). A generative approach of simply applying a single-obj optimization does not solve the issue: (i) many common scalar iz ation methods do not work if the interesting optimal solutions lie on the non-convex part of the efficient frontier, and (ii) Generative methods are more computationally expensive (due to the lack of any parallel search efforts) than simultaneous methods, such as the method used in this paper.
深度学习文献中普遍观察到,分类性能与神经网络复杂度呈正相关。由于我们希望在最大化性能的同时最小化计算量 (FLOPS) ,这构成了一个冲突场景。通过加权将两个目标合并为单一复合目标的优化方法,会产生可能过于复杂 (需要更多FLOPS) 或精度不足 (准确率较低) 的神经架构与权重组合。简单地应用单目标优化的生成式方法无法解决该问题:(i) 若最优解位于效率前沿的非凸部分,许多常见的标量化方法将失效;(ii) 与本文采用的同步优化方法相比,生成式方法因缺乏并行搜索机制而导致计算成本更高。
In particular, ResNet [2] showed the classification accuracy on ImageNet continuously to improve as the number of layers increases from 18 (2G FLOPs) to 152 (11G FLOPs). Similar trends are also observed in DenseNet [3], NASNet [8], Efficient Net [13], etc. The aforementioned observation implies the competing nature of these objectives of simultaneously maximizing classification performance and minimizing complexity in terms of FLOPs. Additionally, posing NAS as a multi-objective problem is beneficial from the decisionmaking perspective, as it allows designers to choose a suitable network architecture a posteriori as opposed to requiring a pre-defined preference weighting of each objective prior to the search. Empirically, we also observe that the type of diversity provided by multi-objective optimization contributes to its outperforming on the classification accuracy objective achieved relative to single-objective optimization. (This can be seen, for example, in comparing NSGANetV1-Ax from the main paper and NSGANetV1-Bx from Table V in this supplementary materials.
特别是ResNet [2]在ImageNet上的分类准确率随着层数从18层(2G FLOPs)增加到152层(11G FLOPs)而持续提升。类似的趋势在DenseNet [3]、NASNet [8]、EfficientNet [13]等网络中也得到了验证。上述观察表明,在最大化分类性能和最小化FLOPs复杂度这两个目标之间存在竞争关系。此外,将NAS(神经架构搜索)构建为多目标问题有利于决策制定,因为它允许设计者在搜索后选择合适的网络架构,而不需要在搜索前预先定义各目标的权重偏好。实证研究表明,多目标优化提供的多样性使其在分类准确率目标上优于单目标优化。(例如,通过比较主论文中的NSGANetV1-Ax和本补充材料表V中的NSGANetV1-Bx可以看出这一点。)
C. Bilevel Optimization in NAS
C. NAS中的双层优化
Recall that we formulate the problem of designing custom architectures for different deployment scenarios as a bilevel multi-objective NAS problem, mathematically as below:
我们将在不同部署场景下设计定制架构的问题表述为一个双层多目标神经架构搜索 (NAS) 问题,其数学形式如下:
$$
\begin{array}{r l}{\mathrm{minimize}}&{F(\boldsymbol{x})=\big(f_{1}(\boldsymbol{x};\boldsymbol{w}^{}(\boldsymbol{x})),f_{2}(\boldsymbol{x})\big)^{T},}\ {\mathrm{subject to}}&{\boldsymbol{w}^{*}(\boldsymbol{x})\in\mathrm{argmin~}\mathcal{L}(\boldsymbol{w};\boldsymbol{x}),}\ &{\boldsymbol{x}\in\Omega_{x},\quad\boldsymbol{w}\in\Omega_{w},}\end{array}
$$
$$
\begin{array}{r l}{\mathrm{minimize}}&{F(\boldsymbol{x})=\big(f_{1}(\boldsymbol{x};\boldsymbol{w}^{}(\boldsymbol{x})),f_{2}(\boldsymbol{x})\big)^{T},}\ {\mathrm{subject to}}&{\boldsymbol{w}^{*}(\boldsymbol{x})\in\mathrm{argmin~}\mathcal{L}(\boldsymbol{w};\boldsymbol{x}),}\ &{\boldsymbol{x}\in\Omega_{x},\quad\boldsymbol{w}\in\Omega_{w},}\end{array}
$$
The bilevel formulation used above arises from the problem nature of NAS, where one objective function $f_{1}$ evaluation at the upper-level requires both an architecture ${x}$ and its weights $w.f_{1}$ is not meaningfully defined at any arbitrary $\mathbf{\nabla}{\mathbf{\overrightarrow{ \mathbf{ \vert~\mathbf{ \nabla~}~}}}}$ ; rather, it requires $\mathbf{\nabla}{\mathbf{\overrightarrow{\nabla}}\mathbf{\overrightarrow{\omega}}}$ to be a member of the set that minimize the crossentropy loss $\mathcal{L}$ (in the case of image classification) on training data given ${x}$ , mathematically as $\pmb{w}^{}(\pmb{x})\in$ argmin $\mathcal{L}(\pmb{w};\pmb{x})$ For simplicity, we use $\pmb{w}^{}(\pmb{x})$ to denote weights that satisfy the previously specified condition. However, $\pmb{w}^{*}(\pmb{x})$ is typically not analytically computable due to non-linear i ties in layers and from activation functions in encoded by $_{x}$ , requiring another (lower) level of optimization.
上述双层规划源于NAS (Neural Architecture Search) 问题的本质:上层目标函数 $f_{1}$ 的评估需要同时依赖架构 ${x}$ 及其权重 $w$。$f_{1}$ 在任意 $\mathbf{\nabla}{\mathbf{\overrightarrow{ \mathbf{ \vert~\mathbf{ \nabla~}~}}}}$ 处无明确定义,而是要求 $\mathbf{\nabla}{\mathbf{\overrightarrow{\nabla}}\mathbf{\overrightarrow{\omega}}}$ 属于能最小化训练数据交叉熵损失 $\mathcal{L}$ (如图像分类任务)的权重集合,数学表述为 $\pmb{w}^{}(\pmb{x})\in$ argmin $\mathcal{L}(\pmb{w};\pmb{x})$ 。为简化表达,我们用 $\pmb{w}^{}(\pmb{x})$ 表示满足该条件的权重。但由于 $_{x}$ 编码的网络层非线性和激活函数特性,$\pmb{w}^{*}(\pmb{x})$ 通常无法解析计算,因此需要引入下层优化过程。
The principle of a bilevel optimization is that the upperlevel objectives and constraints must be computed using the optimal lower level variables $({\pmb w}^{}({\pmb x}))$ for the corresponding upper-level variables $(x)$ . Thus, in an implicit manner, $f_{1}$ becomes a function of ${x}$ alone, making $f_{1}({\pmb x};{\pmb w}^{}({\pmb x}))$ . Using a $(1,1)$ evolution strategy search as an illustration, when ${\pmb x}{t+1}\gets{\pmb x}{t}+\Delta{\pmb x}$ is altered at iteration $t$ in the upper-level, we first optimize the corresponding lower-level problem of $\mathcal{L}(\pmb{w};\pmb{x}{t+1})$ to find $\boldsymbol{w}^{}(\boldsymbol{x}{t+1})$ (through SGD). We then compute $f_{1}(\mathbf x_{t+1},\mathbf w^{}(\mathbf x_{t+1}))$ , and update the current $\pmb x_{t}\gets\pmb x_{t+1}$ if $f_{1}\left(\mathbf{\boldsymbol{x}}{t+1},\mathbf{\boldsymbol{w}}^{}(\mathbf{\boldsymbol{x}}{t+1})\right)$ is better than $f_{1}\left(\mathbf{x}{t},\mathbf{w}^{}(\mathbf{x}{t})\right)$ , i.e., $f_{1}\left(\mathbf{\boldsymbol{x}}{t+1},\mathbf{\boldsymbol{w}}^{}(\mathbf{\boldsymbol{x}}{t+1})\right)<f_{1}\left(\mathbf{\boldsymbol{x}}{t},\mathbf{\boldsymbol{w}}^{*}(\mathbf{\boldsymbol{x}}{t})\right)$ . A bilevel problem allows a more convenient way to optimize a hierarchical problem having two distinct hierarchical variable sets (such as, a weight vector only makes sense when an architecture is provided) than a single level in which both ${x}$ and $\mathbf{\nabla}{\mathbf{\overrightarrow{\nabla}}\mathbf{\overrightarrow{\omega}}}$ are considered in the same level. First, the search space becomes huge and second, a good ${x}$ may be deleted simply because the respective $\mathbf{\nabla}_{\mathbf{\overrightarrow{\nabla}}\mathbf{\overrightarrow{\omega}}}$ is not good.
双层优化的原理是,上层目标和约束必须使用对应上层变量 $(x)$ 的最优下层变量 $({\pmb w}^{}({\pmb x}))$ 来计算。因此,$f_{1}$ 隐式地成为仅关于 ${x}$ 的函数,即 $f_{1}({\pmb x};{\pmb w}^{}({\pmb x}))$。以 $(1,1)$ 进化策略搜索为例,当上层迭代 $t$ 中修改 ${\pmb x}{t+1}\gets{\pmb x}{t}+\Delta{\pmb x}$ 时,我们首先优化对应的下层问题 $\mathcal{L}(\pmb{w};\pmb{x}{t+1})$ 以找到 $\boldsymbol{w}^{}(\boldsymbol{x}{t+1})$(通过 SGD)。接着计算 $f_{1}(\mathbf x_{t+1},\mathbf w^{}(\mathbf x_{t+1}))$,并在 $f_{1}\left(\mathbf{\boldsymbol{x}}{t+1},\mathbf{\boldsymbol{w}}^{}(\mathbf{\boldsymbol{x}}{t+1})\right)$ 优于 $f_{1}\left(\mathbf{x}{t},\mathbf{w}^{}(\mathbf{x}{t})\right)$(即 $f_{1}\left(\mathbf{\boldsymbol{x}}{t+1},\mathbf{\boldsymbol{w}}^{}(\mathbf{\boldsymbol{x}}{t+1})\right)<f_{1}\left(\mathbf{\boldsymbol{x}}{t},\mathbf{\boldsymbol{w}}^{*}(\mathbf{\boldsymbol{x}}{t})\right)$)时更新当前 $\pmb x_{t}\gets\pmb x_{t+1}$。相较于将 ${x}$ 和 $\mathbf{\nabla}{\mathbf{\overrightarrow{\nabla}}\mathbf{\overrightarrow{\omega}}}$ 置于同一层级的单层优化,双层问题为优化具有两个不同层级变量集(例如,权重向量仅在给定架构时才有意义)的层次问题提供了更便捷的方式。首先,搜索空间会变得巨大;其次,良好的 ${x}$ 可能仅因对应的$\mathbf{\nabla}_{\mathbf{\overrightarrow{\nabla}}\mathbf{\overrightarrow{\omega}}}$ 不佳而被删除。
Our NSGANetV1 algorithm has two types of optimization problems put together:
我们的NSGANetV1算法结合了两种类型的优化问题:
Thus, the final outcome of our approach is a set of tradeoff architectures and their associated weight vectors, thereby completely specifying neural networks. For upper-level, we employ a customized and advanced NSGA-II-like evol. multiobjective algorithm, so that a set of non-dominated trade-off solutions is obtained at each iteration. The lower level uses the stochastic gradient based back-propagation algorithm for weight learning. Despite the fact that our approach also hybridizes a global search (EMO algorithm at the upper-level) and a local search (SGD) into a unified paradigm, which is also done in memetic algorithms, our bilevel approach is conceptually very different from memetic computing. To be more specific, the local search used in memetic algorithms is mainly to improve an individual’s fitness within its local neighborhood, while the local search (SGD) used in our approach is to find the remaining set of variables (lower-level variables; weights), which jointly with the upper-level variables (architectures) compute the objective function (classification error).
因此,我们方法的最终成果是一组权衡架构及其关联的权重向量,从而完整定义了神经网络。上层采用定制化的先进类NSGA-II进化多目标算法,使得每次迭代都能获得一组非支配性权衡解。下层则使用基于随机梯度的反向传播算法进行权重学习。尽管我们的方法同样将全局搜索(上层EMO算法)与局部搜索(SGD)融合为统一范式(这与模因算法类似),但我们的双层方法在概念上与模因计算存在本质差异。具体而言,模因算法中的局部搜索主要用于提升个体在局部邻域内的适应度,而本方法采用的局部搜索(SGD)旨在求解剩余变量集(下层变量:权重),这些变量与上层变量(架构)共同计算目标函数(分类误差)。
Our bilevel approach is nested in nature, meaning that for each $_{x}$ at the upper-level, a respective optimized $w^{*}$ is found by using the back-propagation method. However, the NAS at the upper-level is expedited by using a Bayesian learning method of already found good solutions and by using customized coding and genetic operators. Although more sophisticated surrogateassisted bilevel algorithms, such as BLEAQ or BLEAQ2 [72], [73] can be used, in this work we keep the methods relatively simple and use learning-assisted EMO and use only 1,200 architecture evaluations to achieve the results.
我们的双层方法本质上是嵌套的,这意味着对于上层每个 $_{x}$ ,通过反向传播方法找到相应的优化 $w^{*}$ 。然而,上层的神经架构搜索 (NAS) 通过使用贝叶斯学习方法对已发现的优质解进行学习,并采用定制编码和遗传算子来加速。尽管可以使用更复杂的代理辅助双层算法(如 BLEAQ 或 BLEAQ2 [72], [73]),但本工作中我们保持方法相对简单,采用学习辅助的进化多目标优化 (EMO),仅通过 1,200 次架构评估实现结果。
D. Details of the Considered Layer Operations
D. 所考虑的层操作细节
As described in Section III-A, we form a operation pool consists of 12 different choices of convolution, pooling and etc., based on their prevalence in the CNN literature. Most of these operations can be directly called from standard Deep Learning libraries, like Pytorch, TensorFLow, Caffe, etc. Here we provide demo Pytorch codes for less commonly used‡ operations, including depth-wise separable convolutions, local binary convolutions and $I\mathbf{x}7$ then $7\mathrm{x}I$ convolution.
如第 III-A 节所述,我们根据 CNN 文献中的常见操作,构建了一个包含 12 种不同卷积、池化等选择的运算池。这些操作大多可直接通过标准深度学习库 (如 Pytorch、TensorFlow、Caffe 等) 调用。此处我们为较少使用的‡操作提供 Pytorch 示例代码,包括深度可分离卷积 (depth-wise separable convolutions)、局部二值卷积 (local binary convolutions) 以及先 $I\mathbf{x}7$ 后 $7\mathrm{x}I$ 的卷积操作。
E. Datasets Details
E. 数据集详情
Examples from CIFAR-10, CIFAR-100, and ImageNet are provided in Fig. 16.
图 16: 提供了来自 CIFAR-10、CIFAR-100 和 ImageNet 的示例。
- CIFAR-10.1 and ImageNet-V2: In this work, we use the Matched Frequency version of the ImageNet-V2 dataset. The curation details along with the discussion of the difference among the three versions are available in [51]. Examples randomly sampled from these two new testing sets are provided in Figs. 17a and 17b, rep ect iv ely. The CIFAR-10.1 is available for download at https://github.com/modest yachts/CIFAR-10.1. And the ImageNet-V2 is available at https://github.com/ modest yachts/ImageNetV2.
- CIFAR-10.1 和 ImageNet-V2: 本工作中, 我们使用了 ImageNet-V2 数据集的 Matched Frequency 版本。关于该数据集构建细节以及三个版本间差异的讨论详见 [51]。图 17a 和 图 17b 分别展示了从这两个新测试集中随机采样的示例。CIFAR-10.1 可从 https://github.com/modestyachts/CIFAR-10.1 下载, ImageNet-V2 可从 https://github.com/modestyachts/ImageNetV2 获取。
- CIFAR-10-C and CIFAR-100-C: There are in total 19 different commonly observable corruption types considered in both CIFAR-10-C and CIFAR-100-C, including Gaussian noise, shot noise, impulse noise, de-focus blur, frosted glass blur, motion blur, zoom blur, snow, frost, fog, brightness, contrast, elastic, pixelate, jpeg, speckle noise, Gaussian blur, spatter and saturate. Fig. 18a provides examples for visualization. For every corruption type, there are five different levels of severity, see Fig. 18b for visualization. Both datasets are available from the original authors’ GitHub page at https://github.com/hendrycks/robustness. A demo visualization of adversarial examples created by applying FGSM [53] on MNIST dataset is provided in Fig. 18c.
- CIFAR-10-C和CIFAR-100-C:CIFAR-10-C和CIFAR-100-C共包含19种常见可观测的损坏类型,包括高斯噪声、散粒噪声、脉冲噪声、失焦模糊、毛玻璃模糊、运动模糊、缩放模糊、雪、霜、雾、亮度、对比度、弹性变形、像素化、JPEG、斑点噪声、高斯模糊、溅射和饱和。图18a提供了可视化示例。每种损坏类型有五个不同的严重程度级别,可视化效果见图18b。这两个数据集可从原作者GitHub页面https://github.com/hendrycks/robustness获取。图18c展示了在MNIST数据集上应用FGSM[53]生成的对抗样本演示可视化。
- ChestX-Ray14: ChestX-Ray 14 are hospital-scale Chest X-ray database containing 112,120 frontal-view X-ray images of size $1,024\mathrm{
x}1,024$ pixels from 30,805 unique patients. The database is labeled using natural language processing techniques from the associated radiological reports stored in hospitals’ Picture Archiving and Communication Systems (PACS). Each image can have one or multiple common thoracic diseases, or "Normal" otherwise. Visualization of example X-ray images from the database is provided in Fig. 19. The dataset is publicly available from NIH at https://nihcc.app.box.com/v/ChestXray-NIHCC. We follow the train val list.txt and test_list.txt provided along with the X-ray images to split the database for training, validation and testing. - ChestX-Ray14:ChestX-Ray14是一个医院规模的胸部X光数据库,包含来自30,805名患者的112,120张尺寸为$1,024\mathrm{
x}1,024$像素的正位X光图像。该数据库通过自然语言处理技术对医院影像归档与通信系统(PACS)中存储的相关放射学报告进行标注。每张图像可能有一种或多种常见胸部疾病,否则标记为"正常"。图19展示了数据库中部分X光图像的示例。该数据集可从NIH公开获取,地址为https://nihcc.app.box.com/v/ChestXray-NIHCC。我们按照随X光图像提供的train_val_list.txt和test_list.txt划分数据库用于训练、验证和测试。

Fig. 15: Visualization of block-level structures for different architectures. The Normal and Reduction blocks are shown in the first and second rows, respectively for NSGANetV1 architectures. Examples of blocks that are designed manually by experts [2], [5] and from other peer methods [8] are also included in (d) - (f) for comparison.
图 15: 不同架构的块级结构可视化。第一行和第二行分别展示了NSGANetV1架构中的Normal块和Reduction块。(d)-(f)中还包含了专家手动设计的块[2][5]以及其他同类方法[8]的示例块用于对比。

Fig. 16: Examples from CIFAR-10, CIFAR-100 and ImageNet datasets. Images in each row belong to the same class with label names shown to the left.
图 16: CIFAR-10、CIFAR-100 和 ImageNet 数据集示例。每行图像属于同一类别,左侧显示标签名称。
F. Implementation Details Continued
F. 实现细节(续)
Evaluating a neural network architecture’s performance is computationally expensive—e.g., one evaluation on the CIFAR10 dataset takes more than 30 minutes. In general, our (GArelated) hyper-parameter choices represent the minimal number of function evaluations required to reproduce the claimed performance. To be more specific, in our proposed algorithm, each architecture is encoded with a 40-position, integer-valued string. Our choice of population size at 40 corresponds to one individual per variable dimension, which follows one of the common suggestions in the GA literature on minimal required population size. Empirically, we observed that the hyper volume stabilized by generation 30 (see Fig.9a in the revised main paper), hence, we chose to terminate the proposed algorithm at generation 30. Other hyper-parameter choices are discussed in Section V.C of the revised main paper.
评估神经网络架构的性能在计算上非常耗时——例如,在CIFAR10数据集上进行一次评估需要超过30分钟。总体而言,我们(与遗传算法相关的)超参数选择代表了重现所声称性能所需的最少函数评估次数。具体来说,在我们提出的算法中,每个架构用一个40位整数值字符串编码。我们选择种群规模为40,对应于每个变量维度一个个体,这遵循了遗传算法文献中关于最小所需种群规模的常见建议之一。根据经验,我们观察到超体积在第30代趋于稳定(见修订后主文中的图9a),因此我们选择在第30代终止所提出的算法。其他超参数选择在修订后主文的第V.C节中讨论。
Fig. 17: Visualization of CIFAR-10.1 (a) and ImageNet-V2 (b). Examples are randomly sampled from the datsets.
图 17: CIFAR-10.1 (a) 和 ImageNet-V2 (b) 的可视化结果。示例图像均从数据集中随机采样。
G. Follow-up Studies
G. 后续研究
- Single-Objective NSGANetV1: Despite the superior effec ti ve ness and efficiency of the proposed algorithm, the computation overheads of 27 GPU-days of NSGANetV1 can be infeasible for users with few GPU cards. Towards understanding of the overall search wall time limit of NSGANetV1, as well as comparison to the peer methods that use less GPU-days to execute the search, the following experiment has been performed. We minimized the search setup differences by dropping the second objective of minimizing FLOPs and changing the search dataset to CIFAR-10. We also reduce the population size by half and perform early-termination at one and four GPU-days. The obtained architectures are named as NSGANetV1-B0 and NSGANetV1-B1, respectively.
- 单目标NSGANetV1: 尽管所提算法具有卓越的有效性和效率,但NSGANetV1需要27个GPU日的计算开销,这对于GPU卡较少的用户可能难以实现。为了理解NSGANetV1的整体搜索时间限制,并与使用较少GPU日执行搜索的同类方法进行比较,我们进行了以下实验。我们通过舍弃最小化FLOPs的第二目标并将搜索数据集改为CIFAR-10,最小化搜索设置的差异。同时将种群规模减半,并在1个和4个GPU日时提前终止搜索。获得的架构分别命名为NSGANetV1-B0和NSGANetV1-B1。
(a) Separable Convolution
(a) 可分离卷积
TABLE IV: Demo Pytorch implementation of separable con- volution (a), local binary convolution (b) and 1x7 then 7x1 convolution (c) used in NSGANetV1.
| classConvlx7Then7x1 def__init_(self,C,stride,affine=True) | |
| super(Conv1x7Then7x1,self).__init__O) | |
| self.op=nn.Sequential( | |
| nn.ReLU(inplace=False), nn.Conv2d(C,C,(1,7),stride=(1,stride),padding=(0,3),bias=False), | |
| nn.Conv2d(C,C,(7,1),stride=(stride,1),padding=(3,0),bias=False), nn.BatchNorm2d(C,affine=affine) | |
| defforward(self,x) | |
表 IV: NSGANetV1中使用的可分离卷积 (a) 、局部二值卷积 (b) 和1x7接7x1卷积 (c) 的PyTorch演示实现。
| classConvlx7Then7x1 def__init_(self,C,stride,affine=True) | |
|---|---|
| super(Conv1x7Then7x1,self).__init__O) | |
| self.op=nn.Sequential( | |
| nn.ReLU(inplace=False), nn.Conv2d(C,C,(1,7),stride=(1,stride),padding=(0,3),bias=False), | |
| nn.Conv2d(C,C,(7,1),stride=(stride,1),padding=(3,0),bias=False), nn.BatchNorm2d(C,affine=affine) | |
| defforward(self,x) |
(b) Local Binary Convolution
(b) 局部二值卷积
(c) 1x7 convolution then 7x1 convolution TABLE V: NSGANetV1 with single objective of maximizing classification accuracy on CIFAR-10 and early terminations.
| Method | Type | #Params | Top-1 Acc. | GPU-Days |
| GeneticCNN[14] | EA | 92.90% | 17 | |
| AE-CNN+E2EPP[19] | EA | 4.3M | 94.70% | 7 |
| ENAS [44] | RL | 4.6M | 97.11% | 0.5 |
| DARTS [9] | differential | 3.3M | 97.24% | 1 |
| NSGANetV1-B0 | EA | 3.3M | 96.15% | 1 |
| NSGANetV1-B1 | EA | 3.3M | 97.25% | 4 |
(c) 1x7卷积后接7x1卷积
表 V: 以最大化CIFAR-10分类精度为单一目标并采用早停策略的NSGANetV1
| 方法 | 类型 | 参数量 | Top-1准确率 | GPU耗时(天) |
|---|---|---|---|---|
| GeneticCNN[14] | 进化算法(EA) | - | 92.90% | 17 |
| AE-CNN+E2EPP[19] | 进化算法(EA) | 4.3M | 94.70% | 7 |
| ENAS [44] | 强化学习(RL) | 4.6M | 97.11% | 0.5 |
| DARTS [9] | 可微分 | 3.3M | 97.24% | 1 |
| NSGANetV1-B0 | 进化算法(EA) | 3.3M | 96.15% | 1 |
| NSGANetV1-B1 | 进化算法(EA) | 3.3M | 97.25% | 4 |


(c) Adversarial examples from FGSM [53] on MNIST.
(c) MNIST数据集上FGSM [53]生成的对抗样本。

Fig. 18: Visualization of different types of corruptions and different levels of severity. Examples are from [52]. Both CIFAR-10-C and CIFAR-100-C are constructed by applying corruptions to the original testing sets. A demo visualization of adversarial examples from FGSM on MNIST is shown in (c). Fig. 19: Visualization of Chest X ray 14 datasets. Examples showing eight common thoracic diseases are from [56].
图 18: 不同类型和严重程度破坏的可视化示例。案例来自 [52]。CIFAR-10-C 和 CIFAR-100-C 均通过对原始测试集施加破坏构建而成。(c) 展示了 FGSM 在 MNIST 上生成的对抗样本可视化演示。
图 19: 胸部X光14数据集可视化。展示八种常见胸部疾病的案例来自 [56]。
Results in Table V confirm that our proposed algorithm can be more efficient in GPU-days than the other two EA-based peer methods, Genetic CNN [14] and AE-CNN-E2EPP [19]. Specifically, NSGANetV1 obtains the architecture B1 in 3 less GPU-days than AE-CNN-E2EPP, in addition to the B1 architecture being more accurate in CIFAR-10 classification and less complex in number of parameters. Due to the use of weight sharing that partially eliminates the back-propagation weight learning process, ENAS [44] and DARTS [9] are still more efficient in GPU-days than our proposed method. The weight sharing method could in principle be applied to NSGANetV1 as well, however this is beyond the scope of this paper.
表 V 中的结果证实,我们提出的算法在 GPU 天数上比其他两种基于 EA 的同类方法 Genetic CNN [14] 和 AE-CNN-E2EPP [19] 更高效。具体而言,NSGANetV1 在比 AE-CNN-E2EPP 少用 3 个 GPU 天数的情况下获得了架构 B1,且该架构在 CIFAR-10 分类中准确率更高、参数量更少。由于使用了部分消除反向传播权重学习过程的权重共享技术,ENAS [44] 和 DARTS [9] 在 GPU 天数上仍比我们提出的方法更高效。权重共享方法原则上也可应用于 NSGANetV1,但这超出了本文的研究范围。
- Effectiveness of Non-learnable Operations: Our postoptimization analysis on the evolved architectures, shown in Section IV-E, has revealed some interesting findings, one of which being the effectiveness of non-parametric operations, e.g. identity mapping, average/max pooling, etc., in trading off classification performance for architectural complexity. To further validate this observation, we consider a expanded range of operations including both non-parametric and weight-fixed operations, which we name as non-learnable operations in this paper. We manually construct such layers by concatenate multiple non-learnable operations in parallel. The obtained results are shown in Figs. 20a - 20c.
- 不可学习操作的有效性: 第 IV-E 节中对演化架构的后优化分析揭示了一些有趣的发现,其中之一是非参数化操作(如恒等映射、平均/最大池化等)在分类性能与架构复杂度之间权衡的有效性。为了进一步验证这一观察结果,我们考虑了包括非参数化和权重固定操作在内的更广泛操作范围,本文将其统称为不可学习操作。我们通过并行拼接多个不可学习操作手动构建此类层,实验结果如图 20a - 20c 所示。
Our preliminary results on manual construction of nonlearnable layers are very encouraging. In additional to the comparative performance to regular fully learned layers, nonlearnable layers offer unique advantages in terms of re-usable weights for multi-tasking network architectures, as the weights are agnostic (not specifically learned on a particular task). We believe designing dedicated search algorithm to shape the construction of these non-learnable layers is a promising direction for NAS towards automatic design for multi-tasking architectures.
我们在手动构建不可学习层的初步结果非常令人鼓舞。除了与常规全学习层相比的性能优势外,不可学习层在多任务网络架构中具有权重可复用的独特优势(因为这些权重与具体任务无关)。我们认为,设计专用搜索算法来塑造这些不可学习层的构建,是实现多任务架构自动设计的神经架构搜索(NAS)的一个有前景的方向。
- Robustness Against Adversarial Attacks: Based on our analysis in Section V-B, years of architectural advancements have translated to minuscule improvements in robustness against adversarial examples. Simple one-iteration attack strategy like FGSM [53] is enough to constructing examples that turn many modern DNN class if i ers to random-guess (see Fig. 12 for examples). In this section, we make an effort to improve adversarial robustness from the architectural perspective. The search space used in the main paper searches over both layer operations and layer connections (see Section III-A). To isolate the effect of these two aspects to the adversarial robustness, we fix the layer operation to basic residual module [2] and search over the connections among these modules to improve both classification accuracy on clean images and robustness against adversarial examples.
- 抗对抗攻击鲁棒性: 根据我们在第V-B节的分析,多年的架构改进仅带来对抗样本鲁棒性的微小提升。像FGSM [53] 这样的单次迭代攻击策略就足以构建使许多现代DNN分类器退化为随机猜测的样本 (示例见图12)。本节我们尝试从架构角度提升对抗鲁棒性。主论文使用的搜索空间同时涵盖层操作和层连接 (参见第III-A节)。为分离这两个因素对对抗鲁棒性的影响,我们将层操作固定为基础残差模块 [2],通过搜索这些模块间的连接来同步提升干净图像分类精度和对抗样本鲁棒性。
Designing a measure/objective for robustness against adversarial robustness is an area of active research (e.g., [74]). For our purposes, we present a possible measure here, illustrated in Fig. 21. Using the FSGM presented by [53], this robustness objective progressively increases noise produced by FSGM. The $\epsilon$ axis in Fig. 21 refers to the hyper-parameter in the FSGM equation,
设计对抗鲁棒性的衡量标准/目标是一个活跃的研究领域 (如 [74])。为此,我们在此提出一种可能的衡量方法,如图 21 所示。利用 [53] 提出的 FSGM (Fast Gradient Sign Method),该鲁棒性目标逐步增加 FSGM 产生的噪声。图 21 中的 $\epsilon$ 轴表示 FSGM 方程中的超参数。
$$
\pmb{x}^{\prime}=\pmb{x}+\epsilon\mathrm{sign}(\nabla_{\pmb{x}}\mathcal{L}(\pmb{x},y_{t r u e})),
$$
$$
\pmb{x}^{\prime}=\pmb{x}+\epsilon\mathrm{sign}(\nabla_{\pmb{x}}\mathcal{L}(\pmb{x},y_{t r u e})),
$$

Fig. 20: Preliminary experiment on constructing DNN architectures using layers with non-learnable weights. Each layer is composed of several non-learnable operations in parallel. We manually constructed a handful of such layers and evaluate them on CIFAR-10 (a). An example configuration, based on the trade-off between accuracy and the number of the parameters, is shown in (b). We validate the effectiveness of non-learnable layers by replacing the original $3\times3$ convolution in different ResNet models with the chosen configuration on both CIFAR-100 and ImageNet (c). Evidently, layer with non-learnable weights is capable of yielding competitive classification performance while being computational efficient as opposed to conventional learnable convolutions.
图 20: 使用不可训练权重层构建DNN架构的初步实验。每层由多个并行不可训练操作组成。我们在CIFAR-10上手动构建了少量此类层并评估其性能(a)。基于准确率与参数数量的权衡,(b)展示了一个示例配置。通过在CIFAR-100和ImageNet上将不同ResNet模型中的原始$3\times3$卷积替换为选定配置(c),验证了不可训练层的有效性。显然,与传统的可训练卷积相比,不可训练权重层能在保持计算效率的同时产生具有竞争力的分类性能。
where ${x}$ is the original image, $\mathbf{\Delta}_{x^{\prime}}$ is adversarial image, ytrue is true class label, and $\mathcal{L}$ cross-entropy loss. Therefore, for this experiment, we seek to maximize two objectives, namely, classification accuracy and the robustness objective defined above.
其中 ${x}$ 是原始图像,$\mathbf{\Delta}_{x^{\prime}}$ 是对抗图像,ytrue 是真实类别标签,$\mathcal{L}$ 是交叉熵损失。因此在本实验中,我们力求最大化两个目标:分类准确率和上述定义的鲁棒性目标。
The setup for the robustness experiment is as follows. For training we use 40,000 CIFAR-10 images from the official CIFAR-10 training data, 10,000 of which are reserved for validation. Each network is encoded with three blocks using the macro space encoding from our previous work [20]. In each phase a maximum of size nodes may be active—where the computation at each node is 3x3 convolution followed by ReLU and batch normalization. Each network is trained for 20 epochs with SGD on a cosine annealed learning rate schedule. The epsilon values used in the FSGM robustness calculation are [0.0, 0.01, 0.03, 0.05, 0.07, 0.1, 0.15]. As before, NSGANetV1 initiates the search with 40 randomly created network architecture, and 40 new network architectures are created at each generation (iteration) via genetic operations (see main paper for details). The search is terminated at 30 generations.
鲁棒性实验的设置如下。训练阶段我们使用官方CIFAR-10训练数据中的40,000张图像,其中10,000张留作验证集。每个网络采用我们先前工作[20]中的宏观空间编码方式,用三个模块进行编码。每个阶段最多激活size个节点——每个节点的计算包含3x3卷积、ReLU和批量归一化操作。所有网络均采用SGD优化器进行20个epoch的训练,学习率按余弦退火策略调整。FSGM鲁棒性计算使用的epsilon值为[0.0, 0.01, 0.03, 0.05, 0.07, 0.1, 0.15]。与之前相同,NSGANetV1以40个随机生成的网络架构开始搜索,每代(迭代)通过遗传操作生成40个新架构(详见主论文),搜索在30代时终止。

Fig. 21: Robustness Objective: We define a robustness objective under the FGSM [53] attack as follows: 1) obtain classification accuracy on adversarial images generated by FGSM as we vary ϵ, 2) compute the area under the curve (blue line), approximated by the area of the green region; 2) normalize the robustness value to the rectangular area formed by the Ideal point and Nadir point; 3) Ideal point is defined at $100%$ accuracy at pre-defined maximum $\epsilon$ value, and the nadir point is defined as the accuracy of random guessing at $\epsilon=0$ (clean images).
图 21: 鲁棒性目标: 我们在FGSM [53]攻击下定义鲁棒性目标如下: 1) 当ϵ变化时, 获取FGSM生成的对抗样本的分类准确率; 2) 计算曲线下面积(蓝线), 用绿色区域面积近似表示; 2) 将鲁棒性值归一化到理想点和最低点形成的矩形区域; 3) 理想点定义为在预定义最大ϵ值处达到$100%$准确率, 最低点定义为ϵ=0(干净样本)时的随机猜测准确率。

Fig. 22: Trade-off frontier of the robustness experiments. Color indicates the generation (iteration) at which a network architecture is eliminated from the surviving parent population. The size of each point is proportional to the network architecture’s number of trainable parameters. We note that networks for latter generations form the Pareto front (dark blue points).
图 22: 鲁棒性实验的权衡边界。颜色表示网络架构从存活父代种群中被淘汰的代数(迭代次数)。每个点的大小与网络架构可训练参数的数量成正比。我们注意到较后代的网络形成了帕累托前沿(深蓝色点)。
Empirically, we observe a clear trade-off between accuracy and robustness, as shown in Fig. 22. Visualization of the nondominated architectures are provided in Fig. 23c. In our opinion, NSGANetV1 is useful in capturing patterns that differentiate architectures that are good for competing objectives. We find that the “wide” networks (like ResNeXt [42] or Inception blocks [1]) appear to provide good accuracy on standard benchmark images, but are fragile to the FSGM attack. On the other hand, “deep” networks (akin to ResNet [2] or VGG [75]) are more robust to FSGM attack, while having less accuracy. This phenomenon is illustrated with examples in Figs. 23a and 23b, respectively. Furthermore, the skip connection of skipping the entire block’s computation appears to be critical in obtaining a network that is robust to adversarial attacks; see Fig. 24a and 24b.
根据实验观察,我们发现准确性与鲁棒性之间存在明显的权衡关系,如图22所示。图23c展示了非支配架构的可视化结果。我们认为NSGANetV1能有效捕捉区分架构的模式,这些架构适用于相互竞争的目标。研究发现,"宽"网络(如ResNeXt [42]或Inception模块[1])在标准基准图像上表现出色,但对FSGM攻击较为脆弱;而"深"网络(类似ResNet [2]或VGG [75])对FSGM攻击更具鲁棒性,但准确性稍逊。这一现象分别通过图23a和图23b的示例进行说明。此外,跳过整个计算块的跳跃连接对于获得抗对抗攻击的鲁棒网络至关重要,详见图24a和图24b。

Fig. 23: (a) Examples of the computational blocks discovered with high classification accuracy. For these networks, the mean accuracy and robustness objectives are 0.8543 and 0.0535, respectively; (b) Examples of the computational blocks discovered with high robustness against FGSM attack, the mean accuracy and robustness objectives are 0.8415 and 0.1036, respectively; (c) Examples of the computational blocks discovered along the pareto-front that provides an efficient tradeoff between classification accuracy and adversarial robustness. They are arranged in the order of descending accuracy and ascending robustness.
图 23: (a) 具有高分类准确率的计算模块示例。这些网络的平均准确率和鲁棒性目标分别为0.8543和0.0535; (b) 针对FGSM攻击具有高鲁棒性的计算模块示例,平均准确率和鲁棒性目标分别为0.8415和0.1036; (c) 沿帕累托前沿发现的计算模块示例,在分类准确率和对抗鲁棒性之间提供了有效权衡。它们按准确率降序和鲁棒性升序排列。
- An Application to Multi-view Car Alignment: In addition to object classification, dense image prediction (e.g. object alignment, human body pose estimation and semantic segmentation, etc.) is another class of problems that is of great importance to computer vision. Dense image prediction assigns a class label to each pixel in the query images, as opposed to one label to the entire image in case of classification. In this section, we apply NSGANetV1 to the problem of multi-view car key-points alignment.
- 多视角汽车对齐应用: 除物体分类外,密集图像预测(如物体对齐、人体姿态估计和语义分割等)是计算机视觉领域另一类重要问题。与整图分类不同,密集图像预测需为查询图像的每个像素分配类别标签。本节将NSGANetV1应用于多视角汽车关键点对齐任务。
We use the CMU-Car dataset originally introduced in [76]. The dataset contains around 10,000 car images in different orientations, environments, and occlusion situations. In this case, we search for the path of image resolution changes, similar to [64]. The node-level structure is kept fixed, using the basic residual unit [2]. The performance of architectures in this case is calculated using the root mean square (RMS) error between the predicted heatmap and ground truth for each key-point, more details are available in [76]. We use FLOPs as the second objective for architecture complexity measurement. The obtained architectures are named as NSGANetV1-C0 and -C1. The obtained results are provided in Table VI and the visualization of the architectures is provided in Fig. 25.
我们使用最初在[76]中提出的CMU-Car数据集。该数据集包含约10,000张不同朝向、环境和遮挡情况下的汽车图像。在本案例中,我们搜索图像分辨率变化的路径,类似于[64]的方法。节点级结构保持固定,采用基础残差单元[2]。本案例中架构的性能通过预测热图与每个关键点真实值之间的均方根(RMS)误差计算,更多细节见[76]。我们使用FLOPs作为架构复杂度测量的第二目标。所得架构命名为NSGANetV1-C0和-C1。获得的结果如表VI所示,架构可视化如图25所示。

Fig. 24: Parallel coordinate plots of the 1,200 network architectures sampled by NSGANetV1. Each line represents a network architecture, each vertical line is an attribute associated with the network. (a) Networks that have the skip connection bit inactive, we can see that none of them have good measurement on robustness against adversarial attacks. (b) Networks that have the skip connection bit active. This skip connection bit refers to the connection that goes past all computation within a phase, as a normal residual connection would. When the skip connection is active, the networks cover the full range of adversarial robustness.
图 24: NSGANetV1 采样的 1,200 个网络架构的平行坐标图。每条线代表一个网络架构,每条垂直线代表与网络相关的属性。(a) 跳过连接位未激活的网络,可以看到它们在对抗攻击的鲁棒性方面均表现不佳。(b) 跳过连接位激活的网络。此跳过连接位指的是像常规残差连接那样跨越阶段内所有计算的连接。当跳过连接激活时,网络展现出完整的对抗鲁棒性范围。
TABLE VI: Preliminary results on the CMU-Car alignment [76]. Notably, our proposed algorithm is able to find architectures with competitive performance while having 2x less parameters when compared to human-designed architecture [77].
| Architectures | Params. (M) | FLOPs (M) | Regression Error (%) |
| Hourglass[77] | 3.38 | 3613 | 7.80 |
| NSGANetV1-C0 | 1.53 | 2584 | 8.66 |
| NSGANetV1-C1 | 1.61 | 2663 | 8.64 |
表 VI: CMU-Car对齐任务的初步结果 [76]。值得注意的是,我们提出的算法能够找到性能具有竞争力且参数量比人工设计架构 [77] 少2倍的架构。
| 架构 | 参数量 (百万) | 计算量 (百万FLOPs) | 回归误差 (%) |
|---|---|---|---|
| Hourglass[77] | 3.38 | 3613 | 7.80 |
| NSGANetV1-C0 | 1.53 | 2584 | 8.66 |
| NSGANetV1-C1 | 1.61 | 2663 | 8.64 |

Fig. 25: Spatial-resolution-change path of the NSGANetV1-C0 architecture. Each circle encodes a residual module [2]. Circles colored in white are always executed at the beginning. The arrows and blues circles are parts of the search decisions. The total number of layers (L) are set to be nine.
图 25: NSGANetV1-C0架构的空间分辨率变化路径。每个圆圈代表一个残差模块 [2]。白色圆圈始终在开始时执行。箭头和蓝色圆圈是搜索决策的一部分。总层数(L)设置为九层。
- Ablation Study on Exploitation Operator: Recall that we use Bayesian Network (BN) as a probabilistic model to estimate the distribution of the Pareto set (of architectures). In this section, we first explain the connection of our proposed
- 利用算子的消融研究:我们采用贝叶斯网络(BN)作为概率模型来估计架构帕累托集的分布。本节首先阐述我们所提出方法的关联性
BN-based distribution estimation operator to the existing works [78]–[80] of large-scale multi-objective optimization algorithms for general numerical problems. The common theme behind these works is to learn the correlation among decision variables to reduce the dimension either through grouping (optimize a subset of variables at a time) or embedding (projection to lower-dimensional space). In our work, we exploit the problem information (i.e., network architectures are variants of directed acyclic graphs) explicitly in the form of a BN to learn the correlations (i.e., BN edge weights) among architectural variables. The learned BN is then used (as a probabilistic model) to generate the remaining variables given the observed variables, as a form of dimension reduction. Thus, in this work, we take advantage of learning algorithms to capture the properties of good solutions to deal with the large dimensionality of the problem. In short, our approach in NAS application and general-purpose EMO algorithms shares a similar concept of dimensional reduction to handle large-scale problems.
基于BN的分布估计算法被引入到现有的大规模多目标优化算法研究[78]–[80]中,这些算法针对通用数值问题。这些工作的共同点是通过学习决策变量间的相关性来降维,具体方式包括分组(每次优化部分变量)或嵌入(投影到低维空间)。在我们的工作中,我们以BN的形式显式利用问题信息(即网络架构是有向无环图的变体),通过学习架构变量间的相关性(即BN边权重)来实现降维。学习得到的BN随后被用作概率模型,根据观测变量生成剩余变量,从而实现降维。因此,本研究利用学习算法捕捉优质解的特性,以应对问题的高维性。简而言之,我们在NAS应用中的方法与通用EMO算法都采用了类似的降维概念来处理大规模问题。
Secondly, we study the effectiveness of the proposed BNbased exploitation operator. The experimental setup follows a two objective NAS optimization to maximize top-1 validation accuracy on the Fashion M NIST dataset [55] and minimize #FLOPs simultaneously. We study five different settings of the proposed exploitation operator, namely:
其次,我们研究了基于BN的利用算子的有效性。实验设置遵循双目标NAS优化,旨在最大化Fashion MNIST数据集[55]上的top-1验证准确率,同时最小化#FLOPs。我们研究了所提利用算子的五种不同设置,具体包括:
We use the same population size of 40 and a maximum number of 30 generations for each of the considered settings. And we repeat 11 runs with different random seeds to capture the variance from different initial population. The obtained results are provided in Fig. 26.
我们对每种设定均采用相同的种群规模40和最大世代数30。为了捕捉不同初始种群带来的方差,我们使用不同随机种子重复进行了11次实验。所得结果如图26所示。

Fig. 26: Ablation study on effectiveness of our proposed exploitation operator under different settings.
图 26: 不同设置下我们提出的利用算子有效性的消融研究。
Empirically, we observe that our proposed exploitation operator provides a noticeable improvement to the overall algorithm’s performance, measured by hyper volume. However, the margin of improvement quickly diminishes as we activate the operator too early (i.e. before 1/3 of the computation budget spent) or too close to the total budget (i.e. after 3/4 of the computation budget spent).
根据实验观察,我们提出的开发算子通过超体积指标衡量,显著提升了整体算法性能。然而,若该算子启用过早 (即在计算预算消耗不足1/3时) 或过晚 (即在计算预算消耗超过3/4时),改进幅度会迅速减弱。
H. Hyper volume Calculation
H. 超体积计算
For the two-objective experiments presented in Section IV of the main paper, the reference point used in computing the hyper volume metric is $(100,1,000)$ , where 100 is the worst error rate in percentage, and $1,000$ is the highest #FLOPs, in millions, of any architecture that our search space can encode. We then normalize the hyper volume by the rectangular area formed by the reference point and the ideal point—i.e. $(0,0)$ .
对于主论文第四节中介绍的双目标实验,在计算超体积指标时使用的参考点是 $(100,1,000)$ ,其中100表示最差错误率(百分比),$1,000$ 表示我们的搜索空间所能编码的任何架构的最高浮点运算次数(以百万计)。随后,我们通过参考点与理想点(即 $(0,0)$ )构成的矩形区域对超体积进行归一化。
