Multi-Objective Evolutionary Design of Deep Convolutional Neural Networks for Image Classification
用于图像分类的深度卷积神经网络多目标进化设计
Abstract—Convolutional neural networks (CNNs) are the backbones of deep learning paradigms for numerous vision tasks. Early advancements in CNN architectures are primarily driven by human expertise and by elaborate design processes. Recently, neural architecture search was proposed with the aim of automating the network design process and generating task-dependent architectures. While existing approaches have achieved competitive performance in image classification, they are not well suited to problems where the computational budget is limited for two reasons: (1) the obtained architectures are either solely optimized for classification performance, or only for one deployment scenario; (2) the search process requires vast computational resources in most approaches. To overcome these limitations, we propose an evolutionary algorithm for searching neural architectures under multiple objectives, such as classification performance and floating point operations (FLOPs). The proposed method addresses the first shortcoming by populating a set of architectures to approximate the entire Pareto frontier through genetic operations that recombine and modify architectural components progressively. Our approach improves computational efficiency by carefully down-scaling the architectures during the search as well as reinforcing the patterns commonly shared among past successful architectures through Bayesian model learning. The integration of these two main contributions allows an efficient design of architectures that are competitive and in most cases outperform both manually and automatically designed architectures on benchmark image classification datasets: CIFAR, ImageNet and human chest $\mathbf{X}$ -ray. The flexibility provided from simultaneously obtaining multiple architecture choices for different compute requirements further differentiates our approach from other methods in the literature. Code is available at https://github.com/mike lz c 1990/nsganetv1.
摘要—卷积神经网络(CNN)是众多视觉任务深度学习范式的核心架构。早期CNN架构的进步主要依靠人类专业知识和精心设计流程。近年来,神经架构搜索技术被提出以实现网络设计自动化并生成任务相关架构。虽然现有方法在图像分类任务中取得了优异性能,但由于以下两个原因,它们并不适用于计算资源受限的场景:(1) 所得架构要么仅针对分类性能优化,要么仅适配单一部署场景;(2) 大多数方法的搜索过程需要消耗大量计算资源。为克服这些限制,我们提出一种多目标进化算法来搜索神经架构,优化目标包括分类性能和浮点运算量(FLOPs)。该方法通过遗传操作逐步重组和修改架构组件,生成一组能逼近整个帕累托前沿的架构,从而解决第一个缺陷。我们通过两种方式提升计算效率:在搜索过程中谨慎缩放架构规模,以及通过贝叶斯模型学习强化历史成功架构的共有模式。这两个核心贡献的结合使我们能高效设计出具有竞争力的架构,在CIFAR、ImageNet和人体胸部$\mathbf{X}$光片等基准图像分类数据集上,多数情况下性能优于人工设计和自动设计的架构。该方法能同时为不同计算需求提供多种架构选择,这种灵活性使其显著区别于现有文献中的其他方法。代码详见https://github.com/mike lz c 1990/nsganetv1。
Index Terms—Neural architecture search (NAS), evolutionary deep learning, convolutional neural networks (CNNs), genetic algorithms (GAs)
索引术语—神经架构搜索(NAS)、进化深度学习、卷积神经网络(CNNs)、遗传算法(GAs)
I. INTRODUCTION
I. 引言
Deep convolutional neural networks (CNNs) have been overwhelmingly successful in a variety of computer-visionrelated tasks like object classification, detection, and segmentation. One of the main driving forces behind this success is the introduction of many CNN architectures, including GoogLeNet [1], ResNet [2], DenseNet [3], etc., in the context of object classification. Concurrently, architecture designs, such as ShuffleNet [4], MobileNet [5], LBCNN [6], etc., have been developed with the goal of enabling real-world deployment of high-performance models on resource-constrained devices.
深度卷积神经网络 (CNN) 在物体分类、检测和分割等各类计算机视觉任务中取得了巨大成功。这一成功背后的主要驱动力之一是在物体分类领域引入了诸多CNN架构,包括GoogLeNet [1]、ResNet [2]、DenseNet [3]等。与此同时,ShuffleNet [4]、MobileNet [5]、LBCNN [6]等架构设计也相继问世,旨在实现高性能模型在资源受限设备上的实际部署。
These developments are the fruits of years of painstaking efforts and human ingenuity.
这些发展是多年辛勤努力和人类智慧的结晶。
Neural architecture search (NAS), on the other hand, presents a promising path to alleviate this painful process by posing the design of CNN architectures as an optimization problem. By altering the architectural components in an algorithmic fashion, novel CNNs can be discovered that exhibit improved performance metrics on representative datasets. The huge surge in research and applications of NAS indicates the tremendous academic and industrial interest NAS has attracted, as teams seek to stake out some of this territory. It is now well recognized that designing bespoke neural network architectures for various tasks is one of the most challenging and practically beneficial components of the entire Deep Neural Network (DNN) development process, and is a fundamental step toward automated machine learning.
另一方面,神经架构搜索 (NAS) 通过将 CNN 架构设计转化为优化问题,为缓解这一痛苦过程提供了可行路径。通过算法化调整架构组件,能够发现新型 CNN 在代表性数据集上展现更优性能指标。NAS 研究和应用的爆炸式增长,反映了学术界与工业界对其的巨大兴趣,各团队争相在这一领域抢占先机。目前公认的是:为不同任务定制神经网络架构是整个深度神经网络 (DNN) 开发流程中最具挑战性且最具实用价值的环节,也是实现自动化机器学习的关键步骤。
Early methods for NAS relied on Reinforcement Learning (RL) to navigate and search for architectures with high performance. A major limitation of these approaches [7], [8] is the steep computational requirement for the search process itself, often requiring weeks of wall clock time on hundreds of Graphics Processing Unit (GPU) cards. Recent relaxation-based methods [9]–[12] seek to improve the computational efficiency of NAS approaches by approximating the connectivity between different layers in the CNN architectures by real-valued variables that are learned (optimized) through gradient descent together with the weights. However, such relaxation-based NAS methods suffer from excessive GPU memory requirements during search, resulting in constraints on the size of the search space (e.g., reduced layer operation choices).
早期的NAS方法依赖强化学习(RL)来探索和搜索高性能架构。这些方法[7][8]的主要局限在于搜索过程本身需要极高的计算资源,通常需要数百张图形处理器(GPU)卡运行数周时间。近期基于松弛的方法[9]-[12]试图通过用可学习(优化)的实值变量来近似CNN架构中不同层间的连接性,并与权重一起通过梯度下降进行优化,从而提高NAS方法的计算效率。然而,这类基于松弛的NAS方法在搜索过程中存在GPU内存占用过高的问题,导致搜索空间规模受限(例如减少层操作选择)。
In addition to being accurate in prediction, real-world applications demand that NAS methods find network architectures that are also efficient in computation—e.g., have low power consumption in mobile applications and low latency in autonomous driving applications. It has been a common observation that the predictive performance continuously improves as the complexity (i.e., # of layers, channels, etc.) of the network architectures increases [2], [3], [8], [13]. This alludes to the competing nature of trying to simultaneously maximize predictive performance and minimize network complexity, thereby necessitating multi-objective optimization. Despite recent advances in RL and relaxation-based NAS methods, they are still not readily applicable for multi-objective NAS.
除了预测准确外,实际应用还要求NAS方法找到计算效率高的网络架构——例如,在移动应用中功耗低,在自动驾驶应用中延迟低。普遍观察到的现象是,随着网络架构复杂度(即层数、通道数等)的增加,预测性能会持续提升[2], [3], [8], [13]。这表明在尝试同时最大化预测性能和最小化网络复杂度之间存在竞争关系,因此需要进行多目标优化。尽管基于强化学习(RL)和松弛方法的NAS技术近期取得了进展,但它们仍难以直接应用于多目标NAS场景。
Among the many different NAS methods being continually proposed, Evolutionary Algorithms (EAs) are getting a plethora of attention, due to their population-based nature and flexibility in encoding. They offer a viable alternative to conventional machine learning (ML)-oriented approaches, especially under the scope of multi-objective NAS. An EA, in general, is an iterative process in which individuals in a population are made gradually better by applying variations to selected individuals and/or recombining parts of multiple individuals. Despite the ease of extending them to handle multiple objectives, most existing EA-based NAS methods [14]–[19] are still singleobjective driven.
在不断提出的众多不同NAS方法中,进化算法 (Evolutionary Algorithms, EAs) 因其基于种群的特性和编码灵活性而受到广泛关注。它们为传统以机器学习 (ML) 为导向的方法提供了可行替代方案,尤其是在多目标NAS范畴内。进化算法通常是一种迭代过程,通过对选定个体施加变异和/或重组多个个体的部分,使种群中的个体逐渐优化。尽管它们易于扩展以处理多目标问题,但现有大多数基于进化算法的NAS方法 [14]–[19] 仍以单目标驱动为主。
In this paper, we present NSGANetV1, a multi-objective evolutionary algorithm for NAS, extending on an earlier proofof-principle method [20], to address the aforementioned limitations of current approaches. The key contributions followed by the extensions made in this paper are summarized below:
本文提出NSGANetV1,这是一种基于多目标进化算法的神经网络架构搜索(NAS)方法,在早期原理验证方法[20]的基础上进行扩展,以解决现有方法的上述局限性。本文的主要贡献及扩展内容总结如下:
- NSGANetV1 populates a set of architectures to approximate the entire Pareto front in one run through customized genetic operations that recombine and modify architectural components progressively. NSGANetV1 improves computational efficiency by carefully downscaling the architectures during the search as well as reinforcing the emerging patterns shared among past successful architectures through a Bayesian Network based distribution estimation operator. Empirically, the obtained architectures, in most cases, outperform both manually and other automatically designed architectures on various datasets.
- NSGANetV1通过定制化的遗传操作(逐步重组和修改架构组件)生成一组架构来近似单次运行中的整个帕累托前沿。该算法通过两种方式提升计算效率:在搜索过程中谨慎缩减架构规模,以及通过基于贝叶斯网络的分布估计算子强化过往成功架构中涌现的共享模式。实证表明,在多数情况下,所得架构在多个数据集上的表现均优于人工设计及其他自动设计的架构。
- By obtaining a set of architectures in one run, NSGANetV1 allows designers to choose a suitable network a-posteriori as opposed to a pre-defined preference weighting of objectives prior to the search. Further postoptimal analysis of the set of non-dominated architectures often reveals valuable design principles, which is another benefit of posing NAS as a multi-objective optimization problem, as is done in NSGANetV1.
- 通过单次运行获得一组架构,NSGANetV1允许设计者在搜索后选择合适网络,而非预先定义目标偏好权重。对非支配架构集的进一步后优化分析常能揭示宝贵设计原则,这也是将NAS (Neural Architecture Search) 构建为多目标优化问题的另一优势,如NSGANetV1所示。
- From an algorithmic perspective, we extend our previous work [20] in a number of ways: (i) an expanded search space to include five more layer operations and one more option that controls the width of the network, (ii) improved encoding, mutation and crossover operators accompanying the modified search space, and (iii) a more thorough lower-level optimization process for weight learning, resulting in better and more reliable performance.
- 从算法角度来看,我们在先前工作[20]的基础上进行了多方面扩展:(i) 扩展搜索空间以包含五种新增层操作和一项控制网络宽度的选项;(ii) 针对修改后的搜索空间改进了编码、变异和交叉算子;(iii) 采用更精细的底层优化流程进行权重学习,从而获得更优且更稳定的性能。
- From an evaluation perspective, we extend our previous work [20] in two different ways: (i) adding three more tasks, including medical imaging, robustness to adversarial attacks, and car key-point estimation; and (ii) evaluating the searched architectures on five new datasets, including, ImageNet, ImageNet-V2, CIFAR10.1, corrupted CIFAR-10 and corrupted CIFAR-100.
- 从评估角度来看,我们在之前工作 [20] 的基础上进行了两方面的扩展:(i) 新增了三个任务,包括医学影像、对抗攻击鲁棒性以及汽车关键点估计;(ii) 在五个新数据集上评估搜索到的架构,包括 ImageNet、ImageNet-V2、CIFAR10.1、带噪声的 CIFAR-10 和带噪声的 CIFAR-100。
The remainder of this paper is organized as follows. Section II introduces and summarizes related literature. In Section III, we provide a detailed description of the main components of our approach. We describe the experimental setup to validate our approach along with a discussion of the results in Section IV, followed by further analysis and an application study in Sections $\mathrm{v}$ and VI, respectively. Finally, we conclude with a summary of our findings and comment on possible future directions in Section VII.
本文的其余部分组织如下。第II节介绍并总结了相关文献。第III节详细描述了我们方法的主要组成部分。第IV节描述了验证我们方法的实验设置以及结果讨论,随后在第V节和第VI节分别进行了进一步分析和应用研究。最后,我们在第VII节总结了我们的发现,并对未来可能的方向进行了评论。
II. RELATED WORK
II. 相关工作
Recent years have witnessed growing interest in NAS. The promise of being able to automatically and efficiently search for task-dependent network architectures is particularly appealing as deep neural networks are widely deployed in diverse applications and computational environments. Early methods [21], [22] made efforts to simultaneously evolve the topology of neural networks along with weights and hyper parameters. These methods perform competitively with hand-crafted networks on control tasks with shallow fully connected networks. In the following, we present studies related to deep convolutional neural networks for image classification. Readers are referred to the supplementary materials for a more detailed review of the topic.
近年来,人们对神经网络架构搜索 (NAS) 的兴趣与日俱增。随着深度神经网络在各种应用和计算环境中的广泛部署,能够自动高效地搜索任务相关网络架构的前景尤为诱人。早期方法 [21][22] 致力于同步演化神经网络拓扑结构、权重及超参数。这些方法在采用浅层全连接网络的控制任务中,表现可与人工设计网络相媲美。下文我们将重点介绍与图像分类深度卷积神经网络相关的研究。关于该主题更详细的综述,请参阅补充材料。
Evolutionary NAS: Designing neural networks through evolution has been a topic of interest for a long time. Recent evolutionary approaches focus on evolving solely the topology while leaving the learning of weights to gradient descent algorithms, and using hyper-parameter settings that are manually tuned. Xie and Yuille’s work of Genetic CNN [14] is one of the early studies that shows the promise of using EAs for NAS. Real et al. [15] introduce perhaps the first truly large scale application of a simple EA to NAS. The extension of this method presented in [17], called AmoebaNet, provides the first large scale comparison of EA and RL methods. Their EA, using an age-based selection similar to [23], has demonstrated faster convergence to an accurate network when compared to RL and random search. Concurrently, Liu et al. [16] evolve a hierarchical representation that allows non-modular layer structures to emerge. Despite the impressive improvements achieved on various datasets, these EA methods are extremely computationally inefficient, e.g., one run of the regularized evolution method [17] takes one week on 450 GPU cards.
进化式神经架构搜索(NAS):通过进化设计神经网络一直是一个备受关注的话题。最近的进化方法主要专注于优化网络拓扑结构,而将权重学习任务交给梯度下降算法,并采用人工调优的超参数设置。Xie和Yuille提出的Genetic CNN [14] 是早期展示进化算法(EA)在NAS中潜力的研究之一。Real等人[15]的研究可能是首个将简单进化算法大规模应用于NAS的案例。该方法在[17]中的扩展版本AmoebaNet,首次对进化算法与强化学习(RL)方法进行了大规模比较。他们采用的基于年龄选择机制(类似[23])的进化算法,相比强化学习和随机搜索能更快收敛到高精度网络。与此同时,Liu等人[16]开发了能生成非模块化层结构的层次化表示方法。尽管这些进化算法在多个数据集上取得了显著提升,但其计算效率极低——例如[17]提出的正则化进化方法单次运行就需要450块GPU运行一周。
Concurrently, another streamlining of EA methods for use in budgeted NAS has emerged. Suganuma et al. [24] use Cartesian genetic programming to assemble an architecture from existing modular blocks (e.g., Residual blocks). Sun et al. in [19] use a random forest as an offline surrogate model to predict the performance of architectures, partially eliminating the lowerlevel optimization via gradient descent. The reported results yield $3\times$ savings in wall clock time with similar classification performance when compared to their previous works [18], [25]. However, results reported from these budgeted EA methods are far from state-of-the-art and only demonstrated on small-scale datasets—i.e., CIFAR-10 and CIFAR-100.
与此同时,另一种针对预算化神经架构搜索(NAS)的进化算法(EA)精简方法也出现了。Suganuma等人[24]采用笛卡尔遗传编程从现有模块化构件(如残差块)组装架构。Sun等人在[19]中利用随机森林作为离线代理模型来预测架构性能,部分消除了通过梯度下降进行的底层优化。报告结果显示,与他们之前的工作[18][25]相比,在保持相近分类性能的同时实现了3倍的墙钟时间节省。然而,这些预算化EA方法报告的结果远未达到最先进水平,且仅在CIFAR-10和CIFAR-100等小规模数据集上进行了验证。
Multi-objective NAS: In this work, the term multi-objective NAS refers to methods that simultaneously approximate the entire set of efficient trade-off architectures in one run [26], [27]. Kim et al. [28] presented NEMO, one of the earliest evolutionary multi-objective approaches to evolve CNN architectures. NEMO uses NSGA-II [29] to maximize classification performance and inference time of a network and searches over the space of the number of output channels from each layer within a restricted space of seven different architectures. DPP-Net [30], an extension from [31], progressively expands networks from simple structures and only trains the top-K (based on Pareto-optimality) networks that are predicted to be promising by a surrogate model. Elsken et al. [32] present the LEMONADE method, which is formulated to develop networks with high predictive performance and lower resource constraints. LEMONADE reduces compute requirements through customdesigned approximate network morphisms [33], which allow newly generated networks to share parameters with their forerunners, obviating the need to train new networks from scratch. However, LEMONADE still requires nearly 100 GPUdays to search on the CIFAR datasets [34].
多目标神经架构搜索 (Multi-objective NAS) : 本文中,多目标NAS指通过单次运行同时逼近整个高效权衡架构集合的方法 [26], [27]。Kim等人 [28] 提出了NEMO,这是最早采用进化多目标方法优化CNN架构的算法之一。NEMO使用NSGA-II [29] 来最大化网络分类性能和推理速度,并在七种固定架构的受限空间内搜索各层输出通道数。DPP-Net [30] 作为 [31] 的扩展方法,通过逐步扩展简单结构网络,仅训练代理模型预测有潜力的帕累托最优 (Pareto-optimality) 前K个网络。Elsken等人 [32] 提出的LEMONADE方法旨在开发具有高预测性能且资源约束更低的网络,其通过定制设计的近似网络形态变换 [33] 降低计算需求——新生成网络可与前代网络共享参数,从而避免从头训练。但LEMONADE在CIFAR数据集 [34] 上仍需消耗近100个GPU日的搜索时间。
Fig. 1: Overview: Given a dataset and objectives, NSGANetV1 designs a set of custom architectures spanning the trading-off front. NSGANetV1 estimate the performance of an architecture through its proxy model, optimized by Stochastic Gradient Descent (SGD) in the lower-level. The search proceeds in exploration via genetic operations, followed by exploitation via distribution estimation. See Algorithm 1 for pseudocode and colors are in correspondence.
图 1: 概述:给定数据集和目标,NSGANetV1设计了一组覆盖权衡前沿的自定义架构。NSGANetV1通过其代理模型评估架构性能,该模型在底层通过随机梯度下降(SGD)进行优化。搜索过程通过遗传操作进行探索,随后通过分布估计进行利用。伪代码参见算法1,颜色对应一致。
Search Efficiency: The main computation bottleneck of NAS resides in the lower-level optimization of learning the weights for evaluating the performance of architectures. One such evaluation typically requires hours to finish. To improve the practical utility of search under a constrained computational budget, NAS methods commonly advocate for substitute measurements without a full-blown lower-level optimization. A widely-used approach proceeds as follows: it reduces the depth (number of layers) and the width (number of channels) of the intended architecture to create a small-scale network—i.e., a proxy model. Proxy models require an order of magnitude less computation time (typically, minutes) to perform lower-level optimization, and the performance of proxy models is then used as surrogate measurements to guide the search. However, most existing NAS work [8], [16], [17], [31], [35] follows simple heuristics to construct the proxy model, resulting in low correlation in prediction. For instance, NASNet [8] has an additional re-ranking stage that trains the top 250 architectures for 300 epochs each (takes more than a year on a single GPU card) before picking the best one, and the reported NASNet-A model was originally ranked 70th among the top 250 according to the performance measured at proxy model scale. Similarly, AmoebaNet [17] relies on evaluation of duplicate architectures to gauge representative performance, leading to 27K models being evaluated during search.
搜索效率:NAS的主要计算瓶颈在于评估架构性能所需的底层权重学习优化过程。单次评估通常需要数小时才能完成。为了在有限计算资源下提升搜索的实用性,NAS方法普遍采用替代性评估指标,避免完整的底层优化。一种广泛采用的做法是:通过缩减目标架构的深度(层数)和宽度(通道数)构建小规模网络(即代理模型)。代理模型的底层优化计算时间可减少一个数量级(通常为分钟级),其性能被用作指导搜索的替代指标。然而现有NAS研究[8][16][17][31][35]大多采用简单启发式方法构建代理模型,导致预测相关性较低。例如NASNet[8]设置了额外重排序阶段,需对250个顶级架构各训练300轮次(单GPU卡耗时超一年)才能选出最优模型,而最终报告的NASNet-A模型在代理模型评估阶段仅排名第70位。类似地,AmoebaNet[17]通过评估重复架构来衡量代表性能,导致搜索过程中累计评估了27K个模型。
In this work, we focus on both the efficiency and the reliability aspects of the proxy model; through a series of systematic studies in a controlled setting, we empirically establish the trade-off between the correlation of proxy performance to true performance and the speed-up in estimation. We then implement a suitable setting that is specific to our search space and dataset.
在本研究中,我们同时关注代理模型的效率与可靠性:通过在受控环境中进行一系列系统性实验,我们实证建立了代理性能与真实性能相关性同评估加速之间的权衡关系,随后针对特定搜索空间和数据集实现了定制化配置。
III. PROPOSED APPROACH
III. 研究方法
Practical applications of NAS can rarely be considered from the point of view of a single objective of maximizing performance; rather, they must be evaluated from at least one additional, conflicting objective that is specific to the deployment scenario. In this work, we approach the problem of designing high-performance architectures with diverse complexities for different deployment scenarios as a multi-objective bilevel optimization problem∗ [36]. We mathematically formulate the problem as,
实际应用中的神经网络架构搜索(NAS)很少能仅从单一性能最大化目标的角度考虑,而必须根据部署场景特有的至少一个额外冲突目标进行评估。本研究将面向不同部署场景设计多样化复杂度的高性能架构问题建模为多目标双层优化问题* [36],其数学表述为:
$$
\begin{array}{r l}{\mathrm{minimize}}&{F(\boldsymbol{x})=\big(f_{1}(\boldsymbol{x};\boldsymbol{w}^{}(\boldsymbol{x})),f_{2}(\boldsymbol{x})\big)^{T},}\ {\mathrm{subject to}}&{\boldsymbol{w}^{*}(\boldsymbol{x})\in\mathrm{argmin~}\mathcal{L}(\boldsymbol{w};\boldsymbol{x}),}\ &{\boldsymbol{x}\in\Omega_{x},\quad\boldsymbol{w}\in\Omega_{w},}\end{array}
$$
$$
\begin{array}{r l}{\mathrm{minimize}}&{F(\boldsymbol{x})=\big(f_{1}(\boldsymbol{x};\boldsymbol{w}^{}(\boldsymbol{x})),f_{2}(\boldsymbol{x})\big)^{T},}\ {\mathrm{subject to}}&{\boldsymbol{w}^{*}(\boldsymbol{x})\in\mathrm{argmin~}\mathcal{L}(\boldsymbol{w};\boldsymbol{x}),}\ &{\boldsymbol{x}\in\Omega_{x},\quad\boldsymbol{w}\in\Omega_{w},}\end{array}
$$
where $\Omega_{x}=\Pi_{i=1}^{n}[a_{i},b_{i}]\subseteq\mathbb{Z}^{n}$ is the architecture decision space, where $a_{i},~b_{i}$ are the lower and upper bounds, $x=$ $(x_{1},\dots,x_{n})^{T}\in\Omega_{x}$ is a candidate architecture, and the lowerlevel variable $\pmb{w}\in\Omega_{w}$ denotes its associated weights. The upper-level objective vector $\pmb{F}$ comprises of the classification error $(f_{1})$ on the validation data $\mathcal{D}_{v l d}$ , and the complexity $(f_{2})$ of the network architecture. The lower level objective $\mathcal{L}(\pmb{w};\pmb{x})$ is the cross-entropy loss on the training data $\mathcal{D}_{t r n}$ .
其中 $\Omega_{x}=\Pi_{i=1}^{n}[a_{i},b_{i}]\subseteq\mathbb{Z}^{n}$ 是架构决策空间,$a_{i},~b_{i}$ 为上下界,$x=$ $(x_{1},\dots,x_{n})^{T}\in\Omega_{x}$ 是候选架构,底层变量 $\pmb{w}\in\Omega_{w}$ 表示其关联权重。上层目标向量 $\pmb{F}$ 包含验证数据 $\mathcal{D}_{v l d}$ 上的分类误差 $(f_{1})$ 和网络架构复杂度 $(f_{2})$。底层目标 $\mathcal{L}(\pmb{w};\pmb{x})$ 是训练数据 $\mathcal{D}_{t r n}$ 上的交叉熵损失。
Our proposed algorithm, NSGANetV1, is an iterative process in which initial architectures are made gradually better as a group, called a population. In every iteration, a group of offspring (i.e., new architectures) is created by applying variations through crossover and mutation to the more promising of the architectures already found, also known as parents, from the population. Every member in the population (including both
我们提出的算法 NSGANetV1 是一个迭代过程,初始架构作为群体逐步优化。在每次迭代中,通过对种群中已发现较有潜力的架构(称为父代)进行交叉和变异操作,生成一组子代(即新架构)。种群中的每个成员(包括
parents and offspring) compete for survival and reproduction (becoming a parent) in each iteration. The initial population may be generated randomly or guided by prior-knowledge, i.e., seeding the past successful architectures directly into the initial population. Subsequent to initialization, NSGANetV1 conducts the search in two sequential stages: (i) exploration, with the goal of discovering diverse ways to construct architectures, and (ii) exploitation that reinforces the emerging patterns commonly shared among the architectures successful during exploration. A set of architectures representing efficient tradeoffs between network performance and complexity is obtained at the end of evolution, through genetic operators and a Bayesian-model-based learning procedure. A flowchart and a pseudocode outlining the overall approach are shown in Fig. 1 and Algorithm 1, respectively. In the remainder of this section, we provide a detailed description of the aforementioned components in Sections III-A - III-C.
父母与后代)在每次迭代中竞争生存和繁殖(成为父母)。初始种群可以随机生成,也可以基于先验知识引导,即直接将过去成功的架构种子植入初始种群。初始化后,NSGANetV1按顺序执行两个搜索阶段:(i)探索阶段,目标是发现构建架构的多样化方法;(ii)利用阶段,强化探索期间成功架构中普遍出现的新兴模式。通过遗传算子和基于贝叶斯模型的学习过程,在进化结束时获得一组代表网络性能和复杂性之间高效权衡的架构。图1和算法1分别展示了整体方法的流程图和伪代码。在本节剩余部分,我们将在第III-A至III-C小节详细描述上述组件。
A. Search Space and Encoding
A. 搜索空间与编码
The search for optimal network architectures can be performed over many different search spaces. The generality of the chosen search space has a major influence on the quality of results that are even possible. Most existing evolutionary NAS approaches [14], [19], [24], [32] search only one aspect of the architecture space—e.g., the connections and/or hyper-parameters. In contrast, NSGANetV1 searches over both operations and connections—the search space is thus more comprehensive, including most of the previous successful architectures designed both by human experts and algorithmic ally.
寻找最优网络架构可以在许多不同的搜索空间中进行。所选搜索空间的通用性对可能获得的结果质量有重大影响。大多数现有的进化式神经架构搜索(NAS)方法[14]、[19]、[24]、[32]仅搜索架构空间的一个方面——例如连接和/或超参数。相比之下,NSGANetV1同时搜索操作和连接——因此其搜索空间更加全面,涵盖了之前由人类专家和算法设计的大多数成功架构。
Modern CNN architectures are often composed of an outer structure (network-level) design where the width (i.e., # of channels), the depth (i.e., # of layers) and the spatial resolution changes (i.e., locations of pooling layers) are decided; and an inner structure (block-level) design where the layer-wise connections and computations are specified, e.g., Inception block [1], ResNet block [2], and DenseNet block [3], etc. As seen in the CNN literature, the network-level decisions are mostly hand-tuned based on meta-heuristics from prior knowledge and the task at hand, as is the case in this work. For block-level design, we adopt the one used in [8], [9], [17], [31] to be consistent with previous work.
现代CNN架构通常由外部结构(网络层面)设计和内部结构(块层面)设计组成。外部结构设计决定宽度(即通道数)、深度(即层数)和空间分辨率变化(即池化层位置);内部结构设计则指定层间连接和计算方式,例如Inception块[1]、ResNet块[2]和DenseNet块[3]等。从CNN文献中可见,网络层面的决策主要基于先验知识和当前任务的元启发式方法进行手动调整,本文工作亦遵循此惯例。对于块层面设计,为与先前研究保持一致,我们采用[8]、[9]、[17]、[31]中使用的方案。
A block is a small convolutional module, typically repeated multiple times to form the entire neural network. To construct scalable architectures for images of different resolutions, we use two types of blocks to process intermediate information: (1) the Normal block, a block type that returns information of the same spatial resolution; and (2) the Reduction block, another block type that returns information with spatial resolution halved by using a stride of two. See Fig. 2a for a pictorial illustration.
块(block)是一种小型卷积模块,通常重复多次以构成整个神经网络。为了构建适用于不同分辨率图像的可扩展架构,我们使用两种类型的块来处理中间信息:(1) 常规块(Normal block),该类型块返回相同空间分辨率的信息;(2) 降维块(Reduction block),该类型块通过步长为2的操作返回空间分辨率减半的信息。图示说明见图2a。
We use directed acyclic graphs (DAGs) consisting of five nodes to construct both types of blocks (a Reduction block uses a stride of two). Each node is a two-branched structure, mapping two inputs to one output. For each node in block $i$ , we need to pick two inputs from among the output of the previous block $h_{i-1}$ , the output of the previous-previous block $h_{i-2}$ and the set of hidden states created in any previous nodes of block $i$ . For pairs of inputs chosen, we choose a computation operation from among the following options, collected based on their prevalence in the CNN literature:
我们使用由五个节点组成的有向无环图(DAG)来构建两种类型的模块(缩减模块使用步长为2)。每个节点采用双分支结构,将两个输入映射为一个输出。对于模块$i$中的每个节点,我们需要从前一个模块的输出$h_{i-1}$、前前模块的输出$h_{i-2}$以及模块$i$中任何先前节点创建的隐藏状态集合中选择两个输入。针对选定的输入对,我们从以下计算操作中选择(这些操作基于CNN文献中的常见实践进行收集):
• identity 3x3 dilated convolution • 3x3 max pooling 5x5 dilated convolution • 3x3 average pooling • 3x3 depthwise-separable conv • squeeze-and-excitation [37] • 5x5 depthwise-separable conv • 3x3 local binary conv [6] • 7x7 depthwise-separable conv • 5x5 local binary conv [6] • 1x7 then 7x1 convolution
- 恒等 (identity)
- 3x3 空洞卷积 (3x3 dilated convolution)
- 3x3 最大池化 (3x3 max pooling)
- 5x5 空洞卷积 (5x5 dilated convolution)
- 3x3 平均池化 (3x3 average pooling)
- 3x3 深度可分离卷积 (3x3 depthwise-separable conv)
- 压缩-激励模块 (squeeze-and-excitation) [37]
- 5x5 深度可分离卷积 (5x5 depthwise-separable conv)
- 3x3 局部二值卷积 (3x3 local binary conv) [6]
- 7x7 深度可分离卷积 (7x7 depthwise-separable conv)
- 5x5 局部二值卷积 (5x5 local binary conv) [6]
- 1x7 接 7x1 卷积 (1x7 then 7x1 convolution)
The results computed from both branches are then added together to create a new hidden state, which is available for subsequent nodes in the same block. See Fig. 2b-2d for pictorial illustrations. The search space we consider in this paper is an expanded version of the micro search space used in our previous work [20]. Specifically, the current search space (i) search for a factor that gradually increments the channel size of each block with depth (see Fig. 2b) as opposed to sharply doubling the channel size when down-sampling. (ii) considers an expanded set of primitive operations to include both more recent advanced layer primitives such as squeeze-and-excitation [37] and more parameter-efficient layer primitives like local binary convolution [6].
两个分支的计算结果相加后生成新的隐藏状态,供同一块中的后续节点使用。图示说明见图 2b-2d。本文考虑的搜索空间是我们之前工作 [20] 中使用的微观搜索空间的扩展版本。具体而言,当前搜索空间 (i) 寻找一个因子,使每个块的通道大小随深度逐渐增加 (见图 2b),而不是在下采样时突然翻倍;(ii) 考虑了扩展的原始操作集,既包含更新的高级层原语,如压缩激励 [37],也包含更高效的参数层原语,如局部二值卷积 [6]。
Fig. 2: Schematic of the NSGANetV1 search space motivated from [8]: (a) An architecture is composed of stacked blocks. (b) The number of channels in each block is gradually increased with depth of the network. (c) Each block is composed of five nodes, where each node is a two-branched computation applied to outputs from either previous blocks or previous nodes within the same block. (d) A graphical visualization of (c).
图 2: 受 [8] 启发的 NSGANetV1 搜索空间示意图:(a) 架构由堆叠块组成。(b) 每个块中的通道数随网络深度逐渐增加。(c) 每个块由五个节点组成,其中每个节点是对来自前一个块或同一块内前一个节点的输出应用的双分支计算。(d) (c) 的图形可视化。
With the above-mentioned search space, there are in total 20 decisions to constitute a block structure—i.e., choose two pairs of input and operation for each node, and repeat for five nodes. The resulting number of combinations for a block structure is:
在上述搜索空间中,共有20个决策构成一个块结构——即为每个节点选择两对输入和操作,并重复五次节点操作。
where $n$ denotes the number of nodes, $n_{-}o p s$ denotes the number of considered operations. Therefore, with one Normal block and one Reduction block with five nodes in each, the overall size of the encoded search space is approximately $10^{33}$
其中 $n$ 表示节点数量,$n_{-}ops$ 表示考虑的操作数量。因此,当每个模块包含5个节点(一个Normal模块和一个Reduction模块)时,编码搜索空间的总体规模约为 $10^{33}$
B. Performance Estimation Strategy
B. 性能评估策略
To guide NSGANetV1 towards finding more accurate and efficient architectures, we consider two metrics as objectives, namely, classification accuracy and architecture complexity. Assessing the classification accuracy of an architecture during search requires another optimization to first identify the optimal values of the associated weights via Stochastic Gradient Descent (SGD; Algorithm 2). Even though there exist wellestablished gradient descent algorithms to efficiently solve this optimization, repeatedly executing such an algorithm for every candidate architecture renders the overall process computationally very prohibitive. Therefore, to overcome this computational bottleneck, we carefully (using a series of ablation studies) down-scale the architectures to create their proxy models [8], [17], which can be optimized efficiently in the lower-level through SGD. Their performances become surrogate measurements to select architectures during search. Details are provided in Section V-C.
为了引导NSGANetV1寻找更准确高效的架构,我们采用分类准确率和架构复杂度作为双优化目标。评估搜索过程中架构的分类准确率需先通过随机梯度下降(SGD; 算法2)优化权重参数,尽管现有成熟的梯度下降算法能高效完成此优化,但为每个候选架构重复执行该算法会导致整体计算量激增。为此,我们通过系统消融实验将架构缩放为代理模型[8][17],在底层通过SGD实现高效优化,其性能作为架构选择的替代指标。具体实现详见第V-C节。
A number of metrics can serve as proxies for complexity, including: the number of active nodes, number of active connections between the nodes, number of parameters, inference time and number of floating-point operations (FLOPs) needed to execute the forward pass of a given architecture. Our initial experiments considered each of these metrics in turn. We concluded from extensive experimentation that inference time cannot be estimated reliably due to differences
多项指标可作为复杂度的衡量标准,包括:活跃节点数量、节点间活跃连接数、参数量、推理时间以及执行给定架构前向传播所需的浮点运算量 (FLOPs) 。我们通过初步实验逐一考察了这些指标。大量实验表明,由于差异因素的存在,推理时间难以被可靠估算。
and inconsistencies in the computing environment, GPU manufacturer, ambient temperature, etc. Similarly, the number of parameters, active connections or active nodes only relate to one aspect of the complexity. In contrast, we found an estimate of FLOPs to be a more accurate and reliable proxy for network complexity. Therefore, classification accuracy and FLOPs serve as our choice of twin objectives to be traded off for selecting architectures. To simultaneously compare and select architectures based on these two objectives, we use the non-dominated ranking and the “crowded-ness" concepts proposed in [29].
计算环境、GPU制造商、环境温度等存在的不一致和差异。同样,参数数量、活跃连接或活跃节点仅涉及复杂性的一个方面。相比之下,我们发现FLOPs的估算能更准确可靠地代表网络复杂性。因此,分类准确率和FLOPs被选为我们权衡架构选择的双目标。为了基于这两个目标同时比较和选择架构,我们采用了[29]中提出的非支配排序和"拥挤度"概念。
C. Creation of New Generation
C. 新一代的创造
Exploration: Given a population of architectures, parents are selected from the population with a fitness bias. This choice is dictated by two observations, (1) offspring created around better parents are expected to have higher fitness on average than those created around worse parents, with the assumption of some level of gradualism in the solution space; (2) occasionally (although not usually), offspring perform better than their parents, through inheriting useful traits from both parents. Because of this, one might demand that the best architecture in the population should always be chosen as one of the parents. However, the deterministic and greedy nature of that approach would likely lead to premature convergence due to loss of diversity in the population [38]. To address this problem, we use binary tournament selection [39] to promote parent architectures in a stochastic fashion. At each iteration, binary tournament selection randomly picks two architectures from the population, then the one favored by the multi-objective selection criterion described in Section III-B becomes one of the parents. This process is repeated to select a second parent architecture; the two parent architectures then undergo a crossover operation.
探索:给定一个架构种群,根据适应度偏好从中选择父代。这一选择基于两点观察:(1) 假设解空间具有一定渐进性,那么围绕更优父代创建的子代平均而言会比围绕较差父代创建的子代具有更高适应度;(2) 子代偶尔(虽非通常)会通过继承双亲的有用特征而表现优于父代。因此,可能会要求始终选择种群中的最优架构作为父代之一。然而,这种确定性且贪婪的方法很可能因种群多样性丧失而导致早熟收敛[38]。为解决该问题,我们采用二元锦标赛选择[39]以随机方式提升父代架构。每次迭代时,二元锦标赛选择从种群中随机选取两个架构,然后由第III-B节所述多目标选择准则偏好的那个成为父代之一。重复此过程选择第二个父代架构后,两个父代架构将进行交叉操作。
Fig. 3: Illustration of node-level crossover.
图 3: 节点级交叉示意图。
In NSGANetV1, we use two types of crossover (with equal probability of being chosen) to efficiently exchange substructures between two parent architectures. The first type is at the block level, in which the offspring architectures are created by recombining the Normal block from the first parent with the Reduction block from the other parent and vice versa. The second type is at the node level, where a node from one parent is randomly chosen and exchanged with another node at the same position from the other parent. We apply the node-level crossover to both Normal and Reduction blocks. Fig. 3 illustrates an example of node-level crossover. Note that two offspring architectures are generated after each crossover operation, and only one of them (randomly chosen) is added to the offspring population. In each generation, an offspring population of the same size as the parent population is generated.
在NSGANetV1中,我们使用两种交叉类型(以相等概率选择)来高效交换两个父架构的子结构。第一种是在块级别进行的,子代架构通过将第一个父代的Normal块与另一个父代的Reduction块重组来创建,反之亦然。第二种是在节点级别进行的,随机选择一个父代的节点并与另一个父代相同位置的节点交换。我们对Normal块和Reduction块都应用节点级交叉。图3展示了一个节点级交叉的示例。注意每次交叉操作后会生成两个子代架构,其中只有一个(随机选择)会被加入子代种群。每一代都会生成一个与父代种群规模相同的子代种群。
Fig. 4: Input and Operation Mutation: Dashed line boxes with red color highlight the mutation. $h_{i-2}$ and $h_{i-1}$ are outputs from previous-previous and previous blocks, respectively. hi(3) indicates output from node 3 of the current block.
图 4: 输入与操作变异: 红色虚线框突出显示变异部分。$h_{i-2}$ 和 $h_{i-1}$ 分别表示前前块和前块的输出。hi(3)表示当前块节点3的输出。
To enhance the diversity of the population and the ability to escape from local attractors, we use a disc ret i zed version of the polynomial mutation (PM) operator [40] subsequent to crossover. We allow mutation to be applied on both the input hidden states and the choice of operations. Figure 4 shows an example of each type of mutation using the parentcentric PM operator, in which the offspring are intentionally created around the parents in the decision space. In association with PM, we sort our discrete encoding of input hidden states chronologically and choice of operations in ascending order of computational complexity. In the context of neural architecture, this step results in the mutated input hidden states in offspring architectures to more likely be close to the input hidden states in parent architectures in a chronological manner. For example, $h_{i}^{(\tilde{2})}$ is more likely to be mutated to hi(1) than to hi−2 by PM. A similar logic is applied in case of mutation on layer operations.
为增强种群的多样性和逃离局部吸引子的能力,我们在交叉操作后采用了多项式变异 (PM) 算子 [40] 的离散化版本。我们允许变异同时作用于输入隐藏状态和操作选择。图 4 展示了使用亲本中心 PM 算子进行每种变异类型的示例,其中子代被刻意创建在决策空间中亲本周围。结合 PM 算子,我们按时间顺序对输入隐藏状态的离散编码进行排序,并按计算复杂度升序排列操作选择。在神经网络架构的背景下,这一步骤使得子代架构中变异的输入隐藏状态更有可能按时间顺序接近父代架构中的输入隐藏状态。例如,$h_{i}^{(\tilde{2})}$ 通过 PM 变异为 hi(1) 的概率高于变异为 hi−2。类似的逻辑也适用于层操作变异的情况。
Fig. 5: Illustrative example of BN-based exploitation step in NSGANetV1: given past successful architectures, we construct a BN relating the dependencies between the four nodes inside the Normal block. A new architecture is then sampled from this BN and proceeds forward for performance estimation.
图 5: NSGANetV1中基于贝叶斯网络(BN)的探索步骤示例:给定过去成功的架构,我们构建了一个BN来描述Normal块内四个节点之间的依赖关系。随后从这个BN中采样出新架构,并继续进行性能评估。
Exploitation: After a sufficient number of architectures has been explored (consuming 2/3 of the total computational budget; i.e., $\tau$ in Algorithm 1), we start to enhance the exploitation aspect of the search. The key idea is to reinforce and reuse the patterns commonly shared among past successful architectures. We use the Bayesian Network (BN) [41] as the probabilistic model to estimate the distribution of the Pareto set (of architectures). In the context of our search space and encoding, this translates to learning the correlations among the operations and connections of nodes within a block. Our exploitation step uses a subset (top-100 architectures selected based on domination rank and crowding distance [29]) of the past evaluated architectures to guide the final part of the search. More specifically, say we are designing a Normal block with three nodes, namely $\alpha_{n}^{(1)}$ , $\alpha_{n}^{(2)}$ , and α(n3 ). We would like to know the relationship among these three nodes. For this purpose, we construct a BN relating these variables, modeling the probability of Normal blocks beginning with a particular node $\overset{\cdot}{\alpha_{n}^{(1)}}$ , the probability that $\alpha_{n}^{(2)}$ follows $\alpha_{n}^{(1)}$ , and the probability that $\alpha_{n}^{(3)}$ follows $\alpha_{n}^{(2)}$ and $\alpha_{n}^{(1)}$ . In other words, we estimate the conditional distributions $p\left(\alpha_{n}^{(1)}\right)$ $p\left(\alpha_{n}^{(2)}|\alpha_{n}^{(1)}\right)$ , and $p\left(\alpha_{n}^{(3)}|\alpha_{n}^{(2)},\alpha_{n}^{(1)}\right)$ by using the population history, and update these estimates during the exploitation process. New offspring architectures are created by sampling from this BN. A pictorial illustration of this process is provided in Fig. 5. This BN-based exploitation strategy is used in addition to the genetic operators, where we initially (i.e., at the beginning of exploitation) assign $25%$ of the offspring (line 34 in Algorithm 1) to be created by BN and we update this probability adaptively (line 36 in Algorithm 1). To be more specific, we calculate the probabilities of using genetic operators and sampling from the BN model at generation $t$ based on the survival rates of offspring created using them in the previous generation, following the softmax function:
利用阶段:当探索足够数量的架构后(消耗总计算预算的2/3,即算法1中的$\tau$),我们开始增强搜索的利用性。核心思想是强化并复用过去成功架构中的共有模式。我们采用贝叶斯网络(BN)[41]作为概率模型来估计架构的帕累托集分布。在搜索空间和编码背景下,这转化为学习块内节点操作与连接之间的相关性。利用阶段选取历史评估架构的子集(基于支配排序和拥挤距离[29]筛选的top-100架构)来指导最终搜索。具体而言,假设设计包含三个节点$\alpha_{n}^{(1)}$、$\alpha_{n}^{(2)}$和$\alpha_{n}^{(3)}$的Normal块时,我们通过构建BN来建模这些变量的关系:估算Normal块以特定节点$\overset{\cdot}{\alpha_{n}^{(1)}$起始的概率、$\alpha_{n}^{(2)}$跟随$\alpha_{n}^{(1)}$的概率,以及$\alpha_{n}^{(3)}$跟随$\alpha_{n}^{(2)}$和$\alpha_{n}^{(1)}$的概率。即利用种群历史数据估计条件分布$p\left(\alpha_{n}^{(1)}\right)$、$p\left(\alpha_{n}^{(2)}|\alpha_{n}^{(1)}\right)$和$p\left(\alpha_{n}^{(3)}|\alpha_{n}^{(2)},\alpha_{n}^{(1)}\right)$,并在利用过程中动态更新。新子代架构通过从BN采样生成,该过程如图5所示。此BN利用策略与遗传算子协同工作:初始阶段(利用开始时)分配25%的子代(算法1第34行)由BN生成,并自适应调整该概率(算法1第36行)。具体而言,基于前代子代存活率,按softmax函数计算第$t$代使用遗传算子或BN采样的概率:
$$
\rho_{t}^{(i)}=\frac{\exp(s_{t-1}^{(i)})}{\sum_{i=1}^{2}\exp(s_{t-1}^{(i)})}
$$
$$
\rho_{t}^{(i)}=\frac{\exp(s_{t-1}^{(i)})}{\sum_{i=1}^{2}\exp(s_{t-1}^{(i)})}
$$
where ρt $\rho_{t}^{(i)}$ are the probabilities of using genetic operators $(i=1)$ ) and sampling from the learned BN model $(i=2)$ ); and $s_{t-1}^{(i)}$ are the survival rates of the offspring created by genetic operators $(i=1)$ ) and the learned BN model $(i=2)$ ) at the previous generation $t-1$ . Note that $\rho_{t}^{(i=1)}$ corresponds to $\rho$ in Algorithm 1.
其中 $\rho_{t}^{(i)}$ 表示使用遗传算子 $(i=1)$ 和从学习到的BN模型中采样 $(i=2)$ 的概率;$s_{t-1}^{(i)}$ 表示上一代 $t-1$ 中由遗传算子 $(i=1)$ 和学习到的BN模型 $(i=2)$ 创建的后代的存活率。注意 $\rho_{t}^{(i=1)}$ 对应算法1中的 $\rho$。
IV. EXPERIMENTAL SETUP AND RESULTS
IV. 实验设置与结果
In this section, we will evaluate the efficacy of NSGANetV1 on multiple benchmark image classification datasets.
在本节中,我们将评估NSGANetV1在多个基准图像分类数据集上的效能。
A. Baselines
A. 基线方法
To demonstrate the effectiveness of the proposed algorithm, we compare the non-dominated architectures achieved at the conclusion of NSGANetV1’s evolution with architectures reported by various peer methods published in top-tier venues. The chosen peer methods can be broadly categorized into three groups: architectures manually designed by human experts, nonEA- (mainly RL or relaxation)-based, and EA-based. Human engineered architectures include ResNet [2], ResNeXt [42], and DenseNet [3], etc. The second and third groups range from earlier methods [7], [14], [15] that are oriented towards “proofof-concept" for NAS, to more recent methods [8], [9], [17], [43], many of which improve state-of-the-art results on various computer vision benchmarks at the time they were published. The effectiveness of the different architectures is judged on both classification accuracy and computational complexity. For comparison on classification accuracy, three widely used natural object classification benchmark datasets are considered, namely, CIFAR-10, CIFAR-100 and ImageNet. More details and a gallery of examples from these three datasets are provided in Fig. 2 in supplementary materials under Section V.
为验证所提算法的有效性,我们将NSGANetV1演化结束时获得的非支配架构与顶级会议上发表的各种同侪方法报告的架构进行对比。所选同侪方法大致可分为三类:人类专家手动设计的架构、非EA(主要是强化学习或松弛)方法和基于EA的方法。人工设计的架构包括ResNet [2]、ResNeXt [42]和DenseNet [3]等;第二类和第三类方法涵盖早期面向NAS"概念验证"的研究[7][14][15],以及近期能提升当时计算机视觉基准state-of-the-art成果的方法[8][9][17][43]。不同架构的有效性通过分类准确率和计算复杂度两方面评估。在分类准确率对比方面,选用三个广泛使用的自然物体分类基准数据集:CIFAR-10、CIFAR-100和ImageNet。补充材料第五节图2提供了这三个数据集的更多细节及示例图库。
B. Implementation Details
B. 实现细节
Motivated by efficiency and practicality considerations most existing NAS methods, including [8], [17], [18], [44], carry out the search process on the CIFAR-10 dataset. However, as we demonstrate through ablation studies in Section V-C the CIFAR100 provides a more reliable measure of an architecture’s efficacy in comparison to CIFAR-10. Based on this observation, in contrast to existing approaches, we use the more challenging CIFAR-100 dataset for the search process. Furthermore, we split the original CIFAR-100 training set $(80%-20%)$ to create a training and validation set to prevent over-fitting to the training set and improve the general iz ability. We emphasize that the original testing set is never used to guide the selection of architectures in any form during the search.
出于效率和实用性的考虑,大多数现有神经架构搜索(NAS)方法(包括[8]、[17]、[18]、[44])都在CIFAR-10数据集上进行搜索。但如第V-C节的消融实验所示,与CIFAR-10相比,CIFAR-100能更可靠地衡量架构效能。基于这一发现,我们采用更具挑战性的CIFAR-100数据集进行搜索,这与现有方法形成鲜明对比。此外,我们将原始CIFAR-100训练集按$(80%-20%)$比例划分为训练集和验证集,以防止过拟合并提升泛化能力。需要强调的是,在搜索过程中从未以任何形式使用原始测试集来指导架构选择。
TABLE I: Summary of Hyper-parameter Settings.
Categories | Parameters | Settings |
search space | #ofinitial channels(Chinit) | 32 |
#of channelincrements (Chinc) | 6 | |
#of repetitions of Normal blocks (V) | 4/5/6 | |
gradientdescent | batch size | 128 |
weight decay (L2regularization) | 5.00E-04 | |
epochs | 36/600 | |
learningrate schedule | Cosine Annealing[48] | |
search strategy | populationsize | 40 |
#ofgenerations | 30 | |
crossover probability | 0.9 | |
mutationprobability | 0.1 |
表 1: 超参数设置汇总
类别 | 参数 | 设置 |
---|---|---|
搜索空间 | 初始通道数 (Chinit) | 32 |
通道增量数 (Chinc) | 6 | |
普通块重复次数 (V) | 4/5/6 | |
梯度下降 | 批量大小 | 128 |
权重衰减 (L2正则化) | 5.00E-04 | |
训练轮数 | 36/600 | |
学习率调度 | 余弦退火 [48] | |
搜索策略 | 种群大小 | 40 |
代数 | 30 | |
交叉概率 | 0.9 | |
变异概率 | 0.1 |
The search itself is repeated five times with different initial random seeds. We select and report the performance of the median run as measured by hyper volume (HV). Such a procedure ensures the reproducibility of our NAS experiments and mitigates the concerns that have arisen in recent NAS studies [45], [46]. We use the standard SGD algorithm for learning the associated weights for each architecture. Other hyperparameter settings related to the search space, the gradient descent training and the search strategy are summarized in Table I. We provide analysis aimed at justifying some of the hyper-parameter choices in Section V-C. All experiments are performed on 8 Nvidia 2080Ti GPU cards.
搜索过程会使用不同的初始随机种子重复五次。我们选择并报告以超体积 (HV) 衡量的中位数运行性能。这一流程确保了我们的神经架构搜索 (NAS) 实验可复现性,并缓解了近期NAS研究[45][46]中出现的争议。我们使用标准SGD算法学习每个架构的关联权重。与搜索空间、梯度下降训练及搜索策略相关的其他超参数设置总结如表1所示。我们在第V-C节提供了旨在验证部分超参数选择合理性的分析。所有实验均在8块Nvidia 2080Ti GPU上完成。
Our post-search training settings largely follow [9]: We extend the number of epochs to 600 with a batch size of 96 to thoroughly re-train the selected models from scratch. We also incorporate a data pre-processing technique cutout [47], and a regular iz ation technique scheduled path dropout introduced in [8]. In addition, to further improve the training process, an auxiliary head classifier [1] is appended to the architecture at approximately 2/3 depth (right after the second resolutionreduction operation). The loss from this auxiliary head classifier, scaled by a constant factor 0.4, is aggregated with the loss from the original architecture before back-propagation during training.
我们的后搜索训练设置主要遵循[9]:将训练周期扩展到600次,批次大小为96,以从头彻底重新训练所选模型。我们还采用了数据预处理技术cutout [47]和[8]中提出的正则化技术scheduled path dropout。此外,为进一步优化训练过程,在架构约2/3深度处(紧接第二次分辨率降低操作后)添加了一个辅助头分类器[1]。该辅助头分类器的损失按常数因子0.4缩放后,与原始架构的损失在训练的反向传播前进行聚合。
C. Effectiveness of NSGANetV1
C. NSGANetV1的有效性
We first present the objective space distribution of all architectures generated by NSGANetV1 during the course of evolution on CIFAR-100, in Fig. 6a. We include architectures generated by the original NSGA-II algorithm and uniform random sampling as references for comparison. Details of these two methods are provided in Section IV-D. From the set of nondominated solutions (outlined by red box markers in Fig. 6a), we select five architectures based on the ratio of the gain on accuracy over the sacrifice on FLOPs. For reference purposes, we name these five architectures as NSGANetV1-A0 to -A4 in ascending FLOPs order. See Fig. 1 in the supplementary materials for a visualization of the searched architectures.
我们首先在图6a中展示了NSGANetV1在CIFAR-100数据集上进化过程中生成的所有架构在目标空间中的分布情况。作为对比参考,我们同时纳入了原始NSGA-II算法和均匀随机采样生成的架构,这两种方法的详细说明见第IV-D节。从非支配解集中(图6a中用红色方框标记标出),我们根据准确率提升与FLOPs牺牲的比值选取了五种架构。为便于参考,按FLOPs升序将这些架构命名为NSGANetV1-A0至-A4。搜索到的架构可视化结果可参见补充材料中的图1。
For comparison with other peer methods, we follow the training procedure in [9] and re-train the weights of NSGANetV1- A0 to -A4 on CIFAR-100, following the steps outlined in Section IV-B. We would like to mention that since most existing approaches do not report the number of FLOPs for the architectures used on the CIFAR-100 dataset, we instead compare their computational complexity through number of parameters to prevent potential discrepancies from re-implementation. Fig. 6b shows the post-search architecture comparisons, NSGANetV1- A0 to A4, i.e., the algorithms derived in this paper, jointly dominate all other considered peer methods with a clear margin. More specifically, NSGANetV1-A1 is more accurate than peer EA method, AE-CNN-E2EPP [19], while being 30x more efficient in network parameters; NSGANetV1-A2 achieves better performance than AmoebaNet [17] and NSGA-Net [20] with $\mathbf{3x}$ fewer parameters. Furthermore, NSGANetV1-A4 exceeds the classification accuracy of Shake-Even 29 2x4x64d $+~S E$ [37] using 8x fewer parameters. More comparisons can be found in Table IIb.
为与其他同类方法进行比较,我们按照[9]中的训练流程,遵循第IV-B节概述的步骤,在CIFAR-100上重新训练了NSGANetV1-A0至A4的权重。需要说明的是,由于现有方法大多未报告CIFAR-100数据集所用架构的FLOPs数量,我们转而通过参数量来比较计算复杂度,以避免重新实现可能带来的偏差。图6b展示了搜索后架构的对比结果:本文提出的NSGANetV1-A0至A4算法以明显优势全面优于其他同类方法。具体而言,NSGANetV1-A1比同类进化算法AE-CNN-E2EPP[19]精度更高,同时网络参数量效率提升30倍;NSGANetV1-A2以$\mathbf{3x}$更少的参数量,性能优于AmoebaNet[17]和NSGA-Net[20]。此外,NSGANetV1-A4以8倍更少的参数量,分类精度超过了Shake-Even 29 2x4x64d $+~S E$[37]。更多比较结果见表IIb。
Fig. 6: (a) Accuracy vs. FLOPs of all architectures generated by NSGANetV1 during the course of evolution on CIFAR-100. A subset of non-dominated architectures (see text), named NSGANetV1-A0 to A4, are re-trained thoroughly and compared with other peer methods in (b).
图 6: (a) NSGANetV1在CIFAR-100数据集进化过程中生成的所有架构的准确率与FLOPs对比。图中标出了名为NSGANetV1-A0至A4的非支配架构子集 (详见正文), 这些架构经过充分训练后与其他同类方法在(b)中进行比较。
Following the practice adopted in most previous approaches [8], [9], [17], [31], [44], we measure the transferability of the obtained architectures by allowing the architectures evolved on one dataset (CIFAR-100 in this case) to be inherited and used on other datasets, by retraining the weights from scratch on the new dataset–in our case, on CIFAR-10 and ImageNet.
遵循大多数先前方法[8]、[9]、[17]、[31]、[44]采用的实践,我们通过将在某个数据集(本文为CIFAR-100)上进化得到的架构继承并用于其他数据集(本文为CIFAR-10和ImageNet)来测量所获架构的可迁移性,具体方式是在新数据集上从头开始重新训练权重。
The effectiveness of NSGANetV1 is further validated by the transferred performance on the CIFAR-10 dataset. As we show in Figs. 7a, the trade-off frontier established by NSGANetV1- A0 to -A4 completely dominates the frontiers obtained by the peer EMO methods, both DPP-Net [30] and LEMONADE [32], as well as those obtained with other single-objective peer methods. More specifically, NSGANetV1-A0 uses $\mathbf{27x}$ fewer parameters and achieves higher classification accuracy than Large-scale Evo. [15]. NSGANetV1-A1 outperforms Hierarchical NAS [16] and DenseNet [3] in classification, while saving 122x and 51x in parameters. NSGANetV1-A2 uses $\mathbf{4x}$ less parameters to achieve similar performance as compared to NSGA-Net [20]. Furthermore, NSGANetV1-A4 exceeds previous state-of-the-art results reported by Proxyless NAS [43] while being 1.4x more compact. Refer to Table IIa for more comparisons.
NSGANetV1的有效性在CIFAR-10数据集上的迁移性能中得到了进一步验证。如图7a所示,NSGANetV1-A0至-A4建立的帕累托前沿完全优于同类EMO方法(包括DPP-Net [30]和LEMONADE [32])以及其他单目标对比方法所得结果。具体而言,NSGANetV1-A0使用的参数量比Large-scale Evo. [15]少27倍,同时获得更高分类准确率;NSGANetV1-A1在分类性能上超越Hierarchical NAS [16]和DenseNet [3],参数量分别减少122倍和51倍;NSGANetV1-A2仅需NSGA-Net [20] 1/4的参数量即可达到相近性能;NSGANetV1-A4在超越Proxyless NAS [43]报告的最优结果的同时,模型紧凑度提升1.4倍。更多对比数据详见表IIa。
For transfer performance comparison on the ImageNet dataset, we follow previous work [5], [8], [9], [44] and use the ImageNet-mobile setting, i.e., the setting where number of FLOPs is less than 600M. The NSGANetV1-A0 is too simple for the ImageNet dataset and NSGANetV1-A4 exceeds the 600M FLOPs threshold for the mobile setting, so we provide results only for NSGANetV1-A1, -A2 and -A3. Fig. 7b compares the objective space with the other peer methods. Clearly, NSGANetV1 can achieve a better trade-off between the objectives. NSGANetV1-A2 dominates a wide range of peer methods including ShuffleNet [4] by human experts, NASNet-A [8] by RL, DARTS [9] by relaxation-based methods, and AmoebaNet-A [17] by EA. Moreover, NSGANetV1-A3 surpasses previous state-of-the-art performance reported by MobileNet-V2 [5] and AmoebaNet-C [17] on mobile-setting with a marginal overhead in FLOPs $(1%-3%)$ .
为在ImageNet数据集上进行迁移性能比较,我们遵循先前工作[5]、[8]、[9]、[44]采用ImageNet-mobile设置(即FLOPs小于6亿的配置)。NSGANetV1-A0对ImageNet数据集过于简单,而NSGANetV1-A4超出移动端设置的6亿FLOPs阈值,因此我们仅提供NSGANetV1-A1、-A2和-A3的结果。图7b展示了与其他同类方法在目标空间的对比。显然,NSGANetV1能在多个目标间实现更优权衡。NSGANetV1-A2在广泛范围内优于人工设计的ShuffleNet[4]、基于强化学习的NASNet-A[8]、基于松弛方法的DARTS[9]以及采用进化算法的AmoebaNet-A[17]等同类方法。此外,NSGANetV1-A3以微小的FLOPs开销$(1%-3%)$超越了MobileNet-V2[5]和AmoebaNet-C[17]在移动端设置下报告的最先进性能。
D. Efficiency of NSGANetV1
D. NSGANetV1的效率
Comparing the search phase contribution to the success of different NAS algorithms can be difficult and ambiguous due to substantial differences in search spaces and training procedures used during the search. Therefore, we use vanilla NSGA-II and uniform random sampling as comparisons to demonstrate the efficiency of the search phase in NSGANetV1. All three methods use the same search space and performance estimation strategy as described in Section III. The vanilla NSGA-II is implemented by disc ret i zing the crossover and mutation operators in the original NSGA-II [29] algorithm with all hyperparameters set to default values; and it does not utilize any additional enhancements—e.g., the Bayesian-Network-modelbased exploitation. The uniform random sampling method is implemented by replacing the crossover and mutation operators in the original NSGA-II algorithm with an initialization method that uniformly samples the search space. We run each of the three methods five times and record the 25-percentile, median and 75-percentile of the normalized HV (NHV) that we obtain. The NHV measurements shown in Fig. 8a suggest that NSGANetV1 is capable of finding more accurate and simpler architectures than vanilla NSGA-II or uniform random sampling (even with an extended search budget), in a more efficient manner.
由于不同神经架构搜索(NAS)算法在搜索阶段使用的搜索空间和训练流程存在显著差异,直接比较搜索阶段对算法成功的贡献度往往具有难度和模糊性。为此,我们采用标准NSGA-II算法和均匀随机采样作为对比基线,以验证NSGANetV1搜索阶段的高效性。这三种方法均采用第3节描述的相同搜索空间和性能评估策略:标准NSGA-II通过将原始NSGA-II [29] 算法中的交叉和变异算子离散化实现(所有超参数保持默认值),且不包含任何额外增强机制(如基于贝叶斯网络的开发策略);均匀随机采样方法则通过将NSGA-II的交叉变异算子替换为搜索空间均匀采样的初始化方法实现。每种方法运行5次后,我们记录了归一化超体积(NHV)的25百分位、中位数和75百分位。图8a所示的NHV测量结果表明:相较于标准NSGA-II或均匀随机采样(即使延长搜索预算),NSGANetV1能以更高效率发现更精确且更简洁的架构。
Apart from the HV metric, another important aspect of demonstrating the efficiency of NAS is the computational complexity of the methods. Since theoretical analysis of the computational complexity of different NAS methods is challenging, we compare the computation time spent on Graphics
除了HV指标外,展示NAS效率的另一个重要方面是方法的计算复杂度。由于对不同NAS方法计算复杂度进行理论分析具有挑战性,我们比较了图形处理上花费的计算时间
Fig. 7: Transfer ability of the NSGANetV1 architectures to (a) CIFAR-10, and (b) ImageNet. We compare Top-1 Accuracy vs. Computational Complexity. Architectures joined by dashed lines are from multi-objective algorithms.
图 7: NSGANetV1架构在 (a) CIFAR-10 和 (b) ImageNet 上的迁移能力对比。我们展示了Top-1准确率与计算复杂度之间的关系。虚线连接的架构来自多目标优化算法。
TABLE II: Comparison between NSGANetV1 and other baseline methods. NSGANetV1 architectures are obtained by searching on CIFAR-100. NSGANetV1 results on CIFAR-10 and ImageNet are obtained by re-training the weights with images from their respective datasets. Ratio-to-NSGANetV1 indicates the resulting savings on #Params and #FLOPs. The search cost is compared in GPU-days, calculated by multiplying the number of GPU cards deployed with the execution time in days.
表 II: NSGANetV1与其他基线方法的对比。NSGANetV1架构通过在CIFAR-100上搜索获得。NSGANetV1在CIFAR-10和ImageNet上的结果是通过使用各自数据集的图像重新训练权重得到的。Ratio-to-NSGANetV1表示在参数量(#Params)和计算量(#FLOPs)上节省的比例。搜索成本以GPU天为单位进行比较,通过部署的GPU卡数量与执行天数相乘计算得出。
(a) CIFAR-10
Architecture | Search Method | GPU- Days | Top-1 Acc. | #Params | Ratio-to- NSGANetV1 |
NSGANetV1-A0 | EA | 27 | 95.33% | 0.2M | 1x |
CGP-CNN [24] | EA | 27 | 94.02% | 1.7M | 8.5x |
Large-scale Evo. [15] | EA | 2,750 | 94.60% | 5.4M | 27x |
AE-CNN+E2EPP[19] | EA | 7 | 94.70% | 4.3M | 21x |
ResNet [2] | manual | 95.39% | 1.7M | 8.5x | |
NSGANetV1-A1 | EA | 27 | 96.51% | 0.5M | 1x |
Hierarchical NAS[16] | EA | 300 | 96.37% | 61.3M | 122x |
PNAS [31] | SMBO | 150 | 96.37% | 3.2M | 6.4x |
DenseNet [3] | manual | 96.54% | 25.6M | 51x | |
NSGANetV1-A2 | EA | 27 | 97.35% | 0.9M | 1x |
CNN-GA [25] | EA | 35 | 96.78% | 2.9M | 3.2x |
AmoebaNet-A [17] | EA | 3,150 | 96.88% | 3.1M | 3.4x |
DARTS [9] | relaxation | 1 | 97.18% | 3.4M | 3.8x |
NSGA-Net [20] | EA | 4 | 97.25% | 3.3M | 3.7x |
NSGANetV1-A3 | EA | 27 | 97.78% | 2.2M | 1x |
NASNet-A [8] | RL | 1,575 | 97.35% | 3.3M | 1.5x |
LEMONADE [32] | EA | 90 | 97.42% | 13.1M | 6.0x |
NSGANetV1-A4 | EA | 27 | 97.98% | 4.0M | 1x |
AmoebaNet-B [17] | EA | 3,150 | 97.87% | 34.9M | 8.7x |
Proxyless NAS [43] | RL | 1,500 | 97.92% | 5.7 M | 1.4x |
(a) CIFAR-10
架构 | 搜索方法 | GPU天数 | Top-1准确率 | 参数量 | 与NSGANetV1比率 |
---|---|---|---|---|---|
NSGANetV1-A0 | 进化算法 (EA) | 27 | 95.33% | 0.2M | 1x |
CGP-CNN [24] | 进化算法 (EA) | 27 | 94.02% | 1.7M | 8.5x |
Large-scale Evo. [15] | 进化算法 (EA) | 2,750 | 94.60% | 5.4M | 27x |
AE-CNN+E2EPP[19] | 进化算法 (EA) | 7 | 94.70% | 4.3M | 21x |
ResNet [2] | 人工设计 | - | 95.39% | 1.7M | 8.5x |
NSGANetV1-A1 | 进化算法 (EA) | 27 | 96.51% | 0.5M | 1x |
Hierarchical NAS[16] | 进化算法 (EA) | 300 | 96.37% | 61.3M | 122x |
PNAS [31] | 序列模型优化 (SMBO) | 150 | 96.37% | 3.2M | 6.4x |
DenseNet [3] | 人工设计 | - | 96.54% | 25.6M | 51x |
NSGANetV1-A2 | 进化算法 (EA) | 27 | 97.35% | 0.9M | 1x |
CNN-GA [25] | 进化算法 (EA) | 35 | 96.78% | 2.9M | 3.2x |
AmoebaNet-A [17] | 进化算法 (EA) | 3,150 | 96.88% | 3.1M | 3.4x |
DARTS [9] | 松弛法 | 1 | 97.18% | 3.4M | 3.8x |
NSGA-Net [20] | 进化算法 (EA) | 4 | 97.25% | 3.3M | 3.7x |
NSGANetV1-A3 | 进化算法 (EA) | 27 | 97.78% | 2.2M | 1x |
NASNet-A [8] | 强化学习 (RL) | 1,575 | 97.35% | 3.3M | 1.5x |
LEMONADE [32] | 进化算法 (EA) | 90 | 97.42% | 13.1M | 6.0x |
NSGANetV1-A4 | 进化算法 (EA) | 27 | 97.98% | 4.0M | 1x |
AmoebaNet-B [17] | 进化算法 (EA) | 3,150 | 97.87% | 34.9M | 8.7x |
Proxyless NAS [43] | 强化学习 (RL) | 1,500 | 97.92% | 5.7M | 1.4x |
(b) CIFAR-100
Architecture | Search Method | GPU- Days | Top-1 Acc. | #Params | Ratio-to- NSGANetV1 |
NSGANetV1-A0 | EA | 27 | 74.83% | 0.2M | 1x |
Genetic CNN[14] | EA | 17 | 70.95% | ||
MetaQNN [49] | RL | 90 | 72.86% | 11.2M | 56x |
NSGANetV1-A1 Large-scale Evo.[15] | EA EA | 27 | 80.77% | 0.7M | 1x |
ResNet [2] | manual | 2,750 | 77.00% 77.90% | 40.4M | 58x |
AE-CNN+E2EPP[19] | EA | 10 | 77.98% | 1.7M | 2.4x |
NSGA-Net [20] | EA | 8 | 79.26% | 20.9M 3.3M | 30x |
CNN-GA [25] | EA | 40 | 79.47% | 4.1M | 4.7x 5.9x |
PNAS [31] | SMBO | 150 | 80.47% | 3.2M | 4.6x |
ENAS [44] | RL | 0.5 | 80.57% | 4.6M | 6.6x |
NSGANetV1-A2 | EA | 27 | 82.58% | ||
AmoebaNet-A [17] | EA | 3,150 | 81.07% | 0.9M 3.1M | 1x |
GDAS [11] | relaxation | 0.2 | 81.62% | 3.4x | |
DARTS [9] | relaxation | 1 | 82.46% | 3.4M 3.4M | 3.8x |
EA | 3.8x | ||||
NSGANetV1-A3 | 27 | 82.77% | 2.2M | 1x | |
NSGANetV1-A4 DenseNet [3] | EA | 27 | 85.62% | 4.1M | 1x |
manual | 82.82% | 25.6M | 6.2x | ||
SENet [37] | manual | = | 84.59% | 34.4M | 8.4x |
Block-QNN [35] | RL | 32 | 85.17% | 33.3M | 8.1x |
(b) CIFAR-100
架构 | 搜索方法 | GPU天数 | Top-1准确率 | 参数量 | 与NSGANetV1比率 |
---|---|---|---|---|---|
NSGANetV1-A0 | EA | 27 | 74.83% | 0.2M | 1x |
Genetic CNN[14] | EA | 17 | 70.95% | ||
MetaQNN [49] | RL | 90 | 72.86% | 11.2M | 56x |
NSGANetV1-A1 | EA | 27 | 80.77% | 0.7M | 1x |
Large-scale Evo.[15] | EA | 2,750 | 77.00% | 40.4M | 58x |
ResNet [2] | manual | 77.90% | |||
AE-CNN+E2EPP[19] | EA | 10 | 77.98% | 1.7M | 2.4x |
NSGA-Net [20] | EA | 8 | 79.26% | 20.9M | 30x |
CNN-GA [25] | EA | 40 | 79.47% | 4.1M | 4.7x |
PNAS [31] | SMBO | 150 | 80.47% | 3.2M | 4.6x |
ENAS [44] | RL | 0.5 | 80.57% | 4.6M | 6.6x |
NSGANetV1-A2 | EA | 27 | 82.58% | ||
AmoebaNet-A [17] | EA | 3,150 | 81.07% | 0.9M | 1x |
GDAS [11] | relaxation | 0.2 | 81.62% | 3.4x | |
DARTS [9] | relaxation | 1 | 82.46% | 3.4M | 3.8x |
EA | 3.8x | ||||
NSGANetV1-A3 | 27 | 82.77% | 2.2M | 1x | |
NSGANetV1-A4 | EA | 27 | 85.62% | 4.1M | 1x |
DenseNet [3] | manual | 82.82% | 25.6M | 6.2x | |
SENet [37] | manual | = | 84.59% | 34.4M | 8.4x |
Block-QNN [35] | RL | 32 | 85.17% | 33.3M | 8.1x |
(c) ImageNet
Architecture | SearchMethod | GPU-Days | Top-1Acc. | Top-5Acc. | #Params | Ratio-to-NSGANetV1 | #FLOPs | Ratio-to-NSGANetV1 |
NSGANetV1-A1 | EA | 27 | 70.9% | 90.0% | 3.0M | 1x | 270M | 1x |
MobileNet-V2[5] | manual | 72.0% | %0'16 | 3.4M | 1.1x | 300M | 1.1x | |
NSGANetV1-A2 | EA | 27 | 74.5% | 92.0% | 4.1M | 1x | 466M | 1x |
ShuffleNet[4] | manual | 73.7% | 5.4M | 1.3x | 524M | 1.1x | ||
NASNet-A [8] | RL | 1,575 | 74.0% | 91.3% | 5.3M | 1.3x | 564M | 1.2x |
PNAS [31] | SMBO | 150 | 74.2% | 91.9% | 5.1M | 1.2x | 588M | 1.3x |
AmoebaNet-A [17] | EA | 3,150 | 74.5% | 92.0% | 5.1M | 1.2x | 555M | 1.2x |
DARTS [9] | relaxation | 1 | 73.1% | 91.0% | 4.9M | 1.2x | 595M | 1.3x |
NSGANetV1-A3 | EA | 27 | 76.2% | 93.0% | 5.0M | 1x | 585M | 1x |
MobileNetV2 (1.4)[5] | manual | 74.7% | 92.5% | 6.06M | 1.2x | 582M | 1x | |
AmoebaNet-C [17] | EA | 3,150 | 75.7% | 92.4% | 6.4M | 1.3x | 570M | 1x |
† SMBO stands for sequential model-based optimization. SENet is the abbreviation for Shake-Even $29~2\mathrm{x}4\mathrm{x}64\mathrm{d}+\mathrm{S}\mathrm{E}$ . ‡ The CIFAR-100 accuracy and #params for ENAS [44] and DARTS [9] are from [11]. #Params for AE-CNN $^+$ E2EPP are from [18].
架构 | 搜索方法 | GPU天数 | Top-1准确率 | Top-5准确率 | 参数量 | 与NSGANetV1比值 | FLOPs | 与NSGANetV1比值 |
---|---|---|---|---|---|---|---|---|
NSGANetV1-A1 | EA | 27 | 70.9% | 90.0% | 3.0M | 1x | 270M | 1x |
MobileNet-V2[5] | manual | 72.0% | %0'16 | 3.4M | 1.1x | 300M | 1.1x | |
NSGANetV1-A2 | EA | 27 | 74.5% | 92.0% | 4.1M | 1x | 466M | 1x |
ShuffleNet[4] | manual | 73.7% | 5.4M | 1.3x | 524M | 1.1x | ||
NASNet-A [8] | RL | 1,575 | 74.0% | 91.3% | 5.3M | 1.3x | 564M | 1.2x |
PNAS [31] | SMBO | 150 | 74.2% | 91.9% | 5.1M | 1.2x | 588M | 1.3x |
AmoebaNet-A [17] | EA | 3,150 | 74.5% | 92.0% | 5.1M | 1.2x | 555M | 1.2x |
DARTS [9] | relaxation | 1 | 73.1% | 91.0% | 4.9M | 1.2x | 595M | 1.3x |
NSGANetV1-A3 | EA | 27 | 76.2% | 93.0% | 5.0M | 1x | 585M | 1x |
MobileNetV2 (1.4)[5] | manual | 74.7% | 92.5% | 6.06M | 1.2x | 582M | 1x | |
AmoebaNet-C [17] | EA | 3,150 | 75.7% | 92.4% | 6.4M | 1.3x | 570M | 1x |
† SMBO代表序列化基于模型的优化。SENet是Shake-Even $29~2\mathrm{x}4\mathrm{x}64\mathrm{d}+\mathrm{S}\mathrm{E}$的缩写。‡ ENAS [44]和DARTS [9]的CIFAR-100准确率和参数量来自[11]。AE-CNN $^+$ E2EPP的参数量来自[18]。
Processing Units (GPUs), GPU-Days, by each approach to arrive at the reported architectures. The number of GPU-Days is calculated by multiplying the number of employed GPU cards by the execution time (in units of days).
各方法为达到所报告架构所消耗的处理单元 (GPUs) 和 GPU-天数。GPU-天数的计算方式为使用的 GPU 卡数量乘以执行时间 (以天为单位)。
One run of NSGANetV1 on the CIFAR-100 dataset takes approximately 27 GPU-Days to finish, averaged over five runs. The search costs of most of the peer methods are measured on the CIFAR-10 dataset, except for Block-QNN [35] which is measured on CIFAR-100. From the search cost comparison in Fig. 8b, we observe that our proposed algorithm is more efficient at identifying a set of architectures than a number of other approaches, and the set of architectures obtained has higher performance. More specifically, NSGANetV1 simultaneously finds multiple architectures while using 10x to 100x less GPU-days in searching than most of the considered peer methods, including Hierarchical NAS [16], AmoebaNet [17], NASNet [8], and Proxyless NAS [43], all of which find a single architecture at a time. When compared to the peer multi-objective NAS method, LEMONADE [32], NSGANetV1 manages to obtain a better (in the Pareto dominance sense) set of architectures than LEMONADE with 3x fewer GPUDays. Further experiments and comparisons are provided in the supplementary materials under Section VII-A.
NSGANetV1在CIFAR-100数据集上单次运行平均耗时约27个GPU日(基于五次运行取均值)。除Block-QNN[35]的搜索成本基于CIFAR-100测算外,其他对比方法的搜索成本均在CIFAR-10数据集上测得。通过图8b的搜索成本对比可见,我们提出的算法在识别架构集合时较多数对比方法更为高效,且获得的架构集合具有更高性能。具体而言,NSGANetV1能同时发现多个架构,其搜索过程消耗的GPU日仅为Hierarchical NAS[16]、AmoebaNet[17]、NASNet[8]及Proxyless NAS[43]等对比方法的1/10至1/100(这些方法每次仅能找到一个架构)。与多目标NAS对比方法LEMONADE[32]相比,NSGANetV1以三分之一的GPU日消耗获得了帕累托占优性更优的架构集合。更多实验对比结果详见补充材料第七章节A部分。
Fig. 8: Search efficiency comparison between NSGANetV1 and other baselines in terms of (a) HV, and (b) the required compute time in GPU-Days. The search cost is measured on CIFAR-10 for most methods, except NSGANetV1 and Block-QNN [35], where the CIFAR-100 dataset is used for.
图 8: NSGANetV1与其他基线方法在搜索效率上的对比:(a) HV指标,(b) GPU天数为单位的计算耗时。除NSGANetV1和Block-QNN [35]使用CIFAR-100数据集外,其余方法的搜索成本均在CIFAR-10上测得。
E. Observations on Evolved Architectures
E. 对进化架构的观察
Population-based approaches with multiple conflicting objectives often lead to a set of diverse solution candidates, which can be “mined” for commonly shared design principles [50]. In order to discover any patterns for more efficient architecture design, we analyzed the entire history of architectures generated by NSGANetV1. We make the following observations:
基于群体且具有多重冲突目标的方法通常会生成一组多样化的候选解决方案,这些方案可被"挖掘"出共有的设计原则 [50]。为了发现更高效架构设计的潜在模式,我们分析了NSGANetV1生成的全部架构历史,得出以下观察结果:
• Non-parametric operations—e.g., skip connections (identity) and average pooling $(a\nu g_p o o l_3x3)$ —are effective in trading off performance for complexity. Empirically, we notice that three out of the four most frequently used operations in non-dominated architectures are nonparametric, as shown in Fig. 9a (see also supplementary materials under Section VII-B for our follow-up study). • Larger kernel size and parallel operations improve classification accuracy, as shown in Fig. 9a and Fig. 9b respectively. In particular, the frequencies of convolutions with large kernel sizes (e.g., dil con v 5 x 5, $c o n v_7x l_l x7)$ are significantly higher in the top $20%$ most accurate archi tec ture s than in non-dominated architectures in general, which must also balance FLOPs. Similar findings are also reported in previous work [17], [42].
• 非参数化操作——例如跳跃连接 (identity) 和平均池化 $(a\nu g_p o o l_3x3)$ ——能有效权衡性能与复杂度。实证数据显示,非支配架构中最常使用的四种操作中有三种是非参数化的 (如图 9a 所示,另见第七-B节的补充材料中的后续研究) 。
• 更大的卷积核尺寸和并行操作能提升分类准确率,分别如图 9a 和图 9b 所示。特别是,在准确率最高的前 $20%$ 架构中,大卷积核操作 (如 dil con v 5 x 5, $c o n v_7x l_l x7)$ 的出现频率显著高于需要兼顾 FLOPs 的非支配架构。类似结论在先前研究 [17], [42] 中也有报道。
The above common properties of multiple final nondominated solutions stay as important knowledge for future applications. It is noteworthy that such a post-optimal knowledge extraction process is possible only from a multi-objective optimization study, another benefit that we enjoy for posing NAS as a multi-objective optimization problem.
多个最终非支配解的上述共同特性,将作为未来应用的重要知识。值得注意的是,这种优化后知识提取过程仅能通过多目标优化研究实现,这也是我们将NAS (Neural Architecture Search) 构建为多目标优化问题的另一优势。
Fig. 9: Post-Search Analysis: (a) Frequency of each operation selected during the search. (b) Effect of number of input channels that are concatenation on the validation accuracy. More channels improve the predictive performance of the architectures, but adversely affect the computational efficiency.
图 9: 搜索后分析: (a) 搜索过程中每个操作被选择的频率。(b) 输入通道数(拼接操作)对验证准确率的影响。更多通道能提升架构的预测性能,但会对计算效率产生负面影响。
V. FURTHER ANALYSIS
V. 进一步分析
The over arching goal of NAS is to find architecture models that generalize to new instances of what the models were trained on. We usually quantify generalization by measuring the performance of a model on a held-out testing set. Since many computer vision benchmark datasets, including the three datasets used in this paper—i.e. CIFAR-10, CIFAR100, and ImageNet, have been the focus of intense research for almost a decade, does the steady stream of promising empirical results from NAS simply arise from over fitting of these excessively re-used testing sets? Does advancement on these testing sets imply better robustness vis-a-vis commonly observable corruptions in images and adversarial images by which the human vision system is more robust? To answer these questions in a quantitative manner, in this section, we provide systematic studies on newly proposed testing sets from the CNN literature, followed by hyper-parameter analysis.
神经架构搜索(NAS)的终极目标是找到能够泛化到训练数据新实例的模型架构。我们通常通过在保留测试集上衡量模型性能来量化泛化能力。由于包括本文使用的CIFAR-10、CIFAR-100和ImageNet在内的许多计算机视觉基准数据集已被集中研究近十年,NAS持续涌现的优秀实证结果是否仅源于对这些过度重复使用的测试集的过拟合?在这些测试集上的进步是否意味着模型对图像常见损坏和对抗样本具有更强的鲁棒性(人类视觉系统对此更具抵抗力)?为了定量回答这些问题,本节我们将基于CNN文献中新提出的测试集开展系统研究,随后进行超参数分析。
Fig. 10: Generalization: We evaluate the models on new and extended test sets for (a) CIFAR-10, and (b) ImageNet. Numbers in the boxes indicate absolute drop in accuracy $(%)$ .
图 10: 泛化能力评估: 我们在 (a) CIFAR-10 和 (b) ImageNet 的新扩展测试集上评估模型性能。方框内数字表示准确率绝对下降百分比 $(%)$。
A. Generalization
A. 泛化
By mimicking the documented curation process of the original CIFAR-10 and ImageNet datasets, Recht et al. [51] propose two new testing sets, CIFAR-10.1 and ImageNetV2. Refer to supplementary materials under Section V-A for details and examples of the new testing sets. Representative architectures are selected from each of the main categories (i.e., RL, EA, relaxation-based, and manual). The selected architectures are similar in number of parameters or FLOPs, except DenseNet-BC [3] and Inception-V1 [1]. All architectures are trained on the original CIFAR-10 and ImageNet training sets as in Section IV-C, then evaluated on CIFAR-10.1 and ImageNet-V2, respectively.
通过模仿原始CIFAR-10和ImageNet数据集记录的筛选过程,Recht等人[51]提出了两个新的测试集CIFAR-10.1和ImageNetV2。关于新测试集的详细信息和示例,请参阅第V-A节下的补充材料。从每个主要类别(即强化学习(RL)、进化算法(EA)、基于松弛的方法和人工设计)中选取代表性架构。除DenseNet-BC[3]和Inception-V1[1]外,所选架构在参数量或FLOPs上相近。所有架构均按照第IV-C节所述在原始CIFAR-10和ImageNet训练集上进行训练,然后分别在CIFAR-10.1和ImageNet-V2上进行评估。
It is evident from the results summarized in Figs. 10a and 10b that there is a significant drop in accuracy of $3%$ - $7%$ on CIFAR-10.1 and $8%$ to $10%$ on ImageNet-V2 across architectures. However, the relative rank-order of accuracy on the original testing sets translates well to the new testing sets, i.e., the architecture with the highest accuracy on the original testing set (NSGANetV1 in this case) is also the architecture with the highest accuracy on new testing sets. Additionally, we observe that the accuracy gains on the original testing sets translate to larger gains on the new testing sets, especially in the case of CIFAR-10 (curvatures of red vs. blue markers in Fig. 10). These results provide evidence that extensive benchmarking on the original testing sets is an effective way to measure the progress of architectures.
从图10a和图10b汇总的结果可以明显看出,在CIFAR-10.1上准确率显著下降了3%-7%,在ImageNet-V2上各架构下降了8%到10%。然而,原始测试集上的准确率相对排名在新测试集上仍然保持良好,即原始测试集上准确率最高的架构(本例中为NSGANetV1)在新测试集上同样保持最高准确率。此外,我们观察到原始测试集上的准确率提升在新测试集上会转化为更大的增益,尤其是在CIFAR-10中(图10中红色与蓝色标记的曲率对比)。这些结果证明,对原始测试集进行广泛基准测试是衡量架构进展的有效方法。
B. Robustness
B. 鲁棒性
The vulnerability to small changes in query images may very likely prevent the deployment of deep learning vision systems in safety-critical applications at scale. Understanding the architectural advancements under the scope of robustness against various forms of corruption is still in its infancy. Hendrycks and Dietterich [52] recently introduced two new testing datasets, CIFAR-10-C and CIFAR-100-C, by applying commonly observable corruptions (e.g., noise, weather, compression, etc.) to the clean images from the original datasets.
查询图像中微小变化带来的脆弱性很可能阻碍深度学习视觉系统在安全关键应用中的大规模部署。目前,针对各类损坏形式鲁棒性的架构进展研究仍处于起步阶段。Hendrycks和Dietterich [52] 近期通过向原始数据集的干净图像施加常见可观测损坏(如噪声、天气、压缩等),提出了两个新的测试数据集CIFAR-10-C和CIFAR-100-C。
Fig. 11: Robustness: Effect of commonly observable corruptions and adversarial attacks on (a) CIFAR-10, and (b) CIFAR100. Higher values of $\epsilon$ indicate more severe adversarial attacks. We use prediction accuracy on the corrupted test images (from each dataset) as a measurement of robustness.
图 11: 鲁棒性分析: 常见可观测噪声和对抗攻击对 (a) CIFAR-10 及 (b) CIFAR100 的影响。$\epsilon$ 值越高表示对抗攻击越强。我们使用各数据集中被干扰测试图像的预测准确率作为鲁棒性衡量指标。
Each dataset contains images perturbed by 19 different types of corruption at five different levels of severity. More details and visualization s are provided in supplementary materials under Section V-B. In addition, we include adversarial images as examples of worst-case corruption. We use the fast gradient signed method (FGSM) [53] to construct adversarial examples for both the CIFAR-10 and $-100$ datasets. The severity of the attack is controlled via a hyper-parameter $\epsilon$ as shown below:
每个数据集包含经过19种不同类型损坏处理的图像,每种损坏分为五个严重等级。更多细节和可视化内容请参见补充材料第V-B节。此外,我们还加入了对抗样本作为最坏损坏情况的示例。我们使用快速梯度符号法 (FGSM) [53] 为CIFAR-10和$-100$数据集构建对抗样本,攻击强度通过超参数$\epsilon$控制如下:
$$
\pmb{x}^{\prime}=\pmb{x}+\epsilon\mathrm{sign}(\nabla_{\pmb{x}}\mathcal{L}(\pmb{x},y_{t r u e})),
$$
$$
\pmb{x}^{\prime}=\pmb{x}+\epsilon\mathrm{sign}(\nabla_{\pmb{x}}\mathcal{L}(\pmb{x},y_{t r u e})),
$$
where ${x}$ is the original image, $\mathbf{\Delta}{x^{\prime}}$ is the adversarial image, ytrue is the true class label, and $\mathcal{L}$ is the cross-entropy loss. Following the previous section, we pick representative architectures of similar complexities from each of the main categories. Using the weights learned on the clean images from the original CIFAR-10/100 training sets, we evaluate each architecture’s classification performance on the corrupted datasets as our measure of robustness.
其中 ${x}$ 是原始图像,$\mathbf{\Delta}{x^{\prime}}$ 是对抗图像,ytrue 是真实类别标签,$\mathcal{L}$ 是交叉熵损失。延续上一节的思路,我们从每个主要类别中选取复杂度相近的代表性架构。利用在原始 CIFAR-10/100 训练集的干净图像上学习到的权重,我们通过评估各架构在污染数据集上的分类性能来衡量其鲁棒性。
Our empirical findings summarized in Figs. 11a and 11b appear to suggest that a positive correlation exists between the generalization performance on clean data and data under commonly observable corruptions – i.e., we observe that NSGANetV1 architectures perform noticeably better than other peer methods’ architectures on corrupted datasets even though the robustness measurement was never a part of the architecture selection process in NSGANetV1. However, we emphasize that no architectures are considered robust to corruption, especially under adversarial attacks. We observe that the architectural advancements have translated to minuscule improvements in robustness against adversarial examples. The classification accuracy of all selected architectures deteriorates drastically with minor increments in adversarial attack severity ϵ, leading to the question of whether architecture is the “right” ingredient to investigate in pursuit of adversarial robustness. A further step towards answering this question is provided in supplementary materials under Section V-C.
我们在图11a和11b中总结的实证研究结果表明,干净数据与常见可观测损坏数据上的泛化性能存在正相关关系——即我们观察到,尽管NSGANetV1的架构选择过程从未包含鲁棒性测量指标,但其架构在损坏数据集上的表现明显优于其他同类方法。然而需要强调的是,这些架构均未展现出对数据损坏(尤其是对抗攻击)的鲁棒性。我们发现,架构进步仅带来对抗样本鲁棒性的微小提升。当对抗攻击强度ϵ轻微增加时,所有选定架构的分类准确率都会急剧下降,这引发了一个问题:架构是否是追求对抗鲁棒性研究中"正确"的探索方向。补充材料第V-C节对此问题提供了进一步分析。
C. Ablation Studies
C. 消融研究
Dataset for Search: As previously mentioned in Section III-B, our proposed method differs from most of the existing peer methods in the choice of datasets on which the search is carried out. Instead of directly following the current practice of using the CIFAR-10 dataset, we investigated the utility of search on multiple benchmark datasets in terms of their ability to provide reliable estimates of classification accuracy and generalization. We carefully selected four datasets, SVHN [54], fashion M NIST [55], CIFAR-10, and CIFAR-100 for comparison. The choice was based on a number of factors including the number of classes, numbers of training examples, resolutions and required training times. We uniformly sampled 40 architectures from the search space (described in Section III) along with five architectures generated by other peer NAS methods. We trained every architecture three times with different initial random seeds and report the averaged classification accuracy on each of the four datasets in Fig. 12a. Empirically, we observe that the CIFAR-100 dataset is challenging enough for architectural differences to affect predicted performance. This can be observed in Fig. 12a where the variation (blue boxes) in classification accuracy across architectures is noticeably larger on CIFAR-100 than on the other three datasets. In addition, we observe that mean differences in classification accuracy on CIFAR-100 between randomly generated architectures and architectures from principle-based methods have higher deviations, suggesting that it is less likely to find a good architecture on CIFAR-100 by chance.
用于搜索的数据集:如第III-B节所述,我们提出的方法在搜索数据集的选择上与现有多数同类方法不同。我们没有直接沿用当前使用CIFAR-10数据集的惯例,而是研究了在多个基准数据集上进行搜索的效用,评估它们提供分类准确率和泛化能力可靠估计的能力。我们精心选择了四个数据集进行比较:SVHN [54]、Fashion MNIST [55]、CIFAR-10和CIFAR-100。选择依据包括类别数量、训练样本量、分辨率和所需训练时长等因素。我们从搜索空间(如第III节所述)中统一采样了40个架构,以及五个由其他NAS方法生成的架构。每个架构使用不同随机初始种子训练三次,图12a展示了四个数据集的平均分类准确率。实证表明,CIFAR-100数据集能充分体现架构差异对预测性能的影响,这体现在图12a中CIFAR-100的分类准确率方差(蓝色方框)明显大于其他三个数据集。此外,随机生成架构与基于原理方法生成的架构在CIFAR-100上的分类准确率均值差异具有更高偏差,表明在CIFAR-100上偶然发现优秀架构的可能性更低。
Proxy Models: The main computational bottleneck of NAS approaches resides in evaluating the classification accuracy of the architectures by invoking the lower-level weight optimization. One such evaluation typically takes hours to finish, which limits the practical utility of the search under a constrained search budget. In our proposed algorithm, we adopt the concept of a proxy model [8], [17]. Proxy models are small-scale versions of the intended architectures, where the number of layers $N$ in Fig. 2a) and the number of channels $(C h_{i n i t}$ in Fig. 2a) in each layer are reduced. Due to the down scaling, proxy models typically require much less compute time to train†. However, there exists a trade-off between gains in computation efficiency and loss of prediction accuracy. Therefore, it is not necessary that the performance of an architecture measured at proxy-model scale can serve as a reliable indicator of the architecture’s performance measured at the desired scale.
代理模型 (Proxy Models):NAS方法的主要计算瓶颈在于通过调用底层权重优化来评估架构的分类准确率。每次这样的评估通常需要数小时才能完成,这在有限的搜索预算下限制了搜索的实际效用。在我们提出的算法中,我们采用了代理模型的概念 [8][17]。代理模型是目标架构的小规模版本,其中图2a中的层数$N$和每层通道数$(C h_{i n i t}$在图2a中)都减少了。由于规模缩小,代理模型通常需要的训练计算时间要少得多†。然而,在计算效率的提升和预测准确率的损失之间存在权衡。因此,在代理模型规模下测量的架构性能不一定能作为该架构在目标规模下性能的可靠指标。
(c) Effect of proxy model’s depth. (d) Effect of training epochs. Fig. 12: (a) Mean classification accuracy distribution of randomly generated architectures and architectures from peer NAS methods on four datasets. Correlation in performance (red lines) vs. Savings in gradient descent wall time (blue boxes) by reducing (b) the number of channels i