[论文翻译]Netadaptv2:高效的神经结构搜索,具有快速超级网络训练和架构优化


原文地址:https://arxiv.org/pdf/2104.00031v1.pdf


NetAdaptV2: Efficient Neural Architecture Search with Fast Super-Network Training and Architecture Optimization

netadaptv2:高效的神经架构搜索,快速超网络培训和架构优化

Abstract

Neural architecture search (NAS) typically consists of three main steps: training a super-network, training and evaluating sampled deep neural networks (DNNs), and training the discovered DNN. Most of the existing efforts speed up some steps at the cost of a significant slowdown of other steps or sacrificing the support of non-differentiable search metrics. The unbalanced reduction in the time spent per step limits the total search time reduction, and the inability to support non-differentiable search metrics limits the performance of discovered DNNs. In this paper, we present NetAdaptV2 with three innovations to better balance the time spent for each step while supporting non-differentiable search metrics. First, we propose channel-level bypass connections that merge network depth and layer width into a single search dimension to reduce the time for training and evaluating sampled DNNs. Second, ordered dropout is proposed to train multiple DNNs in a single forward-backward pass to decrease the time for training a super-network. Third, we propose the multi-layer coordinate descent optimizer that considers the interplay of multiple layers in each iteration of optimization to improve the performance of discovered DNNs while supporting non-differentiable search metrics. With these innovations, NetAdaptV2 reduces the total search time by up to 5.8\times on ImageNet and 2.4\times on NYU Depth V2, respectively, and discovers DNNs with better accuracy-latency/accuracy-MAC trade-offs than state-of-the-art NAS works. Moreover, the discovered DNN outperforms NAS-discovered MobileNetV3 by 1.8% higher top-1 accuracy with the same latency.\footnote{The project website: http://netadapt.mit.edu.}

摘要

神经体系结构搜索(NAS)通常包括三个主要步骤:训练超级网络,训练和评估采样的深度神经网络(DNN)以及训练发现的DNN。现有的大多数工作都以大大降低其他步骤的速度或牺牲对不可区分搜索指标的支持为代价来加快某些步骤的速度。每步花费时间的不平衡减少限制了总搜索时间的减少,并且无法支持不可区分的搜索指标也限制了发现的DNN的性能。

在本文中,我们为NetAdaptV2提供了三种创新,以更好地平衡每个步骤所花费的时间,同时支持不可区分的搜索指标。首先,我们提出将网络深度和层宽度合并为单个搜索维度的通道级旁路连接,以减少训练和评估采样DNN的时间。其次,提出了有序辍学来在单个前向后向传递中训练多个DNN,以减少训练超级网络的时间。第三,我们提出了多层坐标下降优化器,该优化器在每次优化迭代中考虑多层的相互作用,以提高发现的DNN的性能,同时支持不可区分的搜索指标。通过这些创新,NetAdaptV2将在ImageNet上总搜索时间减少了多达5.8× 和 在NYU Depth V2上为2.4×,同时发现DNN的精确度,延迟/准确性,MAC权衡要比最新的NAS更好。此外,在相同的延迟下,发现的DNN优于NAS发现的MobileNetV3的top-1准确性提高了1.8%。脚注{项目网站:http://netadapt.mit.edu。}


Figure 1: The comparison between NetAdaptV2 and related works. The number above a marker is the corresponding total search time measured on NVIDIA V100 GPUs.
图1:NetAdaptV2与相关作品之间的比较。标记上方的数字是在NVIDIA V100 GPU上测得的相应总搜索时间。

Introduction 简介

Neural architecture search (NAS) applies machine learning to automatically discover deep neural networks (DNNs) with better performance (e.g., better accuracy-latency trade-offs) by sampling the search space, which is the union of all discoverable DNNs. The search time is one key metric for NAS algorithms, which accounts for three steps: 1) training a super-network, whose weights are shared by all the DNNs in the search space and trained by minimizing the loss across them, 2) training and evaluating sampled DNNs (referred to as samples), and 3) training the discovered DNN. Another important metric for NAS is whether it supports non-differentiable search metrics such as hardware metrics (e.g., latency and energy). Incorporating hardware metrics into NAS is the key to improving the performance of the discovered DNNs [1, 2, 3, 4, 5].

There is usually a trade-off between the time spent for the three steps and the support of non-differentiable search metrics. For example, early reinforcement-learning-based NAS methods [6, 7, 2] suffer from the long time for training and evaluating samples. Using a super-network [8, 9, 10, 11, 12, 13, 14, 15, 16] solves this problem, but super-network training is typically time-consuming and becomes the new time bottleneck. The gradient-based methods [17, 18, 19, 20, 3, 21, 22, 23, 24] reduce the time for training a super-network and training and evaluating samples at the cost of sacrificing the support of non-differentiable search metrics. In summary, many existing works either have an unbalanced reduction in the time spent per step (i.e., optimizing some steps at the cost of a significant increase in the time for other steps), which still leads to a long total search time, or are unable to support non-differentiable search metrics, which limits the performance of the discovered DNNs.

神经体系结构搜索(NAS)应用机器学习,通过对搜索空间进行采样来自动发现具有更好性能(例如,更好的精度-延迟权衡)的深层神经网络(DNN),这是所有可发现DNN的并集。搜索时间是NAS算法的一项关键指标,它分三个步骤:1)训练一个超级网络,其权重由搜索空间中所有DNN共享,并通过最小化它们之间的损失进行训练; 2)训练和评估采样的DNN(称为样本),以及3)训练发现的DNN。NAS的另一个重要指标是它是否支持不可区分的搜索指标,例如硬件指标(例如,延迟和能量)。结合硬件度量到NAS的关键是提高发现DNNs的性能 [ 12345 ]。

在这三个步骤所花费的时间与对不可区分的搜索指标的支持之间通常会进行权衡。例如,早期的强化学习,基于NAS方法 [ 672 ]从长时间的训练和评估样品的困扰。使用超级网络 [ 8910111213141516 ]解决了这个问题,但是超级网络训练通常是耗时的,并成为新时间瓶颈。基于梯度的方法 [17181920321222324 ]的时间缩短为以牺牲不可微分搜索度量的支持体的成本训练超网络与培训和评价样品。总而言之,许多现有作品要么在每个步骤上花费的时间不均衡地减少(即,以增加其他步骤的时间为代价来优化某些步骤),但仍会导致总的搜索时间过长,或者无法支持不可区分的搜索指标,这限制了发现的DNN的性能。

In this paper, we propose an efficient NAS algorithm, NetAdaptV2, to significantly reduce the total search time by introducing three innovations to better balance the reduction in the time spent per step while supporting non-differentiable search metrics:
在本文中,我们提出了一种有效的NAS算法NetAdaptV2,它通过引入三种创新来更好地平衡每步所花费时间的减少,同时支持不可区分的搜索指标,从而显着减少了搜索时间:**
Channel-level bypass connections (mainly reduce the time for training and evaluating samples, Sec. 2.2): Early NAS works only search for DNNs with different numbers of filters (referred to as layer widths). To improve the performance of the discovered DNN, more recent works search for DNNs with different numbers of layers (referred to as network depths) in addition to different layer widths at the cost of training and evaluating more samples because network depths and layer widths are usually considered independently. In NetAdaptV2, we propose channel-level bypass connections to merge network depth and layer width into a single search dimension, which requires only searching for layer width and hence reduces the number of samples.
通道级旁路连接(主要减少了训练和评估样本的时间,第2.2节 ):早期的NAS仅搜索具有不同数量的过滤器(称为层宽度)的DNN 。为了提高发现的DNN的性能,最近的工作搜索了具有不同层数(称为网络深度)和不同层宽的DNN,但代价是训练和评估更多样本,因为网络深度和层宽通常是独立考虑。在NetAdaptV2中,我们建议通道级旁路连接 将网络深度和图层宽度合并为一个搜索维度,这仅需要搜索图层宽度,从而减少了样本数量。
Ordered dropout (mainly reduces the time for training a super-network, Sec. 2.3): We adopt the idea of super-network to reduce the time for training and evaluating samples. In previous works, each DNN in the search space requires one forward-backward pass to train. As a result, training multiple DNNs in the search space requires multiple forward-backward passes, which results in a long training time. To address the problem, we propose ordered dropout to jointly train multiple DNNs in a single forward-backward pass, which decreases the required number of forward-backward passes for a given number of DNNs and hence the time for training a super-network.
有序丢弃(主要减少了训练超级网络的时间,第2.3节 ):我们采用超级网络的思想来减少训练和评估样本的时间。在以前的工作中,搜索空间中的每个DNN都需要向前-向后传递以进行训练。结果,在搜索空间中训练多个DNN需要多次前后移动,这导致训练时间较长。为了解决该问题,我们建议有序辍学以在单个向前-向后传递中共同训练多个DNN ,这样就减少了给定数量的DNN所需的向前-向后传递的次数,从而减少了训练超级网络的时间。

Multi-layer coordinate descent optimizer (mainly reduces the time for training and evaluating samples and supports non-differentiable search metrics, Sec. 2.4): NetAdaptV1 [1] and MobileNetV3 [25], which utilizes NetAdaptV1, have demonstrated the effectiveness of the single-layer coordinate descent (SCD) optimizer [26] in discovering high-performance DNN architectures. The SCD optimizer supports both differentiable and non-differentiable search metrics and has only a few interpretable hyper-parameters that need to be tuned, such as the per-iteration resource reduction. However, there are two shortcomings of the SCD optimizer. First, it only considers one layer per optimization iteration. Failing to consider the joint effect of multiple layers may lead to a worse decision and hence sub-optimal performance. Second, the per-iteration resource reduction (e.g., latency reduction) is limited by the layer with the smallest resource consumption (e.g., latency). It may take a large number of iterations to search for a very deep network because the per-iteration resource reduction is relatively small compared with the network resource consumption. To address these shortcomings, we propose the multi-layer coordinate descent (MCD) optimizer that considers multiple layers per optimization iteration to improve performance while reducing search time and preserving the support of non-differentiable search metrics.
多层坐标下降优化器(主要减少了训练和评估样本的时间,并支持不可微分的搜索指标,第2.4节 ):利用NetAdaptV1的NetAdaptV1 [ 1 ]和MobileNetV3 [ 25 ]已证明了单个方法的有效性层坐标下降(SCD)优化器 [ 26 ]发现高性能DNN架构。SCD优化器同时支持可区分和不可区分的搜索指标,并且仅具有一些需要调整的可解释的超参数,例如减少每次迭代的资源。但是,SCD优化器有两个缺点。首先,每次优化迭代仅考虑一层。如果不考虑多层的共同影响,可能会导致更糟糕的决策,从而导致性能欠佳。其次,每次迭代资源的减少(例如,等待时间的减少)受到具有最小资源消耗(例如,等待时间)的层的限制。搜索非常深的网络可能需要大量的迭代,因为与网络资源消耗相比,每次迭代资源的减少相对较小。多层坐标下降(MCD)优化器,它在每次优化迭代中考虑多层,以提高性能,同时减少搜索时间并保留对不可区分搜索指标的支持。
Fig. 1 (and Table 1) compares NetAdaptV2 with related works. NetAdaptV2 can reduce the search time by up to 5.8× and 2.4× on ImageNet [27] and NYU Depth V2 [28] respectively and discover DNNs with better performance than state-of-the-art NAS works. Moreover, compared to NAS-discovered MobileNetV3 [25], the discovered DNN has 1.8% higher accuracy with the same latency.
1(和表 1)将NetAdaptV2与相关作品进行了比较。NetAdaptV2最多可以减少搜索时间5.8× 和 2.4×分别在ImageNet [ 27 ]和NYU Depth V2 [ 28 ]上发现比传统NAS更好的性能的DNN。而且,与NAS发现的MobileNetV3 [ 25 ]相比,发现的DNN具有1.8% 在相同的延迟下获得更高的精度。

Methodology: NetAdaptV2

方法:netadaptv2

Algorithm Overview

NetAdaptV2 searches for DNNs with different network depths, layer widths, and kernel sizes. The proposed channel-level bypass connections (CBCs, Sec. 2.2) enables NetAdaptV2 to discover DNNs with different network depths and layer widths by only searching layer widths because different network depths become the natural results of setting the widths of some layers to zero. To search kernel sizes, NetAdaptV2 uses the superkernel method [21, 22, 12].

Fig. 2 illustrates the algorithm flow of NetAdaptV2. It takes an initial network and uses its sub-networks, which can be obtained by shrinking some layers in the initial network, to construct the search space. In other words, a sample in NetAdaptV2 is a sub-network of the initial network. Because the optimizer needs the accuracy of samples for comparing their performance, the samples need to be trained. NetAdaptV2 adopts the concept of jointly training all sub-networks with shared weights by training a super-network, which has the same architecture as the initial network and contains these shared weights. We use CBCs, the proposed ordered dropout (Sec. 2.3), and superkernel [21, 22, 12] to efficiently train the super-network that contains sub-networks with different layer widths, network depths, and kernel sizes. After training the super-network, the proposed multi-layer coordinate descent optimizer (Sec. 2.4) is used to discover the architectures of DNNs with optimal performance. The optimizer iteratively samples the search space to generate a bunch of samples and determines the next set of samples based on the performance of the current ones. This process continues until the given stop criteria are met (e.g., the latency is smaller than 30ms), and the discovered DNN is then trained until convergence. Because of the trained super-network, the accuracy of samples can be directly evaluated by using the shared weights without any further training.

Figure 2: The algorithm flow of the proposed NetAdaptV2.

算法概述

NetAdaptV2搜索具有不同网络深度,层宽度和内核大小的DNN。提议的通道级旁路连接( CBC ,第2.2节 )使NetAdaptV2仅通过搜索层宽度即可发现具有不同网络深度和层宽度的DNN,因为不同的网络深度成为将某些层的宽度设置为零的自然结果。要搜索的内核大小,NetAdaptV2使用superkernel方法 [ 212212 ]。

2说明了NetAdaptV2的算法流程。它需要一个初始网络并使用其*子网(*可以通过缩小初始网络中的某些层来获得)来构建搜索空间。换句话说,NetAdaptV2中的样本是初始网络的子网。由于优化器需要样本的准确性来比较其性能,因此需要对样本进行训练。NetAdaptV2采用通过训练超级网络来共同训练具有共享权重的所有子网的概念,该超级网络具有与初始网络相同的体系结构,并包含这些共享权重。我们使用CBCS,建议责令退学(秒 2.3),和superkernel [ 2122[ 12 ],以有效地训练包含具有不同层宽度,网络深度和内核大小的子网的超级网络。在训练了超级网络之后,建议的多层坐标下降优化器(第2.4节 )用于发现具有最佳性能的DNN的体系结构。优化器对搜索空间进行迭代采样,以生成一堆采样,并根据当前采样的性能确定下一组采样。该过程持续进行,直到满足给定的停止标准(例如,等待时间小于30毫秒),然后对发现的DNN进行训练,直到收敛为止。由于训练有素的超级网络,可以使用共享权重直接评估样本的准确性,而无需任何进一步的训练。

Channel-Level Bypass Connections

Previous NAS algorithms generally treat network depth and layer width as two different search dimensions. The reason is evident in the following example. If we remove a filter from a layer, we reduce the number of output channels by one. As a result, if we remove all the filters, there are no output channels for the next layer, which breaks the DNN into two disconnected parts. Hence, reducing layer widths typically cannot be used to reduce network depths. To address this, we need an approach that keeps the network connectivity while removing filters; this is achieved by our proposed channel-level bypass connections (CBCs). The high-level concept of CBCs is when a filter is removed, an input channel is bypassed to maintain the same number of output channels. In this way, we can preserve the network connectivity when all filters are removed from a layer. Assuming the target layer in the initial network has $ C $ input channels, $ T $ filters, and $ Z $ output channels If we do not use CBCs, Z is equal to T . , we gradually remove filters from the layer, where there are $ M $ filters remaining. Fig.cbc illustrates how CBCs handle three cases in this process based on the relationship between the number of input channels ( $ C $ ) and the initial number of filters ( $ T $ ) (only $ M $ changes, and $ C $ and $ T $ are fixed): - Case 1, C = T (Fig.\ref{fig:cbc_case1}): When the i -th filter is removed, we bypass the i -th input channel, so the number of output channels ( Z ) can be kept the same. When all the filters are removed ( M = 0 ), all the input channels are bypassed, which is the same as removing the layer. - Case 2, C < T (Fig.\ref{fig:cbc_case2}): We do not bypass input channels at the beginning of filter removal because we have more filters than input channels (i.e., M > C ) and there are no corresponding input channels to bypass. The bypass process starts when there are fewer filters than input channels ( M < C ), which becomes case 1. - Case 3, C > T (Fig.\ref{fig:cbc_case3}): When the i -th filter is removed, we bypass the i -th input channel. The extra ( C-T ) input channels are not used for the bypass.

通道级旁路连接

以前的NAS算法通常将网络深度和层宽度视为两个不同的搜索维度。在以下示例中是明显的。如果我们从图层中删除过滤器,我们将减少一个输出通道的数量。结果,如果我们删除所有过滤器,则下一层没有输出通道,它将DNN分成两个断开部件。因此,还原层宽度通常不能用于减少网络深度。要解决此问题,我们需要一种方法,可以在拆下滤波器时保持网络连接;这是通过我们所提出的频道级旁路连接(CBC)实现的。 CBCS的高级概念是滤波器时,绕过输入通道以保持相同数量的输出通道。通过这种方式,我们可以在从层中删除所有过滤器时,可以保留网络连接。假设初始网络中的目标层具有$ C $输入通道,$ T $过滤器,以及$ Z $输出通道如果我们不使用CBCS, Z 等于 T 。 ,我们逐渐从图层中删除过滤器,其中有$ M $过滤器剩余。图CBC示出了CBCS如何基于输入通道数($ C $)与初始滤波器数($ T $)之间的关系来处理三个情况(仅$ M $更改,以及$ C $和$ C $和$ T $是固定的): - 案例1, c = t (图\ ref {图{图{图CBC_CASE1}):当 i -th过滤器被删除时,我们绕过 i -th输入通道,因此,输出通道( z )的数量可以保持相同。当删除所有过滤器( m = 0 )时,绕过所有输入通道,与删除图层相同。 - 案例2, c <t (图\ ref {图{图\ cbc_case2}):我们在滤波器开始时不会绕过输入通道,因为我们具有比输入通道更多的过滤器(即, m> c )并且没有相应的输入通道旁路。旁路过程开始时缺口比输入通道更少( M <C ),这成为案例1. - 案例3, C> T (图\ ref {图:CBC_CASE3}):当 i -th过滤器被删除,我们绕过 i -th输入通道。额外的( C-T )输入通道不用于旁路。

image

image

image

Figure 3: An illustration of how CBCs handle different cases based on the relationship between the number of input channels (C) and the initial number of filters (T) (only the number of filters remaining (M) changes, and C and T are fixed). For each case, it shows how the architecture changes with more filters removed from top to bottom. The numbers above lines correspond to the letters below lines. Please note that the number of output channels (Z) will never become zero.
图3:基于输入通道数量之间的关系,CBC如何处理不同情况的图示(C)和过滤器的初始数量(Ť)(仅剩余过滤器数量(中号)更改,以及 C 和 T是固定的)。对于每种情况,它都说明了架构是如何变化的,从上到下删除了更多过滤器。上方的数字与下方的字母相对应。请注意,输出通道数(Z)永远不会为零。

These three cases can be summarized in a rule: when the $ i $ -th filter is removed, the corresponding $ i $ -th input channel is bypassed if that input channel exists. Therefore, the number of output channels ( $ Z $ ) when using CBCs can be computed by $ Z = max(min(C, T), M) $ . The proposed CBCs can be efficiently trained when combined with the proposed ordered dropout, as discussed in Sec.ordered_droput. As a more advanced usage of $ T $ , we can treat $ T $ as a hyper-parameter. Please note that we only change $ M $ , and $ C $ and $ T $ are fixed. From the formulation $ Z = max(min(C, T), M) $ , we can observe that the function of $ T $ is limiting the number of bypassed input channels and hence the minimum number of output channels ( $ Z $ ). If we set $ T \geq C $ to allow all $ C $ input channels to be bypassed, the formulation becomes $ Z = max(C, M) $ , and the minimum number of output channels is $ C $ . If we set $ T < C $ to only allow $ T $ input channels to be bypassed, the formulation becomes $ Z = max(T, M) $ , and the minimum number of output channels is $ T $ . Setting $ T < C $ enables generating the bottleneck, where we have fewer output channels than input channels ( $ Z < C $ ). The bottleneck has been shown to be effective in improving the accuracy-efficiency (e.g., accuracy-latency) trade-offs in MobileNetV2/V3. We take the case 1 as an example. In Fig.cbc_case1, we can observe that the number of output channels is always $ 4 $ , which is the same as the number of input channels ( $ Z=C=4 $ ) no matter how many filters are removed. Therefore, the bottleneck cannot be generated. In contrast, if we set $ T $ to 2 as the case 4 in Fig.cbc_case4, no input channels are bypassed until we remove the first two filters because $ Z = max(min(4, 2), 2) = 2 $ . After that, it becomes the case 3 in Fig.cbc_case3, which forms a bottleneck.

这三种情况可以概括为一个规则:当$ i $滤波器被删除,如果存在该输入通道,则绕过相应的$ i $ -TH输入通道。因此,可以通过$ Z = max(min(C, T), M) $计算使用CBC时的输出通道($ Z $)的数量。如SEC.ORDERED_DROPUT所讨论的,所拟议的CBC在结合所讨论的拟议订购的辍学时可以有效地接受培训。作为$ T $的更高级使用,我们可以将$ T $视为超参数。请注意,我们只更换$ M $,$ C $和$ T $是固定的。从配方$ Z = max(min(C, T), M) $中,我们可以观察到$ T $的功能限制了旁路输入通道的数量,从而限制了最小输出通道($ Z $)。如果我们设置$ T \geq C $允许旁路所有$ C $输入通道,则配方变为$ Z = max(C, M) $,并且最小输出通道数为$ C $。如果我们设置$ T < C $只允许绕过$ T $输入通道,则配方变为$ Z = max(T, M) $,并且最小输出通道数为$ T $。设置$ T < C $启用生成瓶颈,在那里我们的输出通道比输入通道更少($ Z < C $)。瓶颈已被证明是有效地改善MobileNetv2 / V3中的准确效率(例如,精度 - 延迟)折衷。我们以案例1作为一个例子。在图中,我们可以观察到输出通道的数量始终为$ 4 $,其与输入通道的数量相同($ Z=C=4 $),无论删除多少个过滤器。因此,不能生成瓶颈。相反,如果将$ T $为2设置为图4中的情况4,则不会绕过输入通道,直到我们删除前两个过滤器,因为$ Z = max(min(4, 2), 2) = 2 $。之后,它成为图CBC_CASE3中的壳体3,其形成瓶颈。

Ordered Dropout

Training the super-network involves joint training of multiple sub-networks with shared weights. After the super-network is trained, comparing sub-networks of the super-network (i.e., samples) only requires their relative accuracy (e.g., sub-network A has higher accuracy than sub-network B). Generally speaking, the more sub-networks are trained, the better the relative accuracy of sub-networks will be. However, previous works usually require one forward-backward pass for training one sub-network. As a result, training more sub-networks requires more forward-backward passes and hence increases the training time.
训练超级网络涉及对具有共享权重的多个子网的联合训练。在训练超级网络之后,比较超级网络的子网络(即样本)仅需要其相对精度(例如,子网络A的精度高于子网络B的精度)。一般来说,子网的训练越多,子网的相对精度就越高。但是,以前的工作通常需要一个向前-向后的通行证来训练一个子网。结果,训练更多的子网需要更多的前进-后退遍历,因此增加了训练时间。

To address this problem, we propose ordered dropout (OD) to enable training N sub-networks in a single forward-backward pass with a batch of N images. OD is inserted after each convolutional layer in the super-network and zeros out different output channels for different images in a batch. As shown in Fig. 4, OD simulates different layer widths with a constant number of output channels. Unlike the standard dropout [30] that zeros out a random subset of channels regardless of their positions, OD always zeros out the last channels to simulate removing the last filters. As a result, while sampling the search space, we can simply drop the last filters from the super-network to evaluate samples without other operations like sorting and avoid a mismatch between training and evaluation.
要解决此问题,我们提出了有序的丢弃(OD),以便在单个向前后向通行证中批量培训$ N $子网,批次$ N $图像。在超网络中的每个卷积层之后插入OD,在批次*中为不同的图像中的不同输出通道插入。如图所示Ordered_Dropout,OD模拟了具有恒定数量的输出通道的不同层宽度。与标准辍学不同,Zeros出来的无论其位置如何,od始终zeros zeros zeROS才能模拟删除最后一个过滤器。因此,在采样搜索空间时,我们可以简单地将来自超级网络的最后一个过滤器丢弃在没有其他操作的情况下评估样品,如排序,避免培训和评估之间不匹配。


Figure 4: An illustration of how NetAdaptV2 uses the proposed ordered dropout to train two different sub-networks in a single forward-backward pass.
图4:NetAdaptV2如何使用提议的有序辍学在单个前向后向传递中训练两个不同的子网的图示。在每个卷积层之后插入有序的辍学,以通过清零一些激活通道来模拟不同的层宽。注意,所有子网共享相同的权重集。

The ordered dropout is inserted after each convolutional layer to simulate different layer widths by zeroing out some channels of activations. Note that all the sub-networks share the same set of weights.When combined with the proposed CBCs, OD can train sub-networks with different network depths by zeroing out all output channels of some layers to simulate layer removal. As shown in Fig. 5, to simulate CBCs, there is another OD in the bypass path (upper) during training, which zeros out the complement set of the channels zeroed by the OD in the corresponding convolution path (lower).

与提议的CBC结合使用时,OD可以通过将某些层的所有输出通道归零以模拟层去除来训练具有不同网络深度的子网。如图5所示 ,为了模拟CBC,在训练过程中旁路路径(上)中还有另一个OD,它将对应的卷积路径中(OD)中被OD归零的通道的补集归零。


Figure 5: An illustration of how NetAdaptV2 uses the proposed channel-level bypass connections and ordered dropout to train a super-network that supports searching different layer widths and network depths.
图5:NetAdaptV2如何使用建议的通道级旁路连接和有序的辍学来训练支持搜索不同层宽和网络深度的超级网络的示意图。

Because NAS only requires the relative accuracy of samples, we can decrease the number of training iterations to further reduce the super-network training time. Moreover, for each layer, we sample each layer width almost the same number of times in a forward-backward pass to avoid biasing towards any specific layer widths.
由于NAS仅需要样本的相对精度,因此我们可以减少训练迭代次数,以进一步减少超级网络训练时间。而且,对于每一层,我们在前后通过中对每个层宽度进行几乎相同次数的采样,以避免偏向任何特定的层宽度。

Multi-Layer Coordinate Descent Optimizer 多层坐标下降优化器

The single-layer coordinate descent (SCD) optimizer, used in NetAdaptV1, is a simple-yet-effective optimizer with the advantages such as supporting both differentiable and non-differentiable search metrics and having only a few interpretable hyper-parameters that need to be tuned. The SCD optimizer runs an iterative optimization. It starts from the super-network and gradually reduces its latency (or other search metrics such as multiply-accumulate operations and energy). In each iteration, the SCD optimizer generates $ K $ samples if the super-network has $ K $ layers. The $ k $ -th sample is generated by shrinking (e.g., removing filters) the $ k $ -th layer in the best sample from the previous iteration to reduce its latency by a given amount. This amount is referred to as per-iteration resource reduction and may change from one iteration to another. Then, the sample with the best performance (e.g., accuracy-latency trade-off) will be chosen and used for the next iteration. The optimization terminates when the target latency is met, and the sample with the best performance in the last iteration is the discovered DNN. The shortcoming of the SCD optimizer is that it generates samples by shrinking only one layer per iteration. This property causes two problems. First, it does not consider the interplay of multiple layers when generating samples in an iteration, which may lead to sub-optimal performance of discovered DNNs. Second, it may take many iterations to search for very deep networks because the layer with the lowest latency limits the maximum value of the per-iteration resource reduction; the lowest latency of a layer becomes small when the super-network is deep. To address these problems, we propose the multi-layer coordinate descent (MCD) optimizer. It generates $ J $ samples per iteration, where each sample is obtained by randomly shrinking $ L $ layers from the previous best sample. In NetAdaptV2, shrinking a layer involves removing filters, reducing the kernel size, or both. Compared with the SCD optimizer, the MCD optimizer considers the interplay of $ L $ layers in each iteration so that the performance of the discovered DNN can be improved. Moreover, it enables using a larger per-iteration resource reduction (i.e., up to the total latency of $ L $ layers) to reduce the number of iterations and hence the search time.

在netadaptv1中使用的单层坐标血统下降(scd)优化器是一个简单而有效的优化器,具有支持可分辨率和非可分子搜索度量的优势,并且只有一个很少需要调整的可解释的超参数。 SCD优化程序运行迭代优化。它从超级网络开始,逐渐降低其延迟(或其他搜索度量,例如乘法累积操作和能量)。在每个迭代中,如果超网络具有$ K $层,SCD优化器会生成$ K $样本。 $ k $ -TH样本是通过从前一迭代的最佳样本中缩小(例如,删除过滤器)$ k $ -TH-TH层来生成$ k $ -TH-TH。通过给定的金额来降低其延迟。此金额被称为每迭代资源减少,并且可能从一个迭代到另一个迭代。然后,将选择具有最佳性能(例如,精度延迟折衷)的样本并用于下一次迭代。当满足目标延迟时,优化终止,并且在最后一次迭代中具有最佳性能的样本是发现的DNN。 SCD优化器的缺点是它通过仅缩小一层来产生样本。此属性会导致两个问题。首先,在迭代中产生样本时,它不考虑多个层的相互作用,这可能导致发现DNN的次优性能。其次,它可能需要许多迭代来搜索非常深的网络,因为延迟最低的层限制了每次迭代资源减少的最大值;当超网络深度时,层的最低延迟变小。为了解决这些问题,我们提出了多层坐标血统(MCD)优化器。它会生成每个迭代的$ J $样本,其中通过从先前的最佳样品随机缩小$ L $层获得每个样本。在NetadAptv2中,缩小图层涉及删除过滤器,减少内核大小,或两者。与SCD优化器相比,MCD Optimizer在每次迭代中考虑$ L $层的相互作用,以便可以提高发现的DNN的性能。此外,它能够使用较大的每迭代资源减少(即,直到$ L $层的总延迟)来减少迭代的数量并因此进行搜索时间。

Reinforcement-learning-based methods [6, 7, 2, 4, 31] demonstrate the ability of neural architecture search for designing high-performance DNNs. However, its search time is longer than the following works due to the long time for training samples individually. Gradient-based methods [17, 18, 19, 3, 21, 22, 23, 24] successfully discover high-performance DNNs with a much shorter search time, but they can only support differentiable search metrics. NetAdaptV1 [1] proposes a single-layer coordinate descent optimizer that can support both differentiable and non-differentiable search metrics and was used to design state-of-the-art MobileNetV3 [25]. However, shrinking only one layer for generating each sample and the long time for training samples individually become its bottleneck of search time. The idea of super-network training [10, 11, 12], which jointly trains all the sub-networks in the search space, is proposed to reduce the time for training and evaluating samples and training the discovered DNN at the cost of a significant increase in the time for training a super-network. Moreover, network depth and layer width are usually considered separately in related works. The proposed NetAdaptV2 addresses all these problems at the same time by reducing the time for training a super-network, training and evaluating samples, and training the discovered DNN in balance while supporting non-differentiable search metrics.

The algorithm flow of NetAdaptV2 is most similar to NetAdaptV1 [1], as shown in Fig. 2. Compared with NetAdaptV2, NetAdaptV1 does not train a super-network but train each sample individually. Moreover, NetAdaptV1 considers only one layer per optimization iteration and different layer widths, but NetAdaptV2 considers multiple layers per optimization iteration and different layer widths, network depths, and kernel sizes. Therefore, NetAdaptV2 is both faster and more effective than NetAdaptV1, as shown in Sec. 4.1.4 and 4.2.

For the methodology, the proposed ordered dropout is most similar to the partial channel connections [24]. However, they are different in the purpose and the ability to expand the search space. Partial channel connections aim to reduce memory consumption while training a DNN with multiple parallel paths by removing some channels. The number of channels removed is constant during training. Moreover, this number is manually chosen. As a result, partial channel connections do not expand the search space. In contrast, the proposed ordered dropout is designed for jointly training multiple sub-networks and expanding the search space. The number of channels removed (i.e., zeroed out) varies from image to image and from one training iteration to another during training to simulate different sub-networks. Moreover, the final number of channels removed (i.e., the discovered architecture) is searched. Therefore, the proposed ordered dropout expands the search space in terms of layer width as well as network depth when the proposed channel-level bypass connections are used.

学习基于加固的方法 [ 672431 ]证明神经结构搜索设计高性能DNNs的能力。但是,由于单独训练样本的时间较长,因此其搜索时间比以下工作更长。基于梯度的方法 [ 171819321222324 ]成功地发现了搜索时间短得多的高性能DNN,但它们只能支持可区分的搜索指标。NetAdaptV1 [ 1 ]提出了一种单层坐标下降优化器,它可以支持可区分和不可区分的搜索指标,并被用于设计最新的MobileNetV3 [ 25 ]。但是,仅缩小生成每个样本的一层以及单独训练样本的长时间会成为其搜索时间的瓶颈。的超级网络训练的想法 [ 101112 ]提议共同训练搜索空间中的所有子网络,以减少训练和评估样本以及训练发现的DNN的时间,但要花费大量时间来训练超级网络。此外,在相关工作中通常单独考虑网络深度和层宽度。提出的NetAdaptV2通过减少训练超级网络,训练和评估样本以及平衡地发现发现的DNN的时间,同时支持不可区分的搜索指标,从而同时解决了所有这些问题。

NetAdaptV2的算法流程与NetAdaptV1 [ 1 ]最相似 ,如图2所示 。与NetAdaptV2相比,NetAdaptV1不会训练超级网络,而是分别训练每个样本。此外,NetAdaptV1每次优化迭代仅考虑一层,而不同的层宽度,但NetAdaptV2每次优化迭代仅考虑多层,并且不同的层宽度,网络深度和内核大小。因此,如第二节所示,NetAdaptV2比NetAdaptV1更快,更有效。 4.1.44.2

对于该方法,建议的有序辍学与部分通道连接最相似 [ 24 ]。但是,它们在目的和扩展搜索空间的能力上是不同的。部分通道连接旨在减少内存消耗,同时通过删除一些通道来训练具有多个并行路径的DNN。在训练期间,删除的通道数是恒定的。此外,该数字是手动选择的。结果,部分通道连接不会展开搜索空间。相比之下,建议的有序丢弃旨在共同培训多个子网并扩展搜索空间。删除的信道数(即,归零)从图像变化到图像,并且在训练期间从一个训练迭代到另一个训练,以模拟不同的子网。此外,搜索(即发现的架构)去除的最终信道数。因此,当使用所提出的频道级旁路连接时,所提出的订购辍学率在层宽度和网络深度方面扩展了搜索空间。

Experiment Results 实验结果

We apply NetAdaptV2 on two applications (image classification and depth estimation) and two search metrics (latency and multiply-accumulate operations (MACs)) to demonstrate the effectiveness and versatility of NetAdaptV2 across different operating conditions. We also perform an ablation study to show the impact of each of the proposed techniques and the associated hyper-parameters.

我们在两个应用程序(图像分类和深度估计)上应用NetAdAptv2和两个搜索度量(延迟和乘数和乘法累积操作(Mac),以展示NetAdaptV2在不同的操作条件下的有效性和多功能性。我们还执行消融研究以显示每个提出的技术和相关的超参数的影响。

Image Classification 图像分类

Experiment Setup

We use latency or MACs to guide NetAdaptV2. The latency is measured on a Google Pixel 1 CPU. The search time is reported in GPU-hours and measured on V100 GPUs. The dataset is ImageNet. We reserve 10K images in the training set for comparing the accuracy of samples and train the super-network with the rest of the training images. The accuracy of the discovered DNN is reported on the validation set, which was not seen during the search. The initial network is based on MobileNetV3. The maximum learning rate is 0.064 decayed by 0.963 every 3 epochs when the batch size is 1024. The learning rate scales linearly with the batch size. The optimizer is RMSProp with an $ \ell $ 2 weight decay of $ 10^{-5} $ . The dropout rate is 0.2. The decay rate of the exponential moving average is 0.9998. The batch size is 1024 for training the super-network, 2048 for training the latency-guided discovered DNN, and 1536 for training the MAC-guided discovered DNN. The multi-layer coordinate descent (MCD) optimizer generates 200 samples per iteration ( $ J=200 $ ). For the latency-guided experiment, each sample is obtained by randomly shrinking 10 layers ( $ L=10 $ ) from the best sample in the previous iteration. We reduce the latency by 3% in the first iteration (i.e., initial resource reduction) and decay the resource reduction by 0.98 every iteration. For the MAC-guided experiment, each sample is obtained by randomly shrinking 15 layers ( $ L=15 $ ) from the best sample in the previous iteration. We reduce the MACs by 2.5% in the first iteration and decay the resource reduction by 0.98 every iteration. More details are included in the appendix.

实验设置

我们使用延迟或mac来指导netadaptv2。延迟在Google Pixel 1 CPU上测量。搜索时间在GPU - 小时内报告并在V100 GPU上测量。数据集是想象的。我们在培训集中保留10K图像,以比较样本的准确性并与训练图像的其余部分培训超级网络。在验证集上报告了发现的DNN的准确性,在搜索期间没有看到的验证集。初始网络基于MobileNetv3。当批次尺寸为1024时,最大学习率为0.963次衰减0.963.学习速率与批次尺寸线性缩放。优化器是RMSPROP,具有$ 10^{-5} $的$\ell $ 2重量衰减。辍学率为0.2。指数移动平均值的衰减率为0.9998。批量大小为1024,用于训练超级网络,2048用于训练延迟引导的DNN,1536用于训练MAC引导的DNN。多层坐标血统(MCD)优化器每次迭代产生200个样本($ J=200 $)。对于延迟引导的实验,通过从前一迭代中的最佳样品随机收缩10层($ L=10 $)来获得每个样本。我们在第一次迭代(即初始资源减少)中减少了3%的延迟,每次迭代都将资源减少0.98减少0.98。对于MAC引导实验,通过从前一迭代中的最佳样品随机缩小15层($ L=15 $)来获得每个样本。我们将MAC减少2.5%,在第一次迭代中衰减,每次迭代都将资源减少0.98。更多详细信息包含在附录中。

Latency-Guided Search Result

The results of NetAdaptV2 guided by latency and related works are summarized in Tablenas_result. Compared with the state-of-the-art (SOTA) NAS algorithms, NetAdaptV2 reduces the search time by up to 5.8 $ \times $ and discovers DNNs with better accuracy-latency/accuracy-MAC trade-offs. The reduced search time is the result of the much more balanced time spent per step. Compared with the NAS algorithms in the class of hundreds of GPU-hours, ProxylessNAS and Single-Path NAS, NetAdaptV2 outperforms them without sacrificing the support of non-differentiable search metrics. NetAdaptV2 achieves either 2.4% higher accuracy with $ 1.5\times $ lower latency or 1.4% higher accuracy with $ 1.6\times $ lower latency. Compared with SOTA NAS-discovered MobileNetV3, NetAdaptV2 achieves 1.8% higher accuracy with the same latency in around 50 hours on eight GPUs. We estimate the $ CO_2 $ emission of NetAdaptV2 based on. NetAdaptV2 discovers DNNs with better accuracy-latency/accuracy-MAC trade-offs with low $ CO_2 $ emission.

延迟引导搜索结果

延迟和相关工程引导的NetAdaptv2的结果总结在TableNas_Result中。与最先进的(SOTA)NAS算法相比,NetAdAptv2将搜索时间减少到5.8$\times $,并发现DNN,具有更好的精度延迟/准确性-mac权衡。降低的搜索时间是每步花费得多的均衡时间的结果。与数百个GPU - 小时的NAS算法相比,ProxyLessnas和单路径NAS,NetAdaptv2优于它们而不牺牲不可微弱的搜索量级的支持。 NetAdaptv2通过$ 1.5\times $降低延迟或使用$ 1.6\times $降低延迟,实现2.4%的准确度。与SOTA NAS发现的MOBILENETV3相比,NetAdaptv2在八个GPU上约50小时的延迟达到1.8%。基于Web,我们估计NetAdAptv2的$ CO_2 $发射。 NetAdaptv2具有具有低$ CO_2 $发射的更好的精度延迟/精度-MAC权衡的DNN。

MAC-Guided Search Result

We present the result of NetAdaptV2 guided by MACs and compare it with related works in Tablenas_result_large. For a fair comparison, AutoAugment and stochastic depth with a survival probability of 0.8 are used for training the discovered network, which results in a longer time for training the discovered DNN. NetAdaptV2 achieves comparable accuracy-MAC trade-offs to NSGANetV2-m while the search time is $ 2.6 \times $ lower. Moreover, the discovered DNN also outperforms EfficientNet-B0 and MixNet-M by up to $ 1.5% $ higher top-1 accuracy with fewer MACs.

Method Top-1 Accuracy (%) MAC (M) Search Time (GPU-Hours)
NSGANetV2-m [38] 78.3 312 1674 (1200, 24, 450)
EfficientNet-B0 [31] 77.3 390 -
MixNet-M [39] 77.0 360 -
NetAdaptV2 (Guided by MAC) 78.5 314 656 (204, 35, 417)

Table 2: The comparison between NetAdaptV2 guided by MACs and related works. The numbers between parentheses show the breakdown of the search time in terms of training a super-network, training and evaluating samples, and training the discovered
表2:MAC指导的NetAdaptV2与相关作品之间的比较。括号之间的数字表示在训练超级网络,训练和评估样本以及从左到右训练发现的DNN方面搜索时间的细目分类。**

MAC引导的搜索结果

我们介绍了Mac引导的Netadaptv2的结果,并将其与TableNas_Result_Large中的相关工作进行比较。对于公平的比较,使用生存概率为0.8的自动化和随机深度用于训练发现的网络,这导致培训被发现的DNN的时间更长。 NetAdaptv2在搜索时间为$ 2.6 \times $下方实现了与NSGANETV2-M的可比精度-MAC权衡。此外,发现的DNN还优于UNDERNET-B0和MINENT-M,最多优于$ 1.5% $,具有较少的MAC。

Ablation Study

The ablation study employs the experiment setup outlined in Sec.experiment_setup unless otherwise stated. To speed up training the discovered networks, the distillation model is smaller. To study the impact of the proposed ordered dropout (OD), we do not use channel-level bypass connections (CBCs) and multi-layer coordinate descent (MCD) optimizer in this experiment. When we further remove the usage of OD, NetAdaptV2 becomes the same as NetAdaptV1, where each sample needs to be trained for four epochs by following the setting of NetAdaptV1. To speed up the execution of NetAdaptV1, we use a shallower network, MobileNetV1, in this experiment instead. Tableablation_ordered_dropout shows that using OD reduces the search time by $ 3.3\times $ while achieving the same accuracy-latency trade-off. If we only consider the time for training a super-network and training and evaluating samples, which are affected by OD, the time reduction is $ 10.4\times $ .

OD Top-1 Accuracy (%) Latency (ms) Search Time (GPU-Hours)
71.0 (+0) 43.9 (100%) 721 (100%) (0, 543, 178)
71.1 (+0.1) 44.4 (101%) 221 (31%) (50, 2, 169)

Table 3: The ablation study of the proposed ordered dropout (OD) on MobileNetV1 [40] and ImageNet. The numbers between parentheses show the breakdown of the search time in terms of training a super-network, training

The proposed channel-level bypass connections (CBCs) enable NetAdaptV2 to search for different network depths. Tableablation_cbc_optimizer shows that CBCs can improve the accuracy by 0.3%. The difference is more significant when we target at lower latency, as shown in the ablation study on MobileNetV1 in the appendix, because the ability to remove layers becomes more critical for maintaining accuracy.

Methods CBC Methods MCD Top-1Accuracy (%)
75.9 (+0)
76.2 (+0.3)
76.6 (+0.7)

Table 4: The ablation study of the channel-level bypass connections (CBCs) and the multi-layer coordinate descent optimizer (MCD) on ImageNet. The latency of the discovered networks is around 51ms, and ordered dropout is used.

The proposed multi-layer coordinate descent (MCD) optimizer improves the performance of the discovered DNN by considering the joint effect of multiple layers per optimization iteration. Tableablation_cbc_optimizer shows that using the MCD optimizer further improves the accuracy by 0.4%. The two main hyper-parameters of the MCD optimizer are the per-iteration resource reduction, which is defined by an initial resource reduction and a decay rate, and the number of samples per iteration ( $ J $ ). They influence the accuracy of the discovered networks and the search time. Tableinfluence_latency_reduction summarizes the accuracy of the 51ms discovered networks when using different initial latency reductions (with a fixed decay of 0.98 per iteration) and different numbers of samples.

消融研究

消融研究采用sec.experiment_setup中概述的实验设置,除非另有说明。为了加速发现的网络,蒸馏模型较小。为研究建议的订购辍学(OD)的影响,我们在本实验中不使用通道级旁路连接(CBC)和多层坐标血统(MCD)优化器。当我们进一步删除OD的使用时,NetAdaptv2变得与NetAdapTv1相同,其中通过遵循NetAdaptv1的设置,需要为四个时期进行四个时期培训。为了加快NetAdaptv1的执行,我们使用较浅的网络MobileNetv1,在此实验中代替。桌面_ordered_dropout显示,使用OD通过$ 3.3\times $减少搜索时间,同时实现相同的精度延迟折衷。如果我们只考虑培训受OD影响的超网络和培训和评估样本的时间,那么减少时间是$ 10.4\times $。

所提出的信道级旁路连接(CBCS)使NetAdaptV2来搜索不同的网络的深度。桌布_cbc_optimizer显示CBC可以提高0.3%的准确性。当我们瞄准较低延迟时,差异更为显着,如在附录中的MobileNetv1的消融研究中所示,因为去除层的能力对于保持精度变得更为关键。

所提出的多层坐标下降(MCD)优化器通过考虑每个优化迭代的多个层的关节效应来提高发现的DNN的性能。桌布_cbc_optimizer显示,使用MCD优化器进一步提高了0.4%的精度。 MCD优化器的两个主要超参数是每次迭代资源减少,其由初始资源降低和衰减率和衰减率和每个迭代的样本数($ J $)定义。它们影响了发现的网络的准确性和搜索时间。 TableInfluence_latency_Refuction总结了使用不同初始延迟延迟时所发现的网络的准确性(固定衰减为0.98的迭代0.98)和不同数量的样本。

The first experiment is fixing the number of samples per iteration and increasing the initial latency reduction from 1.5% to 6.0%, which gradually reduces the time for evaluating samples. The result shows that as long as the latency reduction is small enough, specifically below 3% in this experiment, the accuracy of the discovered networks does not change with the latency reduction. The second experiment is fixing the time for evaluating samples by scaling both the initial latency reduction and the number of samples per iteration at the same rate. As shown in Tableinfluence_latency_reduction, as long as the latency reduction is small enough, more samples will result in better discovered networks. However, if the initial latency reduction is too large, increasing the number of samples per iteration cannot prevent the accuracy from degrading.

c
[c]@c@Initial Latency Reduction
[c]@c@Number of Samples ( J )

|[c]@c@Fixed Number of| Samples per Iteration | 1.5% | 100 | 76.4 | \cline2-4
| :----: | :----: | :----: |
|| 3.0% | 100 | 76.4 | \cline2-4
|| 6.0% | 100 | 75.9 |

|[c]@c@Fixed Time for| Evaluating Samples | 1.5% | 100 | 76.4 | \cline2-4
| :----: | :----: | :----: |
|| 3.0% | 200 | 76.6 | \cline2-4
|| 6.0% | 400 | 75.7 |

To know the accuracy variation of each step in NetAdaptV2, we execute different steps three times and summarize the resultant accuracy of the discovered networks in Tablereproducibility. The initial latency reduction is 1.5%, and the number of samples per iteration is 100 ( $ J=100 $ ). The latency of discovered networks is around 51ms. According to the last row of Tablereproducibility, which corresponds to executing the entire algorithm flow of NetAdaptV2 three times, the accuracy variation is 0.3%. The variation is fairly small because simply training the same discovered network three times results in an accuracy variation of 0.1% as shown in the first row. Moreover, when we fix the super-network and execute the MCD optimizer three times as shown in the second row, the accuracy variation is the same as that of executing the entire NetAdaptV2 three times. The result suggests that the randomness in training a super-network does not increase the overall accuracy variation, which is preferable since we only need to perform this relatively costly step one time.

Training Super-Network EvaluatingSamples Training Discovered DNN Top-1 Accuracy of Executions (%)
1 2 3
76.1
76.1
76.1

Table 6: The accuracy variation of NetAdaptV2. The ✓ denotes the step is executed three times, and the others are executed once. For example, the last row corresponds to executing the entire algorithm flow of NetAdaptV2 three times. For the MCD optimizer, the initial latency reduction is 1.5%, and the number of samples per iteration is 100 (J=100). The latency of all discovered networks is around 51ms, and the accuracy values are sorted in ascending order.

第一个实验正在固定每个迭代的样品数量,并将初始延迟降低从1.5%降低到6.0%,这逐渐降低了评估样品的时间。结果表明,只要延迟减少足够小,在该实验中特别低于3%,所发现的网络的准确性不会随着延迟减少而改变。第二种实验正在修复通过缩放初始延迟降低和以相同速率的迭代的初始等待时间降低和样本数量来评估样品的时间。如tableInfluence_latency_rerection所示,只要延迟减少足够小,就会产生更多的样本,将导致更好的网络网络。但是,如果初始延迟减少太大,则增加每个迭代的样本数量不能阻止从降级的精度。

要知道在NetAdaptV2每个步骤的精度变化,我们执行不同的步骤三次,总结发现的结果精度TakeEropicucitibity的网络。初始延迟降低为1.5%,每个迭代的样本数为100($ J=100 $)。发现网络的延迟大约为51ms。根据桌面的最后一行,这对应于执行NetAdAptv2三次的整个算法流量,精度变化为0.3%。变型相当小,因为简单地训练相同发现的网络三次导致精度变化为0.1%,如第一行所示。此外,当我们修复超网络并执行MCD优化器三次时,如第二行所示,精度变化与执行整个NetAdAptv2三次的变化相同。结果表明,训练超网络的随机性不会增加整体精度变化,这是优选的,因为我们只需要一次相对昂贵的步骤。

Depth Estimation

深度估计

Experiment Setup

NYU Depth V2 is used for depth estimation. We reserve 2K training images for evaluating the performance of samples and train the super-network with the rest of the training images. The initial network is FastDepth. Following FastDepth, we pre-train the encoder of the super-network on ImageNet. The batch size is 256, and the learning rate is 0.9 decayed by 0.963 every epoch. After pre-training the encoder, we train the super-network on NYU Depth V2 for 50 epochs with a batch size of 16 and an initial learning rate of 0.025 decayed by 0.9 every epoch. For the MCD optimizer, we generate 150 ( $ J=150 $ ) samples per iteration. We search with latency measured on a Google Pixel 1 CPU. The latency reduction is 1.5% in the first iteration and is decayed by 0.98 every iteration. For training the discovered network, we use the same setup as training the super-network, except that the initial learning rate is 0.05.

实验设置

NYU深度V2用于深度估计。我们保留2K培训图像,以评估样品的性能,并使用其余培训图像列车。初始网络是快速的。在Fastpth下,我们在想象中预先列车的超级网络的编码器。批量尺寸为256,学习率为0.9次衰减0.963秒。在预先训练编码器之后,我们将超级网络训练在NYU深度V2上,批量大小为16的50个时级,初始学习率为0.025的每个时代衰减0.9。对于MCD Optimizer,我们生成每次迭代的150($ J=150 $)样本。我们在Google Pixel 1 CPU上测量的延迟搜索。延迟减少在第一次迭代中为1.5%,每次迭代都衰减0.98。对于发现的网络训练,我们使用与训练超网络的相同设置,除了初始学习率为0.05。

Search Result

Method RMSE (m) Delta-1 Accuracy (%) Latency (ms) Search Time (GPU-Hours) ImageNet NYU Depth
NetAdaptV1 [1] 0.583 77.4 87.6 96 65
NetAdaptV2 0.576 77.9 86.7 96 27

Table 7: The comparison between NetAdaptV2 and NetAdaptV1 on depth estimation and NYU Depth

搜索结果

在表临时概述了FastDepth中使用的NetAdaptv2和NetAdaptv1之间的比较。 NetAdaptv2在NYU深度V2上通过$ 2.4\times $减少搜索时间,并且发现的DNN在具有可比延迟的可比延迟中的Delta-1精度下的NetAdaptv1的表现优于0.5%。因为NYU深度V2远小于想象成小得多,所以总搜索时间的减少小于在ImageNet上应用NetAdapTv2的减少。在ImageNet上花费的搜索时间用于预先训练编码器,这是训练DNN的常见做法,并且不可或缺的是NYU深度V2的深度估计。

Conclusion

In this paper, we propose NetAdaptV2, an efficient neural architecture search algorithm, which significantly reduces the total search time and discovers DNNs with state-of-the-art accuracy-latency/accuracy-MAC trade-offs. NetAdaptV2 better balances the time spent per step and supports non-differentiable search metrics. This is realized by the proposed methods: channel-level bypass connections, ordered dropout, and multi-layer coordinate descent optimizer. The experiments demonstrate that NetAdaptV2 can reduce the total search time by up to $ 5.8\times $ on image classification and $ 2.4\times $ on depth estimation and discover DNNs with better performance than state-of-the-art works. This research was funded by the National Science Foundation, Real-Time Machine Learning (RTML) program, through grant No. 1937501, a Google Research Award, and gifts from Intel and Facebook.

结论

在本文中,我们提出了一种高效的神经结构搜索算法NetadAptv2,这显着降低了总搜索时间,并以最先进的精度延迟/精度-MAC权衡发现DNN。 NetAdaptv2更好地平衡每一步花费的时间,并支持非可差别的搜索度量标准。这是通过所提出的方法实现:通道级旁路连接,有序丢失和多层坐标血管下降优化器。实验表明,NetAdaptv2可以在深度估计上通过图像分类和$ 2.4\times $在图像分类上减少总搜索时间,并发现具有比最先进的工作更好的性能的DNN。本研究由国家科学基金会,实时机器学习(RTML)计划提供资金,通过授予1937501,谷歌研究奖,英特尔和Facebook的礼物。

Additional Information about Experiment Setup

This section provides additional information about the experiment setup for image classification (Sec.image_classification) and depth estimation (Sec.depth_estimation).

关于实验设置

的附加信息
本节提供有关图像分类(SEC.IMAGE_CLASSIFICE)和深度估计的实验设置(SEC.DEPTH_ESTIMATION)的其他信息。

Image Classification

For the latency-guided experiment, we design the initial network based on MobileNetV3 with an input resolution of $ 224\times224 $ , which is widely used to construct the search space. Starting from MobileNetV3-Large, we round up all layer widths to the power of 2 and add two MobileNetV3 blocks, each with the input resolution of $ 28\times28 $ and $ 14\times14 $ . The kernel sizes of depthwise layers are increased by 2. For sampling sub-networks, we allow 9 uniformly-spaced layer widths from 0 to the full layer width and different odd kernel sizes from 3 to the full kernel size for each layer. We use the initial network of Once-for-All for knowledge distillation when training the discovered network for a fair comparison with Once-for-All and BigNAS. For the MAC-guided experiment, following the practice of Once-for-All and NSGANetV2, we increase layer widths of the initial network used in the latency-guided experiment by $ 1.25 \times $ and add one MobileNetV3 block with an input resolution of $ 7 \times 7 $ to support a large-MAC operating condition. For sampling sub-networks, we allow 11 uniformly-spaced layer widths from 0 to the full layer width and different odd kernel sizes from 3 to the full kernel size for each layer. When training the discovered DNN, we use the initial network of Once-for-All for knowledge distillation for a fair comparison with NSGANetV2.

图像分类

对于延迟引导实验,我们根据MobileNetv3设计初始网络,输入分辨率为$ 224\times224 $,其广泛用于构造搜索空间。从MobileNetv3-Light开始,我们将所有图层宽度达到2的功率,并添加两个MobileNetv3块,每个都具有$ 28\times28 $和$ 14\times14 $的输入分辨率。深度层的内核尺寸增加了2.对于采样子网,我们允许从0到全层宽度和不同层宽度的均匀间隔的层宽度从3到每个层的全内核大小。我们在培训被发现的网络时使用一体化的初始网络进行一次知识蒸馏,以便与所有人和伯纳斯进行公平比较。对于MAC引导的实验,遵循一生和NSGANETV2的实践,我们将$ 1.25 \times $增加了延迟引导实验中使用的初始网络的层宽度,并在输入分辨率的$的输入分辨率下添加一个MOBILENETV3块T352_0\times 7 $支持大型MAC操作条件。对于采样子网络,我们允许从0到全层宽度的11个均匀间隔的层宽度和从3到每层的完整内核大小。在培训被发现的DNN时,我们使用一生的初始网络,以便与Nsganetv2进行公平比较。

Depth Estimation

The initial network is FastDepth. FastDepth consists of an encoder and a decoder. The encoder uses MobileNetV1 as a feature extractor, and the decoder uses depthwise separable convolution and nearest neighbor upsampling. Please refer to FastDepth for more details. For sampling sub-networks, we allow 9 uniformly-spaced layer widths from 0 to the full layer width and different odd kernel sizes from 3 to the full kernel size for each layer. Following a common practice of training DNNs on the NYU Depth V2 dataset, we pre-train the encoder of the initial network on ImageNet. However, as shown in Tablefastdepth, this step is relatively expensive and takes 96 GPU-hours. To avoid pre-training the encoder of the discovered DNN on ImageNet again, we transfer the knowledge learned by the encoder of the initial network to that of the discovered DNN, which is achieved by the following method. We log the architecture of the best sample in each iteration of the MCD optimizer, which forms an architecture trajectory. The starting point of this trajectory is the initial DNN architecture, and the end point is the discovered DNN architecture. For training the discovered network, we start from the starting point of the trajectory with the pre-trained weights of the initial DNN and follow this trajectory to gradually shrink the architecture. In each step of shrinking the architecture, we reuse the overlapped weights from the previous architecture and train the new architecture for two epochs. This process continues until we get to the end point of the trajectory, which is the discovered DNN architecture. Then, we train it until convergence. This knowledge transfer method enables high accuracy of the discovered DNN without pre-training its encoder on ImageNet.

深度估计

初始网络是快速的。 FastDepth由编码器和解码器组成。编码器使用MobileNetv1作为特征提取器,解码器使用深度可分离的卷积和最近的邻居ups采样。有关更多详细信息,请参阅FastDepth。对于采样子网络,我们允许从0到全层宽度和不同层宽度的9个均匀间隔的层宽度从3到全层的完整内核大小。遵循培训DNN的常规做法在NYU深度V2数据集上,我们预先在想象集上培训了初始网络的编码器。但是,如表格坦特图所示,该步骤相对昂贵,需要96 GPU - 小时。为避免再次预先训练发现的DNN的编码器,我们将初始网络的编码器学习的知识转移到发现的DNN的编码器,这通过以下方法实现了该方法。我们在MCD优化器的每次迭代中记录最佳样本的体系结构,从而形成架构轨迹。该轨迹的起点是初始DNN架构,终点是发现的DNN架构。为了训练发现的网络,我们从轨迹的起点开始,初始DNN的预先训练的权重,并遵循该轨迹逐渐缩小架构。在缩小架构的每一步中,我们重复使用前面的体系结构的重叠权重,并为两个时期培训新架构。此过程持续到我们到达轨迹的终点,这是发现的DNN架构。然后,我们训练它直到收敛。这种知识传输方法能够在未在想象中预先训练其编码器的未经预先培训的DNN的高精度。

Discovered DNN Architectures

For the latency-guided experiment on ImageNet (Tablenas_result), Tablediscovered_architecture shows the discovered 51ms DNN architecture of NetAdaptV2. To make the numbers of MACs of all layers as similar as possible, modern DNN design usually doubles the number of filters and channels when the resolution of activations is reduced by 2 $ \times $ . Similarly, to fix the ratio of $ T $ (Sec.channel_level_bypass_connections) to the number of input channels, we use one value of $ T $ for each resolution of input activations and set $ T $ inversely proportional to the resolution. $ T $ s of all depthwise layers are set to infinity to allow bypassing all the input channels. We observe that channel-level bypass connections (CBCs) are widely used in the discovered DNN. Moreover, block 12 is removed, which demonstrates the ability of CBCs to remove a layer.

Index Type T Kernel Size Stride BN Act Exp DW PW SE
1 conv 8 3 2 HS 16 - - -
2 mnv3 block 8 3 1 RE 8 8 16 -
3 mnv3 block 16 5 2 RE 48 48 20 16
4 mnv3 block 16 3 1 RE 48 48 32 -
5 mnv3 block 32 7 2 RE 80 80 32 32
6 mnv3 block 32 3 1 RE 112 80 40 32
7 mnv3 block 32 3 1 RE 64 32 16 32
8 mnv3 block 32 3 1 RE 96 96 8 32
9 mnv3 block 64 5 2 HS 192 192 128 64
10 mnv3 block 64 5 1 HS 224 192 128 -
11 mnv3 block 64 3 1 HS 128 32 48 64
12 mnv3 block 64 0 1 HS 0 0 0 0
13 mnv3 block 64 3 1 HS 512 256 80 256
14 mnv3 block 64 3 1 HS 256 256 112 256
15 mnv3 block 64 5 1 HS 512 512 64 256
16 mnv3 block 128 7 2 HS 640 640 224 256
17 mnv3 block 128 7 1 HS 640 384 224 256
18 mnv3 block 128 5 1 HS 896 512 224 256
19 conv 128 1 1 HS 1024 - - -
20 global avg pool - - - - - - - - -
21 conv 1024 1 1 HS 1792 - - -
22 fc - 1 1 - - 1000 - - -

Table 8: The discovered 51ms DNN architecture of NetAdaptV2 on ImageNet presented in Table 1. Type: type of the layer or block. BN: using batch normalization. Act: activation type (HS: Hard-Swish, RE: ReLU). Exp: number of filters in the expansion layer or number of filters in the conv layer. DW: number of filters in the depthwise layer. PW: number of filters in the pointwise layer. SE: number of filters in the squeeze-and-excitation operation. All MobileNetV3 blocks (mnv3 block) with a stride of 1 have residual connections.

For the MAC-guided experiment on ImageNet (Tablenas_result_large), Tablediscovered_architecture_mac shows the discovered 314M-MAC DNN architecture. We apply the same rule for setting $ T $ s as in the latency-guided experiment. We observe that CBCs are widely used in the discovered DNN. Moreover, MobileNetV3 block 7, 8, 12, 15 are removed.

Index Type T Kernel Size Stride BN Act Exp DW PW SE
1 conv 8 3 2 HS 24 - - -
2 mnv3 block 8 3 1 RE - 24 24 -
3 mnv3 block 16 5 2 RE 64 48 32 24
4 mnv3 block 16 3 1 RE 128 64 32 -
5 mnv3 block 32 5 2 RE 96 96 48 40
6 mnv3 block 32 3 1 RE 128 80 56 40
7 mnv3 block 32 0 1 RE 0 0 0 0
8 mnv3 block 32 0 1 RE 0 0 0 0
9 mnv3 block 64 5 2 HS 224 224 96 80
10 mnv3 block 64 3 1 HS 224 96 96 -
11 mnv3 block 64 3 1 HS 256 256 96 80
12 mnv3 block 64 0 1 HS 0 0 0 0
13 mnv3 block 64 5 1 HS 640 640 144 320
14 mnv3 block 64 3 1 HS 640 512 144 320
15 mnv3 block 64 0 1 HS 0 0 0 0
16 mnv3 block 128 5 2 HS 768 640 192 320
17 mnv3 block 128 5 1 HS 768 256 192 320
18 mnv3 block 128 7 1 HS 896 768 192 320
19 mnv3 block 128 7 1 HS 1152 1152 192 320
20 conv 128 1 1 HS 1152 - - -
21 global avg pool - - - - - - - - -
22 conv 1024 1 1 HS 2048 - - -
23 fc - 1 1 - - 1000 - - -

Table 9: The discovered 314M-MAC DNN architecture of NetAdaptV2 on ImageNet presented in Table 2. Type: type of the layer or block. BN: using batch normalization. Act: activation type (HS: Hard-Swish, RE: ReLU). Exp: number of filters in the expansion layer or number of filters in the conv layer. DW: number of filters in the depthwise layer. PW: number of filters in the pointwise layer. SE: number of filters in the squeeze-and-excitation operation. All MobileNetV3 blocks (mnv3 block) with a stride of 1 have residual connections.

For the depth estimation experiment on the NYU Depth V2 dataset (Tablefastdepth), Tablediscovered_architecture_fastdepth shows the discovered 87ms DNN architecture of NetAdaptV2. We apply the same rule for setting $ T $ s as in the image classification experiments. We observe that NetAdaptV2 reduces the kernel sizes of the $ 37 $ -th and $ 40 $ -th depthwise convolutional layers from 5 to 3, which demonstrates that the ability to search kernel sizes may improve the performance of the discovered DNN.

Table 10: The discovered 87ms DNN architecture of NetAdaptV2 on NYU Depth V2 presented in Table 7. Type: type of the layer, which can be standard convolution (conv), depthwise convolution (dw), pointwise convolution (pw), or nearest neighbor upsampling (upsample). Filter: number of filters. All layers except for upsampling layers are followed by a batch normalization layer and a ReLU activation layer.

发现了DNN架构

,用于Imagenet上的延迟引导实验(TableNas_Result),TableDiscovered_Architecture显示了NetAdAptv2的发现51ms DNN体系结构。为了使所有层的MAC的数量如此相似,现代DNN设计通常在分辨率的分辨率减少2$\times $时,通常会加倍滤波器和通道的数量。类似地,要将$ T $(SEC.Channel_Level_Bypass_Connections)的比率固定到输入通道的数量,我们使用一个值为$ T $的值,每个分辨率都有一个值,并设置与分辨率成反比的$ T $。所有深度图层的$ T $ S设置为Infinity,以允许绕过所有输入通道。我们观察到频道级旁路连接(CBC)广泛用于发现的DNN。此外,除去块12,其演示了CBC去除层的能力。

用于Imagenet上的MAC引导实验(tableNas_result_large),tablecisovered_architecture_mac显示发现的314m-mac dnn架构。我们应用于在延迟引导实验中设置$ T $ s的相同规则。我们观察到CBCS广泛用于发现的DNN。此外,移除MobileNetv3块7,8,12,15。

在nyu深度v2数据集上进行深度估计实验(tabesfastdepth),tablecisovered_achartemiture_fastdepth显示了NetAdaptv2的发现的87ms DNN体系结构。我们在图像分类实验中设置相同的规则以设置$ T $ S。我们观察到NetAdaptv2减少了5到3的$ 37 $-$ -TH深度卷积层的内核大小,这表明搜索内核大小的能力可以提高发现的DNN的性能。

Formulation of Channel-Level Bypass Connections

The formulation of channel-level bypass connections (CBCs), $ Z = max(min(C, T), M) $ , can be derived by considering the case 1 to 3 in Sec.channel_level_bypass_connections and Fig.cbc. For the case 1 ( $ C = T $ ) and 2 ( $ C < T $ ), CBCs start bypassing input channels when $ M $ becomes smaller than $ C $ ( $ M < C $ ) to maintain the number of output channels $ Z = max(C, M) = C $ . For the case 3 ( $ C > T $ ), CBCs start bypassing input channels when $ M $ becomes smaller than $ T $ ( $ M < T $ ) instead of $ C $ , which requires replacing the $ C $ in $ Z = max(C, M) $ with $ min(C, T) $ and gives the formulation of CBCs.

频道级旁路连接的配方

可以通过将壳体1到3在秒钟内的情况1至3中考虑频道级旁路连接(CBCS),$ Z = max(min(C, T), M) $的配方。对于壳体1($ C = T $)和2($ C < T $),当$ M $变得小于$ C $($ M < C $)以维持输出通道$ Z = max(C, M) = C $的数量时,CBCS开始绕过输入通道。对于壳体3($ C > T $),当$ M $($ M < T $)而不是$ C $时,CBCS开始绕过输入通道,而不是$ C $,这需要使用$ min(C, T) 替换$ Z = max(C, M) $中的$ C $ $并给出CBC的配方。

Ablation Study on MobileNetV1

This section provides the ablation study on MobileNetV1. This ablation study employs the experiment setup outlined in Sec.experiment_setup unless otherwise stated. The initial network is the largest MobileNetV1 (1.0 MobileNet-224).

MobileNetv1

的消融研究本节提供了MobileNetv1的消融研究。除非另有说明,否则该消融研究采用SEC.EXPERIME_SETUP中概述的实验设置。初始网络是MobileNetv1(1.0 MobileNet-224)。

Impact of Channel-Level Bypass Connections

The proposed channel-level bypass connections (CBCs) enable NetAdaptV2 to search for different network depths with marginal overhead. Tableablation_cbc_optimizer_mnv1 shows that supporting CBCs only increases the training time of the super-network by $ 1.2\times $ . Moreover, the ability to search network depth allows discovering DNNs with better performance. As shown in Tableablation_cbc_optimizer_mnv1, CBCs improve the accuracy of the discovered DNN by $ 6.5% $ with the same latency.

| Methods CBC |Methods MCD | Top-1 Accuracy (%) | # of Layers | Super-Network Training Speed (min/epoch) | # of Samples |
| - | - | - | - | - |
| | 40.0 (+0) | 28 (-0) | 3.2 (100%) | 1064 (100%) |
| ✓ | | 46.5 (+6.5) | 19 (-9) | 3.8 (119%) | 1092 (103%) |
| ✓ | ✓ | 49.3 (+9.3) | 17 (-11) | 3.8 (119%) | 567 (53%) |

Table 11: The ablation study of the channel-level bypass connections (CBCs) and the multi-layer coordinate descent (MCD) optimizer on ImageNet and MobileNetV1. The latency of the discovered networks is around 6.5ms.

频道级旁路连接的影响

所提出的频道级旁路连接(CBCS)使NetAdpTV2能够搜索具有边缘开销的不同网络深度。桌布_cbc_optimizer_mnv1显示支持CBC仅通过$ 1.2\times $增加超级网络的训练时间。此外,搜索网络深度的能力允许发现具有更好性能的DNN。如桌布_cbc_optimizer_mnv1所示,CBCS通过具有相同延迟的$ 6.5% $提高发现的DNN的精度。

Impact of Multi-Layer Coordinate Descent Optimizer

The proposed multi-layer coordinate descent (MCD) optimizer improves the performance of the discovered DNN while reducing the number of samples and hence the search time. In this experiment, the MCD optimizer generates 27 samples ( $ J=27 $ ) in each iteration, where $ J $ is equal to the number of layers, and each sample is obtained by randomly shrinking 4 layers ( $ L=4 $ ). Tableablation_cbc_optimizer_mnv1 shows that the MCD optimizer reduces the time for evaluating samples by $ 1.9\times $ and improves the accuracy by 2.8%.

对多层坐标血压下降优化器的影响

所提出的多层坐标血统(MCD)优化器改善了发现的DNN的性能,同时减少了样本的数量并因此的搜索时间。在该实验中,MCD优化器在每次迭代中产生27个样本($ J=27 $),其中$ J $等于层数,并且通过随机收缩4层($ L=4 $)获得每个样本。桌布_cbc_optimizer_mnv1显示MCD优化器通过$ 1.9\times $降低了评估样本的时间,并提高了2.8%的准确性。

Estimation of CO_2 Emission

We estimate $ CO_2 $ emission based on Strubell et al.. According to Table 3 in this paper, when BERT $ _{base} $ is trained on 64 V100 GPUs for 79 hours, the $ CO_2 $ emission is 1438 lbs. Therefore, the ratio of $ CO_2 $ emission to GPU-hours is $ \frac{1438}{64 \times 79} = 0.2844 $ . For each NAS method, we multiply its search time by this ratio to estimate its corresponding $ CO_2 $ emission.

估计 co_2 发射

我们估计基于Strubell等人的$ CO_2 $发射。根据本文根据表3,当BERT $_{base} $在64 V100GPU上培训79小时时,$ CO_2 $发射是1438磅。因此,$ CO_2 $发射与GPU-小时的比率为$\frac{1438}{64 \times 79} = 0.2844 $。对于每个NAS方法,我们将其搜索时间乘以该比率来估计其对应的$ CO_2 $发射。