NetAdaptV2: Efficient Neural Architecture Search with Fast Super-Network Training and Architecture Optimization
netadaptv2:高效的神经架构搜索,快速超网络培训和架构优化
Abstract
Neural architecture search (NAS) typically consists of three main steps: training a super-network, training and evaluating sampled deep neural networks (DNNs), and training the discovered DNN. Most of the existing efforts speed up some steps at the cost of a significant slowdown of other steps or sacrificing the support of non-differentiable search metrics. The unbalanced reduction in the time spent per step limits the total search time reduction, and the inability to support non-differentiable search metrics limits the performance of discovered DNNs. In this paper, we present NetAdaptV2 with three innovations to better balance the time spent for each step while supporting non-differentiable search metrics. First, we propose channel-level bypass connections that merge network depth and layer width into a single search dimension to reduce the time for training and evaluating sampled DNNs. Second, ordered dropout is proposed to train multiple DNNs in a single forward-backward pass to decrease the time for training a super-network. Third, we propose the multi-layer coordinate descent optimizer that considers the interplay of multiple layers in each iteration of optimization to improve the performance of discovered DNNs while supporting non-differentiable search metrics. With these innovations, NetAdaptV2 reduces the total search time by up to 5.8\times on ImageNet and 2.4\times on NYU Depth V2, respectively, and discovers DNNs with better accuracy-latency/accuracy-MAC trade-offs than state-of-the-art NAS works. Moreover, the discovered DNN outperforms NAS-discovered MobileNetV3 by 1.8% higher top-1 accuracy with the same latency.\footnote{The project website: http://netadapt.mit.edu.}
摘要
神经体系结构搜索(NAS)通常包括三个主要步骤:训练超级网络,训练和评估采样的深度神经网络(DNN)以及训练发现的DNN。现有的大多数工作都以大大降低其他步骤的速度或牺牲对不可区分搜索指标的支持为代价来加快某些步骤的速度。每步花费时间的不平衡减少限制了总搜索时间的减少,并且无法支持不可区分的搜索指标也限制了发现的DNN的性能。
在本文中,我们为NetAdaptV2提供了三种创新,以更好地平衡每个步骤所花费的时间,同时支持不可区分的搜索指标。首先,我们提出将网络深度和层宽度合并为单个搜索维度的通道级旁路连接,以减少训练和评估采样DNN的时间。其次,提出了有序辍学来在单个前向后向传递中训练多个DNN,以减少训练超级网络的时间。第三,我们提出了多层坐标下降优化器,该优化器在每次优化迭代中考虑多层的相互作用,以提高发现的DNN的性能,同时支持不可区分的搜索指标。通过这些创新,NetAdaptV2将在ImageNet上总搜索时间减少了多达5.8× 和 在NYU Depth V2上为2.4×,同时发现DNN的精确度,延迟/准确性,MAC权衡要比最新的NAS更好。此外,在相同的延迟下,发现的DNN优于NAS发现的MobileNetV3的top-1准确性提高了1.8%。脚注{项目网站:http://netadapt.mit.edu。}
Figure 1: The comparison between NetAdaptV2 and related works. The number above a marker is the corresponding total search time measured on NVIDIA V100 GPUs.
图1:NetAdaptV2与相关作品之间的比较。标记上方的数字是在NVIDIA V100 GPU上测得的相应总搜索时间。
Introduction 简介
Neural architecture search (NAS) applies machine learning to automatically discover deep neural networks (DNNs) with better performance (e.g., better accuracy-latency trade-offs) by sampling the search space, which is the union of all discoverable DNNs. The search time is one key metric for NAS algorithms, which accounts for three steps: 1) training a super-network, whose weights are shared by all the DNNs in the search space and trained by minimizing the loss across them, 2) training and evaluating sampled DNNs (referred to as samples), and 3) training the discovered DNN. Another important metric for NAS is whether it supports non-differentiable search metrics such as hardware metrics (e.g., latency and energy). Incorporating hardware metrics into NAS is the key to improving the performance of the discovered DNNs [1, 2, 3, 4, 5].
There is usually a trade-off between the time spent for the three steps and the support of non-differentiable search metrics. For example, early reinforcement-learning-based NAS methods [6, 7, 2] suffer from the long time for training and evaluating samples. Using a super-network [8, 9, 10, 11, 12, 13, 14, 15, 16] solves this problem, but super-network training is typically time-consuming and becomes the new time bottleneck. The gradient-based methods [17, 18, 19, 20, 3, 21, 22, 23, 24] reduce the time for training a super-network and training and evaluating samples at the cost of sacrificing the support of non-differentiable search metrics. In summary, many existing works either have an unbalanced reduction in the time spent per step (i.e., optimizing some steps at the cost of a significant increase in the time for other steps), which still leads to a long total search time, or are unable to support non-differentiable search metrics, which limits the performance of the discovered DNNs.
神经体系结构搜索(NAS)应用机器学习,通过对搜索空间进行采样来自动发现具有更好性能(例如,更好的精度-延迟权衡)的深层神经网络(DNN),这是所有可发现DNN的并集。搜索时间是NAS算法的一项关键指标,它分三个步骤:1)训练一个超级网络,其权重由搜索空间中所有DNN共享,并通过最小化它们之间的损失进行训练; 2)训练和评估采样的DNN(称为样本),以及3)训练发现的DNN。NAS的另一个重要指标是它是否支持不可区分的搜索指标,例如硬件指标(例如,延迟和能量)。结合硬件度量到NAS的关键是提高发现DNNs的性能 [ 1,2,3,4,5 ]。
在这三个步骤所花费的时间与对不可区分的搜索指标的支持之间通常会进行权衡。例如,早期的强化学习,基于NAS方法 [ 6,7,2 ]从长时间的训练和评估样品的困扰。使用超级网络 [ 8,9,10,11,12,13,14,15,16 ]解决了这个问题,但是超级网络训练通常是耗时的,并成为新时间瓶颈。基于梯度的方法 [17,18,19,20,3,21,22,23,24 ]的时间缩短为以牺牲不可微分搜索度量的支持体的成本训练超网络与培训和评价样品。总而言之,许多现有作品要么在每个步骤上花费的时间不均衡地减少(即,以增加其他步骤的时间为代价来优化某些步骤),但仍会导致总的搜索时间过长,或者无法支持不可区分的搜索指标,这限制了发现的DNN的性能。
In this paper, we propose an efficient NAS algorithm, NetAdaptV2, to significantly reduce the total search time by introducing three innovations to better balance the reduction in the time spent per step while supporting non-differentiable search metrics:
在本文中,我们提出了一种有效的NAS算法NetAdaptV2,它通过引入三种创新来更好地平衡每步所花费时间的减少,同时支持不可区分的搜索指标,从而显着减少了总搜索时间:**
Channel-level bypass connections (mainly reduce the time for training and evaluating samples, Sec. 2.2): Early NAS works only search for DNNs with different numbers of filters (referred to as layer widths). To improve the performance of the discovered DNN, more recent works search for DNNs with different numbers of layers (referred to as network depths) in addition to different layer widths at the cost of training and evaluating more samples because network depths and layer widths are usually considered independently. In NetAdaptV2, we propose channel-level bypass connections to merge network depth and layer width into a single search dimension, which requires only searching for layer width and hence reduces the number of samples.
通道级旁路连接(主要减少了训练和评估样本的时间,第2.2节 ):早期的NAS仅搜索具有不同数量的过滤器(称为层宽度)的DNN 。为了提高发现的DNN的性能,最近的工作搜索了具有不同层数(称为网络深度)和不同层宽的DNN,但代价是训练和评估更多样本,因为网络深度和层宽通常是独立考虑。在NetAdaptV2中,我们建议通道级旁路连接 将网络深度和图层宽度合并为一个搜索维度,这仅需要搜索图层宽度,从而减少了样本数量。
Ordered dropout (mainly reduces the time for training a super-network, Sec. 2.3): We adopt the idea of super-network to reduce the time for training and evaluating samples. In previous works, each DNN in the search space requires one forward-backward pass to train. As a result, training multiple DNNs in the search space requires multiple forward-backward passes, which results in a long training time. To address the problem, we propose ordered dropout to jointly train multiple DNNs in a single forward-backward pass, which decreases the required number of forward-backward passes for a given number of DNNs and hence the time for training a super-network.
有序丢弃(主要减少了训练超级网络的时间,第2.3节 ):我们采用超级网络的思想来减少训练和评估样本的时间。在以前的工作中,搜索空间中的每个DNN都需要向前-向后传递以进行训练。结果,在搜索空间中训练多个DNN需要多次前后移动,这导致训练时间较长。为了解决该问题,我们建议有序辍学以在单个向前-向后传递中共同训练多个DNN ,这样就减少了给定数量的DNN所需的向前-向后传递的次数,从而减少了训练超级网络的时间。
Multi-layer coordinate descent optimizer (mainly reduces the time for training and evaluating samples and supports non-differentiable search metrics, Sec. 2.4): NetAdaptV1 [1] and MobileNetV3 [25], which utilizes NetAdaptV1, have demonstrated the effectiveness of the single-layer coordinate descent (SCD) optimizer [26] in discovering high-performance DNN architectures. The SCD optimizer supports both differentiable and non-differentiable search metrics and has only a few interpretable hyper-parameters that need to be tuned, such as the per-iteration resource reduction. However, there are two shortcomings of the SCD optimizer. First, it only considers one layer per optimization iteration. Failing to consider the joint effect of multiple layers may lead to a worse decision and hence sub-optimal performance. Second, the per-iteration resource reduction (e.g., latency reduction) is limited by the layer with the smallest resource consumption (e.g., latency). It may take a large number of iterations to search for a very deep network because the per-iteration resource reduction is relatively small compared with the network resource consumption. To address these shortcomings, we propose the multi-layer coordinate descent (MCD) optimizer that considers multiple layers per optimization iteration to improve performance while reducing search time and preserving the support of non-differentiable search metrics.
多层坐标下降优化器(主要减少了训练和评估样本的时间,并支持不可微分的搜索指标,第2.4节 ):利用NetAdaptV1的NetAdaptV1 [ 1 ]和MobileNetV3 [ 25 ]已证明了单个方法的有效性层坐标下降(SCD)优化器 [ 26 ]发现高性能DNN架构。SCD优化器同时支持可区分和不可区分的搜索指标,并且仅具有一些需要调整的可解释的超参数,例如减少每次迭代的资源。但是,SCD优化器有两个缺点。首先,每次优化迭代仅考虑一层。如果不考虑多层的共同影响,可能会导致更糟糕的决策,从而导致性能欠佳。其次,每次迭代资源的减少(例如,等待时间的减少)受到具有最小资源消耗(例如,等待时间)的层的限制。搜索非常深的网络可能需要大量的迭代,因为与网络资源消耗相比,每次迭代资源的减少相对较小。多层坐标下降(MCD)优化器,它在每次优化迭代中考虑多层,以提高性能,同时减少搜索时间并保留对不可区分搜索指标的支持。
Fig. 1 (and Table 1) compares NetAdaptV2 with related works. NetAdaptV2 can reduce the search time by up to 5.8× and 2.4× on ImageNet [27] and NYU Depth V2 [28] respectively and discover DNNs with better performance than state-of-the-art NAS works. Moreover, compared to NAS-discovered MobileNetV3 [25], the discovered DNN has 1.8% higher accuracy with the same latency.
图 1(和表 1)将NetAdaptV2与相关作品进行了比较。NetAdaptV2最多可以减少搜索时间5.8× 和 2.4×分别在ImageNet [ 27 ]和NYU Depth V2 [ 28 ]上发现比传统NAS更好的性能的DNN。而且,与NAS发现的MobileNetV3 [ 25 ]相比,发现的DNN具有1.8% 在相同的延迟下获得更高的精度。
Methodology: NetAdaptV2
方法:netadaptv2
Algorithm Overview
NetAdaptV2 searches for DNNs with different network depths, layer widths, and kernel sizes. The proposed channel-level bypass connections (CBCs, Sec. 2.2) enables NetAdaptV2 to discover DNNs with different network depths and layer widths by only searching layer widths because different network depths become the natural results of setting the widths of some layers to zero. To search kernel sizes, NetAdaptV2 uses the superkernel method [21, 22, 12].
Fig. 2 illustrates the algorithm flow of NetAdaptV2. It takes an initial network and uses its sub-networks, which can be obtained by shrinking some layers in the initial network, to construct the search space. In other words, a sample in NetAdaptV2 is a sub-network of the initial network. Because the optimizer needs the accuracy of samples for comparing their performance, the samples need to be trained. NetAdaptV2 adopts the concept of jointly training all sub-networks with shared weights by training a super-network, which has the same architecture as the initial network and contains these shared weights. We use CBCs, the proposed ordered dropout (Sec. 2.3), and superkernel [21, 22, 12] to efficiently train the super-network that contains sub-networks with different layer widths, network depths, and kernel sizes. After training the super-network, the proposed multi-layer coordinate descent optimizer (Sec. 2.4) is used to discover the architectures of DNNs with optimal performance. The optimizer iteratively samples the search space to generate a bunch of samples and determines the next set of samples based on the performance of the current ones. This process continues until the given stop criteria are met (e.g., the latency is smaller than 30ms), and the discovered DNN is then trained until convergence. Because of the trained super-network, the accuracy of samples can be directly evaluated by using the shared weights without any further training.
Figure 2: The algorithm flow of the proposed NetAdaptV2.
算法概述
NetAdaptV2搜索具有不同网络深度,层宽度和内核大小的DNN。提议的通道级旁路连接( CBC ,第2.2节 )使NetAdaptV2仅通过搜索层宽度即可发现具有不同网络深度和层宽度的DNN,因为不同的网络深度成为将某些层的宽度设置为零的自然结果。要搜索的内核大小,NetAdaptV2使用superkernel方法 [ 21,22,12 ]。
图 2说明了NetAdaptV2的算法流程。它需要一个初始网络并使用其*子网(*可以通过缩小初始网络中的某些层来获得)来构建搜索空间。换句话说,NetAdaptV2中的样本是初始网络的子网。由于优化器需要样本的准确性来比较其性能,因此需要对样本进行训练。NetAdaptV2采用通过训练超级网络来共同训练具有共享权重的所有子网的概念,该超级网络具有与初始网络相同的体系结构,并包含这些共享权重。我们使用CBCS,建议责令退学(秒 2.3),和superkernel [ 21,22[ 12 ],以有效地训练包含具有不同层宽度,网络深度和内核大小的子网的超级网络。在训练了超级网络之后,建议的多层坐标下降优化器(第2.4节 )用于发现具有最佳性能的DNN的体系结构。优化器对搜索空间进行迭代采样,以生成一堆采样,并根据当前采样的性能确定下一组采样。该过程持续进行,直到满足给定的停止标准(例如,等待时间小于30毫秒),然后对发现的DNN进行训练,直到收敛为止。由于训练有素的超级网络,可以使用共享权重直接评估样本的准确性,而无需任何进一步的训练。
Channel-Level Bypass Connections
Previous NAS algorithms generally treat network depth and layer width as two different search dimensions. The reason is evident in the following example. If we remove a filter from a layer, we reduce the number of output channels by one. As a result, if we remove all the filters, there are no output channels for the next layer, which breaks the DNN into two disconnected parts. Hence, reducing layer widths typically cannot be used to reduce network depths. To address this, we need an approach that keeps the network connectivity while removing filters; this is achieved by our proposed channel-level bypass connections (CBCs). The high-level concept of CBCs is when a filter is removed, an input channel is bypassed to maintain the same number of output channels. In this way, we can preserve the network connectivity when all filters are removed from a layer. Assuming the target layer in the initial network has $ C $ input channels, $ T $ filters, and $ Z $ output channels If we do not use CBCs, Z is equal to T . , we gradually remove filters from the layer, where there are $ M $ filters remaining. Fig.cbc illustrates how CBCs handle three cases in this process based on the relationship between the number of input channels ( $ C $ ) and the initial number of filters ( $ T $ ) (only $ M $ changes, and $ C $ and $ T $ are fixed): - Case 1, C = T (Fig.\ref{fig:cbc_case1}): When the i -th filter is removed, we bypass the i -th input channel, so the number of output channels ( Z ) can be kept the same. When all the filters are removed ( M = 0 ), all the input channels are bypassed, which is the same as removing the layer. - Case 2, C < T (Fig.\ref{fig:cbc_case2}): We do not bypass input channels at the beginning of filter removal because we have more filters than input channels (i.e., M > C ) and there are no corresponding input channels to bypass. The bypass process starts when there are fewer filters than input channels ( M < C ), which becomes case 1. - Case 3, C > T (Fig.\ref{fig:cbc_case3}): When the i -th filter is removed, we bypass the i -th input channel. The extra ( C-T ) input channels are not used for the bypass.
通道级旁路连接
以前的NAS算法通常将网络深度和层宽度视为两个不同的搜索维度。在以下示例中是明显的。如果我们从图层中删除过滤器,我们将减少一个输出通道的数量。结果,如果我们删除所有过滤器,则下一层没有输出通道,它将DNN分成两个断开部件。因此,还原层宽度通常不能用于减少网络深度。要解决此问题,我们需要一种方法,可以在拆下滤波器时保持网络连接;这是通过我们所提出的频道级旁路连接(CBC)实现的。 CBCS的高级概念是滤波器时,绕过输入通道以保持相同数量的输出通道。通过这种方式,我们可以在从层中删除所有过滤器时,可以保留网络连接。假设初始网络中的目标层具有$ C $输入通道,$ T $过滤器,以及$ Z $输出通道如果我们不使用CBCS, Z 等于 T 。 ,我们逐渐从图层中删除过滤器,其中有$ M $过滤器剩余。图CBC示出了CBCS如何基于输入通道数($ C $)与初始滤波器数($ T $)之间的关系来处理三个情况(仅$ M $更改,以及$ C $和$ C $和$ T $是固定的): - 案例1, c = t (图\ ref {图{图{图CBC_CASE1}):当 i -th过滤器被删除时,我们绕过 i -th输入通道,因此,输出通道( z )的数量可以保持相同。当删除所有过滤器( m = 0 )时,绕过所有输入通道,与删除图层相同。 - 案例2, c <t (图\ ref {图{图\ cbc_case2}):我们在滤波器开始时不会绕过输入通道,因为我们具有比输入通道更多的过滤器(即, m> c )并且没有相应的输入通道旁路。旁路过程开始时缺口比输入通道更少( M <C ),这成为案例1. - 案例3, C> T (图\ ref {图:CBC_CASE3}):当 i -th过滤器被删除,我们绕过 i -th输入通道。额外的( C-T )输入通道不用于旁路。
Figure 3: An illustration of how CBCs handle different cases based on the relationship between the number of input channels (C) and the initial number of filters (T) (only the number of filters remaining (M) changes, and C and T are fixed). For each case, it shows how the architecture changes with more filters removed from top to bottom. The numbers above lines correspond to the letters below lines. Please note that the number of output channels (Z) will never become zero.
图3:基于输入通道数量之间的关系,CBC如何处理不同情况的图示(C)和过滤器的初始数量(Ť)(仅剩余过滤器数量(中号)更改,以及 C 和 T是固定的)。对于每种情况,它都说明了架构是如何变化的,从上到下删除了更多过滤器。上方的数字与下方的字母相对应。请注意,输出通道数(Z)永远不会为零。
These three cases can be summarized in a rule: when the $ i $ -th filter is removed, the corresponding $ i $ -th input channel is bypassed if that input channel exists. Therefore, the number of output channels ( $ Z $ ) when using CBCs can be computed by $ Z = max(min(C, T), M) $ . The proposed CBCs can be efficiently trained when combined with the proposed ordered dropout, as discussed in Sec.ordered_droput. As a more advanced usage of $ T $ , we can treat $ T $ as a hyper-parameter. Please note that we only change $ M $ , and $ C $ and $ T $ are fixed. From the formulation $ Z = max(min(C, T), M) $ , we can observe that the function of $ T $ is limiting the number of bypassed input channels and hence the minimum number of output channels ( $ Z $ ). If we set $ T \geq C $ to allow all $ C $ input channels to be bypassed, the formulation becomes $ Z = max(C, M) $ , and the minimum number of output channels is $ C $ . If we set $ T < C $ to only allow $ T $ input channels to be bypassed, the formulation becomes $ Z = max(T, M) $ , and the minimum number of output channels is $ T $ . Setting $ T < C $ enables generating the bottleneck, where we have fewer output channels than input channels ( $ Z < C $ ). The bottleneck has been shown to be effective in improving the accuracy-efficiency (e.g., accuracy-latency) trade-offs in MobileNetV2/V3. We take the case 1 as an example. In Fig.cbc_case1, we can observe that the number of output channels is always $ 4 $ , which is the same as the number of input channels ( $ Z=C=4 $ ) no matter how many filters are removed. Therefore, the bottleneck cannot be generated. In contrast, if we set $ T $ to 2 as the case 4 in Fig.cbc_case4, no input channels are bypassed until we remove the first two filters because $ Z = max(min(4, 2), 2) = 2 $ . After that, it becomes the case 3 in Fig.cbc_case3, which forms a bottleneck.
这三种情况可以概括为一个规则:当$ i $滤波器被删除,如果存在该输入通道,则绕过相应的$ i $ -TH输入通道。因此,可以通过$ Z = max(min(C, T), M) $计算使用CBC时的输出通道($ Z $)的数量。如SEC.ORDERED_DROPUT所讨论的,所拟议的CBC在结合所讨论的拟议订购的辍学时可以有效地接受培训。作为$ T $的更高级使用,我们可以将$ T $视为超参数。请注意,我们只更换$ M $,$ C $和$ T $是固定的。从配方$ Z = max(min(C, T), M) $中,我们可以观察到$ T $的功能限制了旁路输入通道的数量,从而限制了最小输出通道($ Z $)。如果我们设置$ T \geq C $允许旁路所有$ C $输入通道,则配方变为$ Z = max(C, M) $,并且最小输出通道数为$ C $。如果我们设置$ T < C $只允许绕过$ T $输入通道,则配方变为$ Z = max(T, M) $,并且最小输出通道数为$ T $。设置$ T < C $启用生成瓶颈,在那里我们的输出通道比输入通道更少($ Z < C $)。瓶颈已被证明是有效地改善MobileNetv2 / V3中的准确效率(例如,精度 - 延迟)折衷。我们以案例1作为一个例子。在图中,我们可以观察到输出通道的数量始终为$ 4 $,其与输入通道的数量相同($ Z=C=4 $),无论删除多少个过滤器。因此,不能生成瓶颈。相反,如果将$ T $为2设置为图4中的情况4,则不会绕过输入通道,直到我们删除前两个过滤器,因为$ Z = max(min(4, 2), 2) = 2 $。之后,它成为图CBC_CASE3中的壳体3,其形成瓶颈。
Ordered Dropout
Training the super-network involves joint training of multiple sub-networks with shared weights. After the super-network is trained, comparing sub-networks of the super-network (i.e., samples) only requires their relative accuracy (e.g., sub-network A has higher accuracy than sub-network B). Generally speaking, the more sub-networks are trained, the better the relative accuracy of sub-networks will be. However, previous works usually require one forward-backward pass for training one sub-network. As a result, training more sub-networks requires more forward-backward passes and hence increases the training time.
训练超级网络涉及对具有共享权重的多个子网的联合训练。在训练超级网络之后,比较超级网络的子网络(即样本)仅需要其相对精度(例如,子网络A的精度高于子网络B的精度)。一般来说,子网的训练越多,子网的相对精度就越高。但是,以前的工作通常需要一个向前-向后的通行证来训练一个子网。结果,训练更多的子网需要更多的前进-后退遍历,因此增加了训练时间。
To address this problem, we propose ordered dropout (OD) to enable training N sub-networks in a single forward-backward pass with a batch of N images. OD is inserted after each convolutional layer in the super-network and zeros out different output channels for different images in a batch. As shown in Fig. 4, OD simulates different layer widths with a constant number of output channels. Unlike the standard dropout [30] that zeros out a random subset of channels regardless of their positions, OD always zeros out the last channels to simulate removing the last filters. As a result, while sampling the search space, we can simply drop the last filters from the super-network to evaluate samples without other operations like sorting and avoid a mismatch between training and evaluation.
要解决此问题,我们提出了有序的丢弃(OD),以便在单个向前后向通行证中批量培训$ N $子网,批次$ N $图像。在超网络中的每个卷积层之后插入OD,在批次*中为不同的图像中的不同输出通道插入。如图所示Ordered_Dropout,OD模拟了具有恒定数量的输出通道的不同层宽度。与标准辍学不同,Zeros出来的无论其位置如何,od始终zeros zeros zeROS才能模拟删除最后一个过滤器。因此,在采样搜索空间时,我们可以简单地将来自超级网络的最后一个过滤器丢弃在没有其他操作的情况下评估样品,如排序,避免培训和评估之间不匹配。
Figure 4: An illustration of how NetAdaptV2 uses the proposed ordered dropout to train two different sub-networks in a single forward-backward pass.
图4:NetAdaptV2如何使用提议的有序辍学在单个前向后向传递中训练两个不同的子网的图示。在每个卷积层之后插入有序的辍学,以通过清零一些激活通道来模拟不同的层宽。注意,所有子网共享相同的权重集。
The ordered dropout is inserted after each convolutional layer to simulate different layer widths by zeroing out some channels of activations. Note that all the sub-networks share the same set of weights.When combined with the proposed CBCs, OD can train sub-networks with different network depths by zeroing out all output channels of some layers to simulate layer removal. As shown in Fig. 5, to simulate CBCs, there is another OD in the bypass path (upper) during training, which zeros out the complement set of the channels zeroed by the OD in the corresponding convolution path (lower).
与提议的CBC结合使用时,OD可以通过将某些层的所有输出通道归零以模拟层去除来训练具有不同网络深度的子网。如图5所示 ,为了模拟CBC,在训练过程中旁路路径(上)中还有另一个OD,它将对应的卷积路径中(OD)中被OD归零的通道的补集归零。
Figure 5: An illustration of how NetAdaptV2 uses the proposed channel-level bypass connections and ordered dropout to train a super-network that supports searching different layer widths and network depths.
图5:NetAdaptV2如何使用建议的通道级旁路连接和有序的辍学来训练支持搜索不同层宽和网络深度的超级网络的示意图。
Because NAS only requires the relative accuracy of samples, we can decrease the number of training iterations to further reduce the super-network training time. Moreover, for each layer, we sample each layer width almost the same number of times in a forward-backward pass to avoid biasing towards any specific layer widths.
由于NAS仅需要样本的相对精度,因此我们可以减少训练迭代次数,以进一步减少超级网络训练时间。而且,对于每一层,我们在前后通过中对每个层宽度进行几乎相同次数的采样,以避免偏向任何特定的层宽度。
Multi-Layer Coordinate Descent Optimizer 多层坐标下降优化器
The single-layer coordinate descent (SCD) optimizer, used in NetAdaptV1, is a simple-yet-effective optimizer with the advantages such as supporting both differentiable and non-differentiable search metrics and having only a few interpretable hyper-parameters that need to be tuned. The SCD optimizer runs an iterative optimization. It starts from the super-network and gradually reduces its latency (or other search metrics such as multiply-accumulate operations and energy). In each iteration, the SCD optimizer generates $ K $ samples if the super-network has $ K $ layers. The $ k $ -th sample is generated by shrinking (e.g., removing filters) the $ k $ -th layer in the best sample from the previous iteration to reduce its latency by a given amount. This amount is referred to as per-iteration resource reduction and may change from one iteration to another. Then, the sample with the best performance (e.g., accuracy-latency trade-off) will be chosen and used for the next iteration. The optimization terminates when the target latency is met, and the sample with the best performance in the last iteration is the discovered DNN. The shortcoming of the SCD optimizer is that it generates samples by shrinking only one layer per iteration. This property causes two problems. First, it does not consider the interplay of multiple layers when generating samples in an iteration, which may lead to sub-optimal performance of discovered DNNs. Second, it may take many iterations to search for very deep networks because the layer with the lowest latency limits the maximum value of the per-iteration resource reduction; the lowest latency of a layer becomes small when the super-network is deep. To address these problems, we propose the multi-layer coordinate descent (MCD) optimizer. It generates $ J $ samples per iteration, where each sample is obtained by randomly shrinking $ L $ layers from the previous best sample. In NetAdaptV2, shrinking a layer involves removing filters, reducing the kernel size, or both. Compared with the SCD optimizer, the MCD optimizer considers the interplay of $ L $ layers in each iteration so that the performance of the discovered DNN can be improved. Moreover, it enables using a larger per-iteration resource reduction (i.e., up to the total latency of $ L $ layers) to reduce the number of iterations and hence the search time.
在netadaptv1中使用的单层坐标血统下降(scd)优化器是一个简单而有效的优化器,具有支持可分辨率和非可分子搜索度量的优势,并且只有一个很少需要调整的可解释的超参数。 SCD优化程序运行迭代优化。它从超级网络开始,逐渐降低其延迟(或其他搜索度量,例如乘法累积操作和能量)。在每个迭代中,如果超网络具有$ K $层,SCD优化器会生成$ K $样本。 $ k $ -TH样本是通过从前一迭代的最佳样本中缩小(例如,删除过滤器)$ k $ -TH-TH层来生成$ k $ -TH-TH。通过给定的金额来降低其延迟。此金额被称为每迭代资源减少,并且可能从一个迭代到另一个迭代。然后,将选择具有最佳性能(例如,精度延迟折衷)的样本并用于下一次迭代。当满足目标延迟时,优化终止,并且在最后一次迭代中具有最佳性能的样本是发现的DNN。 SCD优化器的缺点是它通过仅缩小一层来产生样本。此属性会导致两个问题。首先,在迭代中产生样本时,它不考虑多个层的相互作用,这可能导致发现DNN的次优性能。其次,它可能需要许多迭代来搜索非常深的网络,因为延迟最低的层限制了每次迭代资源减少的最大值;当超网络深度时,层的最低延迟变小。为了解决这些问题,我们提出了多层坐标血统(MCD)优化器。它会生成每个迭代的$ J $样本,其中通过从先前的最佳样品随机缩小$ L $层获得每个样本。在NetadAptv2中,缩小图层涉及删除过滤器,减少内核大小,或两者。与SCD优化器相比,MCD Optimizer在每次迭代中考虑$ L $层的相互作用,以便可以提高发现的DNN的性能。此外,它能够使用较大的每迭代资源减少(即,直到$ L $层的总延迟)来减少迭代的数量并因此进行搜索时间。
Related Works 相关工作
Reinforcement-learning-based methods [6, 7, 2, 4, 31] demonstrate the ability of neural architecture search for designing high-performance DNNs. However, its search time is longer than the following works due to the long time for training samples individually. Gradient-based methods [17, 18, 19, 3, 21, 22, 23, 24] successfully discover high-performance DNNs with a much shorter search time, but they can only support differentiable search metrics. NetAdaptV1 [1] proposes a single-layer coordinate descent optimizer that can support both differentiable and non-differentiable search metrics and was used to design state-of-the-art MobileNetV3 [25]. However, shrinking only one layer for generating each sample and the long time for training samples individually become its bottleneck of search time. The idea of super-network training [10, 11, 12], which jointly trains all the sub-networks in the search space, is proposed to reduce the time for training and evaluating samples and training the discovered DNN at the cost of a significant increase in the time for training a super-network. Moreover, network depth and layer width are usually considered separately in related works. The proposed NetAdaptV2 addresses all these problems at the same time by reducing the time for training a super-network, training and evaluating samples, and training the discovered DNN in balance while supporting non-differentiable search metrics.
The algorithm flow of NetAdaptV2 is most similar to NetAdaptV1 [1], as shown in Fig. 2. Compared with NetAdaptV2, NetAdaptV1 does not train a super-network but train each sample individually. Moreover, NetAdaptV1 considers only one layer per optimization iteration and different layer widths, but NetAdaptV2 considers multiple layers per optimization iteration and different layer widths, network depths, and kernel sizes. Therefore, NetAdaptV2 is both faster and more effective than NetAdaptV1, as shown in Sec. 4.1.4 and 4.2.
For the methodology, the proposed ordered dropout is most similar to the partial channel connections [24]. However, they are different in the purpose and the ability to expand the search space. Partial channel connections aim to reduce memory consumption while training a DNN with multiple parallel paths by removing some channels. The number of channels removed is constant during training. Moreover, this number is manually chosen. As a result, partial channel connections do not expand the search space. In contrast, the proposed ordered dropout is designed for jointly training multiple sub-networks and expanding the search space. The number of channels removed (i.e., zeroed out) varies from image to image and from one training iteration to another during training to simulate different sub-networks. Moreover, the final number of channels removed (i.e., the discovered architecture) is searched. Therefore, the proposed ordered dropout expands the search space in terms of layer width as well as network depth when the proposed channel-level bypass connections are used.
学习基于加固的方法 [ 6,7,2,4,31 ]证明神经结构搜索设计高性能DNNs的能力。但是,由于单独训练样本的时间较长,因此其搜索时间比以下工作更长。基于梯度的方法 [ 17,18,19,3,21,22,23,24 ]成功地发现了搜索时间短得多的高性能DNN,但它们只能支持可区分的搜索指标。NetAdaptV1 [ 1 ]提出了一种单层坐标下降优化器,它可以支持可区分和不可区分的搜索指标,并被用于设计最新的MobileNetV3 [ 25 ]。但是,仅缩小生成每个样本的一层以及单独训练样本的长时间会成为其搜索时间的瓶颈。的超级网络训练的想法 [ 10,11,12 ]提议共同训练搜索空间中的所有子网络,以减少训练和评估样本以及训练发现的DNN的时间,但要花费大量时间来训练超级网络。此外,在相关工作中通常单独考虑网络深度和层宽度。提出的NetAdaptV2通过减少训练超级网络,训练和评估样本以及平衡地发现发现的DNN的时间,同时支持不可区分的搜索指标,从而同时解决了所有这些问题。
NetAdaptV2的算法流程与NetAdaptV1 [ 1 ]最相似 ,如图2所示 。与NetAdaptV2相比,NetAdaptV1不会训练超级网络,而是分别训练每个样本。此外,NetAdaptV1每次优化迭代仅考虑一层,而不同的层宽度,但NetAdaptV2每次优化迭代仅考虑多层,并且不同的层宽度,网络深度和内核大小。因此,如第二节所示,NetAdaptV2比NetAdaptV1更快,更有效。 4.1.4和 4.2。
对于该方法,建议的有序辍学与部分通道连接最相似 [ 24 ]。但是,它们在目的和扩展搜索空间的能力上是不同的。部分通道连接旨在减少内存消耗,同时通过删除一些通道来训练具有多个并行路径的DNN。在训练期间,删除的通道数是恒定的。此外,该数字是手动选择的。结果,部分通道连接不会展开搜索空间。相比之下,建议的有序丢弃旨在共同培训多个子网并扩展搜索空间。删除的信道数(即,归零)从图像变化到图像,并且在训练期间从一个训练迭代到另一个训练,以模拟不同的子网。此外,搜索(即发现的架构)去除的最终信道数。因此,当使用所提出的频道级旁路连接时,所提出的订购辍学率在层宽度和网络深度方面扩展了搜索空间。
Experiment Results 实验结果
We apply NetAdaptV2 on two applications (image classification and depth estimation) and two search metrics (latency and multiply-accumulate operations (MACs)) to demonstrate the effectiveness and versatility of NetAdaptV2 across different operating conditions. We also perform an ablation study to show the impact of each of the proposed techniques and the associated hyper-parameters.
我们在两个应用程序(图像分类和深度估计)上应用NetAdAptv2和两个搜索度量(延迟和乘数和乘法累积操作(Mac),以展示NetAdaptV2在不同的操作条件下的有效性和多功能性。我们还执行消融研究以显示每个提出的技术和相关的超参数的影响。
Image Classification 图像分类
Experiment Setup
We use latency or MACs to guide NetAdaptV2. The latency is measured on a Google Pixel 1 CPU. The search time is reported in GPU-hours and measured on V100 GPUs. The dataset is ImageNet. We reserve 10K images in the training set for comparing the accuracy of samples and train the super-network with the rest of the training images. The accuracy of the discovered DNN is reported on the validation set, which was not seen during the search. The initial network is based on MobileNetV3. The maximum learning rate is 0.064 decayed by 0.963 every 3 epochs when the batch size is 1024. The learning rate scales linearly with the batch size. The optimizer is RMSProp with an $ \ell $ 2 weight decay of $ 10^{-5} $ . The dropout rate is 0.2. The decay rate of the exponential moving average is 0.9998. The batch size is 1024 for training the super-network, 2048 for training the latency-guided discovered DNN, and 1536 for training the MAC-guided discovered DNN. The multi-layer coordinate descent (MCD) optimizer generates 200 samples per iteration ( $ J=200 $ ). For the latency-guided experiment, each sample is obtained by randomly shrinking 10 layers ( $ L=10 $ ) from the best sample in the previous iteration. We reduce the latency by 3% in the first iteration (i.e., initial resource reduction) and decay the resource reduction by 0.98 every iteration. For the MAC-guided experiment, each sample is obtained by randomly shrinking 15 layers ( $ L=15 $ ) from the best sample in the previous iteration. We reduce the MACs by 2.5% in the first iteration and decay the resource reduction by 0.98 every iteration. More details are included in the appendix.
实验设置
我们使用延迟或mac来指导netadaptv2。延迟在Google Pixel 1 CPU上测量。搜索时间在GPU - 小时内报告并在V100 GPU上测量。数据集是想象的。我们在培训集中保留10K图像,以比较样本的准确性并与训练图像的其余部分培训超级网络。在验证集上报告了发现的DNN的准确性,在搜索期间没有看到的验证集。初始网络基于MobileNetv3。当批次尺寸为1024时,最大学习率为0.963次衰减0.963.学习速率与批次尺寸线性缩放。优化器是RMSPROP,具有$ 10^{-5} $的$\ell $ 2重量衰减。辍学率为0.2。指数移动平均值的衰减率为0.9998。批量大小为1024,用于训练超级网络,2048用于训练延迟引导的DNN,1536用于训练MAC引导的DNN。多层坐标血统(MCD)优化器每次迭代产生200个样本($ J=200 $)。对于延迟引导的实验,通过从前一迭代中的最佳样品随机收缩10层($ L=10 $)来获得每个样本。我们在第一次迭代(即初始资源减少)中减少了3%的延迟,每次迭代都将资源减少0.98减少0.98。对于MAC引导实验,通过从前一迭代中的最佳样品随机缩小15层($ L=15 $)来获得每个样本。我们将MAC减少2.5%,在第一次迭代中衰减,每次迭代都将资源减少0.98。更多详细信息包含在附录中。
Latency-Guided Search Result
The results of NetAdaptV2 guided by latency and related works are summarized in Tablenas_result. Compared with the state-of-the-art (SOTA) NAS algorithms, NetAdaptV2 reduces the search time by up to 5.8 $ \times $ and discovers DNNs with better accuracy-latency/accuracy-MAC trade-offs. The reduced search time is the result of the much more balanced time spent per step. Compared with the NAS algorithms in the class of hundreds of GPU-hours, ProxylessNAS and Single-Path NAS, NetAdaptV2 outperforms them without sacrificing the support of non-differentiable search metrics. NetAdaptV2 achieves either 2.4% higher accuracy with $ 1.5\times $ lower latency or 1.4% higher