[论文翻译]穿越拥挤山谷的下降——深度学习优化器基准测试


原文地址:https://arxiv.org/pdf/2007.01547v6


Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers

穿越拥挤山谷的下降——深度学习优化器基准测试

Robin M. Schmidt * 1 Frank Schneider * 1 Philipp Hennig 1

Robin M. Schmidt * 1 Frank Schneider * 1 Philipp Hennig 1

Abstract

摘要

1. Introduction

1. 引言

Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than 50,000 individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyper parameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a signifi- cantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: ADAM remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results1 are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.

选择优化器被认为是深度学习中最关键的设计决策之一,且并非易事。当前不断增长的文献已列出数百种优化方法。在缺乏明确理论指导和决定性实证证据的情况下,决策往往基于经验之谈。本研究中,我们试图用证据支持的启发式方法(即便不是决定性排名)来替代这些经验性结论。为此,我们对15种特别流行的深度学习优化器进行了广泛、标准化的基准测试,同时对各类可选方案进行了简明概述。通过分析超过50,000次独立运行实验,我们得出以下三点结论:(i) 优化器性能在不同任务间差异显著;(ii) 使用默认参数评估多个优化器的效果,与调优单个固定优化器的超参数效果相当;(iii) 虽然未发现某种优化方法在所有测试任务中明显占优,但我们确定了一个显著精简的优化器子集及参数组合,这些选择在实验中普遍能产生有竞争力的结果:ADAM仍是强有力的竞争者,新方法未能显著且持续地超越它。我们的开源成果1可作为经过严格调优的基准,用于更有效评估新型优化方法,且无需额外计算开销。

Large-scale stochastic optimization drives a wide variety of machine learning tasks. Because choosing the right optimization method and effectively tuning its hyper parameters heavily influences the training speed and final performance of the learned model, it is an important, every-day challenge to practitioners. It is probably the task that requires the most time and resources in many applications. Hence, stochastic optimization has been a focal point of research, engendering an ever-growing list of methods (cf. Figure 1), many of them targeted at deep learning. The hypothetical machine learning practitioner who is able to keep up with the literature now has the choice among hundreds of methods (see Table 2 in the appendix), each with their own set of tunable hyper parameters, when deciding how to train a model.

大规模随机优化驱动着各类机器学习任务。由于选择合适的优化方法并有效调整其超参数会极大影响模型的训练速度和最终性能,这对从业者而言是日常面临的重要挑战。在许多应用中,这可能是最耗费时间和资源的任务。因此,随机优化一直是研究焦点,催生了数量不断增长的方法(见图1),其中许多专门针对深度学习。如今,假设某位机器学习从业者能跟上文献进展,在决定如何训练模型时,他将面临数百种方法的选择(见附录表2),每种方法都有各自可调的超参数集。

There is limited theoretical analysis that clearly favors one of these choices over the others. Some authors have offered empirical comparisons on comparably small sets of popular methods (e.g. Wilson et al., 2017; Choi et al., 2019; Sivaprasad et al., 2020); but for most optimizers, the only empirical evaluation is offered by the original work introducing the method. Many practitioners and researchers, meanwhile, rely on personal and anecdotal experience, and informal discussion with colleagues or on social media. The result is an often unclear, ever-changing “state of the art” occasionally driven by hype. The key obstacle for an objective benchmark is the combinatorial cost of such an endeavor posed by comparing a large number of methods on a large number of problems, with the high resource and time cost of tuning each method’s parameters and repeating each (stochastic) experiment repeatedly for fidelity.

目前有限的理论分析尚无法明确支持某种选择优于其他方案。部分研究者对少量主流方法进行了实证比较 (如 Wilson 等人, 2017; Choi 等人, 2019; Sivaprasad 等人, 2020) ,但大多数优化器的实证评估仅见于其原始提出文献。与此同时,许多从业者和研究者依赖个人经验、轶事证据以及与同行或社交媒体的非正式讨论。这导致所谓的"最佳实践"往往模糊不清且频繁变化,有时甚至受到炒作影响。构建客观基准的核心障碍在于其组合爆炸成本:需要在大量问题上测试大量方法,同时为保障结果可靠性,还需耗费大量资源和时间进行参数调优及(随机性)实验的重复验证。

We conduct a large-scale benchmark of optimizers to ground the ongoing debate about deep learning optimizers on empirical evidence, and to help understand how the choice of optimization methods and hyper parameters influences the training performance. Specifically, we examine whether recently proposed methods show an improved performance compared to more established methods such as SGD or ADAM. Additionally, we assess whether there exist optimization methods with well-working default hyper para meters that are able to keep up with tuned optimizers. To this end, we evaluate fifteen optimization methods, selected for their perceived popularity, on a range of representative deep learning problems (see Figure 4) drawing conclusions from tens of thousands of individual training runs.

我们对优化器进行了大规模基准测试,旨在基于实证依据探讨当前深度学习优化器的争议,并帮助理解优化方法和超参数选择如何影响训练性能。具体而言,我们研究了新近提出的方法是否比SGD或ADAM等成熟方法表现出更好的性能。此外,我们还评估是否存在具有良好默认超参数的优化方法,能够与经过调优的优化器相媲美。为此,我们选取了十五种被认为广泛使用的优化方法,在一系列具有代表性的深度学习问题上进行了评估(见图4),并从数万次独立训练运行中得出结论。


Figure 1: Number of times ArXiv titles and abstracts mention specific optimizer per year. All non-selected optimizers from Table 2 in the appendix are grouped into Other. This figure illustrates not only the expected increase in both methods and mentions, but also that our selection covers the most popular methods. In 2020, the excluded methods accounted for $<4%$ of the mentions (see Figure 9).

图 1: 每年ArXiv标题和摘要中提及特定优化器的次数。附录表2中未选中的优化器均归类为其他。该图不仅展示了方法和提及次数的预期增长,还表明我们的选择涵盖了最流行的方法。2020年,被排除的方法占比<$4%$ (见图9)。

Right up front, we want to state that it is impossible to include all optimizers (see Table 2 in the appendix), and to satisfy any and all expectations readers may have on tuning, initialization, or the choice of problems—not least because everyone has different expectations in this regard. In our personal opinion, what is needed is an empirical comparison by a third party not involved in the original works. As the target audience of our work, we assume a careful practitioner who does not have access to near-limitless resources, nor to a broad range of personal experiences. As such, the core contributions of our work are:

首先需要说明的是,我们不可能涵盖所有优化器(参见附录中的表2),也无法满足读者在调参、初始化或问题选择方面的各种期望——尤其是因为每个人在这方面的期望都不尽相同。在我们看来,真正需要的是由未参与原始研究的第三方进行的实证比较。作为本研究的目标读者群体,我们假设您是一位谨慎的实践者,既没有近乎无限的资源,也没有广泛多样的个人经验。因此,本研究的核心贡献在于:

1. Assessing the progress in deep learning optimization.

1. 评估深度学习优化的进展

A literature review provides a compact but extensive list of recent advances in stochastic optimization. We identify more than a hundred optimization methods (see Table 2 in the appendix) and more than 20 families of hyper parameter schedules (see Table 3 in the appendix) proposed for deep learning. We conduct a large-scale optimizer benchmark, specifically focusing on problems arising in deep learning. We evaluate fifteen optimizers on eight deep learning problems using four different schedules, tuning over dozens of hyper parameter settings. To our knowledge, this is the most comprehensive empirical evaluation of deep learning optimizers to date (see Section 1.1 on related work).

文献综述提供了随机优化领域最新进展的简明而全面的清单。我们识别出超过100种优化方法(见附录中的表2)和20多类超参数调度方案(见附录中的表3),这些方法均针对深度学习提出。我们开展了大规模优化器基准测试,特别关注深度学习中出现的问题,使用四种不同调度方案在八个深度学习问题上评估了十五种优化器,并对数十种超参数设置进行了调优。据我们所知,这是迄今为止对深度学习优化器最全面的实证评估(相关研究参见第1.1节)。

  1. Insights from more than 50,000 optimization runs. Our empirical experiments indicate that an optimizer’s performance highly depends on the problem (see Figure 4). But some high-level trends emerge, too: (1) Evaluating multiple optimizers with default hyper parameters works approximately as well as tuning the hyper parameters for a fixed optimizer. (2) Using an additional untuned learning rate schedule helps on average, but its effect varies greatly depending on the optimizer and the problem. (3) While there is no optimizer that clearly dominates across all tested workloads, some of the methods we tested exhibited highly variable performance. Others demonstrated decent performance consistently. We deliberately abstain from recommending a single one among them, because we could not find a clear winner with statistical confidence.
  2. 来自超过50,000次优化运行的洞察。我们的实验表明,优化器的性能高度依赖于具体问题 (见图4),但仍可总结出一些高层规律:(1) 用默认超参数评估多个优化器的效果,与为固定优化器调整超参数的效果相当。(2) 额外使用未经调优的学习率调度器平均能带来提升,但其效果因优化器和问题差异显著。(3) 虽然没有任何优化器在所有测试任务中占据绝对优势,但我们测试的部分方法表现出极大的性能波动,另一些则始终保持稳定表现。我们刻意避免推荐单一方案,因为无法在统计置信度下确定明确优胜者。
  3. An open-source baseline for future optimizer benchmarks and meta-learning approaches. Our results are available in an open and easily accessible form (see Footnote 1 on Page 1). This data set contains 53,760 unique runs, each consisting of thousands of individual data points, such as the mini-batch training losses of every iteration or epochwise performance measures, for example, the loss on the full validation set or test set accuracy. These results can be used as competitive and well-tuned baselines for future benchmarks of new optimizers, drastically reducing the amount of computational budget required for a meaningful optimizer comparison. This collection of training curves could also be used for meta-learning novel optimization methods, hyperparameter search strategies, or hyper parameter adaptation strategies. To encourage researches to contribute to this collection, we made our baselines easily expandable. 1
  4. 为未来优化器基准测试和元学习方法提供的开源基线。我们的结果以开放且易于获取的形式提供 (参见第1页脚注1)。该数据集包含53,760次独立实验,每次实验包含数千个数据点,例如每次迭代的小批量训练损失或基于周期的性能指标 (如完整验证集损失或测试集准确率)。这些结果可作为未来新优化器基准测试的竞争性强且经过充分调优的基线,大幅减少有意义的优化器比较所需的计算资源。这些训练曲线集合还可用于元学习新型优化方法、超参数搜索策略或超参数自适应策略。为鼓励研究者贡献数据,我们设计了可轻松扩展的基线架构。

The high-level result of our benchmark is, perhaps expectedly, not a clear winner. Instead, our comparison shows that, while some optimizers are frequently decent, they also generally perform similarly, often switching their positions in the ranking. This result is reminiscent, albeit not formally a rigorous result of the No Free Lunch Theorem (Wolpert & Macready, 1997). A key insight of our comparison is that a practitioner with a new deep learning task can expect to do about equally well by taking almost any method from our benchmark and tuning it, as they would by investing the same computational resources into running a set of optimizers with their default settings and picking the winner.

我们基准测试的高层次结果或许在意料之中,并未出现明确的赢家。相反,我们的比较表明,虽然某些优化器经常表现尚可,但它们通常表现相似,在排名中经常互换位置。这一结果让人联想到(尽管并非严格证明)"没有免费午餐定理" (No Free Lunch Theorem) [Wolpert & Macready, 1997]。我们对比得出的一个关键见解是:面对新的深度学习任务时,从业者采用基准测试中几乎任何方法并进行调优的效果,与投入相同计算资源运行一组默认设置的优化器并选择最佳方案的效果大致相当。

Possibly the most important takeaway from our comparison is that “there are now enough optimizers”. Methods research in stochastic optimization should focus on significant (conceptual, functional, performance) improvements—such as methods specifically suited for certain problem types, inner-loop parameter tuning or structurally novel methods. We make this claim not to discourage research but, quite on the contrary, to offer a motivation for more meaningful, non-incremental research.

我们对比研究中最重要的启示或许是"现有的优化器已经足够多"。随机优化方法的研究应聚焦于实现显著(概念性、功能性、性能)的改进——例如专为特定问题类型设计的方法、内循环参数调优或结构新颖的方法。我们提出这个观点并非要阻碍研究,恰恰相反,是为了激励更有价值、非渐进式的研究。

1.1. Related work

1.1. 相关工作

Following the rapid increase in publications on optimizers, benchmarking these methods for the application in deep learning has only recently attracted significant interest. Schneider et al. (2019) introduced a benchmarking framework called DEEPOBS, which includes a wide range of realistic deep learning problems together with standardized procedures for evaluating optimizers. Metz et al. (2020) presented TASKSET, another collection of optimization problems focusing on smaller but more numerous problems. For the empirical analysis presented here, we use DEEPOBS as it provides optimization problems closer to real-world deep learning tasks. In contrast to our evaluation of existing methods, TASKSET and its analysis focuses on meta-learning new optimizers or hyper parameters.

随着优化器相关论文数量的快速增长,针对深度学习应用的基准测试方法直到最近才引起广泛关注。Schneider等人(2019)提出了名为DEEPOBS的基准测试框架,该框架包含大量真实的深度学习问题以及标准化的优化器评估流程。Metz等人(2020)则推出了TASKSET,这是另一套专注于规模较小但数量更多的优化问题集合。在本文的实证分析中,我们选用DEEPOBS作为测试平台,因为它提供的优化问题更接近实际深度学习任务。与我们对现有方法的评估不同,TASKSET及其分析主要聚焦于元学习新型优化器或超参数。

Both Choi et al. (2019) and Sivaprasad et al. (2020) analyzed specific aspects of the benchmarking process. Sivaprasad et al. (2020) used DEEPOBS to illustrate that the relative performance of an optimizer depends significantly on the used hyper parameter tuning budget. The analysis by Choi et al. (2019) supports this point, stating that “the hyperparameter search space may be the single most important factor explaining the rankings”. They further stress a hierarchy among optimizers, demonstrating that, given sufficient hyper parameter tuning, more general optimizers can never be outperformed by special cases. In their study, however, they manually defined a hyper parameter search space per optimizer and problem basing it either on prior published results, prior experiences, or pre-tuning trials.

Choi等人 (2019) 和Sivaprasad等人 (2020) 分析了基准测试过程的特定方面。Sivaprasad等人 (2020) 使用DEEPOBS说明优化器的相对性能在很大程度上取决于所使用的超参数调整预算。Choi等人 (2019) 的分析支持了这一观点,指出"超参数搜索空间可能是解释排名的最重要因素"。他们进一步强调了优化器之间的层次结构,证明在给定足够的超参数调整的情况下,更通用的优化器永远不会被特殊情况超越。然而,在他们的研究中,他们根据先前发表的结果、先前的经验或预调整试验,手动为每个优化器和问题定义了超参数搜索空间。

Here, we instead aim to identify well-performing generalpurpose optimizers for deep learning, especially when there is no prior knowledge about well-working hyper parameter values for each specific problem. We further elaborate on the influence of our chosen hyper parameter search strategy in Section 4 discussing the limitations of our empirical study.

在此,我们的目标是找出适用于深度学习的通用高性能优化器,尤其是在缺乏针对每个具体问题的有效超参数先验知识时。我们将在第4节详细讨论所选超参数搜索策略的影响,并分析本实证研究的局限性。

Our work is also related to empirical generalization studies of adaptive methods, such as that of Wilson et al. (2017) which sparked an extensive discussion whether adaptive methods (e.g. ADAM) tend to generalize worse than standard first-order methods (i.e. SGD). By focusing on and reporting the test set accuracy we implicitly include the generalization capabilities of different optimizers in our benchmark results, an important characteristic of deep learning optimization.

我们的工作也与自适应方法的实证泛化研究相关,例如 Wilson 等人 (2017) 的研究引发了关于自适应方法 (如 ADAM) 是否比标准一阶方法 (即 SGD) 泛化能力更差的广泛讨论。通过关注并报告测试集准确率,我们隐式地将不同优化器的泛化能力纳入基准结果中,这是深度学习优化的重要特性。

2. Benchmarking process

2. 基准测试流程

Any benchmarking effort requires tricky decisions on the experimental setup that influence the results. Evaluating on a specific task or picking a certain tuning budget may favor or disadvantage certain methods (Sivaprasad et al., 2020). It is impossible to avoid these decisions or to cover all possible choices. Aiming for generality, we evaluate the performance on eight diverse real-world deep learning problems from different disciplines (Section 2.1). From a collection of more than a hundred deep learning optimizers (Table 2 in the appendix) we select fifteen of the most popular choices (see Figure 1) for this benchmark (Section 2.2). For each problem and optimizer we evaluate all possible combinations of four different tuning budgets (Section 2.3) and four selected learning rate schedules (Section 2.4), covering the following combinatorial space:

任何基准测试工作都需要在实验设置上做出影响结果的关键决策。针对特定任务进行评估或选择特定的调优预算可能会偏袒或不利于某些方法 (Sivaprasad et al., 2020)。我们无法避免这些决策,也无法覆盖所有可能的选择。为了追求普适性,我们在八个来自不同领域的多样化现实世界深度学习问题上评估性能 (第2.1节)。从包含百余种深度学习优化器的集合 (附录中的表2) 中,我们为本次基准测试选取了十五种最受欢迎的选择 (见图1) (第2.2节)。针对每个问题和优化器,我们评估了四种不同调优预算 (第2.3节) 与四种选定学习率调度方案 (第2.4节) 的所有可能组合,覆盖以下组合空间:image.png

Combining those options results in 1,920 configurations, where each of the fifteen optimizers is evaluated in 128 settings (i.e. on eight problems, with four budgets and four schedules). Including hyper parameter search and estimating the confidence interval, our main benchmark consists of 53,760 unique training curves.

结合这些选项后共产生1,920种配置方案,其中15种优化器各在128种设定下接受评估(即在8个问题场景中,搭配4种预算方案与4种调度方案)。包含超参数搜索及置信区间估算环节后,我们的核心基准测试共涵盖53,760条独立训练曲线。

2.1. Problems

2.1. 问题

We consider the eight optimization tasks summarized in Table 1, available as the “small” (P1–P4) and “large” (P5– P8) problem sets in DEEPOBS. A detailed description of these problems, including architectures, training parameters, etc. can be found in the work of Schneider et al. (2019).2 DEEPOBS provides several performance metrics, including the training and test loss, and the validation accuracy. While these are all relevant, any comparative evaluation of optimizers requires picking only a few, if not just one particular performance metric. For our analysis (Section 3), we focus on the final test accuracy (or the final test loss, if accuracy is not defined for this problem). This metric captures the optimizer’s ability to generalize and is thus highly relevant for practical use. Our publicly released results include all metrics for completeness. An example of training loss performance is shown in Figure 17 in the appendix. Accordingly, the tuning (Section 2.3) is done with respect to the validation metric. We discuss possible limitations resulting from these choices in Section 4.

我们考虑表1中总结的八项优化任务,这些任务在DEEPOBS中分为"小型"(P1-P4)和"大型"(P5-P8)问题集。Schneider等人(2019)的工作详细描述了这些问题,包括架构、训练参数等。DEEPOBS提供了多个性能指标,包括训练和测试损失以及验证准确率。虽然这些指标都很重要,但对优化器进行比较评估时需要选择少数几个(甚至仅一个)特定性能指标。在我们的分析(第3节)中,我们重点关注最终测试准确率(如果问题未定义准确率,则关注最终测试损失)。该指标反映了优化器的泛化能力,因此与实际应用高度相关。我们公开的结果包含所有指标以确保完整性。附录中的图17展示了训练损失性能的示例。相应地,调优(第2.3节)是针对验证指标进行的。我们将在第4节讨论这些选择可能带来的局限性。

2.2. Optimizer

2.2. 优化器

In Table 2 in the appendix we collect over a hundred optimization methods introduced for or used in deep learning. This list was collected by multiple researchers trying to keep up with the field over recent years. It is thus necessarily incomplete, although it may well represent one of the most exhaustive of such collections. Even this incomplete list, though, contains too many entries for a benchmark with the degrees of freedom collected above. This is a serious problem for research: Even an author of a new optimizer, let alone a practitioner, cannot be expected to compare their work with every possible previous method.

在附录的表2中,我们收集了专为深度学习引入或使用的上百种优化方法。这份清单由多位研究人员近年来持续跟进该领域整理而成,因此虽不完整,却很可能代表了此类收集中最详尽的一批。即便如此,对于包含上述自由度的基准测试而言,这份不完整清单的条目仍显过多。这对研究构成了严峻挑战:即使是新优化器的作者(更不用说实践者),也无法预期其工作能与所有现有方法逐一对比。

Table 1: Summary of problems used in our experiments. Exact model configurations can be found in Schneider et al. (2019)

Data setModelTaskMetricBatch sizeBudget in epochsApprox. run time3
P1ArtificialNoisy quadraticMinimizationLoss128100< 1 min
P2MNISTVAEGenerativeLoss645010 min
P3Fashion-MNISTSimple CNN: 2c2dClassificationAccuracy12810020 min
P4CIFAR-10Simple CNN: 3c3dClassificationAccuracy12810035 min
P5Fashion-MNISTVAEGenerativeLoss6410020 min
P6CIFAR-100All-CNN-CClassificationAccuracy2563504 h 00 min
P7SVHNWideResNet16-4ClassificationAccuracy1281603 h30 min
P8WarandPeaceRNNCharacterPredictionAccuracy502005h30min

表 1: 实验所用问题汇总。具体模型配置详见 Schneider et al. (2019)

数据集 模型 任务 评估指标 批大小 预算周期数 预估运行时间
P1 人工数据 噪声二次模型 最小化 损失值 128 100 < 1分钟
P2 MNIST VAE 生成式 损失值 64 50 10分钟
P3 Fashion-MNIST 简单CNN: 2c2d 分类 准确率 128 100 20分钟
P4 CIFAR-10 简单CNN: 3c3d 分类 准确率 128 100 35分钟
P5 Fashion-MNIST VAE 生成式 损失值 64 100 20分钟
P6 CIFAR-100 All-CNN-C 分类 准确率 256 350 4小时00分钟
P7 SVHN WideResNet16-4 分类 准确率 128 160 3小时30分钟
P8 WarandPeace RNN 字符预测 准确率 50 200 5小时30分钟

We thus select a subset of fifteen optimizers, which we consider to be currently the most popular choices in the community (see Table 4 in the appendix). These do not necessarily reflect the “best” methods, but are either commonly used by practitioners and researchers, or have recently generated attention. Our selection is focused on first-order optimization methods, both due to their prevalence for non-convex optimization problems in deep learning as well as to simplify the comparison. Whether there is a significant difference between these optimizers or if they are inherently redundant is one of the questions this work investigates.

因此,我们选取了十五种优化器的子集,这些优化器目前被认为是社区中最受欢迎的选择(见附录中的表4)。它们并不一定代表"最佳"方法,但要么被从业者和研究人员广泛使用,要么最近引起了关注。我们的选择集中于一阶优化方法,这既是因为它们在深度学习非凸优化问题中的普遍性,也是为了简化比较。这些优化器之间是否存在显著差异,或者它们本质上是否冗余,是本研究探讨的问题之一。

Our list focuses on optimizers over optimization techniques, although the line between the two is admittedly blurry. Techniques such as averaging weights (Izmailov et al., 2018, e.g.) or ensemble methods (Garipov et al., 2018, e.g.) have been shown to be simple but effective at improving the optimization performance. Those methods, however, can be applied to all methods in our lists, similar to regular iz ation techniques, learning rate schedules, or tuning method. We have, therefore, decided to omit them from Table 2.

我们的列表主要关注优化器而非优化技术,尽管两者之间的界限确实模糊。诸如权重平均 (如 Izmailov 等人 [2018]) 或集成方法 (如 Garipov 等人 [2018]) 等技术已被证明能简单有效地提升优化性能。然而这些方法可应用于我们列表中的所有方法,类似于正则化技术、学习率调度或调参方法。因此我们决定在表 2 中省略它们。

2.3. Tuning

2.3. 调优

Budget Optimization methods for deep learning regularly expose hyper parameters to the user. The user either relies on the default suggestion or sets them using experience from previous experiments, or using additional tuning runs to find the best-performing setting. All optimizers in our benchmark have tunable hyper parameters, and we consider four different tuning budgets.

深度学习中的预算优化方法通常会向用户暴露超参数。用户要么依赖默认建议,要么根据先前实验的经验进行设置,或通过额外调优运行来寻找最佳性能配置。我们基准测试中的所有优化器均包含可调超参数,并考虑了四种不同的调优预算方案。

The first budget consists of just a single run. This oneshot budget uses the default values proposed by the original authors, where available (Table 4 in the appendix lists the default parameters). If an optimizer performs well in this setting, this has great practical value, as it drastically reduces the computational resources required for successful training.

首个预算方案仅包含单次运行。这种一次性预算采用原作者提出的默认值(附录中的表4列出了默认参数)。若优化器在此设置下表现优异,则具有极高的实用价值,因为能大幅降低成功训练所需的计算资源。

The small, medium and large budgets consist of 25, 50, and 75 tuning runs, where the parameters for each setting are sampled using random search. Tuning runs for the small and medium budget were sampled using the distributions defined in Table 4. The additional 25 tuning runs of the large budget, however, were sampled using refined bounds: For each combination of optimizer, problem, and learning rate schedule we use the same distribution as before, but restrict the search space, to contain all hyper parameter config u rations of the top-performing $20%$ tuning runs from the medium budget are included.

小、中、大预算分别包含25、50和75次调优运行,其中每个设置的参数均采用随机搜索进行采样。小预算和中预算的调优运行采样使用了表4中定义的分布。然而,大预算新增的25次调优运行采用了更精细的边界:对于每种优化器、问题和学习率调度的组合,我们使用与之前相同的分布,但将搜索空间限制为包含中预算中表现最佳的前20%调优运行的所有超参数配置。

We use a single seed for tuning, but for all configurations repeat the best setting with ten different seeds. This allows us to report standard deviations in addition to means, assessing stability. Our tuning process can sometimes pick “lucky” seeds, which do not perform well when averaging over multiple runs. This is arguably a feature rather than a bug, since it reflects practical reality. If an optimizer is so unstable that ten random seeds are required for tuning—which would render this benchmark practically infeasible—it would be impractical for the end-user as well. Our scoring naturally prefers stable optimizers. Appendices C and D provide further analysis of these cases and the general stability of our benchmark, showing amongst other things that failing seeds occur in less than $0.5%$ of the tuning runs.

我们使用单一随机种子进行调参,但会对所有配置的最佳设置用十个不同种子重复验证。这使得我们不仅能报告平均值,还能计算标准差以评估稳定性。调参过程有时会选中"幸运种子"——这些种子在多次运行的平均表现中并不理想。这实际上反映了现实情况,可视为特性而非缺陷。若某个优化器稳定性差到需要十个随机种子才能完成调参(这将使基准测试无法实施),那么它对终端用户同样不实用。我们的评分机制天然倾向于稳定的优化器。附录C和D对这些情况以及基准测试的整体稳定性进行了深入分析,数据显示调参过程中失败种子的出现概率低于$0.5%$。

Tuning method We tune parameters by random search without early-stopping for the small, medium and large budget. Random search is a popular choice due to its efficiency over grid search (Bergstra & Bengio, 2012) and its ease of implementation and parallel iz ation compared to Bayesian optimization (further discussed in Section 4). A minor complication of random search is that the sampling distribution affects the performance of the optimizer. The sampling distribution acts as a prior over good parameter settings, and bad priors consequently ruin performance. We followed the valid interval and intuition provided by the optimizers’ authors for relevant hyper parameters. The resulting sampling distributions can be found in Table 4 in the appendix. Even though a hyper parameter might have a similar name in different optimization methods (e.g. learning rate $\alpha$ ), its appropriate search space can differ. However, without grounded heuristics guiding the practitioner on how the hyper parameters differ between optimizers, the most straightforward approach for any user is to use the same search space. Therefore, in case there was no prior knowledge provided in the cited work we chose similar distributions for similar hyper parameters across different optimizers.

调优方法
我们对小、中、大预算场景采用无早停的随机搜索进行参数调优。随机搜索因其比网格搜索更高效 (Bergstra & Bengio, 2012) ,且相比贝叶斯优化更易于实现和并行化 (详见第4节) 而广受欢迎。随机搜索的一个微小复杂性在于采样分布会影响优化器性能——采样分布相当于对优质参数设置的先验假设,不当的先验会导致性能下降。我们依据优化器作者提供的有效区间和直觉建议来确定相关超参数,具体采样分布见附录表4。

尽管不同优化方法中可能存在名称相似的超参数 (如学习率 $\alpha$ ) ,但其合理搜索范围可能不同。然而,若缺乏关于优化器间超参数差异的可靠启发式指导,用户最直接的做法是采用相同搜索空间。因此,当引用文献未提供先验知识时,我们对不同优化器中相似超参数选择了相近的分布。

What should be considered a hyper parameter? There is a fuzzy boundary between (tunable) hyper parameters and (fixed) design parameters. A recently contentious example is the $\varepsilon$ in adaptive methods like ADAM. It was originally introduced as a safeguard against division by zero, but has recently been re-interpreted as a problem-dependent hyperparameter (see Choi et al. (2019) for a discussion). Under this view, one can actually consider several optimizers called ADAM: From an easy-to-tune but potentially limited $\mathrm{ADAM}_ {\alpha}$ , only tuning the learning rate, to the tricky-to-tune but all-powerful $\mathbf{ADAM}_ {\alpha,\beta_{1},\beta_{2},\varepsilon}$ , which can approximate SGD in its hyper parameter space. While both share the update rule, we consider them to be different optimizers. For each update rule, we selected one popular choice of tunable parameters, e.g. $\mathrm{ADAM}_ {\alpha,\beta_{1},\beta_{2}}$ (see Table 4).

什么应该被视为超参数?在(可调)超参数和(固定)设计参数之间存在模糊的边界。最近一个有争议的例子是自适应方法(如ADAM)中的$\varepsilon$。它最初被引入是为了防止除以零的情况,但最近被重新解释为与问题相关的超参数(参见Choi等人(2019)的讨论)。在这种观点下,实际上可以考虑几种称为ADAM的优化器:从易于调整但可能受限的$\mathrm{ADAM}_ {\alpha}$(仅调整学习率),到难以调整但功能强大的$\mathbf{ADAM}_ {\alpha,\beta_{1},\beta_{2},\varepsilon}$(可以在其超参数空间中近似SGD)。虽然它们共享更新规则,但我们认为它们是不同的优化器。对于每个更新规则,我们选择了一个流行的可调参数组合,例如$\mathrm{ADAM}_ {\alpha,\beta_{1},\beta_{2}}$(见表4)。

2.4. Schedules

2.4. 调度

The literature on learning rate schedules is now nearly as extensive as that on optimizers (see Table 3 in the appendix). In theory, schedules can be applied to all hyper parameters of an optimization method but to keep our configuration space feasible, we only apply schedules to the learning rate, by far the most popular practical choice (Goodfellow et al., 2016; Zhang et al., 2020). We choose four different learning rate schedules, trying to cover all major types of schedules (see Appendix E):

关于学习率调度 (learning rate schedules) 的文献如今几乎与优化器 (optimizers) 一样丰富 (见附录中的表 3)。理论上,调度可以应用于优化方法的所有超参数,但为了使我们的配置空间可行,我们仅对学习率应用调度,这是目前最流行的实际选择 (Goodfellow et al., 2016; Zhang et al., 2020)。我们选择了四种不同的学习率调度,试图涵盖所有主要类型的调度 (见附录 E):

3. Results

3. 结果

How well do optimizers work out-of-the-box? By comparing each optimizer’s one-shot results against the tuned versions of all fifteen optimizers, we can construct a $15\times15$ matrix of performance gains. Figure 2 illustrates this on five problems showing improvements by a positive sign and an orange cell. Detailed plots for all problems are in Figures 10 and 11 in the appendix. For example, the bottom left cell of the largest matrix in Figure 2 shows that AMSBOUND $(I)$ tuned using a small budget performs $2.4%$ better than SGD (15) with default parameters on this specific problem.

优化器在开箱即用时的表现如何?通过将每个优化器的一次性结果与所有十五种优化器的调优版本进行对比,我们可以构建一个 $15\times15$ 的性能增益矩阵。图 2 展示了五个问题的对比情况,其中正号标记和橙色单元格表示性能提升。所有问题的详细对比图见附录中的图 10 和图 11。例如,图 2 最大矩阵左下角的单元格显示,在该特定问题上,使用小预算调优的 AMSBOUND $(I)$ 比默认参数的 SGD (15) 性能高出 $2.4%$。

An orange row in Figure 2 indicates that an optimizer’s default setting is performing badly, since it can be beaten by any well-tuned competitor. We can observe badlyperforming default settings for MOMENTUM, NAG and SGD, advocating the intuition that non-adaptive optimization methods require more tuning, but also for AMSGRAD and ADADELTA. This is just a statement about the default parameters suggested by the authors or the popular frameworks; well-working default parameters might well exist for those methods. Conversely, a white & blue row signals a well-performing default setting, since even tuned optimizers do not significantly outperform it. ADAM, NADAM and RADAM, as well as AMSBOUND, ADABOUND and ADABELIEF all have white or blue rows on several (but not all!) problems, supporting the rule of thumb that adaptive methods have well-working default parameters. Conversely, orange (or blue) columns highlight optimizers that, when tuned, perform better (or worse) than all untuned optimization methods. We do not observe such columns consistently across tasks. This supports the conclusion that an optimizer’s performance is heavily problem-dependent and that there is no single best optimizer across workloads.

图2中的橙色行表示优化器的默认设置表现不佳,因为它可以被任何调优良好的竞争对手击败。我们可以观察到MOMENTUM、NAG和SGD的默认设置表现不佳,这支持了非自适应优化方法需要更多调优的直觉,但AMSGRAD和ADADELTA也是如此。这只是对作者或流行框架建议的默认参数的说明;这些方法可能存在有效的默认参数。相反,白色和蓝色行表示表现良好的默认设置,因为即使是调优后的优化器也无法显著超越它。ADAM、NADAM和RADAM,以及AMSBOUND、ADABOUND和ADABELIEF在几个(但不是全部!)问题上都有白色或蓝色行,支持了自适应方法具有有效默认参数的经验法则。相反,橙色(或蓝色)列突出显示了调优后表现优于(或劣于)所有未调优优化方法的优化器。我们没有在任务中一致地观察到这样的列。这支持了优化器的性能严重依赖于问题,并且没有一种优化器在所有工作负载中都是最好的结论。

Figures 10 to 13 in the appendix suggest an interesting alternative approach for machine learning practitioners: Instead of picking a single optimizer and tuning its hyper para meters extensively, trying out a few optimizers with default settings and picking the best one yields competitive results with less computational and tuning choice efforts. However, this might not hold for more complicated, structurally different tasks such as GANs (Goodfellow et al., 2014) or Transformer models (Vaswani et al., 2017). The similarity of those two approaches might be due to the fact that optimizers have implicit learning rate schedules (Agarwal et al., 2020) and trying out different optimizers is similar to trying out different (well-tested) schedules.

附录中的图 10 至图 13 为机器学习从业者提出了一个有趣的替代方案:与其选择单一优化器并耗费大量精力调整其超参数 (hyper parameters),不如尝试几种采用默认设置的优化器并从中选出最佳方案,这样能以更少的计算和调参成本获得具有竞争力的结果。不过,对于生成对抗网络 (GANs) [Goodfellow et al., 2014] 或 Transformer 模型 [Vaswani et al., 2017] 等结构更复杂、任务差异更大的场景,这种方法可能并不适用。这两种方案的相似性可能源于优化器本身具有隐式学习率调度机制 [Agarwal et al., 2020],尝试不同优化器本质上类似于尝试不同的(经过充分测试的)调度策略。

How much do tuning and schedules help? We consider the final performance achieved by varying budgets and schedules to quantify the usefulness of tuning and applying parameter-free schedules (Figure 3). While there is no clear trend for any individual setting (gray lines), in the median we observe that increasing the budget improves performance, albeit with diminishing returns. For example, using the medium budget without any schedule leads to a median relative improvement of roughly $3.4%$ compared to the default parameters (without schedule).

调优和训练计划的作用有多大?我们通过调整预算和训练计划来量化调优和无参训练计划的实际效果(图 3)。虽然单一配置(灰色线)未呈现明显趋势,但中位数据显示,增加预算能提升性能,尽管边际效益递减。例如,在中等预算下不使用任何训练计划时,相比默认参数(无训练计划)的中位相对提升约为 $3.4%$。


Figure 2: The test set performance improvement after switching from any untuned optimizer ( $y$ -axis, one-shot) to any tuned optimizer ( $x$ -axis, small budget) as an average over 10 random seeds for the constant schedule. For example, the bottom left cell of the largest matrix indicates that the tuned version of AMSBOUND (1) reaches a $2.4%$ higher test accuracy than untuned SGD (15). We discuss the un intuitive occurrence of negative diagonal entries in Appendix G. The colormap is capped at $\pm3$ to improve presentation, although larger values occur.

图 2: 从任意未调优优化器 (y轴,单次) 切换到任意调优优化器 (x轴,小预算) 后测试集性能提升的平均值 (基于10个随机种子的恒定学习率调度)。例如,最大矩阵左下角单元格显示调优版AMSBOUND (1) 比未调优SGD (15) 获得高2.4%的测试准确率。负对角线条目的反直觉现象将在附录G讨论。为提升可视化效果,色标范围限制在±3,但实际存在更大数值。

Applying an untuned schedule improves median performance as well. For example, the large tuning budget coupled with a trapezoidal learning rate schedule leads to a median relative improvement of the performance of roughly $5.2%$ compared to the default parameters. However, while these trends hold in the median, their individual effect varies wildly among optimizers and problems, as is apparent from the noisy structure of the individual lines shown in Figure 3.

应用未经调整的调度方案也能提升中位性能。例如,采用大调优预算搭配梯形学习率调度时,相比默认参数,性能中位数相对提升约 $5.2%$ 。但这些趋势仅在中位数层面成立,如图 3 中散乱的线条所示,不同优化器和问题间的个体效果差异极大。

Which optimizers work well after tuning? Figure 4 compares the optimizers’ performance across all eight problems. There is no single optimizer that dominates its competitors across all tasks. Nevertheless, some optimizers generally perform well, while others can vary greatly in their behavior, most notably performing poorly on VAEs. Further supporting the hypothesis of previous sections, we note that taking the best out of a small set of untuned optimizers — for example, ADAM and ADABOUND — frequently results in competitive performance. Except for the two VAE problems, the best of those two untuned optimizers generally falls within the distribution of the well-tuned methods. Combining these runs with a tuned version of ADAM (or a variant thereof) provides stable and slightly improved results across many problems in our benchmark. To further increase the performance, our results suggest trying a different optimizer next, such as RMSPROP or NAG. Across multiple budgets and schedules, both optimizers show a consistently good performance on the RNN and ALL-CNN-C model, respectively.

哪些优化器在调优后表现良好?图4比较了所有八个问题中优化器的性能。没有一种优化器能在所有任务中全面超越竞争对手。然而,某些优化器通常表现良好,而其他优化器的行为可能差异很大,尤其是在VAE上表现不佳。进一步支持前几节的假设,我们注意到,从少量未经调优的优化器(例如ADAM和ADABOUND)中选出最佳方案,往往能获得具有竞争力的性能。除了两个VAE问题外,这两种未经调优的优化器的最佳表现通常落在经过良好调优方法的分布范围内。将这些运行结果与调优版的ADAM(或其变体)结合,可以在我们的基准测试中为许多问题提供稳定且略有改进的结果。为了进一步提高性能,我们的结果表明可以尝试其他优化器,例如RMSPROP或NAG。在多种预算和调度方案下,这两种优化器分别在RNN和ALL-CNN-C模型上表现出一致的良好性能。

Nevertheless, achieving (or getting close to) the absolute best performance still requires testing numerous optimizers. Which optimizer wins in the end is problem-dependent: optimizers that achieve top scores on one problem can perform poorly on other tasks. We note in passing that the individual optimizer rankings changes when considering e.g. a smaller budget or an additional learning rate schedule (see Figures 14 to 16 in the appendix). However, the overall trends described here are consistent.

然而,要达成(或接近)绝对最佳性能,仍需测试大量优化器。最终哪种优化器胜出取决于具体问题:在某个任务上表现顶尖的优化器,在其他任务中可能表现不佳。我们注意到,当考虑更小的预算或额外的学习率调度时,单个优化器的排名会发生变化(参见附录中的图14至图16)。不过,本文描述的整体趋势保持一致。

The idea that optimizers perform consistently better or worse for specific model architectures or tasks has been regularly theorized and mentioned in the literature. Indeed, our results support this hypothesis, with NAG often beating ADAM on image classification tasks, and RMSPROP being consistently on top for the natural language modeling task (see Tables 6 to 9). Understanding whether and why certain optimizers favor specific problem types presents an interesting research avenue and might lead to more sophisticated optimizers that utilize the problem characteristics.

关于优化器在特定模型架构或任务上表现持续更优或更差的观点,已在文献中多次被理论化提及 [20]。我们的实验结果支持这一假设:NAG在图像分类任务中常优于ADAM,而RMSPROP在自然语言建模任务中始终领先(见表6至表9)。探究特定优化器是否及为何偏好某类问题类型,是一个有趣的研究方向,可能催生出能利用问题特征的更复杂优化器。


Figure 3: Lines in gray (—, smoothed by cubic splines for visual guidance only) show the relative improvement for a certain tuning budget and schedule (compared to the one-shot tuning without schedule) for all fifteen optimizers on all eight problems. The median over all lines is plotted in orange (—) with the shaded area (T) indicating the area between the 25th and 75th percentile. With an increased budget and a schedule, one can expect a performance increase on average (orange lines), but not automatically for individual settings (i.e. gray lines can be unaffected or even decrease).

图 3: 灰色线条 (—,仅通过三次样条平滑处理作为视觉参考) 展示了所有十五种优化器在八个问题上,采用特定调优预算和调度方案时相对于单次无调度调优的相对改进。橙色实线 (—) 表示所有线条的中位数,阴影区域 (T) 表示第25至75百分位之间的范围。增加预算并采用调度方案时,平均性能会有所提升 (橙色线条),但个别设置可能不受影响甚至出现下降 (即灰色线条可能持平或下行)。

4. Limitations

4. 局限性

Any empirical benchmark has constraints and limitations. Here we highlight some of ours’ and characterize the context within which our results should be considered.

任何实证基准都存在约束和局限性。在此我们重点说明本研究的若干限制,并界定结果适用的考量背景。

Generalization of the results By using the problems from DEEPOBS, which span models and data sets of varying complexity, size, and different domains, we aim for generalization. Our results are, despite our best efforts, reflective of not just these setups, but also of the chosen training parameters, the software framework, and further unavoidable choices. The design of our comparisons aims to be close to what an informed practitioner would encounter for a relatively novel problem in practice. It goes without saying that even a carefully curated range of problems cannot cover all challenges of machine learning or even just deep learning. In particular, our conclusions may not generalize to other workloads such as GANs, reinforcement learning, or applications where e.g. memory usage is crucial.

结果的泛化性
通过使用DEEPOBS中的问题集(涵盖不同复杂度、规模及领域的数据集和模型),我们力求结果的广泛适用性。尽管我们已竭尽全力,但实验结果不仅反映了这些设定,还受限于所选训练参数、软件框架及其他不可避免的选择。我们的对比设计旨在贴近实践中遇到相对新颖问题时,有经验的从业者可能面临的情境。显然,即使精心挑选的问题范围也无法覆盖机器学习(乃至深度学习)的所有挑战。特别是,我们的结论可能无法推广至其他工作负载(如GANs、强化学习)或对内存使用等关键指标敏感的应用场景。

Similarly, our benchmark does not cover more large-scale problems such as ImageNet (Deng et al., 2009) or transformer models (Vaswani et al., 2017). While there is oftmentioned anecdotal evidence that the characteristics of deep learning problems change for larger models, it would simply be impossible to perform the kind of combinatorial exploration of choices covered in our benchmark, even with significant hardware resources. The inclusion of larger models would require reducing the number of tested optimizers, schedules or tuning methods and would thus shift the focus of the benchmark. Studying whether there are systematic differences between different types of deep learning problems presents an interesting avenue for further research.

同样,我们的基准测试并未涵盖更大规模的问题,例如 ImageNet (Deng et al., 2009) 或 Transformer 模型 (Vaswani et al., 2017)。尽管常有传闻表明深度学习问题的特性会随着模型规模增大而改变,但即使拥有充足的硬件资源,要对基准测试中涉及的选择进行组合式探索也是不现实的。纳入更大模型将需要减少测试的优化器、调度器或调参方法数量,从而改变基准测试的重点。研究不同类型深度学习问题之间是否存在系统性差异,是一个值得深入探索的方向。

We do not consider this study the definitive work on benchmarking deep learning optimizers, but rather an important and significant step in the right direction. While our comparison includes many “dimensions” of deep learning optimization, e.g. by considering different problems, tuning budgets, and learning rate schedules, there are certainly many more. To keep the benchmark feasible, we chose to use the fixed $L_{2}$ regular iz ation and batch size that DEEPOBS suggests for each problem. We also did not include optimization techniques such as weight averaging or ensemble methods as they can be combined with all evaluated optimizers and hence would increase the computational cost further. Future works could study how these techniques interact with different optimization methods. However, to keep our benchmark feasible, we have selected what we believe to be the most important aspects affecting an optimizer comparison. We hope, that our study lays the groundwork so that other works can build on it and analyze these questions.

我们并不认为这项研究是深度学习优化器基准测试的最终定论,而只是朝着正确方向迈出的重要一步。虽然我们的比较涵盖了深度学习优化的多个"维度"(例如通过考虑不同问题、调参预算和学习率调度方案),但显然还存在更多维度。为保持基准测试的可行性,我们选择对每个问题采用DEEPOBS建议的固定$L_{2}$正则化和批量大小。我们也没有纳入权重平均或集成方法等优化技术,因为这些技术可与所有被评估的优化器结合使用,会进一步增加计算成本。未来工作可以研究这些技术如何与不同优化方法产生交互。但为确保基准测试的可行性,我们选择了影响优化器比较的最关键因素。希望本研究能奠定基础,供后续工作在此基础上深入分析这些问题。

Influence of the hyper parameter search strategy As noted by, e.g., Choi et al. (2019) and Sivaprasad et al. (2020), the hyper parameter tuning method, its budget, and its search domain, can significantly affect performance. By reporting results from four different hyper parameter optimization budgets (including the tuning-free one-shot setting) we try to quantify the effect of tuning. We argue that our random search process presents a realistic setting for many but certainly not all deep learning practitioners. One may criticize our approach as simplistic, but note that more elaborate schemes, in particular Bayesian optimization, would multiply the number of design decisions (kernels, search utilities, priors, etc.) and thus significantly complicate the analysis.

超参数搜索策略的影响
如 Choi 等人 (2019) 和 Sivaprasad 等人 (2020) 所述,超参数调优方法、其预算及搜索域会显著影响性能。通过报告四种不同超参数优化预算(包括免调优的单次设置)的结果,我们试图量化调优的影响。我们认为,随机搜索过程为多数(但非全部)深度学习实践者提供了现实场景。有人可能批评我们的方法过于简单,但需注意更复杂的方案(尤其是贝叶斯优化)会大幅增加设计决策(核函数、搜索工具、先验等)的数量,从而使分析显著复杂化。


Figure 4: Mean test set performance over 10 random seeds of all tested optimizers on all eight optimization problems using the large budget for tuning and no learning rate schedule. One standard deviation for the tuned ADAM optimizer is shown with a red error bar (I; error bars for other methods omitted for legibility). The performance of untuned ADAM $(\lor.$ ) and ADABOUND (L) are marked for reference. The upper bound of each axis represents the best performance achieved in the benchmark, while the lower bound is chosen in relation to the performance of ADAM with default parameters. Tabular version available in the Appendix as Table 6.

图 4: 所有测试优化器在八种优化问题上使用大预算调参且无学习率调度时的平均测试集性能(10次随机种子均值)。红色误差条显示调参后ADAM优化器的一个标准差(I;其他方法的误差条为清晰起见省略)。未调参ADAM $(\lor.$ )和ADABOUND (L)的性能作为参考标出。各坐标轴上限代表基准测试中的最佳性能,下限则与默认参数ADAM的性能相关联。表格版本详见附录表6。

The individual hyper parameter sampling distributions significantly affect the relative rankings of the optimizers. A poorly chosen search space can make successful tuning next to impossible. In our benchmark, we use relatively broad initial search spaces, dozens of tuning runs and a refining of those search spaces for the large budget. Note, though, that the problem of finding appropriate search spaces is inherited by practitioners. It is arguably an implicit flaw of an optimization method that expects hyper parameter tuning not to come with well-identified search spaces for those parameters and this should thus be reflected in a benchmark.

个体超参数采样分布会显著影响优化器的相对排名。若搜索空间选择不当,几乎不可能成功完成调优。在本基准测试中,我们采用了相对宽泛的初始搜索空间、数十次调优运行,并针对大预算情况优化了这些搜索空间。但需注意,寻找合适搜索空间的问题仍需由实践者自行解决。可以说,若某种优化方法预期超参数调优时不提供明确界定的参数搜索空间,这本身就是该方法的隐性缺陷,基准测试应当反映这一特性。

5. Conclusion

5. 结论

Faced with an avalanche of research developing new stochastic optimization methods, practitioners are left with the nearimpossible task of not just picking a method from this evergrowing list, but also to guess or tune hyper parameters for them, even to continuously tune them during training. Despite efforts by the community, there is currently no method that clearly dominates the competition.

面对层出不穷的新型随机优化方法研究,从业者不仅要在这不断增长的列表中挑选方法,还需猜测或调整其超参数,甚至要在训练过程中持续调参,这几乎是不可能完成的任务。尽管学界付出了诸多努力,目前仍没有任何方法能明显胜出。

We have provided an extensive empirical benchmark of optimization methods for deep learning. It reveals structure in the crowded field of training methods for deep learning: First, although many methods perform competitively, a subset of methods tends to come up near the top across the spectrum of problems. Despite years of new research by many committed authors, ADAM remains a viable (but also not a clearly superior) choice for many problems, with NAG or RMSPROP being interesting alternatives that were able to boost performance on individual problems. Secondly, tuning helps about as much as trying other optimizers. Our open and extendable data set allows many, more technical observations, for example, that the stability to re-runs is an often overlooked challenge.

我们提供了深度学习优化方法的广泛实证基准测试。研究揭示了这一拥挤领域中的方法结构特征:首先,尽管许多方法表现相当,但部分方法在各类问题中 consistently 位居前列。尽管多年来众多研究者持续投入新方法开发,ADAM 仍是多数场景下可行(但并非绝对优势)的选择,而 NAG 或 RMSPROP 作为替代方案能在特定问题上提升性能。其次,参数调优的收益与更换优化器相当。我们开放可扩展的数据集支持更多技术洞察,例如多次运行的稳定性是常被忽视的挑战。

Perhaps the most important takeaway from our study is hidden in plain sight: the field is in danger of being drowned by noise. Different optimizers exhibit a surprisingly similar performance distribution compared to a single method that is re-tuned or simply re-run with different random seeds. It is thus questionable how much insight the development of new methods yields, at least if they are conceptually and functionally close to the existing population. We hope that benchmarks like ours can help the community to move beyond inventing yet another optimizer and to focus on key challenges, such as automatic, inner-loop tuning for truly robust and efficient optimization. We are releasing our data to allow future authors to ensure that their method contributes to such ends.

我们研究中最重要的启示或许显而易见:这个领域正面临被噪音淹没的危险。与使用不同随机种子重新调参或简单重跑的单一方法相比,各种优化器展现出惊人相似的性能分布。因此,新方法的开发能带来多少洞见值得商榷——至少在概念和功能上与现有方法相近的情况下。我们希望像我们这样的基准测试能帮助学界超越"发明又一个优化器"的循环,转而聚焦关键挑战,例如实现真正鲁棒高效优化的自动内循环调参。我们将公开数据,以便后续研究者能确保其方法对此目标有所贡献。

ACKNOWLEDGMENTS

致谢

The authors gratefully acknowledge financial support by the European Research Council through ERC StG Action 757275 / PANAMA; the DFG Cluster of Excellence “Machine Learning - New Perspectives for Science”, EXC 2064/1, project number 390727645; the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research and Arts of the State of Baden-W rttemberg. Moreover, the authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Frank Schneider. We would like to thank Aaron Bahde for providing his analysis on the robustness to random seeds. Further, we are grateful to Lukas Balles, Frederik Künstner, and Felix Dangel for, among other things, helping to create the list of optimizers and providing feedback to the manuscript. Lastly, we want to thank Agustinus Kristiadi, Jonathan Wenger, Marius Hobbhahn, and Lukas Tatzel for their additional feedback.

作者衷心感谢欧洲研究理事会通过ERC StG Action 757275/PANAMA、德国研究基金会(DFG)卓越集群"机器学习-科学新视角"(EXC 2064/1,项目编号390727645)、德国联邦教育与研究部(BMBF)通过图宾根人工智能中心(资助号:01IS18039A),以及巴登-符腾堡州科学、研究与艺术部的资金支持。此外,作者感谢国际马克斯·普朗克智能系统研究院(IMPRS-IS)对Frank Schneider的支持。我们要感谢Aaron Bahde提供的关于随机种子鲁棒性的分析。同时感谢Lukas Balles、Frederik Künstner和Felix Dangel在优化器列表编制和文稿反馈方面的帮助。最后,我们要感谢Agustinus Kristiadi、Jonathan Wenger、Marius Hobbhahn和Lukas Tatzel提供的补充意见。

References

参考文献

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va- sudevan, V., Viégas, F., Vinyals, O., Warden, P., Watten- berg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. URL http://tensorflow.org/.

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va- sudevan, V., Viégas, F., Vinyals, O., Warden, P., Watten- berg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: 异构系统上的大规模机器学习, 2015. URL http://tensorflow.org/.

Agarwal, N., Anil, R., Hazan, E., Koren, T., and Zhang, C. Disentangling Adaptive Gradient Methods from Learning Rates. arXiv preprint: 2002.11803, 2020.

Agarwal, N., Anil, R., Hazan, E., Koren, T., and Zhang, C. 解耦自适应梯度方法与学习率。arXiv预印本: 2002.11803, 2020。

Bergstra, J. and Bengio, Y. Random Search for Hyper

Bergstra, J. 和 Bengio, Y. 超参数随机搜索

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30, NIPS, 2017. Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Advances in Neural Information Processing Systems 30, NIPS, 2017. Wolpert, D. H. and Macready, W. G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput., 1(1):67–82, 1997. Xing, C., Arpit, D., Tsirigotis, C., and Bengio, Y. A Walk with SGD. arXiv preprint: 1802.08770, 2018. Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. Dive into Deep Learning. 2020. https://d2l.ai.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30, NIPS, 2017.
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Advances in Neural Information Processing Systems 30, NIPS, 2017.
Wolpert, D. H. and Macready, W. G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput., 1(1):67–82, 1997.
Xing, C., Arpit, D., Tsirigotis, C., and Bengio, Y. A Walk with SGD. arXiv preprint: 1802.08770, 2018.
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. Dive into Deep Learning. 2020. https://d2l.ai.

A. List of optimizers and schedules considered

A. 考虑的优化器和调度器列表

Table 2: List of optimizers considered for our benchmark. This is only a subset of all existing methods for deep learning

表 2: 我们基准测试考虑的优化器列表。这只是现有深度学习方法的一个子集

Ref.
NameRef.Name
AcceleGrad(Levy et al., 2018)HyperAdam(Wang et al., 2019b)
ACClip(Zhang et al.,2020)K-BFGS/K-BFGS(L)(Goldfarb et al.,2020)
AdaAlter(Xie et al., 2019)KF-QN-CNN(Ren & Goldfarb,2021)
(Martens & Grosse,2015)
AdaBatch(Devarakonda et al., 2017)KFAC
AdaBayes/AdaBayes-SS(Aitchison, 2020)KFLR/KFRA(Botev et al., 2017)
AdaBelief(Zhuang et al., 2020)L4Adam/L4Momentum(Rolinek & Martius,2018)
AdaBlock(Yun et al., 2019)LAMB(You et al., 2020)
AdaBound(Luo et al., 2019)LaProp(Ziyin et al., 2020)
AdaComp(Chen et al., 2018)LARS(You et al., 2017)
Adadelta(Zeiler,2012)LHOPT(Almeida et al., 2021)
Adafactor(Shazeer & Stern,2018)LookAhead(Zhang et al., 2019)
AdaFix(Bae et al., 2019)M-SVAG(Balles & Hennig,2018)
AdaFom(Chen et al., 2019a)MADGRAD(Defazio & Jelassi, 2021)
AdaFTRL(Orabona &Pal,2015)MAS(Landro et al., 2020)
Adagrad(Duchi et al., 2011)MEKA(Chen et al., 2020b)
ADAHESSIAN(Yao et al.,2020)MTAdam(Malkiel & Wolf, 2020)
Adai(Xie et al., 2020)MVRC-1/MVRC-2(Chen & Zhou,2020)
AdaLoss(Teixeira et al., 2019)Nadam(Dozat, 2016)
Adam(Kingma & Ba,2015)NAMSB/NAMSG(Chen et al., 2019b)
Adam +(Liu et al.,2020b)ND-Adam(Zhang et al., 2017a)
AdamAL(Tao et al., 2019)Nero(Liu et al., 2021b)
AdaMax(Kingma & Ba,2015)Nesterov(Nesterov,1983)
AdamBS(Liu et al., 2020c)Noisy Adam/Noisy K-FAC(Zhang et al., 2018)
AdamNC(Reddi et al., 2018)NosAdam(Huang et al., 2019)
AdaMod(Ding et al., 2019)Novograd(Ginsburg et al., 2019)
AdamP/SGDP(Heo et al., 2021)NT-SGD(Zhou et al., 2021b)
AdamT(Zhou et al., 2020)Padam(Chen et al., 2020a)
AdamW(Loshchilov&Hutter,2019)PAGE(Li et al., 2020b)
AdamX(Tran & Phong, 2019)PAL(Mutschler & Zell, 2020)
ADAS(Eliyahu, 2020)PolyAdam(Orvieto et al., 2019)
AdaSPolyak
AdaScale(Hosseini &Plataniotis,2020) (Johnson et al., 2020)PowerSGD/PowerSGDM(Polyak,1964) (Vogels et al., 2019)
AdaSGD(Wang & Wiens,2020)Probabilistic Polyak(de Roos et al., 2021)
AdaShift(Zhou et al., 2019)ProbLS(Mahsereci & Hennig, 2017)
AdaSqrt(Hu et al.,2019)PStorm(Xu, 2020)
Adathm(Sun et al., 2019)QHAdam/QHM(Ma & Yarats, 2019)
AdaX/AdaX-W(Li et al., 2020a)RAdam(Liu et al., 2020a)
AEGD(Liu & Tian, 2020)Ranger(Wright, 2020b)
ALI-G(Berrada et al., 2020)RangerLars(Grankin, 2020)
AMSBound(Luo et al., 2019)RMSProp(Tieleman &Hinton,2012)
AMSGrad(Reddi et al., 2018)RMSterov(Choi et al., 2019)
AngularGrad(Roy et al., 2021)S-SGD(Sung et al., 2020)
ArmijoLS(Vaswani et al., 2019)SAdam(Wang et al.,2020b)
ARSG(Chen et al., 2019b)Sadam/SAMSGrad(Tong et al., 2019)
ASAM(Kwon et al., 2021)SALR(Yue et al., 2020)
AutoLRS(Jin et al., 2021)SAM(Foret et al., 2021)
AvaGrad(Savarese et al., 2019)SC-Adagrad/SC-RMSProp(Mukkamala & Hein, 2017)
BAdam(Salas et al., 2018)SDProp(Ida et al., 2017)
BGAdam(Bai & Zhang,2019)SGD(Robbins & Monro,1951)
BPGrad(Zhang et al., 2017b)SGD-BB(Tan et al., 2016)
BRMSPropSGD-G2
BSGD(Aitchison, 2020) (Hu et al., 2020)SGDEM(Ayadi & Turinici, 2020) (Ramezani-Kebrya et al., 2021)
C-ADAM(Tutunov et al., 2020)SGDHess(Tran &Cutkosky,2021)
CADA(Chen et al., 2021)SGDM(Liu&Luo,2020)
Cool Momentum(Borysenko & Byshkin, 2020)SGDR(Loshchilov & Hutter, 2017)
CProp(Preechakul &Kijsirikul,2019)SHAdagrad(Huang et al., 2020)
Curveball(Henriques et al., 2019)Shampoo(Anil et al., 2020; Gupta et al., 2018)
Dadam(Nazari et al., 2019)SignAdam++(Wang et al., 2019a)
DeepMemory(Wright,2020a)SignSGD(Bernstein et al., 2018)
DGNOpt(Liu et al., 2021a)SKQN/S4QN(Yang et al., 2020)
DiffGrad(Dubey et al., 2020)SM3(Anil et al., 2019)
EAdam(Yuan&Gao,2020)SMG(Tran et al., 2020)
EKFAC(George et al., 2018)SNGM(Zhao et al., 2020)
Eve(Hayashi et al., 2018)SoftAdam(Fetterman et al., 2019)
Expectigrad(Daley&Amato,2020)SRSGD(Wang et al.,2020a)
FastAdaBelief(Zhou et al., 2021a)Step-Tuned SGD(Castera et al., 2021)
FRSGD(Wang&Ye,2020)SWATS(Keskar&Socher,2017)
G-AdaGrad(Chakrabarti & Chopra, 2021)SWNTS(Chen et al., 2019c)
TAdam(llboudo et al., 2020)
GADAM(Zhang & Gouza,2018)
Gadam(Granziol et al., 2020)TEKFAC(Gao et al., 2020)
GOALS(Chae et al., 2021)VAdam(Khan et al., 2018)
GOLS-I Grad-Avg(Kafka & Wilke,2019)VR-SGD vSGD-b/vSGD-g/vSGD-1(Shang et al., 2020) (Schaul et al., 2013)
GRAPES(Purkayastha & Purkayastha, 2020) (Dellaferrera et al., 2021)vSGD-fd(Schaul &LeCun,2013)
Gravilon(Kelterborn et al., 2020)WNGrad(Wu et al., 2018)
Gravity(Bahrami &Zadeh, 2021)YellowFin Yogi(Zhang & Mitliagkas, 2019) (Zaheer et al., 2018)
名称 参考文献 名称 参考文献
AcceleGrad (Levy et al., 2018) HyperAdam (Wang et al., 2019b)
ACClip (Zhang et al., 2020) K-BFGS/K-BFGS(L) (Goldfarb et al., 2020)
AdaAlter (Xie et al., 2019) KF-QN-CNN (Ren & Goldfarb, 2021)
(Martens & Grosse, 2015)
AdaBatch (Devarakonda et al., 2017) KFAC
AdaBayes/AdaBayes-SS (Aitchison, 2020) KFLR/KFRA (Botev et al., 2017)
AdaBelief (Zhuang et al., 2020) L4Adam/L4Momentum (Rolinek & Martius, 2018)
AdaBlock (Yun et al., 2019) LAMB (You et al., 2020)
AdaBound (Luo et al., 2019) LaProp (Ziyin et al., 2020)
AdaComp (Chen et al., 2018) LARS (You et al., 2017)
Adadelta (Zeiler, 2012) LHOPT (Almeida et al., 2021)
Adafactor (Shazeer & Stern, 2018) LookAhead (Zhang et al., 2019)
AdaFix (Bae et al., 2019) M-SVAG (Balles & Hennig, 2018)
AdaFom (Chen et al., 2019a) MADGRAD (Defazio & Jelassi, 2021)
AdaFTRL (Orabona & Pal, 2015) MAS (Landro et al., 2020)
Adagrad (Duchi et al., 2011) MEKA (Chen et al., 2020b)
ADAHESSIAN (Yao et al., 2020) MTAdam (Malkiel & Wolf, 2020)
Adai (Xie et al., 2020) MVRC-1/MVRC-2 (Chen & Zhou, 2020)
AdaLoss (Teixeira et al., 2019) Nadam (Dozat, 2016)
Adam (Kingma & Ba, 2015) NAMSB/NAMSG (Chen et al., 2019b)
Adam + (Liu et al., 2020b) ND-Adam (Zhang et al., 2017a)
AdamAL (Tao et al., 2019) Nero (Liu et al., 2021b)
AdaMax (Kingma & Ba, 2015) Nesterov (Nesterov, 1983)
AdamBS (Liu et al., 2020c) Noisy Adam/Noisy K-FAC (Zhang et al., 2018)
AdamNC (Reddi et al., 2018) NosAdam (Huang et al., 2019)
AdaMod (Ding et al., 2019) Novograd (Ginsburg et al., 2019)
AdamP/SGDP (Heo et al., 2021) NT-SGD (Zhou et al., 2021b)
AdamT (Zhou et al., 2020) Padam (Chen et al., 2020a)
AdamW (Loshchilov & Hutter, 2019) PAGE (Li et al., 2020b)
AdamX (Tran & Phong, 2019) PAL (Mutschler & Zell, 2020)
ADAS (Eliyahu, 2020) PolyAdam (Orvieto et al., 2019)
AdaS Polyak
AdaScale (Hosseini & Plataniotis, 2020) (Johnson et al., 2020) PowerSGD/PowerSGDM (Polyak, 1964) (Vogels et al., 2019)
AdaSGD (Wang & Wiens, 2020) Probabilistic Polyak (de Roos et al., 2021)
AdaShift (Zhou et al., 2019) ProbLS (Mahsereci & Hennig, 2017)
AdaSqrt (Hu et al., 2019) PStorm (Xu, 2020)
Adathm (Sun et al., 2019) QHAdam/QHM (Ma & Yarats, 2019)
AdaX/AdaX-W (Li et al., 2020a) RAdam (Liu et al., 2020a)
AEGD (Liu & Tian, 2020) Ranger (Wright, 2020b)
ALI-G (Berrada et al., 2020) RangerLars (Grankin, 2020)
AMSBound (Luo et al., 2019) RMSProp (Tieleman & Hinton, 2012)
AMSGrad (Reddi et al., 2018) RMSterov (Choi et al., 2019)
AngularGrad (Roy et al., 2021) S-SGD (Sung et al., 2020)
ArmijoLS (Vaswani et al., 2019) SAdam (Wang et al., 2020b)
ARSG (Chen et al., 2019b) Sadam/SAMSGrad (Tong et al., 2019)
ASAM (Kwon et al., 2021) SALR (Yue et al., 2020)
AutoLRS (Jin et al., 2021) SAM (Foret et al., 2021)
AvaGrad (Savarese et al., 2019) SC-Adagrad/SC-RMSProp (Mukkamala & Hein, 2017)
BAdam (Salas et al., 2018) SDProp (Ida et al., 2017)
BGAdam (Bai & Zhang, 2019) SGD (Robbins & Monro, 1951)
BPGrad (Zhang et al., 2017b) SGD-BB (Tan et al., 2016)
BRMSProp SGD-G2
BSGD (Aitchison, 2020) (Hu et al., 2020) SGDEM (Ayadi & Turinici, 2020) (Ramezani-Kebrya et al., 2021)
C-ADAM (Tutunov et al., 2020) SGDHess (Tran & Cutkosky, 2021)
CADA (Chen et al., 2021) SGDM (Liu & Luo, 2020)
Cool Momentum (Borysenko & Byshkin, 2020) SGDR (Loshchilov & Hutter, 2017)
CProp (Preechakul & Kijsirikul, 2019) SHAdagrad (Huang et al., 2020)
Curveball (Henriques et al., 2019) Shampoo (Anil et al., 2020; Gupta et al., 2018)
Dadam (Nazari et al., 2019) SignAdam++ (Wang et al., 2019a)
DeepMemory (Wright, 2020a) SignSGD (Bernstein et al., 2018)
DGNOpt (Liu et al., 2021a) SKQN/S4QN (Yang et al., 2020)
DiffGrad (Dubey et al., 2020) SM3 (Anil et al., 2019)
EAdam (Yuan & Gao, 2020) SMG (Tran et al., 2020)
EKFAC (George et al., 2018) SNGM (Zhao et al., 2020)
Eve (Hayashi et al., 2018) SoftAdam (Fetterman et al., 2019)
Expectigrad (Daley & Amato, 2020) SRSGD (Wang et al., 2020a)
FastAdaBelief (Zhou et al., 2021a) Step-Tuned SGD (Castera et al., 2021)
FRSGD (Wang & Ye, 2020) SWATS (Keskar & Socher, 2017)
G-AdaGrad (Chakrabarti & Chopra, 2021) SWNTS (Chen et al., 2019c)
TAdam (llboudo et al., 2020)
GADAM (Zhang & Gouza, 2018)
Gadam (Granziol et al., 2020) TEKFAC (Gao et al., 2020)
GOALS (Chae et al., 2021) VAdam (Khan et al., 2018)
GOLS-I Grad-Avg (Kafka & Wilke, 2019) VR-SGD vSGD-b/vSGD-g/vSGD-1 (Shang et al., 2020) (Schaul et al., 2013)
GRAPES (Purkayastha & Purkayastha, 2020) (Dellaferrera et al., 2021) vSGD-fd (Schaul & LeCun, 2013)
Gravilon (Kelterborn et al., 2020) WNGrad (Wu et al., 2018)
Gravity (Bahrami & Zadeh, 2021) YellowFin Yogi (Zhang & Mitliagkas, 2019) (Zaheer et al., 2018)

B. List of optimizers selected

B. 所选优化器列表

Table 4: Selected optimizers for our benchmarking process with their respective color, hyper parameters, default values, tuning distributions and scheduled hyper parameters. Here, $\mathcal{L}\mathcal{U}(\cdot,\cdot)$ denotes the log-uniform distribution while $\mathcal{U}{\cdot,\cdot}$ denotes the discrete uniform distribution.

表 4: 基准测试过程中选用的优化器及其对应颜色、超参数、默认值、调优分布和计划超参数。其中,$\mathcal{L}\mathcal{U}(\cdot,\cdot)$表示对数均匀分布,$\mathcal{U}{\cdot,\cdot}$表示离散均匀分布。

OptimizerRef.ParametersDefaultTuning DistributionScheduled
AMSBOUND(Luo et al., 2019)10-3CU(10-4,1)
100.1CU(10-3,0.5)
β10.9CU(0.5,0.999)
β20.999CU(0.8,0.999)
10-3CU(10-4,10-1)
E10-8
AMSGRAD(Reddi et al., 2018)α10-2CU(10-4,1)
β10.9CU(0.5,0.999)
β20.999CU(0.8,0.999)
E10-8x
ADABELIEF
(Zhuang et al., 2020)&10-3CU(10-4,1)
β10.9CU(0.5,0.999)
β20.999LU(0.8,0.999)
E10-14x
ADABOUND(Luo et al., 2019)10-3CU(10-4,1)
0.1CU(10-3,0.5)
01 β10.9CU(0.5,0.999)
β20.999CU(0.8,0.999)
10-3cU(10-4,10-1)
ADADELTA10-3CU(10-4,1)
(Zeiler, 2012)E10-8
1- p0.95CU(10-4,1)
ADAGRAD10-2CU(10-4,1)
(Duchi et al., 2011)10-7x
E
ADAM(Kingma & Ba, 2015)&10-3CU(10-4,1)
β10.9CU(0.5,0.999)
β20.999CU(0.8,0.999)
E10-8x
LOOKAHEAD(Zhang et al., 2019)0.5CU(10-4,1)
MOMENTUM10-2CU(10-4,1)
abbr. LA(MOM.)fo k5U{1,20}
1-p0.99CU(10-4,1)
LOOKAHEAD(Zhang et al., 2019)0.5CU(10-4,1)
RADAM abbr. LA(RADAM)αf10-3CU(1e - 4,1)
β10.9CU(0.5,0.999)
β20.999CU(0.8,0.999)
E10-7
k5u{1,20}
MOMENTUM(Polyak, 1964)10~2CU(10-4,1)
1-p0.99CU(10-4,1)
NAG10-2CU(10-4,1)
(Nesterov, 1983)0.99CU(10-4,1)
1-p
NADAM(Dozat, 2016)α10-3CU(10-4,1)
β10.9CU(0.5,0.999)
β20.999CU(0.8,0.999)
E10-7
RADAM(Liu et al., 2020a)10-3CU(10-4,1)
β10.9CU(0.5,0.999)
β20.999CU(0.8,0.999)
E10-7
RMSPROP(Tieleman & Hinton, 2012)10-3CU(10-4,1)
E10-10
1-p0.9CU(10-4,1)
SGD(Robbins & Monro,1951)10-2CU(10-4,1)
优化器 参考文献 参数 默认值 调优分布 调度
AMSBOUND (Luo et al., 2019) 10-3 CU(10-4,1)
10 0.1 CU(10-3,0.5)
β1 0.9 CU(0.5,0.999)
β2 0.999 CU(0.8,0.999)
10-3 CU(10-4,10-1)
E 10-8
AMSGRAD (Reddi et al., 2018) α 10-2 CU(10-4,1)
β1 0.9 CU(0.5,0.999)
β2 0.999 CU(0.8,0.999)
E 10-8 x
ADABELIEF
(Zhuang et al., 2020) & 10-3 CU(10-4,1)
β1 0.9 CU(0.5,0.999)
β2 0.999 LU(0.8,0.999)
E 10-14 x
ADABOUND (Luo et al., 2019) 10-3 CU(10-4,1)
0.1 CU(10-3,0.5)
01 β1 0.9 CU(0.5,0.999)
β2 0.999 CU(0.8,0.999)
10-3 cU(10-4,10-1)
ADADELTA 10-3 CU(10-4,1)
(Zeiler, 2012) E 10-8
1- p 0.95 CU(10-4,1)
ADAGRAD 10-2 CU(10-4,1)
(Duchi et al., 2011) 10-7 x
E
ADAM (Kingma & Ba, 2015) & 10-3 CU(10-4,1)
β1 0.9 CU(0.5,0.999)
β2 0.999 CU(0.8,0.999)
E 10-8 x
LOOKAHEAD (Zhang et al., 2019) 0.5 CU(10-4,1)
MOMENTUM 10-2 CU(10-4,1)
缩写 LA(MOM.) fo k 5 U{1,20}
1-p 0.99 CU(10-4,1)
LOOKAHEAD (Zhang et al., 2019) 0.5 CU(10-4,1)
RADAM 缩写 LA(RADAM) αf 10-3 CU(1e - 4,1)
β1 0.9 CU(0.5,0.999)
β2 0.999 CU(0.8,0.999)
E 10-7
k 5 u{1,20}
MOMENTUM (Polyak, 1964) 10~2 CU(10-4,1)
1-p 0.99 CU(10-4,1)
NAG 10-2 CU(10-4,1)
(Nesterov, 1983) 0.99 CU(10-4,1)
1-p
NADAM (Dozat, 2016) α 10-3 CU(10-4,1)
β1 0.9 CU(0.5,0.999)
β2 0.999 CU(0.8,0.999)
E 10-7
RADAM (Liu et al., 2020a) 10-3 CU(10-4,1)
β1 0.9 CU(0.5,0.999)
β2 0.999 CU(0.8,0.999)
E 10-7
RMSPROP (Tieleman & Hinton, 2012) 10-3 CU(10-4,1)
E 10-10
1-p 0.9 CU(10-4,1)
SGD (Robbins & Monro,1951) 10-2 CU(10-4,1)

C. Robustness to random seeds

C. 随机种子鲁棒性

Data sub sampling, random weight initialization, dropout and other aspects of deep learning introduce stochastic it y to the training process. As such, judging the performance of an optimizer on a single run may be misleading due to random fluctuations. In our benchmark we use 10 different seeds of the final setting for each budget in order to judge the stability of the optimizer and the results. However, to keep the magnitude of this benchmark feasible, we only use a single seed while tuning, analogously to how a single user would progress. This means that our tuning process can sometimes choose hyper parameter settings which might not even converge for seeds other than the one used for tuning.

数据子采样、随机权重初始化、dropout等深度学习技术会在训练过程中引入随机性。因此,仅通过单次运行评估优化器性能可能因随机波动而产生误导。在本基准测试中,我们为每个预算配置使用10组不同的随机种子来评判优化器的稳定性和结果。但为控制基准测试规模,调参过程中仅使用单一随机种子(模拟用户常规操作方式)。这意味着我们的调参过程有时会选择仅在调参种子下收敛的超参数配置。

Figure 5 illustrates this behavior on an example problem where we used 10 seeds throughout a tuning process using grid search. The figure shows that in the beginning performance increases when i