Revisiting Adversarial Training under Long-Tailed Distributions

重新审视长尾分布下的对抗训练

Xinli Yue, Ningping Mou, Qian Wang, Lingchen Zhao* Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China {xinliyue,ning ping mou,qianwang,lczhaocs}@whu.edu.cn

岳新立, 牟宁平, 王倩, 赵凌琛* 航空航天信息安全与可信计算教育部重点实验室, 网络空间安全学院, 武汉大学, 武汉 430072, 中国 {xinliyue,ningpingmou,qianwang,lczhaocs}@whu.edu.cn

Abstract

摘要

Deep neural networks are vulnerable to adversarial attacks, often leading to erroneous outputs. Adversarial training has been recognized as one of the most effective methods to counter such attacks. However, existing adversarial training techniques have predominantly been tested on balanced datasets, whereas real-world data often exhibit a long-tailed distribution, casting doubt on the efficacy of these methods in practical scenarios.

深度神经网络容易受到对抗攻击的影响，通常会导致错误输出。对抗训练已被认为是对抗此类攻击的最有效方法之一。然而，现有的对抗训练技术主要在平衡数据集上进行测试，而现实世界的数据往往呈现长尾分布，这引发了对这些方法在实际场景中有效性的质疑。

In this paper, we delve into adversarial training under long-tailed distributions. Through an analysis of the previous work “RoBal”, we discover that utilizing Balanced Softmax Loss alone can achieve performance comparable to the complete RoBal approach while significantly reducing training overheads. Additionally, we reveal that, similar to uniform distributions, adversarial training under longtailed distributions also suffers from robust over fitting. To address this, we explore data augmentation as a solution and unexpectedly discover that, unlike results obtained with balanced data, data augmentation not only effectively alleviates robust over fitting but also significantly improves robustness. We further investigate the reasons behind the improvement of robustness through data augmentation and identify that it is attributable to the increased diversity of examples. Extensive experiments further corroborate that data augmentation alone can significantly improve robustness. Finally, building on these findings, we demonstrate that compared to RoBal, the combination of BSL and data augmentation leads to a $+6.66%$ improvement in model robustness under AutoAttack on CIFAR-10-LT. Our code is available at: https://github.com/NISPLab/AT-BSL.

本文深入探讨了长尾分布下的对抗训练。通过对先前工作“RoBal”的分析，我们发现仅使用平衡 Softmax 损失 (Balanced Softmax Loss) 即可实现与完整 RoBal 方法相当的性能，同时显著减少训练开销。此外，我们揭示出，与均匀分布类似，长尾分布下的对抗训练也存在鲁棒过拟合问题。为了解决这一问题，我们探索了数据增强作为解决方案，并意外地发现，与平衡数据下的结果不同，数据增强不仅能有效缓解鲁棒过拟合，还能显著提高鲁棒性。我们进一步研究了数据增强提高鲁棒性的原因，并确定这是由于样本多样性的增加。大量实验进一步证实，仅使用数据增强即可显著提高鲁棒性。最后，基于这些发现，我们证明了与 RoBal 相比，BSL 和数据增强的结合在 CIFAR-10-LT 上的 AutoAttack 下使模型鲁棒性提高了 $+6.66%$。我们的代码可在以下网址获取：https://github.com/NISPLab/AT-BSL。

1. Introduction

1. 引言

It is well-known that deep neural networks (DNNs) are vulnerable to adversarial attacks, where attackers can induce errors in DNNs’ recognition results by adding perturbations that are imperceptible to the human eye [12, 39]. Many researchers have focused on defending against such attacks. Among the various defense methods proposed, adversarial training is recognized as one of the most effective approaches. It involves integrating adversarial examples into the training set to enhance the model’s generalization capability against these examples [20, 31, 42, 45, 53, 54]. In recent years, significant progress has been made in the field of adversarial training. However, we note that almost all studies on adversarial training utilize balanced datasets like CIFAR-10, CIFAR-100 [23], and Tiny-ImageNet [24] for performance evaluation. In contrast, real-world datasets often exhibit an imbalanced, typically long-tailed distribution. Hence, the efficacy of adversarial training in practical systems should be reassessed using long-tailed datasets [14, 40].

众所周知，深度神经网络（DNNs）容易受到对抗攻击的影响，攻击者可以通过添加人眼难以察觉的扰动来诱导 DNNs 的识别结果出错 [12, 39]。许多研究者致力于防御此类攻击。在提出的各种防御方法中，对抗训练被认为是最有效的方法之一。它通过将对抗样本整合到训练集中，以增强模型对这些样本的泛化能力 [20, 31, 42, 45, 53, 54]。近年来，对抗训练领域取得了显著进展。然而，我们注意到，几乎所有关于对抗训练的研究都使用 CIFAR-10、CIFAR-100 [23] 和 Tiny-ImageNet [24] 等平衡数据集进行性能评估。相比之下，现实世界的数据集通常表现出不平衡的长尾分布。因此，对抗训练在实际系统中的有效性应使用长尾数据集重新评估 [14, 40]。

Figure 1. The clean accuracy and robustness under AutoAttack (AA) [5] of various adversarial training methods using WideResNet-34-10 [51] on CIFAR-10-LT [23]. Our method, building upon AT [31] and BSL [36], leverages data augmentation to improve robustness, achieving a $+6.66%$ improvement over the SOTA method RoBal [46]. REAT [26] is a concurrent work with ours, yet to be published.

图 1: 使用 WideResNet-34-10 [51] 在 CIFAR-10-LT [23] 上，各种对抗训练方法在 AutoAttack (AA) [5] 下的干净精度和鲁棒性。我们的方法基于 AT [31] 和 BSL [36]，利用数据增强来提高鲁棒性，相比 SOTA 方法 RoBal [46] 提升了 $+6.66%$。REAT [26] 是与我们同时期的工作，尚未发表。

To the best of our knowledge, RoBal [46] is the sole published work that investigates the adversarial robustness under the long-tailed distribution. However, due to its complex design, RoBal demands extensive training time and GPU memory, which somewhat limits its practicality. Upon revisiting the design and principles of RoBal, we find that its most critical component is the Balanced Softmax Loss (BSL) [36]. We observe that combining AT [31] with BSL to form AT-BSL can match RoBal’s effectiveness while significantly reducing training overhead. Following Occam’s Razor principle, where entities should not be multiplied without necessity [19], we advocate using AT-BSL as a substitute for RoBal. In this paper, we base our studies on ATBSL.

据我们所知，RoBal [46] 是唯一一篇研究长尾分布下对抗鲁棒性的已发表工作。然而，由于其复杂的设计，RoBal 需要大量的训练时间和 GPU 内存，这在一定程度上限制了其实用性。在重新审视 RoBal 的设计和原理后，我们发现其最关键的部分是 Balanced Softmax Loss (BSL) [36]。我们观察到，将 AT [31] 与 BSL 结合形成 AT-BSL 可以匹配 RoBal 的效果，同时显著减少训练开销。根据奥卡姆剃刀原则，即若无必要，勿增实体 [19]，我们提倡使用 AT-BSL 作为 RoBal 的替代方案。在本文中，我们的研究基于 AT-BSL。

In the course of our study on the robustness of models under long-tailed distribution, we encounter another significant finding: adversarial training with long-tailed distribution data, similar to training on balanced datasets, also leads to the issue of robust over fitting [37]. Previous works on balanced datasets often employed data augmentation to mitigate this robust over fitting [4, 13, 35, 37, 45]. Hence, a straightforward approach is to attempt the use of data augmentation to alleviate the robust over fitting issue in adversarial training with long-tailed distribution. Our results align with findings on balanced datasets, indicating that data augmentation can mitigate robust over fitting. However, contrary to findings on balanced datasets where it was concluded that data augmentation alone can not improve robustness [35, 37, 45], we find that data augmentation techniques, including MixUp [52], Cutout [9], CutMix [50], AugMix [17], Auto Augment (AuA) [6], RandAugment (RA) [7] and Trivial Augment (TA) [33], can significantly improve robustness. Hence, we introduce the following query: Why does data augmentation improve robustness? We hypothesize that data augmentation augments example diversity, enabling the model to learn richer representations thereby improving its robustness. Subsequently, we validate our hypothesis through ablation studies on RA.

在我们研究长尾分布下模型鲁棒性的过程中，我们遇到了另一个重要发现：使用长尾分布数据进行对抗训练，与在平衡数据集上训练类似，也会导致鲁棒过拟合问题 [37]。之前关于平衡数据集的工作通常采用数据增强来缓解这种鲁棒过拟合 [4, 13, 35, 37, 45]。因此，一个直接的方法是尝试使用数据增强来缓解长尾分布对抗训练中的鲁棒过拟合问题。我们的结果与平衡数据集上的发现一致，表明数据增强可以缓解鲁棒过拟合。然而，与平衡数据集上的发现相反，后者得出结论认为仅靠数据增强无法提高鲁棒性 [35, 37, 45]，我们发现数据增强技术，包括 MixUp [52]、Cutout [9]、CutMix [50]、AugMix [17]、Auto Augment (AuA) [6]、RandAugment (RA) [7] 和 Trivial Augment (TA) [33]，可以显著提高鲁棒性。因此，我们提出了以下问题：为什么数据增强能提高鲁棒性？我们假设数据增强增加了样本多样性，使模型能够学习更丰富的表示，从而提高了其鲁棒性。随后，我们通过对 RA 的消融研究验证了我们的假设。

Our contributions are summarized as follows:

我们的贡献总结如下：

相关工作

Long-Tailed Recognition. Long-tailed distributions refer to a common imbalance in training set where a small portion of classes (head) have massive examples, while other classes (tail) have very few examples [14, 40]. Models trained under such distribution tend to exhibit a bias towards the head classes, resulting in poor performance for the tail classes. Traditional rebalancing techniques aim at addressing the long-tailed recognition problem include re-sampling [21, 38, 41, 56] and cost-sensitive learning [8, 29], which often improve the performance of tail classes at the expense of head classes. To mitigate these adverse effects, some methods handle class-specific attributes through perspectives such as margins [43] and biases [36]. Recently, more advanced techniques like class-conditional sharpness-aware minimization [57], feature clusters compression [27], and global-local mixture consistency cumulative learning [10] have been introduced, further improving the performance of long-tailed recognition. However, these works have been devoted to improving clean accuracy, and investigations into the adversarial robustness of long-tailed recognition remain scant.

长尾识别

Adversarial Training. The philosophy of adversarial training involves integrating adversarial examples into the training set, thereby improving the model’s general iz ability to such examples. Adversarial training addresses a min-max problem, with the inner maximization dedicated to generating the strongest adversarial examples and the outer minimization aimed at optimizing the model parameters. The quintessential method of adversarial training is AT [31], which can be mathematically represented as follows:

对抗训练。对抗训练的理念是将对抗样本整合到训练集中，从而提高模型对这类样本的泛化能力。对抗训练解决的是一个极小极大问题，其中内层最大化致力于生成最强的对抗样本，而外层最小化则旨在优化模型参数。对抗训练的典型方法是 AT [31]，其数学表示如下：

where $x^{\prime}$ is an adversarial example constrained by $\ell_{p}$ norm for clean examples $x,y$ is the label of $x$ , $\theta_{m}$ is the parameter of the model $m$ , $\epsilon$ is the perturbation size, $\mathcal{L}{\mathrm{max}}$ is the internal maximization loss, and $\mathcal{L}{\mathrm{min}}$ is the external minimization loss.

其中 $x^{\prime}$ 是由 $\ell_{p}$ 范数约束的对抗样本，$x,y$ 是 $x$ 的标签，$\theta_{m}$ 是模型 $m$ 的参数，$\epsilon$ 是扰动大小，$\mathcal{L}{\mathrm{max}}$ 是内部最大化损失，$\mathcal{L}{\mathrm{min}}$ 是外部最小化损失。

Building upon the foundation of AT [31], subsequent works developed advanced adversarial training techniques such as TRADES [53], MART [42], AWP [45], GAIRAT [54], and LAS-AT [20]. However, these adversarial training methods were predominantly experimented with on balanced datasets like CIFAR-10 and CIFAR-100. Robustness under Long-Tailed Distribution. Previous adversarial training works were concentrated mainly on balanced datasets. However, data in the real world are seldom balanced; they are more commonly characterized by long-tailed distributions [14, 40]. Therefore, a critical criterion for assessing the practical utility of adversarial training should be its performance on long-tailed distributions. To our knowledge, RoBal [46] is the only work that investigates adversarial training on long-tailed datasets. In Section 3, we delve into the components of RoBal, improv- ing the efficacy of long-tailed adversarial training based on our findings. Moreover, some works [30, 44, 47, 49] have already indicated that adversarial training on balanced datasets can lead to significant robustness disparities across classes. Whether this disparity is exacerbated on long-tailed datasets remains an open question for further exploration.

在 AT [31] 的基础上，后续研究开发了诸如 TRADES [53]、MART [42]、AWP [45]、GAIRAT [54] 和 LAS-AT [20] 等先进的对抗训练技术。然而，这些对抗训练方法主要是在 CIFAR-10 和 CIFAR-100 等平衡数据集上进行实验的。长尾分布下的鲁棒性。之前的对抗训练工作主要集中在平衡数据集上。然而，现实世界中的数据很少是平衡的；它们更常见的是长尾分布 [14, 40]。因此，评估对抗训练实际效用的一个关键标准应该是其在长尾分布上的表现。据我们所知，RoBal [46] 是唯一一个研究长尾数据集上对抗训练的工作。在第 3 节中，我们深入探讨了 RoBal 的组成部分，并根据我们的发现改进了长尾对抗训练的效果。此外，一些研究 [30, 44, 47, 49] 已经表明，在平衡数据集上进行对抗训练会导致不同类别之间的显著鲁棒性差异。这种差异在长尾数据集上是否会加剧，仍然是一个有待进一步探索的开放性问题。

Data Augmentation. In standard training regimes, data augmentation has been validated as an effective tool to mitigate over fitting and improve model generalization, regardless of whether the data distribution is balanced or longtailed [2, 10, 48, 55]. The most commonly utilized augmentation techniques for image classification tasks include random flips, rotations, and crops [15]. More sophisticated augmentation methods like MixUp [52], Cutout [9], and CutMix [50] have been shown to yield superior results in standard training contexts. Furthermore, augmentation strategies such as Augmix [17], AuA [6], RA [7], and TA [33], which employ a learned or random combination of multiple augmentations, have elevated the efficacy of data augmentation to new heights, heralding the advent of the era of automated augmentation.

数据增强 (Data Augmentation)。在标准的训练机制中，数据增强已被验证为一种有效的手段，能够缓解过拟合并提升模型的泛化能力，无论数据分布是平衡的还是长尾的 [2, 10, 48, 55]。在图像分类任务中，最常用的增强技术包括随机翻转、旋转和裁剪 [15]。更为复杂的增强方法如 MixUp [52]、Cutout [9] 和 CutMix [50] 在标准训练场景中已显示出更优越的效果。此外，像 Augmix [17]、AuA [6]、RA [7] 和 TA [33] 这样的增强策略，通过使用学习或随机的多种增强组合，将数据增强的效果提升到了新的高度，预示着自动化增强时代的到来。

3. Analysis of RoBal

3. RoBal 分析

3.1. Preliminaries

3.1. 预备知识

RoBal [46], in comparison to AT [31], incorporates four additional components: 1) cosine classifier; 2) Balanced Softmax Loss [36]; 3) class-aware margin; 4) TRADES regu- larization [53].

RoBal [46] 与 AT [31] 相比，引入了四个额外组件：1) 余弦分类器；2) 平衡Softmax损失 (Balanced Softmax Loss) [36]；3) 类别感知边界；4) TRADES 正则化 [53]。

Cosine Classifier. In basic classification tasks employing a standard linear classifier, the predicted logit for class $i$ can be represented as:

余弦分类器。在采用标准线性分类器的基本分类任务中，类别 $i$ 的预测逻辑值可以表示为：

where $g(\cdot)$ is the liner classifer. In this formulation, the prediction depends on three factors: 1) the magnitude of the weight vector $\Vert W_{i}\Vert$ and the feature vector $|f(x)|$ ; 2) the angle between them, expressed as $\cos\theta_{i}$ ; and 3) the bias of the classifier $b_{i}$ .

其中 $g(\cdot)$ 是线性分类器。在此公式中，预测结果取决于三个因素：1) 权重向量 $\Vert W_{i}\Vert$ 和特征向量 $|f(x)|$ 的幅度；2) 它们之间的夹角，表示为 $\cos\theta_{i}$ ；以及 3) 分类器的偏差 $b_{i}$ 。

The above decomposition illustrates that simply by scaling the norm of examples in feature space, the predictions of the examples can be altered. In linear class if i ers, the scale of the weight vector $|W_{i}|$ often diminishes in tail classes, thereby impacting the recognition performance for tail classes. Consequently, [46] endeavors to utilize a cosine classifier [34] to mitigate the scale effects of features and weights. And in the cosine classifier, the predicted logit for class $i$ can be represented as:

上述分解说明，仅通过缩放特征空间中样本的范数，就可以改变样本的预测结果。在线性分类器中，权重向量 $|W_{i}|$ 的尺度在尾部类别中通常减小，从而影响尾部类别的识别性能。因此，[46] 尝试利用余弦分类器 [34] 来减轻特征和权重的尺度影响。在余弦分类器中，类别 $i$ 的预测值可以表示为：

where $h(\cdot)$ is the cosine classifier, $|\cdot|$ denotes the $\ell_{2}$ norm of the vector, $s$ is the scaling factor.

其中 $h(\cdot)$ 是余弦分类器，$|\cdot|$ 表示向量的 $\ell_{2}$ 范数，$s$ 是缩放因子。

Balanced Softmax Loss. An intuitive and widely adopted approach to addressing class imbalance is to assign classspecific biases during training for the cross-entropy (CE) loss. [46] employs the formulation by [32, 36], denoted as $b_{i}=\tau_{b}\log\left(n_{i}\right)$ , where the modified cross-entropy loss, namely Balanced Softmax Loss (BSL), becomes:

Balanced Softmax Loss。一种直观且广泛采用的解决类别不平衡问题的方法是在训练时为交叉熵（Cross-Entropy, CE）损失分配类别特定的偏差。[46] 采用了 [32, 36] 的公式，表示为 $b_{i}=\tau_{b}\log\left(n_{i}\right)$，修改后的交叉熵损失，即 Balanced Softmax Loss (BSL)，变为：

where $n_{i}$ is the number of examples in the $i$ -th class, and $\tau_{b}$ is a hyper parameter controling the calculation of bias. BSL adapts to the label distribution shift between training and testing by adding specific biases to each class based on the number of examples in each class to improve long-tailed recognition performance [36].

其中 $n_{i}$ 是第 $i$ 类中的样本数量，$\tau_{b}$ 是控制偏差计算的超参数。BSL 通过根据每个类中的样本数量为每个类添加特定的偏差来适应训练和测试之间的标签分布偏移，从而提高长尾识别的性能 [36]。

Class-Aware Margin. However, when considering the margin representation, the margin from the true class $y$ to class $i$ , denoted by $\tau_{b}\log{(n_{i}/n_{y})}$ , becomes negative when $n_{y}>n_{i}$ , leading to poorer disc rim i native representations and classifier learning for head classes. To address this, [46] introduces a class-aware margin term [34], which assigns a larger margin value to head classes as compensation:

类感知边界。然而，在考虑边界表示时，从真实类别 $y$ 到类别 $i$ 的边界 $\tau_{b}\log{(n_{i}/n_{y})}$ 在 $n_{y}>n_{i}$ 时会变为负值，导致头部类别的判别表示和分类器学习效果变差。为了解决这个问题，[46] 引入了一个类感知边界项 [34]，该边界项为头部类别分配了更大的边界值作为补偿：

The first term increases with $n_{i}$ and reaches its minimum of zero when $n_{i}={\ensuremath{n_{\mathrm{min}}}}$ , with $\tau_{m}$ as the hyper parameter controlling the trend. The second term, $m_{0}>0$ , is a uniform boundary for all classes, a common strategy in networks based on cosine class if i ers. Add this class-aware margin $m_{i}$ to $\mathcal{L}{0}$ to become $\mathcal{L}{1}$ :

第一项随着 $n_{i}$ 的增加而增加，当 $n_{i}={\ensuremath{n_{\mathrm{min}}}}$ 时达到最小值零，其中 $\tau_{m}$ 是控制趋势的超参数。第二项 $m_{0}>0$ 是所有类的统一边界，这是基于余弦分类器的网络中的常见策略。将这个类感知边距 $m_{i}$ 添加到 $\mathcal{L}{0}$ 中，得到 $\mathcal{L}{1}$：

TRADES Regular iz ation. [46] incorporates a $\mathrm{KL}$ regularization term following TRADES [53], thereby modifying the overall loss function to:

TRADES 正则化。[46] 在 TRADES [53] 的基础上引入了一个 $\mathrm{KL}$ 正则化项，从而将整体损失函数修改为：

where $\beta$ serves as a hyper parameter to control the intensity of the TRADES regular iz ation.

其中，$\beta$ 作为超参数用于控制 TRADES 正则化的强度。

Table 1. The clean accuracy, robustness, time (average per epoch) and memory (GPU) using ResNet-18 [15] on CIFAR-10-LT following the integration of components from RoBal [46] into AT [31]. The best results are bolded. The second best results are underlined. Cos: Cosine Classifier; BSL: Balanced Softmax Loss [36]; CM: Class-aware Margin [46]; TRADES: TRADES Regular iz ation [53].

表 1. 使用 ResNet-18 [15] 在 CIFAR-10-LT 上，将 RoBal [46] 的组件集成到 AT [31] 后，得到的准确率、鲁棒性、时间（每个 epoch 的平均值）和内存（GPU）的对比。最佳结果加粗，次佳结果加下划线。Cos: 余弦分类器；BSL: 平衡 Softmax 损失 [36]；CM: 类别感知边界 [46]；TRADES: TRADES 正则化 [53]。

方法	组件	准确率	效率
	Cos BSL	CM	TRADES
AT [31]
AT-BSL
AT-BSL-Cos	√	√
AT-BSL-Cos-TRADES	√	人
RoBal [46]

3.2. Ablation Studies of RoBal

3.2. RoBal 的消融研究

To investigate the role of each component in RoBal [46], we conduct ablation studies on it. Specifically, we incrementally add each component of RoBal to AT [31] and then evaluate the method’s clean accuracy, robustness, training time per epoch, and memory usage. The results are summarized in Table 1. Note that the parameters utilized in Table 1 adhere strictly to the default settings of [46], and the details about adversarial attacks are in Section 5.1. We observe that the AT-BSL method outperforms AT [31] in terms of clean accuracy and adversarial robustness. However, upon integrating a cosine classifier with AT-BSL, while the robustness under PGD [31] significantly improves, robustness under adaptive attacks like CW [3], LSA [18], and AA [5] notably decreases. This aligns with observations in REAT [26], suggesting that the cosine classifier (scaleinvariant classifier) used in RoBal may lead to gradient vanishing when generating adversarial examples with crossentropy loss. This is attributed to the normalization of weights and features in the classification layer, which substantially reduces the gradient scale, impeding the generation of potent adversarial examples [26]. Further additions of TRADES regular iz ation [53] and class-aware margin do not yield substantial improvements in robustness under AA, yet markedly increase training time and memory consumption. In fact, AT-BSL alone can match the complete RoBal in terms of clean accuracy and robustness under AA. Therefore, in line with Occam’s Razor [19], we advocate using AT-BSL, which renders adversarial training more efficient without sacrificing significant performance. The $\mathcal{L}_{m i n}$ formula of AT-BSL is as follows:

为了探究 RoBal [46] 中每个组件的作用，我们对其进行了消融研究。具体而言，我们逐步将 RoBal 的各个组件添加到 AT [31] 中，然后评估该方法的干净准确率、鲁棒性、每轮训练时间和内存使用情况。结果总结在表 1 中。请注意，表 1 中使用的参数严格遵循 [46] 的默认设置，关于对抗攻击的详细信息见第 5.1 节。我们观察到，AT-BSL 方法在干净准确率和对抗鲁棒性方面优于 AT [31]。然而，在将余弦分类器与 AT-BSL 结合后，虽然 PGD [31] 下的鲁棒性显著提高，但在 CW [3]、LSA [18] 和 AA [5] 等自适应攻击下的鲁棒性却显著下降。这与 REAT [26] 中的观察结果一致，表明 RoBal 中使用的余弦分类器（尺度不变分类器）在使用交叉熵损失生成对抗样本时可能会导致梯度消失。这是由于分类层中权重和特征的归一化，显著降低了梯度规模，阻碍了强对抗样本的生成 [26]。进一步添加 TRADES 正则化 [53] 和类感知边界并未在 AA 下的鲁棒性上带来显著提升，却显著增加了训练时间和内存消耗。事实上，仅使用 AT-BSL 即可在干净准确率和 AA 下的鲁棒性方面与完整的 RoBal 相匹配。因此，根据奥卡姆剃刀原则 [19]，我们提倡使用 AT-BSL，它使对抗训练更高效，而不会显著牺牲性能。AT-BSL 的 $\mathcal{L}_{m i n}$ 公式如下：

Figure 2. Learning rate scheduling analysis of RoBal [46]. (a) comparison of the learning rate schedules: ‘RoBal Code Schedule’ from the source code and ‘RoBal Paper Schedule’ as described in the publication. (b) the evolution of test robustness under PGD20 [31] using ResNet-18 on CIFAR-10-LT across training epochs.

图 2: RoBal [46] 的学习率调度分析。(a) 学习率调度比较：来自源代码的‘RoBal Code Schedule’和出版物中描述的‘RoBal Paper Schedule’。(b) 在 CIFAR-10-LT 上使用 ResNet-18 进行 PGD20 [31] 测试鲁棒性随训练周期的演变。

3.3. Robust Over fitting and Unexpected Discovery

3.3. 鲁棒过拟合与意外发现

Discrepancy in Learning Rate Scheduling: Paper Description vs. Code Implementation. RoBal [46] asserts that early stopping is not employed, and the results reported are from the final epoch, the 80th epoch. The declared learning rate schedule is an initial rate of 0.1, with decays at the 60th and 70th epochs, each by a factor of 0.1. After executing the source code of RoBal, we observe, as depicted by the blue line in Fig. 2(b), that test robustness remains essentially unchanged after the first learning rate decay (60th epoch), indicating an absence of robust over fitting. It is well-known that adversarial training on CIFAR-10 exhibits significant robust over fitting [37], and given that CIFAR10-LT has less data than CIFAR-10, the absence of robust over fitting on CIFAR-10-LT is contradictory to the assertion that additional data can alleviate robust over fitting in [35].

学习率调度中的差异：论文描述与代码实现。RoBal [46] 声称未采用早停策略，报告的结果来自最后一个 epoch，即第 80 个 epoch。声明的学习率调度方案是初始学习率为 0.1，在第 60 和第 70 个 epoch 时分别衰减 0.1 倍。在执行 RoBal 的源代码后，我们观察到，如图 2(b) 中的蓝线所示，第一次学习率衰减（第 60 个 epoch）后，测试鲁棒性基本保持不变，这表明没有出现鲁棒过拟合现象。众所周知，在 CIFAR-10 上进行对抗训练会表现出显著的鲁棒过拟合 [37]，而 CIFAR10-LT 的数据量少于 CIFAR-10，因此在 CIFAR-10-LT 上未出现鲁棒过拟合与 [35] 中关于额外数据可以缓解鲁棒过拟合的断言相矛盾。

Upon a meticulous examination of the official code provided by RoBal [46], we discover inconsistencies between the implemented learning rate schedule and what is claimed in the paper. The official code uses a learning rate schedule starting at 0.1, with a decay of 0.1 per epoch after the 60th epoch and 0.01 per epoch after the 75th epoch (the blue line in Fig. 2(a)). This leads to a learning rate as low as 1e26 by the 80th epoch, potentially limiting learning after the 60th epoch and contributing to the similar performance of models at the 60th and 80th epochs as shown in Fig. 2(b).

在对 RoBal [46] 提供的官方代码进行仔细检查后，我们发现实现的学习率调度与论文中声称的不一致。官方代码使用的学习率调度从 0.1 开始，在第 60 个 epoch 后每 epoch 衰减 0.1，在第 75 个 epoch 后每 epoch 衰减 0.01（图 2(a) 中的蓝色线）。这导致在第 80 个 epoch 时学习率低至 1e26，可能限制了第 60 个 epoch 之后的学习，并导致了图 2(b) 中第 60 和第 80 个 epoch 模型性能相似的结果。

Subsequently, we adjust the learning rate schedule to what is declared in [46] (the orange line in Fig. 2(a)) and redraw the robustness curve, represented by the orange line in Fig. 2(b). Post-adjustment, a continuous decline in test robustness following the first learning rate decay is observed, aligning with the robust over fitting phenomenon typically seen on CIFAR-10.

随后，我们将学习率调整至 [46] 中声明的值（图 2(a) 中的橙色线），并重新绘制了鲁棒性曲线，如图 2(b) 中的橙色线所示。调整后，观察到第一次学习率衰减后测试鲁棒性持续下降，这与 CIFAR-10 上常见的鲁棒过拟合现象一致。

Therefore, adversarial training under long-tailed distributions exhibits robust over fitting, similar to balanced distributions. So, how might we resolve this problem? Several works [4, 13, 35, 37, 45] have attempted to use data augmentation to alleviate robust over fitting on balanced datasets.

因此，长尾分布下的对抗训练表现出与平衡分布相似的鲁棒过拟合。那么，我们该如何解决这个问题？一些研究 [4, 13, 35, 37, 45] 尝试使用数据增强来缓解平衡数据集上的鲁棒过拟合。

Testing MixUp. [35, 37, 45] suggest that on CIFAR-10, MixUp [52] can alleviate robust over fitting. Therefore, we posit that on the long-tailed version of CIFAR-10, CIFAR10-LT, MixUp would also mitigate robust over fitting. In Fig. 3(a), it is evident that AT-BSL-MixUp, which utilizes MixUp, significantly alleviates robust over fitting compared to AT-BSL. Furthermore, we unexpectedly discover that MixUp markedly improves robustness. This observation is inconsistent with previous findings in balanced datasets [35, 37, 45], where it was concluded that data augmentation alone does not improve robustness.

测试 MixUp。[35, 37, 45] 表明在 CIFAR-10 上，MixUp [52] 可以缓解鲁棒过拟合。因此，我们假设在 CIFAR-10 的长尾版本 CIFAR10-LT 上，MixUp 也能缓解鲁棒过拟合。在图 3(a) 中，明显可以看出，使用 MixUp 的 AT-BSL-MixUp 相比 AT-BSL 显著缓解了鲁棒过拟合。此外，我们意外地发现 MixUp 显著提高了鲁棒性。这一观察结果与之前在平衡数据集上的发现 [35, 37, 45] 不一致，之前的研究结论是单独的数据增强并不能提高鲁棒性。

Exploring data augmentation. Following the validation of the MixUp hypothesis, our investigation expands to assess whether other augmentation techniques could alleviate robust over fitting and improve robustness. This includes augmentations like Cutout [9], CutMix [50], AugMix [17], TA [33], AuA [6], and RA [7]. Analogous to our analysis of MixUp, we report the robustness achieved by these augmentation techniques during training in Fig. 3. Firstly, our findings indicate that each augmentation technique mitigated robust over fitting, with CutMix, AuA, RA, and TA exhibiting almost negligible instances of this phenomenon. Furthermore, we observe that robustness attained by each augmentation surpasses that of the vanilla AT-BSL, further corroborating that data augmentation alone can improve robustness.

探索数据增强。在验证了MixUp假设后，我们的研究进一步评估了其他增强技术是否能够缓解鲁棒过拟合并提高鲁棒性。这包括Cutout [9]、CutMix [50]、AugMix [17]、TA [33]、AuA [6]和RA [7]等增强技术。与我们对MixUp的分析类似，我们在图3中报告了这些增强技术在训练期间实现的鲁棒性。首先，我们的研究结果表明，每种增强技术都缓解了鲁棒过拟合，其中CutMix、AuA、RA和TA几乎完全消除了这种现象。此外，我们观察到每种增强技术实现的鲁棒性都超过了普通的AT-BSL，进一步证实了仅通过数据增强就可以提高鲁棒性。

4. Why Data Augmentation Can Improve Robustness

4. 为什么数据增强可以提高鲁棒性

Formulating Hypothesis. We postulate that data augmentation improves robustness by increasing example diversity, thereby allowing models to learn richer representations. Taking RA [7] as an illustrative example, for each training image, RA randomly selects a series of augmentations from a search space consisting of 14 augmentations, namely Identity, ShearX, ShearY, TranslateX, TranslateY, Rotate, Brightness, Color, Contrast, Sharpness, Posterize, Solarize, Auto Contrast, and Equalize, to apply to the image. We initiate an ablation study on RA, testing the impact of each augmentation individually. Specifically, we narrow the search space of RA to a single augmentation, meaning RA is restricted to using only this one augmentation to augment all training examples. From Fig. 4(a), it can be observed that except for Contrast, none of the augmentations alone improve robustness; in fact, augmentations such as Solarize, Auto Contrast, and Equalize significantly underperform compared to AT-BSL. We surmise that this is due to the limited example diversity provided by a single augmentation, thereby resulting in no substantial improvement in robustness.

提出假设。我们假设数据增强通过增加样本多样性来提高鲁棒性，从而使模型能够学习到更丰富的表示。以 RA [7] 为例，对于每个训练图像，RA 会从包含 14 种增强操作的搜索空间中随机选择一系列增强操作，这些操作包括 Identity、ShearX、ShearY、TranslateX、TranslateY、Rotate、Brightness、Color、Contrast、Sharpness、Posterize、Solarize、Auto Contrast 和 Equalize，并将其应用于图像。我们对 RA 进行了消融研究，分别测试了每种增强操作的影响。具体来说，我们将 RA 的搜索空间缩小到单一增强操作，即 RA 只能使用这一种增强操作来增强所有训练样本。从图 4(a) 可以看出，除了 Contrast 之外，单独使用任何一种增强操作都无法提高鲁棒性；实际上，像 Solarize、Auto Contrast 和 Equalize 这样的增强操作在性能上显著低于 AT-BSL。我们推测这是由于单一增强操作提供的样本多样性有限，从而导致鲁棒性没有显著提升。

Figure 3. The evolution of test robustness under PGD-20 using ResNet-18 on CIFAR-10-LT for AT-BSL using different data augmentation strategies across training epochs. For reference, the red dashed lines in each panel represent the robustness of the best checkpoint of AT-BSL. Due to the density of the illustrations, the results have been compartmentalized into four distinct panels: (a), (b), (c), and (d).

图 3: 在 CIFAR-10-LT 数据集上使用 ResNet-18 进行 PGD-20 攻击测试时，AT-BSL 在使用不同数据增强策略下的测试鲁棒性随训练轮次的演变。作为参考，每个面板中的红色虚线表示 AT-BSL 最佳检查点的鲁棒性。由于图示密度较大，结果被分为四个独立的面板：(a)、(b)、(c) 和 (d)。

Validating Hypothesis. Subsequently, we explore the impact of the number of types of augmentations on robustness. Specifically, for each trial, we randomly selected $n$ types of augmentations to constitute the search space of RA, with $n\in{2,14}$ . Each experiment is repeated five times. As shown in Fig. 4(b), we reveal that robustness progressively improves with the addition of more augmentation methods in the search space of RA. This indicates that as the number of types of augmentations in the search space increases, the variety of augmentations available to examples also grows, leading to greater example diversity. Consequently, the represent at ions learned by the model become more comprehensive, thereby improving robustness. This validates our hypothesis.

验证假设。随后，我们探讨了增强类型数量对鲁棒性的影响。具体而言，对于每次试验，我们随机选择 $n$ 种增强类型来构成 RA 的搜索空间，其中 $n\in{2,14}$ 。每个实验重复五次。如图 4(b) 所示，我们发现随着 RA 搜索空间中增强方法的增加，鲁棒性逐渐提高。这表明随着搜索空间中增强类型数量的增加，样本可用的增强方式也增多，从而导致样本多样性增加。因此，模型学习到的表示更加全面，从而提高了鲁棒性。这验证了我们的假设。

Moreover, to further substantiate our hypothesis, we conduct an ablation study on the three types of augmentations—Solarize, Auto Contrast, and Equalize—which, when used individually, impair robustness. Specifically, we eliminate these three and employ the remaining 11 augmentations as the baseline: RA-11. We then increment ally add

此外，为了进一步验证我们的假设，我们对三种类型的增强（Solarize、Auto Contrast 和 Equalize）进行了消融研究，这些增强在单独使用时会影响鲁棒性。具体来说，我们剔除这三种增强，并使用剩下的 11 种增强作为基线：RA-11。然后我们逐步添加

Figure 4. The robustness under AA for AT-BSL with different augmentations using ResNet-18 on CIFAR-10-LT. (a) Change the augmentation space of RA [7] to a single augmentation, and the horizontal axis represents the name of the single augmentation. (b) The horizontal axis represents the number of types of augmentations in the search space of RA.

图 4: 在 CIFAR-10-LT 上使用 ResNet-18 进行不同增强的 AT-BSL 在 AA 下的鲁棒性。(a) 将 RA [7] 的增强空间更改为单一增强，横轴表示单一增强的名称。(b) 横轴表示 RA 搜索空间中增强类型的数量。

Table 2. The clean accuracy and robustness under AA for AT-BSL with different augmentations using ResNet-18 on CIFAR-10-LT. The best results are bolded. RA-11 means only using the first 11 augmentations in the search space of RA. The lines below RA-11 indicate additional augmentations based on RA-11, and the last line uses the complete search space of RA. SO: Solarize; AC: AutoContrast; EQ: Equalize.

表 2. 在 CIFAR-10-LT 上使用 ResNet-18 的 AT-BSL 在不同增强下的干净准确性和 AA 鲁棒性。最佳结果以粗体显示。RA-11 表示仅使用 RA 搜索空间中的前 11 种增强。RA-11 下方的行表示基于 RA-11 的额外增强，最后一行使用 RA 的完整搜索空间。SO: 曝光; AC: 自动对比; EQ: 均衡。

方法	Clean	FGSM	PGD	CW	LSA	AA
RA-11	67.80	40.68	35.88	34.01	33.89	32.12
67.60	41.43	37.04	34.52	34.05	32.76
SO AC	68.57	41.20	34.24
	36.60	34.07	32.51
EQ SO+AC	68.33	41.64	36.80	34.33	34.17	32.59
	68.43	42.10	37.23	34.62	34.37	33.02
SO+EQ	68.53	41.89	37.42	35.07	34.83	33.49
	34.91	34.49	33.15
AC+EQ	68.36	41.88	37.42
SO+AC+EQ	70.86	43.06	37.94	36.24	36.04	34.24

one to three of the negative augmentations, with the results outlined in Table 2. It is discovered that the more types of augmentations added, the more significant the improvement in robustness. Despite the negative impact of these three augmentations when used in isolation, their inclusion in the search space of RA still contributes to robustness improvement, further validating our hypothesis that data augmentation increases example diversity and thereby improves ro- bustness.

表 2 中概述了使用一到三种负增强的结果。研究发现，添加的增强类型越多，鲁棒性的提升越显著。尽管这三种增强在单独使用时会产生负面影响，但将它们包含在 RA 的搜索空间中仍然有助于提高鲁棒性，这进一步验证了我们的假设，即数据增强增加了样本的多样性，从而提高了鲁棒性。

5. Experiments

5. 实验

5.1. Settings

5.1. 设置

Datasets. Following [46], we conduct experiments on CIFAR-10-LT and CIFAR-100-LT [23]. Due to space constraints, partial results for CIFAR-100-LT are included in the appendix. In our main experiments, the imbalance ratio (IR) of CIFAR-10-LT is set to 50. Table 6 also provides results for various IRs.

数据集。遵循 [46]，我们在 CIFAR-10-LT 和 CIFAR-100-LT [23] 上进行实验。由于篇幅限制，CIFAR-100-LT 的部分结果包含在附录中。在我们的主要实验中，CIFAR-10-LT 的不平衡比率 (IR) 设置为 50。表 6 也提供了不同 IR 的结果。

Evaluation Metrics. When assessing model robustness, the $l_{\infty}$ norm-bounded perturbation is $\epsilon=8/255$ . The attacks carried out include the single-step attack FGSM [12] and several iterative attacks, such as PGD [31], CW [3] and LSA [18], performed over 20 steps with a step size of 2/255. We also employ AutoAttack (AA) [5], considered the strongest attack so far. For all methods, the evaluations are based on both the best checkpoint (selected based on robustness under PGD-20) and the final checkpoint.

评估指标。在评估模型鲁棒性时，$l_{\infty}$ 范数有界扰动的 $\epsilon=8/255$。所进行的攻击包括单步攻击 FGSM [12] 和几种迭代攻击，如 PGD [31]、CW [3] 和 LSA [18]，这些攻击在 20 步内进行，步长为 2/255。我们还使用了 AutoAttack (AA) [5]，它被认为是目前最强的攻击。对于所有方法，评估均基于最佳检查点（根据 PGD-20 下的鲁棒性选择）和最终检查点。

Comparison Methods. We consider adversarial training methods under long-tailed distributions: RoBal [46] and REAT [26], as well as defenses under balanced distributions: AT [31], TRADES [53], MART [42], AWP [45], GAIRAT [54], and LAS-AT [20].

对比方法。我们考虑了长尾分布下的对抗训练方法：RoBal [46] 和 REAT [26]，以及平衡分布下的防御方法：AT [31]、TRADES [53]、MART [42]、AWP [45]、GAIRAT [54] 和 LAS-AT [20]。

Training Details. We train the models using the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.1, momentum of 0.9, and weight decay of 5e-4. We set the batch size to 128. We set the total number of training epochs to 100, and the learning rate is divided by 10 at the 75th and 90th epoch following [53]. During generating adversarial examples, we enforce a maximum perturbation of 8/255 and a step size of $2/255$ . The number of iterations for internal maximization is fixed at 10, denoting PGD-10, and the impact of PGD steps on robustness is investigated in Table 15. For all experiments related to AT-BSL, we adopt $\tau_{b}=1$ , and the results for different $\tau_{b}$ are provided in Fig. 7. Note that the AT-BSL presented in Tables 3 and 4 represents our own implementation, which differs in training parameters from RoBal [46]. Detailed discussions regarding these discrepancies are provided in the appendix.

训练细节。我们使用随机梯度下降 (Stochastic Gradient Descent, SGD) 优化器训练模型，初始学习率为 0.1，动量为 0.9，权重衰减为 5e-4。我们将批量大小设置为 128，总训练轮数设置为 100，并在第 75 轮和第 90 轮时将学习率除以 10，遵循 [53]。在生成对抗样本时，我们设置最大扰动为 8/255，步长为 $2/255$。内部最大化的迭代次数固定为 10，记为 PGD-10，PGD 步骤对鲁棒性的影响在表 15 中进行了研究。对于所有与 AT-BSL 相关的实验，我们采用 $\tau_{b}=1$，不同 $\tau_{b}$ 的结果在图 7 中提供。请注意，表 3 和表 4 中所示的 AT-BSL 是我们自己的实现，其训练参数与 RoBal [46] 不同。有关这些差异的详细讨论在附录中提供。

5.2. Main Results

5.2. 主要结果

As evident from Tables 3 and 4, on CIFAR-10-LT, ATBSL with data augmentation achieves the highest clean accuracy and adversarial robustness on both ResNet-18 and WideResNet-34-10. Note that on WideResNet-34-10, our method, AT-BSL-AuA, demonstrates a significant improvement of $+6.66%$ robustness under AA compared to the SOTA method RoBal. Moreover, in terms of robustness at the final checkpoint, our method significantly outperforms others, demonstrating that data augmentation mitigates robust over fitting.

从表 3 和表 4 可以看出，在 CIFAR-10-LT 上，使用数据增强的 ATBSL 在 ResNet-18 和 WideResNet-34-10 上均实现了最高的干净准确率和对抗鲁棒性。值得注意的是，在 WideResNet-34-10 上，我们的方法 AT-BSL-AuA 在 AA 下的鲁棒性相比 SOTA 方法 RoBal 有显著提升，达到了 $+6.66%$。此外，在最终检查点的鲁棒性方面，我们的方法显著优于其他方法，表明数据增强缓解了鲁棒过拟合问题。

We present the robustness of different methods across each class in Fig. 5. It is observable that, except for a few classes, our method improves robustness in almost every class, particularly in tail classes (5 to 9 classes) where the improvements are more pronounced. Furthermore, consistent with observations on balanced datasets [30, 44, 47, 49], there is a significant disparity in class-wise robustness. Class 3 remains the least robust despite its example numbers far exceeding that of subsequent classes, which may be attributable to the intrinsic properties of class 3 [46].

我们在图 5 中展示了不同方法在每个类别上的鲁棒性。可以看出，除了少数类别外，我们的方法在几乎所有类别上都提高了鲁棒性，尤其是在尾类（5 到 9 类）中，改进更为显著。此外，与在平衡数据集上的观察结果一致 [30, 44, 47, 49]，类别之间的鲁棒性存在显著差异。尽管类别 3 的样本数量远远超过后续类别，但它仍然是最不鲁棒的，这可能是由于类别 3 的内在属性所致 [46]。

5.3. Futher Analysis

5.3. 进一步分析

Effect of Augmentation Strategies and Parameters. We present in both Table 5 and Fig. 6 the impact of different augmentation strategies and parameters on robustness. Specifically, we conduct experiments using ResNet-18 on CIFAR-10-LT, comparing robustness at the best checkpoint. In addition, in Table 5, we use the best hyper-parameters: mixing rate $\alpha~=~0.3$ for Mixup, window length 17 for Cutout, mixing rate $\alpha=0.1$ for CutMix, and magnitude 8 for RA. As shown in Table 5, various augmentation strategies improve robustness compared to vanilla AT-BSL, with AuA and RA also achieving gains in clean accuracy. Fig. 6 indicates that for MixUp and CutMix, smaller values of $\alpha$ yield better robustness; for Cutout, longer window lengths generally correlate with better robustness; for RA, a moderate magnitude of transformation improves robustness, peaking at magnitude $=8$ , highlighting that excessive augmentation is not always beneficial.

数据增强策略和参数的影响。我们在表5和图6中展示了不同数据增强策略和参数对鲁棒性的影响。具体来说，我们在 CIFAR-10-LT 数据集上使用 ResNet-18 进行实验，并比较了最佳检查点的鲁棒性。此外，在表5中，我们使用了最佳超参数：Mixup的混合率 $\alpha~=~0.3$，Cutout的窗口长度为17，CutMix的混合率 $\alpha=0.1$，以及 RA 的强度为8。如表5所示，与原始的 AT-BSL 相比，各种数据增强策略都提高了鲁棒性，其中 AuA 和 RA 在干净准确率上也取得了提升。图6表明，对于 MixUp 和 CutMix，较小的 $\alpha$ 值会带来更好的鲁棒性；对于 Cutout，较长的窗口长度通常与更好的鲁棒性相关；对于 RA，适度的变换强度可以提高鲁棒性，并在强度 $=8$ 时达到峰值，这表明过度的数据增强并不总是有益的。

Table 3. The clean accuracy and robustness for various algorithms using ResNet-18 on CIFAR-10-LT. The best results are bolded.

表 3: 使用 ResNet-18 在 CIFAR-10-LT 上的各种算法的干净准确率和鲁棒性。最佳结果已加粗。

方法	最佳检查点						最后检查点
	Clean	FGSM	PGD	CW	LSA	AA	Clean	FGSM	PGD	CW	LSA	AA
AT [31]	49.35	30.09	27.30	26.93	27.08	25.76	52.91	29.29	25.15	25.58	27.13	24.23
TRADES [53]	43.61	29.18	27.81	26.73	26.58	26.41	43.75	29.06	27.05	26.10	25.93	25.78
MART [42]	48.61	32.75	30.29	28.82	28.46	27.73	48.80	32.60	29.78	28.45	28.12	27.30
AWP [45]	49.29	33.78	31.20	30.53	30.36	29.53	47.75	32.77	30.83	30.01	29.68	29.12
GAIRAT [54]	50.83	30.20	27.46	21.65	21.23	20.41	50.66	28.44	25.60	19.68	19.22	18.26
LAS-AT [20]	52.81	33.35	30.32	29.57	29.15	28.53	53.50	33.14	30.09	29.13	28.84	28.30
RoBal [46]	70.34	40.50	35.93	31.05	31.10	29.54	70.00	36.18	29.00	27.67	26.98	25.63
REAT [26]	67.38	40.13	35.83	33.88	33.66	32.20	67.58	36.99	30.93	30.83	31.62	28.61
AT-BSL	68.89	40.08	35.27	33.47	33.46	31.78	67.63	35.20	28.65	28.91	31.35	26.97
AT-BSL-RA	70.86	43.06	37.94	36.24	36.04	34.24	71.83	42.62	37.15	35.37	35.50	33.44

Table 4. The clean accuracy and robustness for various algorithms using WideResNet-34-10 on CIFAR-10-LT. The best results are bolded.

表 4: 使用 WideResNet-34-10 在 CIFAR-10-LT 上各种算法的清洁准确率和鲁棒性。最佳结果已加粗。

Method	Best Checkpoint						Last Checkpoint
	Clean	FGSM	PGD	cW	LSA	AA	Clean	FGSM	PGD	CW	LSA	AA
AT [31]	59.21	31.88	27.88	28.19	29.81	27.07	58.25	29.77	25.29	25.71	29.83	24.94
TRADES [53]	51.28	31.58	28.70	28.45	28.36	27.72	53.85	30.44	26.23	26.57	26.77	25.59
MART [42]	49.13	34.33	32.32	30.73	30.13	29.60	52.48	33.95	31.09	29.64	29.43	28.67
AWP [45]	50.91	34.28	31.85	31.23	31.01	30.06	48.65	33.21	31.07	30.33	30.14	29.40
GAIRAT [54]	59.89	33.47	30.40	26.69	26.71	25.38	56.37	29.41	27.25	23.94	23.95	23.15
LAS-AT [20]	57.52	33.66	29.86	29.60	29.44	28.84	58.19	32.98	28.89	28.75	28.58	27.90
RoBal [46]	72.82	41.34	36.42	32.48	31.95	30.49	70.85	35.95	27.74	27.59	26.76	25.71
REAT [26]	73.16	41.32	35.94	35.28	35.67	33.20	67.76	34.51	27.75	28.17	31.82	26.66
AT-BSL	73.19	41.84	35.60	34.86	35.99	32.80	65.95	33.29	27.23	27.87	31.00	26.45
AT-BSL-AuA	75.17	46.18	40.84	38.82	39.23	37.15	77.27	44.73	38.06	37.14	39.05	35.11

Figure 5. The class-wise example number and robustness under AA for various algorithms on CIFAR-10-LT at the best checkpoint. (a) ResNet-18; (b) WideResNet-34-10.

图 5: 在 CIFAR-10-LT 数据集上，不同算法在最佳检查点下的类别样本数量及在 AA (AutoAttack) 下的鲁棒性。(a) ResNet-18；(b) WideResNet-34-10。

Table 5. The clean accuracy and robustness for AT-BSL with different augmentations using ResNet-18 on CIFAR-10-LT. The best results are bolded.

表 5: 使用ResNet-18在CIFAR-10-LT上不同增强方法的AT-BSL的干净准确性和鲁棒性。最佳结果加粗显示。

方法	Clean	FGSM	PGD	cW	LSA	AA
Vanilla	68.89	40.08	35.27	33.47	33.46	31.78
MixUp [52]	65.82	41.33	38.05	34.29	33.63	32.92
Cutout [9]	65.12	40.25	36.68	34.81	34.51	33.35
CutMix [50]	64.54	41.13	37.86	34.10	33.46	32.83
AugMix [17]	67.12	40.31	35.95	34.19	34.02	32.51
TA [33]	67.14	41.56	37.75	34.34	33.90	32.62

Effect of Hyper parameter $\tau_{b}$ . To investigate the sensitivity of AT-BSL to $\tau_{b}$ , we evaluate the performance of AT-BSL under varying $\tau_{b}$ values. Specifically, we utilize

超参数 $\tau_{b}$ 的影响

Figure 6. The robustness under AA using ResNet-18 on CIFAR10-LT as we vary (a) the mixing rate $\alpha$ for MixUp, (b) the window length for Cutout, (c) the mixing rate $\alpha$ for CutMix, and (d) the magnitude of transformations for RA.

图 6. 在 CIFAR10-LT 上使用 ResNet-18 时，随着 (a) MixUp 的混合率 $\alpha$、(b) Cutout 的窗口长度、(c) CutMix 的混合率 $\alpha$ 和 (d) RA 的变换幅度的变化，AA 下的鲁棒性。

Figure 7. The robustness under AA for various algorithms with different $\tau_{b}$ using ResNet-18. (a): CIFAR-10-LT; (b): CIFAR100-LT.

图 7: 使用 ResNet-18 时，不同算法在不同 $\tau_{b}$ 下的 AA 鲁棒性。(a): CIFAR-10-LT; (b): CIFAR100-LT.

ResNet-18 with $\tau_{b}$ ranging from 0 to 20. Note that at $\tau_{b}=0$ , the bias $b_{i}=\tau_{b}\log\left(n_{i}\right)$ added by AT-BSL becomes zero, and Eq.8 reverts to the vanilla CE loss, transforming ATBSL into vanilla AT [31]. The results, depicted in Fig. 7, reveal that on CIFAR10-LT, AT-BSL is quite sensitive to $\tau_{b}$ , achieving optimal robustness at $\tau_{b}=1$ . Conversely, on CIFAR-100-LT, AT-BSL shows less sensitivity to $\tau_{b}$ . Additionally, across tested datasets and $\tau_{b}$ values, AT-BSL with additional data augmentation consistently exhibits significantly higher robustness than vanilla AT-BSL, underscoring the substantial benefits of data augmentation in adversarial training under long-tailed distributions.

ResNet-18 中 $\tau_{b}$ 的取值范围为 0 到 20。需要注意的是，当 $\tau_{b}=0$ 时，AT-BSL 添加的偏置 $b_{i}=\tau_{b}\log\left(n_{i}\right)$ 变为零，公式 8 恢复到原始的 CE 损失，AT-BSL 也就变成了原始的 AT [31]。图 7 中的结果显示，在 CIFAR10-LT 上，AT-BSL 对 $\tau_{b}$ 非常敏感，当 $\tau_{b}=1$ 时达到最佳的鲁棒性。相反，在 CIFAR-100-LT 上，AT-BSL 对 $\tau_{b}$ 的敏感性较低。此外，在测试的数据集和 $\tau_{b}$ 取值范围内，使用额外数据增强的 AT-BSL 始终表现出比原始 AT-BSL 显著更高的鲁棒性，这突显了在长尾分布下进行对抗训练时数据增强的巨大优势。

Effect of Imbalance Ratio. We further construct longtailed datasets with varying IRs following the protocol of [8, 46] to evaluate the performance of our method. Table 6 illustrates that RA consistently improves the robustness of AT-BSL across various IR settings, further substantiating the finding that data augmentation can improve robustness. Effect of PGD Step Size. To delve into the impact of PGD step size on robustness, we fine-tune the PGD step size from $2/255$ to $1/255$ and $0.5/255$ , while also increasing the PGD steps from 10 to 20 and 40. As depicted in Table 7, it is evident that RA consistently improves the robustness of AT

不平衡比例的影响
表 6 表明，RA 在各种 IR 设置下始终能提升 AT-BSL 的鲁棒性，进一步证实了数据增强可以提高鲁棒性的发现。

PGD 步长的影响
为了探讨 PGD 步长对鲁棒性的影响，我们将 PGD 步长从 $2/255$ 微调到 $1/255$ 和 $0.5/255$，同时将 PGD 步数从 10 增加到 20 和 40。如表 7 所示，RA 始终能提升 AT 的鲁棒性。

Table 6. The clean accuracy and robustness for various algorithms using ResNet-18 on CIFAR-10-LT with different imbalance ratios. Better results are bolded.

表 6. 使用 ResNet-18 在 CIFAR-10-LT 上不同不平衡比率下各种算法的干净准确率和鲁棒性。更好的结果以粗体显示。

IR	Method Clean FGSM PGD CW LSA AA
AT-BSL 10 AT-BSL-RA	73.29 47.33 42.04 40.77 41.05 39.12 79.00 50.98 44.19 42.82 43.10 40.56
20	AT-BSL 71.89 44.76 39.40 38.47 38.68 36.74 AT-BSL-RA7 75.84 47.62 41.68 39.92 39.82 37.78
AT-BSL 50 AT-BSL-RA	68.89 40.08 35.27 33.47 33.46 31.78 70.86 43.06 37.94 36.24 36.04 34.24
AT-BSL 62.03 100 AT-BSL-RA66.85	35.06 30.95 29.41 29.56 28.01 38.75 33.69 31.77 31.50 30.00

Table 7. The clean accuracy and robustness for various algorithms using ResNet-18 on CIFAR-10-LT training with different PGD step sizes. Better results are bolded.

表 7: 使用 ResNet-18 在 CIFAR-10-LT 训练上不同 PGD 步长下各种算法的干净准确率和鲁棒性。更好的结果加粗。

Size	Method Clean	FGSM	PGD	CW	LSA	AA
0.5	AT-BSL	68.57	39.65	35.10	32.92	32.97
0.5	AT-BSL-RA	68.68	41.97	37.60	34.81	34.36
1	AT-BSL	68.63	39.98	35.09	33.02	33.00
1	AT-BSL-RA	68.93	42.71	37.85	35.30	34.79
2	AT-BSL	68.89	40.08	35.27	33.47	33.46
2	AT-BSL-RA	70.86	43.06	37.94	36.24	36.04

BSL regardless of the PGD step size. However, we also note a decrease in robustness when compared to the baseline robustness at a PGD step size of 2/255.

BSL 无论在 PGD 步长如何，都表现出稳定的表现。然而，我们也注意到与 PGD 步长为 2/255 时的基线鲁棒性相比，鲁棒性有所下降。

6. Conclusion

6. 结论

In this paper, we first dissect the components of RoBal, identifying BSL as a critical component. We then address the issue of robust over fitting in adversarial training under long-tailed distributions and attempt to mitigate it using data augmentation. Surprisingly, we find that data augmentation not only mitigates robust over fitting but also significantly improves robustness. We hypothesize that the improved robustness is due to increased example diversity brought about by data augmentation, and we validate this hypothesis through experiments. Finally, we conduct extensive experiments with different data augmentation strategies, model architectures, and datasets, affirming the generaliz ability of our findings. Our findings advance adversarial training a step further towards real-world scenarios.

本文首先剖析了 RoBal 的组成，确定 BSL 为关键组件。随后，我们解决了长尾分布下对抗训练中的鲁棒过拟合问题，并尝试通过数据增强来缓解该问题。令人惊讶的是，我们发现数据增强不仅缓解了鲁棒过拟合，还显著提高了鲁棒性。我们推测，这种鲁棒性的提升源于数据增强带来的样本多样性增加，并通过实验验证了这一假设。最后，我们针对不同的数据增强策略、模型架构和数据集进行了广泛的实验，证实了我们发现的普适性。我们的发现将对抗训练向实际应用场景推进了一步。

Acknowledgements

致谢

This work was partially supported by the NSFC under Grants U20B2049, U21B2018, and 62302344, and the Fundamental Research Funds for the Central Universities, 2042023kf0122.

本研究得到国家自然科学基金（项目编号：U20B2049、U21B2018 和 62302344）以及中央高校基本科研业务费专项资金（项目编号：2042023kf0122）的部分资助。

References

参考文献

[57] Zhipeng Zhou, Lanqing Li, Peilin Zhao, Pheng-Ann Heng, and Wei Gong. Class-conditional sharpness-aware minimization for deep long-tailed recognition. In CVPR, 2023. 2

[57] Zhipeng Zhou, Lanqing Li, Peilin Zhao, Pheng-Ann Heng, and Wei Gong. 面向深度长尾识别的类条件锐度感知最小化. In CVPR, 2023. 2

Revisiting Adversarial Training under Long-Tailed Distributions

重新审视长尾分布下的对抗训练

Supplementary Material

补充材料

A. Implementation Details of Experiments

A. 实验实现细节

A.1. Details of Table 1

A.1. 表 1 的详细信息

All parameters in the experiments presented in Table 1 are consistent with those used in RoBal [46]. Specifically, the initial learning rate is set at 0.1, with a decay factor of 10 applied at the 60th and 75th epochs, for a total training duration of 80 epochs. An SGD momentum optimizer is employed with a weight decay of $2\times10^{-4}$ . The batch size is maintained at 64. For adversarial training, we adopt a maximum perturbation of $8/255$ and a step size of $2/255$ , with the number of iterations for internal maximization set at 5, corresponding to PGD-5. For CIFAR-10-LT, we utilize $m_{0}~=~~0.1$ , $s~~=~10$ , $\tau_{b}~=~1.5$ , and $\tau_{m}~=~0.3$ ; for CIFAR-100-LT, the parameters are set as $m_{0}=0.3$ , $s=10$ , $\tau_{b}=1.5$ , and $\tau_{m}=0.3$ . The specific hyper parameters for each experiment are detailed in Table 8.

表 1 中实验的所有参数均与 RoBal [46] 中使用的参数一致。具体而言，初始学习率设置为 0.1，并在第 60 和第 75 个 epoch 时应用衰减因子 10，总训练时长为 80 个 epoch。采用 SGD 动量优化器，权重衰减为 $2\times10^{-4}$ 。批量大小保持为 64。对于对抗训练，我们采用最大扰动 $8/255$ 和步长 $2/255$ ，内部最大化的迭代次数设置为 5，对应 PGD-5。对于 CIFAR-10-LT，我们使用 $m_{0}~=~~0.1$ 、 $s~~=~10$ 、 $\tau_{b}~=~1.5$ 和 $\tau_{m}~=~0.3$ ；对于 CIFAR-100-LT，参数设置为 $m_{0}=0.3$ 、 $s=10$ 、 $\tau_{b}=1.5$ 和 $\tau_{m}=0.3$ 。每个实验的具体超参数详见表 8。

A.2. Details of Data Augment a ions

A.2. 数据增强的细节

Data augmentation techniques such as MixUp [52], Cutout [9], CutMix [50], Augmix [17], Auto Augment (AuA) [6], Rand Augment (RA) [7], and Trivial Aug men (TA) [33] are employed utilizing the implementations provided in torch vision $0.16.0^{1}$ . Regarding the integration of data augmentation into the adversarial training pipeline, we follow the approach outlined in [35], whereby data augmentation precedes the generation of adversarial examples through adversarial attacks. It is observed that reversing this order, i.e., performing data augmentation after adversarial attacks, leads to the disruption of adversarial perturbations, significantly diminishing the effectiveness of the adversarial attacks.

数据增强技术如 MixUp [52]、Cutout [9]、CutMix [50]、Augmix [17]、Auto Augment (AuA) [6]、Rand Augment (RA) [7] 和 Trivial Augment (TA) [33] 均采用 torch vision $0.16.0^{1}$ 提供的实现。关于将数据增强集成到对抗训练流程中，我们遵循 [35] 中概述的方法，即在通过对抗攻击生成对抗样本之前进行数据增强。观察到如果颠倒这一顺序，即在对抗攻击之后进行数据增强，会导致对抗扰动被破坏，从而显著降低对抗攻击的有效性。

A.3. Code References

A.3. 代码引用

For the defense methods compared in our paper, we utilize the official code releases, including AT $[31]^{2}$ , TRADES $[53]^{3}$ , MART $[42]^{4}$ , AWP $[45]^{5}$ , GAIRAT $[54]^{6}$ , LAS-AT $[20]^{7}$ , RoBal $[46]^{8}$ , and REAT $[26]^{9}$ . Regarding the attacks used for evaluation, we implement them by referring to several official code repositories and the original papers, encompassing FGSM [12], PGD [31], CW [3], and AutoAttack $[5]^{10}$ .

对于论文中比较的防御方法，我们使用了官方发布的代码，包括AT [31] 、TRADES [53] 、MART [42] 、AWP [45] 、GAIRAT [54] 、LAS-AT [20] 、RoBal [46] 和REAT [26] 。至于用于评估的攻击方法，我们通过参考多个官方代码库和原始论文来实现，包括FGSM [12]、PGD [31]、CW [3] 和AutoAttack [5] 。

Figure 8. The class-wise robustness under AA for various algorithms on CIFAR-100-LT at the best checkpoint. (a) ResNet-18; (b) WideResNet-34-10.

图 8. 在 CIFAR-100-LT 数据集上，不同算法在最佳检查点下的 AA 分类鲁棒性。(a) ResNet-18; (b) WideResNet-34-10。

B. Extensive Experiments

B. 广泛实验

B.1. More Ablation Studies of RoBal

B.1. RoBal 的更多消融实验

In addition to the experiments conducted with ResNet-18 and CIFAR-10-LT as presented in Table 1, we extend our ablation studies to include WideResNet-34-10 and CIFAR100-LT, as illustrated in Tables 9, 10, and 11. The data acquired from these additional experiments align with the conclusions drawn from Table 1, demonstrating that the ATBSL alone achieves comparable performance in terms of clean accuracy and robustness to the complete RoBal framework. Moreover, a significant advantage is observed regarding training time and memory consumption.

除了表 1 中展示的使用 ResNet-18 和 CIFAR-10-LT 进行的实验外，我们还扩展了消融研究，包括 WideResNet-34-10 和 CIFAR100-LT，如表 9、表 10 和表 11 所示。这些额外实验获得的数据与表 1 得出的结论一致，表明单独的 ATBSL 在干净准确性和鲁棒性方面与完整的 RoBal 框架表现相当。此外，在训练时间和内存消耗方面也观察到了显著优势。

B.2. Experiments on CIFAR-100-LT

B.2. CIFAR-100-LT 上的实验

Tables 12 and 13 reveal that on CIFAR-100-LT, AT-BSL with data augmentation achieves the highest clean accuracy and adversarial robustness on both ResNet-18 and WideResNet-34-10. Compared to the improvement observed on CIFAR-10-LT, the improvements on CIFAR-100- LT are less pronounced, likely due to CIFAR-100-LT’s more significant number of classes and fewer examples per class, making advancements more challenging.

表 12 和表 13 显示，在 CIFAR-100-LT 上，使用数据增强的 AT-BSL 在 ResNet-18 和 WideResNet-34-10 上均实现了最高的干净准确率和对抗鲁棒性。与在 CIFAR-10-LT 上观察到的改进相比，CIFAR-100-LT 上的改进不太明显，这可能是由于 CIFAR-100-LT 的类别数量更多且每个类别的样本较少，使得进展更具挑战性。

In Fig. 8, we illustrate the robustness of different methods across each class. Given the extremely low robustness in most classes on CIFAR-100-LT and the presence of only 50 images per class in the test set, we report the average values for every 10 classes. Notably, AuA universally improves the robustness across all class groups.

在图 8 中，我们展示了不同方法在各个类别上的鲁棒性。鉴于 CIFAR-100-LT 上大多数类别的鲁棒性极低，且测试集中每个类别只有 50 张图像，我们报告了每 10 个类别的平均值。值得注意的是，AuA 在所有类别组中都普遍提高了鲁棒性。

B.3. Experiments on Tiny-ImageNet-LT

B.3. Tiny-ImageNet-LT 上的实验

To see if BSL and data augmentation are as important for higher resolution datasets as they are for low resolution datasets (such as CIFAR-10-LT and CIFAR-100-LT), we conduct experiments on Tiny-ImageNet [24]. Firstly, TinyImageNet is a dataset consisting of 200 classes, with images sized $64^{*}64$ pixels, making it four times the resolution of CIFAR-10/100. We derive Tiny-ImageNet-LT using an IR of 0.1 from Tiny-ImageNet. Following [20, 25], we employ the Pre Act Res Net-18 model [16]. Apart from the model, the experimental setup for Tiny-ImageNet-LT remains largely similar to that of CIFAR-10-LT. As observed from the Table 14, both BSL and data augmentation prove to be crucial for Tiny-ImageNet-LT.

为了验证BSL和数据增强对于高分辨率数据集（如Tiny-ImageNet）是否与低分辨率数据集（如CIFAR-10-LT和CIFAR-100-LT）同样重要，我们在Tiny-ImageNet [24]上进行了实验。首先，TinyImageNet是一个包含200个类别的数据集，图像尺寸为$64^{*}64$像素，分辨率为CIFAR-10/100的四倍。我们使用IR为0.1的Tiny-ImageNet生成了Tiny-ImageNet-LT。根据[20, 25]，我们采用了Pre Act Res Net-18模型 [16]。除了模型外，Tiny-ImageNet-LT的实验设置与CIFAR-10-LT基本相同。从表14中可以看出，BSL和数据增强对于Tiny-ImageNet-LT同样至关重要。

Table 8. The specific hyper parameters for each experiment following the integration of components from RoBal [46] into AT [31]. Cos: Cosine Classifier; BSL: Balanced Softmax Loss [36]; CM: Class-aware Margin [46]; TRADES: TRADES Regular iz ation [53].

表 8. 将 RoBal [46] 的组件集成到 AT [31] 后，每个实验的具体超参数。Cos：余弦分类器；BSL：平衡 Softmax 损失 [36]；CM：类感知间隔 [46]；TRADES：TRADES 正则化 [53]。

方法	Cos	BSL	CM	TRADES	CIFAR-10-LT	CIFAR-100-LT
AT [31]					0	1
AT-BSL		√			0	1
AT-BSL-Cos	√	√			0.1	10
AT-BSL-Cos-TRADES	√			√	0.1	10
RoBal [46]	√	√	√	√	0.1	10

Table 9. The clean accuracy, robustness, time (average per epoch), and memory (GPU) using WideResNet-34-10 [51] on CIFAR-10-LT following the integration of components from RoBal [46] into AT [31]. The best results are bolded. The second best results are underlined. Cos: Cosine Classifier; BSL: Balanced Softmax Loss [36]; CM: Class-aware Margin [46]; TRADES: TRADES Regular iz ation [53].

表 9. 在 CIFAR-10-LT 上使用 WideResNet-34-10 [51] 将 RoBal [46] 的组件集成到 AT [31] 后的清洁准确率、鲁棒性、时间（每轮平均）和内存（GPU）。最佳结果以粗体显示。次佳结果以下划线显示。Cos：余弦分类器；BSL：平衡 Softmax 损失 [36]；CM：类感知边距 [46]；TRADES：TRADES 正则化 [53]。

方法	组件	准确率	效率
	CosBSLCM	TRADES	Clean
AT [31]
AT-BSL
AT-BSL-CoS		√
AT-BSL-Cos-TRADES	√
RoBal [46]

B.4. Different PGD Steps

B.4. 不同PGD步骤

To investigate the impact of PGD steps on robustness, we assess the robustness achieved using different PGD steps following [46]. Table 15 indicates that RA consistently improves the robustness of AT-BSL regardless of PGD steps, and the clean accuracy also experiences improvement. Moreover, there is a trade-off between clean accuracy and robustness: as the PGD step increases, clean accuracy decreases while robustness improves. The optimal trade-off is attained at PGD-10. Hence, we employ PGD-10 in our experiments involving AT-BSL.

为了研究PGD步骤对鲁棒性的影响

[论文翻译]重新审视长尾分布下的对抗训练

原文地址：https://arxiv.org/pdf/2403.10073v1

Revisiting Adversarial Training under Long-Tailed Distributions

重新审视长尾分布下的对抗训练

Abstract

1. Introduction

1. 引言

2. Related Works

3. Analysis of RoBal

3. RoBal 分析

3.1. Preliminaries

3.2. Ablation Studies of RoBal

3.2. RoBal 的消融研究

3.3. Robust Over fitting and Unexpected Discovery

4. Why Data Augmentation Can Improve Robustness

4. 为什么数据增强可以提高鲁棒性

5. Experiments

5. 实验

5.1. Settings

5.2. Main Results

5.2. 主要结果

5.3. Futher Analysis

6. Conclusion

6. 结论

Acknowledgements

References

Revisiting Adversarial Training under Long-Tailed Distributions

重新审视长尾分布下的对抗训练

A. Implementation Details of Experiments

A. 实验实现细节

A.1. Details of Table 1

A.1. 表 1 的详细信息

A.2. Details of Data Augment a ions

A.2. 数据增强的细节

A.3. Code References

B. Extensive Experiments

B.1. More Ablation Studies of RoBal

B.1. RoBal 的更多消融实验

B.2. Experiments on CIFAR-100-LT

B.2. CIFAR-100-LT 上的实验

B.3. Experiments on Tiny-ImageNet-LT

B.4. Different PGD Steps