[论文翻译]FMix: 增强混合样本数据增强


原文地址:https://arxiv.org/pdf/2002.12047v3


FMix: Enhancing Mixed Sample Data Augmentation

FMix: 增强混合样本数据增强

Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, Adam Prugel-Bennett, Jonathon Hare

Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, Adam Prugel-Bennett, Jonathon Hare

Abstract—Mixed Sample Data Augmentation (MSDA) has received increasing attention in recent years, with many successful variants such as MixUp and CutMix. By studying the mutual information between the function learned by a VAE on the original data and on the augmented data we show that MixUp distorts learned functions in a way that CutMix does not. We further demonstrate this by showing that MixUp acts as a form of adversarial training, increasing robustness to attacks such as Deep Fool and Uniform Noise which produce examples similar to those generated by MixUp. We argue that this distortion prevents models from learning about sample specific features in the data, aiding generalisation performance. In contrast, we suggest that CutMix works more like a traditional augmentation, improving performance by preventing mem or is ation without distorting the data distribution. However, we argue that an MSDA which builds on CutMix to include masks of arbitrary shape, rather than just square, could further prevent mem or is ation whilst preserving the data distribution in the same way. To this end, we propose FMix, an MSDA that uses random binary masks obtained by applying a threshold to low frequency images sampled from Fourier space. These random masks can take on a wide range of shapes and can be generated for use with one, two, and three dimensional data. FMix improves performance over MixUp and CutMix, without an increase in training time, for a number of models across a range of data sets and problem settings, obtaining a new single model state-of-the-art result on CIFAR-10 without external data. We show that FMix can outperform MixUp in sentiment classification tasks with one dimensional data, and provides an improvement over the baseline in three dimensional point cloud classification. Finally, we show that a consequence of the difference between interpolating MSDA such as MixUp and masking MSDA such as FMix is that the two can be combined to improve performance even further. Code for all experiments is provided at https://github.com/ecs-vlc/FMix.

摘要—混合样本数据增强(MSDA)近年来受到越来越多的关注,涌现了MixUp和CutMix等成功变体。通过研究VAE在原始数据和增强数据上学得函数的互信息,我们发现MixUp会扭曲学得函数,而CutMix则不会。进一步研究表明,MixUp实际是一种对抗训练形式,能提升模型对Deep Fool和Uniform Noise等攻击的鲁棒性——这些攻击生成的样本与MixUp生成样本相似。我们认为这种扭曲效应会阻止模型学习数据中的样本特异性特征,从而提升泛化性能。相比之下,CutMix更类似传统数据增强,通过防止过拟合来提升性能,同时不会扭曲数据分布。但我们指出,基于CutMix开发能生成任意形状(而非仅限于方形)掩码的MSDA方法,可以在保持数据分布的同时更好地防止过拟合。为此,我们提出FMix——这种MSDA方法通过对傅里叶空间采样的低频图像进行阈值处理来获取随机二值掩码。这些随机掩码可呈现多种形状,适用于一维、二维和三维数据。FMix在多个模型、数据集和问题设置中均优于MixUp和CutMix,且不增加训练时间,在不使用外部数据的情况下取得了CIFAR-10上新的单模型最优结果。我们证明FMix在一维情感分类任务中能超越MixUp,并在三维点云分类中优于基线方法。最后研究表明,插值类MSDA(如MixUp)与掩码类MSDA(如FMix)存在本质差异,二者结合可进一步提升性能。所有实验代码详见https://github.com/ecs-vlc/FMix

I. INTRODUCTION

I. 引言

R EDaCtEa NATuLgYm, ean tap tli eot nh o(raM SofD aAp) phraovaec hbeese tno pr Moi px oe s de dS awmhipclhe obtain state-of-the-art results, particularly in classification tasks [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. MSDA involves combining data samples according to some policy to create an augmented data set on which to train the model. The policies so far proposed can be broadly categorised as either combining samples with interpolation (e.g. MixUp) or masking (e.g. CutMix). Traditionally, augmentation is viewed through the framework of statistical learning as Vicinal Risk Minim is ation (VRM) [12, 13]. Given some notion of the vicinity of a data point, VRM trains with vicinal samples in addition to the data points themselves. This is the motivation for MixUp [2]; to provide a new notion of vicinity based on mixing data samples. In the classical theory, validity of this technique relies on the strong assumption that the vicinal distribution precisely matches the true distribution of the data. As a result, the classical goal of augmentation is to maximally increase the data space, without changing the data distribution. Clearly, for all but the most simple augmentation strategies, the data distribution is in some way distorted. Furthermore, there may be practical implications to correcting this, as is demonstrated in Touvron et al. [14]. In light of this, three important questions arise regarding MSDA: What is good measure of the similarity between the augmented and the original data? Why is MixUp so effective when the augmented data looks so different? If the data is distorted, what impact does this have on trained models?

R EDaCtEa NATuLgYm,即随机数据增强(MSDA)方法,已成为获得最先进结果的关键技术,尤其在分类任务中表现突出[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]。MSDA通过特定策略组合数据样本来构建增强数据集用于模型训练。现有策略主要分为样本插值混合(如MixUp)和掩码混合(如CutMix)两类。传统增强框架基于统计学习中的邻域风险最小化(VRM)[12, 13],通过数据点及其邻域样本进行训练。MixUp[2]的提出正是为了建立基于样本混合的新型邻域关系。经典理论中,该技术有效性依赖于"邻域分布与真实数据分布完全匹配"这一强假设,因此传统增强目标是在不改变数据分布的前提下最大化扩展数据空间。显然,除最基础的增强策略外,多数方法都会以某种形式扭曲数据分布。如Touvron等人[14]所示,这种扭曲可能带来实际影响。由此引发关于MSDA的三个核心问题:如何有效衡量增强数据与原始数据的相似性?为何MixUp在生成明显失真数据时仍能高效?数据扭曲对训练模型究竟产生何种影响?

To construct a good measure of similarity, we note that the data only need be ‘perceived’ similar by the model. As such, we measure the mutual information between representations learned from the real and augmented data, thus character ising how well learning from the augmented data simulates learning from the real data. This measure shows the data-level distortion of MixUp by demonstrating that learned representations are compressed in comparison to those learned from the unaugmented data. To address the efficacy of MixUp, we look to the information bottleneck theory of deep learning [15]. By the data processing inequality, summarised as ‘post-processing cannot increase information’, deep networks can only discard information about the input with depth whilst preserving information about the targets. Tishby and Zaslavsky [15] assert that more efficient generalisation is achieved when each layer maximises the information it has about the target and minimises the information it has about the previous layer. Consequently, we posit that the distortion and subsequent compression induced by MixUp promotes generalisation. Another way to view this is that compression prevents the network from learning about highly sample-specific features in the data. Regarding the impact on trained models, and again armed with the knowledge that MixUp distorts learned functions, we show that MixUp acts as a kind of adversarial training [16], promoting robustness to additive noise. This accords with the theoretical result of Perrault-Ar chamba ul t et al. [17] and the robustness results of Zhang et al. [2]. However, we further show that MSDA does not generally improve adversarial robustness when measured as a worst case accuracy following multiple attacks as suggested by Carlini et al. [18]. Ultimately, our adversarial robustness experiments show that the distortion in the data observed by our mutual information analysis corresponds to practical differences in learned function.

要构建一个良好的相似性度量,我们注意到数据只需被模型"感知"为相似。因此,我们测量从真实数据和增强数据学习到的表征之间的互信息,从而表征从增强数据学习对真实数据学习的模拟程度。该度量通过展示学习到的表征相较于未增强数据学习到的表征被压缩,揭示了MixUp在数据层面的失真。针对MixUp的有效性,我们参考深度学习的信息瓶颈理论[15]。根据数据处理不等式(可概括为"后处理无法增加信息"),深度网络在保留目标信息的同时只能随着深度增加丢弃输入信息。Tishby和Zaslavsky[15]断言,当每一层最大化其拥有的目标信息并最小化其拥有的前一层信息时,能实现更高效的泛化。因此,我们认为MixUp引起的失真和后续压缩促进了泛化。另一种理解是,压缩阻止网络学习数据中高度样本特定的特征。关于对训练模型的影响,并再次基于MixUp会扭曲学习函数这一认知,我们证明MixUp起到某种对抗训练[16]的作用,提升了对加性噪声的鲁棒性。这与Perrault-Ar chamba ul t等[17]的理论结果及Zhang等[2]的鲁棒性结果一致。但我们进一步表明,当按照Carlini等[18]建议的采用多次攻击后最差情况准确率来衡量时,MSDA通常不会提升对抗鲁棒性。最终,我们的对抗鲁棒性实验表明,通过互信息分析观察到的数据失真与学习函数的实际差异相对应。

In contrast to our findings regarding MixUp, our mutual information analysis shows that CutMix causes learned models to retain a good knowledge of the real data, which we argue derives from the fact that individual features extracted by a convolutional model generally only derive from one of the mixed data points. This is further shown by our adversarial robustness results, where CutMix is not found to promote robustness in the same way. We therefore suggest that CutMix limits the ability of the model to over-fit by dramatically increasing the number of observable data points without distorting the data distribution, in keeping with the original intent of VRM. However, by restricting to only masking a square region, CutMix imposes some unnecessary limitations. First, the number of possible masks could be much greater if more mask shapes could be used. Second, it is likely that there is still some distortion since all of the images used during training will involve a square edge. It should be possible to construct an MSDA which uses masking similar to CutMix whilst increasing the data space much more dramatically. Motivated by this, we introduce FMix, a masking MSDA that uses binary masks obtained by applying a threshold to low frequency images sampled randomly from Fourier space. Using our mutual information measure, we show that learning with FMix simulates learning from the real data even better than CutMix. We subsequently demonstrate performance of FMix for a range of models and tasks against a series of augmented baselines and other MSDA approaches. FMix obtains a new single model state-of-the-art performance on CIFAR-10 [19] without external data and improves the performance of several state-ofthe-art models (ResNet, SE-ResNeXt, DenseNet, WideResNet, PyramidNet, LSTM, and Bert) on a range of problems and modalities.

与我们在MixUp上的发现相反,互信息分析表明CutMix能使训练后的模型保持对真实数据的良好认知。我们认为这源于卷积模型提取的单个特征通常仅来自混合数据点中的一个。对抗鲁棒性结果进一步验证了这一点——CutMix并未以相同方式提升模型鲁棒性。因此我们提出,CutMix通过大幅增加可观测数据点数量(同时保持数据分布不变)来限制模型过拟合能力,这与VRM的原始意图一致。但仅采用方形区域掩码会带来不必要的限制:首先,若使用更多掩码形状,可能的掩码组合将大幅增加;其次,由于所有训练图像都包含方形边缘,仍可能存在一定失真。理论上可以构建一种类似CutMix掩码机制、但能更显著扩展数据空间的MSDA方法。基于此,我们提出FMix——通过对傅里叶空间随机采样的低频图像进行阈值处理来生成二值掩码。互信息测量显示,FMix比CutMix更能模拟真实数据的学习效果。后续实验证明,FMix在各类模型和任务中均优于增强基线及其他MSDA方法。在未使用外部数据的情况下,FMix在CIFAR-10[19]上创造了单模型最新性能记录,并在多种先进模型(ResNet、SE-ResNeXt、DenseNet、WideResNet、PyramidNet、LSTM和Bert)的多模态任务中提升了性能。

In light of our experimental results, we go on to suggest that the compressing qualities of MixUp are most desirable when data is limited and learning from individual examples is easier. In contrast, masking MSDAs such as FMix are most valuable when data is abundant. We suggest that there is no reason to see the desirable properties of masking and interpolation as mutually exclusive. In light of these observations, we plot the performance of MixUp, FMix, a baseline, and a hybrid policy where we alternate between batches of MixUp and FMix, as the number of CIFAR-10 training examples is reduced. This experiment confirms our above suggestions and shows that the hybrid policy can outperform both MixUp and FMix.

根据实验结果,我们进一步提出:当数据有限且更容易从单个样本中学习时,MixUp的压缩特性最为理想;相反,当数据充足时,FMix等掩码型混合样本数据增强(MSDA)方法最具价值。我们认为,没有必要将掩码和插值的优势特性视为互斥关系。基于这些观察,我们绘制了MixUp、FMix、基线方法以及混合策略(在MixUp和FMix批次间交替使用)在CIFAR-10训练样本数量递减时的性能曲线。该实验验证了上述观点,并表明混合策略能同时超越MixUp和FMix的表现。

II. RELATED WORK

II. 相关工作

In this section, we review the fundamentals of MSDA. Let $p_{X}(x)$ denote the input data distribution. In general, we can define MSDA for a given mixing function, $\operatorname*{mix}(X_{1},X_{2},\Lambda)$ , where $X_{1}$ and $X_{2}$ are independent random variables on the data domain and $\Lambda$ is the mixing coefficient. Synthetic minority over-sampling [1], a predecessor to modern MSDA approaches, can be seen as a special case of the above where $X_{1}$ and $X_{2}$ are dependent, jointly sampled as nearest neighbours in feature space. These synthetic samples are drawn only from the minority class to be used in conjunction with the original data, addressing the problem of imbalanced data. The mixing function is linear interpolation, $\operatorname*{mix}(x_{1},x_{2},\lambda)=$ $\lambda x_{1}+(1-\lambda)x_{2}$ , and $p_{\Lambda}=\mathcal{U}(0,1)$ . More recently, Zhang et al. [2], Tokozume et al. [3], and Inoue [5] concurrently proposed using this formulation (as MixUp, Between-Class (BC) learning, and sample pairing respectively) on the whole data set, although the choice of distribution for the mixing coefficients varies for each approach. Tokozume et al. [4] subsequently proposed $\mathrm{BC+}$ , which uses a normalised variant of the mixing function. We refer to these approaches as interpol at ive MSDA, where, following Zhang et al. [2], we use the symmetric Beta distribution, that is $p_{\Lambda}=\mathrm{Beta}(\alpha,\alpha)$ .

在本节中,我们回顾MSDA的基础知识。设$p_{X}(x)$表示输入数据分布。通常,我们可以为给定的混合函数$\operatorname*{mix}(X_{1},X_{2},\Lambda)$定义MSDA,其中$X_{1}$和$X_{2}$是数据域上的独立随机变量,$\Lambda$是混合系数。现代MSDA方法的前身——合成少数类过采样[1]可视为上述情况的特例,其中$X_{1}$和$X_{2}$是依赖的,作为特征空间中的最近邻联合采样。这些合成样本仅从少数类中抽取,与原始数据结合使用,以解决数据不平衡问题。混合函数是线性插值$\operatorname*{mix}(x_{1},x_{2},\lambda)=$$\lambda x_{1}+(1-\lambda)x_{2}$,且$p_{\Lambda}=\mathcal{U}(0,1)$。最近,Zhang等人[2]、Tokozume等人[3]和Inoue[5]同时提出在整个数据集上使用此公式(分别称为MixUp、类间(BC)学习和样本配对),尽管每种方法对混合系数分布的选择各不相同。Tokozume等人[4]随后提出$\mathrm{BC+}$,它使用混合函数的归一化变体。我们将这些方法称为插值MSDA,其中按照Zhang等人[2]的做法,我们使用对称Beta分布,即$p_{\Lambda}=\mathrm{Beta}(\alpha,\alpha)$。

Recent variants adopt a binary masking approach [6, 7, 8]. Let $M=\mathrm{mask}(\Lambda)$ be a random variable with $\mathrm{mask}(\lambda)\in$ ${0,1}^{n}$ and $\mu(\mathrm{mask}(\lambda))=\lambda$ , that is, generated masks are binary with average value equal to the mixing coefficient. The mask mixing function is $\operatorname*{mix}(\mathbf{x}{1},\mathbf{x}{2},\mathbf{m})=\mathbf{m}\odot\mathbf{x}{1}+$ $(1-\mathbf{m})\odot\mathbf{x}{2}$ , where $\odot$ denotes point-wise multiplication. A notable masking MSDA which motivates our approach is CutMix [6]. CutMix is designed for two dimensional data, with $\mathrm{mask}(\lambda)~\in~{0,1}^{w\times h}$ , and uses $\mathrm{mask}(\lambda){\it\Delta\phi}=$ ra $\mathrm{nd_rect}(w\sqrt{1-\lambda},h\sqrt{1-\lambda})$ , where rand rec $\mathrm{{;t}}(r_{w},r_{h})\in$ ${0,1}^{w\times h}$ yields a binary mask with a shaded rectangular region of size $r_{w}\times r_{h}$ at a uniform random coordinate. CutMix improves upon the performance of MixUp on a range of experiments.

最近的变体采用了二值掩码方法 [6, 7, 8]。设 $M=\mathrm{mask}(\Lambda)$ 为随机变量,其中 $\mathrm{mask}(\lambda)\in$ ${0,1}^{n}$ 且 $\mu(\mathrm{mask}(\lambda))=\lambda$,即生成的掩码为二值,其平均值等于混合系数。掩码混合函数为 $\operatorname*{mix}(\mathbf{x}{1},\mathbf{x}{2},\mathbf{m})=\mathbf{m}\odot\mathbf{x}{1}+$ $(1-\mathbf{m})\odot\mathbf{x}{2}$,其中 $\odot$ 表示逐点乘法。CutMix [6] 是一种值得注意的掩码 MSDA (混合样本数据增强) 方法,启发了我们的研究。CutMix 专为二维数据设计,其 $\mathrm{mask}(\lambda)~\in~{0,1}^{w\times h}$,并使用 $\mathrm{mask}(\lambda){\it\Delta\phi}=$ ra $\mathrm{nd_rect}(w\sqrt{1-\lambda},h\sqrt{1-\lambda})$,其中 rand rec $\mathrm{{;t}}(r_{w},r_{h})\in$ ${0,1}^{w\times h}$ 生成一个二值掩码,其阴影矩形区域大小为 $r_{w}\times r_{h}$,位于均匀随机坐标。CutMix 在一系列实验中提升了 MixUp 的性能。

In all MSDA approaches the targets are mixed in some fashion, typically to reflect the mixing of the inputs. For the typical case of classification with a cross entropy loss (and for all of the experiments in this work), the objective function is simply the interpolation between the cross entropy against each of the ground truth targets. It could be suggested that by mixing the targets differently, one might obtain better results. However, there are key observations from prior art which give us cause to doubt this supposition; in particular, Liang et al. [20] performed a number of experiments on the importance of the mixing ratio of the labels in MixUp. They concluded that when the targets are not mixed in the same proportion as the inputs the model can be regular is ed to the point of under fitting. However, despite this conclusion their results show only a mild performance change even in the extreme event that targets are mixed randomly, independent of the inputs. For these reasons, we focus only on the development of a better input mixing function for the remainder of the paper.

在所有MSDA方法中,目标会以某种方式混合,通常是为了反映输入的混合情况。对于使用交叉熵损失进行分类的典型情况(以及本文所有实验),目标函数只是针对每个真实目标交叉熵的插值。有人可能建议通过不同的目标混合方式获得更好结果,但现有技术的关键发现让我们对此假设产生怀疑:特别是Liang等人[20]对MixUp中标签混合比例的重要性进行了大量实验。他们得出结论,当目标未按与输入相同比例混合时,模型可能会被正则化到欠拟合的程度。但尽管如此,他们的结果显示即使目标完全随机混合(与输入无关)这种极端情况下,性能变化也较为轻微。基于这些原因,本文后续将仅专注于开发更好的输入混合函数。

Attempts to explain the success of MSDAs were not only made when they were introduced, but also through subsequent empirical and theoretical studies. In addition to their experimentation with the targets, Liang et al. [20] argue that linear interpolation of inputs limits the mem or is ation ability of the network. Gontijo-Lopes et al. [21] propose two measures to explain the impact of augmentation on generalisation when jointly optimised: affinity and diversity. While the former captures the shift in the data distribution as perceived by the baseline model, the latter measures the training loss when learning with augmented data. A more mathematical view on MSDA was adopted by Guo et al. [22], who argue that MixUp regular is es the model by constraining it outside the data manifold. They point out that this could lead to reducing the space of possible hypotheses, but could also lead to generated examples contradicting original ones, degrading quality. Upon Taylor-expanding the objective, Carratino et al.

试图解释MSDA成功的研究不仅在其提出时进行,也贯穿后续的实证与理论研究。除目标域实验外,Liang等人[20]指出输入线性插值会限制网络的记忆能力。Gontijo-Lopes等人[21]提出亲和度(affinity)与多样性(diversity)两个指标,用于解释联合优化时数据增强对泛化性能的影响:前者反映基线模型感知的数据分布偏移,后者衡量使用增强数据训练时的损失值。Guo等人[22]则从数学角度分析MSDA,认为MixUp通过将模型约束在数据流形之外实现正则化效果。他们指出这虽能缩减假设空间,但也可能导致生成样本与原始样本矛盾而降低质量。Carratino等人通过对目标函数进行泰勒展开...

TABLE I: Mutual information between VAE latent spaces $(Z_{A})$ and the CIFAR-10 test set $(I(Z_{A};X))$ , and the CIFAR10 test set as reconstructed by a baseline VAE $(I(Z_{A};\hat{X}))$ for VAEs trained with a range of MSDAs. MixUp prevents the model from learning about specific features in the data. Uncertainty estimates are the standard deviation following 5 trials.

表 1: VAE潜在空间 $(Z_{A})$ 与CIFAR-10测试集 $(I(Z_{A};X))$ 之间的互信息,以及基线VAE重建的CIFAR10测试集 $(I(Z_{A};\hat{X}))$(使用多种MSDA方法训练的VAE)。MixUp阻止模型学习数据中的特定特征。不确定度估计为5次试验后的标准差。

I(ZA; X) I(ZA;X) MSE
Baseline 78.05±0.53 74.40±0.45 0.256±0.002
MixUp 70.38±0.90 68.58±1.12 0.288±0.003
CutMix 83.17±0.72 79.46±0.75 0.254±0.003

[23] motivate the success of MixUp by the co-action of four different regular is ation factors. A similar analysis is carried out in parallel by Zhang et al. [24].

[23] 将 MixUp 的成功归因于四种不同正则化因素的协同作用。Zhang 等人 [24] 也进行了类似的分析。

Following Zhang et al. [2], He et al. [25] take a statistical learning view of MSDA, basing their study on the observation that MSDA distorts the data distribution and thus does not perform VRM in the traditional sense. They subsequently propose separating features into ‘minor’ and ‘major’, where a feature is referred to as ‘minor’ if it is highly sample-specific. Augmentations that significantly affect the distribution are said to make the model predominantly learn from ‘major’ features. From an information theoretic perspective, ignoring these ‘minor’ features corresponds to increased compression of the input by the model. Although He et al. [25] noted the importance of character ising the effect of data augmentation from an information perspective, they did not explore any measures that do so. Instead, He et al. [25] analysed the variance in the learned representations. This is analogous to the entropy of the representation since entropy can be estimated via the pairwise distances between samples, with higher distances corresponding to both greater entropy and variance [26]. In proposing Manifold MixUp, Verma et al. [27] additionally suggest that MixUp works by increasing compression. The authors compute the singular values of the representations in early layers of trained networks, with smaller singular values again corresponding to lower entropy. A potential issue with these approaches is that the entropy of the representation is only an upper bound on the information that the representation has about the input.

遵循Zhang等人[2]的研究思路,He等人[25]从统计学习视角分析多源域自适应(MSDA),基于MSDA会扭曲数据分布、因而无法进行传统意义上的变分风险最小化(VRM)这一观察展开研究。他们随后提出将特征划分为"次要"和"主要"两类——若某特征具有高度样本特异性则被归为"次要特征"。那些显著影响数据分布的增强操作会使模型主要从"主要特征"中学习。从信息论角度看,忽略这些"次要特征"对应着模型对输入信息的更高程度压缩。虽然He等人[25]指出了从信息角度量化数据增强效果的重要性,但并未探索具体测量方法,而是分析了学习表征的方差。这与表征熵的概念相通,因为熵可以通过样本间成对距离来估算,距离越大对应熵和方差也越高[26]。Verma等人在提出流形混合(Manifold MixUp)时[27]也指出MixUp通过增强压缩发挥作用,作者计算了训练网络浅层表征的奇异值,较小的奇异值同样对应较低熵值。这些方法存在的潜在问题是:表征熵仅是表征所含输入信息量的上限。

III. ANALYSIS

III. 分析

We now analyse both interpol at ive and masking MSDAs with a view to distinguishing their impact on learned representations and finding answers to the questions established in our introduction. We first desire a measure which captures the extent to which learning about the augmented data simulates learning about the original data. We propose training unsupervised models on real data and augmented data and then measuring the mutual information, the reduction in uncertainty about one variable given knowledge of another, between the representations they learn. To achieve this, we use Variation al Auto-Encoders (VAEs) [29], which provide a rich depiction of the salient or compressible information in the data [30]. Note that we do not expect these representations to directly relate to those of trained class if i ers. Our requirement is a probabilistic model of the data, for which a VAE is well suited.

我们现在分析插值(interpolative)和掩码(masking)两种多模态数据增强方法(MSDAs),旨在区分它们对学习表征的影响,并回答引言中提出的问题。首先需要一种衡量标准,用于评估增强数据的学习过程在多大程度上模拟了原始数据的学习。我们提出在真实数据和增强数据上训练无监督模型,然后测量它们学习到的表征之间的互信息(mutual information)——即已知一个变量后对另一个变量不确定性的减少量。为此,我们使用变分自编码器(VAEs)[29],它能丰富地刻画数据中的显著或可压缩信息[30]。需要说明的是,这些表征与训练分类器的表征未必直接相关,我们只需要数据的概率模型,而VAE正适合此用途。


Fig. 1: Robustness of PreAct-ResNet18 models trained on CIFAR-10 with standard augmentations (Baseline) and the addition of MixUp and CutMix to the Fast Gradient Sign Method (FGSM), simple gradient descent, projected gradient descent, uniform noise, DeepFool [28], and the worst case performance after all attacks. MixUp improves robustness to adversarial examples with similar properties to images generated with MixUp (acting as adversarial training), but MSDA does not improve robustness in general. Shaded region indicates the standard deviation following 5 repeats.

图 1: 在CIFAR-10上训练的PreAct-ResNet18模型使用标准增强(Baseline)及添加MixUp和CutMix后对Fast Gradient Sign Method(FGSM)、简单梯度下降、投影梯度下降、均匀噪声、DeepFool [28]以及所有攻击后最差情况的鲁棒性表现。MixUp提升了对抗样本的鲁棒性(这些样本与MixUp生成的图像具有相似特性,起到对抗训练的作用),但MSDA总体上并未提升鲁棒性。阴影区域表示5次重复实验的标准差。

We wish to estimate the mutual information between the representation learned by a VAE from the original data set, $Z_{X}$ , and the representation learned from some augmented data set, $Z_{A}$ , written ${\cal I}\left(Z_{X};Z_{A}\right)=\mathbb{E}{Z_{X}}\left[D\left(p_{\left(Z_{A}\mid Z_{X}\right)}\mid\mid p_{Z_{A}}\right)\right]$ . This quantity acts as a good measure of the similarity between the augmented and the original data since it captures only the similarity between learnable or salient features. VAEs comprise an encoder, $p_{\left(Z\mid X\right)}$ , and a decoder, $p_{(X\mid Z)}$ . We impose a Normal prior on $Z$ , and train the model to maximise the Evidence Lower BOund (ELBO) objective

我们希望估计变分自编码器(VAE)从原始数据集学习到的表示$Z_{X}$与从某些增强数据集学习到的表示$Z_{A}$之间的互信息,记为${\cal I}\left(Z_{X};Z_{A}\right)=\mathbb{E}{Z_{X}}\left[D\left(p_{\left(Z_{A}\mid Z_{X}\right)}\mid\mid p_{Z_{A}}\right)\right]$。该量能很好地衡量增强数据与原始数据之间的相似性,因为它仅捕获可学习或显著特征之间的相似性。VAE包含一个编码器$p_{\left(Z\mid X\right)}$和一个解码器$p_{(X\mid Z)}$。我们对$Z$施加正态先验,并训练模型以最大化证据下界(ELBO)目标。

$$
\begin{array}{r l}&{\mathcal{L}=\mathbb{E}{X}\left[\mathbb{E}{Z|X}\left[\mathrm{log}\big(p_{\left(X\mid Z\right)}\big)\right]\right.}\ &{\qquad\left.-D\big(p_{\left(Z|X\right)}\big|\big|\mathcal{N}\big(\mathbf{0},I\big)\big)\right].}\end{array}
$$

$$
\begin{array}{r l}&{\mathcal{L}=\mathbb{E}{X}\left[\mathbb{E}{Z|X}\left[\mathrm{log}\big(p_{\left(X\mid Z\right)}\big)\right]\right.}\ &{\qquad\left.-D\big(p_{\left(Z|X\right)}\big|\big|\mathcal{N}\big(\mathbf{0},I\big)\big)\right].}\end{array}
$$

Denoting the outputs of the decoder of the VAE trained on the augmentation as ${\bf\hat{X}}=d e c o d e(Z_{X})$ , and by the data processing inequality, we have $I(Z_{A};\hat{X})\leq I(Z_{A};Z_{X})$ with equality when the decoder retains all of the information in $Z$ . Now, we need only observe that we already have a model of $p_{(Z_{A}\mid X)}$ , the encoder trained on the augmented data. Estimating the marginal $p_{Z_{A}}$ presents a challenge as it is a Gaussian mixture. However, we can measure an alternative form of the mutual information that is equivalent up to an additive constant, and for which the divergence has a closed form solution, with

将经过增强训练的VAE解码器输出记为 ${\bf\hat{X}}=d e c o d e(Z_{X})$ ,根据数据处理不等式可得 $I(Z_{A};\hat{X})\leq I(Z_{A};Z_{X})$ ,当解码器保留 $Z$ 中全部信息时取等号。此时只需注意到我们已拥有增强数据训练的编码器 $p_{(Z_{A}\mid X)}$ 模型。估计边缘分布 $p_{Z_{A}}$ 存在挑战,因其属于高斯混合分布。不过我们可以测量另一种形式的互信息,该形式在加法常数范围内等价,且其散度具有闭式解:

$$
\begin{array}{r l}{\mathbb{E}{\hat{X}}\big[D\big(p_{(Z_{A}|\hat{X})}\left\Vert p_{Z_{A}}\right)\big]=}&{}\ {\mathbb{E}{\hat{X}}\big[D\big(p_{(Z_{A}|\hat{X})}\left\Vert\mathcal{N}(\mathbf{0},I)\right)\big]}&{}\ {-D\big(p_{Z_{A}}\left\Vert\mathcal{N}(\mathbf{0},I)\right).}\end{array}
$$

$$
\begin{array}{r l}{\mathbb{E}{\hat{X}}\big[D\big(p_{(Z_{A}|\hat{X})}\left\Vert p_{Z_{A}}\right)\big]=}&{}\ {\mathbb{E}{\hat{X}}\big[D\big(p_{(Z_{A}|\hat{X})}\left\Vert\mathcal{N}(\mathbf{0},I)\right)\big]}&{}\ {-D\big(p_{Z_{A}}\left\Vert\mathcal{N}(\mathbf{0},I)\right).}\end{array}
$$

The above holds for any choice of distribution that does not depend on X. Conceptually, this states that we will always lose more information on average if we approximate $p_{\left(Z_{A}\mid\hat{X}\right)}$ with any constant distribution other than the marginal $p_{Z_{A}}$ . Additionally note that we implicitly minimise $D(p_{Z_{A}}||N(0,I))$ during training of the VAE [31]. In light of this fact, we can write $I(\bar{Z_{A}};\hat{X})\approx\mathbb{E}{\hat{X}}[D\big(p_{(Z_{A}|\hat{X})}\bar{|}\mathcal{N}(\mathbf{0},I)\big)]$ . We can now easily obtain a helpful upper bound of $I(Z_{A};Z_{X})$ such that it is bounded on both sides. Since $Z_{A}$ is just a function of $X$ , again by the data processing inequality, we have $I(Z_{A};X)\ge I(Z_{A};Z_{X})$ . This is easy to compute since it is just the relative entropy term from the ELBO objective.

上述结论适用于任何不依赖于X的分布选择。从概念上讲,这表明如果我们用任何不同于边缘分布$p_{Z_{A}}$的恒定分布来近似$p_{\left(Z_{A}\mid\hat{X}\right)}$,平均而言总会损失更多信息。此外需注意,在训练VAE[31]时,我们隐式地最小化了$D(p_{Z_{A}}||N(0,I))$。基于这一事实,可以表示为$I(\bar{Z_{A}};\hat{X})\approx\mathbb{E}{\hat{X}}[D\big(p_{(Z_{A}|\hat{X})}\bar{|}\mathcal{N}(\mathbf{0},I)\big)]$。现在我们可以轻松得到一个实用的$I(Z_{A};Z_{X})$上界,使其具有双边约束。由于$Z_{A}$仅是$X$的函数,再次应用数据处理不等式可得$I(Z_{A};X)\ge I(Z_{A};Z_{X})$。这个值很容易计算,因为它就是ELBO目标函数中的相对熵项。

To summarise, we can compute our measure by first training two VAEs, one on the original data and one on the augmented data. We then generate reconstructions of data points in the original data with one VAE and encode them in the other. We now compute the expected value of the relative entropy between the encoded distribution and an estimate of the marginal to obtain an estimate of a lower bound of the mutual information between the representations. We then recompute this using real data points instead of reconstructions to obtain an upper bound. Table I gives these quantities for MixUp, CutMix, and a baseline. The results show that MixUp consistently reduces the amount of information that is learned about the original data. In contrast, CutMix manages to induce greater mutual information with the data than is obtained from just training on the un-augmented data. Crucially, the results present concrete evidence that interpol at ive MSDA differs fundamentally from masking MSDA in how it impacts learned representations.

总结来说,我们可以通过先训练两个VAE(一个在原始数据上,一个在增强数据上)来计算我们的度量。然后,我们用一个VAE生成原始数据点的重建,并在另一个VAE中对它们进行编码。现在我们计算编码分布与边缘估计之间的相对熵的期望值,以获得表示之间互信息下界的估计。接着,我们使用真实数据点而非重建重新计算,以获得上界。表1给出了MixUp、CutMix和基线的这些量值。结果表明,MixUp持续减少了关于原始数据的学习信息量。相比之下,CutMix能够诱导出比仅使用未增强数据训练获得的更大的互信息。关键的是,这些结果提供了具体证据,表明插值式MSDA与掩码式MSDA在学习表示的影响方式上存在根本差异。

Having shown this is true for VAEs, we now wish to understand whether the finding also holds for trained class if i ers. To this end, we analysed the decisions made by a classifier using Gradient-weighted Class Activation Maps (Grad-CAMs) [32]. Grad-CAM finds the regions in an image that contribute the most to the network’s prediction by taking the derivative of the model’s output with respect to the activation maps and weighting them according to their contribution. If MixUp prevents the network from learning about highly specific features in the data we would expect more of the early features to contribute to the network output. It would be difficult to ascertain qualitatively whether this is the case. Instead, we compute the average sum of Grad-CAM heatmaps over the CIFAR-10 test set for 5 repeats (independently trained PreActResNet18 models). We obtain the following scores: baseline - $146{\scriptstyle\pm5}$ , MixUp - $162{\scriptstyle\pm3}$ , CutMix - $131{\scriptstyle\pm6}$ . The result suggeststhat more of the early features contribute to the decisions made by MixUp trained models and that this result is consistent across independent runs.

在证明了变分自编码器(VAE)中存在这一现象后,我们进一步探究该发现是否也适用于经过训练的分类器。为此,我们使用梯度加权类激活映射(Grad-CAMs) [32] 分析了分类器的决策过程。Grad-CAM通过计算模型输出对激活映射的导数,并按其贡献度加权,从而找出图像中对网络预测影响最大的区域。若MixUp能阻止网络学习数据中高度特定的特征,我们预期会有更多早期特征参与网络输出。定性判断这一现象存在难度,因此我们针对CIFAR-10测试集计算了5次独立训练的PreActResNet18模型的Grad-CAM热力图平均总和,获得以下数值:基线模型 - $146{\scriptstyle\pm5}$,MixUp - $162{\scriptstyle\pm3}$,CutMix - $131{\scriptstyle\pm6}$。结果表明,经过MixUp训练的模型决策过程中有更多早期特征参与贡献,且该结果在不同独立实验间具有一致性。

Having established that MixUp distorts learned functions, we now seek to answer the third question from our introduction by determining the impact of data distortion on trained class if i ers. Since it is our assessment that models trained with MixUp have an altered ‘perception’ of the data distribution, we suggest an analysis based on adversarial attacks, which involve perturbing images outside of the perceived data distribution to alter the given classification. We perform fast gradient sign method, standard gradient descent, projected gradient descent, additive uniform noise, and DeepFool [28] attacks over the whole CIFAR-10 test set on PreAct-ResNet18 models subject to $\ell_{\infty}$ constraints using the Foolbox library [33, 34]. The plots for the additive uniform noise and DeepFool attacks, given in Figure 1, show that MixUp provides an improvement over CutMix and the augmented baseline in this setting. This is because MixUp acts as a form of adversarial training [16], equipping the models with valid classifications for images of a similar nature to those generated by the additive noise and DeepFool attacks. In Figure 1, we additionally plot the worst case robustness following all attacks as suggested by Carlini et al. [18]. These results show that the adversarial training effect of MixUp is limited and does not correspond to a general increase in robustness. The key observation regarding these results is that there may be practical consequences to training with MixUp that are present but to a lesser degree when training with CutMix. There may be value to creating a new MSDA that goes even further than CutMix to minimise these practical consequences.

在确认MixUp会扭曲学习到的函数后,我们试图通过研究数据扭曲对训练分类器的影响来回答引言中的第三个问题。基于"使用MixUp训练的模型对数据分布的'感知'发生了改变"这一判断,我们建议采用对抗性攻击进行分析——这类攻击通过扰动感知分布之外的图像来改变既定分类。我们使用Foolbox库[33,34]对PreAct-ResNet18模型进行全CIFAR-10测试集的快速梯度符号法、标准梯度下降、投影梯度下降、加性均匀噪声和DeepFool[28]攻击,并施加$\ell_{\infty}$约束。图1所示的加性均匀噪声和DeepFool攻击曲线表明,在此设置下MixUp优于CutMix和增强基线。这是因为MixUp作为一种对抗训练形式[16],使模型能对加性噪声和DeepFool攻击生成的同类图像进行有效分类。图1中还按照Carlini等人[18]的建议绘制了所有攻击后的最差鲁棒性曲线。这些结果表明MixUp的对抗训练效果有限,并未带来鲁棒性的普遍提升。关键发现在于:使用MixUp训练可能产生实际影响,而CutMix训练时这些影响程度较轻。开发比CutMix更进一步的新型混合样本数据增强(MSDA)方法来最小化这些实际影响可能具有价值。

IV. FMIX: IMPROVED MASKING

IV. FMIX: 改进的掩码策略

Our principle finding is that the masking MSDA approach works because it effectively preserves the data distribution in a way that interpol at ive MSDAs do not, particularly in the perceptual space of a Convolutional Neural Network (CNN). This derives from the fact that each convolutional neuron at a particular spatial position generally encodes information from only one of the inputs at a time. This could also be viewed as local consistency in the sense that elements that are close to each other in space typically derive from the same data point. To the detriment of CutMix, the number of possible examples is limited by only using square masks. In this section we propose FMix, a masking MSDA which maximises the number of possible masks whilst preserving local consistency. For local consistency, we require masks that are predominantly made up of a single shape or contiguous region. We might think of this as trying to minimise the number of times the binary mask transitions from $\mathbf{\nabla}^{\leftarrow}0^{\bullet}$ to $\mathbf{\hat{\Pi}}^{\epsilon}\mathbf{1}^{\epsilon}$ or vice-versa. For our approach, we begin by sampling a low frequency greyscale mask from Fourier space which can then be converted to binary with a threshold. We will first detail our approach for obtaining the low frequency image before discussing our approach for choosing the threshold. Let $Z$ denote a complex random variable with values on the domain $Z~=~C^{{w} h}$ , with density $p_{(Z)}=N(0,I_{w h})$ and $p_{(Z)}=N(0,I_{w h})$ , where $\Re$ and $I$ return the real and imaginary parts of their input respectively. Let $\operatorname{freq}(w,h)[i,j]$ denote the magnitude of the sample frequency corresponding to the $i,j^{\prime}$ ’th bin of the $w\times h$ discrete Fourier transform. We can apply a low pass filter to $Z$ by decaying its high frequency components. Specifically, for a given decay power $\delta$ , we use

我们的核心发现是:掩码式多源数据增强(MSDA)方法之所以有效,是因为它能以插值式MSDA无法实现的方式有效保持数据分布特性,尤其在卷积神经网络(CNN)的感知空间中。这是因为特定空间位置的每个卷积神经元通常每次仅编码一个输入源的信息。从局部一致性的角度来看,空间相邻的元素通常源自同一数据点。CutMix的局限性在于仅使用方形掩码限制了可能的样本数量。本节提出FMix——一种在保持局部一致性的同时最大化掩码可能性的掩码式MSDA方法。对于局部一致性,我们要求掩码主要由单一形状或连续区域构成,这可理解为尽量减少二值掩码在$\mathbf{\nabla}^{\leftarrow}0^{\bullet}$与$\mathbf{\hat{\Pi}}^{\epsilon}\mathbf{1}^{\epsilon}$之间切换的次数。我们的方法首先生成傅里叶空间的低频灰度掩码,再通过阈值二值化。在讨论阈值选择前,先详述低频图像生成方法:设$Z$为定义在$Z~=~C^{{w} h}$上的复随机变量,其实部密度$p_{(Z)}=N(0,I_{w h})$,虚部密度$p_{(Z)}=N(0,I_{w h})$,其中$\Re$与$\mathfrak{I}$分别返回输入的实部和虚部。设$\operatorname{freq}(w,h)[i,j]$表示对应$w\times h$离散傅里叶变换第$i,j'$个频段的频率幅值。通过衰减高频分量可实现低通滤波,具体对给定衰减功率采用...


Fig. 2: Example masks and mixed images from CIFAR-10 for FMix with $\delta=3$ and $\lambda=0.5$ .

图 2: 使用 $\delta=3$ 和 $\lambda=0.5$ 参数的 FMix 方法在 CIFAR-10 数据集上生成的掩膜示例与混合图像。

$$
\mathrm{filter}({\bf z},\delta)[i,j]=\frac{{\bf z}[i,j]}{\mathrm{freq}(w,h)\left[i,j\right]^{\delta}}.
$$

$$
\mathrm{filter}({\bf z},\delta)[i,j]=\frac{{\bf z}[i,j]}{\mathrm{freq}(w,h)\left[i,j\right]^{\delta}}.
$$

Defining ${\mathcal{F}}^{-1}$ as the inverse discrete Fourier transform, we can obtain a grey-scale image with

定义 ${\mathcal{F}}^{-1}$ 为离散傅里叶逆变换,我们可以得到一个灰度图像

$$
G=\Re\big(\mathcal{F}^{-1}\big(\operatorname{filter}\big(Z,\delta\big)\big)\big)~.
$$

$$
G=\Re\big(\mathcal{F}^{-1}\big(\operatorname{filter}\big(Z,\delta\big)\big)\big)~.
$$

All that now remains is to convert the grey-scale image to a binary mask such that the mean value is some given $\lambda$ . Let $\mathrm{top}(n,{\bf x})$ return a set containing the top $n$ elements of the input $\mathbf{x}$ . Setting the top $\lambda w h$ elements of some grey-scale image $\mathbf{g}$ to have value $\mathbf{\hat{\Pi}}^{\epsilon}\mathbf{1}^{\epsilon}$ and all others to have value $\mathbf{\nabla}^{\leftarrow}0^{\cdot}$ we obtain a binary mask with mean $\lambda$ . Specifically, we have

现在剩下的就是将灰度图像转换为二值掩码 (binary mask),使其平均值为给定的 $\lambda$。设 $\mathrm{top}(n,{\bf x})$ 返回输入 $\mathbf{x}$ 中前 $n$ 个元素的集合。将灰度图像 $\mathbf{g}$ 中前 $\lambda w h$ 个元素的值设为 $\mathbf{\hat{\Pi}}^{\epsilon}\mathbf{1}^{\epsilon}$,其余元素的值设为 $\mathbf{\nabla}^{\leftarrow}0^{\cdot}$,即可得到平均值为 $\lambda$ 的二值掩码。

To recap, we first sample a random complex tensor for which both the real and imaginary part are independent and Gaussian. We then scale each component according to its frequency via the parameter $\delta$ such that higher values of $\delta$ correspond to increased decay of high frequency information. Next, we perform an inverse Fourier transform on the complex tensor and take the real part to obtain a grey-scale image. Finally, we set the top proportion of the image to have value ‘1’ and the rest to have value $\mathrm{{}^{\cdot}0^{\cdot}}$ to obtain our binary mask. Although we have only considered two dimensional data here it is generally possible to create masks with any number of dimensions. We provide some example two dimensional masks and mixed images (with $\delta:=:3$ and $\lambda~=~0.5)$ in Figure 2. We can see that the space of artefacts is significantly increased, furthermore, FMix achieves $I(Z_{A};X)=83.67_{\pm0.89}.$ , $I(Z_{A};\hat{X})=80.28{\scriptstyle\pm0.75}$ , and $\mathrm{MSE}=0.255{\scriptstyle\pm0.003}$ , showing that learning from FMix simulates learning from the un-augmented data to an even greater extent than CutMix.

总结来说,我们首先对一个随机复数张量进行采样,其实部和虚部均为独立高斯分布。随后通过参数 $\delta$ 根据频率对各分量进行缩放,使得 $\delta$ 值越大对应高频信息的衰减越强。接着,我们对复数张量执行逆傅里叶变换并取实部得到灰度图像。最后,我们将图像顶部比例设为'1',其余部分设为 $\mathrm{{}^{\cdot}0^{\cdot}}$ 以获得二值掩码。虽然这里仅考虑二维数据,但通常可以创建任意维度的掩码。图2展示了一些二维掩码和混合图像示例 (参数为 $\delta:=:3$ 和 $\lambda~=~0.5)$ 。可以看出,伪影空间显著扩大,此外FMix实现了 $I(Z_{A};X)=83.67_{\pm0.89}$ 、 $I(Z_{A};\hat{X})=80.28{\scriptstyle\pm0.75}$ 和 $\mathrm{MSE}=0.255{\scriptstyle\pm0.003}$ ,表明从FMix学习比CutMix更能模拟未增强数据的学习效果。

V. EXPERIMENTS

V. 实验

We now perform a series of experiments to compare the performance of FMix with that of MixUp, CutMix, and augmented baselines. For each problem setting and data set, we provide exposition on the results and any relevant caveats.

我们现在进行一系列实验来比较FMix与MixUp、CutMix以及增强基线的性能。针对每个问题设置和数据集,我们将展示结果并说明相关注意事项。


Fig. 3: Ablation study showing the performance of a PreActResNet18 trained on CIFAR-10 with FMix. Performance increases with decay power u