Self-training with Noisy Student improves ImageNet classification
使用带噪声学生模型的自训练提升ImageNet分类效果
Qizhe Xie∗ 1, Minh-Thang Luong1, Eduard Hovy2, Quoc V. Le1 1Google Research, Brain Team, 2Carnegie Mellon University {qizhex, thangluong, qvl}@google.com, hovy@cmu.edu
Qizhe Xie∗ 1, Minh-Thang Luong1, Eduard Hovy2, Quoc V. Le1 1Google Research, Brain Team, 2Carnegie Mellon University {qizhex, thangluong, qvl}@google.com, hovy@cmu.edu
Abstract
摘要
We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves $88.4%$ top1 accuracy on ImageNet, which is $2.0%$ better than the state-of-the-art model that requires $3.5B$ weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from $6l.0%$ to $83.7%$ , reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet $\cdot P$ mean flip rate from 27.8 to 12.2.
我们提出Noisy Student Training,这是一种即使在标注数据充足时也能表现优异的半监督学习方法。该方法在ImageNet上实现了88.4%的top1准确率,比需要35亿张弱标注Instagram图片的当前最佳模型高出2.0%。在鲁棒性测试集上,它将ImageNet-A的top1准确率从61.0%提升至83.7%,将ImageNet-C的平均损坏误差从45.7降至28.3,并将ImageNet-P的平均翻转率从27.8降至12.2。
Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an Efficient Net model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via Rand Augment to the student so that the student generalizes better than the teacher.1
噪声学生训练 (Noisy Student Training) 通过使用同等或更大规模的学生模型 (student model) 并在学习过程中向学生注入噪声,扩展了自训练 (self-training) 和蒸馏 (distillation) 的思想。在 ImageNet 数据集上,我们首先在标注图像上训练一个 EfficientNet 模型作为教师模型 (teacher model),并利用它为 3 亿张未标注图像生成伪标签。接着,我们在标注数据和伪标注数据的组合上训练一个更大的 EfficientNet 作为学生模型。通过将学生模型迭代地作为新教师模型,我们重复这一过程。在学生模型学习期间,我们向其注入随机失活 (dropout)、随机深度 (stochastic depth) 以及通过 Rand Augment 实现的数据增强等噪声机制,使得学生模型获得比教师模型更强的泛化能力 [1]。
1. Introduction
1. 引言
Deep learning has shown remarkable successes in image recognition in recent years [45, 80, 75, 30, 83]. However state-of-the-art (SOTA) vision models are still trained with supervised learning which requires a large corpus of labeled images to work well. By showing the models only labeled images, we limit ourselves from making use of unlabeled images available in much larger quantities to improve accuracy and robustness of SOTA models.
深度学习近年来在图像识别领域取得了显著成功 [45, 80, 75, 30, 83]。然而当前最先进的视觉模型仍采用监督学习训练,这需要大量标注图像才能达到良好效果。仅使用标注图像训练模型,限制了我们对海量未标注图像的利用,无法进一步提升最先进模型的准确性和鲁棒性。
Here, we use unlabeled images to improve the SOTA ImageNet accuracy and show that the accuracy gain has an outsized impact on robustness (out-of-distribution generalization). For this purpose, we use a much larger corpus of unlabeled images, where a large fraction of images do not belong to ImageNet training set distribution (i.e., they do not belong to any category in ImageNet). We train our model with Noisy Student Training, a semi-supervised learning approach, which has three main steps: (1) train a teacher model on labeled images, (2) use the teacher to generate pseudo labels on unlabeled images, and (3) train a student model on the combination of labeled images and pseudo labeled images. We iterate this algorithm a few times by treating the student as a teacher to relabel the unlabeled data and training a new student.
在此,我们利用未标注图像提升SOTA ImageNet准确率,并证明这种准确率增益对鲁棒性(分布外泛化)具有超常影响。为此,我们采用了规模更大的未标注图像集,其中大部分图像不属于ImageNet训练集分布(即不属于ImageNet任何类别)。我们通过半监督学习方法Noisy Student Training训练模型,该方法包含三个核心步骤:(1) 在标注图像上训练教师模型,(2) 使用教师模型为未标注图像生成伪标签,(3) 结合标注图像与伪标注图像训练学生模型。我们将该算法迭代数次:将学生模型作为教师模型重新标注未标注数据,并训练新学生模型。
Noisy Student Training improves self-training and distillation in two ways. First, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. Second, it adds noise to the student so the noised student is forced to learn harder from the pseudo labels. To noise the student, we use input noise such as RandAugment data augmentation [18] and model noise such as dropout [76] and stochastic depth [37] during training.
噪声学生训练 (Noisy Student Training) 从两方面改进了自训练和蒸馏效果:首先,它让学生模型规模大于或等于教师模型,从而能更好地从大规模数据中学习;其次,它向学生模型注入噪声,迫使带噪学生必须更努力地从伪标签中学习。我们采用多种噪声注入方式,包括输入噪声(如 RandAugment 数据增强 [18])和模型噪声(如训练过程中的 dropout [76] 与随机深度 [37])。
Using Noisy Student Training, together with 300M unlabeled images, we improve Efficient Net’s [83] ImageNet top-1 accuracy to $88.4%$ . This accuracy is $2.0%$ better than the previous SOTA results which requires 3.5B weakly labeled Instagram images. Not only our method improves standard ImageNet accuracy, it also improves classification robustness on much harder test sets by large margins: ImageNet-A [32] top-1 accuracy from $61.0%$ to $83.7%$ , ImageNet-C [31] mean corruption error (mCE) from 45.7 to 28.3 and ImageNet-P [31] mean flip rate (mFR) from 27.8 to 12.2. Our main results are shown in Table 1.
通过结合噪声学生训练 (Noisy Student Training) 和3亿张未标注图像,我们将EfficientNet [83] 的ImageNet top-1准确率提升至$88.4%$。这一准确率比此前需要35亿张弱标注Instagram图像的SOTA结果高出$2.0%$。我们的方法不仅提升了标准ImageNet准确率,还在更困难测试集上显著提高了分类鲁棒性:ImageNet-A [32] top-1准确率从$61.0%$提升至$83.7%$,ImageNet-C [31] 平均损坏误差 (mCE) 从45.7降至28.3,ImageNet-P [31] 平均翻转率 (mFR) 从27.8降至12.2。主要结果如表1所示。
| ImageNet top-1 acc. | ImageNet-A top-1 acc. | ImageNet-C mCE | ImageNet-P mFR | |
| Prev.SOTA | 86.4% | 61.0% | 45.7 | 27.8 |
| Ours | 88.4% | 83.7% | 28.3 | 12.2 |
| ImageNet top-1 准确率 | ImageNet-A top-1 准确率 | ImageNet-C 平均分类误差 (mCE) | ImageNet-P 平均翻转率 (mFR) | |
|---|---|---|---|---|
| 先前最佳 (Prev.SOTA) | 86.4% | 61.0% | 45.7 | 27.8 |
| 我们的方法 (Ours) | 88.4% | 83.7% | 28.3 | 12.2 |
Table 1: Summary of key results compared to previous state-of-the-art models [86, 55]. Lower is better for mean corruption error (mCE) and mean flip rate (mFR).
表 1: 与先前最先进模型 [86, 55] 的关键结果对比汇总。平均损坏误差 (mCE) 和平均翻转率 (mFR) 越低越好。
2. Noisy Student Training
2. 噪声学生训练 (Noisy Student Training)
Algorithm 1 gives an overview of Noisy Student Training. The inputs to the algorithm are both labeled and unlabeled images. We use the labeled images to train a teacher model using the standard cross entropy loss. We then use the teacher model to generate pseudo labels on unlabeled images. The pseudo labels can be soft (a continuous distribution) or hard (a one-hot distribution). We then train a student model which minimizes the combined cross entropy loss on both labeled images and unlabeled images. Finally, we iterate the process by putting back the student as a teacher to generate new pseudo labels and train a new student. The algorithm is also illustrated in Figure 1.
算法 1 概述了噪声学生训练 (Noisy Student Training) 的流程。该算法的输入包括标注图像和未标注图像。我们首先使用标注图像通过标准交叉熵损失训练教师模型 (teacher model),随后利用教师模型为未标注图像生成伪标签 (pseudo labels)。这些伪标签可以是软标签 (soft,连续分布) 或硬标签 (hard,one-hot 分布)。接着我们训练学生模型 (student model),使其同时最小化标注图像和未标注图像上的交叉熵损失。最后,通过将学生模型迭代升级为教师模型来生成新的伪标签并训练新的学生模型,从而形成闭环。该算法的流程示意图见图 1。
$$
{\frac{1}{n}}\sum_ {i=1}^{n}\ell(y_ {i},f^{n o i s e d}(x_ {i},\theta^{t}))
$$
$$
{\frac{1}{n}}\sum_ {i=1}^{n}\ell(y_ {i},f^{n o i s e d}(x_ {i},\theta^{t}))
$$
2: Use a normal (i.e., not noised) teacher model to generate soft or hard pseudo labels for clean (i.e., not distorted) unlabeled images
2: 使用常规(即未添加噪声)教师模型为干净(即未失真)的未标注图像生成软伪标签或硬伪标签
$$
\tilde{y}_ {i}=f(\tilde{x}_ {i},\theta_ {* }^{t}),\forall i=1,\cdot\cdot\cdot,m
$$
$$
\tilde{y}_ {i}=f(\tilde{x}_ {i},\theta_ {* }^{t}),\forall i=1,\cdot\cdot\cdot,m
$$
3: Learn an equal-or-larger student model $\theta_ {* }^{s}$ which minimizes the cross entropy loss on labeled images and unlabeled images with noise added to the student model
3: 学习一个等规模或更大的学生模型 $\theta_ {* }^{s}$ ,使其在带标签图像和添加噪声的无标签图像上最小化交叉熵损失
$$
\frac{1}{n}\sum_ {i=1}^{n}\ell(y_ {i},f^{n o i s e d}(x_ {i},\theta^{s}))+\frac{1}{m}\sum_ {i=1}^{m}\ell(\tilde{y}_ {i},f^{n o i s e d}(\tilde{x}_ {i},\theta^{s}))
$$
$$
\frac{1}{n}\sum_ {i=1}^{n}\ell(y_ {i},f^{n o i s e d}(x_ {i},\theta^{s}))+\frac{1}{m}\sum_ {i=1}^{m}\ell(\tilde{y}_ {i},f^{n o i s e d}(\tilde{x}_ {i},\theta^{s}))
$$
4: Iterative training: Use the student as a teacher and go back to step 2.
4: 迭代训练:将学生模型作为教师模型,返回步骤2。
Algorithm 1: Noisy Student Training.
算法 1: 带噪学生训练 (Noisy Student Training)。

Figure 1: Illustration of the Noisy Student Training. (All shown images are from ImageNet.)
图 1: Noisy Student训练方法示意图 (所有展示图片均来自ImageNet数据集)
The algorithm is an improved version of self-training, a method in semi-supervised learning (e.g., [71, 96]), and distillation [33]. More discussions on how our method is related to prior works are included in Section 5.
该算法是自训练(self-training)的改进版本,这是半监督学习中的一种方法(如[71, 96]),同时也借鉴了蒸馏(distillation)[33]。关于我们的方法与先前工作的更多关联讨论见第5节。
Our key improvements lie in adding noise to the student and using student models that are not smaller than the teacher. This makes our method different from Knowledge Distillation [33] where 1) noise is often not used and 2) a smaller student model is often used to be faster than the teacher. One can think of our method as knowledge expansion in which we want the student to be better than the teacher by giving the student model enough capacity and difficult environments in terms of noise to learn through.
我们的关键改进在于向学生模型添加噪声,并使用不小于教师模型的模型。这使得我们的方法与知识蒸馏 (Knowledge Distillation) [33] 不同:1) 知识蒸馏通常不使用噪声;2) 知识蒸馏通常使用更小的学生模型以实现比教师模型更快的速度。可以将我们的方法视为知识扩展,即通过为学生模型提供足够的容量和具有噪声的困难学习环境,使其能够超越教师模型。
Noising Student – When the student is deliberately noised it is trained to be consistent to the teacher that is not noised when it generates pseudo labels. In our experiments, we use two types of noise: input noise and model noise. For input noise, we use data augmentation with RandAugment [18]. For model noise, we use dropout [76] and stochastic depth [37].
噪声学生法——当学生模型被刻意添加噪声时,其训练目标是与非噪声状态下生成伪标签的教师模型保持一致。实验中我们采用两种噪声类型:输入噪声(使用RandAugment [18]进行数据增强)和模型噪声(采用dropout [76]与随机深度 [37])。
When applied to unlabeled data, noise has an important benefit of enforcing in variances in the decision function on both labeled and unlabeled data. First, data augmentation is an important noising method in Noisy Student Training because it forces the student to ensure prediction consistency across augmented versions of an image (similar to UDA [91]). Specifically, in our method, the teacher produces high-quality pseudo labels by reading in clean images, while the student is required to reproduce those labels with augmented images as input. For example, the student must ensure that a translated version of an image should have the same category as the original image. Second, when dropout and stochastic depth function are used as noise, the teacher behaves like an ensemble at inference time (when it generates pseudo labels), whereas the student behaves like a single model. In other words, the student is forced to mimic a more powerful ensemble model. We present an ablation study on the effects of noise in Section 4.1.
在应用于未标注数据时,噪声具有强制决策函数在标注和未标注数据上保持不变性的重要作用。首先,数据增强是Noisy Student Training中的关键加噪方法,因为它迫使学生模型确保图像增强版本间的预测一致性(类似于UDA [91])。具体而言,在我们的方法中,教师模型通过读取干净图像生成高质量伪标签,而学生模型需要以增强图像为输入复现这些标签。例如,学生模型必须确保图像的平移版本应与原始图像具有相同类别。其次,当使用dropout和随机深度函数作为噪声时,教师模型在推理时(生成伪标签时)表现为集成模型,而学生模型表现为单一模型。换言之,学生模型被迫模仿更强大的集成模型。我们将在第4.1节中对噪声效果进行消融研究。
Other Techniques – Noisy Student Training also works better with an additional trick: data filtering and balancing, similar to [91, 93]. Specifically, we filter images that the teacher model has low confidences on since they are usually out-of-domain images. To ensure that the distribution of the unlabeled images match that of the training set, we also need to balance the number of unlabeled images for each class, as all classes in ImageNet have a similar number of labeled images. For this purpose, we duplicate images in classes where there are not enough images. For classes where we have too many images, we take the images with the highest confidence.2
其他技术——Noisy Student Training(噪声学生训练)还结合了一项额外技巧:数据过滤与平衡,类似[91, 93]的做法。具体而言,我们会过滤教师模型置信度较低的图像,因为这些通常是域外图像。为确保未标注图像的分布与训练集匹配,还需平衡每个类别的未标注图像数量(ImageNet中所有类别的标注图像数量相近)。为此,我们对图像不足的类别进行图像复制,对图像过多的类别则保留置信度最高的样本。
Finally, we emphasize that our method can be used with soft or hard pseudo labels as both work well in our experiments. Soft pseudo labels, in particular, work slightly better for out-of-domain unlabeled data. Thus in the following, for consistency, we report results with soft pseudo labels unless otherwise indicated.
最后需要强调的是,我们的方法既适用于软伪标签也适用于硬伪标签,因为实验表明两者效果都很好。特别是对于域外未标注数据,软伪标签表现略优。因此在下文中,为保持一致性,如无特殊说明均采用软伪标签的实验结果。
Comparisons with Existing SSL Methods. Apart from self-training, another important line of work in semisupervised learning [12, 103] is based on consistency training [5, 64, 47, 84, 56, 91, 8] and pseudo labeling [48, 39, 73, 1]. Although they have produced promising results, in our preliminary experiments, methods based on consistency regular iz ation and pseudo labeling work less well on ImageNet. Instead of using a teacher model trained on labeled data to generate pseudo-labels, these methods do not have a separate teacher model and use the model being trained to generate pseudo-labels. In the early phase of training, the model being trained has low accuracy and high entropy, hence consistency training regularizes the model towards high entropy predictions, and prevents it from achieving good accuracy. A common workaround is to use entropy minimization, to filter examples with low confidence or to ramp up the consistency loss. However, the additional hyper parameters introduced by the ramping up schedule, confidence-based filtering and the entropy minimization make them more difficult to use at scale. The selftraining / teacher-student framework is better suited for ImageNet because we can train a good teacher on ImageNet using labeled data.
与现有SSL方法的比较。除自训练外,半监督学习[12, 103]的另一重要研究方向基于一致性训练[5, 64, 47, 84, 56, 91, 8]和伪标签[48, 39, 73, 1]。尽管这些方法取得了不错的效果,但在我们的初步实验中,基于一致性正则化和伪标签的方法在ImageNet上表现欠佳。这些方法不使用基于标注数据训练的教师模型生成伪标签,而是直接利用训练中的模型自身生成伪标签。在训练初期,模型精度低且熵值高,此时一致性训练会迫使模型趋向高熵预测,从而阻碍其达到较高精度。常见解决方案包括采用熵最小化、过滤低置信度样本或逐步增强一致性损失。但逐步增强策略、基于置信度的过滤以及熵最小化引入的超参数,使得这些方法难以大规模应用。自训练/师生框架更适合ImageNet,因为我们可利用标注数据训练出优质的教师模型。
3. Experiments
3. 实验
In this section, we will first describe our experiment details. We will then present our ImageNet results compared with those of state-of-the-art models. Lastly, we demonstrate the surprising improvements of our models on robustness datasets (such as ImageNet-A, C and P) as well as under adversarial attacks.
在本节中,我们将首先描述实验细节,随后展示ImageNet结果与前沿模型的对比,最后揭示我们的模型在鲁棒性数据集(如ImageNet-A、C和P)及对抗攻击下取得的显著提升。
3.1. Experiment Details
3.1. 实验细节
Labeled dataset. We conduct experiments on ImageNet 2012 ILSVRC challenge prediction task since it has been considered one of the most heavily benchmarked datasets in computer vision and that improvements on ImageNet transfer to other datasets [44, 66].
标注数据集。我们在ImageNet 2012 ILSVRC挑战赛预测任务上进行实验,因为该数据集被公认为计算机视觉领域最具代表性的基准测试集之一,且在ImageNet上的性能提升可迁移至其他数据集[44, 66]。
Unlabeled dataset. We obtain unlabeled images from the JFT dataset [33, 15], which has around 300M images. Although the images in the dataset have labels, we ignore the labels and treat them as unlabeled data. We filter the ImageNet validation set images from the dataset (see [58]).
无标注数据集。我们从JFT数据集[33,15]中获取约3亿张无标注图像。尽管该数据集中的图像带有标签,但我们忽略这些标签并将其视为无标注数据。我们从数据集中过滤掉了ImageNet验证集图像(参见[58])。
We then perform data filtering and balancing on this corpus. First, we run an Efficient Net-B0 trained on ImageNet [83] over the JFT dataset [33, 15] to predict a label for each image. We then select images that have confidence of the label higher than 0.3. For each class, we select at most 130K images that have the highest confidence. Finally, for classes that have less than 130K images, we duplicate some images at random so that each class can have 130K images. Hence the total number of images that we use for training a student model is 130M (with some duplicated images). Due to duplications, there are only 81M unique images among these 130M images. We do not tune these hyper parameters extensively since our method is highly robust to them.
我们对这个语料库进行了数据过滤和平衡处理。首先,在JFT数据集[33,15]上运行一个基于ImageNet[83]预训练的Efficient Net-B0模型,为每张图像预测标签。然后筛选出标签置信度高于0.3的图像。对于每个类别,我们最多选择13万张置信度最高的图像。最后,对于图像数量不足13万的类别,通过随机复制部分图像使每个类别都达到13万张。因此用于训练学生模型的总图像量为1.3亿张(含部分重复图像)。由于复制操作,这1.3亿张图像中实际只有8100万张唯一图像。我们没有对这些超参数进行广泛调优,因为我们的方法对其具有高度鲁棒性。
To enable fair comparisons with our results, we also experiment with a public dataset YFCC100M [85] and show the results in Appendix A.4.
为了公平比较我们的结果,我们还使用公开数据集 YFCC100M [85] 进行了实验,并在附录 A.4 中展示了结果。
Architecture. We use Efficient Nets [83] as our baseline models because they provide better capacity for more data. In our experiments, we also further scale up Efficient NetB7 and obtain Efficient Net-L2. Efficient Net-L2 is wider and deeper than Efficient Net-B7 but uses a lower resolution, which gives it more parameters to fit a large number of unlabeled images. Due to the large model size, the training time of Efficient Net-L2 is approximately five times the training time of Efficient Net-B7. For more information about Efficient Net-L2, please refer to Table 8 in Appendix A.1.
架构。我们采用Efficient Nets [83]作为基线模型,因其具备更强的数据承载能力。实验中对Efficient NetB7进行扩展后得到Efficient Net-L2,该模型比Efficient Net-B7具有更宽的架构和更深的层级,但采用较低分辨率,从而拥有更多参数以适应海量未标注图像。由于模型规模较大,Efficient Net-L2的训练时长约为Efficient Net-B7的五倍。更多参数细节请参阅附录A.1中的表8。
Training details. For labeled images, we use a batch size of 2048 by default and reduce the batch size when we could not fit the model into the memory. We find that using a batch size of 512, 1024, and 2048 leads to the same performance. We determine the number of training steps and the learning rate schedule by the batch size for labeled images. Specifically, we train the student model for 350 epochs for models larger than Efficient Net-B4, including Efficient Net-L2 and train smaller student models for 700 epochs. The learning rate starts at 0.128 for labeled batch size 2048 and decays by 0.97 every 2.4 epochs if trained for 350 epochs or every 4.8 epochs if trained for 700 epochs.
训练细节。对于标注图像,默认使用2048的批次大小,当模型无法放入内存时则减小批次。我们发现使用512、1024和2048的批次大小会得到相同的性能表现。根据标注图像的批次大小确定训练步数和学习率调度方案。具体来说,对于大于EfficientNet-B4的模型(包括EfficientNet-L2),学生模型训练350轮;较小的学生模型则训练700轮。当标注批次大小为2048时,初始学习率为0.128:若训练350轮则每2.4轮衰减0.97倍,若训练700轮则每4.8轮衰减0.97倍。
We use a large batch size for unlabeled images, especially for large models, to make full use of large quantities of unlabeled images. Labeled images and unlabeled images are concatenated together to compute the average cross entropy loss. We apply the recently proposed technique to fix train-test resolution discrepancy [86] for Efficient Net-L2. We first perform normal training with a smaller resolution for 350 epochs. Then we finetune the model with a larger resolution for 1.5 epochs on un augmented labeled images. Similar to [86], we fix the shallow layers during finetuning.
我们使用大批量处理未标注图像,尤其针对大型模型,以充分利用海量未标注数据。标注图像与未标注图像会拼接在一起计算平均交叉熵损失。对于EfficientNet-L2模型,我们采用最新提出的技术[86]来修正训练-测试分辨率差异。首先以较小分辨率进行350轮常规训练,随后在未增强的标注图像上以更高分辨率微调1.5轮。与[86]方法类似,微调期间固定浅层网络参数。
Our largest model, Efficient Net-L2, needs to be trained for 6 days on a Cloud TPU v3 Pod, which has 2048 cores, if the unlabeled batch size is $14\mathrm{x}$ the labeled batch size.
我们最大的模型 Efficient Net-L2 需要在配备 2048 个核心的 Cloud TPU v3 Pod 上训练 6 天,其中无标注批大小 (unlabeled batch size) 为标注批大小 (labeled batch size) 的 $14\mathrm{x}$ 倍。
Table 2: Top-1 and Top-5 Accuracy of Noisy Student Training and previous state-of-the-art methods on ImageNet. Efficient Net-L2 with Noisy Student Training is the result of iterative training for multiple iterations by putting back the student model as the new teacher. It has better tradeoff in terms of accuracy and model size compared to previous state-ofthe-art models. †: Big Transfer is a concurrent work that performs transfer learning from the JFT dataset.
| Method | #Params | Extra Data | Top-1 Acc. | Top-5 Acc. |
| ResNet-50 [30] | 26M | 76.0% | 93.0% | |
| ResNet-152 [30] | 60M | 77.8% | 93.8% | |
| DenseNet-264 [36] | 34M | 77.9% | 93.9% | |
| Inception-v3 [81] | 24M | 78.8% | 94.4% | |
| Xception [15] | 23M | 79.0% | 94.5% | |
| Inception-v4 [79] | 48M | 80.0% | 95.0% | |
| Inception-resnet-v2 [79] | 56M | 80.1% | 95.1% | |
| ResNeXt-101 [92] | 84M | 80.9% | 95.6% | |
| PolyNet [100] | 92M | 81.3% | 95.8% | |
| SENet [35] NASNet-A [104] | 146M | 82.7% | 96.2% | |
| AmoebaNet-A [65] | 89M | 82.7% | 96.2% | |
| 87M | 82.8% | 96.1% | ||
| PNASNet [50] AmoebaNet-C [17] | 86M | 82.9% | 96.2% | |
| 155M | 83.5% | 96.5% | ||
| GPipe [38] EfficientNet-B7[83] | 557M | 84.3% | 97.0% | |
| EfficientNet-L2 [83] | 66M | 85.0% | 97.2% | |
| 480M | 85.5% | 97.5% | ||
| ResNet-50 Billion-scale[93] | 26M | 81.2% | 96.0% | |
| ResNeXt-101 Billion-scale[93] | 193M | 3.5B images labeled with tags | 84.8% | |
| ResNeXt-101 WSL [55] | 829M | 85.4% | 97.6% | |
| FixRes ResNeXt-101 WSL[86] | 829M | 86.4% | 98.0% | |
| Big Transfer (BiT-L) [43]t | 928M | 300M weakly labeled images from JFT | 87.5% | 98.5% |
| Noisy Student Training (EfficientNet-L2) | 480M | 300M unlabeled images from JFT | 88.4% | 98.7% |
表 2: Noisy Student Training与先前最优方法在ImageNet上的Top-1和Top-5准确率。通过将学生模型作为新教师进行多次迭代训练得到的EfficientNet-L2 with Noisy Student Training,在准确率和模型大小之间取得了比先前最优模型更好的权衡。†: Big Transfer是一项并行工作,从JFT数据集进行迁移学习。
| 方法 | 参数量 | 额外数据 | Top-1准确率 | Top-5准确率 |
|---|---|---|---|---|
| ResNet-50 [30] | 26M | 76.0% | 93.0% | |
| ResNet-152 [30] | 60M | 77.8% | 93.8% | |
| DenseNet-264 [36] | 34M | 77.9% | 93.9% | |
| Inception-v3 [81] | 24M | 78.8% | 94.4% | |
| Xception [15] | 23M | 79.0% | 94.5% | |
| Inception-v4 [79] | 48M | 80.0% | 95.0% | |
| Inception-resnet-v2 [79] | 56M | 80.1% | 95.1% | |
| ResNeXt-101 [92] | 84M | 80.9% | 95.6% | |
| PolyNet [100] | 92M | 81.3% | 95.8% | |
| SENet [35] NASNet-A [104] | 146M | 82.7% | 96.2% | |
| AmoebaNet-A [65] | 89M | 82.7% | 96.2% | |
| 87M | 82.8% | 96.1% | ||
| PNASNet [50] AmoebaNet-C [17] | 86M | 82.9% | 96.2% | |
| 155M | 83.5% | 96.5% | ||
| GPipe [38] EfficientNet-B7[83] | 557M | 84.3% | 97.0% | |
| EfficientNet-L2 [83] | 66M | 85.0% | 97.2% | |
| 480M | 85.5% | 97.5% | ||
| ResNet-50 Billion-scale[93] | 26M | 81.2% | 96.0% | |
| ResNeXt-101 Billion-scale[93] | 193M | 3.5B带标签图像 | 84.8% | |
| ResNeXt-101 WSL [55] | 829M | 85.4% | 97.6% | |
| FixRes ResNeXt-101 WSL[86] | 829M | 86.4% | 98.0% | |
| Big Transfer (BiT-L) [43]† | 928M | JFT的3亿弱标注图像 | 87.5% | 98.5% |
| Noisy Student Training (EfficientNet-L2) | 480M | JFT的3亿未标注图像 | 88.4% | 98.7% |
Noise. We use stochastic depth [37], dropout [76], and Rand Augment [18] to noise the student. The hyper para meters for these noise functions are the same for Efficient NetB7 and L2. In particular, we set the survival probability in stochastic depth to 0.8 for the final layer and follow the linear decay rule for other layers. We apply dropout to the final layer with a dropout rate of 0.5. For Rand Augment, we apply two random operations with magnitude set to 27.
噪声。我们采用随机深度 [37]、Dropout [76] 和 Rand Augment [18] 向学生模型注入噪声。这些噪声函数的超参数在 EfficientNetB7 和 L2 上保持一致:随机深度中最后一层的生存概率设为 0.8,其余层按线性衰减规则处理;最终层 Dropout 率设为 0.5;Rand Augment 采用幅度为 27 的两种随机操作。
Iterative training. The best model in our experiments is a result of three iterations of putting back the student as the new teacher. We first trained an Efficient Net-B7 on ImageNet as the teacher model. Then by using the B7 model as the teacher, we trained an Efficient Net-L2 model with the unlabeled batch size set to 14 times the labeled batch size. Then, we trained a new Efficient Net-L2 model with the Efficient Net-L2 model as the teacher. Lastly, we iterated again and used an unlabeled batch size of 28 times the labeled batch size. The detailed results of the three iterations are available in Section 4.2.
迭代训练。我们实验中的最佳模型是通过将学生模型作为新教师模型进行三次迭代训练得到的。我们首先在ImageNet上训练了一个Efficient Net-B7作为初始教师模型。随后以B7模型为教师,训练了一个Efficient Net-L2模型,其无标签批次大小设置为有标签批次的14倍。接着,我们以该Efficient Net-L2模型为教师,训练了新的Efficient Net-L2模型。最后再次迭代,使用28倍于有标签批次的无标签批次规模进行训练。三次迭代的详细结果见第4.2节。
3.2. ImageNet Results
3.2. ImageNet 结果
We first report the validation set accuracy on the ImageNet 2012 ILSVRC challenge prediction task as commonly done in literature [45, 80, 30, 83] (see also [66]). As shown in Table 2, Efficient Net-L2 with Noisy Student Training achieves $88.4%$ top-1 accuracy which is significantly better than the best reported accuracy on Efficient Net of $85.0%$ . The total gain of $3.4%$ comes from two sources: by making the model larger $(+0.5%)$ and by Noisy Student Training $(+2.9%)$ . In other words, Noisy Student Training makes a much larger impact on the accuracy than changing the architecture.
我们首先报告在ImageNet 2012 ILSVRC挑战预测任务上的验证集准确率,这是文献[45, 80, 30, 83] (另见[66])中常见的做法。如表2所示,采用噪声学生训练(Noisy Student Training)的Efficient Net-L2实现了88.4%的top-1准确率,显著优于Efficient Net此前报道的最佳准确率85.0%。这3.4%的总提升来自两个方面:通过增大模型规模(+0.5%)和采用噪声学生训练(+2.9%)。换言之,噪声学生训练对准确率的提升效果远大于架构变更。
Further, Noisy Student Training outperforms the stateof-the-art accuracy of $86.4%$ by FixRes ResNeXt-101 WSL [55, 86] that requires 3.5 Billion Instagram images labeled with tags. As a comparison, our method only requires 300M unlabeled images, which is perhaps more easy to collect. Our model is also approximately twice as small in the number of parameters compared to FixRes ResNeXt101 WSL.
此外,Noisy Student Training 以 86.4% 的准确率超越了 FixRes ResNeXt-101 WSL [55, 86] 的最先进水平,后者需要 35 亿张带有标签的 Instagram 图像。相比之下,我们的方法仅需 3 亿张未标注图像,可能更易于收集。与 FixRes ResNeXt-101 WSL 相比,我们的模型参数量也大约减少了一半。
Model size study: Noisy Student Training for EfficientNet B0-B7 without Iterative Training. In addition to improving state-of-the-art results, we conduct experiments to verify if Noisy Student Training can benefit other EfficienetNet models. In previous experiments, iterative training was used to optimize the accuracy of Efficient Net-L2 but here we skip it as it is difficult to use iterative training for many experiments. We vary the model size from Efficient Net-B0 to Efficient Net-B7 [83] and use the same model as both the teacher and the student. We apply RandAugment to all Efficient Net baselines, leading to more competitive baselines. We set the unlabeled batch size to be three times the batch size of labeled images for all model sizes except for Efficient Net-B0. For Efficient Net-B0, we set the unlabeled batch size to be the same as the batch size of labeled images. As shown in Figure 2, Noisy Student Training leads to a consistent improvement of around $0.8%$ for all model sizes. Overall, Efficient Nets with Noisy Student Training provide a much better tradeoff between model size and accuracy than prior works. The results also confirm that vision models can benefit from Noisy Student Training even without iterative training.
模型规模研究:无需迭代训练的EfficientNet B0-B7噪声学生训练。除了提升最先进结果外,我们通过实验验证噪声学生训练是否适用于其他EfficientNet模型。先前实验采用迭代训练优化EfficientNet-L2精度,但本研究跳过该步骤,因为多组实验难以实施迭代训练。我们测试了从EfficientNet-B0到EfficientNet-B7 [83]的模型规模,并使用相同模型同时作为教师和学生。所有EfficientNet基线均采用RandAugment增强,从而建立更具竞争力的基线。除EfficientNet-B0外,所有模型的无标签数据批大小设为有标签图像的3倍;EfficientNet-B0则保持二者批大小相同。如图2所示,噪声学生训练使所有模型规模获得约$0.8%$的稳定提升。总体而言,采用噪声学生训练的EfficientNet系列在模型规模与精度权衡上显著优于先前工作。结果同时证实:即使不采用迭代训练,视觉模型仍能从噪声学生训练中获益。

Figure 2: Noisy Student Training leads to significant improvements across all model sizes. We use the same architecture for the teacher and the student and do not perform iterative training.
图 2: Noisy Student Training 在所有模型规模上都带来了显著提升。师生模型采用相同架构且未进行迭代训练。
3.3. Robustness Results on ImageNet-A, ImageNetC and ImageNet-P
3.3. ImageNet-A、ImageNetC 和 ImageNet-P 上的鲁棒性结果
We evaluate the best model, that achieves $88.4%$ top1 accuracy, on three robustness test sets: ImageNetA, ImageNet-C and ImageNet-P. ImageNet-C and $\mathrm{\bfP}$ test sets [31] include images with common corruptions and perturbations such as blurring, fogging, rotation and scaling. ImageNet-A test set [32] consists of difficult images that cause significant drops in accuracy to state-of-the-art models. These test sets are considered as “robustness” benchmarks because the test images are either much harder, for ImageNet-A, or the test images are different from the training images, for ImageNet-C and P.
我们在三个鲁棒性测试集上评估了表现最佳的模型,其Top1准确率达到88.4%:ImageNetA、ImageNet-C和ImageNet-P。ImageNet-C和P测试集[31]包含经过常见破坏和扰动的图像,如模糊、雾化、旋转和缩放。ImageNet-A测试集[32]由困难图像组成,会导致最先进模型的准确率显著下降。这些测试集被视为"鲁棒性"基准,因为对于ImageNet-A来说测试图像难度更高,而对于ImageNet-C和P来说测试图像与训练图像存在差异。
Table 3: Robustness results on ImageNet-A.
| Method | Top-1 Acc. | Top-5 Acc. |
| ResNet-101 [32] | 4.7% | |
| ResNeXt-101[32] (32x4d) | 5.9% | |
| ResNet-152[32] | 6.1% | |
| ResNeXt-101 [32] (64x4d) | 7.3% | |
| DPN-98 [32] | 9.4% | |
| ResNeXt-101+SE[32](32x4d) | 14.2% | |
| ResNeXt-101WSL[55,59] | 61.0% | |
| EfficientNet-L2 | 49.6% | 78.6% |
| Noisy Student Training (L2) | 83.7% | 95.2% |
表 3: ImageNet-A 鲁棒性测试结果
| 方法 | Top-1 准确率 | Top-5 准确率 |
|---|---|---|
| ResNet-101 [32] | 4.7% | |
| ResNeXt-101 [32] (32x4d) | 5.9% | |
| ResNet-152 [32] | 6.1% | |
| ResNeXt-101 [32] (64x4d) | 7.3% | |
| DPN-98 [32] | 9.4% | |
| ResNeXt-101+SE [32] (32x4d) | 14.2% | |
| ResNeXt-101WSL [55,59] | 61.0% | |
| EfficientNet-L2 | 49.6% | 78.6% |
| Noisy Student Training (L2) | 83.7% | 95.2% |
Table 4: Robustness results on ImageNet-C. mCE is the weighted average of error rate on different corruptions, with AlexNet’s error rate as a baseline (lower is better).
| Method | Res. | Top-1 Acc. | mCE |
| ResNet-50[31] | 224 | 39.0% | 76.7 |
| SIN [23] | 224 | 45.2% | 69.3 |
| Patch Gaussian [51] ResNeXt-101WSL[55,59] | 299 224 | 52.3% | 60.4 45.7 |
| EfficientNet-L2 | 224 | ||
| Noisy Student Training (L2) | 224 | 62.6% 76.5% | 47.5 30.0 |
| EfficientNet-L2 | 299 | ||
| 66.6% | 42.5 | ||
| Noisy Student Training (L2) | 299 | 77.8% | 28.3 |
表 4: ImageNet-C上的鲁棒性结果。mCE是不同损坏类型上错误率的加权平均值,以AlexNet的错误率作为基线(数值越低越好)。
| Method | Res. | Top-1 Acc. | mCE |
|---|---|---|---|
| ResNet-50[31] | 224 | 39.0% | 76.7 |
| SIN [23] | 224 | 45.2% | 69.3 |
| Patch Gaussian [51] ResNeXt-101WSL[55,59] | 299 224 | 52.3% | 60.4 45.7 |
| EfficientNet-L2 | 224 | ||
| Noisy Student Training (L2) | 224 | 62.6% 76.5% | 47.5 30.0 |
| EfficientNet-L2 | 299 | ||
| 66.6% | 42.5 | ||
| Noisy Student Training (L2) | 299 | 77.8% | 28.3 |
Table 5: Robustness results on ImageNet-P, where images are generated with a sequence of perturbations. mFR measures the model’s probability of flipping predictions under perturbations with AlexNet as a baseline (lower is better).
| Method | Res. | Top-1 Acc. | mFR |
| ResNet-50 [31] | 224 | 58.0 | |
| Low Pass Filter Pooling [99] | 224 | 51.2 | |
| ResNeXt-101WSL[55,59] | 224 | 27.8 | |
| EfficientNet-L2 | 224 | 80.4% | 27.2 |
| Noisy Student Training (L2) | 224 | 85.2% | 14.2 |
| EfficientNet-L2 | 299 | 81.6% | 23.7 |
| Noisy Student Training (L2) | 299 | 86.4% | 12.2 |
表 5: ImageNet-P 上的鲁棒性结果,其中图像是通过一系列扰动生成的。mFR 以 AlexNet 为基准衡量模型在扰动下预测翻转的概率 (数值越低越好) 。
| Method | Res. | Top-1 Acc. | mFR |
|---|---|---|---|
| ResNet-50 [31] | 224 | 58.0 | |
| Low Pass Filter Pooling [99] | 224 | 51.2 | |
| ResNeXt-101WSL[55,59] | 224 | 27.8 | |
| EfficientNet-L2 | 224 | 80.4% | 27.2 |
| Noisy Student Training (L2) | 224 | 85.2% | 14.2 |
| EfficientNet-L2 | 299 | 81.6% | 23.7 |
| Noisy Student Training (L2) | 299 | 86.4% | 12.2 |
For ImageNet-C and ImageNet-P, we evaluate models on two released versions with resolution $224\mathrm{x}224$ and $299\times299$ and resize images to the resolution Efficient Net trained on.
对于ImageNet-C和ImageNet-P,我们在两个已发布的版本上评估模型,分辨率分别为 $224\mathrm{x}224$ 和 $299\times299$ ,并将图像调整为Efficient Net训练时的分辨率。

Figure 3: Selected images from robustness benchmarks ImageNet-A, C and P. Test images from ImageNet-C underwent artificial transformations (also known as common corruptions) that cannot be found on the ImageNet training set. Test images on ImageNet-P underwent different scales of perturbations. On ImageNet-A, C, Efficient Net with Noisy Student Tranining produces correct top-1 predictions (shown in bold black texts) and Efficient Net without Noisy Student Training produces incorrect top-1 predictions (shown in red texts). On ImageNet-P, Efficient Net without Noisy Student Training flips predictions frequently.
图 3: 选自鲁棒性基准测试 ImageNet-A、C 和 P 的图像。ImageNet-C 的测试图像经过了人工变换(也称为常见损坏),这些变换在 ImageNet 训练集中无法找到。ImageNet-P 的测试图像则经历了不同程度的扰动。在 ImageNet-A 和 C 上,采用 Noisy Student 训练的 EfficientNet 给出了正确的 top-1 预测(以黑色粗体显示),而未采用 Noisy Student 训练的 EfficientNet 则给出了错误的 top-1 预测(以红色文字显示)。在 ImageNet-P 上,未采用 Noisy Student 训练的 EfficientNet 频繁出现预测翻转。
As shown in Table 3, 4 and 5, Noisy Student Training yields substantial gains on robustness datasets compared to the previous state-of-the-art model ResNeXt-101 WSL [55, 59] trained on 3.5B weakly labeled images. On ImageNet-A, it improves the top-1 accuracy from $61.0%$ to $83.7%$ . On ImageNet-C, it reduces mean corruption error (mCE) from 45.7 to 28.3. On ImageNet-P, it leads to a mean flip rate (mFR) of 14.2 if we use a resolution of $224\mathrm{x}224$ (direct comparison) and 12.2 if we use a resolution of $299\times299.$ 3 These significant gains in robustness in ImageNet-C and ImageNet-P are surprising because our method was not deliberately optimized for robustness.4
如表3、4和5所示,Noisy Student Training在鲁棒性数据集上相比之前基于35亿弱标注图像训练的最先进模型ResNeXt-101 WSL [55,59]取得了显著提升。在ImageNet-A上,它将top-1准确率从$61.0%$提升至$83.7%$。在ImageNet-C上,它将平均损坏误差(mCE)从45.7降低至28.3。在ImageNet-P上,当使用$224\mathrm{x}224$分辨率(直接对比)时平均翻转率(mFR)为14.2,使用$299\times299$分辨率时为12.2。这些在ImageNet-C和ImageNet-P上取得的显著鲁棒性提升令人惊讶,因为我们的方法并未专门针对鲁棒性进行优化。
Qualitative Analysis. To intuitively understand the significant improvements on the three robustness benchmarks, we show several images in Figure 3 where the predictions of the standard model are incorrect while the predictions of the model with Noisy Student Training are correct.
定性分析。为了直观理解在三个鲁棒性基准测试上的显著改进,我们在图 3 中展示了几张图像,其中标准模型的预测是错误的,而采用噪声学生训练 (Noisy Student Training) 的模型的预测是正确的。
Figure 3a shows example images from ImageNet-A and the predictions of our models. The model with Noisy Student Training can successfully predict the correct labels of these highly difficult images. For example, without Noisy Student Training, the model predicts bullfrog for the image shown on the left of the second row, which might be resulted from the black lotus leaf on the water. With Noisy Student Training, the model correctly predicts dragonfly for the image. At the top-left image, the model without Noisy Student Training ignores the sea lions and mistakenly recognizes a buoy as a lighthouse, while the model with Noisy Student Training can recognize the sea lions.
图 3a 展示了来自 ImageNet-A 的示例图像及我们模型的预测结果。采用 Noisy Student Training 的模型能成功预测这些高难度图像的正确标签。例如,在没有 Noisy Student Training 时,模型将第二行左侧图像误判为牛蛙 (bullfrog) ,这可能是由于水面上黑色莲叶的干扰所致;而采用该技术后,模型正确识别出蜻蜓 (dragonfly) 。在左上角图像中,未使用 Noisy Student Training 的模型忽略了海狮 (sea lions) ,将浮标误认为灯塔 (lighthouse) ,而采用该技术的模型则能准确识别海狮。
Figure 3b shows images from ImageNet-C and the corresponding predictions. As can be seen from the figure, our model with Noisy Student Training makes correct predictions for images under severe corruptions and perturbations such as snow, motion blur and fog, while the model without
图 3b 展示了来自 ImageNet-C 的图像及对应预测结果。如图所示,采用噪声学生训练 (Noisy Student Training) 的模型能对雪景、运动模糊和雾化等严重损坏与干扰下的图像做出正确预测,而未采用该方法的模型则...
Noisy Student Training suffers greatly under these conditions. The most interesting image is shown on the right of the first row. The swing in the picture is barely recognizable by human while the model with Noisy Student Training still makes the correct prediction.
在这些条件下,Noisy Student Training表现大幅下降。最有趣的图像显示在第一行右侧。图片中的秋千几乎难以被人眼识别,而采用Noisy Student Training的模型仍能做出正确预测。
Figure 3c shows images from ImageNet-P and the corresponding predictions. As can be seen, our model with Noisy Student Training makes correct and consistent predictions as images undergone different perturbations while the model without Noisy Student Training flips predictions frequently.
图 3c 展示了来自 ImageNet-P 的图像及对应预测结果。可以看出,采用 Noisy Student Training 训练的模型在不同扰动下仍能保持正确且一致的预测,而未采用该技术的模型则频繁出现预测翻转。
3.4. Adversarial Robustness Results
3.4. 对抗鲁棒性结果
After testing our model’s robustness to common corruptions and perturbations, we also study its performance on adversarial perturbations. We evaluate our Efficient Net-L2 models with and without Noisy Student Training against an FGSM attack. This attack performs one gradient descent step on the input image [25] with the update on each pixel set to $\epsilon$ . As shown in Figure 4, Noisy Student Training leads to very significant improvements in accuracy even though the model is not optimized for adversarial robustness. Under a stronger attack PGD with 10 iterations [54], at $\epsilon=16$ , Noisy Student Training improves Efficient Net-L2’s accuracy from $1.1%$ to $4.4%$ .
在测试模型对常见损坏和扰动的鲁棒性后,我们还研究了其在对抗性扰动上的表现。我们评估了带与不带Noisy Student训练的Efficient Net-L2模型对抗FGSM攻击的性能。该攻击在输入图像上执行一次梯度下降步骤[25],每个像素的更新量设为$\epsilon$。如图4所示,即使模型未针对对抗鲁棒性进行优化,Noisy Student训练仍能带来非常显著的准确率提升。在更强的10次迭代PGD攻击下[54],当$\epsilon=16$时,Noisy Student训练将Efficient Net-L2的准确率从$1.1%$提升至$4.4%$。
Note that these adversarial robustness results are not directly comparable to prior works since we use a large input resolution of $800\mathrm{x}800$ and adversarial vulnerability can scale with the input dimension [22, 25, 24, 74].
需要注意的是,这些对抗鲁棒性结果与之前的研究不能直接比较,因为我们使用了 $800\mathrm{x}800$ 的大输入分辨率,而对抗脆弱性会随输入维度增加而加剧 [22, 25, 24, 74]。

Figure 4: Noisy Student Training improves adversarial robustness against an FGSM attack though the model is not optimized for adversarial robustness. The accuracy is improved by $11%$ at $\epsilon=2$ and gets better as $\epsilon$ gets larger.
图 4: Noisy Student Training 提高了模型对 FGSM (Fast Gradient Sign Method) 攻击的对抗鲁棒性,尽管该模型并未针对对抗鲁棒性进行优化。在 $\epsilon=2$ 时准确率提升了 $11%$,且随着 $\epsilon$ 增大效果更显著。
4. Ablation Study
4. 消融实验
In this section, we study the importance of noise and iterative training and summarize the ablations for other components of our method.
在本节中,我们研究了噪声和迭代训练的重要性,并总结了方法中其他组件的消融实验结果。
4.1. The Importance of Noise in Self-training
4.1. 自训练中噪声的重要性
Since we use soft pseudo labels generated from the teacher model, when the student is trained to be exactly the same as the teacher model, the cross entropy loss on unlabeled data would be zero and the training signal would vanish. Hence, a question that naturally arises is why the student can outperform the teacher with soft pseudo labels. As stated earlier, we hypothesize that noising the student is needed so that it does not merely learn the teacher’s knowledge. We investigate the importance of noising in two scenarios with different amounts of unlabeled data and different teacher model accuracies. In both cases, we gradually remove augmentation, stochastic depth and dropout for unlabeled images when training the student model, while keeping them for labeled images. This way, we can isolate the influence of noising on unlabeled images from the influence of preventing over fitting for labeled images. In addition, we compare using a noised teacher and an unnoised teacher to study if it is necessary to disable noise when generating pseudo labels.
由于我们使用了教师模型生成的软伪标签,当学生模型被训练得与教师模型完全一致时,未标记数据的交叉熵损失将为零,训练信号也会消失。因此,一个自然产生的问题是:为什么学生模型能通过软伪标签超越教师模型。如前所述,我们假设需要对学生模型加入噪声,使其不仅仅学习教师模型的知识。我们在两种场景下研究了噪声的重要性:不同数量的未标记数据和不同精度的教师模型。在这两种情况下,我们在训练学生模型时逐步移除未标记图像的增强、随机深度和dropout,同时对标记图像保留这些操作。这样,我们可以将噪声对未标记图像的影响与防止标记图像过拟合的影响区分开来。此外,我们比较了使用带噪声的教师模型和不带噪声的教师模型,以研究在生成伪标签时是否需要禁用噪声。
Here, we show the evidence in Table 6, noise such as stochastic depth, dropout and data augmentation plays an important role in enabling the student model to perform better than the teacher. The performance consistently drops with noise function removed. However, in the case with 130M unlabeled images, when compared to the supervised baseline, the performance is still improved to $84.3%$ from $84.0%$ with noise function removed. We hypothesize that the improvement can be attributed to SGD, which introduces stochastic it y into the training process.
我们在表6中展示了相关证据,随机深度 (stochastic depth) 、dropout 和数据增强等噪声机制对学生模型性能超越教师模型起到关键作用。移除噪声函数后性能持续下降。然而,在使用1.3亿张无标签图像的场景中,即使去除噪声函数,相比监督基线 $84.0%$ 的性能仍提升至 $84.3%$ 。我们认为这种改进可能源于SGD在训练过程中引入的随机性。
One might argue that the improvements from using noise can be resulted from preventing over fitting the pseudo labels on the unlabeled images. We verify that this is not the case when we use 130M unlabeled images since the model does not overfit the unlabeled set from the training loss. While removing noise leads to a much lower training loss for labeled images, we observe that, for unlabeled images, removing noise leads to a smaller drop in training loss. This is probably because it is harder to overfit the large unlabeled dataset.
有人可能会认为,使用噪声带来的改进源于防止模型在未标注图像上对伪标签过拟合。我们验证了在使用1.3亿张未标注图像时并非如此,因为从训练损失来看模型并未对未标注集过拟合。虽然去除噪声会使标注图像训练损失显著降低,但我们观察到对于未标注图像,去除噪声仅导致训练损失小幅下降。这很可能是因为在大规模未标注数据集上更难发生过拟合。
Lastly, adding noise to the teacher model that generates pseudo labels leads to lower accuracy, which shows the importance of having a powerful unnoised teacher model.
最后,为生成伪标签的教师模型添加噪声会导致准确率下降,这表明拥有一个强大的无噪声教师模型的重要性。
4.2. A Study of Iterative Training
4.2. 迭代训练研究
Here, we show the detailed effects of iterative training. As mentioned in Section 3.1, we first train an Efficient NetB7 model on labeled data and then use it as the teacher to
这里,我们展示了迭代训练的详细效果。如第3.1节所述,我们首先在标注数据上训练一个Efficient NetB7模型,然后将其作为教师模型
| Model/UnlabeledSetSize | 1.3M | 130M |
| EfficientNet-B5 | 83.3% | 84.0% |
| Noisy Student Training (B5) | 83.9% | 85.1% |
| studentw/oAug | 83.6% | 84.6% |
| studentw/oAug,SD,Dropout | 83.2% | 84.3% |
| teacher w.Aug,SD,Dropout | 83.7% | 84.4% |
| 模型/未标注集大小 | 1.3M | 130M |
|---|---|---|
| EfficientNet-B5 | 83.3% | 84.0% |
| Noisy Student Training (B5) | 83.9% | 85.1% |
| studentw/oAug | 83.6% | 84.6% |
| studentw/oAug,SD,Dropout | 83.2% | 84.3% |
| teacher w.Aug,SD,Dropout | 83.7% | 84.4% |
Table 6: Ablation study of noising. We use Efficient NetB5 as the teacher model and study two cases with different numbers of unlabeled images and different augmentations. For the experiment with $1.3\mathbf{M}$ unlabeled images, we use the standard augmentation including random translation and flipping for both the teacher and the student. For the experiment with 130M unlabeled images, we use RandAugment. Aug and SD denote data augmentation and stochastic depth respectively. We remove the noise for unlabeled images while keeping them for labeled images. Here, iterative training is not used and unlabeled batch size is set to be the same as the labeled batch size to save training time.
表 6: 噪声消融研究。我们采用Efficient NetB5作为教师模型,研究不同数量未标注图像及不同数据增强方案的两种情况。在$1.3\mathbf{M}$未标注图像的实验中,师生模型均采用随机平移和翻转的标准增强;对于1.3亿未标注图像的实验则使用RandAugment。Aug和SD分别表示数据增强(Data Augmentation)和随机深度(Stochastic Depth)。我们移除了未标注图像的噪声但保留标注图像的噪声。为节省训练时间,本实验未采用迭代训练,且未标注数据的批次大小与标注数据批次大小保持一致。
train an Efficient Net-L2 student model. Then, we iterate this process by putting back the new student model as the teacher model.
训练一个Efficient Net-L2学生模型。然后,我们将新的学生模型作为教师模型放回,迭代这一过程。
As shown in Table 7, the model performance improves to $87.6%$ in the first iteration and then to $88.1%$ in the second iteration with the same hyper parameters (except using a teacher model with better performance). These results indicate that iterative training is effective in producing increasingly better models. For the last iteration, we make use of a larger ratio between unlabeled batch size and labeled batch size to boost the final performance to $88.4%$ .
如表 7 所示,在第一次迭代中,模型性能提升至 $87.6%$ ,随后在第二次迭代中(除使用性能更好的教师模型外,其他超参数相同)提升至 $88.1%$ 。这些结果表明迭代训练能有效生成性能逐步提升的模型。在最后一次迭代中,我们通过增大无标注批次与有标注批次的尺寸比例,将最终性能提升至 $88.4%$ 。
Table 7: Iterative training improves the accuracy, where batch size ratio denotes the ratio between unlabeled data and labeled data.
| Iteration | Model | Batch Size Ratio | Top-1 Acc. |
| 1 | EfficientNet-L2 | 14:1 | 87.6% |
| 2 | EfficientNet-L2 | 14:1 | 88.1% |
| 3 | EfficientNet-L2 | 28:1 | 88.4% |
表 7: 迭代训练提升准确率,其中批量大小比率表示未标注数据与标注数据的比例。
| 迭代次数 | 模型 | 批量大小比率 | Top-1准确率 |
|---|---|---|---|
| 1 | EfficientNet-L2 | 14:1 | 87.6% |
| 2 | EfficientNet-L2 | 14:1 | 88.1% |
| 3 | EfficientNet-L2 | 28:1 | 88.4% |
4.3. Additional Ablation Study Sum mari z ation
4.3. 额外消融研究总结
We also study the importance of various design choices of Noisy Student Training, hopefully offering a practical guide for readers. With this purpose, we conduct 8 ablation studies in Appendix A.2. The findings are summarized as follows:
我们还研究了Noisy Student Training中各种设计选择的重要性,希望能为读者提供实用指南。为此,我们在附录A.2中进行了8项消融实验,主要发现如下:
Finding #1: Using a large teacher model with better performance leads to better results. • Finding #2: A large amount of unlabeled data is nec
发现 #1: 使用性能更好的大型教师模型能带来更好的结果。
发现 #2: 需要大量未标注数据
5. Related works
5. 相关工作
Self-training. Our work is based on self-training (e.g., [71, 96, 68, 67]). Self-training first uses labeled data to train a good teacher model, then use the teacher model to label unlabeled data and finally use the labeled data and unlabeled data to jointly train a student model. In typical self-training with the teacher-student framework, noise injection to the student is not used by default, or the role of noise is not fully understood or justified. The main difference between our work and prior works is that we identify the importance of noise, and aggressively inject noise to make the student better.
自训练 (Self-training)。我们的工作基于自训练 (如 [71, 96, 68, 67])。自训练首先使用标注数据训练一个优质的教师模型,然后利用教师模型标注未标注数据,最后联合使用标注数据和未标注数据训练学生模型。在典型的师生框架自训练中,默认不会对学生模型注入噪声,或者噪声的作用未被充分理解或论证。我们的工作与先前研究的主要区别在于:我们明确了噪声的重要性,并通过主动注入噪声使学生模型表现更优。
Self-training was previously used to improve ResNet-50 from $76.4%$ to $81.2%$ top-1 accuracy [93] which is still far from the state-of-the-art accuracy. Yalniz et al. [93] also did not show significant improvements in terms of robustness on ImageNet-A, C and P as we did. In terms of methodology, they proposed to first only train on unlabeled images and then finetune their model on labeled images as the final stage. In Noisy Student Training, we combine these two steps into one because it simplifies the algorithm and leads to better performance in our experiments.
自训练方法此前曾将 ResNet-50 的 top-1 准确率从 76.4% 提升至 81.2% [93],但仍远未达到最先进水平。Yalniz 等人 [93] 在 ImageNet-A、C 和 P 的鲁棒性方面也未展示出如我们实验所取得的显著提升。方法论层面,他们提出先仅用无标注图像训练模型,最后阶段才在标注图像上微调。而在噪声学生训练 (Noisy Student Training) 中,我们将这两个步骤合并为一,这不仅简化了算法流程,实验结果表明还能获得更优性能。
Data Distillation [63], which ensembles predictions for an image with different transformations to strengthen the teacher, is the opposite of our approach of weakening the student. Part has a rath i et al. [61] find a small and fast speech recognition model for deployment via knowledge distillation on unlabeled data. As noise is not used and the student is also small, it is difficult to make the student better than teacher. The domain adaptation framework in [69] is related but highly optimized for videos, e.g., prediction on which frame to use in a video. The method in [101] ensembles predictions from multiple teacher models, which is more expensive than our method.
数据蒸馏 [63] 通过整合图像在不同变换下的预测来强化教师模型,这与我们削弱学生模型的思路相反。Part has a rath i 等人 [61] 通过未标注数据上的知识蒸馏,找到了一种小而快的语音识别模型用于部署。由于未使用噪声且学生模型也较小,很难让学生模型优于教师模型。[69] 中的领域适应框架与此相关,但专为视频高度优化(例如预测视频中使用哪一帧)。[101] 中的方法整合了多个教师模型的预测,其成本高于我们的方法。
Co-training [9] divides features into two disjoint partitions and trains two models with the two sets of features using labeled data. Their source of “noise” is the feature partitioning such that two models do not always agree on unlabeled data. Our method of injecting noise to the student model also enables the teacher and the student to make different predictions and is more suitable for ImageNet than partitioning features.
协同训练 [9] 将特征划分为两个互不相交的子集,并利用标注数据分别训练两个模型。其"噪声"来源在于特征划分机制,使得两个模型对未标注数据的预测结果存在分歧。我们通过向学生模型注入噪声的方法,同样实现了师生模型预测差异化的效果,且相比特征划分策略更适用于ImageNet数据集。
Self-training / co-training has also been shown to work well for a variety of other tasks including leveraging noisy data [87], semantic segmentation [4], text classification [40, 78]. Back translation and self-training have led to significant improvements in machine translation [72, 20, 28, 14, 90, 29].
自训练/协同训练也被证明在多种任务中表现良好,包括利用噪声数据 [87]、语义分割 [4]、文本分类 [40, 78]。回译和自训练显著提升了机器翻译的性能 [72, 20, 28, 14, 90, 29]。
Semi-supervised Learning. Apart from self-training, another important line of work in semi-supervised learning [12, 103] is based on consistency training [5, 64, 47, 84, 56, 52, 62, 13, 16, 60, 2, 49, 88, 91, 8, 98, 46, 7]. They constrain model predictions to be invariant to noise injected to the input, hidden states or model parameters. As discussed in Section 2, consistency regular iz ation works less well on ImageNet because consistency regular iz ation uses a model being trained to generate the pseudo-labels. In the early phase of training, they regularize the model towards high entropy predictions, and prevents it from achieving good accuracy.
半监督学习。除了自训练(self-training)外,半监督学习[12, 103]的另一重要研究方向是基于一致性训练(consistency training) [5, 64, 47, 84, 56, 52, 62, 13, 16, 60, 2, 49, 88, 91, 8, 98, 46, 7]。这类方法通过约束模型预测对输入、隐藏状态或模型参数注入的噪声保持不变性。如第2节所述,一致性正则(consistency regularization)在ImageNet数据集上效果较差,因为该方法使用正在训练的模型生成伪标签。在训练初期,这些方法会使模型倾向于高熵预测,从而阻碍模型获得高准确率。
Works based on pseudo label [48, 39, 73, 1] are similar to self-training, but also suffer the same problem with consistency training, since they rely on a model being trained instead of a converged model with high accuracy to generate pseudo labels. Finally, frameworks in semi-supervised learning also include graph-based methods [102, 89, 94, 42], methods that make use of latent variables as target variables [41, 53, 95] and methods based on low-density separation [26, 70, 19], which might provide complementary benefits to our method.
基于伪标签 (pseudo label) 的研究工作 [48, 39, 73, 1] 与自训练方法类似,但也存在与一致性训练相同的问题,因为它们依赖于正在训练的模型而非高精度的收敛模型来生成伪标签。最后,半监督学习框架还包括基于图的方法 [102, 89, 94, 42]、利用潜变量作为目标变量的方法 [41, 53, 95] 以及基于低密度分离的方法 [26, 70, 19],这些方法可能为我们的方法提供互补优势。
Knowledge Distillation. Our work is also related to methods in Knowledge Distillation [10, 3, 33, 21, 6] via the use of soft targets. The main use of knowledge distillation is model compression by making the student model smaller. The main difference between our method and knowledge distillation is that knowledge distillation does not consider unlabeled data and does not aim to improve the student model.
知识蒸馏 (Knowledge Distillation)。我们的工作还涉及通过使用软目标进行知识蒸馏 [10, 3, 33, 21, 6] 的方法。知识蒸馏的主要用途是通过使学生模型更小来实现模型压缩。我们的方法与知识蒸馏的主要区别在于,知识蒸馏不考虑未标记数据,也不旨在改进学生模型。
Robustness. A number of studies, e.g. [82, 31, 66, 27], have shown that vision models lack robustness. Addressing the lack of robustness has become an important research direction in machine learning and computer vision in recent years. Our study shows that using unlabeled data improves accuracy and general robustness. Our finding is consistent with arguments that using unlabeled data can improve adversarial robustness [11, 77, 57, 97]. The main difference between our work and these works is that they directly optimize adversarial robustness on unlabeled data, whereas we show that Noisy Student Training improves robustness greatly even without directly optimizing robustness.
鲁棒性。多项研究 [82, 31, 66, 27] 表明视觉模型缺乏鲁棒性。近年来,解决鲁棒性不足已成为机器学习和计算机视觉的重要研究方向。我们的研究表明,使用未标注数据能提升准确率和整体鲁棒性。这一发现与 [11, 77, 57, 97] 提出的"未标注数据可提升对抗鲁棒性"观点一致。本研究与这些工作的主要区别在于:它们直接在未标注数据上优化对抗鲁棒性,而我们证明即使不直接优化鲁棒性,Noisy Student Training 也能显著提升鲁棒性。
6. Conclusion
6. 结论
Prior works on weakly-supervised learning required billions of weakly labeled data to improve state-of-the-art ImageNet models. In this work, we showed that it is possible to use unlabeled images to significantly advance both accuracy and robustness of state-of-the-art ImageNet models. We found that self-training is a simple and effective algorithm to leverage unlabeled data at scale. We improved it by adding noise to the student, hence the name Noisy Student Training, to learn beyond the teacher’s knowledge.
先前关于弱监督学习的研究需要数十亿弱标记数据来改进最先进的ImageNet模型。在本研究中,我们证明了利用未标记图像可以显著提升当前最优ImageNet模型的准确性和鲁棒性。我们发现自训练是一种简单有效的算法,可大规模利用未标记数据。通过向学生模型添加噪声(因此称为Noisy Student Training),我们改进了该方法,使其能够学习超越教师模型的知识。
Our experiments showed that Noisy Student Training and Efficient Net can achieve an accuracy of $88.4%$ which is $2.9%$ higher than without Noisy Student Training. This result is also a new state-of-the-art and $2.0%$ better than the previous best method that used an order of magnitude more weakly labeled data [55, 86].
我们的实验表明,使用噪声学生训练 (Noisy Student Training) 和 Efficient Net 可以达到 $88.4%$ 的准确率,比不使用噪声学生训练高出 $2.9%$。这一结果也创造了新的最优水平 (state-of-the-art),比之前使用多一个数量级弱标记数据的最佳方法 [55, 86] 高出 $2.0%$。
An important contribution of our work was to show that Noisy Student Training boosts robustness in computer vision models. Our experiments showed that our model significantly improves performances on ImageNet-A, C and P.
我们工作的一个重要贡献是证明了噪声学生训练 (Noisy Student Training) 能提升计算机视觉模型的鲁棒性。实验表明,我们的模型在ImageNet-A、C和P数据集上性能显著提升。
Acknowledgement
致谢
We thank the Google Brain team, Zihang Dai, Jeff Dean, Hieu Pham, Colin Raffel, Ilya Sutskever and Mingxing Tan for insightful discussions, Cihang Xie, Dan Hendrycks and A. Emin Orhan for robustness evaluation, Sergey Ioffe, Guokun Lai, Jiquan Ngiam, Jiateng Xie and Adams Wei Yu for feedbacks on the draft, Yanping Huang, Pankaj Kanwar, Naveen Kumar, Sameer Kumar and Zak Stone for great help with TPUs, Ekin Dogus Cubuk and Barret Zoph for help with Rand Augment, Tom Duerig, Victor Gomes, Paul Haahr, Pandu Nayak, David Price, Janel Thamkul, Elizabeth Trumbull, Jake Walker and Wenlei Zhou for help with model releases, Yanan Bao, Zheyun Feng and Daiyi Peng for help with the JFT dataset, Ola Spyra and Olga Wichrowska for help with infrastructure.
我们感谢Google Brain团队的Zihang Dai、Jeff Dean、Hieu Pham、Colin Raffel、Ilya Sutskever和Mingxing Tan富有洞察力的讨论,Cihang Xie、Dan Hendrycks和A. Emin Orhan在鲁棒性评估方面的贡献,Sergey Ioffe、Guokun Lai、Jiquan Ngiam、Jiateng Xie和Adams Wei Yu对草稿的反馈,Yanping Huang、Pankaj Kanwar、Naveen Kumar、Sameer Kumar和Zak Stone在TPU方面的大力协助,Ekin Dogus Cubuk和Barret Zoph对Rand Augment的帮助,Tom Duerig、Victor Gomes、Paul Haahr、Pandu Nayak、David Price、Janel Thamkul、Elizabeth Trumbull、Jake Walker和Wenlei Zhou在模型发布方面的支持,Yanan Bao、Zheyun Feng和Daiyi Peng对JFT数据集的协助,以及Ola Spyra和Olga Wichrowska在基础设施方面的帮助。
References
参考文献
[1] Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. arXiv preprint arXiv:1908.02983, 2019. 3, 9
[1] Eric Arazo、Diego Ortego、Paul Albert、Noel E O'Connor 和 Kevin McGuinness。深度半监督学习中的伪标签与确认偏差。arXiv预印本 arXiv:1908.02983,2019。3, 9
A. Experiments
A. 实验
A.1. Architecture Details
A.1. 架构细节
The architecture specifications of Efficient Net-L2 are listed in Table 8. We also list Efficient Net-B7 as a reference. Scaling width and resolution by $c$ leads to an increase factor of $c^{2}$ in training time and scaling depth by $c$ leads to an increase factor of $c$ . The training time of Efficient Net-L2 is around 5 times the training time of Efficient Net-B7.
Efficient Net-L2的架构规格如表8所示。我们还列出了Efficient Net-B7作为参考。宽度和分辨率按$c$缩放会导致训练时间增加$c^{2}$倍,深度按$c$缩放会导致训练时间增加$c$倍。Efficient Net-L2的训练时间约为Efficient Net-B7的5倍。
| Architecture Name | d | Train Res. | Test Res. | #Params | |
| EfficientNet-B7 | 2.0 | 3.1 | 600 | 600 | 66M |
| EfficientNet-L2 | 4.3 | 5.3 | 475 | 800 | 480M |
| 架构名称 | d | 训练分辨率 | 测试分辨率 | 参数量 | |
|---|---|---|---|---|---|
| EfficientNet-B7 | 2.0 | 3.1 | 600 | 600 | 66M |
| EfficientNet-L2 | 4.3 | 5.3 | 475 | 800 | 480M |
Table 8: Architecture specifications for Efficient Nets used in the paper. The width $w$ and depth $d$ are the scaling factors that need to be contextual i zed in Efficient Net [83]. Train Res. and Test Res. denote training and testing resolutions respectively.
表 8: 论文中使用的 Efficient Net 架构规格。宽度 $w$ 和深度 $d$ 是需要在 Efficient Net [83] 中根据上下文确定的缩放因子。Train Res. 和 Test Res. 分别表示训练和测试分辨率。
A.2. Ablation Studies
A.2. 消融研究
In this section, we provide comprehensive studies of various components of our method. Since iterative training results in longer training time, we conduct ablation without it. To further save training time, we reduce the training epochs for small models from 700 to 350, starting from Study #4. We also set the unlabeled batch size to be the same as the labeled batch size for models smaller than Efficient Net-B7 starting from Study #2.
在本节中,我们对方法的各个组件进行全面研究。由于迭代训练会导致训练时间延长,我们在消融实验中未采用该策略。为节省训练时间,从研究#4开始,我们将小型模型的训练周期从700轮缩减至350轮。此外,从研究#2开始,对于小于Efficient Net-B7的模型,我们将未标注数据的批量大小设置为与标注数据批量大小相同。
Study #1: Teacher Model’s Capacity. Here, we study if using a larger and better teacher model would lead to better results. We use our best model Noisy Student Training with Efficient Net-L2, that achieves a top-1 accuracy of $88.4%$ , to teach student models with sizes ranging from Efficient NetB0 to Efficient Net-B7. We use the standard augmentation instead of Rand Augment on unlabeled data in this experiment to give the student model more capacity. This setting is in principle similar to distillation on unlabeled data.
研究#1:教师模型容量。我们研究使用更大更好的教师模型是否能带来更好的结果。我们采用表现最佳的Noisy Student Training with Efficient Net-L2模型(Top-1准确率达88.4%)作为教师模型,指导学生模型(规模从Efficient NetB0到Efficient Net-B7)。本实验在未标注数据上使用标准增强而非Rand Augment,以赋予学生模型更大容量。该设置原理上类似于未标注数据上的蒸馏学习。
The comparison is shown in Table 9. Using Noisy Student Training (Efficient Net-L2) as the teacher leads to another $0.5%$ to $1.6%$ improvement on top of the improved results by using the same model as the teacher. For example, we can train a medium-sized model Efficient Net-B4, which has fewer parameters than ResNet-50, to an accuracy of $85.3%$ . Therefore, using a large teacher model with better performance leads to better results.
对比结果如表 9 所示。使用 Noisy Student Training (Efficient Net-L2) 作为教师模型,在相同模型作为教师带来的改进基础上,还能再提升 $0.5%$ 至 $1.6%$。例如,我们可以训练一个参数量少于 ResNet-50 的中等规模模型 Efficient Net-B4,使其准确率达到 $85.3%$。因此,采用性能更优的大型教师模型能获得更好的结果。
Study #2: Unlabeled Data Size. Next, we conduct experiments to understand the effects of using different amounts of unlabeled data. We start with the 130M unlabeled images and gradually reduce the unlabeled set. We experiment
研究#2:未标注数据规模。接下来,我们通过实验探究使用不同数量未标注数据的影响。从1.3亿张未标注图像开始,逐步缩减未标注集规模。实验过程中...
| Model | #Params | Top-1 Acc. | Top-5 Acc. |
| EfficientNet-BO | 5.3M | 77.3% | 93.4% |
| Noisy Student Training (BO) Noisy Student Training (Bo, L2) | 78.1% 78.8% | 94.2% 94.5% | |
| EfficientNet-B1 | 7.8M | 79.2% | 94.4% |
| Noisy Student Training (B1) | 80.2% | 95.2% | |
| Noisy Student Training (B1, L2) | 81.5% | 95.8% | |
| EfficientNet-B2 | 9.2M | 80.0% | 94.9% |
| Noisy Student Training (B2) | 81.1% | 95.5% | |
| Noisy Student Training (B2, L2) | 82.4% | 96.3% | |
| EficientNet-B3 | 12M | 81.7% | 95.7% |
| Noisy Student Training (B3) | 82.5% | 96.4% | |
| Noisy Student Training (B3, L2) | 84.1% | 96.9% | |
| EfficientNet-B4 | 19M | 83.2% | 96.4% |
| Noisy Student Training (B4) | 84.4% | 97.0% | |
| Noisy Student Training (B4, L2) | 85.3% | 97.5% | |
| EfficientNet-B5 | 30M | 84.0% | 96.8% |
| Noisy Student Training (B5) | 85.1% | 97.3% | |
| Noisy Student Training (B5, L2) | 86.1% | 97.8% | |
| EfficientNet-B6 | 43M | 97.0% | |
| Noisy Student Training (B6) | 84.5% 85.9% | 97.6% | |
| Noisy Student Training (B6, L2) | 86.4% | 97.9% | |
| EfficientNet-B7 | 66M | 85.0% | 97.2% |
| Noisy Student Training (B7) | 86.4% | 97.9% | |
| Noisy Student Training (B7, L2) | 86.9% | 98.1% |
| 模型 | 参数量 | Top-1准确率 | Top-5准确率 |
|---|---|---|---|
| EfficientNet-B0 | 5.3M | 77.3% | 93.4% |
| Noisy Student Training (B0) Noisy Student Training (B0, L2) | 5.3M | 78.1% 78.8% | 94.2% 94.5% |
| EfficientNet-B1 | 7.8M | 79.2% | 94.4% |
| Noisy Student Training (B1) | 7.8M | 80.2% | 95.2% |
| Noisy Student Training (B1, L2) | 7.8M | 81.5% | 95.8% |
| EfficientNet-B2 | 9.2M | 80.0% | 94.9% |
| Noisy Student Training (B2) | 9.2M | 81.1% | 95.5% |
| Noisy Student Training (B2, L2) | 9.2M | 82.4% | 96.3% |
| EfficientNet-B3 | 12M | 81.7% | 95.7% |
| Noisy Student Training (B3) | 12M | 82.5% | 96.4% |
| Noisy Student Training (B3, L2) | 12M | 84.1% | 96.9% |
| EfficientNet-B4 | 19M | 83.2% | 96.4% |
| Noisy Student Training (B4) | 19M | 84.4% | 97.0% |
| Noisy Student Training (B4, L2) | 19M | 85.3% | 97.5% |
| EfficientNet-B5 | 30M | 84.0% | 96.8% |
| Noisy Student Training (B5) | 30M | 85.1% | 97.3% |
| Noisy Student Training (B5, L2) | 30M | 86.1% | 97.8% |
| EfficientNet-B6 | 43M | - | 97.0% |
| Noisy Student Training (B6) | 43M | 84.5% 85.9% | 97.6% |
| Noisy Student Training (B6, L2) | 43M | 86.4% | 97.9% |
| EfficientNet-B7 | 66M | 85.0% | 97.2% |
| Noisy Student Training (B7) | 66M | 86.4% | 97.9% |
| Noisy Student Training (B7, L2) | 66M | 86.9% | 98.1% |
Table 9: Using our best model with $88.4%$ accuracy as the teacher (denoted as Noisy Student Training (X, L2)) leads to more improvements than using the same model as the teacher (denoted as Noisy Student Training (X)). Models smaller than Efficient Net-B5 are trained for 700 epochs (better than training for 350 epochs as used in Study #4 to Study #8). Models other than Efficient Net-B0 uses an unlabeled batch size of three times the labeled batch size, while other ablation studies set the unlabeled batch size to be the same as labeled batch size by default for models smaller than B7.
表 9: 使用准确率为 88.4% 的最佳模型作为教师模型 (标记为 Noisy Student Training (X, L2)) 比使用相同模型作为教师模型 (标记为 Noisy Student Training (X)) 带来更多提升。小于 Efficient Net-B5 的模型训练了 700 轮次 (优于研究 #4 至 #8 中使用的 350 轮次训练)。除 Efficient Net-B0 外,其他模型的未标注批次大小设置为标注批次大小的三倍,而其他消融研究默认将小于 B7 的模型的未标注批次大小设置为与标注批次大小相同。
with using $\textstyle{\frac{1}{128}},{\frac{1}{64}},{\frac{1}{32}},{\frac{1}{16}},{\frac{1}{4}}$ of the whole data by uniformly sampling images from the the unlabeled set for simplicity, though taking images with highest confidence may lead to better results. We use Efficient Net-B4 as both the teacher and the student.
为简化操作,我们通过从无标签集中均匀采样图像,分别使用完整数据的 $\textstyle{\frac{1}{128}},{\frac{1}{64}},{\frac{1}{32}},{\frac{1}{16}},{\frac{1}{4}}$ 比例进行实验(尽管选择置信度最高的图像可能获得更好结果)。教师模型与学生模型均采用 Efficient Net-B4 架构。
As can be seen from Table 10, the performance stays similar when we reduce the data to $\textstyle{\frac{1}{16}}$ of the whole data,5 which amounts to 8.1M images after duplicating. The performance drops when we further reduce it. Hence, using a large amount of unlabeled data leads to better performance.
从表 10 可以看出,当我们将数据减少到整个数据的 $\textstyle{\frac{1}{16}}$ (即复制后为 810 万张图像) 时,性能保持相似。进一步减少数据会导致性能下降。因此,使用大量未标记数据可以获得更好的性能。
Study #3: Hard Pseudo-Label vs. Soft Pseudo-Label on Out-of-domain Data. Unlike previous studies in semisupervised learning that use in-domain unlabeled data (e.g.,
研究#3:硬伪标签与软伪标签在域外数据上的对比。与以往在半监督学习中使用域内未标注数据的研究不同(例如,
Table 10: Noisy Student Training’s performance improves with more unlabeled data. Models are trained for 700 epochs without iterative training. The baseline model achieves an accuracy of $83.2%$ .
| Data | 1/128 | 1/64 | 1/32 | 1/16 | 1/4 | 1 |
| Top-1 Acc. | 83.4% | 83.3% | 83.7% | 83.9% | 83.8% | 84.0% |
表 10: Noisy Student Training 的性能随着未标注数据的增加而提升。模型训练 700 个周期且未使用迭代训练。基线模型的准确率为 $83.2%$。
| 数据 | 1/128 | 1/64 | 1/32 | 1/16 | 1/4 | 1 |
|---|---|---|---|---|---|---|
| Top-1 准确率 | 83.4% | 83.3% | 83.7% | 83.9% | 83.8% | 84.0% |
CIFAR-10 images as unlabeled data for a small CIFAR10 training set), to improve ImageNet, we must use outof-domain unlabeled data. Here we compare hard pseudolabel and soft pseudo-label for out-of-domain data. Since a teacher model’s confidence on an image can be a good indicator of whether it is an out-of-domain image, we consider the high-confidence images as in-domain images and the low-confidence images as out-of-domain images. We sample 1.3M images in each confidence interval $[0.0,0.1],[0.1,0.2],\cdot\cdot\cdot,[0.9,1.0]$ .
将CIFAR-10图像作为小型CIFAR10训练集的未标记数据使用时,为提升ImageNet性能,我们必须采用域外未标记数据。此处我们对比了域外数据的硬伪标签与软伪标签方案。由于教师模型对图像的置信度能有效判断其是否属于域外图像,我们将高置信度图像视为域内样本,低置信度图像视为域外样本。在每个置信区间$[0.0,0.1],[0.1,0.2],\cdot\cdot\cdot,[0.9,1.0]$内采样130万张图像。

Figure 5: Soft pseudo labels lead to better performance for low confidence data (out-of-domain data). Each dot at $p$ represents a Noisy Student Training model trained with 1.3M ImageNet labeled images and $1.3\mathbf{M}$ unlabeled images with confidence scores in $[p,p+0.1]$ .
图 5: 软伪标签 (soft pseudo labels) 能提升低置信度数据 (域外数据) 的表现。每个位于 $p$ 的点代表一个使用 130 万张 ImageNet 标注图像和 130 万张置信度分数在 $[p,p+0.1]$ 区间的未标注图像训练的 Noisy Student Training 模型。
We use Efficient Net-B0 as both the teacher model and the student model and compare using Noisy Student Training with soft pseudo labels and hard pseudo labels. The results are shown in Figure 5 with the following observations: (1) Soft pseudo labels and hard pseudo labels can both lead to significant improvements with in-domain unlabeled images i.e., high-confidence images. (2) With out-of-domain unlabeled images, hard pseudo labels can hurt the performance while soft pseudo labels lead to robust performance.
我们采用Efficient Net-B0同时作为教师模型和学生模型,并比较使用带软伪标签和硬伪标签的Noisy Student Training方法。结果如图5所示,主要发现如下:(1) 对于域内未标注图像(即高置信度图像),软伪标签和硬伪标签都能带来显著性能提升。(2) 对于域外未标注图像,硬伪标签会损害模型性能,而软伪标签仍能保持稳健表现。
Note that we have also observed that using hard pseudo labels can achieve as good results or slightly better results when a larger teacher is employed. Hence, whether soft pseudo labels or hard pseudo labels work better might need to be determined on a case-by-case basis.
需要注意的是,我们还发现当使用更大的教师模型时,硬伪标签也能取得相当甚至略优的效果。因此,软伪标签和硬伪标签哪种更有效可能需要视具体情况而定。
Study #4: Student Model’s Capacity. Then, we investigate the effects of student models with different capacities. For teacher models, we use Efficient Net-B0, B2 and B4 trained on labeled data and Efficient Net-B7 trained using Noisy Student Training. We compare using a student model with the same size or with a larger size. The comparison is shown in Table 11. With the same teacher, using a larger student model leads to consistently better performance, showing that using a large student model is important to enable the student to learn a more powerful model.
研究 #4: 学生模型容量。我们接着研究了不同容量学生模型的效果。教师模型采用在标注数据上训练的 Efficient Net-B0、B2、B4 以及通过 Noisy Student Training 训练的 Efficient Net-B7。我们比较了使用相同规模或更大规模学生模型的情况。对比结果如表 11 所示。在相同教师模型条件下,使用更大的学生模型始终能获得更好的性能表现,这表明使用大容量学生模型对于让学生学习更强大的模型至关重要。
| Teacher | TeacherAcc. | Student | StudentAcc. |
| BO | 77.3% | BO | 77.9% |
| B1 | 79.5% | ||
| B2 | 80.0% | B2 | 80.7% |
| B3 | 82.0% | ||
| B4 | 83.2% | B4 | 84.0% |
| B5 | 84.7% | ||
| B7 | 86.9% | B7 | 86.9% |
| L2 | 87.2% |
| 教师模型 | 教师准确率 | 学生模型 | 学生准确率 |
|---|---|---|---|
| BO | 77.3% | BO | 77.9% |
| B1 | 79.5% | ||
| B2 | 80.0% | B2 | 80.7% |
| B3 | 82.0% | ||
| B4 | 83.2% | B4 | 84.0% |
| B5 | 84.7% | ||
| B7 | 86.9% | B7 | 86.9% |
| L2 | 87.2% |
Table 11: Using a larger student model leads to better performance. Student models are trained for 350 epochs instead of 700 epochs without iterative training. The B7 teacher with an accuracy of $86.9%$ is trained by Noisy Student Training with multiple iterations using B7. The comparison between B7 and L2 as student models is not completely fair for L2, since we use an unlabeled batch size of $3\mathbf{x}$ the labeled batch size for training L2, which is not as good as using an unlabeled batch size of $7\mathbf{x}$ the labeled batch size when training B7 (See Study #7 for more details).
表 11: 使用更大的学生模型能带来更好的性能。学生模型训练350个周期而非700个周期,且未采用迭代训练。准确率为 $86.9%$ 的B7教师模型是通过Noisy Student Training使用B7进行多轮迭代训练得到的。由于我们在训练L2时使用的无标签批次大小仅为有标签批次大小的 $3\mathbf{x}$ ,而训练B7时采用 $7\mathbf{x}$ 的比例 (详见研究#7) ,因此B7与L2作为学生模型的对比对L2并不完全公平。
Study #5: Data Balancing. Here, we study the necessity of keeping the unlabeled data balanced across categories. As a comparison, we use all unlabeled data that has a confidence score higher than 0.3. We present results with Efficient Net-B0 to B3 as the backbone models in Table 12. Using data balancing leads to better performance for small models Efficient Net-B0 and B1. Interestingly, the gap becomes smaller for larger models such as Efficient Net-B2 and B3, which shows that more powerful models can learn from unbalanced data effectively. To enable Noisy Student Training to work well for all model sizes, we use data balancing by default.
研究 #5: 数据平衡。我们研究了保持未标注数据在类别间平衡的必要性。作为对比,我们使用了所有置信度分数高于0.3的未标注数据。表12展示了以Efficient Net-B0到B3为骨干模型的实验结果。数据平衡使小型模型Efficient Net-B0和B1表现更优。有趣的是,对于Efficient Net-B2和B3等较大模型,性能差距逐渐缩小,这表明更强大的模型能够有效从不平衡数据中学习。为使Noisy Student Training适用于所有模型规模,我们默认采用数据平衡方法。
Study #6: Joint Training. In our algorithm, we train the model with labeled images and pseudo-labeled images jointly. Here, we also compare with an alternative approach used by Yalniz et al. [93], which first pretrains the model on pseudo-labeled images and then finetunes it on labeled images. For finetuning, we experiment with different steps and take the best results. The comparison is shown in Table 13.
研究#6:联合训练。在我们的算法中,我们同时使用标注图像和伪标注图像训练模型。此处,我们还与Yalniz等人[93]采用的另一种方法进行了对比,该方法先在伪标注图像上预训练模型,再在标注图像上微调。针对微调阶段,我们尝试了不同步数并选取最佳结果。对比结果如表13所示。
Table 12: Data balancing leads to better results for small models. Models are trained for 350 epochs instead of 700 epochs without iterative training.
| Model | BO | B1 | B2 | B3 |
| Supervised Learning | 77.3% | 79.2% | 80.0% | 81.7% |
| Noisy Student Training w/o Data Balancing | 77.9% 77.6% | 79.9% 79.6% | 80.7% 80.6% | 82.1% 82.1% |
表 12: 数据平衡能为小模型带来更好的结果。模型训练时长为350个周期,而非未采用迭代训练时的700个周期。
| 模型 | BO | B1 | B2 | B3 |
|---|---|---|---|---|
| 监督学习 (Supervised Learning) | 77.3% | 79.2% | 80.0% | 81.7% |
| 未采用数据平衡的噪声学生训练 (Noisy Student Training w/o Data Balancing) | 77.9% 77.6% | 79.9% 79.6% | 80.7% 80.6% | 82.1% 82.1% |
It is clear that joint training significantly outperforms pre training $^+$ finetuning. Note that pre training only on pseudo-labeled images leads to a much lower accuracy than supervised learning only on labeled data, which suggests that the distribution of unlabeled data is very different from that of labeled data. In this case, joint training leads to a better solution that fits both types of data.
显然,联合训练显著优于预训练 $^+$ 微调。需要注意的是,仅对伪标记图像进行预训练会导致准确率远低于仅对有标记数据进行监督学习,这表明未标记数据的分布与有标记数据存在显著差异。在这种情况下,联合训练能够找到同时适配两类数据的更优解。
Table 13: Joint training works better than pre training and finetuning. We vary the finetuning steps and report the best results. Models are trained for 350 epochs instead of 700 epochs without iterative training.
| Model | B0 | B1 | B2 | B3 |
| Supervised Learning | 77.3% | 79.2% | 80.0% | 81.7% |
| Pretraining | 72.6% | 75.1% | 75.9% | 76.5% |
| Pretraining +Finetuning | 77.5% | 79.4% | 80.3% | 81.7% |
| JointTraining | 77.9% | 79.9% | 80.7% | 82.1% |
表 13: 联合训练效果优于预训练和微调。我们调整微调步数并报告最佳结果。模型训练350个周期而非700个周期(未采用迭代训练)。
| 模型 | B0 | B1 | B2 | B3 |
|---|---|---|---|---|
| 监督学习 | 77.3% | 79.2% | 80.0% | 81.7% |
| 预训练 | 72.6% | 75.1% | 75.9% | 76.5% |
| 预训练+微调 | 77.5% | 79.4% | 80.3% | 81.7% |
| 联合训练 | 77.9% | 79.9% | 80.7% | 82.1% |
Study #7: Ratio between Unlabeled Batch Size and Labeled Batch Size. Since we use 130M unlabeled images and $1.3\mathbf{M}$ labeled images, if the batch sizes for unlabeled data and labeled data are the same, the model is trained on unlabeled data only for one epoch every time it is trained on labeled data for a hundred epochs. Ideally, we would also like the model to be trained on unlabeled data for more epochs by using a larger unlabeled batch size so that it can fit the unlabeled data better. Hence we study the importance of the ratio between unlabeled batch size and labeled batch size.
研究 #7: 无标注批次大小与有标注批次大小的比例。由于我们使用 1.3 亿张无标注图像和 $1.3\mathbf{M}$ 张有标注图像,若两者批次大小相同,则模型每在有标注数据上训练 100 个周期时,仅在无标注数据上训练 1 个周期。理想情况下,我们希望通过增大无标注批次大小使模型在无标注数据上训练更多周期,从而更好地拟合无标注数据。因此我们研究了无标注与有标注批次大小比例的重要性。
In this study, we try a medium-sized model Efficient NetB4 as well as a larger model Efficient Net-L2. We use models of the same size as both the teacher and the student. As shown in Table 14, the larger model Efficient Net-L2 benefits from a large ratio while the smaller model Efficient NetB4 does not. Using a larger ratio between unlabeled batch size and labeled batch size, leads to substantially better performance for a large model.
在本研究中,我们尝试了中等规模的模型Efficient NetB4以及更大的模型Efficient Net-L2。教师模型和学生模型均采用相同规模的架构。如表14所示,较大模型Efficient Net-L2能从高无标注/标注数据量比中获益,而较小模型Efficient NetB4则无此效果。对于大型模型而言,采用更高的无标注批次与标注批次量比能显著提升性能。
Table 14: With a fixed labeled batch size, a larger unlabeled batch size leads to better performance for Efficient Net-L2. The Batch Size Ratio denotes the ratio between unlabeled batch size and labeled batch size.
| Teacher (Acc.) | BatchSizeRatio | Top-1 Acc. |
| B4 (83.2) | 1:1 | 84.0% |
| 3:1 | 84.0% | |
| L2 (87.0) | 1:1 | 86.7% |
| 3:1 | 87.4% | |
| L2 (87.4) | 3:1 | 87.4% |
| 6:1 | 87.9% |
表 14: 在固定标注批次大小的情况下,更大的无标注批次大小会使 Efficient Net-L2 表现更好。批次大小比率表示无标注批次大小与标注批次大小的比值。
| 教师模型 (准确率) | 批次大小比率 | Top-1 准确率 |
|---|---|---|
| B4 (83.2) | 1:1 | 84.0% |
| 3:1 | 84.0% | |
| L2 (87.0) | 1:1 | 86.7% |
| 3:1 | 87.4% | |
| L2 (87.4) | 3:1 | 87.4% |
| 6:1 | 87.9% |
Study #8: Warm-starting the Student Model. Lastly, one might wonder if we should train the student model from scratch when it can be initialized with a converged teacher model with good accuracy. In this ablation, we first train an Efficient Net-B0 model on ImageNet and use it to initialize the student model. We vary the number of epochs for training the student and use the same exponential decay learning rate schedule. Training starts at different learning rates so that the learning rate is decayed to the same value in all experiments. As shown in Table 15, the accuracy drops significantly when we reduce the training epoch from 350 to 70 and drops slightly when reduced to 280 or 140. Hence, the student still needs to be trained for a large number of epochs even with warm-starting.
研究 #8: 学生模型的热启动。最后,有人可能会思考:当学生模型可以用一个已经收敛且高精度的教师模型来初始化时,我们是否还需要从头开始训练它?在本消融实验中,我们首先在ImageNet上训练了一个Efficient Net-B0模型,并用它来初始化学生模型。我们调整了学生模型的训练周期数,并采用相同的指数衰减学习率调度策略。所有实验都从不同的初始学习率开始训练,以确保最终衰减到相同的值。如表15所示,当训练周期从350减少到70时,准确率显著下降;当减少到280或140时,准确率仅有小幅下降。因此,即使采用热启动,学生模型仍需要进行大量周期的训练。
Further, we also observe that a student initialized with the teacher can sometimes be stuck in a local optimal. For example, when we use Efficient Net-B7 with an accuracy of $86.4%$ as the teacher, the student model initialized with the teacher achieves an accuracy of $86.4%$ halfway through the training but gets stuck there when trained for 210 epochs, while a model trained from scratch achieves an accuracy of $86.9%$ . Hence, though we can save training time by warmstaring, we train our model from scratch to ensure the best performance.
此外,我们还观察到,使用教师模型初始化的学生模型有时会陷入局部最优。例如,当我们使用准确率为 $86.4%$ 的 Efficient Net-B7 作为教师模型时,经过教师模型初始化的学生模型在训练中期达到了 $86.4%$ 的准确率,但在训练 210 个周期后停滞不前,而从头开始训练的模型则达到了 $86.9%$ 的准确率。因此,尽管通过预热启动可以节省训练时间,但为了确保最佳性能,我们选择从头开始训练模型。
| Warm-start | Initializing student with teacher | NoInit | |||
| Epoch | 35 | 70 | 140 | 280 | 350 |
| Top-1 Acc. | 77.4% | 77.5% | 77.7% | 77.8% | 77.9% |
| Warm-start | 使用教师模型初始化学生模型 | NoInit |
|---|---|---|
| Epoch | 35 | 70 |
| Top-1 Acc. | 77.4% | 77.5% |
Table 15: A student initialized with the teacher still requires at least 140 epochs to perform well. The baseline model, trained with labeled data only, has an accuracy of $77.3%$ .
表 15: 使用教师模型初始化的学生仍需至少 140 个训练周期才能表现良好。仅使用标注数据训练的基线模型准确率为 $77.3%$。
A.3. Results with a Different Architecture and Dataset
A.3. 不同架构与数据集下的结果
Results with ResNet-50. To study whether other architectures can benefit from Noisy Student Training, we conduct experiments with ResNet-50 [30]. We use the full ImageNet as the labeled data and the 130M images from JFT as the unlabeled data. We train a ResNet-50 model on ImageNet and use it as our teacher model. We use RandAugment with the magnitude set to 9 as the noise.
ResNet-50实验结果。为验证其他架构能否受益于Noisy Student Training方法,我们使用ResNet-50 [30]进行实验。将完整ImageNet作为标注数据,JFT的1.3亿张图像作为未标注数据。首先在ImageNet上训练ResNet-50作为教师模型,采用幅度设为9的RandAugment作为噪声注入方式。
The results are shown in Table 16. Noisy Student Training leads to an improvement of $1.3%$ on the baseline model, which shows that Noisy Student Training is effective for archi tec ture s other than Efficient Net.
结果如表 16 所示。Noisy Student Training 使基线模型提升了 $1.3%$ ,这表明 Noisy Student Training 对于 Efficient Net 以外的架构也有效。
Table 16: Experiments on ResNet-50.
| Method | Top-1 Acc. | Top-5Acc. |
| ResNet-50 | 77.6% | 93.8% |
| Noisy Student Training (ResNet-50) | 78.9% | 94.3% |
表 16: ResNet-50 实验
| 方法 | Top-1 准确率 | Top-5 准确率 |
|---|---|---|
| ResNet-50 | 77.6% | 93.8% |
| Noisy Student Training (ResNet-50) | 78.9% | 94.3% |
Results on SVHN. We also evaluate Noisy Student Training on a smaller dataset SVHN. We use the core set with 73K images as the training set and the validation set. The extra set with 531K images are used as the unlabeled set. We use Efficient Net-B0 with strides of the second and the third blocks set to 1 so that the final feature map is $4\mathrm{x4}$ when the input image size is $32\mathbf{x}32$ .
SVHN上的结果。我们还在较小数据集SVHN上评估了噪声学生训练(Noisy Student Training)方法。使用包含7.3万张图像的核心集作为训练集和验证集,额外包含53.1万张图像的扩展集作为未标注集。采用EfficientNet-B0架构,并将第二、第三模块的步长(stride)设为1,使得输入图像尺寸为$32\mathbf{x}32$时最终特征图尺寸为$4\mathrm{x4}$。
As shown in Table 17, Noisy Student Training improves the baseline accuracy from $98.1%$ to $98.6%$ and outperforms the previous state-of-the-art results achieved by RandAugment with Wide-ResNet-28-10.
如表 17 所示,Noisy Student Training 将基线准确率从 $98.1%$ 提升至 $98.6%$,并超越了 RandAugment 与 Wide-ResNet-28-10 组合所取得的先前最优结果。
Table 17: Results on SVHN.
| Method | Accuracy |
| RandAugment (WRN) | 98.3% |
| RandAugment (EfficientNet-BO) | 98.1% |
| Noisy Student Training (B0) | 98.6% |
表 17: SVHN数据集上的结果。
| 方法 | 准确率 |
|---|---|
| RandAugment (WRN) | 98.3% |
| RandAugment (EfficientNet-BO) | 98.1% |
| Noisy Student Training (B0) | 98.6% |
A.4. Results on YFCC100M
A.4. YFCC100M 数据集上的结果
Since JFT is not a public dataset, we also experiment with a public unlabeled dataset YFCC100M [85], so that researchers can make fair comparisons with our results. Similar to the setting in Section 3.2, we experiment with different model sizes without iterative training. We use the same model for both the teacher and the student. We also use the same hyper paramters when using JFT and YFCC100M. Similar to the case for JFT, we first filter images from ImageNet validation set. We then filter low confidence images according to B0’s prediction and only keep the top 130K images for each class according to the top-1 predicted class. The resulting set has 34M images since there are not enough images for most classes. We then balance the dataset and increase it to 130M images. As a comparison, before the data balancing stage, there are 81M images in JFT.
由于JFT并非公开数据集,我们还使用公开无标签数据集YFCC100M [85]进行实验,以便研究者能公平对比我们的结果。与第3.2节的设置类似,我们在不进行迭代训练的情况下测试了不同规模的模型。师生模型采用相同架构,使用JFT和YFCC100M时保持相同超参数。与JFT处理流程一致,我们首先从ImageNet验证集筛选图像,随后根据B0模型的预测剔除低置信度图像,并按照top-1预测类别为每个类别保留13万张图像。由于多数类别图像不足,最终获得3400万张图像。通过数据平衡处理后,我们将数据集扩充至1.3亿张。作为对比,JFT在数据平衡前包含8100万张图像。
As shown in Table 18, Noisy Student Training also leads to significant improvements using YFCC100M though it achieves better performance using JFT. The performance difference is probably due to the dataset size difference.
如表 18 所示,尽管 Noisy Student Training 在使用 JFT 时表现更佳,但使用 YFCC100M 也能带来显著提升。这种性能差异可能源于数据集规模的差异。
Table 18: Results using YFCC100M and JFT as the unlabeled dataset.
| Model | Dataset | Top-1 Acc. | Top-5 Acc. |
| EfficientNet-BO | 77.3% | 93.4% | |
| Noisy Student Training (BO) | YFCC | 79.9% | 95.0% |
| Noisy Student Training (BO) | JFT | 78.1% | 94.2% |
| EfficientNet-B1 | 79.2% | 94.4% | |
| Noisy Student Training (B1) | YFCC | 79.9% | 95.0% |
| Noisy Student Training (B1) | JFT | 80.2% | 95.2% |
| EfficientNet-B2 | 80.0% | 94.9% | |
| Noisy Student Training (B2) | YFCC | 81.0% | 95.6% |
| Noisy Student Training (B2) | JFT | 81.1% | 95.5% |
| EfficientNet-B3 | 81.7% | 95.7% | |
| Noisy Student Training (B3) | YFCC | 82.3% | 96.2% |
| Noisy Student Training (B3) | JFT | 82.5% | 96.4% |
| EfficientNet-B4 | 83.2% | 96.4% | |
| Noisy Student Training (B4) | YFCC | 84.2% | 96.9% |
| Noisy Student Training (B4) | JFT | 84.4% | 97.0% |
| EfficientNet-B5 | 84.0% | 96.8% | |
| Noisy Student Training (B5) | YFCC | 85.0% | 97.2% |
| Noisy Student Training (B5) | JFT | 85.1% | 97.3% |
| EfficientNet-B6 | 97.0% | ||
| Noisy Student Training (B6) | YFCC | 84.5% 85.4% | 97.5% |
| Noisy Student Training (B6) | JFT | 85.6% | 97.6% |
| EfficientNet-B7 | |||
| Noisy Student Training (B7) | YFCC | 85.0% 86.2% | 97.2% 97.9% |
| Noisy Student Training (B7) | JFT | 86.4% | 97.9% |
表 18: 使用YFCC100M和JFT作为未标注数据集的结果
| 模型 | 数据集 | Top-1准确率 | Top-5准确率 |
|---|---|---|---|
| EfficientNet-B0 | 77.3% | 93.4% | |
| Noisy Student Training (B0) | YFCC | 79.9% | 95.0% |
| Noisy Student Training (B0) | JFT | 78.1% | 94.2% |
| EfficientNet-B1 | 79.2% | 94.4% | |
| Noisy Student Training (B1) | YFCC | 79.9% | 95.0% |
| Noisy Student Training (B1) | JFT | 80.2% | 95.2% |
| EfficientNet-B2 | 80.0% | 94.9% | |
| Noisy Student Training (B2) | YFCC | 81.0% | 95.6% |
| Noisy Student Training (B2) | JFT | 81.1% | 95.5% |
| EfficientNet-B3 | 81.7% | 95.7% | |
| Noisy Student Training (B3) | YFCC | 82.3% | 96.2% |
| Noisy Student Training (B3) | JFT | 82.5% | 96.4% |
| EfficientNet-B4 | 83.2% | 96.4% | |
| Noisy Student Training (B4) | YFCC | 84.2% | 96.9% |
| Noisy Student Training (B4) | JFT | 84.4% | 97.0% |
| EfficientNet-B5 | 84.0% | 96.8% | |
| Noisy Student Training (B5) | YFCC | 85.0% | 97.2% |
| Noisy Student Training (B5) | JFT | 85.1% | 97.3% |
| EfficientNet-B6 | 97.0% | ||
| Noisy Student Training (B6) | YFCC | 84.5% 85.4% | 97.5% |
| Noisy Student Training (B6) | JFT | 85.6% | 97.6% |
| EfficientNet-B7 | |||
| Noisy Student Training (B7) | YFCC | 85.0% 86.2% | 97.2% 97.9% |
| Noisy Student Training (B7) | JFT | 86.4% | 97.9% |
A.5. Details of Robustness Benchmarks
A.5. 鲁棒性基准测试细节
Metrics. For completeness, we provide brief descriptions of metrics used in robustness benchmarks ImageNet-A, ImageNet-C and ImageNet-P.
指标。为完整起见,我们简要介绍鲁棒性基准测试ImageNet-A、ImageNet-C和ImageNet-P中使用的指标。
• ImageNet-A. The top-1 and top-5 accuracy are measured on the 200 classes that ImageNet-A includes. The mapping from the 200 classes to the original ImageNet classes are available online.6 ImageNet-C. mCE (mean corruption error) is the weighted average of error rate on different corruptions, with AlexNet’s error rate as a baseline. The score is normalized by AlexNet’s error rate so that corruptions with different difficulties lead to scores of a similar scale. Please refer to [31] for details about mCE and AlexNet’s error rate. The top-1 accuracy is simply the average top-1 accuracy for all corruptions and all severity degrees. The top-1 accuracy of prior methods are computed from their reported corruption error on each corruption.
• ImageNet-A。在ImageNet-A包含的200个类别上测量top-1和top-5准确率。这200个类别与原版ImageNet类别的映射关系可在线查阅。
• ImageNet-C。mCE (mean corruption error) 是不同损坏类型错误率的加权平均值,以AlexNet的错误率作为基准。该分数通过AlexNet的错误率进行归一化,使得不同难度的损坏产生的分数具有可比性。有关mCE和AlexNet错误率的详细信息请参阅[31]。top-1准确率是所有损坏类型及所有严重程度的平均top-1准确率。现有方法的top-1准确率是根据其报告的每种损坏类型的错误率计算得出的。
• ImageNet-P. Flip probability is the probability that the model changes top-1 prediction for different perturbations. mFR (mean flip rate) is the weighted average of flip probability on different perturbations, with AlexNet’s flip probability as a baseline. Please refer to [31] for details about mFR and AlexNet’s flip probability. The top-1 accuracy reported in this paper is the average accuracy for all images included in ImageNetP.
• ImageNet-P。翻转概率 (flip probability) 指模型在不同扰动下改变top-1预测结果的概率。mFR (平均翻转率) 是不同扰动下翻转概率的加权平均值,以AlexNet的翻转概率为基准。关于mFR和AlexNet翻转概率的详细信息请参阅[31]。本文报告的top-1准确率是ImageNetP包含的所有图像的平均准确率。
On Using Rand Augment for ImageNet-C and ImageNet-P. Since Noisy Student Training leads to significant improvements on ImageNet-C and ImageNet-P, we briefly discuss the influence of Rand Augment on robustness results. First, note that our supervised baseline Efficient Net-L2 also uses Rand Augment. Noisy Student Training leads to significant improvements when compared to the supervised baseline as shown in Table 4 and Table 5.
在ImageNet-C和ImageNet-P上使用Rand Augment的探讨。由于Noisy Student Training在ImageNet-C和ImageNet-P上带来了显著改进,我们简要讨论Rand Augment对鲁棒性结果的影响。首先需注意,我们的监督基线Efficient Net-L2同样采用了Rand Augment。如[表4]和[表5]所示,与监督基线相比,Noisy Student Training实现了显著提升。
Second, the overlap between transformations of RandAugment and ImageNet-C, P is small. For completeness, we list transformations in Rand Augment and corruptions and perturbations in ImageNet-C and ImageNet-P here:
其次,RandAugment 的变换与 ImageNet-C、P 之间的重叠部分较少。为完整起见,我们在此列出 RandAugment 的变换以及 ImageNet-C 和 ImageNet-P 中的损坏和扰动:
The main overlap between Rand Augment and ImageNet-C are Contrast, Brightness and Sharpness. Among them, augmentation Contrast and Brightness are also used in ResNeXt-101 WSL [55] and in vision models that uses the Inception preprocessing [34, 80]. The overlap between Rand Augment and ImageNet-P includes Brightness, Translate and Rotate.
Rand Augment与ImageNet-C的主要重叠部分是对比度(Contrast)、亮度(Brightness)和锐度(Sharpness)。其中对比度和亮度增强也被用于ResNeXt-101 WSL [55]以及采用Inception预处理流程的视觉模型[34, 80]。Rand Augment与ImageNet-P的重叠部分包括亮度(Brightness)、平移(Translate)和旋转(Rotate)。
