[论文翻译]使用带噪声学生模型的自训练提升ImageNet分类效果


原文地址:https://arxiv.org/pdf/1911.04252v4


Self-training with Noisy Student improves ImageNet classification

使用带噪声学生模型的自训练提升ImageNet分类效果

Qizhe Xie∗ 1, Minh-Thang Luong1, Eduard Hovy2, Quoc V. Le1 1Google Research, Brain Team, 2Carnegie Mellon University {qizhex, thangluong, qvl}@google.com, hovy@cmu.edu

Qizhe Xie∗ 1, Minh-Thang Luong1, Eduard Hovy2, Quoc V. Le1 1Google Research, Brain Team, 2Carnegie Mellon University {qizhex, thangluong, qvl}@google.com, hovy@cmu.edu

Abstract

摘要

We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves $88.4%$ top1 accuracy on ImageNet, which is $2.0%$ better than the state-of-the-art model that requires $3.5B$ weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from $6l.0%$ to $83.7%$ , reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet $\cdot P$ mean flip rate from 27.8 to 12.2.

我们提出Noisy Student Training,这是一种即使在标注数据充足时也能表现优异的半监督学习方法。该方法在ImageNet上实现了88.4%的top1准确率,比需要35亿张弱标注Instagram图片的当前最佳模型高出2.0%。在鲁棒性测试集上,它将ImageNet-A的top1准确率从61.0%提升至83.7%,将ImageNet-C的平均损坏误差从45.7降至28.3,并将ImageNet-P的平均翻转率从27.8降至12.2。

Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an Efficient Net model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via Rand Augment to the student so that the student generalizes better than the teacher.1

噪声学生训练 (Noisy Student Training) 通过使用同等或更大规模的学生模型 (student model) 并在学习过程中向学生注入噪声,扩展了自训练 (self-training) 和蒸馏 (distillation) 的思想。在 ImageNet 数据集上,我们首先在标注图像上训练一个 EfficientNet 模型作为教师模型 (teacher model),并利用它为 3 亿张未标注图像生成伪标签。接着,我们在标注数据和伪标注数据的组合上训练一个更大的 EfficientNet 作为学生模型。通过将学生模型迭代地作为新教师模型,我们重复这一过程。在学生模型学习期间,我们向其注入随机失活 (dropout)、随机深度 (stochastic depth) 以及通过 Rand Augment 实现的数据增强等噪声机制,使得学生模型获得比教师模型更强的泛化能力 [1]。

1. Introduction

1. 引言

Deep learning has shown remarkable successes in image recognition in recent years [45, 80, 75, 30, 83]. However state-of-the-art (SOTA) vision models are still trained with supervised learning which requires a large corpus of labeled images to work well. By showing the models only labeled images, we limit ourselves from making use of unlabeled images available in much larger quantities to improve accuracy and robustness of SOTA models.

深度学习近年来在图像识别领域取得了显著成功 [45, 80, 75, 30, 83]。然而当前最先进的视觉模型仍采用监督学习训练,这需要大量标注图像才能达到良好效果。仅使用标注图像训练模型,限制了我们对海量未标注图像的利用,无法进一步提升最先进模型的准确性和鲁棒性。

Here, we use unlabeled images to improve the SOTA ImageNet accuracy and show that the accuracy gain has an outsized impact on robustness (out-of-distribution generalization). For this purpose, we use a much larger corpus of unlabeled images, where a large fraction of images do not belong to ImageNet training set distribution (i.e., they do not belong to any category in ImageNet). We train our model with Noisy Student Training, a semi-supervised learning approach, which has three main steps: (1) train a teacher model on labeled images, (2) use the teacher to generate pseudo labels on unlabeled images, and (3) train a student model on the combination of labeled images and pseudo labeled images. We iterate this algorithm a few times by treating the student as a teacher to relabel the unlabeled data and training a new student.

在此,我们利用未标注图像提升SOTA ImageNet准确率,并证明这种准确率增益对鲁棒性(分布外泛化)具有超常影响。为此,我们采用了规模更大的未标注图像集,其中大部分图像不属于ImageNet训练集分布(即不属于ImageNet任何类别)。我们通过半监督学习方法Noisy Student Training训练模型,该方法包含三个核心步骤:(1) 在标注图像上训练教师模型,(2) 使用教师模型为未标注图像生成伪标签,(3) 结合标注图像与伪标注图像训练学生模型。我们将该算法迭代数次:将学生模型作为教师模型重新标注未标注数据,并训练新学生模型。

Noisy Student Training improves self-training and distillation in two ways. First, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. Second, it adds noise to the student so the noised student is forced to learn harder from the pseudo labels. To noise the student, we use input noise such as RandAugment data augmentation [18] and model noise such as dropout [76] and stochastic depth [37] during training.

噪声学生训练 (Noisy Student Training) 从两方面改进了自训练和蒸馏效果:首先,它让学生模型规模大于或等于教师模型,从而能更好地从大规模数据中学习;其次,它向学生模型注入噪声,迫使带噪学生必须更努力地从伪标签中学习。我们采用多种噪声注入方式,包括输入噪声(如 RandAugment 数据增强 [18])和模型噪声(如训练过程中的 dropout [76] 与随机深度 [37])。

Using Noisy Student Training, together with 300M unlabeled images, we improve Efficient Net’s [83] ImageNet top-1 accuracy to $88.4%$ . This accuracy is $2.0%$ better than the previous SOTA results which requires 3.5B weakly labeled Instagram images. Not only our method improves standard ImageNet accuracy, it also improves classification robustness on much harder test sets by large margins: ImageNet-A [32] top-1 accuracy from $61.0%$ to $83.7%$ , ImageNet-C [31] mean corruption error (mCE) from 45.7 to 28.3 and ImageNet-P [31] mean flip rate (mFR) from 27.8 to 12.2. Our main results are shown in Table 1.

通过结合噪声学生训练 (Noisy Student Training) 和3亿张未标注图像,我们将EfficientNet [83] 的ImageNet top-1准确率提升至$88.4%$。这一准确率比此前需要35亿张弱标注Instagram图像的SOTA结果高出$2.0%$。我们的方法不仅提升了标准ImageNet准确率,还在更困难测试集上显著提高了分类鲁棒性:ImageNet-A [32] top-1准确率从$61.0%$提升至$83.7%$,ImageNet-C [31] 平均损坏误差 (mCE) 从45.7降至28.3,ImageNet-P [31] 平均翻转率 (mFR) 从27.8降至12.2。主要结果如表1所示。

ImageNet top-1 acc.ImageNet-A top-1 acc.ImageNet-C mCEImageNet-P mFR
Prev.SOTA86.4%61.0%45.727.8
Ours88.4%83.7%28.312.2
ImageNet top-1 准确率 ImageNet-A top-1 准确率 ImageNet-C 平均分类误差 (mCE) ImageNet-P 平均翻转率 (mFR)
先前最佳 (Prev.SOTA) 86.4% 61.0% 45.7 27.8
我们的方法 (Ours) 88.4% 83.7% 28.3 12.2

Table 1: Summary of key results compared to previous state-of-the-art models [86, 55]. Lower is better for mean corruption error (mCE) and mean flip rate (mFR).

表 1: 与先前最先进模型 [86, 55] 的关键结果对比汇总。平均损坏误差 (mCE) 和平均翻转率 (mFR) 越低越好。

2. Noisy Student Training

2. 噪声学生训练 (Noisy Student Training)

Algorithm 1 gives an overview of Noisy Student Training. The inputs to the algorithm are both labeled and unlabeled images. We use the labeled images to train a teacher model using the standard cross entropy loss. We then use the teacher model to generate pseudo labels on unlabeled images. The pseudo labels can be soft (a continuous distribution) or hard (a one-hot distribution). We then train a student model which minimizes the combined cross entropy loss on both labeled images and unlabeled images. Finally, we iterate the process by putting back the student as a teacher to generate new pseudo labels and train a new student. The algorithm is also illustrated in Figure 1.

算法 1 概述了噪声学生训练 (Noisy Student Training) 的流程。该算法的输入包括标注图像和未标注图像。我们首先使用标注图像通过标准交叉熵损失训练教师模型 (teacher model),随后利用教师模型为未标注图像生成伪标签 (pseudo labels)。这些伪标签可以是软标签 (soft,连续分布) 或硬标签 (hard,one-hot 分布)。接着我们训练学生模型 (student model),使其同时最小化标注图像和未标注图像上的交叉熵损失。最后,通过将学生模型迭代升级为教师模型来生成新的伪标签并训练新的学生模型,从而形成闭环。该算法的流程示意图见图 1。

$$
{\frac{1}{n}}\sum_ {i=1}^{n}\ell(y_ {i},f^{n o i s e d}(x_ {i},\theta^{t}))
$$

$$
{\frac{1}{n}}\sum_ {i=1}^{n}\ell(y_ {i},f^{n o i s e d}(x_ {i},\theta^{t}))
$$

2: Use a normal (i.e., not noised) teacher model to generate soft or hard pseudo labels for clean (i.e., not distorted) unlabeled images

2: 使用常规(即未添加噪声)教师模型为干净(即未失真)的未标注图像生成软伪标签或硬伪标签

$$
\tilde{y}_ {i}=f(\tilde{x}_ {i},\theta_ {* }^{t}),\forall i=1,\cdot\cdot\cdot,m
$$

$$
\tilde{y}_ {i}=f(\tilde{x}_ {i},\theta_ {* }^{t}),\forall i=1,\cdot\cdot\cdot,m
$$

3: Learn an equal-or-larger student model $\theta_ {* }^{s}$ which minimizes the cross entropy loss on labeled images and unlabeled images with noise added to the student model

3: 学习一个等规模或更大的学生模型 $\theta_ {* }^{s}$ ,使其在带标签图像和添加噪声的无标签图像上最小化交叉熵损失

$$
\frac{1}{n}\sum_ {i=1}^{n}\ell(y_ {i},f^{n o i s e d}(x_ {i},\theta^{s}))+\frac{1}{m}\sum_ {i=1}^{m}\ell(\tilde{y}_ {i},f^{n o i s e d}(\tilde{x}_ {i},\theta^{s}))
$$

$$
\frac{1}{n}\sum_ {i=1}^{n}\ell(y_ {i},f^{n o i s e d}(x_ {i},\theta^{s}))+\frac{1}{m}\sum_ {i=1}^{m}\ell(\tilde{y}_ {i},f^{n o i s e d}(\tilde{x}_ {i},\theta^{s}))
$$

4: Iterative training: Use the student as a teacher and go back to step 2.

4: 迭代训练:将学生模型作为教师模型,返回步骤2。

Algorithm 1: Noisy Student Training.

算法 1: 带噪学生训练 (Noisy Student Training)。


Figure 1: Illustration of the Noisy Student Training. (All shown images are from ImageNet.)

图 1: Noisy Student训练方法示意图 (所有展示图片均来自ImageNet数据集)

The algorithm is an improved version of self-training, a method in semi-supervised learning (e.g., [71, 96]), and distillation [33]. More discussions on how our method is related to prior works are included in Section 5.

该算法是自训练(self-training)的改进版本,这是半监督学习中的一种方法(如[71, 96]),同时也借鉴了蒸馏(distillation)[33]。关于我们的方法与先前工作的更多关联讨论见第5节。

Our key improvements lie in adding noise to the student and using student models that are not smaller than the teacher. This makes our method different from Knowledge Distillation [33] where 1) noise is often not used and 2) a smaller student model is often used to be faster than the teacher. One can think of our method as knowledge expansion in which we want the student to be better than the teacher by giving the student model enough capacity and difficult environments in terms of noise to learn through.

我们的关键改进在于向学生模型添加噪声,并使用不小于教师模型的模型。这使得我们的方法与知识蒸馏 (Knowledge Distillation) [33] 不同:1) 知识蒸馏通常不使用噪声;2) 知识蒸馏通常使用更小的学生模型以实现比教师模型更快的速度。可以将我们的方法视为知识扩展,即通过为学生模型提供足够的容量和具有噪声的困难学习环境,使其能够超越教师模型。

Noising Student – When the student is deliberately noised it is trained to be consistent to the teacher that is not noised when it generates pseudo labels. In our experiments, we use two types of noise: input noise and model noise. For input noise, we use data augmentation with RandAugment [18]. For model noise, we use dropout [76] and stochastic depth [37].

噪声学生法——当学生模型被刻意添加噪声时,其训练目标是与非噪声状态下生成伪标签的教师模型保持一致。实验中我们采用两种噪声类型:输入噪声(使用RandAugment [18]进行数据增强)和模型噪声(采用dropout [76]与随机深度 [37])。

When applied to unlabeled data, noise has an important benefit of enforcing in variances in the decision function on both labeled and unlabeled data. First, data augmentation is an important noising method in Noisy Student Training because it forces the student to ensure prediction consistency across augmented versions of an image (similar to UDA [91]). Specifically, in our method, the teacher produces high-quality pseudo labels by reading in clean images, while the student is required to reproduce those labels with augmented images as input. For example, the student must ensure that a translated version of an image should have the same category as the original image. Second, when dropout and stochastic depth function are used as noise, the teacher behaves like an ensemble at inference time (when it generates pseudo labels), whereas the student behaves like a single model. In other words, the student is forced to mimic a more powerful ensemble model. We present an ablation study on the effects of noise in Section 4.1.

在应用于未标注数据时,噪声具有强制决策函数在标注和未标注数据上保持不变性的重要作用。首先,数据增强是Noisy Student Training中的关键加噪方法,因为它迫使学生模型确保图像增强版本间的预测一致性(类似于UDA [91])。具体而言,在我们的方法中,教师模型通过读取干净图像生成高质量伪标签,而学生模型需要以增强图像为输入复现这些标签。例如,学生模型必须确保图像的平移版本应与原始图像具有相同类别。其次,当使用dropout和随机深度函数作为噪声时,教师模型在推理时(生成伪标签时)表现为集成模型,而学生模型表现为单一模型。换言之,学生模型被迫模仿更强大的集成模型。我们将在第4.1节中对噪声效果进行消融研究。

Other Techniques – Noisy Student Training also works better with an additional trick: data filtering and balancing, similar to [91, 93]. Specifically, we filter images that the teacher model has low confidences on since they are usually out-of-domain images. To ensure that the distribution of the unlabeled images match that of the training set, we also need to balance the number of unlabeled images for each class, as all classes in ImageNet have a similar number of labeled images. For this purpose, we duplicate images in classes where there are not enough images. For classes where we have too many images, we take the images with the highest confidence.2

其他技术——Noisy Student Training(噪声学生训练)还结合了一项额外技巧:数据过滤与平衡,类似[91, 93]的做法。具体而言,我们会过滤教师模型置信度较低的图像,因为这些通常是域外图像。为确保未标注图像的分布与训练集匹配,还需平衡每个类别的未标注图像数量(ImageNet中所有类别的标注图像数量相近)。为此,我们对图像不足的类别进行图像复制,对图像过多的类别则保留置信度最高的样本。

Finally, we emphasize that our method can be used with soft or hard pseudo labels as both work well in our experiments. Soft pseudo labels, in particular, work slightly better for out-of-domain unlabeled data. Thus in the following, for consistency, we report results with soft pseudo labels unless otherwise indicated.

最后需要强调的是,我们的方法既适用于软伪标签也适用于硬伪标签,因为实验表明两者效果都很好。特别是对于域外未标注数据,软伪标签表现略优。因此在下文中,为保持一致性,如无特殊说明均采用软伪标签的实验结果。

Comparisons with Existing SSL Methods. Apart from self-training, another important line of work in semisupervised learning [12, 103] is based on consistency training [5, 64, 47, 84, 56, 91, 8] and pseudo labeling [48, 39, 73, 1]. Although they have produced promising results, in our preliminary experiments, methods based on consistency regular iz ation and pseudo labeling work less well on ImageNet. Instead of using a teacher model trained on labeled data to generate pseudo-labels, these methods do not have a separate teacher model and use the model being trained to generate pseudo-labels. In the early phase of training, the model being trained has low accuracy and high entropy, hence consistency training regularizes the model towards high entropy predictions, and prevents it from achieving good accuracy. A common workaround is to use entropy minimization, to filter examples with low confidence or to ramp up the consistency loss. However, the additional hyper parameters introduced by the ramping up schedule, confidence-based filtering and the entropy minimization make them more difficult to use at scale. The selftraining / teacher-student framework is better suited for ImageNet because we can train a good teacher on ImageNet using labeled data.

与现有SSL方法的比较。除自训练外,半监督学习[12, 103]的另一重要研究方向基于一致性训练[5, 64, 47, 84, 56, 91, 8]和伪标签[48, 39, 73, 1]。尽管这些方法取得了不错的效果,但在我们的初步实验中,基于一致性正则化和伪标签的方法在ImageNet上表现欠佳。这些方法不使用基于标注数据训练的教师模型生成伪标签,而是直接利用训练中的模型自身生成伪标签。在训练初期,模型精度低且熵值高,此时一致性训练会迫使模型趋向高熵预测,从而阻碍其达到较高精度。常见解决方案包括采用熵最小化、过滤低置信度样本或逐步增强一致性损失。但逐步增强策略、基于置信度的过滤以及熵最小化引入的超参数,使得这些方法难以大规模应用。自训练/师生框架更适合ImageNet,因为我们可利用标注数据训练出优质的教师模型。

3. Experiments

3. 实验

In this section, we will first describe our experiment details. We will then present our ImageNet results compared with those of state-of-the-art models. Lastly, we demonstrate the surprising improvements of our models on robustness datasets (such as ImageNet-A, C and P) as well as under adversarial attacks.

在本节中,我们将首先描述实验细节,随后展示ImageNet结果与前沿模型的对比,最后揭示我们的模型在鲁棒性数据集(如ImageNet-A、C和P)及对抗攻击下取得的显著提升。

3.1. Experiment Details

3.1. 实验细节

Labeled dataset. We conduct experiments on ImageNet 2012 ILSVRC challenge prediction task since it has been considered one of the most heavily benchmarked datasets in computer vision and that improvements on ImageNet transfer to other datasets [44, 66].

标注数据集。我们在ImageNet 2012 ILSVRC挑战赛预测任务上进行实验,因为该数据集被公认为计算机视觉领域最具代表性的基准测试集之一,且在ImageNet上的性能提升可迁移至其他数据集[44, 66]。

Unlabeled dataset. We obtain unlabeled images from the JFT dataset [33, 15], which has around 300M images. Although the images in the dataset have labels, we ignore the labels and treat them as unlabeled data. We filter the ImageNet validation set images from the dataset (see [58]).

无标注数据集。我们从JFT数据集[33,15]中获取约3亿张无标注图像。尽管该数据集中的图像带有标签,但我们忽略这些标签并将其视为无标注数据。我们从数据集中过滤掉了ImageNet验证集图像(参见[58])。

We then perform data filtering and balancing on this corpus. First, we run an Efficient Net-B0 trained on ImageNet [83] over the JFT dataset [33, 15] to predict a label for each image. We then select images that have confidence of the label higher than 0.3. For each class, we select at most 130K images that have the highest confidence. Finally, for classes that have less than 130K images, we duplicate some images at random so that each class can have 130K images. Hence the total number of images that we use for training a student model is 130M (with some duplicated images). Due to duplications, there are only 81M unique images among these 130M images. We do not tune these hyper parameters extensively since our method is highly robust to them.

我们对这个语料库进行了数据过滤和平衡处理。首先,在JFT数据集[33,15]上运行一个基于ImageNet[83]预训练的Efficient Net-B0模型,为每张图像预测标签。然后筛选出标签置信度高于0.3的图像。对于每个类别,我们最多选择13万张置信度最高的图像。最后,对于图像数量不足13万的类别,通过随机复制部分图像使每个类别都达到13万张。因此用于训练学生模型的总图像量为1.3亿张(含部分重复图像)。由于复制操作,这1.3亿张图像中实际只有8100万张唯一图像。我们没有对这些超参数进行广泛调优,因为我们的方法对其具有高度鲁棒性。

To enable fair comparisons with our results, we also experiment with a public dataset YFCC100M [85] and show the results in Appendix A.4.

为了公平比较我们的结果,我们还使用公开数据集 YFCC100M [85] 进行了实验,并在附录 A.4 中展示了结果。

Architecture. We use Efficient Nets [83] as our baseline models because they provide better capacity for more data. In our experiments, we also further scale up Efficient NetB7 and obtain Efficient Net-L2. Efficient Net-L2 is wider and deeper than Efficient Net-B7 but uses a lower resolution, which gives it more parameters to fit a large number of unlabeled images. Due to the large model size, the training time of Efficient Net-L2 is approximately five times the training time of Efficient Net-B7. For more information about Efficient Net-L2, please refer to Table 8 in Appendix A.1.

架构。我们采用Efficient Nets [83]作为基线模型,因其具备更强的数据承载能力。实验中对Efficient NetB7进行扩展后得到Efficient Net-L2,该模型比Efficient Net-B7具有更宽的架构和更深的层级,但采用较低分辨率,从而拥有更多参数以适应海量未标注图像。由于模型规模较大,Efficient Net-L2的训练时长约为Efficient Net-B7的五倍。更多参数细节请参阅附录A.1中的表8。

Training details. For labeled images, we use a batch size of 2048 by default and reduce the batch size when we could not fit the model into the memory. We find that using a batch size of 512, 1024, and 2048 leads to the same performance. We determine the number of training steps and the learning rate schedule by the batch size for labeled images. Specifically, we train the student model for 350 epochs for models larger than Efficient Net-B4, including Efficient Net-L2 and train smaller student models for 700 epochs. The learning rate starts at 0.128 for labeled batch size 2048 and decays by 0.97 every 2.4 epochs if trained for 350 epochs or every 4.8 epochs if trained for 700 epochs.

训练细节。对于标注图像,默认使用2048的批次大小,当模型无法放入内存时则减小批次。我们发现使用512、1024和2048的批次大小会得到相同的性能表现。根据标注图像的批次大小确定训练步数和学习率调度方案。具体来说,对于大于EfficientNet-B4的模型(包括EfficientNet-L2),学生模型训练350轮;较小的学生模型则训练700轮。当标注批次大小为2048时,初始学习率为0.128:若训练350轮则每2.4轮衰减0.97倍,若训练700轮则每4.8轮衰减0.97倍。

We use a large batch size for unlabeled images, especially for large models, to make full use of large quantities of unlabeled images. Labeled images and unlabeled images are concatenated together to compute the average cross entropy loss. We apply the recently proposed technique to fix train-test resolution discrepancy [86] for Efficient Net-L2. We first perform normal training with a smaller resolution for 350 epochs. Then we finetune the model with a larger resolution for 1.5 epochs on un augmented labeled images. Similar to [86], we fix the shallow layers during finetuning.

我们使用大批量处理未标注图像,尤其针对大型模型,以充分利用海量未标注数据。标注图像与未标注图像会拼接在一起计算平均交叉熵损失。对于EfficientNet-L2模型,我们采用最新提出的技术[86]来修正训练-测试分辨率差异。首先以较小分辨率进行350轮常规训练,随后在未增强的标注图像上以更高分辨率微调1.5轮。与[86]方法类似,微调期间固定浅层网络参数。

Our largest model, Efficient Net-L2, needs to be trained for 6 days on a Cloud TPU v3 Pod, which has 2048 cores, if the unlabeled batch size is $14\mathrm{x}$ the labeled batch size.

我们最大的模型 Efficient Net-L2 需要在配备 2048 个核心的 Cloud TPU v3 Pod 上训练 6 天,其中无标注批大小 (unlabeled batch size) 为标注批大小 (labeled batch size) 的 $14\mathrm{x}$ 倍。

Table 2: Top-1 and Top-5 Accuracy of Noisy Student Training and previous state-of-the-art methods on ImageNet. Efficient Net-L2 with Noisy Student Training is the result of iterative training for multiple iterations by putting back the student model as the new teacher. It has better tradeoff in terms of accuracy and model size compared to previous state-ofthe-art models. †: Big Transfer is a concurrent work that performs transfer learning from the JFT dataset.

Method#ParamsExtra DataTop-1 Acc.Top-5 Acc.
ResNet-50 [30]26M76.0%93.0%
ResNet-152 [30]60M77.8%93.8%
DenseNet-264 [36]34M77.9%93.9%
Inception-v3 [81]24M78.8%94.4%
Xception [15]23M79.0%94.5%
Inception-v4 [79]48M80.0%95.0%
Inception-resnet-v2 [79]56M80.1%95.1%
ResNeXt-101 [92]84M80.9%95.6%
PolyNet [100]92M81.3%95.8%
SENet [35] NASNet-A [104]146M82.7%96.2%
AmoebaNet-A [65]89M82.7%96.2%
87M82.8%96.1%
PNASNet [50] AmoebaNet-C [17]86M82.9%96.2%
155M83.5%96.5%
GPipe [38] EfficientNet-B7[83]557M84.3%97.0%
EfficientNet-L2 [83]66M85.0%97.2%
480M85.5%97.5%
ResNet-50 Billion-scale[93]26M81.2%96.0%
ResNeXt-101 Billion-scale[93]193M3.5B images labeled with tags84.8%
ResNeXt-101 WSL [55]829M85.4%97.6%
FixRes ResNeXt-101 WSL[86]829M86.4%98.0%
Big Transfer (BiT-L) [43]t928M300M weakly labeled images from JFT87.5%98.5%
Noisy Student Training (EfficientNet-L2)480M300M unlabeled images from JFT88.4%98.7%

表 2: Noisy Student Training与先前最优方法在ImageNet上的Top-1和Top-5准确率。通过将学生模型作为新教师进行多次迭代训练得到的EfficientNet-L2 with Noisy Student Training,在准确率和模型大小之间取得了比先前最优模型更好的权衡。†: Big Transfer是一项并行工作,从JFT数据集进行迁移学习。

方法 参数量 额外数据 Top-1准确率 Top-5准确率
ResNet-50 [30] 26M 76.0% 93.0%
ResNet-152 [30] 60M 77.8% 93.8%
DenseNet-264 [36] 34M 77.9% 93.9%
Inception-v3 [81] 24M 78.8% 94.4%
Xception [15] 23M 79.0% 94.5%
Inception-v4 [79] 48M 80.0% 95.0%
Inception-resnet-v2 [79] 56M 80.1% 95.1%
ResNeXt-101 [92] 84M 80.9% 95.6%
PolyNet [100] 92M 81.3% 95.8%
SENet [35] NASNet-A [104] 146M 82.7% 96.2%
AmoebaNet-A [65] 89M 82.7% 96.2%
87M 82.8% 96.1%
PNASNet [50] AmoebaNet-C [17] 86M 82.9% 96.2%
155M 83.5% 96.5%
GPipe [38] EfficientNet-B7[83] 557M 84.3% 97.0%
EfficientNet-L2 [83] 66M 85.0% 97.2%
480M 85.5% 97.5%
ResNet-50 Billion-scale[93] 26M 81.2% 96.0%
ResNeXt-101 Billion-scale[93] 193M 3.5B带标签图像 84.8%
ResNeXt-101 WSL [55] 829M 85.4% 97.6%
FixRes ResNeXt-101 WSL[86] 829M 86.4% 98.0%
Big Transfer (BiT-L) [43]† 928M JFT的3亿弱标注图像 87.5% 98.5%
Noisy Student Training (EfficientNet-L2) 480M JFT的3亿未标注图像 88.4% 98.7%

Noise. We use stochastic depth [37], dropout [76], and Rand Augment [18] to noise the student. The hyper para meters for these noise functions are the same for Efficient NetB7 and L2. In particular, we set the survival probability in stochastic depth to 0.8 for the final layer and follow the linear decay rule for other layers. We apply dropout to the final layer with a dropout rate of 0.5. For Rand Augment, we apply two random operations with magnitude set to 27.

噪声。我们采用随机深度 [37]、Dropout [76] 和 Rand Augment [18] 向学生模型注入噪声。这些噪声函数的超参数在 EfficientNetB7 和 L2 上保持一致:随机深度中最后一层的生存概率设为 0.8,其余层按线性衰减规则处理;最终层 Dropout 率设为 0.5;Rand Augment 采用幅度为 27 的两种随机操作。

Iterative training. The best model in our experiments is a result of three iterations of putting back the student as the new teacher. We first trained an Efficient Net-B7 on ImageNet as the teacher model. Then by using the B7 model as the teacher, we trained an Efficient Net-L2 model with the unlabeled batch size set to 14 times the labeled batch size. Then, we trained a new Efficient Net-L2 model with the Efficient Net-L2 model as the teacher. Lastly, we iterated again and used an unlabeled batch size of 28 times the labeled batch size. The detailed results of the three iterations are available in Section 4.2.

迭代训练。我们实验中的最佳模型是通过将学生模型作为新教师模型进行三次迭代训练得到的。我们首先在ImageNet上训练了一个Efficient Net-B7作为初始教师模型。随后以B7模型为教师,训练了一个Efficient Net-L2模型,其无标签批次大小设置为有标签批次的14倍。接着,我们以该Efficient Net-L2模型为教师,训练了新的Efficient Net-L2模型。最后再次迭代,使用28倍于有标签批次的无标签批次规模进行训练。三次迭代的详细结果见第4.2节。

3.2. ImageNet Results

3.2. ImageNet 结果

We first report the validation set accuracy on the ImageNet 2012 ILSVRC challenge prediction task as commonly done in literature [45, 80, 30, 83] (see also [66]). As shown in Table 2, Efficient Net-L2 with Noisy Student Training achieves $88.4%$ top-1 accuracy which is significantly better than the best reported accuracy on Efficient Net of $85.0%$ . The total gain of $3.4%$ comes from two sources: by making the model larger $(+0.5%)$ and by Noisy Student Training $(+2.9%)$ . In other words, Noisy Student Training makes a much larger impact on the accuracy than changing the architecture.

我们首先报告在ImageNet 2012 ILSVRC挑战预测任务上的验证集准确率,这是文献[45, 80, 30, 83] (另见[66])中常见的做法。如表2所示,采用噪声学生训练(Noisy Student Training)的Efficient Net-L2实现了88.4%的top-1准确率,显著优于Efficient Net此前报道的最佳准确率85.0%。这3.4%的总提升来自两个方面:通过增大模型规模(+0.5%)和采用噪声学生训练(+2.9%)。换言之,噪声学生训练对准确率的提升效果远大于架构变更。

Further, Noisy Student Training outperforms the stateof-the-art accuracy of $86.4%$ by FixRes ResNeXt-101 WSL [55, 86] that requires 3.5 Billion Instagram images labeled with tags. As a comparison, our method only requires 300M unlabeled images, which is perhaps more easy to collect. Our model is also approximately twice as small in the number of parameters compared to FixRes ResNeXt101 WSL.

此外,Noisy Student Training 以 86.4% 的准确率超越了 FixRes ResNeXt-101 WSL [55, 86] 的最先进水平,后者需要 35 亿张带有标签的 Instagram 图像。相比之下,我们的方法仅需 3 亿张未标注图像,可能更易于收集。与 FixRes ResNeXt-101 WSL 相比,我们的模型参数量也大约减少了一半。

Model size study: Noisy Student Training for EfficientNet B0-B7 without Iterative Training. In addition to improving state-of-the-art results, we conduct experiments to verify if Noisy Student Training can benefit other EfficienetNet models. In previous experiments, iterative training was used to optimize the accuracy of Efficient Net-L2 but here we skip it as it is difficult to use iterative training for many experiments. We vary the model size from Efficient Net-B0 to Efficient Net-B7 [83] and use the same model as both the teacher and the student. We apply RandAugment to all Efficient Net baselines, leading to more competitive baselines. We set the unlabeled batch size to be three times the batch size of labeled images for all model sizes except for Efficient Net-B0. For Efficient Net-B0, we set the unlabeled batch size to be the same as the batch size of labeled images. As shown in Figure 2, Noisy Student Training leads to a consistent improvement of around $0.8%$ for all model sizes. Overall, Efficient Nets with Noisy Student Training provide a much better tradeoff between model size and accuracy than prior works. The results also confirm that vision models can benefit from Noisy Student Training even without iterative training.

模型规模研究:无需迭代训练的EfficientNet B0-B7噪声学生训练。除了提升最先进结果外,我们通过实验验证噪声学生训练是否适用于其他EfficientNet模型。先前实验采用迭代训练优化EfficientNet-L2精度,但本研究跳过该步骤,因为多组实验难以实施迭代训练。我们测试了从EfficientNet-B0到EfficientNet-B7 [83]的模型规模,并使用相同模型同时作为教师和学生。所有EfficientNet基线均采用RandAugment增强,从而建立更具竞争力的基线。除EfficientNet-B0外,所有模型的无标签数据批大小设为有标签图像的3倍;EfficientNet-B0则保持二者批大小相同。如图2所示,噪声学生训练使所有模型规模获得约$0.8%$的稳定提升。总体而言,采用噪声学生训练的EfficientNet系列在模型规模与精度权衡上显著优于先前工作。结果同时证实:即使不采用迭代训练,视觉模型仍能从噪声学生训练中获益。


Figure 2: Noisy Student Training leads to significant improvements across all model sizes. We use the same architecture for the teacher and the student and do not perform iterative training.

图 2: Noisy Student Training 在所有模型规模上都带来了显著提升。师生模型采用相同架构且未进行迭代训练。

3.3. Robustness Results on ImageNet-A, ImageNetC and ImageNet-P

3.3. ImageNet-A、ImageNetC 和 ImageNet-P 上的鲁棒性结果

We evaluate the best model, that achieves $88.4%$ top1 accuracy, on three robustness test sets: ImageNetA, ImageNet-C and ImageNet-P. ImageNet-C and $\mathrm{\bfP}$ test sets [31] include images with common corruptions and perturbations such as blurring, fogging, rotation and scaling. ImageNet-A test set [32] consists of difficult images that cause significant drops in accuracy to state-of-the-art models. These test sets are considered as “robustness” benchmarks because the test images are either much harder, for ImageNet-A, or the test images are different from the training images, for ImageNet-C and P.

我们在三个鲁棒性测试集上评估了表现最佳的模型,其Top1准确率达到88.4%:ImageNetA、ImageNet-C和ImageNet-P。ImageNet-C和P测试集[31]包含经过常见破坏和扰动的图像,如模糊、雾化、旋转和缩放。ImageNet-A测试集[32]由困难图像组成,会导致最先进模型的准确率显著下降。这些测试集被视为"鲁棒性"基准,因为对于ImageNet-A来说测试图像难度更高,而对于ImageNet-C和P来说测试图像与训练图像存在差异。

Table 3: Robustness results on ImageNet-A.

MethodTop-1 Acc.Top-5 Acc.
ResNet-101 [32]4.7%
ResNeXt-101[32] (32x4d)5.9%
ResNet-152[32]6.1%
ResNeXt-101 [32] (64x4d)7.3%
DPN-98 [32]9.4%
ResNeXt-101+SE[32](32x4d)14.2%
ResNeXt-101WSL[55,59]61.0%
EfficientNet-L249.6%78.6%
Noisy Student Training (L2)83.7%95.2%

表 3: ImageNet-A 鲁棒性测试结果

方法 Top-1 准确率 Top-5 准确率
ResNet-101 [32] 4.7%
ResNeXt-101 [32] (32x4d) 5.9%
ResNet-152 [32] 6.1%
ResNeXt-101 [32] (64x4d) 7.3%
DPN-98 [32] 9.4%
ResNeXt-101+SE [32] (32x4d) 14.2%
ResNeXt-101WSL [55,59] 61.0%
EfficientNet-L2 49.6% 78.6%
Noisy Student Training (L2) 83.7% 95.2%

Table 4: Robustness results on ImageNet-C. mCE is the weighted average of error rate on different corruptions, with AlexNet’s error rate as a baseline (lower is better).

MethodRes.Top-1 Acc.mCE
ResNet-50[31]22439.0%76.7
SIN [23]22445.2%69.3
Patch Gaussian [51] ResNeXt-101WSL[55,59]299 22452.3%60.4 45.7
EfficientNet-L2224
Noisy Student Training (L2)22462.6% 76.5%47.5 30.0
EfficientNet-L2299
66.6%42.5
Noisy Student Training (L2)29977.8%28.3

表 4: ImageNet-C上的鲁棒性结果。mCE是不同损坏类型上错误率的加权平均值,以AlexNet的错误率作为基线(数值越低越好)。

Method Res. Top-1 Acc. mCE
ResNet-50[31] 224 39.0% 76.7
SIN [23] 224 45.2% 69.3
Patch Gaussian [51] ResNeXt-101WSL[55,59] 299 224 52.3% 60.4 45.7
EfficientNet-L2 224
Noisy Student Training (L2) 224 62.6% 76.5% 47.5 30.0
EfficientNet-L2 299
66.6% 42.5
Noisy Student Training (L2) 299 77.8% 28.3

Table 5: Robustness results on ImageNet-P, where images are generated with a sequence of perturbations. mFR measures the model’s probability of flipping predictions under perturbations with AlexNet as a baseline (lower is better).

MethodRes.Top-1 Acc.mFR
ResNet-50 [31]22458.0
Low Pass Filter Pooling [99]22451.2
ResNeXt-101WSL[55,59]22427.8
EfficientNet-L222480.4%27.2
Noisy Student Training (L2)22485.2%14.2
EfficientNet-L229981.6%23.7
Noisy Student Training (L2)29986.4%12.2

表 5: ImageNet-P 上的鲁棒性结果,其中图像是通过一系列扰动生成的。mFR 以 AlexNet 为基准衡量模型在扰动下预测翻转的概率 (数值越低越好) 。

Method Res. Top-1 Acc. mFR
ResNet-50 [31] 224 58.0
Low Pass Filter Pooling [99] 224 51.2
ResNeXt-101WSL[55,59] 224 27.8
EfficientNet-L2 224 80.4% 27.2
Noisy Student Training (L2) 224 85.2% 14.2
EfficientNet-L2 299 81.6% 23.7
Noisy Student Training (L2) 299 86.4% 12.2

For ImageNet-C and ImageNet-P, we evaluate models on two released versions with resolution $224\mathrm{x}224$ and $299\times299$ and resize images to the resolution Efficient Net trained on.

对于ImageNet-C和ImageNet-P,我们在两个已发布的版本上评估模型,分辨率分别为 $224\mathrm{x}224$ 和 $299\times299$ ,并将图像调整为Efficient Net训练时的分辨率。


Figure 3: Selected images from robustness benchmarks ImageNet-A, C and P. Test images from ImageNet-C underwent artificial transformations (also known as common corruptions) that cannot be found on the ImageNet training set. Test images on ImageNet-P underwent different scales of perturbations. On ImageNet-A, C, Efficient Net with Noisy Student Tranining produces correct top-1 predictions (shown in bold black texts) and Efficient Net without Noisy Student Training produces incorrect top-1 predictions (shown in red texts). On ImageNet-P, Efficient Net without Noisy Student Training flips predictions frequently.

图 3: 选自鲁棒性基准测试 ImageNet-A、C 和 P 的图像。ImageNet-C 的测试图像经过了人工变换(也称为常见损坏),这些变换在 ImageNet 训练集中无法找到。ImageNet-P 的测试图像则经历了不同程度的扰动。在 ImageNet-A 和 C 上,采用 Noisy Student 训练的 EfficientNet 给出了正确的 top-1 预测(以黑色粗体显示),而未采用 Noisy Student 训练的 EfficientNet 则给出了错误的 top-1 预测(以红色文字显示)。在 ImageNet-P 上,未采用 Noisy Student 训练的 EfficientNet 频繁出现预测翻转。

As shown in Table 3, 4 and 5, Noisy Student Training yields substantial gains on robustness datasets compared to the previous state-of-the-art model ResNeXt-101 WSL [55, 59] trained on 3.5B weakly labeled images. On ImageNet-A, it improves the top-1 accuracy from $61.0%$ to $83.7%$ . On ImageNet-C, it reduces mean corruption error (mCE) from 45.7 to 28.3. On ImageNet-P, it leads to a mean flip rate (mFR) of 14.2 if we use a resolution of $224\mathrm{x}224$ (direct comparison) and 12.2 if we use a resolution of $299\times299.$ 3 These significant gains in robustness in ImageNet-C and ImageNet-P are surprising because our method was not deliberately optimized for robustness.4

如表3、4和5所示,Noisy Student Training在鲁棒性数据集上相比之前基于35亿弱标注图像训练的最先进模型ResNeXt-101 WSL [55,59]取得了显著提升。在ImageNet-A上,它将top-1准确率从$61.0%$提升至$83.7%$。在ImageNet-C上,它将平均损坏误差(mCE)从45.7降低至28.3。在ImageNet-P上,当使用$224\mathrm{x}224$分辨率(直接对比)时平均翻转率(mFR)为14.2,使用$299\times299$分辨率时为12.2。这些在ImageNet-C和ImageNet-P上取得的显著鲁棒性提升令人惊讶,因为我们的方法并未专门针对鲁棒性进行优化。

Qualitative Analysis. To intuitively understand the significant improvements on the three robustness benchmarks, we show several images in Figure 3 where the predictions of the standard model are incorrect while the predictions of the model with Noisy Student Training are correct.

定性分析。为了直观理解在三个鲁棒性基准测试上的显著改进,我们在图 3 中展示了几张图像,其中标准模型的预测是错误的,而采用噪声学生训练 (Noisy Student Training) 的模型的预测是正确的。

Figure 3a shows example images from ImageNet-A and the predictions of our models. The model with Noisy Student Training can successfully predict the correct labels of these highly difficult images. For example, without Noisy Student Training, the model predicts bullfrog for the image shown on the left of the second row, which might be resulted from the black lotus leaf on the water. With Noisy Student Training, the model correctly predicts dragonfly for the image. At the top-left image, the model without Noisy Student Training ignores the sea lions and mistakenly recognizes a buoy as a lighthouse, while the model with Noisy Student Training can recognize the sea lions.

图 3a 展示了来自 ImageNet-A 的示例图像及我们模型的预测结果。采用 Noisy Student Training 的模型能成功预测这些高难度图像的正确标签。例如,在没有 Noisy Student Training 时,模型将第二行左侧图像误判为牛蛙 (bullfrog) ,这可能是由于水面上黑色莲叶的干扰所致;而采用该技术后,模型正确识别出蜻蜓 (dragonfly) 。在左上角图像中,未使用 Noisy Student Training 的模型忽略了海狮 (sea lions) ,将浮标误认为灯塔 (lighthouse) ,而采用该技术的模型则能准确识别海狮。

Figure 3b shows images from ImageNet-C and the corresponding predictions. As can be seen from the figure, our model with Noisy Student Training makes correct predictions for images under severe corruptions and perturbations such as snow, motion blur and fog, while the model without

图 3b 展示了来自 ImageNet-C 的图像及对应预测结果。如图所示,采用噪声学生训练 (Noisy Student Training) 的模型能对雪景、运动模糊和雾化等严重损坏与干扰下的图像做出正确预测,而未采用该方法的模型则...

Noisy Student Training suffers greatly under these conditions. The most interesting image is shown on the right of the first row. The swing in the picture is barely recognizable by human while the model with Noisy Student Training still makes the correct prediction.

在这些条件下,Noisy Student Training表现大幅下降。最有趣的图像显示在第一行右侧。图片中的秋千几乎难以被人眼识别,而采用Noisy Student Training的模型仍能做出正确预测。

Figure 3c shows images from ImageNet-P and the corresponding predictions. As can be seen, our model with Noisy Student Training makes correct and consistent predictions as images undergone different perturbations while the model without Noisy Student Training flips predictions frequently.

图 3c 展示了来自 ImageNet-P 的图像及对应预测结果。可以看出,采用 Noisy Student Training 训练的模型在不同扰动下仍能保持正确且一致的预测,而未采用该技术的模型则频繁出现预测翻转。

3.4. Adversarial Robustness Results

3.4. 对抗鲁棒性结果

After testing our model’s robustness to common corruptions and perturbations, we also study its performance on adversarial perturbations. We evaluate our Efficient Net-L2 models with and without Noisy Student Training against an FGSM attack. This attack performs one gradient descent step on the input image [25] with the update on each pixel set to $\epsilon$ . As shown in Figure 4, Noisy Student Training leads to very significant improvements in accuracy even though the model is not optimized for adversarial robustness. Under a stronger attack PGD with 10 iterations [54], at $\epsilon=16$ , Noisy Student Training improves Efficient Net-L2’s accuracy from $1.1%$ to $4.4%$ .

在测试模型对常见损坏和扰动的鲁棒性后,我们还研究了其在对抗性扰动上的表现。我们评估了带与不带Noisy Student训练的Efficient Net-L2模型对抗FGSM攻击的性能。该攻击在输入图像上执行一次梯度下降步骤[25],每个像素的更新量设为$\epsilon$。如图4所示,即使模型未针对对抗鲁棒性进行优化,Noisy Student训练仍能带来非常显著的准确率提升。在更强的10次迭代PGD攻击下[54],当$\epsilon=16$时,Noisy Student训练将Efficient Net-L2的准确率从$1.1%$提升至$4.4%$。

Note that these adversarial robustness results are not directly comparable to prior works since we use a large input resolution of $800\mathrm{x}800$ and adversarial vulnerability can scale with the input dimension [22, 25, 24, 74].

需要注意的是,这些对抗鲁棒性结果与之前的研究不能直接比较,因为我们使用了 $800\mathrm{x}800$ 的大输入分辨率,而对抗脆弱性会随输入维度增加而加剧 [22, 25, 24, 74]。


Figure 4: Noisy Student Training improves adversarial robustness against an FGSM attack though the model is not optimized for adversarial robustness. The accuracy is improved by $11%$ at $\epsilon=2$ and gets better as $\epsilon$ gets larger.

图 4: Noisy Student Training 提高了模型对 FGSM (Fast Gradient Sign Method) 攻击的对抗鲁棒性,尽管该模型并未针对对抗鲁棒性进行优化。在 $\epsilon=2$ 时准确率提升了 $11%$,且随着 $\epsilon$ 增大效果更显著。

4. Ablation Study

4. 消融实验

In this section, we study the importance of noise and iterative training and summarize the ablations for other components of our method.

在本节中,我们研究了噪声和迭代训练的重要性,并总结了方法中其他组件的消融实验结果。

4.1. The Importance of Noise in Self-training

4.1. 自训练中噪声的重要性

Since we use soft pseudo labels generated from the teacher model, when the student is trained to be exactly the same as the teacher model, the cross entropy loss on unlabeled data would be zero and the training signal would vanish. Hence, a question that naturally arises is why the student can outperform the teacher with soft pseudo labels. As stated earlier, we hypothesize that noising the student is needed so that it does not merely learn the teacher’s knowledge. We investigate the importance of noising in two scenarios with different amounts of unlabeled data and different teacher model accuracies. In both cases, we gradually remove augmentation, stochastic depth and dropout for unlabeled images when training the student model, while keeping them for labeled images. This way, we can isolate the influence of noising on unlabeled images from the influence of preventing over fitting for labeled images. In addition, we compare using a noised teacher and an unnoised teacher to study if it is necessary to disable noise when generating pseudo labels.

由于我们使用了教师模型生成的软伪标签,当学生模型被训练得与教师模型完全一致时,未标记数据的交叉熵损失将为零,训练信号也会消失。因此,一个自然产生的问题是:为什么学生模型能通过软伪标签超越教师模型。如前所述,我们假设需要对学生模型加入噪声,使其不仅仅学习教师模型的知识。我们在两种场景下研究了噪声的重要性:不同数量的未标记数据和不同精度的教师模型。在这两种情况下,我们在训练学生模型时逐步移除未标记图像的增强、随机深度和dropout,同时对标记图像保留这些操作。这样,我们可以将噪声对未标记图像的影响与防止标记图像过拟合的影响区分开来。此外,我们比较了使用带噪声的教师模型和不带噪声的教师模型,以研究在生成伪标签时是否需要禁用噪声。

Here, we show the evidence in Table 6, noise such as stochastic depth, dropout and data augmentation plays an important role in enabling the student model to perform better than the teacher. The performance consistently drops with noise function removed. However, in the case with 130M unlabeled images, when compared to the supervised baseline, the performance is still improved to $84.3%$ from $84.0%$ with noise function removed. We hypothesize that the improvement can be attributed to SGD, which introduces stochastic it y into the training process.

我们在表6中展示了相关证据,随机深度 (stochastic depth) 、dropout 和数据增强等噪声机制对学生模型性能超越教师模型起到关键作用。移除噪声函数后性能持续下降。然而,在使用1.3亿张无标签图像的场景中,即使去除噪声函数,相比监督基线 $84.0%$ 的性能仍提升至 $84.3%$ 。我们认为这种改进可能源于SGD在训练过程中引入的随机性。

One might argue that the improvements from using noise can be resulted from preventing over fitting the pseudo labels on the unlabeled images. We verify that this is not the case when we use 130M unlabeled images since the model does not overfit the unlabeled set from the training loss. While removing noise leads to a much lower training loss for labeled images, we observe that, for unlabeled images, removing noise leads to a smaller drop in training loss. This is probably because it is harder to overfit the large unlabeled dataset.

有人可能会认为,使用噪声带来的改进源于防止模型在未标注图像上对伪标签过拟合。我们验证了在使用1.3亿张未标注图像时并非如此,因为从训练损失来看模型并未对未标注集过拟合。虽然去除噪声会使标注图像训练损失显著降低,但我们观察到对于未标注图像,去除噪声仅导致训练损失小幅下降。这很可能是因为在大规模未标注数据集上更难发生过拟合。

Lastly, adding noise to the teacher model that generates pseudo labels leads to lower accuracy, which shows the importance of having a powerful unnoised teacher model.

最后,为生成伪标签的教师模型添加噪声会导致准确率下降,这表明拥有一个强大的无噪声教师模型的重要性。

4.2. A Study of Iterative Training

4.2. 迭代训练研究

Here, we show the detailed effects of iterative training. As mentioned in Section 3.1, we first train an Efficient NetB7 model on labeled data and then use it as the teacher to

这里,我们展示了迭代训练的详细效果。如第3.1节所述,我们首先在标注数据上训练一个Efficient NetB7模型,然后将其作为教师模型

Model/UnlabeledSetSize1.3M130M
EfficientNet-B583.3%84.0%
Noisy Student Training (B5)83.9%85.1%
studentw/oAug83.6%84.6%
studentw/oAug,SD,Dropout83.2%84.3%
teacher w.Aug,SD,Dropout83.7%84.4%
模型/未标注集大小 1.3M 130M
EfficientNet-B5 83.3% 84.0%
Noisy Student Training (B5) 83.9% 85.1%
studentw/oAug 83.6% 84.6%
studentw/oAug,SD,Dropout 83.2% 84.3%
teacher w.Aug,SD,Dropout 83.7% 84.4%

Table 6: Ablation study of noising. We use Efficient NetB5 as the teacher model and study two cases with different numbers of unlabeled images and different augmentations. For the experiment with $1.3\mathbf{M}$ unlabeled images, we use the standard augmentation including random translation and flipping for both the teacher and the student. For the experiment with 130M unlabeled images, we use RandAugment. Aug and SD denote data augmentation and stochastic depth respectively. We remove the noise for unlabeled images while keeping them for labeled images. Here, iterative training is not used and unlabeled batch size is set to be the same as the labeled batch size to save training time.

表 6: 噪声消融研究。我们采用Efficient NetB5作为教师模型,研究不同数量未标注图像及不同数据增强方案的两种情况。在$1.3\mathbf{M}$未标注图像的实验中,师生模型均采用随机平移和翻转的标准增强;对于1.3亿未标注图像的实验则使用RandAugment。Aug和SD分别表示数据增强(Data Augmentation)和随机深度(Stochastic Depth)。我们移除了未标注图像的噪声但保留标注图像的噪声。为节省训练时间,本实验未采用迭代训练,且未标注数据的批次大小与标注数据批次大小保持一致。

train an Efficient Net-L2 student model. Then, we iterate this process by putting back the new student model as the teacher model.

训练一个Efficient Net-L2学生模型。然后,我们将新的学生模型作为教师模型放回,迭代这一过程。

As shown in Table 7, the model performance improves to $87.6%$ in the first iteration and then to $88.1%$ in the second iteration with the same hyper parameters (except using a teacher model with better performance). These results indicate that iterative training is effective in producing increasingly better models. For the last iteration, we make use of a larger ratio between unlabeled batch size and labeled batch size to boost the final performance to $88.4%$ .

如表 7 所示,在第一次迭代中,模型性能提升至 $87.6%$ ,随后在第二次迭代中(除使用性能更好的教师模型外,其他超参数相同)提升至 $88.1%$ 。这些结果表明迭代训练能有效生成性能逐步提升的模型。在最后一次迭代中,我们通过增大无标注批次与有标注批次的尺寸比例,将最终性能提升至 $88.4%$ 。

Table 7: Iterative training improves the accuracy, where batch size ratio denotes the ratio between unlabeled data and labeled data.

IterationModelBatch Size RatioTop-1 Acc.
1EfficientNet-L214:187.6%
2EfficientNet-L214:188.1%
3EfficientNet-L228:188.4%

表 7: 迭代训练提升准确率,其中批量大小比率表示未标注数据与标注数据的比例。

迭代次数 模型 批量大小比率 Top-1准确率
1 EfficientNet-L2 14:1 87.6%
2 EfficientNet-L2 14:1 88.1%
3 EfficientNet-L2 28:1 88.4%

4.3. Additional Ablation Study Sum mari z ation

4.3. 额外消融研究总结

We also study the importance of various design choices of Noisy Student Training, hopefully offering a practical guide for readers. With this purpose, we conduct 8 ablation studies in Appendix A.2. The findings are summarized as follows:

我们还研究了Noisy Student Training中各种设计选择的重要性,希望能为读者提供实用指南。为此,我们在附录A.2中进行了8项消融实验,主要发现如下:

Finding #1: Using a large teacher model with better performance leads to better results. • Finding #2: A large amount of unlabeled data is nec

发现 #1: 使用性能更好的大型教师模型能带来更好的结果。
发现 #2: 需要大量未标注数据

5. Related works

5. 相关工作

Self-training. Our work is based on self-training (e.g., [71, 96, 68, 67]). Self-training first uses labeled data to train a good teacher model, then use the teacher model to label unlabeled data and finally use the labeled data and unlabeled data to jointly train a student model. In typical self-training with the teacher-student framework, noise injection to the student is not used by default, or the role of noise is not fully understood or justified. The main difference between our work and prior works is that we identify the importance of noise, and aggressively inject noise to make the student better.

自训练 (Self-training)。我们的工作基于自训练 (如 [71, 96, 68, 67])。自训练首先使用标注数据训练一个优质的教师模型,然后利用教师模型标注未标注数据,最后联合使用标注数据和未标注数据训练学生模型。在典型的师生框架自训练中,默认不会对学生模型注入噪声,或者噪声的作用未被充分理解或论证。我们的工作与先前研究的主要区别在于:我们明确了噪声的重要性,并通过主动注入噪声使学生模型表现更优。

Self-training was previously used to improve ResNet-50 from $76.4%$ to $81.2%$ top-1 accuracy [93] which is still far from the state-of-the-art accuracy. Yalniz et al. [93] also did not show significant improvements in terms of robustness on ImageNet-A, C and P as we did. In terms of methodology, they proposed to first only train on unlabeled images and then finetune their model on labeled images as the final stage. In Noisy Student Training, we combine these two steps into one because it simplifies the algorithm and leads to better performance in our experiments.

自训练方法此前曾将 ResNet-50 的 top-1 准确率从 76.4% 提升至 81.2% [93],但仍远未达到最先进水平。Yalniz 等人 [93] 在 ImageNet-A、C 和 P 的鲁棒性方面也未展示出如我们实验所取得的显著提升。方法论层面,他们提出先仅用无标注图像训练模型,最后阶段才在标注图像上微调。而在噪声学生训练 (Noisy Student Training) 中,我们将这两个步骤合并为一,这不仅简化了算法流程,实验结果表明还能获得更优性能。

Data Distillation [63], which ensembles predictions for an image with different transformations to strengthen the teacher, is the opposite of our approach of weakening the student. Part has a rath i et al. [61] find a small and fast speech recognition model for deployment via knowledge distillation on unlabeled data. As noise is not used and the student is also small, it is difficult to make the student better than teacher. The domain adaptation framework in [69] is related but highly optimized for videos, e.g., prediction on which frame to use in a video. The method in [101] ensembles predictions from multiple teacher models, which is more expensive than our method.

数据蒸馏 [63] 通过整合图像在不同变换下的预测来强化教师模型,这与我们削弱学生模型的思路相反。Part has a rath i 等人 [61] 通过未标注数据上的知识蒸馏,找到了一种小而快的语音识别模型用于部署。由于未使用噪声且学生模型也较小,很难让学生模型优于教师模型。[69] 中的领域适应框架与此相关,但专为视频高度优化(例如预测视频中使用哪一帧)。[101] 中的方法整合了多个教师模型的预测,其成本高于我们的方法。

Co-training [9] divides features into two disjoint partitions and trains two models with the two sets of features using labeled data. Their source of “noise” is the feature partitioning such that two models do not always agree on unlabeled data. Our method of injecting noise to the student model also enables the teacher and the student to make different predictions and is more suitable for ImageNet than partitioning features.

协同训练 [9] 将特征划分为两个互不相交的子集,并利用标注数据分别训练两个模型。其"噪声"来源在于特征划分机制,使得两个模型对未标注数据的预测结果存在分歧。我们通过向学生模型注入噪声的方法,同样实现了师生模型预测差异化的效果,且相比特征划分策略更适用于ImageNet数据集。

Self-training / co-training has also been shown to work well for a variety of other tasks including leveraging noisy data [87], semantic segmentation [4], text classification [40, 78]. Back translation and self-training have led to significant improvements in machine translation [72, 20, 28, 14, 90, 29].

自训练/协同训练也被证明在多种任务中表现良好,包括利用噪声数据 [87]、语义分割 [4]、文本分类 [40, 78]。回译和自训练显著提升了机器翻译的性能 [72, 20, 28, 14, 90, 29]。

Semi-supervised Learning. Apart from self-training, another important line of work in semi-supervised learning [12, 103] is based on consistency training [5, 64, 47, 84, 56, 52, 62, 13, 16, 60, 2, 49, 88, 91, 8, 98, 46, 7]. They constrain model predictions to be invariant to noise injected to the input, hidden states or model parameters. As discussed in Section 2, consistency regular iz ation works less well on ImageNet because consistency regular iz ation uses a model being trained to generate the pseudo-labels. In the early phase of training, they regularize the model towards high entropy predictions, and prevents it from achieving good accuracy.

半监督学习。除了自训练(self-training)外,半监督学习[12, 103]的另一重要研究方向是基于一致性训练(consistency training) [5, 64, 47, 84, 56, 52, 62, 13, 16, 60, 2, 49, 88, 91, 8, 98, 46, 7]。这类方法通过约束模型预测对输入、隐藏状态或模型参数注入的噪声保持不变性。如第2节所述,一致性正则(consistency regularization)在ImageNet数据集上效果较差,因为该方法使用正在训练的模型生成伪标签。在训练初期,这些方法会使模型倾向于高熵预测,从而阻碍模型获得高准确率。

Works based on pseudo label [48, 39, 73, 1] are similar to self-training, but also suffer the same problem with consistency training, since they rely on a model being trained instead of a converged model with high accuracy to generate pseudo labels. Finally, frameworks in semi-supervised learning also include graph-based methods [102, 89, 94, 42], methods that make use of latent variables as target variables [41, 53, 95] and methods based on low-density separation [26, 70, 19], which might provide complementary benefits to our method.

基于伪标签 (pseudo label) 的研究工作 [48, 39, 73, 1] 与自训练方法类似,但也存在与一致性训练相同的问题,因为它们依赖于正在训练的模型而非高精度的收敛模型来生成伪标签。最后,半监督学习框架还包括基于图的方法 [102, 89, 94, 42]、利用潜变量作为目标变量的方法 [41, 53, 95] 以及基于低密度分离的方法 [26, 70, 19],这些方法可能为我们的方法提供互补优势。

Knowledge Distillation. Our work is also related to methods in Knowledge Distillation [10, 3, 33, 21, 6] via the use of soft targets. The main use of knowledge distillation is model compression by making the student model smaller. The main difference between our method and knowledge distillation is that knowledge distillation does not consider unlabeled data and does not aim to improve the student model.

知识蒸馏 (Knowledge Distillation)。我们的工作还涉及通过使用软目标进行知识蒸馏 [10, 3, 33, 21, 6] 的方法。知识蒸馏的主要用途是通过使学生模型更小来实现模型压缩。我们的方法与知识蒸馏的主要区别在于,知识蒸馏不考虑未标记数据,也不旨在改进学生模型。

Robustness. A number of studies, e.g. [82, 31, 66, 27], have shown that vision models lack robustness. Addressing the lack of robustness has become an important research direction in machine learning and computer vision in recent years. Our study shows that using unlabeled data improves accuracy and general robustness. Our finding is consistent with arguments that using unlabeled data can improve adversarial robustness [11, 77, 57, 97]. The main difference between our work and these works is that they directly optimize adversarial robustness on unlabeled data, whereas we show that Noisy Student Training improves robustness greatly even without directly optimizing robustness.

鲁棒性。多项研究 [82, 31, 66, 27] 表明视觉模型缺乏鲁棒性。近年来,解决鲁棒性不足已成为机器学习和计算机视觉的重要研究方向。我们的研究表明,使用未标注数据能提升准确率和整体鲁棒性。这一发现与 [11, 77, 57, 97] 提出的"未标注数据可提升对抗鲁棒性"观点一致。本研究与这些工作的主要区别在于:它们直接在未标注数据上优化对抗鲁棒性,而我们证明即使不直接优化鲁棒性,Noisy Student Training 也能显著提升鲁棒性。

6. Conclusion

6. 结论

Prior works on weakly-supervised learning required billions of weakly labeled data to improve state-of-the-art ImageNet models. In this work, we showed that it is possible to use unlabeled images to significantly advance both accuracy and robustness of state-of-the-art ImageNet models. We found that self-training is a simple and effective algorithm to leverage unlabeled data at scale. We improved it by adding noise to the student, hence the name Noisy Student Training, to learn beyond the teacher’s knowledge.

先前关于弱监督学习的研究需要数十亿弱标记数据来改进最先进的ImageNet模型。在本研究中,我们证明了利用未标记图像可以显著提升当前最优ImageNet模型的准确性和鲁棒性。我们发现自训练是一种简单有效的算法,可大规模利用未标记数据。通过向学生模型添加噪声(因此称为Noisy Student Training),我们改进了该方法,使其能够学习超越教师模型的知识。

Our experiments showed that Noisy Student Training and Efficient Net can achieve an accuracy of $88.4%$ which is $2.9%$ higher than without Noisy Student Training. This result is