Meta Co-Training: Two Views are Better than One
Meta Co-Training: 双视角优于单视角
Abstract
摘要
In many practical computer vision scenarios unlabeled data is plentiful, but labels are scarce and difficult to obtain. As a result, semi-supervised learning which leverages unlabeled data to boost the performance of supervised classifiers have received significant attention in recent literature. One major class of semi-supervised algorithms is cotraining. In co-training two different models leverage different independent and sufficient “views” of the data to jointly make better predictions. During co-training each model creates pseudo labels on unlabeled points which are used to improve the other model. We show that in the common case when independent views are not available we can construct such views inexpensively using pre-trained models. Cotraining on the constructed views yields a performance improvement over any of the individual views we construct and performance comparable with recent approaches in semisupervised learning, but has some undesirable properties. To alleviate the issues present with co-training we present Meta Co-Training which is an extension of the successful Meta Pseudo Labels approach to two views. Our method achieves new state-of-the-art performance on ImageNet $10%$ with very few training resources, as well as outperforming prior semi-supervised work on several other finegrained image classification datasets.
在许多实际计算机视觉场景中,未标注数据充足但标注稀缺且难以获取。因此,利用未标注数据提升监督分类器性能的半监督学习近年来受到广泛关注。协同训练(cotraining)是半监督算法的重要分支,其核心思想是让两个不同模型基于数据独立且充分的"视图"(views)进行联合预测。在协同训练过程中,每个模型会为未标注数据生成伪标签(pseudo labels)以优化另一个模型。我们证明:当独立视图不可用时,通过预训练模型可低成本构建此类视图。基于构建视图的协同训练不仅能超越单个视图的性能,还可达到当前半监督学习方法的水平,但仍存在某些缺陷。为此,我们提出元协同训练(Meta Co-Training)——将成功的元伪标签(Meta Pseudo Labels)方法扩展至双视图架构。该方法仅需极少训练资源便在ImageNet $10%$ 数据集上取得新突破,同时在多个细粒度图像分类数据集上超越现有半监督方法的性能。
1. Introduction
1. 引言
In many practical machine learning scenarios we have access to a large amount of unlabeled data, and relatively few labeled data points for the task on which we would like to train. For generic computer vision (CV) tasks there are large well-known and open source datasets with hundreds of millions or even billions of unlabeled images [62, 63]. In contrast, labeled data is usually scarce and expensive to obtain requiring many human-hours to generate. Hence, in this context, semi-supervised learning (SSL) methods are becoming increasingly popular as they rely on training useful learning models by using small amounts of labeled data and large amounts of unlabeled data.
在许多实际的机器学习场景中,我们拥有大量未标注数据,而针对特定训练任务的标注数据点却相对较少。对于通用计算机视觉 (CV) 任务,存在许多知名开源数据集,包含数亿甚至数十亿未标注图像 [62, 63]。相比之下,标注数据通常稀缺且获取成本高昂,需要耗费大量人工时间生成。因此,半监督学习 (SSL) 方法正变得越来越流行,它们通过使用少量标注数据和大量未标注数据来训练有效的学习模型。
Not unrelated to, and not to be confused with the idea of semi-supervised learning, is self-supervised learning.1 In self-supervised learning a model is trained with an objective that does not require a label that is not evident from the data itself. Self-supervised learning was popularized by the BERT [32] model for generating word embeddings for natural language processing (NLP) tasks. BERT generates its embeddings by solving the pretext task of masked word prediction. Pretext tasks present unsupervised objectives that are not directly related to any particular supervised objective we might desire to solve, but rather are solved in the hope of learning a suitable representation to more easily solve a downstream task. Due to the foundational nature of these models to the success of many models in NLP, or perhaps the fact that they serve as a foundation for other models to be built upon, these pretext learners are often referred to as foundation models. Inspired by foundation models from NLP, several pretext tasks have been proposed for CV with associated foundation models that have been trained to solve them [8, 19, 20, 23, 24, 38, 42, 43, 49, 58]. Unlike NLP, the learned representations for images are often much smaller than the images themselves which yields the additional benefit of reduced computational effort when using the learned representations.
与半监督学习概念相关但不相同的是自监督学习。在自监督学习中,模型通过无需外部标注的目标进行训练,这些目标可直接从数据本身推导得出。自监督学习因BERT[32]模型而广为人知,该模型通过掩码词预测这一前置任务生成自然语言处理(NLP)任务的词嵌入。前置任务提供与特定监督目标无直接关联的无监督目标,其解决旨在学习适用于下游任务的表征。由于这类模型对众多NLP模型成功的基础性作用(或因其作为其他模型构建基础的特性),这些前置学习器常被称为基础模型。受NLP领域基础模型启发,计算机视觉(CV)领域也提出了多个前置任务及相应训练的基础模型[8, 19, 20, 23, 24, 38, 42, 43, 49, 58]。与NLP不同,图像学习表征通常远小于原始图像,这带来降低计算开销的额外优势。
Typically, SSL requires the learner to make predictions on a large amount of unlabeled data as well as the labeled data during training. Furthermore, SSL often requires that the learner be re-trained [12, 17, 39, 40, 48, 50, 55, 70, 72, 74, 76] partially, or from scratch. Given the computational benefits of learning from a frozen representation, combining SSL and self-supervised learning is a natural idea. Additionally, the representation learner allows us to compress our data to a smaller size. This compression is beneficial not only because of the reduction in operations required that is associated with a smaller model, but also because it allows us to occupy less space on disk, in RAM, and in high bandwidth memory on hardware accelerators.
通常,半监督学习 (SSL) 要求学习者在训练过程中对大量未标注数据和标注数据同时进行预测。此外,半监督学习通常需要学习者进行部分重训练 [12, 17, 39, 40, 48, 50, 55, 70, 72, 74, 76] 或从头开始训练。鉴于从冻结表征中学习的计算优势,将半监督学习与自监督学习结合是一个自然的思路。此外,表征学习器使我们能够将数据压缩至更小规模。这种压缩不仅有益于减少较小模型所需的计算量,还能降低数据在磁盘、内存以及硬件加速器高带宽内存中的占用空间。
These benefits are particularly pronounced for cotraining [12] in which two different “views” of the data must be obtained or constructed and then two different class if i ers must be trained and re-trained iterative ly. Seldom does a standard problem in machine learning present two independent and sufficient views of the problem to be leveraged for co-training. As a result, if one hopes to apply co-training one usually needs to construct two views from the view that they have. To take advantage of the computational advantages of representation learning, and the large body of self-supervised literature, we propose using the embeddings produced by foundation models trained on different pretext tasks as the views for co-training. To our knowledge we are the first to propose using frozen learned feature representations as views for co-training classification, though we think the idea is very natural. We show that views constructed in this way appear to satisfy the necessary conditions for the effectiveness of co-training.
这些优势在协同训练 [12] 中尤为显著。协同训练需要获取或构建数据的两种不同"视角",然后迭代训练和重新训练两个不同的分类器。机器学习中的标准问题很少能提供两个独立且充分的视角来支持协同训练。因此,若想应用协同训练,通常需要从现有视角中构建两个新视角。为了利用表征学习的计算优势以及大量自监督文献成果,我们提出使用基于不同前置任务训练的基础模型生成的嵌入向量作为协同训练的视角。据我们所知,我们是首个提出使用冻结学习特征表征作为协同训练分类视角的研究团队,尽管我们认为这个思路非常自然。我们证明,以这种方式构建的视角似乎满足协同训练有效性的必要条件。
Unfortunately, while we can boost performance by training different models on two views, co-training performs sub-optimally during its pseudo-labeling step. To more effectively leverage two views we propose a novel method for co-training based on Meta Pseudo Labels [59] called Meta $_{C o}$ -Training in which each model is trained to provide better pseudo labels to the other given only its view of the labeled data and the performance of the other model on the labeled set. We show that our approach provides an improvement over both co-training and the current state-ofthe-art (SotA) on the ImageNet $10%$ dataset, as well as it establishes new SotA few-shot performance on several other fine-grained image classification tasks.
遗憾的是,虽然我们可以通过在两个视图上训练不同模型来提升性能,但协同训练在伪标注步骤中表现欠佳。为更高效利用双视图优势,我们提出了一种基于元伪标签 (Meta Pseudo Labels) [59] 的新型协同训练方法 Meta$_{Co}$-Training:每个模型仅根据其标注数据视图和另一模型在标注集上的表现,被训练为对方提供更优质的伪标签。实验表明,我们的方法在ImageNet $10%$ 数据集上超越了协同训练与当前最优方法 (SotA),并在多个细粒度图像分类任务中创造了少样本学习的新SotA记录。
Summary of Contributions. Our contributions are therefore summarized as follows.
贡献总结。我们的贡献可概括如下。
Outline of the Paper. In Section 2 we discuss related work. In Section 3 we establish notation and provide more information on co-training and meta pseudo labels. In Section 4 we describe our proposed methods of view construc- tion and of meta co-training. In Section 5 we present and discuss our experimental findings on ImageNet [31], Flowers102 [57], Food101 [14], FGVCAircraft [54], iNaturalist [44], and i Naturalist 2021 [45]. In Section 6 we conclude with a summary and ideas for future work.
论文大纲。在第2节中我们讨论相关工作。在第3节中我们建立符号体系并提供关于协同训练(co-training)和元伪标签(meta pseudo labels)的更多信息。在第4节中我们描述提出的视图构建方法和元协同训练方法。在第5节中我们展示并讨论在ImageNet [31]、Flowers102 [57]、Food101 [14]、FGVCAircraft [54]、iNaturalist [44]和iNaturalist 2021 [45]上的实验结果。在第6节中我们总结全文并提出未来工作方向。
2. Related Work
2. 相关工作
While SSL [21, 67, 73, 78] has become very relevant in recent years with the influx of the data collected by various sensors, the paradigm and its usefulness have been studied at least since the $1960^{\ '}\mathrm{s}$ ; e.g., [29, 36, 41, 64]. Among the main approaches for SSL that are not directly related to our work, we find techniques that try to identify the maximum likelihood parameters for mixture models [7, 30], constrained clustering [3, 9, 18, 28, 68, 69], graph-based methods [6, 10, 11, 53], as well as purely theoretical probably approximately correct (PAC)-style results [4, 25, 37]. Below we discuss SSL approaches closer to our work.
近年来,随着各类传感器采集数据的激增,自监督学习 (SSL) [21, 67, 73, 78] 变得尤为重要,但这一范式及其应用价值的研究至少可追溯至1960年代 (例如 [29, 36, 41, 64])。在与我们工作不直接相关的主要SSL方法中,包括尝试确定混合模型最大似然参数的技术 [7, 30]、约束聚类 [3, 9, 18, 28, 68, 69]、基于图的方法 [6, 10, 11, 53],以及纯理论的概率近似正确 (PAC) 风格结果 [4, 25, 37]。下文我们将讨论与本研究更接近的SSL方法。
Self-Training and Student-Teacher Frameworks. One of the earliest ideas for SSL is that of self-training [36, 64]. This idea is still very popular with different student-teacher frameworks [17, 39, 50, 55, 59, 66, 72, 76, 79]. This technique is also frequently referred to as pseudo-labeling [48] as the student tries to improve its performance by making predictions on the unlabeled data, then integrating the most confident predictions to the labeled set, and subsequently retraining using the broader labeled set. In addition, one can combine ideas; e.g., self-training with clustering [46], or create ensembles via self-training [47], potentially using graph-based information [53]. An issue with these approaches is that of confirmation bias [2], where the student cannot improve a lot due to incorrect labeling that occurs either because of self-training, or due to a fallible teacher.
自训练与师生框架。SSL最早期的思想之一是自训练(self-training) [36, 64]。该思想通过不同的师生框架(student-teacher) [17, 39, 50, 55, 59, 66, 72, 76, 79]至今仍被广泛使用。该技术也常被称为伪标注(pseudo-labeling) [48],即学生模型通过对未标注数据进行预测,将置信度最高的预测结果加入标注集,再使用扩展后的标注集重新训练以提升性能。此外,研究者可融合多种思想,例如结合自训练与聚类(clustering) [46],或通过自训练构建集成模型(ensemble) [47],并可能借助基于图的信息(graph-based) [53]。这类方法的共性问题在于确认偏误(confirmation bias) [2],即由于自训练过程或教师模型的错误标注,导致学生模型无法显著提升性能。
Co-Training. Another paradigmatic family of SSL algorithms are co-training [12] algorithms, where instances have two different views for the same underlying phenomenon. Co-training has been very successful with wide applicability [34] and can outperform learning algorithms that use only single-view data [56]. Furthermore, the genera liz ation ability of co-training is well-motivated theoretically [5, 12, 26, 27]. There are several natural extensions of the base idea, such as methods for constructing different views [15, 16, 22, 65, 71], including connections to active learning [33], methods that are specific to deep learning [60], and other natural extensions that exploit more than two views on data; e.g., [77].
协同训练 (Co-Training)。另一类经典的半监督学习 (SSL) 算法是协同训练 [12] 算法,该算法中实例对同一底层现象具有两个不同的视图。协同训练具有广泛的适用性 [34],其性能可以超越仅使用单视图数据的学习算法 [56]。此外,协同训练的泛化能力在理论上得到了充分论证 [5, 12, 26, 27]。该基础思想存在多种自然扩展,例如构建不同视图的方法 [15, 16, 22, 65, 71](包括与主动学习 (active learning) 的联系 [33])、专用于深度学习的方法 [60],以及利用两个以上数据视图的其他自然扩展(例如 [77])。
3. Background
- 背景
We briefly describe two important components of semisupervised learning that are used jointly in order to devise our proposed new technique. However, before we do so, we also define basic notions so that we can eliminate ambiguity.
我们简要描述半监督学习中两个重要组成部分,它们共同用于设计我们提出的新技术。但在开始之前,我们还需定义基本概念以消除歧义。
3.1. Notation
3.1. 符号表示
Supervised learning is based on the idea of learning a function, called a model, $f$ belonging to some model space $\mathcal{F}$ , using training examples $(x,y)$ that exemplify some underlying phenomenon. Training examples are labeled instances; i.e., $x\in\mathcal X$ is an instance drawn from some instance space $\mathcal{X}$ and $y$ is a label from some label set $\mathcal{V}$ . This relationship between $\mathcal{X}$ and $\mathcal{V}$ can be modeled with a probability distribution $D$ that governs $\boldsymbol{\mathcal{X}}\times\boldsymbol{\mathcal{V}}$ . We write $(x,y)\sim D$ to indicate that a training example $(x,y)$ is drawn from ${X}{V}$ according to $D$ . We use ${1}{{A}}$ as an indicator function of the event ${A}$ ; that is, ${1}{{A}}$ is equal to 1 when ${A}$ holds, otherwise 0. For vectors $\alpha,\beta\in\mathbb{R}^{d}$ we indicate their Hadamard (element-wise) product using $\otimes$ ; that is, $\gamma=\alpha\otimes\beta$ , where $\gamma_{j}=\alpha_{j}\cdot\beta_{j}$ for $j\in{1,\ldots,d}$ .
监督学习的核心思想是通过训练样本 $(x,y)$ 学习一个称为模型 $f$ 的函数,该函数属于某个模型空间 $\mathcal{F}$,这些样本反映了某种潜在现象。训练样本是带标签的实例;即 $x\in\mathcal X$ 是从实例空间 $\mathcal{X}$ 抽取的实例,而 $y$ 是标签集 $\mathcal{V}$ 中的标签。$\mathcal{X}$ 与 $\mathcal{V}$ 之间的关系可以用一个概率分布 $D$ 来建模,该分布支配 $\boldsymbol{\mathcal{X}}\times\boldsymbol{\mathcal{V}}$。我们记 $(x,y)\sim D$ 表示训练样本 $(x,y)$ 依据 $D$ 从 $\mathcal{X}\times\mathcal{V}$ 中抽取。用 ${1}{A}$ 表示事件 ${A}$ 的指示函数;即当 ${A}$ 成立时 ${1}{{A}}$ 等于1,否则为0。对于向量 $\alpha,\beta\in\mathbb{R}^{d}$,我们用 $\otimes$ 表示它们的哈达玛积(逐元素乘积);即 $\gamma=\alpha\otimes\beta$,其中 $\gamma{j}=\alpha{j}\cdot\beta{j}$ 对 $j\in{1,\ldots,d}$ 成立。
Semi-Supervised Learning. We are interested in situations where we have a large pool of unlabeled instances $U$ as well as a small portion $L$ of them that is labeled. That is, such that ${\cal L}~=~{(x{i},y{i})}{i=1}^{m}$ and $U={x{j}}{j=1}^{u}$ $S=L\cup U$ setting, semi-supervised learning (SSL) methods attempt to first learn an initial model $f{i n i t}$ using the labeled data $L$ via a supervised learning algorithm, and in sequence, an attempt is being made in order to harness the information that is hidden in the unlabeled set $U$ , so that a better model $f$ can be learnt.
半监督学习。我们关注的情况是存在大量未标记实例 $U$ 以及其中一小部分已标记实例 $L$ , ,在 $S=L U$ 的设置下,半监督学习 (SSL) 方法首先尝试通过监督学习算法利用标记数据 $L$ 学习初始模型 $f$ ,随后尝试挖掘未标记集 $U$ 中隐藏的信息,从而学习到更好的模型 。
3.2. Meta Pseudo Labels
3.2. 元伪标签 (Meta Pseudo Labels)
In order to discuss Meta Pseudo Labels, first we need to discuss Pseudo Labels. Pseudo Labels (PL) [48] is a broad paradigm for SSL, which involves a teacher and a student in the learning process. The teacher looking at the unlabeled set $U$ , selects a few instances and assigns pseudolabels to them. Subsequently these pseudo-examples are shown to the student as if they were legitimate examples so that the student can learn a better model. In this context, the teacher $f_{\theta_{T}}$ is fixed and the student is trying to learn a better model by aligning its predictions to pseudo-labeled batches $\left(X_{u},f_{\theta_{T}}(X_{u})\right)$ . Thus, on a pseudo-labeled batch, the student optimizes its parameters using
为了讨论元伪标签 (Meta Pseudo Labels),首先需要了解伪标签 (Pseudo Labels)。伪标签 (PL) [48] 是一种广泛的半监督学习 (SSL) 范式,其学习过程中包含教师模型和学生模型。教师模型通过观察未标注数据集 $U$,选择部分实例并为它们分配伪标签。随后,这些伪标注样本会被当作真实标注样本展示给学生模型,以帮助其学习更好的模型。在此过程中,教师模型 $f_{\theta_{T}}$ 是固定的,而学生模型通过将其预测结果与伪标注批次 $\left(X_{u},f_{\theta_{T}}(X_{u})\right)$ 对齐来优化自身。因此,在伪标注批次上,学生模型通过以下方式优化其参数:
However, because the dependency between $\theta_{T}$ and $\theta_{S}$ is complicated, a practical approximation [35, 51] is obtained via $\theta{S}^{\mathrm{PL}}\approx\theta{S}-\eta{S}\cdot\nabla{\theta{S}}\mathcal{L}{u}\left(\theta_{T},\theta_{S}\right)$ which leads to the practical teacher objective:
然而,由于$\theta_{T}$与$\theta_{S}$之间的依赖关系较为复杂,实际应用中采用近似方法[35, 51]:通过$\theta{S}^{\mathrm{PL}}\approx\theta{S}-\eta{S}\cdot\nabla{\theta{S}}\mathcal{L}{u}\left(\theta_{T},\theta_{S}\right)$推导出实际教师目标:
For more information the interested reader can refer to [59].
如需了解更多信息,感兴趣的读者可以参考 [59]。
3.3. Co-Training
3.3. 协同训练
Co-training relies on the existence of two different views of the instance space $\mathcal{X}$ ; that is, $\mathcal{X}=\mathcal{X}^{(1)}\times\mathcal{X}^{(2)}$ . Hence, an instance $x\in\mathcal X$ has the form $x~=~(x^{(1)},x^{(2)})~\in$ $\mathcal{X}^{(1)}\times\mathcal{X}^{(2)}$ . We say that the views $x^{(1)}$ and $x^{(2)}$ are complementary. Co-training requires two assumptions in order to work. First, it is assumed that, provided enough many examples, each separate view is enough for learning a meaningful model. The second is the conditional independence assumption [12], which states that x(1) and x(2) are conditionally independent given the label of the instance $x=(x^{(1)},x^{(2)})$ . Thus, with co-training, two models $f^{(1)}$ and $f^{(2)}$ are learnt, which help each other using different information that they capture from the different views of each instance. More information is presented below.
协同训练依赖于实例空间 $\mathcal{X}$ 存在两个不同的视图,即 $\mathcal{X}=\mathcal{X}^{(1)}\times\mathcal{X}^{(2)}$。因此,实例 $x\in\mathcal X$ 的形式为 $x~=~(x^{(1)},x^{(2)})~\in$ $\mathcal{X}^{(1)}\times\mathcal{X}^{(2)}$。我们称视图 $x^{(1)}$ 和 $x^{(2)}$ 是互补的。协同训练需要两个假设才能有效工作。首先,假设在提供足够多示例的情况下,每个单独的视图足以学习一个有意义的模型。其次是条件独立性假设 [12],即在给定实例 $x=(x^{(1)},x^{(2)})$ 标签的情况下,x(1) 和 x(2) 是条件独立的。因此,通过协同训练,可以学习两个模型 $f^{(1)}$ 和 $f^{(2)}$,它们利用从每个实例的不同视图中捕获的不同信息相互帮助。更多信息如下所述。
Given a dataset $S=L\cup U$ (Section 3.1 has notation), we can have access to two different views of the dataset due to the natural partitioning of the instance space $\mathcal{X}$ ; that is, $S^{(1)}=L^{(1)}\bar{\cup}U^{(1)}$ and $\dot{\boldsymbol{S}}^{(2)}=\boldsymbol{L}^{(2)}\cup\boldsymbol{U}^{(\bar{2})}$ . Initially two models $f^{(1)}$ and $f^{(2)}$ are learnt using a supervised learning algorithm based on the labeled sets $L^{(1)}$ and $L^{(2)}$ respectively. Learning proceeds in iterations, where in each iteration a number of unlabeled instances are integrated with the labeled set by assigning pseudo-labels to them that are predicted by $f^{(1)}$ and $f^{(2)}$ , so that improved models $f^{(1)}$ and $f^{(2)}$ can be learnt by retraining on the augmented labeled sets. In particular, in each iteration, each model $f^{(v)}$ predicts a label for each view $x^{(v)}\in U^{(v)}$ . The top $k$ most confident predictions by model $f^{(v)}$ are used to provide a pseudo label for the respective complementary instances and then these pseudo-labeled examples are integrated in to the labeled set for the complementary view. In the next iteration a new model is trained based on the augmented labeled set. Both views of the used instances are dropped from $U$ even if only one of the views may be used for augmenting the appropriate labeled set. This process repeats until the unlabeled sets $U^{(v)}$ are exhausted, yielding models $f^{(1)}$ and $f^{(2)}$ corresponding to the two views. Co-Training is evaluated using the joint prediction of these models which is the re-normalized element-wise product of the two predictions:
给定数据集 $S=L\cup U$ (3.1节已说明符号约定),由于实例空间 $\mathcal{X}$ 的自然划分,我们可以获得该数据集的两个不同视图:即 $S^{(1)}=L^{(1)}\bar{\cup}U^{(1)}$ 和 $\dot{\boldsymbol{S}}^{(2)}=\boldsymbol{L}^{(2)}\cup\boldsymbol{U}^{(\bar{2})}$。初始时,分别基于标注集 $L^{(1)}$ 和 $L^{(2)}$ 使用监督学习算法训练两个模型 $f^{(1)}$ 和 $f^{(2)}$。学习过程通过迭代进行,每次迭代中通过为未标注实例分配由 $f^{(1)}$ 和 $f^{(2)}$ 预测的伪标签,将若干未标注实例整合到标注集中,从而基于增广后的标注集重新训练得到改进的模型 $f^{(1)}$ 和 $f^{(2)}$。具体而言,每次迭代中,每个模型 $f^{(v)}$ 为视图 $x^{(v)}\in U^{(v)}$ 预测一个标签。模型 $f^{(v)}$ 置信度最高的前 $k$ 个预测结果将用于为相应互补实例提供伪标签,随后这些伪标注样本会被整合到互补视图的标注集中。在下一轮迭代中,基于增广后的标注集训练新模型。被使用实例的两个视图都会从 $U$ 中移除——即使可能仅使用其中一个视图来增广对应标注集。该过程重复进行直至未标注集 $U^{(v)}$ 耗尽,最终得到对应两个视图的模型 $f^{(1)}$ 和 $f^{(2)}$。协同训练通过这两个模型的联合预测进行评估,该联合预测是两个预测结果经重新归一化的逐元素乘积:
$$
\frac{f^{(1)}\left(x^{(1)}\right)\otimes f^{(2)}\left(x^{(2)}\right)}{\left|f^{(1)}\left(x^{(1)}\right)\otimes f^{(2)}\left(x^{(2)}\right)\right|_{1}}.
$$
$$
\frac{f^{(1)}\left(x^{(1)}\right)\otimes f^{(2)}\left(x^{(2)}\right)}{\left|f^{(1)}\left(x^{(1)}\right)\otimes f^{(2)}\left(x^{(2)}\right)\right|_{1}}.
$$
Co-training was introduced by Blum and Mitchell [12] and their paradigmatic example was classifying web pages using two different views: (i) the bags of words based on the content that they had and (ii) the bag of words formed by the words that were used for hyperlinks pointing to these webpages. In our context the two different views are obtained via different embeddings that we obtain when we use images as inputs to different foundation models and observe the resulting embeddings.
协同训练由Blum和Mitchell [12]提出,其典型示例是使用两种不同视角对网页进行分类:(i) 基于网页内容的词袋模型,以及(ii) 由指向这些网页的超链接文本构成的词袋模型。在我们的场景中,两种不同视角通过不同嵌入向量获得——当我们将图像输入不同基础模型并观察生成的嵌入向量时即可得到这些视角。
4. Proposed Methods
4. 提出的方法
4.1. View Construction
4.1. 视图构建
Past methods of view construction include manual feature subsetting [33], automatic feature subsetting [22, 65], random feature subsetting [15, 16], random subspace selection [16], and adversarial ex- amples [60]. We propose a method of view construction which is motivated by modern approaches to SSL and representation learning in computer vision and natural language processing: pretext learning.
以往构建视图的方法包括手动特征子集选择 [33]、自动特征子集选择 [22, 65]、随机特征子集选择 [15, 16]、随机子空间选择 [16] 以及对抗样本 [60]。我们提出了一种受计算机视觉和自然语言处理领域中现代自监督学习 (SSL) 与表征学习方法启发的视图构建方法: pretext learning。
Figure 1. Using pretrained models for constructing views.
图 1: 使用预训练模型构建视图。
Given two pretext tasks we can train a model to accomplish each task, then use their learned embedding spaces as separate views for classification. If we have chosen our pretext tasks well, then the representations will be different, compressed, and capture sufficient information to perform the desired downstream task. Figure 1 demonstrates the idea of the method. Even if one does not have access to different views on a particular dataset, the use of different pre-trained models allows the creation of different views; as the transformation happens at the level of instances, it applies to both labeled and unlabeled batches alike as shown in Figure 1 with $L$ and $U$ respectively. However, if one has access to a dataset where the instances have two different views, then one can still obtain useful representations with this approach using even the same pre-trained model. Furthermore, one can also use multiple pre-trained models and concatenate the resulting embeddings so that two separate views with richer information content can be created and are still low-dimensional.
给定两个前置任务,我们可以训练模型完成每个任务,然后将其学习到的嵌入空间作为分类的独立视图。若前置任务选择得当,这些表征将具备差异性、压缩性,并能捕获足够信息以执行目标下游任务。图1展示了该方法的核心思想。即使特定数据集缺乏多视图,通过不同预训练模型仍可创建差异化视图;如图1所示,这种实例级转换可同时作用于带标签($L$)和无标签($U$)的数据批次。若数据集本身具备双视图,即使使用相同预训练模型也能通过此方法获得有效表征。此外,结合多个预训练模型并拼接其生成嵌入,可构建信息更丰富且仍保持低维度的双视图。
While our approach is more widely applicable, it is particularly well-suited to computer vision. There are a plethora [19, 20, 23, 24, 38, 42, 49, 58, 61] of competitive learned representations for images. Recently it has been popular to treat these representations as task-agnostic foundation models. We show that the existence of these models eliminates the need to train new models for pretext tasks to achieve state-of-the-art performance for SSL-classification. Furthermore, it is comparatively very cheap to use the foundation models instead as not only can we forego training on a pretext task, but we are able to completely bypass the need to train a large convolutional neural network (CNN) or vision transformer (ViT). In fact, we compute the embedding for each of the images in our dataset only one time (as opposed to applying some form of data augmentation during training and making use of the embedding process during training). So, at training time the only weights we need to use at all are those in the model we train on top of the learned representation. This makes our approach incredibly lightweight compared to traditional CNN/ViT training procedures.
虽然我们的方法具有更广泛的适用性,但它尤其适合计算机视觉领域。目前存在大量 [19, 20, 23, 24, 38, 42, 49, 58, 61] 针对图像的竞争性学习表征。最近流行将这些表征视为与任务无关的基础模型。我们证明,这些模型的存在消除了为预训练任务训练新模型以实现SSL分类最先进性能的必要性。此外,使用基础模型的成本相对非常低,因为我们不仅可以放弃预训练任务,还能完全避免训练大型卷积神经网络 (CNN) 或视觉Transformer (ViT) 的需求。实际上,我们仅对数据集中的每张图像计算一次嵌入(而不是在训练期间应用某种形式的数据增强并利用训练中的嵌入过程)。因此,在训练时,我们唯一需要使用的权重就是基于学习表征训练的模型中的权重。与传统CNN/ViT训练流程相比,这使得我们的方法极其轻量化。
Figure 2. At each step $t~\in~{1,\ldots,T}$ of meta co-training the models that correspond to the so-far learnt parameters $\theta_{1;t}$ and $\theta_{2;t}$ play the role of the student and the teacher simultaneously using batches for their respective views. Pseudo-labeling occurs on complementary views so that the teacher can provide the student with labels on an unlabeled batch. Labeled batches may, or may not, use complementary views as the purpose that they serve is to calculate the risk of the student model on the labeled batch and this result signals the teacher model to update its weights accordingly.
图 2: 在元协同训练的每个步骤 $t~\in~{1,\ldots,T}$ 中,对应于当前学习参数 $\theta_{1;t}$ 和 $\theta_{2;t}$ 的模型同时扮演学生和教师的角色,分别使用各自视角的批次数据。伪标注发生在互补视角上,使得教师能够为未标注批次提供标签。标注批次可能使用也可能不使用互补视角,因为它们的作用是计算学生模型在标注批次上的风险,该结果会指示教师模型相应地更新其权重。
4.2. Meta Co-Training
4.2. Meta 协同训练
In addition to our method of view construction we propose a novel algorithm for co-training which is based on Meta Pseudo Labels [59], called Meta $_{C o}$ -Training (MCT), which is described below.
除了我们提出的视图构建方法外,我们还提出了一种基于元伪标签 [59] 的新型协同训练算法,称为 Meta$_{Co}$-Training (MCT) ,具体描述如下。
4.2.1 Overview of Meta Co-Training (MCT)
4.2.1 元协同训练(MCT)概述
An overview of MCT is provided in Figure 2. At each step $t~\in~{1,\ldots,T}$ of meta co-training the models that correspond to the so-far learnt parameters $\theta_{1;t}$ and $\theta_{2;t}$ play the role of the student and the teacher simultaneously on batches from their respective views. The (low-dimensional) embeddings obtained from the view construction process of Section 4.1 form different views that are used by MCT.
图 2: MCT概述。在元协同训练的每个步骤$t~\in~{1,\ldots,T}$中,对应于迄今学习到的参数$\theta_{1;t}$和$\theta_{2;t}$的模型,在各自视图的批次数据上同时扮演学生和教师的角色。通过4.1节视图构建过程获得的(低维)嵌入形成了MCT所使用的不同视图。
When a student model predicts labels on an unlabeled batch it suffers loss based on the pseudo-labels that are provided by the teacher model (which uses the complementary view for the unlabeled instances) and the student weights are adjusted accordingly. Then the student model evaluates the performance of its new weights on separate labeled batches. This performance provides feedback to the teacher model. Geometrically, we can understand this as updating the parameters in the direction of increasing confidence on our pseudo labels if those labels helped the other model perform better on the labeled set, or in the opposite direction if it hurt the other model’s performance on the labeled set.
当学生模型对未标记批次进行标签预测时,它会基于教师模型提供的伪标签(该模型使用未标记实例的互补视图)产生损失,并相应调整学生模型的权重。随后,学生模型在独立的标记批次上评估其新权重的性能。该性能会向教师模型提供反馈。从几何角度理解,若这些伪标签有助于另一模型在标记集上表现更好,则参数更新方向会朝着提升伪标签置信度的方向进行;反之,若损害另一模型在标记集上的性能,则朝相反方向更新。
Finally, similar to MPL, we assume that at $t=1$ models
最后,与 MPL 类似,我们假设在 $t=1$ 时模型
have been pre-trained using supervised warmup period so that their predictions are not random. This is analogous to the fully-supervised iteration of co-training.
已通过监督预热期进行预训练,因此其预测并非随机。这与协同训练的全监督迭代类似。
4.2.2 Algorithmic Details
4.2.2 算法细节
Algorithm 1 presents the details in pseudocode. The function Sample Batch, is sampling a batch from the dataset. The function get Other View returns the complementary view of the first argument; the second argument clarifies which view should be returned (“2” in line 5). At each step $t$ of the algorithm each model is first updated based on the pseudo labels the other has provided on a batch of unlabeled data (Lines 13-14). Then each model is updated to provide labels that encourage the other to predict more correctly on the labeled set based on the performance of the other (Lines 24-25). We refer the reader to Section 3.2 (resp. [59]) for a brief (resp. complete) discussion of the MPL approach. Note that contrary to traditional co-training, no new pseudo-labeled data are added to the labeled set, neither are any data removed from the unlabeled set after each step.
算法1以伪代码形式展示了具体细节。函数Sample Batch负责从数据集中采样一个批次。函数getOtherView返回第一个参数的互补视图;第二个参数指定应返回哪个视图(第5行的"2")。在算法的每个步骤$t$中,首先基于另一个模型在未标记数据批次上提供的伪标签更新每个模型(第13-14行)。随后,根据另一个模型的性能,更新每个模型以提供能促使对方在标记集上更准确预测的标签(第24-25行)。关于MPL方法的简要(对应章节3.2)和完整(对应文献[59])讨论,建议读者参阅相应内容。请注意,与传统协同训练不同,该方法不会在每步后将新的伪标记数据添加到标记集,也不会从未标记集中移除任何数据。
Table 1. Linear probe evaluation for views. Top-1 accuracy on different subsets of the ImageNet data are shown.
表 1: 视图线性探针评估。展示了ImageNet数据不同子集的Top-1准确率。
模型 | 1% | 10% | 100% |
---|---|---|---|
MAE | 1.9 | 3.2 | 73.5 |
DINOv2 | 78.1 | 82.9 | 86.3 |
SwAV | 12.1 | 41.1 | 77.9 |
EsViT | 69.1 | 74.4 | 81.3 |
CLIP | 74.1 | 80.9 | 85.4 |
5. Experimental Evaluation
5. 实验评估
Towards evaluating the proposed method of view construction for co-training we first constructed views of the data utilizing five different learned representations. Our primary comparisons were on the widely-used ImageNet [31] classification task; however, Sections 5.4 and B provide experimental results on additional image classification tasks. To produce the views used to train the class if i ers during cotraining and meta co-training we used the embedding space of five representation learning architectures: The Masked Autoencoder (MAE) [42], DINOv2 [58], SwAV [19], EsViT [49], and CLIP [61]. We selected the models which produce the views as they have been learned in an unsupervised way, have been made available by the authors of their respective papers for use in PyTorch, and have been shown to produce representations that are appropriate for ImageNet classification.
为了评估所提出的协同训练视图构建方法,我们首先利用五种不同的学习表示构建数据视图。主要对比实验基于广泛使用的ImageNet [31]分类任务;但第5.4节和附录B提供了其他图像分类任务的实验结果。在协同训练和元协同训练过程中,我们使用五种表征学习架构的嵌入空间来生成分类器训练视图:掩码自编码器(MAE) [42]、DINOv2 [58]、SwAV [19]、EsViT [49]和CLIP [61]。选择这些模型生成视图的依据是:它们采用无监督方式训练、原作者提供了PyTorch实现、且已被证明能生成适用于ImageNet分类的表征。
5.1. Investigating the Co-Training Assumptions on the Constructed Views
5.1. 基于构建视图的协同训练假设验证
Table 2. MLP performance on individual views. Top-1 accuracy on different subsets of the ImageNet data are shown.
表 2: 单视角下的MLP性能。展示了ImageNet数据不同子集的Top-1准确率。
模型 | 1% | 10% |
---|---|---|
MAE | 23.4 | 48.5 |
DINOv2 | 78.4 | 82.7 |
SwAV | 12.5 | 32.6 |
EsViT | 71.3 | 75.8 |
CLIP | 75.2 | 80.9 |
Motivated by the analysis and experiments of [34] we made an effort to verify the sufficiency and independence of the representations learned by the aforementioned models.
受[34]的分析和实验启发,我们努力验证上述模型学习到的表征是否充分且独立。
On the Sufficiency of the Views. Sufficiency is fairly easy to verify. If we choose a reasonable sufficiency threshold for the task, for ImageNet say close or above $70%$ top-1 accuracy, then we can train simple models on the views of the dataset we have constructed and provide examples of functions which demonstrate the satisfaction of the property. We tested a single linear layer with softmax output, as well as a simple 3-layer multi-layer perceptron (MLP) with 1024 neurons per layer and dense skip connections (previous layer outputs were concatenated to input for subsequent layers) to provide a lower bound on the sufficiency of the views on the standard subsets of the ImageNet labels for semi-supervised classification. From these experiments, whose results are shown in Tables 1 and 2 re sep ect iv ely, we can see that if we were to pick a threshold for top-1 accuracy around $70%$ , which we believe indicates considerable skill without presenting too high of a bar, then EsViT, CLIP, and DINOv2 provide sufficient views for both the $1%$ and
关于视图的充分性。验证充分性相当容易。如果我们为任务选择一个合理的充分性阈值,例如ImageNet接近或高于70%的top-1准确率,那么我们可以在构建的数据集视图上训练简单模型,并提供满足该属性的函数示例。我们测试了带softmax输出的单线性层,以及一个简单的3层多层感知机(MLP)(每层1024个神经元并带有密集跳跃连接(前一层的输出与后续层的输入拼接)),以提供ImageNet标签标准子集在半监督分类任务中视图充分性的下限。从这些实验结果(分别如表1和表2所示)可以看出,如果我们将top-1准确率阈值设定在70%左右(我们认为这表明具备相当能力而不设过高门槛),那么EsViT、CLIP和DINOv2在1%和...
Table 3. Pairwise translation performance of linear probe on the ImageNet dataset. A linear classifier is trained on the output of an MLP which is trained to predict one view (columns) from another view (rows) by minimizing MSE. The top-1 accuracy $(%)$ of the linear classifier is reported. Both the MLP and the linear classifier have access to the entire embedded ImageNet training set.
表 3: ImageNet数据集上线性探针的成对翻译性能。通过最小化均方误差 (MSE) 训练一个多层感知机 (MLP) 来预测从一个视图 (行) 到另一个视图 (列) 的输出,并在此基础上训练线性分类器。报告了线性分类器的 top-1 准确率 $(%)$ 。MLP 和线性分类器都可以访问整个嵌入的 ImageNet 训练集。
MAE | DINOv2 | SwAV | EsViT | CLIP | |
---|---|---|---|---|---|
MAE | 0.139 | 0.142 | 0.137 | 0.129 | |
DINOv2 | 0.110 | 0.132 | 0.135 | 0.130 | |
SwAV | 0.116 | 0.147 | 0.137 | 0.126 | |
EsViT | 0.112 | 0.140 | 0.116 | 0.135 | |
CLIP | 0.110 | 0.146 | 0.136 | 0.156 |
$10%$ subsets.
10% 子集
On the Independence of the Views. To test the independence of the views generated by the representation learning models, we trained an identical MLP architecture to predict each view from each other view; Table 3 presents our findings, explained below. Given as input the view to predict from (rows of Table 3) the MLPs were trained to reduce the mean squared error (MSE) between their output and the view to be predicted (columns of Table 3). We then trained a linear classifier on the outputs generated by the MLP given every training set embedding from each view. Had the model faithfully reconstructed the output view given only the input view then we would expect the linear classifier to perform similarly to the last column of Table 1. The linear class if i ers never did much better than a random guess on the ImageNet class achieving at most $0.156%$ accuracy. Thus, while we cannot immediately conclude that the views are independent, it is clearly not trivial to predict one view given any other. We believe that this is compelling evidence for the independence of different representations.
关于视图独立性的验证。为测试表征学习模型生成视图的独立性,我们训练了相同结构的MLP(多层感知机)来通过一个视图预测另一个视图,表3展示了实验结果及分析。该MLP以输入视图(表3行)为条件,通过最小化输出与目标视图(表3列)之间的均方误差(MSE)进行训练。随后,我们在MLP基于各视图训练集嵌入生成的输出上训练线性分类器。若模型仅凭输入视图就能准确重建输出视图,则线性分类器性能应接近表1最后一列的结果。但实际在ImageNet分类任务中,线性分类器的最高准确率仅为$0.156%$,与随机猜测无异。因此,虽然不能直接断言视图完全独立,但实验表明根据任一视图预测其他视图都非易事。我们认为这为不同表征的独立性提供了有力证据。
5.2. Experimental Evaluation on ImageNet
5.2. ImageNet上的实验评估
Having constructed at least two strong views of the data, and suspecting that these views are independent we hypothesize that co-training will work on this data. In Table 4a we show the results of applying the standard co-training algorithm [12]. At each iteration of co-training the model made
在构建了至少两个强数据视图并怀疑这些视图相互独立后,我们假设协同训练(co-training)对该数据有效。表4a展示了应用标准协同训练算法[12]的结果。在协同训练的每次迭代中,模型...
(a) Co-Training
DINOv2 | SwAV | EsViT | CLIP | |
---|---|---|---|---|
MAE | 81.8 (75.5) | 30.5 (11.4) | 77.1 (66.2) | 78.8 (66.8) |
DINOv2 | 78.5 (74.5) | 83.3 (78.9) | 85.1 (80.1) | |
SwAV | 70.6 (64.9) | 75.5 (67.4) | ||
EsViT | 82.3 (76.9) |
(a) 联合训练
(b) Meta Co-Training
DINOv2 | SwAV | EsViT | CLIP | |
---|---|---|---|---|
MAE | 81.2 (73.8) | 37.6 (8.1) | 73.6 (63.8) | 78.6 (63.5) |
DINOv2 | 77.3 (70.7) | 83.4 (79.2) | 85.2 (80.7) | |
SwAV | 74.4 (67.3) | 77.1 (67.4) | ||
EsViT | 82.4 (77.5) |
(b) 元协同训练 (Meta Co-Training)
Table 4. Co-training and meta co-training top-1 accuracy of view combinations on $10%$ $(1%)$ of ImageNet labels. As co-training does not depend on the order of the views, we show only upper-diagonal entries of the pairwise comparison. Diagonal entries would correspond to views which are trivially dependent. All ${\binom{5}{2}}=10$ different combinations of views are shown in each table.
表 4: 在ImageNet标签的$10%$ $(1%)$上,视图组合的协同训练和元协同训练top-1准确率。由于协同训练不依赖于视图顺序,我们仅展示成对比较的上三角部分。对角线部分对应于具有平凡依赖关系的视图。每个表格展示了所有${\binom{5}{2}}=10$种不同的视图组合。
Figure 3. Top-1 accuracy of co-training iterations on the CLIP and DINOv2 views for the ImageNet $10%$ dataset.
图 3: ImageNet $10%$ 数据集上 CLIP 和 DINOv2 视图协同训练迭代的 Top-1 准确率。
Figure 4. Meta co-training using the CLIP and DINOv2 views as a function of the training step. Models are trained on $10%$ of the ImageNet labels.
图 4: 使用CLIP和DINOv2视图进行元协同训练的效果随训练步骤变化。模型在10%的ImageNet标签数据上进行训练。
a prediction for all of the instances in $U$ , following (4). We assigned labels to the $10%$ of the total original unlabeled data with most confident predictions for both models. When confident predictions conflicted on examples, we returned those instances to the unlabeled set, otherwise they entered the labeled set of the other model with the assigned pseudolabel.
根据 (4) 对 $U$ 中的所有实例进行预测。我们将两个模型预测置信度最高的原始未标注数据总量的 $10%$ 分配了标签。当示例的置信预测发生冲突时,我们将这些实例返回到未标注集,否则它们会以分配的伪标签进入另一个模型的标注集。
Predicting according to (4) yielded better accuracy than the predictions of each individual model. In addition, in our experiments on ImageNet, as co-training iterations proceeded top-1 accuracy decreased. Later we show that this was not always the case (see Figure 11 in the appendix). The decrease was less pronounced when the views perform at a similar level; see Figure 3. However, meta co-training does not exhibit performance degradation - in fact this is an ideal case where the method performs best; see Figure 4. While Figures 3 and 4 correspond to ImageNet $10%$ , similar observations also hold for ImageNet $1%$ ; please see the appendix.
根据(4)进行预测比每个单独模型的预测具有更高的准确率。此外,在ImageNet上的实验中,随着协同训练迭代的进行,top-1准确率有所下降。后文我们将展示这并非总是如此(参见附录中的图11)。当各视图表现水平相近时,准确率下降幅度较小;见图3。然而,元协同训练并未出现性能下降——事实上这正是该方法表现最佳的理想情况;见图4。虽然图3和图4对应的是ImageNet $10%$ 数据,类似观察结果也适用于ImageNet $1%$ 数据;详见附录。
Comparing co-training accuracy on ImageNet $10%$ (resp. $1%$ ) shown in Table 4a compared to MLP accuracy shown in Table 2, we observe that co-training top-1 accur