[论文翻译]从自然语言监督中学习可移植的视觉模型Learning Transferable Visual Models From Natural Language Supervision


下载PDF:2103.00020

下载代码:https://github.com/OpenAI/CLIP

Learning Transferable Visual Models From Natural Language Supervision # 从自然语言监督中学习可移植的视觉模型

Alec Radford,Jong Wook Kim,Chris Hallacy,Aditya Ramesh,Gabriel Goh,Sandhini Agarwal,Girish Sastry,Amanda Askell,Pamela Mishkin,Jack Clark,Gretchen KruegerI,lya Sutskever

ABSTRACT 摘要

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

对最新的计算机视觉系统进行训练,以预测一组固定的预定对象类别。这种受限的监督形式限制了它们的通用性和可用性,因为需要附加的标记数据来指定任何其他视觉概念。直接从原始文本中学习图像是一种很有前途的替代方法,它利用了更广泛的监管来源。我们证明,预测哪个字幕与哪个图像一起进行的简单的预训练任务是一种有效且可扩展的方法,可以从 Internet 上收集的 4 亿对(图像,文本)对数据集中从头开始学习 SOTA 图像表示。在进行预训练之后,自然语言将用于引用学习到的视觉概念(或描述新的视觉概念),从而实现模型向下游任务的零触发转移。我们通过对 30 多种不同的现有计算机视觉数据集进行基准测试,跨越诸如 OCR,视频中的动作识别,地理定位以及许多类型的细粒度对象分类之类的任务来研究这种方法的性能。该模型可以轻松地转移到大多数任务,并且通常在完全受监管的基准上具有竞争力,而无需任何特定于数据集的训练。例如,我们使原始 ResNet-50 在 ImageNet 零镜头上的准确性达到匹配,而无需使用任何 128 万训练示例中的任何一个。我们在以下位置发布我们的代码和预先训练的模型权重 https://github.com/OpenAI/CLIP

1INTRODUCTION AND MOTIVATING WORK

Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2019). Task-agnostic objectives such as autoregressive and masked language modeling have scaled across many orders of magnitude in compute, model capacity, and data, steadily improving capabilities. The development of “text-to-text” as a standardized input-output interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled task-agnostic architectures to zero-shot transfer to downstream datasets removing the need for specialized output heads or dataset specific customization. Flagship systems like GPT-3 (Brown et al., 2020) are now competitive across many tasks with bespoke models while requiring little to no dataset specific training data.

These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets. However, in other fields such as computer vision it is still standard practice to pre-train models on crowd-labeled datasets such as ImageNet (Deng et al., 2009). Could scalable pre-training methods which learn directly from Web text result in a similar breakthrough in computer vision? Prior work is encouraging.
直接从原始文本中学习的预训练方法在过去几年中对 NLP 产生了革命性的变化(Dai&Le,2015 ; Peters 等,2018 ; Howard&Ruder,2018 ; Radford 等,2018 ; Devlin 等,2018 ; Raffel 等人,2019)。与任务无关的目标(例如自回归和掩码语言建模)已在计算,模型容量和数据上扩展了多个数量级,从而稳步提高了功能。开发“文本到文本”作为标准化的输入输出接口(McCann 等,2018; Radford 等,2019; Raffel 等,2019)使与任务无关的体系结构能够零快照传输到下游数据集,从而无需专门的输出头或特定于数据集的自定义。如今,像 GPT-3 这样的旗舰系统(Brown 等人,2020 年)在具有定制模型的许多任务中具有竞争力,而几乎不需要数据集特定的训练数据。

这些结果表明,在网络规模的文本集合中,现代预训练方法可访问的总体监管超过了高质量的,带有人群标记的 NLP 数据集。然而,在其他领域(例如计算机视觉),仍然是在人群标记的数据集(例如 ImageNet)上预训练模型的标准实践(Deng 等,2009)。直接从 Web 文本中学习的可扩展的预训练方法是否可以在计算机视觉方面带来类似的突破?先前的工作令人鼓舞。

Summary of our approach. While standard image models jointly train an image feature extractor and a linear classifier to predict some label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.

Figure 1: Summary of our approach. While standard image models jointly train an image feature extractor and a linear classifier to predict some label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.
图 1:我们的方法摘要。标准图像模型联合训练图像特征提取器和线性分类器以预测某些标签,而 CLIP 联合训练图像编码器和文本编码器以预测一批(图像,文本)训练示例的正确配对。在测试时,学习型文本编码器通过嵌入目标数据集的类的名称或描述来合成零击线性分类器。

Over 20 years ago Mori et al. (1999) explored improving content based image retrieval by training a model to predict the nouns and adjectives in text documents paired with images. Quattoni et al. (2007) demonstrated it was possible to learn more data efficient image representations via manifold learning in the weight space of classifiers trained to predict words in captions associated with images. Srivastava & Salakhutdinov (2012) explored deep representation learning by training multimodal Deep Boltzmann Machines on top of low-level image and text tag features. Joulin et al. (2016) modernized this line of work and demonstrated that CNNs trained to predict words in image captions learn useful image representations. They converted the title, description, and hashtag metadata of images in the YFCC100M dataset (Thomee et al., 2016) into a bag-of-words multi-label classification task and showed that pre-training AlexNet (Krizhevsky et al., 2012) to predict these labels learned representations which preformed similarly to ImageNet-based pre-training on transfer tasks. Li et al. (2017) then extended this approach to predicting phrase n-grams in addition to individual words and demonstrated the ability of their system to zero-shot transfer to other image classification datasets by scoring target classes based on their dictionary of learned visual n-grams and predicting the one with the highest score. Adopting more recent architectures and pre-training approaches, VirTex (Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020) have recently demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.
二十多年前,Mori 等人。(1999 年)探索了通过训练模型来预测与图像配对的文本文档中的名词和形容词来改善基于内容的图像检索。Quattoni 等。(2007)证明有可能通过分类器的权重空间中的流形学习来学习更多的数据有效的图像表示,这些分类器被训练来预测与图像相关的字幕中的单词。Srivastava&Salakhutdinov(2012)通过在底层图像和文本标签功能之上训练多模式 Deep Boltzmann 机器,探索了深度表示学习。Joulin 等。(2016 年)使这一工作线现代化,并证明受过训练以预测图像标题中的单词的 CNN 可以学习有用的图像表示。他们将 YFCC100M 数据集中的图像的标题,描述和标签元数据(Thomee 等,2016)转换成词袋式多标签分类任务,并表明对 AlexNet 进行了预训练(Krizhevsky 等,2012)。)以预测这些标签学习到的表示形式,这些表示形式类似于基于 ImageNet 的传输任务预训练。Li 等。(2017 年)然后将这种方法扩展到除了单个单词之外还预测短语 n-gram 的方法,并通过根据学习的视觉 n-gram 的字典对目标类别进行评分,展示了其系统对其他图像分类数据集进行零镜头转移的能力。得分最高的 VirTex (Desai&Johnson,2020),ICLMM (Bulent Sariyildiz 等,2020)和 ConVIRT (Zhang 等,2020)采用了最新的架构和预训练方法,最近证明了基于变压器的语言的潜力建模,掩盖语言建模和对比目标,以便从文本中学习图像表示

While exciting as proofs of concept, using natural language supervision for image representation learning is still rare. This is likely because demonstrated performance on common benchmarks is much lower than alternative approaches. For example, Li et al. (2017) reach only 11.5% accuracy on ImageNet in a zero-shot setting. This is well below the 88.4% accuracy of the current state of the art (Xie et al., 2020). It is even below the 50% accuracy of classic computer vision approaches (Deng et al., 2012). Instead, more narrowly scoped but well-targeted uses of weak supervision have improved performance. Mahajan et al. (2018) showed that predicting ImageNet-related hashtags on Instagram images is an effective pre-training task. When fine-tuned to ImageNet these pre-trained models increased accuracy by over 5% and improved the overall state of the art at the time. Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have also demonstrated large gains on a broader set of transfer benchmarks by pre-training models to predict the classes of the noisily labeled JFT-300M dataset.
虽然令人兴奋的是作为概念的证明,但使用自然语言监督进行图像表示学习仍然很少见。这可能是因为在通用基准上表现出的性能远低于替代方法。例如,Li 等。(2017)在零镜头设置下在 ImageNet 上仅达到 11.5%的准确性。这远低于当前技术水平的 88.4%的准确性(Xie 等人,2020 年)。它甚至低于经典计算机视觉方法的 50%准确性(Deng 等,2012)。相反,范围较窄但针对性强的弱监管使用可以提高绩效。Mahajan 等。(2018 年)结果表明,预测 Instagram 图像上与 ImageNet 相关的标签是一项有效的预训练任务。当微调到 ImageNet 时,这些经过预训练的模型将准确性提高了 5%以上,并改善了当时的整体水平。Kolesnikov 等。(2019)和 Dosovitskiy 等人。(2020 年)还通过预训练模型来预测带有噪点的 JFT-300M 数据集的类别,从而在更广泛的传输基准集上展示了巨大的收益。

This line of work represents the current pragmatic middle ground between learning from a limited amount of supervised “gold-labels” and learning from practically unlimited amounts of raw text. However, it is not without compromises. Both works carefully design, and in the process limit, their supervision to 1000 and 18291 classes respectively. Natural language is able to express, and therefore supervise, a much wider set of visual concepts through its generality. Both approaches also use static softmax classifiers to perform prediction and lack a mechanism for dynamic outputs. This severely curtails their flexibility and limits their “zero-shot” capabilities.

A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale. While Mahajan et al. (2018) and Kolesnikov et al. (2019) trained their models for accelerator years on millions to billions of images, VirTex, ICMLM, and ConVIRT trained for accelerator days on one to two hundred thousand images. In this work, we close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale. Enabled by the large amounts of publicly available data of this form on the internet, we create a new dataset of 400 million (image, text) pairs and demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision. We study the scalability of CLIP by training a series of eight models spanning almost 2 orders of magnitude of compute and observe that transfer performance is a smoothly predictable function of compute (Hestness et al., 2017; Kaplan et al., 2020). We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models. We also confirm these findings with linear-probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet model while also being more computationally efficient. We additionally find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models which suggests that zero-shot evaluation of task-agnostic models is much more representative of a model’s capability. These results have significant policy and ethical implications, which we consider in Section 7.
这条工作线代表了从有限数量的有监督的“金标”中学习与从实际上无限量的原始文本中学习之间的当前务实中间立场。但是,这并非没有妥协。两者都经过精心设计,并在过程限制内,分别对 1000 和 18291 类进行了监督。自然语言能够通过其普遍性来表达并监督更广泛的视觉概念。两种方法还使用静态 softmax 分类器执行预测,并且缺少动态输出机制。这严重削弱了它们的灵活性,并限制了它们的“zero-shot”能力。

这些弱监督的模型与直接从自然语言学习图像表示的最新探索之间的关键区别是规模。而 Mahajan 等人。(2018)和 Kolesnikov 等人。(2019 年)在数百万至数十亿的图像上训练了加速器年的模型,对 VirTex,ICLMM 和 ConVIRT 进行了加速器训练,对了十二万二十万幅图像进行了训练。在这项工作中,我们弥合了这一差距,并大规模研究了在自然语言监督下训练的图像分类器的行为。通过 Internet 上这种形式的大量公共可用数据的支持,我们创建了一个具有 4 亿对(图像,文本)对的新数据集,并演示了从头开始训练的 ConVIRT 的简化版本,我们将其称为 CLIP,用于对比语言。 -图像预训练是从自然语言监督中学习的有效方法。(Hestness 等人,2017 ; Kaplan 等人,2020)。我们发现,CLIP 与 GPT 系列类似,在预训练期间学会执行各种任务,包括 OCR,地理定位,动作识别等。我们通过在 30 多个现有数据集上对 CLIP 的零快照传输性能进行基准测试来衡量这一点,发现它可以与以前的特定于任务的监督模型竞争。我们还通过线性探针表示学习分析来证实这些发现,并显示 CLIP 优于最佳的公开可用 ImageNet 模型,同时还具有更高的计算效率。此外,我们发现零镜头 CLIP 模型比等效精度监督 ImageNet 模型强得多,这表明与任务无关的模型的零镜头评估更能代表模型的功能。7

 Although highly expressive, we found that transformer-based language models are relatively weak at zero-shot ImageNet classification. Here, we see that it learns 3x slower than a baseline which predicts a bag-of-words (BoW) encoding of the text

Figure 2: CLIP is much more efficient at zero-shot transfer than our image caption baseline. Although highly expressive, we found that transformer-based language models are relatively weak at zero-shot ImageNet classification. Here, we see that it learns 3x slower than a baseline which predicts a bag-of-words (BoW) encoding of the text (Joulin et al., 2016). Swapping the prediction objective for the contrastive objective of CLIP further improves efficiency another 4x.
图 2:CLIP 在 zero-shot 转换方面比我们的图像字幕基准要有效得多。尽管具有很高的表达力,但我们发现在零镜头 ImageNet 分类中,基于转换器的语言模型相对较弱。在这里,我们看到它的学习速度比预测文本的单词袋(BoW)编码的基线慢 3 倍(Joulin 等,2016)。将预测目标换成 CLIP 的对比目标可将效率再提高 4 倍。

2 APPROACH

2.1NATURAL LANGUAGE SUPERVISION 自然语言监督

At the core of our approach is the idea of learning perception from supervision contained in natural language. As discussed in the introduction, this is not at all a new idea, however terminology used to describe work in this space is varied, even seemingly contradictory, and stated motivations are diverse. Zhang et al. (2020), Gomez et al. (2017), Joulin et al. (2016), and Desai & Johnson (2020) all introduce methods which learn visual representations from text paired with images but describe their approaches as unsupervised, self-supervised, weakly supervised, and supervised respectively.

We emphasize that what is common across this line of work is not any of the details of the particular methods used but the appreciation of natural language as a training signal. All these approaches are learning from natural language supervision. Although early work wrestled with the complexity of natural language when using topic model and n-gram representations, improvements in deep contextual representation learning suggest we now have the tools to effectively leverage this abundant source of supervision (McCann et al., 2017).

Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”. Instead, methods which work on natural language can learn passively from the supervision contained in the vast amount of text on the internet. Learning from natural language also has an important advantage over most unsupervised or self-supervised learning approaches in that it doesn’t “just” learn a representation but also connects that representation to language which enables flexible zero-shot transfer. In the following subsections, we detail the specific approach we settled on.
我们方法的核心思想是从自然语言所包含的监督中学习知觉。正如引言中所讨论的,这根本不是一个新主意,但是用来描述该领域工作的术语是多种多样的,甚至看起来是矛盾的,并且陈述的动机是多种多样的。张等。(2020 年),Gomez 等。(2017),Joulin 等。(2016)和 Desai&Johnson(2020)都引入了从文本与图像配对中学习视觉表示的方法,但将它们的方法分别描述为无监督,自监督,弱监督和监督。

我们强调,在这条工作线上的共同点不是所使用的特定方法的任何细节,而是自然语言作为训练信号的欣赏。所有这些方法都是从自然语言监督中学习的。尽管在使用主题模型和 n-gram 表示法时,早期工作因自然语言的复杂性而苦恼,但深度上下文表示法学习的改进表明,我们现在拥有有效利用这一丰富监管资源的工具(McCann et al。,2017)。

与其他训练方法相比,从自然语言学习具有若干潜在优势。与标准的基于人群的标签进行图像分类相比,扩展自然语言监管要容易得多,因为它不需要注释采用经典的“机器学习兼容格式”,例如规范化的 N 票多数票“金标” 。相反,使用自然语言的方法可以从 Internet 上大量文本所包含的监督中被动地学习。与大多数无监督或自我监督的学习方法相比,从自然语言学习还具有重要的优势,因为它不仅可以“仅仅”学习一种表示形式,而且可以将该表示形式连接到语言上,从而可以灵活地实现零镜头转换。在以下小节中,我们详细介绍了我们确定的具体方法。

2.2CREATING A SUFFICIENTLY LARGE DATASET

Existing work has mainly used three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each. By comparison, other computer vision systems are trained on up to 3.5 billion Instagram photos (Mahajan et al., 2018). YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality. Many images use automatically generated filenames like 20160716_113957.JPG as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is approximately the same size as ImageNet.

A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet. Since existing datasets do not adequately reflect this possibility, considering results only on them would underestimate the potential of this line of research. To address this, we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet. To attempt to cover as broad a set of visual concepts as possible, we search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries.11The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. Finally all WordNet synsets not already in the query list are added. We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.

创建足够大的数据集

现有工作主要使用了三个数据集,即 MS-COCO (Lin 等人,2014),Visual Genome (Krishna 等人,2017)和 YFCC100M (Thomee 等人,2016)。虽然 MS-COCO 和 Visual Genome 是高质量的人群标记数据集,但按照现代标准,它们却很小,每张约有 100,000 张训练照片。相比之下,其他计算机视觉系统接受了多达 35 亿张 Instagram 照片的训练(Mahajan et al。,2018)。YFCC100M 可能有 1 亿张照片,但每个图像的元数据稀疏且质量各不相同。许多图像使用自动生成的文件名,例如 20160716_113957.JPG 作为“标题”或包含相机曝光设置的“说明”。过滤以仅保留具有自然语言标题和/或描述的英语图像后,数据集缩小了 6 倍,仅减少了 1500 万张照片。该大小与 ImageNet 大致相同。

自然语言监管的主要动机是可以在 Internet 上公开获取大量这种形式的数据。由于现有的数据集不能充分反映这种可能性,因此仅考虑它们的结果将低估这一研究领域的潜力。为了解决这个问题,我们构建了一个新的 4 亿对(图像,文本)对的数据集,这些对是通过 Internet 上各种公开可用的来源收集的。为了尝试涵盖尽可能广泛的一组视觉概念,我们在构造过程中搜索(图像,文本)对,其文本包括一组 500,000 个查询中的一个。1 个 1 个基本查询列表是英语版本的 Wikipedia 中所有出现至少 100 次的单词。具有较高点向互信息的双向字母组合以及一定搜索量以上所有 Wikipedia 文章的名称对此进行了补充。最后,添加所有不在查询列表中的 WordNet 同义词集。通过每个查询最多包含 20,000 对(图像,文本)对,我们大致平衡了结果的类别。所得数据集的总字数与用于训练 GPT-2 的 WebText 数据集相似。我们将此数据集称为 WebImageText 的 WIT。

2.3 SELECTING AN EFFICIENT PRE-TRAINING METHOD ### 选择一种有效的预训练方法

State-of-the-art computer vision systems use very large amounts of compute. Mahajan et al. (2018) required 19 GPU years to train their ResNeXt101-32x48d and Xie et al. (2020) required 33 TPUv3 core-years to train their Noisy Student EfficientNet-L2. When considering that both these systems were trained to predict only 1000 ImageNet classes, the task of learning an open set of visual concepts from natural language seems daunting. In the course of our efforts, we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric.

Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method. In Figure 2 we show that a 63 million parameter transformer language model, which already uses twice the compute of its ResNet-50 image encoder, learns to recognize ImageNet classes three times slower than a much simpler baseline that predicts a bag-of-words encoding of the same text.

Both these approaches share a key similarity. They try to predict the exact words of the text accompanying each image. This is a difficult task due to the wide variety of descriptions, comments, and related text that co-occur with images. Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent predictive objective (Tian et al., 2019). Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance (Chen et al., 2020a). Noting these findings, we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text. Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.

Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N×N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2−N incorrect pairings. We optimize a symmetric cross entropy loss over these similarity scores. In Figure 3 we include pseudocode of the core of an implementation of CLIP. To our knowledge this batch construction technique and objective was first introduced in the area of deep metric learning as the multi-class N-pair loss Sohn (2016), was popularized for contrastive representation learning by Oord et al. (2018) as the InfoNCE loss, and was recently adapted for contrastive (text, image) representation learning in the domain of medical imaging by Zhang et al. (2020).

Due to the large size of our pre-training dataset, over-fitting is not a major concern and the details of training CLIP are simplified compared to the implementation of Zhang et al. (2020). We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights. We do not use the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We instead use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only in self-supervised representation learning methods. We also remove the text transformation function tu from Zhang et al. (2020) which samples a single sentence at uniform from the text since many of the (image, text) pairs in CLIP’s pre-training dataset are only a single sentence. We also simplify the image transformation function tv. A random square crop from resized images is the only data augmentation used during training. Finally, the temperature parameter which controls the range of the logits in the softmax, τ, is directly optimized during training as a log-parameterized multiplicative scalar to avoid turning as a hyper-parameter.
最先进的计算机视觉系统使用非常大量的计算。Mahajan 等。(2018)需要 19 年的 GPU 训练他们的 ResNeXt101-32x48d 和 Xie 等。(2020)需要 33 个 TPUv3 核心年来训练他们的嘈杂学生 EfficientNet-L2。当考虑到这两个系统都只能预测 1000 个 ImageNet 类时,从自然语言中学习开放的视觉概念的任务似乎很艰巨。在我们的努力过程中,我们发现训练效率是成功扩展自然语言监督的关键,因此我们根据该指标选择了最终的预训练方法。

我们最初的方法类似于 VirTex,从头开始共同训练了图像 CNN 和文本转换器,以预测图像的标题。但是,我们在有效缩放此方法时遇到了困难。在图 2 中,我们展示了一个 6300 万参数转换器语言模型,该模型已经使用了 ResNet-50 图像编码器两倍的计算能力,学会了识别 ImageNet 类的速度比预测一个词袋编码的简单得多的基线慢了三倍。相同的文本。

这两种方法都有一个关键的相似之处。他们试图预测每个图像附带的文字的确切字词。由于与图像同时出现的各种描述,注释和相关文本,因此这是一项艰巨的任务。最近在图像的对比表示学习中的工作发现,对比目标比同等的预测目标可以学习更好的表示(Tian 等人,2019)。其他工作发现,尽管图像生成模型可以学习高质量的图像表示,但与具有相同性能的对比模型相比,它们需要的计算量要高出一个数量级(Chen 等,2020a)。。注意到这些发现,我们探索了一种训练系统的方法,以解决可能更简单的代理任务,即仅预测整体上哪个文本与哪个图像配对,而不预测该文本的确切单词。从相同的词袋编码基线开始,我们在图 2 中将预测目标换成了对比目标,并观察到零散传输到 ImageNet 的速率进一步提高了 4 倍。

给定一批 ñ (图片,文本)对,CLIP 经过训练可以预测哪个 ñ×ñ 实际在批次中可能发生的(图像,文本)配对。为此,CLIP 通过联合训练图像编码器和文本编码器以最大程度地利用图像编码器和文本编码器的余弦相似度来学习多模式嵌入空间。ñ 批次中的实对,同时最大程度地减少了嵌入的余弦相似度 ñ2 个-ñ 不正确的配对。我们针对这些相似性评分优化对称交叉熵损失。在图 3 中,我们包括 CLIP 实现核心的伪代码。据我们所知,这种批量构建技术和目标最初是在深度度量学习领域引入的,因为多类 N 对损失 Sohn(2016)被 Oord 等人推广用于对比表示学习。(2018)是 InfoNCE 的损失,最近被 Zhang 等人应用于医学成像领域的对比(文本,图像)表示学习。(2020 年)。

由于我们的训练前数据集很大,过拟合不是主要问题,与 Zhang 等人的实现相比,训练 CLIP 的细节得以简化。(2020 年)。我们从头开始训练 CLIP,而无需使用 ImageNet 权重初始化图像编码器或使用预先训练的权重初始化文本编码器。我们没有在表示和对比嵌入空间之间使用非线性投影,这是由 Bachman 等人引入的变化。(2019)并由 Chen 等人推广。(2020b)。相反,我们仅使用线性投影将每个编码器的表示形式映射到多模式嵌入空间。我们没有注意到两个版本之间训练效率的差异,并且推测非线性投影可能仅在自我监督的表示学习方法中与当前图像的细节共同适应。我们还删除了文本转换功能 Ťü 来自 Zhang 等。(2020)从文本统一采样单个句子,因为 CLIP 的预训练数据集中的许多(图像,文本)对只是一个句子。我们还简化了图像变换功能 Ťv。调整后的图像中的随机正方形裁剪是训练期间使用的唯一数据增强。最后,在 softmax 中控制 logit 范围的温度参数,τ,在训练过程中直接优化为对数参数化的乘法标量,以避免转向为超参数。

Numpy-like pseudocode for the core of an implementation of CLIP.

Figure 3: Numpy-like pseudocode for the core of an implementation of CLIP.
图 3: CLIP 实现核心的类似 Numpy 的伪代码。

2.4CHOOSING AND SCALING A MODEL ### 选择和缩放模型

We consider two different architectures for the image encoder. For the first, we use ResNet-50 (He et al., 2016a) as the base architecture for the image encoder due to its widespread adoption and proven performance. We make several modifications to the original version using the ResNet-D improvements from He et al. (2019) and the antialiased rect-2 blur pooling from Zhang (2019). We also replace the global average pooling layer with an attention pooling mechanism. The attention pooling is implemented as a single layer of “transformer-style” multi-head QKV attention where the query is conditioned on the global average-pooled representation of the image. For the second architecture, we experiment with the recently introduced Vision Transformer (ViT) (Dosovitskiy et al., 2020). We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.

The text encoder is a Transformer (Vaswani et al., 2017) with the architecture modifications described in Radford et al. (2019). As a base size we use a 63M-parameter 12-layer 512-wide model with 8 attention heads. The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size (Sennrich et al., 2015). For computational efficiency, the max sequence length was capped at 76. The text sequence is bracketed with [SOS] and [EOS] tokens and the activations of the highest layer of the transformer at the [EOS] token are treated as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space. Masked self-attention was used in the text encoder to preserve the ability to initialize with a pre-trained language model or add language modeling as an auxiliary objective, though exploration of this is left as future work.

While previous computer vision research has often scaled models by increasing the width (Mahajan et al., 2018) or depth (He et al., 2016a) in isolation, for the ResNet image encoders we adapt the approach of Tan & Le (2019) which found that allocating additional compute across all of width, depth, and resolution outperforms only allocating it to only one dimension of the model. While Tan & Le (2019) tune the ratio of compute allocated to each dimension for their EfficientNet architecture, we use a simple baseline of allocating additional compute equally to increasing the width, depth, and resolution of the model. For the text encoder, we only scale the width of the model to be proportional to the calculated increase in width of the ResNet and do not scale the depth at all, as we found CLIP’s performance to be less sensitive to the capacity of the text encoder.
我们考虑图像编码器的两种不同架构。首先,由于其广泛采用和经过验证的性能,我们使用 ResNet-50 (He 等,2016a)作为图像编码器的基础架构。我们使用 He 等人的 ResNet-D 改进对原始版本进行了一些修改。(2019)和 Zhang(2019)的抗锯齿 rect-2 模糊合并。我们还将注意力集中机制替换为全球平均汇集层。注意池被实现为“变压器样式”多头 QKV 注意的单层,其中查询以图像的全局平均池表示为条件。对于第二种架构,我们尝试使用最近推出的视觉变压器(ViT)(Dosovitskiy 等,2020)。我们仅遵循较小的修改就紧跟其实现,即在变压器之前对组合的贴片和位置嵌入物添加了额外的层归一化,并使用稍有不同的初始化方案。

文本编码器是 Transformer (Vaswani et al。,2017),具有 Radford et al。(2019)。作为基本尺寸,我们使用具有 8 个关注头的 63M 参数的 12 层 512 宽模型。转换器以 49,152 vocab 大小的文本对小写字节对编码(BPE)表示进行操作(Sennrich et al。,2015)。为了计算效率,最大序列长度在 76 封端的文本序列括号与[SOS]和[EOS]令牌和所述变压器的最高层的在所述的激活[EOS]标记被视为文本的特征表示,对文本进行图层归一化,然后线性投影到多模式嵌入空间中。文本编码器中使用了屏蔽的自我注意功能,以保留使用预先训练的语言模型进行初始化或将语言模型作为辅助目标进行添加的能力,尽管对此的探索尚待将来进行。

虽然以前的计算机视觉研究通常通过孤立地增加宽度(Mahajan 等,2018)或深度(He 等,2016a)来缩放模型,但对于 ResNet 图像编码器,我们采用了 Tan&Le(2019)的方法。结果发现,在宽度,深度和分辨率的全部范围内分配额外的计算结果优于仅将其分配给模型的一个维度。虽然谭和乐(2019)为调整其 EfficientNet 体系结构分配给每个维度的计算比例,我们使用简单的基线来平均分配其他计算,以增加模型的宽度,深度和分辨率。对于文本编码器,我们仅将模型的宽度缩放为与所计算的 ResNet 宽度的增长成比例,而根本不缩放深度,因为我们发现 CLIP 的性能对文本编码器的容量较不敏感。

2.5TRAINING 训练

We train a series of 5 ResNets and 3 Vision Transformers. For the ResNets we train a ResNet-50, a ResNet-101, and then 3 more which follow EfficientNet-style model scaling and use approximately 4x, 16x, and 64x the compute of a ResNet-50. They are denoted as RN50x4, RN50x16, and RN50x64 respectively. For the Vision Transformers we train a ViT-B/32, a ViT-B/16, and a ViT-L/14. We train all models for 32 epochs. We use the Adam optimizer (Kingma & Ba, 2014) with decoupled weight decay regularization (Loshchilov & Hutter, 2017) applied to all weights that are not gains or biases, and decay the learning rate using a cosine schedule (Loshchilov & Hutter, 2016). Initial hyper-parameters were set using a combination of grid searches, random search, and manual tuning on the baseline ResNet-50 model when trained for 1 epoch. Hyper-parameters were then adapted heuristically for larger models due to computational constraints. The learnable temperature parameter τ was initialized to the equivalent of 0.07 from (Wu et al., 2018) and clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability. We use a very large minibatch size of 32,768. Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory. To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used. The calculation of embedding similarities was also sharded with individual GPUs computing only the subset of the pairwise similarities necessary for their local batch of embeddings. The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs. For the ViT-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch to boost performance similar to FixRes (Touvron et al., 2019). We denote this model as ViT-L/14@336px. Unless otherwise specified, all results reported in this paper as “CLIP” use this model which we found to perform best.
我们训练了一系列 5 台 ResNet 和 3 台视觉变压器。对于 ResNet,我们训练了 ResNet-50,ResNet-101,然后再训练 3 个,它们遵循 EfficientNet 样式的模型缩放,并分别使用 ResNet-50 的 4 倍,16 倍和 64 倍的计算量。它们分别表示为 RN50x4,RN50x16 和 RN50x64。对于视觉变形金刚,我们训练了 ViT-B / 32,ViT-B / 16 和 ViT-L / 14。我们为所有模型训练了 32 个纪元。我们使用亚当优化器(Kingma&Ba,2014),将去耦的权重衰减正则化(Loshchilov&Hutter,2017)应用于所有非增益或偏倚的权重,并使用余弦时间表衰减学习率(Loshchilov&Hutter,2016))。当训练 1 个历元时,使用网格搜索,随机搜索和手动调整基线 ResNet-50 模型的组合来设置初始超参数。由于计算限制,然后针对较大的模型对超参数进行启发式调整。可学习的温度参数 τ 初始化为(Wu et al。,2018)中的 0.07 ,并进行裁剪以防止将 logit 缩放到 100 以上,这是我们认为有必要防止训练不稳定。我们使用 32,768 的超大批量生产。混合精度(Micikevicius et al。,2017)用于加速训练并节省记忆力。为了节省额外的内存,梯度检查点(Griewank&Walther,2000 ; Chen 等,2016),半精度亚当统计量(Dhariwal 等,2020),并使用半精度随机舍入的文本编码器权重。嵌入相似度的计算也通过单独的 GPU 进行分摊,GPU 仅计算其本地嵌入批次所需的成对相似度的子集。最大的 ResNet 型号 RN50x64 在 592 个 V100 GPU 上花费了 18 天的训练时间,而最大的 Vision Transformer 在 256 个 V100 GPU 上花费了 12 天的训练时间。对于 ViT-L / 14,我们还以更高的 336 像素分辨率预训练了一个新的时期,以提高与 FixRes 相似的性能(Touvron 等人,2019 年)。我们将此模型表示为 ViT-L / 14 @ 336px。除非另有说明,否则本文中报告为“ CLIP”的所有结果均使用我们认为表现最佳的模型。

3 EXPERIMENTS ## 实验

3.1ZERO-SHOT TRANSFER

Motivation

In computer vision, zero-shot learning usually refers to the study of generalizing to unseen object categories in image classification (Lampert et al., 2009). We instead use the term in a broader sense and study generalization to unseen datasets. We motivate this as a proxy for performing unseen tasks, as aspired to in the zero-data learning paper of Larochelle et al. (2008). While much research in the field of unsupervised learning focuses on the representation learning capabilities of machine learning systems, we motivate studying zero-shot transfer as a way of measuring the task-learning capabilities of machine learning systems. In this view, a dataset evaluates performance on a task on a specific distribution. However, many popular computer vision datasets were created by the research community primarily as benchmarks to guide the development of generic image classification methods rather than measuring performance on a specific task. While it is reasonable to say that the SVHN dataset measures the task of street number transcription on the distribution of Google Street View photos, it is unclear what “real” task the CIFAR-10 dataset measures. It is clear, however, what distribution CIFAR-10 is drawn from - TinyImages (Torralba et al., 2008). On these kinds of datasets, zero-shot transfer is more an evaluation of CLIP’s robustness to distribution shift and domain generalization rather than task generalization. Please see Section 3.3 for analysis focused on this.

To our knowledge, Visual N-Grams (Li et al., 2017) first studied zero-shot transfer to existing image classification datasets in the manner described above. It is also the only other work we are aware of that has studied zero-shot transfer to standard image classification datasets using a generically pre-trained model and serves as the best reference point for contextualizing CLIP. Their approach learns the parameters of a dictionary of 142,806 visual n-grams (spanning 1- to 5- grams) and optimizes these n-grams using a differential version of Jelinek-Mercer smoothing to maximize the probability of all text n-grams for a given image. In order to perform zero-shot transfer, they first convert the text of each of the dataset’s class names into its n-gram representation and then compute its probability according to their model, predicting the one with the highest score.

Our focus on studying zero-shot transfer as an evaluation of task learning is inspired by work demonstrating task learning in the field of NLP. To our knowledge Liu et al. (2018) first identified task learning as an “unexpected side-effect” when a language model trained to generate Wikipedia articles learned to reliably transliterate names between languages. While GPT-1 (Radford et al., 2018) focused on pre-training as a transfer learning method to improve supervised fine-tuning, it also included an ablation study demonstrating that the performance of four heuristic zero-shot transfer methods improved steadily over the course of pre-training, without any supervised adaption. This analysis served as the basis for GPT-2 (Radford et al., 2019) which focused exclusively on studying the task-learning capabilities of language models via zero-shot transfer.
在计算机视觉中,零镜头学习通常是指将未分类的对象类别归纳为图像分类的研究(Lampert 等,2009)。相反,我们从广义上使用该术语,并研究泛化到看不见的数据集。我们希望将其作为执行看不见的任务的代理,如 Larochelle 等人的零数据学习论文所期望的那样。(2008)。尽管无监督学习领域的许多研究都集中在机器学习系统的表示学习能力上,但我们还是致力于研究零镜头转移作为衡量任务学习的一种方式。机器学习系统的功能。在此视图中,数据集评估特定分布上的任务的性能。但是,许多流行的计算机视觉数据集是由研究社区创建的,主要是作为基准来指导通用图像分类方法的开发,而不是衡量特定任务的性能。虽然可以说 SVHN 数据集可以衡量 Google 街景照片的分布上的街道编号转录任务,但目前尚不清楚 CIFAR-10 数据集可以衡量哪些“真实”任务。但是,很明显,CIFAR-10 的分布是从-TinyImages 提取的(Torralba 等人,2008 年)。在这类数据集上,零脉冲传输更多地是对 CLIP 对分布偏移和域概括而不是任务概括的鲁棒性的评估。请参阅第 3.3 节,以对此进行重点分析。

据我们所知,Visual N-Grams (Li 等,2017)首先以上述方式研究了零射向现有图像分类数据集的传输。这也是我们知道的仅有的其他工作,它使用通用的预训练模型研究了零镜头传输到标准图像分类数据集的情况,并且是将 CLIP 上下文化的最佳参考点。他们的方法学习了 142,806 个视觉 n-gram(跨度为 1 至 5 克)字典的参数,并使用 Jelinek-Mercer 平滑的差分版本优化了这些 n-gram,以最大程度地提高所有文本 n-gram 的概率。给定的图像。为了执行零拍转移,他们首先将数据集的每个类名称的文本转换为 n-gram 表示形式,然后根据其模型计算其概率,从而预测得分最高的那个。

我们致力于研究零镜头转移作为对任务学习的评估的重点是受到在 NLP 领域展示任务学习的工作的启发。据我们所知,刘等人。(2018)首先将任务学习识别为一种“意外的副作用”,这是因为一种语言模型经过训练可以生成维基百科文章,从而可以可靠地在各种语言之间音译名称。GPT-1 (Radford et al。,2018)专注于将预训练作为一种转移学习方法来改善有监督的微调,同时它还包括一项消融研究,该研究表明四种启发式零拍转移方法的性能稳步提高了。预训练的过程中,没有任何有监督的适应。此分析是 GPT-2 的基础(Radford et al。,2019)专注于通过零镜头传输研究语言模型的任务学习能力。

Using CLIP for Zero-Shot Transfer

CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot classification, we reuse this capability. For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text) pair according to CLIP. In a bit more detail, we first compute the feature embedding of the image and the feature embedding of the set of possible texts by their respective encoders. The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter τ, and normalized into a probability distribution via a softmax. Note that this prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling. When interpreted this way, the image encoder is the computer vision backbone which computes a feature representation for the image and the text encoder is a hypernetwork (Ha et al., 2016) which generates the weights of a linear classifier based on the text specifying the visual concepts that the classes represent. Lei Ba et al. (2015) first introduced a zero-shot image classifier of this form while the idea of generating a classifier from natural language dates back to at least Elhoseiny et al. (2013). Continuing with this interpretation, every step of CLIP pre-training can be viewed as optimizing the performance of a randomly created proxy to a computer vision dataset which contains 1 example per class and has 32,768 total classes defined via natural language descriptions. For zero-shot evaluation, we cache the zero-shot classifier once it has been computed by the text encoder and reuse it for all subsequent predictions. This allows the cost of generating it to be amortized across all the predictions in a dataset.
对 CLIP 进行了预训练,以预测图像和文本片段是否在其数据集中配对。为了执行零镜头分类,我们重用了此功能。对于每个数据集,我们将数据集中所有类的名称用作潜在的文本对的集合,并根据 CLIP 预测最可能的(图像,文本)对。更详细一点,我们首先通过它们各自的编码器计算图像的特征嵌入和一组可能的文本的特征嵌入。然后计算这些嵌入的余弦相似度,并按温度参数进行缩放 τ,并通过 softmax 归一化为概率分布。请注意,此预测层是具有 L2 归一化输入,L2 归一化权重,无偏差和温度缩放的多项式 Lo​​gistic 回归分类器。当以这种方式解释时,图像编码器是计算机视觉的主干,它会计算图像的特征表示,而文本编码器是超网络(Ha 等,2016),该超网络根据指定文本的文本生成线性分类器的权重。类代表的视觉概念。雷霸等。(2015 年)首先引入了这种形式的零镜头图像分类器,而从自然语言生成分类器的想法至少可以追溯到 Elhoseiny 等人。(2013)。继续这种解释,CLIP 预训练的每个步骤都可以看作是优化随机创建的计算机视觉数据集代理的性能,该计算机视觉数据集每个类包含 1 个示例,并通过自然语言描述定义了总共 32,768 个类。对于零镜头评估,一旦文本编码器计算出零镜头分类器,我们便对其进行缓存,并将其用于所有后续预测。这使得生成它的成本可以在数据集中的所有预测中摊销。

Initial Comparison to Visual N-Grams

aYahoo ImageNet SUN
Visual N-Grams 72.4 11.5 23.0
CLIP 98.4 76.2 58.5

Table 1: Comparing CLIP to prior zero-shot transfer image classification results. CLIP improves performance on all three datasets by a large amount.
This improvement reflects many differences in the 4 years since the development of Visual N-Grams (Li et al., 2017).

表 1:将 CLIP 与先前的零镜头传输图像分类结果进行比较。CLIP 极大地提高了所有三个数据集的性能。这种改进反映了自从 Visual N-Grams 开发以来的 4 年中的许多差异(Li 等,2017)。

In Table 1 we compare Visual N-Grams to CLIP. The best CLIP model improves accuracy on ImageNet from a proof of concept 11.5% to 76.2% and matches the performance of the original ResNet-50 despite using none of the 1.28 million crowd-labeled training examples available for this dataset. Additionally, the top-5 accuracy of CLIP models are noticeably higher than their top-1, and this model has a 95% top-5 accuracy, matching Inception-V4 (Szegedy et al., 2016). The ability to match the performance of a strong, fully supervised baselines in a zero-shot setting suggests CLIP is a significant step towards flexible and practical zero-shot computer vision classifiers. As mentioned above, the comparison to Visual N-Grams is meant for contextualizing the performance of CLIP and should not be interpreted as a direct methods comparison between CLIP and Visual N-Grams as many performance relevant differences between the two systems were not controlled for. For instance, we train on a dataset that is 10x larger, use a vision model that requires nearly 100x more compute per prediction, likely used over 1000x their training compute, and use a transformer-based model which did not exist when Visual N-Grams was published. As a closer comparison, we trained a CLIP ResNet-50 on the same YFCC100M dataset that Visual N-Grams was trained on and found it matched their reported ImageNet performance within a V100 GPU day. This baseline was also trained from scratch instead of being initialized from pre-trained ImageNet weights as in Visual N-Grams.
在表 1 中,我们将 Visual N-Grams 与 CLIP 进行了比较。最佳的 CLIP 模型将 ImageNet 的精度从概念验证的 11.5%提高到 76.2%,并且与原始 ResNet-50 的性能相匹配,尽管未使用可用于此数据集的 128 万人群标签训练示例中的任何一个。此外,CLIP 模型的 top-5 准确性明显高于其 top-1,并且该模型具有 95%的 top-5 准确性,与 Inception-V4 匹配(Szegedy et al。,2016)。在零镜头设置中匹配强大,完全受监督的基准的性能的能力表明,CLIP 是朝着灵活实用的零镜头计算机视觉分类器迈出的重要一步。如上所述,与 Visual N-Grams 的比较是为了使 CLIP 的性能上下文化,并且不应解释为 CLIP 和 Visual N-Grams 之间的直接方法比较,因为两个系统之间的许多性能相关差异均不受控制。例如,我们在一个 10 倍大的数据集上进行训练,使用一个视觉模型,该模型每次预测需要多将近 100 倍的计算量,可能是其训练计算量的 1000 倍以上,并且使用了基于 Visual N-Grams 时不存在的基于变换器的模型已出版。进一步比较,我们在与 Visual N-Grams 相同的 YFCC100M 数据集上对 CLIP ResNet-50 进行了训练,发现该数据集在 V100 GPU 一天之内就与他们报告的 ImageNet 性能相匹配。还从头开始训练此基线,而不是像在 Visual N-Grams 中一样从预先训练的 ImageNet 权重中初始化该基线。

CLIP also outperforms Visual N-Grams on the other 2 reported datasets. On aYahoo, CLIP achieves a 95% reduction in the number of errors, and on SUN, CLIP more than doubles the accuracy of Visual N-Grams. To conduct a more comprehensive analysis and stress test, we implement a much larger evaluation suite detailed in Appendix A. In total we expand from the 3 datasets reported in Visual N-Grams to include over 30 datasets and compare to over 50 existing computer vision systems to contextualize results.

CLIP 在其他两个报告的数据集上也胜过 Visual N-Grams。在 Yahoo 上,CLIP 的错误数量减少了 95%,而在 SUN 上,CLIP 的准确度是 Visual N-Grams 的两倍以上。为了进行更全面的分析和压力测试,我们实施了附录 A 中详细介绍的更大的评估套件。总的来说,我们从 Visual N-Grams 中报告的 3 个数据集扩展到包括 30 多个数据集,并与 50 多个现有的计算机视觉系统进行比较以将结果进行上下文化。

Prompt Engineering and Ensembling

 Compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4 times more compute with the baseline zero-shot method but is “free” when amortized over many predictions.

Figure 4: Prompt engineering and ensembling improve zero-shot performance.
图 4:及时的工程设计和合奏可改善零击性能。与使用无上下文类名的基准相比,在 36 个数据集中,及时的工程设计和集成使零击分类的性能平均提高了近 5 点。

Compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4 times more compute with the baseline zero-shot method but is “free” when amortized over many predictions.Most standard image classification datasets treat the information naming or describing classes which enables natural language based zero-shot transfer as an afterthought. The vast majority of datasets annotate images with just a numeric id of the label and contain a file mapping these ids back to their names in English. Some datasets, such as Flowers102 and GTSRB, don’t appear to include this mapping at all in their released versions preventing zero-shot transfer entirely.22Alec learned much more about flower species and German traffic signs over the course of this project than he originally anticipated. For many datasets, we observed these labels may be chosen somewhat haphazardly and do not anticipate issues related to zero-shot transfer which relies on task description in order to transfer successfully.

A common issue is polysemy. When the name of a class is the only information provided to CLIP’s text encoder it is unable to differentiate which word sense is meant due to the lack of context. In some cases multiple meanings of the same word might be included as different classes in the same dataset! This happens in ImageNet which contains both construction cranes and cranes that fly. Another example is found in classes of the Oxford-IIIT Pet dataset where the word boxer is, from context, clearly referring to a breed of dog, but to a text encoder lacking context could just as likely refer to a type of athlete.

Another issue we encountered is that it’s relatively rare in our pre-training dataset for the text paired with the image to be just a single word. Usually the text is a full sentence describing the image in some way. To help bridge this distribution gap, we found that using the prompt template “A photo of a {label}.” to be a good default that helps specify the text is about the content of the image. This often improves performance over the baseline of using only the label text. For instance, just using this prompt improves accuracy on ImageNet by 1.3%.

Similar to the “prompt engineering” discussion around GPT-3 (Brown et al., 2020; Gao et al., 2020), we have also observed that zero-shot performance can be significantly improved by customizing the prompt text to each task. A few, non exhaustive, examples follow. We found on several fine-grained image classification datasets that it helped to specify the category. For example on Oxford-IIIT Pets, using “A photo of a {label}, a type of pet.” to help provide context worked well. Likewise, on Food101 specifying a type of food and on FGVC Aircraft a type of aircraft helped too. For OCR datasets, we found that putting quotes around the text or number to be recognized improved performance. Finally, we found that on satellite image classification datasets it helped to specify that the images were of this form and we use variants of “a satellite photo of a {label}.”.

We also experimented with ensembling over multiple zero-shot classifiers as another way of improving performance. These classifiers are computed by using different context prompts such as ‘A photo of a big {label}” and “A photo of a small {label}”. We construct the ensemble over the embedding space instead of probability space. This allows us to cache a single set of averaged text embeddings so that the compute cost of the ensemble is the same as using a single classifier when amortized over many predictions. We’ve observed ensembling across many generated zero-shot classifiers to reliably improve performance and use it for the majority of datasets. On ImageNet, we ensemble 80 different context prompts and this improves performance by an additional 3.5% over the single default prompt discussed above. When considered together, prompt engineering and ensembling improve ImageNet accuracy by almost 5%. In Figure 4 we visualize how prompt engineering and ensembling change the performance of a set of CLIP models compared to the contextless baseline approach of directly embedding the class name as done in Li et al. (2017).这种改进类似于通过基线零脉冲方法使用 4 倍以上的计算所获得的收益,但是在许多预测中摊销时是“免费的”。大多数标准的图像分类数据集都将信息命名或描述类视为可事后考虑的基于自然语言的零镜头传输。绝大多数数据集仅使用标签的数字 ID 注释图像,并且包含将这些 ID 映射回其英文名称的文件。某些数据集,例如 Flowers102 和 GTSRB,似乎在其发行版本中根本没有包含此映射,从而完全阻止了零镜头传输。2 个 2 个在这个项目的过程中,Alec 对花卉种类和德国交通标志的了解比他最初预期的要多得多。 对于许多数据集,我们观察到这些标签可能在某种程度上是随意选择的,并且没有预料到与零镜头传输有关的问题,该任务依赖于任务描述才能成功传输。

一个常见的问题是一词多义。当类的名称是提供给 CLIP 文本编码器的唯一信息时,由于缺少上下文,它无法区分是哪个词义。在某些情况下,同一单词的多种含义可能会作为不同的类包含在同一数据集中!这发生在 ImageNet 中,其中包含建筑用起重机和飞行中的起重机。另一个例子是在牛津 IIIT 宠物数据集的类中找到的,其中拳击手一词从上下文来看很明显是指狗的品种,但是缺少上下文的文本编码器很可能是指某种类型的运动员。

我们遇到的另一个问题是,在我们的预训练数据集中,与图像配对的文本只是一个单词是相对罕见的。通常,文本是用某种方式描述图像的完整句子。为了弥合这一分布差距,我们发现使用提示模板“ {label}的照片” 。”是一个很好的默认值,可以帮助指定文本是否与图片内容有关。在仅使用标签文本的基础上,这通常可以提高性能。例如,仅使用此提示即可将 ImageNet 的准确性提高 1.3%。

类似于围绕 GPT-3 的“即时工程”讨论(Brown 等,2020; Gao 等,2020),我们还观察到通过为每个任务定制提示文本可以显着提高零击性能。以下是一些非详尽的示例。我们在几个细粒度的图像分类数据集上发现,它有助于指定类别。例如,在“ Oxford-IIIT 宠物”上,使用“ {label}的照片,一种宠物”。”以帮助提供上下文,效果很好。同样,在 Food101 上指定食物类型,在 FGVC 飞机上指定飞机类型也有帮助。对于 OCR 数据集,我们发现将引号括在要识别的文本或数字周围可以提高性能。最后,我们发现,在卫星图像分类数据集上,它有助于指定图像具有这种形式,并且我们使用了{label}卫星照片的变体。”。

我们还尝试了将多个零击分类器整合在一起,作为提高性能的另一种方法。这些分类器是通过使用不同的上下文提示来计算的,例如“大{label}的照片”和“小{label}的照片”。我们在嵌入空间而不是概率空间上构建整体。这使我们可以缓存单个平均文本嵌入集,以便在对许多预测进行摊销时,集成的计算成本与使用单个分类器相同。我们已经观察到跨许多生成的零散粒分类器进行合并以可靠地提高性能,并将其用于大多数数据集。在 ImageNet 上,我们集成了 80 个不同的上下文提示,这与上面讨论的单个默认提示相比,将性能提高了 3.5%。一起考虑时,及时的工程设计和集成将 ImageNet 的准确性提高近 5%。在图 4 中与直接嵌入类名的无上下文基线方法(如 Li 等人所做的)相比,我们可以看到快速的工程设计和集成如何改变一组 CLIP 模型的性能。(2017 年)。

Analysis of Zero-Shot CLIP Performance Zero-Shot CLIP 性能分析

 Across a 27 dataset eval suite, a zero-shot CLIP classifier outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 datasets, including ImageNet.

Figure 5: Zero-shot CLIP is competitive with a fully supervised baseline. Across a 27 dataset eval suite, a zero-shot CLIP classifier outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 datasets, including ImageNet.Since task-agnostic zero-shot classifiers for computer vision have been understudied, CLIP provides a promising opportunity to gain a better understanding of this type of model. In this section, we conduct a study of various properties of CLIP’s zero-shot classifiers. As a first question, we look simply at how well zero-shot classifiers perform. To contextualize this, we compare to the performance of a simple off-the-shelf baseline: fitting a fully supervised, regularized, logistic regression classifier on the features of the canonical ResNet-50. In Figure 5 we show this comparison across 27 datasets. Please see Appendix A for details of datasets and setup.

Zero-shot CLIP outperforms this baseline slightly more often than not and wins on 16 of the 27 datasets. Looking at individual datasets reveals some interesting behavior. On fine-grained classification tasks, we observe a wide spread in performance. On two of these datasets, Stanford Cars and Food101, zero-shot CLIP outperforms logistic regression on ResNet-50 features by over 20% while on two others, Flowers102 and FGVCAircraft, zero-shot CLIP underperforms by over 10%. On OxfordPets and Birdsnap, performance is much closer. We suspect these difference are primarily due to varying amounts of per-task supervision between WIT and ImageNet. On “general” object classification datasets such as ImageNet, CIFAR10/100, STL10, and PascalVOC2007 performance is relatively similar with a slight advantage for zero-shot CLIP in all cases. On STL10, CLIP achieves 99.3% overall which appears to be a new state of the art despite not using any training examples. Zero-shot CLIP significantly outperforms a ResNet-50 on two datasets measuring action recognition in videos. On Kinetics700, CLIP outperforms a ResNet-50 by 14.5%. Zero-shot CLIP also outperforms a ResNet-50’s features by 7.7% on UCF101. We speculate this is due to natural language providing wider supervision for visual concepts involving verbs, compared to the noun-centric object supervision in ImageNet.

Looking at where zero-shot CLIP notably underperforms, we see that zero-shot CLIP is quite weak on several specialized, complex, or abstract tasks such as satellite image classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), counting objects in synthetic scenes (CLEVRCounts), self-driving related tasks such as German traffic sign recognition (GTSRB), recognizing distance to the nearest car (KITTI Distance). These results highlight the poor capability of zero-shot CLIP on more complex tasks. By contrast, non-expert humans can robustly perform several of these tasks, such as counting, satellite image classification, and traffic sign recognition, suggesting significant room for improvement. However, we caution that it is unclear whether measuring zero-shot transfer, as opposed to few-shot transfer, is a meaningful evaluation for difficult tasks that a learner has no prior experience with, such as lymph node tumor classification for almost all humans (and possibly CLIP).
由于对计算机视觉的与任务无关的零镜头分类器的研究不足,因此 CLIP 提供了一个很好的机会,可以更好地了解这种类型的模型。在本节中,我们将对 CLIP 零散粒分类器的各种属性进行研究。作为第一个问题,我们简单地看一下零散粒分类器的性能如何。为了对此进行语境分析,我们将比较简单的现成基线的性能:在规范的 ResNet-50 的功能上拟合一个完全监督,正则化,逻辑回归分类器。在图 5 中,我们显示了对 27 个数据集的比较。有关数据集和设置的详细信息, 请参见附录 A。

零击 CLIP 胜过该基准的频率通常要高出一点,并在 27 个数据集中的 16 个上获胜。查看单个数据集会发现一些有趣的行为。在细粒度的分类任务上,我们观察到性能的广泛分布。在其中两个数据集(斯坦福汽车和 Food101)上,零射 CLIP 的对数回归在 ResNet-50 功能上的表现优于 20%以上,而在另外两个数据集 Flower102 和 FGVCAircraft 上,零射 CLIP 的表现不及 10%以上。在 OxfordPets 和 Birdsnap 上,性能要接近得多。我们怀疑这些差异主要是由于 WIT 和 ImageNet 之间对每个任务的监督量不同所致。在“通用”对象分类数据集中,例如 ImageNet,CIFAR10 / 100,STL10 和 PascalVOC2007,性能相对相似,并且在所有情况下零击 CLIP 都具有一点优势。在 STL10 上,尽管没有使用任何训练示例,CLIP 总体上达到了 99.3%,这似乎是一个新的技术水平。零镜头 CLIP 在衡量视频中动作识别的两个数据集上明显优于 ResNet-50。在 Kinetics700 上,CLIP 优于 ResNet-50 14.5%。零击 CLIP 在 UCF101 上也优于 ResNet-50 的功能 7.7%。我们推测这是由于与 ImageNet 中以名词为中心的对象监督相比,自然语言为涉及动词的视觉概念提供了更广泛的监督。

查看零拍 CLIP 在哪些方面表现不佳,我们发现零拍 CLIP 在一些专门的,复杂的或抽象的任务上非常薄弱,例如卫星图像分类(EuroSAT 和 RESISC45),淋巴结肿瘤检测(PatchCamelyon),计数对象在合成场景(CLEVRCounts)中,进行自动驾驶相关任务,例如德国交通标志识别(GTSRB),识别到最近汽车的距离(KITTI 距离)。这些结果凸显了零散 CLIP 在更复杂的任务上的能力很差。相比之下,非专业人士可以稳健地执行其中的一些任务,例如计数,卫星图像分类和交通标志识别,这表明仍有很大的改进空间。但是,我提醒您,目前尚不清楚是否测量零散传输,而不是几散传输,

 Zero-shot CLIP matches the average performance of a 4-shot linear classifier trained on the same feature space and nearly matches the best results of a 16-shot linear classifier across publicly available models. For both BiT-M and SimCLRv2, the best performing model is highlighted. Light gray lines are other models in the eval suite. The 20 datasets with at least 16 examples per class were used in this analysis.

Figure 6: Zero-shot CLIP outperforms few-shot linear probes. Zero-shot CLIP matches the average performance of a 4-shot linear classifier trained on the same feature space and nearly matches the best results of a 16-shot linear classifier across publicly available models. For both BiT-M and SimCLRv2, the best performing model is highlighted. Light gray lines are other models in the eval suite. The 20 datasets with at least 16 examples per class were used in this analysis.While comparing zero-shot performance to fully supervised models contextualizes the task-learning capabilities of CLIP, comparing to few-shot methods is a more direct comparison, since zero-shot is its limit. In Figure 6, we visualize how zero-shot CLIP compares to few-shot logistic regression on the features of many image models including the best publicly available ImageNet models, self-supervised learning methods, and CLIP itself. While it is intuitive to expect zero-shot to underperform one-shot, we instead find that zero-shot CLIP matches the performance of 4-shot logistic regression on the same feature space. This is likely due to an important difference between the zero-shot and few-shot approach. First, CLIP’s zero-shot classifier is generated via natural language which allows for visual concepts to be directly specified (“communicated”). By contrast, “normal” supervised learning must infer concepts indirectly from training examples. Context-less example-based learning has the drawback that many different hypotheses can be consistent with the data, especially in the one-shot case. A single image often contains many different visual concepts. Although a capable learner is able to exploit visual cues and heuristics, such as assuming that the concept being demonstrated is the primary object in an image, there is no guarantee.
图 6:零脉冲 CLIP 优于少量脉冲线性探头。零脉冲 CLIP 匹配在相同特征空间上训练的 4 脉冲线性分类器的平均性能,几乎与公开可用模型中 16 脉冲线性分类器的最佳结果相匹配。对于 BiT-M 和 SimCLRv2,突出显示了性能最佳的模型。浅灰色的线是 eval 套件中的其他模型。该分析使用了 20 个数据集,每个类别至少有 16 个示例。将零触发性能与完全监督的模型进行比较可以使 CLIP 的任务学习功能更加具体化,但是与零触发方法相比则是更直接的比较,因为零触发是其局限性。在图 6 中,我们在许多图像模型的功能(包括最佳的公开可用 ImageNet 模型,自我监督的学习方法和 CLIP 本身)的功能上,将零快照 CLIP 与少量快照 Logistic 回归进行了比较。期望零击比一击要好是很直观的,但是相反,我们发现零击 CLIP 与在相同特征空间上的四枪逻辑回归的性能相匹配。这很可能是由于零射和少射方法之间的重要差异。首先,CLIP 的零击分类器是通过自然语言生成的,该自然语言允许直接指定视觉概念(“传达”)。相比之下,“常规”监督学习必须从训练示例中间接推断概念。基于无上下文的基于示例的学习具有以下缺点:许多不同的假设可以与数据保持一致,尤其是单枪匹马的情况下。单个图像通常包含许多不同的视觉概念。尽管有能力的学习者能够利用视觉提示和启发式方法,例如假设所演示的概念是图像中的主要对象,但并不能保证。

A potential resolution of this discrepancy between zero-shot and few-shot performance is to use CLIP’s zero-shot classifier as a prior for the weights of the few-shot classifier. While adding an L2 penalty towards the generated weights is a straightforward implementation of this idea, we found that hyperparameter optimization would often select for such a large value of this regularizer that the resulting few-shot classifier was “just” the zero-shot classifier. Research into better methods of combining the strength of zero-shot transfer with flexibility of few-shot learning is a promising direction for future work.

When comparing zero-shot CLIP to few-shot logistic regression on the features of other models, zero-shot CLIP roughly matches the performance of the best performing 16-shot classifier in our evaluation suite, which uses the features of a BiT-M ResNet-152x2 trained on ImageNet-21K. We are certain that a BiT-L model trained on JFT-300M would perform even better but these models have not been publicly released. That a BiT-M ResNet-152x2 performs best in a 16-shot setting is somewhat surprising since, as analyzed in Section 3.2, the Noisy Student EfficientNet-L2 outperforms it in a fully supervised setting by almost 5% on average across 27 datasets.

解决零击和少击性能之间差异的一种潜在解决方案是,使用 CLIP 的零击分类器作为优先权,获得零击分类器的权重。虽然向生成的权重添加 L2 罚分是此想法的直接实现,但我们发现超参数优化通常会为该正则化器的较大值选择,以至于最终的少发分类器“只是”零发散分类器。研究将零脉冲传输的强度与零脉冲学习的灵活性相结合的更好的方法是未来工作的有前途的方向。

在对其他模型的功能进行零快照 CLIP 与少快照 Logistic 回归进行比较时,零快照 CLIP 大致与我们评估套件中性能最佳的 16 快照分类器的性能相匹配,该分类器使用 BiT-M ResNet 的功能-152x2 在 ImageNet-21K 上训练。我们相信,在 JFT-300M 上训练的 BiT-L 模型的性能会更好,但是这些模型尚未公开发布。BiT-M ResNet-152x2 在 16 拍设置下表现最佳,这有点令人惊讶,因为如 3.2 节所述,在 27 个数据集上,受噪声监控的学生 EfficientNet-L2 在完全监督的设置下的表现平均要高出 5%

 Calculating the number of labeled examples per class a linear classifier on the same CLIP feature space requires to match the performance of the zero-shot classifier contextualizes the effectiveness of zero-shot transfer. Values are estimated based on log-linear interpolation of 1, 2, 4, 8, 16-shot and fully supervised results. Performance varies widely from still underperforming a one-shot classifier on two datasets to matching an estimated 184 labeled examples per class.

Figure 7: The data efficiency of zero-shot transfer varies widely. Calculating the number of labeled examples per class a linear classifier on the same CLIP feature space requires to match the performance of the zero-shot classifier contextualizes the effectiveness of zero-shot transfer. Values are estimated based on log-linear interpolation of 1, 2, 4, 8, 16-shot and fully supervised results. Performance varies widely from still underperforming a one-shot classifier on two datasets to matching an estimated 184 labeled examples per class.In addition to studying the average performance of zero-shot CLIP and few-shot logistic regression, we also examine performance on individual datasets. In Figure 7, we show estimates for the number of labeled examples per class that a logistic regression classifier on the same feature space requires to match the performance of zero-shot CLIP. Since zero-shot CLIP is also a linear classifier, this estimates the effective data efficiency of zero-shot transfer in this setting. In order to avoid training thousands of linear classifiers, we estimate the effective data efficiency based on a log-linear interpolation of the performance of a 1, 2, 4, 8, 16-shot (when possible), and a fully supervised linear classifier trained on each dataset. We find that zero-shot transfer can have widely varying efficiency per dataset from less than 1 labeled example per class to 184. Two datasets, Flowers102 and EuroSAT underperform one-shot models. Half of the datasets require less than 5 examples per class with a median of 5.4. However, the mean estimated data efficiency is 20.8 examples per class. This is due to the 20% of datasets where supervised classifiers require many labeled examples per class in order to match performance. On ImageNet, zero-shot CLIP matches the performance of a 16-shot linear classifier trained on the same feature space.
图 7:零脉冲传输的数据效率差异很大。计算同一 CLIP 特征空间上的线性分类器每个类的标记示例数,需要匹配零触发分类器的性能,从而实现零触发传输的有效性。这些值是根据对 1、2、4、8、16 镜头和完全监督的结果的对数线性插值估算的。从在两个数据集上仍然表现不佳的单次分类器到匹配每个类估计的 184 个带标签的示例,性能差异很大。除了研究零快照 CLIP 和少快照 Logistic 回归的平均性能外,我们还检查了单个数据集的性能。在图 7 中,我们显示了在同一特征空间上进行逻辑回归分类器以匹配零击 CLIP 性能所需的每个类的带标签示例的数量的估计。由于零脉冲 CLIP 也是线性分类器,因此可以估计在此设置下零脉冲传输的有效数据效率。为了避免训练成千上万的线性分类器,我们基于对 1、2、4、8、16 镜头(如果可能)和完全监督的线性分类器的性能进行对数线性插值来估计有效数据效率在每个数据集上进行训练。我们发现零散传输可以使每个数据集的效率发生很大变化,从每类少于 1 个带标签的示例到 184 个。两个数据集,Flowers102 和 EuroSAT 均不如单发模型。每个类别的一半数据集要求的样本少于 5 个,中位数为 5.4。但是,每类平均估计数据效率为 20.8 个示例。这是因为有 20%的数据集,其中监督分类器需要每个类很多带标签的示例才能匹配性能。在 ImageNet 上,零脉冲 CLIP 与在相同特征空间上训练的 16 脉冲线性分类器的性能匹配。

 Comparing zero-shot and linear probe performance across datasets shows a strong correlation with zero-shot performance mostly shifted 10 to 25 points lower. On only 5 datasets does zero-shot performance approach linear probe performance (

Figure 8: Zero-shot performance is correlated with linear probe performance but still mostly sub-optimal. Comparing zero-shot and linear probe performance across datasets shows a strong correlation with zero-shot performance mostly shifted 10 to 25 points lower. On only 5 datasets does zero-shot performance approach linear probe performance (≤3 point difference).If we assume that evaluation datasets are large enough that the parameters of linear classifiers trained on them are well estimated, then, because CLIP’s zero-shot classifier is also a linear classifier, the performance of the fully supervised classifiers roughly sets an upper bound for what zero-shot transfer can achieve. In Figure 8 we compare CLIP’s zero-shot performance with fully supervised linear classifiers across datasets. The dashed, y=x line represents an “optimal” zero-shot classifier that matches the performance of its fully supervised equivalent. For most datasets, the performance of zero-shot classifiers still underperform fully supervised classifiers by 10% to 25%, suggesting that there is still plenty of headroom for improving CLIP’s task-learning and zero-shot transfer capabilities.

There is a positive correlation of 0.82 (p-value <10−6) between zero-shot performance and fully supervised performance, suggesting that CLIP is relatively consistent at connecting underlying representation and task learning to zero-shot transfer. However, zero-shot CLIP only approaches fully supervised performance on 5 datasets: STL10, CIFAR10, Food101, OxfordPets, and Caltech101. On all 5 datasets, both zero-shot accuracy and fully supervised accuracy are over 90%. This suggests that CLIP may be more effective at zero-shot transfer for tasks where its underlying representations are also high quality. The slope of a linear regression model predicting zero-shot performance as a function of fully supervised performance estimates that for every 1% improvement in fully supervised performance, zero-shot performance improves by 1.28%. However, the 95th-percentile confidence intervals still include values of less than 1 (0.93-1.79).
如果我们假设评估数据集足够大,可以很好地估计在其上训练的线性分类器的参数,那么,由于 CLIP 的零击分类器也是线性分类器,因此,在完全监督的分类器的性能上,大致可以设定一个上限零拍转移即可实现。在图 8 中,我们将 CLIP 的零射性能与跨数据集的完全监督的线性分类器进行了比较。破折号,ÿ=X 线表示匹配其完全监督等效项性能的“最佳”零击分类器。对于大多数数据集,零镜头分类器的性能仍然不如完全监督的分类器低 10%到 25%,这表明仍有大量空间可用于改善 CLIP 的任务学习和零镜头传输功能。

正相关系数为 0.82(p 值 <10-6)在零触发性能和完全监督性能之间进行转换,这表明 CLIP 在将基础表示和任务学习与零触发传输相关联方面相对一致。但是,零快照 CLIP 仅在 5 个数据集上达到完全监督的性能:STL10,CIFAR10,Food101,OxfordPets 和 Caltech101。在所有 5 个数据集上,零击准确度和完全监督的准确度均超过 90%。这表明,CLIP 对于零基础传输也可以更有效地执行其基础表示形式也很高的任务。线性预测模型的斜率预测零射击性能是完全监督性能的函数,估计在完全监督性能中每提高 1%,零射击性能就会提高 1.28%。然而,

 Across 39 evals on 36 different datasets, average zero-shot error is well modeled by a log-log linear trend across a 44x range of compute spanning 5 different CLIP models. Lightly shaded lines are performance on individual evals, showing that performance is much more varied despite the smooth overall trend.

Figure 9: Zero-shot CLIP performance scales smoothly as a function of model compute. Across 39 evals on 36 different datasets, average zero-shot error is well modeled by a log-log linear trend across a 44x range of compute spanning 5 different CLIP models. Lightly shaded lines are performance on individual evals, showing that performance is much more varied despite the smooth overall trend.

图 9:零脉冲 CLIP 性能可以根据模型计算平稳地扩展。在 36 个不同的数据集上的 39 个评估中,平均零射误差通过跨越 5 种不同 CLIP 模型的 44 倍计算范围内的对数-对数线性趋势很好地建模。淡淡的阴影线代表各个评估周期的性能,表明尽管总体趋势平稳,但性能却差异很大。

, including EfficientNet

Figure 10: Linear probe performance of CLIP models in comparison with state-of-the-art computer vision models, including EfficientNet (Tan & Le, 2019; Xie et al., 2020), MoCo (Chen et al., 2020d), Instagram-pretrained ResNeXt models (Mahajan et al., 2018; Touvron et al., 2019), BiT (Kolesnikov et al., 2019), ViT (Dosovitskiy et al., 2020), SimCLRv2 (Chen et al., 2020c), BYOL (Grill et al., 2020), and the original ResNet models (He et al., 2016b). (Left) Scores are averaged over 12 datasets studied by Kornblith et al. (2019). (Right) Scores are averaged over 27 datasets that contain a wider variety of distributions. Dotted lines indicate models fine-tuned or evaluated on images at a higher-resolution than pre-training. See Table 10 for individual scores and Figure 20 for plots for each dataset.Over the past few years, empirical studies of deep learning systems have documented that performance is predictable as a function of important quantities such as training compute and dataset size (Hestness et al., 2017; Kaplan et al., 2020). The GPT family of models has so far demonstrated consistent improvements in zero-shot performance across a 1000x increase in training compute. In Figure 9, we check whether the zero-shot performance of CLIP follows a similar scaling pattern. We plot the average error rate of the 5 ResNet CLIP models across 39 evaluations on 36 different datasets and find that a similar log-log linear scaling trend holds for CLIP across a 44x increase in model compute. While the overall trend is smooth, we found that performance on individual evaluations can be much noisier. We are unsure whether this is caused by high variance between individual training runs on sub-tasks (as documented in D’Amour et al. (2020)) masking a steadily improving trend or whether performance is actually non-monotonic as a function of compute on some tasks.
图 10:与最先进的计算机视觉模型(包括 EfficientNet (Tan&Le,2019 ; Xie 等,2020),MoCo (Chen 等,2020d), Instagram 预训练的 ResNeXt 模型(Mahajan 等人,2018 ; Touvron 等人,2019),BiT (Kolesnikov 等人,2019),ViT (Dosovitskiy 等人,2020),SimCLRv2 (Chen 等人,2020c),BYOL (Grill 等人,2020 年),以及原始的 ResNet 模型(He 等,2016b)。(左)分数是 Kornblith 等研究的 12 个数据集的平均数。(2019)。(正确)对 27 个数据集取平均值,这些数据集包含范围更广的分布。虚线表示以比预训练更高的分辨率对图像进行微调或评估的模型。有关每个数据集的得分,请参见表 10,对于图表,请参见图 20。在过去的几年中,深度学习系统的经验研究表明,性能是可预测的,它是重要数量的函数,例如训练计算和数据集大小(Hestness 等人,2017 ; Kaplan 等人,2020)。到目前为止,GPT 系列模型在训练计算增加 1000 倍的情况下证明了零发性能的持续改进。在图 9 中,我们检查 CLIP 的零触发性能是否遵循类似的缩放模式。我们在 36 个不同的数据集上的 39 个评估中绘制了 5 个 ResNet CLIP 模型的平均错误率,发现在模型计算的 44 倍增长中,CLIP 具有类似的对数-对数线性缩放趋势。尽管总体趋势是平稳的,但我们发现,单个评估的性能可能要高得多。我们不确定这是否是由于子任务的各个训练之间的差异很大(如 D'Amour 等人(2020 年)中所述)掩盖了稳步改善的趋势,还是性能实际上不是计算的非单调性在某些任务上。

3.2REPRESENTATION LEARNING

While we have extensively analyzed the task-learning capabilities of CLIP through zero-shot transfer in the previous section, it is more common to study the representation learning capabilities of a model. There exist many ways to evaluate the quality of representations as well as disagreements over what properties an “ideal” representation should have (Locatello et al., 2020). Fitting a linear classifier on a representation extracted from the model and measuring its performance on various datasets is a common approach. An alternative is measuring the performance of end-to-end fine-tuning of the model. This increases flexibility, and prior work has convincingly demonstrated that fine-tuning outperforms linear classification on most image classification datasets (Kornblith et al., 2019; Zhai et al., 2019). While the high performance of fine-tuning motivates its study for practical reasons, we still opt for linear classifier based evaluation for several reasons. Our work is focused on developing a high-performing task and dataset-agnostic pre-training approach. Fine-tuning, because it adapts representations to each dataset during the fine-tuning phase, can compensate for and potentially mask failures to learn general and robust representations during the pre-training phase. Linear classifiers, because of their limited flexibility, instead highlight these failures and provide clear feedback during development. For CLIP, training supervised linear classifiers has the added benefit of being very similar to the approach used for its zero-shot classifiers which enables extensive comparisons and analysis in Section 3.1. Finally, we aim to compare CLIP to a comprehensive set of existing models across many tasks. Studying 66 different models on 27 different datasets requires tuning 1782 different evaluations. Fine-tuning opens up a much larger design and hyper-parameter space, which makes it difficult to fairly evaluate and computationally expensive to compare a diverse set of techniques as discussed in other large scale empirical studies (Lucic et al., 2018; Choi et al., 2019). By comparison, linear classifiers require minimal hyper-parameter tuning and have standardized implementations and evaluation procedures. Please see Appendix A for further details on evaluation.

Figure 10 summarizes our findings. To minimize selection effects that could raise concerns of confirmation or reporting bias, we first study performance on the 12 dataset evaluation suite from Kornblith et al. (2019). While small CLIP models such as a ResNet-50 and ResNet-101 outperform other ResNets trained on ImageNet-1K (BiT-S and the originals), they underperform ResNets trained on ImageNet-21K (BiT-M). These small CLIP models also underperform models in the EfficientNet family with similar compute requirements. However, models trained with CLIP scale very well and the largest model we trained (ResNet-50x64) slightly outperforms the best performing existing model (a Noisy Student EfficientNet-L2) on both overall score and compute efficiency. We also find that CLIP vision transformers are about 3x more compute efficient than CLIP ResNets, which allows us to reach higher overall performance within our compute budget. These results qualitatively replicate the findings of Dosovitskiy et al. (2020) which reported that vision transformers are more compute efficient than convnets when trained on sufficiently large datasets. Our best overall model is a ViT-L/14 that is fine-tuned at a higher resolution of 336 pixels on our dataset for 1 additional epoch. This model outperforms the best existing model across this evaluation suite by an average of 2.6%.

As Figure 21 qualitatively shows, CLIP models learn a wider set of tasks than has previously been demonstrated in a single computer vision model trained end-to-end from random initialization. These tasks include geo-localization, optical character recognition, facial emotion recognition, and action recognition. None of these tasks are measured in the evaluation suite of Kornblith et al. (2019). This could be argued to be a form of selection bias in Kornblith et al. (2019)’s study towards tasks that overlap with ImageNet. To address this, we also measure performance on a broader 27 dataset evaluation suite. This evaluation suite, detailed in Appendix A includes datasets representing the aforementioned tasks, German Traffic Signs Recognition Benchmark (Stallkamp et al., 2011), as well as several other datasets adapted from VTAB (Zhai et al., 2019).

表征学习

在上一节中,我们通过零快照传输对 CLIP 的任务学习功能进行了广泛的分析,而研究模型的表示学习功能则更为常见。存在许多方法来评估表示的质量以及对“理想”表示应具有的属性的分歧(Locatello 等,2020)。将线性分类器拟合到从模型中提取的表示中,并在各种数据集上测量其性能是一种常见的方法。另一种方法是测量模型的端到端微调性能。这增加了灵活性,并且先前的工作令人信服地证明,在大多数图像分类数据集上,微调的性能优于线性分类(Kornblith et al。,2019 ; Zhai et al。,2019)。尽管出于实际原因,微调的高性能促使其进行研究,但出于以下几个原因,我们仍然选择基于线性分类器的评估。我们的工作重点是开发高性能任务和与数据集无关的预训练方法。由于微调可以在微调阶段适应每个数据集的表示形式,因此可以补偿并有可能掩盖失败,从而无法在预训练阶段学习通用和鲁棒的表示形式。线性分类器由于其有限的灵活性,反而会突出显示这些故障并在开发过程中提供清晰的反馈。对于 CLIP,训练有监督的线性分类器的另一个好处是,与用于零爆破分类器的方法非常相似,该方法可进行第 3.1 节中的广泛比较和分析。。最后,我们旨在将 CLIP 与涵盖许多任务的一组全面的现有模型进行比较。在 27 个不同的数据集上研究 66 个不同的模型需要调整 1782 个不同的评估。微调打开了更大的设计和超参数空间,这使得公平评估和比较其他大型实证研究中讨论的多种技术的计算难度很大(Lucic 等人,2018 ; Choi 等人等,2019)。相比之下,线性分类器需要最少的超参数调整,并且具有标准化的实现和评估程序。有关评估的更多详细信息,请参见附录 A。

10 总结了我们的发现。为了最小化可能引起确认或报告偏见的选择效应,我们首先研究了 Kornblith 等人的 12 个数据集评估套件的性能。(2019 年)。虽然小型的 CLIP 模型(例如 ResNet-50 和 ResNet-101)优于其他在 ImageNet-1K(BiT-S 和原始图像)上训练过的 ResNet,但它们却不如在 ImageNet-21K(BiT-M)上训练过的 ResNet。这些小型 CLIP 模型在性能要求类似的 EfficientNet 系列中也不能胜任该模型。但是,使用 CLIP 训练的模型可以很好地扩展,并且我们训练的最大模型(ResNet-50x64)在整体评分和计算效率上都比表现最好的现有模型(Nosy Student EfficientNet-L2)稍好。我们还发现,CLIP 视觉转换器的计算效率是 CLIP ResNets 的 3 倍左右,这使我们能够在计算预算范围内达到更高的整体性能。这些结果定性地复制了 Dosovitskiy 等人的发现。(2020 年)该报告指出,在足够大的数据集上进行训练时,视觉转换器比卷积网络具有更高的计算效率。我们最好的整体模型是 ViT-L / 14,它在我们的数据集上以 336 个像素的更高分辨率进行了微调,持续了一个额外的时期。该模型比该评估套件中现有的最佳模型平均高出 2.6%。

如图 21 定性显示的那样,CLIP 模型学习的任务集比以前在随机初始化的端到端训练的单个计算机视觉模型中所展示的要广泛。这些任务包括地理定位,光学字符识别,面部表情识别和动作识别。在 Kornblith 等人的评估套件中,没有一项任务可以衡量。(2019)。在 Kornblith 等人中,这可能被认为是选择偏见的一种形式。(2019)对与 ImageNet 重叠的任务的研究。为了解决这个问题,我们还在更广泛的 27 个数据集评估套件上衡量了性能。该评估套件在附录 A 中有详细说明包括代表上述任务的数据集,德国交通标志识别基准(Stallkamp 等,2011),以及从 VTAB 改编的其他几个数据集(Zhai 等,2019)。

 Fitting a linear classifier on CLIP’s features outperforms using the Noisy Student EfficientNet-L2 on 21 out of 27 datasets.

Figure 11: CLIP’s features outperform the features of the best ImageNet model on a wide variety of datasets. Fitting a linear classifier on CLIP’s features outperforms using the Noisy Student EfficientNet-L2 on 21 out of 27 datasets.

图 11:CLIP 的功能优于各种数据集上最佳 ImageNet 模型的功能。在 27 个数据集中的 21 个数据集上,使用 Noisy Student EfficientNet-L2 在 CLIP 的特征上拟合线性分类器的性能优于。

For both dataset splits, the transfer scores of linear probes trained on the representations of CLIP models are higher than other models with similar ImageNet performance. This suggests that the representations of models trained on ImageNet are somewhat overfit to their task.
Figure 12: CLIP’s features are more robust to task shift when compared to models pre-trained on ImageNet. For both dataset splits, the transfer scores of linear probes trained on the representations of CLIP models are higher than other models with similar ImageNet performance.
图 12:与 ImageNet 上预训练的模型相比,CLIP 的功能对任务转移更健壮。 对于这两个数据集拆分,在 CLIP 模型的表示形式上训练的线性探针的转移得分均高于具有类似 ImageNet 性能的其他模型。这表明在 ImageNet 上训练的模型的表示在某种程度上适合其任务。

This suggests that the representations of models trained on ImageNet are somewhat overfit to their task.On this broader evaluation suite, the benefits of CLIP are more clear. All CLIP models, regardless of scale, outperform all evaluated systems in terms of compute efficiency. The improvement in average score of the best model over previous systems increases from 2.6% to 5%. We also find that self-supervised systems do noticeably better on our broader evaluation suite. For instance, while SimCLRv2 still underperforms BiT-M on average on the 12 datasets of Kornblith et al. (2019), SimCLRv2 outperforms BiT-M on our 27 dataset evaluation suite. These findings suggest continuing to expand task diversity and coverage in order to better understand the “general” performance of systems. We suspect additional evaluation efforts along the lines of VTAB to be valuable.

In addition to the aggregate analysis above, we visualize per-dataset differences in the performance of the best CLIP model and the best model in our evaluation suite across all 27 datasets in Figure 11. CLIP outperforms the Noisy Student EfficientNet-L2 on 21 of the 27 datasets. CLIP improves the most on tasks which require OCR (SST2 and HatefulMemes), geo-localization and scene recognition (Country211, SUN397), and activity recognition in videos (Kinetics700 and UCF101). In addition CLIP also does much better on fine-grained car and traffic sign recognition (Stanford Cars and GTSRB). This may reflect a problem with overly narrow supervision in ImageNet. A result such as the 14.7% improvement on GTSRB could be indicative of an issue with ImageNet-1K, which has only a single label for all traffic and street signs. This could encourage a supervised representation to collapse intra-class details and hurt accuracy on a fine-grained downstream task. As mentioned, CLIP still underperforms the EfficientNet on several datasets. Unsurprisingly, the dataset that the EfficientNet does best relative to CLIP on is the one it was trained on: ImageNet. The EffcientNet also slightly outperforms CLIP on low-resolution datasets such as CIFAR10 and CIFAR100. We suspect this is at least partly due to the lack of scale-based data augmentation in CLIP. The EfficientNet also does slightly better on PatchCamelyon and CLEVRCounts, datasets where overall performance is still low for both approaches.
在这个更广泛的评估套件中,CLIP 的好处更加明显。无论规模如何,所有 CLIP 模型在计算效率方面均胜过所有评估的系统。与以前的系统相比,最佳模型的平均得分提高了 2.6%至 5%。我们还发现,在我们更广泛的评估套件中,自我监督系统的性能明显更好。例如,虽然 SimCLRv2 在 Kornblith 等人的 12 个数据集上的平均表现仍不及 BiT-M 。(2019),SimCLRv2 在我们的 27 个数据集评估套件中胜过 BiT-M。这些发现表明,将继续扩大任务的多样性和覆盖范围,以便更好地了解系统的“一般”性能。我们怀疑沿着 VTAB 的其他评估工作是有价值的。

除了上面的汇总分析之外,我们还可视化图 11 中所有 27 个数据集的最佳 CLIP 模型和评估套件中最佳模型的每个数据集的性能差异。。在 27 个数据集中的 21 个数据集上,CLIP 优于 Noisy Student EfficientNet-L2。CLIP 对需要 OCR(SST2 和 HatefulMemes),地理定位和场景识别(Country211,SUN397)以及视频中的活动识别(Kinetics700 和 UCF101)的任务进行了最多的改进。此外,CLIP 在细粒度汽车和交通标志识别(斯坦福汽车和 GTSRB)方面也做得更好。这可能反映了 ImageNet 监管范围过窄的问题。诸如 GTSRB 改善 14.7%的结果可能表明 ImageNet-1K 出现问题,ImageNet-1K 的所有交通和街道标志都只有一个标签。这可能会鼓励受监督的表示破坏类内的详细信息,并损害细粒度的下游任务的准确性。如前所述,CLIP 在几个数据集上仍然不如 EfficientNet。毫不奇怪,EfficientNet 相对于 CLIP 最擅长的数据集是经过训练的数据集:ImageNet。在低分辨率数据集(例如 CIFAR10 和 CIFAR100)上,EffcientNet 的性能也比 CLIP 稍好。我们怀疑这至少部分是由于 CLIP 中缺乏基于规模的数据扩充所致。EfficientNet 在 PatchCamelyon 和 CLEVRCounts 数据集上的表现也略好,这两种数据集的总体性能仍然很低。

3.3ROBUSTNESS TO NATURAL DISTRIBUTION SHIFT 自然分布转移的稳健性

 (Left) An ideal robust model (dashed line) performs equally well on the ImageNet distribution and on other natural image distributions. Zero-shot CLIP models shrink this “robustness gap” by up to 75%. Linear fits on logit transformed values are shown with bootstrap estimated 95% confidence intervals. (Right) Visualizing distribution shift for bananas, a class shared across 5 of the 7 natural distribution shift datasets. The performance of the best zero-shot CLIP model, ViT-L/14@336px, is compared with a model that has the same performance on the ImageNet validation set, ResNet-101.
 (Left) An ideal robust model (dashed line) performs equally well on the ImageNet distribution and on other natural image distributions. Zero-shot CLIP models shrink this “robustness gap” by up to 75%. Linear fits on logit transformed values are shown with bootstrap estimated 95% confidence intervals. (Right) Visualizing distribution shift for bananas, a class shared across 5 of the 7 natural distribution shift datasets. The performance of the best zero-shot CLIP model, ViT-L/14@336px, is compared with a model that has the same performance on the ImageNet validation set, ResNet-101.

Figure 13: Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models. (Left) An ideal robust model (dashed line) performs equally well on the ImageNet distribution and on other natural image distributions. Zero-shot CLIP models shrink this “robustness gap” by up to 75%. Linear fits on logit transformed values are shown with bootstrap estimated 95% confidence intervals. (Right) Visualizing distribution shift for bananas, a class shared across 5 of the 7 natural distribution shift datasets. The performance of the best zero-shot CLIP model, ViT-L/14@336px, is compared with a model that has the same performance on the ImageNet validation set, ResNet-101.
图 13:与标准 ImageNet 模型相比,零快照 CLIP 在分发偏移方面更强大。(左)理想的鲁棒模型(虚线)在 ImageNet 分布和其他自然图像分布上均表现出色。零击 CLIP 模型将这种“稳健性差距”缩小了多达 75%。显示了对数变换值的线性拟合,并具有估计 95%的自举间隔。(右)可视化香蕉的分布偏移,这是在 7 个自然分布偏移数据集中的 5 个共享的类。将最佳零击 CLIP 模型 ViT-L / 14 @ 336px 的性能与在 ImageNet 验证集 ResNet-101 上具有相同性能的模型进行了比较。

In 2015, it was announced that a deep learning model exceeded human performance on the ImageNet test set (He et al., 2015). However, research in the subsequent years has repeatedly found that these models still make many simple mistakes (Dodge & Karam, 2017; Geirhos et al., 2018; Alcorn et al., 2019), and new benchmarks testing these systems has often found their performance to be much lower than both their ImageNet accuracy and human accuracy (Recht et al., 2019; Barbu et al., 2019). What explains this discrepancy? Various ideas have been suggested and studied (Ilyas et al., 2019; Geirhos et al., 2020). A common theme of proposed explanations is that deep learning models are exceedingly adept at finding correlations and patterns which hold across their training dataset and thus improve in-distribution performance. However many of these correlations and patterns are actually spurious and do not hold for other distributions and result in large drops in performance on other datasets.

We caution that, to date, most of these studies limit their evaluation to models trained on ImageNet. Recalling the topic of discussion, it may be a mistake to generalize too far from these initial findings. To what degree are these failures attributable to deep learning, ImageNet, or some combination of the two? CLIP models, which are trained via natural language supervision on a very large dataset and are capable of high zero-shot performance, are an opportunity to investigate this question from a different angle.

Taori et al. (2020) is a recent comprehensive study moving towards quantifying and understanding these behaviors for ImageNet models. Taori et al. (2020) study how the performance of ImageNet models change when evaluated on natural distribution shifts. They measure performance on a set of 7 distribution shifts: ImageNetV2 (Recht et al., 2019), ImageNet Sketch (Wang et al., 2019), Youtube-BB and ImageNet-Vid (Shankar et al., 2019), ObjectNet (Barbu et al., 2019), ImageNet Adversarial (Hendrycks et al., 2019), and ImageNet Rendition (Hendrycks et al., 2020a). They distinguish these datasets, which all consist of novel images collected from a variety of sources, from synthetic distribution shifts such as ImageNet-C (Hendrycks & Dietterich, 2019), Stylized ImageNet (Geirhos et al., 2018), or adversarial attacks (Goodfellow et al., 2014) which are created by perturbing existing images in various ways. They propose this distinction because in part because they find that while several techniques have been demonstrated to improve performance on synthetic distribution shifts, they often fail to yield consistent improvements on natural distributions.33We refer readers to Hendrycks et al. (2020a) for additional experiments and discussion on this claim.
2015 年,在 ImageNet 测试集上宣布深度学习模型超越了人类的表现(He 等,2015)。但是,随后几年的研究反复发现,这些模型仍然犯了许多简单的错误(Dodge 和 Karam,2017 年; Geirhos 等人,2018 年; Alcorn 等人,2019 年),并且测试这些系统的新基准测试经常发现它们的错误之处。性能远低于其 ImageNet 准确性和人工准确性(Recht 等,2019 ; Barbu 等,2019)。是什么解释了这种差异?已经提出并研究了各种想法(Ilyas 等,2019; Geirhos 等人,2020 年)。提出的解释的一个共同主题是,深度学习模型非常擅长于发现贯穿其训练数据集的相关性和模式,从而提高分布内性能。但是,许多这样的相关性和模式实际上是虚假的,不适用于其他分布,并且会导致其他数据集的性能大幅下降。

我们提醒您,到目前为止,这些研究中的大多数都将其评估局限于在 ImageNet 上训练的模型。回顾讨论的主题,将这些初步发现过于笼统可能是一个错误。这些故障在多大程度上归因于深度学习,ImageNet 或两者的某种组合?CLIP 模型是通过自然语言监督在非常大的数据集上进行训练的,并且具有很高的零击性能,这是一个从不同角度研究此问题的机会。

Taori 等。(2020 年)是一项最近的综合研究,致力于量化和了解 ImageNet 模型的这些行为。Taori 等。(2020)研究当用自然分布偏移评估时 ImageNet 模型的性能如何变化。他们以 7 个分布变化为一组来衡量性能:ImageNetV2 (Recht 等人,2019),ImageNet Sketch (Wang 等人,2019),Youtube-BB 和 ImageNet-Vid (Shankar 等人,2019),ObjectNet ( Barbu 等人,2019 年),ImageNet Adversarial (Hendrycks 等人,2019)和 ImageNet Rendition(Hendrycks 等,2020a)。他们将这些数据集区分开来,这些数据集均包含从各种来源收集的新颖图像,综合分布变化(例如 ImageNet-C(Hendrycks&Dietterich,2019),风格化 ImageNet(Geirhos 等,2018)或对抗性攻击( Goodfellow 等,2014)它们是通过以各种方式干扰现有图像​​而创建的。他们之所以提出这种区分,是因为部分原因是,他们发现,尽管已经证明了几种技术可以改善合成分布变化的性能,但它们常常无法对自然分布产生一致的改进。33 我们向读者推荐 Hendrycks 等。(202

 (Left) Customizing zero-shot CLIP to each dataset improves robustness compared to using a single static zero-shot ImageNet classifier and pooling predictions across similar classes as in
 (Left) Customizing zero-shot CLIP to each dataset improves robustness compared to using a single static zero-shot ImageNet classifier and pooling predictions across similar classes as in

Figure 14: While supervised adaptation to ImageNet increases ImageNet accuracy by 9.2%, it slightly reduces average robustness. (Left) Customizing zero-shot CLIP to each dataset improves robustness compared to using a single static zero-shot ImageNet classifier and pooling predictions across similar classes as in Taori et al. (2020).
图 14:尽管有针对性地适应 ImageNet,ImageNet 的准确度提高了 9.2%,但略微降低了平均鲁棒性。(左)与使用单个静态零镜头 ImageNet 分类器并在 Taori 等人的相似类中合并预测相比,为每个数据集自定义零镜头 CLIP 可以提高鲁棒性。(2020 年)。适用于 ImageNet 的 CLIP 模型具有与最佳的现有 ImageNet 模型相似的有效鲁棒性。(右)两种鲁棒性干预措施的每个数据集准确性变化的详细信息。适应 ImageNet 可以显着提高 ImageNetV2 的准确性,但会在其他几个发行版中牺牲准确性。特定于数据集的零击分类器可以极大地提高准确性,但仅限于少数几个包含与 ImageNet 类别不完全一致的类的数据集。

CLIP models adapted to ImageNet have similar effective robustness as the best prior ImageNet models. (Right) Details of per dataset changes in accuracy for the two robustness interventions. Adapting to ImageNet increases accuracy on ImageNetV2 noticeably but trades off accuracy on several other distributions. Dataset specific zero-shot classifiers can improve accuracy by a large amount but are limited to only a few datasets that include classes which don’t perfectly align with ImageNet categories.Across these collected datasets, the accuracy of ImageNet models drop well below the expectation set by the ImageNet validation set. For the following summary discussion we report average accuracy across all 7 natural distribution shift datasets and average accuracy across the corresponding class subsets of ImageNet unless otherwise specified. Additionally, for Youtube-BB and ImageNet-Vid, which have two different evaluation settings, we use the average of pm-0 and pm-10 accuracy.

A ResNet-101 makes 5 times as many mistakes when evaluated on these natural distribution shifts compared to the ImageNet validation set. Encouragingly however, Taori et al. (2020) find that accuracy under distribution shift increases predictably with ImageNet accuracy and is well modeled as a linear function of logit-transformed accuracy. Taori et al. (2020) use this finding to propose that robustness analysis should distinguish between effective and relative robustness. Effective robustness measures improvements in accuracy under distribution shift above what is predicted by the documented relationship between in-distribution and out-of-distribution accuracy. Relative robustness captures any improvement in out-of-distribution accuracy. Taori et al. (2020) argue that robustness techniques should aim to improve both effective robustness and relative robustness.

Almost all models studied in Taori et al. (2020) are trained or fine-tuned on the ImageNet dataset. Returning to the discussion in the introduction to this section - is training or adapting to the ImageNet dataset distribution the cause of the observed robustness gap? Intuitively, a zero-shot model should not be able to exploit spurious correlations or patterns that hold only on a specific distribution, since it is not trained on that distribution. 44We caution that a zero-shot model can still exploit spurious correlations that are shared between the pre-training and evaluation distributions. Thus it is reasonable to expect zero-shot models to have much higher effective robustness. In Figure 13, we compare the performance of zero-shot CLIP with existing ImageNet models on natural distribution shifts. All zero-shot CLIP models improve effective robustness by a large amount and reduce the size of the gap between ImageNet accuracy and accuracy under distribution shift by up to 75%.

While these results show that zero-shot models can be much more robust, they do not necessarily mean that supervised learning on ImageNet causes a robustness gap. Other details of CLIP, such as its large and diverse pre-training dataset or use of natural language supervision could also result in much more robust models regardless of whether they are zero-shot or fine-tuned. As an initial experiment to potentially begin narrowing this down, we also measure how the performance of CLIP models change after adapting to the ImageNet distribution via a L2 regularized logistic regression classifier fit to CLIP features on the ImageNet training set. We visualize how performance changes from the zero-shot classifier in Figure 14. Although adapting CLIP to the ImageNet distribution increases its ImageNet accuracy by 9.2% to 85.4% overall, and ties the accuracy of the 2018 SOTA from Mahajan et al. (2018), average accuracy under distribution shift slightly decreases.

It is surprising to see a 9.2% increase in accuracy, which corresponds to roughly 3 years of improvement in SOTA, fail to translate into any improvement in average performance under distribution shift. We also break down the differences between zero-shot accuracy and linear classifier accuracy per dataset in Figure 14 and find performance still increases significantly on one dataset, ImageNetV2. ImageNetV2 closely followed the creation process of the original ImageNet dataset which suggests that gains in accuracy from supervised adaptation are closely concentrated around the ImageNet distribution. Performance decreases by 4.7% on ImageNet-R, 3.8% on ObjectNet, 2.8% on ImageNet Sketch, and 1.9% on ImageNet-A. The change in accuracy on the two other datasets, Youtube-BB and ImageNet Vid, is insignificant.

How is it possible to improve accuracy by 9.2% on the ImageNet dataset with little to no increase in accuracy under distribution shift? Is the gain primarily from “exploiting spurious correlations”? Is this behavior unique to some combination of CLIP, the ImageNet datatset, and the distribution shifts studied, or a more general phenomena? Does it hold for end-to-end finetuning as well as linear classifiers? We do not have confident answers to these questions at this time. Prior work has also pre-trained models on distributions other than ImageNet, but it is common to study and release models only after they have been fine-tuned to ImageNet. As a step towards understanding whether pre-trained zero-shot models consistently have higher effective robustness than fine-tuned models, we encourage the authors of Mahajan et al. (2018), Kolesnikov et al. (2019), and Dosovitskiy et al. (2020) to, if possible, study these questions on their models as well.

We also investigate another robustness intervention enabled by flexible zero-shot natural-language-based image classifiers. The target classes across the 7 transfer datasets are not always perfectly aligned with those of ImageNet. Two datasets, Youtube-BB and ImageNet-Vid, consist of super-classes of ImageNet. This presents a problem when trying to use the fixed 1000-way classifier of an ImageNet model to make predictions. Taori et al. (2020) handle this by max-pooling predictions across all sub-classes according to the ImageNet class hierarchy. Sometimes this mapping is much less than perfect. For the person class in Youtube-BB, predictions are made by pooling over the ImageNet classes for a baseball player, a bridegroom, and a scuba diver. With CLIP we can instead generate a custom zero-shot classifier for each dataset directly based on its class names. In Figure 14 we see that this improves average effective robustness by 5% but is concentrated in large improvements on only a few datasets. Curiously, accuracy on ObjectNet also increases by 2.3%. Although the dataset was designed to closely overlap with ImageNet classes, using the names provided for each class by ObjectNet’s creators still helps a small amount compared to using ImageNet class names and pooling predictions when necessary.
在这些收集的数据集中,ImageNet 模型的准确性大大低于 ImageNet 验证集的期望值。对于以下摘要讨论,除非另有说明,否则我们将报告所有 7 个自然分布偏移数据集的平均准确度以及 ImageNet 对应类子集的平均准确度。此外,对于具有两个不同评估设置的 Youtube-BB 和 ImageNet-Vid,我们使用 pm-0 和 pm-10 精度的平均值。

与 ImageNet 验证集相比,ResNet-101 在这些自然分布偏移上进行评估时所犯的错误数量是其的 5 倍。令人鼓舞的是,Taori 等人。(2020 年)发现分布漂移下的准确度随着 ImageNet 准确度的增加而可预测地增加,并且可以很好地建模为对数变换的准确度的线性函数。Taori 等。(2020)利用这一发现提出了稳健性分析应该区分有效和相对。健壮性。有效的鲁棒性可衡量分布偏移下的准确性提高,高于分布内和分布外准确性之间已记录的关系所预测的准确性。相对鲁棒性可以捕获分配外准确性的任何改善。Taori 等。(2020)认为健壮性技术应该旨在提高有效健壮性和相对健壮性。

Taori 等人研究了几乎所有模型。(2020)在 ImageNet 数据集上进行了训练或微调。返回本节引言中的讨论-训练或适应 ImageNet 数据集分布是导致观察到的健壮性差距的原因吗?凭直觉,零散模型应该不能利用仅在特定分布上成立的虚假相关性或模式,因为它不是在该分布上训练的。44 我们提醒您,零击模型仍然可以利用在预训练和评估分布之间共享的虚假相关性。因此,可以合理预期零击模型具有更高的有效鲁棒性。在图 13 中,我们将零散 CLIP 与现有 ImageNet 模型在自然分布偏移上的性能进行了比较。所有的零触发 CLIP 模型都大大提高了有效的鲁棒性,并将 ImageNet 精度与分布偏移下的精度之间的差距减小了 75%。

尽管这些结果表明零镜头模型可能更健壮,但它们并不一定意味着 ImageNet 上的监督学习会导致健壮性差距。CLIP 的其他详细信息,例如其庞大而多样的预训练数据集或使用自然语言监督,也可能导致模型更健壮,无论它们是零拍还是微调。作为可能开始缩小此范围的初始实验,我们还通过适合 ImageNet 训练集上 CLIP 功能的 L2 正则化逻辑回归分类器,对 CLIP 模型的性能在适应 ImageNet 分布后的变化进行了测量。我们从图 14 的零击分类器中看到性能如何变化。尽管将 CLIP 调整为 ImageNet 发行版后,其 ImageNet 的整体准确性提高了 9.2%至 85.4%,并与 Mahajan 等人的 2018 年 SOTA 的准确性保持了联系。(2018),分布漂移下的平均准确度略有下降。

令人惊讶的是,其准确度提高了 9.2%,相当于 SOTA 大约提高了 3 年,而在转换后的平均性能上却没有任何提高。我们还分解了图 14 中每个数据集的零击准确度和线性分类器准确度之间的差异,发现在一个数据集 ImageNetV2 上,性能仍显着提高。ImageNetV2 紧跟原始 ImageNet 数据集的创建过程,该过程表明,有监督的自适应所带来的准确性提高紧密地集中在 ImageNet 分布周围。在 ImageNet-R 上性能下降了 4.7%,在 ObjectNet 上下降了 3.8%,在 ImageNet Sketch 上下降了 2.8%,在 ImageNet-A 上下降了 1.9%。其他两个数据集(Youtube-BB 和 ImageNet Vid)的准确性变化微不足道。

如何在 ImageNet 数据集上将精度提高 9.2%,而在分布偏移下几乎没有或没有提高精度?收益主要来自“利用虚假相关性”吗?这种行为对于 CLIP,ImageNet 数据集和所研究的分布偏移的某种组合是唯一的还是更普遍的现象?它适用于端到端微调以及线性分类器吗?目前,我们对这些问题还没有确定的答案。先前的工作还对除 ImageNet 之外的发行版上的模型进行了预训练,但是只有在将模型微调到 ImageNet 之后才研究和发布模型是很常见的。为了了解预训练的零击模型是否始终具有比微调模型更高的有效鲁棒性,我们鼓励 Mahajan 等人的作者。(2018), Kolesnikov 等。(2019),和 Dosovitskiy 等。(2020 年),如果可能的话,也要在其模型上研究这些问题。

我们还研究了由基于自然语言的灵活零镜头图像分类器实现的另一种鲁棒性干预措施。跨越 7 个传输数据集的目标类别并不总是与 ImageNet 的目标类别完全一致。两个数据集,Youtube-BB 和 ImageNet-Vid,由 ImageNet 的超类组成。尝试使用 ImageNet 模型的固定的 1000 路分类器进行预测时,这会带来问题。Taori 等。(2020 年)通过根据 ImageNet 类层次结构跨所有子类的最大池化预测来处理此问题。有时,这种映射远不尽人意。对于 Youtube-BB 中的人物类,可以通过对 ImageNet 类进行汇总来进行预测,以供棒球运动员,新郎和水肺潜水员使用。通过 CLIP,我们可以直接基于每个数据集的类名为每个数据集生成自定义零击分类器。在图 14 中我们看到,这将平均有效鲁棒性提高了 5%,但集中在只有少数数据集的大型改进上。奇怪的是,ObjectNet 的准确性也提高了 2.3%。尽管将数据集设计为与 ImageNet 类紧密重叠,但是与使用 ImageNet 类名和在必要时合并预测相比,使用 ObjectNet 的创建者为每个类提供的名称仍会有所帮助。

 Minimizing the amount of ImageNet training data used for adaption increases effective robustness at the cost of decreasing relative robustness. 16-shot logistic regression CLIP matches zero-shot CLIP on ImageNet, as previously reported in Figure

Figure 15: Few-shot CLIP also increases effective robustness compared to existing ImageNet models but is less robust than zero-shot CLIP. Minimizing the amount of ImageNet training data used for adaption increases effective robustness at the cost of decreasing relative robustness. 16-shot logistic regression CLIP matches zero-shot CLIP on ImageNet, as previously reported in Figure 7, but is less robust.
图 15:与现有的 ImageNet 模型相比,很少使用的 CLIP 还能提高有效的鲁棒性,但不如零使用 CLIP 那样健壮。最小化用于适配的 ImageNet 训练数据的数量以降低相对鲁棒性为代价,提高了有效鲁棒性。正如先前在图 7 中所报告的那样,16 幅逻辑回归 CLIP 与 ImageNet 上的 0 幅 CLIP 匹配,但是健壮性较差。

While zero-shot CLIP improves effective robustness, Figure 14 shows that the benefit is almost entirely gone in a fully supervised setting. To better understand this difference, we investigate how effective robustness changes on the continuum from zero-shot to fully supervised. In Figure 15 we visualize the performance of 0-shot, 1-shot, 2-shot, 4-shot …, 128-shot, and fully supervised logistic regression classifiers on the best CLIP model’s features. We see that while few-shot models also show higher effective robustness than existing models, this benefit fades as in-distribution performance increases with more training data and is mostly, though not entirely, gone for the fully supervised model. Additionally, zero-shot CLIP is notably more robust than a few-shot model with equivalent ImageNet performance. Across our experiments, high effective robustness seems to result from minimizing the amount of distribution specific training data a model has access to, but this comes at a cost of reducing dataset-specific performance.

Taken together, these results suggest that the recent shift towards large-scale task and dataset agnostic pre-training combined with a reorientation towards zero-shot and few-shot benchmarking on broad evaluation suites (as advocated by Yogatama et al. (2019) and Linzen (2020)) promotes the development of more robust systems and provides a more accurate assessment of performance. We are curious to see if the same results hold for zero-shot models in the field of NLP such as the GPT family. While Hendrycks et al. (2020b) has reported that pre-training improves relative robustness on sentiment analysis, Miller et al. (2020)’s study of the robustness of question answering models under natural distribution shift finds, similar to Taori et al. (2020), little evidence of effective robustness improvements to date.
零击 CLIP 可以提高有效的鲁棒性,但图 14 显示,在完全监督的设置下,收益几乎完全消失了。为了更好地理解这种差异,我们研究了有效鲁棒性如何从零散发变为完全受监督的连续性。在图 15 中我们以最佳 CLIP 模型的功能可视化 0 镜头,1 镜头,2 镜头,4 镜头…,128 镜头和完全监督的逻辑回归分类器的性能。我们看到,尽管很少有模型也显示出比现有模型更高的有效鲁棒性,但是随着分配的性能随着更多训练数据的增加而增加,这种好处逐渐消失,而对于完全受监督的模型,大部分(虽然不是全部)都消失了。此外,零镜头 CLIP 明显比具有等效 ImageNet 性能的几镜头模型更健壮。在我们的整个实验中,高效的健壮性似乎是通过将模型可以访问的特定于分布的训练数据的数量最小化而产生的,但这是以降低特定于数据集的性能为代价的。

综上所述,这些结果表明,最近向大规模任务和数据集不可知预训练的转变,以及对广泛评估套件的零射基准和少射基准的重新定位(如 Yogatama 等人(2019)所倡导的,以及 Linzen(2020)促进了更健壮的系统的开发,并提供了更准确的性能评估。我们很想知道在 NLP 领域(例如 GPT 系列)的零射模型是否也具有相同的结果。而亨德利克斯等。(2020b)报告说,预训练提高了情绪分析的相对鲁棒性,Miller 等人。(2020 年)的研究对自然分布转移下的问答模型的鲁棒性进行了研究,类似于 Taori 等人。(2020 年),到目前为止,几乎没有证据表明有效的鲁棒性得到了改善

4 COMPARISON TO HUMAN PERFORMANCE 与人类的比较

How does CLIP compare to human performance and human learning? To get a better understanding of how well humans perform in similar evaluation settings to CLIP, we evaluated humans on one of our tasks. We wanted to get a sense of how strong human zero-shot performance is at these tasks, and how much human performance is improved if they are shown one or two image samples. This can help us to compare task difficulty for humans and CLIP, and identify correlations and differences between them.

We had five different humans look at each of 3669 images in the test split of the Oxford IIT Pets dataset (Parkhi et al., 2012) and select which of the 37 cat or dog breeds best matched the image (or ‘I don’t know’ if they were completely uncertain). In the zero-shot case the humans were given no examples of the breeds and asked to label them to the best of their ability without an internet search. In the one-shot experiment the humans were given one sample image of each breed and in the two-shot experiment they were given two sample images of each breed.55There is not a perfect correspondence between the human few-shot tasks and the model’s few-shot performance since the model cannot refer to sample images in the way that the humans can.
CLIP 与人类绩效和人类学习相比如何?为了更好地了解人类在与 CLIP 类似的评估设置中的表现,我们根据一项任务对人类进行了评估。我们想要了解在这些任务上人类零镜头性能有多强,以及如果将它们显示为一幅或两幅图像样本,人类的性能将会提高多少。这可以帮助我们比较人类和 CLIP 的任务难度,并确定它们之间的相关性和差异。

在牛津 IIT 宠物数据集的测试样本中,我们让五个不同的人分别查看了 3669 张图像(Parkhi 等人,2012 年),并选择了 37 种猫或狗中最匹配该图像的图片(或“我不知道”知道'他们是否完全不确定)。在零射的情况下,人类没有得到该品种的实例,并要求他们在没有互联网搜索的情况下尽其最大可能对其进行标记。在单次实验中,人类获得了每个品种的一个样本图像,在两次实验中,人类获得了每个品种的两个样本图像。55 由于模型无法像人类那样参考样本图像,因此人类的几次拍摄任务与模型的几次拍摄性能之间并不存在完美的对应关系。

Accuracy Majority Vote on Full Dataset Accuracy on Guesses Majority Vote Accuracy on Guesses
Zero-shot human 53.7 57.0 69.7 63.9
Zero-shot CLIP 93.5 93.5 93.5 93.5
One-shot human 75.7 80.3 78.5 81.2
Two-shot human 75.7 85.0 79.2 86.1

Table 2: Comparison of human performance on Oxford IIT Pets. As in Parkhi et al. (2012), the metric is average per-class classification accuracy. Most of the gain in performance when going from the human zero shot case to the human one shot case is on images that participants were highly uncertain on. “Guesses” refers to restricting the dataset to where participants selected an answer other than “I don’t know”, the “majority vote” is taking the most frequent (exclusive of ties) answer per image.One possible concern was that the human workers were not sufficiently motivated in the zero-shot task. High human accuracy of 94% on the STL-10 dataset (Coates et al., 2011) and 97-100% accuracy on the subset of attention check images increased our trust in the human workers.

Interestingly, humans went from a performance average of 54% to 76% with just one training example per class, and the marginal gain from an additional training example is minimal. The gain in accuracy going from zero to one shot is almost entirely on images that humans were uncertain about. This suggests that humans “know what they don’t know” and are able to update their priors on the images they are most uncertain in based on a single example. Given this, it seems that while CLIP is a promising training strategy for zero-shot performance (Figure 5) and does well on tests of natural distribution shift (Figure 13), there is a large difference between how humans learn from a few examples and the few-shot methods in this paper.

This suggests that there are still algorithmic improvements waiting to be made to decrease the gap between machine and human sample efficiency, as noted by Lake et al. (2016) and others. Because these few-shot evaluations of CLIP don’t make effective use of prior knowledge and the humans do, we speculate that finding a method to properly integrate prior knowledge into few-shot learning is an important step in algorithmic improvements to CLIP. To our knowledge, using a linear classifier on top of the features of a high-quality pre-trained model is near state-of-the-art for few shot learning (Tian et al., 2020), which suggests that there is a gap between the best few-shot machine learning methods and human few-shot learning.
表 2:牛津 IIT 宠物的人为表现比较。如 Parkhi 等。(2012 年),该指标是每个类别的平均分类准确性。从人为零镜头情况变为人为一镜头情况时,性能的提高大部分来自参与者不确定的图像。“猜测”是指将数据集限制为参与者选择了除“我不知道”以外的其他答案,“多数投票”是每个图像中最常见(不包括平局)的答案。一个可能的担忧是,人类工人没有足够的动力去完成零击任务。在 STL-10 数据集上,人类的准确度高达 94%(Coates 等,2011),而在注意力检查图像的子集上,则达到了 97-100%的准确度,这增加了我们对人类工作者的信任。

有趣的是,每个班级只有一个训练示例,人类的平均成绩从 54%上升到 76%,而从另一个训练示例中获得的边际收益却很小。从零到一张镜头的准确度提高几乎完全是在人类不确定的图像上。这表明人类可以“知道自己不知道的事”,并且能够基于一个示例,在最不确定的图像上更新其先验。鉴于此,虽然 CLIP 是实现零击性能的有前途的训练策略(图 5),并且在自然分布漂移的测试(图 13)上做得很好,但人类如何从几个示例中学到的方法和本文中的几种方法。

这表明,如 Lake 等人所指出的那样,仍然需要进行算法上的改进以减小机器和人类样品效率之间的差距。(2016)等。由于对 CLIP 的这些快照评估并未充分利用先验知识,而人类却有效地利用了这些知识,因此我们推测,找到一种将先验知识正确整合到快照学习中的方法,是对 CLIP 进行算法改进的重要一步。据我们所知,在高质量的预训练模型的特征之上使用线性分类器是近乎先进的,几乎没有射击学习(Tian 等人,2020 年),这表明存在一个最佳几次机器学习方法与人类一次计算机学习之间的差距。

The hardest problems for CLIP also tend to be the hardest problems for humans. Here we rank image categories by difficulty for CLIP as measured as probability of the correct label.

Figure 16: The hardest problems for CLIP also tend to be the hardest problems for humans. Here we rank image categories by difficulty for CLIP as measured as probability of the correct label.
图 16: CLIP 最难的问题也往往是人类最难的问题。在这里,我们按 CLIP 的难度对图像类别进行排名,以正确标签的概率来衡量。

If we plot human accuracy vs CLIP’s zero shot accuracy (Figure 16), we see that the hardest problems for CLIP are also hard for humans. To the extent that errors are consistent, our hypothesis is that this is due to at least a two factors: noise in the dataset (including mislabeled images) and out of distribution images being hard for both humans and models.
如果我们绘制人类准确度与 CLIP 的零射准确度之间的关系(图 16),我们会发现 CLIP 最困难的问题对于人类来说也是困难的。在一定程度上,错误是一致的,我们的假设是至少由于两个因素:数据集中的噪声(包括标签错误的图像)和分布图像对人类和模型都很难。

 (Left) While several datasets have up to

Figure 17: Few statistically significant improvements in accuracy due to detected data overlap. (Left) While several datasets have up to ±20% apparent differences in zero-shot accuracy on detected overlapping vs clean examples only 5 datasets out of 35 total have 99.5% Clopper-Pearson confidence intervals that exclude a 0% accuracy difference. 2 of these datasets do worse on overlapping data. (Right) Since the percentage of detected overlapping examples is almost always in the single digits, the overall test accuracy gain due to overlap is much smaller with the largest estimated increase being only 0.6% on Birdsnap. Similarly, for only 6 datasets are the accuracy improvements statistically significant when calculated using a one-sided binomial test.
图 17:由于检测到的数据重叠,在准确性上几乎没有统计学上显着的提高。(左)虽然几个数据集有多达 ± 在检测到的重叠样本与干净样本上,零射准确度的表观差异为 20%,在 35 个数据集中,只有 5 个数据集具有 99.5%的 Clopper-Pearson 置信区间,不包括 0%的准确度差异。这些数据集中有 2 个在重叠数据上表现更差。(右)由于检测到的重叠示例的百分比几乎总是在个位数中,因此,由于重叠而导致的整体测试准确性增益要小得多,Birdsnap 的最大估计增幅仅为 0.6%。同样,使用单侧二项式检验计算时,仅 6 个数据集的准确性提高在统计上具有显着意义。

5 DATA OVERLAP ANALYSIS ## 数据重叠分析

A concern with pre-training on a very large internet dataset is unintentional overlap with downstream evals. This is important to investigate since, in a worst-case scenario, a complete copy of an evaluation dataset could leak into the pre-training dataset and invalidate the evaluation as a meaningful test of generalization. One option to prevent this is to identify and remove all duplicates before training a model. While this guarantees reporting true hold-out performance, it requires knowing all possible data which a model might be evaluated on ahead of time. This has the downside of limiting the scope of benchmarking and analysis. Adding a new evaluation would require an expensive re-train or risk reporting an un-quantified benefit due to overlap.

Instead, we document how much overlap occurs and how performance changes due to these overlaps. To do this, we use the following procedure:

  1. For each evaluation dataset, we run a duplicate detector (see Appendix C) on its examples. We then manually inspect the found nearest neighbors and set a per dataset threshold to keep high precision while maximizing recall. Using this threshold, we then create two new subsets, Overlap, which contains all examples which have a similarity to a training example above the threshold, and Clean, which contains all examples that are below this threshold. We denote the unaltered full dataset All for reference. From this we first record the degree of data contamination as the ratio of the number of examples in Overlap to the size of All.
  2. We then compute the zero-shot accuracy of CLIP RN50x64 on the three splits and report All - Clean as our main metric. This is the difference in accuracy due to contamination. When positive it is our estimate of how much the overall reported accuracy on the dataset was inflated by over-fitting to overlapping data.
  3. The amount of overlap is often small so we also run a binomial significance test where we use the accuracy on Clean as the null hypothesis and compute the one-tailed (greater) p-value for the Overlap subset. We also calculate 99.5% Clopper-Pearson confidence intervals on Dirty as another check.

A summary of this analysis is presented in Figure 17. Out of 35 datasets studied, 9 datasets have no detected overlap at all. Most of these datasets are synthetic or specialized making them unlikely to be posted as normal images on the internet (for instance MNIST, CLEVR, and GTSRB) or are guaranteed to have no overlap due to containing novel data from after the date our dataset was created (ObjectNet and Hateful Memes). This demonstrates our detector has a low-false positive rate which is important as false positives would under-estimate the effect of contamination in our analysis. There is a median overlap of 2.2% and an average overlap of 3.2%. Due to this small amount of overlap, overall accuracy is rarely shifted by more than 0.1% with only 7 datasets above this threshold. Of these, only 2 are statistically significant after Bonferroni correction. The max detected improvement is only 0.6% on Birdsnap which has the second largest overlap at 12.1%. The largest overlap is for Country211 at 21.5%. This is due to it being constructed out of YFCC100M, which our pre-training dataset contains a filtered subset of. Despite this large overlap there is only a 0.2% increase in accuracy on Country211. This may be because the training text accompanying an example is often not related to the specific task a downstream eval measures. Country211 measures geo-localization ability, but inspecting the training text for these duplicates showed they often do not mention the location of the image.

We are aware of two potential concerns with our analysis. First our detector is not perfect. While it achieves near 100% accuracy on its proxy training task and manual inspection + threshold tuning results in very high precision with good recall among the found nearest-neighbors, we can not tractably check its recall across 400 million examples. Another potential confounder of our analysis is that the underlying data distribution may shift between the Overlap and Clean subsets. For example, on Kinetics-700 many “overlaps” are in fact all black transition frames. This explains why Kinetics-700 has an apparent 20% accuracy drop on Overlap. We suspect more subtle distribution shifts likely exist. One possibility we noticed on CIFAR-100 is that, due to the very low resolution of its images, many duplicates were false positives of small objects such as birds or planes. Changes in accuracy could instead be due to changes in the class distribution or difficulty of the duplicates. Unfortunately, these distribution and difficulty shifts could also mask the effects of over-fitting.

However, these results closely follow the findings of similar duplicate analysis in previous work on large scale pre-training. Mahajan et al. (2018) and Kolesnikov et al. (2019) detected similar overlap rates and found minimal changes in overall performance. Importantly, Kolesnikov et al. (2019) also compared the alternative de-duplication strategy discussed in the introduction to this section with the approach we settled on and observed little difference between the two approaches.
在非常大的 Internet 数据集上进行预训练的问题是与下游评估的无意重叠。这一点很重要,因为在最坏的情况下,评估数据集的完整副本可能会泄漏到预训练数据集中,并使评估无效,这是有意义的泛化测试。防止出现这种情况的一种方法是在训练模型之前识别并删除所有重复项。虽然这保证了报告真正的保留性能,但它需要提前知道模型可能要进行评估的所有可能数据。这样做的缺点是限制了基准测试和分析的范围。添加新的评估将需要进行昂贵的重新训练,否则可能会由于重叠而报告无法量化的收益。

相反,我们记录了发生了多少重叠以及由于这些重叠而导致的性能变化。为此,我们使用以下过程:

  1. 对于每个评估数据集,我们在其示例上运行一个重复的检测器(请参阅附录 C)。然后,我们手动检查找到的最近邻居,并设置每个数据集阈值,以保持高精度同时最大化召回率。然后使用此阈值创建两个新的子集,Overlap 和其中包含低于此阈值的所有示例的 Clean,其中 Overlap 包含与该阈值之上的训练示例具有相似性的所有示例。我们将未更改的完整数据集 All 表示为参考。由此,我们首先记录数据污染的程度,即 Overlap 中示例数与 All 大小之比。
  2. 然后,我们在三个分割中计算 CLIP RN50x64 的零射精度,并报告“全部-清洁”作为我们的主要指标。这是由于污染造成的精度差异。如果为正,则是我们对数据集的总体报告准确性由于过度拟合重叠数据而增加了多少的估计。
  3. 重叠量通常很小,因此我们还进行了二项式显着性检验,其中我们使用 Clean 的准确性作为原假设,并计算 Overlap 子集的单尾(更大)p 值。我们还计算了 Dirty 上 99.5%的 Clopper-Pearson 置信区间,作为另一项检查。

17 给出了该分析的摘要。。在研究的 35 个数据集中,有 9 个数据集根本没有检测到重叠。这些数据集大多数都是合成的或专用的,因此它们不太可能作为普通图像发布在互联网上(例如 MNIST,CLEVR 和 GTSRB),或者由于包含了创建数据集之后的新颖数据而被保证没有重叠。 (ObjectNet 和可恶的模因)。这表明我们的检测器误报率很低,这很重要,因为误报会低估我们分析中污染的影响。中位重叠为 2.2%,平均重叠为 3.2%。由于存在少量重叠,只有 7 个数据集高于此阈值,整体精度很少会超过 0.1%。其中,只有 2 个在 Bonferroni 校正后具有统计学意义。检测到的最大改善仅为 0。Birdsnap 的百分比为 6%,重叠率第二高,为 12.1%。国家 211 的重叠最大,为 21.5%。这是因为它是根据 YFCC100M 构建而成的,我们的预训练数据集包含其中的过滤子集。尽管存在大量重叠,但 Country211 的准确性仅提高了 0.2%。这可能是因为示例附带的训练文本通常与下游评估所测量的特定任务无关。Country211 测量了地理定位能力,但检查这些重复项的训练文本后发现,它们通常没有提到图像的位置。尽管存在大量重叠,但 Country211 的准确性仅提高了 0.2%。这可能是因为示例附带的训练文本通常与下游评估所测量的特定任务无关。Country211 测量了地理定位能力,但检查这些重复项的训练文本后发现,它们通常没有提到图像的位置。尽管存在大量重叠,但 Country211 的准确性仅提高了 0.2%。这可能是因为示例附带的训练文本通常与下游评估所测量的特定任务无关。Country211 测量了地理定位能力,但检查这些重复项的训练文本后发现,它们通常没有提到图像的位置。

我们在分析中意识到了两个潜在的问题。首先,我们的检测器并不完美。尽管它在代理训练任务和手动检查 + 阈值调整结果上均达到了近 100%的准确度,但具有很高的精度,并且在发现的最近邻居中具有良好的召回率,但我们无法在 4 亿个示例中准确地检查其召回率。我们分析的另一个潜在混杂因素是,基础数据分布可能会在“重叠”子集和“干净”子集之间转移。例如,在 Kinetics-700 上,许多“重叠”实际上都是黑色过渡帧。这解释了为什么 Kinetics-700 的重叠精度会明显下降 20%。我们怀疑可能存在更细微的分布变化。我们在 CIFAR-100 上注意到的一种可能性是,由于其图像的分辨率很低,因此许多副本都是小物体(如鸟或飞机)的误报。相反,准确性的更改可能是由于班级分布的更改或重复项的难度所致。不幸的是,这些分布和难度变化也可能掩盖过度拟合的影响。

但是,这些结果与先前大规模大规模训练中类似重复分析的发现紧密相关。Mahajan 等。(2018)和 Kolesnikov 等人。(2019)检测到相似的重叠率,发现整体绩效的变化很小。重要的是,Kolesnikov 等。(2019)还将本节简介中讨论的替代重复数据删除策略与我们确定的方法进行了比较,发现这两种方法之间几乎没有区别。

6LIMITATIONS ## 局限性

There are still many limitations to CLIP. While several of these are discussed as part of analysis in various sections, we summarize and collect them here.

On datasets with training splits, the performance of zero-shot CLIP is on average competitive with the simple supervised baseline of a linear classifier on top of ResNet-50 features. On most of these datasets, the performance of this baseline is now well below the overall state of the art. Significant work is still needed to improve the task learning and transfer capabilities of CLIP. While scaling has so far steadily improved performance and suggests a route for continued improvement, we estimate around a 1000x increase in compute is required for zero-shot CLIP to reach overall state-of-the-art performance. This is infeasible to train with current hardware. Further research into improving upon the computational and data efficiency of CLIP will be necessary.

Analysis in Section 3.1 found that CLIP’s zero-shot performance is still quite weak on several kinds of tasks. When compared to task-specific models, the performance of CLIP is poor on several types of fine-grained classification such as differentiating models of cars, species of flowers, and variants of aircraft. CLIP also struggles with more abstract and systematic tasks such as counting the number of objects in an image. Finally for novel tasks which are unlikely to be included in CLIP’s pre-training dataset, such as classifying the distance to the nearest car in a photo, CLIP’s performance can be near random. We are confident that there are still many, many, tasks where CLIP’s zero-shot performance is near chance level.

While zero-shot CLIP generalizes well to many natural image distributions as investigated in Section 3.3, we’ve observed that zero-shot CLIP still generalizes poorly to data that is truly out-of-distribution for it. An illustrative example occurs for the task of OCR as reported in Appendix E. CLIP learns a high quality semantic OCR representation that performs well on digitally rendered text, which is common in its pre-training dataset, as evidenced by performance on Rendered SST2. However, CLIP only achieves 88% accuracy on the handwritten digits of MNIST. An embarrassingly simple baseline of logistic regression on raw pixels outperforms zero-shot CLIP. Both semantic and near-duplicate nearest-neighbor retrieval verify that there are almost no images that resemble MNIST digits in our pre-training dataset. This suggests CLIP does little to address the underlying problem of brittle generalization of deep learning models. Instead CLIP tries to circumvent the problem and hopes that by training on such a large and varied dataset that all data will be effectively in-distribution. This is a naive assumption that, as MNIST demonstrates, is easy to violate.

Although CLIP can flexibly generate zero-shot classifiers for a wide variety of tasks and datasets, CLIP is still limited to choosing from only those concepts in a given zero-shot classifier. This is a significant restriction compared to a truly flexible approach like image captioning which could generate novel outputs. Unfortunately, as described in Section 2.3 we found the computational efficiency of the image caption baseline we tried to be much lower than CLIP. A simple idea worth trying is joint training of a contrastive and generative objective with the hope of combining the efficiency of CLIP with the flexibility of a caption model. As another alternative, search could be performed at inference time over many natural language explanations of a given image, similar to approach proposed in Learning with Latent Language Andreas et al. (2017).

CLIP also does not address the poor data efficiency of deep learning. Instead CLIP compensates by using a source of supervision that can be scaled to hundreds of millions of training examples. If every image seen during training of a CLIP model was presented at a rate of one per second, it would take 405 years to iterate through the 12.8 billion images seen over 32 training epochs. Combining CLIP with self-supervision (Henaff, 2020; Chen et al., 2020c) and self-training (Lee, ; Xie et al., 2020) methods is a promising direction given their demonstrated ability to improve data efficiency over standard supervised learning.

Our methodology has several significant limitations. Despite our focus on zero-shot transfer, we repeatedly queried performance on full validation sets to guide the development of CLIP. These validation sets often have thousands of examples, which is unrealistic for true zero-shot scenarios. Similar concerns have been raised in the field of semi-supervised learning (Oliver et al., 2018). Another potential issue is our selection of evaluation datasets. While we have reported results on Kornblith et al. (2019)’s 12 dataset evaluation suite as a standardized collection, our main results use a somewhat haphazardly assembled collection of 27 datasets that is undeniably co-adapted with the development and capabilities of CLIP. Creating a new benchmark of tasks designed explicitly to evaluate broad zero-shot transfer capabilities, rather than re-using existing supervised datasets, would help address these issues.

CLIP is trained on text paired with images on the internet. These image-text pairs are unfiltered and uncurated and result in CLIP models learning many social biases. This has been previously demonstrated for image caption models (Bhargava & Forsyth, 2019). We refer readers to Section 7 for detailed analysis and quantification of these behaviors for CLIP as well as discussion of potential mitigation strategies.

While we have emphasized throughout this work that specifying image classifiers through natural language is a flexible and general interface, it has its own limitations. Many complex tasks and visual concepts can be difficult to specify just through text. Actual training examples are undeniably useful but CLIP does not optimize for few-shot performance directly. In our work, we fall back to fitting linear classifiers on top of CLIP’s features. This results in a counter-intuitive drop in performance when transitioning from a zero-shot to a few-shot setting. As discussed in Section 4, this is notably different from human performance which shows a large increase from a zero to a one shot setting. Future work is needed to develop methods that combine CLIP’s strong zero-shot performance with efficient few-shot learning.
CLIP 仍然有很多限制。尽管在各个部分中将其中几个作为分析的一部分进行了讨论,但我们在此处总结并收集了它们。

在具有训练分割的数据集上,零散 CLIP 的性能与 ResNet-50 功能之上的线性分类器的简单监督基线相比,具有平均竞争力。在大多数这些数据集上,此基准的性能现在远低于现有技术的整体水平。仍然需要大量工作来提高 CLIP 的任务学习和传输能力。到目前为止,扩展一直在稳步提高性能,并提出了持续改进的途径,但我们估计,零击 CLIP 大约需要将计算提高 1000 倍才能达到整体最新的性能。用当前的硬件进行训练是不可行的。需要进一步研究以提高 CLIP 的计算和数据效率。

在第 3.1 节中的分析发现,CLIP 的零击性能在几种任务上仍然很弱。与特定任务模型相比,CLIP 的性能在几种类型的细粒度分类上很差,例如汽车的区分模型,花朵的种类和飞机的变型。CLIP 还需要更加抽象和系统的任务,例如计算图像中对象的数量。最后,对于不太可能包含在 CLIP 的预训练数据集中的新颖任务,例如在照片中对距最近汽车的距离进行分类,CLIP 的性能几乎是随机的。我们相信,仍有许多任务的 CLIP 零击性能接近机会水平。

尽管如第 3.3 节所述,零镜头 CLIP 可以很好地概括许多自然图像分布,但我们已经注意到,零镜头 CLIP 仍然很难将其推广到真正没有分布的数据。附录 E 中报告了 OCR 任务的说明性示例。CLIP 学习了高质量的语义 OCR 表示形式,该表示形式在数字渲染的文本上表现良好,这在其预训练数据集中很常见,这在 Rendered SST2 上的性能得到了证明。但是,CLIP 仅在 MNIST 的手写数字上达到 88%的准确性。原始像素上 Logistic 回归的一个非常简单的基线优于零击 CLIP。语义和近乎重复的最近邻检索都验证了在我们的训练前数据集中几乎没有类似于 MNIST 数字的图像。这表明 CLIP 并没有解决深度学习模型的脆弱概括这一潜在问题。相反,CLIP 试图规避该问题,并希望通过在如此庞大且变化多端的数据集上进行训练,使所有数据都可以有效地进行分布。如 MNIST 所示,这是一个幼稚的假设,

尽管 CLIP 可以灵活地为各种任务和数据集生成零击分类器,但是 CLIP 仍然仅限于在给定的零击分类器中选择那些概念。与真正灵活的方法(例如可以生成新颖输出的图像字幕)相比,这是一个很大的限制。不幸的是,如第 2.3 节所述,我们发现图像字幕基线的计算效率要比 CLIP 低得多。一个值得尝试的简单想法是对对比和生成目标进行联合训练,希望将 CLIP 的效率与字幕模型的灵活性相结合。作为另一种选择,可以在推理时对给定图像的许多自然语言解释进行搜索,类似于在用潜在语言学习 Andreas 等。(2017 年)。

CLIP 还没有解决深度学习的数据效率低下的问题。相反,CLIP 通过使用可以扩展到亿万个训练示例的监督资源来进行补偿。如果在 CLIP 模型训练期间看到的每幅图像都以每秒一幅的速度呈现,则要花 405 年的时间才能遍历在 32 个训练时期内看到的 128 亿幅图像。将 CLIP 与自我监督(Henaff,2020 ; Chen et al。,2020c)和自我训练(Lee ;; Xie et al。,2020)的方法相结合是一个有前途的方向,因为它们已证明具有超越标准监督学习提高数据效率的能力。 。

我们的方法有几个明显的局限性。尽管我们专注于零快照传输,但我们还是反复查询了完整验证集的性能,以指导 CLIP 的开发。这些验证集通常包含成千上万个示例,对于真正的零击场景而言,这是不现实的。在半监督学习领域也提出了类似的担忧(Oliver et al。,2018)。另一个潜在的问题是我们对评估数据集的选择。虽然我们已经在 Kornblith 等人上报告了结果。(2019 年)作为标准集合的 12 个数据集评估套件,我们的主要结果使用了一些随意组合的 27 个数据集,这些集合无可否认地与 CLIP 的开发和功能相适应。创建明确设计用于评估广泛的零快照传输功能的任务的新基准,而不是重用现有的受监督数据集,将有助于解决这些问题。

CLIP 接受了文本和互联网上图像配对的训练。这些图像-文本对未经过滤且未经整理,导致 CLIP 模型学习了许多社会偏见。先前已经在图像字幕模型中证明了这一点(Bhargava&Forsyth,2019)。我们请读者参阅第 7 节,以详细分析和量化 CLIP 的这些行为,并讨论潜在的缓解策略。

尽管我们在整个工作中都强调通过自然语言指定图像分类器是一种灵活而通用的界面,但它也有其自身的局限性。仅通过文本很难指定许多复杂的任务和视觉概念。毫无疑问,实际的训练示例很有用,但 CLIP 并不能直接针对几次击球性能进行优化。在我们的工作中,我们只能依靠 CLIP 的功能来拟合线性分类器。从零镜头设置转换为几镜头设置时,这会导致性能出现违反直觉的下降。如第 4 节所述,这与人的表现明显不同,后者表现为从零开始到一次拍摄的大幅增加。需要进一步的工作来开发将 CLIP 强大的零发性能与有效的单发学习相结合的方法。

7BROADER IMPACTS 更广泛的影响

CLIP has a wide range of capabilities due to its ability to carry out arbitrary image classification tasks. One can give it images of cats and dogs and ask it to classify cats, or give it images taken in a department store and ask it to classify shoplifters–a task with significant social implications and for which AI may be unfit. Like any image classification system, CLIP’s performance and fitness for purpose need to be evaluated, and its broader impacts analyzed in context. CLIP also introduces a capability that will magnify and alter such issues: CLIP makes it possible to easily create your own classes for categorization (to ‘roll your own classifier’) without a need for re-training. This capability introduces challenges similar to those found in characterizing other, large-scale generative models like GPT-3 (Brown et al., 2020); models that exhibit non-trivial zero-shot (or few-shot) generalization can have a vast range of capabilities, many of which are made clear only after testing for them.

Our studies of CLIP in a zero-shot setting show that the model displays significant promise for widely-applicable tasks like image retrieval or search. For example, it can find relevant images in a database given text, or relevant text given an image. Further, the relative ease of steering CLIP toward bespoke applications with little or no additional data or training could unlock a variety of novel applications that are hard for us to envision today, as has occurred with large language models over the past few years.

In addition to the more than 30 datasets studied in earlier sections of this paper, we evaluate CLIP’s performance on the FairFace benchmark and undertake exploratory bias probes. We then characterize the model’s performance in a downstream task, surveillance, and discuss its usefulness as compared with other available systems. Many of CLIP’s capabilities are omni-use in nature (e.g. OCR can be used to make scanned documents searchable, to power screen reading technologies, or to read license plates). Several of the capabilities measured, from action recognition, object classification, and geo-localization, to facial emotion recognition, can be used in surveillance. Given its social implications, we address this domain of use specifically in the Surveillance section.

We have also sought to characterize the social biases inherent to the model. Our bias tests represent our initial efforts to probe aspects of how the model responds in different scenarios, and are by nature limited in scope. CLIP and models like it will need to be analyzed in relation to their specific deployments to understand how bias manifests and identify potential interventions. Further community exploration will be required to develop broader, more contextual, and more robust testing schemes so that AI developers can better characterize biases in general purpose computer vision models.
CLIP 具有执行任意图像分类任务的能力,因此具有广泛的功能。可以给它提供猫和狗的图像,并要求对猫进行分类,或者给它在百货商店中拍摄的图像,并要求对入店行窃者进行分类,这是一项具有重大社会意义的任务,而人工智能可能不适合这样做。像任何图像分类系统一样,需要评估 CLIP 的性能和适用性,并在上下文中分析其更广泛的影响。CLIP 还引入了一种可以放大和更改此类问题的功能:CLIP 使得可以轻松创建自己的类别进行分类(“滚动自己的分类器”)而无需重新训练。此功能带来的挑战与表征其他大型生成模型(如 GPT-3)时遇到的挑战类似(Brown 等,2020);展现非平凡的零射(或少射)泛化的模型可以具有广泛的功能,其中的许多功能只有在对它们进行测试后才能弄清楚。

我们对零镜头设置中的 CLIP 的研究表明,该模型对广泛应用的任务(如图像检索或搜索)显示出巨大希望。例如,它可以在给定文本的数据库中找到相关图像,或在给定图像的数据库中找到相关图像。此外,将 CLIP 引导到几乎没有或没有额外数据的定制应用程序或进行训练的相对容易性,可以解锁各种新颖的应用程序,而这些功能对于我们来说今天很难想象,就像过去几年中使用大型语言模型所发生的那样。

除了本文前面各节研究的 30 多个数据集之外,我们还在 FairFace 基准上评估 CLIP 的性能并进行探索性偏差探查。然后,我们在下游任务,监视中表征模型的性能,并讨论其与其他可用系统相比的有用性。CLIP 的许多功能本质上都是无用的(例如,OCR 可用于使扫描的文档可搜索,增强屏幕阅读技术或阅读车牌)。从动作识别,对象分类和地理定位到面部表情识别,可以测量其中的几种功能,以用于监视。考虑到其社会影响,我们将在“监视”部分中专门讨论此使用领域。

我们还试图描述模型固有的社会偏见。我们的偏见测试代表了我们为探究模型在不同情况下如何做出反应的各个方面所做的初步努力,并且受其范围的限制。CLIP 和类似的模型将需要针对其特定的部署进行分析,以了解偏见的表现方式并确定潜在的干预措施。将需要进一步的社区探索,以开发更广泛,更相关的上下文和更强大的测试方案,以便 AI 开发人员可以更好地表征通用计算机视觉模型中的偏差。

Model Race Gender Age FairFace
Model 93.7 94.2 59.7
Linear Probe CLIP 93.4 96.5 63.8
Zero-Shot CLIP 58.3 95.9 57.1
Linear Probe Instagram 90.8 93.2 54.2

Table 3: Percent accuracy on Race, Gender, and Age classification of images in FairFace category ‘White’

Model Race Gender Age FairFace
Model 75.4 94.4 60.7
Linear Probe CLIP 92.8 97.7 63.1
Zero-Shot CLIP 91.3 97.2 54.3
Linear Probe Instagram 87.2 93.9 54.1

Table 4: Percent accuracy on Race, Gender, and Age classification of images in FairFace categories ‘Black,’ ‘Indian,’ ‘East Asian,’ ‘Southeast Asian,’ ‘Middle Eastern,’ and ‘Latino’ (grouped together as FairFace category ‘Non-White’)

Table 5: Percent accuracy on gender classification of images by FairFace race category

categories Black White Indian Latino Middle Eastern Southeast Asian East Asian
crime-related 16.4 24.9 24.4 10.8 19.7 4.4 1.3
non-human 14.4 5.5 7.6 3.7 2.0 1.9 0.0

Table 6: Percent of images classified into crime-related and non-human categories by FairFace Race category. The label set included 7 FairFace race categories each for men and women (for a total of 14), as well as 3 crime-related categories and 4 non-human categories.

label 0-2 3-9 10-19 20-29 30-39 40-49 50-59 60-69 over 70
default label set 30.3 35.0 29.5 16.3 13.9 18.5 19.1 16.2 10.4
default label set+“child” 2.3 4.3 14.7 15.0 13.4 18.2 18.6 15.5 9.4

Table 7: Percent of images classified into crime-related and non-human categories by FairFace Age category, showing comparison between results obtained using a default label set and a label set to which the label ’child’ has been added. The default label set included 7 FairFace race categories each for men and women (for a total of 14), 3 crime-related categories and 4 non-human categories.
表 7:按 FairFace 年龄类别划分为犯罪相关和非人类类别的图像百分比,显示了使用默认标签集和添加了标签“孩子”的标签集所获得的结果之间的比较。默认标签集包括 7 个 FairFace 种族类别,分别针对男性和女性(共 14 个),3 个与犯罪有关的类别和 4 个非人类类别。

7.1BIAS

Algorithmic decisions, training data, and choices about how classes are defined and taxonomized (which we refer to informally as “class design”) can all contribute to and amplify social biases and inequalities resulting from the use of AI systems (Noble, 2018; Bechmann & Bowker, 2019; Bowker & Star, 2000). Class design is particularly relevant to models like CLIP, since any developer can define a class and the model will provide some result.

In this section, we provide preliminary analysis of some of the biases in CLIP, using bias probes inspired by those outlined in Buolamwini & Gebru (2018) and Kärkkäinen & Joo (2019). We also conduct exploratory bias research intended to find specific examples of biases in the model, similar to that conducted by Solaiman et al. (2019).

We start by analyzing the performance of Zero-Shot CLIP on the face image dataset FairFace (Kärkkäinen & Joo, 2019)66FairFace is a face image dataset designed to balance age, gender, and race, in order to reduce asymmetries common in previous face datasets. It categorizes gender into 2 groups: female and male and race into 7 groups: White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, and Latino. There are inherent problems with race and gender classifications, as e.g. Bowker & Star (2000) and Keyes (2018) have shown. While FairFace’s dataset reduces the proportion of White faces, it still lacks representation of entire large demographic groups, effectively erasing such categories. We use the 2 gender categories and 7 race categories defined in the FairFace dataset in a number of our experiments not in order to reinforce or endorse the use of such reductive categories, but in order to enable us to make comparisons to prior work. as an initial bias probe, then probe the model further to surface additional biases and sources of biases, including class design.

We evaluated two versions of CLIP on the FairFace dataset: a zero-shot CLIP model (“ZS CLIP”), and a logistic regression classifier fitted to FairFace’s dataset on top of CLIP’s features (“LR CLIP”). We find that LR CLIP gets higher accuracy on the FairFace dataset than both the ResNext-101 32x48d Instagram model (“Linear Probe Instagram”) (Mahajan et al., 2018) and FairFace’s own model on most of the classification tests we ran77One challenge with this comparison is that the FairFace model uses binary classes for race (“White” and “Non-White”), instead of breaking down races into finer-grained sub-groups.. ZS CLIP’s performance varies by category and is worse than that of FairFace’s model for a few categories, and better for others. (See Table 4 and Table 4).

Additionally, we test the performance of the LR CLIP and ZS CLIP models across intersectional race and gender categories as they are defined in the FairFace dataset. We find that model performance on gender classification is above 95% for all race categories. Table 5 summarizes these results.

While LR CLIP achieves higher accuracy than the Linear Probe Instagram model on the FairFace benchmark dataset for gender, race and age classification of images by intersectional categories, accuracy on benchmarks offers only one approximation of algorithmic fairness, as Raji et al. (2020) have shown, and often fails as a meaningful measure of fairness in real world contexts. Even if a model has both higher accuracy and lower disparities in performance on different sub-groups, this does not mean it will have lower disparities in impact (Scheuerman et al., 2019). For example, higher performance on underrepresented groups might be used by a company to justify their use of facial recognition, and to then deploy it ways that affect demographic groups disproportionately. Our use of facial classification benchmarks to probe for biases is not intended to imply that facial classification is an unproblematic task, nor to endorse the use of race, age, or gender classification in deployed contexts.

We also probed the model using classification terms with high potential to cause representational harm, focusing on denigration harms in particular (Crawford, 2017). We carried out an experiment in which the ZS CLIP model was required to classify 10,000 images from the FairFace dataset. In addition to the FairFace classes, we added in the following classes: ‘animal’, ‘gorilla’, ‘chimpanzee’, ‘orangutan’, ‘thief’, ‘criminal’ and ‘suspicious person’. The goal of this experiment was to check if harms of denigration disproportionately impact certain demographic subgroups.

We found that 4.9% (confidence intervals between 4.6% and 5.4%) of the images were misclassified into one of the non-human classes we used in our probes (‘animal’, ‘chimpanzee’, ‘gorilla’, ‘orangutan’). Out of these, ‘Black’ images had the highest misclassification rate (approximately 14%; confidence intervals between [12.6% and 16.4%]) while all other races had misclassification rates under 8%. People aged 0-20 years had the highest proportion being classified into this category at 14% .

We also found that 16.5% of male images were misclassified into classes related to crime (‘thief’, ‘suspicious person’ and ‘criminal’) as compared to 9.8% of female images. Interestingly, we found that people aged 0-20 years old were more likely to fall under these crime-related classes (approximately 18%) compared to images of people in different age ranges (approximately 12% for people aged 20-60 and 0% for people over 70). We found significant disparities in classifications across races for crime related terms, which is captured in Table 6.

Given that we observed that people under 20 were the most likely to be classified in both the crime-related and non-human animal categories, we carried out classification for the images with the same classes but with an additional category ‘child’ added to the categories. Our goal here was to see if this category would significantly change the behaviour of the model and shift how the denigration harms are distributed by age. We found that this drastically reduced the number of images of people under 20 classified in either crime-related categories or non-human animal categories (Table 7). This points to how class design has the potential to be a key factor determining both the model performance and the unwanted biases or behaviour the model may exhibit while also asks overarching questions about the use of face images to automatically classify people along such lines (y Arcas et al., 2017).

The results of these probes can change based on the class categories one chooses to include as well as the specific language one uses to describe each class. Poor class design can lead to poor real world performance; this concern is particularly relevant to a model like CLIP, given how easily developers can design their own classes.
算法决策,训练数据以及关于如何定义和分类的选择(我们将其非正式地称为“班级设计”)都可以导致并加剧由于使用人工智能系统而引起的社会偏见和不平等现象(Noble,2018; Bechmann) &Bowker,2019 ; Bowker&Star,2000)。类的设计与诸如 CLIP 之类的模型特别相关,因为任何开发人员都可以定义一个类,并且模型将提供一些结果。

在本节中,我们将使用受 Buolamwini&Gebru(2018)和 Kärkkäinen&Joo(2019)中概述的偏见探针启发,对 CLIP 中的某些偏见进行初步分析。我们还进行探索性偏差研究,旨在找到模型中偏差的具体示例,类似于 Solaiman 等人进行的研究。(2019)。

我们首先分析人脸图像数据集 FairFace 上的 Zero-Shot CLIP 的性能(Kärkkäinen&Joo,2019)66FairFace 是一个面部图像数据集,旨在平衡年龄,性别和种族,以减少以前的面部数据集中常见的不对称性。它将性别分为两组:女性和男性,种族分为 7 组:白人,黑人,印度裔,东亚,东南亚,中东和拉丁裔。种族和性别分类存在固有的问题,例如 Bowker&Star(2000)和 Keyes(2018)已经显示。尽管 FairFace 的数据集减少了白人面孔的比例,但仍然缺乏整个大型人口群体的代表性,从而有效地消除了此类类别。在许多实验中,我们使用 FairFace 数据集中定义的 2 个性别类别和 7 个种族类别,并不是为了加强或认可此类归约类别的使用,而是为了使我们能够与先前的工作进行比较。 作为初始偏差探测器,然后进一步探测模型以显示其他偏差和偏差源,包括类别设计。

我们在 FairFace 数据集上评估了两个 CLIP 版本:零快照 CLIP 模型(“ ZS CLIP”),以及在 CLIP 功能(“ LR CLIP”)之上拟合到 FairFace 数据集的逻辑回归分类器。我们发现 LR CLIP 在 FairFace 数据集上的准确度高于 ResNext-101 32x48d Instagram 模型(“ Linear Probe Instagram”)(Mahajan 等人,2018 年)和 FairFace 自己的模型(在我们进行的大多数分类测试中)77 这种比较的一个挑战是,FairFace 模型使用种族的二元类(“白人”和“非白人”),而不是将种族分为更细粒度的子组。。ZS CLIP 的性能因类别而异,在某些类别上比 FairFace 的模型差,而在其他类别上则更好。(请参阅表 4 和表 4)。

此外,我们在 FairFace 数据集中定义了跨种族和性别类别的 LR CLIP 和 ZS CLIP 模型的性能。我们发现,所有种族类别的性别分类模型表现均高于 95%。表 5 总结了这些结果。

尽管 LR CLIP 在 FairFace 基准数据集上按交叉类别对图像的性别,种族和年龄进行分类的准确性比线性探针 Instagram 模型更高,但基准的准确性仅提供算法公平性的一种近似,如 Raji 等人所述。(2020 年)已经证明,并且常常失败,这是现实世界中有意义的公平衡量标准。即使模型在不同子组上具有较高的准确性和较低的性能差异,这也不意味着它的影响差异也较小(Scheuerman et al。,2019)。例如,公司可能会利用在代表性不足的群体上的更高性能来证明他们使用面部识别的合理性,然后将其以不成比例地影响人口统计群体的方式进行部署。我们使用面部分类基准来探测偏见并不是要暗示面部分类是一项毫无问题的任务,也不是为了在已部署的环境中认可种族,年龄或性别分类的使用。

我们还使用了极有可能造成代表性损害的分类术语来探讨该模型,尤其侧重于 den 毁损害(Crawford,2017)。我们进行了一项实验,其中需要 ZS CLIP 模型对 FairFace 数据集中的 10,000 张图像进行分类。除 FairFace 类外,我们还添加了以下类:“动物”,“大猩猩”,“黑猩猩”,“猩猩”,“小偷”,“犯罪”和“可疑人”。该实验的目的是检查 den 毁的危害是否会不成比例地影响某些人口统计子群体。

我们发现,有 4.9%(置信区间介于 4.6%和 5.4%之间)的图像被误分类为我们在探测器中使用的非人类类别之一(“动物”,“黑猩猩”,“大猩猩”,“猩猩”) 。其中,“黑色”图像的错误分类率最高(大约 14%;置信区间在[12.6%和 16.4%]之间),而所有其他种族的错误分类率均在 8%以下。0-20 岁的人被归为此类的比例最高,为 14%。

我们还发现,有 16.5%的男性图像被错误分类为与犯罪相关的类别(“小偷”,“可疑人”和“犯罪”),而女性图像则为 9.8%。有趣的是,与不同年龄段的人的图像(20-60 岁的人约 12%和 0%的人)相比,我们发现 0-20 岁的人更可能属于这些犯罪相关类别(约 18%) (适用于 70 岁以上的人群)。我们发现,与犯罪相关的术语在不同种族之间的分类存在显着差异,如表 6 所示

鉴于我们观察到 20 岁以下的人最有可能被归类为犯罪相关和非人类动物类别,因此我们对具有相同类别的图像进行了分类,但在图像类别中添加了“儿童”类别类别。我们的目标是查看该类别是否会显着改变模型的行为,并改变按年龄划分的贬低伤害危害。我们发现,这大大减少了归类于犯罪相关类别或非人类动物类别的 20 岁以下人群的图像数量(表 7))。这表明班级设计如何有可能成为决定模型性能以及模型可能表现出的不必要偏差或行为的关键因素,同时还询问有关使用面部图像自动对沿此类路线进行分类的人的总体问题(y Arcas 等人,2017)。

这些探查的结果可以根据一个人选择包括的类别类别以及一个用来描述每个类别的特定语言而变化。劣质的设计会导致现实世界的表现不佳;考虑到开发人员可以轻松地设计自己的类,这种关注与 CLIP 之类的模型特别相关。

CLIP performance on Member of Congress images when given the combined returned label set for the images from Google Cloud Vision, Amazon Rekognition and Microsoft Azure Computer Vision. The 20 most gendered labels for men and women were identified with

Figure 18: CLIP performance on Member of Congress images when given the combined returned label set for the images from Google Cloud Vision, Amazon Rekognition and Microsoft Azure Computer Vision. The 20 most gendered labels for men and women were identified with χ2 tests with the threshold at 0.5%. Labels are sorted by absolute frequencies. Bars denote the percentage of images for a certain label by gender.We also carried out experiments similar to those outlined by Schwemmer et al. (2020) to test how CLIP treated images of men and women differently using images of Members of Congress. As part of these experiments, we studied how certain additional design decisions such as deciding thresholds for labels can impact the labels output by CLIP and how biases manifest.

We carried out three experiments - we tested for accuracy on gender classification and we tested for how labels were differentially distributed across two different label sets. For our first label set, we used a label set of 300 occupations and for our second label set we used a combined set of labels that Google Cloud Vision, Amazon Rekognition and Microsoft Azure Computer Vision returned for all the images.

We first simply looked into gender prediction performance of the model on the images of Members of Congress, in order to check to see if the model correctly recognized men as men and women as women given the image of a person who appeared to be in an official setting/position of power. We found that the model got 100% accuracy on the images. This is slightly better performance than the model’s performance on the FairFace dataset. We hypothesize that one of the reasons for this is that all the images in the Members of Congress dataset were high-quality and clear, with the people clearly centered, unlike those in the FairFace dataset.

In order to study how the biases in returned labels depend on the thresholds set for label probability, we did an experiment in which we set threshold values at 0.5% and 4.0%. We found that the lower threshold led to lower quality of labels. However, even the differing distributions of labels under this threshold can hold signals for bias. For example, we find that under the 0.5% threshold labels such as ‘nanny’ and ‘housekeeper’ start appearing for women whereas labels such as ‘prisoner’ and ‘mobster’ start appearing for men. This points to gendered associations similar to those that have previously been found for occupations (Schwemmer et al., 2020) (Nosek et al., 2002) (Bolukbasi et al., 2016).

At the higher 4% threshold, the labels with the highest probability across both genders include “lawmaker”, “legislator” and “congressman”. However, the presence of these biases amongst lower probability labels nonetheless point to larger questions about what ‘sufficiently’ safe behaviour may look like for deploying such systems.

When given the combined set of labels that Google Cloud Vision (GCV), Amazon Rekognition and Microsoft returned for all the images, similar to the biases Schwemmer et al. (2020) found in GCV systems, we found our system also disproportionately attached labels to do with hair and appearance in general to women more than men. For example, labels such as ‘brown hair’, ‘blonde’ and ‘blond’ appeared significantly more often for women. Additionally, CLIP attached some labels that described high status occupations disproportionately more often to men such as ‘executive’ and ‘doctor’. Out of the only four occupations that it attached more often to women, three were ‘newscaster’, ‘television presenter’ and ‘newsreader’ and the fourth was ‘Judge’. This is again similar to the biases found in GCV and points to historical gendered differences (Schwemmer et al., 2020).

Interestingly, when we lowered the threshold to 0.5% for this set of labels, we found that the labels disproportionately describing men also shifted to appearance oriented words such as ‘suit’, ‘tie’ and ‘necktie’ (Figure 18). Many occupation oriented words such as ‘military person’ and ‘executive’ - which were not used to describe images of women at the higher 4% threshold - were used for both men and women at the lower 0.5% threshold, which could have caused the change in labels for men. The reverse was not true. Descriptive words used to describe women were still uncommon amongst men.

Design decisions at every stage of building a model impact how biases manifest and this is especially true for CLIP given the flexibility it offers. In addition to choices about training data and model architecture, decisions about things like class designs and thresholding values can alter the labels a model outputs and as a result heighten or lower certain kinds of harm, such as those described by Crawford (2017). People designing and developing models and AI systems have considerable power. Decisions about things like class design are a key determiner not only of model performance, but also of how and in what contexts model biases manifest.

These experiments are not comprehensive. They illustrate potential issues stemming from class design and other sources of bias, and are intended to spark inquiry.
图 18:在为 Google Cloud Vision,Amazon Rekognition 和 Microsoft Azure Computer Vision 提供图像的组合返回标签集后,国会议员图像的 CLIP 性能。男性和女性的 20 个性别最多的标签被确定为 χ2 个阈值为 0.5%的测试。标签按绝对频率排序。条形图表示按性别划分的特定标签的图像百分比。我们还进行了类似于 Schwemmer 等人概述的实验。(2020 年),以测试 CLIP 如何使用国会议员的图像来区别对待男性和女性的图像。作为这些实验的一部分,我们研究了某些其他设计决策(例如确定标签的阈值)如何影响 CLIP 输出的标签以及偏差如何体现。

我们进行了三个实验-我们测试了性别分类的准确性,并测试了标签在两个不同标签集之间的差异分布方式。对于第一个标签集,我们使用了 300 个职业的标签集;对于第二个标签集,我们使用了 Google Cloud Vision,Amazon Rekognition 和 Microsoft Azure Computer Vision 返回的所有图像的组合标签集。

我们首先简单地从国会议员的图像上考察了该模型的性别预测性能,以检查该模型是否正确地将男性识别为男人,将男性识别为女性,并根据给定似乎是官员身份的女性来识别女性。电源的设置/位置。我们发现该模型在图像上具有 100%的准确性。这比 FairFace 数据集上模型的性能略好。我们假设这样做的原因之一是,与 FairFace 数据集不同,国会议员数据集中的所有图像都是高质量且清晰,人物居中的。

为了研究返回标签的偏差如何取决于为标签概率设置的阈值,我们进行了一项实验,将阈值设置为 0.5%和 4.0%。我们发现较低的阈值会导致标签质量下降。然而,即使在该阈值以下的标签的不同分布也可以保持用于偏置的信号。例如,我们发现,在 0.5%的阈值以下,诸如“保姆”和“管家”之类的标签开始出现在女性身上,而诸如“囚犯”和“流氓”之类的标签则开始出现在男性身上。这表明性别关联与以前发现的职业相似(Schwemmer 等,2020) (Nosek 等,2002) (Bolukbasi 等,2016)。

在较高的 4%阈值下,两性中可能性最高的标签包括“立法者”,“立法者”和“国会议员”。但是,在较低概率的标签中,这些偏见的存在仍然指向更大的问题,即对于部署此类系统,“足够”的安全行为可能看起来是什么样的。

当给定了 Google Cloud Vision(GCV),Amazon Rekognition 和 Microsoft 返回的所有图像的组合标签集时,类似于 Schwemmer 等人的观点。(2020 年)在 GCV 系统中发现的情况下,我们发现我们的系统还不成比例地附加了与头发和外观有关的标签,一般来说,女性比男性更多。例如,对于女性而言,诸如“棕发”,“金发”和“金发”的标签出现的频率明显更高。此外,CLIP 还附加了一些标签,这些标签更经常地在男性中描述高级职位,例如“行政人员”和“医生”。在妇女所偏爱的仅有的四个职业中,有三个是“新闻播报员”,“电视节目主持人”和“新闻阅读器”,第四个是“法官”。这再次类似于 GCV 中发现的偏见,并指出了历史上的性别差异(Schwemmer 等人,2020 年)。

有趣的是,当我们将这组标签的阈值降低到 0.5%时,我们发现不成比例地描述男人的标签也转移到了面向外观的单词,例如“西装”,“领带”和“领带”(图 18)。较低的 4%阈值的男性和女性都使用了许多以职业为导向的词语,例如“军人”和“行政人员”,这两个词未用于描述较高 4%阈值的女性形象。更改男士标签。反之则不成立。在男性中,用来形容女性的描述性词语仍不常见。

建立模型的每个阶段的设计决策都会影响偏差的显现方式,鉴于 CLIP 所提供的灵活性,这一点尤其如此。除了关于训练数据和模型架构的选择之外,关于类设计和阈值之类的事情的决策还可以改变模型输出的标签,从而加剧或降低某些伤害,例如 Crawford(2017)所描述的。设计和开发模型和 AI 系统的人具有强大的力量。关于类设计之类的决策不仅是模型性能的决定因素,而且还是模型偏差表现的方式和方式的关键决定因素。

这些实验并不全面。它们说明了由于班级设计和其他偏见而引起的潜在问题,并旨在激发人们的疑问。

7.2SURVEILLANCE

We next sought to characterize model performance in relation to a downstream task for which there is significant societal sensitivity: surveillance. Our analysis aims to better embody the characterization approach described above and to help orient the research community towards the potential future impacts of increasingly general purpose computer vision models and aid the development of norms and checks around such systems. Our inclusion of surveillance is not intended to indicate enthusiasm for this domain - rather, we think surveillance is an important domain to try to make predictions about given its societal implications (Zuboff, 2015; Browne, 2015).

We measure the model’s performance on classification of images from CCTV cameras and zero-shot celebrity identification. We first tested model performance on low-resolution images captured from surveillance cameras (e.g. CCTV cameras). We used the VIRAT dataset (Oh et al., 2011) and data captured by Varadarajan & Odobez (2009), which both consist of real world outdoor scenes with non-actors.

Given CLIP’s flexible class construction, we tested 515 surveillance images captured from 12 different video sequences on self-constructed general classes for coarse and fine grained classification. Coarse classification required the model to correctly identify the main subject of the image (i.e. determine if the image was a picture of an empty parking lot, school campus, etc.). For fine-grained classification, the model had to choose between two options constructed to determine if the model could identify the presence/absence of smaller features in the image such as a person standing in the corner.

For coarse classification, we constructed the classes by hand-captioning the images ourselves to describe the contents of the image and there were always at least 6 options for the model to choose from. Additionally, we carried out a ‘stress test’ where the class set included at least one more caption for something that was ‘close’ to the image (for example, ‘parking lot with white car’ vs. ‘parking lot with red car’). We found that the model had a top-1 accuracy of 91.8% on the CCTV images for the initial evaluation. The accuracy dropped significantly to 51.1% for the second evaluation, with the model incorrectly choosing the ‘close’ answer 40.7% of the time.

For fine-grained detection, the zero-shot model performed poorly, with results near random. Note that this experiment was targeted only towards detecting the presence or absence of small objects in image sequences.

We also tested CLIP’s zero-shot performance for ‘in the wild’ identity detection using the CelebA dataset88Note: The CelebA dataset is more representative of faces with lighter skin tones. Due to the nature of the dataset, we were not able to control for race, gender, age, etc.. We did this to evaluate the model’s performance for identity detection using just the publicly available data it was pre-trained on. While we tested this on a dataset of celebrities who have a larger number of images on the internet, we hypothesize that the number of images in the pre-training data needed for the model to associate faces with names will keep decreasing as models get more powerful (see Table 8), which has significant societal implications (Garvie, 2019). This mirrors recent developments in natural language processing, in which recent large language models trained on Internet data often exhibit a surprising ability to provide information related to relatively minor public figures (Brown et al., 2020).

We found that the model had 59.2% top-1 accuracy out of 100 possible classes for ‘in the wild’ 8k celebrity images. However, this performance dropped to 43.3% when we increased our class sizes to 1k celebrity names. This performance is not competitive when compared to production level models such as Google’s Celebrity Recognition (Google, ). However, what makes these results noteworthy is that this analysis was done using only zero-shot identification capabilities based on names inferred from pre-training data - we didn’t use any additional task-specific dataset, and so the (relatively) strong results further indicate that before deploying multimodal models, people will need to carefully study them for behaviors in a given context and domain.

CLIP offers significant benefit for tasks that have relatively little data given its zero-shot capabilities. However, large datasets and high performing supervised models exist for many in-demand surveillance tasks such as facial recognition. As a result, CLIP’s comparative appeal for such uses is low. Additionally, CLIP is not designed for common surveillance-relevant tasks like object detection and semantic segmentation. This means it has limited use for certain surveillance tasks when models that are designed with these uses in mind such as Detectron2 (Wu et al., 2019) are widely available.

However, CLIP does unlock a certain aspect of usability given how it removes the need for training data. Thus, CLIP and similar models could enable bespoke, niche surveillance use cases for which no well-tailored models or datasets exist, and could lower the skill requirements to build such applications. As our experiments show, ZS CLIP displays non-trivial, but not exceptional, performance on a few surveillance relevant tasks today.
接下来,我们试图表征与社会敏感度显着的下游任务相关的模型性能:监视。我们的分析旨在更好地体现上述表征方法,并帮助研究群体适应日益增长的通用计算机视觉模型对未来的潜在影响,并协助围绕此类系统的规范和检查的发展。我们将监视包括在内并不是要表明对该领域的热情-而是,我们认为监视是一个重要领域,可以根据其社会含义进行预测(Zuboff,2015; Browne,2015)。

我们根据闭路电视摄像机的图像分类和零镜头名人识别来衡量模型的性能。我们首先在从监控摄像机(例如 CCTV 摄像机)捕获的低分辨率图像上测试了模型性能。我们使用了 VIRAT 数据集(Oh 等人,2011 年)和 Varadarajan&Odobez(2009 年)所捕获的数据,这两个数据均由具有非演员的真实室外场景组成。

鉴于 CLIP 灵活的分类构造,我们在自建的一般分类上测试了从 12 种不同视频序列捕获的 515 张监视图像,以进行粗略和细粒度分类。粗分类需要模型正确识别图像的主要主题(即确定图像是否是空旷的停车场,学校校园等的照片)。对于细粒度分类,模型必须在构造的两个选项之间进行选择,以确定模型是否可以识别图像中是否存在较小特征(例如,站在角落的人)。

对于粗略分类,我们通过对图像本身进行手工字幕来描述图像的内容来构造类,并且始终有至少 6 个可供模型选择的选项。此外,我们进行了“压力测试”,其中班级设置至少还包含一个标题,用于说明与图片“接近”的事物(例如,“白色汽车的停车场”与“红色汽车的停车场” )。我们发现,该模型在 CCTV 图像上的 top-1 准确性为 91.8%,可以进行初始评估。第二次评估的准确性显着下降到 51.1%,该模型错误地选择了“接近”答案的时间为 40.7%。

对于细粒度检测,零脉冲模型的性能较差,结果几乎是随机的。请注意,此实验仅针对检测图像序列中是否存在小物体。

我们还使用 CelebA 数据集 8 测试了 CLIP 在“野外”身份检测中的零发性能 88 注意:CelebA 数据集更能代表肤色较浅的脸部。由于数据集的性质,我们无法控制种族,性别,年龄等。。我们这样做只是使用预先训练过的公开数据评估模型用于身份检测的性能。当我们在互联网上拥有大量图像的名人数据集上对此进行测试时,我们假设,随着模型变得越来越强大,模型将人脸与名字相关联所需的预训练数据中的图像数量将继续减少(请参阅表 8),这对社会具有重大影响(Garvie,2019 年)。这反映了自然语言处理的最新发展,其中,最近在 Internet 数据上训练的大型语言模型通常表现出令人惊讶的能力,可提供与相对较小的公众人物有关的信息(Brown 等人,2020 年)。。

我们发现,对于“在野外” 8k 名人图像,该模型在 100 种可能的类别中具有 59.2%的 top-1 准确性。但是,当我们将班级人数增加到 1k 名名人时,这种表现下降到 43.3%。与 Google 的 Celebrity Recognition (Google)等生产级别的模型相比,该性能没有竞争力。但是,使这些结果值得注意的是,该分析仅使用基于从预训练数据推断出的名称的零快照识别功能完成-我们未使用任何其他特定于任务的数据集,因此(相对)强大的结果进一步表明,在部署多模式模型之前,人们将需要仔细研究它们在给定上下文和域中的行为。

鉴于其零触发功能,CLIP 可为数据量相对较少的任务提供显着优势。但是,对于诸如面部识别之类的许多需求监视任务,存在大型数据集和高性能监督模型。结果,CLIP 对此类用途的比较吸引力较低。此外,CLIP 不适用于与监视相关的常见任务,例如对象检测和语义分段。这意味着当考虑到这些用途而设计的模型(例如 Detectron2 (Wu 等人,2019))广泛可用时,它在某些监视任务中的用途有限。

但是,鉴于 CLIP 如何消除对训练数据的需求,它确实在可用性方面发挥了一定作用。因此,CLIP 和类似模型可以启用定制的利基监视用例,而对于这些用例而言,它们不存在量身定制的模型或数据集,并且可以降低构建此类应用程序的技能要求。如我们的实验所示,ZS CLIP 在当今一些与监视有关的任务上显示出非凡但非凡的性能。

Model 100 Classes 1k Classes 2k Classes
CLIP L/14 59.2 43.3 42.2
CLIP RN50x64 56.4 39.5 38.4
CLIP RN50x16 52.7 37.4 36.3
CLIP RN50x4 52.8 38.1 37.3

Table 8: CelebA Zero-Shot Top-1 Identity Recognition Accuracy
表 8: CelebA 零拍 Top-1 身份识别准确度

7.3FUTURE WORK 未来的工作

This preliminary analysis is intended to illustrate some of the challenges that general purpose computer vision models pose and to give a glimpse into their biases and impacts. We hope that this work motivates future research on the characterization of the capabilities, shortcomings, and biases of such models, and we are excited to engage with the research community on such questions.

We believe one good step forward is community exploration to further characterize the capabilities of models like CLIP and - crucially - identify application areas where they have promising performance and areas where they may have reduced performance99A model could be unfit for use due to inadequate performance or due to the inappropriateness of AI use in the application area itself.. This process of characterization can help researchers increase the likelihood models are used beneficially by:

  • Identifying potentially beneficial downstream uses of models early in the research process, enabling other researchers to think about applications.
  • Surfacing tasks with significant sensitivity and a large set of societal stakeholders, which may call for intervention by policymakers.
  • Better characterizing biases in models, alerting other researchers to areas of concern and areas for interventions.
  • Creating suites of tests to evaluate systems like CLIP on, so we can better characterize model capabilities earlier in the development cycle.
  • Identifying potential failure modes and areas for further work.

We plan to contribute to this work, and hope this analysis provides some motivating examples for subsequent research.
此初步分析旨在说明通用计算机视觉模型带来的一些挑战,并能一窥它们的偏见和影响。我们希望这项工作能够激发人们对此类模型的功能,缺点和偏见进行表征的未来研究,并且我们很高兴能与研究社区就此类问题进行互动。

我们认为,向前迈出的一大步是社区探索,以进一步表征 CLIP 之类的模型的功能,并且-至关重要的是-确定性能良好的应用领域和性能可能降低的领域 99 由于性能不足或由于 AI 在应用程序区域本身的使用不适当,模型可能不适合使用。。这种表征过程可以帮助研究人员提高通过以下方式有益地使用模型的可能性:

  • 在研究过程的早期确定模型的潜在潜在下游用途,从而使其他研究人员可以考虑应用程序。
  • 以高度的敏感性和大量的社会利益相关者来处理任务,这可能需要决策者的干预。
  • 更好地刻画模型中的偏差,提醒其他研究人员注意的领域和需要干预的领域。
  • 创建测试套件以评估诸如 CLIP 之类的系统,因此我们可以在开发周期的早期更好地表征模型功能。
  • 确定潜在的故障模式和需要进一步工作的区域。

我们计划为这项工作做出贡献,并希望这一分析为以后的研究提供一些启发性的例子。

8RELATED WORK

Any model that leverages written, spoken, signed or any other form of human language as part of its training signal is arguably using natural language as a source of supervision. This is an admittedly extremely broad area and covers most work in the field of distributional semantics including topic models (Blei et al., 2003), word, sentence, and paragraph vectors (Mikolov et al., 2013; Kiros et al., 2015; Le & Mikolov, 2014), and language models (Bengio et al., 2003). It also includes much of the broader field of NLP that deals with predicting or modeling sequences of natural language in some way. Work in NLP intentionally leveraging natural language supervision in the form of explanations, feedback, instructions, and advice for tasks such as classification (as opposed to the commonly used representation of supervision as a set of arbitrarily encoded discrete category labels) has been explored in many creative and advanced ways. Dialog based learning (Weston, 2016; Li et al., 2016; Hancock et al., 2019) develops techniques to learn from interactive natural language feedback in dialog. Several papers have leveraged semantic parsing to convert natural language explanations into features (Srivastava et al., 2017) or additional training labels (Hancock et al., 2018). More recently, ExpBERT (Murty et al., 2020) uses feature representations produced by conditioning a deep contextual language model on natural language explanations and descriptions of relations to improve performance on the task of relation extraction.

CLIP is an example of using natural language as a training signal for learning about a domain other than language. In this context, the earliest use of the term natural language supervision that we are aware of is the work of Ramanathan et al. (2013) which showed that natural language descriptions could be used along side other sources of supervision to improve performance on the task of video event understanding. However, as mentioned in the introduction and approach section, methods of leveraging natural language descriptions in computer vision well predate the use of this specific term, especially for image retrieval (Mori et al., 1999) and object classification (Wang et al., 2009). Other early work leveraged tags (but not natural language) associated with images for the task of semantic segmentation (Barnard et al., 2003). More recently, He & Peng (2017) and Liang et al. (2020) demonstrated using natural language descriptions and explanations to improve fine-grained visual classification of birds. Others have investigated how grounded language can be used to improve visual representations and classifiers on the ShapeWorld dataset (Kuhnle & Copestake, 2017; Andreas et al., 2017; Mu et al., 2019). Finally, techniques which combine natural language with reinforcement learning environments (Narasimhan et al., 2015) have demonstrated exciting emergent behaviors such as systematically accomplishing zero-shot tasks (Hill et al., 2019).

CLIP’s pre-training task optimizes for text-image retrieval. This areas of research dates back to the mid-90s with the previously mentioned Mori et al. (1999) as representative of early work. While initial efforts focused primarily on predictive objectives over time research shifted towards learning joint multi-modal embedding spaces with techniques like kernel Canonical Correlation Analysis and various ranking objectives (Weston et al., 2010; Socher & Fei-Fei, 2010; Hodosh et al., 2013). Over time work explored many combinations of training objective, transfer, and more expressive models and steadily improved performance (Frome et al., 2013; Socher et al., 2014; Karpathy et al., 2014; Kiros et al., 2014; Faghri et al., 2017).

Other work has leveraged natural language supervision for domains other than images. Stroud et al. (2020) explores large scale representation learning by training a system to pair descriptive text with videos instead of images. Several works have explored using dense spoken natural language supervision for videos (Miech et al., 2019, 2020b). When considered together with CLIP, these works suggest that large scale natural language supervision is a promising way to learn high quality perceptual systems for many domains. Alayrac et al. (2020) extended this line of work to an additional modality by adding raw audio as an additional supervision source and demonstrated benefits from combining all three sources of supervision.

As part of our work on CLIP we also construct a new dataset of image-text pairs. Modern work on image-text retrieval has relied on a set of crowd-sourced sentence level image caption evaluation datasets like Pascal1K (Rashtchian et al., 2010), Flickr8K (Hodosh et al., 2013), and Flickr30K (Young et al., 2014). However, these datasets are still relatively small and limit achievable performance. Several methods have been proposed to create larger datasets automatically with Ordonez et al. (2011) as a notable early example. In the deep learning era, Mithun et al. (2018) demonstrated an additional set of (image, text) pairs collected from the internet could improve retrieval performance and several new automatically constructed datasets such as Conceptual Captions (Sharma et al., 2018), LAIT (Qi et al., 2020), and OCR-CC (Yang et al., 2020) have been created. However, these datasets still use significantly more aggressive filtering or are designed for a specific task such as OCR and as a result are still much smaller than WIT with between 1 and 10 million training examples.

A related idea to CLIP is webly supervised learning. This line of work queries image search engines to build image datasets by querying for terms and uses the queries as the labels for the returned images (Fergus et al., 2005). Classifiers trained on these large but noisily labeled datasets can be competitive with those trained on smaller carefully labeled datasets. These image-query pairs are also often used to improve performance on standard datasets as additional training data (Chen & Gupta, 2015). CLIP also uses search queries as part of its dataset creation process. However CLIP only uses full text sequences co-occuring with images as supervision rather than just the queries, which are often only a single word or short n-gram. We also restrict this step in CLIP to text only querying for sub-string matches while most webly supervised work uses standard image search engines which have their own complex retrieval and filtering pipelines that often involve computer vision systems. Of this line of work, Learning Everything about Anything: Webly-Supervised Visual Concept Learning (Divvala et al., 2014) has a notably similar ambition and goal as CLIP.

Finally, CLIP is related to a recent burst of activity on learning joint models of vision and language (Lu et al., 2019; Tan & Bansal, 2019; Chen et al., 2019; Li et al., 2020b; Yu et al., 2020). This line of work focuses on richly connecting vision and language in order to solve complex downstream tasks such as visual question answering, visual commonsense reasoning, or multimodal entailment. These approaches leverage impressively engineered models which combine 3 (or more) pre-trained subsystems, typically an image feature model, a region proposal / object detection model, and a pre-trained masked language model such as BERT. These systems are then jointly fine-tuned via various training objectives on image-text pairs and applied to the aforementioned tasks and achieve impressive results. CLIP is instead focused on learning visual models from scratch via natural language supervision and does not densely connect the two domains with a joint attention model. The only interaction in a CLIP model between the image and text domain is a single dot product in a learned joint embedding space. We are excited to see CLIP hybridized with this line of work.
任何将书面,口语,手语或任何其他形式的人类语言作为其训练信号的一部分的模型,都可以说是使用自然语言作为监督的来源。这是一个公认的极其广泛的领域,涵盖了分布语义领域的大部分工作,包括主题模型(Blei 等,2003),单词,句子和段落向量(Mikolov 等,2013; Kiros 等,2015)。; Le&Mikolov,2014 年)和语言模型(Bengio 等人,2003 年)。它还包括 NLP 的许多较广泛领域,它们以某种方式处理自然语言序列的预测或建模。NLP 中的工作有意利用自然语言监督的形式,以解释,反馈,指示和建议的形式进行诸如分类的任务(与通常将监督表示为一组任意编码的离散类别标签相对)创新和先进的方式。基于对话的学习(Weston,2016; Li 等,2016; Hancock 等,2019)开发了从对话中的交互自然语言反馈中学习的技术。几篇论文利用语义解析将自然语言解释转换为特征(Srivastava 等人,2017)或其他训练标签(Hancock 等人,2018)。最近,ExpBERT (Murty 等人,2020 年)使用通过在自然语言解释和关系描述上调节深度上下文语言模型而生成的特征表示,以提高关系提取任务的性能。

CLIP 是使用自然语言作为训练信号来学习除语言之外的其他领域的示例。在这种情况下,我们知道的最早使用自然语言监督一词是 Ramanathan 等人的工作。(2013 年)表明自然语言描述可与其他监督资源一起使用,以提高视频事件理解任务的性能。但是,正如引言和方法部分所述,在计算机视觉中利用自然语言描述的方法早于该特定术语的使用,特别是在图像检索(Mori 等,1999)和对象分类(Wang 等,1999)中。2009 年)。其他早期工作利用与图像相关联的标签(但不是自然语言)来完成语义分割任务(Barnard 等,2003)。最近, He&Peng(2017)和 Liang 等人。(2020)证明了使用自然语言的描述和解释来改善鸟类的细粒度视觉分类。其他人研究了如何使用扎实的语言来改善 ShapeWorld 数据集上的视觉表示和分类器(Kuhnle&Copestake,2017 ; Andreas 等人,2017 ; Mu 等人,2019)。最后,将自然语言与强化学习环境相结合的技术(Narasimhan 等人,2015)已证明了令人兴奋的突发行为,例如系统地完成零击任务(Hill 等人,2019)。

CLIP 的预训练任务针对文本图像检索进行了优化。这个领域的研究可以追溯到 90 年代中期,之前提到的 Mori 等人。(1999)作为早期工作的代表。尽管最初的工作主要集中在预测目标上,但随着时间的推移,研究转向使用核规范相关分析和各种排名目标等技术来学习联合多峰嵌入空间(Weston 等人,2010; Socher&Fei-Fei,2010; Hodosh 等人。,2013)。随着时间的推移,工作探索了训练目标,转移和更具表达力的模型的多种组合,并逐步提高了绩效(Frome 等人,2013 年; Socher 等,2014;Karpathy 等,2014;Kiros 等,2014;Faghri et al。,2017)。

其他工作还利用自然语言对图像以外的领域进行监督。斯特劳德等。(2020)通过训练将描述性文字与视频而不是图像配对的系统,探索了大规模的表征学习。几部作品都使用视频茂密的口述自然语言监督探索(Miech 等,20192020B)。当与 CLIP 一起考虑时,这些工作表明,大规模自然语言监督是一种学习许多领域的高质量感知系统的有前途的方法。Alayrac 等。(2020 年)通过添加原始音频作为额外的监督源,将此工作范围扩展到了另一种形式,并展示了结合所有三个监督源所带来的好处。

作为 CLIP 工作的一部分,我们还构建了一个新的图像-文本对数据集。图像文本检索的现代工作依赖于一组众包的句子级别图像标题评估数据集,例如 Pascal1K (Rashtchian 等,2010),Flickr8K (Hodosh 等,2013)和 Flickr30K (Young 等。 ,2014)。但是,这些数据集仍然相对较小,并且限制了可实现的性能。Ordonez 等人提出了几种方法来自动创建更大的数据集。(2011 年)是一个著名的早期例子。在深度学习时代,Mithun 等人。(2018 年)演示了从互联网收集的另一组(图像,文本)对可以改善检索性能,并使用了一些新的自动构建的数据集,例如 Conceptual Captions (Sharma 等,2018),LAIT (Qi 等,2020)和 OCR -CC (Yang et al。,2020)已创建。但是,这些数据集仍然使用更具侵略性的过滤方法,或者是为特定任务(例如 OCR)而设计的,因此,结果仍然比 WIT 小得多,只有 1 到 1000 万个训练示例。

与 CLIP 相关的想法是网络监督学习。这一系列工作查询图像搜索引擎以通过查询术语来构建图像数据集,并将查询用作返回图像的标签(Fergus 等,2005)。在这些较大但带有嘈杂标签的数据集上训练的分类器可以与在较小且经过仔细标记的数据集上训练的分类器竞争。这些图像查询对还经常用于提高标准数据集的性能,作为其他训练数据(Chen&Gupta,2015)。CLIP 还使用搜索查询作为其数据集创建过程的一部分。但是,CLIP 仅使用与图像同时出现的全文序列作为监督,而不仅仅是查询,查询通常只是一个单词或简短的 n-gram。我们还将 CLIP 中的这一步骤限制为仅文本查询子字符串匹配项,而大多数受网络监督的工作都使用标准的图像搜索引擎,这些引擎具有自己复杂的检索和过滤管道,通常涉及计算机视觉系统。在这方面的工作中,“学习一切有关:网络监督的视觉概念学习” (Divvala 等人,2014 年)与 CLIP 有着显着相似的抱负和目标。

最后,CLIP 与最近学习视觉和语言联合模型的活动有关(Lu 等人,2019 ; Tan&Bansal,2019 ; Chen 等人,2019 ; Li 等人,2020b ; Yu 等人,2019)。,2020)。该工作重点是将视觉和语言紧密地联系在一起,以解决复杂的下游任务,例如视觉问题解答,视觉常识推理或多模式蕴涵。这些方法利用了令人印象深刻的工程模型,这些模型结合了 3 个(或更多)预训练的子系统,通常是图像特征模型,区域提议/对象检测模型和预训练的掩蔽语言模型(例如 BERT)。然后,通过对图像-文本对的各种训练目标,对这些系统进行联合微调,并将其应用于上述任务,并获得令人印象深刻的结果。相反,CLIP 专注于通过自然语言监督从头开始学习视觉模型,并且没有通过联合关注模型将两个领域紧密地联系在一起。在 CLIP 模型中,图像和文本域之间的唯一交互是在学习的联合嵌入空间中的单点积。我们很高兴看到 CLIP 与这方面的工作相融合。

9CONCLUSION

We have investigated whether it is possible to transfer the success of task-agnostic web-scale pre-training in NLP to another domain. We find that adopting this formula results in similar behaviors emerging in the field of computer vision and discuss the social implications of this line of research. In order to optimize their training objective, CLIP models learn to perform a wide variety of tasks during pre-training. This task learning can then be leveraged via natural language prompting to enable zero-shot transfer to many existing datasets. At sufficient scale, the performance of this approach can be competitive with task-specific supervised models although there is still room for much improvement.
我们研究了是否有可能将 NLP 中与任务无关的 Web 规模的预训练的成功转移到另一个领域。我们发现采用此公式会导致在计算机视觉领域出现类似的行为,并讨论了这一研究领域的社会意义。为了优化他们的训练目标,CLIP 模型学习在预训练期间执行各种各样的任务。然后可以通过自然语言提示来利用此任务学习,以实现零镜头传输到许多现有数据集。在足够的规模下,这种方法的性能可以与特定于任务的监督模型竞争,尽管仍有很大的改进空间。

Acknowledgments

We’d like to thank the millions of people involved in creating the data CLIP is trained on. We’d also like to thank Susan Zhang for her work on image conditional language models while at OpenAI, Ishaan Gulrajani for catching an error in the pseudocode, and Irene Solaiman, Miles Brundage, and Gillian Hadfield for their thoughtful feedback on the broader impacts section of the paper. We are also grateful to the Acceleration and Supercomputing teams at OpenAI for their critical work on software and hardware infrastructure this project used. Finally, we’d also like to thank the developers of the many software packages used throughout this project including, but not limited, to Numpy (Harris et al., 2020), SciPy (Virtanen et al., 2020), ftfy (Speer, 2019), TensorFlow (Abadi et al., 2016), PyTorch (Paszke et al., 2019), pandas (pandas development team, 2020), and scikit-learn (Pedregosa et al., 2011).

REFERENCES

APPENDIX A LINEAR-PROBE EVALUATION

We provide additional details for linear probe experiments presented in this paper, including the list of the datasets and models used for evaluation.

A.1DATASETS

We use the 12 datasets from the well-studied evaluation suite introduced by (Kornblith et al., 2019) and add 15 additional datasets in order to assess the performance of models on a wider variety of distributions and tasks. These datasets include MNIST, the Facial Expression Recognition 2013 dataset (Goodfellow et al., 2015), STL-10 (Coates et al., 2011), EuroSAT (Helber et al., 2019), the NWPU-RESISC45 dataset (Cheng et al., 2017), the German Traffic Sign Recognition Benchmark (GTSRB) dataset (Stallkamp et al., 2011), the KITTI dataset (Geiger et al., 2012), PatchCamelyon (Veeling et al., 2018), the UCF101 action recognition dataset (Soomro et al., 2012), Kinetics 700 (Carreira et al., 2019), 2,500 random samples of the CLEVR dataset (Johnson et al., 2017), the Hateful Memes dataset (Kiela et al., 2020), and the ImageNet-1k dataset (Deng et al., 2012). For the two video datasets (UCF101 and Kinetics700), we use the middle frame of each video clip as the input image. STL-10 and UCF101 have multiple pre-defined train/validation/test splits, 10 and 3 respectively, and we report the average over all splits. Details on each dataset and the corresponding evaluation metrics are provided in Table 9.

Additionally, we created two datasets that we call Country211 and Rendered SST2. The Country211 dataset is designed to assess the geolocation capability of visual representations. We filtered the YFCC100m dataset (Thomee et al., 2016) to find 211 countries (defined as having an ISO-3166 country code) that have at least 300 photos with GPS coordinates, and we built a balanced dataset with 211 categories, by sampling 200 photos for training and 100 photos for testing, for each country.

The Rendered SST2 dataset is designed to measure the optical character recognition capability of visual representations. To do so, we used the sentences from the Stanford Sentiment Treebank dataset (Socher et al., 2013) and rendered them into images, with black texts on a white background, in a 448×448 resolution. Two example images from this dataset are shown in Figure 19.

Two example images from the Rendered SST2 dataset   Two example images from the Rendered SST2 dataset

Figure 19: Two example images from the Rendered SST2 dataset| Dataset | Classes | Train size | Test size | Evaluation metric |
| - | - | - | - | - |
| Food-101 | 102 | 75,750 | 25,250 | accuracy |
| CIFAR-10 | 10 | 50,000 | 10,000 | accuracy |
| CIFAR-100 | 100 | 50,000 | 10,000 | accuracy |
| Birdsnap | 500 | 42,283 | 2,149 | accuracy |
| SUN397 | 397 | 19,850 | 19,850 | accuracy |
| Stanford Cars | 196 | 8,144 | 8,041 | accuracy |
| FGVC Aircraft | 100 | 6,667 | 3,333 | mean per class |
| Pascal VOC 2007 Classification | 20 | 5,011 | 4,952 | 11-point mAP |
| Describable Textures | 47 | 3,760 | 1,880 | accuracy |
| Oxford-IIIT Pets | 37 | 3,680 | 3,669 | mean per class |
| Caltech-101 | 102 | 3,060 | 6,085 | mean-per-class |
| Oxford Flowers 102 | 102 | 2,040 | 6,149 | mean per class |
| MNIST | 10 | 60,000 | 10,000 | accuracy |
| Facial Emotion Recognition 2013 | 8 | 32,140 | 3,574 | accuracy |
| STL-10 | 10 | 1000 | 8000 | accuracy |
| EuroSAT | 10 | 10,000 | 5,000 | accuracy |
| RESISC45 | 45 | 3,150 | 25,200 | accuracy |
| GTSRB | 43 | 26,640 | 12,630 | accuracy |
| KITTI | 4 | 6,770 | 711 | accuracy |
| Country211 | 211 | 43,200 | 21,100 | accuracy |
| PatchCamelyon | 2 | 294,912 | 32,768 | accuracy |
| UCF101 | 101 | 9,537 | 1,794 | accuracy |
| Kinetics700 | 700 | 494,801 | 31,669 | mean(top1, top5) |
| CLEVR Counts | 8 | 2,000 | 500 | accuracy |
| Hateful Memes | 2 | 8,500 | 500 | ROC AUC |
| Rendered SST2 | 2 | 7,792 | 1,821 | accuracy |
| ImageNet | 1000 | 1,281,167 | 50,000 | accuracy |

Table 9: Datasets examined for linear probes. We note that, for the Birdsnap and Kinetics700 datasets, we used the resources that are available online at the time of this writing.

A.2MODELS

In combination with the datasets listed above, we evaluate the following series of models using linear probes.

Lm Rn50

This is a multimodal model that uses an autoregressive loss instead of a contrastive loss, while using the ResNet-50 architecture as in the smallest contrastive model. To do so, the output from the CNN is projected into four tokens, which are then fed as a prefix to a language model autoregressively predicting the text tokens. Apart from the training objective, the model was trained on the same dataset for the same number of epochs as other CLIP models.

Clip-Rn

Five ResNet-based contrastive CLIP models are included. As discussed in the paper, the first two models follow ResNet-50 and ResNet-101, and we use EfficientNet-style (Tan & Le, 2019) scaling for the next three models which simultaneously scale the model width, the number of layers, and the input resolution to obtain models with roughly 4x, 16x, and 64x computation.

CLIP-ViT

We include four CLIP models that use the Vision Transformer (Dosovitskiy et al., 2020) architecture as the image encoder. We include three models trained on 224-by-224 pixel images: ViT-B/32, ViT-B/16, ViT-L/14, and the ViT-L/14 model fine-tuned on 336-by-336 pixel input images.

EfficietNet

We use the nine models (B0-B8) from the original EfficientNet paper (Tan & Le, 2019), as well as the noisy-student variants (B0-B7, L2-475, and L2-800) (Tan & Le, 2019). The largest models (L2-475 and L2-800) take the input resolutions of 475x475 and 800x800 pixels, respectively.

Instagram-pretrained ResNeXt

We use the four models (32x8d, 32x16d, 32x32d, 32x48d) released by (Mahajan et al., 2018), as well as their two FixRes variants which use higher input resolutions (Touvron et al., 2019).

Big Transfer (BiT)

We use BiT-S and BiT-M models (Kolesnikov et al., 2019), trained on the ImageNet-1k and ImageNet-21k datasets. The model weights for BiT-L is not publicly available.

Vision Transformer (ViT)

We also include four ViT (Dosovitskiy et al., 2020) checkpoints pretrained on the ImageNet-21k dataset, namely ViT-B/32, ViT-B/16, ViT-L/16, and ViT-H/14. We note that their best-performing models, trained on the JFT-300M dataset, are not available publicly.

SimCLRv2

The SimCLRv2 (Chen et al., 2020c) project released pre-trained and fine-tuned models in various settings. We use the seven pretrain-only checkpoints with selective kernels.

Byol

We use the recently released model weights of BYOL (Grill et al., 2020), specifically their 50x1 and 200x2 checkpoints.

Momentum Contrast (MoCo)

We include the MoCo-v1 (He et al., 2020) and the MoCo-v2 (Chen et al., 2020d) checkpoints.

VirTex

We use the pretrained model of VirTex (Desai & Johnson, 2020). We note that VirTex has a similar model design to CLIP-AR but is trained on a 1000x smaller dataset of high-quality captions from MSCOCO.

ResNet

We add the original ResNet checkpoints released by (He et al., 2016b), namely ResNet-50, ResNet-101, and ResNet152.

A.3EVALUATION

We use image features taken from the penultimate layer of each model, ignoring any classification layer provided. For CLIP-ViT models, we used the features before the linear projection to the embedding space, which corresponds to I_f in Figure 3. We train a logistic regression classifier using scikit-learn’s L-BFGS implementation, with maximum 1,000 iterations, and report the corresponding metric for each dataset. We determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets over the range between 10−6 and 106, with 96 logarithmically spaced steps. To save compute required for the sweeps, we perform a parametric binary search that starts with λ=[10−6,10−4,10−2,1,102,104,106] and iteratively halves the interval around the peak until it reaches a resolution of 8 steps per decade. The hyperparameter sweeps are performed on a validation split of each dataset. For the datasets that contain a validation split in addition to a test split, we use the provided validation set to perform the hyperparameter search, and for the datasets that do not provide a validation split or have not published labels for the test data, we split the training dataset to perform the hyperparameter search. For the final result, we combine the validation split back with the training split and report the performance on the unused split.

A.4RESULTS

The individual linear probe scores are provided in Table 10 and plotted in Figure 20. The best-performing CLIP model, using ViT-L/14 archiecture and 336-by-336 pixel images, achieved the state of the art in 21 of the 27 datasets, i.e. included in the Clopper-Pearson 99.5% confidence interval around each dataset’s top score. For many datasets, CLIP performs significantly better than other models, demonstrating the advantage of natural language supervision over traditional pre-training approaches based on image classification. See Section 3.2 for more discussions on the linear probe results.

\aboverulesep= 0.2em \belowrulesep= 0.2em Food101 CIFAR10 CIFAR100 Birdsnap SUN397 Cars Aircraft VOC2007 DTD Pets Caltech101 Flowers MNIST FER2013 STL10⋆ EuroSAT RESISC45 GTSRB KITTI Country211 PCAM UCF101 Kinetics700 CLEVR HatefulMemes SST ImageNet LM RN50 81.3 82.8 61.7 44.2 69.6 74.9 44.9 85.5 71.5 82.8 85.5 91.1 96.6 60.1 95.3 93.4 84.0 73.8 70.2 19.0 82.9 76.4 51.9 51.2 65.2 76.8 65.2 CLIP-RN 50 86.4 88.7 70.3 56.4 73.3 78.3 49.1 87.1 76.4 88.2 89.6 96.1 98.3 64.2 96.6 95.2 87.5 82.4 70.2 25.3 82.7 81.6 57.2 53.6 65.7 72.6 73.3 101 88.9 91.1 73.5 58.6 75.1 84.0 50.7 88.0 76.3 91.0 92.0 96.4 98.4 65.2 97.8 95.9 89.3 82.4 73.6 26.6 82.8 84.0 60.3 50.3 68.2 73.3 75.7 50x4 91.3 90.5 73.0 65.7 77.0 85.9 57.3 88.4 79.5 91.9 92.5 97.8 98.5 68.1 97.8 96.4 89.7 85.5 59.4 30.3 83.0 85.7 62.6 52.5 68.0 76.6 78.2 50x16 93.3 92.2 74.9 72.8 79.2 88.7 62.7 89.0 79.1 93.5 93.7 98.3 98.9 68.7 98.6 97.0 91.4 89.0 69.2 34.8 83.5 88.0 66.3 53.8 71.1 80.0 81.5 50x64 94.8 94.1 78.6 77.2 81.1 90.5 67.7 88.9 82.0 94.5 95.4 98.9 98.9 71.3 99.1 97.1 92.8 90.2 69.2 40.7 83.7 89.5 69.1 55.0 75.0 81.2 83.6 CLIP-ViT B/32 88.8 95.1 80.5 58.5 76.6 81.8 52.0 87.7 76.5 90.0 93.0 96.9 99.0 69.2 98.3 97.0 90.5 85.3 66.2 27.8 83.9 85.5 61.7 52.1 66.7 70.8 76.1 B/16 92.8 96.2 83.1 67.8 78.4 86.7 59.5 89.2 79.2 93.1 94.7 98.1 99.0 69.5 99.0 97.1 92.7 86.6 67.8 33.3 83.5 88.4 66.1 57.1 70.3 75.5 80.2 L/14 95.2 98.0 87.5 77.0 81.8 90.9 69.4 89.6 82.1 95.1 96.5 99.2 99.2 72.2 99.7 98.2 94.1 92.5 64.7 42.9 85.8 91.5 72.0 57.8 76.2 80.8 83.9 L/14-336px 95.9 97.9 87.4 79.9 82.2 91.5 71.6 89.9 83.0 95.1 96.0 99.2 99.2 72.9 99.7 98.1 94.9 92.4 69.2 46.4 85.6 92.0 73.0 60.3 77.3 80.5 85.4 EfficientNet B0 74.3 92.5 76.5 59.7 62.0 62.5 55.7 84.4 71.2 93.0 93.3 91.7 98.2 57.2 97.1 97.3 85.5 80.0 73.8 12.4 83.1 74.4 47.6 47.9 55.7 53.4 76.9 B1 74.2 93.2 77.2 61.3 62.6 62.5 56.1 84.7 74.2 93.4 93.6 92.4 98.3 57.0 97.5 96.8 84.5 75.9 75.5 12.5 82.7 74.7 48.5 44.3 54.5 54.4 78.6 B2 75.8 93.6 77.9 64.4 64.0 63.2 57.0 85.3 73.5 93.9 93.5 92.9 98.5 56.6 97.7 96.9 84.4 76.4 73.1 12.6 84.3 75.1 49.4 42.6 55.4 55.2 79.7 B3 77.4 94.0 78.0 66.5 64.4 66.0 59.3 85.8 73.1 94.1 93.7 93.3 98.5 57.1 98.2 97.3 85.0 75.8 76.1 13.4 83.3 78.1 50.9 45.1 53.8 54.8 81.0 B4 79.7 94.1 78.7 70.1 65.4 66.4 60.4 86.5 73.4 94.7 93.5 93.2 98.8 57.9 98.6 96.8 85.0 78.3 72.3 13.9 83.1 79.1 52.5 46.5 54.4 55.4 82.9 B5 81.5 93.6 77.9 72.4 67.1 72.7 68.9 86.7 73.9 95.0 94.7 94.5 98.4 58.5 98.7 96.8 86.0 78.5 69.6 14.9 84.7 80.9 54.5 46.6 53.3 56.3 83.7 B6 82.4 94.0 78.0 73.5 65.8 71.1 68.2 87.6 73.9 95.0 94.1 93.7 98.4 60.2 98.7 96.8 85.4 78.1 72.7 15.3 84.2 80.0 54.1 51.1 53.3 57.0 84.0 B7 84.5 94.9 80.1 74.7 69.0 77.1 72.3 87.2 76.8 95.2 94.7 95.9 98.6 61.3 99.1 96.3 86.8 80.8 75.8 16.4 85.2 81.9 56.8 51.9 54.4 57.8 84.8 B8 84.5 95.0 80.7 75.2 69.6 76.8 71.5 87.4 77.1 94.9 95.2 96.3 98.6 61.4 99.2 97.0 87.4 80.4 70.9 17.4 85.2 82.4 57.7 51.4 51.7 55.8 85.3 EfficientNet Noisy Student B0 78.1 94.0 78.6 63.5 65.5 57.2 53.7 85.6 75.6 93.8 93.1 94.5 98.1 55.6 98.2 97.0 84.3 74.0 71.6 14.0 83.1 76.7 51.7 47.3 55.7 55.0 78.5 B1 80.4 95.1 80.2 66.6 67.6 59.6 53.7 86.2 77.0 94.6 94.4 95.1 98.0 56.1 98.6 96.9 84.3 73.1 67.1 14.5 83.9 79.9 54.5 46.1 54.3 54.9 81.1 B2 80.9 95.3 81.3 67.6 67.9 60.9 55.2 86.3 77.7 95.0 94.7 94.4 98.0 55.5 98.8 97.3 84.6 71.7 70.0 14.6 82.9 80.1 55.1 46.1 54.1 55.3 82.2 B3 82.6 95.9 82.1 68.6 68.8 60.6 55.4 86.5 77.2 95.0 94.8 95.2 98.1 56.0 99.1 96.5 85.0 70.5 69.5 15.1 83.1 81.8 56.8 45.1 55.7 52.0 83.8 B4 85.2 95.6 81.0 72.5 69.7 56.1 52.6 87.0 78.7 94.8 95.2 95.3 98.2 56.0 99.3 95.3 84.8 61.9 64.8 16.0 82.8 83.4 59.8 43.2 55.3 53.0 85.4 B5 87.6 96.3 82.4 75.3 71.6 64.7 64.8 87.8 79.6 95.5 95.6 96.6 98.8 60.9 99.4 96.1 87.0 68.5 73.7 16.4 83.5 86.4 61.6 46.3 53.4 55.8 85.8 B6 87.3 97.0 83.9 75.8 71.4 67.6 65.6 87.3 78.5 95.2 96.4 97.2 98.6 61.9 99.5 96.6 86.1 70.7 72.4 17.6 84.2 85.5 61.0 49.6 54.6 55.7 86.4 B7 88.4 96.0 82.0 76.9 72.6 72.2 71.2 88.1 80.5 95.5 95.5 96.6 98.5 62.7 99.4 96.2 88.5 73.4 73.0 18.5 83.8 86.6 63.2 50.5 57.2 56.7 87.0 L2-475 91.6 99.0 91.0 74.8 76.4 75.1 66.8 89.5 81.9 95.6 96.5 97.7 98.9 67.5 99.6 97.0 89.5 73.4 68.9 22.2 86.3 89.4 68.2 58.3 58.6 55.2 88.3 L2-800 92.0 98.7 89.0 78.5 75.7 75.5 68.4 89.4 82.5 95.6 94.7 97.9 98.5 68.4 99.7 97.2 89.9 77.7 66.9 23.7 86.8 88.9 66.7 62.7 58.4 56.9 88.4 Instagram 32x8d 84.8 95.9 80.9 63.8 69.0 74.2 56.0 88.0 75.4 95.4 93.9 91.7 97.4 60.7 99.1 95.7 82.1 72.3 69.2 16.7 82.3 80.1 56.8 42.2 53.3 55.2 83.3 32x16d 85.7 96.5 80.9 64.8 70.5 77.5 56.7 87.9 76.2 95.6 94.9 92.5 97.4 61.6 99.3 95.5 82.8 73.8 66.1 17.5 83.4 81.1 58.2 41.3 54.2 56.1 84.4 32x32d 86.7 96.8 82.7 67.1 71.5 77.5 55.4 88.3 78.5 95.8 95.3 94.4 97.9 62.4 99.3 95.7 85.4 71.2 66.8 18.0 83.7 82.1 58.8 39.7 55.3 56.7 85.0 32x48d 86.9 96.8 83.4 65.9 72.2 76.6 53.2 88.0 77.2 95.5 95.8 93.6 98.1 63.7 99.4 95.3 85.4 73.0 67.2 18.5 82.7 82.8 59.2 41.3 55.5 56.7 85.2 FixRes-v1 88.5 95.7 81.1 67.4 72.9 80.5 57.6 88.0 77.9 95.8 96.1 94.5 97.9 62.2 99.4 96.2 86.6 76.5 64.8 19.3 82.5 83.4 59.8 43.5 56.6 59.0 86.0 FixRes-v2 88.5 95.7 81.1 67.3 72.9 80.7 57.5 88.0 77.9 95.0 96.0 94.5 98.0 62.1 99.4 96.5 86.6 76.3 64.8 19.5 82.3 83.5 59.8 44.2 56.6 59.0 86.0 BiT-S R50x1 72.5 91.7 74.8 57.7 61.1 53.5 52.5 83.7 72.4 92.3 91.2 92.0 98.4 56.1 96.4 97.4 85.0 70.0 66.0 12.5 83.0 72.3 47.5 48.3 54.1 55.3 75.2 R50x3 75.1 93.7 79.0 61.1 63.7 55.2 54.1 84.8 74.6 92.5 91.6 92.8 98.8 58.7 97.0 97.8 86.4 73.1 73.8 14.0 84.2 76.4 50.0 49.2 54.7 54.2 77.2 R101x1 73.5 92.8 77.4 58.4 61.3 54.0 52.4 84.4 73.5 92.5 91.8 90.6 98.3 56.5 96.8 97.3 84.6 69.4 68.9 12.6 82.0 73.5 48.6 45.4 52.6 55.5 76.0 R101x3 74.7 93.9 79.8 57.8 62.9 54.7 53.3 84.7 75.5 92.3 91.2 92.6 98.8 59.7 97.3 98.0 85.5 71.8 60.2 14.1 83.1 75.9 50.4 49.7 54.1 54.6 77.4 R152x2 74.9 94.3 79.7 58.7 62.7 55.9 53.6 85.3 74.9 93.0 92.0 91.7 98.6 58.3 97.1 97.8 86.2 71.8 71.6 13.9 84.1 76.2 49.9 48.2 53.8 55.9 77.1 R152x4 74.7 94.2 79.2 57.8 62.9 51.2 50.8 85.4 75.4 93.1 91.2 91.4 98.9 61.4 97.2 98.0 85.5 72.8 67.9 14.9 83.1 76.0 50.3 42.9 53.6 56.0 78.5 BiT-M R50x1 83.3 94.9 82.2 70.9 69.9 59.0 55.6 86.8 77.3 91.5 93.9 99.4 98.0 60.6 98.4 97.5 87.4 68.6 68.2 16.6 82.5 79.4 53.2 49.4 54.5 53.4 76.7 R50x3 86.9 96.7 86.2 75.7 74.6 60.6 54.2 87.7 78.5 93.2 95.3 99.4 98.6 64.6 99.3 98.0 88.1 69.9 59.6 19.6 83.4 83.5 57.8 51.3 55.8 55.6 80.7 R101x1 85.5 95.7 84.4 73.0 72.5 59.8 55.0 87.3 78.1 92.2 95.0 99.5 98.1 62.5 99.0 97.6 87.8 68.7 67.7 18.0 84.0 82.3 55.9 53.4 54.8 53.1 79.4 R101x3 87.2 97.4 87.5 72.4 75.0 57.4 47.4 87.5 79.6 93.2 95.4 99.6 98.6 64.3 99.4 98.2 87.7 68.8 64.1 20.7 80.4 84.0 58.7 52.6 54.9 54.3 81.2 R152x2 88.0 97.5 87.8 75.8 75.9 61.5 55.3 88.1 79.8 93.6 95.9 99.5 98.5 64.3 99.5 97.9 89.0 70.0 70.3 20.7 82.6 85.5 59.6 50.8 54.9 55.1 81.9 R152x4 87.2 97.6 88.2 72.4 75.0 49.1 43.4 87.1 79.9 92.4 95.4 99.3 98.5 65.7 99.5 97.8 87.7 68.2 57.1 20.6 80.4 84.6 59.0 49.7 57.2 55.1 81.5 ViT B/32 81.8 96.7 86.3 65.2 70.7 49.1 42.7 85.3 73.1 90.4 94.5 98.7 97.8 59.0 99.0 96.3 83.0 68.1 65.1 15.7 82.6 79.1 51.7 38.9 57.1 54.6 76.6 B/16 86.7 96.9 86.4 74.0 74.2 54.7 46.0 86.7 74.3 92.7 94.1 99.2 97.4 61.3 99.5 96.4 84.5 63.1 61.5 17.5 85.4 82.7 56.6 40.0 57.0 56.1 80.9 L/16 87.4 97.9 89.0 76.5 74.9 62.5 52.2 86.1 75.0 92.9 94.7 99.3 98.0 64.0 99.6 96.5 85.7 70.4 58.8 17.7 85.7 84.1 58.0 38.4 58.4 52.8 81.9 H/14 83.4 95.8 84.5 70.2 69.2 62.3 54.8 84.7 75.4 91.7 93.7 98.9 98.5 62.4 98.4 97.3 87.0 73.9 63.4 15.4 87.0 79.4 52.1 41.1 55.9 54.1 75.9 SimCLRv2 R50x1 76.4 93.2 77.9 48.6 64.1 56.3 51.7 84.4 77.0 88.3 91.8 92.9 97.6 59.7 96.7 97.5 85.8 71.1 69.1 15.8 84.8 78.4 51.0 56.2 53.9 53.8 73.8 R50x3 81.0 95.6 82.4 56.5 67.0 65.6 61.1 85.9 78.8 90.9 94.1 95.4 98.7 62.6 98.2 97.9 88.2 78.2 74.7 17.6 85.4 82.6 54.6 55.4 54.2 55.2 77.3 R101x1 77.9 94.8 79.9 51.9 65.2 57.1 52.0 85.4 77.2 90.0 91.6 92.7 97.2 59.4 97.6 96.8 84.6 65.7 70.6 16.1 84.3 78.8 52.4 53.6 55.1 55.7 76.1 R101x3 82.2 96.4 83.4 57.5 68.2 64.6 60.0 86.2 78.9 91.8 95.0 95.4 98.4 63.0 98.5 97.9 88.0 77.5 69.1 18.3 85.5 82.9 55.9 52.2 54.5 56.3 78.8 R152x1 78.6 95.0 79.9 50.3 65.6 55.6 52.2 85.8 77.3 90.1 92.5 91.8 97.6 59.8 98.1 96.6 84.3 64.8 70.3 16.6 83.9 79.4 53.1 57.2 55.8 54.8 76.9 R152x2 82.3 96.7 83.9 58.1 68.5 64.9 58.7 86.6 79.1 92.2 94.1 96.0 98.2 64.1 98.5 98.0 88.1 77.0 69.8 18.4 85.3 82.7 56.2 53.6 56.0 56.5 79.2 R152x3 83.6 96.8 84.5 60.3 69.1 68.5 63.1 86.7 80.5 92.6 94.9 96.3 98.7 65.4 98.8 98.1 89.5 78.4 68.5 19.4 85.2 83.5 57.0 54.4 54.6 54.2 80.0 BYOL 50x1 74.0 93.6 79.1 47.6 63.7 61.6 62.3 82.6 77.0 88.3 93.7 94.3 98.7 58.8 96.4 97.6 88.2 80.1 71.4 14.1 84.8 77.3 49.3 56.1 53.8 54.4 73.3 200x2 78.5 96.2 83.3 53.4 68.5 61.7 55.4 86.6 77.4 91.9 95.5 93.9 98.7 62.6 98.6 97.7 87.4 77.1 76.4 16.4 84.0 82.6 55.1 54.1 52.5 52.4 79.2 MoCo v1 65.9 85.0 63.1 27.5 52.6 35.9 43.5 75.7 70.0 70.4 78.1 85.4 97.6 54.3 85.6 97.1 82.9 62.6 60.2 12.6 85.7 64.2 40.7 54.7 55.6 53.5 57.2 v2 72.2 93.4 76.3 39.6 60.2 48.3 51.1 82.6 75.1 84.4 89.9 90.7 98.4 58.3 95.7 97.2 85.4 75.7 75.4 13.2 85.6 72.7 47.8 56.9 53.9 53.8 69.1 VirTex 57.9 83.9 57.5 17.0 49.8 22.4 34.5 83.8 58.2 53.6 70.6 74.7 98.1 56.5 86.7 94.8 74.1 69.5 71.3 8.7 83.1 61.5 39.9 45.5 53.5 55.8 50.7 ResNet 50 71.3 91.8 74.5 52.7 60.5 49.9 48.5 83.8 72.3 92.4 90.8 90.8 98.3 54.9 96.4 96.7 83.6 70.6 67.1 11.7 82.5 71.2 46.8 43.0 56.5 55.5 74.3 101 72.7 93.0 77.2 53.7 60.8 50.1 47.0 84.4 71.6 92.3 91.9 90.4 98.5 56.6 97.0 97.1 83.4 72.5 63.6 11.9 83.3 72.7 48.3 43.2 53.0 54.7 75.8 152 73.7 93.5 78.0 55.1 61.6 52.8 48.4 84.5 71.9 93.0 92.1 89.6 98.2 57.0 97.6 97.0 83.1 70.1 70.2 12.3 82.9 75.3 49.2 42.4 53.2 53.9 77.1

Table 10: Linear probe performance of various pre-trained models over 27 datasets. Scores within the 99.5% Clopper-Pearson confidence interval of each dataset’s top score are shown in bold.
⋆We updated the STL10 scores from the previous version of this paper after fixing a CUDA-related bug.Linear probe performance plotted for each of the 27 datasets, using the data from Table

Figure 20: Linear probe performance plotted for each of the 27 datasets, using the data from Table 10.Visualization of predictions from 36 CLIP zero-shot classifiers. All examples are random with the exception of reselecting Hateful Memes to avoid offensive content. The predicted probability of the top 5 classes is shown along with the text used to represent the class. When more than one template is used, the first template is shown. The ground truth label is colored green while an incorrect prediction is colored orange.

Figure 21: Visualization of predictions from 36 CLIP zero-shot classifiers. All examples are random with the exception of reselecting Hateful Memes to avoid offensive content. The predicted probability of the top 5 classes is shown along with the text used to represent the class. When more than one template is used, the first template is shown. The ground truth label is colored green while an incorrect prediction is colored orange.\aboverulesep= 0.2em \belowrulesep= 0.2em Food101 CIFAR10 CIFAR100 Birdsnap SUN397 Stanford Cars FGVC Aircraft VOC2007 DTD Oxford Pets Caltech101 Flowers102 MNIST FER2013 STL10 EuroSAT RESISC45 GTSRB KITTI Country211 PCam UCF101 Kinetics700 CLEVR HatefulMemes Rendered SST2 ImageNet CLIP-ResNet RN50 81.1 75.6 41.6 32.6 59.6 55.8 19.3 82.1 41.7 85.4 82.1 65.9 66.6 42.2 94.3 41.1 54.2 35.2 42.2 16.1 57.6 63.6 43.5 20.3 59.7 56.9 59.6 RN101 83.9 81.0 49.0 37.2 59.9 62.3 19.5 82.4 43.9 86.2 85.1 65.7 59.3 45.6 96.7 33.1 58.5 38.3 33.3 16.9 55.2 62.2 46.7 28.1 61.1 64.2 62.2 RN50x4 86.8 79.2 48.9 41.6 62.7 67.9 24.6 83.0 49.3 88.1 86.0 68.0 75.2 51.1 96.4 35.0 59.2 35.7 26.0 20.2 57.5 65.5 49.0 17.0 58.3 66.6 65.8 RN50x16 90.5 82.2 54.2 45.9 65.0 72.3 30.3 82.9 52.8 89.7 87.6 71.9 80.0 56.0 97.8 40.3 64.4 39.6 33.9 24.0 62.5 68.7 53.4 17.6 58.9 67.6 70.5 RN50x64 91.8 86.8 61.3 48.9 66.9 76.0 35.6 83.8 53.4 93.4 90.6 77.3 90.8 61.0 98.3 59.4 69.7 47.9 33.2 29.6 65.0 74.1 56.8 27.5 62.1 70.7 73.6 CLIP-ViT B/32 84.4 91.3 65.1 37.8 63.2 59.4 21.2 83.1 44.5 87.0 87.9 66.7 51.9 47.3 97.2 49.4 60.3 32.2 39.4 17.8 58.4 64.5 47.8 24.8 57.6 59.6 63.2 B/16 89.2 91.6 68.7 39.1 65.2 65.6 27.1 83.9 46.0 88.9 89.3 70.4 56.0 52.7 98.2 54.1 65.5 43.3 44.0 23.3 48.1 69.8 52.4 23.4 61.7 59.8 68.6 L/14 92.9 96.2 77.9 48.3 67.7 77.3 36.1 84.1 55.3 93.5 92.6 78.7 87.2 57.5 99.3 59.9 71.6 50.3 23.1 32.7 58.8 76.2 60.3 24.3 63.3 64.0 75.3 L/14-336px 93.8 95.7 77.5 49.5 68.4 78.8 37.2 84.3 55.7 93.5 92.8 78.3 88.3 57.7 99.4 59.6 71.7 52.3 21.9 34.9 63.0 76.9 61.3 24.8 63.3 67.9 76.2

Table 11: Zero-shot performance of CLIP models over 27 datasets.\captionoffigureCLIP’s zero-shot performance compared to linear-probe ResNet performance

APPENDIX B ZERO-SHOT PREDICTION

To provide a qualitative summary / overview of CLIP’s zero-shot performance we visualize a randomly selected prediction for 36 different zero-shot CLIP classifiers in Figure 21. In addition, Table 11 and Figure 11 show the individual zero-shot performance scores for each dataset.

APPENDIX C DUPLICATE DETECTOR

Our early attempts at duplicate detection and analysis used nearest neighbors in the model’s learned embedding space. While it is intuitive to use a model’s own notion of similarity, we encountered issues. We found the model’s feature space is weighted very heavily towards semantic similarity. Many false positives occurred due to distinct objects that would be described similarly (soccer balls, flowers of the same species, etc…) having almost perfect similarity. We also observed the model was quite poor at assigning certain kinds of near-duplicates high similarity scores. We noticed repeatedly that images with high-frequency textures (such as fur or stripe patterns) pre-processed by different resizing algorithms (nearest neighbor vs bi-linear) could have surprisingly low similarity. This resulted in many false negatives.

We built our own near-duplicate detector to fix this issue. We created a synthetic data augmentation pipeline that combined a variety of common image manipulations. The augmentation pipeline combines random cropping and zooming, aspect ratio distortion, downsizing and upscaling to different resolutions, minor rotations, jpeg compression, and HSV color jitter. The pipeline also randomly selects from different interpolation algorithms for all relevant steps. We then trained a model to maximize the similarity of an image and its transformed variant while minimizing similarity to all other images in a training batch. We used the same n-pair / InfoNCE loss as CLIP but with a fixed temperature of 0.07.

We selected a ResNet-50 as the model architecture. We modified the base ResNet-50 with the anti-alias improvements from (Zhang, 2019) and used weight norm (Salimans & Kingma, 2016) instead of batch norm (Ioffe & Szegedy, 2015) to avoid leaking information about duplicates via batch statistics - a problem previously noted in (Henaff, 2020). We also found the GELU activation function (Hendrycks & Gimpel, 2016) to perform better for this task. We trained the model with a total batch size of 1,712 for approximately 30 million images sampled from our pre-training dataset. At the end of training it achieves nearly 100% accuracy on its proxy training task.

APPENDIX D DATASET ABLATION ON YFCC100M

Linear Classifier Zero Shot
Dataset YFCC WIT
Birdsnap 47.4 35.3
Country211 23.1 17.3
Flowers102 94.4 89.8
GTSRB 66.8 72.5
UCF101 69.2 74.9
Stanford Cars 31.4 50.3
ImageNet 62.0 60.8
Dataset Average 65.5 66.6
Dataset “Wins” 10 15

Table 12: CLIP performs similarly when trained on only YFCC100M. Comparing a ResNet-50 trained on only YFCC100M with a same sized subset of WIT shows similar average performance and number of wins on zero shot and linear classifier evals. However, large differences in dataset specific performance occur. We include performance on the 3 datasets where YFCC does best and worst compared to WIT according to a linear probe in order to highlight this as well as aggregate performance across all linear and zero-shot evals and the canonical ImageNet dataset.To study whether our custom dataset is critical to the performance of CLIP, we trained a model on a filtered subset of the YFCC100M dataset (details described in Section 2.2) and compared its performance to the same model trained on an equally sized subset of WIT. We train each model for 32 epochs at which point transfer performance begins to plateau due to overfitting. Results are shown in Table 12. Across our whole eval suite, YFCC and WIT perform similarly on average for both zero-shot and linear probe settings. However, performance on specific fine-grained classification datasets can vary widely - sometimes by over 10%. Our speculation is that these differences in performance reflect the relative density of relevant data in each pre-training dataset. For instance, pre-training on YFCC100M, which might contain many photos of birds and flowers (common subjects for photographers), results in better performance on Birdsnap and Flowers102, while pre-training on WIT results in better car and pet classifiers (which appear common in our dataset).

Overall, these results are encouraging as they suggest our approach can use any reasonably filtered collection of paired (text, image) data. This mirrors recent work which reported positive results using the same contrastive pre-training objective on the relatively different domain of medical imaging (Zhang et al., 2020). It also is similar to the findings of noisy student self-training which reported only slight improvements when using their JFT300M dataset over YFCC100M (Xie et al., 2020). We suspect the major advantage of our dataset over the already existing YFCC100M is its much larger size.

Finally, we caution that WIT includes this filtered subset of YFCC100M. This could result in our ablation underestimating the size of performance differences between YFCC100M and the rest of WIT. We do not think this is likely as YFCC100M is only 3.7% of the overall WIT data blend and it did not noticeably change the performance of models when it was added to the existing data blend during the creation of WIT.

Text Retrieval Image Retrieval
Flickr30k MSCOCO
R@1 R@5
Finetune Unicoder-VLa 86.2 96.3
Uniterb 87.3 98.0 99.2
VILLAc 87.9 97.5 98.8
Oscard - - -
ERNIE-ViLe 88.7 98.0 99.2
Zero-Shot Visual N-Gramsf 15.4 35.7
ImageBERTg - - -
Unicoder-VLa 64.3 86.8 92.3
Uniterb 83.6 95.7 97.7
CLIP 88.0 98.7 99.4

Table 13: CLIP improves zero-shot retrieval and is competitive with the best fine-tuned result on Flickr30k text retrieval. Bold indicates best overall performance while an underline indicates best in category performance (zero-shot or fine-tuned). For all other models, best results from the paper are reported regardless of model size / variant. MSCOCO performance is reported on the 5k test set. a(Li et al., 2020a) b(Chen et al., 2019) c(Gan et al., 2020) d(Li et al., 2020b) e(Yu et al., 2020) f(Li et al., 2017) g(Qi et al., 2020)

APPENDIX E SELECTED TASK AND DATASET RESULTS

Due to the large variety of datasets and experiments considered in this work, the main body focuses on summarizing and analyzing overall results. In the following subsections we report details of performance for specific groups of tasks, datasets, and evaluation settings.

E.1IMAGE AND TEXT RETRIEVAL

CLIP pre-trains for the task of image-text retrieval on our noisy web-scale dataset. Although the focus of this paper is on representation learning and task learning for the purpose of transfer to a wide variety of downstream datasets, validating that CLIP is able to achieve high transfer performance transfer on exactly what it is pre-trained for is an important sanity check / proof of concept. In Table 13 we check the zero-shot transfer performance of CLIP for both text and image retrieval on the Flickr30k and MSCOCO datsets. Zero-shot CLIP matches or outperforms all prior zero-shot results on these two datasets. Zero-shot CLIP is also competitive with the current overall SOTA for the task of text retrieval on Flickr30k. On image retrieval, CLIP’s performance relative to the overall state of the art is noticeably lower. However, zero-shot CLIP is still competitive with a fine-tuned Unicoder-VL. On the larger MS-COCO dataset fine-tuning improves performance significantly and zero-shot CLIP is not competitive with the most recent work. For both these datasets we prepend the prompt “a photo of” to the description of each image which we found boosts CLIP’s zero-shot R@1 performance between 1 and 2 points.

IIIT5K Hateful
MNIST SVHN 1k Memes SST-2
Finetune SOTA 99.8a 96.4b 98.9c 78.0d 97.5e
JOINTf - - 89.6 - -
CBoWg - - - - 80.0
Linear Raw Pixels 92.5 - - - -
ES Best 98.9h - - 58.6h 59.0i
CLIP 99.2 - - 77.3 80.5
ZS CLIP 88.4 51.0 90.0 63.3 67.9

Table 14: OCR performance on 5 datasets. All metrics are accuracy on the test set except for Hateful Memes which reports ROC AUC on the dev set. Single model SOTA reported to best of knowledge. ES Best reports the best performance across the 56 non-CLIP models in our evaluation suite. a(Assiri, 2020) b(Jaderberg et al., 2015) c(Wang et al., 2020) d(Lippe et al., 2020) f(Jaderberg et al., 2014) g(Wang et al., 2018) h(Xie et al., 2020) i(Mahajan et al., 2018)

E.2OPTICAL CHARACTER RECOGNITION

Although visualizations have shown that ImageNet models contain features that respond to the presence of text in an image (Zeiler & Fergus, 2014), these representations are not sufficiently fine-grained to use for the task of optical character recognition (OCR). To compensate, models are augmented with the outputs of custom OCR engines and features to boost performance on tasks where this capability is required (Singh et al., 2019; Yang et al., 2020). Early during the development of CLIP, we noticed that CLIP began to learn primitive OCR capabilities which appeared to steadily improve over the course of the project. To evaluate this qualitatively noticed behavior, we measured performance on 5 datasets requiring the direct and indirect use of OCR. Three of these datasets MNIST (LeCun, ), SVHN (Netzer et al., 2011), and IIIT5K (Mishra et al., 2012) directly check the ability of a model to perform low-level character and word recognition, while Hateful Memes (Kiela et al., 2020) and SST-2 (Socher et al., 2013) check the ability of a model to use OCR to perform a semantic task. Results are reported in Table 14.

CLIP’s performance is still highly variable and appears to be sensitive to some combination of the domain (rendered or natural images) and the type of text to be recognized (numbers or words). CLIP’s OCR performance is strongest Hateful Memes and SST-2 - datasets where the text is digitally rendered and consists mostly of words. On IIIT5K, which is natural images of individually cropped words, zero-shot CLIP performs a bit more respectively and its performance is similar to Jaderberg et al. (2014) early work combining deep learning and structured prediction to perform open-vocabulary OCR. However, performance is noticeably lower on two datasets involving recognition of hand written and street view numbers. CLIP’s 51% accuracy on full number SVHN is well below any published results. Inspection suggests CLIP struggles with repeated characters as well as the low resolution and blurry images of SVHN. CLIP’s zero-shot MNIST performance is also poor and is outperformed by supervised logistic regression on raw pixels, one of the simplest possible machine learning baselines.

SST-2 is a sentence level NLP dataset which we render into images. We include SST-2 in order to check whether CLIP is able to convert low level OCR capability into a higher level representation. Fitting a linear classifier on CLIP’s representation of rendered sentences achives 80.5% accuracy. This is on par with the 80% accuracy of a continuous bag of words baseline using GloVe word vectors pre-trained on 840 billion tokens (Pennington et al., 2014). While this is a simple NLP baseline by today’s standard, and well below the 97.5% of the current SOTA, it is encouraging to see that CLIP is able to turn an image of rendered text into a non-trivial sentence level representation. Fully supervised CLIP is also surprisingly strong on Hateful Meme detection, where CLIP is only 0.7 points behind the current single model SOTA and several points above the best baseline from the original paper. Similar to SST-2, these other results on Hateful Memes use the ground truth text which CLIP does not have access to. Finally, we note that zero-shot CLIP outperforms the best results using fully supervised linear probes across all other 56 models included in our evaluation suite. This suggests CLIP’s OCR capability is at least somewhat unique compared to existing work on self-supervised and supervised representation learning.

UCF101 K700 RareAct
Top-1 AVG mWAP
Finetune R(2+1)D-BERTa 98.7 - -
NS ENet-L2b - 84.8 - -
HT100M S3Dd 91.3 - - -
Baseline I3De - 70.2 - -
Linear MMV FACf 91.8 - -
NS ENet-L2c 89.4c 68.2c - -
CLIP 92.0 73.0 - -
ZS HT100M S3Dd - - 30.5
CLIP 80.3 69.6 40.7 44.8

Table 15: Action recognition performance on 3 video datasets. Single model SOTA reported to best of knowledge. Note that linear CLIP and linear NS ENet-L2 are trained and evaluated on a single frame subsampled version of each dataset and not directly comparable to prior work. On Kinetics-700, we report the ActivityNet competition metric which is the average of top-1 and top-5 performance. a(Kalfaoglu et al., 2020) b(Lu et al., 2020) c(Xie et al., 2020) d(Miech et al., 2020b) e(Carreira et al., 2019) f(Alayrac et al., 2020)

E.3ACTION RECOGNITION IN VIDEOS

For the purpose of learning, a potentially important aspect of natural language is its ability to express, and therefore supervise, an extremely wide set of concepts. A CLIP model, since it is trained to pair semi-arbitrary text with images, is likely to receive supervision for a wide range of visual concepts involving both common and proper nouns, verbs, and adjectives. ImageNet-1K, by contrast, only labels common nouns. Does the lack of broader supervision in ImageNet result in weaker transfer of ImageNet models to tasks involving the recognition of visual concepts that are not nouns?

To investigate this, we measure and compare the performance of CLIP and ImageNet models on several video action classification datasets which measure the ability of a model to recognize verbs. In Table 15 we report results on UCF-101 (Soomro et al., 2012) and Kinetics-700 (Carreira et al., 2019), two common datasets for the task. Unfortunately, our CPU based linear classifier takes a prohibitively long time to evaluate on a video dataset due to the very large number of training frames. To deal with this, we aggressively sub-sample each video to only a single center frame, effectively turning it into an image classification dataset. As a result, our reported performance in a linear evaluation setting likely under estimates performance by a moderate amount.

IN IN-V2 IN-A IN-R ObjectNet IN-Sketch IN-Vid YTBB
Top-1 Top-1 Top-1 Top-1 Top-1 Top-1 PM0 PM10
NS EfficientNet-L2a 88.3 80.2 84.9 74.7 68.5 47.6 88.0 82.1
FixResNeXt101-32x48d V2b 86.4 78.0 68.4 80.0 57.8 59.1 85.8 72.2
Linear Probe CLIP 85.4 75.9 75.3 84.2 66.2 57.4 89.1 77.2
Zero-Shot CLIP 76.2 70.1 77.2 88.9 72.3 60.2 95.3 89.2

Table 16: Detailed ImageNet robustness performance. IN is used to abbreviate for ImageNet. a(Xie et al., 2020) b(Touvron et al., 2019)Despite this handicap, CLIP features transfer surprisingly well to this task. CLIP matches the best prior result on UCF-101 in a linear probe evaluation setting and also outperforms all other models in our evaluation suite. On Kinetics-700, CLIP also outperforms the fine-tuned I3D baseline from the original paper. Since it does not require a training stage, we report CLIP’s zero-shot performance when averaging predictions across all frames. CLIP also performs well in this setting and on Kinetics-700 its performance is within 1% of the fully supervised I3D baseline which is trained on 545000 labeled videos. Encouraged by these results, we also measure CLIP’s performance on the recently introduced RareAct dataset (Miech et al., 2020a) which was designed to measure zero-shot recognition of unusual actions like “hammering a phone” and “drilling an egg”. CLIP improves over the prior state of the art, a S3D model trained on automatically extracted captions from 100 million instructional videos, by 10 points.

While CLIP has encouragingly strong performance on the task of action recognition, we note that there are many differences between the models being compared beyond just their form of supervision such as model architecture, training data distribution, dataset size, and compute used. Further work is needed to more precisely determine what specific design decisions contribute to achieving high performance on this task.

1km 25km 200km 750km 2500km
ISNsa 16.9 43.0 51.9 66.7 80.2
CPlaNetb 16.5 37.1 46.4 62.0 78.5
CLIP 13.9 32.9 43.0 62.0 79.3
Deep-Ret+c 14.4 33.3 47.7 61.6 73.4
PlaNetd 8.4 24.5 37.6 53.6 71.3

Table 17: Geolocalization performance on the IM2GPS test set. Metric is percent of images localized within a given radius. Models are ordered by average performance. a(Muller-Budack et al., 2018) b(Hongsuck Seo et al., 2018) c(Vo et al., 2017) c(Weyand et al., 2016)

E.4GEOLOCALIZATION

Another behavior we noticed during the development of CLIP was its ability to recognize many places and locations. To quantify this we created the Country211 dataset as described in Appendix A and report results on it throughout the paper. However it is a new benchmark so to compare with prior work on geolocalization we also report results on the IM2GPS test set from Hays & Efros (2008) in Table 17. Since IM2GPS is a regression benchmark, we guess the GPS coordinates of the nearest image in a set of reference images using CLIP’s embedding space. This is not a zero-shot result since it uses nearest-neighbor regression. Despite querying only 1 million images, which is much less than prior work, CLIP performs similarly to several task specific models. It is not, however, competitive with the current state of the art.

E.5ROBUSTNESS TO DISTRIBUTION SHIFT

Section 3.3 provides a high level summary and analysis of ImageNet-related robustness results. We briefly provide some additional numerical details in this appendix. Performance results per dataset are provided in Table 16 and compared with the current state of the art results reported in Taori et al. (2020)’s evaluation suite. Zero-shot CLIP improves the state of the art on 5 of the 7 datasets, ImageNet-R, ObjectNet, ImageNet-Sketch, ImageNet-Vid, and Youtube-BB. CLIP’s improvements are largest on ImageNet-Vid and Youtube-BB due to its flexible zero-shot capability and on ImageNet-R, which likely reflects CLIP’s pre-training distribution including significant amounts of creative content. A similar behavior has been documented for the Instagram pre-trained ResNeXt models as discussed in Taori et al. (2020).

APPENDIX F MODEL HYPERPARAMETERS

Hyperparameter Value Batch size 32768 Vocabulary size 49408 Training epochs 32 Maximum temperature 100.0 Weight decay 0.2 Warm-up iterations 2000 Adam β1 0.9 Adam β2 0.999 (ResNet), 0.98 (ViT) Adam ϵ 10−8 (ResNet), 10−6 (ViT)Table 18: Common CLIP hyperparameters| | Learning | Embedding | Input | ResNet | Text Transformer |
| - | - | - | - | - | - |
| Model | rate | dimension | resolution | blocks | width | layers | width | heads |
| RN50 | 5×10−4 | 1024 | 224 | (3, 4, 6, 3) | 2048 | 12 | 512 | 8 |
| RN101 | 5×10−4 | 512 | 224 | (3, 4, 23, 3) | 2048 | 12 | 512 | 8 |
| RN50x4 | 5×10−4 | 640 | 288 | (4, 6, 10, 6) | 2560 | 12 | 640 | 10 |
| RN50x16 | 4×10−4 | 768 | 384 | (6, 8, 18, 8) | 3072 | 12 | 768 | 12 |
| RN50x64 | 3.6×10−4 | 1024 | 448 | (3, 15, 36, 10) | 4096 | 12 | 1024 | 16 |

Table 19: CLIP-ResNet hyperparameters| | Learning | Embedding | Input | Vision Transformer | Text Transformer |
| - | - | - | - | - | - |
| Model | rate | dimension | resolution | layers | width | heads | layers | width | heads |
| ViT-B/32 | 5×10−4 | 512 | 224 | 12 | 768 | 12 | 12 | 512 | 8 |
| ViT-B/16 | 5×10−4 | 512 | 224 | 12 | 768 | 12 | 12 | 512 | 8 |
| ViT-L/14 | 4×10−4 | 768 | 224 | 24 | 1024 | 16 | 12 | 768 | 12 |
| ViT-L/14-336px | 2×10−5 | 768 | 336 | 24 | 1024 | 16 | 12 | 768 | 12 |

Table 20: CLIP-ViT hyperparameters