Sigmoid Loss for Language Image Pre-Training
Sigmoid Loss用于语言图像预训练
Abstract
摘要
We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization.The s igm oid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves $84.5%$ Image Net zero-shot accuracy in two days.The disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of $32,k$ beingsuffcient. We release our models at https://github. com/google-research/big_vision and hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.
我们提出了一种用于语言-图像预训练 (SigLIP) 的简单成对 Sigmoid 损失。与使用 softmax 归一化的标准对比学习不同,sigmoid 损失仅作用于图像-文本对,并且不需要全局视角来归一化成对相似性。sigmoid 损失同时允许进一步扩大批次大小,同时在较小的批次大小下表现更好。结合 Locked-image Tuning,仅使用四个 TPUv4 芯片,我们训练了一个 SigLiT 模型,在两天内实现了 $84.5%$ 的 ImageNet 零样本准确率。批次大小与损失的分离进一步使我们能够研究样本与对以及负样本与正样本比例的影响。最后,我们将批次大小推向极端,达到一百万,并发现增加批次大小的好处迅速减弱,$32,k$ 的批次大小已经足够合理。我们在 https://github.com/google-research/big_vision 发布了我们的模型,并希望我们的研究能够激发更多关于提高语言-图像预训练质量和效率的探索。
1. Introduction
1. 引言
Contrastive pre-training using weak supervision from image-text pairs found on the web is becoming the go-to method for obtaining generic computer vision backbones, slowly replacing pre-training on large labelled multi-class datasets. The high-level idea is to simultaneously learn an aligned representation space for images and texts using paired data. Seminal works CLIP [36] and ALIGN [23] established the viability of this approach at a large scale, and following their success, many large image-text datasets became available privately [59, 13, 21, 49] and publicly [40, 6,15,7,41].
利用网络上图像-文本对进行弱监督的对比预训练正逐渐成为获取通用计算机视觉骨干网络的首选方法,逐步取代了在大型标注多类数据集上的预训练。其核心理念是通过配对数据同时学习图像和文本的对齐表示空间。开创性工作 CLIP [36] 和 ALIGN [23] 在大规模上证实了该方法的可行性,随着它们的成功,许多大型图像-文本数据集在私有 [59, 13, 21, 49] 和公开 [40, 6,15,7,41] 领域变得可用。
The standard recipe to pre-train such models leverages the image-text contrastive objective. It aligns the image and
预训练此类模型的标准方法利用了图像-文本对比目标。它对齐了图像和
Table 1: SigLiT and SigLIP results. Sigmoid loss is memory efficient, allows larger batch sizes (BS) that unlocks language image pre-training with a small number of chips. SigLiT model with a frozen public $\frac{20}{5}$ B/8 checkpoint [42], trained on the LiT image-text dataset [59] using four TPUv4 chips for one day, achieves $79.7%$ O-shot accuracy on ImageNet. The same setup with a $\mathrm{g}/14$ checkpoint [58] leads to $84.5%$ accuracy, trained for two days. With a public unlocked B/16 image checkpoint [42], trained on the WebLI dataset [13], SigLIP achieves $71.0%$ O-shot accuracy using 16 TPU-v4 chips for three days. The last two rows show results with randomly initialized models.
表 1: SigLiT 和 SigLIP 结果。Sigmoid 损失函数在内存效率上表现优异,允许更大的批处理大小 (BS),从而在少量芯片的情况下实现语言图像预训练。使用冻结的公开 $\frac{20}{5}$ B/8 检查点 [42] 的 SigLiT 模型,在 LiT 图文数据集 [59] 上训练一天,使用四个 TPUv4 芯片,在 ImageNet 上取得了 $79.7%$ 的零样本准确率。相同的设置下,使用 $\mathrm{g}/14$ 检查点 [58] 训练两天,准确率达到 $84.5%$。在 WebLI 数据集 [13] 上训练,使用公开的未冻结 B/16 图像检查点 [42] 的 SigLIP 模型,使用 16 个 TPU-v4 芯片训练三天,取得了 $71.0%$ 的零样本准确率。最后两行展示了随机初始化模型的结果。
We use a variant of the L model with 12 layers.
图像 | 文本 | BS | #TPUv4 | 天数 | INet-0 |
---|---|---|---|---|---|
SigLiT B/8 | L* | 32k | 4 | 1 | 79.8 |
SigLiT g/14 | L | 20k | 4 | 2 | 84.5 |
SigLIP | B/16 B | 16k | 16 | 3 | 71.0 |
SigLIP | B/16 B | 32k | 32 | 2 | 72.1 |
SigLIP | B/16 B | 32k | 32 | 5 | 73.4 |
我们使用了具有12层的L模型变体。
text embeddings for matching (positive) image-text pairs while making sure that unrelated (negative) image-text pairs are dissimilar in the embedding space. This is achieved via a batch-level softmax-based contrastive loss, applied twice to normalize the pairwise similarity scores across all images, then all texts. A naive implementation of the softmax is numerically unstable; it is usually stabilized by subtracting the maximum input value before applying the softmax [18], which requires another pass over the full batch.
用于匹配(正样本)图像-文本对的文本嵌入,同时确保不相关的(负样本)图像-文本对在嵌入空间中不相似。这是通过基于批次的软最大对比损失实现的,该损失应用两次以在所有图像和所有文本之间归一化成对相似性分数。软最大的原始实现在数值上不稳定;通常通过减去最大输入值来稳定[18],这需要对整个批次进行另一次遍历。
In this paper, we propose a simpler alternative: the sigmoid loss. It does not require any operation across the full batch and hence greatly simplifies the distributed loss imple ment ation and boosts efficiency. Additionally, it conceptually decouples the batch size from the definition of the task. We compare the proposed sigmoid loss with the standard softmax loss across multiple setups. In particular, we investigate sigmoid-based loss with two prominent approaches for image-text learning: CLIP [36] and LiT [59], which we call sigmoid language image pretraining $(S i g L I P)$ and sigmoid LiT (SigLiT), respectively. We find that the sigmoid loss performs significantly better than the softmax loss when the batch size is smaller than $16,\mathrm{k}$ . As the train batch size grows, the gap closes. Importantly, the sigmoid loss is symmetric, requires just a single pass, and a typical implementation requires less memory than the softmax loss. This enables successful training of a SigLiT model at a batch size of one million. However, we find that the performance saturates with growing batch size, both for softmax and sigmoid. The good news is that a reasonable batch size, i.e. $32,\mathrm{k}$ , is sufficient for image-text pretraining. This conclusion also holds for multilingual SigLIP training on over 100 languages.
在本文中,我们提出了一种更简单的替代方案:sigmoid损失。它不需要在整个批次上进行任何操作,因此大大简化了分布式损失的实现并提高了效率。此外,它在概念上将批次大小与任务定义解耦。我们比较了提出的sigmoid损失与标准softmax损失在多种设置下的表现。特别地,我们研究了基于sigmoid损失的两种图像文本学习的显著方法:CLIP [36]和LiT [59],我们分别称之为sigmoid语言图像预训练$(SigLIP)$和sigmoid LiT (SigLiT)。我们发现,当批次大小小于$16,\mathrm{k}$时,sigmoid损失的表现显著优于softmax损失。随着训练批次大小的增加,差距逐渐缩小。重要的是,sigmoid损失是对称的,只需要一次传递,典型的实现比softmax损失需要更少的内存。这使得在批次大小为一百万的情况下成功训练SigLiT模型成为可能。然而,我们发现随着批次大小的增加,无论是softmax还是sigmoid,性能都会饱和。好消息是,一个合理的批次大小,即$32,\mathrm{k}$,对于图像文本预训练来说是足够的。这一结论也适用于在100多种语言上的多语言SigLIP训练。
In Table 1, we present setups for image-text pre-training that require a moderate amount of TPUv4 chips for training. SigLiT is surprisingly efficient, reaching $79.7%$ zero-shot accuracy on ImageNet in just a single day on four chips. SigLIP's more demanding from-scratch training reaches $73.4%$ zero-shot accuracy in 5 days with 32 TPUv4 chips. This compares favorably to prior works such as FLIP [30] and CLIP [36], which require approximately 5 and 10 days respectively on 256 TPUv3 cores. When fine-tuning a pretrained vision backbone in SigLIP, denoted as $\vec{\boxdot}$ in Table 1, we found that disabling the weight decay on the pre-trained backbone leads to better results (see Figure 4 for details). We hope our work paves the way for making the nascent language-image pre-training field more accessible.
在表1中,我们展示了需要适量TPUv4芯片进行训练的图文预训练设置。SigLiT表现出惊人的高效性,仅用四块芯片一天时间就在ImageNet上达到了79.7%的零样本准确率。SigLIP从头开始的训练要求更高,使用32块TPUv4芯片在5天内达到了73.4%的零样本准确率。这与之前的工作如FLIP [30]和CLIP [36]相比表现更优,后者分别需要约5天和10天的时间在256个TPUv3核心上完成。在微调SigLIP中的预训练视觉主干时(在表1中用$\vec{\boxdot}$表示),我们发现禁用预训练主干的权重衰减会带来更好的结果(详见图4)。我们希望我们的工作能够为新兴的语言-图像预训练领域开辟更广泛的应用前景。
2. Related Work
- 相关工作
Contrastive learning with the sigmoid loss. One prior work proposes a similar sigmoid loss for the task of unsupervised dimensionality reduction [19]; in the scope of contrastive image-text learning, the vast majority of works rely on the softmax-based InfoNCE loss as popularized by [46]. In supervised classification, the sigmoid loss has already been shown to be slightly more effective and robust than the softmax loss [3, 51].
基于Sigmoid损失的对比学习。一项先前的工作提出了一种类似的Sigmoid损失,用于无监督降维任务 [19];在对比图像-文本学习领域,绝大多数工作依赖于基于Softmax的InfoNCE损失,这一方法由 [46] 推广。在监督分类中,Sigmoid损失已经被证明比Softmax损失略微更有效且更稳健 [3, 51]。
Contrastive language-image pre-training has become popular since CLIP [36] and ALIGN [23] applied softmax contrastive learning [60, 46, 10, 24] to large-scale imagetext datasets. Both models perform very well on zero-shot transfer tasks, including classification and retrieval. Followup works show that contrastive ly pre-trained models produce good representations for fine-tuning [53, 16], linear regression [23], object detection [31], semantic segmentation [33] and video tasks [57].
对比语言-图像预训练 (Contrastive Language-Image Pre-training) 自从 CLIP [36] 和 ALIGN [23] 将 softmax 对比学习 [60, 46, 10, 24] 应用于大规模图像-文本数据集以来变得流行。这两种模型在零样本迁移任务(包括分类和检索)上表现非常出色。后续研究表明,对比预训练的模型在微调 [53, 16]、线性回归 [23]、目标检测 [31]、语义分割 [33] 和视频任务 [57] 中都能产生良好的表示。
Generative language-image pre-training Besides softmax contrastive pre-training, various alternatives have been proposed. GIT [49], SimVLM [50], and LEMON [21] successfully pre-train models using a generative text decoder instead, while CoCa [56] adds such a decoder to the discri mi native CLIP/ALIGN setup, thus combining the pros and cons of both approaches into a single very capable model. BLIP [28] further proposes CapFilt which uses the generative decoder to create better captions and the discriminative part of the model to filter pairs. Language-Image pre-training is a very active field and surveys [8] rapidly become outdated.
生成式语言-图像预训练
Algorithm 1 Sigmoid loss pseudo-implementation.
算法 1 Sigmoid 损失伪实现。
# | img_emb | image | model embedding [n, dim] | |
---|---|---|---|---|
(7J) | txt_emb | text model embedding | [n, dim] | |
, | ## | t_prime, | learnable temperature and bias | |
4 | # n | mini-batch size | ||
七 | exp(t_prime) | |||
zimg | =l2_normalize(img_emb) | |||
8 | ztxt | = l2_normalize(txt_emb) | ||
logits | =dot(zimg,ztxt.T) *t+b | |||
10 | labels=2*eye(n)- | ones(n)#-1 | with diagonal 1 | |
11 | 1 | =-sum(log_sigmoid(labels *logits))/n |
Efficient language-image pre-training On the other hand, few works have tried making language image pre-training more efficient. LiT [59] and FLIP [30] are notable attempts, the former requires a pre-trained and locked backbone, and the latter sacrifices quality by randomly dropping visual tokens. BASIC [35] and LAION [52] look at scaling batchsize but only go up to $16,\mathrm{k}$ and $160,\mathrm{k}$ respectively, by using many hundreds of chips, and for the former also mixing in a large private classification dataset [35, 55]. The recent Lion optimizer [12] claims to be able to reduce the training cost to reach similar quality.
高效的图文预训练
另一方面,很少有工作尝试让图文预训练更加高效。LiT [59] 和 FLIP [30] 是值得注意的尝试,前者需要一个预训练并固定的骨干网络,后者通过随机丢弃视觉 Token 来牺牲质量。BASIC [35] 和 LAION [52] 研究了批量大小的扩展,但分别仅达到 $16,\mathrm{k}$ 和 $160,\mathrm{k}$,并且使用了数百个芯片,前者还混合了一个大型私有分类数据集 [35, 55]。最近的 Lion 优化器 [12] 声称能够降低训练成本以达到类似的质量。
3. Method
3. 方法
In this section, we first review the widely-used softmaxbased contrastive loss. We then introduce the pairwise sigmoid loss and discuss its efficient implementation.
在本节中,我们首先回顾广泛使用的基于 softmax 的对比损失。然后我们介绍了成对 sigmoid 损失,并讨论了其高效实现。
Given a mini-batch $\beta~=~{(I_{1},T_{1}),(I_{2},T_{2}),\ldots}$ of image-text pairs, the contrastive learning objective encourages embeddings of matching pairs $(I_{i},T_{i})$ to align with each other, while pushing embeddings of unmatched pairs $(I_{i},T_{j\neq i})$ apart. For practical purposes, it is assumed that for all images $i$ , the text associated with a different image $j$ is not related to $i$ , and vice-versa. This assumption is usually noisy and imperfect.
给定一个小批量 $\beta~=~{(I_{1},T_{1}),(I_{2},T_{2}),\ldots}$ 的图文对,对比学习目标鼓励匹配对 $(I_{i},T_{i})$ 的嵌入相互对齐,同时将不匹配对 $(I_{i},T_{j\neq i})$ 的嵌入推开。出于实际目的,假设对于所有图像 $i$,与不同图像 $j$ 相关的文本与 $i$ 无关,反之亦然。这一假设通常是嘈杂且不完美的。
3.1. Softmax loss for language image pre-training
3.1. 用于语言图像预训练的 Softmax 损失
When using the softmax loss to formalize this objective, an image model $f(\cdot)$ and a text model $g(\cdot)$ are trained to
当使用 softmax 损失来形式化这一目标时,图像模型 $f(\cdot)$ 和文本模型 $g(\cdot)$ 被训练以
(a) Initially each device holds 4 (b) They each compute the com- (c) Texts are swapped across the (d) This repeats till every image image and 4 text representations. ponent of the loss (highlighted) devices, so device 1 now has $I_{1:4}$ & text pair have interacted, e.g. Each device needs to see the rep- for their representations, which and $T_{5:8}$ etc. The new loss is device 1 has the loss of $I_{1:4}$ and resent at ions from other devices includes the positives. computed and accumulated with $T_{1:12}$ . A final cross-device sum to calculate the full loss. the previous. brings everything together.
(a) 最初,每个设备持有 4 个图像和 4 个文本表示。
(b) 它们各自计算损失的分量(高亮显示)
(c) 文本在设备之间交换,因此设备 1 现在有 $I_{1:4}$ 和 $T_{5:8}$ 等。新的损失被计算并累加到之前的结果中。
(d) 这个过程重复,直到每个图像和文本对都进行了交互,例如设备 1 有 $I_{1:4}$ 和 $T_{1:12}$ 的损失。最终的跨设备求和将所有内容整合在一起。
Figure 1: Efficient loss implementation demonstrated via a mock setup with 3 devices and a global batch size of 12. There are no all-gathers, and at any point in time only the bright yellow square (size $4\times4]$ is materialized in memory.
图 1: 通过一个模拟设置展示了高效损失实现,该设置包含 3 个设备和全局批量大小为 12。没有使用 all-gather 操作,在任何时间点只有亮黄色方块(大小为 $4\times4]$)被实际存储在内存中。
minimize the following objective:
最小化以下目标:
where x = $\begin{array}{r}{\mathbf{x}{i},=,\frac{f\left(I{i}\right)}{|f\left(I_{i}\right)|{2}}}\end{array}$ and $\begin{array}{r}{\mathbf{y}{i},=,\frac{g\left(T_{i}\right)}{|g\left(T_{i}\right)|_{2}}}\end{array}$ ()l- In this paper, we adopt the vision transformer architecture [17] for images and the transformer architecture [47] for texts. Note that due to the asymmetry of the softmax loss, the normalization is independently performed two times: across images and across texts [36]. The scalar $t$ is para met rize d as $\exp(t^{\prime})$ where $t^{\prime}$ is a global freely learnable parameter.
其中 x = $\begin{array}{r}{\mathbf{x}{i},=,\frac{f\left(I{i}\right)}{|f\left(I_{i}\right)|{2}}}\end{array}$ 且 $\begin{array}{r}{\mathbf{y}{i},=,\frac{g\left(T_{i}\right)}{|g\left(T_{i}\right)|_{2}}}\end{array}$ ()l- 在本文中,我们采用了视觉 Transformer 架构 [17] 处理图像,并使用 Transformer 架构 [47] 处理文本。需要注意的是,由于 softmax 损失的不对称性,归一化分别在图像和文本上独立进行 [36]。标量 $t$ 被参数化为 $\exp(t^{\prime})$,其中 $t^{\prime}$ 是一个全局可自由学习的参数。
3.2. Sigmoid loss for language image pre-training
3.2. 语言图像预训练的Sigmoid损失
Instead of the softmax-based contrastive loss, we propose a simpler alternative that does not require computing global normalization factors. The sigmoid-based loss processes every image-text pair independently, effectively turning the learning problem into the standard binary classification on the dataset of all pair combinations, with a positive labels for the matching pairs $(I_{i},T_{i})$ and negative labels for all other pairs $(I_{i},T_{j\neq i})$ . It is defined as follows:
我们提出了一种更简单的替代方案,它不需要计算全局归一化因子,而不是基于 softmax 的对比损失。基于 sigmoid 的损失独立处理每个图像-文本对,有效地将学习问题转化为在所有对组合数据集上的标准二分类问题,其中匹配对 $(I_{i},T_{i})$ 的标签为正,所有其他对 $(I_{i},T_{j\neq i})$ 的标签为负。其定义如下:
where $z_{i j}$ is the label for a given image and text input, which equals 1 if they are paired and $-1$ otherwise. At initialization, the heavy imbalance coming from the many negatives dominates the loss, leading to large initial optimization steps attempting to correct this bias. To alleviate this, we introduce an additional learnable bias term $b$ similar to the temperature $t$ .We initialize $t^{\prime}$ and $b$ to $\log10$ and $-10$ respectively. This makes sure the training starts roughly close to the prior and does not require massive over-correction. Algorithm 1 presents a pseudocode implementation of the proposed sigmoid loss for language image pre-training.
其中 $z_{ij}$ 是给定图像和文本输入的标签,如果它们配对则等于 1,否则为 $-1$。在初始化时,由于大量负样本带来的严重不平衡主导了损失,导致初始优化步骤试图纠正这种偏差。为了缓解这一问题,我们引入了一个额外的可学习偏置项 $b$,类似于温度 $t$。我们将 $t^{\prime}$ 和 $b$ 分别初始化为 $\log10$ 和 $-10$。这确保了训练开始时大致接近先验,并且不需要大规模的过度修正。算法 1 展示了所提出的用于语言图像预训练的 sigmoid 损失的伪代码实现。
3.3. Efficient “chunked implementation
3.3. 高效的“分块”实现
Contrastive training typically utilizes data parallelism. Computing the loss when data is split across $D$ devices necessitates gathering all embeddings [59] with expensive all-gathers and, more importantly, the materialization of a memory-intensive $|\boldsymbol{B}|\times|\boldsymbol{B}|$ matrix of pairwise similarities.
对比训练通常利用数据并行。当数据分布在 $D$ 个设备上时,计算损失需要收集所有嵌入 [59],这涉及昂贵的所有收集操作,更重要的是,需要实例化一个内存密集型的 $|\boldsymbol{B}|\times|\boldsymbol{B}|$ 的成对相似性矩阵。
The sigmoid loss, however, is particularly amenable to a memory efficient, fast, and numerically stable implementation that ameliorates both these issues. Denoting the perdevice batch size as $\begin{array}{r}{b={\frac{|B|}{D}}}\end{array}$ , the loss is reformulated as:
然而,sigmoid 损失函数特别适合一种内存高效、快速且数值稳定的实现方式,能够缓解这两个问题。将每个设备的批量大小表示为 $\begin{array}{r}{b={\frac{|B|}{D}}}\end{array}$,损失函数被重新表述为:
This is particularly simple for the sigmoid loss as each pair is an independent term in the loss. Figure 1 illustrates this method. In words, we first compute the component of the loss corresponding to the positive pairs, and $b-1$ negative pairs. We then permute representations across devices, so each device takes negatives from its neighbouring device (next iteration of sum B). The loss is then calculated with respect to this chunk (sum C). This is done independently in each device, such that each device computes the loss with respect to its local batch $b$ . Losses can then simply be summed across all devices (sum A). Individual collective permutes (for sum B) are fast (and indeed $D$ collective permutes is typically faster than two all-gathers between $D$ devices), and the memory cost at any given moment is reduced from $|\boldsymbol{B}|^{2}$ to $b^{2}$ (for sum C). Usually $b$ is constant as scaling $|\beta|$ is achieved by increasing the number of accelerators. Due to being quadratic with respect to the batch size, the vanilla loss computation rapidly bottlenecks scaling up. This chunked approach enabled training with batch sizes over 1 million on relatively few devices.
这对于 sigmoid 损失函数来说特别简单,因为每对在损失中都是独立的项。图 1 展示了这种方法。简而言之,我们首先计算与正对和 $b-1$ 个负对相对应的损失分量。然后,我们在设备之间交换表示,以便每个设备从其邻近设备获取负对(sum B 的下一次迭代)。随后,根据这个块计算损失(sum C)。这在每个设备中独立完成,因此每个设备根据其本地批次 $b$ 计算损失。然后可以简单地将所有设备的损失相加(sum A)。单个集体交换(sum B)是快速的(事实上,$D$ 个集体交换通常比 $D$ 个设备之间的两次全收集更快),并且在任何给定时刻的内存成本从 $|\boldsymbol{B}|^{2}$ 减少到 $b^{2}$(sum C)。通常 $b$ 是常数,因为通过增加加速器的数量来实现 $|\beta|$ 的扩展。由于与批次大小呈二次关系,原始损失计算在扩展时迅速成为瓶颈。这种分块方法使得在相对较少的设备上使用超过 100 万个批次的训练成为可能。
Figure 2: The effect of pre-training batch size. Left: SigLiT results, trained for 18B seen examples. Sigmoid loss outperforms the softmax loss significantly with small batch sizes, and performs similarly at larger batch sizes. We successfully trained an SigLiT model with up to one million batch size. However, performance for both sigmoid and softmax saturate at around $32,\mathrm{k}$ batch size. Middle: SigLIP results, trained for 9B seen examples. Both sigmoid loss and softmax loss saturate at a reasonable batch size, while the peak of the sigmoid loss comes earlier and slightly outperforms the peak of the softmax loss. A very large batch size hurts both losses. Right: mSigLIP results, trained for 30B seen examples. With a multilingual setup using over 100 languages, $32,\mathrm{k}$ batch size is surprisingly sufficient and scaling beyond that hurts performance on a 36-language cross-modal retrieval task.
图 2: 预训练批次大小的影响。左图:SigLiT 结果,训练了 180 亿个样本。在小批次大小时,Sigmoid 损失显著优于 softmax 损失,而在大批次大小时表现相似。我们成功训练了批次大小高达一百万的 SigLiT 模型。然而,Sigmoid 和 softmax 的性能在批次大小约为 $32,\mathrm{k}$ 时趋于饱和。中图:SigLIP 结果,训练了 90 亿个样本。Sigmoid 损失和 softmax 损失在合理的批次大小时趋于饱和,而 Sigmoid 损失的峰值出现得更早,且略微优于 softmax 损失的峰值。非常大的批次大小对两种损失都有害。右图:mSigLIP 结果,训练了 300 亿个样本。在使用超过 100 种语言的多语言设置中,批次大小为 $32,\mathrm{k}$ 已经足够,进一步扩大批次大小会损害在 36 种语言的跨模态检索任务上的性能。
4. Results
4. 结果
In this section, we evaluate the proposed SigLiT and SigLIP models across a wide range of batch sizes. We discuss what can be achieved with a small number of accelerator chips, using both SigLiT and SigLIP recipes. We also briefly discuss the impact of batch size on multilingual language image pre-training. We ablate the importance of our large-batch stabilization modification and the introduced learned bias term and present a study on the effect of positive and negative pairs ratio in the sigmoid loss. Lastly, we explore SigLIP's data noise robustness.
在本节中,我们评估了在不同批量大小下的 SigLiT 和 SigLIP 模型。我们讨论了使用少量加速器芯片时,通过 SigLiT 和 SigLIP 方法可以实现的效果。我们还简要讨论了批量大小对多语言图像预训练的影响。我们分析了大规模批量稳定化修改和引入的学习偏差项的重要性,并对 sigmoid 损失中正负样本比例的影响进行了研究。最后,我们探讨了 SigLIP 对数据噪声的鲁棒性。
To validate our models, we report zero-shot transfer results on the ImageNet dataset [14] and zero-shot retrieval results across 36 languages on the XM3600 dataset [44]. We use the ScalingViT-Adafactor optimizer [58] by default for all our experiments.
为了验证我们的模型,我们报告了在 ImageNet 数据集 [14] 上的零样本迁移结果,以及在 XM3600 数据集 [44] 上跨 36 种语言的零样本检索结果。我们在所有实验中默认使用 ScalingViT-Adafactor 优化器 [58]。
4.1. SigLiT: Scaling batch size to the limit
4.1. SigLiT: 将批量大小扩展到极限
Following [59], we use the same pre computed embeddings for the images using a ViT $\mathrm{g}$ vision model, and train a base size text tower from scratch with the same hyperparameters using the LiT image-text dataset [59].
根据 [59],我们使用 ViT $\mathrm{g}$ 视觉模型为图像生成相同的预计算嵌入,并使用 LiT 图文数据集 [59] 从头训练一个基础大小的文本塔,超参数保持不变。
We perform a study over a wide range of batch sizes, from 512 to $1,M$ , demonstrating the impact of batch size for contrastive learning. Results are presented in Figure 2 (left). When the batch size is smaller than $16,k$ ,s igm oid loss outperforms softmax loss by a large margin. With growing batch sizes, we observe that softmax loss quickly catches up and potentially slightly under performs sigmoid loss with a large enough batch size. Overall, we recommend using the SigLIP recipe for large batch sizes as well, due to the simplicity, compute savings, and straightforward memory efficient implementation.
我们在从 512 到 $1,M$ 的广泛批量大小范围内进行了研究,展示了批量大小对对比学习的影响。结果如图 2(左)所示。当批量大小小于 $16,k$ 时,sigmoid 损失显著优于 softmax 损失。随着批量大小的增加,我们观察到 softmax 损失迅速赶上,并且在批量大小足够大时可能会略微逊色于 sigmoid 损失。总体而言,由于简单性、计算节省和直接的内存高效实现,我们建议在大批量大小下也使用 SigLIP 方案。
There is a consensus that contrastive learning benefits from large batch sizes, while most of the existing studies stopat $64,\mathrm{k}$ batch size [59, 35, 10]. We successfully trained an SigLiT model at one million batch size, to explore the limit of contrastive learning. To our surprise, the performance saturates at $32,\mathrm{k}$ batch size, further scaling up the batch size only gives a minor boost, and the model peaks at
共识认为对比学习受益于大批量,而大多数现有研究止步于 $64,\mathrm{k}$ 批量大小 [59, 35, 10]。我们成功训练了一个批量大小达到一百万的 SigLiT 模型,以探索对比学习的极限。令人惊讶的是,性能在 $32,\mathrm{k}$ 批量大小时达到饱和,进一步扩大批量大小仅带来轻微提升,且模型在
Figure 3: SigLiT ImageNet 0-shot transfer results with different training durations. Large batch size results in a big performance boost, but needs a sufficiently long schedule to ramp up, as for short schedules, very large batch size results in a small number of gradient update steps.
图 3: SigLiT ImageNet 零样本迁移结果与不同训练时间的关系。大批量训练带来了显著的性能提升,但需要足够长的训练时间才能达到最佳效果,因为在短时间训练中,大批量会导致梯度更新步骤的数量较少。
$256,\mathrm{k}$ batch size. Our best SigLiT with a $B$ -sized text mode achieves $84.7%$ zero-shot transfer accuracy on ImageNet, while the original LiT paper reports a slightly better $85.2%$ score with a 10 times larger $g$ -sized text model. Figure 3 presents the impact of training duration for different batch sizes. It demonstrates that large, $262,k$ batch size significantly outperforms smaller $8,k$ batch size when trained for a sufficiently long time. Note, that for short training durations, large batch size leads to the fewer absolute number of update steps and thus needs more time to ramp up.
$256,\mathrm{k}$ 的批量大小。我们最佳的 SigLiT 在 $B$ 大小的文本模式下,在 ImageNet 上的零样本迁移准确率达到了 $84.7%$,而原始 LiT 论文中报告的准确率略高,为 $85.2%$,但其文本模型大小是 $g$ 大小的 10 倍。图 3 展示了不同批量大小对训练时长的影响。结果表明,当训练时间足够长时,较大的 $262,k$ 批量大小显著优于较小的 $8,k$ 批量大小。需要注意的是,在较短的训练时长内,较大的批量大小会导致更新步骤的绝对数量减少,因此需要更多时间来达到最佳效果。
4.2. SigLIP: Sigmoid loss is beneficial for languageimage pre-training
4.2. SigLIP: Sigmoid 损失函数对语言-图像预训练有益
We pre-train SigLIP models on the WebLI dataset [13], using only English image and text pairs. We use CLIP (WebLI) to denote the CLIP baseline pre-trained on WebLI with the standard softmax loss. We use moderately-sized models: B/16 ViT for image embeddings and B-sized transformer for text embeddings. The input images are resized to $224!\times!224$ resolution. The text is tokenized by a $32,\mathrm{k}$ vocabulary sentence piece tokenizer [27] trained on the English C4 dataset [37], and a maximum of 16 text tokens are kept. Figure 2 middle plot shows SigLIP results, With less than $32,\mathrm{k}$ batch size, SigLIP outperforms CLIP (WebLI) baselines. On the other end of the scale, the memory efficiency of the sigmoid loss enabled much larger batch sizes. For example, with four TPU-v4 chips, we could fit a batch size of 4096 with a Base SigLIP but only 2048 with a corresponding CLIP model. The two advantages together demonstrate significant benefits of the sigmoid loss for language image pre-training with fixed resources, which will be discussed in Section 4.5.
我们在 WebLI 数据集 [13] 上预训练 SigLIP 模型,仅使用英文图像和文本对。我们使用 CLIP (WebLI) 表示在 WebLI 上使用标准 softmax 损失预训练的 CLIP 基线。我们使用中等大小的模型:用于图像嵌入的 B/16 ViT 和用于文本嵌入的 B 大小 Transformer。输入图像调整为 $224!\times!224$ 分辨率。文本通过一个在英文 C4 数据集 [37] 上训练的 $32,\mathrm{k}$ 词汇量 sentence piece tokenizer [27] 进行 Token 化,最多保留 16 个文本 Token。图 2 中间的图表展示了 SigLIP 的结果,在批量大小小于 $32,\mathrm{k}$ 时,SigLIP 优于 CLIP (WebLI) 基线。在另一极端情况下,sigmoid 损失的内存效率使得批量大小可以更大。例如,使用四个 TPU-v4 芯片,我们可以在 Base SigLIP 中适应 4096 的批量大小,而在相应的 CLIP 模型中只能适应 2048。这两个优势共同展示了 sigmoid 损失在固定资源下进行语言图像预训练的显著好处,这将在第 4.5 节中讨论。
16k | 32k | 64k | 128k | 240k |
---|---|---|---|---|
INet-0 71.6 | 73.2 | 73.2 | 73.2 | 73.1 |
XM avg 34.8 | 34.9 | 34.4 | 33.6 | 32.7 |
XM de | 54.7 54.8 | 55.4 | 54.3 | 54.7 |
XM en | 46.5 46.2 | 46.5 | 46.6 | 46.6 |
XM hi 9.1 | 8.5 | 7.9 | 8.1 | 7.3 |
XM ru 50.1 | 49.9 | 49.7 | 48.6 | 49.3 |
XM zh 30.7 | 32.5 | 32.0 | 30.6 | 23.7 |
Table 2: Multilingual SigLIP results with various batch sizes, pre-trained for 30 billion seen examples. We report zero-shot transfer results on ImageNet (INet-O) and averaged text to image retrieval results across 36 languages on the crossmodal 3600 dataset (XM). The full table on 36languages can be found in Appendix.
表 2: 使用不同批量大小的多语言 SigLIP 结果,预训练时使用了 300 亿个样本。我们报告了在 ImageNet (INet-O) 上的零样本迁移结果,以及在跨模态 3600 数据集 (XM) 上 36 种语言的文本到图像检索的平均结果。完整的 36 种语言结果表可以在附录中找到。
As batch size increases, the gap between the sigmoid and the softmax losses diminish. SigLIP performs best at batch size $32,\mathrm{k}$ , whereas the softmax loss required $98,\mathrm{k}$ for optimal performance and still didn't outperform the sigmoid based variant. Scaling further, a larger batch size like $307,\mathrm{k}$ hurts both losses.
随着批量大小的增加,sigmoid 和 softmax 损失之间的差距逐渐缩小。SigLIP 在批量大小为 $32,\mathrm{k}$ 时表现最佳,而 softmax 损失需要 $98,\mathrm{k}$ 才能达到最佳性能,但仍未超越基于 sigmoid 的变体。进一步扩大批量大小,如 $307,\mathrm{k}$,会对两种损失造成负面影响。
4.3. mSigLIP: Multi-lingual pre-training
4.3. mSigLIP: 多语言预训练
We further scale up the training data by keeping all the 100 languages from the WebLI dataset [13]. With multilingual data, one usually needs to use a larger international vocabulary. We first verify the impact of two tokenizers: a small multilingual vocabulary with $32,\mathrm{k}$ tokens [37], and a large multilingual vocabulary with $250,\mathrm{k}$ tokens [54]. We train B-sized ViT and text models for $900,M$ total examples seen, and observe slightly more than $1%$ improvement when using a larger vocabulary.
我们进一步通过保留 WebLI 数据集 [13] 中的所有 100 种语言来扩展训练数据。在多语言数据的情况下,通常需要使用更大的国际词汇表。我们首先验证了两种分词器的影响:一个包含 $32,\mathrm{k}$ 个 Token 的小型多语言词汇表 [37],以及一个包含 $250,\mathrm{k}$ 个 Token 的大型多语言词汇表 [54]。我们训练了 B 规模的 ViT 和文本模型,总共观察了 $900,M$ 个样本,发现使用更大的词汇表时性能提升了略高于 $1%$。
However, the token embeddings become huge for very large vocabulary sizes. Following the standard setup, we would need to store a $N\times W$ token embedding lookup table to train the multilingual model, where $N$ is the vocabulary size mentioned above and $W$ is the embedding dimension of the text model. To save memory, we propose to use a "bottle necked" token embedding. We use $N\times K$ embedding matrix and additional $K\times W$ projection, where the bottleneck $K$ is much smaller than $W$
然而,对于非常大的词汇量,Token 嵌入会变得非常庞大。按照标准设置,我们需要存储一个 $N\times W$ 的 Token 嵌入查找表来训练多语言模型,其中 $N$ 是上述词汇量,$W$ 是文本模型的嵌入维度。为了节省内存,我们建议使用“瓶颈式”Token 嵌入。我们使用 $N\times K$ 的嵌入矩阵和额外的 $K\times W$ 投影,其中瓶颈 $K$ 远小于 $W$。
In our experiments, we observed that using a large multilingual vocabulary with a bottleneck can be scaled up as efficiently as using a small multilingual vocabulary. Specifically, by enabling the bottleneck of size $K=96$ for Base architecture with $W,=,768$ , we only see about a half percent quality drop on ImageNet zero-shot transfer, compared to using the full $250k$ vocabulary.
在我们的实验中,我们观察到使用带有瓶颈的大规模多语言词汇表可以像使用小规模多语言词汇表一样高效地扩展。具体来说,通过在基础架构($W,=,768$)中启用大小为 $K=96$ 的瓶颈,与使用完整的 $250k$ 词汇表相比,我们在 ImageNet 零样本迁移上仅看到约 0.5% 的质量下降。
Figure 4: Top: SigLIP with pre-trained encoders ramps up quickly. However, only disabling weight decay on the pretrained encoder weights leads to stable behavior and good ImageNet 0-shot transfer results. Bottom: ImageNet 10- shot transfer results, where decaying the pre-trained weights leads to deterioration of the pre-trained model visual representation quality. Disabling weight decay flattens the curve.
图 4: 上图: 使用预训练编码器的 SigLIP 快速提升。然而,仅在预训练编码器权重上禁用权重衰减才能实现稳定行为并获得良好的 ImageNet 零样本迁移结果。下图: ImageNet 少样本迁移结果,其中对预训练权重进行衰减会导致预训练模型视觉表示质量的下降。禁用权重衰减使曲线趋于平缓。
With the memory improvements, we train mSigLIP models for various batch sizes, for a total of 30 billion examples seen. Table 2 and Figure 2 (right plot) show the results. We were expecting a large batch size to improve multilingual pre-training, where the model sees more examples from the same language as hard negatives in a single mini-batch. However, we didn't observe clear improvements with a batch size larger than $32,\mathrm{k}$ .A batch size of $32,\mathrm{k}$ is sufficient for a multilingual setup as well. On the XM3600 cross-modal retrieval tasks, we found that going beyond $32,\mathrm{k}$ batch size leads to worse results on average while on ImageNet zero-shot transfer it stays flat. mSigLIP sets the new state-of-the-art on XM3600 text to image retrieval task, with only a Base size model. Our best result is $34.9%$ , which is more than $6%$ higher than the previously reported result $28.5%$ [13] with a standard LiT model [59] using a much larger four billion ViT-e model. We further scale up mSigLIP training in Section 4.6.
随着内存的改进,我们为不同的批次大小训练了 mSigLIP 模型,总共处理了 300 亿个样本。表 2 和图 2(右侧图表)展示了结果。我们原本期望较大的批次大小能够改善多语言预训练,因为模型在单个小批次中会看到更多来自同一语言的硬负样本。然而,我们并未观察到批次大小超过 $32,\mathrm{k}$ 时带来的明显改进。$32,\mathrm{k}$ 的批次大小对于多语言设置已经足够。在 XM3600 跨模态检索任务中,我们发现超过 $32,\mathrm{k}$ 的批次大小平均会导致结果变差,而在 ImageNet 零样本迁移任务中结果保持平稳。mSigLIP 在 XM3600 文本到图像检索任务中仅使用 Base 大小的模型就设定了新的最先进水平。我们的最佳结果为 $34.9%$,比之前使用标准 LiT 模型 [59] 和更大的 40 亿参数 ViT-e 模型报告的结果 $28.5%$ [13] 高出超过 $6%$。我们将在第 4.6 节中进一步扩展 mSigLIP 的训练。
4.4. SigLiT with four TPU-v4 chips
4.4 使用四个 TPU-v4 芯片的 SigLiT
For many practitioners, the important question usually is "what can be trained with a limited amount of resources?" We explore the usage of SigLiT models in this section with only four TPU-v4 chips, as the memory effcient sigmoid loss is suitable for this application scenario.
对于许多从业者来说,一个重要的问题通常是“在资源有限的情况下可以训练什么?”我们在本节中仅使用四个 TPU-v4 芯片探索 SigLiT 模型的使用,因为内存高效的 sigmoid 损失适用于此应用场景。
Figure 5: The effect of Adam and AdaFactor's $\beta_{2}$ .As we increase batch-size, we observe more frequent training instability. This instability seen in the loss curves (top) is caused by spikes in gradient norm (middle) leading to large parameter updates (bottom). Decreasing the $\beta_{2}$ momentum stabilizes training. Occasional gradient spikes still happen (see step at 2B), but do not destabilize the training process.
图 5: Adam 和 AdaFactor 的 $\beta_{2}$ 的影响。随着批量大小的增加,我们观察到训练不稳定性更加频繁。这种在损失曲线(顶部)中看到的不稳定性是由梯度范数的尖峰(中部)导致的,进而引发较大的参数更新(底部)。降低 $\beta_{2}$ 动量可以稳定训练。偶尔的梯度尖峰仍然会发生(参见 2B 处的步骤),但不会破坏训练过程的稳定性。
We follow the same setup as in section 4.1. We use the publicly available ViT-AugReg-B/8 [42] model as the frozen $(\oplus\oplus)$ vision tower, and precompute embeddings to accelerate the training [59]. The text model is a Large Transformer, but with a depth of only 12 layers (instead of 24). It is trained using the LION [12] optimizer with decoupled weight decay $1\times10^{-7}$ , linearly warm-up of learning rate over $6.5\mathrm{k}$ steps up to a peak of $1\times10^{-4}$ , followed by a cosine decay to O. We train for a total of $65,000$ steps with a batch size of $32\mathbf{k}$ -- this leads to just under one day of training. Table 1 shows the results when training a model on four chips for one day, achieving $79.7%$ O-shot ImageNet classification accuracy; very competitive in this limited resource regime. With a ViT $\mathrm{\textg}/14$ [58] model as the vision tower and a Large text tower, we can train at $20,\mathrm{k}$ batch size on four chips for $107,\mathrm{k}$ steps in under two days. This further pushes the O-shot ImageNet classification accuracy up to $84.5%$
我们遵循与第4.1节相同的设置。我们使用公开的ViT-AugReg-B/8 [42]模型作为冻结的 $(\oplus\oplus)$ 视觉塔,并预计算嵌入以加速训练 [59]。文本模型是一个大型Transformer,但深度仅为12层(而不是24层)。它使用LION [12]优化器进行训练,解耦权重衰减为 $1\times10^{-7}$ ,学习率线性预热至 $6.5\mathrm{k}$ 步,最高达到 $1\times10^{-4}$ ,随后进行余弦衰减至O。我们总共训练了 $65,000$ 步,批量大小为 $32\mathbf{k}$ ——这导致训练时间略少于一天。表1显示了在四块芯片上训练一天的结果,实现了 $79.7%$ 的零样本ImageNet分类准确率;在有限的资源条件下非常有竞争力。使用ViT $\mathrm{\textg}/14$ [58]模型作为视觉塔和大型文本塔,我们可以在四块芯片上以 $20,\mathrm{k}$ 的批量大小训练 $107,\mathrm{k}$ 步,时间不到两天。这将零样本ImageNet分类准确率进一步提升至 $84.5%$
4.5. SigLIP with a small amount of TPU-v4 chips
4.5. 少量 TPU-v4 芯片上的 SigLIP
It's resource demanding to train a CLIP model fromscratch in general, with SigLIP it's possible to fit a larger train batch size with fewer amount of chips. In this section, we explore ways to train SigLIP models efficiently with pretrained weights. We use pre-trained weights to initialize the image model to accelerate the pre-training, which was originally discussed in [59]. We use the public and unlocked ViT-AugReg-B/16 [42] model to initialize our vision tower and fine-tune on the same WebLI English data as used for SigLIP. In all the experiments, we apply a 0.1 learning rate multiplier to the pre-trained image tower to make it suitable for fine-tuning.
训练一个 CLIP 模型通常需要大量资源,而使用 SigLIP 则可以在更少的芯片上适应更大的训练批次。在本节中,我们探讨了如何使用预训练权重高效地训练 SigLIP 模型。我们使用预训练权重来初始化图像模型以加速预训练,这一点最初在 [59] 中讨论过。我们使用公开且未锁定的 ViT-AugReg-B/16 [42] 模型来初始化我们的视觉塔,并在与 SigLIP 相同的 WebLI 英文数据集上进行微调。在所有实验中,我们对预训练的图像塔应用 0.1 的学习率乘数,使其适合微调。
Figure 6: The effect of batch composition. We simulate various batch compositions by masking out negatives, either randomly, keeping only the hardest, or the easiest. With no masking, we have $16,\mathrm{k}$ negatives for each positive in the batch (1:16k) and the strongest masking we apply (1:1.6) results in almost balanced mini batches. In one setting we match total pairs seen by training for significantly longer. We observe ImageNet O-shot score, the final value of the learned bias, and the average logits of positive and negative pairs. Overall, the imbalance does not seem to be detrimental, but finding an effcient way of mining negatives might be beneficial.
图 6: 批次组成的影响。我们通过随机掩码、仅保留最难的或最简单的负样本来模拟不同的批次组成。在没有掩码的情况下,每个正样本对应 $16,\mathrm{k}$ 个负样本 (1:16k),而最强的掩码 (1:1.6) 导致几乎平衡的小批次。在一种设置中,我们通过显著延长训练时间来匹配总样本对。我们观察到 ImageNet 零样本分数、学习偏置的最终值以及正负样本对的平均 logits。总体而言,不平衡似乎并不有害,但找到一种有效的负样本挖掘方法可能是有益的。
Figure 4 presents unlocked $\vec{\boxdot}$ fine-tuning results alongside from-scratch randomly initialized baselines. We used 16 TPU-v4 chips and train at $16,\mathrm{k}$ batch size for $2.4,\mathrm{B}$ examples seen. We found that the fine-tuning setup doesn't perform well out-of-the-box; this is consistent with prior works [59] where finetuning image models degraded visual representation quality. This is evidenced by ImageNet 10- shot linear classification, where in Figure 4 the fine-tuned setup is barely better than the from-scratch baseline.
图 4 展示了解锁的 $\vec{\boxdot}$ 微调结果以及从头开始随机初始化的基线。我们使用了 16 个 TPU-v4 芯片,并在 $16,\mathrm{k}$ 的批量大小下训练了 $2.4,\mathrm{B}$ 个样本。我们发现,微调设置在没有额外调整的情况下表现不佳;这与之前的工作 [59] 一致,其中微调图像模型降低了视觉表示质量。这一点在 ImageNet 10-shot 线性分类中得到了验证,在图 4 中,微调设置仅略优于从头开始的基线。
We hypothesize that the default weight decay applied to the pre-trained weights reduces their effectiveness. Motivated by the fine-tuning recipe from [17, 58, 25], that uses no weight decay, we also propose disabling weight decay on the pre-trained weights for SigLIP training. Weight decay is therefore only applied to the randomly initialized weights in the text model. This simple modification significantly improved SigLIP results. Figure 4 shows that with our improved recipe, SigLIP reaches $71%$ O-shot accuracy on ImageNet, using $16k$ batch size, trained on 16 chips for three days. We also present from-scratch results in the bottom rows of Table 1: with 32 TPUv4 chips for only two days, SigLIP achieves $72.1%$ O-shot accuracy. This presents a significant training cost reduction e.g. compared to CLIP (approx. 2500 TPUv3-days for $72.6%$ ) reported in [30].
我们假设应用于预训练权重的默认权重衰减降低了它们的有效性。受 [17, 58, 25] 中不使用权重衰减的微调方法的启发,我们建议在 SigLIP 训练中禁用预训练权重的权重衰减。因此,权重衰减仅应用于文本模型中随机初始化的权重。这一简单的修改显著提升了 SigLIP 的结果。图 4 显示,使用我们改进的方法,SigLIP 在 ImageNet 上达到了 $71%$ 的零样本准确率,使用 $16k$ 的批量大小,在 16 个芯片上训练了三天。我们还在表 1 的底部展示了从头开始训练的结果:仅使用 32 个 TPUv4 芯片训