Sigmoid Loss for Language Image Pre-Training
Sigmoid Loss用于语言图像预训练
Abstract
摘要
We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization.The s igm oid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves $84.5%$ Image Net zero-shot accuracy in two days.The disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of $32,k$ beingsuffcient. We release our models at https://github. com/google-research/big_vision and hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.
我们提出了一种用于语言-图像预训练 (SigLIP) 的简单成对 Sigmoid 损失。与使用 softmax 归一化的标准对比学习不同,sigmoid 损失仅作用于图像-文本对,并且不需要全局视角来归一化成对相似性。sigmoid 损失同时允许进一步扩大批次大小,同时在较小的批次大小下表现更好。结合 Locked-image Tuning,仅使用四个 TPUv4 芯片,我们训练了一个 SigLiT 模型,在两天内实现了 $84.5%$ 的 ImageNet 零样本准确率。批次大小与损失的分离进一步使我们能够研究样本与对以及负样本与正样本比例的影响。最后,我们将批次大小推向极端,达到一百万,并发现增加批次大小的好处迅速减弱,$32,k$ 的批次大小已经足够合理。我们在 https://github.com/google-research/big_vision 发布了我们的模型,并希望我们的研究能够激发更多关于提高语言-图像预训练质量和效率的探索。
1. Introduction
1. 引言
Contrastive pre-training using weak supervision from image-text pairs found on the web is becoming the go-to method for obtaining generic computer vision backbones, slowly replacing pre-training on large labelled multi-class datasets. The high-level idea is to simultaneously learn an aligned representation space for images and texts using paired data. Seminal works CLIP [36] and ALIGN [23] established the viability of this approach at a large scale, and following their success, many large image-text datasets became available privately [59, 13, 21, 49] and publicly [40, 6,15,7,41].
利用网络上图像-文本对进行弱监督的对比预训练正逐渐成为获取通用计算机视觉骨干网络的首选方法,逐步取代了在大型标注多类数据集上的预训练。其核心理念是通过配对数据同时学习图像和文本的对齐表示空间。开创性工作 CLIP [36] 和 ALIGN [23] 在大规模上证实了该方法的可行性,随着它们的成功,许多大型图像-文本数据集在私有 [59, 13, 21, 49] 和公开 [40, 6,15,7,41] 领域变得可用。
The standard recipe to pre-train such models leverages the image-text contrastive objective. It aligns the image and
预训练此类模型的标准方法利用了图像-文本对比目标。它对齐了图像和
Table 1: SigLiT and SigLIP results. Sigmoid loss is memory efficient, allows larger batch sizes (BS) that unlocks language image pre-training with a small number of chips. SigLiT model with a frozen public $\frac{20}{5}$ B/8 checkpoint [42], trained on the LiT image-text dataset [59] using four TPUv4 chips for one day, achieves $79.7%$ O-shot accuracy on ImageNet. The same setup with a $\mathrm{g}/14$ checkpoint [58] leads to $84.5%$ accuracy, trained for two days. With a public unlocked B/16 image checkpoint [42], trained on the WebLI dataset [13], SigLIP achieves $71.0%$ O-shot accuracy using 16 TPU-v4 chips for three days. The last two rows show results with randomly initialized models.
表 1: SigLiT 和 SigLIP 结果。Sigmoid 损失函数在内存效率上表现优异,允许更大的批处理大小 (BS),从而在少量芯片的情况下实现语言图像预训练。使用冻结的公开 $\frac{20}{5}$ B/8 检查点 [42] 的 SigLiT 模型,在 LiT 图文数据集 [59] 上训练一天,使用四个 TPUv4 芯片,在 ImageNet 上取得了 $79.7%$ 的零样本准确率。相同的设置下,使用 $\mathrm{g}/14$ 检查点 [58] 训练两天,准确率达到 $84.5%$。在 WebLI 数据集 [13] 上训练,使用公开的未冻结 B/16 图像检查点 [42] 的 SigLIP 模型,使用 16 个 TPU-v4 芯片训练三天,取得了 $71.0%$ 的零样本准确率。最后两行展示了随机初始化模型的结果。
We use a variant of the L model with 12 layers.
| 图像 | 文本 | BS | #TPUv4 | 天数 | INet-0 |
|---|---|---|---|---|---|
| SigLiT B/8 | L* | 32k | 4 | 1 | 79.8 |
| SigLiT g/14 | L | 20k | 4 | 2 | 84.5 |
| SigLIP | B/16 B | 16k | 16 | 3 | 71.0 |
| SigLIP | B/16 B | 32k | 32 | 2 | 72.1 |
| SigLIP | B/16 B | 32k | 32 | 5 | 73.4 |
我们使用了具有12层的L模型变体。
text embeddings for matching (positive) image-text pairs while making sure that unrelated (negative) image-text pairs are dissimilar in the embedding space. This is achieved via a batch-level softmax-based contrastive loss, applied twice to normalize the pairwise similarity scores across all images, then all texts. A naive implementation of the softmax is numerically unstable; it is usually stabilized by subtracting the maximum input value before applying the softmax [18], which requires another pass over the full batch.
用于匹配(正样本)图像-文本对的文本嵌入,同时确保不相关的(负样本)图像-文本对在嵌入空间中不相似。这是通过基于批次的软最大对比损失实现的,该损失应用两次以在所有图像和所有文本之间归一化成对相似性分数。软最大的原始实现在数值上不稳定;通常通过减去最大输入值来稳定[18],这需要对整个批次进行另一次遍历。
In this paper, we propose a simpler alternative: the sigmoid loss. It does not require any operation across the full batch and hence greatly simplifies the distributed loss imple ment ation and boosts efficiency. Additionally, it conceptually decouples the batch size from the definition of the task. We compare the proposed sigmoid loss with the standard softmax loss across multiple setups. In particular, we investigate sigmoid-based loss with two prominent approaches for image-text learning: CLIP [36] and LiT [59], which we call sigmoid language image pretraining $(S i g L I P)$ and sigmoid LiT (SigLiT), respectively. We find that the sigmoid loss performs significantly better than the softmax loss when the batch size is smaller than $16,\mathrm{k}$ . As the train batch size grows, the gap closes. Importantly, the sigmoid loss is symmetric, requires just a single pass, and a typical implementation requires less memory than the softmax loss. This enables successful training of a SigLiT model at a batch size of one million. However, we find that the performance saturates with growing batch size, both for softmax and sigmoid. The good news is that a reasonable batch size, i.e. $32,\mathrm{k}$ , is sufficient for image-text pretraining. This conclusion also holds for multilingual SigLIP training on over 100 languages.
在本文中,我们提出了一种更简单的替代方案:sigmoid损失。它不需要在整个批次上进行任何操作,因此大大简化了分布式损失的实现并提高了效率。此外,它在概念上将批次大小与任务定义解耦。我们比较了提出的sigmoid损失与标准softmax损失在多种设置下的表现。特别地,我们研究了基于sigmoid损失的两种图像文本学习的显著方法:CLIP [36]和LiT [59],我们分别称之为sigmoid语言图像预训练$(SigLIP)$和sigmoid LiT (SigLiT)。我们发现,当批次大小小于$16,\mathrm{k}$时,sigmoid损失的表现显著优于softmax损失。随着训练批次大小的增加,差距逐渐缩小。重要的是,sigmoid损失是对称的,只需要一次传递,典型的实现比softmax损失需要更少的内存。这使得在批次大小为一百万的情况下成功训练SigLiT模型成为可能。然而,我们发现随着批次大小的增加,无论是softmax还是sigmoid,性能都会饱和。好消息是,一个合理的批次大小,即$32,\mathrm{k}$,对于图像文本预训练来说是足够的。这一结论也适用于在100多种语言上的多语言SigLIP训练。
In Table 1, we present setups for image-text pre-training that require a moderate amount of TPUv4 chips for training. SigLiT is surprisingly efficient, reaching $79.7%$ zero-shot accuracy on ImageNet in just a single day on four chips. SigLIP's more demanding from-scratch training reaches $73.4%$ zero-shot accuracy in 5 days with 32 TPUv4 chips. This compares favorably to prior works such as FLIP [30] and CLIP [36], which require approximately 5 and 10 days respectively on 256 TPUv3 cores. When fine-tuning a pretrained vision backbone in SigLIP, denoted as $\vec{\boxdot}$ in Table 1, we found that disabling the weight decay on the pre-trained backbone leads to better results (see Figure 4 for details). We hope our work paves the way for making the nascent language-image pre-training field more accessible.
在表1中,我们展示了需要适量TPUv4芯片进行训练的图文预训练设置。SigLiT表现出惊人的高效性,仅用四块芯片一天时间就在ImageNet上达到了79.7%的零样本准确率。SigLIP从头开始的训练要求更高,使用32块TPUv4芯片在5天内达到了73.4%的零样本准确率。这与之前的工作如FLIP [30]和CLIP [36]相比表现更优,后者分别需要约5天和10天的时间在256个TPUv3核心上完成。在微调SigLIP中的预训练视觉主干时(在表1中用$\vec{\boxdot}$表示),我们发现禁用预训练主干的权重衰减会带来更好的结果(详见图4)。我们希望我们的工作能够为新兴的语言-图像预训练领域开辟更广泛的应用前景。
2. Related Work
- 相关工作
Contrastive learning with the sigmoid loss. One prior work proposes a similar sigmoid loss for the task of unsupervised dimensionality reduction [19]; in the scope of contrastive image-text learning, the vast majority of works rely on the softmax-based InfoNCE loss as popularized by [46]. In supervised classification, the sigmoid loss has already been shown to be slightly more effective and robust than the softmax loss [3, 51].
基于Sigmoid损失的对比学习。一项先前的工作提出了一种类似的Sigmoid损失,用于无监督降维任务 [19];在对比图像-文本学习领域,绝大多数工作依赖于基于Softmax的InfoNCE损失,这一方法由 [46] 推广。在监督分类中,Sigmoid损失已经被证明比Softmax损失略微更有效且更稳健 [3, 51]。
Contrastive language-image pre-training has become popular since CLIP [36] and ALIGN [23] applied softmax contrastive learning [60, 46, 10, 24] to large-scale imagetext datasets. Both models perform very well on zero-shot transfer tasks, including classification and retrieval. Followup works show that contrastive ly pre-trained models produce good representations for fine-tuning [53, 16], linear regression [23], object detection [31], semantic segmentation [33] and video tasks [57].
对比语言-图像预训练 (Contrastive Language-Image Pre-training) 自从 CLIP [36] 和 ALIGN [23] 将 softmax 对比学习 [60, 46, 10, 24] 应用于大规模图像-文本数据集以来变得流行。这两种模型在零样本迁移任务(包括分类和检索)上表现非常出色。后续研究表明,对比预训练的模型在微调 [53, 16]、线性回归 [23]、目标检测 [31]、语义分割 [33] 和视频任务 [57] 中都能产生良好的表示。
Generative language-image pre-training Besides softmax contrastive pre-training, various alternatives have been proposed. GIT [49], SimVLM [50], and LEMON [21] successfully pre-train models using a generative text decoder instead, while CoCa [56] adds such a decoder to the discri mi native CLIP/ALIGN setup, thus combining the pros and cons of both approaches into a single very capable model. BLIP [28] further proposes CapFilt which uses the generative decoder to create better captions and the discriminative part of the model to filter pairs. Language-Image pre-training is a very active field and surveys [8] rapidly become outdated.
生成式语言-图像预训练
Algorithm 1 Sigmoid loss pseudo-implementation.
算法 1 Sigmoid 损失伪实现。
| # | img_emb | image | model embedding [n, dim] | |
|---|---|---|---|---|
| (7J) | txt_emb | text model embedding | [n, dim] | |
| , | ## | t_prime, | learnable temperature and bias | |
| 4 | # n | mini-batch size | ||
| 七 | exp(t_prime) | |||
| zimg | =l2_normalize(img_emb) | |||
| 8 | ztxt | = l2_normalize(txt_emb) | ||
| logits | =dot(zimg,ztxt.T) *t+b | |||
| 10 | labels=2*eye(n)- | ones(n)#-1 | with diagonal 1 | |
| 11 | 1 | =-sum(log_sigmoid(labels *logits))/n |
Efficient language-image pre-training On the other hand, few works have tried making language image pre-training more efficient. LiT [59] and FLIP [30] are notable attempts, the former requires a pre-trained and locked backbone, and the latter sacrifices quality by randomly dropping visual tokens. BASIC [35] and LAION [52] look at scaling batchsize but only go up to $16,\mathrm{k}$ and $160,\mathrm{k}$ respectively, by using many hundreds of chips, and for the former also mixing in a large private classification dataset [35, 55]. The recent Lion optimizer [12] claims to be able to reduce the training cost to reach similar quality.
高效的图文预训练
另一方面,很少有工作尝试让图文预训练更加高效。LiT [59] 和 FLIP [30] 是值得注意的尝试,前者需要一个预训练并固定的骨干网络,后者通过随机丢弃视觉 Token 来牺牲质量。BASIC [35] 和 LAION [52] 研究了批量大小的扩展,但分别仅达到 $16,\mathrm{k}$ 和 $160,\mathrm{k}$,并且使用了数百个芯片,前者还混合了一个大型私有分类数据集 [35, 55]。最近的 Lion 优化器 [12] 声称能够降低训练成本以达到类似的质量。
3. Method
3. 方法
In this section, we first review the widely-used softmaxbased contrastive loss. We then introduce the pairwise sigmoid loss and discuss its efficient implementation.
在本节中,我们首先回顾广泛使用的基于 softmax 的对比损失。然后我们介绍了成对 sigmoid 损失,并讨论了其高效实现。
Given a mini-batch $\beta~=~{(I_{1},T_{1}),(I_{2},T_{2}),\ldots}$ of image-text pairs, the contrastive learning objective encourages embeddings of matching pairs $(I_{i},T_{i})$ to align with each other, while pushing embeddings of unmatched pairs $(I_{i},T_{j\neq i})$ apart. For practical purposes, it is assumed that for all images $i$ , the text associated with a different image $j$ is not related to $i$ , and vice-versa. This assumption is usually noisy and imperfect.
给定一个小批量 $\beta~=~{(I_{1},T_{1}),(I_{2},T_{2}),\ldots}$ 的图文对,对比学习目标鼓励匹配对 $(I_{i},T_{i})$ 的嵌入相互对齐,同时将不匹配对 $(I_{i},T_{j\neq i})$ 的嵌入推开。出于实际目的,假设对于所有图像 $i$,与不同图像 $j$ 相关的文本与 $i$ 无关,反之亦然。这一假设通常是嘈杂且不完美的。
3.1. Softmax loss for language image pre-training
3.1. 用于语言图像预训练的 Softmax 损失
When using the softmax loss to formalize this objective, an image model $f(\cdot)$ and a text model $g(\cdot)$ are trained to
当使用 softmax 损失来形式化这一目标时,图像模型 $f(\cdot)$ 和文本模型 $g(\cdot)$ 被训练以


(a) Initially each device holds 4 (b) They each compute the com- (c) Texts are swapped across the (d) This repeats till every image image and 4 text representations. ponent of the loss (highlighted) devices, so device 1 now has $I_{1:4}$ & text pair have interacted, e.g. Each device needs to see the rep- for their representations, which and $T_{5:8}$ etc. The new loss is device 1 has the loss of $I_{1:4}$ and resent at ions from other devices includes the positives. computed and accumulated with $T_{1:12}$ . A final cross-device sum to calculate the full loss. the previous. brings everything together.
(a) 最初,每个设备持有 4 个图像和 4 个文本表示。
(b) 它们各自计算损失的分量(高亮显示)
(c) 文本在设备之间交换,因此设备 1 现在有 $I_{1:4}$ 和 $T_{5:8}$ 等。新的损失被计算并累加到之前的结果中。
(d) 这个过程重复,直到每个图像和文本对都进行了交互,例如设备 1 有 $I_{1:4}$ 和 $T_{1:12}$ 的损失。最终的跨设备求和将所有内容整合在一起。
Figure 1: Efficient loss implementation demonstrated via a mock setup with 3 devices and a global batch size of 12. There are no all-gathers, and at any point in time only the bright yellow square (size $4\times4]$ is materialized in memory.
图 1: 通过一个模拟设置展示了高效损失实现,该设置包含 3 个设备和全局批量大小为 12。没有使用 all-gather 操作,在任何时间点只有亮黄色方块(大小为 $4\times4]$)被实际存储在内存中。
minimize the following objective:
最小化以下目标:

where x = $\begin{array}{r}{\mathbf{x}{i},=,\frac{f\left(I{i}\right)}{|f\left(I_{i}\right)|{2}}}\end{array}$ and $\begin{array}{r}{\mathbf{y}{i},=,\frac{g\left(T_{i}\right)}{|g\left(T_{i}\right)|_{2}}}\end{array}$ ()l- In this paper, we adopt the vision transformer architecture [17] for images and the transformer architecture [47] for texts. Note that due to the asymmetry of the softmax loss, the normalization is independently performed two times: across images and across texts [36]. The scalar $t$ is para met rize d as $\exp(t^{\prime})$ where $t^{\prime}$ is a global freely learnable parameter.
其中 x = $\begin{array}{r}{\mathbf{x}{i},=,\frac{f\left(I{i}\right)}{|f\left(I_{i}\right)|{2}}}\end{array}$ 且 $\begin{array}{r}{\mathbf{y}{i},=,\frac{g\left(T_{i}\right)}{|g\left(T_{i}\right)|_{2}}}\end{array}$ ()l- 在本文中,我们采用了视觉 Transformer 架构 [17] 处理图像,并使用 Transformer 架构 [47] 处理文本。需要注意的是,由于 softmax 损失的不对称性,归一化分别在图像和文本上独立进行 [36]。标量 $t$ 被参数化为 $\exp(t^{\prime})$,其中 $t^{\prime}$ 是一个全局可自由学习的参数。
3.2. Sigmoid loss for language image pre-training
3.2. 语言图像预训练的Sigmoid损失
Instead of the softmax-based contrastive loss, we propose a simpler alternative that does not require computing global normalization factors. The sigmoid-based loss processes every image-text pair independently, effectively turning the learning problem into the standard binary classification on the dataset of all pair combinations, with a positive labels for the matching pairs $(I_{i},T_{i})$ and negative labels for all other pairs $(I_{i},T_{j\neq i})$ . It is defined as follows:
我们提出了一种更简单的替代方案,它不需要计算全局归一化因子,而不是基于 softmax 的对比损失。基于 sigmoid 的损失独立处理每个图像-文本对,有效地将学习问题转化为在所有对组合数据集上的标准二分类问题,其中匹配对 $(I_{i},T_{i})$ 的标签为正,所有其他对 $(I_{i},T_{j\neq i})$ 的标签为负。其定义如下:

where $z_{i j}$ is the label for a given image and text input, which equals 1 if they are paired and $-1$ otherwise. At initialization, the heavy imbalance coming from the many negatives dominates the loss, leading to large initial optimization steps attempting to correct this bias. To alleviate this, we introduce an additional learnable bias term $b$ similar to the temperature $t$ .We initialize $t^{\prime}$ and $b$ to $\log10$ and $-10$ respectively. This makes sure the training starts roughly close to the prior and does not require massive over-correction. Algorithm 1 presents a pseudocode implementation of the proposed sigmoid loss for language image pre-training.
其中 $z_{ij}$ 是给定图像和文本输入的标签,如果它们配对则等于 1,否则为 $-1$。在初始化时,由于大量负样本带来的严重不平衡主导了损失,导致初始优化步骤试图纠正这种偏差。为了缓解这一问题,我们引入了一个额外的可学习偏置项 $b$,类似于温度 $t$。我们将 $t^{\prime}$ 和 $b$ 分别初始化为 $\log10$ 和 $-10$。这确保了训练开始时大致接近先验,并且不需要大规模的过度修正。算法 1 展示了所提出的用于语言图像预训练的 sigmoid 损失的伪代码实现。
3.3. Efficient “chunked implementation
3.3. 高效的“分块”实现
Contrastive training typically utilizes data parallelism. Computing the loss when data is split across $D$ devices necessitates gathering all embeddings [59] with expensive all-gathers and, more importantly, the materialization of a memory-intensive $|\boldsymbol{B}|\times|\boldsymbol{B}|$ matrix of pairwise similarities.
对比训练通常利用数据并行。当数据分布在 $D$ 个设备上时,计算损失需要收集所有嵌入 [59],这涉及昂贵的所有收集操作,更重要的是,需要实例化一个内存密集型的 $|\boldsymbol{B}|\times|\boldsymbol{B}|$ 的成对相似性矩阵。
The sigmoid loss, however, is particularly amenable to a memory efficient, fast, and numerically stable implementation that ameliorates both these issues. Denoting the perdevice batch size as $\begin{array}{r}{b={\frac{|B|}{D}}}\end{array}$ , the loss is reformulated as:
然而,sigmoid 损失函数特别适合一种内存高效、快速且数值稳定的实现方式,能够缓解这两个问题。将每个设备的批量大小表示为 $\begin{array}{r}{b={\frac{|B|}{D}}}\end{array}$,损失函数被重新表述为:


This is particularly simple for the sigmoid loss as each pair is an independent term in the loss. Figure 1 illustrates this method. In words, we first compute the component of the loss corresponding to the positive pairs, and $b-1$ negative pairs. We then permute representations across devices, so each device takes negatives from its neighbouring device (next iteration of sum B). The loss is then calculated with respect to this chunk (sum C). This is done independently in each device, such that each device computes the loss with respect to its local batch $b$ . Losses can then simply be summed across all devices (sum A). Individual collective permutes (for sum B) are fast (and indeed $D$ collective permutes is typically faster than two all-gathers between $D$ devices), and the memory cost at any given moment is reduced from $|\boldsymbol{B}|^{2}$ to $b^{2}$ (for sum C). Usually $b$ is constant as scaling $|\beta|$ is achieved by increasing the number of accelerators. Due to being quadratic with respect to the batch size, the vanilla loss computation rapidly bottlenecks scaling up. This chunked approach enabled training with batch sizes over 1 million on relatively few devices.
这对于 sigmoid 损失函数来说特别简单,因为每对在损失中都是独立的项。图 1 展示了这种方法。简而言之,我们首先计算与正对和 $b-1$ 个负对相对应的损失分量。然后,我们在设备之间交换表示,以便每个设备从其邻近设备获取负对(sum B 的下一次迭代)。随后,根据这个块计算损失(sum C)。这在每个设备中独立完成,因此每个设备根据其本地批次 $b$ 计算损失。然后可以简单地将所有设备的损失相加(sum A)。单个集体交换(sum B)是快速的(事实上,$D$ 个集体交换通常比 $D$ 个设备之间的两次全收集更快),并且在任何给定时刻的内存成本从 $|\boldsymbol{B}|^{2}$ 减少到 $b^{2}$(sum C)。通常 $b$ 是常数,因为通过增加加速器的数量来实现 $|\beta|$ 的扩展。由于与批次大小呈二次关系,原始损失计算在扩展时迅速成为瓶颈。这种分块方法使得在相对较少的设备上使用超过 100 万个批次的训练成为可能。

Figure 2: The effect of pre-training batch size. Left: SigLiT results, trained for 18B seen examples. Sigmoid loss outperforms the softmax loss significantly with small batch sizes, and performs similarly at larger batch sizes. We successfully trained an SigLiT model with up to one million batch size. However, performance for both sigmoid and softmax saturate at around $32,\mathrm{k}$ batch size. Middle: SigLIP results, trained for 9B seen examples. Both sigmoid loss and softmax loss saturate at a reasonable batch size, while the peak of the sigmoid loss comes earlier and slightly outperforms the peak of the softmax loss. A very large batch size hurts both losses. Right: mSigLIP results, trained for 30B seen examples. With a multilingual setup using over 100 languages, $32,\mathrm{k}$ batch size is surprisingly sufficient and scaling beyond that hurts performance on a 36-language cross-modal retrieval task.
图 2: 预训练批次大小的影响。左图:SigLiT 结果,训练了 180 亿个样本。在小批次大小时,Sigmoid 损失显著优于 softmax 损失,而在大批次大小时表现相似。我们成功训练了批次大小高达一百万的 SigLiT 模型。然而,Sigmoid 和 softmax 的性能在批次大小约为 $32,\mathrm{k}$ 时趋于饱和。中图:SigLIP 结果,训练了 90 亿个样本。Sigmoid 损失和 softmax 损失在合理的批次大小时趋于饱和,而 Sigmoid 损失的峰值出现得更早,且略微优于 softmax 损失的峰值。非常大的批次大小对两种损失都有害。右图:mSigLIP 结果,训练了 300 亿个样本。在使用超过 100 种语言的多语言设置中,批次大小为 $32,\mathrm{k}$ 已经足够,进一步扩大批次大小会损害在 36 种语言的跨模态检索任务上的性能。
4. Results
4. 结果
In this section, we evaluate the proposed SigLiT and SigLIP models across a wide range of batch sizes. We discuss what can be achieved with a small number of accelerator chips, using both SigLiT and SigLIP recipes. We also briefly discuss the impact of batch size on multilingual language image pre-training. We ablate the importance of our large-batch stabilization modification and the introduced learned bias term and present a study on the effect of positive and negative pairs ratio in the sigmoid loss. Lastly, we explore SigLIP's data noise robustness.
在本节中,我们评估了在不同批量大小下的 SigLiT 和 SigLIP 模型。我们讨论了使用少量加速器芯片时,通过 SigLiT 和 SigLIP 方法可以实现的效果。我们还简要讨论了批量大小对多语言图像预训练的影响。我们分析了大规模批量稳定化修改和引入的学习偏差项的重要性,并对 sigmoid 损失中正负样本比例的影响进行了研究。最后,我们探讨了 SigLIP 对数据噪声的鲁棒性。
To validate our models, we report zero-shot transfer results on the ImageNet dataset [14] and zero-shot retrieval results across 36 languages on the XM3600 dataset [44]. We use the ScalingViT-Adafactor optimizer [58] by default for all our experiments.
为了验证我们的模型,我们报告了在 ImageNet 数据集 [14] 上的零样本迁移结果,以及在 XM3600 数据集 [44] 上跨 36 种语言的零样本检索结果。我们在所有实验中默认使用 ScalingViT-Adafactor 优化器 [58]。
4.1. SigLiT: Scaling batch size to the limit
4.1. SigLiT: 将批量大小扩展到极限
Following [59], we use the same pre computed embeddings for the images using a ViT $\mathrm{g}$ vision model, and train a base size text tower from scratch with the same hyperparameters using the LiT image-text dataset [59].
根据 [59],我们使用 ViT $\mathrm{g}$ 视觉模型为图像生成相同的预计算嵌入,并使用 LiT 图文数据集 [59] 从头训练一个基础大小的文本塔,超参数保持不变。
We perform a study over a wide range of batch sizes, from 512 to $1,M$ , demonstrating the impact of batch size for contrastive learning. Results are presented in Figure 2 (left). When the batch size is smaller than $16,k$ ,s igm oid loss outperforms softmax loss by a large margin. With growing batch sizes, we observe that softmax loss quickly catches up and potentially slightly under performs sigmoid loss with a large enough batch size. Overall, we recommend using the SigLIP recipe for large batch sizes as well, due to the simplicity, compute savings, and straightforward memory efficient implementation.
我们在从 512 到 $1,M$ 的广泛批量大小范围内进行了研究,展示了批量大小对对比学习的影响。结果如图 2(左)所示。当批量大小小于 $16,k$ 时,sigmoid 损失显著优于 softmax 损失。随着批量大小的增加,我们观察到 softmax 损失迅速赶上,并且在批量大小足够大时可能会略微逊色于 sigmoid 损失。总体而言,由于简单性、计算节省和直接的内存高效实现,我们建议在大批量大小下也使用 SigLIP 方案。
There is a consensus that contrastive learning benefits from large batch sizes, while most of the existing studies stopat $64,\mathrm{k}$ batch size [59, 35, 10]. We successfully trained an SigLiT model at one million batch size, to explore the limit of contrastive learning. To our surprise, the performance saturates at $32,\mathrm{k}$ batch size, further scaling up the batch size only gives a minor boost, and the model peaks at
共识认为对比学习受益于大批量,而大多数现有研究止步于 $64,\mathrm{k}$ 批量大小 [59, 35, 10]。我们成功训练了一个批量大小达到一百万的 SigLiT 模型,以探索对比学习的极限。令人惊讶的是,性能在 $32,\mathrm{k}$ 批量大小时达到饱和,进一步扩大批量大小仅带来轻微提升,且模型在

Figure 3: SigLiT ImageNet 0-shot transfer results with different training durations. Large batch size results in a big performance boost, but needs a sufficiently long schedule to ramp up, as for short schedules, very large batch size results in a small number of gradient update steps.
图 3: SigLiT ImageNet 零样本迁移结果与不同训练时间的关系。大批量训练带来了显著的性能提升,但需要足够长的训练时间才能达到最佳效果,因为在短时间训练中,大批量会导致梯度更新步骤的数量较少。
$256,\mathrm{k}$ batch size. Our best SigLiT with a $B$ -sized text mode achieves $84.7%$ zero-shot transfer accuracy on ImageNet, while the original LiT paper reports a slightly better $85.2%$ score with a 10 times larger $g$ -sized text model. Figure 3 presents the impact of training duration for different batch sizes. It demonstrates that large, $262,k$ batch size significantly outperforms smaller $8,k$ batch size when trained for a sufficiently long time. Note, that for short training durations, large batch size leads to the fewer absolute number of update steps and thus needs more time to ramp up.
$256,\mathrm{k}$ 的批量大小。我们最佳的 SigLiT 在 $B$ 大小的文本模式下,在 ImageNet 上的零样本迁移准确率达到了 $84.7%$,而原始 LiT 论文中报告的准确率略高,为 $85.2%$,但其文本模型大小是 $g$ 大小的 10 倍。图 3 展示了不同批量大小对训练时长的影响。结果表明,当训练时间足够长时,较大的 $262,k$ 批量大小显著优于较小的 $8,k$ 批量大小。需要注意的是,在较短的训练时长内,较大的批量大小会导致更新步骤的绝对数量减少,因此需要更多时间来达到最佳效果。
4.2. SigLIP: Sigmoid loss is beneficial for languageimage pre-training
4.2. SigLIP: Sigmoid 损失函数对语言-图像预训练有益
We pre-train SigLIP models on the WebLI dataset [13], using only English image and text pairs. We use CLIP (WebLI) to denote the CLIP baseline pre-trained on WebLI with the standard softmax loss. We use moderately-sized models: B/16 ViT for image embeddings and B-sized transformer for text embeddings. The input images are resized to $224!\times!224$ resolution. The text is tokenized by a $32,\mathrm{k}$ vocabulary sentence piece tokenizer [27] trained on the English C4 dataset [37], and a maximum of 16 text tokens are kept. Figure 2 middle plot shows SigLIP results, With less than $32,\mathrm{k}$ batch size, SigLIP outperforms CLIP (WebLI) baselines. On the other end of the scale, the memory efficiency of the sigmoid loss enabled much larger batch sizes. For example, with four TPU-v4 chips, we could fit a batch size of 4096 with a Base SigLIP but only 2048 with a corresponding CLIP model. The two advantages together demonstrate significant benefits of the sigmoid loss for language image pre-training with fixed resources, which will be discussed in Section 4.5.
我们在 WebLI 数据集 [13] 上预训练 SigLIP 模型,仅使用英文图像和文本对。我们使用 CLIP (WebLI) 表示在 WebLI 上使用标准 softmax 损失预训练的 CLIP 基线。我们使用中等大小的模型:用于图像嵌入的 B/16 ViT 和用于文本嵌入的 B 大小 Transformer。输入图像调整为 $224!\times!224$ 分辨率。文本通过一个在英文 C4 数据集 [37] 上训练的 $32,\mathrm{k}$ 词汇量 sentence piece tokenizer [27] 进行 Token 化,最多保留 16 个文本 Token。图 2 中间的图表展示了 SigLIP 的结果,在批量大小小于 $32,\mathrm{k}$ 时,SigLIP 优于 CLIP (WebLI) 基线。在另一极端情况下,sigmoid 损失的内存效率使得批量大小可以更大。例如,使用四个 TPU-v4 芯片,我们可以在 Base SigLIP 中适应 4096 的批量大小,而在相应的 CLIP 模型中只能适应 2048。这两个优势共同展示了 sigmoid 损失在固定资源下进行语言图像预训练的显著好处,这将在第 4.5 节中讨论。
| 16k | 32k | 64k | 128k | 240k |
|---|---|---|---|---|
| INet-0 71.6 | 73.2 | 73.2 | 73.2 | 73.1 |
| XM avg 34.8 | 34.9 | 34.4 | 33.6 | 32.7 |
| XM de | 54.7 54.8 | 55.4 | 54.3 | 54.7 |
| XM en | 46.5 46.2 | 46.5 | 46.6 | 46.6 |
| XM hi 9.1 | 8.5 | 7.9 | 8.1 | 7.3 |
| XM ru 50.1 | 49.9 | 49.7 | 48.6 | 49.3 |
| XM zh 30.7 | 32.5 | 32.0 | 30.6 | 23.7 |
Table 2: Multilingual SigLIP results with various batch sizes, pre-trained for 30 billion seen examples. We report zero-shot transfer results on ImageNet (INet-O) and averaged text to image retrieval results across 36 languages on the crossmodal 3600 dataset (XM). The full table on 36languages can be found in Appendix.
表 2: 使用不同批量大小的多语言 SigLIP 结果,预训练时使用了 300 亿个样本。我们报告了在 ImageNet (INet-O) 上的零样本迁移结果,以及在跨模态 3600 数据集 (XM) 上 36 种语言的文本到图像检索的平均结果。完整的 36 种语言结果表可以在附录中找到。
As batch size increases, the gap between the sigmoid and the softmax losses diminish. SigLIP performs best at batch size $32,\mathrm{k}$ , whereas the softmax loss required $98,\mathrm{k}$ for optimal performance and still didn't outperform the sigmoid based variant. Scaling further, a larger batch size like $307,\mathrm{k}$ hurts both losses.
随着批量大小的增加,sigmoid 和 softmax 损失之间的差距逐渐缩小。SigLIP 在批量大小为 $32,\mathrm{k}$ 时表现最佳,而 softmax 损失需要 $98,\mathrm{k}$ 才能达到最佳性能,但仍未超越基于 sigmoid 的变体。进一步扩大批量大小,如 $307,\mathrm{k}$,会对两种损失造成负面影响。
4.3. mSigLIP: Multi-lingual pre-training
4.3. mSigLIP: 多语言预训练
We further scale up the training data by keeping all the 100 languages from the WebLI dataset [13]. With multilingual data, one usually needs to use a larger international vocabulary. We first verify the impact of two tokenizers: a small multilingual vocabulary with $32,\mathrm{k}$ tokens [37], and a large multilingual vocabulary with $250,\mathrm{k}$ tokens [54]. We train B-sized ViT and text models for $900,M$ total examples seen, and observe slightly more than $1%$ improvement when using a larger vocabulary.
我们进一步通过保留 WebLI 数据集 [13] 中的所有 100 种语言来扩展训练数据。在多语言数据的情况下,通常需要使用更大的国际词汇表。我们首先验证了两种分词器的影响:一个包含 $32,\mathrm{k}$ 个 Token 的小型多语言词汇表 [37],以及一个包含 $250,\mathrm{k}$ 个 Token 的大型多语言词汇表 [54]。我们训练了 B 规模的 ViT 和文本模型,总共观察了 $900,M$ 个样本,发现使用更大的词汇表时性能提升了略高于 $1%$。
However, the token embeddings become huge for very large vocabulary sizes. Following the standard setup, we would need to store a $N\times W$ token embedding lookup table to train the multilingual model, where $N$ is the vocabulary size mentioned above and $W$ is the embedding dimension of the text model. To save memory, we propose to use a "bottle necked" token embedding. We use $N\times K$ embedding matrix and additional $K\times W$ projection, where the bottleneck $K$ is much smaller than $W$
然而,对于非常大的词汇量,Token 嵌入会变得非常庞大。按照标准设置,我们需要存储一个 $N\times W$ 的 Token 嵌入查找表来训练多语言模型,其中 $N$ 是上述词汇量,$W$ 是文本模型的嵌入维度。为了节省内存,我们建议使用“瓶颈式”Token 嵌入。我们使用 $N\times K$ 的嵌入矩阵和额外的 $K\times W$ 投影,其中瓶颈 $K$ 远小于 $W$。
In our experiments, we observed that using a large multilingual vocabulary with a bottleneck can be scaled up as efficiently as using a small multilingual vocabulary. Specifically, by enabling the bottleneck of size $K=96$ for Base architecture with $W,=,768$ , we only see about a half percent quality drop on ImageNet zero-shot transfer, compared to using the full $250k$ vocabulary.
在我们的实验中,我们观察到使用带有瓶颈的大规模多语言词汇表可以像使用小规模多语言词汇表一样高效地扩展。具体来说,通过在基础架构($W,=,768$)中启用大小为 $K=96$ 的瓶颈,与使用完整的 $250k$ 词汇表相比,我们在 ImageNet 零样本迁移上仅看到约 0.5% 的质量下降。

Figure 4: Top: SigLIP with pre-trained encoders ramps up quickly. However, only disabling weight decay on the pretrained encoder weights leads to stable behavior and good ImageNet 0-shot transfer results. Bottom: ImageNet 10- shot transfer results, where decaying the pre-trained weights leads to deterioration of the pre-trained model visual representation quality. Disabling weight decay flattens the curve.
图 4: 上图: 使用预训练编码器的 SigLIP 快速提升。然而,仅在预训练编码器权重上禁用权重衰减才能实现稳定行为并获得良好的 ImageNet 零样本迁移结果。下图: ImageNet 少样本迁移结果,其中对预训练权重进行衰减会导致预训练模型视觉表示质量的下降。禁用权重衰减使曲线趋于平缓。
With the memory improvements, we train mSigLIP models for various batch sizes, for a total of 30 billion examples seen. Table 2 and Figure 2 (right plot) show the results. We were expecting a large batch size to improve multilingual pre-training, where the model sees more examples from the same language as hard negatives in a single mini-batch. However, we didn't observe clear improvements with a batch size larger than $32,\mathrm{k}$ .A batch size of $32,\mathrm{k}$ is sufficient for a multilingual setup as well. On the XM3600 cross-modal retrieval tasks, we found that going beyond $32,\mathrm{k}$ batch size leads to worse results on average while on ImageNet zero-shot transfer it stays flat. mSigLIP sets the new state-of-the-art on XM3600 text to image retrieval task, with only a Base size model. Our best result is $34.9%$ , which is more than $6%$ higher than the previously reported result $28.5%$ [13] with a standard LiT model [59] using a much larger four billion ViT-e model. We further scale up mSigLIP training in Section 4.6.
随着内存的改进,我们为不同的批次大小训练了 mSigLIP 模型,总共处理了 300 亿个样本。表 2 和图 2(右侧图表)展示了结果。我们原本期望较大的批次大小能够改善多语言预训练,因为模型在单个小批次中会看到更多来自同一语言的硬负样本。然而,我们并未观察到批次大小超过 $32,\mathrm{k}$ 时带来的明显改进。$32,\mathrm{k}$ 的批次大小对于多语言设置已经足够。在 XM3600 跨模态检索任务中,我们发现超过 $32,\mathrm{k}$ 的批次大小平均会导致结果变差,而在 ImageNet 零样本迁移任务中结果保持平稳。mSigLIP 在 XM3600 文本到图像检索任务中仅使用 Base 大小的模型就设定了新的最先进水平。我们的最佳结果为 $34.9%$,比之前使用标准 LiT 模型 [59] 和更大的 40 亿参数 ViT-e 模型报告的结果 $28.5%$ [13] 高出超过 $6%$。我们将在第 4.6 节中进一步扩展 mSigLIP 的训练。
4.4. SigLiT with four TPU-v4 chips
4.4 使用四个 TPU-v4 芯片的 SigLiT
For many practitioners, the important question usually is "what can be trained with a limited amount of resources?" We explore the usage of SigLiT models in this section with only four TPU-v4 chips, as the memory effcient sigmoid loss is suitable for this application scenario.
对于许多从业者来说,一个重要的问题通常是“在资源有限的情况下可以训练什么?”我们在本节中仅使用四个 TPU-v4 芯片探索 SigLiT 模型的使用,因为内存高效的 sigmoid 损失适用于此应用场景。

Figure 5: The effect of Adam and AdaFactor's $\beta_{2}$ .As we increase batch-size, we observe more frequent training instability. This instability seen in the loss curves (top) is caused by spikes in gradient norm (middle) leading to large parameter updates (bottom). Decreasing the $\beta_{2}$ momentum stabilizes training. Occasional gradient spikes still happen (see step at 2B), but do not destabilize the training process.
图 5: Adam 和 AdaFactor 的 $\beta_{2}$ 的影响。随着批量大小的增加,我们观察到训练不稳定性更加频繁。这种在损失曲线(顶部)中看到的不稳定性是由梯度范数的尖峰(中部)导致的,进而引发较大的参数更新(底部)。降低 $\beta_{2}$ 动量可以稳定训练。偶尔的梯度尖峰仍然会发生(参见 2B 处的步骤),但不会破坏训练过程的稳定性。
We follow the same setup as in section 4.1. We use the publicly available ViT-AugReg-B/8 [42] model as the frozen $(\oplus\oplus)$ vision tower, and precompute embeddings to accelerate the training [59]. The text model is a Large Transformer, but with a depth of only 12 layers (instead of 24). It is trained using the LION [12] optimizer with decoupled weight decay $1\times10^{-7}$ , linearly warm-up of learning rate over $6.5\mathrm{k}$ steps up to a peak of $1\times10^{-4}$ , followed by a cosine decay to O. We train for a total of $65,000$ steps with a batch size of $32\mathbf{k}$ -- this leads to just under one day of training. Table 1 shows the results when training a model on four chips for one day, achieving $79.7%$ O-shot ImageNet classification accuracy; very competitive in this limited resource regime. With a ViT $\mathrm{\textg}/14$ [58] model as the vision tower and a Large text tower, we can train at $20,\mathrm{k}$ batch size on four chips for $107,\mathrm{k}$ steps in under two days. This further pushes the O-shot ImageNet classification accuracy up to $84.5%$
我们遵循与第4.1节相同的设置。我们使用公开的ViT-AugReg-B/8 [42]模型作为冻结的 $(\oplus\oplus)$ 视觉塔,并预计算嵌入以加速训练 [59]。文本模型是一个大型Transformer,但深度仅为12层(而不是24层)。它使用LION [12]优化器进行训练,解耦权重衰减为 $1\times10^{-7}$ ,学习率线性预热至 $6.5\mathrm{k}$ 步,最高达到 $1\times10^{-4}$ ,随后进行余弦衰减至O。我们总共训练了 $65,000$ 步,批量大小为 $32\mathbf{k}$ ——这导致训练时间略少于一天。表1显示了在四块芯片上训练一天的结果,实现了 $79.7%$ 的零样本ImageNet分类准确率;在有限的资源条件下非常有竞争力。使用ViT $\mathrm{\textg}/14$ [58]模型作为视觉塔和大型文本塔,我们可以在四块芯片上以 $20,\mathrm{k}$ 的批量大小训练 $107,\mathrm{k}$ 步,时间不到两天。这将零样本ImageNet分类准确率进一步提升至 $84.5%$
4.5. SigLIP with a small amount of TPU-v4 chips
4.5. 少量 TPU-v4 芯片上的 SigLIP
It's resource demanding to train a CLIP model fromscratch in general, with SigLIP it's possible to fit a larger train batch size with fewer amount of chips. In this section, we explore ways to train SigLIP models efficiently with pretrained weights. We use pre-trained weights to initialize the image model to accelerate the pre-training, which was originally discussed in [59]. We use the public and unlocked ViT-AugReg-B/16 [42] model to initialize our vision tower and fine-tune on the same WebLI English data as used for SigLIP. In all the experiments, we apply a 0.1 learning rate multiplier to the pre-trained image tower to make it suitable for fine-tuning.
训练一个 CLIP 模型通常需要大量资源,而使用 SigLIP 则可以在更少的芯片上适应更大的训练批次。在本节中,我们探讨了如何使用预训练权重高效地训练 SigLIP 模型。我们使用预训练权重来初始化图像模型以加速预训练,这一点最初在 [59] 中讨论过。我们使用公开且未锁定的 ViT-AugReg-B/16 [42] 模型来初始化我们的视觉塔,并在与 SigLIP 相同的 WebLI 英文数据集上进行微调。在所有实验中,我们对预训练的图像塔应用 0.1 的学习率乘数,使其适合微调。

Figure 6: The effect of batch composition. We simulate various batch compositions by masking out negatives, either randomly, keeping only the hardest, or the easiest. With no masking, we have $16,\mathrm{k}$ negatives for each positive in the batch (1:16k) and the strongest masking we apply (1:1.6) results in almost balanced mini batches. In one setting we match total pairs seen by training for significantly longer. We observe ImageNet O-shot score, the final value of the learned bias, and the average logits of positive and negative pairs. Overall, the imbalance does not seem to be detrimental, but finding an effcient way of mining negatives might be beneficial.
图 6: 批次组成的影响。我们通过随机掩码、仅保留最难的或最简单的负样本来模拟不同的批次组成。在没有掩码的情况下,每个正样本对应 $16,\mathrm{k}$ 个负样本 (1:16k),而最强的掩码 (1:1.6) 导致几乎平衡的小批次。在一种设置中,我们通过显著延长训练时间来匹配总样本对。我们观察到 ImageNet 零样本分数、学习偏置的最终值以及正负样本对的平均 logits。总体而言,不平衡似乎并不有害,但找到一种有效的负样本挖掘方法可能是有益的。
Figure 4 presents unlocked $\vec{\boxdot}$ fine-tuning results alongside from-scratch randomly initialized baselines. We used 16 TPU-v4 chips and train at $16,\mathrm{k}$ batch size for $2.4,\mathrm{B}$ examples seen. We found that the fine-tuning setup doesn't perform well out-of-the-box; this is consistent with prior works [59] where finetuning image models degraded visual representation quality. This is evidenced by ImageNet 10- shot linear classification, where in Figure 4 the fine-tuned setup is barely better than the from-scratch baseline.
图 4 展示了解锁的 $\vec{\boxdot}$ 微调结果以及从头开始随机初始化的基线。我们使用了 16 个 TPU-v4 芯片,并在 $16,\mathrm{k}$ 的批量大小下训练了 $2.4,\mathrm{B}$ 个样本。我们发现,微调设置在没有额外调整的情况下表现不佳;这与之前的工作 [59] 一致,其中微调图像模型降低了视觉表示质量。这一点在 ImageNet 10-shot 线性分类中得到了验证,在图 4 中,微调设置仅略优于从头开始的基线。
We hypothesize that the default weight decay applied to the pre-trained weights reduces their effectiveness. Motivated by the fine-tuning recipe from [17, 58, 25], that uses no weight decay, we also propose disabling weight decay on the pre-trained weights for SigLIP training. Weight decay is therefore only applied to the randomly initialized weights in the text model. This simple modification significantly improved SigLIP results. Figure 4 shows that with our improved recipe, SigLIP reaches $71%$ O-shot accuracy on ImageNet, using $16k$ batch size, trained on 16 chips for three days. We also present from-scratch results in the bottom rows of Table 1: with 32 TPUv4 chips for only two days, SigLIP achieves $72.1%$ O-shot accuracy. This presents a significant training cost reduction e.g. compared to CLIP (approx. 2500 TPUv3-days for $72.6%$ ) reported in [30].
我们假设应用于预训练权重的默认权重衰减降低了它们的有效性。受 [17, 58, 25] 中不使用权重衰减的微调方法的启发,我们建议在 SigLIP 训练中禁用预训练权重的权重衰减。因此,权重衰减仅应用于文本模型中随机初始化的权重。这一简单的修改显著提升了 SigLIP 的结果。图 4 显示,使用我们改进的方法,SigLIP 在 ImageNet 上达到了 $71%$ 的零样本准确率,使用 $16k$ 的批量大小,在 16 个芯片上训练了三天。我们还在表 1 的底部展示了从头开始训练的结果:仅使用 32 个 TPUv4 芯片训练两天,SigLIP 达到了 $72.1%$ 的零样本准确率。与 [30] 中报告的 CLIP(约 2500 个 TPUv3 天达到 $72.6%$)相比,这显著降低了训练成本。
4.6. Scaling up SigLIP and mSigLIP
4.6. 扩展 SigLIP 和 mSigLIP
In this section, we scale up SigLIP by “over training” the model [45, 1]. We present results in Table 3 using ViT-B, ViT-L or $\mathrm{So}{-}400\mathrm{m}$ [1] as the vision encoder, with a text encoder of the same size (B, L and $\mathrm{So}{-}400\mathrm{m}$ respectively). Following the recipe described in Section 4.2, we train both models for 40 billion examples seen at batch size $32,\mathrm{k}$ ,but use $(256/16)^{2}=256$ image patches and 64 text tokens (instead of 16). To get SigLIP models for different resolutions, we train for 5 billion more examples at the target resolution, with a $100\mathrm{x}$ smaller learning rate and no weight decay. In Table 3, we report zero-shot classification results on ImageNet [14], ObjectNet [2], ImageNet-v2 [39], ImageNet ReaL [3], and zero-shot image-to-text $(\mathrm{I}{\rightarrow}\mathrm{T})$ retrieval, textto-image $(\mathrm{I}{\rightarrow}\mathrm{T})$ retrieval results on MSCOCO [11].
在本节中,我们通过“过度训练”模型 [45, 1] 来扩展 SigLIP。我们在表 3 中展示了使用 ViT-B、ViT-L 或 $\mathrm{So}{-}400\mathrm{m}$ [1] 作为视觉编码器的结果,并搭配相同大小的文本编码器(分别为 B、L 和 $\mathrm{So}{-}400\mathrm{m}$)。按照第 4.2 节中描述的方法,我们以批量大小 $32,\mathrm{k}$ 对两个模型进行了 400 亿个样本的训练,但使用了 $(256/16)^{2}=256$ 的图像块和 64 个文本 Token(而不是 16)。为了获得不同分辨率的 SigLIP 模型,我们在目标分辨率下额外训练了 50 亿个样本,学习率缩小了 $100\mathrm{x}$,并且没有权重衰减。在表 3 中,我们报告了在 ImageNet [14]、ObjectNet [2]、ImageNet-v2 [39]、ImageNet ReaL [3] 上的零样本分类结果,以及在 MSCOCO [11] 上的零样本图像到文本 $(\mathrm{I}{\rightarrow}\mathrm{T})$ 检索和文本到图像 $(\mathrm{I}{\rightarrow}\mathrm{T})$ 检索结果。
We also scale up the multilingual mSigLIP ViT-B model in the same way. We report image-text retrieval results across 36 languages on the XM3600 benchmark [44]. The scaled-up mSigLIP ViT-B model achieves the state-of-theart $42.6%$ image retrieval recall $@I$ and $54.1%$ text retrieval recall $@I$ for a Base model. This is slightly outperformed by the Large model in [48] getting $42.96%$ image retrieval recall $@1$ . Detailed results are provided in Appendix Table 9 and Figure 8, denoted as $\boldsymbol{*}32,\mathbf{k}$
我们同样以相同的方式扩大了多语言 mSigLIP ViT-B 模型的规模。我们在 XM3600 基准测试 [44] 中报告了 36 种语言的图像-文本检索结果。扩大后的 mSigLIP ViT-B 模型在 Base 模型上实现了最先进的 $42.6%$ 图像检索召回率 $@I$ 和 $54.1%$ 文本检索召回率 $@I$。这一结果略低于 [48] 中的 Large 模型,后者实现了 $42.96%$ 的图像检索召回率 $@1$。详细结果见附录表 9 和图 8,标记为 $\boldsymbol{*}32,\mathbf{k}$。
4.7. Stabilizing large-batch training
4.7. 稳定大规模批量训练
As we move to large batch sizes, the language image pretraining using transformers becomes increasingly more unstable, even when using a modestly-sized model (e.g. Base size). The reason for these instabilities is large spikes in the gradient norms, which translate to large-magnitude changes in the weights that may destabilize the training process, see Figure 5. We observe that reducing $\beta_{2}$ in Adam and AdaFactor from its default 0.999 to 0.95 (which was suggested in [20, 9]) is enough to stabilize the training. Intuitively, this allows recovering from gradient spikes quicker. We opt for setting $\beta_{2}=0.95$ for all our experiments.
随着批量大小的增加,使用 Transformer 的语言图像预训练变得越来越不稳定,即使是使用中等规模的模型(例如 Base 大小)也是如此。这些不稳定的原因是梯度范数中的大幅波动,这会导致权重大幅度变化,从而可能破坏训练过程,见图 5。我们观察到,将 Adam 和 AdaFactor 中的 $\beta_{2}$ 从默认的 0.999 降低到 0.95(如 [20, 9] 所建议的)足以稳定训练。直观地说,这允许更快地从梯度波动中恢复。我们选择在所有实验中设置 $\beta_{2}=0.95$。
Table 3: Comparison with other publicly released models. Our SigLIP models outperform all prior models, e.g. OpenCLIP [22] and CLIP [36], by a significant margin on both zero-shot classification and retrieval tasks. Compared to the concurrent EVA-CLIP [43] and CLIPA-v2 [29], our SigLIP-L performs better across the board, in both the low and high resolution cases. Especially noteworthy is the Shape-Optimized 400M parameter ViT [1] architecture, which outperforms all significantly larger models. We publicly release our models: https : / / github . com/ google-research /big_vi s ion.
表 3: 与其他公开发布的模型对比。我们的 SigLIP 模型在零样本分类和检索任务上均显著优于所有先前模型,例如 OpenCLIP [22] 和 CLIP [36]。与同时期的 EVA-CLIP [43] 和 CLIPA-v2 [29] 相比,我们的 SigLIP-L 在低分辨率和高分辨率情况下均表现更好。尤其值得注意的是 Shape-Optimized 400M 参数 ViT [1] 架构,它在所有显著更大的模型中表现优异。我们公开了我们的模型:https://github.com/google-research/big_vision。
4.8. Negative ratio in sigmoid loss
4.8. Sigmoid 损失中的负比率
One question which arises when shifting the perspective from the softmax's “pick the right class" view to the sigmoid's “rate this pair” view, is the imbalance in positive versus negative pairs. For a batch size $|\beta|$ , the batch contains $|\beta|$ positive pairs, but $|\boldsymbol{B}|^{2}-|\boldsymbol{B}|$ negative examples. In the modest batch-size of $16,\mathrm{k}$ , there are actually $268,\mathrm{M}$ negative examples for only $16,\mathrm{k}$ positive ones. At the same time, because the sigmoid loss decomposes into a sum of per-example losses, we can perform controlled experiments to study the effect of the mini-batch composition and distribution of examples visited. We run experiments in the SigLiT setup at batch-size $16,\mathrm{k}$ for $900,\mathrm{M}$ steps and vary the composition of the batch by masking out (i.e. ignoring) enough negative examples to reach a target “positive : negative'” ratio, masking in the following ways:
当视角从 softmax 的“选择正确类别”转向 sigmoid 的“评估这对”时,出现的一个问题是正负对之间的不平衡。对于一个批次大小为 $|\beta|$ 的情况,批次中包含 $|\beta|$ 个正对,但有 $|\boldsymbol{B}|^{2}-|\boldsymbol{B}|$ 个负样本。在批次大小为 $16,\mathrm{k}$ 的情况下,实际上有 $268,\mathrm{M}$ 个负样本,而只有 $16,\mathrm{k}$ 个正样本。同时,由于 sigmoid 损失分解为每个样本损失的总和,我们可以进行控制实验,以研究小批次组成和样本分布的影响。我们在 SigLiT 设置下运行实验,批次大小为 $16,\mathrm{k}$,进行 $900,\mathrm{M}$ 步,并通过屏蔽(即忽略)足够多的负样本来达到目标“正:负”比例,屏蔽方式如下:
· Random: Randomly choose negative pairs to mask. · Hard: Keep hardest negative pairs (highest loss). · Easy: Keep easiest negatives pairs (lowest loss). ·Hard $^+$ matching total pairs seen: Masking examples while training for a fixed number of steps does decrease the total number of pairs seen during training. Hence in the matched pairs setting, we increase the number of training steps by the masking ratio in order to keep the number of pairs seen constant.
· 随机 (Random):随机选择负对进行掩码。
· 困难 (Hard):保留最难的负对(损失最高)。
· 简单 (Easy):保留最简单的负对(损失最低)。
· 困难 $^+$ 匹配总对数 (Hard $^+$ matching total pairs seen):在训练过程中固定步数进行掩码确实会减少训练期间看到的总对数。因此,在匹配对设置中,我们根据掩码比例增加训练步数以保持看到的总对数不变。
Figure 6 shows the effect of the various masking strategies. Randomly removing negatives to rebalance does deteriorate performance. Keeping the easiest examples does not work at all, while keeping the hardest negatives does almost maintain the quality, indicating that, as could be expected, a lot of the learning on the negative side comes from the harder examples. This is further confirmed by the slightly increased performance of training longer on the hardest examples in order to match the total pairs seen.
图 6 展示了各种掩码策略的效果。随机移除负样本来重新平衡确实会降低性能。保留最简单样本的策略完全不起作用,而保留最困难负样本的策略几乎能维持质量,这表明,正如预期的那样,负样本侧的大部分学习来自于较困难的样本。这一点通过在最困难的样本上训练更长时间以匹配所看到的总样本对时性能略有提升而得到进一步证实。

Figure 7: Sigmoid-training increases robustness to data noise. Titles show the type of corruption applied, and $\mathbf{X}$ -axesshow the probability with which they are applied. With increasing corruption severity, M-scale models trained with sigmoid loss for 3.6 billion examples retain superiority over corresponding softmax baseline.
图 7: Sigmoid 训练提高了对数据噪声的鲁棒性。标题显示所应用的噪声类型,$\mathbf{X}$ 轴显示其应用的概率。随着噪声严重程度的增加,使用 sigmoid 损失训练的 36 亿样本的 M-scale 模型在对应的 softmax 基线上保持了优势。
We also look at the value of the learned bias at the end of training as well as the average logit value for positive and negative examples across these settings, and find the result mostly follows what one would expect: as fewer negatives are present, the bias and logits become more positive overall. Interestingly, when training with more hard negative pairs, the average logits of positive pairs stays mostly fat.
我们还观察了训练结束时学习到的偏置值以及在这些设置下正负样本的平均 logit 值,发现结果基本符合预期:随着负样本减少,偏置和 logit 总体上变得更正。有趣的是,当使用更多困难负样本对进行训练时,正样本对的平均 logit 基本保持稳定。
This study confirms that (1) the imbalance does not seem to be a major reason for concern, while at the same time (2) coming up with an effcient way of including more negative examples can be promising but is not trivial.
本研究证实:(1) 不平衡问题似乎并不是需要重点关注的主要原因,同时 (2) 找到一种高效的方式来包含更多负样本可能很有前景,但并非易事。
4.9. Bias term in sigmoid loss
4.9. Sigmoid损失中的偏置项
We ablate the bias term in the loss function, using the Base architecture with an $8,\mathrm{k}$ batch size, trained for 900M examples with the SigLIP setup. Zero-shot transfer results are reported on ImageNet [14], Oxford-it pet [34] and Cifar100 [26]. Table 4 presents results with and without a bias term in the sigmoid loss.
我们在损失函数中消融了偏置项,使用 Base 架构,批大小为 $8,\mathrm{k}$,在 SigLIP 设置下训练了 900M 样本。零样本迁移结果在 ImageNet [14]、Oxford-it pet [34] 和 Cifar100 [26] 上报告。表 4 展示了在 sigmoid 损失中包含和不包含偏置项的结果。
Table 4: Bias (b) and temperature $(\mathbf{t}^{\prime})$ initialization. Results are reported using Base architecture, $8,\mathrm{k}$ batch size, trained for 900M examples. Enabling the bias term b with $-10$ initialization improves results consistently.
表 4: 偏差 (b) 和温度 $(\mathbf{t}^{\prime})$ 初始化。结果使用基础架构 (Base architecture)、$8,\mathrm{k}$ 批次大小 (batch size) 训练 900M 样本 (examples) 得出。启用偏差项 b 并使用 $-10$ 的初始化能持续提升结果。
| b | INet-0 | Pet-0 | C100-0 | |
|---|---|---|---|---|
| n/a | log 10 | 62.0 | 81.8 | 59.9 |
| -10 | log 10 | 63.0 | 82.4 | 61.0 |
| -10 | log 1 | 61.0 | 80.0 | 60.4 |
| 0 | log 10 | 61.7 | 79.9 | 59.0 |
| 0 | log 1 | 53.7 | 73.2 | 53.8 |
Enabling the bias term with a $-10$ initialization consistently improves performance across all tasks. This is because the bias term ensures that the training starts close to the prior, preventing dramatic over-correction in early optimization. In contrast, a randomly chosen bias term initi aliz ation, such as the O initialization in Table 4, fails to address the over-correction issue, leading to significantly worse results. This effect is particularly noticeable when using a small temperature $\mathbf{t}^{\prime}$ initialization.We set the bias and temperature initialization to $b=-10$ and $t^{\prime}=\log{10}$ (hence $t=10$ ) as the default for all experiments.
使用 $-10$ 初始化偏置项在所有任务中始终能提高性能。这是因为偏置项确保训练开始时接近先验,防止早期优化中出现剧烈的过度修正。相比之下,随机选择的偏置项初始化,如表 4 中的 O 初始化,无法解决过度修正问题,导致结果显著变差。这种效果在使用较小的温度 $\mathbf{t}^{\prime}$ 初始化时尤为明显。我们将偏置和温度初始化设置为 $b=-10$ 和 $t^{\prime}=\log{10}$(因此 $t=10$)作为所有实验的默认值。
4.10. Label noise robustness
4.10. 标签噪声鲁棒性
Prior works demonstrated improved robustness against label noise when using the sigmoid loss for classification models [3]. This property would be particularly useful here in the face of the famously noisy nature of popular largescale image-text datasets. In order to study this for SigLIP, we train M/16 image models alongside an M text model at batch size 16384 for 3.6 billion seen examples. We corrupt the training data using one of the following methods:
先前的研究表明,在分类模型中使用 sigmoid 损失 (sigmoid loss) 可以提高对标签噪声的鲁棒性 [3]。在面对众所周知的大规模图像-文本数据集的噪声特性时,这一特性将特别有用。为了在 SigLIP 中研究这一点,我们训练了 M/16 图像模型和 M 文本模型,批量大小为 16384,共处理了 36 亿个样本。我们使用以下方法之一对训练数据进行损坏:
Results from varying the likelihood of the corruption are shown in Figure 7. Models trained with sigmoid loss are increasingly robust to all kinds of added noise.
不同损坏概率的结果如图 7 所示。使用 sigmoid 损失训练的模型对所有类型的附加噪声具有越来越强的鲁棒性。
5. Conclusion
5. 结论
We conducted a study on two language-image pretraining instances that used the sigmoid loss: SigLiT and SigLIP. Our results demonstrate that the sigmoid loss performs better than the softmax baseline, particularly for small train batch sizes. This loss function is also more memory efficient, which allows larger train batch sizes without requiring additional resources. We performed a thorough investigation of the batch size in contrastive learning. Surprisingly, we found that a relatively modest batch size of $32,\mathrm{k}$ yielded nearly optimal performance. Further studies have been performed to understand better the introduced bias term in the sigmoid loss, robustness to data noises and the impact of positive and negative pairs ratio in the sigmoid loss. We hope this work will facilitate language-image pretraining research with limited resources.
我们对使用 sigmoid 损失的两个语言-图像预训练实例 SigLiT 和 SigLIP 进行了研究。我们的结果表明,sigmoid 损失的表现优于 softmax 基线,特别是在小训练批量大小时。该损失函数还更节省内存,从而允许在不增加资源的情况下使用更大的训练批量。我们对对比学习中的批量大小进行了深入研究。令人惊讶的是,我们发现相对较小的批量大小 $32,\mathrm{k}$ 几乎可以达到最佳性能。我们进一步研究了 sigmoid 损失中引入的偏差项、对数据噪声的鲁棒性以及正负样本比例对 sigmoid 损失的影响。我们希望这项工作能促进在资源有限的情况下进行语言-图像预训练研究。
Acknowledgements. We thank Daniel Keysers, Ilya Tolstikhin, Olivier Bousquet and Michael Tschannen for their valuable feedback and discussions on this paper. We thank Joan Puigcerver, Josip Djolonga and Black Hechtman for discussions on efficient implementations of the chunked contrastive loss. We thank Kaiming He and Xinlei Chen for the discussion of $\beta_{2}$ to stabilize the training. We also thank Ross Wightman for spotting a mistake in the pseudocode in the first version of this paper, Boris Dayma and Krzysztof Maziarz for spotting typos in the second and third versions which made $t$ Vs $t^{\prime}$ confusing. We thank the Google Deepmind team for providing a supportive research environment. We use the big-vision codebase [5, 4] for all experiments in this project.
致谢。我们感谢 Daniel Keysers、Ilya Tolstikhin、Olivier Bousquet 和 Michael Tschannen 对本文的宝贵反馈和讨论。我们感谢 Joan Puigcerver、Josip Djolonga 和 Black Hechtman 关于分块对比损失高效实现的讨论。我们感谢 Kaiming He 和 Xinlei Chen 关于 $\beta_{2}$ 稳定训练的讨论。我们还要感谢 Ross Wightman 发现了本文第一版伪代码中的错误,感谢 Boris Dayma 和 Krzysztof Maziarz 发现了第二版和第三版中的拼写错误,这些错误使得 $t$ 与 $t^{\prime}$ 的关系变得混乱。我们感谢 Google Deepmind 团队提供了支持性的研究环境。我们在本项目中的所有实验都使用了 big-vision 代码库 [5, 4]。
References
参考文献
MLHC 2022,5-6 August 2022,Durham,NC,USA,volume 182 of Proceedings of Machine Learning Research, pages 2-25.PMLR,2022.2
MLHC 2022, 2022年8月5-6日, 美国北卡罗来纳州达勒姆, 第182卷机器学习研究论文集, 第2-25页. PMLR, 2022.2
A. More results for SigLiT
A. SigLiT 的更多结果
In section 4.1, we use the same pre computed embeddings for the images using a ViT $\mathbf{\nabla}^{\cdot}\mathbf{g}$ vision model from [59]. Only resize augmentation is applied, to a fixed $288\times288$ resolution. We train a standard base size text tower, using the ScalingViT-Adafactor optimizer [58] with $\beta_{1}=0.9$ and $\beta_{2}=0.95$ . We use 0.001 learning rate with a linear warmup schedule for the first $200,\mathrm{M}$ examples seen, and then the learning rate is decayed to zero with a cosine learning rate schedule. Weight decay is set to 0.0o01 for all the experiments. The text is tokenized by a $32,\mathrm{k}$ vocabulary sentencepiece tokenizer [27] trained on the English C4 dataset [37], and a maximum of 16 text tokens are kept. Table 8 shows results with multiple train examples seen and batch sizes, for both the sigmoid loss and the softmax loss baseline.
在 4.1 节中,我们使用来自 [59] 的 ViT $\mathbf{\nabla}^{\cdot}\mathbf{g}$ 视觉模型为图像使用相同的预计算嵌入。仅应用了调整大小的增强,将其调整为固定的 $288\times288$ 分辨率。我们使用 ScalingViT-Adafactor 优化器 [58] 训练了一个标准基础大小的文本塔,其中 $\beta_{1}=0.9$ 和 $\beta_{2}=0.95$。我们使用 0.001 的学习率,并在前 $200,\mathrm{M}$ 个样本中采用线性预热计划,然后通过余弦学习率计划将学习率衰减至零。所有实验的权重衰减设置为 0.0o01。文本通过一个 $32,\mathrm{k}$ 词汇量的 sentencepiece tokenizer [27] 进行 Token 化,该 Token 器在英语 C4 数据集 [37] 上训练,最多保留 16 个文本 Token。表 8 展示了在使用多个训练样本和批量大小时,sigmoid 损失和 softmax 损失基线的结果。
For training SigLiT in under a day with 4 chips (Section 4.4), we used the LION optimizer with peak learning rate $1\times10^{-4}$ and weight decay $1\times10^{-7}$ . The learning rate was warmed linearly to the peak in $6.5,\mathrm{k}$ steps, then cosine decayed to zero for the remaining $58.5,\mathrm{k}$ steps.
为了在 4 个芯片上在一天内训练 SigLiT(第 4.4 节),我们使用了 LION 优化器,峰值学习率为 $1\times10^{-4}$,权重衰减为 $1\times10^{-7}$。学习率在 $6.5,\mathrm{k}$ 步内线性预热至峰值,然后在剩余的 $58.5,\mathrm{k}$ 步内余弦衰减至零。
B. More results for SigLIP
B. SigLIP 更多结果
In Table 5, we present more results for SigLIP Base with multiple train examples seen: 3 billion examples and 9 billion examples respectively.
在表 5 中,我们展示了 SigLIP Base 在多训练样本下的更多结果:分别为 30 亿样本和 90 亿样本。
| BatchSize | 3B | 3B | 9B | 9B |
|---|---|---|---|---|
| sigmoid | softmax | sigmoid | softmax | |
| 512 | 51.5 | 47.7 | ||
| 1k | 57.3 | 53.2 | ||
| 2k | 62.1 | 59.3 | ||
| 4k | 65.3 | 63.8 | 68.4 | 66.6 |
| 8k | 68.6 | 66.6 | 70.6 | 69.4 |
| 16k | - | 72.3 | 71.7 | |
| 32k | 69.9 | 69.9 | 73.4 | 72.9 |
| 98k | 69.5 | 69.7 | 73.0 | 73.2 |
| 307k | 71.6 | 72.6 |
Table 5: SigLIP zeor-shot accuracy $(%)$ on the ImageNet benchmark. Both the sigmoid loss and the softmax loss baseline are presented. Experiments are performed on multiple train examples seen $(3,{\bf B},,9,{\bf B})$ and train batch sizes (from 512 to $307,\mathrm{k}$ ).When trained for $^{9,\mathrm{{B}}}$ examples, the peak of the sigmoid loss comes earlier at $32,\mathrm{k}$ than the peak of the softmax loss at $98,\mathrm{k}$ Together with the memory efficient advantage for the sigmoid loss, it allows one to train the best language-image model with much fewer amount of accelerators.
表 5: SigLIP 在 ImageNet 基准测试中的零样本准确率 $(%)$。展示了 sigmoid 损失和 softmax 损失基线的结果。实验在多个训练样本数量 $(3,{\bf B},,9,{\bf B})$ 和训练批量大小(从 512 到 $307,\mathrm{k}$)下进行。当训练到 $^{9,\mathrm{{B}}}$ 样本时,sigmoid 损失的峰值出现在 $32,\mathrm{k}$,比 softmax 损失的峰值 $98,\mathrm{k}$ 更早。结合 sigmoid 损失的内存效率优势,这使得可以用更少的加速器训练出最佳的语言-图像模型。
| BS | Default | Best | Best LR | Best WD |
|---|---|---|---|---|
| 8k | 70.1 | 70.1 | 0.001 | 0.0001 |
| 16k | 70.0 | 70.0 | 0.001 | 0.0001 |
| 32k | 68.2 | 69.0 | 0.0003 | 0.00003 |
Table 6: Default hyper parameters across different batch sizes, perform either the best or close to the best hyperparameter from a sweep. Zero-shot accuracy on ImageNet is reported. $\mathrm{\bfBS=}1$ batch size, LR=learning rate, WD $=$ weight decay.
表 6: 不同批次大小的默认超参数,执行最佳或接近最佳的超参数。报告了 ImageNet 上的零样本准确率。$\mathrm{\bfBS=}1$ 批次大小,LR=学习率,WD $=$ 权重衰减。
C. Robustness of SigLIP results
C. SigLIP 结果的鲁棒性
Hyper parameters for different batch sizes. Sigmoid loss doesn't require tuning hyper parameters for different batch sizes. For example, in both the SigLiP and SigLiT setup, we only used default 0.001 learning rate and 0.0001 weight decay across a wide range of batch sizes (from 512 to 1024k). We further performed a sweep of 9 hyperparameters across 3 batch sizes on the from-scratch SigLIP setup for 3B seen examples: learning rate ${0.0003,0.001,0.003}$ $\times$ weight decay $\left{0.00003,;0.0001,;0.0003\right};\times$ batchsize ${8\textup{k},16\textup{k},32\textup{k}}$ .We observed in Table 6 that the default LR/WD is either the best or close to the best.
不同批量大小的超参数。Sigmoid损失不需要为不同的批量大小调整超参数。例如,在SigLiP和SigLiT设置中,我们仅使用了默认的0.001学习率和0.0001权重衰减,适用于广泛的批量大小(从512到1024k)。我们进一步在从头开始的SigLIP设置中对3B个样本进行了9个超参数的扫描,涵盖了3种批量大小:学习率 ${0.0003,0.001,0.003}$ $\times$ 权重衰减 $\left{0.00003,;0.0001,;0.0003\right};\times$ 批量大小 ${8\textup{k},16\textup{k},32\textup{k}}$。我们在表6中观察到,默认的LR/WD要么是最佳值,要么接近最佳值。
Standard deviation. We repeat SigLIP training five times, using the recommended 32k batch size and 3B seen examples. We report the average and std in Table 7. The std of the five runs is very small for both sigmoid and softmax.
标准差。我们重复了五次 SigLIP 训练,使用推荐的 32k 批次大小和 3B 样本。我们在表 7 中报告了平均值和标准差。五次运行的 sigmoid 和 softmax 的标准差都非常小。
Alternative optimizers. We repeat the same experiment with AdamW optimizer five times and got very similar results and std as reported in Table 7. We tested a linear learning rate scheduler instead of the default cosine learning rate scheduler, it achieves $69.9%$ accuracy.
替代优化器。我们使用 AdamW 优化器重复了相同的实验五次,得到了与表 7 中报告的结果和标准差非常相似的结果。我们测试了线性学习率调度器,而不是默认的余弦学习率调度器,其准确率达到 $69.9%$。
D. More results for mSigLIP
D. mSigLIP 的更多结果
We present the mSigLIP Base crossmodal retrieval results on the Crossmodal-3600 dataset, across all the 36 langauges in Figure 8 and Table 9.
我们在图 8 和表 9 中展示了 mSigLIP Base 在 Crossmodal-3600 数据集上的跨模态检索结果,涵盖了所有 36 种语言。
Table 7: Mean and standard deviation of five repeated experiments. Zero-shot accuracy on ImageNet is reported.
| 损失函数 | 优化器 | 结果 (%) |
|---|---|---|
| Softmax | ViT-Adafactor | 69.9 ± 0.1 |
| Sigmoid | ViT-Adafactor | 70.1 ± 0.2 |
| Sigmoid | Adamw | 70.3 ± 0.1 |
表 7: 五次重复实验的均值和标准差。报告了 ImageNet 上的零样本准确率。
| Batch Size | 450 M sigmoid | 450 M softmax | 900 M sigmoid | 900 M softmax | 3B sigmoid | 3B softmax | 18 B sigmoid | 18 B softmax |
|---|---|---|---|---|---|---|---|---|
| 512 | 72.5 | 69.5 | 75.0 | 72.8 | 77.2 | 74.6 | ||
| 1k | 75.5 | 73.6 | 77.2 | 76.0 | 79.6 | 77.9 | - | |
| 2k | 77.1 | 76.3 | 79.3 | 78.1 | 81.3 | 80.1 | 82.2 | 81.2 |
| 4k | 79.2 | 78.3 | 80.8 | 79.8 | 82.4 | 81.2 | 83.0 | 82.0 |
| 8k | 80.8 | 79.7 | 82.0 | 81.0 | 83.1 | 82.6 | 83.6 | 83.1 |
| 16k | 81.2 | 81.2 | 82.7 | 82.1 | 83.8 | 83.5 | 84.2 | 84.1 |
| 32k | 81.9 | 81.4 | 83.1 | 82.7 | 84.2 | 84.0 | 84.6 | 84.4 |
| 64k | 81.6 | 81.6 | 83.0 | 82.8 | 84.3 | 84.1 | 84.7 | 84.4 |
| 128k | 80.5 | 80.0 | 83.1 | 83.2 | 84.2 | 84.4 | 84.7 | 84.6 |
| 256k | 72.8 | 72.2 | 82.1 | 81.7 | 84.3 | 84.2 | 84.7 | 84.6 |
| 1024k | - | 84.7 |
Table 8: SigLiT zero-shot accuracy $(%)$ on the ImageNet benchmark. Both the sigmoid loss and the softmax loss baseline are presented. Extensive experiments are performed on multiple train examples seen (450 M, 900 M, 3B, 18 B) and train batch sizes (from 512 to $1,\mathrm{M}$
表 8: SigLiT 在 ImageNet 基准上的零样本准确率 $(%)$。展示了 sigmoid 损失和 softmax 损失基线。在多种训练样本数(450 M、900 M、3B、18 B)和训练批量大小(从 512 到 $1,\mathrm{M}$)上进行了广泛的实验。

Figure 8: Image-to-text and text-to-image zero-shot retrieval recall $@1$ results on all 36 languages of Crossmodal-3600. Top: Image to text. Bottom: text to image. Colors are batch sizes. $^{*}32\mathrm{k}$ represents the scaled up results as described in Section 4.6.
图 8: Crossmodal-3600 的 36 种语言的图像到文本和文本到图像的零样本检索召回率 $@1$ 结果。顶部:图像到文本。底部:文本到图像。颜色表示批量大小。$^{*}32\mathrm{k}$ 表示第 4.6 节中描述的扩展结果。
E. Label noise experiments
E. 标签噪声实验
All models had an M/16 image tower and a M text tower. They were trained from random initial is ation for 3.6B examples seen, with a batch size of 16384. A cosine learning rate schedule was used, with an initial linear warmup for $10%$ of steps up to a peak learning rate of 0.001.
所有模型都包含一个 M/16 的图像塔和一个 M 的文本塔。它们从随机初始化开始训练,共观察了 3.6B 个样本,批量大小为 16384。使用了余弦学习率调度,初始线性预热步骤为总步骤的 $10%$,峰值学习率为 0.001。
Table 9: Image-to-text (text retrieval) and text-to-image (image retrieval) zero-shot recall@ 1 results on all 36 languages of Crossmodal-3600, with mSigLIP models trained at different batch sizes for $30,\mathrm{B}$ total examples seen. $^{*}321$ represents the scaled up results as described in Section 4.6.
表 9: 在所有 36 种语言的 Crossmodal-3600 上,使用不同批量大小训练的 mSigLIP 模型进行图像到文本(文本检索)和文本到图像(图像检索)的零样本召回率@1 结果,总计处理了 $30,\mathrm{B}$ 个样本。$^{*}321$ 表示如第 4.6 节所述的放大结果。
F. Model Card
F. 模型卡
We provide a description of our models following [32].
我们按照[32]提供了模型的描述。
· Model Architecture: The model is trained using the contrastive pre-training technique with sigmoid loss as described in this paper. This contrastive model contains two encoders, i.e. vision transformer encoder [17] and language transformer encoder [47]. The vision and language encoders always have the same size, one of ViT-B, ViT-L and SoViT-400M [1].
· 模型架构:该模型使用对比预训练技术进行训练,并采用本文中描述的 sigmoid 损失。该对比模型包含两个编码器,即视觉 Transformer 编码器 [17] 和语言 Transformer 编码器 [47]。视觉和语言编码器始终具有相同的大小,分别为 ViT-B、ViT-L 和 SoViT-400M [1] 之一。
· Inputs: The vision encoder takes an image $224\times$ $224\times3$ $256!\times!256!\times!3$ $384\times384\times3$ $512!\times!512!\times!3)$ as input. The text encoder takes a tokenized text [38, 54] cropped to the first 64 tokens as input.
输入:视觉编码器以图像 $224\times$ $224\times3$ $256!\times!256!\times!3$ $384\times384\times3$ $512!\times!512!\times!3)$ 作为输入。文本编码器以前 64 个 Token 截断的 Token 化文本 [38, 54] 作为输入。
· Outputs: The vision and text encoders both output a $d$ dimensional feature vector, where $d$ is 768, 1024 and 1152 for ViT-B, ViT-L and SoViT-400M, respectively.
输出:视觉和文本编码器都输出一个 $d$ 维特征向量,其中 $d$ 对于 ViT-B、ViT-L 和 SoViT-400M 分别为 768、1024 和 1152。
· Intended Use: The models are designed for multimodal research purposes. The models can be used for zero-shot image classification and zero-shot imagetext retrieval by comparing both feature vectors. We provide both en-only and i18n-trained models to encourage research on the impact of this choice.
· 预期用途:这些模型设计用于多模态研究目的。通过比较特征向量,这些模型可用于零样本图像分类和零样本图像文本检索。我们提供了仅英语训练和多语言训练的模型,以鼓励研究这一选择的影响。
· Training Data: The contrastive model is pre-trained from-scratch using the WebLI [13] dataset. SigLIP models are pre-trained on a WebLI subset filtered to contain mostly English. mSigLIP models are pretrained on the WebLI dataset without language filters.
训练数据:对比模型从头开始使用 WebLI [13] 数据集进行预训练。SigLIP 模型在过滤后主要包含英语的 WebLI 子集上进行预训练。mSigLIP 模型在没有语言过滤器的情况下在 WebLI 数据集上进行预训练。
· Evaluation Data: Zero-shot classification is performed on ImageNet [14], ImageNet v2 [39], ImageNet Real [3], and ObjectNet [2]. Zero-shot retrieval is performed on COCO [11] and the multilingual XM3600 dataset [44].
评估数据:零样本分类在 ImageNet [14]、ImageNet v2 [39]、ImageNet Real [3] 和 ObjectNet [2] 上进行。零样本检索在 COCO [11] 和多语言 XM3600 数据集 [44] 上进行。
· Hardware & Software: The models are developed in the big-vision codebase [5, 4] and trained on Google Cloud TPUs.
硬件与软件:这些模型在big-vision代码库[5, 4]中开发,并在Google Cloud TPU上进行训练。
