[论文翻译]YOLOE: 实时视觉感知


原文地址:https://arxiv.org/pdf/2503.07465v1


YOLOE: Real-Time Seeing Anything

Abstract

摘要

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameter iz able Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameter iz able lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a builtin large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE’s exceptional zero-shot performance and transfer ability with high inference efficiency and low training cost. Notably, on LVIS, with 3× less training cost and I.4× inference speedup, YOLOE νδS surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves O.6APb and 0.4 APm gains over closed-set Y O L O\nu\delta\ –L with nearly 4\times less training time. Code and models are available at https: //github.com/THU-MIG/yoloe.

目标检测与分割在计算机视觉应用中被广泛采用,然而传统的模型如 YOLO 系列虽然高效且准确,但受限于预定义类别,在开放场景中的适应性较差。最近的开放集方法利用文本提示、视觉线索或无提示范式来克服这一问题,但由于高计算需求或部署复杂性,往往在性能与效率之间做出妥协。在本工作中,我们提出了 YOLOE,它将检测与分割整合到单一高效的模型中,支持多样化的开放提示机制,实现了实时“看到任何东西”。对于文本提示,我们提出了可重参数化的区域-文本对齐策略(RepRTA)。它通过可重参数化的轻量级辅助网络优化预训练的文本嵌入,并在零推理和迁移开销的情况下增强视觉-文本对齐。对于视觉提示,我们提出了语义激活的视觉提示编码器(SAVPE)。它采用解耦的语义和激活分支,以最小的复杂度带来改进的视觉嵌入和准确性。对于无提示场景,我们引入了惰性区域-提示对比策略(LRPC)。它利用内置的大词汇量和专用嵌入来识别所有对象,避免了对昂贵语言模型的依赖。大量实验表明,YOLOE 在零样本性能和迁移能力方面表现出色,具有高推理效率和低训练成本。值得注意的是,在 LVIS 上,YOLOE \nu\delta{-}S3\times 的训练成本减少和 I.4\times 的推理加速,超越了 YOLO-Worldv2-S,提升了 3.5~A P.。在迁移到 COCO 时,YOLOE-v8-L 在封闭集 Y O L O\nu\delta\ –L 的基础上实现了 O.6A P^{b} 和 0.4 A P^{m} 的提升,训练时间减少了近 4\times。代码和模型可在 https://github.com/THU-MIG/yoloe 获取。

1. Introduction

1. 引言

Object detection and segmentation are foundational tasks in computer vision [15, 48], with widespread applications spanning autonomous driving [2], medical analyses [55], and robotics [8], etc. Traditional approaches like YOLO series [1, 3, 21, 47], have leveraged convolutional neural networks to achieve real-time remarkable performance. However, their dependence on predefined object categories constrains flexibility in practical open scenarios. Such scenarios increasingly demand models capable of detecting and segmenting arbitrary objects guided by diverse prompt mechanisms, such as texts, visual cues, or without prompt.

目标检测与分割是计算机视觉中的基础任务 [15, 48],其应用广泛覆盖自动驾驶 [2]、医学分析 [55] 和机器人技术 [8] 等领域。传统方法如 YOLO 系列 [1, 3, 21, 47] 利用卷积神经网络实现了显著的实时性能。然而,它们对预定义目标类别的依赖限制了在实际开放场景中的灵活性。这些场景越来越需要能够通过多种提示机制(如文本、视觉提示或无提示)检测和分割任意目标的模型。


Figure 1. Comparison of performance, training cost, and inference efficiency between YOLOE (Ours) and advanced YOLO-Worldv2 in terms of open text prompts. LVIS AP is evaluated on minival set and FPS w/ TensorRT and w/ CoreML is measured on T4 GPU and iPhone 12, respectively. The results highlight our superiority.

图 1: YOLOE (Ours) 与先进的 YOLO-Worldv2 在开放文本提示下的性能、训练成本和推理效率对比。LVIS AP 在 minival 集上评估,FPS w/ TensorRT 和 w/ CoreML 分别在 T4 GPU 和 iPhone 12 上测量。结果凸显了我们的优势。

Given this, recent efforts have shifted towards enabling models to generalize for open prompts [5, 20, 49, 80]. They target single prompt type, e.g., GLIP [32], or multiple prompt types in a unified way, e.g., DINO-X [49]. Specifically, with region-level vision-language pre training [32, 37, 65], text prompts are usually processed by text encoder to serve as contrastive objectives for region features [20, 49], achieving recognition for arbitrary categories, e.g., YOLOWorld [5]. For visual prompts, they are often encoded as class embeddings tied to specified regions for identifying similar objects, by the interaction with image features or language-aligned visual encoder [5, 19, 30, 49], e.g., TRex2 [20]. In prompt-free scenario, existing methods typically integrate language models, finding all objects and generating the corresponding category names conditioned on region features sequentially [49, 62], e.g., GenerateU [33].

因此,最近的研究工作转向了使模型能够对开放提示进行泛化 [5, 20, 49, 80]。它们针对单一提示类型,例如 GLIP [32],或以统一的方式处理多种提示类型,例如 DINO-X [49]。具体而言,通过区域级视觉语言预训练 [32, 37, 65],文本提示通常由文本编码器处理,作为区域特征的对比目标 [20, 49],实现对任意类别的识别,例如 YOLOWorld [5]。对于视觉提示,它们通常被编码为与指定区域绑定的类别嵌入,通过与图像特征或语言对齐的视觉编码器的交互来识别相似对象 [5, 19, 30, 49],例如 TRex2 [20]。在无提示场景中,现有方法通常集成语言模型,依次找到所有对象并根据区域特征生成相应的类别名称 [49, 62],例如 GenerateU [33]。

Despite notable advancements, a single model that supports diverse open prompts for arbitrary objects with high efficiency and accuracy is still lacking. For example, DINO

尽管取得了显著进展,但能够高效且准确地支持任意对象多样化开放提示的单一模型仍然缺乏。例如,DINO

X [49] features a unified architecture, which, however, incurs resource-intensive training and inference overhead. Additionally, individual designs for different prompts in separate works exhibit suboptimal trade-offs between performance and efficiency, making it difficult to directly combine them into one model. For example, text-prompted approaches often incur substantial computational overhead when incorporating large vocabularies, due to complexity of cross-modality fusion [5, 32, 37, 49]. Visual-prompted methods usually compromise deploy ability on edge devices owing to the transformer-heavy design or reliance on additional visual encoder [20, 30, 67]. Prompt-free ways, meanwhile, depend on large language models, introducing considerable memory and latency costs [33, 49].

X [49] 采用了一种统一架构,然而,这带来了资源密集型的训练和推理开销。此外,不同工作中针对不同提示的独立设计在性能和效率之间表现出次优的权衡,使得很难将它们直接结合到一个模型中。例如,文本提示方法在融合大词汇量时,由于跨模态融合的复杂性,通常会带来大量的计算开销 [5, 32, 37, 49]。视觉提示方法通常由于 Transformer 密集型设计或依赖额外的视觉编码器而在边缘设备上的部署能力上做出妥协 [20, 30, 67]。与此同时,无提示方法依赖大语言模型,引入了相当大的内存和延迟成本 [33, 49]。

In light of these, in this paper, we introduce YOLOE(ye), a highly efficient, unified, and open object detection and segmentation model, like human eye, under different prompt mechanisms, like texts, visual inputs, and promptfree paradigm. We begin with YOLO models with widely proven efficacy. For text prompts, we propose a Reparameter iz able Region-Text Alignment (RepRTA) strat- egy, which employs a lightweight auxiliary network to improve pretrained textual embeddings for better visualsemantic alignment. During training, pre-cached textual embeddings require only the auxiliary network to process text prompts, incurring low additional cost compared with closed-set training. At inference and transferring, auxiliary network is seamlessly re-parameterized into the classification head, yielding an architecture identical to YOLOs with zero overhead. For visual prompts, we design a SemanticActivated Visual Prompt Encoder (SAVPE). By formalizing regions of interest as masks, SAVPE fuses them with multi-scale features from PAN to produce grouped promptaware weights in low dimension in an activation branch and extract prompt-agnostic semantic features in a semantic branch. Prompt embeddings are derived through aggregation of them, resulting in favorable performance with min- imal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. Without relying on costly language models, LRPC leverages a specialized prompt embedding to find all objects and a built-in large vocabulary for category retrieval. By matching only anchor points with identified objects against the vocabulary, LRPC ensures high performance with low overhead.

基于这些,本文中我们提出了YOLOE(ye),一种高效、统一且开放的目标检测与分割模型,它能够像人眼一样在不同的提示机制下工作,如文本、视觉输入以及无提示范式。我们从已被广泛验证有效的YOLO模型出发。针对文本提示,我们提出了一种可重参数化的区域-文本对齐(RepRTA)策略,该策略采用轻量级辅助网络来优化预训练的文本嵌入,以实现更好的视觉语义对齐。训练期间,预缓存的文本嵌入仅需辅助网络处理文本提示,相比闭集训练,额外成本较低。在推理和迁移时,辅助网络无缝地重参数化到分类头中,生成与YOLO结构相同且无额外开销的架构。对于视觉提示,我们设计了语义激活视觉提示编码器(SAVPE)。通过将感兴趣区域形式化为掩码,SAVPE将其与PAN的多尺度特征融合,在激活分支中生成低维度的分组提示感知权重,并在语义分支中提取与提示无关的语义特征。提示嵌入通过它们的聚合得出,以最小的复杂度实现优异的性能。对于无提示场景,我们引入了惰性区域-提示对比(LRPC)策略。LRPC不依赖于成本高昂的语言模型,而是利用专门的提示嵌入来寻找所有对象,并内置大词汇表用于类别检索。通过仅将锚点与识别出的对象与词汇表匹配,LRPC确保了高性能与低开销。

Thanks to them, YOLOE excels in detection and segmentation across diverse open prompt mechanisms within one model, enjoying high inference efficiency and low training cost. Notably, as shown in Fig. 1, under 3\times less training cost, YOLOE-v8-S significantly outperforms YOLOWorldv2-S [5] by 3.5 AP on LVIS [14], with 1.4\times and 1.3\times inference speedups on T4 and iPhone 12, respectively. In visual-prompted and prompt-free settings, YOLOE-v8-L outperforms T-Rex2 by 3.3\mathrm{AP}_{r} and GenerateU by 0.4 AP with 2\times less training data and 6.3\times fewer parameters, respectively. For transferring to COCO [34], YOLOE-v8-M $/\mathrm{L} outperforms \mathrm{YOLOv8{-}M/\Omega L} by 0.4/0.6\mathrm{AP}^{b} and 0.4~/ 0.4~{\mathrm{AP}}^{m} with nearly 4\times$ less training time. We hope that YOLOE can establish a strong baseline and inspire further advancements in real-time open prompt-driven vision tasks.

得益于它们,YOLOE 在单一模型内跨多种开放提示机制中表现出色,具备高推理效率和低训练成本。值得注意的是,如图 1 所示,在训练成本减少 3\times 的情况下,YOLOE-v8-S 在 LVIS [14] 上显著优于 YOLOWorldv2-S [5],AP 高出 3.5,并且在 T4 和 iPhone 12 上的推理速度分别提升了 1.4\times1.3\times。在视觉提示和无提示设置下,YOLOE-v8-L 分别以 3.3\mathrm{AP}_{r} 和 0.4 AP 优于 T-Rex2 和 GenerateU,同时训练数据减少 2\times,参数减少 6.3\times。在迁移到 COCO [34] 时,YOLOE-v8-M $/\mathrm{L}0.4/0.6\mathrm{AP}^{b}0.4~/ 0.4~{\mathrm{AP}}^{m} 优于 \mathrm{YOLOv8{-}M/\Omega L},且训练时间减少近 4\times$。我们希望 YOLOE 能够建立一个强大的基线,并激发实时开放提示驱动视觉任务的进一步进展。

2. Related Work

2. 相关工作

Traditional detection and segmentation. Traditional approaches for object detection and segmentation primarily operate under closed-set paradigms. Early two-stage frameworks [4, 12, 15, 48], exemplified by Faster RCNN [48], introduce region proposal networks (RPNs) followed by region-of-interest (ROI) classification and regression. Meanwhile, single-stage detectors [10, 35, 38, 56, 72] prioritizes speed through grid-based predictions within a single network. The YOLO series [1, 21, 27, 47, 59, 60] plays a significant role in this paradigm and are widely used in real world. Moreover, DETR [28] and its variants [28, 69, 77] mark a major shift by removing heuristicdriven components with transformer-based architectures. To achieve finer-grained results, existing instance segmentation methods predict pixel-level masks rather than bounding box coordinates [15]. For this, YOLACT [3] facilitates real-time instance segmentation through integration of prototype masks and mask coefficients. Based on DINO [69], MaskDINO [29] utilizes query embeddings and a highresolution pixel embedding map to produce binary masks.

传统检测与分割。传统的目标检测和分割方法主要在封闭集范式下运行。早期的两阶段框架 [4, 12, 15, 48],以 Faster RCNN [48] 为例,引入了区域提议网络 (RPNs),随后进行感兴趣区域 (ROI) 的分类和回归。与此同时,单阶段检测器 [10, 35, 38, 56, 72] 通过在单一网络中进行基于网格的预测来优先考虑速度。YOLO 系列 [1, 21, 27, 47, 59, 60] 在这一范式中扮演了重要角色,并在现实世界中得到广泛应用。此外,DETR [28] 及其变体 [28, 69, 77] 通过使用基于 Transformer 的架构移除启发式驱动的组件,标志着一个重大转变。为了获得更细粒度的结果,现有的实例分割方法预测像素级掩码而非边界框坐标 [15]。为此,YOLACT [3] 通过整合原型掩码和掩码系数实现了实时实例分割。基于 DINO [69],MaskDINO [29] 利用查询嵌入和高分辨率像素嵌入图生成二进制掩码。

Text-prompted detection and segmentation. Recent advancements in open-vocabulary object detection [13, 25, 61, 68, 74–76] have focused on detecting novel categories by aligning visual features with textual embeddings. Specifically, GLIP [32] unifies object detection and phrase grounding through grounded pre-training on largescale image-text pairs, demonstrating robust zero-shot performance. DetCLIP [65] facilitates open-vocabulary learning by enriching the concepts with descriptions. Besides, Grounding DINO [37] enhances this by integrating crossmodality fusion into DINO, improving alignment between text prompts and visual representations. YOLO-World [5] further shows the potential of pre training small detectors with open recognition capabilities based on the YOLO architecture. YOLO-UniOW [36] builds upon YOLO-World by leveraging the adaptive decision-learning strategy. Similarly, several open-vocabulary instance segmentation models [11, 18, 26, 45, 63] learn rich visual-semantic knowledge from advanced foundation models to perform segmentation on novel object categories. For example, X-Decoder [79] and OpenSeeD [71] explore both the open-vocabulary detection and segmentation tasks. APE [54] introduces a universal visual perception model that aligns and prompts all objects in image using various text prompts.

文本提示的检测与分割。最近在开放词汇目标检测领域的进展[13, 25, 61, 68, 74-76]主要集中在通过将视觉特征与文本嵌入对齐来检测新类别。具体来说,GLIP[32]通过在大规模图像-文本对上进行基于预训练的统一,展示了强大的零样本性能。DetCLIP[65]通过用描述丰富概念来促进开放词汇学习。此外,Grounding DINO[37]通过将跨模态融合集成到DINO中,提高了文本提示与视觉表示之间的对齐。YOLO-World[5]进一步展示了基于YOLO架构的小型检测器在开放识别能力方面的预训练潜力。YOLO-UniOW[36]在YOLO-World的基础上,利用了自适应决策学习策略。同样,一些开放词汇实例分割模型[11, 18, 26, 45, 63]从先进的基础模型中学习丰富的视觉-语义知识,以对新对象类别进行分割。例如,X-Decoder[79]和OpenSeeD[71]探索了开放词汇检测和分割任务。APE[54]引入了一个通用视觉感知模型,使用各种文本提示对齐和提示图像中的所有对象。

Visual-prompted detection and segmentation. While text prompts offer a generic description, certain objects can be challenging to describe with language alone, such as those requiring specialized domain knowledge. In such cases, visual prompts can guide detection and segmentation more flexibly and specifically, complementing text prompts [19, 20]. OV-DETR [67] and OWL-ViT [41] leverage CLIP encoders to process text and image prompts. MQDet [64] augments text queries with class-specific visual information from query images. DINOv [30] explores visual prompts as in-context examples for generic and referring vision tasks. T-Rex2 [20] integrates visual and text prompts by region-level contrastive alignment. For segmentation, based on large-scale data, SAM [23] presents a flexible and strong model that can be prompted interactively and iteratively. SEEM [80] further explores segmenting objects with more various prompt types. Semantic-SAM [31] excels in semantic comprehension and granularity detection, handling both panoptic and part segmentation tasks.

视觉提示检测与分割。尽管文本提示提供了通用的描述,但某些对象可能难以仅用语言描述,例如那些需要专业领域知识的对象。在这种情况下,视觉提示可以更灵活和具体地指导检测和分割,与文本提示形成互补 [19, 20]。OV-DETR [67] 和 OWL-ViT [41] 利用 CLIP 编码器处理文本和图像提示。MQDet [64] 通过从查询图像中提取特定类别的视觉信息来增强文本查询。DINOv [30] 探索了将视觉提示作为通用和引用视觉任务的上下文示例。T-Rex2 [20] 通过区域级别的对比对齐整合了视觉和文本提示。在分割方面,基于大规模数据,SAM [23] 提出了一个灵活且强大的模型,可以交互式和迭代地进行提示。SEEM [80] 进一步探索了使用更多样化的提示类型来分割对象。Semantic-SAM [31] 在语义理解和粒度检测方面表现出色,能够处理全景和部分分割任务。


Figure 2. The overview of YOLOE, which supports detection and segmentation for diverse open prompt mechanisms. For text prompts, We design a re-parameter iz able region-text alignment strategy to improve performance with zero inference and transferring overhead. For visual prompts, SAVPE is employed to encode visual cues with enhanced prompt embedding under minimal cost. For prompt-free setting, we introduce lazy region-prompt contrast strategy to provide category names for all identified objects efficiently by retrieval.

图 2: YOLOE 的概述,支持多种开放提示机制的检测和分割。对于文本提示,我们设计了一种可重参数化的区域-文本对齐策略,以在零推理和迁移开销下提高性能。对于视觉提示,SAVPE 被用于在最小成本下编码视觉提示,并增强提示嵌入。对于无提示设置,我们引入了惰性区域-提示对比策略,通过检索高效地为所有识别对象提供类别名称。

Prompt-free detection and segmentation. Existing approaches still depend on explicit prompts during inference for open-set detection and segmentation. To address this limitation, several works [33, 40, 49, 62, 66] explore integrating with generative language models to produce object descriptions for all found objects. For instance, GRiT [62] employs a text decoder for both dense captioning and object detection tasks. DetCLIPv3 [66] trains an object captioner on large-scale data, enabling model to generate rich label information. GenerateU [33] leverages the language model to generate object names in a free-form way.

无提示检测与分割。现有方法在开放集检测和分割的推理过程中仍然依赖显式提示。为了解决这一限制,一些工作 [33, 40, 49, 62, 66] 探索了与生成式语言模型(Generative Language Model)的集成,为所有发现的对象生成描述。例如,GRiT [62] 在密集字幕生成和对象检测任务中使用了一个文本解码器。DetCLIPv3 [66] 在大规模数据上训练了一个对象字幕生成器,使模型能够生成丰富的标签信息。GenerateU [33] 则利用语言模型以自由形式生成对象名称。

Closing remarks. To the best of our knowledge, aside from DINO-X [49], few efforts have achieved object detection and segmentation across various open prompt mecha- nisms within a single architecture. However, DINO-X entails extensive training cost and notable inference overhead, severely constraining the practicality for real-world edge deployments. In contrast, our YOLOE aims to deliver an efficient and unified model that enjoys real-time performance and efficiency with easy deploy ability.

结语。据我们所知,除了 DINO-X [49] 之外,很少有研究在单一架构内实现了跨多种开放提示机制的目标检测和分割。然而,DINO-X 需要大量的训练成本和显著的推理开销,严重限制了其在现实世界边缘部署中的实用性。相比之下,我们的 YOLOE 旨在提供一个高效且统一的模型,具备实时性能和高效的部署能力。

3. Methodology

3. 方法论

In this section, we detail designs of YOLOE. Building upon YOLOs (Sec. 3.1), YOLOE supports text prompts through RepRTA (Sec. 3.2), visual prompts via SAVPE (Sec. 3.3), and prompt-free scenario with LRPC (Sec. 3.4).

在本节中,我们详细介绍了 YOLOE 的设计。基于 YOLOs(第 3.1 节),YOLOE 通过 RepRTA(第 3.2 节)支持文本提示,通过 SAVPE(第 3.3 节)支持视觉提示,并在 LRPC(第 3.4 节)中支持无提示场景。

3.1. Model architecture

3.1. 模型架构

As shown in Fig. 2, YOLOE adopts the typical YOLOs’ architecture [1, 21, 47], consisting of backbone, PAN, regression head, segmentation head, and object embedding head. The backbone and PAN extracts multi-scale features for the image. For each anchor point, the regression head predicts the bounding box for detection, and the segmentation head produces the prototype and mask coefficients for segmentation [3]. The object embedding head follows the structure of classification head in YOLOs, except that the output channel number of last 1\times convolution layer is changed from the class number in closed-set scenario to the embedding dimension. Meanwhile, given text and visual prompts, we employ RepRTA and SAVPE to encode them as normalized prompt embeddings \mathcal{P} , respectively. They serve as the classification weights and contrast with the anchor points’ object embeddings \mathcal{O} to obtain category labels. The process can be formalized as

如图 2 所示,YOLOE 采用了典型的 YOLOs 架构 [1, 21, 47],包括骨干网络、PAN、回归头、分割头和物体嵌入头。骨干网络和 PAN 提取图像的多尺度特征。对于每个锚点,回归头预测检测的边界框,分割头生成分割的原型和掩码系数 [3]。物体嵌入头遵循 YOLOs 中分类头的结构,只是将最后一层 1\times 卷积层的输出通道数从封闭场景中的类别数更改为嵌入维度。同时,给定文本和视觉提示,我们使用 RepRTA 和 SAVPE 将它们分别编码为归一化的提示嵌入 \mathcal{P}。它们作为分类权重,并与锚点的物体嵌入 \mathcal{O} 进行对比以获得类别标签。该过程可以形式化为

image.png

where N denotes the number of anchor points, C indicates the number of prompts, and D means the feature dimension of embeddings, respectively.

其中 N 表示锚点数量,C 表示提示数量,D 表示嵌入的特征维度。

3.2. Re-parameter iz able region-text alignment

3.2. 可重参数化的区域-文本对齐

In open-set scenarios, the alignment between textual and object embeddings determines the accuracy of identified categories. Prior works usually introduce complex crossmodality fusion to improve the visual-textual representation for better alignment [5, 37]. However, these ways incur notable computational overhead, especially with large number of texts. Given this, we present Re-parameter iz able RegionText Alignment (RepRTA) strategy, which improves pretrained textual embeddings during training through the reparameter iz able lightweight auxiliary network. The alignment between textual and anchor points’ object embeddings can be enhanced with zero inference and transferring cost.

在开放集场景中,文本嵌入与物体嵌入的对齐决定了识别类别的准确性。先前的工作通常引入复杂的跨模态融合来改进视觉-文本表示,以实现更好的对齐 [5, 37]。然而,这些方法会带来显著的计算开销,尤其是在文本数量较多时。鉴于此,我们提出了可重参数化的区域文本对齐 (RepRTA) 策略,该策略通过可重参数化的轻量级辅助网络在训练过程中改进预训练的文本嵌入。文本嵌入与锚点物体嵌入的对齐可以在零推理和迁移成本的情况下得到增强。

Specifically, with the text prompts of T with length of C , we first employ the CLIP text encoder [44, 57] to obtain pretrained textual embedding P=\mathrm{TextEncoder}(T) . Before training, we cache all embeddings of texts in datasets in advance and the text encoder can be removed with no extra training cost. Meanwhile, as shown in Fig. 3.(a), we introduce a lightweight auxiliary network f_{\theta} with only one feed forward block [53, 58], where \theta indicates the trainable parameters and introduces low overhead compared with closed-set training. It derives the enhanced textual embedding \mathcal{P}=f_{\theta}(P)\stackrel{\cdot}{\in}\mathbb{R}^{C\times D} for contrasting with the anchor points’ object embedding during training, leading to improved visual-semantic alignment. Let K ∈ RD×D′×1×1 be the kernel parameters of last convolution layer with input features \overset{\cdot}{I}\in\mathbb{R}^{D^{\prime}\times H\times W} in the object embedding head, \circledast be the convolution operator, and R be the reshape function, we have

具体来说,对于长度为 C 的文本提示 T,我们首先使用 CLIP 文本编码器 [44, 57] 获得预训练的文本嵌入 P=\mathrm{TextEncoder}(T)。在训练之前,我们预先缓存了数据集中所有文本的嵌入,因此文本编码器可以在不增加额外训练成本的情况下移除。同时,如图 3(a) 所示,我们引入了一个轻量级的辅助网络 f_{\theta},它仅包含一个前馈块 [53, 58],其中 \theta 表示可训练参数,与闭集训练相比,引入的开销较低。它生成增强的文本嵌入 \mathcal{P}=f_{\theta}(P)\stackrel{\cdot}{\in}\mathbb{R}^{C\times D},用于在训练过程中与锚点的对象嵌入进行对比,从而改善视觉-语义对齐。设 K ∈ RD×D′×1×1 为对象嵌入头中最后一个卷积层的核参数,输入特征为 \overset{\cdot}{I}\in\mathbb{R}^{D^{\prime}\times H\times W}\circledast 为卷积运算符,R 为重塑函数,我们有

image.png

Moreover, after training, the auxiliary network can be reparameterized with the object embedding head into the identical classification head of YOLOs. The new kernel parameters K^{\prime}\in\mathbb{R}^{C\times D^{\prime}\times1\times1} for last convolution layer after re-parameter iz ation can be derived by

此外,训练完成后,辅助网络可以通过目标嵌入头重新参数化为 YOLOs 相同的分类头。重新参数化后,最后一层卷积层的新内核参数 K^{\prime}\in\mathbb{R}^{C\times D^{\prime}\times1\times1} 可以通过以下方式得出:

image.png

The final predication can be obtained by \mathrm{Label}=I\circledast K^{\prime} , which is identical to the original YOLO architecture, leading to zero overhead for deployment and transferring to downstream closed-set tasks.

最终的预测可以通过 \mathrm{Label}=I\circledast K^{\prime} 获得,这与原始的 YOLO 架构相同,因此在部署和迁移到下游闭集任务时不会产生额外开销。

3.3. Semantic-activated visual prompt encoder

3.3. 语义激活的视觉提示编码器

Visual prompts are designed to indicate the object category of interest through visual cues, e.g., box and mask. To produce the visual prompt embedding, prior works often employ transformer-heavy design [20, 30], e.g., deformable attention [78], or additional CLIP vision encoder [44, 67]. These ways, however, introduce challenges in deployment and efficiency due to complex operators or high computational demands. Considering this, we introduce Semantic-Activated Visual Prompt Encoder (SAVPE) for efficiently processing visual cues. It features two decoupled lightweight branches: (1) Semantic branch outputs prompt-agnostic semantic features in D channels without overhead of fusing visual cues, and (2) Activation branch produces grouped prompt-aware weights by interacting visual cues with image features in much fewer channels under low costs. Their aggregation then leads to informative prompt embedding under minimal complexity.

视觉提示旨在通过视觉线索(例如框和掩码)指示感兴趣的对象类别。为了生成视觉提示嵌入,先前的工作通常采用基于 Transformer 的设计 [20, 30],例如可变形注意力 [78],或额外的 CLIP 视觉编码器 [44, 67]。然而,由于复杂的操作或高计算需求,这些方法在部署和效率方面带来了挑战。考虑到这一点,我们引入了语义激活视觉提示编码器 (SAVPE) 来高效处理视觉线索。它具有两个解耦的轻量级分支:(1) 语义分支在没有融合视觉线索的开销下输出与提示无关的语义特征,通道数为 D;(2) 激活分支通过在较少通道中以低成本将视觉线索与图像特征交互,生成分组的提示感知权重。它们的聚合在最小复杂度下产生信息丰富的提示嵌入。


Figure 3. (a) The structure of lightweight auxiliary network in RepRTA, which consists of one SwiGLU FFN block [53]. (b) The structure of SAVPE, which consists of semantic branch to generate prompt-agnostic semantic features and activation branch to provide grouped prompt-aware weights. Visual prompt embedding can thus be efficiently derived by their aggregation.

图 3: (a) RepRTA 中的轻量级辅助网络结构,由一个 SwiGLU FFN 块 [53] 组成。(b) SAVPE 的结构,由语义分支和激活分支组成,语义分支生成与提示无关的语义特征,激活分支提供分组提示感知权重。通过它们的聚合,可以高效地生成视觉提示嵌入。

As shown in Fig. 3.(b), in the semantic branch, we adopt the similar structure as object embedding head. With multiscale features {P_{3},P_{4},P_{5}} from PAN, we employ two 3\times3 convs for each scale, respectively. After upsampling, features are concatenated and projected to derive semantic features S~\in~\mathbb{R}^{D\times H\times W} . In the activation branch, we formalize visual prompt as mask with 1 for indicated region and 0 for others. We downsample it and leverage 3\times3 conv to derive prompt feature F_{V}\in\mathbb{R}^{A\times H\times W} . Besides, we obtain image features F_{I}\in\mathbb{R}^{A\times H\times W} for fusion with it from {P_{3},P_{4},P_{5}} by convs. F_{V} and F_{I} are then concatenated and utilized to output prompt-aware weights \mathcal{W}\in\mathbb{R}^{A\times H\times W} , which is normalized using softmax within prompt-indicated region. Moreover, we divide the channels of S into A groups with \textstyle{\frac{D}{A}} channels in each. The channels in the i -th group share the weight \mathcal{W}_{i:i+1} from the i\cdot - th channel of \mathcal{W} . With A\ll D , we can process visual cues with image features in low dimension, bringing minimal cost. Furthermore, prompt embedding can be derived with aggregation of two branches by

如图 3(b) 所示,在语义分支中,我们采用了与对象嵌入头相似的结构。利用 PAN 提供的多尺度特征 {P_{3},P_{4},P_{5}},我们分别为每个尺度使用了两个 3\times3 卷积。上采样后,特征被拼接并投影以得到语义特征 S~\in~\mathbb{R}^{D\times H\times W}。在激活分支中,我们将视觉提示形式化为掩码,指示区域为 1,其他区域为 0。我们对其进行下采样,并利用 3\times3 卷积得到提示特征 F_{V}\in\mathbb{R}^{A\times H\times W}。此外,我们从 {P_{3},P_{4},P_{5}} 通过卷积获得图像特征 F_{I}\in\mathbb{R}^{A\times H\times W},用于与其融合。F_{V}F_{I} 随后被拼接并用于输出提示感知权重 \mathcal{W}\in\mathbb{R}^{A\times H\times W},该权重在提示指示区域内使用 softmax 进行归一化。此外,我们将 S 的通道分为 A 组,每组有 \textstyle{\frac{D}{A}} 个通道。第 i 组中的通道共享来自 \mathcal{W}i\cdot 个通道的权重 \mathcal{W}_{i:i+1}。由于 A\ll D,我们可以在低维度下处理视觉线索,带来最小的成本。此外,提示嵌入可以通过两个分支的聚合得到。

image.png

It can thus contrast with anchor points’ object embeddings to identify objects with category of interest.

因此,它可以与锚点的对象嵌入进行对比,以识别具有感兴趣类别的对象。

3.4. Lazy region-prompt contrast

3.4. 惰性区域提示对比

In prompt-free scenario without explicit guidance, models are expected to identity all objects with names in the image. Prior works usually formulate such setting as a generative problem, where language model is employed to generate categories for dense found objects [33, 49, 62]. However, this introduces notable overhead, where language models, e.g., FlanT5-base [6] with 250M parameters in GenerateU [33] and OPT-125M [73] in DINO-X [49], are far from meeting high efficiency requirement. Given this, we reformulate such setting as a retrieval problem and present Lazy Region-Prompt Contrast (LRPC) strategy. It lazily retrieves category names from a built-in large vocabulary for anchor points with objects in the cost-effective way. Such paradigm enjoys zero dependency on language models, meanwhile with favorable efficiency and performance.

在没有明确指导的无提示场景中,模型需要识别图像中所有命名的对象。先前的工作通常将此类场景形式化为生成问题,即使用大语言模型为密集找到的对象生成类别 [33, 49, 62]。然而,这引入了显著的开销,例如 GenerateU [33] 中使用的 FlanT5-base [6](具有 2.5 亿参数)和 DINO-X [49] 中的 OPT-125M [73],远未满足高效率要求。鉴于此,我们将此类场景重新形式化为检索问题,并提出了惰性区域提示对比(LRPC)策略。它以成本效益高的方式从内置的大词汇表中惰性检索带有对象的锚点的类别名称。这种范式完全不依赖大语言模型,同时具有良好的效率和性能。

Specifically, with pretrained YOLOE, we introduce a specialized prompt embedding and train it exclusively to find all objects, where objects are treated as one category. Meanwhile, we follow [16] to collect a large vocabulary which covers various categories and serve as the built-in data source for retrieval. One may directly leverage the large vocabulary as text prompts for YOLOE to identify all objects, which, however, incurs notable computational cost by contrasting abundant anchor points’ object embeddings with numerous textual embeddings. Instead, we employ the specialized prompt embedding \mathcal{P}_{s} to find the set \mathcal{O}^{\prime} of anchor points corresponding to objects by

具体来说,借助预训练的 YOLOE,我们引入了一个专门的提示嵌入 (prompt embedding),并对其进行专门训练以找到所有对象,这些对象被视为一个类别。同时,我们遵循 [16] 的方法收集了一个覆盖各类别的大词汇表,并将其作为检索的内置数据源。可以直接利用这个大词汇表作为 YOLOE 的文本提示来识别所有对象,但这会通过将大量锚点的对象嵌入与众多文本嵌入进行对比而带来显著的计算成本。相反,我们采用专门的提示嵌入 \mathcal{P}_{s} 来找到对应于对象的锚点集合 \mathcal{O}^{\prime}

image.png

where \mathcal{O} denotes all anchor points and \delta is the threshold hyper parameter for filtering. Then, only anchor points in \mathcal{O}^{\prime} are lazily matched against the built-in vocabulary to retrieve category names, bypassing the cost for irrelevant anchor points. This further improves efficiency without performance drop, facilitating the real world application.

其中 \mathcal{O} 表示所有锚点,\delta 是用于过滤的阈值超参数。然后,仅对 \mathcal{O}^{\prime} 中的锚点进行懒匹配,以检索类别名称,从而避免了对不相关锚点的开销。这进一步提高了效率且不会导致性能下降,有助于实际应用。

3.5. Training objective

3.5. 训练目标

During training, we follow [5] to obtain an online vocabulary for each mosaic sample with the texts involved in the images as positive labels. Following [21], we leverage taskaligned label assignment to match predictions with ground truths. The binary cross entropy loss is employed for classification, with IoU loss and distributed focal loss adopted for regression. For segmentation, we follow [3] to utilize binary cross-entropy loss for optimizing masks.

在训练过程中,我们遵循 [5] 为每个马赛克样本获取在线词汇表,并将图像中的文本作为正标签。根据 [21],我们利用任务对齐的标签分配将预测与真实值匹配。分类采用二元交叉熵损失,回归采用 IoU 损失和分布式焦点损失。对于分割,我们遵循 [3] 使用二元交叉熵损失来优化掩码。

4. Experiments

4. 实验

4.1. Implementation details

4.1. 实现细节

Model. For fair comparison with [5], we employ the same YOLOv8 architecture [21] for YOLOE. Besides, to verify its good general iz ability on other YOLOs, we also experiment with YOLO11 architecture [21]. For both of them, we provide three model scales, i.e., small (S), medium (M), and large (L), to suit various application needs. Text prompts are encoded using the pretrained MobileCLIP-B(LT) [57] text encoder. We empirically use A=16 in SAVPE, by default.

模型。为了与[5]进行公平比较,我们在YOLOE中采用了相同的YOLOv8架构[21]。此外,为了验证其在其他YOLO架构上的良好泛化能力,我们还使用了YOLO11架构[21]进行实验。对于这两种架构,我们提供了三种模型规模,即小型(S)、中型(M)和大型(L),以适应不同的应用需求。文本提示使用预训练的MobileCLIP-B(LT)[57]文本编码器进行编码。默认情况下,我们经验性地在SAVPE中使用A=16

Data. We follow [5] to utilize detection and grounding datasets, including Objects365 (V1) [52], GoldG [22] (includes GQA [17] and Flickr30k [43]), where images from COCO [34] are excluded. Beside, we leverage advanced SAM-2.1 [46] model to generate pseudo instance masks using ground truth bounding boxes from the detection and grounding datasets for segmentation data. These masks undergo filtering and simplification to eliminate noise [9]. For visual prompt data, we follow [20] to leverage ground truth bounding boxes for visual cues. In prompt-free tasks, we reuse the same datasets, but annotate all objects as a single category to learn a specialized prompt embedding.

数据。我们遵循 [5] 使用检测和定位数据集,包括 Objects365 (V1) [52]、GoldG [22](包含 GQA [17] 和 Flickr30k [43]),其中排除了来自 COCO [34] 的图像。此外,我们利用先进的 SAM-2.1 [46] 模型,通过检测和定位数据集中的真实边界框生成伪实例掩码,用于分割数据。这些掩码经过过滤和简化以消除噪声 [9]。对于视觉提示数据,我们遵循 [20] 利用真实边界框作为视觉提示。在无提示任务中,我们复用相同的数据集,但将所有对象标注为单一类别以学习专门的提示嵌入。

Training. Due to limited computational resource, unlike YOLO-World’s training for 100 epochs, we first train YOLOE with text prompts for 30 epochs. Then, we only train the SAVPE for merely 2 epochs with visual prompts, which avoids additional significant training cost that comes with supporting visual prompts. At last, we train the specialized prompt embedding for only 1 epoch for promptfree scenarios. During the text prompt training stage, we adopt the same settings as [5]. Notably, YOLOE-v8-S / M $/\mathrm{L} can be trained on 8 Nvidia RTX4090 GPUs in 12.0~/ 17.0/22.5 hours, with 3\times$ less cost compared with YOLOWorld. For visual prompt training, we freeze all other parts and adopt the same setting as in text prompt training. To enable prompt-free capability, we leverage the same data to train a specialized embedding. We can see that YOLOE not only enjoys low training costs but also show exceptional zero-shot performance. Besides, to verify YOLOE’s good transfer ability on downstream tasks, we fine-tune our YOLOE on COCO [34] for closed-set detection and segmentation. We experiment with two distinct practical finetuning strategies: (1) Linear probing: Only the classification head is learnable and (2) Full tuning: All parameters are trainable. For Linear probing, we train all models for only 10 epochs. For Full tuning, we train small scale models including YOLOE-v8-S / 11-S for 160 epochs, and medium and large scale models including YOLOE-v8-M / L and YOLOE-11-M / L for 80 epochs, respectively.

训练。由于计算资源有限,与 YOLO-World 的 100 轮训练不同,我们首先使用文本提示训练 YOLOE 30 轮。然后,我们仅使用视觉提示训练 SAVPE 2 轮,这避免了支持视觉提示带来的额外显著训练成本。最后,我们仅为无提示场景训练专用提示嵌入 1 轮。在文本提示训练阶段,我们采用与 [5] 相同的设置。值得注意的是,YOLOE-v8-S / M $/\mathrm{L} 可以在 8 个 Nvidia RTX4090 GPU 上分别训练 12.0~/ 17.0/22.5 小时,与 YOLOWorld 相比,成本减少了 3\times$ 。对于视觉提示训练,我们冻结所有其他部分并采用与文本提示训练相同的设置。为了实现无提示能力,我们利用相同的数据训练专用嵌入。我们可以看到,YOLOE 不仅训练成本低,而且表现出出色的零样本性能。此外,为了验证 YOLOE 在下游任务中的良好迁移能力,我们在 COCO [34] 上微调 YOLOE 以进行闭集检测和分割。我们实验了两种不同的实际微调策略:(1)线性探测:仅分类头可学习;(2)完全微调:所有参数可训练。对于线性探测,我们仅训练所有模型 10 轮。对于完全微调,我们训练包括 YOLOE-v8-S / 11-S 的小规模模型 160 轮,训练包括 YOLOE-v8-M / L 和 YOLOE-11-M / L 的中大规模模型 80 轮。

Metric. For text prompt evaluation, we utilize all category names from the benchmark as inputs, adhering to the standard protocol for open-vocabulary object detection tasks. For visual prompt evaluation, following [20], for each category, we randomly sample N training images N{=}16 by default), extract visual embeddings using their ground truth bounding boxes, and compute the average prompt embedding. For prompt-free evaluation, we employ the same protocol as [33]. A pretrained text encoder [57] is employed to map open-ended predictions to semantically similar category names within the benchmark. In contrast to [33], we streamline the mapping process by selecting the most confident prediction, and eliminating the need for top \mathbf{\nabla\cdotk} selection and beam search. We use the tag list from [16] as the built-in large vocabulary with total 4585 category names, and empirically use \delta=0.001 for LRPC, by default. For all three prompt types, following [5, 20, 33], evaluations are conducted on LVIS [14] in a zero-shot manner, which contains 1,203 categories. By default, Fixed AP [7] on LVIS minival subset is reported. For transferring to COCO, standard AP is evaluated, following [1, 21]. Besides, we measure the FPS for all models on Nvidia T4 GPU with TensorRT and mobile device iPhone 12 with CoreML.

评估指标。对于文本提示的评估,我们利用基准测试中的所有类别名称作为输入,遵循开放词汇对象检测任务的标准协议。对于视觉提示的评估,遵循 [20],对于每个类别,我们默认随机采样 N 张训练图像 (N{=}16),使用其真实边界框提取视觉嵌入,并计算平均提示嵌入。对于无提示的评估,我们采用与 [33] 相同的协议。使用预训练的文本编码器 [57] 将开放预测映射到基准测试中的语义相似类别名称。与 [33] 不同的是,我们通过选择最具信心的预测来简化映射过程,无需进行 top \mathbf{\nabla\cdotk} 选择和束搜索。我们使用 [16] 中的标签列表作为内置的大词汇表,总共有 4585 个类别名称,并默认使用 \delta=0.001 进行 LRPC。对于所有三种提示类型,遵循 [5, 20, 33],在 LVIS [14] 上进行零样本评估,其中包含 1,203 个类别。默认情况下,报告在 LVIS 小验证子集上的 Fixed AP [7]。对于迁移到 COCO,遵循 [1, 21] 评估标准 AP。此外,我们在 Nvidia T4 GPU 上使用 TensorRT 和在 iPhone 12 移动设备上使用 CoreML 测量所有模型的 FPS。

Table 1. Zero-shot detection evaluation on LVIS. For fair comparisons, Fixed A P is reported on LVIS minival set in a zero-shot manner. The training time is for text prompts, based on 8 Nvidia V100 GPUs for [32, 65] and 8 RTX4090 GPUs for YOLO-World and YOLOE. The FPS is measured on Nvidia T4 GPU using TensorRT and on iPhone 12 using CoreML, respectively. Results are provided with text prompt (T) and visual prompt (V) type. For training data, OI, HT, and CH indicates OpenImages [24], HierText [39], and CrowdHuman [51], respectively. OG indicates Objects365 [52] and GoldG [22], and G-20M represents Grounding-20M [50].

表 1: LVIS 上的零样本检测评估。为了公平比较,Fixed A P 在 LVIS minival 集上以零样本方式报告。训练时间基于 8 个 Nvidia V100 GPU 用于 [32, 65] 和 8 个 RTX4090 GPU 用于 YOLO-World 和 YOLOE。FPS 分别在 Nvidia T4 GPU 上使用 TensorRT 和在 iPhone 12 上使用 CoreML 进行测量。结果提供了文本提示 (T) 和视觉提示 (V) 类型。对于训练数据,OI、HT 和 CH 分别表示 OpenImages [24]、HierText [39] 和 CrowdHuman [51]。OG 表示 Objects365 [52] 和 GoldG [22],G-20M 表示 Grounding-20M [50]。

模型 提示类型 参数量 训练数据 训练时间 FPS T4 /iPhone AP APr APc APf
GLIP-T [32] T 232M OG,Cap4M 1337.6h - / - 26.0 20.8 21.4 31.0
GLIPv2-T [70] T 232M OG,Cap4M - - / - 29.0
GDINO-T[37] T 172M OG,Cap4M - / - 27.4 18.1 23.3 32.7
DetCLIP-T [65] T 155M OG 250.0h - / - 34.4 26.9 33.9 36.3
G-1.5 Edge [50] T G-20M - / - 33.5 28.0 34.3 33.9
T-Rex2 [20] V 0365,OI,HT CH,SA-1B - / - 37.4 29.9 33.9 41.8
YWorldv2-S [5] T 13M OG 41.7h 216.4 / 48.9 24.4 17.1 22.5 27.3
YWorldv2-M [5] T 29M OG 60.0h 117.9/34.2 32.4 28.4 29.6 35.5
YWorldv2-L [5] T 48M OG 80.0h 80.0/22.1 35.5 25.6 34.6 38.1
YOLOE-v8-S T/V 12M/13M OG 12.0h 305.8 / 64.3 27.9/26.2 22.3/21.3 27.8/27.7 29.0 /25.7
YOLOE-v8-M T/V 27M/30M OG 17.0h 156.7/ 41.7 32.6/31.0 26.9/27.0 31.9/31.7 34.4 /31.1
YOLOE-v8-L T/V 45M/50M OG 22.5h 102.5/27.2 35.9/34.2 33.2/33.2 34.8 / 34.6 37.3 /34.1
YOLOE-11-S T/V 10M/12M OG 13.0h 301.2/73.3 27.5/26.3 21.4 /22.5 26.8 / 27.1 29.3 / 26.4
YOL0E-11-M T/V 21M/27M OG 18.5h 168.3 / 39.2 33.0/31.4 26.9 / 27.1 32.5 /31.9 34.5 / 31.7
YOLOE-11-L T/V 26M/32M OG 23.5h 130.5 / 35.1 35.2 /33.7 29.1 / 28.1 35.0 / 34.6 36.5 / 33.8

4.2. Text and visual prompt evaluation

4.2. 文本与视觉提示评估

As shown in Tab. 1, for detection on LVIS, YOLOE exhibits favorable trade-offs between efficiency and zero-shot performance across different model scales. We also note that such results are achieved under much less training time, e.g., 3\times faster than YOLO-Worldv2. Specifically, YOLOE $\mathrm{v}8\mathrm{-}\mathrm{S}/\mathrm{M}/\mathrm{L} outperforms YOLOv8-Worldv2-S / M / L by 3.5/0.2/0.4 AP, along with 1.4\times\textit{/}1.3\times\textit{/}1.3\times and 1.3\times\textit{/}1.2\times\textit{/}1.2\times inference speedups on T4 and iPhone 12, respectively. Besides, for rare category which is challenging, our YOLOE-v8-S and YOLOE-v8-L obtains significant improvements of 5.2% and 7.6% \mathrm{AP}{r} . Besides, compared with YOLO-Worldv2, while YOLOE-v8-M / L achieves lower \operatorname{AP}{f}$ , this performance gap primarily stems from YOLOE’s integration of both detection and segmentation in one model. Such multi-task learning introduces a trade-off that adversely impact detection performance on frequent categories, as shown in Tab. 5. Besides, YOL