[论文翻译]1 AgiBot World Colosseo: 一个用于可扩展和智能具身系统的大规模操作平台


原文地址:https://arxiv.org/pdf/2503.06669v2


1 AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

1 AgiBot World Colosseo: 一个用于可扩展和智能具身系统的大规模操作平台

Team AgiBot-World* Project website: https : / /agibot-world.com/ Code: https://github.com/Open Drive Lab/AgiBot-World

Team AgiBot-World* 项目网站: https://agibot-world.com/ 代码: https://github.com/OpenDriveLab/AgiBot-World


Fig. 1: Introducing AgiBot World Colosseo, an open-sourced large-scale manipulation platform comprising data, models, benchmarks and ecosystem. AgiBot World stands out for its unparalleled scale and diversity compared to prior counterparts. A suite of 100 dual-arm humanoid robots is deployed. We further propose a generalist policy (GO-1) with the latent action planner. It is trained across diverse data corpus with a scalable performance of $32%$ gain compared to prior arts.

图 1: 介绍 AgiBot World Colosseo,一个开源的、包含数据、模型、基准测试和生态系统的大规模操作平台。与之前的同类平台相比,AgiBot World 以其无与伦比的规模和多样性脱颖而出。我们部署了 100 台双臂人形机器人。我们进一步提出了一种通用策略 (GO-1),并配备了潜在动作规划器。该策略在多样化的数据语料库上进行训练,与现有技术相比,性能提升了 $32%$。

Abstract We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets.Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of $30%$ over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over $60%$ success rate on complex tasks and outperforming prior RDT approach by $32%$ . By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.

摘要 我们探讨了可扩展的机器人数据如何应对通用机器人操作中的现实世界挑战。通过引入AgiBot World,一个包含五个部署场景中217个任务的超过100万条轨迹的大规模平台,我们实现了数据规模的数量级增长,相较于现有数据集。通过标准化收集流程和人在环验证的加速,AgiBot World保证了高质量和多样化的数据分布。它可以从夹爪扩展到灵巧手和视觉触觉传感器,以获取细粒度技能。基于数据,我们引入了Genie Operator-1 (GO-1),一种新颖的通用策略,利用潜在动作表示最大化数据利用率,展示了随着数据量增加的可预测性能扩展。在我们的数据集上预训练的策略在域内和域外场景中,相比在Open X-Embodiment上训练的策略,平均性能提升了30%。GO-1在现实世界的灵巧和长期任务中表现出色,在复杂任务上成功率超过60%,比之前的RDT方法高出32%。通过开源数据集、工具和模型,我们旨在普及大规模、高质量机器人数据的访问,推动可扩展和通用智能的追求。

1. INTRODUCTION

1. 引言

Manipulation is a cornerstone task in robotics, enabling the agent to interact with and adapt to the physical world. While significant progress has been made in general-purpose foundational models for natural language processing [1] and computer vision [2], robotics lags behind due to the difficulty of (high-quality) data collection. In the controlled lab setting, simple tasks such as pick-and-place have been well studied [3], [4]. Yet for the open-set real-world setting, tasks spanning from fine-grained object interaction, mobile manipulation to collaborative tasks, remains a formidable challenge [5]. These tasks require not only physical dexterity but also the ability to generalize across diverse environment and scenarios, a merit beyond the reach of current robotic systems. The widely accepted reason is the lack of highquality data——unlike images and text, which are abundant and standardized, robotic datasets suffer from fragmented clips due to heterogeneous hardware and un standardized collection procedure, leading to low-quality and inconsistent outcome. In this work we ask, how could we resolve the real-world complexity effectively by scaling up real-world robot data?

操控是机器人技术中的一项核心任务,使智能体能够与物理世界互动并适应物理世界。尽管在自然语言处理 [1] 和计算机视觉 [2] 的通用基础模型方面取得了显著进展,但由于(高质量)数据收集的困难,机器人技术仍然落后。在受控的实验室环境中,诸如拾取和放置等简单任务已经得到了充分研究 [3], [4]。然而,在开放的现实世界环境中,从细粒度物体交互、移动操控到协作任务,仍然是一个巨大的挑战 [5]。这些任务不仅需要物理上的灵巧性,还需要在不同环境和场景中泛化的能力,这是当前机器人系统无法达到的优势。广泛接受的原因是缺乏高质量数据——与图像和文本不同,图像和文本丰富且标准化,而机器人数据集由于硬件异构和收集程序不标准化而呈现出碎片化的片段,导致数据质量低且不一致。在这项工作中,我们提出,如何通过扩展现实世界机器人数据来有效解决现实世界的复杂性?

Recent efforts, such as Open X-Embodiment (OXE) [6], have addressed by aggregating and standardizing existing datasets. Despite advancements on large-scale crossembodiment learning, the resulting policy is constrained within naive, short-horizon tasks and can weakly generalize to out-of-domain scenarios [4]. DROID [7] collected expert data through crowd-sourcing from diverse real-life scenes. The absence of data quality assurance (with human feedback) and the reliance on a constrained hardware setup (i.e., featuring fixed, single-arm robots), limit its real-world applicability and broader effectiveness. More recently, Lin et al. [8] explored scaling laws governing general iz ability across intracategory objects and environments, albeit limited to a few simple, single-step tasks. These efforts represent a notable advancement toward developing generalist policies, moving beyond the traditional focus on single-task learning within narrow domains [9], [3]. Nevertheless, existing robot learning datasets remain constrained by their reliance on short-horizon tasks in highly controlled laboratory environments, failing to adequately capture the complexity and diversity inherent in real-world manipulation tasks. To achieve generalpurpose robotic intelligence, it is essential to develop datasets that scale in size and diversity while capturing real-world variability, supported by general-purpose humanoid robots for robust skill acquisition, a standardized data collection pipeline with assured quality, and carefully curated tasks reflecting real-world challenges.

最近的努力,如 Open X-Embodiment (OXE) [6],通过聚合和标准化现有数据集来解决这一问题。尽管在大规模跨具身学习方面取得了进展,但生成的策略仍局限于简单、短期的任务,并且在域外场景中的泛化能力较弱 [4]。DROID [7] 通过众包从多样化的现实场景中收集了专家数据。缺乏数据质量保证(通过人类反馈)以及对受限硬件设置(即固定的单臂机器人)的依赖,限制了其在现实世界中的适用性和广泛有效性。最近,Lin 等人 [8] 探索了跨类别对象和环境的泛化能力的扩展规律,尽管仅限于一些简单的单步任务。这些努力代表了在开发通用策略方面的显著进展,超越了传统上对狭窄领域内单任务学习的关注 [9], [3]。然而,现有的机器人学习数据集仍然受到在高度控制的实验室环境中依赖短期任务的限制,未能充分捕捉现实世界操作任务中固有的复杂性和多样性。为了实现通用机器人智能,必须开发在规模和多样性上扩展的数据集,同时捕捉现实世界的变异性,并得到通用人形机器人的支持以实现稳健的技能获取,标准化的数据收集管道以确保质量,以及精心策划的任务以反映现实世界的挑战。

As depicted in Fig. 1, we introduce AgiBot World Colosseo, a full-stack large-scale robot learning platform curated for advancing bimanual manipulation in scalable and intelligent embodied systems. A full-scale 4000- square-meter facility is constructed to represent five major domains-domestic, retail, industrial, restaurant, and office environment—all dedicated to high-fidelity data collection in authentic everyday scenarios. With over 1 million trajectories collected from 100 real robots, AgiBot World offers unprecedent ed diversity and complexity. It spans over 100 realworld scenarios, addressing challenging tasks such as finegrained manipulation, tool usage, and multi-robot synergistic collaboration. Unlike prior datasets, AgiBot World dataset collection is carried out with a fully standardized pipeline, ensuring high data quality and s cal ability, while incorporating human-in-the-loop verification to guarantee reliability. Our hardware setup includes mobile base humanoid robots with whole-body control, dexterous hands, and visuo-tactile sensors, enabling rich, multimodal data collection. Each episode is meticulously designed, featuring multiple camera views, depth information, camera calibration, and language annotations for both the overall task and each individual sub-steps. This well-rounded hardware setup, combined with various long-horizon, real-world tasks, opens new avenues for developing next-generation generalist policies and fosters diverse future research in robotics.

如图 1 所示,我们介绍了 AgiBot World Colosseo,这是一个全栈大规模机器人学习平台,旨在推进可扩展和智能的具身系统中的双手操作。我们建造了一个 4000 平方米的设施,代表五个主要领域——家庭、零售、工业、餐厅和办公环境——所有这些都致力于在真实的日常场景中进行高保真数据收集。AgiBot World 从 100 个真实机器人中收集了超过 100 万条轨迹,提供了前所未有的多样性和复杂性。它涵盖了 100 多个现实世界场景,解决了诸如精细操作、工具使用和多机器人协同合作等具有挑战性的任务。与之前的数据集不同,AgiBot World 数据集收集是通过完全标准化的流程进行的,确保了高质量的数据和可扩展性,同时结合了人在环验证以保证可靠性。我们的硬件设置包括具有全身控制的移动基座人形机器人、灵巧的手和视觉触觉传感器,能够进行丰富的多模态数据收集。每个情节都经过精心设计,具有多个摄像机视角、深度信息、摄像机校准以及整体任务和每个子步骤的语言注释。这种全面的硬件设置,结合各种长期、现实世界的任务,为开发下一代通用策略开辟了新途径,并促进了机器人领域的多样化未来研究。

Our experimental results highlight the transformative potential of the AgiBot World dataset. Policies pre-trained on our dataset achieve an average success rate improvement Oof $30%$ compared to those trained on the prior large-scale robot dataset OXE [6]. Notably, even when utilizing only a fraction of our dataset—equivalent to $1/10$ of thedata volume in hours compared to OXE—the general iz ability of pretrained policies is elevated by $18%$ . These findings underscore the dataset's efficacy in bridging the gap between controlled laboratory environments and real-world robotic applications. Following our dataset, to address the limitations of previous robot foundation models that heavily rely on indomain robot datasets, we present Genie Operator-1 (GO1), a novel generalist policy that utilizes latent action represent at ions to enable learning from heterogeneous data and efficiently bridges general-purpose vision-language models (VLMs) with robotic sequential decision-making. Through unified pre-training on web-scale data, spanning human videos to our high-quality robot dataset, GO-1 achieves superior generalization and dexterity, outperforming prior generalist policies such as RDT [10] and our variant without latent action planner. Moreover, we demonstrate that GO1's performance exhibits robust s cal ability with increasing dataset size, underscoring its potential for sustained advancement as larger datasets become available.

我们的实验结果突显了AgiBot World数据集的变革潜力。与在先前的大规模机器人数据集OXE [6]上训练的模型相比,在我们的数据集上预训练的模型平均成功率提高了30%。值得注意的是,即使仅使用我们数据集的一小部分——相当于OXE数据量的1/10——预训练模型的泛化能力也提升了18%。这些发现强调了该数据集在弥合受控实验室环境与现实世界机器人应用之间差距方面的有效性。在我们的数据集之后,为了解决以往机器人基础模型严重依赖领域内机器人数据集的局限性,我们提出了Genie Operator-1 (GO1),这是一种新颖的通用策略,利用潜在动作表示来从异构数据中学习,并有效地将通用视觉语言模型 (VLMs) 与机器人序列决策相结合。通过对网络规模数据的统一预训练,涵盖从人类视频到我们高质量机器人数据集的范围,GO-1实现了卓越的泛化能力和灵活性,超越了之前的通用策略,如RDT [10]以及我们不带潜在动作规划器的变体。此外,我们展示了GO1的性能随着数据集规模的增加而表现出强大的扩展能力,突显了其在更大数据集可用时持续进步的潜力。

Beyond its immediate impact, AgiBot World lays a strong foundation for future research in robotic manipulation. By open-sourcing the dataset, toolchain, and pre-trained models, we aim to foster community-wide innovation, enabling researchers to explore more authentic and diverse applications from household assistant to industrial automation. AgiBot World is more than yet another dataset; it is a step toward scalable, general-purpose robotic intelligence, empowering robots to tackle the complexities of the real world.

除了其直接影响外,AgiBot World 为未来的机器人操作研究奠定了坚实的基础。通过开源数据集、工具链和预训练模型,我们旨在促进社区范围内的创新,使研究人员能够探索从家庭助理到工业自动化的更真实和多样化的应用。AgiBot World 不仅仅是一个数据集;它是迈向可扩展、通用机器人智能的一步,赋予机器人应对现实世界复杂性的能力。

Contribution. 1) We construct AgiBot World dataset, a multifarious robot learning dataset accompanied by opensource tools to advance research on policy learning at scale. As a pioneering initiative, AgiBot World employs an inclusive optimized pipeline, from scene configuration, task design, data collection, to human-in-the-loop verification, which ensures unparalleled data quality. 2) We propose GO1, a robot foundation policy using latent action representations to unlock web-scale pre-training on heterogeneous data. Empowered by AgiBot World dataset, it outperforms prior generalist policies in generalization and dexterity, with performance scaling predictably with dataset size.

贡献。1) 我们构建了 AgiBot World 数据集,这是一个多样化的机器人学习数据集,并附带开源工具,以推动大规模策略学习的研究。作为一项开创性举措,AgiBot World 采用了包容性优化的流程,从场景配置、任务设计、数据收集到人机交互验证,确保了无与伦比的数据质量。2) 我们提出了 GO1,这是一种使用潜在动作表示的机器人基础策略,能够在异构数据上进行网络规模的预训练。借助 AgiBot World 数据集,它在泛化性和灵活性方面优于之前的通用策略,并且性能随数据集大小的增加而可预测地提升。

Limitation. All evaluations are conducted in real-world scenarios. We are currently developing the simulation environment, aligning with the real-world setup and aiming to reflect real-world policy deployment outcome. It would thereby facilitate fast and reproducible evaluation.

限制。所有评估均在现实场景中进行。我们目前正在开发模拟环境,与现实设置保持一致,旨在反映现实世界的策略部署结果。这将有助于快速且可重复的评估。

11. RELATED WORK

11. 相关工作

Data scaling in robotics. Robot learning datasets from automated scripts or human tele operation have enabled policy learning, with early efforts like RoboTurk [19] and BridgeData [12] offering small-scale datasets with 2.1k and $7.2\mathrm{k}$ trajectories, respectively. Larger datasets, such as RT1 [14] ( $130\mathrm{k\Omega}$ trajectories), expand scopes yet remain limited to few environments and skills. Open X-Embodiment [6] aggregates various datasets into a unified format, growing to more than 2.4 million trajectories, as a consequence it suffers from significant variability in embodiment s, observation perspectives, and inconsistent data quality, limiting its overall effectiveness. More recently, DROID [7] moves towards scaling up scenes for greater diversity by crowdsourcing demonstrations yet falls short in data scale and quality control. Prior datasets above generally face limitations in data scale, task practicality, and scenario naturalness, compounded by inadequate quality assurance and hardware restrictions, which impedes generalist policy training. As shown in Tab. I, our dataset addresses these gap adequately. We build a data collection facility spanning five scenarios to reconstruct real-world diversity and authenticity. With over 1 million trajectories gathered by skilled tele operators through rigorous verification protocols, AgiBot World utilizes humanoid robots equipped with visuo-tactile sensors and dexterous hands to enable multimodal demonstrations, setting it apart from previous efforts. Unlike Pumacay et al. [20], which serves as a simulation benchmark for evaluating generalization, what we propose is a full-stack platform with data, models, benchmarks, and ecosystem.

机器人领域的数据扩展。通过自动化脚本或人类远程操作生成的机器人学习数据集已经支持了策略学习,早期的努力如RoboTurk [19] 和 BridgeData [12] 提供了小规模的数据集,分别包含2.1k和 $7.2\mathrm{k}$ 条轨迹。更大的数据集,如RT1 [14] ( $130\mathrm{k\Omega}$ 条轨迹),虽然扩展了范围,但仍局限于少数环境和技能。Open X-Embodiment [6] 将各种数据集整合为统一格式,增长到超过240万条轨迹,但由于其在不同实体、观察视角和数据质量上的显著差异,整体效果受限。最近,DROID [7] 通过众包演示向扩展场景多样性迈进,但在数据规模和质量控制方面仍有不足。上述数据集普遍面临数据规模、任务实用性和场景自然性的限制,加之质量保证不足和硬件限制,阻碍了通用策略的训练。如表 I 所示,我们的数据集充分解决了这些差距。我们构建了一个跨越五个场景的数据收集设施,以重建现实世界的多样性和真实性。通过严格的验证协议,由熟练的远程操作员收集了超过100万条轨迹,AgiBot World 利用配备视觉触觉传感器和灵巧手的仿人机器人实现多模态演示,使其与以往的努力区分开来。与Pumacay等人 [20] 作为评估泛化能力的模拟基准不同,我们提出的是一个包含数据、模型、基准和生态系统的全栈平台。

TABLE I: Comparison to existing datasets. AgiBot World features the largest number of trajectories to date. We replicate real-world environment at a 1:1 scale for the industrial and retail scenarios, which are barely present before. Extensive human annotations are offered, including item, scene, skill (sub-task segmented), and task-level annotations. Notably, to expand data applicability and potential, we include imperfect data (i.e., failure recovery data with annotated error states) and tasks with dexterous hands. To ensure data quality, we adopt a human-in-the-loop philosophy: the policy learning is performed on collected demonstrations. The deployment results are adopted as feedback to improve the collection protocol.

表 1: 与现有数据集的比较。AgiBot World 拥有迄今为止最多的轨迹数据。我们以 1:1 的比例复制了工业和零售场景的真实环境,这些场景在之前几乎没有出现过。提供了广泛的人工标注,包括物品、场景、技能(子任务分割)和任务级别的标注。值得注意的是,为了扩展数据的适用性和潜力,我们包含了不完美的数据(即带有标注错误状态的失败恢复数据)和需要灵巧手的任务。为了确保数据质量,我们采用了人在环中的理念:策略学习是在收集的演示数据上进行的。部署结果被用作反馈来改进收集协议。

数据集 轨迹数 技能数 场景数 详细标注 相机校准 手臂类型 灵巧手 失败恢复 人在环中 收集方式
RoboNet [11] 162k n/a 10 单臂 × × 脚本化
BridgeData [12] 7.2k 4 12 单臂 人工遥控
BC-Z [13] 26k 3 1 单臂 × 人工遥控
RT-1 [14] 130k 8 2 单臂 × × 人工遥控
RH20T [15] 13k 33 7 单臂 × 人工遥控
RoboSet [16] 98.5k 6 11 单臂 × × 30%人工/70%脚本化
BridgeData aV2[17] 60.1k 13 24 × 单臂 × × 85%人工/15%脚本化
DROID [7] 76k 86 564 单臂 × 人工遥控
RoboMIND [18] 55k 36 n/a × 单臂+双臂 人工遥控
Open X-Embodiment [6] 1.4M 217 311 (×) 单臂+双臂 数据集聚合
AgiBotWorld Dataset 1M+ 87 106 双臂 人工遥控

Policy learning at scale. Robotic foundation models often co-evolve with the development of dataset scale, equipping robots with escalating general-purpose capabilities through diverse, large-scale training. Several prior arts use web-scale video only to facilitate policy learning given the limited scale of action-labeled robot datasets [21], [22], [23]. Another line of work lies in the use of large, end-to-end models trained on robot trajectories with robotics data scaling up [4], [24], [14], [25]. For instance, RDT [10] employs Diffusion Transformers, initially pre-trained on heterogeneous multirobot datasets and fine-tuned on over 6k dual-arm trajectories, showcasing the benefits of pre-training on diverse sources. $\pi_{0}$ [26] uses a pre-trained VLM backbone and a flow-based action expert, advancing dexterous manipulation for complex tasks like laundry. LAPA [27] introduces the use of latent actions as pre-training targets; however, its latent planning capability is not preserved for downstream tasks. Building on a variety of innovative ideas from recent research, we advance the field by transferring web-scale knowledge to robotic control through the adaptation of vision-language models (VLMs) with latent actions, leveraging both human videos and robot data for scalable training. Our work demonstrates how the integration of a latent action planner enhances long-horizon task execution and enables more efficient policy learning, significantly improving upon existing generalist policies.

大规模策略学习。机器人基础模型通常与数据集规模的发展共同演进,通过多样化、大规模的训练,赋予机器人不断提升的通用能力。鉴于动作标注的机器人数据集规模有限,一些先前的研究仅使用网络规模的视频来促进策略学习 [21], [22], [23]。另一类研究则侧重于使用大规模、端到端的模型,这些模型在机器人轨迹上进行训练,随着机器人数据的扩展而发展 [4], [24], [14], [25]。例如,RDT [10] 采用了 Diffusion Transformers,最初在异构的多机器人数据集上进行预训练,并在超过 6k 的双臂轨迹上进行微调,展示了在多样化数据源上进行预训练的优势。$\pi_{0}$ [26] 使用了预训练的 VLM 骨干网络和基于流的动作专家,提升了复杂任务(如洗衣)中的灵巧操作能力。LAPA [27] 引入了潜在动作作为预训练目标;然而,其潜在规划能力并未保留用于下游任务。基于近期研究中的各种创新思想,我们通过将网络规模的知识转移到机器人控制中,利用视觉语言模型 (VLMs) 和潜在动作,结合人类视频和机器人数据进行可扩展训练,推动了该领域的发展。我们的工作展示了潜在动作规划器的集成如何增强长期任务执行能力,并实现更高效的策略学习,显著改进了现有的通用策略。

III. AGIBOT WORLD: PLATFORM AND DATA

III. AGIBOT 世界:平台与数据

AgiBot World is a full-stack and open-source embodied intelligence ecosystem. Based on the hardware platform developed by us, AgiBot G1, we construct AgiBot World— an open-source robot manipulation dataset collected by more than 100 homogeneous robots, providing high-quality data for challenging tasks spanning a wide spectrum of real-life scenarios. The latest version contains 1,001,552 trajectories, with a total duration of 2976.4 hours, covering 217 specific tasks, 87 skills, and 106 scenes. We go beyond basic tabletop tasks such as pick-and-place in lab environments; instead, concentrate on real-world scenarios involving dual-arm manipulation, dexterous hands, and collaborative tasks. AgiBot World aims to provide an inclusive benchmark to drive the future development of advanced and robust algorithms.

AgiBot World 是一个全栈开源的具身智能生态系统。基于我们开发的硬件平台 AgiBot G1,我们构建了 AgiBot World——一个由 100 多台同构机器人收集的开源机器人操作数据集,为涵盖广泛现实生活场景的具有挑战性的任务提供高质量数据。最新版本包含 1,001,552 条轨迹,总时长为 2976.4 小时,涵盖 217 个具体任务、87 项技能和 106 个场景。我们超越了实验室环境中的基本桌面任务,如拾取和放置;而是专注于涉及双臂操作、灵巧手和协作任务的现实场景。AgiBot World 旨在提供一个包容性的基准,以推动先进且鲁棒的算法的未来发展。

We plan to release all resources to enable the community build upon AgiBot World. The dataset is available under the CC BY-NC-SA 4.o license, along with the model checkpoints and code.

我们计划发布所有资源,以便社区能够在AgiBot World的基础上进行构建。数据集、模型检查点和代码均可在CC BY-NC-SA 4.0许可下获取。

A.Hardware:A Versatile Humanoid Robot

A. 硬件:多功能人形机器人

The hardware platform is the cornerstone of AgiBot World, determining the lower limit of its quality. The standar diz ation of hardware is also the key to streamlining distributed data collection and ensuring reproducible results. We meticulously develop a novel hardware platform for AgiBot World, distinguished by visuo-tactile sensors, durable 6-DoF dexterous hands with humanoid configuration.

硬件平台是AgiBot World的基石,决定了其质量的下限。硬件的标准化也是简化分布式数据收集和确保结果可重复性的关键。我们精心开发了一种新颖的硬件平台,其特点在于视觉触觉传感器、耐用的人形配置6自由度灵巧手。

As illustrated in Fig. 1, our robotic platform features dual 7-DoF arms, a mobile chassis, and an adjustable waist. The end effectors are modular, allowing for the use of either a standard gripper or a 6-DoF dexterous hand, depending on task requirements. For tasks necessitating tactile feedback, a gripper equipped with visuo-tactile sensors is utilized. The robot is outfitted with eight cameras: an RGB-D camera and three fisheye cameras for the front view, RGB-D or fisheye cameras mounted on each end-effector, and two fisheye cameras positioned at the rear. Image observations and proprio ce pti ve states, including joint and end-effector positions, are recorded at a control frequency of $30~\mathrm{Hz}$

如图 1 所示,我们的机器人平台配备了双 7 自由度手臂、一个移动底盘和一个可调节腰部。末端执行器是模块化的,可以根据任务需求使用标准夹持器或 6 自由度灵巧手。对于需要触觉反馈的任务,使用配备视觉触觉传感器的夹持器。机器人配备了八个摄像头:一个 RGB-D 摄像头和三个用于前视的鱼眼摄像头,每个末端执行器上安装的 RGB-D 或鱼眼摄像头,以及位于后部的两个鱼眼摄像头。图像观测和本体感知状态(包括关节和末端执行器位置)以 $30~\mathrm{Hz}$ 的控制频率记录。


Fig. 2: Data collection pipeline. We embrace a human-inthe-loop framework to ensure high quality, enriched with detailed annotations and error recovery behaviors. Human feedback plays a critical role not only in post-collection review but also in actively guiding the data collection process, which is largely overlooked in prior efforts.

图 2: 数据收集流程。我们采用人机协作框架以确保高质量,并辅以详细的注释和错误恢复行为。人类反馈不仅在收集后的审查中起着关键作用,还在积极引导数据收集过程中发挥重要作用,这一点在之前的工作中大多被忽视。

We employ two tele operation systems: VR headset control and whole-body motion capture control. The VR controller maps the hand gesture to the end-effector translation and rotation, which is subsequently converted to joint angles through inverse kinematics. The thumb sticks and buttons on the controller enable robot base and body movement, while the trigger buttons control end-effector actuation. However, the VR controller restricts the dexterous hand to only a few predefined gestures. To extensively unlock our robot's capabilities, we adapt a motion capture system which records the data of human joints, including the fingers, and maps them to robot posture, enabling more nuanced control, including individual finger movements, torso pose, and head orientation. This system provides posture fexibility and execution precision that are required in achieving more complex manipulation tasks.

我们采用两种远程操作系统:VR头戴设备控制和全身动作捕捉控制。VR控制器将手势映射到末端执行器的平移和旋转,随后通过逆运动学转换为关节角度。控制器上的拇指摇杆和按钮用于控制机器人底座和身体的移动,而触发按钮则控制末端执行器的动作。然而,VR控制器限制了灵巧手只能执行少数预定义的手势。为了充分发挥我们机器人的能力,我们采用了动作捕捉系统,该系统记录包括手指在内的人体关节数据,并将其映射到机器人姿态,从而实现更精细的控制,包括单个手指的移动、躯干姿势和头部方向。该系统提供了实现更复杂操作任务所需的姿态灵活性和执行精度。

B. Data Collection: Protocol and Quality

B. 数据收集:协议与质量

The data collection session, as shown in Fig. 2, can be broadly divided into three phases. (1) Before formally commencing data collection, we first conduct preliminary data acquisition to validate the feasibility of each task and establish corresponding collection standards. (2) After feasibility validation and review of the collection standards, skilled tele operators arrange the initial scene and formally begin data collection according to the established standards. All data undergoes an initial validity verification locally, such as verifying the absence of missing frames. Once the data is confirmed to be complete, it is uploaded to the cloud for the next phase. (3) During post-processing, the data annotators will verify whether each episode meets the collection standards established in phase 1 and provide language annotations.

数据收集会话,如图 2 所示,可以大致分为三个阶段。(1) 在正式启动数据收集之前,我们首先进行初步数据采集,以验证每个任务的可行性并建立相应的收集标准。(2) 在可行性验证和收集标准审查之后,熟练的远程操作员安排初始场景,并根据既定标准正式开始数据收集。所有数据在本地进行初步有效性验证,例如验证是否存在缺失帧。一旦数据被确认为完整,它将被上传到云端进行下一阶段。(3) 在后处理过程中,数据标注员将验证每个片段是否符合第一阶段建立的收集标准,并提供语言注释。

Failure recovery. During data collection, tele operators may occasionally commit errors, such as inadvertently dropping objects while manipulating the robotic arms. However, they are often able to recover from these errors and successfully complete the task without requiring a full re config uration of the setup. Rather than discarding such trajectories, we retain them and manually annotate each with corresponding failure reasons and timestamps. These trajectories, referred to as failure recovery data, constitute approximately one percent of the dataset. We consider them invaluable for achieving policy alignment [28] and failure reflection [29], essential for advancing the next generation of robot foundation models.

故障恢复。在数据收集过程中,远程操作员可能会偶尔犯错,例如在操作机械臂时无意中掉落物体。然而,他们通常能够从这些错误中恢复并成功完成任务,而无需完全重新配置设置。我们不会丢弃这些轨迹,而是保留它们,并手动为每个轨迹标注相应的故障原因和时间戳。这些轨迹被称为故障恢复数据,约占数据集的百分之一。我们认为这些数据对于实现策略对齐 [28] 和故障反思 [29] 至关重要,这对于推动下一代机器人基础模型的发展至关重要。

Human-in-the-loop. Concurrent with feedback collection from data annotators, we adopt a human-in-the-loop approach to assess and refine data quality. This process involves an iterative cycle of collecting a small set of demonstrations, training a policy, and deploying the resulting policy to evaluate data availability. Based on the policy's performance, we iterative ly refine the data collection pipeline to address identified gaps or inefficiencies. For instance, during real-world deployment, the model exhibits prolonged pauses at the onset of actions, aligning with data annotator feedback highlighting inconsistent transitions and excessive idle time in the collected data. In response, we revise the data collection protocols and introduce a post-processing step to eliminate idle frames, thereby enhancing the dataset's overall utility for policy learning. This feedback-driven methodology ensures continuous improvement in data quality.

人机协作。在从数据标注员收集反馈的同时,我们采用人机协作的方法来评估和改进数据质量。这一过程包括一个迭代循环:收集一小部分演示数据,训练策略,并部署生成的策略以评估数据的可用性。根据策略的表现,我们迭代地改进数据收集流程,以解决发现的差距或低效问题。例如,在实际部署中,模型在动作开始时表现出长时间的停顿,这与数据标注员反馈中指出的收集数据中不一致的转换和过多的空闲时间一致。为此,我们修订了数据收集协议,并引入了一个后处理步骤来消除空闲帧,从而提高了数据集在策略学习中的整体效用。这种反馈驱动的方法确保了数据质量的持续改进。

C. Dataset Statistics and Analysis: Beyond Scale

C. 数据集统计与分析:超越规模

AgiBot World is developed through a large-scale data collection facility, which spans over 4,0o0 square meters. This extensive environment contains over 3,oo0 unique objects in a variety of scenes, meticulously designed to reflect realworld settings. The dataset covers a wide range of scenarios and scene setups, ensuring both scale and diversity in the pursuit of general iz able robot policy.

AgiBot World 是通过一个占地超过 4,000 平方米的大规模数据收集设施开发的。这个广阔的环境包含超过 3,000 个独特物体,分布在各种场景中,精心设计以反映现实世界的设置。该数据集涵盖了广泛的场景和场景设置,确保在追求通用机器人策略时兼具规模和多样性。

Reconstructing the diversity of the real world. Key statistics of our dataset are presented in Fig. 3. AgiBot World provides extensive coverage across five key domains: domestic, retail, industrial, restaurant, and office environments. Within each domain, we further define specific scene categories. For instance, the domestic domain includes detailed environments such as bedrooms, kitchens, living rooms, and balconies, while the retail domain features distinct areas like shelving units and fresh produce sections. Our dataset also features over 3,000 distinct objects, systematically categorized across various scenes. These objects span a wide range of everyday items, including food, furniture, clothing, electronic devices, and more. The distribution of object categories, as illustrated in Fig. 3(a), highlights the relative frequency of different object types within each scene.

重建现实世界的多样性。我们的数据集的关键统计数据如图 3 所示。AgiBot World 在五个关键领域提供了广泛的覆盖:家庭、零售、工业、餐厅和办公环境。在每个领域中,我们进一步定义了具体的场景类别。例如,家庭领域包括卧室、厨房、客厅和阳台等详细环境,而零售领域则包括货架和生鲜区等不同区域。我们的数据集还包含超过 3,000 种不同的物体,这些物体在各种场景中进行了系统分类。这些物体涵盖了广泛的日常物品,包括食品、家具、服装、电子设备等。如图 3(a) 所示,物体类别的分布突出了每个场景中不同物体类型的相对频率。


Fig. 3: Dataset Statistics. a) AgiBot World dataset covers the vast majority of robotic application scenarios, as well as a wide range of interactive objects. b) Our dataset features long-horizon tasks, with the majority of trajectories ranging from 30s to 60s. In contrast, widely used datasets, such as DROID, primarily consist of trajectories ranging from 5s to 20s, while OXE v1.0 predominantly contains trajectories within 5s. c) AgiBot World dataset focuses on valuable atomic skills, spanning a wide spectrum of skills, each supported by a minimum of 100 trajectories (red dashed line above).

图 3: 数据集统计。a) AgiBot World 数据集涵盖了绝大多数机器人应用场景,以及广泛的交互对象。b) 我们的数据集以长时任务为特色,大多数轨迹的时长在 30 秒到 60 秒之间。相比之下,广泛使用的数据集(如 DROID)主要由 5 秒到 20 秒的轨迹组成,而 OXE v1.0 主要包含 5 秒以内的轨迹。c) AgiBot World 数据集专注于有价值的原子技能,涵盖了广泛的技能范围,每个技能至少有 100 条轨迹支持(上方的红色虚线)。

Long-horizon manipulation. A distinguishing feature of the AgiBot World dataset is its emphasis on long-horizon manipulation. As shown in Fig. 3(b), prior datasets predominantly focus on tasks involving single atomic skills, with most trajectories lasting no more than 5 seconds. In contrast, AgiBot World is built upon continuous and complete tasks composed by multiple atomic skills, like “make a coffee". Trajectories in our dataset typically span approximately 30 seconds, some of which last over 2 minutes. We also provide key-frame and instruction annotation for each sub-step to facilitate policy learning in such challenging scenarios.

长时程操作。AgiBot World 数据集的一个显著特点是其强调长时程操作。如图 3(b) 所示,先前的数据集主要关注涉及单一原子技能的任务,大多数轨迹持续时间不超过 5 秒。相比之下,AgiBot World 建立在由多个原子技能组成的连续且完整的任务基础上,例如“制作咖啡”。我们数据集中的轨迹通常持续约 30 秒,有些甚至超过 2 分钟。我们还为每个子步骤提供了关键帧和指令注释,以促进在这种具有挑战性的场景中进行策略学习。

Comprehensive skill coverage. In terms of task design, while generic atomic skills, such as “pick-and-place", dominate the majority of tasks, we have intentionally incorporated tasks that emphasize less frequently used but highly valuable skills, such as “chop" and “plug" (as shown in Fig. 3(c)). This ensures that our dataset adequately represents a broad spectrum of skills, providing sufficient data for each to support robust policy learning.

全面的技能覆盖。在任务设计方面,虽然通用的原子技能(如“抓取和放置”)占据了大多数任务,但我们有意加入了强调较少使用但非常有价值的技能的任务,例如“切菜”和“插入”(如图 3(c) 所示)。这确保了我们的数据集能够充分代表广泛的技能范围,为每种技能提供足够的数据以支持稳健的策略学习。

IV. AGIBOT WORLD: MODEL

IV. AGIBOT WORLD: 模型

To effectively utilize our high-quality AgiBot World dataset and enhance the policy's general iz ability, we propose a hierarchical Vision-Language-Latent-Action (ViLLA) framework with three training stages, as depicted in Fig. 4. Compared to Vision-Language-Action (VLA) model where action is vision-language conditioned, the ViLLA model predicts latent action tokens, conditioned on the generation of subsequent robot control actions.

为了有效利用我们高质量的 AgiBot World 数据集并增强策略的泛化能力,我们提出了一个分层的视觉-语言-潜在动作 (ViLLA) 框架,包含三个训练阶段,如图 4 所示。与动作由视觉-语言条件化的视觉-语言-动作 (VLA) 模型相比,ViLLA 模型预测潜在动作 Token,条件化于后续机器人控制动作的生成。

In Stage 1, we project consecutive images into a latent action space by training an encoder-decoder latent action model (LAM) on internet-scale heterogeneous data. This allows the latent action to serve as an intermediate representation, bridging the gap between general image-t