[论文翻译]THE AGENT COMPANY: 在大语言模型智能体上进行现实世界任务的基准测试


原文地址:https://arxiv.org/pdf/2412.14161


THE AGENT COMPANY: BENCHMARKING LLM AGENTS ON CONSEQUENTIAL REAL WORLD TASKS

THE AGENT COMPANY: 在大语言模型智能体上进行现实世界任务的基准测试

Frank F. $\mathbf{X}\mathbf{u}^{1*}$ Yufan $\mathbf{Song^{2*}}$ Boxuan $\mathbf{Li}^{2*}$ Yuxuan Tang2 Kritanjali Jain1 Mengxue Bao2 Zora Z. Wang1 Xuhui Zhou1 Zhitong Guo1 Murong $\mathbf{Cao^{2}}$ Mingyang Yang2 Hao Yang $\mathbf{L}\mathbf{u}^{2}$ Amaad Martin1 Zhe $\mathbf{S}\mathbf{u}^{1}$ Leander Maben1 Raj Mehta1 Wayne Chi1 Lawrence Jang1 Yiqing Xie1 Shuyan Zhou3 Graham Neubig1

Frank F. $\mathbf{X}\mathbf{u}^{1*}$ Yufan $\mathbf{Song^{2*}}$ Boxuan $\mathbf{Li}^{2*}$ Yuxuan Tang2 Kritanjali Jain1 Mengxue Bao2 Zora Z. Wang1 Xuhui Zhou1 Zhitong Guo1 Murong $\mathbf{Cao^{2}}$ Mingyang Yang2 Hao Yang $\mathbf{L}\mathbf{u}^{2}$ Amaad Martin1 Zhe $\mathbf{S}\mathbf{u}^{1}$ Leander Maben1 Raj Mehta1 Wayne Chi1 Lawrence Jang1 Yiqing Xie1 Shuyan Zhou3 Graham Neubig1

1Carnegie Mellon University 2 Independent 3Duke University {fangzhex, gneubig}@cs.cmu.edu, {yufans, boxuanli}@alumni.cmu.edu

1卡内基梅隆大学 2独立 3杜克大学 {fangzhex, gneubig}@cs.cmu.edu, {yufans, boxuanli}@alumni.cmu.edu

ABSTRACT

摘要

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents’ performance on performing real-world professional tasks, in this paper, we introduce The Agent Company, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, $24%$ of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents – in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.

我们每天都在与计算机互动,无论是在日常生活中还是工作中,许多工作都可以完全通过计算机和互联网完成。与此同时,得益于大语言模型 (LLM) 的改进,与环境互动并影响其变化的 AI 智能体也迅速发展。但 AI 智能体在帮助加速甚至自主执行工作任务方面的表现如何?这个问题的答案对于希望将 AI 引入工作流程的行业以及理解 AI 采用对劳动力市场影响的经济政策都具有重要意义。为了衡量这些大语言模型智能体在执行现实世界专业任务中的进展,本文介绍了 The Agent Company,这是一个可扩展的基准,用于评估以类似于数字工作者的方式与世界互动的 AI 智能体:通过浏览网页、编写代码、运行程序以及与同事沟通。我们构建了一个包含内部网站和数据的自包含环境,模拟一个小型软件公司环境,并创建了可能由该公司员工执行的各种任务。我们测试了基于封闭 API 和开源权重语言模型 (LM) 的基线智能体,发现使用最具竞争力的智能体,$24%$ 的任务可以自主完成。这描绘了使用 LM 智能体进行任务自动化的复杂图景——在模拟真实工作环境的设置中,大部分较简单的任务可以自主解决,但更复杂的长期任务仍然超出了当前系统的能力范围。

Website https://the-agent-company.com Code https://github.com/The Agent Company/The Agent Company 口 Evaluations https://github.com/The Agent Company/experiments

网站 https://the-agent-company.com
代码 https://github.com/The Agent Company/The Agent Company
评估 https://github.com/The Agent Company/experiments

1 INTRODUCTION

1 引言

We are in the midst of a technological transformation. With the rapid year-by-year and month-bymonth progress brought about by large language models (LLMs), we are seeing AI-based assistance or automation become commonplace in tasks that were unthinkable only a few years ago. In fact, the pace of progress is so fast that some have gone so far as to claim that the majority of human labor may be automata ble within the next couple of years (Eloundou et al., 2023; Amodei & Fridman, 2024). On the other hand, others are skeptical, claiming that language models cannot truly reason (Kam b ham pati et al., 2024), do not generalize well to novel tasks (Chollet et al., 2024), and may only have an impact on a small minority of the labor market (Witten stein, 2024).

我们正处于一场技术变革之中。随着大语言模型 (LLMs) 带来的逐年逐月快速进展,我们看到基于 AI 的辅助或自动化在几年前还无法想象的任务中变得司空见惯。事实上,进展的速度如此之快,以至于有些人甚至声称,在未来几年内,大部分人类劳动可能会被自动化 (Eloundou et al., 2023; Amodei & Fridman, 2024)。另一方面,其他人则持怀疑态度,声称语言模型无法真正推理 (Kambhampati et al., 2024),不能很好地泛化到新任务 (Chollet et al., 2024),并且可能只对劳动力市场中的一小部分产生影响 (Wittenstein, 2024)。

What is the reason for this disconnect? We argue that it is, in part, due to a lack of objective benchmarks that not only demonstrate the power of existing LLM-based agents to accelerate a wide variety of repetitive tasks encountered in every-day workplaces, but also provide appropriate caveats about the tasks that agents cannot do. This is a pressing issue, because the commercial and policy implications of diverse and effective acceleration or automation of work-related tasks will be broad, both positive (e.g. increase of quality of life and accelerated scientific discovery) and negative (e.g. potential displacement or loss of jobs and increase in wealth disparities). In this paper, we take some first steps towards resolving this gap and providing a clearer view of where we are now with respect to acceleration or automation of consequential work-related tasks, and a litmus test for future development in this direction.

这种脱节的原因是什么?我们认为,部分原因在于缺乏客观的基准测试,这些基准测试不仅能够展示现有基于大语言模型的智能体在加速日常工作中遇到的各种重复性任务方面的能力,还能为智能体无法完成的任务提供适当的警示。这是一个紧迫的问题,因为工作相关任务的多样化和有效加速或自动化的商业和政策影响将是广泛的,既有积极的(例如提高生活质量和加速科学发现),也有消极的(例如潜在的职位替代或失业以及财富差距的扩大)。在本文中,我们采取了一些初步措施来解决这一差距,并为我们在工作相关任务的加速或自动化方面的现状提供了更清晰的视角,同时也为未来的发展方向提供了一个试金石。


Figure 1: An overview of The Agent Company benchmark. It features a reproducible and selfhosted environment, simulated colleagues to test agent communication capabilities, checkpoint and execution-based evaluation, and a set of 175 diverse, realistic and professional tasks in a software engineering company setting.

图 1: The Agent Company 基准的概述。它具有可重复且自托管的环境、用于测试智能体通信能力的模拟同事、基于检查点和执行的评估,以及在一家软件工程公司环境中设置的 175 个多样化、现实且专业的任务。

Concretely, we propose a benchmark, The Agent Company (Figure 1) that estimates the ability of AI agents to perform tasks encountered in everyday workplaces. We create a simulated software development company where agents must perform tasks related to software engineering, project management, financial analysis, and other typical tasks encountered in such business settings. The agents must browse the web, code, and interact with other simulated co-workers to achieve success on the provided tasks. The Agent Company’s environment is based entirely on open-source software and self-hostable for reproducibility purposes, and we create rigorous evaluators that also assign partial credit when the agent gets the answer partially correct.

具体来说,我们提出了一个基准测试——The Agent Company(图 1),用于评估 AI 智能体在日常工作场所中执行任务的能力。我们创建了一个模拟的软件开发公司,智能体必须执行与软件工程、项目管理、财务分析等相关的任务,以及在此类业务环境中常见的其他任务。智能体必须浏览网页、编写代码,并与模拟的同事互动,以成功完成提供的任务。The Agent Company 的环境完全基于开源软件,并且可以自托管以确保可重复性,我们还创建了严格的评估器,当智能体部分正确时也会给予部分分数。

We perform experiments using seven large language model backbones, including API-based models such as Anthropic Claude (Anthropic, 2023), OpenAI GPT-4o (OpenAI, 2024), Google Gemini (Team et al., 2023), Amazon Nova (Intelligence, 2024), as well as open models including Meta Llama (Dubey et al., 2024) and Alibaba Qwen (Yang et al., 2024). All models are run using the OpenHands agent framework (Wang et al., 2024b),1 which provides a stable and strong agent harness for both web browsing and coding. As a result of experiments, we find that the best performing model, Claude 3.5 Sonnet was able to autonomously perform $24.0%$ of the provided tests to completion, and achieve a score of $34.4%$ on our metric that provides extra credit for partially completed tasks.

我们使用七种大语言模型骨干进行实验,包括基于API的模型,如Anthropic Claude (Anthropic, 2023)、OpenAI GPT-4o (OpenAI, 2024)、Google Gemini (Team et al., 2023)、Amazon Nova (Intelligence, 2024),以及开源模型,包括Meta Llama (Dubey et al., 2024) 和 Alibaba Qwen (Yang et al., 2024)。所有模型均通过OpenHands智能体框架 (Wang et al., 2024b) 运行,该框架为网页浏览和编码提供了稳定且强大的智能体支持。实验结果表明,表现最佳的模型Claude 3.5 Sonnet能够自主完成24.0%的测试任务,并在我们的评估指标中获得34.4%的分数,该指标对部分完成的任务给予额外加分。

These results present a nuanced picture of the current ability of AI agents to perform tasks. Agents powered by the current gold-standard AI techniques are able to autonomously perform a wide variety of tasks encountered in everyday work. However, they are not close to automating every task encountered in a workspace, even on the subset of tasks presented in The Agent Company, which are well-scoped administrative and coding tasks encountered in a software company’s day-to-day work.

这些结果展示了当前AI智能体执行任务能力的复杂图景。基于当前黄金标准AI技术的智能体能够自主执行日常工作中遇到的各种任务。然而,它们还远未达到自动化工作空间中遇到的每一项任务的水平,即使是在The Agent Company中呈现的任务子集上,这些任务也是软件公司日常工作中遇到的范围明确的管理和编码任务。

In the rest of this paper, we explain detail comparisons to other existing benchmarks (§ 2), how we set up realistic and reproducible environments (§ 3), how we define tasks $(\S,4)$ and how we create them $(\S\ S)$ , our baseline agent (§ 6), experimental results $(\S,7)$ , and finally implications and future directions $(\S\ 8)$ .

在本文的其余部分,我们将详细解释与其他现有基准的比较(§ 2)、如何设置现实且可复现的环境(§ 3)、如何定义任务(§ 4)以及如何创建任务(§ 5)、我们的基线 AI智能体(§ 6)、实验结果(§ 7),最后是影响和未来方向(§ 8)。

Table 1: Comparison of different AI agent benchmarks. Interface: the interface agent has access to; $\boxed{\alpha}$ is web browser, $\bar{\bigstar}$ is desktop, $\frac{A D}{868}$ is API usage, $\pmb{\epsilon}_{\odot}$ is Python script, $\varnothing\varnothing$ is chat platform, is bash terminal. Supported Tasks: tasks in the benchmark, $^*$ indicate tasks with no association with real-world occupations; SE refers to software engineering, HR is human resources, PM is project manager. Checkpoint-based evaluation: if tasks are evaluated at intermediate checkpoints and assigned partial scores. Interact with NPC Agents: If the agent can interact with other NPC agents during task-solving.

FrameworkDiverse Real-worldWorkTask CategoriesRequires InteractionLong-Horizon w/CheckpointsInterfaceSelf-Hosted Environment
MiniWob++ (Liu et al., 2018)×Browsing*
Mind2Web (Deng et al., 2023)Browsing*××
WebLINX (Lu et al., 2024)Browsing*
AssistantBench (Yoran et al., 2024)×Browsing*
WebArena (Zhou et al., 2023)×Browsing*××LR
VisualWebArena (Koh et al., 2024)×Browsing*××LR
VideoWebArena (Jang et al., 2024)×Browsing*××R
WorkArena (Drouin et al.,2024)Enterprise Software
OSWorld (Xie et al., 2024)Office,Coding×
Windows Agent Arena (Bonatti et al., 2024)Browsing*,Office,Coding×
AppWorld (Trivedi et al., 2024)×Daily
Gorilla APIBench (Patil et al., 2023)×Coding×
T-bench (Yao et al., 2024)Retail, Airline××
SWE-bench (Jimenez et al., 2024)×SWE××
DevBench (Li et al., 2024)×SWE×××
Smallville (Park et al., 2023)×Social*
Sotopia (Zhou et al., 2024)×Social*×
TheAgentCompanySWE,HR,Admin, PM,Research,Finance

表 1: 不同AI智能体基准对比。接口:智能体可访问的接口;$\boxed{\alpha}$ 是网页浏览器,$\bar{\bigstar}$ 是桌面,$\frac{A D}{868}$ 是API使用,$\pmb{\epsilon}_{\odot}$ 是Python脚本,$\varnothing\varnothing$ 是聊天平台,是bash终端。支持的任务:基准中的任务,$^*$ 表示与现实职业无关的任务;SE指软件工程,HR是人力资源,PM是项目经理。基于检查点的评估:任务是否在中间检查点评估并分配部分分数。与NPC智能体交互:智能体在任务解决过程中是否可以与其他NPC智能体交互。

框架 多样化的现实工作 任务类别 需要交互 长时任务/检查点 接口 自托管环境
MiniWob++ (Liu et al., 2018) × 浏览*
Mind2Web (Deng et al., 2023) 浏览* × ×
WebLINX (Lu et al., 2024) 浏览*
AssistantBench (Yoran et al., 2024) × 浏览*
WebArena (Zhou et al., 2023) × 浏览* × × LR
VisualWebArena (Koh et al., 2024) × 浏览* × × LR
VideoWebArena (Jang et al., 2024) × 浏览* × × R
WorkArena (Drouin et al.,2024) 企业软件
OSWorld (Xie et al., 2024) 办公, 编程 ×
Windows Agent Arena (Bonatti et al., 2024) 浏览*, 办公, 编程 ×
AppWorld (Trivedi et al., 2024) × 日常
Gorilla APIBench (Patil et al., 2023) × 编程 ×
T-bench (Yao et al., 2024) 零售, 航空 × ×
SWE-bench (Jimenez et al., 2024) × 软件工程 × ×
DevBench (Li et al., 2024) × 软件工程 × × ×
Smallville (Park et al., 2023) × 社交*
Sotopia (Zhou et al., 2024) × 社交* ×
TheAgentCompany 软件工程, 人力资源, 行政, 项目经理, 研究, 财务

2 BENCHMARK DESIDERATA AND COMPARISON TO OTHER BENCHMARKS

2 基准测试的需求及与其他基准测试的对比

In order to evaluate the ability of agents to perform tasks in complex real-world settings, we built The Agent Company with a number of desiderata in mind. The comparison with several existing prominent agent benchmarks with respect to these desiderata is in Table 1.

为了评估AI智能体在复杂现实环境中执行任务的能力,我们构建了The Agent Company,并考虑了多个期望目标。表1展示了与现有几个主要智能体基准在这些期望目标上的对比。

Coverage of Multiple Work-related Tasks: In order to make any valid statements about the potential of AI to accelerate or automate various types of real-world work, we should have tasks that are motivated by real-world work across multiple job categories. Many benchmarks are not relevant to real-world work (e.g. MiniWob $^{++}$ (Liu et al., 2018)) or very relevant to real-world work, but only over a limited scope of tasks (e.g. SWE-Bench (Jimenez et al., 2024)). In contrast, The Agent Company contains a set of more diverse, realistic, and professional tasks that would typically be completed by multiple job roles in a software engineering company.

覆盖多种工作相关任务:为了对AI加速或自动化各种现实世界工作的潜力做出有效陈述,我们应该拥有由多个工作类别中的现实工作驱动的任务。许多基准测试与现实工作无关(例如 MiniWob$^{++}$ (Liu et al., 2018)),或者与现实工作非常相关,但仅限于有限的任务范围(例如 SWE-Bench (Jimenez et al., 2024))。相比之下,The Agent Company 包含一组更加多样化、现实且专业的任务,这些任务通常由软件工程公司中的多个职位角色完成。

Requirement for Interaction: If agents are to integrate into real-world workplaces, they will need to communicate with the other human members of the workspace. Most other benchmarks do not measure communication or interactivity, with the exception of $\tau$ -bench (Yao et al., 2024), which only measures interaction in customer service scenarios. The Agent Company provides a better testbed for communication as many tasks involve asking and providing information to colleagues as part of a more complex task.

交互需求:如果AI智能体要融入现实工作场所,它们需要与工作空间中的其他人类成员进行沟通。大多数其他基准测试并不衡量沟通或交互性,除了 $\tau$ -bench (Yao et al., 2024) ,它仅衡量客户服务场景中的交互。Agent Company 提供了一个更好的沟通测试平台,因为许多任务涉及在更复杂的任务中向同事询问和提供信息。

Long-horizon Tasks with Checkpoints: In real-world settings, many tasks require taking many different steps to achieve a higher-level goal. One major novel contribution of The Agent Company is that we both (1) contain tasks that require an agent to perform significantly more consecutive work (i.e. involving more steps and realistically taking human professionals longer to accomplish) than previous benchmarks, and (2) provide granular evaluators that measure the ability of models to perform subtasks of these larger tasks.

长时程任务与检查点:在现实场景中,许多任务需要执行多个不同的步骤才能实现更高层次的目标。The Agent Company 的一项主要创新贡献在于,我们既 (1) 包含了要求智能体执行比以往基准测试更多连续工作(即涉及更多步骤,且实际需要人类专业人员更长时间完成)的任务,又 (2) 提供了细粒度的评估器,用于衡量模型执行这些大型任务中子任务的能力。

Versatile Environment Interface: In order to handle a diversity of tasks in real-world settings, we minimally should be able to interact with the tools that real-world workers use – including web interfaces, programs, command-line terminals, and communication tools. The Agent Company covers all of these interfaces, while most previous benchmarks focus only on one or two.

多功能环境接口:为了处理现实世界中的多样化任务,我们至少应该能够与现实世界工作者使用的工具进行交互——包括网页界面、程序、命令行终端和通信工具。Agent Company 涵盖了所有这些接口,而大多数之前的基准测试仅关注其中一两个。

Self-hosted and Reproducible: In order to allow for careful comparisons between different methods that remain constant over time, the benchmark should be fully self-hosted and reproducible. This contrasts with existing benchmarks that do not have execution environments (e.g. Mind2Web (Deng et al., 2023)) or require the usage of third-party software (e.g. WorkArena (Drouin et al., 2024)).

自托管与可复现性:为了能够对不同方法进行长期一致的细致比较,基准测试应完全自托管且可复现。这与现有的一些基准测试形成对比,后者要么没有执行环境(例如 Mind2Web (Deng et al., 2023)),要么需要依赖第三方软件(例如 WorkArena (Drouin et al., 2024))。

3 THE AGENT COMPANY ENVIRONMENT SETUP

3 AI智能体公司环境设置

Our benchmark is set in an imaginary software engineering startup called The Agent Company, hence the benchmark’s name. Within The Agent Company, we create tasks inspired by tasks handled by workers inside such companies. More details about the company’s imaginary background, overview and employees can be found in Appendix A. The benchmark environment contains multiple components.

我们的基准测试设定在一家名为 The Agent Company 的虚构软件工程初创公司中,因此基准测试的名称由此而来。在 The Agent Company 中,我们创建了受此类公司内部员工处理的任务启发的任务。有关该公司虚构背景、概述和员工的更多详细信息,请参阅附录 A。基准测试环境包含多个组件。

Local Workspace The local workspace runs locally on the agent’s host, which is analogous to a human professional’s local workspace, e.g. their work laptop computer. This environment is created as a sandboxed Docker environment to provide a safe execution environment that will not affect other parts of the evaluation machine (Wang et al., 2024b).2 This environment is where agents work on the task, and within this environment the The Agent Company baseline agent (§ 6) uses a browser, code editor and a Linux terminal with typical software pre installed.3

本地工作空间
本地工作空间在智能体的主机上本地运行,类似于人类专业人员的本地工作空间,例如他们的工作笔记本电脑。该环境被创建为一个沙盒化的 Docker 环境,以提供一个安全的执行环境,不会影响评估机器的其他部分 (Wang et al., 2024b)。这是智能体处理任务的环境,在此环境中,The Agent Company 基线智能体(§ 6)使用浏览器、代码编辑器和预装了典型软件的 Linux 终端。

Intranet This part of the environment mimics the company’s internal websites that host code, documents, project management software, and communications software. To achieve our goal of a reproducible, self-contained environment, we follow WebArena (Zhou et al., 2023), in using open-source, self-hostable software to host our environment. The environment mainly contains the following websites:

内网
这部分环境模拟了公司内部托管代码、文档、项目管理软件和通信软件的网站。为了实现可复现、自包含的环境目标,我们遵循 WebArena (Zhou et al., 2023) 的做法,使用开源、可自托管的软件来托管我们的环境。环境主要包含以下网站:

  1. GitLab,4 an open-source alternative to source-code repositories such as GitHub. This is used for hosting The Agent Company’s code repositories and tech-oriented wiki pages. 2. OwnCloud,5 an open-source alternative to office software such as Google Drive or Microsoft Office. This to save and share files, especially for document storage and collaborative editing. 3. Plane,6 an open-source alternative to task management software such as Jira or Linear. This is used to track issues, run sprints cycles, and manage product roadmaps. 4. RocketChat,7 an open-source alternative to communication software such as Slack. This is a company-internal real-time messaging tool that facilitates collaboration between employees.
  2. GitLab,一个开源的替代品,类似于 GitHub 等源代码仓库。它用于托管 The Agent Company 的代码仓库和技术导向的 Wiki 页面。
  3. OwnCloud,一个开源的替代品,类似于 Google Drive 或 Microsoft Office 等办公软件。它用于保存和共享文件,特别是文档存储和协作编辑。
  4. Plane,一个开源的替代品,类似于 Jira 或 Linear 等任务管理软件。它用于跟踪问题、运行冲刺周期和管理产品路线图。
  5. RocketChat,一个开源的替代品,类似于 Slack 等通信软件。这是一个公司内部的实时消息工具,用于促进员工之间的协作。

All the websites hosted are reproducible and reset-able with mock data inspired by that from a software engineering company. The data inside these company internal websites are populated with real-world software project data, as well as data manually curated by co-authors who have some experience in the relevant corporate roles.

所有托管的网站都可以使用受软件工程公司启发的模拟数据进行重现和重置。这些公司内部网站中的数据填充了真实世界的软件项目数据,以及由具有相关公司角色经验的合著者手动整理的数据。

Simulated Colleague Communication One major aspect of working in a company is communicating with other company members, and in The Agent Company we also test the ability of models to perform this type of communication. Specifically, we allow agents to use RocketChat to message other company members and obtain information that may not be available in the original task description. To create these simulated colleagues, we rely on the Sotopia platform (Zhou et al., 2024), which supports the creation of simulated human characters with LLMs. Each simulated colleague is equipped with a detailed profile that includes their name, role, responsibilities, and project affiliations.

模拟同事沟通

在公司工作的一个重要方面是与其他公司成员沟通,而在The Agent Company中,我们也测试了模型执行此类沟通的能力。具体来说,我们允许AI智能体使用RocketChat向其他公司成员发送消息,并获取可能在原始任务描述中无法获得的信息。为了创建这些模拟同事,我们依赖于Sotopia平台(Zhou et al., 2024),该平台支持使用大语言模型创建模拟人类角色。每个模拟同事都配备了一个详细的个人资料,包括他们的姓名、角色、职责和项目关联。

Table 2: Example task intents and checkpoints for three domains.

DomainTask IntentCheckpoints
SWESet up JanusGraph and run it locally with an HTTP endpoint: ·Clone JanusGraph directory under /workspace folder ·Build the binary file. · Launch JanusGraph server locally on port 8182 with an HTTP endpoint.Checkpoint 1 (1pt): Check if Janus- Graph repo is cloned. Checkpoint 2 (3pts): Check if the binary file is built (requires skipping Docker in pom.xml, hence higher points). Checkpoint 3 (2pts): Check the Janus- Graph Server as an HTTP endpoint.
FinanceNavigate to ownCloud at http://the-agent-company.com:8092 and complete Section B—Alternative Simplified Credit of IRS Form 6765: · Gather necessary information fromCheckpoint 1 (5pts): Check if all 16 questions in Section B of the form have been answered correctly. Checkpoint 2 (3pts): Check if the
/Documents/Financials/TAC_financials.csv and /workspace/research_wages.csv. ·Consult /Documents/Financials/f6765_instructions.pdfcorrect finance director (David Wong) s questions.
for instructions. ·Contact the finance director (David Wong) on Rocket.Chat
(http://the-agent-company.com:3000/home) for ambiguous questions.
·Save the filledform as/workspace/filled_f6765.pdf.

表 2: 三个领域的示例任务意图和检查点

领域 任务意图 检查点
SWE 设置 JanusGraph 并在本地运行,带有 HTTP 端点:· 在 /workspace 文件夹下克隆 JanusGraph 目录 · 构建二进制文件 · 在本地端口 8182 上启动 JanusGraph 服务器,带有 HTTP 端点。 检查点 1 (1 分):检查 JanusGraph 仓库是否已克隆。检查点 2 (3 分):检查二进制文件是否已构建(需要在 pom.xml 中跳过 Docker,因此分数较高)。检查点 3 (2 分):检查 JanusGraph 服务器是否作为 HTTP 端点运行。
金融 导航到 http://the-agent-company.com 的 ownCloud 并完成 IRS 表格 6765 的 B 部分——替代简化信用:· 从 /Documents/Financials/TAC_financials.csv 和 /workspace/research_wages.csv 收集必要信息 · 参考 /Documents/Financials/f6765_instructions.pdf 获取说明 · 在 Rocket.Chat (http://the-agent-company.com/home) 上联系财务总监 (David Wong) 以解决模糊问题 · 将填好的表格保存为 /workspace/filled_f6765.pdf。 检查点 1 (5 分):检查表格 B 部分的所有 16 个问题是否已正确回答。检查点 2 (3 分):检查财务总监 (David Wong) 的问题是否正确。
PMAnalyzeTheAgent Company's performance and createa summary inPlane:Checkpoint1(1pt):CheckifPlanewas accessed and the agent navigated to"An- alytics"section. Checkpoint 2 (3pts):Check if all re- quiredprojectmetricswerecollected. Checkpoint 3 (1pt):Checkif the sum- marywas sharedinthe#kudoschan- nelonRocket.Chat.
·AccessPlane(http://the-agent-company.com:8091/tac/) and navigate to"Analytics. Collectmetrics:OpenTasks,BacklogTasks,Unstarted Tasks,
Started Tasks,UnassignedIssues,PendingIssues. Createa summary and shareit onRocket.Chat
(http://the-agent-company.com:3000/home)inthe#kudos channel.
PM 分析 The Agent Company 的表现并在 Plane 中创建总结: 检查点 1 (1 分):检查是否访问了 Plane 并导航到“分析”部分。检查点 2 (3 分):检查是否收集了所有必需的项目指标。检查点 3 (1 分):检查总结是否在 Rocket.Chat 的 #kudos 频道中分享。
· 访问 Plane (http://the-agent-company.com/tac/) 并导航到“分析”。收集指标:未完成任务、待办任务、未开始任务、
已开始任务、未分配问题、待处理问题。创建总结并在 Rocket.Chat (http://the-agent-company.com/home) 的 #kudos 频道中分享。

(e.g., Sarah Johnson, who serves as the CTO, oversees technical strategy planning and R&D team leadership, with access to all technical channels). Agents can interact with these simulated colleagues through direct messages or in specific channels, as is standard in RocketChat and other platforms. By default, all simulated human characters are backed by the Claude-3-5-Sonnet-20241022 LLM across experiments, as we found that it provided the best results during preliminary experiments. For example conversations between the agent and the simulated colleagues drawn from empirical experiments, please refer to Appendix B.

(例如,Sarah Johnson 担任 CTO,负责技术战略规划和研发团队领导,并拥有所有技术渠道的访问权限)。AI智能体可以通过直接消息或特定渠道与这些模拟同事互动,这在 RocketChat 和其他平台中是标准做法。默认情况下,所有模拟人类角色在实验中均由 Claude-3-5-Sonnet-20241022 大语言模型支持,因为我们在初步实验中发现它提供了最佳结果。有关从实证实验中提取的 AI智能体与模拟同事之间的对话示例,请参阅附录 B。

4 TASK STRUCTURE

4 任务结构

The tasks in The Agent Company include a task intent, a list of checkpoints that the agent must achieve, a programmatic evaluator to check success on these checkpoints, and code to initialize and finalize the environment. We show some examples in Table 2, and describe each of aspect in detail below.

The Agent Company 中的任务包括任务意图、AI智能体必须达成的检查点列表、用于检查这些检查点是否成功的程序化评估器,以及用于初始化和最终化环境的代码。我们在表 2 中展示了一些示例,并在下面详细描述了每个方面。

Task Intent Each task begins with an English description, simulating how a user would instruct an LLM-based agent to perform a real-world task. In general, we aim for these tasks to be clear enough so that a human worker would be able to complete the task without asking for further instructions directly from the user (although they may need to ask questions of their other co-workers).

任务意图
每个任务都以英文描述开始,模拟用户如何指示基于大语言模型的AI智能体执行现实世界中的任务。通常,我们希望这些任务足够清晰,以便人类工作者能够在不直接向用户询问进一步指示的情况下完成任务(尽管他们可能需要向其他同事提问)。

Checkpoints Tasks are divided into checkpoints representing intermediate milestones, each assigned a point value to measure progress. Each checkpoint is awarded a certain number of points based on its significance to the overall completion of the task. Checkpoints are written in English, and typically specify one or more of the following:

检查点任务被划分为代表中间里程碑的检查点,每个检查点分配一个分值以衡量进度。根据检查点对任务整体完成的重要性,每个检查点会被授予一定数量的分数。检查点用英文书写,通常指定以下一项或多项内容:

• Action Completion: Verifying whether required actions, such as using tools, navigating to URLs, or collecting data, were carried out successfully.

• 动作完成:验证所需动作(如使用工具、导航到URL或收集数据)是否成功执行。

• Data Accuracy: Evaluating the correctness and completeness of the output, such as extracted data or formatted documents. • Collaboration: Assessing interactions with simulated colleagues or sharing of output, such as posting messages or asking for additional information to complete the task.

• 数据准确性:评估输出的正确性和完整性,例如提取的数据或格式化的文档。
• 协作:评估与模拟同事的互动或输出的共享,例如发布消息或请求更多信息以完成任务。

Evaluators Checkpoints are created in the task design phase, but for actual evaluation, each of the checkpoints must be concretely implemented through an evaluator – a program that checks the completion of the checkpoint. These evaluators are implemented by examining environment states, such as the local workspace, intranet status, simulated colleague interactions, or by analyzing agent trajectories, like verifying browsing history or action sequences.

评估者在任务设计阶段创建检查点,但在实际评估中,每个检查点必须通过评估者具体实现——评估者是一个检查检查点完成情况的程序。这些评估者通过检查环境状态(如本地工作空间、内网状态、模拟同事互动)或分析智能体轨迹(如验证浏览历史或操作序列)来实现。

In most cases, these evaluators are deterministic and written as simple Python functions. For instance, in the SWE task in Table 2, the checkpoints are deterministic: verifying if the JanusGraph repository is cloned, the binary file is built, and the server is launched with an HTTP endpoint. However, for tasks with more complex and unstructured deliverable s, such as in Table 2, the last checkpoint in the Finance task requires contacting the correct finance director (David Wong) to resolve ambiguous questions, which involves a judgment from a (simulated) human colleague, deterministic evaluation can be challenging due to subjectivity and variability. In such cases, we employ LLM-based evaluation. This involves prompting LLMs with predefined rubrics or reference outputs to assess the agent’s deliverable s, enabling a more nuanced and flexible evaluation of these tasks. Same as the NPC backbone, all LLM-based evaluators are backed by the Claude-3-5-Sonnet-20241022.

在大多数情况下,这些评估器是确定性的,并以简单的 Python 函数形式编写。例如,在表 2 中的 SWE 任务中,检查点是确定性的:验证 JanusGraph 仓库是否被克隆、二进制文件是否构建完成,以及服务器是否通过 HTTP 端点启动。然而,对于具有更复杂和非结构化交付物的任务,例如表 2 中的 Finance 任务,最后一个检查点需要联系正确的财务总监 (David Wong) 来解决模糊问题,这涉及来自(模拟的)人类同事的判断,由于主观性和可变性,确定性评估可能具有挑战性。在这种情况下,我们采用基于大语言模型的评估。这涉及使用预定义的评分标准或参考输出提示大语言模型,以评估智能体的交付物,从而对这些任务进行更细致和灵活的评估。与 NPC 骨干相同,所有基于大语言模型的评估器均由 Claude-3-5-Sonnet-20241022 支持。

4.1 EVALUATION METRICS

4.1 评估指标

Due to our checkpoint-based evaluation scheme and the need for showcasing both the progress of the agent’s capability improvement as well as the eventual goal completion ability, we calculate two scalar agent capability metrics and two efficiency metrics.

由于我们基于检查点的评估方案以及需要展示智能体能力提升的进展和最终目标完成能力,我们计算了两个标量智能体能力指标和两个效率指标。

Full completion score We define the full completion score $S_{\mathrm{full}}$ as:

完全完成分数我们定义完全完成分数 $S_{\mathrm{full}}$ 为:

$$
S_{\mathrm{full}}={1if all checkpoints are successfully passed,0otherwise.}
$$

This binary metric evaluates whether the agent successfully completed the task by passing all checkpoints.

该二元指标通过评估智能体是否成功通过所有检查点来判断其是否完成了任务。

Partial completion score To provide a more nuanced measure that rewards partial task completion while strongly in centi viz ing full task completion, we define partial completion score $S_{\mathrm{partial}}$ as:

部分完成分数
为了提供一个更细致的衡量标准,既能奖励部分任务完成,又能强烈激励完全任务完成,我们定义部分完成分数 $S_{\mathrm{partial}}$ 为:

$$
S_{\mathrm{partial}}=0.5\cdot\frac{\mathrm{Result}}{\mathrm{Total}}+0.5\cdot S_{\mathrm{full}},
$$

where:

其中:

This formulation ensures that agents are awarded partial credit in proportion to the points achieved, reflecting their progress toward task completion. At the same time, full task completion is strongly in centi viz ed by incorporating an additional $50%$ credit, which is awarded only when all checkpoints are successfully completed. This design ensures that agents achieving partial progress receive scores scaled linearly with their performance, while those reaching $100%$ completion are distinctly rewarded to emphasize the importance of achieving the end goal.

这种设计确保智能体根据其完成任务的程度获得部分积分,反映了它们在任务完成过程中的进展。同时,通过引入额外的 50% 积分,强烈鼓励完全完成任务,这部分积分仅在所有检查点成功完成时才会授予。这种设计确保取得部分进展的智能体能够根据其表现线性获得分数,而那些达到 100% 完成度的智能体则会得到显著奖励,以强调实现最终目标的重要性。

Figure 2: Example The Agent Company workflow illustrating an agent managing a sprint for the RisingWave project. The task involves identifying and moving unfinished issues to next sprint cycle, notifying assignees of those issues, running a code coverage script, uploading summarized report to OwnCloud, and incorporating feedback on report from a simulated project manager.

图 2: 示例 The Agent Company 工作流程,展示了一个智能体为 RisingWave 项目管理冲刺的过程。任务包括识别并将未完成的问题移至下一个冲刺周期,通知这些问题的负责人,运行代码覆盖率脚本,将汇总报告上传到 OwnCloud,并整合来自模拟项目经理的反馈。

Number of steps The number of steps is defined as the total number of LLM calls made during the task execution. This metric quantifies the operational effort required to perform the task.

步数
步数定义为任务执行期间进行的大语言模型 (LLM) 调用的总次数。该指标量化了执行任务所需的操作工作量。

Cost per instance The cost per instance measures the monetary cost of querying the underlying LLM through its API to complete a task. Assuming no prompt caching, the cost is calculated as:

单次实例成本
单次实例成本衡量的是通过底层大语言模型 (LLM) 的 API 完成任务所需的货币成本。假设没有提示缓存,成本计算公式为:

4.2 WORKFLOW

4.2 工作流程

Each task typically follows a workflow involving the following stages:

每个任务通常遵循一个工作流程,涉及以下阶段:

  1. Initialization: The agent sets up its workspace and prepares to execute the task. 初始化:AI智能体设置其工作空间并准备执行任务。
  2. Execution: The agent completes subtasks, such as navigating tools, collecting data, or processing information or if required by the task, the agent interacts with simulated colleagues or shares results via communication platforms. 执行:AI智能体完成子任务,例如导航工具、收集数据或处理信息,或者如果任务需要,AI智能体与模拟同事互动或通过通信平台共享结果。
  3. Final iz ation: The agent produces and submits the final output for evaluation. 最终化:AI智能体生成并提交最终输出以供评估。

Example Task We consider a task designed to evaluate an agent’s ability to perform realistic project management workflows using multiple tools and services hosted in the benchmark. The task involves managing a sprint for the RisingWave project, requiring the agent to execute interdependent steps such as sprint issue management, team communication, repository operations, and report generation while incorporating feedback from a simulated project manager.

示例任务
我们考虑了一个任务,旨在评估智能体在使用基准测试中托管的多种工具和服务时执行现实项目管理工作流的能力。该任务涉及管理 RisingWave 项目的冲刺,要求智能体执行相互依赖的步骤,例如冲刺问题管理、团队沟通、仓库操作和报告生成,同时结合模拟项目经理的反馈。

The workflow as illustrated in Figure 2 begins with the agent identifying unfinished issues in the current sprint on Plane and updating their sprint assignments. This step is worth 2 points and is fully completed, earning the agent the maximum score of 2/2. Next, the agent successfully notifies the relevant assignees using Rocket.Chat regarding their pending tasks and earns 1/1 point.

如图 2 所示的工作流程始于 AI 智能体识别 Plane 上当前冲刺中未完成的问题并更新其冲刺任务。此步骤价值 2 分且已完全完成,AI 智能体因此获得满分 2/2 分。接着,AI 智能体成功使用 Rocket.Chat 通知相关任务负责人其待办任务,并获得了 1/1 分。

The agent then proceeds to clone the RisingWave repository from GitLab and execute a Python script in the terminal to calculate updated code coverage. This step, worth 2 points, is only partially completed, as the agent successfully clones the repository but fails to run code coverage. As a result, the agent earns $1/2$ points for this checkpoint. The subsequent steps—generating and sharing the sprint summary report on OwnCloud and incorporating feedback from a simulated project manager—are not completed, resulting in $_{0/2}$ and $0/1$ scores, respectively. Notably, the checkpoints can also fail if the report does not meet quality standards as assessed by the LLM-based evaluator, which evaluates the report for clarity, completeness, and successful incorporation of feedback. This ensures that the assessment reflects both the generation of outputs and their qualitative relevance to the task.

随后,智能体从 GitLab 克隆 RisingWave 仓库,并在终端执行 Python语言 脚本来计算更新的代码覆盖率。这一步价值 2 分,但仅部分完成,因为智能体成功克隆了仓库但未能运行代码覆盖率计算。因此,智能体在此检查点获得 $1/2$ 分。接下来的步骤——在 OwnCloud 上生成并分享冲刺总结报告,以及整合模拟项目经理的反馈——均未完成,分别获得 $_{0/2}$ 和 $0/1$ 分。值得注意的是,如果报告未能达到基于大语言模型的评估器的质量标准,检查点也可能失败。该评估器会评估报告的清晰度、完整性以及反馈的成功整合情况。这确保了评估不仅反映输出的生成,还反映其与任务的质量相关性。

Finally, the overall score is calculated using the partial completion formula defined in $\S,^{4.1}$ , where the total possible points are 8, and the awarded points sum to 4. Substituting these values, the agent achieves a final score of 0.25 $(25%)$ . Our scoring mechanism thus rewards incremental progress while strongly in centi viz ing full completion.

最后,使用 $\S,^{4.1}$ 中定义的部分完成公式计算总体得分,其中总分为 8 分,获得的分数总和为 4 分。代入这些值后,智能体的最终得分为 0.25 $(25%)$。我们的评分机制因此奖励了逐步进展,同时强烈鼓励完全完成。

This example represents a typical task in the The Agent Company benchmark, where agents are required to handle complex workflows involving multiple tools and interdependent steps. By evaluating both partial progress and overall outcomes, our benchmark provides a rigorous and realistic measure of agent performance, allowing us to identify their strengths and pinpoint areas for improvement in task execution.

此示例代表了The Agent Company基准测试中的典型任务,要求AI智能体处理涉及多种工具和相互依赖步骤的复杂工作流。通过评估部分进展和整体结果,我们的基准测试提供了严格且现实的智能体性能衡量标准,使我们能够识别其优势并找出任务执行中的改进领域。

5 TASK CREATION

5 任务创建

5.1 CHOOSING TASK CATEGORIES

5.1 选择任务类别

Many previous agent benchmarks discussed in $\S~2$ were created to evaluate agents on tasks people perform in daily life (Zhou et al., 2023; Lù et al., 2024; Deng et al., 2023), or tasks that accomplish digital chores (Yoran et al., 2024; Trivedi et al., 2024). Obtaining realistic tasks for the benchmark poses challenges. Some benchmark (Xie et al., 2024; Drouin et al., 2024; Yoran et al., 2024) crowd sourced tasks based on predetermined interfaces, platforms, and services available to the agent. They also adopt a strategy to first gather task templates and then instantiate more task instances by filling in the variables. Some benchmark (Zhou et al., 2023; Koh et al., 2024; Bonatti et al., 2024) took a semi-systematic approach of reviewing the action history of the research team and choosing tasks that reflected the types of task that the researchers carried out in their daily life. There are several obvious issues with this if we want to evaluate agents with broader implications in the The Agent Company benchmark. Despite some grounding in realistic data, the process of creating tasks from these data was susceptible to heuristic, and no consideration was made for how important or time-consuming the tasks are. The tasks are biased towards those important for academics in computer science and do not reflect the tasks performed by the entire population.

许多先前在 $\S~2$ 中讨论的智能体基准测试是为了评估智能体在人们日常生活中执行的任务(Zhou et al., 2023; Lù et al., 2024; Deng et al., 2023),或者完成数字杂务的任务(Yoran et al., 2024; Trivedi et al., 2024)。为基准测试获取现实任务带来了挑战。一些基准测试(Xie et al., 2024; Drouin et al., 2024; Yoran et al., 2024)基于预定的接口、平台和服务,通过众包方式获取任务。他们还采用了一种策略,即首先收集任务模板,然后通过填充变量来实例化更多任务实例。一些基准测试(Zhou et al., 2023; Koh et al., 2024; Bonatti et al., 2024)采用了半系统化的方法,审查研究团队的行动历史,并选择反映研究人员日常生活中执行的任务类型。如果我们希望在 The Agent Company 基准测试中评估具有更广泛意义的智能体,这种方法存在几个明显的问题。尽管这些数据有一定的现实基础,但从这些数据中创建任务的过程容易受到启发式方法的影响,并且没有考虑任务的重要性或耗时程度。这些任务偏向于对计算机科学学者重要的任务,并不能反映整个人群执行的任务。

In The Agent Company, we attempt to cover a wide variety of tasks motivated by real-world work. While it is highly challenging to create a representative sample of tasks, fortunately we can rely on existing resources created for other purposes as a reference. Specifically, we start by referencing the 29.1 release of $\mathrm{O^{}}\mathrm{NET}$ database ( $\mathrm{O^{}N E T}$ , 2024; Rounds et al., 1999), which is a database of jobs performed by workers in the US created by the US Department of Labor. It also contains information about tasks performed within the context of each job, abilities required to perform each task, whether the task is a major or minor task for that job category, and other pieces of relevant information.

在 The Agent Company 中,我们尝试涵盖由现实工作驱动的各种任务。虽然创建一个具有代表性的任务样本极具挑战性,但幸运的是,我们可以依赖为其他目的创建的现有资源作为参考。具体来说,我们从参考 $\mathrm{O^{}}\mathrm{NET}$ 数据库的 29.1 版本($\mathrm{O^{}N E T}$,2024;Rounds 等,1999)开始,该数据库是由美国劳工部创建的,记录了美国工人从事的工作。它还包含每项工作中执行的任务信息、执行每项任务所需的能力、该任务在该工作类别中是主要任务还是次要任务,以及其他相关信息。

Based on this data, we first identified a few categories of occupation categories to focus on. First, based on statistics from $\mathrm{O^{*}}\mathrm{NET}$ , we identified job categories that have a large number of people performing this job. Then, we used median salary information for each of these job categories from the US department of labor statistics, and multiplied the number of employees in that category to estimate the aggregate value of performing this job.

基于这些数据,我们首先确定了几类职业类别作为重点。首先,根据 $\mathrm{O^{*}}\mathrm{NET}$ 的统计数据,我们确定了从事人数较多的职业类别。然后,我们使用美国劳工统计局的每个职业类别的中位数薪资信息,并乘以该类别中的员工人数,以估算从事该职业的总价值。

Based on this, we identified several categories of jobs such as “General and Operations Managers”, “Registered Nurses”, “Software Developers”, and “Financial Managers” that have both a high population and high average salary. Because The Agent Company is designed to be a non-embodied benchmark in the digital domain, we excluded the categories that require extensive physical labor such as “Registered Nurses”, and eventually settled on the setting of a software company, which would allow us to cover tasks from the other categories.

基于此,我们确定了几类既拥有高从业人数又具有高平均薪资的工作,如“总经理和运营经理”、“注册护士”、“软件开发人员”和“财务经理”。由于The Agent Company被设计为数字领域中的非实体基准,我们排除了需要大量体力劳动的类别,如“注册护士”,最终选择了软件公司的设定,这样我们就能涵盖其他类别中的任务。

5.2 CHOOSING TASKS

5.2 选择任务

Next, within this setting we chose tasks to implement. In this setting, we attempted to create a diversity of tasks, but mostly focused on concrete tasks that have well-defined goals and success criteria. These tasks were created through a combination of referencing the $\mathrm{O^{*}N E T}$ task list, introspection based on paper co-authors who had experience in each task category, and brainstorming lists with language models. It is important to note that in no cases have we covered an extensive list of all the tasks that are performed in a particular occupational category, and therefore we caution against making any assumptions about whether a particular job may be in danger of full automation based solely on

接下来,在此背景下,我们选择了要实施的任务。在此背景下,我们尝试创建多样化的任务,但主要集中在具有明确目标和成功标准的具体任务上。这些任务是通过参考 $\mathrm{O^{*}N E T}$ 任务列表、基于有经验的论文合著者的内省以及与语言模型的头脑风暴列表相结合而创建的。需要注意的是,我们并未涵盖特定职业类别中执行的所有任务的广泛列表,因此我们提醒不要仅基于此做出任何关于某个特定工作是否可能面临完全自动化的假设。


Figure 3: Overview of OpenHands’ default CodeAct $^+$ Browsing agent architecture, the baseline agent used throughout the experiments.

图 3: OpenHands 默认的 CodeAct $^+$ 浏览智能体架构概览,这是整个实验中使用的基础智能体。

The Agent Company. Rather, it may provide insight into whether certain tasks within jobs may be accelerated or automated, and inform further analysis by labor professionals into this question.

The Agent Company。相反,它可能提供关于工作中某些任务是否可能被加速或自动化的见解,并为劳动专业人员进一步分析这一问题提供信息。

5.3 MANUAL TASK CURATION

5.3 手动任务整理

Once we set up the environment required for our desired jobs and task categories (§ 3), we return to the curated list, and perform a manual curation process for tasks. For each task, this consists of the following steps: We first create a description of task intent, checkpoints, and how to evaluate each checkpoint. We then identify and import the required data for the task that are currently missing in the company Intranet services and create any necessary data. We then write scripts to configure the required initialization state in the local workspace. Finally, we implement the checkpoint evaluators that calculate the scalar scores for each checkpoint.

一旦我们为所需的工作和任务类别设置了环境(§ 3),我们就会回到策划的任务列表,并对任务进行手动策划。对于每个任务,这包括以下步骤:我们首先创建任务意图的描述、检查点以及如何评估每个检查点。然后,我们识别并导入任务所需的当前在公司内网服务中缺失的数据,并创建任何必要的数据。接着,我们编写脚本来配置本地工作区中所需的初始化状态。最后,我们实现检查点评估器,为每个检查点计算标量分数。

All tasks were created by coauthors of the paper. Overall, it took 20 computer science students, software engineers, and project managers over 2 months, consuming approximately 3,000 personhours in total. Some of the more complex tasks take more than 10 hours each to design, implement, test, and verify. To ensure quality control of the task creation process, we implement several check and verification processes. For each task implementation, we require screenshot proof that the evaluator is valid and that the task is able to get a full score when successfully completed. We also encourage including tests for the implemented evaluator programs. Each task contribution is also code reviewed by a panel of lead authors before merging into the benchmark. After creating all tasks, a final round of manual human double-check of required environment data, evaluator behavior, and checkpoint scoring for every task is performed to ensure quality. Notably, during the process a person who has not curated the tasks checks all the checkpoint score assignments to make sure that the importance scoring is consistent over all the tasks and that it correlates reasonably with the relative importance of the checkpoint within the task.

所有任务均由论文的共同作者创建。总体而言,20名计算机科学专业的学生、软件工程师和项目经理花费了超过2个月的时间,总计消耗了约3000人时。一些较为复杂的任务每个任务的设计、实现、测试和验证耗时超过10小时。为了确保任务创建过程的质量控制,我们实施了多项检查和验证流程。对于每个任务的实现,我们要求提供截图证明,证明评估者是有效的,并且任务在成功完成后能够获得满分。我们还鼓励为实现的评估程序编写测试。每个任务的贡献在合并到基准测试之前,还会由一组主要作者进行代码审查。在创建所有任务后,我们进行了最后一轮人工双重检查,确保每个任务所需的环境数据、评估者行为和检查点评分都符合质量要求。值得注意的是,在此过程中,一个未参与任务策划的人员会检查所有检查点的评分分配,以确保所有任务的重要性评分一致,并且与任务中检查点的相对重要性合理相关。

6BASELINE AGENT

6 基线智能体

To test the current state-of-the-art performance on the The Agent Company benchmark, we need agents that can at least perform tasks using a browser, operate a local workspace using a terminal, and write and execute programs to perform most of the tasks. Throughout this paper, we experiment with OpenHands’ main agent (Wang et al., 2024b;a; Song et al., 2024), CodeAct Agent with Browsing.8 An overview of the agent architecture is illustrated in Figure 3.

为了测试当前在 The Agent Company 基准上的最先进性能,我们需要至少能够使用浏览器执行任务、使用终端操作本地工作空间、并编写和执行程序以完成大部分任务的智能体。在本文中,我们实验了 OpenHands 的主要智能体 (Wang et al., 2024b;a; Song et al., 2024),即带有浏览功能的 CodeAct 智能体。智能体架构的概览如图 3 所示。

Interfaces The agent can interact with the environment through 3 interfaces. (1) A bash shell that connects with the local workspace operating system environment for command execution. (2) A Jupyter IPython server to handle interactive python (IPython) code execution requests and return the execution results back. (3) A Chromium browser based on Playwright. The provider provides a set of action primitives defined by BrowserGym (ServiceNow; Drouin et al., 2024), such as navigation, clicking, typing, and scrolling. After executing these actions, the browser runtime provides a rich set of observations about the current state of the browser, including HTML, DOM, accessibility tree (Mozilla), screenshot, opened tabs, etc. These observations can be also augmented with configurable attributes that could allow agents to better understand web page observations, such as using a set-of-marks on screenshot (Yang et al., 2023; He et al., 2024), visible element marking, focused element, interact able element marking, in-viewport element filtering (Zhou et al., 2023), etc.

接口

AI智能体可以通过3种接口与环境进行交互。(1) 一个与本地工作区操作系统环境连接的bash shell,用于执行命令。(2) 一个Jupyter IPython服务器,用于处理交互式Python (IPython) 代码执行请求并返回执行结果。(3) 一个基于Playwright的Chromium浏览器。该提供商提供了一组由BrowserGym (ServiceNow; Drouin et al., 2024) 定义的操作原语,例如导航、点击、输入和滚动。执行这些操作后,浏览器运行时会提供一组丰富的关于浏览器当前状态的观察结果,包括HTML、DOM、可访问性树 (Mozilla)、截图、打开的标签页等。这些观察结果还可以通过可配置的属性进行增强,使智能体能够更好地理解网页观察结果,例如在截图上使用一组标记 (Yang et al., 2023; He et al., 2024)、可见元素标记、聚焦元素、可交互元素标记、视口内元素过滤 (Zhou et al., 2023) 等。

Actions The agent connects with the environment through a core set of general actions. Actions I Python Run Cell Action and CmdR un Action enable the agent to execute arbitrary Python code and bash commands inside the sandbox environment (e.g., a secure isolated Linux operating system used as our local workspace). Browser Interactive Action enables interaction with a web browser with a domain-specific language for browsing introduced by BrowserGym (Chezelles et al., 2024; Drouin et al., 2024). These actions provide a comprehensive, yet flexible set of primitives that cover most of the tasks performed by human employees of The Agent Company, including navigation, click, hovering, and typing, etc.

动作
AI智能体通过一组核心通用动作与环境进行交互。动作包括 Python 运行单元动作 (Python Run Cell Action) 和 CmdRun 动作 (CmdRun Action),这些动作使得智能体能够在沙盒环境(例如,用作本地工作空间的安全隔离的 Linux 操作系统)中执行任意的 Python 代码和 bash 命令。浏览器交互动作 (Browser Interactive Action) 则允许智能体通过 BrowserGym 引入的特定领域语言与网页浏览器进行交互 (Chezelles et al., 2024; Drouin et al., 2024)。这些动作提供了一套全面且灵活的基础操作,涵盖了 The Agent Company 员工执行的大多数任务,包括导航、点击、悬停和输入等。

Observations Observations describe the environmental changes that the agent observes. The main types of observations used in the CodeAct agent include the execution result of bash terminal commands, Python programs, and browser actions. Specifically, the execution result of browser actions is usually browser snapshots and textual representation in the form of accessibility tree of the current browser viewport.

观察描述了智能体观察到的环境变化。CodeAct 智能体中使用的主要观察类型包括 bash 终端命令、Python 程序和浏览器操作的执行结果。具体来说,浏览器操作的执行结果通常是浏览器快照和当前浏览器视口的可访问性树的文本表示。

Workflow At each step, the underlying backbone LLM will take in prompts consisting of previous agent history and the current observation of the environment, and generate a response consisting of the action to execute next. On a higher level, the agent can perform the task by executing code, including executing bash commands, Python code, or browser-specific programming language (defined in BrowserGym).9 This general action space allows the agent to perform various tasks, including editing files, browsing the Web, running programs, etc.

工作流程
在每个步骤中,底层的大语言模型 (LLM) 会接收由先前的智能体历史记录和当前环境观察结果组成的提示,并生成包含下一步要执行的操作的响应。在更高层次上,智能体可以通过执行代码来完成任务,包括执行 bash 命令、Python 代码或浏览器特定的编程语言(在 BrowserGym 中定义)。这种通用的操作空间使智能体能够执行各种任务,包括编辑文件、浏览网页、运行程序等。

7 EXPERIMENTAL RESULTS

7 实验结果

In this section, we evaluate popular foundation models, both closed and open, on The Agent Company benchmark. We use OpenHands CodeAct agent (§ 6) for all experiments This serves as a baseline for future development of both the foundation LLMs and the agent infrastructure. Note that since LLM evaluators and NPCs are part of the environment rather than the agent being evaulated, we fix their backbone LLM to Claude-3-5-Sonnet-20241022, which demonstrated the best qualitative accuracy in simulating human colleagues and judging deliverable s in preliminary experiments.

在本节中,我们评估了The Agent Company基准测试中的流行基础模型,包括闭源和开源模型。我们使用OpenHands CodeAct智能体(§ 6)进行所有实验,这为未来基础大语言模型和智能体基础设施的开发提供了基线。需要注意的是,由于大语言模型评估者和NPC是环境的一部分,而非被评估的智能体,因此我们将它们的骨干大语言模型固定为Claude-3-5-Sonnet-20241022,该模型在初步实验中展示了在模拟人类同事和判断交付物方面的最佳定性准确性。

7.1 RESULT OVERVIEW

7.1 结果概述

Table 3 shows the evaluation results of both closed and open foundation models on the full evaluation set of The Agent Company (175 tasks). We can see that the Claude-3.5-Sonnet is the clear winner across all models. However, even with the strongest frontier model, it only manages to complete $24%$ of the total tasks and achieves a score of $34.4%$ taking into account partial completion credits. Note that this result comes at a cost: It requires an average of almost 30 steps and more than $\mathbb{S}6$ to complete each task, making it the most expensive model to run both in time and in cost. This is expected as most of the tasks in our benchmark are of long-horizon nature. The Gemini 2.0 Flash model that comes second in terms of capability requires 40 steps on average to complete the tasks, which is time consuming, yet only to achieve less than half the success rate compared to the top-performing model. Surprisingly, its cost is less than $\mathbb{S}1$ , making it a very cost-efficient yet relatively strong model. A qualitative examination demonstrated that this was due to instances where the agent got stuck in a loop or aimlessly explored the environment.

表 3 展示了封闭和开放基础模型在 The Agent Company 完整评估集(175 个任务)上的评估结果。我们可以看到,Claude-3.5-Sonnet 在所有模型中表现最为突出。然而,即使是最强大的前沿模型,也只能完成总任务的 $24%$,并且在考虑部分完成积分的情况下,得分为 $34.4%$。需要注意的是,这一结果的代价是:每个任务平均需要近 30 步和超过 $\mathbb{S}6$ 的成本,使其成为运行时间和成本上最昂贵的模型。这是可以预见的,因为我们基准测试中的大多数任务都是长期任务。能力排名第二的 Gemini 2.0 Flash 模型平均需要 40 步来完成这些任务,虽然耗时,但其成功率还不到表现最佳模型的一半。令人惊讶的是,它的成本不到 $\mathbb{S}1$,使其成为一个非常经济高效且相对强大的模型。定性分析表明,这是由于智能体在某些情况下陷入循环或漫无目的地探索环境所致。

Among the open-weight models, Llama 3.1 (405B) achieves the highest performance, nearly on par with OpenAI’s GPT-4o model, though still having a big gap behind the leading Claude 3.5 Sonnet.

在开源权重模型中,Llama 3.1 (405B) 表现最佳,几乎与 OpenAI 的 GPT-4o 模型持平,但仍与领先的 Claude 3.5 Sonnet 有较大差距。

Table 3: Performance comparison of various foundation models on The Agent Company.

ModelSuccessScoreStepsCosts
API-basedModels
Claude-3.5-Sonnet Gemini-2.0-Flash GPT-40 Gemini-1.5-Pro24.0% 11.4% 8.6% 3.4%34.4% 19.0% 16.7% 8.0%29.17 39.85 14.55 22.10\$6.34 \$0.79 \$1.29 \$6.78
Amazon-Nova-Pro-v11.7% Open-weightsModels5.7%19.59\$1.55
Llama-3.1-405b7.4%14.1%22.95\$3.21
Llama-3.3-70b6.9%12.8%20.93\$0.93
Qwen-2.5-72b5.7%11.8%23.99\$1.53
Llama-3.1-70b Qwen-2-72b1.7% 1.1%6.5% 4.2%19.18 23.70\$0.83 \$0.28

表 3: 各基础模型在 The Agent Company 上的性能对比

模型 成功率 得分 步骤数 成本
基于API的模型
Claude-3.5-Sonnet Gemini-2.0-Flash GPT-40 Gemini-1.5-Pro 24.0% 11.4% 8.6% 3.4% 34.4% 19.0% 16.7% 8.0% 29.17 39.85 14.55 22.10 $6.34 $0.79 $1.29 $6.78
Amazon-Nova-Pro-v1 1.7% 5.7% 19.59 $1.55
开源模型
Llama-3.1-405b 7.4% 14.1% 22.95 $3.21
Llama-3.3-70b 6.9% 12.8% 20.93 $0.93
Qwen-2.5-72b 5.7% 11.8% 23.99 $1.53
Llama-3.1-70b Qwen-2-72b 1.7% 1.1% 6.5% 4.2% 19.18 23.70 $0.83 $0.28

Table 4: Performance of the models in tasks that require different platforms in The Agent Company. All numbers are percentages $(%)$ .

GitLab(71 tasks)Plane (17 tasks)RocketChat(79 tasks)ownCloud(70 tasks)
ModelSuccess (%)Score (%)Success (%)Score (%)Success (%)Score (%)Success (%)Score (%)
API-basedModels
Claude-3.5-Sonnet30.9940.2541.1850.3721.5234.6810.0021.81
Gemini-2.0-Flash11.2718.2117.6529.8413.9223.342.868.52
GPT-4011.2719.4623.5333.685.0616.081.437.76
Gemini-1.5-Pro2.823.885.8814.053.8010.970.004.22
Amazon-Nova-Pro-v12.827.225.8816.671.275.360.002.43
Open-weightsModels
Llama-3.1-405b5.6311.8429.4139.128.8616.460.004.45
Llama-3.3-70b8.4514.2611.7621.655.0612.060.003.76
Qwen-2.5-72b5.6311.3311.7623.565.0612.600.004.14
Llama-3.1-70b1.416.095.8815.352.538.230.003.32
Qwen-2-72b1.411.945.8812.450.004.880.002.60

表 4: The Agent Company 中不同平台任务下的模型性能。所有数字均为百分比 $(%)$ 。

模型 GitLab (71 任务) Plane (17 任务) RocketChat (79 任务) ownCloud (70 任务)
成功率 (%) 得分 (%) 成功率 (%) 得分 (%)
基于 API 的模型
Claude-3.5-Sonnet 30.99 40.25 41.18 50.37
Gemini-2.0-Flash 11.27 18.21 17.65 29.84
GPT-40 11.27 19.46 23.53 33.68
Gemini-1.5-Pro 2.82 3.88 5.88 14.05
Amazon-Nova-Pro-v1 2.82 7.22 5.88 16.67
开源权重模型
Llama-3.1-405b 5.63 11.84 29.41 39.12
Llama-3.3-70b 8.45 14.26 11.76 21.65
Qwen-2.5-72b 5.63 11.33 11.76 23.56
Llama-3.1-70b 1.41 6.09 5.88 15.35
Qwen-2-72b 1.41 1.94 5.88 12.45

Interestingly, comparing the number of steps and costs between the open Llama 3.1 (405B) model and the closed OpenAI GPT-4o model, Llama 3.1 takes more steps and costs nearly $2\mathbf{x}$ more to run, while having a lower success than GPT-4o. Anecdotally, our inspection showed that GPT-4o seems to be better at giving up early, saving steps and costs if the task is clearly out of the capacity range of the agent. This suggests that open-weight models are not always the most cost-effective choice in agents given serving cost, especially with highly complex tasks.

有趣的是,比较开源的 Llama 3.1 (405B) 模型和闭源的 OpenAI GPT-4o 模型的步骤数量和成本,Llama 3.1 需要更多的步骤,运行成本几乎是 GPT-4o 的 $2\mathbf{x}$ 倍,而成功率却低于 GPT-4o。根据我们的观察,GPT-4o 似乎更擅长在任务明显超出智能体能力范围时提前放弃,从而节省步骤和成本。这表明,考虑到服务成本,开源权重模型在智能体中并不总是最具成本效益的选择,尤其是在处理高度复杂的任务时。

On the other hand, the newer generation of Llama model, Llama 3.3 (70B) achieves a considerably high performance of $6.9%$ success rate, on par with the much larger (405B), older generation (Llama 3.1) model. This model also costs significantly less because of its smaller size. This suggests a promising future for LLM development, as smaller and more efficient models begin to catch up in agent performance.

另一方面,新一代的 Llama 模型 Llama 3.3 (70B) 实现了相当高的性能,成功率达到 $6.9%$,与更大规模(405B)的老一代模型(Llama 3.1)相当。由于规模较小,该模型的成本也显著降低。这表明大语言模型的发展前景广阔,因为更小、更高效的模型在智能体性能方面开始迎头赶上。

7.2 ANALYSIS

7.2 分析

How well do agents operate on different platforms? Table 4 presents performance breakdown on tasks that involve different platforms in The Agent Company. A task is categorized under a platform if one of the platforms that the task requires it. From Figure 4a, we can see that most models struggle with RocketChat and ownCloud. RocketChat platform is where all the social interaction with peers happens, and the low performance on this platform suggests that current-day LLMs still need improvements in communicating with others. ownCloud platform provides online Office suite functionality, and due to the complexity of the UI of web-based Office software, it is expected that current LLMs fail badly on the platform. This suggests that the browsing capability of the agents, especially on more complex websites, still needs improvement. These results underscore the inherent challenges and complexities of performing tasks that occur in real-world work environments, involve social interaction, or require understanding and navigating complex web interfaces.

AI智能体在不同平台上的表现如何?表4展示了The Agent Company中涉及不同平台任务的表现细分。如果任务需要某个平台,则该任务被归类为该平台下的任务。从图4a中可以看出,大多数模型在RocketChat和ownCloud上表现不佳。RocketChat平台是所有与同行社交互动发生的地方,而在此平台上的低表现表明,当前的大语言模型在与人沟通方面仍需改进。ownCloud平台提供在线办公套件功能,由于基于Web的办公软件用户界面的复杂性,当前的大语言模