THE AGENT COMPANY: BENCHMARKING LLM AGENTS ON CONSEQUENTIAL REAL WORLD TASKS
THE AGENT COMPANY: 在大语言模型智能体上进行现实世界任务的基准测试
Frank F. $\mathbf{X}\mathbf{u}^{1*}$ Yufan $\mathbf{Song^{2*}}$ Boxuan $\mathbf{Li}^{2*}$ Yuxuan Tang2 Kritanjali Jain1 Mengxue Bao2 Zora Z. Wang1 Xuhui Zhou1 Zhitong Guo1 Murong $\mathbf{Cao^{2}}$ Mingyang Yang2 Hao Yang $\mathbf{L}\mathbf{u}^{2}$ Amaad Martin1 Zhe $\mathbf{S}\mathbf{u}^{1}$ Leander Maben1 Raj Mehta1 Wayne Chi1 Lawrence Jang1 Yiqing Xie1 Shuyan Zhou3 Graham Neubig1
Frank F. $\mathbf{X}\mathbf{u}^{1*}$ Yufan $\mathbf{Song^{2*}}$ Boxuan $\mathbf{Li}^{2*}$ Yuxuan Tang2 Kritanjali Jain1 Mengxue Bao2 Zora Z. Wang1 Xuhui Zhou1 Zhitong Guo1 Murong $\mathbf{Cao^{2}}$ Mingyang Yang2 Hao Yang $\mathbf{L}\mathbf{u}^{2}$ Amaad Martin1 Zhe $\mathbf{S}\mathbf{u}^{1}$ Leander Maben1 Raj Mehta1 Wayne Chi1 Lawrence Jang1 Yiqing Xie1 Shuyan Zhou3 Graham Neubig1
1Carnegie Mellon University 2 Independent 3Duke University {fangzhex, gneubig}@cs.cmu.edu, {yufans, boxuanli}@alumni.cmu.edu
1卡内基梅隆大学 2独立 3杜克大学 {fangzhex, gneubig}@cs.cmu.edu, {yufans, boxuanli}@alumni.cmu.edu
ABSTRACT
摘要
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents’ performance on performing real-world professional tasks, in this paper, we introduce The Agent Company, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, $24%$ of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents – in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.
我们每天都在与计算机互动,无论是在日常生活中还是工作中,许多工作都可以完全通过计算机和互联网完成。与此同时,得益于大语言模型 (LLM) 的改进,与环境互动并影响其变化的 AI 智能体也迅速发展。但 AI 智能体在帮助加速甚至自主执行工作任务方面的表现如何?这个问题的答案对于希望将 AI 引入工作流程的行业以及理解 AI 采用对劳动力市场影响的经济政策都具有重要意义。为了衡量这些大语言模型智能体在执行现实世界专业任务中的进展,本文介绍了 The Agent Company,这是一个可扩展的基准,用于评估以类似于数字工作者的方式与世界互动的 AI 智能体:通过浏览网页、编写代码、运行程序以及与同事沟通。我们构建了一个包含内部网站和数据的自包含环境,模拟一个小型软件公司环境,并创建了可能由该公司员工执行的各种任务。我们测试了基于封闭 API 和开源权重语言模型 (LM) 的基线智能体,发现使用最具竞争力的智能体,$24%$ 的任务可以自主完成。这描绘了使用 LM 智能体进行任务自动化的复杂图景——在模拟真实工作环境的设置中,大部分较简单的任务可以自主解决,但更复杂的长期任务仍然超出了当前系统的能力范围。
Website https://the-agent-company.com Code https://github.com/The Agent Company/The Agent Company 口 Evaluations https://github.com/The Agent Company/experiments
网站 https://the-agent-company.com
代码 https://github.com/The Agent Company/The Agent Company
评估 https://github.com/The Agent Company/experiments
1 INTRODUCTION
1 引言
We are in the midst of a technological transformation. With the rapid year-by-year and month-bymonth progress brought about by large language models (LLMs), we are seeing AI-based assistance or automation become commonplace in tasks that were unthinkable only a few years ago. In fact, the pace of progress is so fast that some have gone so far as to claim that the majority of human labor may be automata ble within the next couple of years (Eloundou et al., 2023; Amodei & Fridman, 2024). On the other hand, others are skeptical, claiming that language models cannot truly reason (Kam b ham pati et al., 2024), do not generalize well to novel tasks (Chollet et al., 2024), and may only have an impact on a small minority of the labor market (Witten stein, 2024).
我们正处于一场技术变革之中。随着大语言模型 (LLMs) 带来的逐年逐月快速进展,我们看到基于 AI 的辅助或自动化在几年前还无法想象的任务中变得司空见惯。事实上,进展的速度如此之快,以至于有些人甚至声称,在未来几年内,大部分人类劳动可能会被自动化 (Eloundou et al., 2023; Amodei & Fridman, 2024)。另一方面,其他人则持怀疑态度,声称语言模型无法真正推理 (Kambhampati et al., 2024),不能很好地泛化到新任务 (Chollet et al., 2024),并且可能只对劳动力市场中的一小部分产生影响 (Wittenstein, 2024)。
What is the reason for this disconnect? We argue that it is, in part, due to a lack of objective benchmarks that not only demonstrate the power of existing LLM-based agents to accelerate a wide variety of repetitive tasks encountered in every-day workplaces, but also provide appropriate caveats about the tasks that agents cannot do. This is a pressing issue, because the commercial and policy implications of diverse and effective acceleration or automation of work-related tasks will be broad, both positive (e.g. increase of quality of life and accelerated scientific discovery) and negative (e.g. potential displacement or loss of jobs and increase in wealth disparities). In this paper, we take some first steps towards resolving this gap and providing a clearer view of where we are now with respect to acceleration or automation of consequential work-related tasks, and a litmus test for future development in this direction.
这种脱节的原因是什么?我们认为,部分原因在于缺乏客观的基准测试,这些基准测试不仅能够展示现有基于大语言模型的智能体在加速日常工作中遇到的各种重复性任务方面的能力,还能为智能体无法完成的任务提供适当的警示。这是一个紧迫的问题,因为工作相关任务的多样化和有效加速或自动化的商业和政策影响将是广泛的,既有积极的(例如提高生活质量和加速科学发现),也有消极的(例如潜在的职位替代或失业以及财富差距的扩大)。在本文中,我们采取了一些初步措施来解决这一差距,并为我们在工作相关任务的加速或自动化方面的现状提供了更清晰的视角,同时也为未来的发展方向提供了一个试金石。

Figure 1: An overview of The Agent Company benchmark. It features a reproducible and selfhosted environment, simulated colleagues to test agent communication capabilities, checkpoint and execution-based evaluation, and a set of 175 diverse, realistic and professional tasks in a software engineering company setting.
图 1: The Agent Company 基准的概述。它具有可重复且自托管的环境、用于测试智能体通信能力的模拟同事、基于检查点和执行的评估,以及在一家软件工程公司环境中设置的 175 个多样化、现实且专业的任务。
Concretely, we propose a benchmark, The Agent Company (Figure 1) that estimates the ability of AI agents to perform tasks encountered in everyday workplaces. We create a simulated software development company where agents must perform tasks related to software engineering, project management, financial analysis, and other typical tasks encountered in such business settings. The agents must browse the web, code, and interact with other simulated co-workers to achieve success on the provided tasks. The Agent Company’s environment is based entirely on open-source software and self-hostable for reproducibility purposes, and we create rigorous evaluators that also assign partial credit when the agent gets the answer partially correct.
具体来说,我们提出了一个基准测试——The Agent Company(图 1),用于评估 AI 智能体在日常工作场所中执行任务的能力。我们创建了一个模拟的软件开发公司,智能体必须执行与软件工程、项目管理、财务分析等相关的任务,以及在此类业务环境中常见的其他任务。智能体必须浏览网页、编写代码,并与模拟的同事互动,以成功完成提供的任务。The Agent Company 的环境完全基于开源软件,并且可以自托管以确保可重复性,我们还创建了严格的评估器,当智能体部分正确时也会给予部分分数。
We perform experiments using seven large language model backbones, including API-based models such as Anthropic Claude (Anthropic, 2023), OpenAI GPT-4o (OpenAI, 2024), Google Gemini (Team et al., 2023), Amazon Nova (Intelligence, 2024), as well as open models including Meta Llama (Dubey et al., 2024) and Alibaba Qwen (Yang et al., 2024). All models are run using the OpenHands agent framework (Wang et al., 2024b),1 which provides a stable and strong agent harness for both web browsing and coding. As a result of experiments, we find that the best performing model, Claude 3.5 Sonnet was able to autonomously perform $24.0%$ of the provided tests to completion, and achieve a score of $34.4%$ on our metric that provides extra credit for partially completed tasks.
我们使用七种大语言模型骨干进行实验,包括基于API的模型,如Anthropic Claude (Anthropic, 2023)、OpenAI GPT-4o (OpenAI, 2024)、Google Gemini (Team et al., 2023)、Amazon Nova (Intelligence, 2024),以及开源模型,包括Meta Llama (Dubey et al., 2024) 和 Alibaba Qwen (Yang et al., 2024)。所有模型均通过OpenHands智能体框架 (Wang et al., 2024b) 运行,该框架为网页浏览和编码提供了稳定且强大的智能体支持。实验结果表明,表现最佳的模型Claude 3.5 Sonnet能够自主完成24.0%的测试任务,并在我们的评估指标中获得34.4%的分数,该指标对部分完成的任务给予额外加分。
These results present a nuanced picture of the current ability of AI agents to perform tasks. Agents powered by the current gold-standard AI techniques are able to autonomously perform a wide variety of tasks encountered in everyday work. However, they are not close to automating every task encountered in a workspace, even on the subset of tasks presented in The Agent Company, which are well-scoped administrative and coding tasks encountered in a software company’s day-to-day work.
这些结果展示了当前AI智能体执行任务能力的复杂图景。基于当前黄金标准AI技术的智能体能够自主执行日常工作中遇到的各种任务。然而,它们还远未达到自动化工作空间中遇到的每一项任务的水平,即使是在The Agent Company中呈现的任务子集上,这些任务也是软件公司日常工作中遇到的范围明确的管理和编码任务。
In the rest of this paper, we explain detail comparisons to other existing benchmarks (§ 2), how we set up realistic and reproducible environments (§ 3), how we define tasks $(\S,4)$ and how we create them $(\S\ S)$ , our baseline agent (§ 6), experimental results $(\S,7)$ , and finally implications and future directions $(\S\ 8)$ .
在本文的其余部分,我们将详细解释与其他现有基准的比较(§ 2)、如何设置现实且可复现的环境(§ 3)、如何定义任务(§ 4)以及如何创建任务(§ 5)、我们的基线 AI智能体(§ 6)、实验结果(§ 7),最后是影响和未来方向(§ 8)。
Table 1: Comparison of different AI agent benchmarks. Interface: the interface agent has access to; $\boxed{\alpha}$ is web browser, $\bar{\bigstar}$ is desktop, $\frac{A D}{868}$ is API usage, $\pmb{\epsilon}_{\odot}$ is Python script, $\varnothing\varnothing$ is chat platform, is bash terminal. Supported Tasks: tasks in the benchmark, $^*$ indicate tasks with no association with real-world occupations; SE refers to software engineering, HR is human resources, PM is project manager. Checkpoint-based evaluation: if tasks are evaluated at intermediate checkpoints and assigned partial scores. Interact with NPC Agents: If the agent can interact with other NPC agents during task-solving.
| Framework | Diverse Real-worldWork | Task Categories | Requires Interaction | Long-Horizon w/Checkpoints | Interface | Self-Hosted Environment |
| MiniWob++ (Liu et al., 2018) | × | Browsing* | ||||
| Mind2Web (Deng et al., 2023) | Browsing* | × | × | |||
| WebLINX (Lu et al., 2024) | Browsing* | |||||
| AssistantBench (Yoran et al., 2024) | × | Browsing* | ||||
| WebArena (Zhou et al., 2023) | × | Browsing* | × | × | LR | |
| VisualWebArena (Koh et al., 2024) | × | Browsing* | × | × | LR | |
| VideoWebArena (Jang et al., 2024) | × | Browsing* | × | × | R | |
| WorkArena (Drouin et al.,2024) | Enterprise Software | |||||
| OSWorld (Xie et al., 2024) | Office,Coding | × | ||||
| Windows Agent Arena (Bonatti et al., 2024) | Browsing*,Office,Coding | × | 口 | |||
| AppWorld (Trivedi et al., 2024) | × | Daily | ||||
| Gorilla APIBench (Patil et al., 2023) | × | Coding | × | |||
| T-bench (Yao et al., 2024) | Retail, Airline | × | × | |||
| SWE-bench (Jimenez et al., 2024) | × | SWE | × | × | ||
| DevBench (Li et al., 2024) | × | SWE | × | × | × | |
| Smallville (Park et al., 2023) | × | Social* | √ | |||
| Sotopia (Zhou et al., 2024) | × | Social* | × | |||
| TheAgentCompany | SWE,HR,Admin, PM,Research,Finance |
表 1: 不同AI智能体基准对比。接口:智能体可访问的接口;$\boxed{\alpha}$ 是网页浏览器,$\bar{\bigstar}$ 是桌面,$\frac{A D}{868}$ 是API使用,$\pmb{\epsilon}_{\odot}$ 是Python脚本,$\varnothing\varnothing$ 是聊天平台,是bash终端。支持的任务:基准中的任务,$^*$ 表示与现实职业无关的任务;SE指软件工程,HR是人力资源,PM是项目经理。基于检查点的评估:任务是否在中间检查点评估并分配部分分数。与NPC智能体交互:智能体在任务解决过程中是否可以与其他NPC智能体交互。
| 框架 | 多样化的现实工作 | 任务类别 | 需要交互 | 长时任务/检查点 | 接口 | 自托管环境 |
|---|---|---|---|---|---|---|
| MiniWob++ (Liu et al., 2018) | × | 浏览* | ||||
| Mind2Web (Deng et al., 2023) | 浏览* | × | × | |||
| WebLINX (Lu et al., 2024) | 浏览* | |||||
| AssistantBench (Yoran et al., 2024) | × | 浏览* | ||||
| WebArena (Zhou et al., 2023) | × | 浏览* | × | × | LR | |
| VisualWebArena (Koh et al., 2024) | × | 浏览* | × | × | LR | |
| VideoWebArena (Jang et al., 2024) | × | 浏览* | × | × | R | |
| WorkArena (Drouin et al.,2024) | 企业软件 | |||||
| OSWorld (Xie et al., 2024) | 办公, 编程 | × | ||||
| Windows Agent Arena (Bonatti et al., 2024) | 浏览*, 办公, 编程 | × | 口 | |||
| AppWorld (Trivedi et al., 2024) | × | 日常 | ||||
| Gorilla APIBench (Patil et al., 2023) | × | 编程 | × | |||
| T-bench (Yao et al., 2024) | 零售, 航空 | × | × | |||
| SWE-bench (Jimenez et al., 2024) | × | 软件工程 | × | × | ||
| DevBench (Li et al., 2024) | × | 软件工程 | × | × | × | |
| Smallville (Park et al., 2023) | × | 社交* | √ | |||
| Sotopia (Zhou et al., 2024) | × | 社交* | × | |||
| TheAgentCompany | 软件工程, 人力资源, 行政, 项目经理, 研究, 财务 |
2 BENCHMARK DESIDERATA AND COMPARISON TO OTHER BENCHMARKS
2 基准测试的需求及与其他基准测试的对比
In order to evaluate the ability of agents to perform tasks in complex real-world settings, we built The Agent Company with a number of desiderata in mind. The comparison with several existing prominent agent benchmarks with respect to these desiderata is in Table 1.
为了评估AI智能体在复杂现实环境中执行任务的能力,我们构建了The Agent Company,并考虑了多个期望目标。表1展示了与现有几个主要智能体基准在这些期望目标上的对比。
Coverage of Multiple Work-related Tasks: In order to make any valid statements about the potential of AI to accelerate or automate various types of real-world work, we should have tasks that are motivated by real-world work across multiple job categories. Many benchmarks are not relevant to real-world work (e.g. MiniWob $^{++}$ (Liu et al., 2018)) or very relevant to real-world work, but only over a limited scope of tasks (e.g. SWE-Bench (Jimenez et al., 2024)). In contrast, The Agent Company contains a set of more diverse, realistic, and professional tasks that would typically be completed by multiple job roles in a software engineering company.
覆盖多种工作相关任务:为了对AI加速或自动化各种现实世界工作的潜力做出有效陈述,我们应该拥有由多个工作类别中的现实工作驱动的任务。许多基准测试与现实工作无关(例如 MiniWob$^{++}$ (Liu et al., 2018)),或者与现实工作非常相关,但仅限于有限的任务范围(例如 SWE-Bench (Jimenez et al., 2024))。相比之下,The Agent Company 包含一组更加多样化、现实且专业的任务,这些任务通常由软件工程公司中的多个职位角色完成。
Requirement for Interaction: If agents are to integrate into real-world workplaces, they will need to communicate with the other human members of the workspace. Most other benchmarks do not measure communication or interactivity, with the exception of $\tau$ -bench (Yao et al., 2024), which only measures interaction in customer service scenarios. The Agent Company provides a better testbed for communication as many tasks involve asking and providing information to colleagues as part of a more complex task.
交互需求:如果AI智能体要融入现实工作场所,它们需要与工作空间中的其他人类成员进行沟通。大多数其他基准测试并不衡量沟通或交互性,除了 $\tau$ -bench (Yao et al., 2024) ,它仅衡量客户服务场景中的交互。Agent Company 提供了一个更好的沟通测试平台,因为许多任务涉及在更复杂的任务中向同事询问和提供信息。
Long-horizon Tasks with Checkpoints: In real-world settings, many tasks require taking many different steps to achieve a higher-level goal. One major novel contribution of The Agent Company is that we both (1) contain tasks that require an agent to perform significantly more consecutive work (i.e. involving more steps and realistically taking human professionals longer to accomplish) than previous benchmarks, and (2) provide granular evaluators that measure the ability of models to perform subtasks of these larger tasks.
长时程任务与检查点:在现实场景中,许多任务需要执行多个不同的步骤才能实现更高层次的目标。The Agent Company 的一项主要创新贡献在于,我们既 (1) 包含了要求智能体执行比以往基准测试更多连续工作(即涉及更多步骤,且实际需要人类专业人员更长时间完成)的任务,又 (2) 提供了细粒度的评估器,用于衡量模型执行这些大型任务中子任务的能力。
Versatile Environment Interface: In order to handle a diversity of tasks in real-world settings, we minimally should be able to interact with the tools that real-world workers use – including web interfaces, programs, command-line terminals, and communication tools. The Agent Company covers all of these interfaces, while most previous benchmarks focus only on one or two.
多功能环境接口:为了处理现实世界中的多样化任务,我们至少应该能够与现实世界工作者使用的工具进行交互——包括网页界面、程序、命令行终端和通信工具。Agent Company 涵盖了所有这些接口,而大多数之前的基准测试仅关注其中一两个。
Self-hosted and Reproducible: In order to allow for careful comparisons between different methods that remain constant over time, the benchmark should be fully self-hosted and reproducible. This contrasts with existing benchmarks that do not have execution environments (e.g. Mind2Web (Deng et al., 2023)) or require the usage of third-party software (e.g. WorkArena (Drouin et al., 2024)).
自托管与可复现性:为了能够对不同方法进行长期一致的细致比较,基准测试应完全自托管且可复现。这与现有的一些基准测试形成对比,后者要么没有执行环境(例如 Mind2Web (Deng et al., 2023)),要么需要依赖第三方软件(例如 WorkArena (Drouin et al., 2024))。
3 THE AGENT COMPANY ENVIRONMENT SETUP
3 AI智能体公司环境设置
Our benchmark is set in an imaginary software engineering startup called The Agent Company, hence the benchmark’s name. Within The Agent Company, we create tasks inspired by tasks handled by workers inside such companies. More details about the company’s imaginary background, overview and employees can be found in Appendix A. The benchmark environment contains multiple components.
我们的基准测试设定在一家名为 The Agent Company 的虚构软件工程初创公司中,因此基准测试的名称由此而来。在 The Agent Company 中,我们创建了受此类公司内部员工处理的任务启发的任务。有关该公司虚构背景、概述和员工的更多详细信息,请参阅附录 A。基准测试环境包含多个组件。
Local Workspace The local workspace runs locally on the agent’s host, which is analogous to a human professional’s local workspace, e.g. their work laptop computer. This environment is created as a sandboxed Docker environment to provide a safe execution environment that will not affect other parts of the evaluation machine (Wang et al., 2024b).2 This environment is where agents work on the task, and within this environment the The Agent Company baseline agent (§ 6) uses a browser, code editor and a Linux terminal with typical software pre installed.3
本地工作空间
本地工作空间在智能体的主机上本地运行,类似于人类专业人员的本地工作空间,例如他们的工作笔记本电脑。该环境被创建为一个沙盒化的 Docker 环境,以提供一个安全的执行环境,不会影响评估机器的其他部分 (Wang et al., 2024b)。这是智能体处理任务的环境,在此环境中,The Agent Company 基线智能体(§ 6)使用浏览器、代码编辑器和预装了典型软件的 Linux 终端。
Intranet This part of the environment mimics the company’s internal websites that host code, documents, project management software, and communications software. To achieve our goal of a reproducible, self-contained environment, we follow WebArena (Zhou et al., 2023), in using open-source, self-hostable software to host our environment. The environment mainly contains the following websites:
内网
这部分环境模拟了公司内部托管代码、文档、项目管理软件和通信软件的网站。为了实现可复现、自包含的环境目标,我们遵循 WebArena (Zhou et al., 2023) 的做法,使用开源、可自托管的软件来托管我们的环境。环境主要包含以下网站:
- GitLab,4 an open-source alternative to source-code repositories such as GitHub. This is used for hosting The Agent Company’s code repositories and tech-oriented wiki pages. 2. OwnCloud,5 an open-source alternative to office software such as Google Drive or Microsoft Office. This to save and share files, especially for document storage and collaborative editing. 3. Plane,6 an open-source alternative to task management software such as Jira or Linear. This is used to track issues, run sprints cycles, and manage product roadmaps. 4. RocketChat,7 an open-source alternative to communication software such as Slack. This is a company-internal real-time messaging tool that facilitates collaboration between employees.
- GitLab,一个开源的替代品,类似于 GitHub 等源代码仓库。它用于托管 The Agent Company 的代码仓库和技术导向的 Wiki 页面。
- OwnCloud,一个开源的替代品,类似于 Google Drive 或 Microsoft Office 等办公软件。它用于保存和共享文件,特别是文档存储和协作编辑。
- Plane,一个开源的替代品,类似于 Jira 或 Linear 等任务管理软件。它用于跟踪问题、运行冲刺周期和管理产品路线图。
- RocketChat,一个开源的替代品,类似于 Slack 等通信软件。这是一个公司内部的实时消息工具,用于促进员工之间的协作。
All the websites hosted are reproducible and reset-able with mock data inspired by that from a software engineering company. The data inside these company internal websites are populated with real-world software project data, as well as data manually curated by co-authors who have some experience in the relevant corporate roles.
所有托管的网站都可以使用受软件工程公司启发的模拟数据进行重现和重置。这些公司内部网站中的数据填充了真实世界的软件项目数据,以及由具有相关公司角色经验的合著者手动整理的数据。
Simulated Colleague Communication One major aspect of working in a company is communicating with other company members, and in The Agent Company we also test the ability of models to perform this type of communication. Specifically, we allow agents to use RocketChat to message other company members and obtain information that may not be available in the original task description. To create these simulated colleagues, we rely on the Sotopia platform (Zhou et al., 2024), which supports the creation of simulated human characters with LLMs. Each simulated colleague is equipped with a detailed profile that includes their name, role, responsibilities, and project affiliations.
模拟同事沟通
在公司工作的一个重要方面是与其他公司成员沟通,而在The Agent Company中,我们也测试了模型执行此类沟通的能力。具体来说,我们允许AI智能体使用RocketChat向其他公司成员发送消息,并获取可能在原始任务描述中无法获得的信息。为了创建这些模拟同事,我们依赖于Sotopia平台(Zhou et al., 2024),该平台支持使用大语言模型创建模拟人类角色。每个模拟同事都配备了一个详细的个人资料,包括他们的姓名、角色、职责和项目关联。
Table 2: Example task intents and checkpoints for three domains.
| Domain | Task Intent | Checkpoints |
| SWE | Set up JanusGraph and run it locally with an HTTP endpoint: ·Clone JanusGraph directory under /workspace folder ·Build the binary file. · Launch JanusGraph server locally on port 8182 with an HTTP endpoint. | Checkpoint 1 (1pt): Check if Janus- Graph repo is cloned. Checkpoint 2 (3pts): Check if the binary file is built (requires skipping Docker in pom.xml, hence higher points). Checkpoint 3 (2pts): Check the Janus- Graph Server as an HTTP endpoint. |
| Finance | Navigate to ownCloud at http://the-agent-company.com:8092 and complete Section B—Alternative Simplified Credit of IRS Form 6765: · Gather necessary information from | Checkpoint 1 (5pts): Check if all 16 questions in Section B of the form have been answered correctly. Checkpoint 2 (3pts): Check if the |
| /Documents/Financials/TAC_financials.csv and /workspace/research_wages.csv. ·Consult /Documents/Financials/f6765_instructions.pdf | correct finance director (David Wong) s questions. | |
| for instructions. ·Contact the finance director (David Wong) on Rocket.Chat | ||
| (http://the-agent-company.com:3000/home) for ambiguous questions. | ||
| ·Save the filledform as/workspace/filled_f6765.pdf. |
表 2: 三个领域的示例任务意图和检查点
| 领域 | 任务意图 | 检查点 |
|---|---|---|
| SWE | 设置 JanusGraph 并在本地运行,带有 HTTP 端点:· 在 /workspace 文件夹下克隆 JanusGraph 目录 · 构建二进制文件 · 在本地端口 8182 上启动 JanusGraph 服务器,带有 HTTP 端点。 | 检查点 1 (1 分):检查 JanusGraph 仓库是否已克隆。检查点 2 (3 分):检查二进制文件是否已构建(需要在 pom.xml 中跳过 Docker,因此分数较高)。检查点 3 (2 分):检查 JanusGraph 服务器是否作为 HTTP 端点运行。 |
| 金融 | 导航到 http://the-agent-company.com 的 ownCloud 并完成 IRS 表格 6765 的 B 部分——替代简化信用:· 从 /Documents/Financials/TAC_financials.csv 和 /workspace/research_wages.csv 收集必要信息 · 参考 /Documents/Financials/f6765_instructions.pdf 获取说明 · 在 Rocket.Chat (http://the-agent-company.com/home) 上联系财务总监 (David Wong) 以解决模糊问题 · 将填好的表格保存为 /workspace/filled_f6765.pdf。 | 检查点 1 (5 分):检查表格 B 部分的所有 16 个问题是否已正确回答。检查点 2 (3 分):检查财务总监 (David Wong) 的问题是否正确。 |
| PM | AnalyzeTheAgent Company's performance and createa summary inPlane: | Checkpoint1(1pt):CheckifPlanewas accessed and the agent navigated to"An- alytics"section. Checkpoint 2 (3pts):Check if all re- quiredprojectmetricswerecollected. Checkpoint 3 (1pt):Checkif the sum- marywas sharedinthe#kudoschan- nelonRocket.Chat. |
| ·AccessPlane(http://the-agent-company.com:8091/tac/) and navigate to"Analytics. Collectmetrics:OpenTasks,BacklogTasks,Unstarted Tasks, | ||
| Started Tasks,UnassignedIssues,PendingIssues. Createa summary and shareit onRocket.Chat | ||
| (http://the-agent-company.com:3000/home)inthe#kudos channel. |
| PM | 分析 The Agent Company 的表现并在 Plane 中创建总结: | 检查点 1 (1 分):检查是否访问了 Plane 并导航到“分析”部分。检查点 2 (3 分):检查是否收集了所有必需的项目指标。检查点 3 (1 分):检查总结是否在 Rocket.Chat 的 #kudos 频道中分享。 |
|---|---|---|
| · 访问 Plane (http://the-agent-company.com/tac/) 并导航到“分析”。收集指标:未完成任务、待办任务、未开始任务、 | ||
| 已开始任务、未分配问题、待处理问题。创建总结并在 Rocket.Chat (http://the-agent-company.com/home) 的 #kudos 频道中分享。 |
(e.g., Sarah Johnson, who serves as the CTO, oversees technical strategy planning and R&D team leadership, with access to all technical channels). Agents can interact with these simulated colleagues through direct messages or in specific channels, as is standard in RocketChat and other platforms. By default, all simulated human characters are backed by the Claude-3-5-Sonnet-20241022 LLM across experiments, as we found that it provided the best results during preliminary experiments. For example conversations between the agent and the simulated colleagues drawn from empirical experiments, please refer to Appendix B.
(例如,Sarah Johnson 担任 CTO,负责技术战略规划和研发团队领导,并拥有所有技术渠道的访问权限)。AI智能体可以通过直接消息或特定渠道与这些模拟同事互动,这在 RocketChat 和其他平台中是标准做法。默认情况下,所有模拟人类角色在实验中均由 Claude-3-5-Sonnet-20241022 大语言模型支持,因为我们在初步实验中发现它提供了最佳结果。有关从实证实验中提取的 AI智能体与模拟同事之间的对话示例,请参阅附录 B。
4 TASK STRUCTURE
4 任务结构
The tasks in The Agent Company include a task intent, a list of checkpoints that the agent must achieve, a programmatic evaluator to check success on these checkpoints, and code to initialize and finalize the environment. We show some examples in Table 2, and describe each of aspect in detail below.
The Agent Company 中的任务包括任务意图、AI智能体必须达成的检查点列表、用于检查这些检查点是否成功的程序化评估器,以及用于初始化和最终化环境的代码。我们在表 2 中展示了一些示例,并在下面详细描述了每个方面。
Task Intent Each task begins with an English description, simulating how a user would instruct an LLM-based agent to perform a real-world task. In general, we aim for these tasks to be clear enough so that a human worker would be able to complete the task without asking for further instructions directly from the user (although they may need to ask questions of their other co-workers).
任务意图
每个任务都以英文描述开始,模拟用户如何指示基于大语言模型的AI智能体执行现实世界中的任务。通常,我们希望这些任务足够清晰,以便人类工作者能够在不直接向用户询问进一步指示的情况下完成任务(尽管他们可能需要向其他同事提问)。
Checkpoints Tasks are divided into checkpoints representing intermediate milestones, each assigned a point value to measure progress. Each checkpoint is awarded a certain number of points based on its significance to the overall completion of the task. Checkpoints are written in English, and typically specify one or more of the following:
检查点任务被划分为代表中间里程碑的检查点,每个检查点分配一个分值以衡量进度。根据检查点对任务整体完成的重要性,每个检查点会被授予一定数量的分数。检查点用英文书写,通常指定以下一项或多项内容:
• Action Completion: Verifying whether required actions, such as using tools, navigating to URLs, or collecting data, were carried out successfully.
• 动作完成:验证所需动作(如使用工具、导航到URL或收集数据)是否成功执行。
• Data Accuracy: Evaluating the correctness and completeness of the output, such as extracted data or formatted documents. • Collaboration: Assessing interactions with simulated colleagues or sharing of output, such as posting messages or asking for additional information to complete the task.
• 数据准确性:评估输出的正确性和完整性,例如提取的数据或格式化的文档。
• 协作:评估与模拟同事的互动或输出的共享,例如发布消息或请求更多信息以完成任务。
Evaluators Checkpoints are created in the task design phase, but for actual evaluation, each of the checkpoints must be concretely implemented through an evaluator – a program that checks the completion of the checkpoint. These evaluators are implemented by examining environment states, such as the local workspace, intranet status, simulated colleague interactions, or by analyzing agent trajectories, like verifying browsing history or action sequences.
评估者在任务设计阶段创建检查点,但在实际评估中,每个检查点必须通过评估者具体实现——评估者是一个检查检查点完成情况的程序。这些评估者通过检查环境状态(如本地工作空间、内网状态、模拟同事互动)或分析智能体轨迹(如验证浏览历史或操作序列)来实现。
In most cases, these evaluators are deterministic and written as simple Python functions. For instance, in the SWE task in Table 2, the checkpoints are deterministic: verifying if the JanusGraph repository is cloned, the binary file is built, and the server is launched with an HTTP endpoint. However, for tasks with more complex and unstructured deliverable s, such as in Table 2, the last checkpoint in the Finance task requires contacting the correct finance director (David Wong) to resolve ambiguous questions, which involves a judgment from a (simulated) human colleague, deterministic evaluation can be challenging due to subjectivity and variability. In such cases, we employ LLM-based evaluation. This involves prompting LLMs with predefined rubrics or reference outputs to assess the agent’s deliverable s, enabling a more nuanced and flexible evaluation of these tasks. Same as the NPC backbone, all LLM-based evaluators are backed by the Claude-3-5-Sonnet-20241022.
在大多数情况下,这些评估器是确定性的,并以简单的 Python 函数形式编写。例如,在表 2 中的 SWE 任务中,检查点是确定性的:验证 JanusGraph 仓库是否被克隆、二进制文件是否构建完成,以及服务器是否通过 HTTP 端点启动。然而,对于具有更复杂和非结构化交付物的任务,例如表 2 中的 Finance 任务,最后一个检查点需要联系正确的财务总监 (David Wong) 来解决模糊问题,这涉及来自(模拟的)人类同事的判断,由于主观性和可变性,确定性评估可能具有挑战性。在这种情况下,我们采用基于大语言模型的评估。这涉及使用预定义的评分标准或参考输出提示大语言模型,以评估智能体的交付物,从而对这些任务进行更细致和灵活的评估。与 NPC 骨干相同,所有基于大语言模型的评估器均由 Claude-3-5-Sonnet-20241022 支持。
4.1 EVALUATION METRICS
4.1 评估指标
Due to our checkpoint-based evaluation scheme and the need for showcasing both the progress of the agent’s capability improvement as well as the eventual goal completion ability, we calculate two scalar agent capability metrics and two efficiency metrics.
由于我们基于检查点的评估方案以及需要展示智能体能力提升的进展和最终目标完成能力,我们计算了两个标量智能体能力指标和两个效率指标。
Full completion score We define the full completion score $S_{\mathrm{full}}$ as:
完全完成分数我们定义完全完成分数 $S_{\mathrm{full}}$ 为:
$$
S_{\mathrm{full}}={1if all checkpoints are successfully passed,0otherwise.}
$$
This binary metric evaluates whether the agent successfully completed the task by passing all checkpoints.
该二元指标通过评估智能体是否成功通过所有检查点来判断其是否完成了任务。
Partial completion score To provide a more nuanced measure that rewards partial task completion while strongly in centi viz ing full task completion, we define partial completion score $S_{\mathrm{partial}}$ as:
部分完成分数
为了提供一个更细致的衡量标准,既能奖励部分任务完成,又能强烈激励完全任务完成,我们定义部分完成分数 $S_{\mathrm{partial}}$ 为:
$$
S_{\mathrm{partial}}=0.5\cdot\frac{\mathrm{Result}}{\mathrm{Total}}+0.5\cdot S_{\mathrm{full}},
$$
where:
其中:
This formulation ensures that agents are awarded partial credit in proportion to the points achieved, reflecting their progress toward task completion. At the same time, full task completion is strongly in centi viz ed by incorporating an additional $50%$ credit, which is awarded only when all checkpoints are successfully completed. This design ensures that agents achieving partial progress receive scores scaled linearly with their performance, while those reaching $100%$ completion are distinctly rewarded to emphasize the importance of achieving the end goal.
这种设计确保智能体根据其完成任务的程度获得部分积分,反映了它们在任务完成过程中的进展。同时,通过引入额外的 50% 积分,强烈鼓励完全完成任务,这部分积分仅在所有检查点成功完成时才会授予。这种设计确保取得部分进展的智能体能够根据其表现线性获得分数,而那些达到 100% 完成度的智能体则会得到显著奖励,以强调实现最终目标的重要性。


Figure 2: Example The Agent Company workflow illustrating an agent managing a sprint for the RisingWave project. The task involves identifying and moving unfinished issues to next sprint cycle, notifying assignees of those issues, running a code coverage script, uploading summarized report to OwnCloud, and incorporating feedback on report from a simulated project manager.
图 2: 示例 The Agent Company 工作流程,展示了一个智能体为 RisingWave 项目管理冲刺的过程。任务包括识别并将未完成的问题移至下一个冲刺周期,通知这些问题的负责人,运行代码覆盖率脚本,将汇总报告上传到 OwnCloud,并整合来自模拟项目经理的反馈。
Number of steps The number of steps is defined as the total number of LLM calls made during the task execution. This metric quantifies the operational effort required to perform the task.
步数
步数定义为任务执行期间进行的大语言模型 (LLM) 调用的总次数。该指标量化了执行任务所需的操作工作量。
Cost per instance The cost per instance measures the monetary cost of querying the underlying LLM through its API to complete a task. Assuming no prompt caching, the cost is calculated as:
单次实例成本
单次实例成本衡量的是通过底层大语言模型 (LLM) 的 API 完成任务所需的货币成本。假设没有提示缓存,成本计算公式为:
4.2 WORKFLOW
4.2 工作流程
Each task typically follows a workflow involving the following stages:
每个任务通常遵循一个工作流程,涉及以下阶段:
- Initialization: The agent sets up its workspace and prepares to execute the task. 初始化:AI智能体设置其工作空间并准备执行任务。
- Execution: The agent completes subtasks, such as navigating tools, collecting data, or processing information or if required by the task, the agent interacts with simulated colleagues or shares results via communication platforms. 执行:AI智能体完成子任务,例如导航工具、收集数据或处理信息,或者如果任务需要,AI智能体与模拟同事互动或通过通信平台共享结果。
- Final iz ation: The agent produces and submits the final output for evaluation. 最终化:AI智能体生成并提交最终输出以供评估。
Example Task We consider a task designed to evaluate an agent’s ability to perform realistic project management workflows using multiple tools and services hosted in the benchmark. The task involves managing a sprint for the RisingWave project, requiring the agent to execute interdependent steps such as sprint issue management, team communication, repository operations, and report generation while incorporating feedback from a simulated project manager.
示例任务
我们考虑了一个任务,旨在评估智能体在使用基准测试中托管的多种工具和服务时执行现实项目管理工作流的能力。该任务涉及管理 RisingWave 项目的冲刺,要求智能体执行相互依赖的步骤,例如冲刺问题管理、团队沟通、仓库操作和报告生成,同时结合模拟项目经理的反馈。
The workflow as illustrated in Figure 2 begins with the agent identifying unfinished issues in the current sprint on Plane and updating their sprint assignments. This step is worth 2 points and is fully completed, earning the agent the maximum score of 2/2. Next, the agent successfully notifies the relevant assignees using Rocket.Chat regarding their pending tasks and earns 1/1 point.
如图 2 所示的工作流程始于 AI 智能体识别 Plane 上当前冲刺中未完成的问题并更新其冲刺任务。此步骤价值 2 分且已完全完成,AI 智能体因此获得满分 2/2 分。接着,AI 智能体成功使用 Rocket.Chat 通知相关任务负责人其待办任务,并获得了 1/1 分。
The agent then proceeds to clone the RisingWave repository from GitLab and execute a Python script in the terminal to calculate updated code coverage. This step, worth 2 points, is only partially completed, as the agent successfully clones the repository but fails to run code coverage. As a result, the agent earns $1/2$ points for this checkpoint. The subsequent steps—generating and sharing the sprint summary report on OwnCloud and incorporating feedback from a simulated project manager—are not completed, resulting in $_{0/2}$ and $0/1$ scores, respectively. Notably, the checkpoints can also fail if the report does not meet quality standards as assessed by the LLM-based evaluator, which evaluates the report for clarity, completeness, and successful incorporation of feedback. This ensures that the assessment reflects both the generation of outputs and their qualitative relevance to the task.
随后,智能体从 GitLab 克隆 RisingWave 仓库,并在终端执行 Python语言 脚本来计算更新的代码覆盖率。这一步价值 2 分,但仅部分完成,因为智能体成功克隆了仓库但未能运行代码覆盖率计算。因此,智能体在此检查点获得 $1/2$ 分。接下来的步骤——在 OwnCloud 上生成并分享冲刺总结报告,以及整合模拟项目经理的反馈——均未完成,分别获得 $_{0/2}$ 和 $0/1$ 分。值得注意的是,如果报告未能达到基于大语言模型的评估器的质量标准,检查点也可能失败。该评估器会评估报告的清晰度、完整性以及反馈的成功整合情况。这确保了评估不仅反映输出的生成,还反映其与任务的质量相关性。
Finally, the overall score is calculated using the partial completion formula defined in $\S,^{4.1}$ , where the total possible points are 8, and the awarded points sum to 4. Substituting these values, the agent achieves a final score of 0.25 $(25%)$ . Our scoring mechanism thus rewards incremental progress while strongly in centi viz ing full completion.
最后,使用 $\S,^{4.1}$ 中定义的部分完成公式计算总体得分,其中总分为 8 分,获得的分数总和为 4 分。代入这些值后,智能体的最终得分为 0.25 $(25%)$。我们的评分机制因此奖励了逐步进展,同时强烈鼓励完全完成。
This example represents a typical task in the The Agent Company benchmark, where agents are required to handle complex workflows involving multiple tools and interdependent steps. By evaluating both partial progress and overall outcomes, our benchmark provides a rigorous and realistic measure of agent performance, allowing us to identify their strengths and pinpoint areas for improvement in task execution.
此示例代表了The Agent Company基准测试中的典型任务,要求AI智能体处理涉及多种工具和相互依赖步骤的复杂工作流。通过评估部分进展和整体结果,我们的基准测试提供了严格且现实的智能体性能衡量标准,使我们能够识别其优势并找出任务执行中的改进领域。
5 TASK CREATION
5 任务创建
5.1 CHOOSING TASK CATEGORIES
5.1 选择任务类别
Many previous agent benchmarks discussed in $\S~2$ were created to evaluate agents on tasks people perform in daily life (Zhou et al., 2023; Lù et al., 2024; Deng et al., 2023), or tasks that accomplish digital chores (Yoran et al., 2024; Trivedi et al., 2024). Obtaining realistic tasks for the benchmark poses challenges. Some benchmark (Xie et al., 2024; Drouin et al., 2024; Yoran et al., 2024) crowd sourced tasks based on predetermined interfaces, platforms, and services available to the agent. They also adopt a strategy to first gather task templates and then instantiate more task instances by filling in the variables. Some benchmark (Zhou et al., 2023; Koh et al., 2024; Bonatti et al., 2024) took a semi-systematic approach of reviewing the action history of the research team and choosing tasks that reflected the types of task that the researchers carried out in their daily life. There are several obvious issues with this if we want to evaluate agents with broader implications in the The Agent Company benchmark. Despite some grounding in realistic data, the process of creating tasks from these data was susceptible to heuristic, and no consideration was made for how important or time-consuming the tasks are. The tasks are biased towards those important for academics in computer science and do not reflect the tasks performed by the entire population.
许多先前在 $\S~2$ 中讨论的智能体基准测试是为了评估智能体在人们日常生活中执行的任务(Zhou et al., 2023; Lù et al., 2024; Deng et al., 2023),或者完成数字杂务的任务(Yoran et al., 2024; Trivedi et al., 2024)。为基准测试获取现实任务带来了挑战。一些基准测试(Xie et al., 2024; Drouin et al., 2024; Yoran et al., 2024)基于预定的接口、平台和服务,通过众包方式获取任务。他们还采用了一种策略,即首先收集任务模板,然后通过填充变量来实例化更多任务实例。一些基准测试(Zhou et al., 2023; Koh et al., 2024; Bonatti et al., 2024)采用了半系统化的方法,审查研究团队的行动历史,并选择反映研究人员日常生活中执行的任务类型。如果我们希望在 The Agent Company 基准测试中评估具有更广泛意义的智能体,这种方法存在几个明显的问题。尽管这些数据有一定的现实基础,但从这些数据中创建任务的过程容易受到启发式方法的影响,并且没有考虑任务的重要性或耗时程度。这些任务偏向于对计算机科学学者重要的任务,并不能反映整个人群执行的任务。
In The Agent Company, we attempt to cover a wide variety of tasks motivated by real-world work. While it is highly challenging to create a representative sample of tasks, fortunately we can rely on existing resources created for other purposes as a reference. Specifically, we start by referencing the 29.1 release of $\mathrm{O^{}}\mathrm{NET}$ database ( $\mathrm{O^{}N E T}$ , 2024; Rounds et al., 1999), which is a database of jobs performed by workers in the US created by the US Department of Labor. It also contains information about tasks performed within the context of each job, abilities required to perform each task, whether the task is a major or minor task for that job category, and other pieces of relevant information.
在 The Agent Company 中,我们尝试涵盖由现实工作驱动的各种任务。虽然创建一个具有代表性的任务样本极具挑战性,但幸运的是,我们可以依赖为其他目的创建的现有资源作为参考。具体来说,我们从参考 $\mathrm{O^{}}\mathrm{NET}$ 数据库的 29.1 版本($\mathrm{O^{}N E T}$,2024;Rounds 等,1999)开始,该数据库是由美国劳工部创建的,记录了美国工人从事的工作。它还包含每项工作中执行的任务信息、执行每项任务所需的能力、该任务在该工作类别中是主要任务还是次要任务,以及其他相关信息。
Based on this data, we first identified a few categories of occupation categories to focus on. First, based on statistics from $\mathrm{O^{*}}\mathrm{NET}$ , we identified job categories that have a large number of people performing this job. Then, we used median salary information for each of these job categories from the US department of labor statistics, and multiplied the number of employees in that category to estimate the aggregate value of performing this job.
基于这些数据,我们首先确定了几类职业类别作为重点。首先,根据 $\mathrm{O^{*}}\mathrm{NET}$ 的统计数据,我们确定了从事人数较多的职业类别。然后,我们使用美国劳工统计局的每个职业类别的中位数薪资信息,并乘以该类别中的员工人数,以估算从事该职业的总价值。
Based on this, we identified several categories of jobs such as “General and Operations Managers”, “Registered Nurses”, “Software Developers”, and “Financial Managers” that have both a high population and high average salary. Because The Agent Company is designed to be a non-embodied benchmark in the digital domain, we excluded the categories that require extensive physical labor such as “Registered Nurses”, and eventually settled on the setting of a software company, which would allow us to cover tasks from the other categories.
基于此,我们确定了几类既拥有高从业人数又具有高平均薪资的工作,如“总经理和运营经理”、“注册护士”、“软件开发人员”和“财务经理”。由于The Agent Company被设计为数字领域中的非实体基准,我们排除了需要大量体力劳动的类别,如“注册护士”,最终选择了软件公司的设定,这样我们就能涵盖其他类别中的任务。
5.2 CHOOSING TASKS
5.2 选择任务
Next, within this setting we chose tasks to implement. In this setting, we attempted to create a diversity of tasks, but mostly focused on concrete tasks that have well-defined goals and success criteria. These tasks were created through a combination of referencing the $\mathrm{O^{*}N E T}$ task list, introspection based on paper co-authors who had experience in each task category, and brainstorming lists with language models. It is important to note that in no cases have we covered an extensive list of all the tasks that are performed in a particular occupational category, and therefore we caution against making any assumptions about whether a particular job may be in danger of full automation based solely on
接下来,在此背景下,我们选择了要实施的任务。在此背景下,我们尝试创建多样化的任务,但主要集中在具有明确目标和成功标准的具体任务上。这些任务是通过参考 $\mathrm{O^{*}N E T}$ 任务列表、基于有经验的论文合著者的内省以及与语言模型的头脑风暴列表相结合而创建的。需要注意的是,我们并未涵盖特定职业类别中执行的所有任务的广泛列表,因此我们提醒不要仅基于此做出任何关于某个特定工作是否可能面临完全自动化的假设。

Figure 3: Overview of OpenHands’ default CodeAct $^+$ Browsing agent architecture, the baseline agent used throughout the experiments.
图 3: OpenHands 默认的 CodeAct $^+$ 浏览智能体架构概览,这是整个实验中使用的基础智能体。
The Agent Company. Rather, it may provide insight into whether certain tasks within jobs may be accelerated or automated, and inform further analysis by labor professionals into this question.
The Agent Company。相反,它可能提供关于工作中某些任务是否可能被加速或自动化的见解,并为劳动专业人员进一步分析这一问题提供信息。
5.3 MANUAL TASK CURATION
5.3 手动任务整理
Once we set up the environment required for our desired jobs and task categories (§ 3), we return to the curated list, and perform a manual curation process for tasks. For each task, this consists of the following steps: We first create a description of task intent, checkpoints, and how to evaluate each checkpoint. We then identify and import the required data for the task that are currently missing in the company Intranet services and create any necessary data. We then write scripts to configure the required initialization state in the local workspace. Finally, we implement the checkpoint evaluators that calculate the scalar scores for each checkpoint.
一旦我们为所需的工作和任务类别设置了环境(§ 3),我们就会回到策划的任务列表,并对任务进行手动策划。对于每个任务,这包括以下步骤:我们首先创建任务意图的描述、检查点以及如何评估每个检查点。然后,我们识别并导入任务所需的当前在公司内网服务中缺失的数据,并创建任何必要的数据。接着,我们编写脚本来配置本地工作区中所需的初始化状态。最后,我们实现检查点评估器,为每个检查点计算标量分数。
All tasks were created by coauthors of the paper. Overall, it took 20 computer science students, software engineers, and project managers over 2 months, consuming approximately 3,000 personhours in total. Some of the more complex tasks take more than 10 hours each to design, implement, test, and verify. To ensure quality control of the task creation process, we implement several check and verification processes. For each task implementation, we require screenshot proof that the evaluator is valid and that the task is able to get a full score when successfully completed. We also encourage including tests for the implemented evaluator programs. Each task contribution is also code reviewed by a panel of lead authors before merging into the benchmark. After creating all tasks, a final round of manual human double-check of required environment data, evaluator behavior, and checkpoint scoring for every task is performed to ensure quality. Notably, during the process a person who has not curated the tasks checks all the checkpoint score assignments to make sure that the importance scoring is consistent over all the tasks and that it correlates reasonably with the relative importance of the checkpoint within the task.
所有任务均由论文的共同作者创建。总体而言,20名计算机科学专业的学生、软件工程师和项目经理花费了超过2个月的时间,总计消耗了约3000人时。一些较为复杂的任务每个任务的设计、实现、测试和验证耗时超过10小时。为了确保任务创建过程的质量控制,我们实施了多项检查和验证流程。对于每个任务的实现,我们要求提供截图证明,证明评估者是有效的,并且任务在成功完成后能够获得满分。我们还鼓励为实现的评估程序编写测试。每个任务的贡献在合并到基准测试之前,还会由一组主要作者进行代码审查。在创建所有任务后,我们进行了最后一轮人工双重检查,确保每个任务所需的环境数据、评估者行为和检查点评分都符合质量要求。值得注意的是,在此过程中,一个未参与任务策划的人员会检查所有检查点的评分分配,以确保所有任务的重要性评分一致,并且与任务中检查点的相对重要性合理相关。
6BASELINE AGENT
6 基线智能体
To test the current state-of-the-art performance on the The Agent Company benchmark, we need agents that can at least perform tasks using a browser, operate a local workspace using a terminal, and write and execute programs to perform most of the tasks. Throughout this paper, we experiment with OpenHands’ main agent (Wang et al., 2024b;a; Song et al., 2024), CodeAct Agent with Browsing.8 An overview of the agent architecture is illustrated in Figure 3.
为了测试当前在 The Agent Company 基准上的最先进性能,我们需要至少能够使用浏览器执行任务、使用终端操作本地工作空间、并编写和执行程序以完成大部分任务的智能体。在本文中,我们实验了 OpenHands 的主要智能体 (Wang et al., 2024b;a; Song et al., 2024),即带有浏览功能的 CodeAct 智能体。智能体架构的概览如图 3 所示。
Interfaces The agent can interact with the environment through 3 interfaces. (1) A bash shell that connects with the local workspace operating system environment for command execution. (2) A Jupyter IPython server to handle interactive python (IPython) code execution requests and return the execution results back. (3) A Chromium browser based on Playwright. The provider provides a set of action primitives defined by BrowserGym (ServiceNow; Drouin et al., 2024), such as navigation, clicking, typing, and scrolling. After executing these actions, the browser runtime provides a rich set of observations about the current state of the browser, including HTML, DOM, accessibility tree (Mozilla), screenshot, opened tabs, etc. These observations can be also augmented with configurable attributes that could allow agents to better understand web page observations, such as using a set-of-marks on screenshot (Yang et al., 2023; He et al., 2024), visible element marking, focused element, interact able element marking, in-viewport element filtering (Zhou et al., 2023), etc.
接口
AI智能体可以通过3种接口与环境进行交互。(1) 一个与本地工作区操作系统环境连接的bash shell,用于执行命令。(2) 一个Jupyter IPython服务器,用于处理交互式Python (IPython) 代码执行请求并返回执行结果。(3) 一个基于Playwright的Chromium浏览器。该提供商提供了一组由BrowserGym (ServiceNow; Drouin et al., 2024) 定义的操作原语,例如导航、点击、输入和滚动。执行这些操作后,浏览器运行时会提供一组丰富的关于浏览器当前状态的观察结果,包括HTML、DOM、可访问性树 (Mozilla)、截图、打开的标签页等。这些观察结果还可以通过可配置的属性进行增强,使智能体能够更好地理解网页观察结果,例如在截图上使用一组标记 (Yang et al., 2023; He et al., 2024)、可见元素标记、聚焦元素、可交互元素标记、视口内元素过滤 (Zhou et al., 2023) 等。
Actions The agent connects with the environment through a core set of general actions. Actions I Python Run Cell Action and CmdR un Action enable the agent to execute arbitrary Python code and bash commands inside the sandbox environment (e.g., a secure isolated Linux operating system used as our local workspace). Browser Interactive Action enables interaction with a web browser with a domain-specific language for browsing introduced by BrowserGym (Chezelles et al., 2024; Drouin et al., 2024). These actions provide a comprehensive, yet flexible set of primitives that cover most of the tasks performed by human employees of The Agent Company, including navigation, click, hovering, and typing, etc.
动作
AI智能体通过一组核心通用动作与环境进行交互。动作包括 Python 运行单元动作 (Python Run Cell Action) 和 CmdRun 动作 (CmdRun Action),这些动作使得智能体能够在沙盒环境(例如,用作本地工作空间的安全隔离的 Linux 操作系统)中执行任意的 Python 代码和 bash 命令。浏览器交互动作 (Browser Interactive Action) 则允许智能体通过 BrowserGym 引入的特定领域语言与网页浏览器进行交互 (Chezelles et al., 2024; Drouin et al., 2024)。这些动作提供了一套全面且灵活的基础操作,涵盖了 The Agent Company 员工执行的大多数任务,包括导航、点击、悬停和输入等。
Observations Observations describe the environmental changes that the agent observes. The main types of observations used in the CodeAct agent include the execution result of bash terminal commands, Python programs, and browser actions. Specifically, the execution result of browser actions is usually browser snapshots and textual representation in the form of accessibility tree of the current browser viewport.
观察描述了智能体观察到的环境变化。CodeAct 智能体中使用的主要观察类型包括 bash 终端命令、Python 程序和浏览器操作的执行结果。具体来说,浏览器操作的执行结果通常是浏览器快照和当前浏览器视口的可访问性树的文本表示。
Workflow At each step, the underlying backbone LLM will take in prompts consisting of previous agent history and the current observation of the environment, and generate a response consisting of the action to execute next. On a higher level, the agent can perform the task by executing code, including executing bash commands, Python code, or browser-specific programming language (defined in BrowserGym).9 This general action space allows the agent to perform various tasks, including editing files, browsing the Web, running programs, etc.
工作流程
在每个步骤中,底层的大语言模型 (LLM) 会接收由先前的智能体历史记录和当前环境观察结果组成的提示,并生成包含下一步要执行的操作的响应。在更高层次上,智能体可以通过执行代码来完成任务,包括执行 bash 命令、Python 代码或浏览器特定的编程语言(在 BrowserGym 中定义)。这种通用的操作空间使智能体能够执行各种任务,包括编辑文件、浏览网页、运行程序等。
7 EXPERIMENTAL RESULTS
7 实验结果
In this section, we evaluate popular foundation models, both closed and open, on The Agent Company benchmark. We use OpenHands CodeAct agent (§ 6) for all experiments This serves as a baseline for future development of both the foundation LLMs and the agent infrastructure. Note that since LLM evaluators and NPCs are part of the environment rather than the agent being evaulated, we fix their backbone LLM to Claude-3-5-Sonnet-20241022, which demonstrated the best qualitative accuracy in simulating human colleagues and judging deliverable s in preliminary experiments.
在本节中,我们评估了The Agent Company基准测试中的流行基础模型,包括闭源和开源模型。我们使用OpenHands CodeAct智能体(§ 6)进行所有实验,这为未来基础大语言模型和智能体基础设施的开发提供了基线。需要注意的是,由于大语言模型评估者和NPC是环境的一部分,而非被评估的智能体,因此我们将它们的骨干大语言模型固定为Claude-3-5-Sonnet-20241022,该模型在初步实验中展示了在模拟人类同事和判断交付物方面的最佳定性准确性。
7.1 RESULT OVERVIEW
7.1 结果概述
Table 3 shows the evaluation results of both closed and open foundation models on the full evaluation set of The Agent Company (175 tasks). We can see that the Claude-3.5-Sonnet is the clear winner across all models. However, even with the strongest frontier model, it only manages to complete $24%$ of the total tasks and achieves a score of $34.4%$ taking into account partial completion credits. Note that this result comes at a cost: It requires an average of almost 30 steps and more than $\mathbb{S}6$ to complete each task, making it the most expensive model to run both in time and in cost. This is expected as most of the tasks in our benchmark are of long-horizon nature. The Gemini 2.0 Flash model that comes second in terms of capability requires 40 steps on average to complete the tasks, which is time consuming, yet only to achieve less than half the success rate compared to the top-performing model. Surprisingly, its cost is less than $\mathbb{S}1$ , making it a very cost-efficient yet relatively strong model. A qualitative examination demonstrated that this was due to instances where the agent got stuck in a loop or aimlessly explored the environment.
表 3 展示了封闭和开放基础模型在 The Agent Company 完整评估集(175 个任务)上的评估结果。我们可以看到,Claude-3.5-Sonnet 在所有模型中表现最为突出。然而,即使是最强大的前沿模型,也只能完成总任务的 $24%$,并且在考虑部分完成积分的情况下,得分为 $34.4%$。需要注意的是,这一结果的代价是:每个任务平均需要近 30 步和超过 $\mathbb{S}6$ 的成本,使其成为运行时间和成本上最昂贵的模型。这是可以预见的,因为我们基准测试中的大多数任务都是长期任务。能力排名第二的 Gemini 2.0 Flash 模型平均需要 40 步来完成这些任务,虽然耗时,但其成功率还不到表现最佳模型的一半。令人惊讶的是,它的成本不到 $\mathbb{S}1$,使其成为一个非常经济高效且相对强大的模型。定性分析表明,这是由于智能体在某些情况下陷入循环或漫无目的地探索环境所致。
Among the open-weight models, Llama 3.1 (405B) achieves the highest performance, nearly on par with OpenAI’s GPT-4o model, though still having a big gap behind the leading Claude 3.5 Sonnet.
在开源权重模型中,Llama 3.1 (405B) 表现最佳,几乎与 OpenAI 的 GPT-4o 模型持平,但仍与领先的 Claude 3.5 Sonnet 有较大差距。
Table 3: Performance comparison of various foundation models on The Agent Company.
| Model | Success | Score | Steps | Costs |
| API-basedModels | ||||
| Claude-3.5-Sonnet Gemini-2.0-Flash GPT-40 Gemini-1.5-Pro | 24.0% 11.4% 8.6% 3.4% | 34.4% 19.0% 16.7% 8.0% | 29.17 39.85 14.55 22.10 | \$6.34 \$0.79 \$1.29 \$6.78 |
| Amazon-Nova-Pro-v1 | 1.7% Open-weightsModels | 5.7% | 19.59 | \$1.55 |
| Llama-3.1-405b | 7.4% | 14.1% | 22.95 | \$3.21 |
| Llama-3.3-70b | 6.9% | 12.8% | 20.93 | \$0.93 |
| Qwen-2.5-72b | 5.7% | 11.8% | 23.99 | \$1.53 |
| Llama-3.1-70b Qwen-2-72b | 1.7% 1.1% | 6.5% 4.2% | 19.18 23.70 | \$0.83 \$0.28 |
表 3: 各基础模型在 The Agent Company 上的性能对比
| 模型 | 成功率 | 得分 | 步骤数 | 成本 |
|---|---|---|---|---|
| 基于API的模型 | ||||
| Claude-3.5-Sonnet Gemini-2.0-Flash GPT-40 Gemini-1.5-Pro | 24.0% 11.4% 8.6% 3.4% | 34.4% 19.0% 16.7% 8.0% | 29.17 39.85 14.55 22.10 | $6.34 $0.79 $1.29 $6.78 |
| Amazon-Nova-Pro-v1 | 1.7% | 5.7% | 19.59 | $1.55 |
| 开源模型 | ||||
| Llama-3.1-405b | 7.4% | 14.1% | 22.95 | $3.21 |
| Llama-3.3-70b | 6.9% | 12.8% | 20.93 | $0.93 |
| Qwen-2.5-72b | 5.7% | 11.8% | 23.99 | $1.53 |
| Llama-3.1-70b Qwen-2-72b | 1.7% 1.1% | 6.5% 4.2% | 19.18 23.70 | $0.83 $0.28 |
Table 4: Performance of the models in tasks that require different platforms in The Agent Company. All numbers are percentages $(%)$ .
| GitLab(71 tasks) | Plane (17 tasks) | RocketChat(79 tasks) | ownCloud(70 tasks) | |||||
| Model | Success (%) | Score (%) | Success (%) | Score (%) | Success (%) | Score (%) | Success (%) | Score (%) |
| API-basedModels | ||||||||
| Claude-3.5-Sonnet | 30.99 | 40.25 | 41.18 | 50.37 | 21.52 | 34.68 | 10.00 | 21.81 |
| Gemini-2.0-Flash | 11.27 | 18.21 | 17.65 | 29.84 | 13.92 | 23.34 | 2.86 | 8.52 |
| GPT-40 | 11.27 | 19.46 | 23.53 | 33.68 | 5.06 | 16.08 | 1.43 | 7.76 |
| Gemini-1.5-Pro | 2.82 | 3.88 | 5.88 | 14.05 | 3.80 | 10.97 | 0.00 | 4.22 |
| Amazon-Nova-Pro-v1 | 2.82 | 7.22 | 5.88 | 16.67 | 1.27 | 5.36 | 0.00 | 2.43 |
| Open-weightsModels | ||||||||
| Llama-3.1-405b | 5.63 | 11.84 | 29.41 | 39.12 | 8.86 | 16.46 | 0.00 | 4.45 |
| Llama-3.3-70b | 8.45 | 14.26 | 11.76 | 21.65 | 5.06 | 12.06 | 0.00 | 3.76 |
| Qwen-2.5-72b | 5.63 | 11.33 | 11.76 | 23.56 | 5.06 | 12.60 | 0.00 | 4.14 |
| Llama-3.1-70b | 1.41 | 6.09 | 5.88 | 15.35 | 2.53 | 8.23 | 0.00 | 3.32 |
| Qwen-2-72b | 1.41 | 1.94 | 5.88 | 12.45 | 0.00 | 4.88 | 0.00 | 2.60 |
表 4: The Agent Company 中不同平台任务下的模型性能。所有数字均为百分比 $(%)$ 。
| 模型 | GitLab (71 任务) | Plane (17 任务) | RocketChat (79 任务) | ownCloud (70 任务) |
|---|---|---|---|---|
| 成功率 (%) | 得分 (%) | 成功率 (%) | 得分 (%) | |
| 基于 API 的模型 | ||||
| Claude-3.5-Sonnet | 30.99 | 40.25 | 41.18 | 50.37 |
| Gemini-2.0-Flash | 11.27 | 18.21 | 17.65 | 29.84 |
| GPT-40 | 11.27 | 19.46 | 23.53 | 33.68 |
| Gemini-1.5-Pro | 2.82 | 3.88 | 5.88 | 14.05 |
| Amazon-Nova-Pro-v1 | 2.82 | 7.22 | 5.88 | 16.67 |
| 开源权重模型 | ||||
| Llama-3.1-405b | 5.63 | 11.84 | 29.41 | 39.12 |
| Llama-3.3-70b | 8.45 | 14.26 | 11.76 | 21.65 |
| Qwen-2.5-72b | 5.63 | 11.33 | 11.76 | 23.56 |
| Llama-3.1-70b | 1.41 | 6.09 | 5.88 | 15.35 |
| Qwen-2-72b | 1.41 | 1.94 | 5.88 | 12.45 |
Interestingly, comparing the number of steps and costs between the open Llama 3.1 (405B) model and the closed OpenAI GPT-4o model, Llama 3.1 takes more steps and costs nearly $2\mathbf{x}$ more to run, while having a lower success than GPT-4o. Anecdotally, our inspection showed that GPT-4o seems to be better at giving up early, saving steps and costs if the task is clearly out of the capacity range of the agent. This suggests that open-weight models are not always the most cost-effective choice in agents given serving cost, especially with highly complex tasks.
有趣的是,比较开源的 Llama 3.1 (405B) 模型和闭源的 OpenAI GPT-4o 模型的步骤数量和成本,Llama 3.1 需要更多的步骤,运行成本几乎是 GPT-4o 的 $2\mathbf{x}$ 倍,而成功率却低于 GPT-4o。根据我们的观察,GPT-4o 似乎更擅长在任务明显超出智能体能力范围时提前放弃,从而节省步骤和成本。这表明,考虑到服务成本,开源权重模型在智能体中并不总是最具成本效益的选择,尤其是在处理高度复杂的任务时。
On the other hand, the newer generation of Llama model, Llama 3.3 (70B) achieves a considerably high performance of $6.9%$ success rate, on par with the much larger (405B), older generation (Llama 3.1) model. This model also costs significantly less because of its smaller size. This suggests a promising future for LLM development, as smaller and more efficient models begin to catch up in agent performance.
另一方面,新一代的 Llama 模型 Llama 3.3 (70B) 实现了相当高的性能,成功率达到 $6.9%$,与更大规模(405B)的老一代模型(Llama 3.1)相当。由于规模较小,该模型的成本也显著降低。这表明大语言模型的发展前景广阔,因为更小、更高效的模型在智能体性能方面开始迎头赶上。
7.2 ANALYSIS
7.2 分析
How well do agents operate on different platforms? Table 4 presents performance breakdown on tasks that involve different platforms in The Agent Company. A task is categorized under a platform if one of the platforms that the task requires it. From Figure 4a, we can see that most models struggle with RocketChat and ownCloud. RocketChat platform is where all the social interaction with peers happens, and the low performance on this platform suggests that current-day LLMs still need improvements in communicating with others. ownCloud platform provides online Office suite functionality, and due to the complexity of the UI of web-based Office software, it is expected that current LLMs fail badly on the platform. This suggests that the browsing capability of the agents, especially on more complex websites, still needs improvement. These results underscore the inherent challenges and complexities of performing tasks that occur in real-world work environments, involve social interaction, or require understanding and navigating complex web interfaces.
AI智能体在不同平台上的表现如何?表4展示了The Agent Company中涉及不同平台任务的表现细分。如果任务需要某个平台,则该任务被归类为该平台下的任务。从图4a中可以看出,大多数模型在RocketChat和ownCloud上表现不佳。RocketChat平台是所有与同行社交互动发生的地方,而在此平台上的低表现表明,当前的大语言模型在与人沟通方面仍需改进。ownCloud平台提供在线办公套件功能,由于基于Web的办公软件用户界面的复杂性,当前的大语言模型在该平台上表现不佳是可以预见的。这表明AI智能体的浏览能力,尤其是在更复杂的网站上,仍需改进。这些结果突显了在现实工作环境中执行任务、涉及社交互动或需要理解和导航复杂Web界面时所面临的固有挑战和复杂性。
Table 5: Performance of various models in tasks with different nature in The Agent Company. All numbers are percentages $(%)$ .
| Model | SDE (69 tasks) | PM (28 tasks) | DS (14 tasks) | Admin (15 tasks) | HR (29 tasks) | Finance(12 tasks) | Other (8 tasks) | |||||||||
| Success | Score | Success Score | Success | Score | Success | Score | Success | Score | Success | Score | Success | Score | ||||
| API-basedModels | ||||||||||||||||
| Claude-3.5-Sonnet | 30.43 | 38.02 | 35.71 | 51.31 | 14.29 | 21.70 | 0.00 | 11.59 | 24.14 | 34.49 | 8.33 | 25.17 | 12.50 | 22.40 | ||
| Gemini-2.0-Flash | 13.04 | 18.99 | 17.86 | 31.71 | 0.00 | 6.49 | 6.67 | 15.20 | 17.24 | 23.08 | 0.00 | 4.31 | 0.00 | 10.05 | ||
| GPT-40 | 13.04 | 19.18 | 17.86 | 32.27 | 0.00 | 4.70 | 6.67 | 13.89 | 0.00 | 8.28 | 0.00 | 7.36 | 0.00 | 10.78 | ||
| Gemini-1.5-Pro | 4.35 | 5.64 | 3.57 | 13.19 | 0.00 | 4.82 | 6.67 | 9.92 | 3.45 | 11.42 | 0.00 | 2.78 | 0.00 | 8.07 | ||
| Amazon-Nova-Pro-v1 | 2.90 | 6.07 | 3.57 12.54 | 0.00 | 3.27 | 0.00 | 0.00 | 0.00 | 4.27 | 0.00 | 2.78 | 0.00 | 2.86 | |||
| Open-weightsModels | ||||||||||||||||
| Llama-3.1-405b | 5.80 | 11.33 | 21.43 | 35.62 | 0.00 | 5.42 | 0.00 | 3.33 | 6.90 | 12.56 | 0.00 | 5.00 | 12.50 | 17.45 | ||
| Llama-3.3-70b | 11.59 | 16.49 | 7.14 19.83 | 0.00 | 4.70 | 0.00 | 1.67 | 6.90 | 11.38 | 0.00 | 5.69 | 0.00 | 7.03 | |||
| Qwen-2.5-72b | 7.25 | 11.99 | 10.71 | 22.90 | 0.00 | 5.42 | 0.00 | 2.14 | 6.90 | 12.36 | 0.00 | 7.15 | 0.00 | 5.99 | ||
| Llama-3.1-70b | 1.45 | 4.77 | 3.57 | 15.16 | 0.00 | 5.42 | 0.00 | 2.42 | 3.45 | 7.19 | 0.00 | 3.82 | 0.00 | 2.86 | ||
| Qwen-2-72b | 2.90 | 3.68 | 0.00 | 7.44 | 0.00 | 4.70 | 0.00 | 0.56 | 0.00 | 4.14 | 0.00 | 3.61 | 0.00 | 4.95 | ||
表 5: The Agent Company 中不同性质任务下各模型的性能。所有数字均为百分比 $(%)$ 。
| 模型 | SDE (69 任务) | PM (28 任务) | DS (14 任务) | Admin (15 任务) | HR (29 任务) | Finance (12 任务) | Other (8 任务) |
|---|---|---|---|---|---|---|---|
| 成功 | 得分 | 成功 | 得分 | 成功 | 得分 | 成功 | |
| API 模型 | |||||||
| Claude-3.5-Sonnet | 30.43 | 38.02 | 35.71 | 51.31 | 14.29 | 21.70 | 0.00 |
| Gemini-2.0-Flash | 13.04 | 18.99 | 17.86 | 31.71 | 0.00 | 6.49 | 6.67 |
| GPT-40 | 13.04 | 19.18 | 17.86 | 32.27 | 0.00 | 4.70 | 6.67 |
| Gemini-1.5-Pro | 4.35 | 5.64 | 3.57 | 13.19 | 0.00 | 4.82 | 6.67 |
| Amazon-Nova-Pro-v1 | 2.90 | 6.07 | 3.57 | 12.54 | 0.00 | 3.27 | 0.00 |
| 开源模型 | |||||||
| Llama-3.1-405b | 5.80 | 11.33 | 21.43 | 35.62 | 0.00 | 5.42 | 0.00 |
| Llama-3.3-70b | 11.59 | 16.49 | 7.14 | 19.83 | 0.00 | 4.70 | 0.00 |
| Qwen-2.5-72b | 7.25 | 11.99 | 10.71 | 22.90 | 0.00 | 5.42 | 0.00 |
| Llama-3.1-70b | 1.45 | 4.77 | 3.57 | 15.16 | 0.00 | 5.42 | 0.00 |
| Qwen-2-72b | 2.90 | 3.68 | 0.00 | 7.44 | 0.00 | 4.70 | 0.00 |

Figure 4: Comparing agent success rate across platforms (left) and task categories (right).
图 4: 跨平台(左)和任务类别(右)的智能体成功率比较。
How well do agents perform on different type of tasks? Table 5 presents performance breakdown on different types of tasks in The Agent Company. According to the nature of the task, i.e. what kind of professionals are usually assigned with the task, the tasks in The Agent Company can be categorized into several job departments: Software Development Engineering (SDE), Project Management (PM), Data Science (DS), Administrative (Admin), Human Resources (HR), Financial (Finance) and all the remaining (Other).
AI智能体在不同类型任务上的表现如何?表 5 展示了 The Agent Company 中不同类型任务的表现细分。根据任务的性质,即通常由哪些专业人员负责该任务,The Agent Company 中的任务可以分为几个工作部门:软件开发工程 (SDE)、项目管理 (PM)、数据科学 (DS)、行政 (Admin)、人力资源 (HR)、财务 (Finance) 以及其他所有部门 (Other)。
From the success rate demonstrated in Figure 4b, we can see that data science, administrative, and finance tasks are among the lowest, with many LLMs completing none of the tasks successfully, and even the strongest Claude model achieving much less than the rest of the tasks. On the other hand, software engineering tasks, which may seem like much harder tasks for many humans, result in a higher success rate. This suggests that there exists a gap between the perceived difficulty of the tasks for humans versus the difficulty for LLM agents.
从图 4b 中展示的成功率可以看出,数据科学、行政和财务任务的成功率最低,许多大语言模型在这些任务上都没有成功完成,即使是表现最强的 Claude 模型在这些任务上的表现也远低于其他任务。另一方面,软件工程任务对人类来说可能看起来更难,但大语言模型在这些任务上的成功率却更高。这表明人类对任务难度的感知与大语言模型智能体对任务难度的感知之间存在差距。
For example, some tasks in the administrative and finance category involves making spreadsheets, collecting and filling in a lot of information from various people, or reading and understanding images scanned by employees. These tasks are arguably easier conceptually for humans in terms of professional skill sets than software engineering, as SDE jobs usually have a higher barrier of entry and more prerequisites for certain knowledge. However, most LLMs achieve a much higher score on the SDE tasks. However, LLMs fail these seemingly easier tasks due to lack of ability to understand documents, communicate with other people, navigate complex software and tedious processes, and autonomously automate repetitive tasks. We hypothesize that part of the reason lies in the fact that current LLM development is heavily based on software engineering abilities, such as coding, due to several high profile benchmarks that measure this capability (e.g. HumanEval, SWE-Bench) as well as the abundance of publicly available training data related to software. On the other hand, administrative and financial tasks, are usually private data within companies, not readily available for training LLMs.
例如,行政和财务类别中的一些任务涉及制作电子表格、从不同人员那里收集和填写大量信息,或阅读和理解员工扫描的图像。从专业技能的角度来看,这些任务对人类来说在概念上可能比软件工程更容易,因为软件工程工作通常具有更高的入门门槛和更多的知识前提。然而,大多数大语言模型在软件工程任务上得分更高。然而,大语言模型在这些看似更容易的任务上失败,原因是缺乏理解文档、与他人沟通、导航复杂软件和繁琐流程以及自主自动化重复任务的能力。我们假设部分原因在于当前大语言模型的开发严重依赖于软件工程能力,例如编码,这是由于有几个高调的基准测试(例如 HumanEval、SWE-Bench)以及大量公开可用的与软件相关的训练数据。另一方面,行政和财务任务通常是公司内部的私有数据,不易用于训练大语言模型。
7.3 COMMON AGENT FAILURES
7.3 常见AI智能体故障
Overall, the agent performance on The Agent Company is still low and a majority of tasks are failed. Among those, we try to find some common and interesting agent mistakes that are often surprising because they are usually not made by humans.
总体而言,AI智能体在The Agent Company上的表现仍然较低,大多数任务都失败了。在这些失败的任务中,我们试图找出一些常见且有趣的智能体错误,这些错误通常令人惊讶,因为它们通常不会由人类犯下。
Lack of commonsense Some tasks are failed because the agent lacks the common sense and domain background knowledge required to infer implicit assumptions. For example, one task asked the agent to “Write the responses to /workspace/answer.docx” but does not explicitly states that this is a Microsoft Word file. A human can infer this requirement from the file extension. The agent instead treats it as a plain text file, writing text directly to the file, resulting in a task failure.
缺乏常识
一些任务失败是因为智能体缺乏推断隐含假设所需的常识和领域背景知识。例如,一个任务要求智能体“将响应写入 /workspace/answer.docx”,但没有明确说明这是一个 Microsoft Word 文件。人类可以从文件扩展名推断出这一要求,而智能体却将其视为纯文本文件,直接将文本写入文件,导致任务失败。
Lack of social skills Sometimes, the agent fails to understand the implications and goals in the social conversations with colleagues in The Agent Company. For example, one task involves asking Alex for help, and the agent first successfully asks the right question “Could you tell me who I should introduce myself to next on the team?” Then the simulated colleague Alex replied “You should introduce yourself to Chen Xinyi next. She’s on our frontend team and would be a great person to connect with!” At this point, a human would then talk to Chen Xinyi, but instead the agent then decides to not follow up with her, and prematurely considers the task accomplished.
缺乏社交技能
有时,AI智能体在与The Agent Company的同事进行社交对话时,无法理解其中的含义和目标。例如,一个任务涉及向Alex寻求帮助,AI智能体首先成功提出了正确的问题:“你能告诉我接下来我应该向团队中的谁自我介绍吗?”然后,模拟同事Alex回答:“你应该接下来向陈心怡自我介绍。她在我们的前端团队工作,是一个很好的联系人!”此时,人类会继续与陈心怡交谈,但AI智能体却决定不再跟进,并过早地认为任务已完成。
Incompetence in browsing Often times, the biggest obstacle in tasks is the parts that require browsing the Web. This is expected as browsing is still hard for agents given the complexity of modern-day web UIs and the numerous distractions on a webpage. For example, on many tasks that involve ownCloud, the closable popup that sometimes shows up and asks the user to download the mobile phone apps for a better experience has become an obstacle. Humans can simply click on the $\mathbf{\omega}^{\star}\mathbf{x},^{\star}$ to close the popup, while the agents are stuck. Similarly, when trying to download a file from ownCloud, there are several popups to click through before the actual download, and each step is error prone for agents due to the complex UI.
浏览能力不足
通常情况下,任务中最大的障碍是需要浏览网页的部分。这是可以预见的,因为现代网页用户界面(UI)的复杂性以及网页上的众多干扰因素,使得浏览对智能体来说仍然具有挑战性。例如,在许多涉及 ownCloud 的任务中,有时会弹出一个可关闭的弹窗,提示用户下载手机应用以获得更好的体验,这已成为一个障碍。人类可以简单地点击 $\mathbf{\omega}^{\star}\mathbf{x},^{\star}$ 来关闭弹窗,而智能体则会被卡住。同样,当尝试从 ownCloud 下载文件时,在实际下载之前需要点击多个弹窗,由于复杂的 UI,每个步骤对智能体来说都容易出错。
Deceiving oneself Interestingly, we find that for some tasks, when the agent is not clear what the next steps should be, it sometimes try to be clever and create fake “shortcuts” that omit the hard part of a task. For example, during the execution of one task, the agent cannot find the right person to ask questions on RocketChat. As a result, it then decides to create a shortcut solution by renaming another user to the name of the intended user.
自我欺骗
有趣的是,我们发现对于一些任务,当智能体不清楚下一步该做什么时,它有时会试图耍小聪明,创建虚假的“捷径”,从而跳过任务中的困难部分。例如,在执行某个任务时,智能体无法在 RocketChat 上找到合适的人来提问。结果,它决定通过将另一个用户重命名为目标用户的名称来创建一个快捷解决方案。
8 IMPLICATIONS AND FUTURE DIRECTIONS
8 影响与未来方向
In this paper, we present The Agent Company, a new benchmark that stands out because it specifically focuses on real-world tasks that would be tackled within the context of real-world work. Unsurprisingly, current state-of-the-art agents fail to solve a majority of the tasks, suggesting that there is a big gap for current AI agents to autonomously perform most of the jobs a human worker would do, even in a relatively simplified benchmarking setting. Looking at how different models perform on different types of tasks, we argue that tasks that involve social interaction with other humans, navigating through complex user interfaces designed for professionals, and tasks that are typically performed in private, without a significant open and publicly available resources, are the most challenging. However, we believe that currently new LLMs are making significant progress: not only are they becoming more and more capable in terms of raw performance, but also more cost-efficient (e.g. Gemini 2.0 Flash). Open-weights models are closing the gap between proprietary frontier models too, and the newer models are getting smaller (e.g. Llama 3.3 70B) but with equivalent performance to previous huge models, also showcasing that efficiency will further improve.
在本文中,我们介绍了 The Agent Company,这是一个新的基准测试,其独特之处在于它特别关注现实世界工作中会遇到的实际任务。不出所料,当前最先进的 AI 智能体无法解决大多数任务,这表明即使在一个相对简化的基准测试环境中,当前的 AI 智能体在自主执行人类工作者所完成的大部分工作方面仍存在巨大差距。通过观察不同模型在不同类型任务上的表现,我们认为涉及与其他人类进行社交互动、在专为专业人士设计的复杂用户界面中导航的任务,以及通常在私密环境中执行且没有大量公开可用资源的任务是最具挑战性的。然而,我们相信当前的新大语言模型正在取得显著进展:它们不仅在原始性能方面变得越来越强大,而且在成本效益方面也有所提升(例如 Gemini 2.0 Flash)。开源模型也在缩小与专有前沿模型之间的差距,新模型变得更小(例如 Llama 3.3 70B),但性能与之前的大型模型相当,这也表明效率将进一步提高。
That said, this is just a first step towards forming a firmer grasp on how AI may affect the tasks performed within a workspace, and it has its limitations. First, our tasks are generally on the more straightforward side due to the need to automatically evaluate with programs and test cases, and we do not cover more complex creative tasks such as brainstorming new product ideas or designing system architectures. Second, we are only using one agent scaffold as the baseline performance, and others may differ in performance. Third, while it would be interesting to know the actual performance of human professionals on these tasks to understand how LLM agents perform in comparison, due to resource limitations we were not able to perform this comparison in the current iteration of The Agent Company. Fourth, the topic and content of the tasks were mostly created through introspection by people familiar with these workspaces, which may result in some disconnect with actual tasks performed in enterprise settings.
尽管如此,这只是朝着更深入了解 AI 如何影响工作空间内执行任务的第一步,并且它有其局限性。首先,由于需要通过程序和测试用例自动评估,我们的任务通常较为简单,我们没有涵盖更复杂的创造性任务,例如头脑风暴新产品创意或设计系统架构。其次,我们仅使用一个智能体框架作为基线性能,其他框架的性能可能有所不同。第三,虽然了解人类专业人员在这些任务上的实际表现以比较 LLM 智能体的表现会很有趣,但由于资源限制,我们未能在当前版本的 The Agent Company 中进行这种比较。第四,任务的主题和内容大多是通过熟悉这些工作空间的人内省创建的,这可能会导致与企业环境中实际执行的任务存在一些脱节。
Based on this, there are many future directions for further improvement of The Agent Company or other related benchmarks in this space. These include further expanding the benchmark tasks to those encountered in other industries, or tasks that require physical labor. Benchmarking may also be expanded with tasks that have more vague intents to better simulate real-world scenarios where the goal is not immediately clear at the very beginning. Further, benchmarks could also be expanded to include higher-level longer-horizon tasks such as conceptualizing a new product and carrying it to execution. We hope that The Agent Company provides a first step, but not the only step, towards these goals, and that we or others may build upon the open source release of The Agent Company to further expand in these promising directions.
基于此,The Agent Company 或其他相关基准测试在未来的改进方向上有很多可能性。这些方向包括进一步扩展基准测试任务,涵盖其他行业中的任务,或需要体力劳动的任务。基准测试还可以扩展到具有更模糊意图的任务,以更好地模拟现实世界中的场景,其中目标在一开始并不明确。此外,基准测试还可以扩展到包括更高层次、更长周期的任务,例如构思新产品并将其付诸实施。我们希望 The Agent Company 为这些目标迈出第一步,但不是唯一的一步,并且我们或其他人可以在 The Agent Company 的开源发布基础上,进一步扩展这些有前景的方向。
AUTHOR CONTRIBUTIONS
作者贡献
This work was an open source collaborative effort between multiple institutions and many independent individuals. We used a point-based system to determine contributions and award authorship. Frank Xu, Boxuan Li and Yufan Song led the project, coordinating overall development and paper writing efforts. Detailed contributions were as follows:
本工作是多个机构和许多独立个人之间的开源协作成果。我们采用积分系统来确定贡献并授予作者身份。Frank Xu、Boxuan Li 和 Yufan Song 领导了该项目,协调整体开发和论文撰写工作。详细贡献如下:
ACKNOWLEDGMENTS
致谢
This work was supported by a grant from Open Philanthropy. We thank Daniel Fried, Ruslan Salak hut dino v, Chris Donahue, Jing Yu Koh, Brandon Trabucco, Julie Nys, Olga Rogunova, Aditya Soni, Alex Lawsen, and Stephen McAleer for their insightful discussions in various stages of the project. We also thank the OpenHands team for their help on some of the task implementations.
本研究得到了 Open Philanthropy 的资助。我们感谢 Daniel Fried、Ruslan Salakhutdinov、Chris Donahue、Jing Yu Koh、Brandon Trabucco、Julie Nys、Olga Rogunova、Aditya Soni、Alex Lawsen 和 Stephen McAleer 在项目的各个阶段提供的深刻讨论。同时,我们也感谢 OpenHands 团队在部分任务实现上的帮助。
REFERENCES
参考文献
Dario Amodei and Lex Fridman. Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452, November 2024. URL https://lexfridman.com/ dario-amodei-transcript. 1
Dario Amodei 和 Lex Fridman. Dario Amodei: Anthropic CEO 谈 Claude、AGI 及 AI 与人类的未来 | Lex Fridman 播客 #452, 2024 年 11 月. URL https://lexfridman.com/dario-amodei-transcript. 1
Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2023. URL https://www-cdn.anthropic. com/de 8 ba 9 b 01 c 9 ab 7 cba bf 5 c 33 b 80 b 7 bbc 618857627/Model Card Claude 3.pdf. 2
Anthropic. Claude 3 模型家族:Opus、Sonnet、Haiku,2023. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claude 3.pdf. 2
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264, 2024. 3, 8
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, 等. Windows 智能体竞技场:大规模评估多模态操作系统智能体. arXiv 预印本 arXiv:2409.08264, 2024. 3, 8
De Chezelles, Thibault Le Sellier, Maxime Gasse, Alexandre Lacoste, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, et al. The browsergym ecosystem for web agent research. arXiv preprint arXiv:2412.05467, 2024. 10
De Chezelles, Thibault Le Sellier, Maxime Gasse, Alexandre Lacoste, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, 等. 用于网络智能体研究的 browsergym 生态系统. arXiv 预印本 arXiv:2412.05467, 2024. 10
François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. ARC Prize 2024: Technical Report, December 2024. URL https://arcprize.org/media/arc-prize-2024-technical-report.pdf. 1
François Chollet, Mike Knoop, Gregory Kamradt, 和 Bryan Landers. ARC Prize 2024: 技术报告, 2024年12月. 网址 https://arcprize.org/media/arc-prize-2024-technical-report.pdf. 1
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https: //openreview.net/forum?id=kiYqbO3wqw. 3, 4, 8
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: 迈向通用网页智能体。在第三十七届神经信息处理系统会议数据集和基准赛道中,2023年。URL https://openreview.net/forum?id=kiYqbO3wqw. 3, 4, 8
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. 3, 4, 8, 10
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: Web 智能体在解决常见知识工作任务中的能力如何?, 2024. 3, 4, 8, 10
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 2
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, 等. LLaMA 3 模型群. arXiv 预印本 arXiv:2407.21783, 2024. 2
Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models, 2023. 1
Tyna Eloundou、Sam Manning、Pamela Mishkin 和 Daniel Rock。《GPTs 即 GPTs:大语言模型对劳动力市场影响的早期观察》,2023 年。1
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024. 10
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: 使用大型多模态模型构建端到端网络代理。arXiv 预印本 arXiv:2401.13919, 2024. 10
Amazon Artificial General Intelligence. The amazon nova family of models: Technical report and model card. Amazon Technical Reports, 2024. URL https://www.amazon.science/publications/ the-amazon-nova-family-of-models-technical-report-and-model-card. 2
Amazon 通用人工智能。Amazon Nova 模型家族:技术报告与模型卡片。Amazon 技术报告,2024。URL https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card。2
IPython. Jupyter and the future of IPython — IPython. URL https://ipython.org. 9
IPython. Jupyter 和 IPython 的未来 — IPython. URL https://ipython.org. 9
Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, and Kazuhito Koishida. Video web arena: Evaluating long context multimodal agents with video understanding web tasks, 2024. URL https://arxiv.org/abs/2410.19100. 3
Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, 和 Kazuhito Koishida. Video Web Arena: 通过视频理解网页任务评估长上下文多模态智能体, 2024. URL https://arxiv.org/abs/2410.19100. 3
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github Issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=VTF8yNQM66. 3
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, 和 Karthik R Narasimhan. SWE-bench: 语言模型能否解决现实世界中的 GitHub 问题? 在第十二届国际学习表示会议 (The Twelfth International Conference on Learning Representations) 上, 2024. URL https://openreview.net/forum?id=VTF8yNQM66. 3
Subbarao Kam b ham pati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. Llms can’t plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817, 2024. 1
Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, 和 Anil Murthy. 大语言模型无法规划,但可以在LLM-modulo框架中辅助规划. arXiv预印本 arXiv:2402.01817, 2024. 1
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salak hut dino v, and Daniel Fried. Visual Web Arena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024. URL https://a cl anthology.org/2024.acl-long.50. 3, 8
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, 和 Daniel Fried. Visual Web Arena: 在真实视觉网页任务上评估多模态智能体. 在《第62届计算语言学协会年会论文集(第一卷:长论文)》中. 计算语言学协会, 2024. 网址 https://acl anthology.org/2024.acl-long.50. 3, 8
Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. Devbench: A comprehensive benchmark for software development. arXiv preprint arXiv:2403.08604, 2024. 3
Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, 等. Devbench: 一个全面的软件开发基准测试. arXiv 预印本 arXiv:2403.08604, 2024. 3
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018. URL https://arxiv.org/abs/1802.08802. 3
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, 和 Percy Liang. 使用工作流引导探索的网页界面强化学习. 在国际学习表征会议 (ICLR) 上, 2018. URL https://arxiv.org/abs/1802.08802. 3
Xing Han Lù, Zdenˇek Kasner, and Siva Reddy. Weblinx: Real-world website navigation with multi-turn dialogue, 2024. URL https://arxiv.org/abs/2402.05930. 3, 8
Xing Han Lù, Zdenˇek Kasner, 和 Siva Reddy. Weblinx: 多轮对话中的真实世界网站导航, 2024. URL https://arxiv.org/abs/2402.05930. 3, 8
Mozilla. Accessibility tree - MDN Web Docs Glossary: Definitions of Web-related terms | MDN. URL https://developer.mozilla.org/en-US/docs/Glossary/Accessibility tree. 10
Mozilla. 无障碍树 (Accessibility Tree) - MDN Web 文档术语表:Web 相关术语定义 | MDN. URL https://developer.mozilla.org/en-US/docs/Glossary/Accessibility_tree. 10
O*NET. The 29.1 release of the $\mathrm{O^{*}}\mathrm{NET}$ database, November 2024. URL https://www.onetcenter. org/dictionary/29.1/excel/. 8
O*NET。2024年11月发布的29.1版本O*NET数据库。URL https://www.onetcenter.org/dictionary/29.1/excel/。
OpenAI. Introducing gpt-4o and more tools to chatgpt free users, 2024. URL https://openai.com/ index/gpt-4o-and-more-tools-to-chatgpt-free/. 2
OpenAI. 为 ChatGPT 免费用户推出 GPT-4o 及更多工具, 2024. URL https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/. 2
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 3
Joon Sung Park、Joseph C. O’Brien、Carrie J. Cai、Meredith Ringel Morris、Percy Liang 和 Michael S. Bernstein。生成式智能体:人类行为的交互模拟,2023。3
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023. 3
Shishir G. Patil, Tianjun Zhang, Xin Wang, 和 Joseph E. Gonzalez. Gorilla: 连接海量 API 的大语言模型. arXiv 预印本 arXiv:2305.15334, 2023. 3
Playwright. Fast and reliable end-to-end testing for modern web apps | Playwright. URL https: //playwright.dev/. 9
Playwright。现代 Web 应用的快速可靠端到端测试 | Playwright。URL https://playwright.dev/。9
James Rounds, Thomas Smith, Lawrence Hubert, Phil Lewis, and David Rivkin. Development of occupational interest profiles for $\mathrm{o^{*}}$ net. Raleigh, NC: National Center for $O^{\ast}$ NET Development, 1999. 8
James Rounds、Thomas Smith、Lawrence Hubert、Phil Lewis 和 David Rivkin。$\mathrm{o^{*}}$ net 的职业兴趣档案开发。北卡罗来纳州罗利:$O^{\ast}$ NET 开发国家中心,1999 年。8
ServiceNow. BrowserGym: a Gym Environment for Web Task Automation. URL https://github.com/ ServiceNow/BrowserGym. 10
ServiceNow. BrowserGym: 用于网页任务自动化的 Gym 环境。URL https://github.com/ServiceNow/BrowserGym. 10
Yueqi Song, Frank Xu, Shuyan Zhou, and Graham Neubig. Beyond browsing: Api-based web agents. arXiv preprint arXiv:2410.16464, 2024. 9
Yueqi Song, Frank Xu, Shuyan Zhou 和 Graham Neubig. 超越浏览:基于 API 的网络智能体. arXiv 预印本 arXiv:2410.16464, 2024. 9
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, 等. Gemini: 一个高度多模态的模型家族. arXiv 预印本 arXiv:2312.11805, 2023. 2
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Bala subramania n. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.acl-long.850. URL https://a cl anthology.org/2024.acl-long.850. 3, 8
Harsh Trivedi、Tushar Khot、Mareike Hartmann、Ruskin Manku、Vinty Dong、Edward Li、Shashank Gupta、Ashish Sabharwal 和 Niranjan Bala subramania n. AppWorld: 一个可控的应用和人员世界,用于基准测试交互式编码智能体。在《第62届计算语言学协会年会论文集(第一卷:长论文)》中。计算语言学协会,2024年。doi: 10.18653/v1/2024.acl-long.850。URL https://acl anthology.org/2024.acl-long.850。3, 8
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable Code Actions Elicit Better LLM Agents. In ICML, 2024a. 9
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, 和 Heng Ji. 可执行代码操作能激发更好的大语言模型智能体. 在 ICML, 2024a. 9
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Mu en nigh off, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai software developers as generalist agents, 2024b. URL https://arxiv.org/abs/2407.16741. 2, 4, 9
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Mu en nigh off, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: 一个面向AI软件开发者的通用智能体开放平台, 2024b. URL https://arxiv.org/abs/2407.16741. 2, 4, 9
Jeran Witten stein. AI Can Only Do $5%$ of Jobs, Says MIT Economist Who Fears Crash, October 2024. URL https://www.bloomberg.com/news/articles/2024-10-02/ ai-can-only-do-5-of-jobs-says-mit-economist-who-fears-crash. 1
Jeran Wittenstein. 担心经济崩溃的 MIT 经济学家称 AI 只能完成 5% 的工作,2024 年 10 月。URL https://www.bloomberg.com/news/articles/2024-10-02/ ai-can-only-do-5-of-jobs-says-mit-economist-who-fears-crash. 1
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https: //openreview.net/forum?id=tN61DTr4Ed. 3, 4, 8
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: 在真实计算机环境中为开放式任务评估多模态智能体。发表于第三十八届神经信息处理系统会议数据集与基准测试赛道,2024年。URL https://openreview.net/forum?id=tN61DTr4Ed。3, 4, 8
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024. 2
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, 等. Qwen2 技术报告. arXiv 预印本 arXiv:2407.10671, 2024. 2
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023. 10
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, 和 Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023. 10
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. $\tau$ -bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045, 2024. 3
Shunyu Yao, Noah Shinn, Pedram Razavi 和 Karthik Narasimhan. $\tau$-bench: 现实世界领域中工具-智能体-用户交互的基准测试. arXiv 预印本 arXiv:2406.12045, 2024. 3
A MORE THE AGENT COMPANY ENVIRONMENT DETAILS
更详细的智能体公司环境细节
A.1 THE AGENT COMPANY OVERVIEW
A.1 AI智能体公司概述
A.2 THE AGENT COMPANY EMPLOYEE ROSTER WITH PROJECT ASSIGNMENTS AND SLACK CHANNELS
A.2 公司员工名册与项目分配及Slack频道
- AI Agent (Agent employee being tested in The Agent Company) - Role: All - Responsibilities: All - Project: All - Slack Channels: All
- AI智能体 (在The Agent Company中测试的智能体员工) - 角色: 全部 - 职责: 全部 - 项目: 全部 - Slack频道: 全部
- Sarah Johnson (Female, 42 years old) - Role: CTO - Responsibilities: Technical strategy planning, R&D team leadership, new technology assessment - Project: Oversees all technical projects - Slack Channels: All technical channels, #general, #tech-talk
- Sarah Johnson (女性,42 岁) - 职位:首席技术官 (CTO) - 职责:技术战略规划、研发团队领导、新技术评估 - 项目:监督所有技术项目 - Slack 频道:所有技术频道、#general、#tech-talk
- Li Ming (Male, 35 years old) - Role: Database Team Project Manager - Responsibilities: Managing database projects, resource coordination, ensuring timely delivery - Skills: Java, distributed systems - Project: JanusGraph (Graph Database) - Slack Channels: #project-graphdb, #engineering, #tech-talk
李明(男,35岁) - 角色:数据库团队项目经理 - 职责:管理数据库项目、资源协调、确保按时交付 - 技能:Java语言、分布式系统 - 项目:JanusGraph(图数据库) - Slack 频道:#project-graphdb, #engineering, #tech-talk
- Zhang Wei (Male, 31 years old)
- 张伟 (男, 31岁)
- Wang Fang (Female, 28 years old)
- 王芳 (女,28岁)
- Mike Chen (Male, 33 years old)
- Emily Zhou (Female, 29 years old)
- Emily Zhou (女,29岁)
- Liu Qiang (Male, 36 years old)
- 刘强 (男,36岁)
- Priya Sharma (Female, 27 years old)
- Priya Sharma (女性,27岁)
- Mark Johnson (Male, 40 years old)
- Mark Johnson (男,40岁)
- Jessica Lee (Female, 32 years old)
- Jessica Lee (女性,32岁)
- Chen Xinyi (Female, 30 years old)
- 陈心怡 (女,30岁)
- David Wong (Male, 45 years old) - Role: Finance Director - Responsibilities: Financial planning, budget management, financial reporting - Project: N/A (Finance) - Slack Channels: #general
- David Wong(男,45岁)- 职位:财务总监 - 职责:财务规划、预算管理、财务报告 - 项目:无(财务) - Slack 频道:#general
- Huang Jie (Male, 34 years old) - Role: Product Manager (Search Engine Team) - Responsibilities: Defining product requirements, planning product roadmap, communicating with clients Project: OpenSearch (Search Engine) Slack Channels: #project-search, #product, #tech-talk
- Sophia Rodriguez (Female, 37 years old) - Role: UX Designer - Responsibilities: Designing user interfaces, improving user experience, conducting user research - Project: All projects (focusing on user experience) - Slack Channels: All project channels, #product, #tech-talk
- Sophia Rodriguez (女性,37 岁) - 角色:用户体验设计师 - 职责:设计用户界面、提升用户体验、进行用户研究 - 项目:所有项目(专注于用户体验) - Slack 频道:所有项目频道、#product、#tech-talk
- Alex Turner (Male, 30 years old) - Role: Software Engineer (Low-Code Platform Team) - Project: Node-RED (Low-Code Platform) - Slack Channels: #project-lowcode, #engineering, #tech-talk
- Alex Turner (男,30 岁) - 角色:软件工程师 (低代码平台团队) - 项目:Node-RED (低代码平台) - Slack 频道:#project-lowcode, #engineering, #tech-talk
- Emma Lewis (Female, 33 years old) - Role: Software Engineer (API Team) - Project: API-server (Python project) Slack Channels: #engineering, #tech-talk
- Emma Lewis (女性,33 岁) - 角色:软件工程师 (API 团队) - 项目:API-server (Python 项目) Slack 频道:#engineering, #tech-talk
- Jessica Chen (Female, 28 years old) - Role: Frontend Software Engineer - Responsibilities: Developing user interfaces, implementing responsive designs, optimizing web performance - Project: E-commerce Website Redesign - Slack Channels: #project-ecommerce, #frontend, #tech-talk
- Jessica Chen(女,28岁)- 职位:前端软件工程师 - 职责:开发用户界面,实现响应式设计,优化网页性能 - 项目:电商网站重新设计 - Slack 频道:#project-ecommerce, #frontend, #tech-talk
A.3 THE AGENT COMPANY Q3 2024 QUARTERLY SPRINT GOALS
A.3 AI智能体公司 2024 年第三季度冲刺目标
Engineering Teams
工程团队
. Graph Database Team (JanusGraph) - Optimize large-scale graph query performance - Implement new graph analysis algorithms - Improve stability of distributed deployments
图数据库团队 (JanusGraph)
- 优化大规模图查询性能
- 实现新的图分析算法
- 提高分布式部署的稳定性
- Streaming Database Team (RisingWave)
- 流式数据库团队 (RisingWave)
- Implement new stream processing operators - Optimize memory usage Improve fault recovery mechanisms
- 实现新的流处理操作符
- 优化内存使用
- 改进故障恢复机制
- AI Team (OpenHands & llama.cpp)
- AI 团队 (OpenHands & llama.cpp)
- Integrate latest LLM models - Optimize model inference speed - Develop model fine-tuning functionality
- 集成最新的大语言模型 (LLM)
- 优化模型推理速度
- 开发模型微调功能
- Web Crawler Team (Colly)
- 网络爬虫团队 (Colly)
- Implement distributed crawling functionality Improve anti-crawling detection and bypass mechanisms - Develop data cleaning and preprocessing modules
- 实现分布式爬取功能
- 提升反爬虫检测与绕过机制
- 开发数据清洗与预处理模块
- Search Engine Team (OpenSearch) - Optimize full-text search performance Implement new relevance ranking algorithms - Develop custom analyzer functionality
- 搜索引擎团队 (OpenSearch) - 优化全文搜索性能 实现新的相关性排序算法 - 开发自定义分析器功能
- Low-Code Platform Team (Node-RED) - Design new visual components - Improve workflow execution engine - Develop more third-party service integration s
- 低代码平台团队 (Node-RED)
- 设计新的可视化组件
- 改进工作流执行引擎
- 开发更多第三方服务集成
Product Team
产品团队
- Conduct user research, collect product feedback - Develop Q4 product roadmap - Optimize product documentation and user guides ## Quality Assurance Team - Develop automated test suites - Conduct performance and load testing - Improve bug tracking and reporting processes ## Sales and Marketing Team - Organize industry trade show participation - Launch new content marketing campaigns - Develop sales team training programs
- 进行用户调研,收集产品反馈
- 制定第四季度产品路线图
- 优化产品文档和用户指南
质量保证团队
- 开发自动化测试套件
- 进行性能和负载测试
- 改进缺陷跟踪和报告流程
销售和市场营销团队
- 组织行业展会参与
- 启动新的内容营销活动
- 制定销售团队培训计划
Human Resources Team - Implement new employee development plans Optimize recruitment processes - Organize team-building activities
人力资源团队 - 实施新员工发展计划 优化招聘流程 - 组织团队建设活动
Finance Team
财务团队
- Prepare Q2 financial reports - Develop Q4 budget plans Optimize financial analysis tools
- 准备第二季度财务报告
- 制定第四季度预算计划
- 优化财务分析工具
A.4 THE AGENT COMPANY INTERNAL DOCUMENTS
A.4 AI智能体公司内部文档
B AGENT-SIMULATED COLLEAGUES CONVERSATION EXAMPLES
B AI智能体模拟同事对话示例
We present some examples (see Figure 5, Figure 6 and Figure 7) of the agent’s interaction with the simulated colleagues within our environment.
我们展示了一些AI智能体与模拟同事在我们环境中互动的示例(见图5、图6和图7)。


Figure 5: Simulated Colleague Communication Example 1 – The agent is tasked with collecting required equipment while adhering to the department’s budget. After calculating that the requested items exceed the budget, the agent negotiates with the simulated colleague to reduce the request, showcasing its ability of effective communication.
图 5: 模拟同事沟通示例 1 – AI智能体的任务是收集所需设备,同时遵守部门的预算。在计算出请求的物品超出预算后,AI智能体与模拟同事协商减少请求,展示了其有效沟通的能力。


Figure 6: Simulated Colleague Communication Example 2 – The agent is tasked with writing a job description for a new graduate software engineering position. To fulfill the task, the agent communicates with simulated Project Manager to gather requirements. The agent requests the job description template, minimum and preferred qualifications, and the ideal salary range. This interaction evaluates the agent’s ability to gather information systematically and clarify task-related requirements through effective communication.
图 6: 模拟同事沟通示例 2 – AI智能体的任务是为新毕业生软件工程职位编写职位描述。为了完成任务,AI智能体与模拟的项目经理沟通以收集需求。AI智能体请求职位描述模板、最低和优先资格要求以及理想的薪资范围。此交互评估了AI智能体通过有效沟通系统地收集信息并澄清任务相关需求的能力。


Figure 7: Simulated Colleague Communication Example 3 - The agent is tasked with scheduling a meeting between NPCs Emily Zhou and Liu Qiang based on their availability. Emily is available on Wednesday and Thursday, while Liu is only available on Thursday. The agent identifies Thursday as the common free day and successfully proposes a mid-morning slot at $10{:}30;\mathrm{AM}$ , which both participants confirm. This example highlights the agent’s ability to manage multi-turn conversations, effectively going back and forth between participants to align schedules and finalize a meeting time.
图 7: 模拟同事沟通示例 3 - AI智能体的任务是根据 NPC Emily Zhou 和 Liu Qiang 的空闲时间安排会议。Emily 周三和周四有空,而 Liu 只有周四有空。AI智能体识别出周四是双方共同的空闲日,并成功提议在上午 10:30 的时间段,双方都确认了这一时间。此示例展示了AI智能体管理多轮对话的能力,能够有效地在参与者之间来回沟通,协调日程并最终确定会议时间。
