OmniParser for Pure Vision Based GUI Agent

基于纯视觉的GUI智能体OmniParser

Yadong $\mathbf{L}\mathbf{u}^{1}$ , Jianwei Yang1, Yelong Shen2, Ahmed Awadallah1 1Microsoft Research 2 Microsoft Gen AI {yadonglu, jianwei.yang, yeshe, ahmed.awadallah}@microsoft.com

Yadong $\mathbf{L}\mathbf{u}^{1}$, Jianwei Yang1, Yelong Shen2, Ahmed Awadallah1
1Microsoft Research
2 Microsoft Gen AI
{yadonglu, jianwei.yang, yeshe, ahmed.awadallah}@microsoft.com

Abstract

摘要

The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interact able icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce OMNIPARSER, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interact able icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interact able regions on the screen and a caption model to extract the functional semantics of the detected elements. OMNIPARSER significantly improves GPT-4V’s performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, OMNIPARSER with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.

大规模视觉语言模型的最新成功展示了在用户界面上驱动智能体系统的巨大潜力。然而，我们认为像 GPT-4V 这样的多模态模型作为跨不同应用程序的多种操作系统上的通用智能体的能力被大大低估，原因在于缺乏一种稳健的屏幕解析技术，该技术能够：1) 可靠地识别用户界面中的可交互图标，以及 2) 理解屏幕截图中的各种元素的语义，并准确地将预期操作与屏幕上的相应区域关联起来。为了填补这些空白，我们引入了 OMNIPARSER，这是一种将用户界面截图解析为结构化元素的综合方法，显著增强了 GPT-4V 生成准确基于界面相应区域操作的能力。我们首先使用流行的网页创建了一个可交互图标检测数据集和一个图标描述数据集。这些数据集用于微调专门的模型：一个检测模型用于解析屏幕上的可交互区域，一个描述模型用于提取检测元素的功能语义。OMNIPARSER 显著提高了 GPT-4V 在 ScreenSpot 基准测试中的表现。在 Mind2Web 和 AITW 基准测试中，仅使用屏幕截图输入的 OMNIPARSER 优于需要屏幕截图之外额外信息的 GPT-4V 基线。

1 Introduction

1 引言

Large language models have shown great success in their understanding and reasoning capabilities. More recent works have explored the use of large vision-language models (VLMs) as agents to perform complex tasks on the user interface (UI) with the aim of completing tedious tasks to replace human efforts $[\mathrm{YZL}^{+}23$ , ${\mathrm{YYZ}}^{+}23$ , $\mathrm{DGZ}^{+}23$ , $Z{\mathrm{GK}}^{+}24$ , $\mathrm{HWL}^{+}23$ , ${\mathrm{Y}}Z{S^{+}}24$ , $\mathrm{WXJ^{+}24}$ , $\mathrm{GFH}^{+}24$ , $\mathrm{CSC}^{+}24]$ . Despite the promising results, there remains a significant gap between current state-of-thearts and creating widely usable agents that can work across multiple platforms, e.g. Windows/MacOS, IOS/Android and multiple applications (Web broswer Office365, PhotoShop, Adobe), with most previous work focusing on limiting applications or platforms.

大语言模型在理解和推理能力方面展现了巨大的成功。最近的研究探索了使用大视觉语言模型 (VLMs) 作为 AI智能体，在用户界面 (UI) 上执行复杂任务，旨在完成繁琐的工作以替代人力 ([YZL^{+}23] , [YYZ^{+}23] , [DGZ^{+}23] , [ZGK^{+}24] , [HWL^{+}23] , [YZS^{+}24] , [WXJ^{+}24] , [GFH^{+}24] , [CSC^{+}24])。尽管取得了令人鼓舞的成果，但当前的最新技术与创建能够在多个平台（例如 Windows/MacOS、IOS/Android）和多种应用程序（Web 浏览器、Office365、PhotoShop、Adobe）上广泛使用的 AI智能体之间仍存在显著差距，大多数先前的工作都集中在有限的应用程序或平台上。

While large multimodal models like GPT-4V and other models trained on UI data $[\mathrm{HWL}^{+}23$ , $\mathrm{YZS}^{+}24$ , $\mathrm{CSC}^{+}24]$ have demonstrated abilities to understand basic elements of the UI screenshot, action grounding remains one of the key challenges in converting the actions predicted by LLMs to the actual actions on screen in terms of keyboard/mouse movement or API call $[Z\mathrm{GK}^{+}24]$ . It has been noted that GPT-4V is unable to produce the exact x-y coordinate of the button location, Set-of-Mark prompting $[\mathrm{YZL}^{+}23]$ proposes to overlay a group of bounding boxes each with unique numeric IDs on to the original image, as a visual prompt sent to the GPT-4V model. By applying set-of-marks prompting, GPT-4V is able to ground the action into a specific bounding box which has ground truth location instead of a specific xy coordinate value, which greatly improves the robustness of the action grounding $[Z\mathrm{GK}^{+}2\bar{4}]$ . However, the current solutions using SoM relies on parsed HTML information to extract bounding boxes for actionable elements such as buttons, which limits its usage to web browsing tasks. We aim to build a general approach that works on a variety of platforms and applications.

尽管如 GPT-4V 等大模型以及基于 UI 数据训练的模型 $[\mathrm{HWL}^{+}23$ , $\mathrm{YZS}^{+}24$ , $\mathrm{CSC}^{+}24]$ 已经展示了理解 UI 截图基本元素的能力，但动作落地（action grounding）仍然是将大语言模型预测的动作转换为实际屏幕上的键盘/鼠标移动或 API 调用的关键挑战之一 $[Z\mathrm{GK}^{+}24]$ 。值得注意的是，GPT-4V 无法生成按钮位置的精确 x-y 坐标，而 Set-of-Mark 提示 $[\mathrm{YZL}^{+}23]$ 提出在原始图像上叠加一组带有唯一数字 ID 的边界框，作为发送给 GPT-4V 模型的视觉提示。通过应用 Set-of-Mark 提示，GPT-4V 能够将动作落地到具有真实位置的特定边界框，而不是特定的 x-y 坐标值，这大大提高了动作落地的鲁棒性 $[Z\mathrm{GK}^{+}2\bar{4}]$ 。然而，当前使用 SoM 的解决方案依赖于解析 HTML 信息来提取可操作元素（如按钮）的边界框，这限制了其在网页浏览任务中的使用。我们的目标是构建一种适用于各种平台和应用程序的通用方法。

In this work, we argue that previous pure vision-based screen parsing techniques are not satisfactory, which lead to significant under estimation of GPT-4V model’s understanding capabilities. And a reliable vision-based screen parsing method that works well on general user interface is a key to improve the robustness of the agentic workflow on various operating systems and applications. We present OMNIPARSER, a general screen parsing tool to extract information from UI screenshot into structured bounding box and labels which enhances GPT-4V’s performance in action prediction in a variety of user tasks.

在本研究中，我们认为之前基于纯视觉的屏幕解析技术并不令人满意，这导致了对 GPT-4V 模型理解能力的显著低估。一个在通用用户界面上表现良好的可靠视觉屏幕解析方法是提高 AI 智能体在各种操作系统和应用中工作流程鲁棒性的关键。我们提出了 OMNIPARSER，这是一种通用的屏幕解析工具，可从 UI 截图中提取信息，将其转换为结构化的边界框和标签，从而在各种用户任务中增强 GPT-4V 在动作预测中的表现。

We list our contributions as follows:

我们的贡献如下：

2 相关工作

2.1 UI Screen Understanding

2.1 用户界面屏幕理解

There has been a line of modeling works focusing on detailed understanding of UI screens, such as Screen 2 Words $[\mathbf{W}!\mathbf{L}Z^{+}21]$ , UI-BERT $[\mathbf{B}Z\bar{\mathbf{X}}^{\bar{+}}21]$ , Widget Captioning $[\bar{\mathrm{LLH}}^{+}20]$ , ActionBERT $[\mathrm{HSZ}^{+}21]$ . These works demonstrated effective usage of multimodal models for extracting semantics of user screen. But these models rely on additional information such as view hierarchy, or are trained for visual question answering tasks or screen summary tasks.

有一系列建模工作专注于对UI屏幕的详细理解，例如 Screen 2 Words [WLZ+21]、UI-BERT [BZẌ+21]、Widget Captioning [LLH+20]、ActionBERT [HSZ+21]。这些工作展示了多模态模型在提取用户屏幕语义方面的有效应用。但这些模型依赖于视图层次结构等附加信息，或者针对视觉问答任务或屏幕摘要任务进行训练。

There are also a couple publicly available dataset that on UI screen understanding. Most notably the Rico dataset $[\mathrm{DHF^{+}17}]$ , which contains more than 66k unique UI screens and its view hierarchies. Later $[\mathrm{SWL}^{+}22]$ auguments Rico by providing $500\mathrm{k}$ human annotations on the original 66k RICO screens identifying various icons based on their shapes and semantics, and associations between selected general UI elements (like icons, form fields, radio buttons, text inputs) and their text labels. Same on mobile platform, PixelHelp $[\mathrm{LHZ}^{+}20]$ provides a dataset that contains UI elements of screen spanning across 88 common tasks. In the same paper they also released RicoSCA which is a cleaned version of Rico. For the web and general OS domain, there are several works such Mind2Web $[\mathrm{DGZ}^{+}23]$ ], MiniWob $\mathsf{1++[L G P^{+}18]}$ , Visual-WebArena $[\mathrm{KLJ}^{+}24$ , $Z\mathrm{XZ}^{+}24]$ , and OSWorld $[\mathrm{XZC^{+}}24]$ that provide simulated environment, but does not provide dataset explicitly for general screen understanding tasks such as interact able icon detection on real world websites.

在UI屏幕理解方面，也有几个公开可用的数据集。最著名的是Rico数据集 $[\mathrm{DHF^{+}17}]$，它包含超过66k个独特的UI屏幕及其视图层次结构。随后 $[\mathrm{SWL}^{+}22]$ 对Rico进行了扩展，在原始的66k RICO屏幕上提供了 $500\mathrm{k}$ 条人工注释，根据形状和语义识别各种图标，并选择了一般UI元素（如图标、表单字段、单选按钮、文本输入）与其文本标签之间的关联。同样在移动平台上，PixelHelp $[\mathrm{LHZ}^{+}20]$ 提供了一个包含跨越88个常见任务的屏幕UI元素的数据集。在同一篇论文中，他们还发布了RicoSCA，这是Rico的清理版本。对于Web和通用操作系统领域，有几项工作如Mind2Web $[\mathrm{DGZ}^{+}23]$、MiniWob $\mathsf{1++[L G P^{+}18]}$、Visual-WebArena $[\mathrm{KLJ}^{+}24$、$Z\mathrm{XZ}^{+}24]$ 和OSWorld $[\mathrm{XZC^{+}}24]$ 提供了模拟环境，但没有为通用屏幕理解任务（如现实世界网站上的可交互图标检测）提供明确的数据集。

To address the absence of a large-scale, general web UI understanding dataset, and to keep pace with the rapid evolution of UI design, we curated an icon detection dataset using the DOM information from popular URLs avaialbe on the Web. This dataset features the up-to-date design of icons and buttons, with their bounding boxes retrieved from the DOM tree, providing ground truth locations.

为了解决缺乏大规模、通用的网页 UI 理解数据集的问题，并跟上 UI 设计的快速发展，我们利用网络上流行 URL 的 DOM 信息，整理了一个图标检测数据集。该数据集包含了最新的图标和按钮设计，其边界框从 DOM 树中提取，提供了真实的位置信息。

2.2 Autonomous GUI Agent

2.2 自主GUI智能体

Recently there has been a lot of works on designing autonomous GUI agent to perform tasks in place of human users. One line of work is to train an end-to-end model to directly predict the next action, representative works include: Pixel2Act $[\mathbf{S}\mathbf{J}\mathbf{C}^{+}23]$ , WebGUM $[\mathrm{FLN}^{+}24]$ in web domain, Ferret $[\mathrm{Y}\bar{\mathrm{Z}}\mathrm{S}^{+}24]$ , CogAgent $[\mathrm{HWL}^{+}23]$ , and Fuyu $[{\mathrm{BEH}}^{+}23]$ in Mobile domain. Another line of works involve leveraging existing multimodal models such as GPT-4V to perform user tasks. Representative works include MindAct agent $[\mathrm{DGZ}^{+}23]$ , SeeAct agent $[Z\mathrm{GK}^{+}\bar{2}4]$ in web domain and agents in $[\mathrm{YY}\mathbf{Z}^{+}23$ , $\mathrm{WXY^{+}24}$ , $\mathrm{RLR}^{+}23]$ for mobile domain. These work often leverages the DOM information in web browser, or the view hierarchies in mobile apps to get the ground truth position of interact able elements of the screen, and use Set-Of-Marks $[\bar{\mathrm{YZL}}^{+}23]$ to overlay the bounding boxes on top of the screenshot then feed into the vision-language models. However, ground truth information of interact able elements may not always be available when the goal is to build a general agent for cross-platforms and cross-applications tasks. Therefore, we focus on providing a systematic approach for getting structured elements from general user screens.

最近有许多研究致力于设计自主的 GUI 智能体来替代人类用户执行任务。一类研究是训练端到端模型直接预测下一个动作，代表性工作包括：Pixel2Act [SJC+23]，WebGUM [FLN+24] 在网络领域，Ferret [YŻS+24]，CogAgent [HWL+23] 和 Fuyu [BEH+23] 在移动领域。另一类研究则利用现有的多模态模型如 GPT-4V 来执行用户任务。代表性工作包括 MindAct 智能体 [DGZ+23]，SeeAct 智能体 [ZGK+24] 在网络领域，以及 [YYZ+23, WXY+24, RLR+23] 在移动领域的智能体。这些工作通常利用网页浏览器中的 DOM 信息或移动应用程序中的视图层次结构来获取屏幕上可交互元素的真实位置，并使用 Set-Of-Marks [ŻYL+23] 在截图上方叠加边界框，然后输入到视觉-语言模型中。然而，当目标是构建一个用于跨平台和跨应用程序任务的通用智能体时，可交互元素的真实信息可能并不总是可用。因此，我们专注于提供一种系统化的方法，从通用的用户屏幕中获取结构化元素。

3 Methods

3 方法

A complex task can usually be broken down into several steps of actions. Each step requires the model’s (e.g. GPT-4V) ability to: 1) understand the UI screen in the current step, i.e. analyzing what is the screen content overall, what are the functions of detected icons that are labeled with numeric ID, and 2) predict what is the next action on the current screen that is likely to help complete the whole task. Instead of trying to accomplish the two goals in one call, we found it beneficial to extract some of the information such as semantics in the screen parsing stage, to alleviate the burden of GPT-4V so that it can leverages more information from the parsed screen and focus more on the action prediction.

一个复杂的任务通常可以分解为多个步骤的操作。每个步骤都需要模型（例如 GPT-4V）具备以下能力：1) 理解当前步骤的 UI 界面，即分析屏幕内容的整体情况，以及检测到的带有数字 ID 的图标的功能；2) 预测在当前屏幕上可能有助于完成整个任务的下一步操作。我们发现，与其试图在一次调用中完成这两个目标，不如在屏幕解析阶段提取一些信息（例如语义），以减轻 GPT-4V 的负担，使其能够从解析后的屏幕中利用更多信息，并更专注于动作预测。

Hence we propose OMNIPARSER, which integrates the outputs from a finetuned interact able icon detection model, a finetuned icon description model, and an OCR module. This combination produces a structured, DOM-like representation of the UI and a screenshot overlaid with bounding boxes for potential interact able elements. We discuss each component of the OMNIPARSER in more details for the rest of the section.

因此我们提出了 OMNIPARSER，它集成了来自微调的可交互图标检测模型、微调的图标描述模型和 OCR 模块的输出。这种组合生成了一个结构化的、类似于 DOM 的 UI 表示，并在截图上叠加了潜在可交互元素的边界框。我们将在本节剩余部分详细讨论 OMNIPARSER 的每个组件。

3.1 Interact able Region Detection

3.1 可交互区域检测

Identifying interact able regions from the UI screen is a crucial step to reason about what actions should be performed given a user tasks. Instead of directly prompting GPT-4V to predict the xy coordinate value of the screen that it should operate on, we follow previous works to use the Set-of-Marks approach $[\mathrm{YZL}^{+}23]$ to overlay bounding boxes of interact able icons on top of UI screenshot, and ask GPT-4V to generate the bounding box ID to perform action on. However, different from $[Z{\bf G}{\bf K}^{+}24$ , $\mathrm{KLJ}^{+}\bar{2}4]$ which uses the ground truth button location retrieved from DOM tree in web browswer, and $[\mathrm{YYZ}^{+}23]$ which uses labeled bounding boxes in the AITW dataset $[\mathbf{RLR}^{+}23]$ , we finetune a detection model to extract interact able icons/buttons.

从 UI 屏幕中识别可交互区域是根据用户任务推理应执行哪些操作的关键步骤。我们没有直接提示 GPT-4V 预测其应操作的屏幕 xy 坐标值，而是遵循先前的工作，使用 Set-of-Marks 方法 [YZL+23]，在 UI 截图上叠加可交互图标的边界框，并要求 GPT-4V 生成要执行操作的边界框 ID。然而，与 [ZGK+24]，[KLJ+24] 使用从 Web 浏览器 DOM 树中检索的真实按钮位置，以及 [YYZ+23] 使用 AITW 数据集 [RLR+23] 中标记的边界框不同，我们微调了一个检测模型来提取可交互图标/按钮。

Specifically, we curate a dataset of interact able icon detection dataset, containing $67\mathrm{k}$ unique screenshot images, each labeled with bounding boxes of interact able icons derived from DOM tree. We first took a $100\mathrm{k}$ uniform sample of popular publicly availabe urls on the web $[0\mathrm{XL}^{+}22]$ , and collect bounding boxes of interact able regions of the webpage from the DOM tree of each urls. Some examples of the webpage and the interact able regions are shown in 2.

具体来说，我们整理了一个可交互图标检测数据集，包含 $67\mathrm{k}$ 个独特的截图图像，每个图像都标注了从 DOM 树中提取的可交互图标的边界框。我们首先从网络上 $[0\mathrm{XL}^{+}22]$ 中随机抽取了 $100\mathrm{k}$ 个热门公开 URL，并从每个 URL 的 DOM 树中收集了网页可交互区域的边界框。图 2 展示了一些网页及其可交互区域的示例。

Apart from interact able region detection, we also have a OCR module to extract bounding boxes of texts. Then we merge the bounding boxes from OCR detection module and icon detection module while removing the boxes that have high overlap (we use over $90%$ as a threshold). For every bounding box, we label it with a unique ID next to it using a simple algorithm to minimizing the overlap between numeric labels and other bounding boxes.

除了可交互区域检测外，我们还集成了一个 OCR 模块来提取文本的边界框。然后我们将 OCR 检测模块和图标检测模块的边界框进行合并，同时去除重叠度高的边界框（我们使用超过 90% 作为阈值）。对于每个边界框，我们使用一个简单的算法在其旁边标注一个唯一的 ID，以最小化数字标签与其他边界框之间的重叠。

3.2 Incorporating Local Semantics of Functionality

3.2 融合功能的局部语义

We found in a lot of cases where only inputting the UI screenshot overlayed with bounding boxes and associated IDs can be misleading to GPT-4V. We argue the limitation stems from GPT-4V’s constrained ability to simultaneously perform the composite tasks of identifying each icon’s semantic information and predicting the next action on a specific icon box. This has also been observed by several other works $[\mathrm{YY}\mathbf{Z}^{+}23$ , $Z{\mathrm{GK}}^{+}24]$ . To address this issue, we incorporate the local semantics of functionality into the prompt, i.e. for each icons detected by the interact able region detection model, we use a finetuned model to generate description of functionality to the icons, and for each text boxes, we use the detected texts and its label.

我们在许多案例中发现，仅输入带有边界框和相关 ID 的 UI 截图可能会误导 GPT-4V。我们认为这一限制源于 GPT-4V 在执行识别每个图标的语义信息并预测特定图标框上的下一步动作的复合任务时的能力受限。这一问题也被其他几项工作所观察到 [YYZ+23, ZGK+24]。为了解决这一问题，我们将功能的局部语义融入提示中，即对于可交互区域检测模型检测到的每个图标，我们使用微调后的模型生成功能描述，而对于每个文本框，我们使用检测到的文本及其标签。

We perform more detailed analysis for this topic in section 4.1. To the best of our knowledge, there is no public model that is specifically trained for up-to-date UI icon description, and is suitable for our purpose to provide fast and accurate local semantics for the UI screenshot. Therefore we curate a dataset of $7\mathbf{k}$ icon-description pairs using GPT-4o, and finetune a BLIP-v2 model [LLSH23] on this dataset. Details of dataset and training can be found in Appendix 7.1. After finetuning, we found the model is much more reliable in its description to common app icons. Examples can be seen in figure 4. And in figure 3, we show it is helpful to incorporate the semantics of local bounding boxes in the form of text prompt along with the UI screenshot visual prompt.

我们将在第4.1节中对这一主题进行更详细的分析。据我们所知，目前没有公开的模型专门针对最新的UI图标描述进行训练，并且适合我们为UI截图提供快速且准确的局部语义的需求。因此，我们使用GPT-4o构建了一个包含$7\mathbf{k}$个图标-描述对的数据集，并在此数据集上对BLIP-v2模型 [LLSH23] 进行了微调。数据集和训练的详细信息可以在附录7.1中找到。微调后，我们发现该模型在描述常见应用图标时更加可靠。示例见图4。在图3中，我们展示了将局部边界框的语义以文本提示的形式与UI截图视觉提示结合是有帮助的。

Figure 1: Examples of parsed screenshot image and local semantics by OMNIPARSER. The inputs to OmniParse are user task and UI screenshot, from which it will produce: 1) parsed screenshot image with bounding boxes and numeric IDs overlayed, and 2) local semantics contains both text extracted and icon description.

图 1: OMNIPARSER 解析的截图图像和局部语义示例。OmniParse 的输入是用户任务和 UI 截图，它会生成：1) 带有边界框和数字 ID 覆盖的解析截图图像，2) 包含提取文本和图标描述的局部语义。

4 Experiments and Results

4 实验与结果

We conduct experiments on several benchmarks to demonstrate the effectiveness of OMNIPARSER. We start by a motivating experiments showing that current GPT-4V model with set of mark prompting $[\mathrm{YZL}^{+}23]$ is prone to incorrectly assigning label ID to the referred bounding boxes. Then we evaluate on Seeclick benchmark and Mind2Web to further showcase OMNIPARSER with local semantics can improve the GPT-4V’s performance on real user tasks on different platforms and applications.

我们在多个基准上进行了实验，以展示OMNIPARSER的有效性。我们首先进行了一项动机实验，表明当前带有标记提示集的GPT-4V模型（[YZL+23]）容易错误地为参考的边界框分配标签ID。然后，我们在Seeclick基准和Mind2Web上进行了评估，进一步展示了带有局部语义的OMNIPARSER可以在不同平台和应用程序上提高GPT-4V在真实用户任务中的表现。

4.1 Evaluation on SeeAssign Task

4.1 SeeAssign任务评估

To test the ability of correctly predicting the label ID given the description of the bounding boxes for GPT-4v models, We handcrafted a dataset SeeAssign that contains 112 tasks consisting of samples from 3 different platforms: Mobile, Desktop and Web Browser. Each task includes a concise task description and a screenshot image. The task descriptions are manually created and we make sure each task refers to one of the detected bounding boxes, e.g. ’click on ’settings”, ’click on the minimize button’. During evaluation, GPT-4V is prompted to predict the bounding box ID associated to it. Detailed prompt are specified in Appendix. The task screenshot images are sampled from the ScreenSpot $[\mathrm{CSC^{+}}24]$ benchmark, where they are labeled with set of marks using OMNIPARSER. The tasks are further divided into 3 sub-categories by difficulty: easy (less than 10 bounding boxes), medium (10-40 bounding boxes) and hard (more than 40 bounding boxes).

为了测试 GPT-4v 模型在给定边界框描述时正确预测标签 ID 的能力，我们手工制作了一个名为 SeeAssign 的数据集，其中包含 112 个任务，样本来自 3 个不同平台：移动端、桌面端和网页浏览器。每个任务包括一个简洁的任务描述和一张截图图像。任务描述是手动创建的，我们确保每个任务都指向一个检测到的边界框，例如“点击‘设置’”、“点击最小化按钮”。在评估过程中，GPT-4V 被提示预测与之相关的边界框 ID。详细提示见附录。任务截图图像来自 ScreenSpot $[\mathrm{CSC^{+}}24]$ 基准测试集，这些图像使用 OMNIPARSER 进行了标记。任务按难度进一步分为 3 个子类别：简单（少于 10 个边界框）、中等（10-40 个边界框）和困难（超过 40 个边界框）。

Figure 2: Examples from the Interact able Region Detection dataset. The bounding boxes are based on the interact able region extracted from the DOM tree of the webpage.

图 2: 交互区域检测数据集中的示例。边界框基于从网页的DOM树中提取的交互区域。

From table 1, we see that GPT-4V often mistakenly assign the numeric ID to the table especially when there are a lot of bounding boxes over the screen. And by adding local semantics including texts within the boxes and short descriptions of the detected icons, GPT-4V’s ability of correctly assigning the icon improves from 0.705 to 0.938.

从表 1 中我们可以看到，GPT-4V 经常错误地将数字 ID 分配给表格，尤其是当屏幕上有很多边界框时。通过添加局部语义，包括框内的文本和检测到的图标的简短描述，GPT-4V 正确分配图标的能力从 0.705 提高到了 0.938。

From figure 3, we see that without the description of the referred icon in the task, GPT-4V often fails to link the icon required in the task and the ground truth icon ID in the SoM labeled screenshot, which leads to hallucination in the response. With fine-grain local semantics added in the text prompt, it makes it much easier for GPT-4V to find the correct icon ID for the referred icon.

从图 3 中可以看出，在任务中没有描述所指图标的情况下，GPT-4V 经常无法将任务中所需的图标与 SoM 标注截图中的真实图标 ID 联系起来，从而导致响应中出现幻觉。在文本提示中添加细粒度的局部语义后，GPT-4V 更容易找到所指图标的正确图标 ID。

Table 1: Comparison of GPT-4V with and without local semantics

表 1: GPT-4V 在有和无局部语义情况下的对比

	Easy	Medium	Hard	Overall
GPT-4V w.o. 局部语义	0.913	0.692	0.620	0.705
GPT-4V w. 局部语义	1.00	0.949	0.900	0.938

4.2 Evaluation on ScreenSpot

4.2 ScreenSpot 评估

ScreenSpot dataset $[\mathrm{CSC^{+}}24]$ is a benchmark dataset that includes over 600 interface screenshots from mobile (iOS, Android), desktop (macOS, Windows), and web platforms. The task instructions are manually created so that each instruction corresponds to an actionable elements on the UI screen. We first evaluate the performance of OMNIPARSER using the this benchmark. In table 2, we can see across the 3 different platforms: Mobile, Desktop and Web, OMNIPARSER significantly improves the GPT-4V baseline. Noticeably, OMNIPARSER’s performance even surpasses models that are specifically finetuned on GUI dataset including SeeClick, CogAgent and Fuyu by a large margin. We also note that incorporating the local semantics (OMNIPARSER w. LS in the table) further improves the overall performance. This corroborates with the finds in section 4.1 that incorporating local semantics of the UI screenshot in text format, i.e. adding OCR text and descriptions of the icon bounding boxes further helps GPT-4V to accurately identify the correct element to operate on. Furthermore, our findings indicate that the interact able region detection (ID) model we finetuned improves overall accuracy by an additional $4.3%$ compared to using the raw Grounding DINO model. This underscores the importance of accurately detecting interact able elements for the success of UI tasks. Overall, the results demonstrate that the UI screen understanding capability of GPT-4V is significantly underestimated and can be greatly enhanced with more accurate interact able elements detection and the incorporation of functional local semantics.

ScreenSpot 数据集 [CSC+24] 是一个基准数据集，包含来自移动端（iOS、Android）、桌面端（macOS、Windows）和网页平台的 600 多张界面截图。任务指令是手动创建的，因此每个指令都对应于 UI 界面上的可操作元素。我们首先使用该基准评估了 OMNIPARSER 的性能。在表 2 中，我们可以看到在移动端、桌面端和网页端这三个不同的平台上，OMNIPARSER 显著提升了 GPT-4V 基准。值得注意的是，OMNIPARSER 的性能甚至大幅超过了在 GUI 数据集上专门微调的模型，包括 SeeClick、CogAgent 和 Fuyu。我们还注意到，结合局部语义（表中的 OMNIPARSER w. LS）进一步提高了整体性能。这与 4.1 节中的发现一致，即结合 UI 截图的局部语义（以文本形式添加 OCR 文本和图标边界框的描述）进一步帮助 GPT-4V 准确识别要操作的正确元素。此外，我们的研究结果表明，与使用原始的 Grounding DINO 模型相比，我们微调的可交互区域检测（ID）模型将整体准确率提高了 4.3%。这凸显了准确检测可交互元素对于 UI 任务成功的重要性。总体而言，结果表明 GPT-4V 的 UI 界面理解能力被显著低估，并且可以通过更准确的可交互元素检测和结合功能性局部语义来大幅提升。

Figure 3: Examples from the SeeAssign evaluation. We can see that fine-grain local semantics improves the GPT-4V’s ability to assign correct labels to the referred icon.

图 3: SeeAssign 评估中的示例。我们可以看到，细粒度的局部语义提高了 GPT-4V 为引用图标分配正确标签的能力。

Table 2: Comparison of different approaches on ScreenSpot Benchmark. LS is short for local semantics of functionality, GD is short for Grounding DINO, and $\mathrm{ID}$ is short for the interact able region detection model we finetune.

方法	模型大小	移动端文本	移动端图标/部件	桌面端文本	桌面端图标/部件	网页端文本	网页端图标/部件	平均
Fuyu	8B	41.0%	1.3%	33.0%	3.6%	33.9%	4.4%	19.5%
CogAgent	18B	67.0%	24.0%	74.2%	20.0%	70.4%	28.6%	47.4%
SeeClick	9.6B	78.0%	52.0%	72.2%	30.0%	55.7%	32.5%	53.4%
MiniGPT-v2	7B	8.4%	6.6%	6.2%	2.9%	6.5%	3.4%	5.7%
Qwen-VL	9.6B	9.5%	4.8%	5.7%	5.0%	3.5%	2.4%	5.2%
GPT-4V		22.6%	24.5%	20.2%	11.8%	9.2%	8.8%	16.2%
OmniParser (w.o.LS,w.GD)	-	92.7%	49.4%	64.9%	26.3%	77.3%	39.7%	58.38%
OmniParser(w.LS +GD)	-	94.8%	53.7%	89.3%	44.9%	83.0%	45.1%	68.7%
OmniParser(w.LS+ID)		93.9%	57.0%	91.3%	63.6%	81.3	51.0%	73.0%

表 2: 不同方法在 ScreenSpot 基准测试上的对比。LS 代表功能的局部语义，GD 代表 Grounding DINO，$\mathrm{ID}$ 代表我们微调的可交互区域检测模型。

4.3 Evaluation on Mind2Web

4.3 Mind2Web 评估

In order to test how OMNIPARSER is helpful to the web navigation secnario, We evaluate on $[\mathrm{DGZ}^{+}23]$ benchmark. There are 3 different categories of task in the test set: Cross-Domain, CrossWebsite, and Cross-Tasks. We used a cleaned version of Mind2Web tests set processed from the raw HTML dump which eliminates a small number of samples that has incorrect bounding boxes. In total we have 867, 167, 242 tasks in the test set from Cross-Domain, Cross-Website, and Cross-Tasks category respectively. During evaluation, we feed both the parsed screen results and the action history as text prompt, and SOM labeled screenshot to GPT-4V similar to the prompting strategy in $[\mathrm{YYZ}^{+}23$ , $Z G\mathrm{K}^{+}24]$ . Following the original paper, we perform offline evaluation focusing on the element accuracy, Operation F1 and step success rate averaged across the task.

为了测试OMNIPARSER在网页导航场景中的帮助，我们在$[\mathrm{DGZ}^{+}23]$基准上进行了评估。测试集中有3种不同类别的任务：跨域(Cross-Domain)、跨网站(Cross-Website)和跨任务(Cross-Tasks)。我们使用了从原始HTML转储中处理的Mind2Web测试集的清理版本，该版本消除了少数具有错误边界框的样本。在测试集中，跨域、跨网站和跨任务类别分别有867、167和242个任务。在评估过程中，我们将解析的屏幕结果和操作历史记录作为文本提示，并将SOM标注的截图提供给GPT-4V，类似于$[\mathrm{YYZ}^{+}23$，$ZG\mathrm{K}^{+}24]$中的提示策略。按照原始论文，我们进行了离线评估，重点评估元素准确性、操作F1和任务平均步骤成功率。

In the first section of the table (row 1-3), We report numbers from a set of open source VL models as it appears in $[Z{\tt G K}^{+}24$ , $\mathrm{CSC}^{+}24]$ . Here CogAgent and Qwen-VL are not finetuned on the Mind2Web training set. More detailed information about model settings can be found in the Appendix7.4.

在表的第一部分（第1-3行），我们报告了一组开源视觉语言模型（VL模型）的数据，这些数据来源于 $[Z{\tt G K}^{+}24$ 和 $\mathrm{CSC}^{+}24]$。其中，CogAgent 和 Qwen-VL 并未在 Mind2Web 训练集上进行微调。关于模型设置的更多详细信息可以在附录7.4中找到。

In the second section of the table (row 4-9) we report numbers from Mind2web paper $[\mathrm{DGZ}^{+}23]$ and SeeAct $[Z\mathrm{GK}^{+}24]$ paper. In this section, all of the approaches use the HTML elements selected by a finetuned element proposal model on Mind2Web training set which produces top 50 relevant elements on the HTML page based on the user task. Additionally, GPT $\cdot4\mathrm{{V}+}\mathrm{{SOM}}$ and GPT-4V+textual choices corresponds to the SeeAct with image annotation, and textual choices grounding methods respectively. In GPT $4\mathrm{{V}+\mathrm{{SOM}}}$ , the set of mark (SOM) boxes are selected from the element proposal model, and are labeled with the ground truth location extracted from HTML. In contrast, GPT-4V+textual uses DOM information of the selected relevant elements directly in the text prompt, rather than overlaying bounding boxes on top of screenshot. The better performance of textual choice corroborates with the experiment results in 4.1.

在表格的第二部分（第4-9行）中，我们报告了Mind2web论文[$\mathrm{DGZ}^{+}23$]和SeeAct论文[$Z\mathrm{GK}^{+}24$]中的数据。在这一部分中，所有方法都使用了在Mind2Web训练集上微调的元素提案模型选择的HTML元素，该模型根据用户任务生成HTML页面上前50个相关元素。此外，GPT-4V+SOM和GPT-4V+textual choices分别对应于带有图像注释的SeeAct和文本选择基础方法。在GPT-4V+SOM中，标记集合（SOM）框是从元素提案模型中选出的，并使用从HTML中提取的真实位置进行标注。相比之下，GPT-4V+textual直接在文本提示中使用所选相关元素的DOM信息，而不是在屏幕截图上方叠加边界框。文本选择的更好表现与4.1节中的实验结果一致。

In the last section (row 10-11), we report numbers from OMNIPARSER. We observe GPT-4V augumented with local semantics of icon functionality and the finetuned interact able region detection model (w. $\mathrm{L}S+\mathrm{ID})$ performs better than the model with raw grounding DINO model (w. $\mathrm{LS+GD})$ ) in all of the categories.

在最后一部分（第 10-11 行），我们报告了 OMNIPARSER 的数据。我们观察到，GPT-4V 通过增强图标功能的局部语义和微调的可交互区域检测模型（w. $\mathrm{LS}+\mathrm{ID}$）在所有类别中均优于使用原始 grounding DINO 模型（w. $\mathrm{LS+GD}$）的模型。

Further, without using parsed HTML information, OMNIPARSER is able to outperform GPT-4’s performance that uses HTML in every sub-category by a large margin, suggesting the substantial benefit of the screen parsing results provided by OMNIPARSER. Additionally, OMNIPARSER outperforms the GPT $4\mathrm{{V}+\mathrm{{SOM}}}$ by a large margin. Compared to GPT $\mathrm{4V+}$ textual choices, OMNIPARSER significantly outperforms in Cross-Website and Cross-Domain category $(+4.1%$ and $+5.2%$ ), while under performing $(-0.8%)$ slightly in the Cross-Task category, which indicates that OMNIPARSER provides higher quality information compared ground truth element information from DOM and top $\cdot\mathbf{k}$ relevant elemnt proposal used by the GPT-4V+textual choices set-up, and make the GPT-4V easier to make a accurate action prediction. Lastly, OMNIPARSER with GPT-4V significantly outperform all the other trained models using only UI screenshot such as SeeClick and Qwen-VL.

此外，在不使用解析后的 HTML 信息的情况下，OMNIPARSER 能够在每个子类别中以显著优势超越 GPT-4 使用 HTML 的表现，这表明 OMNIPARSER 提供的屏幕解析结果具有显著优势。此外，OMNIPARSER 也以显著优势超越了 GPT $4\mathrm{{V}+\mathrm{{SOM}}}$。与 GPT $\mathrm{4V+}$ 的文本选择相比，OMNIPARSER 在跨网站和跨领域类别中显著领先 $(+4.1%$ 和 $+5.2%$)，而在跨任务类别中略微落后 $(-0.8%)$，这表明 OMNIPARSER 提供了比 GPT-4V+ 文本选择设置中使用的 DOM 真实元素信息和 top $\cdot\mathbf{k}$ 相关元素建议更高质量的信息，并使 GPT-4V 更容易做出准确的动作预测。最后，结合 GPT-4V 的 OMNIPARSER 显著优于仅使用 UI 截图的其他训练模型，如 SeeClick 和 Qwen-VL。

Table 3: Comparison of different methods across various categories on Mind2Web benchmark.

方法	输入类型	跨网站	跨领域	跨任务
	HTMLfree	图像	Ele.Acc	Op.F1
CogAgent			18.4	42.2
Qwen-VL			13.2	83.5
SeeClick			21.4	80.6
MindAct(gen)	x		13.9	44.7
MindAct	x		42.0	65.2

[论文翻译]基于纯视觉的GUI智能体OmniParser

原文地址：https://arxiv.org/pdf/2408.00203v1

OmniParser for Pure Vision Based GUI Agent

基于纯视觉的GUI智能体OmniParser

Abstract

1 Introduction

1 引言

2 相关工作

2.1 UI Screen Understanding

2.1 用户界面屏幕理解

2.2 Autonomous GUI Agent

2.2 自主GUI智能体

3 Methods

3 方法

3.1 Interact able Region Detection

3.2 Incorporating Local Semantics of Functionality

4 Experiments and Results

4 实验与结果

4.1 Evaluation on SeeAssign Task

4.1 SeeAssign任务评估

4.2 Evaluation on ScreenSpot

4.2 ScreenSpot 评估

4.3 Evaluation on Mind2Web

4.3 Mind2Web 评估

[论文翻译]基于纯视觉的GUI智能体OmniParser

原文地址：https://arxiv.org/pdf/2408.00203v1

OmniParser for Pure Vision Based GUI Agent

基于纯视觉的GUI智能体OmniParser

Abstract

1 Introduction

1 引言

2 Related Works

2 相关工作

2.1 UI Screen Understanding

2.1 用户界面屏幕理解

2.2 Autonomous GUI Agent

2.2 自主GUI智能体

3 Methods

3 方法

3.1 Interact able Region Detection

3.2 Incorporating Local Semantics of Functionality

4 Experiments and Results

4 实验与结果

4.1 Evaluation on SeeAssign Task

4.1 SeeAssign任务评估

4.2 Evaluation on ScreenSpot

4.2 ScreenSpot 评估

4.3 Evaluation on Mind2Web

4.3 Mind2Web 评估