OmniParser for Pure Vision Based GUI Agent
基于纯视觉的GUI智能体OmniParser
Yadong $\mathbf{L}\mathbf{u}^{1}$ , Jianwei Yang1, Yelong Shen2, Ahmed Awadallah1 1Microsoft Research 2 Microsoft Gen AI {yadonglu, jianwei.yang, yeshe, ahmed.awadallah}@microsoft.com
Yadong $\mathbf{L}\mathbf{u}^{1}$, Jianwei Yang1, Yelong Shen2, Ahmed Awadallah1
1Microsoft Research
2 Microsoft Gen AI
{yadonglu, jianwei.yang, yeshe, ahmed.awadallah}@microsoft.com
Abstract
摘要
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interact able icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce OMNIPARSER, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interact able icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interact able regions on the screen and a caption model to extract the functional semantics of the detected elements. OMNIPARSER significantly improves GPT-4V’s performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, OMNIPARSER with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.
大规模视觉语言模型的最新成功展示了在用户界面上驱动智能体系统的巨大潜力。然而,我们认为像 GPT-4V 这样的多模态模型作为跨不同应用程序的多种操作系统上的通用智能体的能力被大大低估,原因在于缺乏一种稳健的屏幕解析技术,该技术能够:1) 可靠地识别用户界面中的可交互图标,以及 2) 理解屏幕截图中的各种元素的语义,并准确地将预期操作与屏幕上的相应区域关联起来。为了填补这些空白,我们引入了 OMNIPARSER,这是一种将用户界面截图解析为结构化元素的综合方法,显著增强了 GPT-4V 生成准确基于界面相应区域操作的能力。我们首先使用流行的网页创建了一个可交互图标检测数据集和一个图标描述数据集。这些数据集用于微调专门的模型:一个检测模型用于解析屏幕上的可交互区域,一个描述模型用于提取检测元素的功能语义。OMNIPARSER 显著提高了 GPT-4V 在 ScreenSpot 基准测试中的表现。在 Mind2Web 和 AITW 基准测试中,仅使用屏幕截图输入的 OMNIPARSER 优于需要屏幕截图之外额外信息的 GPT-4V 基线。
1 Introduction
1 引言
Large language models have shown great success in their understanding and reasoning capabilities. More recent works have explored the use of large vision-language models (VLMs) as agents to perform complex tasks on the user interface (UI) with the aim of completing tedious tasks to replace human efforts $[\mathrm{YZL}^{+}23$ , ${\mathrm{YYZ}}^{+}23$ , $\mathrm{DGZ}^{+}23$ , $Z{\mathrm{GK}}^{+}24$ , $\mathrm{HWL}^{+}23$ , ${\mathrm{Y}}Z{S^{+}}24$ , $\mathrm{WXJ^{+}24}$ , $\mathrm{GFH}^{+}24$ , $\mathrm{CSC}^{+}24]$ . Despite the promising results, there remains a significant gap between current state-of-thearts and creating widely usable agents that can work across multiple platforms, e.g. Windows/MacOS, IOS/Android and multiple applications (Web broswer Office365, PhotoShop, Adobe), with most previous work focusing on limiting applications or platforms.
大语言模型在理解和推理能力方面展现了巨大的成功。最近的研究探索了使用大视觉语言模型 (VLMs) 作为 AI智能体,在用户界面 (UI) 上执行复杂任务,旨在完成繁琐的工作以替代人力 ([YZL^{+}23] , [YYZ^{+}23] , [DGZ^{+}23] , [ZGK^{+}24] , [HWL^{+}23] , [YZS^{+}24] , [WXJ^{+}24] , [GFH^{+}24] , [CSC^{+}24])。尽管取得了令人鼓舞的成果,但当前的最新技术与创建能够在多个平台(例如 Windows/MacOS、IOS/Android)和多种应用程序(Web 浏览器、Office365、PhotoShop、Adobe)上广泛使用的 AI智能体之间仍存在显著差距,大多数先前的工作都集中在有限的应用程序或平台上。
While large multimodal models like GPT-4V and other models trained on UI data $[\mathrm{HWL}^{+}23$ , $\mathrm{YZS}^{+}24$ , $\mathrm{CSC}^{+}24]$ have demonstrated abilities to understand basic elements of the UI screenshot, action grounding remains one of the key challenges in converting the actions predicted by LLMs to the actual actions on screen in terms of keyboard/mouse movement or API call $[Z\mathrm{GK}^{+}24]$ . It has been noted that GPT-4V is unable to produce the exact x-y coordinate of the button location, Set-of-Mark prompting $[\mathrm{YZL}^{+}23]$ proposes to overlay a group of bounding boxes each with unique numeric IDs on to the original image, as a visual prompt sent to the GPT-4V model. By applying set-of-marks prompting, GPT-4V is able to ground the action into a specific bounding box which has ground truth location instead of a specific xy coordinate value, which greatly improves the robustness of the action grounding $[Z\mathrm{GK}^{+}2\bar{4}]$ . However, the current solutions using SoM relies on parsed HTML information to extract bounding boxes for actionable elements such as buttons, which limits its usage to web browsing tasks. We aim to build a general approach that works on a variety of platforms and applications.
尽管如 GPT-4V 等大模型以及基于 UI 数据训练的模型 $[\mathrm{HWL}^{+}23$ , $\mathrm{YZS}^{+}24$ , $\mathrm{CSC}^{+}24]$ 已经展示了理解 UI 截图基本元素的能力,但动作落地(action grounding)仍然是将大语言模型预测的动作转换为实际屏幕上的键盘/鼠标移动或 API 调用的关键挑战之一 $[Z\mathrm{GK}^{+}24]$ 。值得注意的是,GPT-4V 无法生成按钮位置的精确 x-y 坐标,而 Set-of-Mark 提示 $[\mathrm{YZL}^{+}23]$ 提出在原始图像上叠加一组带有唯一数字 ID 的边界框,作为发送给 GPT-4V 模型的视觉提示。通过应用 Set-of-Mark 提示,GPT-4V 能够将动作落地到具有真实位置的特定边界框,而不是特定的 x-y 坐标值,这大大提高了动作落地的鲁棒性 $[Z\mathrm{GK}^{+}2\bar{4}]$ 。然而,当前使用 SoM 的解决方案依赖于解析 HTML 信息来提取可操作元素(如按钮)的边界框,这限制了其在网页浏览任务中的使用。我们的目标是构建一种适用于各种平台和应用程序的通用方法。
In this work, we argue that previous pure vision-based screen parsing techniques are not satisfactory, which lead to significant under estimation of GPT-4V model’s understanding capabilities. And a reliable vision-based screen parsing method that works well on general user interface is a key to improve the robustness of the agentic workflow on various operating systems and applications. We present OMNIPARSER, a general screen parsing tool to extract information from UI screenshot into structured bounding box and labels which enhances GPT-4V’s performance in action prediction in a variety of user tasks.
在本研究中,我们认为之前基于纯视觉的屏幕解析技术并不令人满意,这导致了对 GPT-4V 模型理解能力的显著低估。一个在通用用户界面上表现良好的可靠视觉屏幕解析方法是提高 AI 智能体在各种操作系统和应用中工作流程鲁棒性的关键。我们提出了 OMNIPARSER,这是一种通用的屏幕解析工具,可从 UI 截图中提取信息,将其转换为结构化的边界框和标签,从而在各种用户任务中增强 GPT-4V 在动作预测中的表现。
We list our contributions as follows:
我们的贡献如下:
2 Related Works
2 相关工作
2.1 UI Screen Understanding
2.1 用户界面屏幕理解
There has been a line of modeling works focusing on detailed understanding of UI screens, such as Screen 2 Words $[\mathbf{W}!\mathbf{L}Z^{+}21]$ , UI-BERT $[\mathbf{B}Z\bar{\mathbf{X}}^{\bar{+}}21]$ , Widget Captioning $[\bar{\mathrm{LLH}}^{+}20]$ , ActionBERT $[\mathrm{HSZ}^{+}21]$ . These works demonstrated effective usage of multimodal models for extracting semantics of user screen. But these models rely on additional information such as view hierarchy, or are trained for visual question answering tasks or screen summary tasks.
有一系列建模工作专注于对UI屏幕的详细理解,例如 Screen 2 Words [WLZ+21]、UI-BERT [BZẌ+21]、Widget Captioning [LLH+20]、ActionBERT [HSZ+21]。这些工作展示了多模态模型在提取用户屏幕语义方面的有效应用。但这些模型依赖于视图层次结构等附加信息,或者针对视觉问答任务或屏幕摘要任务进行训练。
There are also a couple publicly available dataset that on UI screen understanding. Most notably the Rico dataset $[\mathrm{DHF^{+}17}]$ , which contains more than 66k unique UI screens and its view hierarchies. Later $[\mathrm{SWL}^{+}22]$ auguments Rico by providing $500\mathrm{k}$ human annotations on the original 66k RICO screens identifying various icons based on their shapes and semantics, and associations between selected general UI elements (like icons, form fields, radio buttons, text inputs) and their text labels. Same on mobile platform, PixelHelp $[\mathrm{LHZ}^{+}20]$ provides a dataset that contains UI elements of screen spanning across 88 common tasks. In the same paper they also released RicoSCA which is a cleaned version of Rico. For the web and general OS domain, there are several works such Mind2Web $[\mathrm{DGZ}^{+}23]$ ], MiniWob $\mathsf{1++[L G P^{+}18]}$ , Visual-WebArena $[\mathrm{KLJ}^{+}24$ , $Z\mathrm{XZ}^{+}24]$ , and OSWorld $[\mathrm{XZC^{+}}24]$ that provide simulated environment, but does not provide dataset explicitly for general screen understanding tasks such as interact able icon detection on real world websites.
在UI屏幕理解方面,也有几个公开可用的数据集。最著名的是Rico数据集 $[\mathrm{DHF^{+}17}]$,它包含超过66k个独特的UI屏幕及其视图层次结构。随后 $[\mathrm{SWL}^{+}22]$ 对Rico进行了扩展,在原始的66k RICO屏幕上提供了 $500\mathrm{k}$ 条人工注释,根据形状和语义识别各种图标,并选择了一般UI元素(如图标、表单字段、单选按钮、文本输入)与其文本标签之间的关联。同样在移动平台上,PixelHelp $[\mathrm{LHZ}^{+}20]$ 提供了一个包含跨越88个常见任务的屏幕UI元素的数据集。在同一篇论文中,他们还发布了RicoSCA,这是Rico的清理版本。对于Web和通用操作系统领域,有几项工作如Mind2Web $[\mathrm{DGZ}^{+}23]$、MiniWob $\mathsf{1++[L G P^{+}18]}$、Visual-WebArena $[\mathrm{KLJ}^{+}24$、$Z\mathrm{XZ}^{+}24]$ 和OSWorld $[\mathrm{XZC^{+}}24]$ 提供了模拟环境,但没有为通用屏幕理解任务(如现实世界网站上的可交互图标检测)提供明确的数据集。
To address the absence of a large-scale, general web UI understanding dataset, and to keep pace with the rapid evolution of UI design, we curated an icon detection dataset using the DOM information from popular URLs avaialbe on the Web. This dataset features the up-to-date design of icons and buttons, with their bounding boxes retrieved from the DOM tree, providing ground truth locations.
为了解决缺乏大规模、通用的网页 UI 理解数据集的问题,并跟上 UI 设计的快速发展,我们利用网络上流行 URL 的 DOM 信息,整理了一个图标检测数据集。该数据集包含了最新的图标和按钮设计,其边界框从 DOM 树中提取,提供了真实的位置信息。
2.2 Autonomous GUI Agent
2.2 自主GUI智能体
Recently there has been a lot of works on designing autonomous GUI agent to perform tasks in place of human users. One line of work is to train an end-to-end model to directly predict the next action, representative works include: Pixel2Act $[\mathbf{S}\mathbf{J}\mathbf{C}^{+}23]$ , WebGUM $[\mathrm{FLN}^{+}24]$ in web domain, Ferret $[\mathrm{Y}\bar{\mathrm{Z}}\mathrm{S}^{+}24]$ , CogAgent $[\mathrm{HWL}^{+}23]$ , and Fuyu $[{\mathrm{BEH}}^{+}23]$ in Mobile domain. Another line of works involve leveraging existing multimodal models such as GPT-4V to perform user tasks. Representative works include MindAct agent $[\mathrm{DGZ}^{+}23]$ , SeeAct agent $[Z\mathrm{GK}^{+}\bar{2}4]$ in web domain and agents in $[\mathrm{YY}\mathbf{Z}^{+}23$ , $\mathrm{WXY^{+}24}$ , $\mathrm{RLR}^{+}23]$ for mobile domain. These work often leverages the DOM information in web browser, or the view hierarchies in mobile apps to get the ground truth position of interact able elements of the screen, and use Set-Of-Marks $[\bar{\mathrm{YZL}}^{+}23]$ to overlay the bounding boxes on top of the screenshot then feed into the vision-language models. However, ground truth information of interact able elements may not always be available when the goal is to build a general agent for cross-platforms and cross-applications tasks. Therefore, we focus on providing a systematic approach for getting structured elements from general user screens.
最近有许多研究致力于设计自主的 GUI 智能体来替代人类用户执行任务。一类研究是训练端到端模型直接预测下一个动作,代表性工作包括:Pixel2Act [SJC+23],WebGUM [FLN+24] 在网络领域,Ferret [YŻS+24],CogAgent [HWL+23] 和 Fuyu [BEH+23] 在移动领域。另一类研究则利用现有的多模态模型如 GPT-4V 来执行用户任务。代表性工作包括 MindAct 智能体 [DGZ+23],SeeAct 智能体 [ZGK+24] 在网络领域,以及 [YYZ+23, WXY+24, RLR+23] 在移动领域的智能体。这些工作通常利用网页浏览器中的 DOM 信息或移动应用程序中的视图层次结构来获取屏幕上可交互元素的真实位置,并使用 Set-Of-Marks [ŻYL+23] 在截图上方叠加边界框,然后输入到视觉-语言模型中。然而,当目标是构建一个用于跨平台和跨应用程序任务的通用智能体时,可交互元素的真实信息可能并不总是可用。因此,我们专注于提供一种系统化的方法,从通用的用户屏幕中获取结构化元素。
3 Methods
3 方法
A complex task can usually be broken down into several steps of actions. Each step requires the model’s (e.g. GPT-4V) ability to: 1) understand the UI screen in the current step, i.e. analyzing what is the screen content overall, what are the functions of detected icons that are labeled with numeric ID, and 2) predict what is the next action on the current screen that is likely to help complete the whole task. Instead of trying to accomplish the two goals in one call, we found it beneficial to extract some of the information such as semantics in the screen parsing stage, to alleviate the burden of GPT-4V so that it can leverages more information from the parsed screen and focus more on the action prediction.
一个复杂的任务通常可以分解为多个步骤的操作。每个步骤都需要模型(例如 GPT-4V)具备以下能力:1) 理解当前步骤的 UI 界面,即分析屏幕内容的整体情况,以及检测到的带有数字 ID 的图标的功能;2) 预测在当前屏幕上可能有助于完成整个任务的下一步操作。我们发现,与其试图在一次调用中完成这两个目标,不如在屏幕解析阶段提取一些信息(例如语义),以减轻 GPT-4V 的负担,使其能够从解析后的屏幕中利用更多信息,并更专注于动作预测。
Hence we propose OMNIPARSER, which integrates the outputs from a finetuned interact able icon detection model, a finetuned icon description model, and an OCR module. This combination produces a structured, DOM-like representation of the UI and a screenshot overlaid with bounding boxes for potential interact able elements. We discuss each component of the OMNIPARSER in more details for the rest of the section.
因此我们提出了 OMNIPARSER,它集成了来自微调的可交互图标检测模型、微调的图标描述模型和 OCR 模块的输出。这种组合生成了一个结构化的、类似于 DOM 的 UI 表示,并在截图上叠加了潜在可交互元素的边界框。我们将在本节剩余部分详细讨论 OMNIPARSER 的每个组件。
3.1 Interact able Region Detection
3.1 可交互区域检测
Identifying interact able regions from the UI screen is a crucial step to reason about what actions should be performed given a user tasks. Instead of directly prompting GPT-4V to predict the xy coordinate value of the screen that it should operate on, we follow previous works to use the Set-of-Marks approach $[\mathrm{YZL}^{+}23]$ to overlay bounding boxes of interact able icons on top of UI screenshot, and ask GPT-4V to generate the bounding box ID to perform action on. However, different from $[Z{\bf G}{\bf K}^{+}24$ , $\mathrm{KLJ}^{+}\bar{2}4]$ which uses the ground truth button location retrieved from DOM tree in web browswer, and $[\mathrm{YYZ}^{+}23]$ which uses labeled bounding boxes in the AITW dataset $[\mathbf{RLR}^{+}23]$ , we finetune a detection model to extract interact able icons/buttons.
从 UI 屏幕中识别可交互区域是根据用户任务推理应执行哪些操作的关键步骤。我们没有直接提示 GPT-4V 预测其应操作的屏幕 xy 坐标值,而是遵循先前的工作,使用 Set-of-Marks 方法 [YZL+23],在 UI 截图上叠加可交互图标的边界框,并要求 GPT-4V 生成要执行操作的边界框 ID。然而,与 [ZGK+24],[KLJ+24] 使用从 Web 浏览器 DOM 树中检索的真实按钮位置,以及 [YYZ+23] 使用 AITW 数据集 [RLR+23] 中标记的边界框不同,我们微调了一个检测模型来提取可交互图标/按钮。
Specifically, we curate a dataset of interact able icon detection dataset, containing $67\mathrm{k}$ unique screenshot images, each labeled with bounding boxes of interact able icons derived from DOM tree. We first took a $100\mathrm{k}$ uniform sample of popular publicly availabe urls on the web $[0\mathrm{XL}^{+}22]$ , and collect bounding boxes of interact able regions of the webpage from the DOM tree of each urls. Some examples of the webpage and the interact able regions are shown in 2.
具体来说,我们整理了一个可交互图标检测数据集,包含 $67\mathrm{k}$ 个独特的截图图像,每个图像都标注了从 DOM 树中提取的可交互图标的边界框。我们首先从网络上 $[0\mathrm{XL}^{+}22]$ 中随机抽取了 $100\mathrm{k}$ 个热门公开 URL,并从每个 URL 的 DOM 树中收集了网页可交互区域的边界框。图 2 展示了一些网页及其可交互区域的示例。
Apart from interact able region detection, we also have a OCR module to extract bounding boxes of texts. Then we merge the bounding boxes from OCR detection module and icon detection module while removing the boxes that have high overlap (we use over $90%$ as a threshold). For every bounding box, we label it with a unique ID next to it using a simple algorithm to minimizing the overlap between numeric labels and other bounding boxes.
除了可交互区域检测外,我们还集成了一个 OCR 模块来提取文本的边界框。然后我们将 OCR 检测模块和图标检测模块的边界框进行合并,同时去除重叠度高的边界框(我们使用超过 90% 作为阈值)。对于每个边界框,我们使用一个简单的算法在其旁边标注一个唯一的 ID,以最小化数字标签与其他边界框之间的重叠。
3.2 Incorporating Local Semantics of Functionality
3.2 融合功能的局部语义
We found in a lot of cases where only inputting the UI screenshot overlayed with bounding boxes and associated IDs can be misleading to GPT-4V. We argue the limitation stems from GPT-4V’s constrained ability to simultaneously perform the composite tasks of identifying each icon’s semantic information and predicting the next action on a specific icon box. This has also been observed by several other works $[\mathrm{YY}\mathbf{Z}^{+}23$ , $Z{\mathrm{GK}}^{+}24]$ . To address this issue, we incorporate the local semantics of functionality into the prompt, i.e. for each icons detected by the interact able region detection model, we use a finetuned model to generate description of functionality to the icons, and for each text boxes, we use the detected texts and its label.
我们在许多案例中发现,仅输入带有边界框和相关 ID 的 UI 截图可能会误导 GPT-4V。我们认为这一限制源于 GPT-4V 在执行识别每个图标的语义信息并预测特定图标框上的下一步动作的复合任务时的能力受限。这一问题也被其他几项工作所观察到 [YYZ+23, ZGK+24]。为了解决这一问题,我们将功能的局部语义融入提示中,即对于可交互区域检测模型检测到的每个图标,我们使用微调后的模型生成功能描述,而对于每个文本框,我们使用检测到的文本及其标签。
We perform more detailed analysis for this topic in section 4.1. To the best of our knowledge, there is no public model that is specifically trained for up-to-date UI icon description, and is suitable for our purpose to provide fast and accurate local semantics for the UI screenshot. Therefore we curate a dataset of $7\mathbf{k}$ icon-description pairs using GPT-4o, and finetune a BLIP-v2 model [LLSH23] on this dataset. Details of dataset and training can be found in Appendix 7.1. After finetuning, we found the model is much more reliable in its description to common app icons. Examples can be seen in figure 4. And in figure 3, we show it is helpful to incorporate the semantics of local bounding boxes in the form of text prompt along with the UI screenshot visual prompt.
我们将在第4.1节中对这一主题进行更详细的分析。据我们所知,目前没有公开的模型专门针对最新的UI图标描述进行训练,并且适合我们为UI截图提供快速且准确的局部语义的需求。因此,我们使用GPT-4o构建了一个包含$7\mathbf{k}$个图标-描述对的数据集,并在此数据集上对BLIP-v2模型 [LLSH23] 进行了微调。数据集和训练的详细信息可以在附录7.1中找到。微调后,我们发现该模型在描述常见应用图标时更加可靠。示例见图4。在图3中,我们展示了将局部边界框的语义以文本提示的形式与UI截图视觉提示结合是有帮助的。

Figure 1: Examples of parsed screenshot image and local semantics by OMNIPARSER. The inputs to OmniParse are user task and UI screenshot, from which it will produce: 1) parsed screenshot image with bounding boxes and numeric IDs overlayed, and 2) local semantics contains both text extracted and icon description.
图 1: OMNIPARSER 解析的截图图像和局部语义示例。OmniParse 的输入是用户任务和 UI 截图,它会生成:1) 带有边界框和数字 ID 覆盖的解析截图图像,2) 包含提取文本和图标描述的局部语义。
4 Experiments and Results
4 实验与结果
We conduct experiments on several benchmarks to demonstrate the effectiveness of OMNIPARSER. We start by a motivating experiments showing that current GPT-4V model with set of mark prompting $[\mathrm{YZL}^{+}23]$ is prone to incorrectly assigning label ID to the referred bounding boxes. Then we evaluate on Seeclick benchmark and Mind2Web to further showcase OMNIPARSER with local semantics can improve the GPT-4V’s performance on real user tasks on different platforms and applications.
我们在多个基准上进行了实验,以展示OMNIPARSER的有效性。我们首先进行了一项动机实验,表明当前带有标记提示集的GPT-4V模型([YZL+23])容易错误地为参考的边界框分配标签ID。然后,我们在Seeclick基准和Mind2Web上进行了评估,进一步展示了带有局部语义的OMNIPARSER可以在不同平台和应用程序上提高GPT-4V在真实用户任务中的表现。
4.1 Evaluation on SeeAssign Task
4.1 SeeAssign任务评估
To test the ability of correctly predicting the label ID given the description of the bounding boxes for GPT-4v models, We handcrafted a dataset SeeAssign that contains 112 tasks consisting of samples from 3 different platforms: Mobile, Desktop and Web Browser. Each task includes a concise task description and a screenshot image. The task descriptions are manually created and we make sure each task refers to one of the detected bounding boxes, e.g. ’click on ’settings”, ’click on the minimize button’. During evaluation, GPT-4V is prompted to predict the bounding box ID associated to it. Detailed prompt are specified in Appendix. The task screenshot images are sampled from the ScreenSpot $[\mathrm{CSC^{+}}24]$ benchmark, where they are labeled with set of marks using OMNIPARSER. The tasks are further divided into 3 sub-categories by difficulty: easy (less than 10 bounding boxes), medium (10-40 bounding boxes) and hard (more than 40 bounding boxes).
为了测试 GPT-4v 模型在给定边界框描述时正确预测标签 ID 的能力,我们手工制作了一个名为 SeeAssign 的数据集,其中包含 112 个任务,样本来自 3 个不同平台:移动端、桌面端和网页浏览器。每个任务包括一个简洁的任务描述和一张截图图像。任务描述是手动创建的,我们确保每个任务都指向一个检测到的边界框,例如“点击‘设置’”、“点击最小化按钮”。在评估过程中,GPT-4V 被提示预测与之相关的边界框 ID。详细提示见附录。任务截图图像来自 ScreenSpot $[\mathrm{CSC^{+}}24]$ 基准测试集,这些图像使用 OMNIPARSER 进行了标记。任务按难度进一步分为 3 个子类别:简单(少于 10 个边界框)、中等(10-40 个边界框)和困难(超过 40 个边界框)。

Figure 2: Examples from the Interact able Region Detection dataset. The bounding boxes are based on the interact able region extracted from the DOM tree of the webpage.
图 2: 交互区域检测数据集中的示例。边界框基于从网页的DOM树中提取的交互区域。
From table 1, we see that GPT-4V often mistakenly assign the numeric ID to the table especially when there are a lot of bounding boxes over the screen. And by adding local semantics including texts within the boxes and short descriptions of the detected icons, GPT-4V’s ability of correctly assigning the icon improves from 0.705 to 0.938.
从表 1 中我们可以看到,GPT-4V 经常错误地将数字 ID 分配给表格,尤其是当屏幕上有很多边界框时。通过添加局部语义,包括框内的文本和检测到的图标的简短描述,GPT-4V 正确分配图标的能力从 0.705 提高到了 0.938。
From figure 3, we see that without the description of the referred icon in the task, GPT-4V often fails to link the icon required in the task and the ground truth icon ID in the SoM labeled screenshot, which leads to hallucination in the response. With fine-grain local semantics added in the text prompt, it makes it much easier for GPT-4V to find the correct icon ID for the referred icon.
从图 3 中可以看出,在任务中没有描述所指图标的情况下,GPT-4V 经常无法将任务中所需的图标与 SoM 标注截图中的真实图标 ID 联系起来,从而导致响应中出现幻觉。在文本提示中添加细粒度的局部语义后,GPT-4V 更容易找到所指图标的正确图标 ID。
Table 1: Comparison of GPT-4V with and without local semantics
表 1: GPT-4V 在有和无局部语义情况下的对比
| Easy | Medium | Hard | Overall | |
|---|---|---|---|---|
| GPT-4V w.o. 局部语义 | 0.913 | 0.692 | 0.620 | 0.705 |
| GPT-4V w. 局部语义 | 1.00 | 0.949 | 0.900 | 0.938 |
4.2 Evaluation on ScreenSpot
4.2 ScreenSpot 评估
ScreenSpot dataset $[\mathrm{CSC^{+}}24]$ is a benchmark dataset that includes over 600 interface screenshots from mobile (iOS, Android), desktop (macOS, Windows), and web platforms. The task instructions are manually created so that each instruction corresponds to an actionable elements on the UI screen. We first evaluate the performance of OMNIPARSER using the this benchmark. In table 2, we can see across the 3 different platforms: Mobile, Desktop and Web, OMNIPARSER significantly improves the GPT-4V baseline. Noticeably, OMNIPARSER’s performance even surpasses models that are specifically finetuned on GUI dataset including SeeClick, CogAgent and Fuyu by a large margin. We also note that incorporating the local semantics (OMNIPARSER w. LS in the table) further improves the overall performance. This corroborates with the finds in section 4.1 that incorporating local semantics of the UI screenshot in text format, i.e. adding OCR text and descriptions of the icon bounding boxes further helps GPT-4V to accurately identify the correct element to operate on. Furthermore, our findings indicate that the interact able region detection (ID) model we finetuned improves overall accuracy by an additional $4.3%$ compared to using the raw Grounding DINO model. This underscores the importance of accurately detecting interact able elements for the success of UI tasks. Overall, the results demonstrate that the UI screen understanding capability of GPT-4V is significantly underestimated and can be greatly enhanced with more accurate interact able elements detection and the incorporation of functional local semantics.
ScreenSpot 数据集 [CSC+24] 是一个基准数据集,包含来自移动端(iOS、Android)、桌面端(macOS、Windows)和网页平台的 600 多张界面截图。任务指令是手动创建的,因此每个指令都对应于 UI 界面上的可操作元素。我们首先使用该基准评估了 OMNIPARSER 的性能。在表 2 中,我们可以看到在移动端、桌面端和网页端这三个不同的平台上,OMNIPARSER 显著提升了 GPT-4V 基准。值得注意的是,OMNIPARSER 的性能甚至大幅超过了在 GUI 数据集上专门微调的模型,包括 SeeClick、CogAgent 和 Fuyu。我们还注意到,结合局部语义(表中的 OMNIPARSER w. LS)进一步提高了整体性能。这与 4.1 节中的发现一致,即结合 UI 截图的局部语义(以文本形式添加 OCR 文本和图标边界框的描述)进一步帮助 GPT-4V 准确识别要操作的正确元素。此外,我们的研究结果表明,与使用原始的 Grounding DINO 模型相比,我们微调的可交互区域检测(ID)模型将整体准确率提高了 4.3%。这凸显了准确检测可交互元素对于 UI 任务成功的重要性。总体而言,结果表明 GPT-4V 的 UI 界面理解能力被显著低估,并且可以通过更准确的可交互元素检测和结合功能性局部语义来大幅提升。

Figure 3: Examples from the SeeAssign evaluation. We can see that fine-grain local semantics improves the GPT-4V’s ability to assign correct labels to the referred icon.
图 3: SeeAssign 评估中的示例。我们可以看到,细粒度的局部语义提高了 GPT-4V 为引用图标分配正确标签的能力。
Table 2: Comparison of different approaches on ScreenSpot Benchmark. LS is short for local semantics of functionality, GD is short for Grounding DINO, and $\mathrm{ID}$ is short for the interact able region detection model we finetune.
| 方法 | 模型大小 | 移动端文本 | 移动端图标/部件 | 桌面端文本 | 桌面端图标/部件 | 网页端文本 | 网页端图标/部件 | 平均 |
|---|---|---|---|---|---|---|---|---|
| Fuyu | 8B | 41.0% | 1.3% | 33.0% | 3.6% | 33.9% | 4.4% | 19.5% |
| CogAgent | 18B | 67.0% | 24.0% | 74.2% | 20.0% | 70.4% | 28.6% | 47.4% |
| SeeClick | 9.6B | 78.0% | 52.0% | 72.2% | 30.0% | 55.7% | 32.5% | 53.4% |
| MiniGPT-v2 | 7B | 8.4% | 6.6% | 6.2% | 2.9% | 6.5% | 3.4% | 5.7% |
| Qwen-VL | 9.6B | 9.5% | 4.8% | 5.7% | 5.0% | 3.5% | 2.4% | 5.2% |
| GPT-4V | 22.6% | 24.5% | 20.2% | 11.8% | 9.2% | 8.8% | 16.2% | |
| OmniParser (w.o.LS,w.GD) | - | 92.7% | 49.4% | 64.9% | 26.3% | 77.3% | 39.7% | 58.38% |
| OmniParser(w.LS +GD) | - | 94.8% | 53.7% | 89.3% | 44.9% | 83.0% | 45.1% | 68.7% |
| OmniParser(w.LS+ID) | 93.9% | 57.0% | 91.3% | 63.6% | 81.3 | 51.0% | 73.0% |
表 2: 不同方法在 ScreenSpot 基准测试上的对比。LS 代表功能的局部语义,GD 代表 Grounding DINO,$\mathrm{ID}$ 代表我们微调的可交互区域检测模型。
4.3 Evaluation on Mind2Web
4.3 Mind2Web 评估
In order to test how OMNIPARSER is helpful to the web navigation secnario, We evaluate on $[\mathrm{DGZ}^{+}23]$ benchmark. There are 3 different categories of task in the test set: Cross-Domain, CrossWebsite, and Cross-Tasks. We used a cleaned version of Mind2Web tests set processed from the raw HTML dump which eliminates a small number of samples that has incorrect bounding boxes. In total we have 867, 167, 242 tasks in the test set from Cross-Domain, Cross-Website, and Cross-Tasks category respectively. During evaluation, we feed both the parsed screen results and the action history as text prompt, and SOM labeled screenshot to GPT-4V similar to the prompting strategy in $[\mathrm{YYZ}^{+}23$ , $Z G\mathrm{K}^{+}24]$ . Following the original paper, we perform offline evaluation focusing on the element accuracy, Operation F1 and step success rate averaged across the task.
为了测试OMNIPARSER在网页导航场景中的帮助,我们在$[\mathrm{DGZ}^{+}23]$基准上进行了评估。测试集中有3种不同类别的任务:跨域(Cross-Domain)、跨网站(Cross-Website)和跨任务(Cross-Tasks)。我们使用了从原始HTML转储中处理的Mind2Web测试集的清理版本,该版本消除了少数具有错误边界框的样本。在测试集中,跨域、跨网站和跨任务类别分别有867、167和242个任务。在评估过程中,我们将解析的屏幕结果和操作历史记录作为文本提示,并将SOM标注的截图提供给GPT-4V,类似于$[\mathrm{YYZ}^{+}23$,$ZG\mathrm{K}^{+}24]$中的提示策略。按照原始论文,我们进行了离线评估,重点评估元素准确性、操作F1和任务平均步骤成功率。
In the first section of the table (row 1-3), We report numbers from a set of open source VL models as it appears in $[Z{\tt G K}^{+}24$ , $\mathrm{CSC}^{+}24]$ . Here CogAgent and Qwen-VL are not finetuned on the Mind2Web training set. More detailed information about model settings can be found in the Appendix7.4.
在表的第一部分(第1-3行),我们报告了一组开源视觉语言模型(VL模型)的数据,这些数据来源于 $[Z{\tt G K}^{+}24$ 和 $\mathrm{CSC}^{+}24]$。其中,CogAgent 和 Qwen-VL 并未在 Mind2Web 训练集上进行微调。关于模型设置的更多详细信息可以在附录7.4中找到。
In the second section of the table (row 4-9) we report numbers from Mind2web paper $[\mathrm{DGZ}^{+}23]$ and SeeAct $[Z\mathrm{GK}^{+}24]$ paper. In this section, all of the approaches use the HTML elements selected by a finetuned element proposal model on Mind2Web training set which produces top 50 relevant elements on the HTML page based on the user task. Additionally, GPT $\cdot4\mathrm{{V}+}\mathrm{{SOM}}$ and GPT-4V+textual choices corresponds to the SeeAct with image annotation, and textual choices grounding methods respectively. In GPT $4\mathrm{{V}+\mathrm{{SOM}}}$ , the set of mark (SOM) boxes are selected from the element proposal model, and are labeled with the ground truth location extracted from HTML. In contrast, GPT-4V+textual uses DOM information of the selected relevant elements directly in the text prompt, rather than overlaying bounding boxes on top of screenshot. The better performance of textual choice corroborates with the experiment results in 4.1.
在表格的第二部分(第4-9行)中,我们报告了Mind2web论文[$\mathrm{DGZ}^{+}23$]和SeeAct论文[$Z\mathrm{GK}^{+}24$]中的数据。在这一部分中,所有方法都使用了在Mind2Web训练集上微调的元素提案模型选择的HTML元素,该模型根据用户任务生成HTML页面上前50个相关元素。此外,GPT-4V+SOM和GPT-4V+textual choices分别对应于带有图像注释的SeeAct和文本选择基础方法。在GPT-4V+SOM中,标记集合(SOM)框是从元素提案模型中选出的,并使用从HTML中提取的真实位置进行标注。相比之下,GPT-4V+textual直接在文本提示中使用所选相关元素的DOM信息,而不是在屏幕截图上方叠加边界框。文本选择的更好表现与4.1节中的实验结果一致。
In the last section (row 10-11), we report numbers from OMNIPARSER. We observe GPT-4V augumented with local semantics of icon functionality and the finetuned interact able region detection model (w. $\mathrm{L}S+\mathrm{ID})$ performs better than the model with raw grounding DINO model (w. $\mathrm{LS+GD})$ ) in all of the categories.
在最后一部分(第 10-11 行),我们报告了 OMNIPARSER 的数据。我们观察到,GPT-4V 通过增强图标功能的局部语义和微调的可交互区域检测模型(w. $\mathrm{LS}+\mathrm{ID}$)在所有类别中均优于使用原始 grounding DINO 模型(w. $\mathrm{LS+GD}$)的模型。
Further, without using parsed HTML information, OMNIPARSER is able to outperform GPT-4’s performance that uses HTML in every sub-category by a large margin, suggesting the substantial benefit of the screen parsing results provided by OMNIPARSER. Additionally, OMNIPARSER outperforms the GPT $4\mathrm{{V}+\mathrm{{SOM}}}$ by a large margin. Compared to GPT $\mathrm{4V+}$ textual choices, OMNIPARSER significantly outperforms in Cross-Website and Cross-Domain category $(+4.1%$ and $+5.2%$ ), while under performing $(-0.8%)$ slightly in the Cross-Task category, which indicates that OMNIPARSER provides higher quality information compared ground truth element information from DOM and top $\cdot\mathbf{k}$ relevant elemnt proposal used by the GPT-4V+textual choices set-up, and make the GPT-4V easier to make a accurate action prediction. Lastly, OMNIPARSER with GPT-4V significantly outperform all the other trained models using only UI screenshot such as SeeClick and Qwen-VL.
此外,在不使用解析后的 HTML 信息的情况下,OMNIPARSER 能够在每个子类别中以显著优势超越 GPT-4 使用 HTML 的表现,这表明 OMNIPARSER 提供的屏幕解析结果具有显著优势。此外,OMNIPARSER 也以显著优势超越了 GPT $4\mathrm{{V}+\mathrm{{SOM}}}$。与 GPT $\mathrm{4V+}$ 的文本选择相比,OMNIPARSER 在跨网站和跨领域类别中显著领先 $(+4.1%$ 和 $+5.2%$),而在跨任务类别中略微落后 $(-0.8%)$,这表明 OMNIPARSER 提供了比 GPT-4V+ 文本选择设置中使用的 DOM 真实元素信息和 top $\cdot\mathbf{k}$ 相关元素建议更高质量的信息,并使 GPT-4V 更容易做出准确的动作预测。最后,结合 GPT-4V 的 OMNIPARSER 显著优于仅使用 UI 截图的其他训练模型,如 SeeClick 和 Qwen-VL。
Table 3: Comparison of different methods across various categories on Mind2Web benchmark.
| 方法 | 输入类型 | 跨网站 | 跨领域 | 跨任务 |
|---|---|---|---|---|
| HTMLfree | 图像 | Ele.Acc | Op.F1 | |
| CogAgent | 18.4 | 42.2 | ||
| Qwen-VL | 13.2 | 83.5 | ||
| SeeClick | 21.4 | 80.6 | ||
| MindAct(gen) | x | 13.9 | 44.7 | |
| MindAct | x | 42.0 | 65.2 | |
| GPT-3.5-Turbo | 19.3 | 48.8 | ||
| GPT-4 | x | 35.8 | 51.1 | |
| GPT-4V+som | x | V | ||
| GPT-4V+textualchoice | x | 38.0 | 67.8 | |
| OmniParser (w. LS + GD) | 41.5 | 83.2 | ||
| OmniParser (w.LS + ID) | V | 41.0 | 84.8 |
表 3: 不同方法在Mind2Web基准测试中的各类别对比。
4.4 Evaluation on AITW
4.4 AITW 上的评估
In additional to evaluation on multi-step web browsing tasks, we assess OMNIPARSER on the mobile navigating benchmark AITW $[\mathbf{R}\mathbf{L}\mathbf{R}^{+}2\bar{3}]$ , which contains $30\mathrm{k}$ instructions and $715\mathrm{k}$ trajectories. We use the same train/test split as in $[\mathrm{CSC^{+}}24]$ based on instructions, which retain only one trajectory for each instructions and no intersection between train and test. For a fair comparison, we only use their test split for evaluation and discard the train set as our method does not require finetuing.
除了对多步网页浏览任务进行评估外,我们还在移动导航基准 AITW $[\mathbf{R}\mathbf{L}\mathbf{R}^{+}2\bar{3}]$ 上评估了 OMNIPARSER,该基准包含 $30\mathrm{k}$ 条指令和 $715\mathrm{k}$ 条轨迹。我们使用与 $[\mathrm{CSC^{+}}24]$ 中相同的基于指令的训练/测试分割,每条指令仅保留一条轨迹,且训练集和测试集之间没有交集。为了公平比较,我们仅使用他们的测试集进行评估,并丢弃训练集,因为我们的方法不需要微调。
In table 4, we report the GPT-4V baseline in $[\mathrm{YYZ}^{+}23]$ paper, which corresponds to the best performing set up (GPT-4V ZS+history) that uses UI elements detected by IconNet $[\mathrm{SWL}^{+}22]$ through set-of-marks prompting $[\mathrm{YZL}^{+}23]$ for each screenshot at every step of the evaluation. The detected UI elements consist of either OCR-detected text or an icon class label, which is one of the 96 possible icon types identified by IconNet. Additionally, action history is also incorporated at each step’s prompt as well. We used the exact same prompt format in $[\mathrm{YYZ}^{+}23]$ except the results from the IconNet model is replaced with the output of the finetuned interact able region detection (ID) model. Interestingly, we found that the ID model can generalize well to mobile screen. By replacing the IconNet with the interact able region detection (ID) model we finetuned on the collected webpages, and incorporating local semantics of icon functionality (LS), we find OMNIPARSER delivers significantly improved performance across most sub-categories, and a $4.7%$ increase in the overall score compared to the best performing GPT $4\mathrm{V}+$ history baseline.
在表 4 中,我们报告了 $[\mathrm{YYZ}^{+}23]$ 论文中的 GPT-4V 基线,该基线对应于最佳性能设置(GPT-4V ZS+history),该设置使用由 IconNet $[\mathrm{SWL}^{+}22]$ 检测到的 UI 元素,通过 set-of-marks 提示 $[\mathrm{YZL}^{+}23]$ 在每个评估步骤的每个屏幕截图中进行。检测到的 UI 元素包括 OCR 检测到的文本或图标类标签,这是 IconNet 识别的 96 种可能图标类型之一。此外,每个步骤的提示中还包含了操作历史。我们使用了与 $[\mathrm{YYZ}^{+}23]$ 完全相同的提示格式,只是 IconNet 模型的结果被替换为微调后的可交互区域检测(ID)模型的输出。有趣的是,我们发现 ID 模型可以很好地泛化到移动屏幕。通过将 IconNet 替换为我们在收集的网页上微调的可交互区域检测(ID)模型,并结合图标功能的局部语义(LS),我们发现 OMNIPARSER 在大多数子类别中显著提高了性能,与最佳性能的 GPT $4\mathrm{V}+$ history 基线相比,总体得分提高了 $4.7%$。
Table 4: Comparison of different methods across various tasks and overall performance in AITW benchmark.
| 方法 | 模态 | 通用 | 安装 | GoogleApps | 单任务 | 网购 | 总体 |
|---|---|---|---|---|---|---|---|
| ChatGPT-CoT | 文本 | 5.9 | 4.4 | 10.5 | 9.4 | 8.4 | 7.7 |
| PaLM2-CoT | 文本 | 39.6 | |||||
| GPT-4Vimage-only | 图像 | 41.7 | 42.6 | 49.8 | 72.8 | 45.7 | 50.5 |
| GPT-4V+history | 图像 | 43.0 | 46.1 | 49.2 | 78.3 | 48.2 | 53.0 |
| OmniParser (w.LS + ID) | 图像 | 48.3 | 57.8 | 51.6 | 77.4 | 52.9 | 57.7 |
表 4: AITW基准测试中不同方法在各种任务和整体性能上的比较。
5 Discussions
5 讨论
In this section, we discuss a couple of common failure cases of OMNIPARSER with examples and potential approach to improve.
在本节中,我们将通过示例讨论OMNIPARSER的一些常见失败案例,并探讨潜在的改进方法。
Repeated Icons/Texts From analysis of the the GPT-4V’s response log, we found that GPT-4V often fails to make the correct prediction when the results of the OMNIPARSER contains multiple repeated icons/texts, which will lead to failure if the user task requires clicking on one of the buttons. This is illustrated by the figure 7 (Left) in the Appendix. A potential solution to this is to add finer grain descriptions to the repeated elements in the UI screenshot, so that the GPT-4V is aware of the existence of repeated elements and take it into account when predicting next action.
重复图标/文本
通过对 GPT-4V 的响应日志分析,我们发现当 OMNIPARSER 的结果包含多个重复的图标/文本时,GPT-4V 通常无法做出正确的预测。如果用户任务需要点击其中一个按钮,这将导致失败。附录中的图 7 (左) 展示了这一点。一个潜在的解决方案是在 UI 截图中为重复元素添加更细粒度的描述,以便 GPT-4V 能够意识到重复元素的存在,并在预测下一步操作时将其考虑在内。
Corase Prediction of Bounding Boxes One common failure case of OMNIPARSER is that it fails to detect the bounding boxes with correct granularity. In figure 7 (Right), the task is to click on the text ’MORE’. The OCR module of OMNIPARSER detects text bounding box 8 which encompass the desired text. But since it uses center of the box as predicted click point, it falls outside of the ground truth bounding box. This is essentially due to the fact that the OCR module we use does not have a notion of which text region are hyperlink and clickable. Hence we plan to train a model that combines OCR and interact able region detection into one module so that it can better detect the clickable text/hyperlinks.
边界框的粗略预测
Icon Misinterpretation We found that in some cases the icon with similar shape can have different meanings depending on the UI screenshot. For example, in figure 8, the task is to find button related to ’More information’, where the ground truth is to click the three dots icon in the upper right part of the screenshot. OMNIPARSER successfully detects all the relevant bounding boxes, but the icon description model interpret it as: "a loading or buffering indicator". We think this is due to the fact that the icon description model is only able to see each icon cropped from image, while not able to see the whole picture during both training and inference. So without knowing the full context of the image, a symbol of three dots can indeed mean loading buffer in other scenarios. A potential fix to this is to train an icon description model that is aware of the full context of the image.
图标误解
我们发现,在某些情况下,形状相似的图标可能根据UI截图具有不同的含义。例如,在图8中,任务是找到与“更多信息”相关的按钮,其正确答案是点击截图右上角的三点图标。OMNIPARSER成功检测到所有相关的边界框,但图标描述模型将其解释为:“加载或缓冲指示器”。我们认为这是由于图标描述模型在训练和推理过程中只能看到从图像中裁剪的每个图标,而无法看到整个画面。因此,在不了解图像完整上下文的情况下,三点符号在其他场景中确实可能表示加载缓冲。一个潜在的解决方案是训练一个能够感知图像完整上下文的图标描述模型。
6 Conclusion
6 结论
In this report, We propose OMNIPARSER, a general vision only approach that parse UI screenshots into structured elements. OMNIPARSER encompasses two finetuned models: an icon detection model and a functional description models. To train them, we curated an interact able region detection dataset using popular webpages, and an icon functional description dataset. We demonstrate that with the parsed results, the performance of GPT-4V is greatly improved on ScreenSpot benchmarks. It achieves better performance compared to GPT-4V agent that uses HTML extracted information on Mind2Web, and outperforms GPT-4V augmented with specialized Android icon detection model on AITW benchmark. We hope OMNIPARSER can serve as a general and easy-to-use tool that has the capability to parse general user screen across both PC and mobile platforms without any dependency on extra information such as HTML and view hierarchy in Android.
在本报告中,我们提出了OMNIPARSER,这是一种仅依赖视觉的通用方法,用于将UI截图解析为结构化元素。OMNIPARSER包含两个微调模型:图标检测模型和功能描述模型。为了训练这些模型,我们使用流行的网页构建了一个可交互区域检测数据集,以及一个图标功能描述数据集。我们展示了通过解析结果,GPT-4V在ScreenSpot基准测试中的性能得到了显著提升。与使用HTML提取信息的GPT-4V智能体相比,它在Mind2Web上表现更佳,并且在AITW基准测试中优于配备了专门Android图标检测模型的GPT-4V。我们希望OMNIPARSER能够作为一种通用且易于使用的工具,能够解析跨PC和移动平台的通用用户屏幕,而无需依赖HTML和Android中的视图层次结构等额外信息。
Acknowledgement
致谢
We would like to thank Corby Rosset and authors of ClueWeb22 for providing the seed urls for which we use to collect data to finetune the interact able region detection model. The data collection pipeline adapted AutoGen’s multimodal websurfer code for extracting inter a table elements in DOM, for which we thank Adam Fourney. We also thank Dillon DuPont for providing the processed version of mind2web benchmark.
我们要感谢 Corby Rosset 和 ClueWeb22 的作者提供了用于收集数据的种子 URL,我们利用这些数据来微调交互区域检测模型。数据收集流程采用了 AutoGen 的多模态网页浏览代码来提取 DOM 中的交互元素,为此我们感谢 Adam Fourney。我们还要感谢 Dillon DuPont 提供了处理后的 mind2web 基准数据集。
References
参考文献
7 Appendix
7 附录
7.1 Details of Icon-Description Dataset
7.1 图标描述数据集详情
In figure 4, we see that the original BLIP-2 model tend to focus on describing shapes and colors of app icons, while struggling to recognize the semantics of the icon. This motivates us to finetune this model on an icon description dataset. For the dataset, we use the result of parsed icon bounding boxes inferenced by the interact able icon detection model on the ScreenSpot dataset since it contains screenshots on both mobile and PC. For the description, we ask GPT-4o whether the object presented in the parsed bounding box is an app icon. If GPT-4o decides the image is an icon, it outputs onesentence description of the icon about the potential functionality. And if not, GPT-4o will output ’this is not an icon’, while still including this in the dataset. In the end, we collected 7185 icon-description pairs for finetuning.
在图4中,我们看到原始的BLIP-2模型倾向于描述应用图标的形状和颜色,而难以识别图标的语义。这促使我们在一个图标描述数据集上对该模型进行微调。对于数据集,我们使用了在ScreenSpot数据集上由可交互图标检测模型推断出的已解析图标边界框的结果,因为该数据集包含移动端和PC端的截图。对于描述,我们询问GPT-4o解析出的边界框中呈现的对象是否为应用图标。如果GPT-4o决定图像是图标,它会输出关于图标潜在功能的一句话描述。如果不是,GPT-4o会输出“这不是一个图标”,但仍然将其包含在数据集中。最终,我们收集了7185个图标-描述对用于微调。
We finetune BLIP-2 model for 1 epoch on the generated dataset with constant learning rate of $1e^{-5}$ , no weight decay and Adam optimizer. We show a few of the qualitative examples of finetuned model vs the original model in figure 4.
我们在生成的数据集上对 BLIP-2 模型进行了 1 个 epoch 的微调,学习率恒定为 $1e^{-5}$,无权重衰减,使用 Adam 优化器。我们在图 4 中展示了一些微调模型与原始模型的定性示例。


Figure 4: Example comparisons of icon description model using BLIP-2 (Left) and its finetuned version (Right). Original BLIP-2 model tend to focus on describing shapes and colors of app icons. After finetuning on the functionality semantics dataset, the model is able to show understanding of semantics of some common app icons.
图 4: 使用 BLIP-2(左)及其微调版本(右)的图标描述模型示例比较。原始的 BLIP-2 模型倾向于描述应用图标的形状和颜色。在功能语义数据集上微调后,该模型能够展示对一些常见应用图标语义的理解。
7.2 Training details of Interact able Icon Region Detection Model
7.2 可交互图标区域检测模型的训练细节
As introduced in 3.1, we train a YOLOv8 model on the interact able icon region detection dataset. We collect in total of 66990 samples where we split $95%$ (63641) for training, and $5%$ (3349) for validation. We train for 20 epochs with batch size of 256, learning rate of $\bar{1}e^{-3}$ , and the Adam optimizer on 4 GPUs. We show the training curve in figure 5.
如3.1节所述,我们在可交互图标区域检测数据集上训练了一个 YOLOv8 模型。我们总共收集了 66990 个样本,其中 95% (63641) 用于训练,5% (3349) 用于验证。我们在 4 个 GPU 上训练了 20 个 epoch,批大小为 256,学习率为 $\bar{1}e^{-3}$,并使用 Adam 优化器。训练曲线如图 5 所示。
7.3 Details of SeeAssign Evaluation
7.3 SeeAssign 评估详情
7.3.1 Prompt Used for GPT-4V
7.3.1 用于 GPT-4V 的提示
GPT-4V without local semantics:
GPT-4V 不具备局部语义理解能力:
Here is a UI screenshot image with bounding boxes and corresponding labeled ID overlayed on top of it, your task is {task}. Which icon box label you should operate on? Give a brief analysis, then put your answer in the format of $\ln^{\epsilon\epsilon}$ Box with label ID: $[\mathbf{x}\mathbf{x}]^{\mathrm{c}\epsilon}\setminus\mathbf{n}$
这是一张带有边界框和相应标注 ID 覆盖的 UI 截图图像,你的任务是 {task}。你应该操作哪个图标框的标注 ID?简要分析后,将答案格式化为 $\ln^{\epsilon\epsilon}$ Box with label ID: $[\mathbf{x}\mathbf{x}]^{\mathrm{c}\epsilon}\setminus\mathbf{n}$
GPT-4V with local semantics:
GPT-4V 与局部语义
Here is a UI screenshot image with bounding boxes and corresponding labeled ID overlayed on top of it, and here is a list of icon/text box description: { parsed local semantics}. Your task is {task}. Which bounding box label you should operate on? Give a brief analysis, then put your answer in the format of $\ln^{\prime}\d c<B0\pi$ with label ID: $[\mathbf{x}\mathbf{x}]^{\mathrm{c}\cdot\mathbf{\Phi}}\backslash\mathbf{n}$
这是一张带有边界框和对应标注ID叠加在上面的UI截图,以及一个图标/文本框描述列表:{解析的本地语义}。你的任务是{task}。你应该操作哪个边界框标注?给出简要分析,然后将答案以$\ln^{\prime}\d c<B0\pi$格式写出,并附带标注ID:$[\mathbf{x}\mathbf{x}]^{\mathrm{c}\cdot\mathbf{\Phi}}\backslash\mathbf{n}$

Figure 5: Training curves of interact able icon region detection model.
图 5: 可交互图标区域检测模型的训练曲线。
7.4 Details of Mind2Web Evaluation
7.4 Mind2Web 评估细节
Here we list more details of each baseline in table 3.
我们在表 3 中列出了每个基线的更多细节。
SeeClick, QWen-VL SeeClick is a finetuned version of Qwen-VL on the Mind2Web training set and we report both of their numbers in their paper.
SeeClick, QWen-VL SeeClick 是 Qwen-VL 在 Mind2Web 训练集上微调的版本,我们在论文中报告了它们的数据。
CogAgent CogAgent number is taken from the SEEAct paper $[Z\mathrm{GK}^{+}24]$ , where they report cogagentchat-hf checkpoint that is not fine-tuned on Mind2Web for experiments.
CogAgent CogAgent 的编号取自 SEEAct 论文 $[Z\mathrm{GK}^{+}24]$,其中他们报告了未在 Mind2Web 上进行微调的 cogagentchat-hf 检查点用于实验。
MindAct(Gen), MindAct, GPT-3.5-Turbo, GPT-4 The numbers for these baseline are taken from the Mind2Web $[\mathrm{DGZ}^{+}23]$ paper, where they use HTML information to augument the corresponding web agent.
MindAct(Gen), MindAct, GPT-3.5-Turbo, GPT-4 这些基线的数据来自 Mind2Web [DGZ+23] 论文,他们在其中使用 HTML 信息来增强相应的网页智能体。
GPT $\mathbf{4\nabla!+!som}$ This model corresponds to the image annotation grounding method in SeeAct paper, where the som boxes extracted from the selelcted HTML elements are provided to GPT-4V to make action prediction.
GPT $\mathbf{4\nabla!+!som}$ 该模型对应于 SeeAct 论文中的图像标注定位方法,将从选定的 HTML 元素中提取的 som 框提供给 GPT-4V 以进行动作预测。
GPT-4V+textual choice This corresponds to the best performing scenario in SeeAct paper (except the Oracle), that uses the selected HTML elments information in a multi-choice question format as input to the GPT-4V agent.
GPT-4V+文本选择 这对应于SeeAct论文中表现最佳的场景(除Oracle外),该场景使用多选问题格式的选定HTML元素信息作为GPT-4V智能体的输入。
7.4.1 Qualitative Examples
7.4.1 定性示例
We list a few more examples to demonstrate local semantics of icon function description helps GPT-4V make better action prediction in figure 6.
我们列出了更多示例来展示图标功能描述的局部语义如何帮助 GPT-4V 在图 6 中做出更好的动作预测。
GPT-4V without local semantics: To click on "This PC,I" you should operate on the icon that is labeled with the ID that corresponds to the "This PC" text in the screenshot. in this image, the \This PC" text is within the bounding boxlabeled withID1.n\nBox with label ID:
没有本地语义的GPT-4V:要点击“此电脑”,你应该操作在截图中与“此电脑”文本对应的ID标签图标。在这张图片中,“此电脑”文本位于标记为ID1的边界框内。
GPT-4V without local semantics
不具备局部语义的GPT-4V
To complete the task of clicking "This PC,"I need to identify the bounding box that corresponds to the "This PCl" icon or text. According to the provided list, Text Box ID 0: This PC' is the description for the "This PC" element.in\nTherefore, the answer is:InBox
要完成点击“此电脑”的任务,我需要识别与“此电脑”图标或文本对应的边界框。根据提供的列表,文本框ID 0: This PC' 是对“此电脑”元素的描述。因此,答案是:InBox
GPT-4V without local semantics: To perform a voice search, you would typically look for a microphone icon,which is commonly used to represent voice input functionality.In the screenshot provided, the icon with a microphone symbol is located in the search bar area,which is labeled with the number 44.ln\nTherefore, the answer is:InBox with
GPT-4V 无局部语义:执行语音搜索时,通常会寻找麦克风图标,该图标通常用于表示语音输入功能。在提供的截图中,带有麦克风符号的图标位于搜索栏区域,该区域标有数字 44。因此,答案是:InBox with

Figure 6: More examples of local semantics of icon functionality help with GPT-4V in grounding actions
图 6: 图标功能局部语义的更多示例,帮助 GPT-4V 在动作基础中的应用
The I"voice search iconl" is typically represented by a microphone icon.Looking at the list of icon/text box descriptions,the description that matches this representation is "con Box ID 40: a microphone icon on a white background'.In\nTherefore,the bounding box label to operate on is:InlnBox with label ID:
“语音搜索图标”通常由一个麦克风图标表示。查看图标/文本框描述列表,与之匹配的描述是“文本框 ID 40:白色背景上的麦克风图标”。因此,需要操作的边界框标签是:带有标签 ID 的文本框:
Task:More information
任务:更多信息
GPT-4V local semantics:
GPT-4V 本地语义:


Analysis: The screenshot appears to be from a music streaming application, possibly Apple Music,given the text"Apple Music Classical” in TextBoxID3.The task instruction is to findmore information about “Classical Christmas."In the screenshot,there is a section with the title "Classical Christmas"(Text BoxID 2)and abrief description (Text Box ID 7) with a “MORE" link at the end (Text Box ID 8), suggesting that clicking on this link will provide additional information about the “Classical Christmas” section.ln\nTo achieve the task, the bounding box with the description that most likely corresponds to the "MORE" link should be clicked. Based on the bounding boxes and their descriptions, Text Box ID 8 with the text "MORE"is the most relevant and accurate choice for this action.inln'In summary,thenext action|will perform is:{'Click IDV: 8*
分析:截图似乎来自一款音乐流媒体应用,可能是 Apple Music,因为 TextBoxID 3 中显示了文本“Apple Music Classical”。任务指令是查找有关“Classical Christmas”的更多信息。在截图中,有一个标题为“Classical Christmas”的部分(Text Box ID 2)以及一段简短的描述(Text Box ID 7),描述末尾有一个“MORE”链接(Text Box ID 8),表明点击该链接将提供有关“Classical Christmas”部分的更多信息。
为了完成任务,应点击最有可能与“MORE”链接对应的描述部分。根据边界框及其描述,Text Box ID 8 中的文本“MORE”是最相关且准确的选择。
综上所述,下一步要执行的操作是:{'Click ID: 8'}
Figure 7: Analysis of failure cases. All the bounding boxes are labeled by which relies only on the screenshot. Left: There are in total 7 similar enable button for 7 different alarm times in the parsed screenshot. And the correct Icon ID corresponding to alarm 7:30 is 27. GPT-4V fails to make the correct prediction. Right: The ground truth region to click is the text ’MORE’ inside bounding box 8. We can see that the OCR fails to detect the text ’MORE’ in bold, and only detects the bounding box 8, which encompasses ’MORE’. Since the predicts the click point as the center of the box, so it the predicted click point falls outside of the ground truth region, which leads to failure in this task.
图 7: 失败案例分析。所有边界框仅依赖截图进行标注。左图:解析后的截图中有 7 个不同闹钟时间的相似启用按钮,且正确对应 7:30 闹钟的 Icon ID 是 27。GPT-4V 未能做出正确预测。右图:需要点击的真实区域是边界框 8 内的文本“MORE”。可以看到 OCR 未能检测到加粗的文本“MORE”,仅检测到包含“MORE”的边界框 8。由于预测的点击点是边界框的中心,因此预测的点击点落在了真实区域之外,导致任务失败。


Figure 8: Analysis of failure cases. The task is to find button related to ’More information’, and the ground truth is to click the three dots icon in the upper right part of the screenshot. The the icon functional description model does not take into account the context of this page and interpret it as: "a loading or buffering indicator" which causes the failure.
图 8: 失败案例分析。任务是找到与“更多信息”相关的按钮,真实情况是点击截图右上部分的三点图标。图标功能描述模型没有考虑页面的上下文,将其解释为“加载或缓冲指示器”,导致了失败。
