OmniParser for Pure Vision Based GUI Agent
基于纯视觉的GUI智能体的OmniParser
Abstract
摘要
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interact able icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce OMNIPARSER, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interact able icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interact able regions on the screen and a caption model to extract the functional semantics of the detected elements. OMNIPARSER significantly improves GPT-4V’s performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, OMNIPARSER with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.
1 Introduction
1 引言
Large language models have shown great success in their understanding and reasoning capabilities. More recent works have explored the use of large vision-language models (VLMs) as agents to perform complex tasks on the user interface (UI) with the aim of completing tedious tasks to replace human efforts $[\mathrm{YZL}^{+}23$ , ${\mathrm{YYZ}}^{+}23$ , $\mathrm{DGZ}^{+}23$ , $Z{\mathrm{GK}}^{+}24$ , $\mathrm{HWL}^{+}23$ , ${\mathrm{Y}}Z{S^{+}}24$ , $\mathrm{WXJ^{+}24}$ , $\mathrm{GFH}^{+}24$ , $\mathrm{CSC}^{+}24]$ . Despite the promising results, there remains a significant gap between current state-of-thearts and creating widely usable agents that can work across multiple platforms, e.g. Windows/MacOS, IOS/Android and multiple applications (Web broswer Office365, PhotoShop, Adobe), with most previous work focusing on limiting applications or platforms.
大语言模型在理解和推理能力方面展现了巨大成功。最近的研究探索了使用大视觉语言模型(VLM)作为智能体,在用户界面(UI)上执行复杂任务,旨在完成繁琐任务以替代人力 [YZL+23, YYZ+23, DGZ+23, ZGK+24, HWL+23, YZS+24, WXJ+24, GFH+24, CSC+24]。尽管取得了令人瞩目的成果,但在创建可跨多个平台(如Windows/MacOS、IOS/Android)和多个应用程序(如Web浏览器、Office365、PhotoShop、Adobe)广泛使用的智能体方面,目前的最新技术仍存在显著差距,大多数先前的工作都集中在限制应用程序或平台上。
While large multimodal models like GPT-4V and other models trained on UI data $[\mathrm{HWL}^{+}23$ , $\mathrm{YZS}^{+}24$ , $\mathrm{CSC}^{+}24]$ have demonstrated abilities to understand basic elements of the UI screenshot, action grounding remains one of the key challenges in converting the actions predicted by LLMs to the actual actions on screen in terms of keyboard/mouse movement or API call $[Z\mathrm{GK}^{+}24]$ . It has been noted that GPT-4V is unable to produce the exact x-y coordinate of the button location, Set-of-Mark prompting $[\mathrm{YZL}^{+}23]$ proposes to overlay a group of bounding boxes each with unique numeric IDs on to the original image, as a visual prompt sent to the GPT-4V model. By applying set-of-marks prompting, GPT-4V is able to ground the action into a specific bounding box which has ground truth location instead of a specific xy coordinate value, which greatly improves the robustness of the action grounding $[Z\mathrm{GK}^{+}2\bar{4}]$ . However, the current solutions using SoM relies on parsed HTML information to extract bounding boxes for actionable elements such as buttons, which limits its usage to web browsing tasks. We aim to build a general approach that works on a variety of platforms and applications.
尽管像 GPT-4V 和其他在 UI 数据上训练的模型 $[\mathrm{HWL}^{+}23$ , $\mathrm{YZS}^{+}24$ , $\mathrm{CSC}^{+}24]$ 已经展示了理解 UI 截图基本元素的能力,但在将大语言模型预测的动作转换为屏幕上键盘/鼠标移动或 API 调用的实际动作方面,动作定位仍然是关键挑战之一 $[Z\mathrm{GK}^{+}24]$ 。值得注意的是,GPT-4V 无法生成按钮位置的精确 x-y 坐标,Set-of-Mark 提示 $[\mathrm{YZL}^{+}23]$ 提出在原始图像上叠加一组带有唯一数字 ID 的边界框,作为发送给 GPT-4V 模型的视觉提示。通过应用 Set-of-Mark 提示,GPT-4V 能够将动作定位到具有真实位置的特定边界框,而不是特定的 x-y 坐标值,这大大提高了动作定位的鲁棒性 $[Z\mathrm{GK}^{+}2\bar{4}]$ 。然而,当前使用 SoM 的解决方案依赖于解析 HTML 信息来提取可操作元素(如按钮)的边界框,这限制了其在网页浏览任务中的使用。我们的目标是构建一种适用于各种平台和应用程序的通用方法。
In this work, we argue that previous pure vision-based screen parsing techniques are not satisfactory, which lead to significant under estimation of GPT-4V model’s understanding capabilities. And a reliable vision-based screen parsing method that works well on general user interface is a key to improve the robustness of the agentic workflow on various operating systems and applications. We present OMNIPARSER, a general screen parsing tool to extract information from UI screenshot into structured bounding box and labels which enhances GPT-4V’s performance in action prediction in a variety of user tasks.
在本工作中,我们认为之前基于纯视觉的屏幕解析技术并不令人满意,这导致了对 GPT-4V 模型理解能力的显著低估。一个在通用用户界面上表现良好的可靠视觉屏幕解析方法是提高智能体工作流在各种操作系统和应用上鲁棒性的关键。我们提出了 OMNIPARSER,这是一个通用的屏幕解析工具,能够从 UI 截图中提取信息为结构化的边界框和标签,从而在各种用户任务中提升 GPT-4V 在动作预测中的表现。
We list our contributions as follows:
我们的贡献如下:
2 Related Works
2 相关工作
2.1 UI Screen Understanding
2.1 界面屏幕理解
There has been a line of modeling works focusing on detailed understanding of UI screens, such as Screen 2 Words $[\mathbf{W}!\mathbf{L}Z^{+}21]$ , UI-BERT $[\mathbf{B}Z\bar{\mathbf{X}}^{\bar{+}}21]$ , Widget Captioning $[\bar{\mathrm{LLH}}^{+}20]$ , ActionBERT $[\mathrm{HSZ}^{+}21]$ . These works demonstrated effective usage of multimodal models for extracting semantics of user screen. But these models rely on additional information such as view hierarchy, or are trained for visual question answering tasks or screen summary tasks.
一系列建模工作专注于对UI界面的详细理解,例如Screen 2 Words $[\mathbf{W}!\mathbf{L}Z^{+}21]$、UI-BERT $[\mathbf{B}Z\bar{\mathbf{X}}^{\bar{+}}21]$、Widget Captioning $[\bar{\mathrm{LLH}}^{+}20]$、ActionBERT $[\mathrm{HSZ}^{+}21]$。这些工作展示了多模态模型在提取用户界面语义方面的有效应用。但这些模型依赖于视图层次结构等额外信息,或者针对视觉问答任务或屏幕摘要任务进行训练。
There are also a couple publicly available dataset that on UI screen understanding. Most notably the Rico dataset $[\mathrm{DHF^{+}17}]$ , which contains more than 66k unique UI screens and its view hierarchies. Later $[\mathrm{SWL}^{+}22]$ auguments Rico by providing $500\mathrm{k}$ human annotations on the original 66k RICO screens identifying various icons based on their shapes and semantics, and associations between selected general UI elements (like icons, form fields, radio buttons, text inputs) and their text labels. Same on mobile platform, PixelHelp $[\mathrm{LHZ}^{+}20]$ provides a dataset that contains UI elements of screen spanning across 88 common tasks. In the same paper they also released RicoSCA which is a cleaned version of Rico. For the web and general OS domain, there are several works such Mind2Web $[\mathrm{DGZ}^{+}23]$ ], MiniWob $\mathsf{1++[L G P^{+}18]}$ , Visual-WebArena $[\mathrm{KLJ}^{+}24$ , $Z\mathrm{XZ}^{+}24]$ , and OSWorld $[\mathrm{XZC^{+}}24]$ that provide simulated environment, but does not provide dataset explicitly for general screen understanding tasks such as interact able icon detection on real world websites.
To address the absence of a large-scale, general web UI understanding dataset, and to keep pace with the rapid evolution of UI design, we curated an icon detection dataset using the DOM information from popular URLs avaialbe on the Web. This dataset features the up-to-date design of icons and buttons, with their bounding boxes retrieved from the DOM tree, providing ground truth locations.
为了解决缺乏大规模、通用的网页界面理解数据集的问题,并跟上界面设计的快速发展,我们利用网络上流行网址的DOM信息整理了一个图标检测数据集。该数据集展示了图标和按钮的最新设计,其边界框从DOM树中提取,提供了真实位置。
2.2 Autonomous GUI Agent
2.2 自主GUI智能体 (Autonomous GUI Agent)
Recently there has been a lot of works on designing autonomous GUI agent to perform tasks in place of human users. One line of work is to train an end-to-end model to directly predict the next action, representative works include: Pixel2Act $[\mathbf{S}\mathbf{J}\mathbf{C}^{+}23]$ , WebGUM $[\mathrm{FLN}^{+}24]$ in web domain, Ferret $[\mathrm{Y}\bar{\mathrm{Z}}\mathrm{S}^{+}24]$ , CogAgent $[\mathrm{HWL}^{+}23]$ , and Fuyu $[{\mathrm{BEH}}^{+}23]$ in Mobile domain. Another line of works involve leveraging existing multimodal models such as GPT-4V to perform user tasks. Representative works include MindAct agent $[\mathrm{DGZ}^{+}23]$ , SeeAct agent $[Z\mathrm{GK}^{+}\bar{2}4]$ in web domain and agents in $[\mathrm{YY}\mathbf{Z}^{+}23$ , $\mathrm{WXY^{+}24}$ , $\mathrm{RLR}^{+}23]$ for mobile domain. These work often leverages the DOM information in web browser, or the view hierarchies in mobile apps to get the ground truth position of interact able elements of the screen, and use Set-Of-Marks $[\bar{\mathrm{YZL}}^{+}23]$ to overlay the bounding boxes on top of the screenshot then feed into the vision-language models. However, ground truth information of interact able elements may not always be available when the goal is to build a general agent for cross-platforms and cross-applications tasks. Therefore, we focus on providing a systematic approach for getting structured elements from general user screens.
最近有许多关于设计自主GUI智能体以代替人类用户执行任务的工作。一方面,有研究致力于训练端到端模型直接预测下一个动作,代表性工作包括:Web领域的Pixel2Act [SJC+23]、WebGUM [FLN+24],以及移动领域的Ferret [YZS+24]、CogAgent [HWL+23]和Fuyu [BEH+23]。另一方面,有研究利用现有的多模态模型(如GPT-4V)来执行用户任务,代表性工作包括Web领域的MindAct智能体 [DGZ+23]、SeeAct智能体 [ZGK+24],以及移动领域的工作 [YYZ+23, WXY+24, RLR+23]。这些工作通常利用Web浏览器中的DOM信息或移动应用中的视图层次结构来获取屏幕上可交互元素的真实位置,并使用Set-Of-Marks [YZL+23]在截图上方叠加边界框,然后输入视觉语言模型。然而,当目标是构建一个适用于跨平台和跨应用任务的通用智能体时,可交互元素的真实信息可能并不总是可用。因此,我们专注于提供一种从通用用户屏幕中获取结构化元素的系统方法。
3 Methods
3 方法
A complex task can usually be broken down into several steps of actions. Each step requires the model’s (e.g. GPT-4V) ability to: 1) understand the UI screen in the current step, i.e. analyzing what is the screen content overall, what are the functions of detected icons that are labeled with numeric ID, and 2) predict what is the next action on the current screen that is likely to help complete the whole task. Instead of trying to accomplish the two goals in one call, we found it beneficial to extract some of the information such as semantics in the screen parsing stage, to alleviate the burden of GPT-4V so that it can leverages more information from the parsed screen and focus more on the action prediction.
一项复杂任务通常可以分解为多个步骤。每个步骤都需要模型(例如 GPT-4V)具备以下能力:1)理解当前步骤的 UI 屏幕,即分析屏幕内容的整体情况,以及检测到带有数字 ID 的图标的功能;2)预测在当前屏幕上可能有助于完成整个任务的下一步动作。我们发现,与其在一次调用中同时实现这两个目标,不如在屏幕解析阶段提取一些信息(例如语义)更为有益,这样可以减轻 GPT-4V 的负担,使其能够利用更多从解析屏幕中获得的信息,并更专注于动作预测。
Hence we propose OMNIPARSER, which integrates the outputs from a finetuned interact able icon detection model, a finetuned icon description model, and an OCR module. This combination produces a structured, DOM-like representation of the UI and a screenshot overlaid with bounding boxes for potential interact able elements. We discuss each component of the OMNIPARSER in more details for the rest of the section.
因此我们提出了OMNIPARSER,它集成了来自微调的可交互图标检测模型、微调的图标描述模型和OCR模块的输出。这种组合生成了一个结构化的、类似DOM的UI表示,以及一个带有潜在可交互元素边界框的截图叠加层。我们将在本节剩余部分更详细地讨论OMNIPARSER的每个组件。
3.1 Interact able Region Detection
3.1 可交互区域检测
Identifying interact able regions from the UI screen is a crucial step to reason about what actions should be performed given a user tasks. Instead of directly prompting GPT-4V to predict the xy coordinate value of the screen that it should operate on, we follow previous works to use the Set-of-Marks approach $[\mathrm{YZL}^{+}23]$ to overlay bounding boxes of interact able icons on top of UI screenshot, and ask GPT-4V to generate the bounding box ID to perform action on. However, different from $[Z{\bf G}{\bf K}^{+}24$ , $\mathrm{KLJ}^{+}\bar{2}4]$ which uses the ground truth button location retrieved from DOM tree in web browswer, and $[\mathrm{YYZ}^{+}23]$ which uses labeled bounding boxes in the AITW dataset $[\mathbf{RLR}^{+}23]$ , we finetune a detection model to extract interact able icons/buttons.
从 UI 屏幕中识别可交互区域是推理用户任务应执行哪些操作的关键步骤。我们没有直接提示 GPT-4V 预测它应操作的屏幕的 xy 坐标值,而是遵循之前的工作,使用 Set-of-Marks 方法 $[\mathrm{YZL}^{+}23]$ 在 UI 截图上方叠加可交互图标的边界框,并要求 GPT-4V 生成要执行操作的边界框 ID。然而,与 $[Z{\bf G}{\bf K}^{+}24$ 和 $\mathrm{KLJ}^{+}\bar{2}4]$ 使用从网页浏览器 DOM 树中检索到的真实按钮位置,以及 $[\mathrm{YYZ}^{+}23]$ 使用 AITW 数据集 $[\mathbf{RLR}^{+}23]$ 中的标注边界框不同,我们微调了一个检测模型来提取可交互图标/按钮。
Specifically, we curate a dataset of interact able icon detection dataset, containing $67\mathrm{k}$ unique screenshot images, each labeled with bounding boxes of interact able icons derived from DOM tree. We first took a $100\mathrm{k}$ uniform sample of popular publicly availabe urls on the web $[0\mathrm{XL}^{+}22]$ , and collect bounding boxes of interact able regions of the webpage from the DOM tree of each urls. Some examples of the webpage and the interact able regions are shown in 2.
具体来说,我们整理了一个可交互图标检测数据集,包含 $67\mathrm{k}$ 个独特的截图图像,每个图像都标注了从 DOM 树中提取的可交互图标的边界框。我们首先从网络上流行的公开可用 URL 中均匀抽取了 $100\mathrm{k}$ 个样本 $[0\mathrm{XL}^{+}22]$,并从每个 URL 的 DOM 树中收集了网页可交互区域的边界框。网页及其可交互区域的一些示例如图 2 所示。
Apart from interact able region detection, we also have a OCR module to extract bounding boxes of texts. Then we merge the bounding boxes from OCR detection module and icon detection module while removing the boxes that have high overlap (we use over $90%$ as a threshold). For every bounding box, we label it with a unique ID next to it using a simple algorithm to minimizing the overlap between numeric labels and other bounding boxes.
除了可交互区域检测,我们还有一个 OCR 模块用于提取文本的边界框。然后我们将 OCR 检测模块和图标检测模块的边界框合并,同时去除重叠度较高的框(我们使用超过 $90%$ 作为阈值)。对于每个边界框,我们使用一个简单的算法在其旁边标注一个唯一的 ID,以最小化数字标签与其他边界框之间的重叠。
3.2 Incorporating Local Semantics of Functionality
3.2 融合功能的局部语义
We found in a lot of cases where only inputting the UI screenshot overlayed with bounding boxes and associated IDs can be misleading to GPT-4V. We argue the limitation stems from GPT-4V’s constrained ability to simultaneously perform the composite tasks of identifying each icon’s semantic information and predicting the next action on a specific icon box. This has also been observed by several other works $[\mathrm{YY}\mathbf{Z}^{+}23$ , $Z{\mathrm{GK}}^{+}24]$ . To address this issue, we incorporate the local semantics of functionality into the prompt, i.e. for each icons detected by the interact able region detection model, we use a finetuned model to generate description of functionality to the icons, and for each text boxes, we use the detected texts and its label.
我们发现,在许多情况下,仅输入带有边界框和相关 ID 的 UI 截图可能会误导 GPT-4V。我们认为这一限制源于 GPT-4V 在同时执行识别每个图标的语义信息和预测特定图标框上的下一步动作的复合任务时能力受限。其他一些工作也观察到了这一点 [YYZ+23, ZGK+24]。为了解决这个问题,我们将功能的局部语义信息融入到提示中,即对于可交互区域检测模型检测到的每个图标,我们使用微调后的模型生成图标功能描述,而对于每个文本框,我们使用检测到的文本及其标签。
We perform more detailed analysis for this topic in section 4.1. To the best of our knowledge, there is no public model that is specifically trained for up-to-date UI icon description, and is suitable for our purpose to provide fast and accurate local semantics for the UI screenshot. Therefore we curate a dataset of $7\mathbf{k}$ icon-description pairs using GPT-4o, and finetune a BLIP-v2 model [LLSH23] on this dataset. Details of dataset and training can be found in Appendix 7.1. After finetuning, we found the model is much more reliable in its description to common app icons. Examples can be seen in figure 4. And in figure 3, we show it is helpful to incorporate the semantics of local bounding boxes in the form of text prompt along with the UI screenshot visual prompt.
我们将在第4.1节中对该主题进行更详细的分析。据我们所知,目前没有公开的模型是专门为最新的UI图标描述进行训练的,并且适合我们为UI截图提供快速准确的局部语义的目的。因此,我们使用GPT-4o构建了一个包含$7\mathbf{k}$个图标-描述对的数据集,并在该数据集上对BLIP-v2模型 [LLSH23] 进行了微调。数据集和训练的详细信息可以在附录7.1中找到。微调后,我们发现该模型在描述常见应用图标时更加可靠。示例可以在图4中看到。在图3中,我们展示了将局部边界框的语义以文本提示的形式与UI截图视觉提示结合是有帮助的。
Figure 1: Examples of parsed screenshot image and local semantics by OMNIPARSER. The inputs to OmniParse are user task and UI screenshot, from which it will produce: 1) parsed screenshot image with bounding boxes and numeric IDs overlayed, and 2) local semantics contains both text extracted and icon description.
图 1: OMNIPARSER 解析的截图图像和局部语义示例。OmniParse 的输入是用户任务和 UI 截图,它会生成:1) 带有边界框和数字 ID 叠加的解析截图图像,2) 包含提取文本和图标描述的局部语义。
4 Experiments and Results
4 实验与结果
We conduct experiments on several benchmarks to demonstrate the effectiveness of OMNIPARSER. We start by a motivating experiments showing that current GPT-4V model with set of mark prompting $[\mathrm{YZL}^{+}23]$ is prone to incorrectly assigning label ID to the referred bounding boxes. Then we evaluate on Seeclick benchmark and Mind2Web to further showcase OMNIPARSER with local semantics can improve the GPT-4V’s performance on real user tasks on different platforms and applications.
我们在多个基准上进行了实验,以展示 OMNIPARSER 的有效性。我们首先进行了一项动机实验,表明当前使用标记提示集 $[\mathrm{YZL}^{+}23]$ 的 GPT-4V 模型容易错误地将标签 ID 分配给所引用的边界框。然后,我们在 Seeclick 基准和 Mind2Web 上进行了评估,进一步展示了具有局部语义的 OMNIPARSER 可以提高 GPT-4V 在不同平台和应用上的真实用户任务表现。
4.1 Evaluation on SeeAssign Task
4.1 SeeAssign任务评估
To test the ability of correctly predicting the label ID given the description of the bounding boxes for GPT-4v models, We handcrafted a dataset SeeAssign that contains 112 tasks consisting of samples from 3 different platforms: Mobile, Desktop and Web Browser. Each task includes a concise task description and a screenshot image. The task descriptions are manually created and we make sure each task refers to one of the detected bounding boxes, e.g. ’click on ’settings”, ’click on the minimize button’. During evaluation, GPT-4V is prompted to predict the bounding box ID associated to it. Detailed prompt are specified in Appendix. The task screenshot images are sampled from the ScreenSpot $[\mathrm{CSC^{+}}24]$ benchmark, where they are labeled with set of marks using OMNIPARSER. The tasks are further divided into 3 sub-categories by difficulty: easy (less than 10 bounding boxes), medium (10-40 bounding boxes) and hard (more than 40 bounding boxes).
为了测试 GPT-4v 模型在给定边界框描述时正确预测标签 ID 的能力,我们手工制作了一个名为 SeeAssign 的数据集,该数据集包含 112 个任务,这些任务来自 3 个不同平台:移动设备、桌面和网页浏览器。每个任务包括一个简明的任务描述和一张截图图像。任务描述是手动创建的,并且我们确保每个任务都涉及检测到的边界框之一,例如“点击‘设置’”、“点击最小化按钮”。在评估过程中,GPT-4V 被提示预测与之关联的边界框 ID。详细的提示信息在附录中指定。任务截图图像是从 ScreenSpot $[\mathrm{CSC^{+}}24]$ 基准中采样的,这些图像使用 OMNIPARSER 进行标记。任务进一步按难度分为 3 个子类别:简单(少于 10 个边界框)、中等(10-40 个边界框)和困难(超过 40 个边界框)。
Figure 2: Examples from the Interact able Region Detection dataset. The bounding boxes are based on the interact able region extracted from the DOM tree of the webpage.
图 2: 交互区域检测数据集中的示例。边界框基于从网页的 DOM 树中提取的交互区域。
From table 1, we see that GPT-4V often mistakenly assign the numeric ID to the table especially when there are a lot of bounding boxes over the screen. And by adding local semantics including texts within the boxes and short descriptions of the detected icons, GPT-4V’s ability of correctly assigning the icon improves from 0.705 to 0.938.
从表 1 中我们可以看到,GPT-4V 经常错误地将数字 ID 分配给表格,尤其是在屏幕上有很多边界框时。通过添加局部语义,包括框内的文本和检测到的图标的简短描述,GPT-4V 正确分配图标的能力从 0.705 提高到 0.938。
From figure 3, we see that without the description of the referred icon in the task, GPT-4V often fails to link the icon required in the task and the ground truth icon ID in the SoM labeled screenshot, which leads to hallucination in the response. With fine-grain local semantics added in the text prompt, it makes it much easier for GPT-4V to find the correct icon ID for the referred icon.
从图 3 可以看出,如果任务中没有描述所引用的图标,GPT-4V 通常无法将任务中所需的图标与 SoM 标注的截图中的真实图标 ID 关联起来,从而导致响应中出现幻觉。在文本提示中添加细粒度的局部语义后,GPT-4V 更容易找到所引用图标的正确图标 ID。
Table 1: Comparison of GPT-4V with and without local semantics
表 1: GPT-4V 在有无局部语义情况下的对比
Easy | Medium | Hard | Overall | |
---|---|---|---|---|
GPT-4V 无局部语义 | 0.913 | 0.692 | 0.620 | 0.705 |
GPT-4V 有局部语义 | 1.00 | 0.949 | 0.900 | 0.938 |
4.2 Evaluation on ScreenSpot
4.2 ScreenSpot 评估
ScreenSpot dataset $[\mathrm{CSC^{+}}24]$ is a benchmark dataset that includes over 600 interface screenshots from mobile (iOS, Android), desktop (macOS, Windows), and web platforms. The task instructions are manually created so that each instruction corresponds to an actionable elements on the UI screen. We first evaluate the performance of OMNIPARSER using the this benchmark. In table 2, we can see across the 3 different platforms: Mobile, Desktop and Web, OMNIPARSER significantly improves the GPT-4V baseline. Noticeably, OMNIPARSER’s performance even surpasses models that are specifically finetuned on GUI dataset including SeeClick, CogAgent and Fuyu by a large margin. We also note that incorporating the local semantics (OMNIPARSER w. LS in the table) further improves the overall performance. This corroborates with the finds in section 4.1 that incorporating local semantics of the UI screenshot in text format, i.e. adding OCR text and descriptions of the icon bounding boxes further helps GPT-4V to accurately identify the correct element to operate on. Furthermore, our findings indicate that the interact able region detection (ID) model we finetuned improves overall accuracy by an additional $4.3%$ compared to using the raw Grounding DINO model. This underscores the importance of accurately detecting interact able elements for the success of UI tasks. Overall, the results demonstrate that the UI screen understanding capability of GPT-4V is significantly underestimated and can be greatly enhanced with more accurate interact able elements detection and the incorporation of functional local semantics.
ScreenSpot 数据集 [CSC+24] 是一个基准数据集,包含来自移动端 (iOS, Android)、桌面端 (macOS, Windows) 和网页平台的 600 多张界面截图。任务指令是手动创建的,以便每条指令对应于 UI 屏幕上的可操作元素。我们首先使用该基准评估了 OMNIPARSER 的性能。在表 2 中,我们可以看到在移动端、桌面端和网页平台这三个不同平台上,OMNIPARSER 显著提升了 GPT-4V 的基线性能。值得注意的是,OMNIPARSER 的性能甚至大幅超过了专门在 GUI 数据集上微调的模型,包括 SeeClick、CogAgent 和 Fuyu。我们还注意到,结合局部语义(表 中的 OMNIPARSER w. LS)进一步提高了整体性能。这与 4.1 节中的发现一致,即结合 UI 截图的局部语义(以文本形式添加 OCR 文本和图标的边界框描述)进一步帮助 GPT-4V 准确识别要操作的正确元素。此外,我们的研究结果表明,与使用原始的 Grounding DINO 模型相比,我们微调的交互区域检测 (ID) 模型将整体准确率提高了 4.3%。这强调了准确检测可交互元素对于 UI 任务成功的重要性。总体而言,结果表明 GPT-4V 的 UI 屏幕理解能力被显著低估,并且可以通过更准确的可交互元素检测和功能局部语义的结合而大幅提升。
Figure 3: Examples from the SeeAssign evaluation. We can see that fine-grain local semantics improves the GPT-4V’s ability to assign correct labels to the referred icon.
图 3: SeeAssign 评估中的示例。我们可以看到,细粒度的局部语义提升了 GPT-4V 为所指图标分配正确标签的能力。
Table 2: Comparison of different approaches on ScreenSpot Benchmark. LS is short for local semantics of functionality, GD is short for Grounding DINO, and $\mathrm{ID}$ is short for the interact able region detection model we finetune.
4.3 Evaluation on Mind2Web
4.3 在Mind2Web上的评估
In order to test how OMNIPARSER is helpful to the web navigation secnario, We evaluate on $[\mathrm{DGZ}^{+}23]$ benchmark. There are 3 different categories of task in the test set: Cross-Domain, CrossWebsite, and Cross-Tasks. We used a cleaned version of Mind2Web tests set processed from the raw HTML dump which eliminates a small number of samples that has incorrect bounding boxes. In total we have 867, 167, 242 tasks in the test set from Cross-Domain, Cross-Website, and Cross-Tasks category respectively. During evaluation, we feed both the parsed screen results and the action history as text prompt, and SOM labeled screenshot to GPT-4V similar to the prompting strategy in $[\mathrm{YYZ}^{+}23$ , $Z G\mathrm{K}^{+}24]$ . Following the original paper, we perform offline evaluation focusing on the element accuracy, Operation F1 and step success rate averaged across the task.
为了测试OMNIPARSER在网页导航场景中的实用性,我们在$[\mathrm{DGZ}^{+}23]$基准上进行了评估。测试集中有3种不同类别的任务:跨领域、跨网站和跨任务。我们使用了从原始HTML转储中处理的Mind2Web测试集的清理版本,该版本消除了少数具有错误边界框的样本。测试集中分别有867、167、242个任务来自跨领域、跨网站和跨任务类别。在评估过程中,我们将解析后的屏幕结果和操作历史作为文本提示,并将SOM标记的截图提供给GPT-4V,类似于$[\mathrm{YYZ}^{+}23$,$Z G\mathrm{K}^{+}24]$中的提示策略。根据原始论文,我们进行了离线评估,重点关注元素准确率、操作F1和任务平均步骤成功率。
In the first section of the table (row 1-3), We report numbers from a set of open source VL models as it appears in $[Z{\tt G K}^{+}24$ , $\mathrm{CSC}^{+}24]$ . Here CogAgent and Qwen-VL are not finetuned on the Mind2Web training set. More detailed information about model settings can be found in the Appendix7.4.
在表的第一部分(第1-3行),我们报告了一组开源视觉语言模型 (VL) 的数据,这些数据来自 $[Z{\tt G K}^{+}24$ 和 $\mathrm{CSC}^{+}24]$。其中,CogAgent 和 Qwen-VL 未在 Mind2Web 训练集上进行微调。有关模型设置的更多详细信息,请参见附录7.4。
In the second section of the table (row 4-9) we report numbers from Mind2web paper $[\mathrm{DGZ}^{+}23]$ and SeeAct $[Z\mathrm{GK}^{+}24]$ paper. In this section, all of the approaches use the HTML elements selected by a finetuned element proposal model on Mind2Web training set which produces top 50 relevant elements on the HTML page based on the user task. Additionally, GPT $\cdot4\mathrm{{V}+}\mathrm{{SOM}}$ and GPT-4V+textual choices corresponds to the SeeAct with image annotation, and textual choices grounding methods respectively. In GPT $4\mathrm{{V}+\mathrm{{SOM}}}$ , the set of mark (SOM) boxes are selected from the element proposal model, and are labeled with the ground truth location extracted from HTML. In contrast, GPT-4V+textual uses DOM information of the selected relevant elements directly in the text prompt, rather than overlaying bounding boxes on top of screenshot. The better performance of textual choice corroborates with the experiment results in 4.1.
在表的第二部分(第4-9行)中,我们报告了来自Mind2web论文 $[\mathrm{DGZ}^{+}23]$ 和SeeAct论文 $[Z\mathrm{GK}^{+}24]$ 的数据。在本部分中,所有方法都使用了在Mind2Web训练集上微调的元素提议模型选择的HTML元素,该模型根据用户任务在HTML页面上生成前50个相关元素。此外,GPT $\cdot4\mathrm{{V}+}\mathrm{{SOM}}$ 和GPT-4V+textual choices分别对应于带有图像标注的SeeAct和文本选择基础方法。在GPT $4\mathrm{{V}+\mathrm{{SOM}}}$ 中,标记集(SOM)框是从元素提议模型中选择的,并标记了从HTML中提取的真实位置。相比之下,GPT-4V+textual直接在文本提示中使用所选相关元素的DOM信息,而不是在屏幕截图上叠加边界框。文本选择的更好性能与4.1中的实验结果一致。
In the last section (row 10-11), we report numbers from OMNIPARSER. We observe GPT-4V augumented with local semantics of icon functionality and the finetuned interact able region detection model (w. $\mathrm{L}S+\mathrm{ID})$ performs better than the model with raw grounding DINO model (w. $\mathrm{LS+GD})$ ) in all of the categories.
在上一节(第 10-11 行)中,我们报告了 OMNIPARSER 的数字。我们观察到,GPT-4V 通过图标功能的局部语义和微调的可交互区域检测模型 (w. $\mathrm{L}S+\mathrm{ID})$ 在所有类别中表现优于使用原始 grounding DINO 模型 (w. $\mathrm{LS+GD})$ 的模型。
Further, without using parsed HTML information, OMNIPARSER is able to outperform GPT-4’s performance that uses HTML in every sub-category by a large margin, suggesting the substantial benefit of the screen parsing results provided by OMNIPARSER. Additionally, OMNIPARSER outperforms the GPT $4\mathrm{{V}+\mathrm{{SOM}}}$ by a large margin. Compared to GPT $\mathrm{4V+}$ textual choices, OMNIPARSER significantly outperforms in Cross-Website and Cross-Domain category $(+4.1%$ and $+5.2%$ ), while under performing $(-0.8%)$ slightly in the Cross-Task category, which indicates that OMNIPARSER provides higher quality information compared ground truth element information from DOM and top $\cdot\mathbf{k}$ relevant elemnt proposal used by the GPT-4V+textual choices set-up, and make the GPT-4V easier to make a accurate action prediction. Lastly, OMNIPARSER with GPT-4V significantly outperform all the other trained models using only UI screenshot such as SeeClick and Qwen-VL.
此外,在不使用解析后的HTML信息的情况下,OMNIPARSER能够在每个子类别中大幅超越使用HTML的GPT-4性能,这表明OMNIPARSER提供的屏幕解析结果具有显著优势。此外,OMNIPARSER大幅优于GPT $4\mathrm{{V}+\mathrm{{SOM}}}$。与GPT $\mathrm{4V+}$ 的文本选择相比,OMNIPARSER在跨网站和跨域类别中显著优于 $(+4.1%$ 和 $+5.2%$),而在跨任务类别中略微落后 $(-0.8%)$,这表明OMNIPARSER提供了比GPT-4V+文本选择设置中使用的DOM真实元素信息和顶部$\cdot\mathbf{k}$相关元素提案更高质量的信息,使得GPT-4V更容易做出准确的动作预测。最后,OMNIPARSER与GPT-4V结合显著优于仅使用UI截图的训练模型,如SeeClick和Qwen-VL。
Table 3: Comparison of different methods across various categories on Mind2Web benchmark.
4.4 Evaluation on AITW
4.4 AITW 上的评估
In additional to evaluation on multi-step web browsing tasks, we assess OMNIPARSER on the mobile navigating benchmark AITW $[\mathbf{R}\mathbf{L}\mathbf{R}^{+}2\bar{3}]$ , which contains $30\mathrm{k}$ instructions and $715\mathrm{k}$ trajectories. We use the same train/test split as in $[\mathrm{CSC^{+}}24]$ based on instructions, which retain only one trajectory for each instructions and no intersection between train and test. For a fair comparison, we only use their test split for evaluation and discard the train set as