[论文翻译]OmAgent:面向复杂视频理解的多模态智能体框架与任务分治策略


原文地址:https://arxiv.org/pdf/2406.16620v3


OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

OmAgent:面向复杂视频理解的多模态智能体框架与任务分治策略

Lu Zhang', Tiancheng Zhao1,2, Heting Ying', Yibo Ma', Kyusong Lee1,2

Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, Kyusong Lee

1Om AI Research, 2Binjiang Institute of Zhejiang University {zhang_lu, ying he ting, ma-yibo}@hzlh.com {tianchez, kyusongl}@zju-bj.com

1Om AI Research, 2浙江大学滨江研究院 {zhang_lu, ying he ting, ma-yibo}@hzlh.com {tianchez, kyusongl}@zju-bj.com

Abstract

摘要

Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understanding, significantly reducing information loss. Experimental results af- firm OmAgent's efficacy in handling various types of videos and complex tasks. Moreover, we have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks. Code: https://github.com/om-ai-lab/OmAgent

大语言模型 (LLMs) 的最新进展已将其能力扩展到多模态领域,包括全面的视频理解。然而,处理诸如 24 小时监控录像或完整电影等大量视频,由于数据量和处理需求巨大,带来了显著挑战。传统方法,如提取关键帧或将帧转换为文本,通常会导致大量信息丢失。为了解决这些不足,我们开发了 OmAgent,它能够高效地存储和检索特定查询相关的视频帧,保留视频的详细内容。此外,它具备一种分治循环 (Divide-and-Conquer Loop) 能力,能够自主推理,动态调用 API 和工具以增强查询处理的效率和准确性。这种方法确保了强大的视频理解能力,显著减少了信息丢失。实验结果证实了 OmAgent 在处理各种类型视频和复杂任务方面的有效性。此外,我们赋予了它更大的自主性和强大的工具调用系统,使其能够完成更为复杂的任务。代码:https://github.com/om-ai-lab/OmAgent

1 Introduction

1 引言

Large Language Models (LLMs) have advanced remarkably in recent years, greatly expanding their capabilities across various applications (Touvron et al., 2023a,b; Bai et al., 2023; OpenAI, 2023a). As these models have evolved, they have been increasingly applied to multimodal contexts, allowing them to process and interpret not just text but also images and other media types. Initially, the focus was on single images, utilizing the models' ability to generate and understand detailed descriptions or responses based on static visuals (Awadalla et al., 2023; Liu et al., 2023, 2024; OpenAI, 2023b)

大语言模型 (LLMs) 近年来取得了显著进展,极大地扩展了其在各种应用中的能力 (Touvron et al., 2023a,b; Bai et al., 2023; OpenAI, 2023a) 。随着这些模型的不断发展,它们越来越多地应用于多模态场景,使其不仅能够处理文本,还能处理图像和其他媒体类型。最初,研究重点集中在单张图像上,利用模型生成和理解基于静态视觉的详细描述或回应的能力 (Awadalla et al., 2023; Liu et al., 2023, 2024; OpenAI, 2023b) 。

However, as LLMs have become more complex and powerful, there has been growing interest in applying them to more dynamic media, such as video content (Lin et al., 2023a; Zhu et al., 2023; Zhang et al., 2023b; Zha0 et al., 2023; Tang et al., 2023). This interest arises from the potential to provide deeper and more nuanced interpretations of video data, similar to how humans understand and interact with moving images and sound. Currently, most video understanding models are limited to processing short videos, typically only a few minutes or even several seconds long. Despite these advancements, significant challenges remain, especially in handling long video inputs like 24-hour CCTV footage or full-length movies, which involve massive amounts of data and require substantial processing power.

然而,随着大语言模型变得更加复杂和强大,人们越来越关注将其应用于更具动态性的媒体,例如视频内容(Lin et al., 2023a; Zhu et al., 2023; Zhang et al., 2023b; Zha0 et al., 2023; Tang et al., 2023)。这种兴趣源于其潜力,即能够对视频数据进行更深入和更细致的解读,类似于人类如何理解和与动态图像和声音互动。目前,大多数视频理解模型仅限于处理短视频,通常只有几分钟甚至几秒钟的长度。尽管取得了这些进展,但仍然存在重大挑战,尤其是在处理长时间视频输入时,如24小时的监控录像或全长电影,这些视频涉及大量数据,需要强大的处理能力。

Traditionally, one solution has been to extract key frames from these long videos or convert all frames into textual descriptions before processing (Lin et al., 2023b; Yang et al., 2023). While this approach makes the task more manageable for LLMs, it often results in information loss. Key frame extraction might miss subtle but important details in the omitted frames, and converting visual data to text can oversimplify or misrepresent visual nuances, leading to a less accurate understanding of the content.

传统上,一种解决方案是从这些长视频中提取关键帧或将所有帧转换为文本描述后再进行处理 (Lin et al., 2023b; Yang et al., 2023)。虽然这种方法使得大语言模型处理任务更加可行,但通常会导致信息丢失。关键帧提取可能会忽略被省略帧中的细微但重要的细节,而将视觉数据转换为文本可能会过度简化或错误地表示视觉上的细微差别,导致对内容的理解不够准确。

To address these limitations, the application of Retrieval-Augmented Generation (RAG) technology in video understanding has emerged as a promising solution (Chen et al., 2017; Liu, 2022; Chase, 2022; Arefeen et al., 2024). RAG enables the storage and efficient retrieval of video frames based on their relevance to a query. This method allows for more precise and contextual responses by referencing specific content directly from the video, rather than relying on potentially incomplete or inaccurate textual summaries. However, given that video is a continuous stream of information, when it is stored, this continuous fow is segmented into discrete blocks of data. The information lost during this segmentation is irretrievable.

为了解决这些限制,检索增强生成 (Retrieval-Augmented Generation, RAG) 技术在视频理解中的应用成为了一种有前景的解决方案 (Chen et al., 2017; Liu, 2022; Chase, 2022; Arefeen et al., 2024)。RAG 能够根据查询的相关性存储并高效检索视频帧。这种方法通过直接从视频中引用特定内容,而不是依赖可能不完整或不准确的文本摘要,提供了更精确和上下文相关的响应。然而,由于视频是连续的信息流,当它被存储时,这种连续流被分割成离散的数据块。在这种分割过程中丢失的信息是不可恢复的。

To address the aforementioned issues, we attempt to analyze the human approach to handling complex, long video question-answering tasks, seeking breakthroughs from this perspective. When a person watches a content-rich, lengthy video, such as a movie, they retain a general impression of the movie in their mind. This impression includes a rough outline of the video's content at various time points. When asked about specific details, the person may not recall the details immediately but can quickly locate the relevant time point in the video and rewatch the segment to retrieve the missing information. The key insight of OmAgent is to replicate this process by integrating multimodal RAG and generalist AI agent. OmAgent consists of two main components: (1) A video2RAG video preprocessor to extract and store the generalized information from the video, akin to the foundational impression a video imprints upon the viewer's memory. (2) A Divide-and-Conquer Loop (DnC Loop) for task planning and execution which equipped with tool invocation capabilities.

为了解决上述问题,我们尝试分析人类处理复杂、长视频问答任务的方法,并从中寻求突破。当一个人观看内容丰富、时长较长的视频,例如电影时,他们会在脑海中保留对该电影的大致印象。这种印象包括视频在不同时间点的内容概要。当被问及具体细节时,这个人可能无法立即回忆起细节,但可以快速定位到视频中的相关时间点,并重新观看该片段以找回缺失的信息。OmAgent 的核心思想是通过整合多模态 RAG 和通用型 AI 智能体来复制这一过程。OmAgent 由两个主要组件组成:(1) 一个 video2RAG 视频预处理器,用于提取并存储视频的概括信息,类似于视频在观众记忆中留下的基础印象。(2) 一个用于任务规划和执行的 Divide-and-Conquer Loop (DnC Loop),该循环具备工具调用能力。

We abstract the human ability to reposition and review video details as a tool named "rewinder," which can be autonomously selected and utilized by an AI agent, similar to how a person might use a video player's progress bar to navigate to points of interest. OmAgent not only can retrieve detailed information from videos but also can actively seek external information, enabling more advanced video understanding and question-answering. Existing benchmarks are insufficient to accurately quantify these capabilities, so we propose a new complex video understanding benchmark to fulfill this task. The contributions are: (1) OmAgent, the first complex video understanding framework integrating multimodal RAG and generalist AI agent. (2) A benchmark dataset that contains ${2000+}$ Q&A pairs for evaluating video understanding systems. (3) Experiments that shows the proposed agentic method is able to outperform strong baselines for solving complex video understanding problems.

我们将人类重新定位和回顾视频细节的能力抽象为一个名为“rewinder”的工具,AI智能体可以像人们使用视频播放器的进度条导航到感兴趣的点一样,自主选择和使用该工具。OmAgent不仅可以从视频中检索详细信息,还可以主动寻求外部信息,从而实现更高级的视频理解和问答。现有的基准测试不足以准确量化这些能力,因此我们提出了一个新的复杂视频理解基准测试来完成这一任务。贡献如下:(1) OmAgent,首个集成多模态RAG和通用AI智能体的复杂视频理解框架。(2) 包含${2000+}$问答对的基准数据集,用于评估视频理解系统。(3) 实验表明,所提出的智能体方法在解决复杂视频理解问题方面优于强基线。

2 Related Work

2 相关工作

Video LLMs Analyzing and understanding video content using large-scale language models (LLMs) typically involves fine-tuning or pre- training methods. Pre-training strategies, such as supervised or contrastive learning, develop video LLMs, while instruction fine-tuning updates adapter parameters to enable video comprehen- sion (Tang et al., 2023). For example, LaViLa (Zhao et al., 2023) enhances video subtitle generation through a cross-attention module and rewriting mechanism, improving coverage and diversity. Video-LLaMA (Zhang et al., 2023b) addresses spatio-temporal visual variations using separate video and audio encoders with an advanced audiovisual Q-former, significantly boosting video comprehension. Video-LLaVA (Lin et al., 2023a) connects multimodal representations into a unified semantic space with LLMs, improving video understanding tasks.

视频大语言模型

However, these methods often consume a lot of computational resources and time during the training process, and the models can usually only target specific tasks related to the training data. In addition, video LLMs trained from scratch may not be able to achieve the expected performance when dealing with longer or previously unseen videos, showing shortcomings in understanding long videos and dealing with complex video question-answer tasks.

然而,这些方法在训练过程中通常会消耗大量的计算资源和时间,并且模型通常只能针对与训练数据相关的特定任务。此外,从头开始训练的视频大语言模型在处理较长或之前未见过的视频时可能无法达到预期的性能,显示出在理解长视频和处理复杂视频问答任务方面的不足。

Long Video Understanding system with LLMs LLMs and Multimodal LLMs (MLLMs) applied to long video comprehension tasks utilize external systems to process extensive content. This involves analyzing visual elements, actions, scenes, and ob jects over time, aligning multimodal information with textual modalities, and leveraging the powerful text processing capabilities of MLLMs (Tang et al., 2023). For instance, Vlog (Kevin, 2023) uses pre-trained models for different modalities to record and interpret visual and audio information, summarizing it into detailed text for MLLM comprehension. MM-REACT (Yang et al., 2023) employs visual expert tools via internal prompts, enhancing MLLMs’ visual understanding. MMVID (Lin et al., 2023b) segments videos using ASR and Scene Detection tools, generating and integrating textual descriptions to complete Q&A tasks with MLLMs. LLoVi (Zhang et al., 2023a) uses a process that generates a summary based on subtitles and questions, then uses the summary for question-answering. VideoTree (Wang et al., 2024b) employs a three-step process to understand long videos, clustering video frames, calculating relevance, and performing depth expansion for question-answering. VideoAgent (Fan et al., 2024) pre processes videos to generate captions and object data, and uses an agent with pre-provided tools to obtain answers.

长视频理解系统与大语言模型和大语言多模态模型 (MLLMs) 应用于长视频理解任务时,利用外部系统处理大量内容。这涉及分析视觉元素、动作、场景和随时间变化的物体,将多模态信息与文本模态对齐,并利用 MLLMs 的强大文本处理能力 (Tang et al., 2023)。例如,Vlog (Kevin, 2023) 使用预训练模型记录和解释视觉和音频信息,将其总结为详细的文本供 MLLM 理解。MM-REACT (Yang et al., 2023) 通过内部提示使用视觉专家工具,增强 MLLMs 的视觉理解能力。MMVID (Lin et al., 2023b) 使用 ASR 和场景检测工具对视频进行分段,生成并整合文本描述以完成与大语言模型的问答任务。LLoVi (Zhang et al., 2023a) 使用一个基于字幕和问题生成摘要的过程,然后使用摘要进行问答。VideoTree (Wang et al., 2024b) 采用三步过程来理解长视频,包括聚类视频帧、计算相关性和进行深度扩展以完成问答。VideoAgent (Fan et al., 2024) 预处理视频以生成字幕和物体数据,并使用预先提供的工具通过智能体获得答案。

Compared to training entirely new video LLMs, these MLLM-based approaches significantly reduce the need for computational resources while allowing the system to integrate or update external tools according to new technical or performance requirements. However, such long video understanding methodologies will lose a large amount of video information when transforming modalities, and do not fully utilize the multimodal processing capabilities of MLLM. In addition, these systems usually lack sufficient autonomy to support more complex video questioning and interaction, which limits their depth and breadth in practical applications.

与训练全新的视频大语言模型相比,这些基于MLLM的方法显著减少了对计算资源的需求,同时使系统能够根据新的技术或性能要求集成或更新外部工具。然而,这种长视频理解方法在转换模态时会丢失大量视频信息,并且未能充分利用MLLM的多模态处理能力。此外,这些系统通常缺乏足够的自主性来支持更复杂的视频提问和交互,这限制了它们在实际应用中的深度和广度。

MultiModal RAG MultiModal Retrieval Augmented Generation (RAG) leverages images, videos, audio, and other non-text data for information retrieval, enhancing content relevance and con- text for complex query and generation tasks (Zhang et al., 2018). The LlamaIndex (Liu, 2022) frame- work improves relevance and accuracy by enabling quick retrieval and processing of multimodal content through precise embedding and efficient indexing. Indexify (Tensor Lake A I, 2024) provides a robust framework for building multimodal RAG systems with real-time data pipelines, extractor SDKs, and powerful storage interfaces for efficient information extraction and indexing. iRAG (Arefeen et al., 2024) uses AI model selection for perceptual queries, improving the speed and quality of multimodal data-to-text conversion, particularly for real-time, long video understanding.

多模态RAG 多模态检索增强生成(RAG)利用图像、视频、音频和其他非文本数据进行信息检索,增强复杂查询和生成任务的内容相关性和上下文(Zhang 等,2018)。LlamaIndex(Liu,2022)框架通过精确嵌入和高效索引,实现快速检索和处理多模态内容,从而提高相关性和准确性。Indexify(Tensor Lake AI,2024)提供了一个强大的框架,用于构建具有实时数据管道、提取器 SDK 和强大存储接口的多模态 RAG 系统,以实现高效信息提取和索引。iRAG(Arefeen 等,2024)使用 AI 模型选择进行感知查询,提高多模态数据到文本转换的速度和质量,特别是实时长视频理解。

Although these multimodal RAG systems offer great advantages in integrating multimodal information, they are still unable to eliminate the significant loss of information that occurs when data is transformed from video to knowledge in a RAG system. Our OmAgent, on the other hand, has no significant loss of information thanks for the task planning and autonomous tool call capability of DnC Loop especially the "rewinder" mechanism.

尽管这些多模态检索增强生成(RAG)系统在整合多模态信息方面提供了巨大优势,但它们仍然无法消除在RAG系统中将数据从视频转换为知识时发生的信息显著损失。而我们的OmAgent,得益于DnC Loop的任务规划和自主工具调用能力,特别是“回放”机制,没有出现显著的信息损失。

3 Method

3 方法

The process of OmAgent's video understanding can be bifurcated into two primary parts: Video2RAG and DnC Loop. As illustrated in Figure 1, all video data must undergo preliminary processing before being stored in the knowledge database in preparation for subsequent tasks. The preprocessing phase of Video2RAG encompasses a series of model identification and vector iz ation procedures, culminating in the extraction of the core content of the video files for storage. When undertaking video understanding tasks, the initial step is to extract temporal information from the query. This information will then be used to filter the retrieved results. Subsequently, the query is encoded by text encoder, and the embedding is employed to retrieve pertinent video segment information from the knowledge database.

OmAgent 的视频理解过程可以分为两个主要部分:Video2RAG 和 DnC Loop。如图 1 所示,所有视频数据在进行后续任务之前都必须经过初步处理并存储在知识数据库中。Video2RAG 的预处理阶段包括一系列模型识别和向量化过程,最终提取视频文件的核心内容进行存储。在执行视频理解任务时,首先从查询中提取时间信息,该信息将用于过滤检索结果。随后,查询通过文本编码器进行编码,并使用嵌入从知识数据库中检索相关的视频片段信息。

The retrieved video clip information and the original task will be transmitted to DnC Loop, the intelligent agent capable of autonomously planning and executing tasks, for processing. Complex tasks will be recursively subdivided into executable subtasks. If at any point the agent deems that specific video details need to be reviewed, it will utilize the rewind tool to examine the relevant content. Once all subtasks are successfully completed, the execution results will be conveyed to a node dedicated to synthesizing the final answer.

检索到的视频片段信息和原始任务将被传输到能够自主规划和执行任务的智能体 DnC Loop 进行处理。复杂任务将被递归细分为可执行的子任务。如果智能体认为需要查看特定的视频细节,它将使用回放工具来检查相关内容。一旦所有子任务成功完成,执行结果将被传送到一个专门用于合成最终答案的节点。

3.1 Video to RAG

3.1 视频到RAG

OmAgent's preprocessing (as shown in Figure 1) of video data is similar to a multimodal RAG. This approach avoids treating the entire content of a very long video as context input to the large language model, which would lead to three serious issues: (1) The length of the context would limit the maximum length of the video that can be processed. (2) Using an extremely long context for each question and answer session would cause an explosive increase in token usage. (3) An overly long context increases the difficulty of LLM inference, affecting the accuracy of question and answer sessions. OmAgent's Video2RAG processing mainly consists of the following steps.

OmAgent 的视频数据预处理 (如图 1 所示) 与多模态 RAG 类似。这种方法避免了将非常长的视频的全部内容作为上下文输入到大语言模型中,因为这会导致三个严重问题:(1) 上下文的长度会限制可处理视频的最大长度。(2) 在每个问答会话中使用极长的上下文会导致 token 使用量的爆炸性增长。(3) 过长的上下文会增加 LLM 推理的难度,影响问答会话的准确性。OmAgent 的 Video2RAG 处理主要包括以下步骤。

Scene Detection Firstly, an algorithm is used to segment the video into relatively independent video blocks. The main purpose of this step is to locate the key nodes of the video. We can determine whether to segment the scene by assessing the degree of change in the frames; overly short segments will be merged together. The extracted video segments will have their start and end timestamps recorded, and 10 frames will be uniformly sampled from every segment.

场景检测

Visual Prompting During the video preprocessing stage, additional algorithms can be used to provide more information. For example, using facial recognition, we can obtain information about the characters in the video. OmAgent will annotate

视觉提示在视频预处理阶段,可以使用额外的算法提供更多信息。例如,利用面部识别,我们可以获取视频中人物的信息。OmAgent 将对其进行标注。


OmAgent


OmAgent

Figure 1: How OmAgent understand video. In Video2RAG, the video is processed by different algorithms (e.g. Scene Detection, ASR and face recognition) and then summarized by MLLMs to generate Scene Captions. Those captions are encoded and saved in the knowledge database. When OmAgent receives a query, it filters and retrieves in knowledge database based on timestamps (if available). The retrieved information is processed by the Divideand-Conquer Loop and summarized by Conclusive Synthesis to generate the final answer.

图 1: OmAgent 如何理解视频。在 Video2RAG 中,视频通过不同的算法(例如场景检测、ASR 和面部识别)进行处理,然后由 MLLM 进行总结以生成场景描述。这些描述被编码并保存在知识数据库中。当 OmAgent 收到查询时,它会根据时间戳(如果可用)在知识数据库中进行过滤和检索。检索到的信息通过分治循环进行处理,并通过总结性合成生成最终答案。

this algorithmic information directly on the images through visual prompting, i.e., drawing the corresponding recognition boxes and using text to explain above the bounding box. This allows for the full utilization of the powerful understanding capabilities of MLLMs.

直接在图像上通过视觉提示展示算法信息,即绘制相应的识别框并在边界框上方使用文字进行解释。这样可以充分利用 MLLMs 的强大理解能力。

Text Representation of Audio The audio information in the video is as important as the visual information. OmAgent uses ASR algorithms to convert the speech in the video into text and employs speaker di ari z ation algorithms to distinguish between different speakers.

音频的文本表示

Scene Caption Using MLLMs that support multiple images, each video segment's content is summarized. The inputs include video frames that have already been annotated with visual prompting and the transcribed audio information. In the process of generating dense captions at this step, We have delineated a set of pivotal elements to guide the

场景描述

MLLM in generating effective and comprehensive captions, ensuring that vital information is not overlooked in the absence of explicit objectives. OmAgent specifies the following dimensions as instructions to MLLMs:

MLLM 在生成有效且全面的描述中的作用,确保在没有明确目标的情况下不会遗漏重要信息。OmAgent 为 MLLM 指定了以下维度作为指令:

· Provide an overall description and summary of the content of this video.

· 提供此视频内容的整体描述和摘要。

Encode and Save The final step in video processing involves vector i zing the scene captions and storing them in a vector database (knowledge database). Additionally, the original text of the captions is also stored in the memory for keyword-based retrieval. The start and end timestamps of the video segments are used as filtering fields and are likewise stored in the memory repository of the OmAgent agent.

编码与保存
视频处理的最后一步涉及将场景字幕向量化并存储到向量数据库(知识数据库)中。此外,字幕的原始文本也会存储在内存中,以便基于关键词进行检索。视频片段的开始和结束时间戳作为过滤字段,同样存储在 OmAgent 代理的内存库中。

3.2 Divide-and-Conquer Loop

3.2 分治循环

In computer science, divide-and-conquer (DnC) is a highly classical algorithm design paradigm. A divide-and-conquer approach entails the iterative decomposition of a problem into multiple subproblems. This process continues until the subproblems reach a level of simplicity that allows for direct resolution (Zhao et al., 2016). The solutions to these sub-problems are subsequently merged to yield a resolution to the initial problem. In order to ensure that OmAgent is not limited to simple video Q&A functionality but possesses robust problemsolving capabilities, we initially aimed to build a general task-solving agent system when designing and constructing the agent framework. Inspired by XAgent's (Team, 2023) double looped planner, we designed an agent framework based on the divide-and-conquer task processing loop (DnC Loop), which is capable of performing recursive task decomposition and execution. The DnC Loop task-solving procedure is shown in Algorithm 1.

在计算机科学中,分治法 (DnC) 是一种高度经典的算法设计范式。分治法通过将问题迭代分解为多个子问题来解决问题。这个过程持续进行,直到子问题达到可以直接解决的简单程度 (Zhao et al., 2016)。这些子问题的解随后被合并,以得到初始问题的解。为了确保 OmAgent 不仅仅局限于简单的视频问答功能,而是具备强大的问题解决能力,我们在设计和构建智能体框架时,最初的目标是构建一个通用的任务解决智能体系统。受 XAgent (Team, 2023) 的双循环规划器启发,我们设计了一个基于分治法任务处理循环 (DnC Loop) 的智能体框架,该框架能够执行递归任务分解和执行。DnC Loop 任务解决过程如算法 1 所示。

Conqueror Conqueror is the entry point of the DnC loop. It is responsible for evaluating and processing the current task. For a given task, Conqueror may return one of the following three types of results:

Conqueror 是 DnC 循环的入口点,负责评估和处理当前任务。对于给定的任务,Conqueror 可能会返回以下三种类型的结果之一:

· If the current task is too complex and needs to be divided into subtasks, Conqueror returns the reason for the division. · If the execution of the current task requires the use of a specific tool, Conqueror returns both the task information and the tool information. These pieces of information will be passed to the tool execution module for tool invocation. · If the current task can be answered directly by the LLM, return the result directly. Conqueror will detect the depth of the task tree and terminate task execution when it exceeds the user's setting to prevent tasks from being infinitely split.

· 如果当前任务过于复杂且需要划分为子任务,Conqueror 返回划分的原因。
· 如果当前任务的执行需要使用特定工具,Conqueror 返回任务信息和工具信息。这些信息将传递给工具执行模块以进行工具调用。
· 如果当前任务可以直接由大语言模型回答,则直接返回结果。Conqueror 会检测任务树的深度,并在超过用户设置时终止任务执行,以防止任务被无限拆分。

Require: The input Query $U s e r T a s k$ ; The max depth of the TaskTree $N$ Initialize $T a s k=$ TaskTree.init(U serTask) procedure $\operatorname{DNC}(T a s k,N)$ Result $\leftarrow$ Conqueror $(T a s k)$ if Result.type $=$ "too complex" then Subtasks $\leftarrow$ Divider(Task, Result.reason) if Subtasks.success then Task.add(Subtasks.tasks) for all Subtask $\in T a s k$ .subtasks do if Task.depth $\le N$ then $\operatorname{DNC}(S u b t a s k,N)$ else return“Task tree depth exceeded" end if end for else return Subtasks.reason end if else if Result.type $=$ "requires tool' then ToolResult $\leftarrow$ ToolCall(Task, Result.tool) Task.update(Task,ToolResult) return Tool Result else if Result.type $=$ "direct answer' then Task.update(Task, Result.answer) return Result.answer end if end procedure

要求:输入查询 $U s e r T a s k$;任务树的最大深度 $N$
初始化 $T a s k=$ TaskTree.init(U serTask)
过程 $\operatorname{DNC}(T a s k,N)$
结果 $\leftarrow$ 征服者 $(T a s k)$
如果 Result.type $=$ "too complex" 那么
子任务 $\leftarrow$ 分割器(Task, Result.reason)
如果 Subtasks.success 那么
Task.add(Subtasks.tasks)
对于所有 Subtask $\in T a s k$.subtasks 做
如果 Task.depth $\le N$ 那么
$\operatorname{DNC}(S u b t a s k,N)$
否则返回“任务树深度超出"
结束如果
结束对于
否则返回 Subtasks.reason
结束如果
否则如果 Result.type $=$ "requires tool" 那么
工具结果 $\leftarrow$ 工具调用(Task, Result.tool)
Task.update(Task,ToolResult)
返回工具结果
否则如果 Result.type $=$ "direct answer" 那么
Task.update(Task, Result.answer)
返回 Result.answer
结束如果
结束过程

The position of Conqueror in the whole process is shown in Figure 2.

图 2 展示了 Conqueror 在整个过程中的位置。

Divider The Divider component is responsible for breaking down complex tasks into simpler ones while ensuring that the execution results of these simple tasks are equivalent to the original task. When the Conqueror component determines the necessity of task division, it delegates the task to the Divider for attempted division. Successfully divided tasks are then integrated into the Task tree as child nodes of the original task node. If the division fails, the Divider is asked to provide the reason.

分割器 (Divider)
分割器组件负责将复杂任务分解为更简单的任务,同时确保这些简单任务的执行结果与原始任务等效。当征服者 (Conqueror) 组件确定需要分割任务时,它会将任务委托给分割器尝试分割。成功分割的任务会作为原始任务节点的子节点集成到任务树 (Task tree) 中。如果分割失败,分割器会被要求提供原因。

Rescuer Rescuer is an auxiliary module in the Conqueror's execution process. It attempts to repair issues and ensure the smooth completion of the Conqueror's execution when errors occur. A typical scenario is when the agent tries to execute a piece of code, but a required package is missing in the environment. The Rescuer can attempt to fix the runtime environment issue. The position of Divider in the whole process is shown in Figure 2.

救援者
救援者是Conqueror执行过程中的辅助模块。当错误发生时,它尝试修复问题并确保Conqueror的顺利执行。一个典型的场景是当代理尝试执行一段代码时,环境中缺少所需的包。救援者可以尝试修复运行时环境问题。救援者在整个过程中的位置如图2所示。

Task tree In software development practice, divide-and-conquer is often implemented using recursion. OmAgent uses recursive tree structure to store all the paths of task execution. With the help of the Loop node, it achieves recursive operations for task decomposition and execution.

任务树


Figure 2: Divider and Conqueror Loop task-solving procedure. In the DnC Loop, simple problems are directly executed by Conqueror, while complex problems are split by Divider until they can be executed. The Rescuer recognizes exceptions and retries the task. The Tool Manager organizes the external tools. It is worth mentioning that the Rewinder tool can goes back through the entire video to find information and missing details. Finally, the DnC loop outputs the relevant content whether the execution fails or succeeds.

图 2: Divider and Conqueror Loop 任务解决流程。在 DnC Loop 中,简单问题由 Conqueror 直接执行,而复杂问题则由 Divider 拆分,直到可以执行为止。Rescuer 识别异常并重试任务。Tool Manager 组织外部工具。值得一提的是,Rewinder 工具可以回放整个视频以查找信息和遗漏的细节。最后,无论执行成功还是失败,DnC Loop 都会输出相关内容。

3.3 Tool call

3.3 工具调用

As a standard capability of intelligent agents, the principle of tool calling lies in utilizing the powerful logical generation ability of large language models (LLMs) to generate corresponding tool invocation request parameters based on task information. In addition to conventional tools, OmAgent specifically offers a video detail rewinder tool for further information extraction within specific time ranges of a video. OmAgent can autonomously choose to view details of a particular segment of a video when necessary, addressing the issue of information loss that occurs when video data transitions from a continuous information source to a discrete one during the preprocessing stage.

作为智能体的标准能力,工具调用的原理在于利用大语言模型强大的逻辑生成能力,根据任务信息生成相应的工具调用请求参数。除了常规工具外,OmAgent 特别提供了视频细节回放工具,用于在视频的特定时间范围内进一步提取信息。OmAgent 可以在需要时自主选择查看视频某个片段的细节,解决了视频数据在预处理阶段从连续信息源转变为离散信息源时出现的信息丢失问题。

Furthermore, OmAgent provides conventional tools such as internet search tools, facial recognition tools, and file processing tools to meet more complex user tasks.

此外,OmAgent 提供了互联网搜索工具、面部识别工具和文件处理工具等常规工具,以满足更复杂的用户任务。

4 Experimental Settings

4 实验设置

To validate the efficacy of the OmAgent system in addressing complex problems within real-world

验证OmAgent系统在解决现实世界复杂问题中的效果

scenarios, we have designed a two-phase experimental approach:

我们设计了一个两阶段的实验方法:

4.1 General problem-solving capabilities

4.1 通用问题解决能力

We hypothesize that understanding lengthy and intricate videos relies significantly on an agent's comprehensive problem-solving skills. To test this, we used two general-purpose intelligent benchmarks, MBPP (Austin et al., 2021) and FreshQA (Vu et al., 2023). We focused on the DnC Loop's ability to plan and execute tasks and its proficiency in utilizing tools to address complex issues

我们假设理解冗长且复杂的视频在很大程度上依赖于智能体的综合问题解决能力。为了验证这一点,我们使用了两个通用智能基准测试:MBPP (Austin et al., 2021) 和 FreshQA (Vu et al., 2023)。我们重点关注 DnC Loop 在规划和执行任务方面的能力,以及其利用工具解决复杂问题的熟练程度。

Datasets The Mostly Basic Programming Problems (MBPP) benchmark includes 976 elementary Python coding tasks. These problems are designed to evaluate the system's proficiency to plan solutions, select and invoke the right tools, and fix errors effectively.

数据集 Mostly Basic Programming Problems (MBPP) 基准测试包含 976 个基础的 Python 编程任务。这些问题旨在评估系统规划解决方案、选择和调用正确工具以及有效修复错误的能力。

FreshQA is a continuously updated collection of real-world questions and answers, reflecting the constantly changing nature of reality. As a result, FreshQA focuses on the system's ability to learn and integrate new information from external sources.

FreshQA 是一个持续更新的现实世界问答集合,反映了现实不断变化的本质。因此,FreshQA 关注系统从外部来源学习和整合新信息的能力。

Settings To study OmAgent's capability in comprehending complex long-form videos, we devised two control groups:

设置
为了研究 OmAgent 在理解复杂长视频方面的能力,我们设计了两组对照组:

First, we aimed to ascertain the performance that could be achieved by one of the most advanced MLLMs - GPT-4o, based solely on a limited number of video frames (restricted to 20 by the Microsoft Azure GPT-4o service) and basic dialogue textual information. In this experiment, we initially extracted 20 frames evenly from the video and paired them with the dialogue text obtained through Whisper (Radford et al., 2023) as the context for input into the MLLM for question-answering.

首先,我们的目标是确定仅基于有限数量的视频帧(受限于 Microsoft Azure GPT-4o 服务的 20 帧)和基本对话文本信息,最先进的多模态大语言模型(MLLM)之一——GPT-4o 能够达到的性能。在这个实验中,我们首先从视频中均匀提取了 20 帧,并将其与通过 Whisper (Radford et al., 2023) 获得的对话文本配对,作为输入到 MLLM 的上下文进行问答。

In the second control experiment, we sought to evaluate the strengths and weaknesses of using a multimodal RAG approach compared to our Agent strategy. To ensure a fair comparison, we isolated the Video2RAG component from OmAgent to serve as the RAG system. When a query is input, the system first retrieves relevant video clip information from the knowledge database, then inputs this pertinent data as context into the MLLM for question-answering.

在第二个控制实验中,我们旨在评估使用多模态RAG方法与我们的AI智能体策略相比的优缺点。为了确保公平比较,我们将OmAgent中的Video2RAG组件单独作为RAG系统使用。当输入查询时,系统首先从知识库中检索相关的视频片段信息,然后将这些相关数据作为上下文输入到MLLM中进行问答。

4.2 Long-form videos understanding capabilities

4.2 长视频理解能力

We created a benchmark with over 2000 questionanswer pairs to evaluate OmAgent's ability to understand, answer questions, and recall detailed information from long-form videos. We aimed to assess the general understanding of ultra-long video content and the ability to recall specific details. Our benchmark was designed to be logically coherent and narrative-rich, with video segments up to an hour long. This benchmark allows us to measure the capabilities of our intelligent agent and compare its performance against large language models.

我们创建了一个包含超过2000个问答对的基准测试,以评估OmAgent在理解、回答问题以及从长视频中回忆详细信息方面的能力。我们的目标是评估对超长视频内容的整体理解能力以及回忆具体细节的能力。我们的基准测试设计为逻辑连贯且叙事丰富,视频片段长达一小时。该基准测试使我们能够衡量智能体的能力,并将其性能与大语言模型进行比较。

Datasets Publicly available long-form video under standing datasets are very scarce, similar to the FreshQA/MovieQA datasets which only contain videos at the minute level, and the questions are not complex enough, for example, "How does Talia die?" in MovieQA, which is a question that can be inferred from consecutive frames, cannot meet our needs. SOK-Bench (Wang et al., 2024a) addresses scenarios slightly different from ours, as it demands the testing program to integrate situated and general knowledge to answer questions. MoVQA is a long movie question and answer dataset that utilizes 100 well-known movies to create complex long video question-answer pairs, which align with our requirements, but it is not yet open-sourced. Therefore, we created the dataset ourselves. We collected some long videos familiar to annotators, selected the top 100 videos in terms of frequency, and created 20 questions for each video. These questions were first proofread by two different annotators and then revised by a third annotator. Videos include episodes and movies, varieties, documentaries, and vlogs. These types of videos exhibit significant differences in themes, filming and editing techniques, data density, scene lengths, and alignment of video audio and visuals, fully demonstrating the diversity of the data.

数据集公开可用的长视频理解数据集非常稀缺,类似于 FreshQA/MovieQA 数据集,这些数据集仅包含分钟级别的视频,而且问题不够复杂,例如 MovieQA 中的 "Talia 是如何死的?" 这样的问题,可以从连续的帧中推断出来,无法满足我们的需求。SOK-Bench (Wang et al., 2024a) 解决的场景与我们的略有不同,因为它要求测试程序整合情境知识和一般知识来回答问题。MoVQA 是一个长电影问答数据集,它利用 100 部知名电影创建了复杂的长视频问答对,这符合我们的要求,但尚未开源。因此,我们自己创建了数据集。我们收集了一些标注者熟悉的长视频,选择了频率最高的 100 个视频,并为每个视频创建了 20 个问题。这些问题首先由两位不同的标注者进行校对,然后由第三位标注者进行修订。视频包括剧集和电影、综艺、纪录片和 vlog。这些类型的视频在主题、拍摄和剪辑技术、数据密度、场景长度以及视频音频和视觉的对齐方面表现出显著差异,充分展示了数据的多样性。

Settings To evaluate the system's understanding of long-form videos and timelines, we defined questions in four categories: reasoning, information summary, event localization, and external knowledge.

设置
为了评估系统对长视频和时间线的理解能力,我们定义了四类问题:推理、信息总结、事件定位和外部知识。

Reasoning, information summary, and external knowledge questions are multiple-choice. Event localization requires precise timestamps or time spans, with a deviation within $\pm2$ seconds for timestamps and an IoU exceeding $90%$ for time spans.

推理、信息摘要和外部知识问题为选择题。事件定位需要精确的时间戳或时间段,时间戳的偏差在 $\pm2$ 秒内,时间段的 IoU 超过 $90%$。

5 Results and Analysis

5 结果与分析

5.1 General problem-solving capabilities

5.1 通用问题解决能力

GPT-4, recognized as a benchmark for evaluating Large Language Models, is known for its strong reasoning abilities and serves as the baseline model for our experiments. XAgent, an advanced agent system, features a well-designed Dual-Loop Mechanism that allows it to address problems from both broad and detailed perspectives.

GPT-4 被视为评估大语言模型的基准,以其强大的推理能力而著称,并作为我们实验的基线模型。XAgent 是一个先进的 AI 智能体系统,具有精心设计的双循环机制,使其能够从宏观和细节两个角度解决问题。

Table 1: Results on MBPP and FreshQA comparing with GPT4 and XAgent. Showing OmAgent has a strong generalized task solving capbility.

表 1: MBPP 和 FreshQA 上与 GPT4 和 XAgent 的对比结果。展示了 OmAgent 具有强大的泛化任务解决能力。

Table 2: Results of different types of videos on OmAgent and the other five baselines.

Vlog 剧集和电影 综艺 纪录片 总计
OmAgent 57.14% 56.25% 23.53% 36.84% 45.45%
Video2RAG 42.86% 32.35% 19.88% 31.57% 27.27%
Frames with STT 42.85% 29.41% 17.64% 31.58% 28.57%
VideoAgent 41.72% 23.53% 11.76% 26.32% 23.38%
VideoTree 34.52% 31.48% 21.27% 27.35% 26.76%
LLoVi 28.57% 24.16% 17.65% 21.05% 23.63%

表 2: OmAgent 和其他五个基线在不同类型视频上的结果。

Table 3: Results of different types of queries on OmAgent and the other five baselines.

推理 事件定位 信息摘要 外部知识
OmAgent 81.82% 19.05% 72.74% 57.21%
Video2RAG 72.73% 4.76% 50.17% 23.36%
FrameswithSTT 63.64% 2.38% 63.63% 19.46%
VideoAgent 64.66% 2.25% 45.45% 23.78%
VideoTree 35.30% 18.62% 47.27% 29.57%
LLoVi 27.27% 11.90% 45.46% 24.57%

表 3: OmAgent与其他五个基线在不同类型查询上的结果。

5.2 Long-form videos understanding cap bil i ties

5.2 长视频理解能力

The results in Table 1 clearly show that both agent systems outperform the basic inferential capabilities of GPT-4 alone. Notably, OmAgent surpasses XAgent (Team, 2023) in overall performance. Analysis reveals that XAgent's Dual-Loop Mechanism, while thorough, often leads to overthinking and complicates problem-solving. In contrast, OmAgent's Rescuer mechanism proves more effective, especially in handling code-related tasks. This mechanism enables OmAgent to dynamically correct issues based on real-time results, leading to superior performance.

表 1 中的结果清楚地表明,两个智能体系统均优于 GPT-4 单独的基本推理能力。值得注意的是,OmAgent 在整体性能上超越了 XAgent (Team, 2023)。分析表明,XAgent 的双循环机制虽然全面,但常常导致过度思考并使问题解决复杂化。相比之下,OmAgent 的 Rescuer 机制在处理代码相关任务时更为有效。该机制使 OmAgent 能够根据实时结果动态纠正问题,从而获得更优的性能。

Table 2 compares the scores of five baselines and OmAgent across different types of long-form video understanding. OmAgent achieved the highest scores. Vlogs, variety, and documentaries con- tain extensive narration, so the STT data and the resulting scene captions encompass most relevant information of the video. Therefore, the performance difference between OmAgent and the other two methods in these categories is not as significant as in episodes and movies. In episodes and movies, scenes change frequently, complex queries involve cross-scene information, and STT data might span scene transitions. Compared to frames with STT, Video2RAG retrieves data related to the query, reducing data redundancy, and thus has higher scores than frames with STT. However, since it only retrieves relevant information from a vector database, complex questions such as "Are there any scene changes between 03:58 and 04:02, and what is their connection?" are not achieved. On the other hand, OmAgent's DnC Loop breaks down complex questions into several sub-questions, including "Extract frames between 03:58 and 04:02," "Analyze the extracted frames to identify any scene changes," and "Determine the connection between the scenes based on the identified changes." By leveraging the rewinder capability, it pinpoints the relevant segments for rewatching, thereby arriving at the correct answer.

表 2 比较了五种基线方法和 OmAgent 在不同类型的长视频理解中的得分。OmAgent 取得了最高分。Vlogs、综艺和纪录片包含大量的叙述,因此 STT 数据和生成的场景描述涵盖了视频的大部分相关信息。因此,OmAgent 与其他两种方法在这些类别中的性能差异不像在剧集和电影中那么显著。在剧集和电影中,场景变化频繁,复杂查询涉及跨场景信息,而 STT 数据可能跨越场景转换。与带有 STT 的帧相比,Video2RAG 检索与查询相关的数据,减少了数据冗余,因此得分高于带有 STT 的帧。然而,由于它只从向量数据库中检索相关信息,因此无法实现诸如“在 03:58 到 04:02 之间是否有场景变化,它们的联系是什么?”这样的复杂问题。另一方面,OmAgent 的 DnC 循环将复杂问题分解为几个子问题,包括“提取 03:58 到 04:02 之间的帧”,“分析提取的帧以识别任何场景变化”,以及“根据识别到的变化确定场景之间的联系”。通过利用倒带功能,它精确定位了需要重新观看的相关片段,从而得出正确答案。

Furthermore, we conducted a detailed analysis of different question types. Table 3 provides a comparison of OmAgent and five baselines in terms of reasoning, event localization, information summary, and external knowledge. The results show that OmAgent achieves the highest scores in all four types of questions. Through its rewinder capability, OmAgent can extract more detailed video information and accurately locate timestamps, which leads to significant improvements in reasoning, event localization, and information sum mari z ation tasks compared to frames with STT and Video2RAG. External knowledge task has stricter requirements for information retrieval. Although GPT can answer some questions through its own capabilities and scene information, OmAgent achieves higher scores by utilizing various external tools (such as facial recognition, web search, etc.) to obtain more accurate relevant information. Notably, in the question type of information summary, Video2RAG scored lower than frames with STT. Analysis reveal that video2RAG's information source came from scen