[论文翻译]#FILMAGENT:用于虚拟3D空间中端到端电影自动化的多智能体框架


原文地址:https://arxiv.org/pdf/2501.12909


FILMAGENT: A MULTI-AGENT FRAMEWORK FOR END-TO-END FILM AUTOMATION IN VIRTUAL 3DSPACES

FILMAGENT:用于虚拟3D空间中端到端电影自动化的多智能体框架


Figure 1: We introduce FILMAGENT, a multi-agent collaborative framework for end-to-end film automation powered by large language models (LLMs). A team of LLM-based agents takes on film crew roles, and simulates the human workflow in 3D virtual spaces by sequentially engaging in idea development, script writing, and cinematography, finally completing the filmmaking process.

图图 1: 我们介绍了 FILMAGENT,这是一个基于大语言模型 (LLMs) 的多智能体协作框架,用于端到端的电影自动化制作。一个由基于LLM的智能体组成的团队扮演电影工作人员的角色,通过依次参与创意开发、剧本编写和电影摄影,在3D虚拟空间中模拟人类工作流程,最终完成电影制作过程。

ABSTRACT

摘要

Virtual film production requires intricate decision-making processes, including script writing, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agentbased societies, this paper introduces FILMAGENT, a novel LLM-based multiagent collaborative framework for end-to-end film automation in our constructed 3D virtual spaces. FILMAGENT simulates various crew roles—directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) script writing elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FILMAGENT outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi-agent collaboration in filmmaking. Further analysis reveals that FILMAGENT, despite using the less advanced GPT-4o model, surpasses the single-agent o1, showing the advantage of a well-coordinated multiagent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI’s text-to-video model Sora and our FILMAGENT in filmmaking 1.

虚拟虚拟电影制作需要复杂的决策过程,包括剧本编写、虚拟摄影以及精确的演员定位和动作。受近期基于语言智能体社会的自动化决策进展的启发,本文介绍了 FILMAGENT,一种基于大语言模型的多智能体协作框架,用于在我们构建的 3D 虚拟空间中实现端到端的电影自动化。FILMAGENT 模拟了多种工作人员角色——导演、编剧、演员和摄影师,并涵盖了电影制作工作流程的关键阶段:(1) 创意开发将头脑风暴的想法转化为结构化的故事大纲;(2) 剧本编写详细描述了每个场景的对话和角色动作;(3) 摄影确定了每个镜头的摄像机设置。一组智能体通过迭代反馈和修订进行协作,从而验证中间剧本并减少幻觉。我们在 15 个创意和 4 个关键方面评估生成的视频。人类评估表明,FILMAGENT 在所有方面均优于所有基线,平均得分为 3.98 分(满分 5 分),展示了多智能体协作在电影制作中的可行性。进一步分析表明,尽管 FILMAGENT 使用的是较不先进的 GPT-4o 模型,但它超越了单一智能体 o1,展示了协调良好的多智能体系统的优势。最后,我们讨论了 OpenAI 的文本到视频模型 Sora 和我们的 FILMAGENT 在电影制作中的互补优势和劣势。

1 INTRODUCTION

1 引言

Virtual film production entails a methodical and disciplined approach to the directing, camera placement and actor positioning (He et al., 1996). Recent advancements in deep learning have started to automate film production practices, where sophisticated neural networks enable the movement of virtual cameras through 3D environments (Jiang et al., 2020). However, films are not only about moving pictures; they are crafted through language. They are produced through the dialogues spoken by the characters, the screenplays that outline the story, the shooting scripts that instruct the cinematographers, and, undeniably, the guidance given by directors (Jiang et al., 2024). Therefore, filmmaking is fundamentally a communication-driven collaborative task, motivating our design of a multi-agent system based on large language models (LLMs).

虚拟虚拟电影制作需要对导演、摄像机摆放和演员站位进行系统化和规范化的处理 (He et al., 1996)。近年来,深度学习的进展已经开始自动化电影制作实践,复杂的神经网络使得虚拟摄像机能够在3D环境中移动 (Jiang et al., 2020)。然而,电影不仅仅是关于动态画面;它们是通过语言构建的。通过角色之间的对话、故事大纲的剧本、指导摄影师的分镜头脚本,以及不可否认的导演指导 (Jiang et al., 2024) 制作而成。因此,电影制作本质上是一项以沟通为驱动的协作任务,这促使我们设计了一个基于大语言模型的多智能体系统。

In this paper, we propose FILMAGENT, the first LLM-based multi-agent collaborative framework designed to automate end-to-end virtual film production. In this framework, LLM-based agents fulfill various film crew roles, including director, screenwriter, actor, and cinematographer, to collectively create a film. As shown in Figure 1, the collaborative method emulates the human workflow and divides the process into three sequential stages: idea development, script writing, and cinematography. In the first stage, given a brief story idea, the director develops character profiles and expands the idea into a detailed scene outline, specifying the where, what, and who of each segment. During script writing, the director, screenwriter, and actors collaborate on dialogue development and choreograph movements. In the cinematography stage, the cinematographers and director work together to design camera setups for each line, selecting between static and dynamic shots to effectively convey the narrative visually. In addition, we propose two multi-agent collaboration algorithms, CritiqueCorrect-Verify and Debate-Judge, applied in script writing and cinematography stages respectively, to refine the script and camera settings. Finally, once the script is fully annotated, the film is shot within our meticulously constructed 3D spaces. The virtual 3D spaces include 15 locations, 65 designated actor positions, 272 shots covering 9 types of static and dynamic shots, 21 actor actions depicting expressive gestures and emotions, and speech audio generation.

在在本文中,我们提出了FILMAGENT,这是首个基于大语言模型的多智能体协作框架,旨在自动化端到端的虚拟电影制作。在该框架中,基于大语言模型的智能体扮演电影制作中的各种角色,包括导演、编剧、演员和摄影师,共同创作一部电影。如图 1 所示,该协作方法模拟了人类的工作流程,并将过程分为三个连续的阶段:构思开发、剧本编写和摄影。在第一阶段,导演根据简要的故事构思,开发角色档案,并将构思扩展为详细的场景大纲,详细说明每个片段的背景、内容和人物。在剧本编写阶段,导演、编剧和演员合作进行对话开发并编排动作。在摄影阶段,摄影师和导演共同为每一句台词设计摄像机设置,选择静态或动态镜头,以有效地通过视觉传达叙事。此外,我们提出了两种多智能体协作算法,分别是 CritiqueCorrect-Verify 和 Debate-Judge,分别应用于剧本编写和摄影阶段,以优化剧本和摄像机设置。最后,剧本完全注释后,电影在我们精心构建的 3D 空间中进行拍摄。虚拟 3D 空间包括 15 个场景、65 个指定的演员位置、272 个镜头,涵盖 9 种静态和动态镜头、21 个描绘表情和情感的动作,以及语音音频生成。

Human evaluations of the generated videos across 15 ideas validate the effectiveness of our framework. The results show that the collaborative FILMAGENT achieves an average score of 3.98 out of 5, significantly outperforming single-agent efforts across four aspects: plot coherence, alignment between dialogue and actor profiles, appropriateness of camera setting, and accuracy of actor actions. Further preference analysis underscores the importance of multi-agent collaboration in addressing hallucinations, enhancing plot coherence and improving camera choices. We also experiment with OpenAI’s large reasoning model o1 and find that FILMAGENT, despite using a less advanced GPT4o as foundational model, outperforms the single-agent o1. This highlights that a well-coordinated multi-agent system can exceed the performance of a more advanced foundational model.

跨跨15个创意生成的视频人工评估验证了我们框架的有效性。结果表明,协作的FILMAGENT在五个方面的平均得分为3.98(满分5分),在情节连贯性、对话与演员档案的匹配度、镜头设置的恰当性以及演员动作的准确性四个方面均显著优于单智能体。进一步的偏好分析强调了多智能体协作在解决幻觉、增强情节连贯性和改进镜头选择方面的重要性。我们还实验了OpenAI的大型推理模型o1,发现尽管FILMAGENT使用了相对不那么先进的GPT4o作为基础模型,但其表现仍优于单智能体o1。这突显了良好协调的多智能体系统可以超越更先进的基础模型的性能。

Case study with the OpenAI’s text-to-video model Sora reveals the complementary strengths and weaknesses of Sora and FILMAGENT. While Sora shows great adaptability, it struggles with consistency and narrative delivery. In contrast, FILMAGENT produces coherent, physics-compliant videos with strong storytelling capabilities, due to its foundation on pre-designed 3D spaces and characters within a collaborative workflow. As an early exploration of LLM-based multi-agent systems in virtual film production, we hope that this project lays the groundwork for end-to-end film automation, showing the potential of collaborative AI agents in this creative domain.

通过通过 OpenAI 的文本到视频模型 Sora 的案例研究,揭示了 Sora 和 FILMAGENT 在优势和弱点上的互补性。虽然 Sora 展现了强大的适应性,但在一致性和叙事传达方面存在困难。相比之下,FILMAGENT 由于其基于预设计的 3D 空间和角色,并在协作工作流程中运行,能够生成连贯且符合物理规律的视频,具备强大的叙事能力。作为基于大语言模型的多智能体系统在虚拟电影制作中的早期探索,我们希望这个项目能为端到端的电影自动化奠定基础,展示协作 AI 智能体在这一创意领域的潜力。

In summary, our main contributions are as follows:

总结总结来说,我们的主要贡献如下:

• We present FILMAGENT, a novel LLM-based multi-agent collaborative framework for endto-end film automation, which mirrors the traditional film set process within meticulously crafted 3D virtual spaces. • We incorporate two multi-agent collaboration strategies within the workflow, which substantially reduces hallucinations and enhances the quality of scripts and camera settings. • Extensive human evaluations validate the effectiveness of FILMAGENT, indicating LLMbased multi-agent system as a promising avenue for automating film production.

•• 我们提出了FILMAGENT,一个基于大语言模型的多智能体协作框架,用于端到端的电影自动化制作,它在精心设计的3D虚拟空间中模拟传统电影拍摄过程。• 我们在工作流程中融入了两种多智能体协作策略,这大大减少了幻觉并提升了剧本和摄像机设置的质量。• 广泛的人类评估验证了FILMAGENT的有效性,表明基于大语言模型的多智能体系是为电影制作自动化提供了一条有前景的道路。

2 RELATED WORK

2 相关工作

2.1 VIRTUAL FILM PRODUCTION

2.1 虚拟电影制作

Virtual film production is defined as “a broad term referring to a spectrum of computer-aided production and visualization filmmaking methods” (Bodini et al., 2024). This method supports remote collaboration and enhances accessibility due to its virtual nature (Nebeling et al., 2021). It has gained substantial attention in the entertainment industry, following its prominent use in The M and a lori an television series (Kavakli & Cremona, 2022). Recently, game engines are revolutionizing filmmaking with Virtual Camera Plugin, which allows real-time rendering of simulated environments. This enables filmmakers to play around in a virtual environment before shooting, potentially replacing traditional pre-visualization methods like storyboard s (Legato & Deschanel, 2019).

虚拟虚拟电影制作被定义为“一个广泛的术语,指代一系列计算机辅助制作和可视化电影制作方法” (Bodini et al., 2024)。由于其虚拟性,这种方法支持远程协作并提高了可访问性 (Nebeling et al., 2021)。在《曼达洛人》电视剧中的突出应用后,它在娱乐行业引起了广泛关注 (Kavakli & Cremona, 2022)。最近,游戏引擎通过虚拟摄像机插件正在革新电影制作,该插件允许实时渲染模拟环境。这使得电影制作人可以在拍摄前在虚拟环境中进行探索,可能取代传统的预可视化方法,如故事板 (Legato & Deschanel, 2019)。

Deep learning-based virtual production. Virtual film production covers a wide spectrum of problems, from narrative aspects (de Lima et al., 2009), camera control (Li & Cheng, 2008; Christie et al., 2008) and even cutting and editing problems (Leake et al., 2017). In recent years, the field has embraced deep neural networks due to their remarkable generalization ability. When applying cinema to graph y in computer graphics environments, Jiang et al. (2020) combine the Toric coordinate system (Lino & Christie, 2015) with a Mixture-of-Experts model to generate styled camera motions based on different video references. Jiang et al. (2021) further introduce keyframing for finer control of camera motions with an LSTM-based backbone. In this work, based on the understanding that filmmaking is a communication-driven collaborative process (Jiang et al., 2024), we design a multi-agent system that uses large language models (LLMs) to enhance this collaboration.

基于基于深度学习的虚拟制作。虚拟电影制作涵盖了广泛的问题,从叙事方面 (de Lima et al., 2009)、相机控制 (Li & Cheng, 2008; Christie et al., 2008) 甚至到剪辑和编辑问题 (Leake et al., 2017)。近年来,该领域因深度神经网络的显著泛化能力而开始采用这些技术。在将电影应用于计算机图形环境中的图 y 时,Jiang et al. (2020) 将 Toric 坐标系 (Lino & Christie, 2015) 与 Mixture-of-Experts 模型结合,基于不同的视频参考生成风格化的相机运动。Jiang et al. (2021) 进一步引入了关键帧技术,使用基于 LSTM 的框架对相机运动进行更精细的控制。在本研究中,基于对电影制作是一个以沟通为驱动的协作过程的理解 (Jiang et al., 2024),我们设计了一个多AI智能体系统,利用大语言模型来增强这种协作。

Preliminary exploration with LLMs. Recent works in virtual production have begun to utilize the emergent reasoning and planning capabilities of LLMs (Wei et al., 2022). Qing et al. (2023) address the Story-to-Motion task, which requires characters to navigate to locations and perform specific actions based on textual descriptions. Here, LLMs are utilized as text-driven motion schedulers, extracting sequences of (text, position, duration) tuples from long text. Video Director GP T (Lin et al., 2023) and Anim-Director (Li et al., 2024b) employ LLMs to plan videos, generating detailed scene descriptions, along with the positioning and layout of entities, for consistent multi-scene video production. In our work, we expand the use of LLMs to cover all aspects of film production, fully automating tasks from plot planning to cinematography within 3D virtual spaces.

大大语言模型的初步探索

2.2 MULTI-AGENT FRAMEWORK

22.2 多智能体框架

Recently, LLM-based autonomous agents have gained tremendous interest in both industry and academia (Wang et al., 2024). Voyager (Wang et al., 2023), AppAgent (Zhang et al., 2023) and Claude 3.5 Computer Use (Hu et al., 2024a) are typical task-oriented agents that can autonomously interact with the environment and solve simple tasks. However, single agents struggle to achieve effective, coherent, and accurate problem-solving processes, particularly when there is a need for meaningful collaborative interaction (Zhuge et al., 2023; Qian et al., 2023).

近日近日,基于大语言模型的自主智能体在业界和学术界引起了极大关注 [Wang et al., 2024]。Voyager [Wang et al., 2023]、AppAgent [Zhang et al., 2023] 和 Claude 3.5 计算机应用 [Hu et al., 2024a] 是典型的任务导向型智能体,它们能够自主与环境交互并解决简单任务。然而,单个智能体在实现高效、连贯且准确的问题解决过程时面临挑战,尤其是在需要有意义协作互动时 [Zhuge et al., 2023; Qian et al., 2023]。

In the transition from single-agent frameworks into multi-agent frameworks, the pioneering research on Generative Agents (Park et al., 2023) has laid the groundwork for the development of “Simulated Society”. These societies are conceptualized as dynamic systems where multiple agents engage in intricate interactions within a well-defined environment (Xi et al., 2023). This approach aligns with the Society of Mind (SoM) theory (Minsky, 1988), which suggests that intelligence arises from the interaction of computational modules, achieving collective goals beyond the capabilities of individual modules. To this end, many works (Xu et al., 2023; Zhang et al., 2024; Cohen et al., 2023) have improved reasoning and factuality of LLMs by integrating discussions among multiple agents. Furthermore, ChatDev (Qian et al., 2023), MetaGPT (Hong et al., 2024) Trans Agents (Wu et al., 2024a;b) and Agent Laboratory (Schmidgall et al., 2025) have successfully implemented multi-agent collaborative schemes throughout simulating standard human practices and workflows such as requirement design, coding and testing. Motivated by the promising outcomes of multiagent collaboration, we have developed a multi-agent system called FILMAGENT to replicate human workflows and automate the end-to-end film production process.

在在从单智能体框架向多智能体框架的过渡中,生成式智能体 (Generative Agents) 的开创性研究(Park 等人,2023)为“模拟社会”的发展奠定了基础。这些社会被概念化为动态系统,其中多个智能体在明确定义的环境中参与复杂的互动(Xi 等人,2023)。这种方法与“心智社会 (Society of Mind)”理论(Minsky, 1988)一致,该理论认为智能源于计算模块的互动,从而实现单个模块无法实现的集体目标。为此,许多研究(Xu 等人,2023;Zhang 等人,2024;Cohen 等人,2023)通过整合多个智能体之间的讨论,改善了大语言模型的推理能力和事实性。此外,ChatDev(Qian 等人,2023)、MetaGPT(Hong 等人,2024)、Trans Agents(Wu 等人,2024a;b)和 Agent Laboratory(Schmidgall 等人,2025)在模拟标准人类实践和工作流程(如需求设计、编码和测试)中成功实施了多智能体协作方案。受多智能体协作成果的启发,我们开发了一个名为 FILMAGENT 的多智能体系统,以复制人类工作流程并实现端到端的电影制作自动化。


Figure 2: A vertical view of one of the 3D spaces (the living room) in FILMAGENT built with Unity. The environment is pre-configured with designated positions for actors and various camera setups for cinematography. These include static shots from multiple distances and dynamic shots that either follow or orbit around characters. Full camera setup of this space is provided in Figure 8.

图图 2: FILMAGENT 中用 Unity 构建的一个 3D 空间(客厅)的垂直视图。该环境已预先配置了演员的指定位置和用于电影摄影的各种摄像机设置。这些设置包括来自多个距离的静态镜头以及跟随或环绕角色的动态镜头。该空间的完整摄像机设置见图 8。

3 FILMAGENT

3 FILMAGENT

FILMAGENT is an LLM-based multi-agent framework for end-to-end film automation in a 3D sandbox environment. The basic process is illustrated in Figure 1. An introduction of our constructed virtual 3D spaces for filmmaking is in Sec. 3.1. We describe the overview of FILMAGENT in Sec. 3.2, the core collaboration strategies in Sec. 3.3, and the production workflow in Sec. 3.4.

FILFILMAGENT 是一个基于大语言模型 (LLM) 的多智能体框架,用于在 3D 沙盒环境中实现端到端的电影自动化。基本流程如图 1 所示。我们在第 3.1 节介绍了我们所构建的用于电影制作的虚拟 3D 空间。我们在第 3.2 节描述了 FILMAGENT 的概览,在第 3.3 节介绍了核心协作策略,并在第 3.4 节详细说明了生产工作流程。

3.1 ENVIRONMENT SETUP

3.1 环境设置

We have meticulously built virtual 3D spaces ready for filmmaking. This Unity spaces include 15 locations that reflect everyday settings, such as living rooms, kitchens, offices and roadside, thus providing versatile backdrops for a wide range of narratives. A screenshot of the living room is presented in Figure 2. Each scene is pre-configured with actor positions and camera setups. All locations are listed in Figure 7 in Appendix A.

我们我们精心构建了可用于电影制作的虚拟 3D 空间。这些 Unity 空间包含了 15 个反映日常场景的地点,例如客厅、厨房、办公室和路边,从而为各种叙事提供了多样化的背景。图 2 展示了客厅的截图。每个场景都预先配置了演员位置和摄像机设置。所有地点均在附录 A 中的图 7 中列出。

Positions. The environment includes 32 standing points and 33 sitting points, each accompanied by a human-written description indicating its position. For example, Position B in Figure 2 is described as “near the sofa, sittable, between Positions A and C, allowing easy communication with characters at these positions”.

位置位置。环境包括 32 个站立点和 33 个坐点,每个位置都附有人工编写的描述,指明其位置。例如,图 2 中的位置 B 被描述为“靠近沙发,可坐,位于位置 A 和 C 之间,便于与这些位置的角色进行交流”。

Actions. Each character can perform 21 different actions, selected from Mixamo2. These actions range from basic movements like sitting down and walking to more expressive gestures, such as joyful jumping and annoyed head-shaking. All actions are listed in Appendix A.

动作动作。每个角色可以执行21种不同的动作,这些动作选自Mixamo2。动作范围包括从坐下和行走等基本动作,到高兴跳跃和生气摇头等更具表现力的手势。所有动作均列于附录A中。

Cameras. Following the principles of the “language of film” (Wohl, 2004), we define 9 types of shots, including 3 static shots from various distances (e.g., close-up, medium, and long shots as shown by Camera 1-3 in Figure 2) and 6 dynamic shots that track or orbit around a character (e.g., pan shot represented by Camera 4 in Figure 2, zoom shot, arc shot, etc.). The descriptions and views of these static and dynamic shots in Figure 2 are shown in Table 1 and 4 in Appendix A. In total, all virtual 3D spaces contains 165 static shots and 107 dynamic shots.

摄像机摄像机。遵循“电影语言” (Wohl, 2004) 的原则,我们定义了 9 种镜头类型,包括 3 种不同距离的静态镜头(例如,如图 2 中摄像机 1-3 所示的特写、中景和远景)和 6 种跟踪或环绕角色的动态镜头(例如,图 2 中摄像机 4 所代表的平移镜头、缩放镜头、弧线镜头等)。这些静态和动态镜头的描述和视图在图 2 中展示,并在附录 A 中的表 1 和表 4 中列出。总体而言,所有虚拟 3D 空间包含 165 个静态镜头和 107 个动态镜头。

Table 1: Examples of 3 types of static shots in Figure 2, targeted at Position B.

表表 1: 图 2 中针对位置 B 的三种静态镜头的示例

No. Shot Type Description View
Close-up Shot Close-up (CU) Shot 应该靠近主体,通常包括衣领,体现身份。
2 Medium Shot Medium Shot (MS) 应该包括姿势(如身体语言)和身体动作(如行走)。

Audio. To create more natural and expressive audio, we utilize ChatTTS3 to generate the speech for each line in the script. The duration of each camera shot and action in the video is synchronized with the length of the corresponding audio segment.

音频音频。为了创建更加自然和富有表现力的音频,我们利用ChatTTS3为脚本中的每一行生成语音。视频中每个镜头和动作的持续时间与相应音频片段的长度同步。

With these configurations in place, our virtual 3D spaces can support automatic film production.

通过通过以上配置,我们的虚拟3D空间可以支持自动影片制作。

3.2 OVERVIEW

3.2 概述

Clear role specialization allows for the breakdown of complex work into smaller and more specific tasks (Li et al., 2023; Hong et al., 2024). In FILMAGENT, we define four main characters: Director, Screenwriter, Actor and Cinematographer, as shown in Figure 3. Each of these roles carries its own set of responsibilities.

清晰清晰的角色分工使得复杂的工作能够分解为更小、更具体的任务 (Li et al., 2023; Hong et al., 2024)。在 FILMAGENT 中,我们定义了四个主要角色:导演 (Director)、编剧 (Screenwriter)、演员 (Actor) 和摄影师 (Cinematographer),如图 3 所示。每个角色都有其特定的职责。

The Director initiates and oversees the entire filmmaking project. This role includes setting character profiles, developing video outlines, providing feedback on the script, engaging in discussions with other crew members, and making final decisions when conflicts arise. The Screenwriter works closely under the Director’s guidance. Its responsibilities go beyond writing dialogue; they also specify the positioning and actions for each line, and continuously update the script to ensure it is coherent, captivating, and well-structured, based on the Director’s critiques. Actors are responsible for making minor adjustments to their lines based on their character profiles, ensuring the dialogue aligns with the characters, and communicating any necessary changes to the Director. Cinematographers select the camera settings for each line according to shot usage guidelines, collaborate with peer cinematographers to compare and discuss these choices, and ensure the appropriateness of camera settings.

导演导演负责启动并监督整个电影制作项目。该角色包括设定角色档案、制定视频大纲、提供剧本反馈、与其他剧组成员进行讨论,以及在出现冲突时做出最终决定。编剧在导演的指导下紧密合作。其职责不仅限于编写对话;他们还需为每句台词指定定位和动作,并根据导演的批评不断更新剧本,确保其连贯、引人入胜且结构良好。演员负责根据角色档案对台词进行微调,确保对话符合角色,并将必要的修改传达给导演。摄影师根据镜头使用指南为每句台词选择摄像机设置,与同行摄影师合作比较并讨论这些选择,并确保摄像机设置的适当性。

3.3 AGENT COLLABORATION STRATEGIES

3.3 AI智能体协作策略

In this section, we introduce two collaboration strategies used in this work, including CritiqueCorrect-Verify (Algorithm 1) and Debate-Judge (Algorithm 2).

在本在本节中,我们介绍了本工作中使用的两种协作策略,包括 CritiqueCorrect-Verify (算法 1) 和 Debate-Judge (算法 2)。


Figure 3: Workflow of FILMAGENT. Given a story idea and 3D virtual spaces, the director creates character profiles and a scene outline. Actors, the screenwriter, and the director then collaborate on dialogue and movements. Cinematographers annotate camera setups for each line. Finally, the film is shot within the 3D spaces. LLM-based agents take on various film crew roles, collaborating through Critique-Correct-Verify and Debate-Judge strategies. Algorithm 1: Critique-Correct-Verify Collaboration

图图 3: FILMAGENT 的工作流程。给定一个故事创意和 3D 虚拟空间,导演创建角色档案和场景概要。随后,演员、编剧和导演共同协作完成台词和动作。摄影师为每一句台词标注摄影机设置。最后,在 3D 空间内完成电影的拍摄。基于大语言模型的 AI智能体 承担各种电影制作人员的角色,通过 Critique-Correct-Verify 和 Debate-Judge 策略进行协作。算法 1: Critique-Correct-Verify 协作

Critique-Correct-Verify Collaboration. As outlined in Algorithm 1, this strategy involves two agents working collaborative ly. First, the Action agent $\mathbf{P}$ generates a response $\mathbf{R}$ based on the given context $\mathbf{C}$ and instruction I. Next, the Critique agent $\mathbf{Q}$ reviews the response R and writes critiques $\mathbf{F}$ highlighting potential areas for improvement. The Action agent $\mathbf{P}$ then integrates the critiques and corrects the response. Finally, the Critique agent $\mathbf{Q}$ evaluates the updated response $\mathbf{R}$ to determine whether the critiques $\mathbf{F}$ have been adequately addressed or if further iterations are necessary.

CritCritique-Correct-Verify协作策略。如算法1所述,该策略涉及两个AI智能体的协作。首先,Action智能体$\mathbf{P}$根据给定的上下文$\mathbf{C}$和指令I生成响应$\mathbf{R}$。接着,Critique智能体$\mathbf{Q}$审查响应$\mathbf{R}$并撰写评语$\mathbf{F}$,指出需要改进的潜在领域。然后,Action智能体$\mathbf{P}$整合评语并修正响应。最后,Critique智能体$\mathbf{Q}$评估更新后的响应$\mathbf{R}$,以确定评语$\mathbf{F}$是否已得到充分解决,或者是否需要进一步迭代。

Return the final response $\mathbf{R}$ ;

返回返回最终响应 $\mathbf{R}$ ;

Debate-Judge Collaboration involves multiple agents who propose their responses and then engage in a debate to persuade each other. A third-party agent ultimately summarizes the discussion and delivers the final judgment. We present the details of our collaboration strategy in Algorithm 2. Two peer agents $\mathbf{P}$ and $\mathbf{Q}$ independently generate their responses, and then provide feedback on each other’s work about the discrepancy during each iteration. After several rounds of debate, the Judgment agent $\mathbf{J}$ concludes the discussion and makes the final decision $\mathbf{R}$ .

辩论辩论-评判协作涉及多个AI智能体,它们提出各自的响应,然后进行辩论以说服对方。最终,第三方智能体总结讨论并做出最终判断。我们在算法2中展示了协作策略的细节。两个对等智能体 $\mathbf{P}$ 和 $\mathbf{Q}$ 独立生成各自的响应,然后在每次迭代中对彼此的工作提供关于差异的反馈。经过几轮辩论后,评判智能体 $\mathbf{J}$ 总结讨论并做出最终决定 $\mathbf{R}$。

Algorithm 2: Debate-Judge Collaboration

算法算法 2: 辩论-评判协作


Figure 4: The responsibilities of a screenwriter extend beyond writing dialogues; they also involve annotating the corresponding action for each line.

图图 4: 编剧的职责不仅仅局限于编写对话,还包括为每一行台词标注相应的动作。

3.4 WORKFLOW

3.4 工作流程

As shown in Figure 3, following the traditional film set workflow, we divide the whole film production process into three sequential stages: idea development, script writing and cinematography, and apply the collaboration strategies discussed earlier in Sec. 3.3. The prompts for each stage are detailed in Appendix D.

如图如图 3 所示,按照传统的电影制作流程,我们将整个电影制作过程分为三个连续的阶段:创意开发、剧本编写和摄影,并应用了前面 3.3 节中讨论的协作策略。每个阶段的具体提示详见附录 D。

Idea development. From a brief story idea, the director generates various character profiles that could be relevant to the story. The profiles include key attributes such as gender, occupation, and personality traits. Using these profiles and a set of 15 predefined locations in our 3D virtual spaces, the director expands the initial story idea into a detailed scene outline, specifying the where, what, and who of each segment (as illustrated in Figure 1).

创意创意开发。从一个简短的故事创意出发,导演生成了与故事相关的各种角色简介。这些简介包括性别、职业和个性特征等关键属性。利用这些简介和我们3D虚拟空间中的15个预定义地点,导演将初始的故事创意扩展为详细的场景大纲,明确每个部分的地点、事件和人物(如图1所示)。

Script writing is a collaborative stage involving the screenwriter, director, and actors and is divided into three parts:

剧本剧本写作是一个涉及编剧、导演和演员的协作阶段,分为三个部分:

ration with the screenwriter, employs the same Critique-Correct-Verify cycle to refine the script. Once the director confirms no further changes are needed, the script is finalized.

与与编剧合作时,采用相同的“批评-修正-验证”循环来完善剧本。导演确认无需进一步修改后,剧本即定稿。

Cinematography is a collaborative process among two peer cinematographers and the director in the Debate-Judge manner to ensure diverse and appropriate camera choices. Cinematographers (agents P and Q) independently assign camera choices to each line of the script. They then engage in a debate to address any discrepancies in their choices. Consider Figure 3 as an example. In this scenario, the cinematographers debate over the best shot, with one preferring a medium shot to capture body language, while the other favors a zoom shot to emphasize Dana’s surprise. Through this debate, the pros and cons of each option are thoroughly explored. After several rounds, the director (the Judgment agent J) summarizes the debate process, resolves any remaining conflicts, and finalizes the camera setup based on the discussion.

摄影摄影是两位同侪摄影师和导演在辩论-评审方式下合作的过程,以确保多样且合适的镜头选择。摄影师(智能体 P 和 Q)独立地为剧本的每一行分配镜头选择,然后通过辩论来解决选择上的分歧。以图 3 为例,在这个场景中,摄影师就最佳镜头进行辩论,一位倾向于使用中景镜头捕捉肢体语言,而另一位则偏爱变焦镜头以突出 Dana 的惊讶。通过辩论,充分探讨了每个选项的优缺点。经过几轮讨论后,导演(评审智能体 J)总结辩论过程,解决剩余冲突,并根据讨论确定最终的镜头设置。

After these stages, each line in the script is specified with the positions of the actors, their actions, and the chosen camera shots. An example of a fully annotated script is displayed in Appendix C. We can simulate the entire script within the constructed 3D environment and begin filming. The duration of each line in the video corresponds to the length of its speech audio.

经过经过这些阶段后,脚本中的每一行都指定了演员的位置、他们的动作以及选择的摄像机镜头。附录 C 展示了一个完整注释脚本的示例。我们可以在构建的 3D 环境中模拟整个脚本并开始拍摄。视频中每一行的持续时间与其语音音频的长度相对应。

4 EXPERIMENTS

4 实验

4.1 EXPERIMENTAL SETUP

4.1 实验设置

Data. We manually brainstorm 15 story ideas that can be implemented within the constraints of our constructed virtual 3D spaces, such as “a quarrel and breakup scene”, “late night brainstorming for a startup” and “casual meet-up with an old friend”.

数据数据。我们手动构思了15个可以在我们构建的虚拟3D空间限制内实现的故事创意,例如“争吵和分手场景”、“深夜为创业公司头脑风暴”和“与老朋友的随意会面”。

Evaluation Scheme. We evaluate the generated videos across five key aspects: the script’s fidelity to the intended theme, the appropriateness of camera settings, the alignment of the script with actor profiles, the accuracy of actor actions, and the overall plot coherence. In our preliminary study, we found that all scripts faithfully adhered to the intended story ideas. Therefore, we conduct comprehensive human annotations on the remaining four aspects of the videos. We use a 5-point Likert scale to assess the script’s alignment with actor profiles, the appropriateness of camera settings, and the overall plot coherence. The evaluation guideline is in Appendix B. To evaluate the accuracy of actor actions, we randomly select 50 actions from the generated scripts and annotate their accuracy. Finally, we normalize the action accuracy scores to a 0-5 scale and calculate average scores across the four aspects.

评估评估方案。我们从五个关键方面对生成的视频进行评估:剧本对主题的忠实度、摄像机设置的恰当性、剧本与演员的契合度、演员动作的准确性以及整体情节的连贯性。在初步研究中,我们发现所有剧本都忠实地遵循了既定的故事情节。因此,我们对视频的其余四个方面进行了全面的人工标注。我们使用5点李克特量表(Likert scale)来评估剧本与演员的契合度、摄像机设置的恰当性以及整体情节的连贯性。评估指南见附录B。为了评估演员动作的准确性,我们从生成的剧本中随机选取50个动作并标注其准确性。最后,我们将动作准确性得分归一化为0-5分,并计算四个方面的平均得分。

Baselines. Following the experimental setup of AgentVerse (Chen et al., 2024), to validate the superiority of FILMAGENT in facilitating agent collaboration over standalone agents, we compare it against the following baselines: (1) CoT: A single agent, guided by hints about key stages in the prompt, directly generates the chain-of-thought rationale and produces the complete script. (2) Solo: A single agent is responsible for idea development, script writing, and cinematography, representing our FILMAGENT framework without multi-agent collaboration algorithms. (3) Group, i.e. the full FILMAGENT framework, utilizing multi-agent collaboration. All the experiments are done in zero-shot setting.

基线基线实验。为了验证FILMAGENT在促进智能体协作方面的优势,我们将其与以下基线进行了比较:(1) CoT:单个智能体在提示中关键阶段的指导下,直接生成思维链推理并编写完整脚本。(2) Solo:单个智能体负责创意开发、剧本编写和电影制作,代表FILMAGENT框架中不包括多智能体协作算法。(3) Group,即完整的FILMAGENT框架,利用多智能体协作。所有实验均在零样本设置下进行。

Implementation Details. Our experiments employ the “gpt-4o-2024-05-13” version of OpenAI API to simulate multi-agent virtual film production. For the “o1-preview” model, we access it through the ChatGPT webpage. The maximum number of iterations in multi-agent collaboration algorithms is set to 3.

实现实现细节。我们的实验使用 OpenAI API 的“gpt-4o-2024-05-13”版本来模拟多智能体虚拟电影制作。对于“o1-preview”模型,我们通过 ChatGPT 网页进行访问。多智能体协作算法的最大迭代次数设置为 3。

4.2 RESULTS

4.2 结果

The human evaluation results in Table 2 show that FILMAGENT achieves an average score of 3.98 out of 5, validating the effectiveness of FILMAGENT. Agents configured using FILMAGENT (both Solo and Group setups) consistently outperform the standalone CoT agent. This demonstrates the efficacy of decomposing complex tasks into manageable sub-tasks. Comparative analysis between the Solo and Group configurations of FILMAGENT highlights the benefits of the multi-agent framework. FILMAGENT facilitates iterative feedback and revisions through multiple collaboration algorithms, leading to significant improvements across all aspects, especially in plot coherence and the appropriateness of camera settings. Further analysis and detailed case studies underscoring the importance of multi-agent collaboration are provided in Section 4.3.

表表 2 中的人工评估结果显示,FILMAGENT 的平均得分为 3.98(满分为 5 分),验证了 FILMAGENT 的有效性。使用 FILMAGENT 配置的 AI 智能体(无论是 Solo 还是 Group 设置)始终优于单独的 CoT 智能体。这表明将复杂任务分解为可管理的子任务的效果显著。FILMAGENT 的 Solo 和 Group 配置之间的比较分析突出了多智能体框架的优势。FILMAGENT 通过多种协作算法促进了迭代反馈和修订,从而在所有方面都取得了显著改进,尤其是在情节连贯性和摄像机设置的合理性方面。第 4.3 节提供了进一步的详细分析和案例研究,强调了多智能体协作的重要性。

Table 2: Comparison of baselines using human annotations for actor actions, overall plot coherence, script alignment with actor profiles, and appropriateness of camera settings. The evaluation metric for Action is accuracy (0-1), while the others use a 5-point Likert scale.

表表 2: 使用人工标注对演员动作、整体剧情连贯性、剧本与演员档案的匹配度以及摄像机设置的合理性进行基线比较。动作的评估指标为准确率 (0-1),其他指标则使用 5 点李克特量表。

方法 LLM 动作 剧情 档案 摄像机 平均
CoT GPT-40 0.68 1.60 3.84 1.67 2.63
CoT 01 0.80 2.73 3.60 2.86 3.30
FILMAGENT (Group) GPT-40 0.88 3.53 4.44 3.53 3.98

Comparison with o1. Recently OpenAI has released a large reasoning model called o1, optimized for complex multi-step tasks and achieving superior performance compared to GPT-4o (Xu et al., 2025). This adavantage is reflected in Table 2: an o1-based CoT agent not only outperforms a GPT-4o-based CoT agent, but also surpasses the single-agent version of FILMAGENT in certain aspects. This highlights o1’s ability to autonomously decompose complex tasks and solve sub-tasks step by step. However, our findings also show that the multi-agent FILMAGENT framework, despite being built on a less advanced GPT-4o foundational model, outperforms the single-agent o1. This demonstrates that a well-coordinated multi-agent system can exceed the performance of a more advanced underlying model.

与与 o1 的对比。最近,OpenAI 发布了一个名为 o1 的大型推理模型,专为复杂的多步任务优化,相比 GPT-4o 表现出色 (Xu 等, 2025)。这一优势在表 2 中得以体现:基于 o1 的 CoT 智能体不仅优于基于 GPT-4o 的 CoT 智能体,还在某些方面超越了 FILMAGENT 的单智能体版本。这凸显了 o1 在自主分解复杂任务并逐步解决子任务方面的能力。然而,我们的研究也表明,尽管多智能体 FILMAGENT 框架建立在较不先进的 GPT-4o 基础模型上,但仍优于单智能体 o1。这证明了协调良好的多智能体系统可以超越更先进的基础模型的性能。

4.3 PREFERENCE ANALYSIS

4.3 偏好分析

To further analyze the effectiveness of multi-agent collaboration, we compare 15 scripts before and after Critique-Correct-Verify, including the Director-Screenwriter Discussion (referred to as Script writing #2, representing the second stage of script writing) and the Actor-DirectorScreenwriter Discussion (referred to as Script writing #3, representing the third stage of scriptwriting). Additionally, we examine 50 randomly selected modifications on the camera choices before and after Debate-Judge in the Cinematography Stage (denoted as Cinematography). For each case, we determine whether the updated version "wins," "loses," or "ties" compared to the original version.

为了为了进一步分析多智能体协作的有效性,我们比较了15个在Critique-Correct-Verify前后的剧本,包括导演与编剧讨论(称为剧本写作#2,代表剧本写作的第二阶段)和演员、导演与编剧讨论(称为剧本写作#3,代表剧本写作的第三阶段)。此外,我们还在摄影阶段(称为Cinematography)中检查了50个Debate-Judge前后随机选择的摄像机选择修改。对于每个案例,我们确定更新后的版本与原始版本相比是“胜出”、“落后”还是“持平”。

Figure 5 presents the winning rates of revised scripts and reveals a clear preference by human evaluators for the revised scripts over the original versions. These results highlight the effectiveness of iterative feedback and verification in multi-agent collaboration strategies, as demonstrated by the four cases in Table 3. For the script writing stage, as illustrated by Case #1, the Director-Screenwriter discussion reduces hallucinations of non-existent actions (e.g., standing suggest), enhances plot coherence, and ensures consistency across scenes. Case #2 shows that Actor-Director-Screenwriter discussion improves the alignment of dialogue with

图图 5 展示了修订剧本的胜率,并揭示人类评估者对修订剧本的明显偏好。这些结果凸显了多智能体协作策略中迭代反馈和验证的有效性,如表 3 中的四个案例所示。在剧本写作阶段,如案例 #1 所示,导演-编剧讨论减少了不存在动作(例如,站立暗示)的幻觉,增强了情节连贯性,并确保场景之间的一致性。案例 #2 表明,演员-导演-编剧讨论提高了对话与


Figure 5: Compared with the original version, the win, tie, and lose rates of the updated script and camera choices after multi-agent collaboration.

图图 5: 与原始版本相比,多智能体协作后更新脚本和相机选择的胜率、平局率和负率。

character profiles. For the Debate-Judge method in cinematography, Case #3 demonstrates the correction of an inappropriate dynamic shot, which is replaced with a medium shot to better convey body language. Case $#4$ replaces a series of identical static shots with a mix of dynamic and static shots, resulting in a more diverse camera setup.

人物人物简介。在电影摄影的辩论-评判方法中,案例 #3 展示了对不恰当的动态镜头的修正,替换为中景镜头以更好地传达肢体语言。案例 #4 将一系列相同的静态镜头替换为动态和静态镜头的混合,从而形成了更加多样化的摄像机设置。

5 DISCUSSION AND FUTURE WORK

5 讨论与未来工作

Comparison with Sora. Sora is a video generation tool developed by OpenAI, designed to create high-quality videos from text prompts, images or existing videos (Cho et al., 2024). We experiment

与与Sora的比较。Sora是由OpenAI开发的视频生成工具,旨在从文本提示、图像或现有视频中创建高质量视频(Cho等人,2024)。我们进行了实验

Table 3: Comparisons of the scripts and camera settings before (left) and after (right) multi-agent collaboration, with excerpts from their discussion process. Case #1 and #2 are from the CritiqueCorrect-Verify method in Script writing #2 and #3 stages respectively. Case #3 and $#4$ are from the Debate-Judge method in Cinematography.

表表 3: 多智能体协作前后的脚本和摄像机设置对比,并摘录了他们的讨论过程。案例 #1 和 #2 分别来自 CritiqueCorrect-Verify 方法在脚本编写 #2 和 #3 阶段的案例。案例 #3 和 #4 来自 Cinematography 中的 Debate-Judge 方法。

Sora with the storyboard function on its official website4, which allows users to describe what you want to happen at a specific time in the generated video. Specifically we utilize the director’s planned scenes in FILMAGENT as prompts for each segment of the video. Given that the maximum duration for Sora’s generated video is currently 10 seconds, we allocate 3–4 seconds per scene on the storyboard timeline.

SSora 在其官方网站上提供了故事板功能,允许用户描述生成视频中特定时间点想要发生的内容。具体来说,我们利用 FILMAGENT 中导演计划好的场景作为视频每个片段的提示。鉴于目前 Sora 生成的视频最大时长为 10 秒,我们在故事板时间线上为每个场景分配 3-4 秒的时长。


Figure 6: Comparison of videos showing “a quarrel and breakup scene” produced by FILMAGENT and Sora. Sora demonstrates excellent adaptability to various scenes, styles, and shots, while FILMAGENT can produce coherent, physics-compliant videos with storytelling capabilities.

图图 6: FILMAGENT 和 Sora 生成的“争吵和分手场景”视频对比。Sora 展示了其对各种场景、风格和镜头的出色适应性,而 FILMAGENT 能够生成连贯、符合物理规律且具有叙事能力的视频。

As illustrated in Figure 6, we compare the videos produced by FILMAGENT with those generated by Sora, and analyze their complementary strengths and weaknesses: Sora excels at quickly adapting to diverse scenes, realistic styles, and various shots (e.g., the close-up of the woman in the first scene to convey her anger). This makes Sora a useful tool for video creators seeking rapid brainstorming and idea validation. In contrast, FILMAGENT requires pre-built virtual 3D spaces, characters and cameras. However, Sora faces several challenges: (1) Inconsistencies: The generated videos sometimes fail to align with the text instructions. For example, in the second scene, the prompt specifies that only two main characters should be involved, yet there are four. Additionally, we observe character inconsistencies across frames during our tests. (2) Non-compliance with physics: There are strange artifacts that defy real-world physics. For example, in the second scene, the woman’s face and right hand blend together unnaturally, and then another phone suddenly appears in her right hand. (3) Limited storytelling capability: Due to short video durations and lack of variation, Sora struggles to convey complete stories. In the third scene, only a close-up of the man talking is shown, with minimal variation between frames (just lips moving), and no subtitles or audio to indicate the dialogue, making it hard to follow the plot. In contrast, FILMAGENT effectively addresses these issues by utilizing 3D spaces in game engines and a collaborative workflow, ensuring coherence and a more comprehensive storytelling capability.

如图如图 6 所示,我们将 FILMAGENT 生成的视频与 Sora 生成的视频进行了比较,并分析了它们的互补优势和不足:Sora 擅长快速适应多样场景、写实风格和各种镜头(例如,第一个场景中女性的特写镜头以传达她的愤怒)。这使得 Sora 成为视频创作者寻求快速头脑风暴和创意验证的有用工具。相比之下,FILMAGENT 需要预建的虚拟 3D 空间、角色和摄像机。然而,Sora 面临几个挑战:(1) 不一致性:生成的视频有时无法与文本指令对齐。例如,在第二个场景中,提示指定只应涉及两个主要角色,然而却有四个。此外,我们在测试中观察到帧之间的角色不一致。(2) 不符合物理规律:存在一些违背现实世界物理规律的奇怪伪影。例如,在第二个场景中,女性的脸和右手不自然地融合在一起,然后她的右手中突然出现了另一部手机。(3) 叙事能力有限:由于视频持续时间短且缺乏变化,Sora 难以传达完整的故事。在第三个场景中,只展示了男性谈话的特写镜头,帧之间的变化很小(只有嘴唇在动),并且没有字幕或音频来指示对话,这使得情节难以理解。相比之下,FILMAGENT 通过利用游戏引擎中的 3D 空间和协作工作流程,有效地解决了这些问题,确保了连贯性和更全面的叙事能力。

Limitations. The primary limitation of our system is its reliance on predefined virtual 3D spaces with limited action spaces and preset camera settings. Recent advancements in 3D scene synthesis, motion, and camera adjustments driven by textual instructions (Qing et al., 2023; Jiang et al., 2024; Hu et al., 2024b) provide more flexible and dynamic alternatives. Future research could integrate these adaptable components into the FILMAGENT framework. Additionally, there are other important areas for improvement: (1) Fine-Grained Control: The current system lacks precise control over actions and camera settings. Annotating actions and camera movements at the line level is too coarse-grained, as a single line of script may involve multiple character actions and camera transitions. (2) Multimodal LLM Integration: Film automation is inherently a multimodal task requiring visual inputs. Incorporating multimodal LLMs presents a promising direction for improving the accuracy of feedback and verification processes (Xu et al., 2024; Li et al., 2024a). (3) Expanded Crew Roles: To create a video that meets the standards of a “film”, essential crew roles such as music composition, color grading, and video editing need to be included.

局限性局限性。我们系统的主要局限性在于它依赖于预定义的虚拟3D空间,这些空间的动作空间有限且相机设置是预设的。最近由文本指令驱动的3D场景合成、运动和相机调整的进展(Qing et al., 2023; Jiang et al., 2024; Hu et al., 2024b)提供了更灵活和动态的替代方案。未来的研究可以将这些适应性组件整合到FILMAGENT框架中。此外,还有其他重要的改进领域:(1) 细粒度控制:当前系统缺乏对动作和相机设置的精确控制。在行级别标注动作和相机移动过于粗糙,因为单行脚本可能涉及多个角色动作和相机过渡。(2) 多模态大语言模型集成:电影自动化本质上是一种需要视觉输入的多模态任务。集成多模态大语言模型为提高反馈和验证过程的准确性提供了一个有前景的方向(Xu et al., 2024; Li et al., 2024a)。(3) 扩展的剧组角色:为了制作符合“电影”标准的视频,需要包括音乐创作、色彩分级和视频剪辑等必不可少的剧组角色。

We present FILMAGENT, an LLM-based multi-agent framework that automates end-to-end film production in virtual 3D spaces. This framework incorporates our constructed 3D spaces, simulates efficient human workflows, and employs multi-agent collaboration strategies. Extensive human evaluations rate the videos produced by FILMAGENT with an average score of 3.98 out of 5, underscoring its effectiveness. Further analysis shows that multi-agent collaboration significantly enhances script quality, improves camera selection, and reduces hallucination errors. These findings demonstrate the potential of FILMAGENT to advance film automation through multi-agent systems.

我们我们推出FILMAGENT,这是一个基于大语言模型的多AI智能体框架,能够在虚拟3D空间中自动化端到端的电影制作。该框架结合了我们构建的3D空间,模拟高效的人类工作流程,并采用了多AI智能体协作策略。大量的人类评估显示,FILMAGENT制作的视频平均得分为3.98分(满分5分),突显其有效性。进一步分析表明,多AI智能体协作显著提升了剧本质量,改进了相机选择,并减少了幻觉错误。这些发现展示了FILMAGENT通过多AI智能体系统推动电影自动化的潜力。

REFERENCES

参考文献

Aimone Bodini, Arthi Manohar, Federico Colecchia, David Harrison, and Vanja Garaj. Envisioning the future of virtual production in filmmaking: A remote co-design study. Multimedia Tools and Applications, 83(7):19015–19039, February 2024.

AAimone Bodini, Arthi Manohar, Federico Colecchia, David Harrison, and Vanja Garaj. 展望电影制作中虚拟制作的未来:一项远程共设计研究. Multimedia Tools and Applications, 83(7):19015–19039, February 2024.

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id $=$ EHg5GDnyq1.

WeWeize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, 和 Jie Zhou. Agentverse:促进多智能体协作与探索涌现行为. 在第十二届国际学习表征会议 (The Twelfth International Conference on Learning Representations) 上, 2024. URL https://openreview.net/forum?id $=$ EHg5GDnyq1.

Joseph Cho, Fachrina Dewi Pu spit asari, Sheng Zheng, Jingyao Zheng, Lik-Hang Lee, Tae-Ho Kim, Choong Seon Hong, and Chaoning Zhang. Sora as an agi world model? a complete survey on text-to-video generation, 2024. URL https://arxiv.org/abs/2403.05131.

JosephJoseph Cho、Fachrina Dewi Pu spit asari、Sheng Zheng、Jingyao Zheng、Lik-Hang Lee、Tae-Ho Kim、Choong Seon Hong 和 Chaoning Zhang。Sora作为通用人工智能世界模型?文本到视频生成的全面调查,2024。URL https://arxiv.org/abs/2403.05131

Marc Christie, Patrick Olivier, and Jean-Marie Normand. Camera control in computer graphics. Computer Graphics Forum, 27(8):2197–2218, 2008. doi: https://doi.org/10.1111/j.1467-8659. 2008.01181.x. URL https://online library.wiley.com/doi/abs/10.1111/j. 1467-8659.2008.01181.x.

MarcMarc Christie, Patrick Olivier, 和 Jean-Marie Normand. 计算机图形学中的相机控制. 计算机图形学论坛, 27(8):2197–2218, 2008. doi: https://doi.org/10.1111/j.1467-8659.2008.01181.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8659.2008.01181.x.

Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. LM vs LM: Detecting factual errors via cross examination. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12621–12640, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023. emnlp-main.778. URL https://a cl anthology.org/2023.emnlp-main.778.

RRoi Cohen, May Hamri, Mor Geva, 和 Amir Globerson。《大语言模型与大语言模型:通过交叉检验检测事实错误》。收录于 Houda Bouamor, Juan Pino, 和 Kalika Bali 编辑的《2023年自然语言处理实证方法会议论文集》,第12621–12640页,新加坡,2023年12月。计算语言学协会。doi: 10.18653/v1/2023.emnlp-main.778。URL https://aclanthology.org/2023.emnlp-main.778

Edirlei E. S. de Lima, Cesar T. Pozzer, Marcos C. d’Ornellas, Angelo E. M. Ciarlini, Bruno Feijó, and Antonio L. Furtado. Virtual cinematography director for interactive storytelling. In Proceedings of the International Conference on Advances in Computer Entertainment Technology, ACE ’09, pp. 263–270, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605588643. doi: 10.1145/1690388.1690432. URL https://doi.org/10.1145/ 1690388.1690432.

EdEdirrei E. S. de Lima, Cesar T. Pozzer, Marcos C. d’Ornellas, Angelo E. M. Ciarlini, Bruno Feijó 和 Antonio L. Furtado. 虚拟电影导演在交互式叙事中的应用。在《国际计算机娱乐技术进展会议论文集》中,ACE ’09,第 263–270 页,美国纽约,2009 年。计算机协会。ISBN 9781605588643。doi: 10.1145/1690388.1690432。URL https://doi.org/10.1145/1690388.1690432

Li-wei He, Michael F. Cohen, and David H. Salesin. The virtual cinematographer: a paradigm for automatic real-time camera control and directing. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pp. 217–224, New York, NY, USA, 1996. Association for Computing Machinery. ISBN 0897917464. doi: 10.1145/237170.237259. URL https://doi.org/10.1145/237170.237259.

LiLi-wei He、Michael F. Cohen 和 David H. Salesin。虚拟摄影师:自动实时相机控制和导演的范式。在《第23届计算机图形与交互技术年会论文集》(SIGGRAPH '96)中,第217-224页,美国纽约州纽约市,1996年。计算机械协会。ISBN 0897917464。doi: 10.1145/237170.237259。URL https://doi.org/10.1145/237170.237259

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmid huber. MetaGPT: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id $=$ VtmBAGCN7o.

SirSirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. 在第十二届国际学习表征会议 (The Twelfth International Conference on Learning Representations) 上, 2024. URL https://openreview.net/forum?id $=$ VtmBAGCN7o.

Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use, 2024a. URL https://arxiv.org/abs/ 2411.10323.

SiySiyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. GUI智能体的黎明:与Claude 3.5计算机使用的初步案例研究, 2024a. URL https://arxiv.org/abs/2411.10323.

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: An llm agent for synthesizing 3d scene as blender code, 2024b. URL https://arxiv.org/abs/2403.01248.

ZZiniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. Scenecraft: 一个用于合成3D场景为Blender代码的大语言模型智能体, 2024b. URL https://arxiv.org/abs/2403.01248.

Hongda Jiang, Bin Wang, Xi Wang, Marc Christie, and Baoquan Chen. Example-driven virtual cinema to graph y by learning camera behaviors. ACM Trans. Graph., 39(4), aug 2020. ISSN 0730- 0301. doi: 10.1145/3386569.3392427. URL https://doi.org/10.1145/3386569. 3392427.

HongHongda Jiang, Bin Wang, Xi Wang, Marc Christie, 和 Baoquan Chen。通过相机行为实现示例驱动的虚拟电影构图。ACM Trans. Graph., 39(4), 2020年8月。ISSN 0730-0301。doi: 10.1145/3386569.3392427。URL https://doi.org/10.1145/3386569.3392427

Hongda Jiang, Marc Christie, Xi Wang, Libin Liu, Bin Wang, and Baoquan Chen. Camera keyframing with style and control. ACM Trans. Graph., 40(6), December 2021. ISSN 0730-0301. doi: 10.1145/3478513.3480533. URL https://doi.org/10.1145/3478513.3480533.

HongHongda Jiang, Marc Christie, Xi Wang, Libin Liu, Bin Wang, and Baoquan Chen. 风格与控制相结合的相机关键帧技术。ACM Trans. Graph., 40(6), December 2021. ISSN 0730-0301. doi: 10.1145/3478513.3480533. URL https://doi.org/10.1145/3478513.3480533.

Hongda Jiang, Xi Wang, Marc Christie, Libin Liu, and Baoquan Chen. Cinema to graphic camera diffusion model, 2024.

HongHongda Jiang, Xi Wang, Marc Christie, Libin Liu, 和 Baoquan Chen。从电影到图形相机的扩散模型,2024。

Manolya Kavakli and Cinzia Cremona. The virtual production studio concept – an emerging game changer in filmmaking. In 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 29–37, 2022. doi: 10.1109/VR51125.2022.00020.

ManManolya Kavakli 和 Cinzia Cremona. 虚拟制作工作室概念——电影制作中的新兴变革者。在 2022 IEEE 虚拟现实与 3D 用户界面会议 (VR) 中,第 29-37 页,2022。doi: 10.1109/VR51125.2022.00020。

Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. Computational video editing for dialogue-driven scenes. ACM Trans. Graph., 36(4), jul 2017. ISSN 0730-0301. doi: 10.1145/ 3072959.3073653. URL https://doi.org/10.1145/3072959.3073653.

MMackenzie Leake, Abe Davis, Anh Truong, 和 Maneesh Agrawala. 对话驱动场景的计算视频编辑. ACM Trans. Graph., 36(4), 2017年7月. ISSN 0730-0301. doi: 10.1145/3072959.3073653. URL https://doi.org/10.1145/3072959.3073653.

Rob Legato and Caleb Deschanel. Disney presents: the making of the lion king. In ACM SIGGRAPH 2019 Productio