Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback
对齐一切:通过语言反馈训练全模态模型以遵循指令


Figure 1. Motivated by the challenge of achieving all-modality human preference alignment, particularly the limitations of binary preferences in accurately reflecting human preferences, we introduce the align-anything: Data: align-anything-200k, which covers text, image, audio, video modalities, and $^{8+}$ specific subtasks, annotated with preference and language feedback; Algorithm: improving all-modality alignment by learning from language feedback; Evaluation: encompassing all-modality understanding and generation.
图 1: 受到实现全模态人类偏好对齐的挑战的启发,特别是二元偏好在准确反映人类偏好方面的局限性,我们引入了 align-anything:数据:align-anything-200k,涵盖文本、图像、音频、视频模态以及 $^{8+}$ 个特定子任务,并标注了偏好和语言反馈;算法:通过学习语言反馈来改进全模态对齐;评估:涵盖全模态理解和生成。
Abstract
摘要
* Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains under explored in the cross-modality domain. As the number of modalities increases, aligning all-modality models with human intentions – such as instruction following – becomes a pressing challenge. In this work, we make the first attempt to fine-tune all-modality models (i.e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions. This endeavor presents several challenges. First, there is no large-scale all-modality human preference data in existing open-source resources, as most datasets are limited to specific modalities, predominantly text and image. Secondly, the effectiveness of binary preferences in RLHF for post-training alignment in complex all-modality scenarios remains an unexplored area. Finally, there is a lack of a systematic framework to evaluate the capabilities of all-modality models, particularly regarding modality selection and synergy. To address these challenges, we propose the align-anything framework, which includes meticulously annotated 200k all-modality human preference data. Then, we introduce an alignment method that learns from unified language feedback, effectively capturing complex modality-specific human preferences and enhancing the model’s instruction-following capabilities. Furthermore, to assess performance improvements in allmodality models after post-training alignment, we construct a challenging all-modality capability evaluation framework – eval-anything. All data, models, and code framework have been open-sourced for the community. For more details, please refer to https://github.com/PKUAlignment/align-anything.
- 人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF) 已被证明在增强大语言模型的指令遵循能力方面非常有效;然而,它在跨模态领域仍未被充分探索。随着模态数量的增加,将所有模态模型与人类意图(如指令遵循)对齐成为一个紧迫的挑战。在这项工作中,我们首次尝试使用跨所有模态(包括文本、图像、音频和视频)的人类偏好数据来微调全模态模型(即输入和输出可以是任何模态,也称为任意到任意模型),确保其行为与人类意图一致。这一尝试面临多个挑战。首先,现有的开源资源中没有大规模的全模态人类偏好数据,因为大多数数据集仅限于特定模态,主要是文本和图像。其次,在复杂的全模态场景中,二元偏好在 RLHF 中对于训练后对齐的有效性仍是一个未探索的领域。最后,缺乏一个系统性的框架来评估全模态模型的能力,特别是在模态选择和协同方面。为了解决这些挑战,我们提出了 align-anything 框架,其中包括精心标注的 20 万条全模态人类偏好数据。然后,我们引入了一种从统一语言反馈中学习的对齐方法,有效捕捉复杂的模态特定人类偏好,并增强模型的指令遵循能力。此外,为了评估训练后对齐后全模态模型的性能提升,我们构建了一个具有挑战性的全模态能力评估框架——eval-anything。所有数据、模型和代码框架均已开源,供社区使用。更多详情,请参阅 https://github.com/PKUAlignment/align-anything。

Figure 2. Composition and distribution of align-anything-200k. Our dataset comprises 8 subtasks across text, image, audio, and video modalities. Each modality exhibits distinct semantic features and distribution patterns, covering various latent spaces. This highlights that all-modality alignment cannot rely solely on data from specific modalities; rather, it requires the integration of data across modalities.
图 2: align-anything-200k 的构成与分布。我们的数据集涵盖了文本、图像、音频和视频四种模态的 8 个子任务。每种模态都表现出独特的语义特征和分布模式,覆盖了多种潜在空间。这表明全模态对齐不能仅仅依赖特定模态的数据,而是需要跨模态数据的整合。
1. Introduction
1. 引言
Our world is inherently multimodal [62, 100, 126]. Humans perceive the world through various sensory organs, acquiring information in multiple modalities such as text, images, audio, video, and others. These different forms of information often complement and interact with each other. Each sensory channel has unique advantages in conveying specific concepts and enhancing our understanding of the world. With the success of large language models (LLMs) [1, 7, 128], researchers aim to extend these models to handle multiple modalities, enabling them to perceive and generate any modality [5, 65, 98, 120]. This would allow the models to respond using the most appropriate modality, achieving truly human-like AI [16, 34, 39, 52, 64, 104, 108, 109, 117].
我们的世界本质上是多模态的 [62, 100, 126]。人类通过各种感官器官感知世界,获取文本、图像、音频、视频等多种模态的信息。这些不同形式的信息通常相互补充和交互。每个感官通道在传达特定概念和增强我们对世界的理解方面都有独特的优势。随着大语言模型 (LLMs) [1, 7, 128] 的成功,研究人员希望将这些模型扩展到处理多种模态,使其能够感知和生成任何模态 [5, 65, 98, 120]。这将使模型能够使用最合适的模态进行响应,实现真正类人的人工智能 [16, 34, 39, 52, 64, 104, 108, 109, 117]。
Consequently, the research community has actively developed foundational models capable of handling arbitrary modalities [58, 108]. Based on LLMs, people use individual image/text encoders or domain-specific decoders for input or output processing [65, 108, 124, 135], leveraging the MoE architecture [60] and diffusion techniques [41]. In this line of work, each modality is encoded by a single encoder, with non-text modality information being mapped to the text space via projection layers. Additionally, Chameleon [98] has experimented with encoding images during the pre-training phase using fully token-based representations to handle both image and text modalities. However, handling more modalities remains a significant challenge.
因此,研究社区积极开发了能够处理任意模态的基础模型 [58, 108]。基于大语言模型,人们使用单独的图像/文本编码器或特定领域的解码器进行输入或输出处理 [65, 108, 124, 135],利用 MoE 架构 [60] 和扩散技术 [41]。在这项工作中,每个模态由单个编码器编码,非文本模态信息通过投影层映射到文本空间。此外,Chameleon [98] 在预训练阶段尝试使用完全基于 Token 的表示来编码图像,以处理图像和文本模态。然而,处理更多模态仍然是一个重大挑战。
On the other hand, Reinforcement learning from human feedback (RLHF) plays a significant role in aligning models with human intentions [81]. GPT-4 [1] has effectively boosted the model’s instruction-following using RLHF. LLaMA series models [99] have significantly improved the performance in code, mathematics, and reasoning through the post-training method, e.g., DPO [87]. However, this line of work is limited to single modality. Researchers try to apply RLHF or DPO directly to multimodal scenarios, such as models with text and image modalities [95, 120]. However, these methods, which rely on aligning responses with binary human preferences (e.g., one answer is better than another) [136], struggle to be effective with more complex and diverse modalities. ImageBind [38] propose using single modality information as a binding, such as using images as the bridging feature to learn joint embeddings. So, how can we establish a unified preference modeling to ensure effectiveness across any modalities?
另一方面,从人类反馈中进行强化学习 (RLHF) 在使模型与人类意图对齐方面发挥了重要作用 [81]。GPT-4 [1] 通过 RLHF 有效地提升了模型的指令遵循能力。LLaMA 系列模型 [99] 通过后训练方法(如 DPO [87])显著提升了代码、数学和推理方面的性能。然而,这些工作仅限于单一模态。研究人员尝试将 RLHF 或 DPO 直接应用于多模态场景,例如具有文本和图像模态的模型 [95, 120]。然而,这些依赖于将响应与二元人类偏好对齐的方法(例如,一个答案比另一个更好)[136],在更复杂和多样化的模态中难以有效应用。ImageBind [38] 提出使用单一模态信息作为绑定,例如使用图像作为桥梁特征来学习联合嵌入。那么,我们如何建立统一的偏好建模,以确保在任何模态下的有效性?
All-modality models refer to models capable of accepting input and generating output in any modality, essentially functioning as any-to-any models. Language can serve not only as a binding across different modalities but also as a natural human preference carrier. In this work, we propose utilizing language feedback to unify human feedback across all modalities, marking the first attempt to extend RLHF/DPO to arbitrary modality space and promoting the development of a general all-modality model. Overall, our work makes the following contributions:
全模态模型是指能够接受任何模态的输入并生成任何模态输出的模型,本质上是一种任意到任意的模型。语言不仅可以作为不同模态之间的绑定,还可以作为人类偏好的自然载体。在这项工作中,我们提出利用语言反馈来统一所有模态的人类反馈,这标志着首次将 RLHF/DPO 扩展到任意模态空间,并推动了通用全模态模型的发展。总的来说,我们的工作做出了以下贡献:
• Data (Sec. 2): The first all-modality human preference dataset – align-anything-200k – on text, image, audio, and video modalities. Utilizing a two-stage human annotation process, this dataset accurately captures genuine human preferences, aiming to improve the alignment of all-modality models with human intentions. • Algorithm (Sec. 3): The first algorithm applicable to enhance RLHF/DPO across all modalities: learning from language feedback $(L L F)$ . Our empirical results demonstrate that LLF enhance RLHF performance across 5 modalities, 5 open-source models, and 7 popular benchmarks, which achieves an average 5.83 times improvement than the baseline RLHF. • Evaluation (Sec. 4): To our knowledge, there is no system specifically designed for all-modality model evaluation. To address this gap, we propose the first evaluation tool (eval-anything), tailored for all-modality evaluation. This tool not only meticulously constructs evaluation tasks across various modalities but also emphasizes assessing the unique features of all-modality models, such as modality selection and synergy.
• 数据 (Sec. 2): 首个全模态人类偏好数据集——align-anything-200k,涵盖文本、图像、音频和视频模态。通过两阶段的人工标注流程,该数据集准确捕捉了真实的人类偏好,旨在提升全模态模型与人类意图的对齐。
• 算法 (Sec. 3): 首个适用于增强所有模态 RLHF/DPO 的算法:从语言反馈学习 $(L L F)$。我们的实证结果表明,LLF 在 5 种模态、5 个开源模型和 7 个流行基准测试中提升了 RLHF 性能,平均比基线 RLHF 提高了 5.83 倍。
• 评估 (Sec. 4): 据我们所知,目前没有专门为全模态模型评估设计的系统。为了解决这一空白,我们提出了首个评估工具 (eval-anything),专为全模态评估量身定制。该工具不仅在多种模态上精心构建了评估任务,还强调评估全模态模型的独特特性,如模态选择和协同作用。
• Framework and Open Source: To further the alignment of the all-modality model with human intentions, we have released all of our resources to the public.
• 框架与开源:为了进一步增强全模态模型与人类意图的对齐,我们已将全部资源向公众开放。

Response A: The person is using some kind of object. Response B: $\checkmark$ The person in the video is blowing a tubeshaped object to imitate the sound of a train whistle.
图 1:
Response A: 这个人正在使用某种物体。
Response B: $\checkmark$ 视频中的人在吹一个管状物体,模仿火车汽笛的声音。

Preference Dimensions
偏好维度
Language Feedback
语言反馈
Critique: The Response A is not very helpful … the fact that they are blowing into a tube-like (Video) object and the possible train whistle sound (Audio) being introduced in the audio.
批评:响应A并不是很有帮助……事实上,他们正在向一个管状物体(视频)吹气,并且可能在音频中引入了火车汽笛声(音频)。
Refinement: The response should focus on information in audio and vide, The person in the picture is blowing a flute to imitate the sound of a train whistle.
优化:响应应聚焦于音频和视频信息,图中的人正在吹笛子模仿火车汽笛声。
Figure 3. All-modality preference and language feedback annotation of align-anything-200k. For all-modality preference annotation, we classify the instruction-following metrics into two categories: modality-agnostic and modality-specific. Each finegrained dimension is assigned a corresponding score along with a rationale. Additionally, we offer detailed language feedback, including critiques and refinement suggestions, which integrate information from multiple modalities within the responses.
图 3: Align-Anything-200K 的全模态偏好和语言反馈标注。在全模态偏好标注中,我们将指令遵循指标分为两类:模态无关和模态相关。每个细粒度维度都分配了相应的得分以及理由。此外,我们还提供了详细的语言反馈,包括批评和改进建议,这些反馈整合了响应中多个模态的信息。
2. Datasets
2. 数据集
A primary challenge lies in the fact that existing preference datasets are predominantly focused on single-modal tasks [26, 52, 73, 120], lacking comprehensive datasets that encompass all modalities, thus limiting further research progress.
主要挑战在于现有的偏好数据集主要集中在单模态任务上 [26, 52, 73, 120],缺乏涵盖所有模态的全面数据集,从而限制了进一步的研究进展。
In response, we open-source the first all-modality human preference dataset – align-anything-200k – to enhance the instruction-following capabilities of all-modality models across tasks such as question answering, complex reasoning, etc. During the annotation process, we observed that introducing additional modal information significantly reduced binary human preference consistency (as shown in Tab. 1). Therefore, in addition to annotating fine-grained binary preferences, we introduce language feedback as an additional constraint to more accurately capture human preferences. In short, our dataset includes the following:
为此,我们开源了首个全模态人类偏好数据集——align-anything-200k,以增强全模态模型在问答、复杂推理等任务中的指令跟随能力。在标注过程中,我们发现引入额外的模态信息显著降低了二元人类偏好的一致性(如表 1 所示)。因此,除了标注细粒度的二元偏好外,我们还引入了语言反馈作为额外约束,以更准确地捕捉人类偏好。简而言之,我们的数据集包括以下内容:
• All-modality Human Preference Dataset. We annotate human preference data across various modalities (including text, image, video, and audio) with a focus on instruction-following capabilities (Fig. 2). • One More Thing – Language Feedback. In addition to the widely used binary preferences, we incorporate language feedback (critique and refinement) to improve human preference consistency across all modalities.
• 全模态人类偏好数据集。我们标注了跨多种模态(包括文本、图像、视频和音频)的人类偏好数据,重点关注指令跟随能力(图 2)。• 额外补充——语言反馈。除了广泛使用的二元偏好外,我们还引入了语言反馈(批评和改进),以提高跨所有模态的人类偏好一致性。
2.1. All-Modality Human Preference Dataset
2.1 全模态人类偏好数据集
The align-anything $.20O k$ aims to capture human preferences across 8 subtasks of all modalities, as shown in Fig. 2. With the increase in modalities, the complexity and inconsistency of human intent preferences rise significantly, driven by the unique semantic characteristics of each modality. To achieve consistent all-modality preference modeling, we decouple annotation targets into two types, which serve as evaluation metrics for instruction-following:
align-anything $.20O k$ 旨在捕捉人类在所有模态的 8 个子任务中的偏好,如图 2 所示。随着模态的增加,人类意图偏好的复杂性和不一致性显著上升,这是由每个模态独特的语义特征驱动的。为了实现一致的全模态偏好建模,我们将标注目标解耦为两种类型,作为指令跟随的评估指标:
• Modality-agnostic refers to dimensions that are applied universally across modalities, including: (1) Prompt adherence, which requires responses to be consistent with the input prompts, accurately reflecting the specified elements. (2) Rule conformity, where responses adhere to logical, physical, or scientific principles related to the scenario or theme described in the prompt. (3) Information richness, which emphasizes that responses should be thorough and detailed in addressing the query.
• 模态无关性指的是适用于所有模态的维度,包括:(1) 提示遵循性,要求响应与输入提示一致,准确反映指定的元素。(2) 规则一致性,响应需遵循与提示中描述的场景或主题相关的逻辑、物理或科学原则。(3) 信息丰富性,强调响应应全面且详细地解决查询。
• Modality-specific represents preference dimensions tailored to the characteristics of each modality. For example, in video-related subtasks, we introduce three additional dimensions, temporal consistency, content coherence and motion naturalness to better evaluate the details in the output regarding the duration, dynamics, etc.
• 模态特定表示根据每种模态的特性定制的偏好维度。例如,在视频相关的子任务中,我们引入了三个额外的维度:时间一致性、内容连贯性和运动自然性,以更好地评估输出在持续时间、动态等方面的细节。
By decomposing the instruction-following dimensions, we establish the evaluation standards for all-modality preference annotation. More details about the annotation document can be found in Appendix: Datasets.
通过分解指令跟随维度,我们建立了全模态偏好标注的评估标准。更多关于标注文档的详细信息可以在附录:数据集中找到。
2.2. One More Thing – Language Feedback
2.2. 还有一点——语言反馈
During the annotation process, there exists a decline in human preference consistency when directly conducting binary preference annotations. As illustrated in Tab. 1, the introduction of different modalities makes it challenging for binary preferences to fully capture human preference [47, 59, 90]. To address this issue, we introduce language feedback, utilizing natural language to describe discrepancies between the model’s response and the standard criteria, thereby improving the consistency of human annotation.
在标注过程中,直接进行二元偏好标注时存在人类偏好一致性的下降。如表 1 所示,不同模态的引入使得二元偏好难以完全捕捉人类偏好 [47, 59, 90]。为了解决这一问题,我们引入了语言反馈,利用自然语言描述模型响应与标准准则之间的差异,从而提高人类标注的一致性。
Table 1. Human preference agreement across different modalities. The results demonstrate that traditional annotation struggles to capture modality-specific details, whereas language feedback enhances consistency by conveying fine-grained human preferences effectively.
表 1: 不同模态下的人类偏好一致性。结果表明,传统的标注方法难以捕捉模态特定的细节,而语言反馈通过有效传达细粒度的人类偏好,增强了一致性。
| T2T | TI2T | T21 | TI2TI | TV2T | T2V | TA2T | T2A | |
|---|---|---|---|---|---|---|---|---|
| 无语言反馈 (%) | 73.2 ± 2.1 | 62.5 ± 7.3 | 63.2 ± 6.3 | 62.7 ± 7.7 | 62.1 ± 5.7 | 60.8 ± 4.6 | 61.8 ± 6.9 | 58.7 ± 5.8 |
| 有语言反馈 (%) | 74.1 ± 1.2 | 70.5 ± 3.4 | 69.2 ± 2.2 | 68.3 ± 2.1 | 67.2 ± 1.8 | 65.8 ± 1.2 | 68.6 ± 2.4 | 64.3 ± 1.7 |
Specifically, we provide language feedback for each response which includes two parts: critique, which assesses the strengths and weaknesses of the response based on detailed criteria, and refinement, which offers specific improvement suggestions to enhance the models’ instructionfollowing capabilities. Through the quality control process with annotators and training verification (Sec. 3), we identify two key advantages of using language feedback: (1) as a natural carrier of human preferences, it clearly identifies fine-grained defects and areas for improvement in the responses, providing clearer guidance compared to binary preferences [10]. (2) as a connecting binding, it bridges human intent across different modalities and creates a unified, modality-agnostic preference modeling approach, capable of handling more modalities than typical methods. Further examples can be referred to in Appendix: Datasets.
具体来说,我们为每个响应提供语言反馈,包括两个部分:批评,基于详细标准评估响应的优缺点;以及改进,提供具体的改进建议以增强模型的指令跟随能力。通过标注人员的质量控制过程和训练验证(第3节),我们确定了使用语言反馈的两个关键优势:(1) 作为人类偏好的自然载体,它清晰地识别了响应中的细粒度缺陷和改进领域,相比二元偏好提供了更清晰的指导 [10]。(2) 作为连接纽带,它在不同模态之间桥接了人类意图,并创建了一种统一的、与模态无关的偏好建模方法,能够处理比典型方法更多的模态。更多示例可参考附录:数据集。
2.3. Dataset Construction
2.3. 数据集构建
We collect initial prompts from 24 multimodal datasets and refine them based on modality-agnostic and modalityspecific dimensions with existing multimodal models (e.g., GPT-4o [80]). For example, we emphasize dynamic temporal aspects for video or quality characteristics for audio to better adapt to each modality. Finally, we gather the responses from 27 models for the 8 subtasks. More details can be referred to in Appendix: Datasets.
我们从24个多模态数据集中收集初始提示,并基于模态无关和模态特定的维度使用现有的多模态模型(例如GPT-4o [80])进行优化。例如,我们强调视频的动态时间方面或音频的质量特征,以更好地适应每种模态。最后,我们收集了27个模型在8个子任务中的响应。更多细节可参考附录:数据集。
Annotation Process Inspired by PKU-SafeRLHF [48], we utilize a Human-AI joint annotation process to conduct preference annotations across fine-grained dimensions for each subtask. As shown in Fig. 3, for each fine-grained dimension’s binary preferences, we assign scores from 0 to 3 following strict guidelines and detailed rationales. Additionally, we collect language feedback from human annotators and API-based models. The annotators provide language feedback for each response by defining the critique scope, performing the critique, offering refinement suggestions, and organizing the critique and refinements into comprehensive language feedback. As shown in Fig. 3, language feedback provides more fine-grained improvement directions for cross-modal responses, serving as a carrier for expressing natural human preferences across modalities.
基于PKU-SafeRLHF [48]的标注流程,我们采用人机联合标注的方式,为每个子任务在细粒度维度上进行偏好标注。如图 3所示,对于每个细粒度维度的二元偏好,我们按照严格的准则和详细的理由给出0到3的评分。此外,我们还从人类标注者和基于API的模型中收集语言反馈。标注者通过定义批评范围、进行批评、提供改进建议,并将批评和改进整理成全面的语言反馈,为每个回答提供语言反馈。如图 3所示,语言反馈为跨模态回答提供了更细粒度的改进方向,作为表达跨模态自然人类偏好的载体。
Human and AI Agreement Analysis We integrate APIbased multimodal models (e.g., GPT-4o, and others) into human evaluation for binary preferences annotations. To analyze the consistency of the dataset with human preferences, we sample 100 data pairs from each subtask, with 10 annotators to pair-wise comparisons, and statistically measure the preference consistency, as shown in Tab. 1. The results demonstrate that with the traditional binary preferences annotation pipeline, agreement consistency tends to decline as multimodal information is incorporated. This suggests that the conventional pipeline struggles to scale effectively across all modalities. As language-based feedback offers targeted and fine-grained human preference information, higher consistency is achieved across all modalities.
人类与AI一致性分析
我们将基于API的多模态模型(例如GPT-4o等)整合到人类评估中,用于二元偏好标注。为了分析数据集与人类偏好的一致性,我们从每个子任务中抽取100对数据,由10名标注者进行成对比较,并统计偏好一致性,如表1所示。结果表明,随着多模态信息的加入,传统二元偏好标注流程的一致性趋于下降。这表明传统流程难以在所有模态中有效扩展。由于基于语言的反馈提供了有针对性和细粒度的人类偏好信息,因此在所有模态中实现了一致性更高的结果。
3. Learning from Language Feedback
3. 从语言反馈中学习
In this section, we introduce learning from language feedback (LLF). It utilizes language feedback to optimize responses, synthesizing preference data which can enhance the performance of all-modality alignment. Firstly, we review the RLHF pipeline including PPO [81] and DPO [87], highlighting the limitations of binary preferences feedback. Then we demonstrate how to practically implement LLF, including two main stages, feedback modeling and self improving. Finally, we empirically verify that LLF achieves an average 5.83 times improvement across 5 modalities, 5 open-sourced models, and 7 popular benchmarks.
在本节中,我们介绍了从语言反馈中学习 (LLF, Learning from Language Feedback)。它利用语言反馈来优化响应,合成偏好数据,从而增强全模态对齐的性能。首先,我们回顾了包括 PPO [81] 和 DPO [87] 在内的 RLHF (Reinforcement Learning from Human Feedback) 流程,强调了二元偏好反馈的局限性。然后,我们展示了如何实际实施 LLF,包括反馈建模和自我改进两个主要阶段。最后,我们通过实验验证了 LLF 在 5 种模态、5 个开源模型和 7 个流行基准测试中平均提升了 5.83 倍。
3.1. Background and Preliminary
3.1. 背景与预备知识
PPO consists of two main stages including: step 1: preference modeling and step 2: policy optimization. The former involves the collection of comparison data, essential for training the reward model $r_{\mathrm{RM}}(\cdot|\cdot)$ . The process starts with response pairs $(y_{1},y_{2})$ generated by the initial model from shared prompts $\textbf{\em x}$ . Human annotators are then tasked with selecting their preferred response from each pair, denoted as $y_{w};\vdash;y_{l};\mid;x$ , where $\mathbf{\nabla}y_{w}$ and $\textit{\textbf{y l}}$ denote the preferred and disp referred answer. The latter (step 2) is guided by the $r_{\mathrm{RM}}(\cdot|\cdot)$ . This process is commonly modeled as a bandit setting, where a reward is obtained from the $r_{\mathrm{RM}}$ at the end of each response. The RL objective is,
PPO 包含两个主要阶段:第一步:偏好建模和第二步:策略优化。前者涉及比较数据的收集,这对于训练奖励模型 $r_{\mathrm{RM}}(\cdot|\cdot)$ 至关重要。该过程从初始模型生成的响应对 $(y_{1},y_{2})$ 开始,这些响应对基于共享提示 $\textbf{\em x}$ 。然后,人类标注者需要从每对响应中选择他们偏好的响应,表示为 $y_{w};\vdash;y_{l};\mid;x$ ,其中 $\mathbf{\nabla}y_{w}$ 和 $\textit{\textbf{y l}}$ 分别表示偏好和非偏好的答案。后者(第二步)由 $r_{\mathrm{RM}}(\cdot|\cdot)$ 指导。该过程通常被建模为一个赌博机设置,其中每次响应结束时从 $r_{\mathrm{RM}}$ 获得奖励。强化学习的目标是,

DPO directly fine-tunes the model aligning with preference pairs $y_{w},\vdash,y_{l}\mid x$ . It consolidates the two stages of preference modeling and policy optimization in RLHF into one stage. Let $\theta_{0}$ denote the initial model parameters. The optimization objective can be rewritten as,
DPO 直接微调模型以对齐偏好对 $y_{w},\vdash,y_{l}\mid x$。它将 RLHF 中的偏好建模和策略优化两个阶段整合为一个阶段。设 $\theta_{0}$ 表示初始模型参数,优化目标可以重写为:

Figure 4. Learning from language feedback pipeline: (1). Feedback Modeling. We perform SFT on the initial model using annotated language feedback. (2). Self Improving. The initial model optimizes responses given the language feedback to synthesize preference pairs.
图 4: 从语言反馈中学习的流程:(1). 反馈建模。我们使用带注释的语言反馈对初始模型进行监督微调 (SFT)。(2). 自我改进。初始模型根据语言反馈优化响应以合成偏好对。
Table 2. Main experiment results. We validate that LLF can enhance the performance of RLHF across 5 modalities, 5 open-source models, and 7 popular benchmarks, which achieves an average 5.83 times improvement than the baseline RLHF.
表 2: 主要实验结果。我们验证了 LLF 可以在 5 种模态、5 个开源模型和 7 个流行基准上增强 RLHF 的性能,相比基线 RLHF 平均提升了 5.83 倍。
| 模态 | 数据集 | 关注点 | 初始模型 | 初始模型 | DPO | DPO+LLF | PPO | PPO+LLF |
|---|---|---|---|---|---|---|---|---|
| TI2T | LLaVA-Bench[65] | 视觉问答 | LLaVA-1.5-7B | 90.03 | 88.20 (-1.83) | 97.36 (+7.33) | 92.67 (+2.64) | 98.76 (+8.73) |
| TI2T | MIA-Bench[84] | 分层视觉问答 | LLaVA-1.5-13B | 90.46 | 98.36 (+7.90) | 100.33 (+9.87) | 95.63 (+5.17) | 100.59 (+10.13) |
| TI2T | MIA-Bench[84] | 分层视觉问答 | LLaVA-1.5-7B | 61.15 | 64.30 (+3.15) | 65.32 (+4.17) | 70.94 (+9.79) | 72.59 (+11.44) |
| TI2T | MIA-Bench[84] | 分层视觉问答 | LLaVA-1.5-13B | 70.34 | 70.72 (+0.38) | 75.43 (+5.09) | 76.57 (+6.23) | 80.26 (+9.92) |
| TI2T | ImageReward[113] | 细粒度偏好 | Chameleon-7B | -0.80 -0.65 (+0.15) | -0.46 (+0.34) | -0.52 (+0.28) | -0.44 (+0.36) | |
| T2I TI2TI | HPS v2[109] | 粗粒度偏好 | Chameleon-7B | 25.19 | 25.56 (+0.37) | 25.62 (+0.43) | 25.56 (+0.37) | 25.71 (+0.52) |
| T2I TI2TI | InterleavedBench[67] | 文本-图像交错问答 | Chameleon-7B | 1.63 | 1.65 (+0.02) | 1.70 (+0.07) | 1.35 (-0.28) | 1.78 (+0.15) |
| TV2T | Video-ChatGPT[72] | 详细视频问答 | Qwen2-VL-7B | 3.01 | 3.03 (+0.02) | 3.29 (+0.28) | 3.02 (+0.01) | 3.05 (+0.04) |
| TA2T | AIR-Bench[116] | 混合音频问答 | Qwen2-Audio-7B | 6.61 | 6.64 (+0.03) | 6.71 (+0.10) | 6.49 (-0.12) | 6.63 (+0.02) |

Since both PPO and DPO are modality-agnostic, intuitively, all-modality alignment can be achieved with preference pairs $y_{w}\ \succ\ y_{l}\ \mid\ x$ , and recent works have also validated its effectiveness [3, 24, 95, 132]. However, the preference of coupled responses are trade-off by different dimensions, e.g., their style and correctness. As the number of modalities increases, these dimensions become more complex, making it harder to figure out misaligned behavior with binary preferences [46, 59].
由于 PPO 和 DPO 都是与模态无关的,直观上,可以通过偏好对 $y_{w}\ \succ\ y_{l}\ \mid\ x$ 实现全模态对齐,最近的研究也验证了其有效性 [3, 24, 95, 132]。然而,耦合响应的偏好在不同维度上是权衡的,例如它们的风格和正确性。随着模态数量的增加,这些维度变得更加复杂,使得通过二元偏好找出未对齐行为变得更加困难 [46, 59]。
3.2. Practical Implementation
3.2. 实际实现
Inspired by Constitutional AI [10], LLF comprises two steps: feedback modeling and self-improving as shown in Fig. 4. The former employs SFT to train the feedback model, enabling it to provide language feedback based on $\textbf{\em x}$ and $\textit{\textbf{y}}$ . The latter allows the model to refine its responses based on the language feedback $^c$ .
受宪法 AI (Constitutional AI) [10] 的启发,LLF 包含两个步骤:反馈建模和自我改进,如图 4 所示。前者使用 SFT 训练反馈模型,使其能够基于 $\textbf{\em x}$ 和 $\textit{\textbf{y}}$ 提供语言反馈。后者允许模型基于语言反馈 $^c$ 改进其响应。
Feedback Modeling The training process utilizes a dataset $\mathbf{\mathcal{D}}={(\mathbf{x}{i},\mathbf{y}{i},\mathbf{c}{i})}{i=1}^{N}$ , where $N$ is the size of dataset, $\pmb{x}{i}$ denotes the prompt, $\pmb{y}{i}$ represents the response, and $c_{i}$ is the corresponding language feedback. Let $P_{\varphi}(c_{i}\ |$ $\mathbf{\displaystyle}x_{i},y_{i})$ denote the probability of the target sequence $c_{i}$ given the input sequence $(x_{i},y_{i})$ and the model parameters $\varphi$ , the training objective of the feedback model can be expressed by the cross-entropy loss:
反馈建模
训练过程使用了一个数据集 $\mathbf{\mathcal{D}}={(\mathbf{x}{i},\mathbf{y}{i},\mathbf{c}{i})}{i=1}^{N}$ ,其中 $N$ 是数据集的大小,$\pmb{x}{i}$ 表示提示,$\pmb{y}{i}$ 表示响应,$c_{i}$ 是对应的语言反馈。令 $P_{\varphi}(c_{i}\ |$ $\mathbf{\displaystyle}x_{i},y_{i})$ 表示给定输入序列 $(x_{i},y_{i})$ 和模型参数 $\varphi$ 时目标序列 $c_{i}$ 的概率,反馈模型的训练目标可以通过交叉熵损失来表达:

Self Improving We first collect the initial model’s response $\textit{\textbf{y}}$ to the given prompt $\textbf{\em x}$ online. Then, we gather feedback $\pmb{c}=M_{\phi}(\pmb{x},\pmb{y})$ from the feedback model $M_{\phi}$ . Finally, we have the initial model generate responses $\pmb{y}^{\ast};=$ $M_{\theta}(x,c)$ conditioned on this feedback. For example, $\textit{\textbf{y}}$ may suffer from redundancy and hallucination, while $^c$ will remind the generation of $\boldsymbol{y}^{*}$ to avoid these problems. Based on online sampling, it often enables $\boldsymbol{y}^{*}$ and $\textit{\textbf{y}}$ to exhibit significant differences in certain aspects (e.g., reducing redundancy), thereby generating more learnable preference pairs.
自我改进
我们首先在线收集初始模型对给定提示 $\textbf{\em x}$ 的响应 $\textit{\textbf{y}}$。然后,我们从反馈模型 $M_{\phi}$ 中收集反馈 $\pmb{c}=M_{\phi}(\pmb{x},\pmb{y})$。最后,我们让初始模型生成基于此反馈的响应 $\pmb{y}^{\ast};=$ $M_{\theta}(x,c)$。例如,$\textit{\textbf{y}}$ 可能存在冗余和幻觉问题,而 $^c$ 会提醒生成 $\boldsymbol{y}^{*}$ 时避免这些问题。基于在线采样,它通常使 $\boldsymbol{y}^{*}$ 和 $\textit{\textbf{y}}$ 在某些方面表现出显著差异(例如减少冗余),从而生成更具可学习性的偏好对。
3.3. Experiment
3.3. 实验
We empirically verify that LLF offers several key advantages over traditional binary feedback: unified preference: As language feedback typically optimizes responses along key dimensions, models can easily learn genuine human preference beyond pairs; rich information: Since the feedback model can generate a substantial amount of language
我们通过实证验证,LLF 相较于传统的二元反馈具有几个关键优势:统一偏好:由于语言反馈通常沿着关键维度优化回答,模型可以轻松学习超越对立的真实人类偏好;丰富信息:由于反馈模型可以生成大量的语言

Figure 5. Comparison of $\mathbf{DPO+LLF}$ with DPO on varying language feedback amounts. We trained the feedback models using $25%$ , $50%$ , and $75%$ of the language feedback (LF) compared to binary feedback (BF), then synthesized an equal amount of preference pairs based on them, and subsequently compared the performance of the DPO against the initial model. We find that a small amount of language feedback can synthesize preference pairs that surpass those derived from binary feedback.
图 5: DPO+LLF 与 DPO 在不同语言反馈量下的比较。我们使用 25%、50% 和 75% 的语言反馈 (LF) 训练反馈模型,与二元反馈 (BF) 进行对比,然后基于它们合成等量的偏好对,并随后将 DPO 的表现与初始模型进行比较。我们发现,少量的语言反馈可以合成优于二元反馈的偏好对。
| 模态 | TI2T | TA2T | TV2T | T2I | TI2TI |
|---|---|---|---|---|---|
| 无反馈模型 | 44.8% | 47.5% | 39.6% | 33.4% | 41.8% |
| 有反馈模型 | 57.8% | 64.5% | 63.8% | 56.9% | 67.2% |
Table 3. Abalation study of feedback model. We compare the win rate of self-improving models against initial models on allmodality hold-out evaluation prompts. The w/o feedback model refers to using the initial model as the feedback model, judged by GPT-4o (TI2T, T2T, TI2TI) and Gemini 1.5 Pro (TA2T, TV2T).
表 3: 反馈模型的消融研究。我们比较了自改进模型与初始模型在所有模态保留评估提示上的胜率。w/o 反馈模型指的是使用初始模型作为反馈模型,由 GPT-4o (TI2T, T2T, TI2TI) 和 Gemini 1.5 Pro (TA2T, TV2T) 进行判断。
feedback, LLF can effectively serve as a robust synthesizer of preference data. Specifically, we train Chameleon-7B [98] for T2I and TI2TI, LLaVA-7B and LLaVA-13B [65] for TI2T, Qwen2-Audio-7B [25] for TA2T, and Qwen2- VL-7B [103] for TV2T based on align-anything-200k. For more details about the training setting, please refer to the Appendix: Training Details.
基于反馈,LLF 能够有效地作为偏好数据的稳健合成器。具体来说,我们基于 align-anything-200k 训练了 Chameleon-7B [98] 用于 T2I 和 TI2TI,LLaVA-7B 和 LLaVA-13B [65] 用于 TI2T,Qwen2-Audio-7B [25] 用于 TA2T,以及 Qwen2-VL-7B [103] 用于 TV2T。有关训练设置的更多详细信息,请参阅附录:训练详情。
Improvement with Unified Preference As shown in Tab. 2, LLF synthesized preference pairs reflect more unified human preference, enhancing all-modality alignment performance. We observe that DPO and RLHF using binary pairs fall short in some modalities. However, with LLF, they yields positive improvement across all modalities. Interestingly, we find that the improvement of LLF on LLaVA-13B is greater than that on LLaVA-7B. This suggests that LLF performs better on stronger models.
统一偏好改进
如表 2 所示,LLF 合成的偏好对反映了更统一的人类偏好,增强了所有模态的对齐性能。我们观察到,使用二元对的 DPO 和 RLHF 在某些模态上表现不佳。然而,使用 LLF 后,它们在所有模态上都取得了积极的改进。有趣的是,我们发现 LLF 在 LLaVA-13B 上的改进大于在 LLaVA-7B 上的改进。这表明 LLF 在更强的模型上表现更好。
Efficiency with Rich Information LLF provides richer information and supports efficient preference data synthesis. As shown in Fig. 5, despite having only a smaller amount of language feedback, DPO+LLF outperforms DPO with binary feedback. At the same time, the reduction in data amount does not significantly weaken the capabilities of LLF. This suggests that in all-modality alignment, labeling many binary preference pairs is less effective than labeling a smaller amount of language feedback.
丰富的信息与高效性
LLF 提供了更丰富的信息,并支持高效的偏好数据合成。如图 5 所示,尽管语言反馈的数量较少,DPO+LLF 的表现仍然优于使用二元反馈的 DPO。同时,数据量的减少并未显著削弱 LLF 的能力。这表明,在全模态对齐中,标注大量二元偏好对的效果不如标注少量语言反馈。
Ablation Study: Feedback Model is Necessary As shown in Tab. 3, the feedback model generates language feedback on the initial model’s responses to hold-out evaluation prompts, which is then used to regenerate these responses. Results indicate that this approach yields improvement while replacing the feedback model with the initial model does not achieve comparable positive effects.
消融实验:反馈模型的必要性 如表 3 所示,反馈模型对初始模型在保留评估提示上的响应生成语言反馈,然后用于重新生成这些响应。结果表明,这种方法带来了改进,而用初始模型替换反馈模型则无法达到类似的积极效果。
4. Evaluation: Eval-Anything
4. 评估:Eval-Anything
Currently, evaluating all-modality models relies on human experts for assessments, which is inefficient and costly. While combining benchmarks for individual modalities could offer a broader evaluation [125], differences in data preparation, post-processing, and metrics across benchmarks hinder accurate performance assessment. Additionally, all-modality models uniquely select the appropriate modalities based on user queries, enabling seamless crossmodal synergy, a capability that traditional single-modality evaluation pipelines fail to capture fully.
目前,评估全模态模型依赖人类专家进行评估,效率低且成本高昂。虽然结合单一模态的基准测试可以提供更广泛的评估 [125],但不同基准在数据准备、后处理和指标上的差异阻碍了准确的性能评估。此外,全模态模型能够根据用户查询选择适当的模态,实现无缝的跨模态协同,这是传统单一模态评估流程无法完全捕捉的能力。
To address this gap, we deliver our evaluation framework specifically designed for all-modality models – evalanything – including (1) all-modality understanding (AMU) for assess models to simultaneously process and integrate information from all modalities and (2) all-modality generation (AMG): evaluate a model’s ability to follow user instructions, autonomously select modalities, and work synergistic ally across different modalities for output.
为了解决这一差距,我们提供了一个专门为全模态模型设计的评估框架——evalanything,包括(1)全模态理解 (AMU),用于评估模型同时处理和整合来自所有模态信息的能力,以及(2)全模态生成 (AMG):评估模型遵循用户指令、自主选择模态并在不同模态之间协同工作以生成输出的能力。
4.1. Composition of Evaluation Dataset
4.1. 评估数据集的组成
4.1.1 All-Modality Understanding
4.1.1 全模态理解
All-modality models aim to both understand individual modalities and combine information across them to generate high-quality responses. To assess their comprehensive multimodal processing, we create 164 test entries, each containing textual, visual (image or video), and auditory (audio or speech) components. These interconnected modalities require the model to integrate all inputs accurately, as failure in any one modality leads to incorrect answers. For instance, as shown in Fig. 6, if a model fails to process visual inputs, it may miss that the person in the picture is frightened. Similarly, without auditory processing, it might not understand that the person’s fear is due to a barking dog.
全模态模型旨在理解各个模态,并整合跨模态信息以生成高质量响应。为评估其综合多模态处理能力,我们创建了 164 个测试项,每个测试项包含文本、视觉(图像或视频)和听觉(音频或语音)组件。这些相互关联的模态要求模型准确整合所有输入,因为任何一个模态的失败都会导致错误答案。例如,如图 6 所示,如果模型无法处理视觉输入,它可能会忽略图片中的人感到恐惧。同样,如果没有听觉处理,它可能无法理解这个人的恐惧是因为一只吠叫的狗。
We use GPT-4 [1] to evaluate model responses on a scale of 1 to 10. However, previous studies [65, 116] underscore limitations in multimodal evaluation that rely on a single human annotation as a reference. As the number of modalities increases, so does the complexity, making it harder to reach consensus and increasing subjective bias in single annotations. Moreover, since GPT-4 lacks true multimodal comprehension, evaluating only its text responses doesn’t confirm if essential information from each modality is fully understood. To address this, we gather responses from 10 annotators and extract key terms from each modality. As illustrated in Fig. 7, the distribution of multiple annotations mitigates the bias associated with single-reference evaluation. Additionally, the inclusion of key terms enables GPT- 4 to more accurately detect potential errors in its responses. The details for AMU can be found in Appendix: Evaluation.
我们使用 GPT-4 [1] 来评估模型响应,评分范围为 1 到 10。然而,先前的研究 [65, 116] 强调了依赖单一人工注释作为参考的多模态评估的局限性。随着模态数量的增加,复杂性也随之增加,这使得达成共识变得更加困难,并增加了单一注释中的主观偏见。此外,由于 GPT-4 缺乏真正的多模态理解能力,仅评估其文本响应并不能确认是否充分理解了每个模态中的关键信息。为了解决这个问题,我们收集了 10 位注释者的响应,并从每个模态中提取了关键术语。如图 7 所示,多注释的分布减轻了单一参考评估中的偏见。此外,关键术语的加入使 GPT-4 能够更准确地检测其响应中的潜在错误。AMU 的详细信息可在附录:评估中找到。

Figure 6. The eval-anything benchmark consists of two components: (Up) AMU: All-Modality Understanding, where the model answers open-ended questions by integrating textual instructions, images, videos, and audio. (Down) AMG: All-Modality Generation is divided into subtasks of instruction-following, modality selection, and synergy. The model generates outputs for each modality (text, image, video, audio) based on instructions, with human-preferred combinations guiding modality selection metrics. A trained judge model evaluates the relevance, consistency, and synergy across different modalities in the outputs.
图 6: Eval-anything 基准由两个部分组成:(上) AMU:全模态理解,模型通过整合文本指令、图像、视频和音频来回答开放式问题。(下) AMG:全模态生成分为指令遵循、模态选择和协同三个子任务。模型根据指令为每种模态(文本、图像、视频、音频)生成输出,人类偏好的组合指导模态选择指标。训练过的评判模型评估输出中不同模态的相关性、一致性和协同性。

Figure 7. Distribution of model and human annotation. In AMU task, comparing the distribution of an all-modality model’s responses (blue) with human annotations (red) reveals that a single annotation inadequately covers the model’s semantic space (left), whereas multiple annotations broaden the red region, improving evaluation coverage by capturing more response diversity (right).
图 7: 模型与人类标注的分布。在 AMU 任务中,对比全模态模型的响应分布(蓝色)与人类标注(红色)可以发现,单个标注无法充分覆盖模型的语义空间(左图),而多个标注则扩大了红色区域,通过捕捉更多的响应多样性提升了评估的覆盖范围(右图)。
4.1.2 All-Modality Generation
4.1.2 全模态生成
In generation tasks, all-modality models can outperform single-modal ones by delivering diverse information across multiple formats under the same user query. To achieve this, they must follow SSI principles: (1) Select relevant modalities automatically to reduce redundancy, (2) Synergistic integration for maximum information gain, and (3) Instruction-following in each modality. The overall score for the AMG task is as follows:
在生成任务中,多模态模型能够通过在同一用户查询下提供多种格式的多样化信息,从而优于单模态模型。为了实现这一点,它们必须遵循SSI原则:(1) 自动选择相关模态以减少冗余,(2) 协同整合以实现最大信息增益,(3) 在每个模态中遵循指令。AMG任务的总体得分如下:

Instruction Following To assess instruction-following, we design 100 prompts per modality. Using GPT-4o, we score the text and image outputs. For audio and video modalities, inspired by TIFA [44] pipeline, we create multiple-choice questions based on the prompts and employ Qwen2-Audio [24] and Qwen2-VL [103] as evaluation models. This process yields independent modality scores $S_{T},S_{V},S_{A}$ , each ranging from 0 to 10.
指令遵循评估
为评估指令遵循能力,我们为每种模态设计 100 个提示。使用 GPT-4o 对文本和图像输出进行评分。对于音频和视频模态,受 TIFA [44] 流程启发,我们基于提示创建多项选择题,并采用 Qwen2-Audio [24] 和 Qwen2-VL [103] 作为评估模型。该过程生成独立的模态评分 $S_{T},S_{V},S_{A}$,每个评分范围为 0 到 10。
Modality Selection Properly combining modalities in the model’s response can provide rich perspectives while reducing redundant information. We employ 25 crowdsourced annotators to identify the expected modality combinations for given text instructions. The voting process results in one of the quantitative metrics for AMG task, with appropriate rewards and penalties applied based on the modalities generated by the model. As shown in Fig. 6, $\alpha$ represents the percentage of human votes for each modality combination, and if the model outputs all three modalities, this parameter will be one-third of $\alpha_{T},V,A$ .
模态选择
在模型响应中适当组合模态可以提供丰富的视角,同时减少冗余信息。我们雇佣了25名众包标注员来识别给定文本指令所期望的模态组合。投票过程产生了AMG任务的一项定量指标,并根据模型生成的模态应用适当的奖励和惩罚。如图6所示,$\alpha$ 表示每种模态组合的人类投票百分比,如果模型输出所有三种模态,则该参数将是 $\alpha_{T},V,A$ 的三分之一。
Modality Synergy refers to the consistency between different modalities in a model’s responses. We assess modality synergy by training a judge model with human annotations obtained through a data synthesis. Specifically, we develop a data synthesis pipeline centered on LLMbased agents that utilize tools to invoke audio, image, and video generation models, constructing a dataset with textual inputs and multimodal outputs. We employ human annotators to annotate preference data considering the synergy between different modalities in each response, and then train a judge model that allows all-modality input. As shown in Fig. 6, responses with high synergy scores $(S_{T,,V},S_{T,,A},S_{V,,A})$ should show high relevance and consistency between modalities, with visual and auditory elements enriching the text from various perspectives. More details on training the judge model can be found in Appendix: Evaluation.
模态协同 (Modality Synergy) 是指模型在不同模态响应之间的一致性。我们通过训练一个评判模型来评估模态协同,该模型使用通过数据合成获得的人工标注进行训练。具体来说,我们开发了一个以基于大语言模型的 AI智能体为核心的数据合成流程,这些智能体利用工具调用音频、图像和视频生成模型,构建了一个包含文本输入和多模态输出的数据集。我们雇佣人工标注员对偏好数据进行标注,考虑每个响应中不同模态之间的协同性,然后训练一个支持全模态输入的评判模型。如图 6 所示,协同性得分较高的响应 $(S_{T,,V},S_{T,,A},S_{V,,A})$ 应该在模态之间表现出高度的相关性和一致性,视觉和听觉元素从不同角度丰富了文本内容。更多关于评判模型训练的细节可以在附录:评估中找到。
Table 4. The performance of models in the eval-anything benchmark. Additional input modalities are masked for models that do not support all-modality input. Since most current open-source models lack support for all-modality output, (†) indicates that models are used as agents to invoke AudioLDM2-Large [66] and FLUX.1-schnell[53] for audio and image generation.
表 4: 模型在 eval-anything 基准测试中的表现。对于不支持全模态输入的模型,额外的输入模态被屏蔽。由于目前大多数开源模型缺乏对全模态输出的支持,(†) 表示这些模型被用作 AI智能体来调用 AudioLDM2-Large [66] 和 FLUX.1-schnell [53] 以进行音频和图像生成。

Figure 8. Performance of models in modality selection metric. We analyze the model’s modality selection performance across all instructions in AMG task, comparing it to the voting results on preferred modality combinations from 25 human annotators.
图 8: 模型在模态选择指标上的表现。我们分析了模型在 AMG 任务中所有指令的模态选择性能,并将其与 25 位人类标注者对首选模态组合的投票结果进行了比较。
4.2. Evaluation Results & Analysis
4.2. 评估结果与分析
Human and AI Agreement In the modality synergy task, after training on the $5\mathrm{k}$ preference dataset, the experiment reveals a $66.4%$ agreement rate between the judging model and human annotators. These figures are consistent with human agreement ratios reported in similar studies on modeling human preferences [113] in the multimodal large language models domain.
人与 AI 的一致性在模态协同任务中,经过在 $5\mathrm{k}$ 的偏好数据集上训练后,实验表明评判模型与人类标注者之间的达成一致性率为 $66.4%$。这些数据与多模态大语言模型领域中类似研究 [113] 所报告的人类一致性比率一致。
Input vs Outputs Most models in Tab. 4 support partial modality input and have baseline scores, but Gemini-1.5- Pro outperforms others due to its ability to process all three modalities. In the AMG task, the average scores is relatively low, with no model demonstrating a clear advantage across all sub-items. The results indicate that, compared to modality generation, models are more advanced in modality understanding, consistent with the developmental trends of all-modality models.
输入与输出
表 4 中的大多数模型支持部分模态输入并具有基线得分,但 Gemini-1.5-Pro 因其能够处理所有三种模态而表现优于其他模型。在 AMG 任务中,平均得分相对较低,没有模型在所有子项中展现出明显优势。结果表明,与模态生成相比,模型在模态理解方面更为先进,这与全模态模型的发展趋势一致。
Truly All-Modality Model Current models still fall far behind all-modality models. For all-modality understanding, models using only a single modality score less than half the maximum points, as shown in Tab. 4. Even Gemini-1.5- Pro, which processes both visual and auditory inputs, fails to attain a perfect score. Unlike humans, who integrate information from multiple modalities, these models are limited by their inability to perceive different modalities in a fully integrated way, even though they perform nearly at the human level within individual modalities. Moreover, models align poorly with human choices when selecting out- put modalities, as illustrated in Fig. 8. Humans can adaptively choose the best combination of modalities based on the instruction. In contrast, models tend to output either only textual information, resulting in information loss, or all available modalities, causing redundancy. The limited multimodal capability, especially in models trained mainly on text, hampers their ability to synthesize information effectively, making it difficult to balance detail and conciseness.
真正的全模态模型当前的模型仍然远远落后于全模态模型。在全模态理解方面,仅使用单一模态的模型得分不到最高分的一半,如 表 4 所示。即使是能够处理视觉和听觉输入的 Gemini-1.5-Pro 也无法获得满分。与人类能够整合来自多种模态的信息不同,这些模型由于无法以完全整合的方式感知不同模态而受到限制,尽管它们在单一模态内的表现接近人类水平。此外,在选择输出模态时,模型与人类的选择一致性较差,如图 8 所示。人类可以根据指令自适应地选择最佳模态组合,而模型往往要么仅输出文本信息,导致信息丢失,要么输出所有可用模态,造成冗余。有限的多模态能力,尤其是在主要基于文本训练的模型中,阻碍了它们有效综合信息的能力,使得难以在细节和简洁性之间取得平衡。
5. Conclusion
5. 结论
In this work, we make the first exploration of fine-tuning all-modality models using human preference data across diverse modalities, to ensure alignment with human intentions. We have open-sourced the align-anything dataset, incorpora ting $200\mathbf{k}$ annotated human preference data across modalities. Our proposed alignment method leverages language feedback to capture complex, modality-specific human preferences, significantly enhancing the model’s ability to follow instructions. To assess the all-modality models, we developed an evaluation benchmark: eval-anything. All data, models, and code have been made openly available.
在本工作中,我们首次探索了利用跨多种模态的人类偏好数据对全模态模型进行微调,以确保与人类意图的对齐。我们开源了align-anything数据集,其中包含了跨模态的$200\mathbf{k}$条标注的人类偏好数据。我们提出的对齐方法利用语言反馈来捕捉复杂的、模态特定的人类偏好,显著提升了模型遵循指令的能力。为了评估全模态模型,我们开发了一个评估基准:eval-anything。所有数据、模型和代码均已公开。
Ethics Impact Our data collection, approved by the Institutional Review Board, will be released under the CC BY $\mathrm{NC},4.0$ license. The dataset integrates Q-A data from opensource and API-based models, offering the potential for developing AI aligned with human intentions. However, while it could theoretically be misused for harmful purposes, we are committed to promoting safe AI technology.
伦理影响
我们通过机构审查委员会批准的数据收集将在 CC BY $\mathrm{NC},4.0$ 许可下发布。该数据集整合了来自开源和基于 API 的模型的问答数据,为开发符合人类意图的 AI 提供了潜力。然而,尽管理论上它可能被滥用于有害目的,但我们致力于推动安全的 AI 技术。
Limitations and Future Work Despite having 45 experienced annotators, the team lacks sufficient diversity, which may limit the representative ness of human preferences. To address this, we plan to recruit a more varied group via platforms like Amazon MTurk. Additionally, the absence of all-modality foundational models in open-source restricts validation to individual modalities, so we encourage community efforts to develop such models and aim to expand experiments across more models. This is an ongoing effort, with the goal of scaling the dataset to millions.
限制与未来工作 尽管拥有 45 位经验丰富的标注员,团队仍缺乏足够的多样性,这可能会限制人类偏好的代表性。为了解决这个问题,我们计划通过 Amazon MTurk 等平台招募更多样化的群体。此外,由于开源领域缺乏全模态基础模型,验证仅限于单一模态,因此我们鼓励社区开发此类模型,并旨在扩展到更多模型进行实验。这是一项持续的工作,目标是将数据集扩展到数百万规模。
References
参考文献
Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback
Align Anything:通过语言反馈训练全模态模型遵循指令
Supplementary Material
补充材料
6. Ethic Responsibility
6. 伦理责任
Our data collection has been approved by an Institutional Review Board (IRB). The IRB file contains institutional information. To maintain anonymity in the double-blind review process, we did not upload the IRB documents alongside the supplementary materials. If needed, we are willing to discuss the IRB file further with the Ethics Reviewer, provided it does not compromise the double-blind review protocol.
我们的数据收集已获得机构审查委员会(Institutional Review Board, IRB)的批准。IRB 文件中包含机构信息。为了在双盲评审过程中保持匿名性,我们没有将 IRB 文件与补充材料一起上传。如有需要,我们愿意与伦理评审员进一步讨论 IRB 文件,前提是不影响双盲评审协议。
7. Open-Source Assets and License
- 开源资产与许可证
All datasets examples, codes, and demos have been attached to our supplementary material. In Sec. 6, we discuss potential risks and mitigation strategies related to model opensourcing in detail. After the double-blind review process, we will actively engage with community feedback regarding the data, framework, and code and promptly address any issues related to version inconsistencies to further advance scientific research on large model alignment.
所有数据集示例、代码和演示已附在我们的补充材料中。在第 6 节中,我们详细讨论了与模型开源相关的潜在风险和缓解策略。在双盲评审过程结束后,我们将积极回应社区关于数据、框架和代码的反馈,并及时解决与版本不一致相关的任何问题,以进一步推动大模型对齐的科学研究。
The following assets are planned for open-source release after the double-blind review process:
以下资产计划在双盲评审过程结束后开源发布:
have included samples from each modality of the dataset, along with additional related materials. After the doubleblind review process, we will release all content as open source.
我们包含了数据集中每种模态的样本,以及额外的相关材料。在双盲评审过程结束后,我们将以开源形式发布所有内容。
8. More Details of Related Works
8. 相关工作更多细节
Building on the success of large language models (LLMs) and the latest advancements in multimodal large language models (MLLMs), there is growing anticipation for integrating multiple modalities into a single model to achieve truly all-modality capabilities, i.e., all-modality models. However, numerous challenges must be overcome to reach this goal. We will introduce the related works from the aspects of dataset, algorithms, and evaluation.
基于大语言模型 (LLMs) 的成功以及多模态大语言模型 (MLLMs) 的最新进展,人们越来越期待将多种模态集成到一个模型中,以实现真正的全模态能力,即全模态模型。然而,要实现这一目标,必须克服许多挑战。我们将从数据集、算法和评估等方面介绍相关的工作。
Dataset Recently, several studies have proposed preference datasets for multimodal models, encompassing both understanding and generation tasks [6, 101]. However, current preference datasets are restricted to specific modalities and tasks, and they lack comprehensive cross-modal annotations. In the domain of Text-Image-to-Text (TI2T) under standing, existing preference datasets primarily aim to reduce hallucinations and enhance helpfulness in MLLMs [56, 95, 120, 134]. In the Text-to-Image (T2I) domain, most existing studies employ binary human ratings or preference rankings to reflect authentic human preferences [52, 109, 113]. Liang et al. [59], in contrast, gathers more comprehensive feedback, focusing on misaligned regions and key terms. In the video and audio domains, aside from datasets such as LLaVA-Hound [127], SafeSora [27], and AudioAlpaca [73], which provide binary human preference data, further research on preferences remains relatively scarce.
数据集
Algorithms Recent all-modality alignment algorithms directly extend the RLHF pipeline to the corresponding modality, since both PPO [81] and DPO [87] are modalityagnostic. RLHF-V [120] have used human feedback to annotate image-question-answering data, significantly reducing hallucinations of MLLMs. Similarly, Qwen2-Audio- 7B-Instruct [24] have attempted to apply this approach to audio language models, improving audio comprehension and question-answering tasks. Similar strategies have also been practiced in video question-answering [3], imagegeneration [11], and video generation [27] tasks. However, the preference of coupled responses are trade-off by different dimensions, e.g., their style and correctness. As the number of modalities increases, these dimensions become more complex, making it harder to figure out misaligned behavior with binary preferences [46, 59]. Due to the increased difficulty and cost of annotating all-modality data [121], there is an urgent need for an alignment algorithm that more efficiently utilizes human feedback.
算法
最近的跨模态对齐算法直接扩展了RLHF(基于人类反馈的强化学习)流程到相应的模态,因为PPO(近端策略优化)[81]和DPO(直接偏好优化)[87]都是模态无关的。RLHF-V [120]利用人类反馈标注了图像问答数据,显著减少了多模态大语言模型(MLLMs)的幻觉问题。类似地,Qwen2-Audio-7B-Instruct [24]尝试将这种方法应用于音频语言模型,提升了音频理解和问答任务的性能。类似的策略也在视频问答[3]、图像生成[11]和视频生成[27]任务中得到了实践。然而,耦合响应的偏好在不同维度上存在权衡,例如风格和正确性。随着模态数量的增加,这些维度变得更加复杂,使得通过二元偏好识别未对齐行为变得更加困难[46, 59]。由于标注跨模态数据的难度和成本增加[121],迫切需要一种能更高效利用人类反馈的对齐算法。
Evaluation Current benchmarks for all-modality models primarily focus on evaluating individual modalities. Numerous benchmarks are developed to assess the capabilities of large vision models across multiple dimensions, including image perception [35, 71, 93, 122], object hallucination [21, 50, 57], and safety performance [69, 70, 105]. For the video modality, Fu et al. [36], Li et al. [55], Ning et al. [79] develop various benchmarks to assess large vision models. Furthermore, in the audio domain, there are dedicated evaluation datasets for large audio models [107, 116]. However, these studies are not specifically designed to assess their cross-modal comprehensive understanding capabilities. Existing evaluation methods are primarily based on a series of tasks, such as Audio-Visual Question Answering [54, 115, 123], Audio-Visual Scene-aware Dialog [4, 92], and Audio-Visual Retrieval [17, 18]. Although involving multiple modalities, these tasks are not specifically designed to assess the all-modality capabilities of MLLMs.
评估
当前的全模态模型基准主要集中于评估单一模态。为了评估大规模视觉模型在多个维度上的能力,开发了众多基准,包括图像感知 [35, 71, 93, 122]、物体幻觉 [21, 50, 57] 和安全性能 [69, 70, 105]。对于视频模态,Fu 等人 [36]、Li 等人 [55]、Ning 等人 [79] 开发了多种基准来评估大规模视觉模型。此外,在音频领域,也有专门用于评估大规模音频模型的数据集 [107, 116]。然而,这些研究并非专门设计用于评估其跨模态综合理解能力。现有的评估方法主要基于一系列任务,如视听问答 [54, 115, 123]、视听场景感知对话 [4, 92] 和视听检索 [17, 18]。尽管涉及多种模态,但这些任务并非专门设计用于评估 MLLMs 的全模态能力。
9.1. Training Part
9.1. 训练部分
The align-anything framework integrates all-modality alignment algorithms (e.g., Reinforcement Learning from Human Feedback, RLHF), supporting SFT, RM, DPO, and PPO. Additionally, align-anything implements KTO [32], SimPO [76], and ORPO [42] in the text-to-text modality. Besides, align-anything offers a highly scalable model registration mechanism and currently supports the training and deploying over 25 models. For more details, please refer to the align-anything-code/README.md file of our supplementary materials.
align-anything 框架集成了所有模态对齐算法(例如,基于人类反馈的强化学习,RLHF),支持 SFT、RM、DPO 和 PPO。此外,align-anything 在文本到文本模态中实现了 KTO [32]、SimPO [76] 和 ORPO [42]。此外,align-anything 提供了一个高度可扩展的模型注册机制,目前支持超过 25 个模型的训练和部署。更多详情请参考我们补充材料中的 align-anything-code/README.md 文件。

9. Align-Anything Framework Figure 9. The open-sourced framework of align-anything.
图 9: Align-Anything 框架的开源框架
Our work is built on the framework of align-anything, which is designed for training and evaluation across all modalities. As shown in Fig. 9, the align-anything framework aims to align all-modality large models, including large language models (LLMs), vision language models (VLMs), and others, with human intentions and values. Overall, this framework has the following characteristics:
我们的工作基于align-anything框架,该框架旨在跨所有模态进行训练和评估。如图 9 所示,align-anything框架的目标是将全模态大模型(包括大语言模型(LLMs)、视觉语言模型(VLMs)等)与人类意图和价值观对齐。总体而言,该框架具有以下特点:
9.2. Evaluation Part
9.2. 评估部分
The align-anything evaluation framework now supports over 30 commonly used benchmarks, covering all common modalities. For more details, please refer to the align-anything-code/README.md file of our supplementary materials.
align-anything 评估框架现已支持超过 30 个常用基准,涵盖所有常见模态。更多详细信息,请参阅我们补充材料中的 align-anything-code/README.md 文件。
10. Training Details
10. 训练细节
This section will introduce the implementation details of learning from language feedback $(L L F)$ , hyper-parameter settings, case studies, and the computational devices involved in the experiments.
本节将介绍从语言反馈学习 (LLF) 的实现细节、超参数设置、案例研究以及实验中所涉及的计算设备。
10.1. Implementation Details
10.1. 实现细节
LLF comprises two primary steps: feedback modeling and self improving. The first step employs maximum likelihood to enable the model to learn from align-anything-200k how to generate language feedback for a given prompt and response. This includes evaluating the response (critique part) and providing suggestions for improvement (refinement part). During the self improving phase, when optimizing the response, we incorporate the refinement into the specific dimensions, thereby creating preference pairs with the original response.
LLF 包含两个主要步骤:反馈建模和自我改进。第一步采用最大似然估计,使模型能够从 align-anything-200k 中学习如何为给定的提示和响应生成语言反馈。这包括评估响应(批评部分)和提供改进建议(优化部分)。在自我改进阶段,当优化响应时,我们将优化部分融入特定维度,从而与原始响应创建偏好对。

Figure 10. Examples of LLF synthesized preference pairs. The responses from current models are often not perfect. Using language feedback to enhance prompts can improve responses in certain dimensions, synthesizing more learnable preference pairs.
图 10: LLF 合成的偏好对示例。当前模型的响应通常并不完美。使用语言反馈来增强提示可以在某些维度上改进响应,从而合成更多可学习的偏好对。

Figure 11. Further experiment results of LLF on TI2T, TV2T, TA2T, T2I and TI2TI modalities. We conducted comparisons of LLF with common open-source [22–24, 28, 62, 64, 68, 78, 89, 94, 96, 98, 102, 108, 129] and closed-source models [7, 72, 80, 88]. prompt to guide the model in generating better responses on
图 11. LLF 在 TI2T, TV2T, TA2T, T2I 和 TI2TI 模态上的进一步实验结果。我们将 LLF 与常见的开源模型 [22–24, 28, 62, 64, 68, 78, 89, 94, 96, 98, 102, 108, 129] 和闭源模型 [7, 72, 80, 88] 进行了比较。通过提示引导模型生成更好的响应。
In the RLHF phase, we synthesized a preference dataset of the same size as the baseline data (detailed size for each modality can be found at Fig. 12). We will open-source this preference dataset synthesized by LLF. Fig. 10 provides a simple example.
在 RLHF 阶段,我们合成了一个与基线数据大小相同的偏好数据集(每种模态的详细大小见图 12)。我们将开源由 LLF 合成的这个偏好数据集。图 10 提供了一个简单的示例。
10.2. Further Comparison
10.2. 进一步比较
This section will compare LLF with other open-source and closed-source models. These comparison results reveal two surprising phenomena. For modalities with initially weaker models, such as TI2T, LLF can significantly enhance their performance. As shown in Fig. 11, on the MIA-Bench [84], although the initial model LLaVA-1.5-13B [62] is not outstanding, after fine-tuning with LLF-based PPO, it surpasses open-source models with much larger parameter sizes (e.g., LLaVA-NeXT-110B [64]) and even some closed-source models (e.g., Claude-3-Opus [7]). Meanwhile, for modalities with initially stronger models, like TA2T, LLF can still further improve their performance.
本节将比较 LLF 与其他开源和闭源模型。这些比较结果揭示了两个令人惊讶的现象。对于初始模型较弱的模态,例如 TI2T,LLF 可以显著提升其性能。如图 11 所示,在 MIA-Bench [84] 上,尽管初始模型 LLaVA-1.5-13B [62] 并不突出,但在使用基于 LLF 的 PPO 进行微调后,它超越了参数规模大得多的开源模型(例如 LLaVA-NeXT-110B [64])甚至一些闭源模型(例如 Claude-3-Opus [7])。同时,对于初始模型较强的模态,如 TA2T,LLF 仍能进一步提升其性能。
10.3. Models
10.3. 模型
LLF pipeline requires multimodal models with language understanding capabilities. Considering popularity, performance, and open-source availability, we select the following models for our experiment.
LLF 流程需要具备语言理解能力的多模态模型。考虑到流行度、性能和开源可用性,我们为实验选择了以下模型。
• LLaVA-1.5-7B & LLaVA-1.5-13B [62] are visual language models validated their performance across 11 benchmarks. their base language models are Vicuna-7Bv1.5 and Vicuna-13B-v1.5 [82], with CLIP-ViT-L-336px [85] as the visual encoder and MLP as the multimodal projection layer. The 13B version is trained on approximately 1.2M publicly available data. They excel at extracting effective information from inputs that combine language and images. Then they provide answers by integrating the world knowledge of the language model. • Qwen2-Audio-7B-Instruct [24] is a large-scale audiolanguage model, which can process diverse audio signal inputs to perform audio analysis or generate direct textual responses to speech instructions. Qwen2-Audio-7BInstruct uses Whisper-large-v3 [86] as the audio encoder and Qwen-7B [9] as the language model. It was pretrained on over 520k audio QA data and completed SFT and DPO alignment, achieving outstanding performance on 9 benchmarks. • Chameleon-7B [98] is an early-fusion, token-based multimodal model designed for unified image and text under standing and generation. Chameleon-7B employs a newly trained image tokenizer, based on [37], to encode $512\times512$ images into 1024 discrete tokens. The model was pre-trained using a Llama2-style language model architecture on text-image interleaved data with this encoder. The original paper introduced two model variants: Chameleon-7B and Chameleon-34B. Due to computa-tional constraints, we focused on evaluating our method using Chameleon-7B. However, the full Chameleon-7B parameters with image generation capabilities were not open-sourced by the authors. To address this limitation, we used the Align-Anything framework to finetune the original Chameleon-7B model on the LAION-Art dataset [91]. This resulted in the AA-Chameleon-7B-base model, which integrates both image generation and understanding capabilities. For simplicity, we refer to this model as Chameleon-7B throughout our paper.
• LLaVA-1.5-7B 和 LLaVA-1.5-13B [62] 是视觉语言模型,在 11 个基准上验证了其性能。它们的基础语言模型是 Vicuna-7Bv1.5 和 Vicuna-13B-v1.5 [82],视觉编码器为 CLIP-ViT-L-336px [85],多模态投影层为 MLP。13B 版本在约 1.2M 的公开数据上进行了训练。它们擅长从结合语言和图像的输入中提取有效信息,然后通过整合语言模型的世界知识提供答案。
• Qwen2-Audio-7B-Instruct [24] 是一个大规模音频语言模型,能够处理多样化的音频信号输入,进行音频分析或生成对语音指令的直接文本响应。Qwen2-Audio-7B-Instruct 使用 Whisper-large-v3 [86] 作为音频编码器,Qwen-7B [9] 作为语言模型。它在超过 520k 的音频问答数据上进行了预训练,并完成了 SFT 和 DPO 对齐,在 9 个基准上表现出色。
• Chameleon-7B [98] 是一个早期融合、基于 Token 的多模态模型,设计用于统一的图像和文本理解与生成。Chameleon-7B 采用了一个基于 [37] 新训练的图像 Tokenizer,将 $512\times512$ 的图像编码为 1024 个离散的 Token。该模型使用 Llama2 风格的语言模型架构在文本-图像交错数据上进行了预训练。原论文介绍了两个模型变体:Chameleon-7B 和 Chameleon-34B。由于计算限制,我们专注于使用 Chameleon-7B 评估我们的方法。然而,具有图像生成能力的完整 Chameleon-7B 参数并未由作者开源。为了解决这一限制,我们使用 Align-Anything 框架在 LAION-Art 数据集 [91] 上对原始 Chameleon-7B 模型进行了微调,得到了 AA-Chameleon-7B-base 模型,该模型集成了图像生成和理解能力。为简洁起见,我们在论文中将其称为 Chameleon-7B。
Qwen2-VL-7B-Instruct [103] is a large multimodal model that employs a unified paradigm for processing both images and videos, representing a significant advancement in video understanding capabilities. Through its innovative approach to visual processing, the model can effectively analyze video content using the same architecture that handles static images, creating a seamless integration of temporal and spatial information for comprehensive video comprehension. Qwen2-VL-7B-Instruct uses Vision Transformer [30] as the vision encoder and Qwen2- 7B [114] as the language model. Trained on 1.4 trillion tokens and fine-tuned through SFT, the model demonstrates exceptional video comprehension capabilities.
Qwen2-VL-7B-Instruct [103] 是一个大型多模态模型,它采用统一的范式来处理图像和视频,代表了视频理解能力的重大进步。通过其创新的视觉处理方法,该模型能够有效地分析视频内容,使用与处理静态图像相同的架构,将时间和空间信息无缝整合,以实现全面的视频理解。Qwen2-VL-7B-Instruct 使用 Vision Transformer [30] 作为视觉编码器,并使用 Qwen2-7B [114] 作为语言模型。该模型在 1.4 万亿个 Token 上进行训练,并通过 SFT 进行微调,展示了卓越的视频理解能力。
10.4. Benchmarks
10.4. 基准测试
• LLaVA-Bench (COCO) [65] is a benchmark designed to evaluate the alignment behavior and capabilities of models with consistent visual inputs. It utilizes a subset of the COCO-Val-2014 dataset, selecting 30 images and generating three distinct types of questions for each image: conversation, detailed description, and complex reasoning. This results in a total of 90 questions, crafted using a specialized data generation pipeline. The benchmark aims to assess how effectively models can follow user instructions and handle diverse question types. • MIA-Bench [84] is a pioneering benchmark designed to assess the capabilities of VLMs in adhering to complex, layered instructions. This benchmark features a diverse collection of 400 meticulously crafted imageprompt pairs, each intended to rigorously test the models’ ability to generate responses that precisely follow intricate directives. Through comprehensive evaluations of various state-of-the-art VLMs, MIA-Bench uncovers significant disparities in performance, thereby identifying key areas for enhancement in instruction fidelity. • Image Reward [113] is a general-purpose T2I human preference reward benchmark. Trained on $137\mathbf{k}$ expertannotated comparisons, the Image Reward model utilizes a systematic annotation pipeline emphasizing alignment, fidelity, and harmlessness in generated images. It significantly outperforms prior metrics like CLIP and FID in capturing human preferences, achieving higher consistency with human rankings. The benchmark demonstrates robustness across various datasets, establishing itself as a promising metric for evaluating T2I generative models.
- LLaVA-Bench (COCO) [65] 是一个旨在评估模型在一致视觉输入下的对齐行为和能力的基准。它使用了 COCO-Val-2014 数据集的一个子集,选择了 30 张图像,并为每张图像生成了三种不同类型的问题:对话、详细描述和复杂推理。通过专门的数据生成管道,总共生成了 90 个问题。该基准旨在评估模型如何有效地遵循用户指令并处理多样的问题类型。
- MIA-Bench [84] 是一个开创性的基准,旨在评估视觉语言模型(VLM)在遵循复杂、分层指令方面的能力。该基准包含了 400 个精心制作的图像提示对,每个提示对都旨在严格测试模型生成精确遵循复杂指令的响应的能力。通过对各种最先进的 VLM 进行全面评估,MIA-Bench 揭示了性能上的显著差异,从而确定了在指令保真度方面的关键改进领域。
- Image Reward [113] 是一个通用的文本到图像(T2I)人类偏好奖励基准。通过在 $137\mathbf{k}$ 个专家注释的比较数据上进行训练,Image Reward 模型利用了一个系统的注释管道,强调生成图像的对齐性、保真度和无害性。它在捕捉人类偏好方面显著优于 CLIP 和 FID 等先前指标,实现了与人类排名更高的一致性。该基准在各种数据集上展示了鲁棒性,成为评估 T2I 生成模型的有前途的指标。
• Human Preference Score $\nu2$ $(H P S\nu2)$ [109] is a benchmark designed to evaluate the alignment of T2I models with human preferences. It is built upon Human Preference Dataset v2 (HPD v2), a comprehensive dataset comprising 798k human-annotated pairwise comparisons across nine generative models and real images from the COCO dataset. HPS v2 trains a fine-tuned CLIP-based preference prediction model, demonstrating superior genera liz ation and sensitivity to algorithmic improvements compared to prior metrics such as Inception Score and Fre´chet Inception Distance (FID Score).
• 人类偏好评分 (Human Preference Score) $\nu2$ $(H P S\nu2)$ [109] 是一个旨在评估文本到图像 (T2I) 模型与人类偏好对齐的基准。它基于人类偏好数据集 v2 (HPD v2),这是一个包含 798k 人类标注的成对比较的综合数据集,涵盖了九个生成模型和来自 COCO 数据集的真实图像。HPS v2 训练了一个基于 CLIP 的微调偏好预测模型,相较于先前的指标如 Inception Score 和 Fréchet Inception Distance (FID Score),展示了更好的泛化能力和对算法改进的敏感性。
• Interleaved Bench [67] is a holistic benchmark designed for evaluating the interleaved generation of text and images. It encompasses diverse real-world use cases, including storytelling, marketing, and multimodal script generation, with 815 instances distributed across 10 categories. Unlike prior benchmarks, Interleaved Bench emphasizes arbitrary interleaving of text and images within both inputs and outputs. Its evaluation metric, Interleave dEval, powered by GPT-4, assesses five dimensions: text quality, perceptual quality, image coherence, text-image coherence, and helpfulness.
• Interleaved Bench [67] 是一个用于评估文本和图像交错生成的整体基准。它涵盖了多样化的现实世界用例,包括故事讲述、营销和多模态脚本生成,共包含815个实例,分布在10个类别中。与之前的基准不同,Interleaved Bench 强调输入和输出中文本和图像的任意交错。其评估指标 Interleave dEval 由 GPT-4 驱动,评估五个维度:文本质量、感知质量、图像连贯性、文本-图像连贯性和实用性。
• Video-ChatGPT [72] contains a sophisticated framework for evaluating the text generation capabilities of videobased conversational models, leveraging the Activity Net200 [12] dataset’s rich collection of videos with dense descriptive captions. This comprehensive evaluation system employs GPT-3.5 to assess model outputs on a 1-5 scale across five fundamental dimensions: Correctness of Information (verifying factual accuracy and alignment with video content), Detail Orientation (examining response completeness and specificity), Contextual Understanding (assessing comprehension of overall video context), Temporal Understanding (evaluating grasp of event sequences), and Consistency (measuring reliability across similar queries).
• Video-ChatGPT [72] 包含一个复杂的框架,用于评估基于视频的对话模型的文本生成能力,利用 Activity Net200 [12] 数据集中丰富的视频资源及其密集的描述性字幕。这一全面的评估系统采用 GPT-3.5 对模型输出进行 1-5 分的评分,涵盖五个基本维度:信息正确性(验证事实准确性及与视频内容的一致性)、细节导向(检查回答的完整性和特异性)、上下文理解(评估对视频整体内容的理解)、时间理解(评估对事件序列的把握)以及一致性(衡量在类似查询中的可靠性)。
• AIR-Bench [116] is designed to assess the audio-centric interaction capabilities of Large Audio-Language Models (LALMs). Unlike previous benchmarks that primarily focus on fundamental tasks like automatic speech recognition, AIR-Bench offers a comprehensive evaluation of LALMs’ ability to understand and interact with various audio signals, including human speech, natural sounds, and music, in a textual format. It comprises two main components: the foundation benchmark, which includes 19 tasks with approximately 19,000 single-choice questions to test basic single-task abilities, and the chat benchmark, featuring 2,000 instances of open-ended QA data to evaluate complex audio comprehension and instructionfollowing capabilities, utilizing advanced language models such as GPT-4 for scoring. Since Qwen2-AudioInstruct is a chat model, our experiment results come from its open-ended QA part.
• AIR-Bench [116] 旨在评估大音频语言模型(LALMs)的以音频为中心的交互能力。与之前主要关注自动语音识别等基础任务的基准不同,AIR-Bench 提供了对 LALMs 理解和交互各种音频信号(包括人类语音、自然声音和音乐)能力的全面评估,并以文本形式呈现。它由两个主要部分组成:基础基准,包含 19 个任务,约 19,000 道单选题,用于测试基本的单任务能力;以及聊天基准,包含 2,000 个开放式问答数据实例,用于评估复杂的音频理解和指令跟随能力,并利用 GPT-4 等先进语言模型进行评分。由于 Qwen2-AudioInstruct 是一个聊天模型,我们的实验结果来自于其开放式问答部分。
10.5. Hyper-Parameters
10.5. 超参数
The hyper-parameters involved in our experiment are as follows. Their selection was based on community implementations [95, 130] and our experimental observations. They may not be optimal, but we have confirmed that they do not affect the correctness of the training. These hyperparameters vary across different modalities, but we ensure consistency between language feedback and binary feedback scenarios.
我们实验中涉及的超参数如下。这些参数的选择基于社区实现 [95, 130] 以及我们的实验观察。它们可能并非最优,但我们已确认它们不会影响训练的正确性。这些超参数在不同模态之间有所不同,但我们确保语言反馈和二进制反馈场景之间的一致性。
Table 5. Hyper-parameters for feedback modeling.
| 超参数 | TI2T | T2I | TI2TI | TV2T | TA2T |
|---|---|---|---|---|---|
| Epochs | 3 | 3 | 3 | 3 | 3 |
| BatchSizePerDevice | 4 | 4 | 4 | 3 | 4 |
| Learning Rate | 1.e-6 | 1.e-6 | 1.e-6 | 1.e-7 | 1.e-6 |
| SchedulerType | cosine | cosine | cosine | cosine | cosine |
| WarmupRatio | 0.03 | 0.03 | 0.03 | 0.1 | 0.03 |
| GradientAccumulation | 1 | 2 | 2 | 1 | 1 |
| WeightDecay | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| MaxTokenLength | 4096 | 4096 | 4096 | 4096 | 4096 |
| BFloat16 | True | True | True | True | True |
表 5. 反馈建模的超参数。
Table 6. Hyper-parameters for reward modeling.
| 超参数 | TI2T | T2I | TI2TI | TV2T | TA2T |
|---|---|---|---|---|---|
| Epochs | 3 | 3 | 3 | 3 | 3 |
| BatchSizePerDevice | 4 | 4 | 4 | 1 | 4 |
| Learning Rate | 1.e-6 | 5.e-6 | 5.e-7 | 1.e-6 | 1.e-6 |
| SchedulerType | cosine | cosine | cosine | cosine | cosine |
| WarmupRatio | 0.03 | 0.03 | 0.03 | 0.01 | 0.03 |
| GradientAccumulation | 1 | 2 | 2 | 1 | 1 |
| WeightDecay | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| MaxTokenLength | 2048 | 4096 | 4096 | 4096 | 2048 |
| BFloat16 | True | True | True | True | True |
表 6. 奖励建模的超参数。
Table 7. Hyper-parameters for DPO.
| 超参数 | TI2T | T2I | TI2TI | TV2T | TA2T |
|---|---|---|---|---|---|
| 训练轮数 (Epochs) | 2 | 2 | 2 | 2 | 2 |
| 每个设备的批量大小 (BatchSizePerDevice) | 4 | 4 | 2 | 3 | 4 |
| 学习率 (Learning Rate) | 1.e-6 | 5.e-7 | 5.e-7 | 1.e-7 | 1.e-6 |
| 调度器类型 (SchedulerType) | cosine | cosine | cosine | cosine | cosine |
| 预热比例 (WarmupRatio) | 0.03 | 0.03 | 0.03 | 0.1 | 0.03 |
| 梯度累积 (GradientAccumulation) | 1 | 2 | 2 | 1 | 1 |
| 权重衰减 (WeightDecay) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 最大 Token 长度 (MaxTokenLength) | 2048 | 4096 | 4096 | 4096 | 2048 |
| BFloat16 | True | True | True | True | True |
| 缩放系数 (ScaleCoefficient) | 0.10 | 0.10 | 0.10 | 0.10 | 0.10 |
表 7: DPO 的超参数。
Table 8. Hyper-parameters for PPO.
| 超参数 | TI2T | T2I | TI2TI | TV2T | TA2T |
|---|---|---|---|---|---|
| 训练轮数 (Epochs) | 3 | 3 | 3 | 3 | 3 |
| 每设备批大小 (BatchSizePerDevice) | 4 | 4 | 4 | 1 | 4 |
| 执行者学习率 (ActorLearningRate) | 1.e-5 | 1.e-5 | 1.e-5 | 5.e-8 | 1.e-6 |
| 执行者调度器类型 (ActorSchedulerType) | cosine | cosine | cosine | cosine | cosine |
| 执行者预热比例 (ActorWarmupRatio) | 0.03 | 0.03 | 0.03 | 0.03 | 0.03 |
| 评论者学习率 (CriticLearningRate) | 5.e-6 | 5.e-6 | 5.e-6 | 5.e-8 | 5.e-6 |
| 评论者调度器类型 (CriticSchedulerType) | constant | constant | constant | constant | constant |
| 评论者预热比例 (CriticWarmupRatio) | 0.03 | 0.03 | 0.03 | 0.03 | 0.03 |
| 梯度累积 (GradientAccumulation) | 1 | 4 | 4 | 1 | 1 |
| 权重衰减 (WeightDecay) | 0.01 | 0.0 | 0.0 | 0.0 | 0.01 |
| 最大 Token 长度 (Max TokenLength) | 2048 | 4096 | 4096 | 2048 | 2048 |
| BFloat16 | True | True | True | True | True |
| 采样温度 (SamplingTemperature) | 1.0 | 0.2 | 0.2 | 1.0 | 1.0 |
表 8. PPO 的超参数。
11. Dataset Card
11. 数据集卡片
11.1. Dataset Overview
11.1. 数据集概述
As the number of modalities increases, current all-modality models encounter significant challenges in effectively following instructions. These challenges include difficulties in comprehending multimodal instructions and generating outputs that align with the intended directives [73, 118, 120]. While RLHF has demonstrated its effectiveness in addressing such issues for specific modalities, such as text and images [120], its applicability to all-modality scenarios remains uncertain.
随着模态数量的增加,当前的全模态模型在有效遵循指令方面遇到了显著挑战。这些挑战包括理解多模态指令的困难以及生成符合预期指令的输出 [73, 118, 120]。尽管 RLHF 在解决特定模态(如文本和图像)的此类问题上已显示出其有效性 [120],但其在全模态场景中的适用性仍不确定。
Furthermore, existing preference datasets predominantly focus on single-modal tasks, lacking the comprehensive information required to capture the intricacies of all-modality features. In response, we introduce align-anything-200k, the first all-modality human preference dataset designed to enhance the instruction-following capabilities of allmodality models. This dataset encompasses eight subtasks across text, image, audio, and video modalities (see Fig. 12). For the examples please refer to the dataset/dataset examples included in the supplementary materials.
此外,现有的偏好数据集主要关注单模态任务,缺乏捕捉所有模态特征复杂性所需的全面信息。为此,我们引入了 align-anything-200k,这是首个旨在增强所有模态模型指令遵循能力的全模态人类偏好数据集。该数据集涵盖了文本、图像、音频和视频模态的八个子任务(见图 12)。具体示例请参考补充材料中包含的 dataset/dataset examples。
To ensure consistent modality preference modeling, we categorize targets into modality-agnostic and modalityspecific types, which act as evaluation metrics for instruction-following (see Sec. 11.2). A comprehensive dataset collection and annotation pipeline are outlined in Sec. 11.3, while Sec. 11.4 provides a detailed analysis of dataset features, including prompt semantics, preference features etc.. Additionally, examples from the alignanything-200k dataset are presented in Sec. 11.5.
为确保模态偏好建模的一致性,我们将目标分为模态无关和模态特定类型,这些类型作为指令遵循的评估指标(见第 11.2 节)。第 11.3 节概述了全面的数据集收集和标注流程,而第 11.4 节则对数据集特征进行了详细分析,包括提示语义、偏好特征等。此外,第 11.5 节展示了来自 alignanything-200k 数据集的示例。
11.2. Instruction-Following Dimensions
11.2. 指令跟随维度
In the context of text modality, instruction-following refers to the ability of LLMs to effectively execute humanprovided instructions, such as answering questions or summarizing text. This capability enables them to function as helpful, harmless, and honest assistants [81, 99]. However,
在文本模态的背景下,指令遵循 (instruction-following) 指的是大语言模型有效执行人类提供的指令的能力,例如回答问题或总结文本。这种能力使它们能够成为有用、无害且诚实的助手 [81, 99]。然而,
| 任务 | 数据集 | 模型 |
|---|---|---|
| T2T | UltraFeedback [26], PKU-SafeRLHF [48], Evol-Instruct [112] | UltraLM [29], WizardLM [111], Alpaca-7B, Alpaca2-7B, Alpaca3-8B, LLaMA2-7B/13B/70B-Chat [99] |
| TI2T | RLHF-V [120], ShareGPT4V [15], ART500K [74], MovieNet [45], LLaVA-Instruct-150K [62] | Llama 3.2-Vision-11B [31], LLaVA-v1.5-7B [62], GPT-4o [80], LLaVA-v1.6-Vicuna-13B [63] |
| T21 | HPDv2 [109], Pick-a-Pic-v2 [52], MS COCO [20], DiffusionDB [106] | StableDiffusionXL [83], StableDiffusion v2-1 [89], FLUX.1-schnell, Chameleon-7B [98] |
| TI2TI | RLHF-V [120], LLaVA-Instruct-150K [62], BDD100K [119], Eg04D [40] | Chameleon-7B [98], Gemini 1.5 Pro [88] |
| TV2T | NExTQA [110], Panda-70M [19], ShareGPT4Video [16], VidProm [104] | Qwen2-VL-7B [103], CogVideoX-5B [117] |
| T2V | mixkit [77], ShareGPT4Video [16], Wavcaps [75], GigaSpeech [13] | CogVideo [43], Open-Sora [131], Pika [8], Gemini 1.5 Pro [88] |
| TA2T | OpenAQA [39], MusicCaps [2], Wavcaps [75] | Qwen2-Audio-7B [24], AudioLDM 2-large [66] |
| T2A | Audio-Alpaca [73], AudioCaps [51] | Tango 2 [73], Stable Audio Open 1.0 [33] |
Table 9. Related datasets and models for each subtask. We extensively collect datasets of various modalities and enhance the prompt sources for these datasets, resulting in the final prompt sources of align-anything-200k. We also widely gather the generation results from both open-source and closed-source models. Llama2-7B and Llama3-8B is fine-tuned with Alpaca-52K[97], resulting in Alpaca2-7B and Alpaca3-8B.
表 9: 每个子任务相关的数据集和模型。我们广泛收集了各种模态的数据集,并增强了这些数据集的提示来源,最终生成了align-anything-200k的提示来源。我们还广泛收集了开源和闭源模型的生成结果。Llama2-7B和Llama3-8B使用Alpaca-52K[97]进行了微调,生成了Alpaca2-7B和Alpaca3-8B。
as the number of modalities increases, establishing a unified metric for instruction-following across all modalities becomes increasingly challenging. While the goal is for an allmodality model to effectively perform instructions across various modalities, each specific modality (e.g., video) may have unique requirements, such as ensuring temporal consistency in video outputs.
随着模态数量的增加,为所有模态建立统一的指令遵循度量标准变得越来越具有挑战性。尽管目标是让全模态模型能够有效地执行跨各种模态的指令,但每个特定模态(例如视频)可能有独特的要求,例如确保视频输出的时间一致性。
To address the diversity inherent in the concept of instruction-following within an all-modality alignment setting, we decompose instruction-following into modalityagnostic and modality-specific dimensions. The modalityagnostic dimensions are universally applicable across all modalities, while the modality-specific dimensions represent preference criteria tailored to the unique characteristics of each modality. For the annotation prompts of each subtask, please refer to dataset/annotation prompts included in the supplementary materials.
为应对全模态对齐设置中指令跟随概念的多样性,我们将其分解为模态无关和模态特定的维度。模态无关维度适用于所有模态,而模态特定维度则代表针对每种模态独特特性定制的偏好标准。关于每个子任务的标注提示,请参阅补充材料中包含的 dataset/annotation prompts。
11.2.1 Modality-Agnostic Dimensions
11.2.1 模态无关维度
In this section, we provide a detailed overview of modalityagnostic dimensions for evaluating instruction-following in all-modality scenarios.
在本节中,我们详细概述了用于评估全模态场景中指令跟随的模态无关维度。
Prompt adherence Prompt adherence refers to the extent to which responses align with the given input prompts, accurately reflecting the specified elements, themes, or instructions. This ensures that the output remains faithful to the user’s intent and maintains relevance to the multimodal content of prompts.
提示遵循

Figure 12. Composition of align-anything-200k preference dataset. The composition of enhanced prompts encompasses text, image, video, and audio modalities, aiming to improve the instruction-following capabilities of foundational all-modality models. In addition, We visualize the overall text prompt distribution across all tasks, demonstrating that the data covers different semantic embedding spaces.
图 12: align-anything-200k 偏好数据集的组成。增强提示的组成包括文本、图像、视频和音频模态,旨在提高基础全模态模型的指令遵循能力。此外,我们可视化了所有任务中的整体文本提示分布,展示了数据覆盖了不同的语义嵌入空间。
Rule conformity Rule conformity refers to the compliance of responses with logical, physical, biological, or scientific principles that are applicable to the scenario or theme described in the prompt. This metric evaluates the response’s consistency with established rules and modalityspecific constraints, ensuring realism and plausibility.
规则一致性
Information richness Information richness evaluates the depth and detail provided in responses, emphasizing the thoroughness and comprehensiveness with which the problem or query is addressed. High-quality responses should offer nuanced insights, detailed explanations, and wellrounded coverage of the topic.
信息丰富度
11.2.2 Modaltiy-Specific Dimensions
11.2.2 模态特定维度
In the following part, we provide a detailed overview of the instruction-following standards designed based on the unique characteristics of different modalities, along with the subtasks to which they apply.
在以下部分,我们将详细概述基于不同模态独特特性设计的指令遵循标准,以及它们适用的子任务。
Clarity Clarity refers to the degree to which responses are clear, comprehensible, and easy to understand. It emphasizes overall coherence, language quality, and grammatical correctness, ensuring that the output effectively communicates its intended meaning without ambiguity. This metric applies to subtasks such as T2T, TI2T, TI2TI, TA2T, and TV2T, where textual response quality plays a critical role in achieving successful multimodal communication.
清晰度 清晰度指的是回答的清晰、易懂和易于理解的程度。它强调整体连贯性、语言质量和语法正确性,确保输出能够有效地传达其预期意义,避免歧义。该指标适用于 T2T、TI2T、TI2TI、TA2T 和 TV2T 等子任务,在这些任务中,文本回答的质量对于实现成功的多模态通信至关重要。
Aesthetics Aesthetics refers to the evaluation of the visual and auditory appeal of multimodal outputs, including images, videos, and audio. It encompasses the sensory and emotional impact these outputs deliver to the audience.
美学
For image-related subtasks (e.g., T2I, TI2TI), aesthetics involves analyzing aspects such as lighting, color harmony, richness of detail, creativity, and the overall emotional resonance or enjoyment the image evokes. In audio subtasks (e.g., T2A), the focus lies on sound quality, emphasizing clarity, smoothness, and the absence of disruptive elements like noise or distortion. For video subtasks (e.g., T2V), aesthetics includes the evaluation of visual appeal and aesthetic quality, taking into account factors like lighting, color harmony, the richness of detail, and the emotional or immersive experience the video provides.
对于图像相关的子任务(例如,T2I、TI2TI),美学涉及分析诸如光照、色彩和谐、细节丰富度、创意以及图像唤起的情感共鸣或享受等方面。在音频子任务(例如,T2A)中,重点在于音质,强调清晰度、流畅度以及无噪音或失真等干扰元素。对于视频子任务(例如,T2V),美学包括对视觉吸引力和审美质量的评估,考虑因素如光照、色彩和谐、细节丰富度以及视频提供的情感或沉浸体验。
Cross-modal Consistency Cross-modal consistency refers to the degree of alignment and harmony between all output modalities. For the TI2TI subtask, this involves ensuring that the text and image are consistent in terms of content, style, and message, so the modalities complement each other and produce a coherent, unified output.
跨模态一致性 跨模态一致性指的是所有输出模态之间的对齐和协调程度。对于TI2TI子任务,这涉及到确保文本和图像在内容、风格和信息方面保持一致,从而使各模态相互补充,产生连贯、统一的输出。
Audio Consistency Audio consistency evaluates the coherence of an audio output’s acoustic qualities, focusing on smooth transitions along the time axis, natural flow, and the absence of abrupt changes. This metric ensures that the audio maintains a steady tone and rhythm, enhancing the listening experience which is applied in the T2A subtask.
音频一致性
音频一致性评估音频输出的声学质量连贯性,重点关注时间轴上的平滑过渡、自然流畅度以及无突变。该指标确保音频保持稳定的音调和节奏,提升聆听体验,应用于T2A子任务中。
Temporal Consistency Temporal consistency assesses the smoothness of transitions between video frames, ensuring a natural and fluid progression without sudden jumps or erratic changes in motion. This metric is critical for maintaining the visual flow and is applied in the T2V subtask.
时间一致性
时间一致性评估视频帧之间过渡的平滑性,确保动作自然流畅,没有突然跳跃或异常变化。该指标对于保持视觉流畅性至关重要,并在T2V子任务中应用。
Content Coherence Content coherence measures the semantic and narrative alignment within a video, ensuring that all elements logically work together to deliver a clear and cohesive message. This metric prevents disjointed or illogical segments and is used in the T2V subtask.
内容连贯性
Motion Naturalness Motion naturalness evaluates the realism and fluidity of object and character movements in a video. It ensures that movements adhere to realistic physical laws, resulting in actions that appear natural, smooth, and believable. This metric is applied in the T2V subtask.
运动自然度
11.3. Details of Dataset Construction
11.3. 数据集构建细节
11.3.1 Annotation Pipeline Overview
11.3.1 标注流程概述
In this subsection, we detail the annotation pipeline of the dataset. For text and image modalities, we rely on GPT4o [80] to support the human-AI joint annotation process, while for video and audio modalities, we utilize Gemini1.5-Pro [88], as these models currently provide the highestquality multimodal annotations
在本小节中,我们将详细说明数据集的标注流程。对于文本和图像模态,我们依赖 GPT4o [80] 来支持人机联合标注过程,而对于视频和音频模态,我们则使用 Gemini1.5-Pro [88],因为这些模型目前提供了最高质量的多模态标注。
Data Pairs Collection To comprehensively cover allmodality tasks and diverse prompt distributions, in constructing align-anything-200k, we first collect open-source multimodal datasets, followed by a refined secondary screening of the corresponding text prompts, videos, and audio. Based on this carefully curated multimodal content, we then utilize advanced multimodal models [24, 80, 88, 103] to enhance prompts to fit with all-modality alignment requirements, ensuring a strong focus on instructionfollowing capabilities across modalities. We then gather the responses from open-source and API-based models. The related datasets and models are listed in Tab. 9.
数据对收集
为了全面覆盖所有模态任务和多样化的提示分布,在构建align-anything-200k时,我们首先收集开源的多模态数据集,随后对相应的文本提示、视频和音频进行精细的二次筛选。基于这些精心策划的多模态内容,我们利用先进的多模态模型[24, 80, 88, 103]来增强提示,以符合全模态对齐的要求,确保跨模态的指令跟随能力得到重点关注。然后,我们从开源和基于API的模型中收集响应。相关数据集和模型列于表9。
Fine-grained Preference Annotation We apply a detailed preference annotation process to each questionanswer pair, combining insights from both GPT-4o and human annotators. The approach ensures a thorough assessment across modality-agnostic and modality-specific dimensions, with each dimension scored from 0 to 3 based on strict criteria. Annotators also provide justifications and rationales after providing scores, significantly enhancing the consistency and reliability of the annotation process.
细粒度偏好标注
我们对每个问答对应用了详细的偏好标注流程,结合了 GPT-4o 和人类标注者的见解。该方法确保了在模态无关和模态特定维度上的全面评估,每个维度根据严格标准从 0 到 3 打分。标注者在提供分数后还会给出理由和依据,显著提升了标注过程的一致性和可靠性。
Language Feedback Annotation To refine response quality further, we implement a comprehensive language feedback annotation process. The process involves defining the critique scope for each response, conducting a structured critique, and offering targeted refinement suggestions. Both human annotators and AI models contribute to this feedback, which is organized into cohesive guidance that addresses specific improvement areas considering different modalities. This feedback process captures fine-grained, modality-related preferences and effectively serves as a rich source of natural human preference across all modalities.
语言反馈标注

Figure 13. Human-AI preference consistency analysis of modality-agnostic dimensions and overall preference.
图 13: 模态无关维度和整体偏好的人机偏好一致性分析。
11.3.2 Annotation Agreement Analysis
11.3.2 标注一致性分析
We evaluate the consistency between AI and human preferences across two dimensions: modality-agnostic and modality-specific. For each subtask, 100 data pairs are randomly selected with 10 annotators for pair-wise comparisons. We assess both preference consistency for modalityagnostic preferences and overall preferences. Additionally, we examine the consistency of modality-specific preferences for each subtask, observing that AI can partially replace human judgment and perform effectively in certain modality-specific tasks. However, our findings reveal that both consistency and accuracy between AI and human preferences decline in multimodal scenarios, suggesting that incorpora ting multimodal information heightens the complexity of preference judgment. In Fig. 13, we illustrate the modality-agonistic dimensions consistency across 8 subtasks and overall consistency considering modality-anostic and modality-specific dimensions.
我们在两个维度上评估了AI与人类偏好之间的一致性:模态无关和模态特定。对于每个子任务,随机选择100对数据,并由10名标注者进行成对比较。我们评估了模态无关偏好和整体偏好的一致性。此外,我们还检查了每个子任务的模态特定偏好的一致性,观察到AI可以部分替代人类判断,并在某些模态特定任务中表现良好。然而,我们的研究结果表明,在多模态场景中,AI与人类偏好之间的一致性和准确性均有所下降,这表明引入多模态信息增加了偏好判断的复杂性。在图13中,我们展示了8个子任务中模态无关维度的一致性,以及考虑模态无关和模态特定维度的整体一致性。
11.4. Statistical Properties of Preference Dataset
11.4. 偏好数据集的统计特性
We analyze the composition of the dataset, including the tasks associated with each modality and the categories of prompts. Based on the characteristics of each modality, we enhance the prompts along instruction-following dimensions. These enhancements cover a variety of types, such as question-answering, multimodal generation, emotional expression, modality selection, and complex reasoning, leveraging advanced multimodal models [24, 80, 88, 103]. For more details, please refer to Fig. 12.
我们分析了数据集的构成,包括每种模态相关的任务和提示的类别。根据每种模态的特性,我们在指令遵循的维度上对提示进行了增强。这些增强涵盖了多种类型,例如问答、多模态生成、情感表达、模态选择和复杂推理,并利用了先进的多模态模型 [24, 80, 88, 103]。更多细节请参见图 12。
11.5. Examples of Align-anything-200k
11.5. Align-anything-200k 示例
In Fig. 14, we showcase the training data across 8 subtasks. For each task, we provide detailed fine-grained preferences and evaluation rationales. Furthermore, we include language feedback annotations for each response. Each response is accompanied by specific critique and refinement feedback, focusing on modality information and directions for improvement. Additionally, the total language feedback further specifies how each response should be revised. For more examples of our dataset, please refer to the dataset folder in the supplementary materials.
在图 14 中,我们展示了 8 个子任务的训练数据。对于每个任务,我们提供了详细的细粒度偏好和评估理由。此外,我们还为每个响应包含了语言反馈注释。每个响应都附有具体的批评和改进反馈,重点关注模态信息和改进方向。此外,总语言反馈进一步指定了每个响应应如何修订。更多数据集的示例,请参阅补充材料中的数据集文件夹。
12. Evaluation
12. 评估
12.1. All-Modality Understanding
12.1. 全模态理解
12.1.1 Dataset Composition
12.1.1 数据集组成
The AMU dataset is partially derived from the test set of VGGSound [14], supplemented with additional data collected from the internet and generated using generative models. The data is categorized into perception, reasoning, instruction following, and safety, with each category encompassing various types of content. For perception, reasoning, and instruction following, image+audio and video $^+$ audio pairs are sourced from VGGSound, while other materials, including manually curated multimedia such as videos, audio, and images, are gathered from the internet. For the safety category, the dataset integrates both internet-sourced content and data generated using generative models.
AMU 数据集部分源自 VGGSound [14] 的测试集,并补充了从互联网收集的数据以及通过生成式模型生成的数据。数据分为感知、推理、指令遵循和安全四大类别,每个类别包含多种类型的内容。在感知、推理和指令遵循方面,图像+音频和视频 $^+$ 音频对来自 VGGSound,而其他材料,包括手动整理的多媒体内容(如视频、音频和图像),则从互联网收集。在安全类别中,数据集整合了从互联网获取的内容以及通过生成式模型生成的数据。
12.1.2 Annotation Details
12.1.2 标注细节
To collect human responses for evaluating the AMU task, we instruct crowd workers to provide detailed answers and annotations for each test sample. As shown in Fig. 15, each test instance in our dataset consists of one visual data (image or video), one auditory data (audio or speech), and one text question. For video, audio and speech content, we maintain a consistent duration of approximately 10 seconds. Each test instance is annotated by 10 human annotators who are instructed to provide detailed responses according to the following requirements:
为了收集人类反馈以评估 AMU 任务,我们指示众包工人为每个测试样本提供详细的答案和标注。如图 15 所示,我们数据集中的每个测试实例包含一个视觉数据(图像或视频)、一个听觉数据(音频或语音)和一个文本问题。对于视频、音频和语音内容,我们保持一致的时长,约为 10 秒。每个测试实例由 10 名人类标注者进行标注,他们被要求根据以下需求提供详细的回答:
For data in the instruction following category, annotators are required to follow the instructions strictly when providing responses. In contrast, for data in the safety category, annotators are encouraged to make their own judgments based on their personal values regarding whether the content in the given scenario is appropriate to respond to. If the content is deemed sensitive and unsuitable for a response, annotators should politely decline the request to answer.
在指令遵循类别的数据中,标注者需要严格按照指令提供响应。相反,在安全类别的数据中,鼓励标注者根据个人价值观对给定场景中的内容是否适合响应做出自己的判断。如果内容被视为敏感且不适合响应,标注者应礼貌地拒绝回答请求。
12.1.3 Evaluation Details
12.1.3 评估细节
The default evaluation system prompt consists of two sections. The first section contains instructions for GPT-4, outlining the content it should consider during evaluation and specifying the expected output format. The last section includes reference answers and keywords for each question.
默认的评估系统提示由两部分组成。第一部分包含对 GPT-4 的指令,概述了其在评估过程中应考虑的内容,并指定了预期的输出格式。最后一部分包括每个问题的参考答案和关键词。
To assess response scores, we guide GPT-4 to evaluate from multiple perspectives, using keywords to determine if it correctly interprets multimodal information. Additionally, multiple manually annotated reference answers are provided to evaluate the alignment between the model’s responses and human response distributions.
为了评估响应分数,我们引导 GPT-4 从多个角度进行评估,使用关键词来确定其是否正确解读了多模态信息。此外,还提供了多个手动标注的参考答案,以评估模型响应与人类响应分布之间的一致性。
Due to the unique nature of the safety and instruction following components, we design additional system prompts to evaluate response safety and instruction following. For safety evaluation, GPT-4 should check whether it needs to refuse to answer based on the distribution of human references. For instruction following, GPT-4 should examine whether the response follows the special instructions in the test cases by checking if the reference answers have a special output format. All instruction-following related guidelines and evaluation methods are referenced from IFEval [133] and Follow Bench [49] respectively. The prompts for AMU task are shown in the evaluation/prompt directory included in the supplementary materials.
由于安全和指令遵循组件的独特性,我们设计了额外的系统提示来评估响应的安全性和指令遵循情况。对于安全评估,GPT-4应根据人类参考的分布检查是否需要拒绝回答。对于指令遵循,GPT-4应通过检查参考答案是否具有特殊输出格式,来检验响应是否遵循测试用例中的特殊指令。所有与指令遵循相关的指南和评估方法分别参考了IFEval [133]和Follow Bench [49]。AMU任务的提示显示在补充材料中包含的evaluation/prompt目录中。
12.2. All-Modality Generation
12.2. 全模态生成
Due to the limitations of current all-modal generation models, we employs LLMs as agents to perform multimodal tool calls during experiments, allowing models to generate outputs in non-textual modalities. The prompt for LLM agents is shown in the evaluation/prompt directory included in the supplementary materials.
由于当前全模态生成模型的局限性,我们在实验中使用大语言模型作为AI智能体来执行多模态工具调用,使模型能够生成非文本模态的输出。大语言模型智能体的提示信息可在补充材料中的evaluation/prompt目录中找到。
12.2.1 Instruction Following
12.2.1 指令跟随
To evaluate the instruction-following capability of multimodal models in different modality generation tasks, we design a pipeline for evaluating instruction-following across various modalities. For the four modalities—text, image, video, and audio—we consider various tasks and dimensions of instruction-following and meticulously design 100 instruction-following prompts for each modality. During the evaluation phase of the generated content, since different modalities have different evaluation models, we use GPT-4 [1] and GPT-4o [80] for text and image modalities to directly assess the instruction-following level of the generated content using a scalar score from 0 to 10. For audio and video modalities, inspired by the TIFA [44] pipeline, we design multiple-choice questions closely related to the instruction to capture the information in the prompts in more detail. We use Qwen2-Audio [24] and Qwen2-VL [103] as evaluation models to assess the generated audio and video, scoring based on the accuracy of the multiple-choice answers, with scores ranging from 0 to 10. The system prompts involved in the entire scoring process can be found in the evaluation/prompt directory included in the supplementary materials
为了评估多模态模型在不同模态生成任务中的指令跟随能力,我们设计了一个跨多种模态的指令跟随评估流程。针对文本、图像、视频和音频这四种模态,我们考虑了指令跟随的各种任务和维度,并为每种模态精心设计了100个指令跟随提示。在生成内容的评估阶段,由于不同模态有不同的评估模型,我们使用GPT-4 [1]和GPT-4o [80]对文本和图像模态进行直接评估,采用0到10的标量分数来衡量生成内容的指令跟随水平。对于音频和视频模态,受TIFA [44]流程的启发,我们设计了与指令密切相关的多项选择题,以更详细地捕捉提示中的信息。我们使用Qwen2-Audio [24]和Qwen2-VL [103]作为评估模型,基于多项选择题答案的准确性对生成的音频和视频进行评分,评分范围为0到10。整个评分过程中涉及的系统提示可以在补充材料中的evaluation/prompt目录中找到。

Figure 14. Examples from the align-anything-200k dataset. We present the training data across 8 subtasks, showcasing fine-grained preferences, evaluation rationales, and detailed language feedback (critique and refinement) for each response. The feedback highlights modality-specific suggestions and directions for improvement.
图 14: align-anything-200k 数据集中的示例。我们展示了 8 个子任务的训练数据,呈现了细粒度的偏好、评估理由以及对每个响应的详细语言反馈(批评和改进)。反馈突出了针对特定模态的建议和改进方向。
12.2.2 Modality Selection
12.2.2 模态选择
To assess the model’s flexibility in choosing different modality combinations based on textual instructions, we engage human crowd workers to annotate the expected output modality for each of the 100 prompts. Specifically, each instruction is annotated through multiple-choice selections by 25 human annotators across options like text, image, audio, and text, video, and audio. This process produces a distribution of votes for each instruction, reflecting preferences for the expected output modalities. Modality combinations that closely align with human preferences earn additional points in the modality selection metric. The human annotation criteria are as follows:
为了评估模型根据文本指令选择不同模态组合的灵活性,我们聘请众包工作人员对100个提示的预期输出模态进行标注。具体来说,每个指令由25名人工标注者通过多项选择进行标注,选项包括文本、图像、音频、以及文本、视频和音频。这一过程为每个指令生成投票分布,反映了对预期输出模态的偏好。与人类偏好高度一致的模态组合在模态选择指标中获得额外分数。人工标注标准如下:
• Information Richness: Annotators should specify the desired modality of the model output for a given textual instruction to maximize information gain in the response.
信息丰富度:标注者应为给定的文本指令指定模型输出的所需模态,以最大化响应中的信息增益。

Figure 15. Example of AMU task. The top left corner shows a sample of test data, which includes a piece of visual information, a piece of auditory information, and a question. The right side presents an example of annotations provided by annotators. For an open-ended question, each annotator’s response may differ. However, as long as the answer correctly addresses the multimodal information, it will be accepted as a reference answer. The bottom left corner displays the responses of various models to the test data. The more modalities a model can recognize, the more details its reply can incorporate, resulting in higher quality and, consequently, a higher score.
图 15: AMU任务示例。左上角展示了测试数据的样本,包括一段视觉信息、一段听觉信息和一个问题。右侧展示了标注者提供的注释示例。对于开放式问题,每个标注者的回答可能不同。然而,只要答案正确地处理了多模态信息,它将被接受为参考答案。左下角展示了各种模型对测试数据的响应。模型能够识别的模态越多,其回复中能够包含的细节就越多,从而质量越高,得分也越高。
• Necessity & Conciseness: When choosing modality combinations, necessity and conciseness should guide the selection. If a modality adds no significant information, it should be excluded based on the principle of conciseness.
• 必要性与简洁性:在选择模态组合时,应以必要性和简洁性为指导。如果某种模态未提供显著信息,则应根据简洁性原则将其排除。
Modality synergy is a complex attribute, and currently, no simple or objective method exists to uniformly evaluate information across different modalities. While human evaluation or the use of AI under strictly defined annotation guidelines are feasible solutions, they introduce additional costs and complexities. Therefore, we meticulously construct training data for the Modality Synergy task and train a judge model. The judge model can evaluate the synergy metrics of multi-modal information generated according to the task instructions.
模态协同是一个复杂的属性,目前尚无简单或客观的方法来统一评估不同模态的信息。虽然人工评估或在严格定义的标注指南下使用 AI 是可行的解决方案,但它们会引入额外的成本和复杂性。因此,我们精心构建了模态协同任务的训练数据,并训练了一个评判模型。该评判模型可以评估根据任务指令生成的多模态信息的协同指标。
• Data Collection Due to the current immaturity of endto-end multimodal generation models, generating highquality training data remains a challenge. To address this, we develop a data generation system centered on an LLM Agent. Upon receiving user instructions, the agent calls image, audio, or video generation models through tool usage, producing the corresponding modality information and responding to the user’s instructions in conjunction with text.
• 数据收集 由于当前端到端多模态生成模型的不成熟,生成高质量的训练数据仍然是一个挑战。为了解决这个问题,我们开发了一个以LLM智能体为核心的数据生成系统。在接收到用户指令后,智能体通过工具调用图像、音频或视频生成模型,生成相应的模态信息,并结合文本响应用户指令。
• Data Annotation As there is currently no fully automated annotation tool for all modalities information, we develop a comprehensive human annotation pipeline specifically for the modality synergy task. Human annotators receive data pairs consisting of one instruction and outputs from two models, where the model outputs involve any two modalities in text, image, video, audio. Annotators evaluate the two responses based on relevance and consistency, assigning preference annotations in line with the annotation guidelines. Detailed annotation guidelines. please refer to Sec. 12.2.4.
• 数据标注 (Data Annotation) 由于目前没有适用于所有模态信息的全自动标注工具,我们开发了一个专门用于模态协同任务的人工标注流程。标注人员接收由一条指令和两个模型输出组成的数据对,其中模型输出涉及文本、图像、视频、音频中的任意两种模态。标注人员根据相关性和一致性对两个响应进行评估,并按照标注指南分配偏好标注。详细的标注指南请参见第 12.2.4 节。
• RM Training Referring to reward model training in the text-only modality, the base model of our judge model is Llama3-8B, equipped with a score head. ImageBind [38] and independent projectors for each modality enable the model to achieve multimodal understanding. The training process has two stages: first, the parameters of the LLM and ImageBind are frozen, and only the projector is trained; second, both the LLM and projector are updated simultaneously.
• RM 训练 在纯文本模式下的奖励模型训练中,我们的评判模型基于 Llama3-8B,并配备了一个评分头。ImageBind [38] 和每个模态的独立投影器使模型能够实现多模态理解。训练过程分为两个阶段:首先,冻结大语言模型和 ImageBind 的参数,仅训练投影器;其次,同时更新大语言模型和投影器。
12.2.3 More Experiment Results
12.2.3 更多实验结果
To evaluate agent performance across various multimodal generative models, we conduct supplementary experiments. Specifically, the LLM-based agent invoke Stable-Diffusionv1-5 [89], AudioLDM-Small-full-v2 [61], and CogVideoX2B [117] for image, audio, and video generation, respectively. The results of these experiments are presented in Tab. 10 and Tab. 11.
为了评估不同多模态生成模型的AI智能体性能,我们进行了补充实验。具体而言,基于大语言模型的AI智能体分别调用了Stable-Diffusionv1-5 [89]、AudioLDM-Small-full-v2 [61]和CogVideoX2B [117]进行图像、音频和视频生成。实验结果如表10和表11所示。
The two tables align with the conclusions in Section 4.2, emphasizing that current models have limitations, and no single model exhibits a clear advantage in the AMG task.
这两个表格与第4.2节的结论一致,强调当前模型存在局限性,且没有一个模型在AMG任务中表现出明显优势。
| 初始模型 | 全模态理解 | 全模态生成 | 总体 |
|---|---|---|---|
| 类别 | AMU | 模态 | |
| 感知 | 推理 | IF | 安全 |
| LLaVA-v1.5-7Bt | 2.66 | 2.67 | 2.50 |
| Qwen2-VLt | 2.76 | 3.07 | 2.40 |
| Qwen2-Audiot | 3.58 | 4.53 | 3.40 |
| Chameleon-7Bt | 1.44 | 2.97 | 2.80 |
| Llama3.1-8B-Instructt | 1.05 | 1.20 | 1.20 |
| Gemini-1.5-Prot | 5.36 | 5.67 | 6.70 |
| GPT-4ot | 2.66 | 3.48 | 4.20 |
| 表 1. 模型在 eval-anything 基准测试中的表现。模型在 AMU 任务中的表现与表 4 相同 (t) 表示模型被用作 AI智能体来调用 AudioLDM2-Large [66] 和 CogVideoX-2B [117] 进行音频和视频生成。 | |||
| 初始模型 | 全模态理解 | 全模态生成 | |
| 类别 | 模态指令遵循 | ||
| 感知 | 推理 | IF | 安全 |
| LLaVA-v1.5-7Bt | 2.66 | 2.67 | 2.50 |
| Qwen2-VLt | 2.76 | 3.07 | 2.40 |
| Qwen2-Audiot | 3.58 | 4.53 | 3.40 |
| Chameleon-7Bt | 1.44 | 2.97 | 2.80 |
| Llama3.1-8B-Instructt | 1.05 | 1.20 | 1.20 |
| Gemini-1.5-Prot | 5.36 | 5.67 | 6.70 |
| GPT-4ot | 2.66 | 3.48 | 4.20 |
As for the instruction-following metric, the performance of close-sourced llm-based agent declines significantly when the capability of tool models decrease. Analysis of specific test cases shows that close-sourced models often generate more fine-grained instructions for identical test prompts compared to the open-sourced model. Stronger tool models effectively use these fine-grained instructions to pro- duce outputs that meet the test requirements. In contrast, weaker tool models struggle with overly complex and detailed instructions, leading to reduced output quality and lower instruction-following scores. This issue is especially evident in tasks such as video and audio generation, which rely heavily on temporal sequencing. The two closedsource models often offer instructions refined to the temporal dimension. However, the corresponding video and audio generation models are unable to fully adhere to the required temporal sequences, leading to the loss of critical information. Conversely, open-sourced models, through more generalized instructions, better guide tool models to produce outputs closely follow with the test instructions.
在指令跟随指标方面,当工具模型的能力下降时,基于闭源大语言模型的智能体性能显著下降。具体测试案例分析表明,与开源模型相比,闭源模型通常会为相同的测试提示生成更细粒度的指令。更强的工具模型能够有效利用这些细粒度指令,生成符合测试要求的输出。相反,较弱的工具模型难以处理过于复杂和详细的指令,导致输出质量下降和指令跟随分数降低。这一问题在视频和音频生成等任务中尤为明显,这些任务严重依赖时间序列。两个闭源模型通常会提供在时间维度上细化的指令。然而,相应的视频和音频生成模型无法完全遵循所需的时间序列,导致关键信息的丢失。相反,开源模型通过更通用的指令,更好地引导工具模型生成与测试指令紧密一致的输出。
12.2.4 Human Annotation for Modality Synergy Task
12.2.4 多模态协同任务的人工标注
This guide outlines the process for evaluating multimodal outputs generated by two different models. Your task is to assess the correlation and consistency between the outputs of ${\mathtt{m o d a l i t y-1}}$ and ${\mathtt{m o d a l i t y}_{-}2}$ for each model and assign a score to reflect how effectively these outputs complement one another. When scoring, it is important to distinguish between the two models’ outputs, avoiding identical scores for both models unless absolutely necessary.
本指南概述了评估由两种不同模型生成的多模态输出的过程。您的任务是评估每个模型的 ${\mathtt{m o d a l i t y-1}}$ 和 ${\mathtt{m o d a l i t y}_{-}2}$ 输出之间的相关性和一致性,并分配一个分数以反映这些输出如何有效地互补。在评分时,重要的是区分两种模型的输出,除非绝对必要,否则避免为两种模型分配相同的分数。
Evaluation Criteria
评估标准
Scoring Guidelines
评分指南

Figure 16. Examples of AMG task. We create two distinct instruction test sets for evaluating the three metrics of the AMG task, dividing the instruction-following section into four modality-based subsets. The rationale is that in multimodal scenarios, evaluating instructionfollowing requires specific instructions to assess the model’s ability to generate content that aligns with the instructions. Conversely, for modality selection and modality synergy tasks, the instructions should enable open-ended responses to assess the model’s autonomous use of modalities. For detailed example analysis, please refer to Sec. 12.2.5
图 16: AMG 任务示例。我们创建了两个不同的指令测试集来评估 AMG 任务的三个指标,将指令遵循部分划分为四个基于模态的子集。其基本原理是,在多模态场景中,评估指令遵循需要特定的指令来评估模型生成与指令一致内容的能力。相反,对于模态选择和模态协同任务,指令应允许开放式响应,以评估模型自主使用模态的能力。详细的示例分析请参见第 12.2.5 节。
• Score 3: The outputs display basic correlation and consistency. Both modalities describe the same topic or object but do not offer additional insights or value when combined. The information is redundant, and no significant gain is observed from their interaction. • Score 2: The outputs demonstrate strong correlation and consistency, addressing the same topic or object. While minor variations in presentation may exist, they do not conflict but instead provide additional perspectives that moderately enhance understanding. The interaction between modalities is effective, though not as impactful as in the 5-point case. • Score 1: The outputs exhibit an exceptional level of correlation and consistency. Each modality significantly enhances the other, providing unique yet complementary information. Together, they form a cohesive and synergistic combination, offering insights beyond the sum of their individual contributions.
• 评分 3:输出显示基本的相关性和一致性。两种模态描述了相同的话题或对象,但在结合时没有提供额外的见解或价值。信息是冗余的,并且从它们的交互中没有观察到显著的增益。
• 评分 2:输出表现出很强的相关性和一致性,涉及相同的话题或对象。尽管在呈现方式上可能存在细微差异,但它们并不冲突,而是提供了额外的视角,适度增强了理解。模态之间的交互是有效的,尽管不如 5 分案例那样有影响力。
• 评分 1:输出表现出极高的相关性和一致性。每种模态都显著增强了另一种模态,提供了独特且互补的信息。它们共同形成了一个连贯且协同的组合,提供了超出其各自贡献的见解。
Important Notes
重要提示
12.2.5 Evaluation Examples
12.2.5 评估示例
Examples of modality synergy evaluation in AMG task refer to Fig. 16. In the Instruction-Following module, we evaluate the model’s ability to follow instructions across text, visual (images and videos), and audio modalities. For the text and image modalities, the generated content aligns well with the requirements in the instructions, resulting in high scores in the evaluation. However, the generated content fails to match the instructions for the video and audio modalities. For example, the video does not depict the action of a woman standing up, and the audio does not reflect the sound transition process. As a result, the model can only correctly complete part of the multiple-choice questions related to the original instructions, leading to lower scores.
AMG任务中的模态协同评估示例如图 16所示。在指令跟随模块中,我们评估模型在文本、视觉(图像和视频)以及音频模态下遵循指令的能力。对于文本和图像模态,生成的内容与指令要求高度一致,因此在评估中获得高分。然而,视频和音频模态的生成内容未能匹配指令要求。例如,视频未描绘女性站立的动作,音频也未反映声音的过渡过程。因此,模型只能正确完成与原始指令相关的部分选择题,导致评分较低。
In the left part of the Modality Selection & Synergy module, the text on the left describes elements such as sunlight, a magnificent temple, and glowing symbols, all of which are effectively mirrored in the corresponding image. In contrast, the image on the right fails to depict key features like the hidden entrance and seaweed forest mentioned in the text. This highlights the superior modality synergy of the left response.
在模态选择与协同模块的左侧部分,左侧文本描述了阳光、宏伟的寺庙和发光的符号等元素,这些都在对应的图像中得到了有效映照。相比之下,右侧的图像未能描绘出文本中提到的隐藏入口和海藻森林等关键特征。这凸显了左侧响应的优越模态协同性。
In the right part of Modality Selection & Synergy module, the synergy between text and image modalities shows that the left text aligns with its corresponding image, while the right image depicts a similar scene to the left. However, the lack of visual description in the right text creates a significant disparity in their modality synergy score. In terms of text and audio synergy, both responses feature nearly identical audio content. However, the absence of rain and wave sound descriptions in the left text results in a lower score compared to the right. Regarding image and audio synergy, the information conveyed by the images and audio is nearly identical in both responses, resulting in comparable scores.
在模态选择与协同模块的右侧部分,文本和图像模态之间的协同作用显示左侧文本与其对应的图像对齐,而右侧图像描绘了与左侧相似的场景。然而,右侧文本缺乏视觉描述,导致它们的模态协同得分存在显著差异。在文本和音频协同方面,两个响应都包含几乎相同的音频内容。然而,左侧文本中缺少雨声和波浪声的描述,导致其得分低于右侧。在图像和音频协同方面,两个响应中图像和音频所传达的信息几乎一致,因此得分相当。
