Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

对齐一切：通过语言反馈训练全模态模型以遵循指令

Figure 1. Motivated by the challenge of achieving all-modality human preference alignment, particularly the limitations of binary preferences in accurately reflecting human preferences, we introduce the align-anything: Data: align-anything-200k, which covers text, image, audio, video modalities, and $^{8+}$ specific subtasks, annotated with preference and language feedback; Algorithm: improving all-modality alignment by learning from language feedback; Evaluation: encompassing all-modality understanding and generation.

图 1: 受到实现全模态人类偏好对齐的挑战的启发，特别是二元偏好在准确反映人类偏好方面的局限性，我们引入了 align-anything：数据：align-anything-200k，涵盖文本、图像、音频、视频模态以及 $^{8+}$ 个特定子任务，并标注了偏好和语言反馈；算法：通过学习语言反馈来改进全模态对齐；评估：涵盖全模态理解和生成。

Abstract

摘要

* Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains under explored in the cross-modality domain. As the number of modalities increases, aligning all-modality models with human intentions – such as instruction following – becomes a pressing challenge. In this work, we make the first attempt to fine-tune all-modality models (i.e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions. This endeavor presents several challenges. First, there is no large-scale all-modality human preference data in existing open-source resources, as most datasets are limited to specific modalities, predominantly text and image. Secondly, the effectiveness of binary preferences in RLHF for post-training alignment in complex all-modality scenarios remains an unexplored area. Finally, there is a lack of a systematic framework to evaluate the capabilities of all-modality models, particularly regarding modality selection and synergy. To address these challenges, we propose the align-anything framework, which includes meticulously annotated 200k all-modality human preference data. Then, we introduce an alignment method that learns from unified language feedback, effectively capturing complex modality-specific human preferences and enhancing the model’s instruction-following capabilities. Furthermore, to assess performance improvements in allmodality models after post-training alignment, we construct a challenging all-modality capability evaluation framework – eval-anything. All data, models, and code framework have been open-sourced for the community. For more details, please refer to https://github.com/PKUAlignment/align-anything.

人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF) 已被证明在增强大语言模型的指令遵循能力方面非常有效；然而，它在跨模态领域仍未被充分探索。随着模态数量的增加，将所有模态模型与人类意图（如指令遵循）对齐成为一个紧迫的挑战。在这项工作中，我们首次尝试使用跨所有模态（包括文本、图像、音频和视频）的人类偏好数据来微调全模态模型（即输入和输出可以是任何模态，也称为任意到任意模型），确保其行为与人类意图一致。这一尝试面临多个挑战。首先，现有的开源资源中没有大规模的全模态人类偏好数据，因为大多数数据集仅限于特定模态，主要是文本和图像。其次，在复杂的全模态场景中，二元偏好在 RLHF 中对于训练后对齐的有效性仍是一个未探索的领域。最后，缺乏一个系统性的框架来评估全模态模型的能力，特别是在模态选择和协同方面。为了解决这些挑战，我们提出了 align-anything 框架，其中包括精心标注的 20 万条全模态人类偏好数据。然后，我们引入了一种从统一语言反馈中学习的对齐方法，有效捕捉复杂的模态特定人类偏好，并增强模型的指令遵循能力。此外，为了评估训练后对齐后全模态模型的性能提升，我们构建了一个具有挑战性的全模态能力评估框架——eval-anything。所有数据、模型和代码框架均已开源，供社区使用。更多详情，请参阅 https://github.com/PKUAlignment/align-anything。

Figure 2. Composition and distribution of align-anything-200k. Our dataset comprises 8 subtasks across text, image, audio, and video modalities. Each modality exhibits distinct semantic features and distribution patterns, covering various latent spaces. This highlights that all-modality alignment cannot rely solely on data from specific modalities; rather, it requires the integration of data across modalities.

图 2: align-anything-200k 的构成与分布。我们的数据集涵盖了文本、图像、音频和视频四种模态的 8 个子任务。每种模态都表现出独特的语义特征和分布模式，覆盖了多种潜在空间。这表明全模态对齐不能仅仅依赖特定模态的数据，而是需要跨模态数据的整合。

1. Introduction

1. 引言

Our world is inherently multimodal [62, 100, 126]. Humans perceive the world through various sensory organs, acquiring information in multiple modalities such as text, images, audio, video, and others. These different forms of information often complement and interact with each other. Each sensory channel has unique advantages in conveying specific concepts and enhancing our understanding of the world. With the success of large language models (LLMs) [1, 7, 128], researchers aim to extend these models to handle multiple modalities, enabling them to perceive and generate any modality [5, 65, 98, 120]. This would allow the models to respond using the most appropriate modality, achieving truly human-like AI [16, 34, 39, 52, 64, 104, 108, 109, 117].

我们的世界本质上是多模态的 [62, 100, 126]。人类通过各种感官器官感知世界，获取文本、图像、音频、视频等多种模态的信息。这些不同形式的信息通常相互补充和交互。每个感官通道在传达特定概念和增强我们对世界的理解方面都有独特的优势。随着大语言模型 (LLMs) [1, 7, 128] 的成功，研究人员希望将这些模型扩展到处理多种模态，使其能够感知和生成任何模态 [5, 65, 98, 120]。这将使模型能够使用最合适的模态进行响应，实现真正类人的人工智能 [16, 34, 39, 52, 64, 104, 108, 109, 117]。

Consequently, the research community has actively developed foundational models capable of handling arbitrary modalities [58, 108]. Based on LLMs, people use individual image/text encoders or domain-specific decoders for input or output processing [65, 108, 124, 135], leveraging the MoE architecture [60] and diffusion techniques [41]. In this line of work, each modality is encoded by a single encoder, with non-text modality information being mapped to the text space via projection layers. Additionally, Chameleon [98] has experimented with encoding images during the pre-training phase using fully token-based representations to handle both image and text modalities. However, handling more modalities remains a significant challenge.

因此，研究社区积极开发了能够处理任意模态的基础模型 [58, 108]。基于大语言模型，人们使用单独的图像/文本编码器或特定领域的解码器进行输入或输出处理 [65, 108, 124, 135]，利用 MoE 架构 [60] 和扩散技术 [41]。在这项工作中，每个模态由单个编码器编码，非文本模态信息通过投影层映射到文本空间。此外，Chameleon [98] 在预训练阶段尝试使用完全基于 Token 的表示来编码图像，以处理图像和文本模态。然而，处理更多模态仍然是一个重大挑战。

On the other hand, Reinforcement learning from human feedback (RLHF) plays a significant role in aligning models with human intentions [81]. GPT-4 [1] has effectively boosted the model’s instruction-following using RLHF. LLaMA series models [99] have significantly improved the performance in code, mathematics, and reasoning through the post-training method, e.g., DPO [87]. However, this line of work is limited to single modality. Researchers try to apply RLHF or DPO directly to multimodal scenarios, such as models with text and image modalities [95, 120]. However, these methods, which rely on aligning responses with binary human preferences (e.g., one answer is better than another) [136], struggle to be effective with more complex and diverse modalities. ImageBind [38] propose using single modality information as a binding, such as using images as the bridging feature to learn joint embeddings. So, how can we establish a unified preference modeling to ensure effectiveness across any modalities?

另一方面，从人类反馈中进行强化学习 (RLHF) 在使模型与人类意图对齐方面发挥了重要作用 [81]。GPT-4 [1] 通过 RLHF 有效地提升了模型的指令遵循能力。LLaMA 系列模型 [99] 通过后训练方法（如 DPO [87]）显著提升了代码、数学和推理方面的性能。然而，这些工作仅限于单一模态。研究人员尝试将 RLHF 或 DPO 直接应用于多模态场景，例如具有文本和图像模态的模型 [95, 120]。然而，这些依赖于将响应与二元人类偏好对齐的方法（例如，一个答案比另一个更好）[136]，在更复杂和多样化的模态中难以有效应用。ImageBind [38] 提出使用单一模态信息作为绑定，例如使用图像作为桥梁特征来学习联合嵌入。那么，我们如何建立统一的偏好建模，以确保在任何模态下的有效性？

All-modality models refer to models capable of accepting input and generating output in any modality, essentially functioning as any-to-any models. Language can serve not only as a binding across different modalities but also as a natural human preference carrier. In this work, we propose utilizing language feedback to unify human feedback across all modalities, marking the first attempt to extend RLHF/DPO to arbitrary modality space and promoting the development of a general all-modality model. Overall, our work makes the following contributions:

全模态模型是指能够接受任何模态的输入并生成任何模态输出的模型，本质上是一种任意到任意的模型。语言不仅可以作为不同模态之间的绑定，还可以作为人类偏好的自然载体。在这项工作中，我们提出利用语言反馈来统一所有模态的人类反馈，这标志着首次将 RLHF/DPO 扩展到任意模态空间，并推动了通用全模态模型的发展。总的来说，我们的工作做出了以下贡献：

• Data (Sec. 2): The first all-modality human preference dataset – align-anything-200k – on text, image, audio, and video modalities. Utilizing a two-stage human annotation process, this dataset accurately captures genuine human preferences, aiming to improve the alignment of all-modality models with human intentions. • Algorithm (Sec. 3): The first algorithm applicable to enhance RLHF/DPO across all modalities: learning from language feedback $(L L F)$ . Our empirical results demonstrate that LLF enhance RLHF performance across 5 modalities, 5 open-source models, and 7 popular benchmarks, which achieves an average 5.83 times improvement than the baseline RLHF. • Evaluation (Sec. 4): To our knowledge, there is no system specifically designed for all-modality model evaluation. To address this gap, we propose the first evaluation tool (eval-anything), tailored for all-modality evaluation. This tool not only meticulously constructs evaluation tasks across various modalities but also emphasizes assessing the unique features of all-modality models, such as modality selection and synergy.

• 数据 (Sec. 2): 首个全模态人类偏好数据集——align-anything-200k，涵盖文本、图像、音频和视频模态。通过两阶段的人工标注流程，该数据集准确捕捉了真实的人类偏好，旨在提升全模态模型与人类意图的对齐。
• 算法 (Sec. 3): 首个适用于增强所有模态 RLHF/DPO 的算法：从语言反馈学习 $(L L F)$。我们的实证结果表明，LLF 在 5 种模态、5 个开源模型和 7 个流行基准测试中提升了 RLHF 性能，平均比基线 RLHF 提高了 5.83 倍。
• 评估 (Sec. 4): 据我们所知，目前没有专门为全模态模型评估设计的系统。为了解决这一空白，我们提出了首个评估工具 (eval-anything)，专为全模态评估量身定制。该工具不仅在多种模态上精心构建了评估任务，还强调评估全模态模型的独特特性，如模态选择和协同作用。

• Framework and Open Source: To further the alignment of the all-modality model with human intentions, we have released all of our resources to the public.

• 框架与开源：为了进一步增强全模态模型与人类意图的对齐，我们已将全部资源向公众开放。

Response A: The person is using some kind of object. Response B: $\checkmark$ The person in the video is blowing a tubeshaped object to imitate the sound of a train whistle.

图 1:
Response A: 这个人正在使用某种物体。
Response B: $\checkmark$ 视频中的人在吹一个管状物体，模仿火车汽笛的声音。

Preference Dimensions

偏好维度

Language Feedback

语言反馈

Critique: The Response A is not very helpful … the fact that they are blowing into a tube-like (Video) object and the possible train whistle sound (Audio) being introduced in the audio.

批评：响应A并不是很有帮助……事实上，他们正在向一个管状物体（视频）吹气，并且可能在音频中引入了火车汽笛声（音频）。

Refinement: The response should focus on information in audio and vide, The person in the picture is blowing a flute to imitate the sound of a train whistle.

优化：响应应聚焦于音频和视频信息，图中的人正在吹笛子模仿火车汽笛声。

Figure 3. All-modality preference and language feedback annotation of align-anything-200k. For all-modality preference annotation, we classify the instruction-following metrics into two categories: modality-agnostic and modality-specific. Each finegrained dimension is assigned a corresponding score along with a rationale. Additionally, we offer detailed language feedback, including critiques and refinement suggestions, which integrate information from multiple modalities within the responses.

图 3: Align-Anything-200K 的全模态偏好和语言反馈标注。在全模态偏好标注中，我们将指令遵循指标分为两类：模态无关和模态相关。每个细粒度维度都分配了相应的得分以及理由。此外，我们还提供了详细的语言反馈，包括批评和改进建议，这些反馈整合了响应中多个模态的信息。

2. Datasets

2. 数据集

A primary challenge lies in the fact that existing preference datasets are predominantly focused on single-modal tasks [26, 52, 73, 120], lacking comprehensive datasets that encompass all modalities, thus limiting further research progress.

主要挑战在于现有的偏好数据集主要集中在单模态任务上 [26, 52, 73, 120]，缺乏涵盖所有模态的全面数据集，从而限制了进一步的研究进展。

In response, we open-source the first all-modality human preference dataset – align-anything-200k – to enhance the instruction-following capabilities of all-modality models across tasks such as question answering, complex reasoning, etc. During the annotation process, we observed that introducing additional modal information significantly reduced binary human preference consistency (as shown in Tab. 1). Therefore, in addition to annotating fine-grained binary preferences, we introduce language feedback as an additional constraint to more accurately capture human preferences. In short, our dataset includes the following:

为此，我们开源了首个全模态人类偏好数据集——align-anything-200k，以增强全模态模型在问答、复杂推理等任务中的指令跟随能力。在标注过程中，我们发现引入额外的模态信息显著降低了二元人类偏好的一致性（如表 1 所示）。因此，除了标注细粒度的二元偏好外，我们还引入了语言反馈作为额外约束，以更准确地捕捉人类偏好。简而言之，我们的数据集包括以下内容：

• All-modality Human Preference Dataset. We annotate human preference data across various modalities (including text, image, video, and audio) with a focus on instruction-following capabilities (Fig. 2). • One More Thing – Language Feedback. In addition to the widely used binary preferences, we incorporate language feedback (critique and refinement) to improve human preference consistency across all modalities.

• 全模态人类偏好数据集。我们标注了跨多种模态（包括文本、图像、视频和音频）的人类偏好数据，重点关注指令跟随能力（图 2）。• 额外补充——语言反馈。除了广泛使用的二元偏好外，我们还引入了语言反馈（批评和改进），以提高跨所有模态的人类偏好一致性。

2.1. All-Modality Human Preference Dataset

2.1 全模态人类偏好数据集

The align-anything $.20O k$ aims to capture human preferences across 8 subtasks of all modalities, as shown in Fig. 2. With the increase in modalities, the complexity and inconsistency of human intent preferences rise significantly, driven by the unique semantic characteristics of each modality. To achieve consistent all-modality preference modeling, we decouple annotation targets into two types, which serve as evaluation metrics for instruction-following:

align-anything $.20O k$ 旨在捕捉人类在所有模态的 8 个子任务中的偏好，如图 2 所示。随着模态的增加，人类意图偏好的复杂性和不一致性显著上升，这是由每个模态独特的语义特征驱动的。为了实现一致的全模态偏好建模，我们将标注目标解耦为两种类型，作为指令跟随的评估指标：

• Modality-agnostic refers to dimensions that are applied universally across modalities, including: (1) Prompt adherence, which requires responses to be consistent with the input prompts, accurately reflecting the specified elements. (2) Rule conformity, where responses adhere to logical, physical, or scientific principles related to the scenario or theme described in the prompt. (3) Information richness, which emphasizes that responses should be thorough and detailed in addressing the query.

• 模态无关性指的是适用于所有模态的维度，包括：(1) 提示遵循性，要求响应与输入提示一致，准确反映指定的元素。(2) 规则一致性，响应需遵循与提示中描述的场景或主题相关的逻辑、物理或科学原则。(3) 信息丰富性，强调响应应全面且详细地解决查询。

• Modality-specific represents preference dimensions tailored to the characteristics of each modality. For example, in video-related subtasks, we introduce three additional dimensions, temporal consistency, content coherence and motion naturalness to better evaluate the details in the output regarding the duration, dynamics, etc.

• 模态特定表示根据每种模态的特性定制的偏好维度。例如，在视频相关的子任务中，我们引入了三个额外的维度：时间一致性、内容连贯性和运动自然性，以更好地评估输出在持续时间、动态等方面的细节。

By decomposing the instruction-following dimensions, we establish the evaluation standards for all-modality preference annotation. More details about the annotation document can be found in Appendix: Datasets.

通过分解指令跟随维度，我们建立了全模态偏好标注的评估标准。更多关于标注文档的详细信息可以在附录：数据集中找到。

2.2. One More Thing – Language Feedback

2.2. 还有一点——语言反馈

During the annotation process, there exists a decline in human preference consistency when directly conducting binary preference annotations. As illustrated in Tab. 1, the introduction of different modalities makes it challenging for binary preferences to fully capture human preference [47, 59, 90]. To address this issue, we introduce language feedback, utilizing natural language to describe discrepancies between the model’s response and the standard criteria, thereby improving the consistency of human annotation.

在标注过程中，直接进行二元偏好标注时存在人类偏好一致性的下降。如表 1 所示，不同模态的引入使得二元偏好难以完全捕捉人类偏好 [47, 59, 90]。为了解决这一问题，我们引入了语言反馈，利用自然语言描述模型响应与标准准则之间的差异，从而提高人类标注的一致性。

Table 1. Human preference agreement across different modalities. The results demonstrate that traditional annotation struggles to capture modality-specific details, whereas language feedback enhances consistency by conveying fine-grained human preferences effectively.

表 1: 不同模态下的人类偏好一致性。结果表明，传统的标注方法难以捕捉模态特定的细节，而语言反馈通过有效传达细粒度的人类偏好，增强了一致性。

	T2T	TI2T	T21	TI2TI	TV2T	T2V	TA2T	T2A
无语言反馈 (%)	73.2 ± 2.1	62.5 ± 7.3	63.2 ± 6.3	62.7 ± 7.7	62.1 ± 5.7	60.8 ± 4.6	61.8 ± 6.9	58.7 ± 5.8
有语言反馈 (%)	74.1 ± 1.2	70.5 ± 3.4	69.2 ± 2.2	68.3 ± 2.1	67.2 ± 1.8	65.8 ± 1.2	68.6 ± 2.4	64.3 ± 1.7

Specifically, we provide language feedback for each response which includes two parts: critique, which assesses the strengths and weaknesses of the response based on detailed criteria, and refinement, which offers specific improvement suggestions to enhance the models’ instructionfollowing capabilities. Through the quality control process with annotators and training verification (Sec. 3), we identify two key advantages of using language feedback: (1) as a natural carrier of human preferences, it clearly identifies fine-grained defects and areas for improvement in the responses, providing clearer guidance compared to binary preferences [10]. (2) as a connecting binding, it bridges human intent across different modalities and creates a unified, modality-agnostic preference modeling approach, capable of handling more modalities than typical methods. Further examples can be referred to in Appendix: Datasets.

具体来说，我们为每个响应提供语言反馈，包括两个部分：批评，基于详细标准评估响应的优缺点；以及改进，提供具体的改进建议以增强模型的指令跟随能力。通过标注人员的质量控制过程和训练验证（第3节），我们确定了使用语言反馈的两个关键优势：(1) 作为人类偏好的自然载体，它清晰地识别了响应中的细粒度缺陷和改进领域，相比二元偏好提供了更清晰的指导 [10]。(2) 作为连接纽带，它在不同模态之间桥接了人类意图，并创建了一种统一的、与模态无关的偏好建模方法，能够处理比典型方法更多的模态。更多示例可参考附录：数据集。

2.3. Dataset Construction

2.3. 数据集构建

We collect initial prompts from 24 multimodal datasets and refine them based on modality-agnostic and modalityspecific dimensions with existing multimodal models (e.g., GPT-4o [80]). For example, we emphasize dynamic temporal aspects for video or quality characteristics for audio to better adapt to each modality. Finally, we gather the responses from 27 models for the 8 subtasks. More details can be referred to in Appendix: Datasets.

我们从24个多模态数据集中收集初始提示，并基于模态无关和模态特定的维度使用现有的多模态模型（例如GPT-4o [80]）进行优化。例如，我们强调视频的动态时间方面或音频的质量特征，以更好地适应每种模态。最后，我们收集了27个模型在8个子任务中的响应。更多细节可参考附录：数据集。

Annotation Process Inspired by PKU-SafeRLHF [48], we utilize a Human-AI joint annotation process to conduct preference annotations across fine-grained dimensions for each subtask. As shown in Fig. 3, for each fine-grained dimension’s binary preferences, we assign scores from 0 to 3 following strict guidelines and detailed rationales. Additionally, we collect language feedback from human annotators and API-based models. The annotators provide language feedback for each response by defining the critique scope, performing the critique, offering refinement suggestions, and organizing the critique and refinements into comprehensive language feedback. As shown in Fig. 3, language feedback provides more fine-grained improvement directions for cross-modal responses, serving as a carrier for expressing natural human preferences across modalities.

基于PKU-SafeRLHF [48]的标注流程，我们采用人机联合标注的方式，为每个子任务在细粒度维度上进行偏好标注。如图 3所示，对于每个细粒度维度的二元偏好，我们按照严格的准则和详细的理由给出0到3的评分。此外，我们还从人类标注者和基于API的模型中收集语言反馈。标注者通过定义批评范围、进行批评、提供改进建议，并将批评和改进整理成全面的语言反馈，为每个回答提供语言反馈。如图 3所示，语言反馈为跨模态回答提供了更细粒度的改进方向，作为表达跨模态自然人类偏好的载体。

Human and AI Agreement Analysis We integrate APIbased multimodal models (e.g., GPT-4o, and others) into human evaluation for binary preferences annotations. To analyze the consistency of the dataset with human preferences, we sample 100 data pairs from each subtask, with 10 annotators to pair-wise comparisons, and statistically measure the preference consistency, as shown in Tab. 1. The results demonstrate that with the traditional binary preferences annotation pipeline, agreement consistency tends to decline as multimodal information is incorporated. This suggests that the conventional pipeline struggles to scale effectively across all modalities. As language-based feedback offers targeted and fine-grained human preference information, higher consistency is achieved across all modalities.

人类与AI一致性分析
我们将基于API的多模态模型（例如GPT-4o等）整合到人类评估中，用于二元偏好标注。为了分析数据集与人类偏好的一致性，我们从每个子任务中抽取100对数据，由10名标注者进行成对比较，并统计偏好一致性，如表1所示。结果表明，随着多模态信息的加入，传统二元偏好标注流程的一致性趋于下降。这表明传统流程难以在所有模态中有效扩展。由于基于语言的反馈提供了有针对性和细粒度的人类偏好信息，因此在所有模态中实现了一致性更高的结果。

3. Learning from Language Feedback

3. 从语言反馈中学习

In this section, we introduce learning from language feedback (LLF). It utilizes language feedback to optimize responses, synthesizing preference data which can enhance the performance of all-modality alignment. Firstly, we review the RLHF pipeline including PPO [81] and DPO [87], highlighting the limitations of binary preferences feedback. Then we demonstrate how to practically implement LLF, including two main stages, feedback modeling and self improving. Finally, we empirically verify that LLF achieves an average 5.83 times improvement across 5 modalities, 5 open-sourced models, and 7 popular benchmarks.

在本节中，我们介绍了从语言反馈中学习 (LLF, Learning from Language Feedback)。它利用语言反馈来优化响应，合成偏好数据，从而增强全模态对齐的性能。首先，我们回顾了包括 PPO [81] 和 DPO [87] 在内的 RLHF (Reinforcement Learning from Human Feedback) 流程，强调了二元偏好反馈的局限性。然后，我们展示了如何实际实施 LLF，包括反馈建模和自我改进两个主要阶段。最后，我们通过实验验证了 LLF 在 5 种模态、5 个开源模型和 7 个流行基准测试中平均提升了 5.83 倍。

3.1. Background and Preliminary

3.1. 背景与预备知识

PPO consists of two main stages including: step 1: preference modeling and step 2: policy optimization. The former involves the collection of comparison data, essential for training the reward model $r_{\mathrm{RM}}(\cdot|\cdot)$ . The process starts with response pairs $(y_{1},y_{2})$ generated by the initial model from shared prompts $\textbf{\em x}$ . Human annotators are then tasked with selecting their preferred response from each pair, denoted as $y_{w};\vdash;y_{l};\mid;x$ , where $\mathbf{\nabla}y_{w}$ and $\textit{\textbf{y l}}$ denote the preferred and disp referred answer. The latter (step 2) is guided by the $r_{\mathrm{RM}}(\cdot|\cdot)$ . This process is commonly modeled as a bandit setting, where a reward is obtained from the $r_{\mathrm{RM}}$ at the end of each response. The RL objective is,

PPO 包含两个主要阶段：第一步：偏好建模和第二步：策略优化。前者涉及比较数据的收集，这对于训练奖励模型 $r_{\mathrm{RM}}(\cdot|\cdot)$ 至关重要。该过程从初始模型生成的响应对 $(y_{1},y_{2})$ 开始，这些响应对基于共享提示 $\textbf{\em x}$ 。然后，人类标注者需要从每对响应中选择他们偏好的响应，表示为 $y_{w};\vdash;y_{l};\mid;x$ ，其中 $\mathbf{\nabla}y_{w}$ 和 $\textit{\textbf{y l}}$ 分别表示偏好和非偏好的答案。后者（第二步）由 $r_{\mathrm{RM}}(\cdot|\cdot)$ 指导。该过程通常被建模为一个赌博机设置，其中每次响应结束时从 $r_{\mathrm{RM}}$ 获得奖励。强化学习的目标是，

DPO directly fine-tunes the model aligning with preference pairs $y_{w},\vdash,y_{l}\mid x$ . It consolidates the two stages of preference modeling and policy optimization in RLHF into one stage. Let $\theta_{0}$ denote the initial model parameters. The optimization objective can be rewritten as,

DPO 直接微调模型以对齐偏好对 $y_{w},\vdash,y_{l}\mid x$。它将 RLHF 中的偏好建模和策略优化两个阶段整合为一个阶段。设 $\theta_{0}$ 表示初始模型参数，优化目标可以重写为：

Figure 4. Learning from language feedback pipeline: (1). Feedback Modeling. We perform SFT on the initial model using annotated language feedback. (2). Self Improving. The initial model optimizes responses given the language feedback to synthesize preference pairs.

图 4: 从语言反馈中学习的流程：(1). 反馈建模。我们使用带注释的语言反馈对初始模型进行监督微调 (SFT)。(2). 自我改进。初始模型根据语言反馈优化响应以合成偏好对。

Table 2. Main experiment results. We validate that LLF can enhance the performance of RLHF across 5 modalities, 5 open-source models, and 7 popular benchmarks, which achieves an average 5.83 times improvement than the baseline RLHF.

表 2: 主要实验结果。我们验证了 LLF 可以在 5 种模态、5 个开源模型和 7 个流行基准上增强 RLHF 的性能，相比基线 RLHF 平均提升了 5.83 倍。

模态	数据集	关注点	初始模型	初始模型	DPO	DPO+LLF	PPO	PPO+LLF
TI2T	LLaVA-Bench[65]	视觉问答	LLaVA-1.5-7B	90.03	88.20 (-1.83)	97.36 (+7.33)	92.67 (+2.64)	98.76 (+8.73)
TI2T	MIA-Bench[84]	分层视觉问答	LLaVA-1.5-13B	90.46	98.36 (+7.90)	100.33 (+9.87)	95.63 (+5.17)	100.59 (+10.13)
TI2T	MIA-Bench[84]	分层视觉问答	LLaVA-1.5-7B	61.15	64.30 (+3.15)	65.32 (+4.17)	70.94 (+9.79)	72.59 (+11.44)
TI2T	MIA-Bench[84]	分层视觉问答	LLaVA-1.5-13B	70.34	70.72 (+0.38)	75.43 (+5.09)	76.57 (+6.23)	80.26 (+9.92)
TI2T	ImageReward[113]	细粒度偏好	Chameleon-7B	-0.80 -0.65 (+0.15)	-0.46 (+0.34)	-0.52 (+0.28)	-0.44 (+0.36)
T2I TI2TI	HPS v2[109]	粗粒度偏好	Chameleon-7B	25.19	25.56 (+0.37)	25.62 (+0.43)	25.56 (+0.37)	25.71 (+0.52)
T2I TI2TI	InterleavedBench[67]	文本-图像交错问答	Chameleon-7B	1.63	1.65 (+0.02)	1.70 (+0.07)	1.35 (-0.28)	1.78 (+0.15)
TV2T	Video-ChatGPT[72]	详细视频问答	Qwen2-VL-7B	3.01	3.03 (+0.02)	3.29 (+0.28)	3.02 (+0.01)	3.05 (+0.04)
TA2T	AIR-Bench[116]	混合音频问答	Qwen2-Audio-7B	6.61	6.64 (+0.03)	6.71 (+0.10)	6.49 (-0.12)	6.63 (+0.02)

Since both PPO and DPO are modality-agnostic, intuitively, all-modality alignment can be achieved with preference pairs $y_{w}\ \succ\ y_{l}\ \mid\ x$ , and recent works have also validated its effectiveness [3, 24, 95, 132]. However, the preference of coupled responses are trade-off by different dimensions, e.g., their style and correctness. As the number of modalities increases, these dimensions become more complex, making it harder to figure out misaligned behavior with binary preferences [46, 59].

由于 PPO 和 DPO 都是与模态无关的，直观上，可以通过偏好对 $y_{w}\ \succ\ y_{l}\ \mid\ x$ 实现全模态对齐，最近的研究也验证了其有效性 [3, 24, 95, 132]。然而，耦合响应的偏好在不同维度上是权衡的，例如它们的风格和正确性。随着模态数量的增加，这些维度变得更加复杂，使得通过二元偏好找出未对齐行为变得更加困难 [46, 59]。

3.2. Practical Implementation

3.2. 实际实现

Inspired by Constitutional AI [10], LLF comprises two steps: feedback modeling and self-improving as shown in Fig. 4. The former employs SFT to train the feedback model, enabling it to provide language feedback based on $\textbf{\em x}$ and $\textit{\textbf{y}}$ . The latter allows the model to refine its responses based on the language feedback $^c$ .

受宪法 AI (Constitutional AI) [10] 的启发，LLF 包含两个步骤：反馈建模和自我改进，如图 4 所示。前者使用 SFT 训练反馈模型，使其能够基于 $\textbf{\em x}$ 和 $\textit{\textbf{y}}$ 提供语言反馈。后者允许模型基于语言反馈 $^c$ 改进其响应。

Feedback Modeling The training process utilizes a dataset $\mathbf{\mathcal{D}}={(\mathbf{x}{i},\mathbf{y}{i},\mathbf{c}{i})}{i=1}^{N}$ , where $N$ is the size of dataset, $\pmb{x}{i}$ denotes the prompt, $\pmb{y}{i}$ represents the response, and $c_{i}$ is the corresponding language feedback. Let $P_{\varphi}(c_{i}\ |$ $\mathbf{\displaystyle}x_{i},y_{i})$ denote the probability of the target sequence $c_{i}$ given the input sequence $(x_{i},y_{i})$ and the model parameters $\varphi$ , the training objective of the feedback model can be expressed by the cross-entropy loss:

反馈建模

训练过程使用了一个数据集 $\mathbf{\mathcal{D}}={(\mathbf{x}{i},\mathbf{y}{i},\mathbf{c}{i})}{i=1}^{N}$ ，其中 $N$ 是数据集的大小，$\pmb{x}{i}$ 表示提示，$\pmb{y}{i}$ 表示响应，$c_{i}$ 是对应的语言反馈。令 $P_{\varphi}(c_{i}\ |$ $\mathbf{\displaystyle}x_{i},y_{i})$ 表示给定输入序列 $(x_{i},y_{i})$ 和模型参数 $\varphi$ 时目标序列 $c_{i}$ 的概率，反馈模型的训练目标可以通过交叉熵损失来表达：

Self Improving We first collect the initial model’s response $\textit{\textbf{y}}$ to the given prompt $\textbf{\em x}$ online. Then, we gather feedback $\pmb{c}=M_{\phi}(\pmb{x},\pmb{y})$ from the feedback model $M_{\phi}$ . Finally, we have the initial model generate responses $\pmb{y}^{\ast};=$ $M_{\theta}(x,c)$ conditioned on this feedback. For example, $\textit{\textbf{y}}$ may suffer from redundancy and hallucination, while $^c$ will remind the generation of $\boldsymbol{y}^{*}$ to avoid these problems. Based on online sampling, it often enables $\boldsymbol{y}^{*}$ and $\textit{\textbf{y}}$ to exhibit significant differences in certain aspects (e.g., reducing redundancy), thereby generating more learnable preference pairs.

自我改进
我们首先在线收集初始模型对给定提示 $\textbf{\em x}$ 的响应 $\textit{\textbf{y}}$。然后，我们从反馈模型 $M_{\phi}$ 中收集反馈 $\pmb{c}=M_{\phi}(\pmb{x},\pmb{y})$。最后，我们让初始模型生成基于此反馈的响应 $\pmb{y}^{\ast};=$ $M_{\theta}(x,c)$。例如，$\textit{\textbf{y}}$ 可能存在冗余和幻觉问题，而 $^c$ 会提醒生成 $\boldsymbol{y}^{*}$ 时避免这些问题。基于在线采样，它通常使 $\boldsymbol{y}^{*}$ 和 $\textit{\textbf{y}}$ 在某些方面表现出显著差异（例如减少冗余），从而生成更具可学习性的偏好对。

3.3. Experiment

3.3. 实验

We empirically verify that LLF offers several key advantages over traditional binary feedback: unified preference: As language feedback typically optimizes responses along key dimensions, models can easily learn genuine human preference beyond pairs; rich information: Since the feedback model can generate a substantial amount of language

我们通过实证验证，LLF 相较于传统的二元反馈具有几个关键优势：统一偏好：由于语言反馈通常沿着关键维度优化回答，模型可以轻松学习超越对立的真实人类偏好；丰富信息：由于反馈模型可以生成大量的语言

Figure 5. Comparison of $\mathbf{DPO+LLF}$ with DPO on varying language feedback amounts. We trained the feedback models using $25%$ , $50%$ , and $75%$ of the language feedback (LF) compared to binary feedback (BF), then synthesized an equal amount of preference pairs based on them, and subsequently compared the performance of the DPO against the initial model. We find that a small amount of language feedback can synthesize preference pairs that surpass those derived from binary feedback.

图 5: DPO+LLF 与 DPO 在不同语言反馈量下的比较。我们使用 25%、50% 和 75% 的语言反馈 (LF) 训练反馈模型，与二元反馈 (BF) 进行对比，然后基于它们合成等量的偏好对，并随后将 DPO 的表现与初始模型进行比较。我们发现，少量的语言反馈可以合成优于二元反馈的偏好对。

模态	TI2T	TA2T	TV2T	T2I	TI2TI
无反馈模型	44.8%	47.5%	39.6%	33.4%	41.8%
有反馈模型	57.8%	64.5%	63.8%	56.9%	67.2%

Table 3. Abalation study of feedback model. We compare the win rate of self-improving models against initial models on allmodality hold-out evaluation prompts. The w/o feedback model refers to using the initial model as the feedback model, judged by GPT-4o (TI2T, T2T, TI2TI) and Gemini 1.5 Pro (TA2T, TV2T).

表 3: 反馈模型的消融研究。我们比较了自改进模型与初始模型在所有模态保留评估提示上的胜率。w/o 反馈模型指的是使用初始模型作为反馈模型，由 GPT-4o (TI2T, T2T, TI2TI) 和 Gemini 1.5 Pro (TA2T, TV2T) 进行判断。

feedback, LLF can effectively serve as a robust synthesizer of preference data. Specifically, we train Chameleon-7B [98] for T2I and TI2TI, LLaVA-7B and LLaVA-13B [65] for TI2T, Qwen2-Audio-7B [25] for TA2T, and Qwen2- VL-7B [103] for TV2T based on align-anything-200k. For more details about the training setting, please refer to the Appendix: Training Details.

基于反馈，LLF 能够有效地作为偏好数据的稳健合成器。具体来说，我们基于 align-anything-200k 训练了 Chameleon-7B [98] 用于 T2I 和 TI2TI，LLaVA-7B 和 LLaVA-13B [65] 用于 TI2T，Qwen2-Audio-7B [25] 用于 TA2T，以及 Qwen2-VL-7B [103] 用于 TV2T。有关训练设置的更多详细信息，请参阅附录：训练详情。

Improvement with Unified Preference As shown in Tab. 2, LLF synthesized preference pairs reflect more unified human preference, enhancing all-modality alignment performance. We observe that DPO and RLHF using binary pairs fall short in some modalities. However, with LLF, they yields positive improvement across all modalities. Interestingly, we find that the improvement of LLF on LLaVA-13B is greater than that on LLaVA-7B. This suggests that LLF performs better on stronger models.

统一偏好改进
如表 2 所示，LLF 合成的偏好对反映了更统一的人类偏好，增强了所有模态的对齐性能。我们观察到，使用二元对的 DPO 和 RLHF 在某些模态上表现不佳。然而，使用 LLF 后，它们在所有模态上都取得了积极的改进。有趣的是，我们发现 LLF 在 LLaVA-13B 上的改进大于在 LLaVA-7B 上的改进。这表明 LLF 在更强的模型上表现更好。

Efficiency with Rich Information LLF provides richer information and supports efficient preference data synthesis. As shown in Fig. 5, despite having only a smaller amount of language feedback, DPO+LLF outperforms DPO with binary feedback. At the same time, the reduction in data amount does not significantly weaken the capabilities of LLF. This suggests that in all-modality alignment, labeling many binary preference pairs is less effective than labeling a smaller amount of language feedback.

丰富的信息与高效性
LLF 提供了更丰富的信息，并支持高效的偏好数据合成。如图 5 所示，尽管语言反馈的数量较少，DPO+LLF 的表现仍然优于使用二元反馈的 DPO。同时，数据量的减少并未显著削弱 LLF 的能力。这表明，在全模态对齐中，标注大量二元偏好对的效果不如标注少量语言反馈。

Ablation Study: Feedback Model is Necessary As shown in Tab. 3, the feedback model generates language feedback on the initial model’s responses to hold-out evaluation prompts, which is then used to regenerate these responses. Results indicate that this approach yields improvement while replacing the feedback model with the initial model does not achieve comparable positive effects.

消融实验：反馈模型的必要性如表 3 所示，反馈模型对初始模型在保留评估提示上的响应生成语言反馈，然后用于重新生成这些响应。结果表明，这种方法带来了改进，而用初始模型替换反馈模型则无法达到类似的积极效果。

4. Evaluation: Eval-Anything

4. 评估：Eval-Anything

Currently, evaluating all-modality models relies on human experts for assessments, which is inefficient and costly. While combining benchmarks for individual modalities could offer a broader evaluation [125], differences in data preparation, post-processing, and metrics across benchmarks hinder accurate performance assessment. Additionally, all-modality models uniquely select the appropriate modalities based on user queries, enabling seamless crossmodal synergy, a capability that traditional single-modality evaluation pipelines fail to capture fully.

目前，评估全模态模型依赖人类专家进行评估，效率低且成本高昂。虽然结合单一模态的基准测试可以提供更广泛的评估 [125]，但不同基准在数据准备、后处理和指标上的差异阻碍了准确的性能评估。此外，全模态模型能够根据用户查询选择适当的模态，实现无缝的跨模态协同，这是传统单一模态评估流程无法完全捕捉的能力。

To address this gap, we deliver our evaluation framework specifically designed for all-modality models – evalanything – including (1) all-modality understanding (AMU) for assess models to simultaneously process and integrate information from all modalities and (2) all-modality generation (AMG): evaluate a model’s ability to follow user instructions, autonomously select modalities, and work synergistic ally across different modalities for output.

为了解决这一差距，我们提供了一个专门为全模态模型设计的评估框架——evalanything，包括（1）全模态理解 (AMU)，用于评估模型同时处理和整合来自所有模态信息的能力，以及（2）全模态生成 (AMG)：评估模型遵循用户指令、自主选择模态并在不同模态之间协同工作以生成输出的能力。

4.1. Composition of Evaluation Dataset

4.1. 评估数据集的组成

4.1.1 All-Modality Understanding

4.1.1 全模态理解

All-modality models aim to both understand individual modalities and combine information across them to generate high-quality responses. To assess their comprehensive multimodal processing, we create 164 test entries, each containing textual, visual (image or video), and auditory (audio or speech) components. These interconnected modalities require the model to integrate all inputs accurately, as failure in any one modality leads to incorrect answers. For instance, as shown in Fig. 6, if a model fails to process visual inputs, it may miss that the person in the picture is frightened. Similarly, without auditory processing, it might not understand that the person’s fear is due to a barking dog.

全模态模型旨在理解各个模态，并整合跨模态信息以生成高质量响应。为评估其综合多模态处理能力，我们创建了 164 个测试项，每个测试项包含文本、视觉（图像或视频）和听觉（音频或语音）组件。这些相互关联的模态要求模型准确整合所有输入，因为任何一个模态的失败都会导致错误答案。例如，如图 6 所示，如果模型无法处理视觉输入，它可能会忽略图片中的人感到恐惧。同样，如果没有听觉处理，它可能无法理解这个人的恐惧是因为一只吠叫的狗。

We use GPT-4 [1] to evaluate model responses on a scale of 1 to 10. However, previous studies [65, 116] underscore limitations in multimodal evaluation that rely on a single human annotation as a reference. As the number of modalities increases, so does the complexity, making it harder to reach consensus and increasing subjective bias in single annotations. Moreover, since GPT-4 lacks true multimodal comprehension, evaluating only its text responses doesn’t confirm if essential information from each modality is fully understood. To address this, we gather responses from 10 annotators and extract key terms from each modality. As illustrated in Fig. 7, the distribution of multiple annotations mitigates the bias associated with single-reference evaluation. Additionally, the inclusion of key terms enables GPT- 4 to more accurately detect potential errors in its responses. The details for AMU can be found in Appendix: Evaluation.

我们使用 GPT-4 [1] 来评估模型响应，评分范围为 1 到 10。然而，先前的研究 [65, 116] 强调了依赖单一人工注释作为参考的多模态评估的局限性。随着模态数量的增加，复杂性也随之增加，这使得达成共识变得更加困难，并增加了单一注释中的主观偏见。此外，由于 GPT-4 缺乏真正的多模态理解能力，仅评估其文本响应并不能确认是否充分理解了每个模态中的关键信息。为了解决这个问题，我们收集了 10 位注释者的响应，并从每个模态中提取了关键术语。如图 7 所示，多注释的分布减轻了单一参考评估中的偏见。此外，关键术语的加入使 GPT-4 能够更准确地检测其响应中的潜在错误。AMU 的详细信息可在附录：评估中找到。

Figure 6. The eval-anything benchmark consists of two components: (Up) AMU: All-Modality Understanding, where the model answers open-ended questions by integrating textual instructions, images, videos, and audio. (Down) AMG: All-Modality Generation is divided into subtasks of instruction-following, modality selection, and synergy. The model generates outputs for each modality (text, image, video, audio) based on instructions, with human-preferred combinations guiding modality selection metrics. A trained judge model evaluates the relevance, consistency, and synergy across different modalities in the outputs.

图 6: Eval-anything 基准由两个部分组成：(上) AMU：全模态理解，模型通过整合文本指令、图像、视频和音频来回答开放式问题。(下) AMG：全模态生成分为指令遵循、模态选择和协同三个子任务。模型根据指令为每种模态（文本、图像、视频、音频）生成输出，人类偏好的组合指导模态选择指标。训练过的评判模型评估输出中不同模态的相关性、一致性和协同性。

Figure 7. Distribution of model and human annotation. In AMU task, comparing the distribution of an all-modality model’s responses (blue) with human annotations (red) reveals that a single annotation inadequately covers the model’s semantic space (left), whereas multiple annotations broaden the red region, improving evaluation coverage by capturing more response diversity (right).

图 7: 模型与人类标注的分布。在 AMU 任务中，对比全模态模型的响应分布（蓝色）与人类标注（红色）可以发现，单个标注无法充分覆盖模型的语义空间（左图），而多个标注则扩大了红色区域，通过捕捉更多的响应多样性提升了评估的覆盖范围（右图）。

4.1.2 All-Modality Generation

4.1.2 全模态生成

In generation tasks, all-modality models can outperform single-modal ones by delivering diverse information across multiple formats under the same user query. To achieve this, they must follow SSI principles: (1) Select relevant modalities automatically to reduce redundancy, (2) Synergistic integration for maximum information gain, and (3) Instruction-following in each modality. The overall score for the AMG task is as follows:

在生成任务中，多模态模型能够通过在同一用户查询下提供多种格式的多样化信息，从而优于单模态模型。为了实现这一点，它们必须遵循SSI原则：(1) 自动选择相关模态以减少冗余，(2) 协同整合以实现最大信息增益，(3) 在每个模态中遵循指令。AMG任务的总体得分如下：

Instruction Following To assess instruction-following, we design 100 prompts per modality. Using GPT-4o, we score the text and image outputs. For audio and video modalities, inspired by TIFA [44] pipeline, we create multiple-choice questions based on the prompts and employ Qwen2-Audio [24] and Qwen2-VL [103] as evaluation models. This process yields independent modality scores $S_{T},S_{V},S_{A}$ , each ranging from 0 to 10.

指令遵循评估
为评估指令遵循能力，我们为每种模态设计 100 个提示。使用 GPT-4o 对文本和图像输出进行评分。对于音频和视频模态，受 TIFA [44] 流程启发，我们基于提示创建多项选择题，并采用 Qwen2-Audio [24] 和 Qwen2-VL [103] 作为评估模型。该过程生成独立的模态评分 $S_{T},S_{V},S_{A}$，每个评分范围为 0 到 10。

Modality Selection Properly combining modalities in the model’s response can provide rich perspectives while reducing redundant information. We employ 25 crowdsourced annotators to identify the expected modality combinations for given text instructions. The voting process results in one of the quantitative metrics for AMG task, with appropriate rewards and penalties applied based on the modalities generated by the model. As shown in Fig. 6, $\alpha$ represents the percentage of human votes for each modality combination, and if the model outputs all three modalities, this parameter will be one-third of $\alpha_{T},V,A$ .

模态选择

在模型响应中适当组合模态可以提供丰富的视角，同时减少冗余信息。我们雇佣了25名众包标注员来识别给定文本指令所期望的模态组合。投票过程产生了AMG任务的一项定量指标，并根据模型生成的模态应用适当的奖励和惩罚。如图6所示，$\alpha$ 表示每种模态组合的人类投票百分比，如果模型输出所有三种模态，则该参数将是 $\alpha_{T},V,A$ 的三分之一。

Modality Synergy refers to the consistency between different modalities in a model’s responses. We assess modality synergy by training a judge model with human annotations obtained through a data synthesis. Specifically, we develop a data synthesis pipeline centered on LLMbased agents that utilize tools to invoke audio, image, and video generation models, constructing a dataset with textual inputs and multimodal outputs. We employ human annotators to annotate preference data considering the synergy between different modalities in each response, and then train a judge model that allows all-modality input. As shown in Fig. 6, responses with high synergy scores $(S_{T,,V},S_{T,,A},S_{V,,A})$ should show high relevance and consistency between modalities, with visual and auditory elements enriching the text from various perspectives. More details on training the judge model can be found in Appendix: Evaluation.

模态协同 (Modality Synergy) 是指模型在不同模态响应之间的一致性。我们通过训练一个评判模型来评估模态协同，该模型使用通过数据合成获得的人工标注进行训练。具体来说，我们开发了一个以基于大语言模型的 AI智能体为核心的数据合成流程，这些智能体利用工具调用音频、图像和视频生成模型，构建了一个包含文本输入和多模态输出的数据集。我们雇佣人工标注员对偏好数据进行标注，考虑每个响应中不同模态之间的协同性，然后训练一个支持全模态输入的评判模型。如图 6 所示，协同性得分较高的响应 $(S_{T,,V},S_{T,,A},S_{V,,A})$ 应该在模态之间表现出高度的相关性和一致性，视觉和听觉元素从不同角度丰富了文本内容。更多关于评判模型训练的细节可以在附录：评估中找到。

Table 4. The performance of models in the eval-anything benchmark. Additional input modalities are masked for models that do not support all-modality input. Since most current open-source models lack support for all-modality output, (†) indicates that models are used as agents to invoke AudioLDM2-Large [66] and FLUX.1-schnell[53] for audio and image generation.

表 4: 模型在 eval-anything 基准测试中的表现。对于不支持全模态输入的模型，额外的输入模态被屏蔽。由于目前大多数开源模型缺乏对全模态输出的支持，(†) 表示这些模型被用作 AI智能体来调用 AudioLDM2-Large [66] 和 FLUX.1-schnell [53] 以进行音频和图像生成。

Figure 8. Performance of models in modality selection metric. We analyze the model’s modality selection performance across all instructions in AMG task, comparing it to the voting results on preferred modality combinations from 25 human annotators.

图 8: 模型在模态选择指标上的表现。我们分析了模型在 AMG 任务中所有指令的模态选择性能，并将其与 25 位人类标注者对首选模态组合的投票结果进行了比较。

4.2. Evaluation Results & Analysis

4.2. 评估结果与分析

Human and AI Agreement In the modality synergy task, after training on the $5\mathrm{k}$ preference dataset, the experiment reveals a $66.4%$ agreement rate between the judging model and human annotators. These figures are consistent with human agreement ratios reported in similar studies on modeling human preferences [113] in the multimodal large language models domain.

人与 AI 的一致性在模态协同任务中，经过在 $5\mathrm{k}$ 的偏好数据集上训练后，实验表明评判模型与人类标注者之间的达成一致性率为 $66.4%$。这些数据与多模态大语言模型领域中类似研究 [113] 所报告的人类一致性比率一致。

Input vs Outputs Most models in Tab. 4 support partial modality input and have baseline scores, but Gemini-1.5- Pro outperforms others due to its ability to process all three modalities. In the AMG task, the average scores is relatively low, with no model demonstrating a clear advantage across all sub-items. The results indicate that, compared to modality generation, models are more advanced in modality understanding, consistent with the developmental trends of all-modality models.

输入与输出
表 4 中的大多数模型支持部分模态输入并具有基线得分，但 Gemini-1.5-Pro 因其能够处理所有三种模态而表现优于其他模型。在 AMG 任务中，平均得分相对较低，没有模型在所有子项中展现出明显优势。结果表明，与模态生成相比，模型在模态理解方面更为先进，这与全模态模型的发展趋势一致。

Truly All-Modality Model Current models still fall far behind all-modality models. For all-modality understanding, models using only a single modality score less than half the maximum points, as shown in Tab. 4. Even Gemini-1.5- Pro, which processes both visual and auditory inputs, fails to attain a perfect score. Unlike humans, who integrate information from multiple modalities, these models are limited by their inability to perceive different modalities in a fully integrated way, even though they perform nearly at the human level within individual modalities. Moreover, models align poorly with human choices when selecting out- put modalities, as illustrated in Fig. 8. Humans can adaptively choose the best combination of modalities based on the instruction. In contrast, models tend to output either only textual information, resulting in information loss, or all available modalities, causing redundancy. The limited multimodal capability, especially in models trained mainly on text, hampers their ability to synthesize information effectively, making it difficult to balance detail and conciseness.

真正的全模态模型当前的模型仍然远远落后于全模态模型。在全模态理解方面，仅使用单一模态的模型得分不到最高分的一半，如表 4 所示。即使是能够处理视觉和听觉输入的 Gemini-1.5-Pro 也无法获得满分。与人类能够整合来自多种模态的信息不同，这些模型由于无法以完全整合的方式感知不同模态而受到限制，尽管它们在单一模态内的表现接近人类水平。此外，在选择输出模态时，模型与人类的选择一致性较差，如图 8 所示。人类可以根据指令自适应地选择最佳模态组合，而模型往往要么仅输出文本信息，导致信息丢失，要么输出所有可用模态，造成冗余。有限的多模态能力，尤其是在主要基于文本训练的模型中，阻碍了它们有效综合信息的能力，使得难以在细节和简洁性之间取得平衡。

5. Conclusion

5. 结论

In this work, we make the first exploration of fine-tuning all-modality models using human preference data across diverse modalities, to ensure alignment with human intentions. We have open-sourced the align-anything dataset, incorpora ting $200\mathbf{k}$ annotated human preference data across modalities. Our proposed alignment method leverages language feedback to capture complex, modality-specific human preferences, significantly enhancing the model’s ability to follow instructions. To assess the all-modality models, we developed an evaluation benchmark: eval-anything. All data, models, and code have been made openly available.

在本工作中，我们首次探索了利用跨多种模态的人类偏好数据对全模态模型进行微调，以确保与人类意图的对齐。我们开源了align-anything数据集，其中包含了跨模态的$200\mathbf{k}$条标注的人类偏好数据。我们提出的对齐方法利用语言反馈来捕捉复杂的、模态特定的人类偏好，显著提升了模型遵循指令的能力。为了评估全模态模型，我们开发了一个评估基准：eval-anything。所有数据、模型和代码均已公开。

Ethics Impact Our data collection, approved by the Institutional Review Board, will be released under the CC BY $\mathrm{NC},4.0$ license. The dataset integrates Q-A data from opensource and API-based models, offering the potential for developing AI aligned with human intentions. However, while it could theoretically be misused for harmful purposes, we are committed to promoting safe AI technology.

伦理影响
我们通过机构审查委员会批准的数据收集将在 CC BY $\mathrm{NC},4.0$ 许可下发布。该数据集整合了来自开源和基于 API 的模型的问答数据，为开发符合人类意图的 AI 提供了潜力。然而，尽管理论上它可能被滥用于有害目的，但我们致力于推动安全的 AI 技术。

Limitations and Future Work Despite having 45 experienced annotators, the team lacks sufficient diversity, which may limit the representative ness of human preferences. To address this, we plan to recruit a more varied group via platforms like Amazon MTurk. Additionally, the absence of all-modality foundational models in open-source restricts validation to individual modalities, so we encourage community efforts to develop such models and aim to expand experiments across more models. This is an ongoing effort, with the goal of scaling the dataset to millions.

限制与未来工作尽管拥有 45 位经验丰富的标注员，团队仍缺乏足够的多样性，这可能会限制人类偏好的代表性。为了解决这个问题，我们计划通过 Amazon MTurk 等平台招募更多样化的群体。此外，由于开源领域缺乏全模态基础模型，验证仅限于单一模态，因此我们鼓励社区开发此类模型，并旨在扩展到更多模型进行实验。这是一项持续的工作，目标是将数据集扩展到数百万规模。

References

参考文献

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

Align Anything：通过语言反馈训练全模态模型遵循指令

Supplementary Material

补充材料

6. Ethic Responsibility

6. 伦理责任

Our data collection has been approved by an Institutional Review Board (IRB). The IRB file contains institutional information. To maintain anonymity in the double-blind review process, we did not upload the IRB documents alongside the supplementary materials. If needed, we are willing to discuss the IRB file further with the Ethics Reviewer, provided it does not compromise the double-blind review protocol.

我们的数据收集已获得机构审查委员会（Institutional Review Board, IRB）的批准。IRB 文件中包含机构信息。为了在双盲评审过程中保持匿名性，我们没有将 IRB 文件与补充材料一起上传。如有需要，我们愿意与伦理评审员进一步讨论 IRB 文件，前提是不影响双盲评审协议。

7. Open-Source Assets and License

开源资产与许可证

All datasets examples, codes, and demos have been attached to our supplementary material. In Sec. 6, we discuss potential risks and mitigation strategies related to model opensourcing in detail. After the double-blind review process, we will actively engage with community feedback regarding the data, framework, and code and promptly address any issues related to version inconsistencies to further advance scientific research on large model alignment.

所有数据集示例、代码和演示已附在我们的补充材料中。在第 6 节中，我们详细讨论了与模型开源相关的潜在风险和缓解策略。在双盲评审过程结束后，我们将积极回应社区关于数据、框架和代码的反馈，并及时解决与版本不一致相关的任何问题，以进一步推动大模型对齐的科学研究。

The following assets are planned for open-source release after the double-blind review process:

以下资产计划在双盲评审过程结束后开源发布：

have included samples from each modality of the dataset, along with additional related materials. After the doubleblind review process, we will release all content as open source.

我们包含了数据集中每种模态的样本，以及额外的相关材料。在双盲评审过程结束后，我们将以开源形式发布所有内容。

8. 相关工作更多细节

Building on the success of large language models (LLMs) and the latest advancements in multimodal large language models (MLLMs), there is growing anticipation for integrating multiple modalities into a single model to achieve truly all-modality capabilities, i.e., all-modality models. However, numerous challenges must be overcome to reach this goal. We will introduce the related works from the aspects of dataset, algorithms, and evaluation.

基于大语言模型 (LLMs) 的成功以及多模态大语言模型 (MLLMs) 的最新进展，人们越来越期待将多种模态集成到一个模型中，以实现真正的全模态能力，即全模态模型。然而，要实现这一目标，必须克服许多挑战。我们将从数据集、算法和评估等方面介绍相关的工作。

Dataset Recently, several studies have proposed preference datasets for multimodal models, encompassing both understanding and generation tasks [6, 101]. However, current preference datasets are restricted to specific modalities and tasks, and they lack comprehensive cross-modal annotations. In the domain of Text-Image-to-Text (TI2T) under standing, existing preference datasets primarily aim to reduce hallucinations and enhance helpfulness in MLLMs [56, 95, 120, 134]. In the Text-to-Image (T2I) domain, most existing studies employ binary human ratings or preference rankings to reflect authentic human preferences [52, 109, 113]. Liang et al. [59], in contrast, gathers more comprehensive feedback, focusing on misaligned regions and key terms. In the video and audio domains, aside from datasets such as LLaVA-Hound [127], SafeSora [27], and AudioAlpaca [73], which provide binary human preference data, further research on preferences remains relatively scarce.

数据集

Algorithms Recent all-modality alignment algorithms directly extend the RLHF pipeline to the corresponding modality, since both PPO [81] and DPO [87] are modalityagnostic. RLHF-V [120] have used human feedback to annotate image-question-answering data, significantly reducing hallucinations of MLLMs. Similarly, Qwen2-Audio- 7B-Instruct [24] have attempted to apply this approach to audio language models, improving audio comprehension and question-answering tasks. Similar strategies have also been practiced in video question-answering [3], imagegeneration [11], and video generation [27] tasks. However, the preference of coupled responses are trade-off by different dimensions, e.g., their style and correctness. As the number of modalities increases, these dimensions become more complex, making it harder to figure out misaligned behavior with binary preferences [46, 59]. Due to the increased difficulty and cost of annotating all-modality data [121], there is an urgent need for an alignment algorithm that more efficiently utilizes human feedback.

算法
最近的跨模态对齐算法直接扩展了RLHF（基于人类反馈的强化学习）流程到相应的模态，因为PPO（近端策略优化）[81]和DPO（直接偏好优化）[87]都是模态无关的。RLHF-V [120]利用人类反馈标注了图像问答数据，显著减少了多模态大语言模型（MLLMs）的幻觉问题。类似地，Qwen2-Audio-7B-Instruct [24]尝试将这种方法应用于音频语言模型，提升了音频理解和问答任务的性能。类似的策略也在视频问答[3]、图像生成[11]和视频生成[27]任务中得到了实践。然而，耦合响应的偏好在不同维度上存在权衡，例如风格和正确性。随着模态数量的增加，这些维度变得更加复杂，使得通过二元偏好识别未对齐行为变得更加困难[46, 59]。由于标注跨模态数据的难度和成本增加[121]，迫切需要一种能更高效利用人类反馈的对齐算法。

Evaluation Current benchmarks for all-modality models primarily focus on evaluating individual modalities. Numerous benchmarks are developed to assess the capabilities of large vision models across multiple dimensions, including image perception [35, 71, 93, 122], object hallucination [21, 50, 57], and safety performance [69, 70, 105]. For the video modality, Fu et al. [36], Li et al. [55], Ning et al. [79] develop various benchmarks to assess large vision models. Furthermore, in the audio domain, there are dedicated evaluation datasets for large audio models [107, 116]. However, these studies are not specifically designed to assess their cross-modal comprehensive understanding capabilities. Existing evaluation methods are primarily based on a series of tasks, such as Audio-Visual Question Answering [54, 115, 123], Audio-Visual Scene-aware Dialog [4, 92], and Audio-Visual Retrieval [17, 18]. Although involving multiple modalities, these tasks are not specifically designed to assess the all-modality capabilities of MLLMs.

评估
当前的全模态模型基准主要集中于评估单一模态。为了评估大规模视觉模型在多个维度上的能力，开发了众多基准，包括图像感知 [35, 71, 93, 122]、物体幻觉 [21, 50, 57] 和安全性能 [69, 70, 105]。对于视频模态，Fu 等人 [36]、Li 等人 [55]、Ning 等人 [79] 开发了多种基准来评估大规模视觉模型。此外，在音频领域，也有专门用于评估大规模音频模型的数据集 [107, 116]。然而，这些研究并非专门设计用于评估其跨模态综合理解能力。现有的评估方法主要基于一系列任务，如视听问答 [54, 115, 123]、视听场景感知对话 [4, 92] 和视听检索 [17, 18]。尽管涉及多种模态，但这些任务并非专门设计用于评估 MLLMs 的全模态能力。

9.1. Training Part

9.1. 训练部分

The align-anything framework integrates all-modality alignment algorithms (e.g., Reinforcement Learning from Human Feedback, RLHF), supporting SFT, RM, DPO, and PPO. Additionally, align-anything implements KTO [32], SimPO [76], and ORPO [42] in the text-to-text modality. Besides, align-anything offers a highly scalable model registration mechanism and currently supports the training and deploying over 25 models. For more details, please refer to the align-anything-code/README.md file of our supplementary materials.

align-anything 框架集成了所有模态对齐算法（例如，基于人类反馈的强化学习，RLHF），支持 SFT、RM、DPO 和 PPO。此外，align-anything 在文本到文本模态中实现了 KTO [32]、SimPO [76] 和 ORPO [42]。此外，align-anything 提供了一个高度可扩展的模型注册机制，目前支持超过 25 个模型的训练和部署。更多详情请参考我们补充材料中的 align-anything-code/README.md 文件。

9. Align-Anything Framework Figure 9. The open-sourced framework of align-anything.

图 9: Align-Anything 框架的开源框架

Our work is built on the framework of align-anything, which is designed for training and evaluation across all modalities. As shown in Fig. 9, the align-anything framework aims to align all-modality large models, including large language models (LLMs), vision language models (VLMs), and others, with human intentions and values. Overall, this framework has the following characteristics:

我们的工作基于align-anything框架，该框架旨在跨所有模态进行训练和评估。如图 9 所示，align-anything框架的目标是将全模态大模型（包括大语言模型（LLMs）、视觉语言模型（VLMs）等）与人类意图和价值观对齐。总体而言，该框架具有以下特点：

9.2. Evaluation Part

9.2. 评估部分

The align-anything evaluation framework now supports over 30 commonly used benchmarks, covering all common modalities. For more details, please refer to the align-anything-code/README.md file of our supplementary materials.

align-anything 评估框架现已支持超过 30 个常用基准，涵盖所有常见模态。更多详细信息，请参阅我们补充材料中的 align-anything-code/README.md 文件。

10. Training Details

10. 训练细节

This section will introduce the implementation details of learning from language feedback $(L L F)$ , hyper-parameter settings, case studies, and the computational devices involved in the experiments.

本节将介绍从语言反馈学习 (LLF) 的实现细节、超参数设置、案例研究以及实验中所涉及的计算设备。

10.1. Implementation Details

10.1. 实现细节

LLF comprises two primary steps: feedback modeling and self improving. The first step employs maximum likelihood to enable the model to learn from align-anything-200k how to generate language feedback for a given prompt and response. This includes evaluating the response (critique part) and providing suggestions for improvement (refinement part). During the self improving phase, when optimizing the response, we incorporate the refinement into the specific dimensions, thereby creating preference pairs with the original response.

LLF 包含两个主要步骤：反馈建模和自我改进。第一步采用最大似然估计，使模型能够从 align-anything-200k 中学习如何为给定的提示和响应生成语言反馈。这包括评估响应（批评部分）和提供改进建议（优化部分）。在自我改进阶段，当优化响应时，我们将优化部分融入特定维度，从而与原始响应创建偏好对。

Figure 10. Examples of LLF synthesized preference pairs. The responses from current models are often not perfect. Using language feedback to enhance prompts can improve responses in certain dimensions, synthesizing more learnable preference pairs.

图 10: LLF 合成的偏好对示例。当前模型的响应通常并不完美。使用语言反馈来增强提示可以在某些维度上改进响应，从而合成更多可学习的偏好对。

Figure 11. Further experiment results of LLF on TI2T, TV2T, TA2T, T2I and TI2TI modalities. We conducted comparisons of LLF with common open-source [22–24, 28, 62, 64, 68, 78, 89, 94, 96, 98, 102, 108, 129] and closed-source models [7, 72, 80, 88]. prompt to guide the model in generating better responses on

图 11. LLF 在 TI2T, TV2T, TA2T, T2I 和 TI2TI 模态上的进一步实验结果。我们将 LLF 与常见的开源模型 [22–24, 28, 62, 64, 68, 78, 89, 94, 96, 98, 102, 108, 129] 和闭源模型 [7, 72, 80, 88] 进行了比较。通过提示引导模型生成更好的响应。

In the RLHF phase, we synthesized a preference dataset of the same size as the baseline data (detailed size for each modality can be found at Fig. 12). We will open-source this preference dataset synthesized by LLF. Fig. 10 provides a simple example.

在 RLHF 阶段，我们合成了一个与基线数据大小相同的偏好数据集（每种模态的详细大小见图 12）。我们将开源由 LLF 合成的这个偏好数据集。图 10 提供了一个简单的示例。

10.2. Further Comparison

10.2. 进一步比较

This section will compare LLF with other open-source and closed-source models. These comparison results reveal two surprising phenomena. For modalities with initially weaker models, such as TI2T, LLF can significantly enhance their performance. As shown in Fig. 11, on the MIA-Bench [84], although the initial model LLaVA-1.5-13B [62] is not outstanding, after fine-tuning with LLF-based PPO, it surpasses open-source models with much larger parameter sizes (e.g., LLaVA-NeXT-110B [64]) and even some closed-source models (e.g., Claude-3-Opus [7]). Meanwhile, for modalities with initially stronger models, like TA2T, LLF can still further improve their performance.

本节将比较 LLF 与其他开源和闭源模型。这些比较结果揭示了两个令人惊讶的现象。对于初始模型较弱的模态，例如 TI2T，LLF 可以显著提升其性能。如图 11 所示，在 MIA-Bench [84] 上，尽管初始模型 LLaVA-1.5-13B [62] 并不突出，但在使用基于 LLF 的 PPO 进行微调后，它超越了参数规模大得多的开源模型（例如 LLaVA-NeXT-110B [64]）甚至一些闭源模型（例如 Claude-3-Opus [7]）。同时，对于初始模型较强的模态，如 TA2T，LLF 仍能进一步提升其性能。

10.3. Models

10.3. 模型

LLF pipeline requires multimodal models with language understanding capabilities. Considering popularity, performance, and open-source availability, we select the following models for our experiment.

LLF 流程需要具备语言理解能力的多模态模型。考虑到流行度、性能和开源可用性，我们为实验选择了以下模型。

• LLaVA-1.5-7B & LLaVA-1.5-13B [62] are visual language models validated their performance across 11 benchmarks. their base language models are Vicuna-7Bv1.5 and Vicuna-13B-v1.5 [82], with CLIP-ViT-L-336px [85] as the visual encoder and MLP as the multimodal projection layer. The 13B version is trained on approximately 1.2M publicly available data. They excel at extracting effective information from inputs that combine language and images. Then they provide answers by integrating the world knowledge of the language model. • Qwen2-Audio-7B-Instruct [24] is a large-scale audiolanguage model, which can process diverse audio signal inputs to perform audio analysis or generate direct textual responses to speech instructions. Qwen2-Audio-7BInstruct uses Whisper-large-v3 [86] as the audio encoder and Qwen-7B [9] as the language model. It was pretrained on over 520k audio QA data and completed SFT and DPO alignment, achieving outstanding performance on 9 benchmarks. • Chameleon-7B [98] is an early-fusion, token-based multimodal model designed for unified image and text under standing and generation. Chameleon-7B employs a newly trained image tokenizer, based on [37], to encode $512\times512$ images into 1024 discrete tokens. The model was pre-trained using a Llama2-style language model architecture on text-image interleaved data with this encoder. The original paper introduced two model variants: Chameleon-7B and Chameleon-34B. Due to computa-tional constraints, we focused on evaluating our method using Chameleon-7B. However, the full Chameleon-7B parameters with image generation capabilities were not open-sourced by the authors. To address this limitation, we used the Align-Anything framework to finetune the original Chameleon-7B model on the LAION-Art dataset [91]. This resulted in the AA-Chameleon-7B-base model, which integrates both image generation and understanding capabilities. For simplicity, we refer to this model as Chameleon-7B throughout our paper.

• LLaVA-1.5-7B 和 LLaVA-1.5-13B [62] 是视觉语言模型，在 11 个基准上验证了其性能。它们的基础语言模型是 Vicuna-7Bv1.5 和 Vicuna-13B-v1.5 [82]，视觉编码器为 CLIP-ViT-L-336px [85]，多模态投影层为 MLP。13B 版本在约 1.2M 的公开数据上进行了训练。它们擅长从结合语言和图像的输入中提取有效信息，然后通过整合语言模型的世界知识提供答案。
• Qwen2-Audio-7B-Instruct [24] 是一个大规模音频语言模型，能够处理多样化的音频信号输入，进行音频分析或生成对语音指令的直接文本响应。Qwen2-Audio-7B-Instruct 使用 Whisper-large-v3 [86] 作为音频编码器，Qwen-7B [9] 作为语言模型。它在超过 520k 的音频问答数据上进行了预训练，并完成了 SFT 和 DPO 对齐，在 9 个基准上表现出色。
• Chameleon-7B [98] 是一个早期融合、基于 Token 的多模态模型，设计用于统一的图像和文本理解与生成。Chameleon-7B 采用了一个基于 [37] 新训练的图像 Tokenizer，将 $512\times512$ 的图像编码为 1024 个离散的 Token。该模型使用 Llama2 风格的语言模型架构在文本-图像交错数据上进行了预训练。原论文介绍了两个模型变体：Chameleon-7B 和 Chameleon-34B。由于计算限制，我们专注于使用 Chameleon-7B 评估我们的方法。然而，具有图像生成能力的完整 Chameleon-7B 参数并未由作者开源。为了解决这一限制，我们使用 Align-Anything 框架在 LAION-Art 数据集 [91] 上对原始 Chameleon-7B 模型进行了微调，得到了 AA-Chameleon-7B-base 模型，该模型集成了图像生成和理解能力。为简洁起见，我们在论文中将其称为 Chameleon-7B。

Qwen2-VL-7B-Instruct [103] is a large multimodal model that employs a unified paradigm for processing both images and videos, representing a significant advancement in video understanding capabilities. Through its innovative approach to visual processing, the model can effectively analyze video content using the same architecture that handles static images, creating a seamless integration of temporal and spatial information for comprehensive video comprehension. Qwen2-VL-7B-Instruct uses Vision Transformer [30] as the vision encoder and Qwen2- 7B [114] as the language model. Trained on 1.4 trillion tokens and fine-tuned through SFT, the model demonstrates exceptional video comprehension capabilities.

Qwen2-VL-7B-Instruct [103] 是一个大型多模态模型，它采用统一的范式来处理图像和视频，代表了视频理解能力的重大进步。通过其创新的视觉处理方法，该模型能够有效地分析视频内容，使用与处理静态图像相同的架构，将时间和空间信息无缝整合，以实现全面的视频理解。Qwen2-VL-7B-Instruct 使用 Vision Transformer [30] 作为视觉编码器，并使用 Qwen2-7B [114] 作为语言模型。该模型在 1.4 万亿个 Token 上进行训练，并通过 SFT 进行微调，展示了卓越的视频理解能力。

10.4. Benchmarks

10.4. 基准测试

• LLaVA-Bench (COCO) [65] is a benchmark designed to evaluate the alignment behavior and capabilities of models with consistent visual inputs. It utilizes a subset of the COCO-Val-2014 dataset, selecting 30 images and generating three distinct types of questions for each image: conversation, detailed description, and complex reasoning. This results in a total of 90 questions, crafted using a specialized data generation pipeline. The benchmark aims to assess how effectively models can follow user instructions and handle diverse question types. • MIA-Bench [84] is a pioneering benchmark designed to assess the capabilities of VLMs in adhering to complex, layered instructions. This benchmark features a diverse collection of 400 meticulously crafted imageprompt pairs, each intended to rigorously test the models’ ability to generate responses that precisely follow intricate directives. Through comprehensive evaluations of various state-of-the-art VLMs, MIA-Bench uncovers significant disparities in performance, thereby identifying key areas for enhancement in instruction fidelity. • Image Reward [113] is a general-purpose T2I human preference reward benchmark. Trained on $137\mathbf{k}$ expertannotated comparisons, the Image Reward model utilizes a systematic annotation pipeline emphasizing alignment, fidelity, and harmlessness in generated images. It significantly outperforms prior metrics like CLIP and FID in capturing human preferences, achieving higher consistency with human rankings. The benchmark demonstrates robustness across various datasets, establishing itself as a promising metric for evaluating T2I generative models.

LLaVA-Bench (COCO) [65] 是一个旨在评估模型在一致视觉输入下的对齐行为和能力的基准。它使用了 COCO-Val-2014 数据集的一个子集，选择了 30 张图像，并为每张图像生成了三种不同类型的问题：对话、详细描述和复杂推理。通过专门的数据生成管道，总共生成了 90 个问题。该基准旨在评估模型如何有效地遵循用户指令并处理多样的问题类型。
MIA-Bench [84] 是一个开创性的基准，旨在评估视觉语言模型（VLM）在遵循复杂、分层指令方面的能力。该基准包含了 400 个精心制作的图像提示对，每个提示对都旨在严格测试模型生成精确遵循复杂指令的响应的能力。通过对各种最先进的 VLM 进行全面评估，MIA-Bench 揭示了性能上的显著差异，从而确定了在指令保真度方面的关键改进领域。
Image Reward [113] 是一个通用的文本到图像（T2I）人类偏好奖励基准。通过在 $137\mathbf{k}$ 个专家注释的比较数据上进行训练，Image Reward 模型利用了一个系统的注释管道，强调生成图像的对齐性、保真度和无害性。它在捕捉人类偏好方面显著优于 CLIP 和 FID 等先前指标，实现了与人类排名更高的一致性。该基准在各种数据集上展示了鲁棒性，成为评估 T2I 生成模型的有前途的指标。

• Human Preference Score $\nu2$ $(H P S\nu2)$ [109] is a benchmark designed to evaluate the alignment of T2I models with human preferences. It is built upon Human Preference Dataset v2 (HPD v2), a comprehensive dataset comprising 798k human-annotated pairwise comparisons across nine generative models and real images from the COCO dataset. HPS v2 trains a fine-tuned CLIP-based preference prediction model, demonstrating superior genera liz ation and sensitivity to algorithmic improvements compared to prior metrics such as Inception Score and Fre´chet Inception Distance (FID Score).

• 人类偏好评分 (Human Preference Score) $\nu2$ $(H P S\nu2)$ [109] 是一个旨在评估文本到图像 (T2I) 模型与人类偏好对齐的基准。它基于人类偏好数据集 v2 (HPD v2)，这是一个包含 7

[论文翻译]对齐一切：通过语言反馈训练全模态模型以遵循指令

原文地址：https://arxiv.org/pdf/2412.15838v2

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

对齐一切：通过语言反馈训练全模态模型以遵循指令

Abstract

摘要

1. Introduction

1. 引言

Language Feedback

2. Datasets

2. 数据集

2.1. All-Modality Human Preference Dataset

2.2. One More Thing – Language Feedback

2.2. 还有一点——语言反馈

2.3. Dataset Construction

2.3. 数据集构建

3. Learning from Language Feedback

3. 从语言反馈中学习

3.1. Background and Preliminary

3.2. Practical Implementation

3.3. Experiment

4. Evaluation: Eval-Anything

4. 评估：Eval-Anything

4.1. Composition of Evaluation Dataset

4.1. 评估数据集的组成

4.1.1 All-Modality Understanding

4.1.2 All-Modality Generation

4.1.2 全模态生成

4.2. Evaluation Results & Analysis

5. Conclusion

5. 结论

References

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

Align Anything：通过语言反馈训练全模态模型遵循指令

6. Ethic Responsibility

6. 伦理责任

7. Open-Source Assets and License

8. More Details of Related Works

8. 相关工作更多细节

9.1. Training Part

9.2. Evaluation Part

10. Training Details

10. 训练细节

10.1. Implementation Details

10.2. Further Comparison

10.2. 进一步比较

10.3. Models

10.4. Benchmarks

10.4. 基准测试