https://arxiv.org/pdf/2503.06141v1# Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model
下一个Token就够了:基于多模态大语言模型的真实图像质量与美学评分
Figure 1: Realistic image Quality and Aesthetic (RealQA) dataset (b), including 10 rich fine-grained attributes (a). Based on these attributes, we can (1) reconstruct public datasets in a CoT manner and (2) directly apply the fine-grained attributes and composite scores for real-world applications. We show two kinds of CoT forms on the AVA dataset in (c). And we show the real-world applications in (d), where Q-Align [44] trained on the AVA dataset [28] assigns an inappropriate rank (scores scale to 1–10 for fairness), while the model trained on the RealQA dataset gives a correct rank with the rich fine-grained attributes. Here, we show only part of the predicted fine-grained attributes for clearer viewing.
图 1: 真实图像质量与美学 (RealQA) 数据集 (b), 包含10个丰富的细粒度属性 (a)。基于这些属性, 我们可以 (1) 以思维链 (CoT) 方式重构公共数据集, (2) 直接将细粒度属性和综合评分应用于现实场景。我们在 (c) 中展示了AVA数据集上的两种CoT形式, 并在 (d) 中展示了实际应用案例: 基于AVA数据集 [28] 训练的Q-Align [44] 给出了不合理的评分 (分数按1-10标准化以保持公平性), 而基于RealQA数据集训练的模型则通过丰富的细粒度属性给出了正确排序。为便于观察, 此处仅显示部分预测的细粒度属性。
Abstract
摘要
The rapid expansion of mobile internet has resulted in a substantial increase in user-generated content (UGC) images, thereby making the thorough assessment of UGC images both urgent and essential. Recently, multimodal large language models (MLLMs) have shown great potential in image quality assessment (IQA) and image aesthetic assessment (IAA). Despite this progress, effectively scoring the quality and aesthetics of UGC images still faces two main challenges: 1) A single score is inadequate to capture the hierarchical human perception. 2) How to use MLLMs to output numerical scores, such as mean opinion scores (MOS), remains an open question. To address these challenges, we introduce a novel dataset, named Realistic image Quality and Aesthetic (RealQA), including 14,715 UGC images, each of which is annotated with 10 fine-grained attributes. These attributes span three levels: low level (e.g., image clarity), middle level (e.g., subject integrity) and high level (e.g., composition). Besides, we conduct a series of in-depth and comprehensive investigations into how to effectively predict numerical scores using MLLMs. Surprisingly, by predicting just two extra significant digits, the next token paradigm can achieve SOTA performance. Furthermore, with the help of chain of thought (CoT) [41] combined with the learnt fine-grained attributes, the proposed method can outperform SOTA methods on five public datasets for IQA and IAA with superior interpret ability and show strong zero-shot generalization for video quality assessment (VQA). The code and dataset will be released.
移动互联网的迅猛发展导致用户生成内容(UGC)图像数量激增,这使得对UGC图像的全面评估变得既紧迫又必要。近期,多模态大语言模型(MLLM)在图像质量评估(IQA)和图像美学评估(IAA)领域展现出巨大潜力。尽管取得进展,有效评估UGC图像的质量与美学仍面临两大挑战:1) 单一分数难以捕捉人类感知的层次性;2) 如何利用MLLM输出数值分数(如平均意见得分MOS)仍是待解难题。为解决这些问题,我们提出了名为Realistic image Quality and Aesthetic (RealQA)的新数据集,包含14,715张UGC图像,每张图像标注了10个细粒度属性。这些属性涵盖三个层次:低层次(如图像清晰度)、中层次(如主体完整性)和高层次(如构图)。此外,我们针对如何利用MLLM预测数值分数开展了一系列深入全面的研究。令人惊讶的是,仅需预测两个额外有效数字,next token范式就能达到SOTA性能。进一步地,通过思维链(CoT) [41]结合学习到的细粒度属性,所提方法在五个公开IQA和IAA数据集上超越SOTA方法,具有更优的可解释性,并在视频质量评估(VQA)中展现出强大的零样本泛化能力。代码和数据集将公开。
1. Introduction
1. 引言
The widespread adoption of mobile internet enables users to effortlessly upload images, leading to the largescale generation of UGC images. Scoring UGC images in a way which closely aligns with human perception becomes increasingly important in real-world applications.
移动互联网的广泛普及使用户能够轻松上传图像,导致大规模用户生成内容(UGC)图像的涌现。在实际应用中,以符合人类感知的方式对UGC图像进行评分变得愈发重要。
Recently, numerous advancements [42, 44, 45, 51, 61, 62] have been made in leveraging MLLMs for IQA and IAA, due to their exceptional capabilities in visual and linguistic understanding. Despite this progress, effectively scoring the quality and aesthetics of the UGC images still faces two main challenges: 1) A single score (e.g., MOS) is inadequate to capture the hierarchical human perception. Simply learning one score restricts the capability of MLLMs to capture the underlying rationale behind judgments, thereby limiting the effective alignment with human perception [24, 37, 59]. We believe that incorporating fine-grained attribute-level content can promote more robust consistency with human perception and better interpre t ability. 2) How to utilize MLLMs to predict the final score (e.g., MOS) is still an open question. Current approaches typically employ either the pre-defined textual labels (e.g., “poor”, “bad”) [42, 44] to associate with the limited discrete scores or direct regression to a numerical score from hidden layers [16]. Nevertheless, the pre-defined textual labels pose challenges for further refinements. For example, if we define a word between “poor” and “bad”, it is hard to find it accurately in the predefined vocabulary. Meanwhile, scores predicted by the regression method cannot be seamlessly integrated with the next-token prediction paradigm. This motivates us to ask: Is it possible for MLLMs to directly predict the final score numerically?
最近,多项研究[42, 44, 45, 51, 61, 62]利用多模态大语言模型(MLLM)在视觉与语言理解方面的卓越能力,推动了图像质量评估(IQA)和图像美学评估(IAA)的进展。然而,在有效评估用户生成内容(UGC)图像的质量与美学时仍面临两大挑战:1) 单一评分(如MOS)难以捕捉人类感知的层次性。仅学习单一分数会限制MLLM捕捉判断背后逻辑的能力,从而影响与人类感知的有效对齐[24, 37, 59]。我们认为引入细粒度属性级内容可增强与人类感知的一致性并提升可解释性。2) 如何利用MLLM预测最终分数(如MOS)仍是开放问题。现有方法通常采用预定义文本标签(如"poor"、"bad")[42, 44]关联有限离散分数,或通过隐藏层直接回归数值分数[16]。但预定义标签难以灵活调整——例如若需在"poor"与"bad"之间新增描述,很难在既定词表中准确定位;而回归方法预测的分数无法与下一token预测范式无缝衔接。这促使我们思考:能否让MLLM直接以数值形式预测最终分数?
To address these challenges, we first propose a novel dataset, named Realistic image Quality and Aesthetic (RealQA) dataset. To align with the human perception, as shown in Fig.1 (a), we decompose human perception into three perspectives: low-level attributes, middle-level attributes and high-level attributes, which includes 10 finegrained human perceptual attributes in total. Specifically, the high-level attributes offer a more comprehensive assessment of an image, such as composition and the degree of the eye-catching. The middle-level attributes differentiate the layering between foreground and background and emphasize the expression of neatness and integrity. The low-level attributes pertain to the fundamental quality of images, including clarity, exposure, and saturation. The RealQA dataset collects 14,715 images from AutoNavi, which are taken in various industries, including tourist attractions, restaurants, hotels, leisure and entertainment venues, and other user-active areas. To ensure the application for the real-world UGC image scenarios, we collect the feedback in the real application to determine the weights of various attributes by partial least squares. These weights are aggregated into a composite score, facilitating the comprehensive evaluation of the fine-grained attributes. Although the composite score does not follow the MOS collection, it relies on the online results to better fit the real applications.
为了解决这些挑战,我们首先提出了一个名为真实图像质量与美学 (RealQA) 数据集的新颖数据集。为了与人类感知保持一致,如图 1 (a) 所示,我们将人类感知分解为三个层次:低层属性、中层属性和高层属性,共包含10个细粒度的人类感知属性。具体而言,高层属性提供了对图像更全面的评估,例如构图和吸睛程度。中层属性区分前景与背景的层次感,并强调整洁性与完整性的表达。低层属性涉及图像的基础质量,包括清晰度、曝光和饱和度。RealQA数据集从高德地图收集了14,715张图像,这些图像拍摄于旅游景点、餐厅、酒店、休闲娱乐场所等用户活跃区域的不同行业。为确保适用于真实世界的用户生成内容 (UGC) 图像场景,我们通过偏最小二乘法收集实际应用中的反馈来确定各属性的权重。这些权重被汇总为一个综合评分,便于对细粒度属性进行全面评估。尽管综合评分不遵循平均意见得分 (MOS) 的收集方式,但它依赖于在线结果以更好地适应实际应用。
Figure 2: Top 3 Pearson correlation coefficients between predicted attributes and MOS on the IQA (Koniq-10k [18]) and IAA (AVA [28]) dataset separately. Some attributes already have a high correlation with the MOS for IQA and IAA, which can be utilized to improve the ability of MLLMs to score the image quality and aesthetics.
图 2: 预测属性与MOS在IQA (Koniq-10k [18]) 和IAA (AVA [28]) 数据集上的前3个皮尔逊相关系数。部分属性已与IQA和IAA的MOS呈现较高相关性,可用于提升MLLMs对图像质量和美学评分的能力。
Second, since a numerical score is generally composed of multiple tokens in MLLMs (e.g., for 3.99, the token of Qwen2-VL [39] is “3”, “.”, “9” and “9” ), it is not a simple one-time token classification problem. Based on the above observation, we conduct a series of in-depth and comprehensive research on how to use MLLMs to directly predict scores numerically. Interestingly, we find several conclusions: a) For simple numerical sorting, even MLLMs of similar sizes vary greatly in their performance. b) Training with the next token paradigm, predicting two extra significant digits can greatly improve model performance compared to directly predicting the integer. c) We propose the Numerical Continuity Metric (NCM) and its variant $\mathrm{NCM^{*}}$ , which verify that well-trained MLLMs understand numerical scores as a whole rather than memorizing tokens in different positions. Although previous methods [16, 44] show inferior performance when MLLMs directly predict the final numerical score. In this paper, we argue that it is entirely feasible to employ MLLMs for accurate numerical score prediction within the next token prediction paradigm.
其次,由于多模态大语言模型(MLLM)中的数值分数通常由多个token组成(例如Qwen2-VL [39]对3.99的token化结果为"3"、"."、"9"和"9"),这并非简单的单次token分类问题。基于上述观察,我们针对如何利用MLLM直接预测数值分数展开了一系列深入全面的研究。有趣的是,我们发现了以下结论:a) 对于简单数值排序任务,即使规模相近的MLLM也表现出显著性能差异;b) 采用下一个token预测范式训练时,相比直接预测整数,额外预测两位有效数字能显著提升模型性能;c) 我们提出数值连续性度量(NCM)及其变体$\mathrm{NCM^{*}}$,验证了训练良好的MLLM会将数值分数作为整体理解,而非记忆不同位置的token。尽管先前方法[16,44]显示MLLM直接预测最终数值分数时性能欠佳,但本文论证了在下个token预测范式中使用MLLM实现精确数值分数预测完全可行。
Furthermore, relying on the fine-grained attributes labeled on the RealQA dataset, we can utilize a cold-start strategy to empower MLLMs with the capability to extract these attributes. Then, naturally, we can self-label the attributes of images in the public datasets (e.g. Koniq-10k and AVA). As shown in Fig.2, some of the attributes already have a high correlation with the MOS on the IQA and IAA datasets. Thus, we organize the attributes and the final scores using CoT for the public datasets. We retrain the final model to simultaneously predict the attributes and the final score. With the help of CoT, the proposed method outperforms the SOTA method on five public IQA and IAA datasets. For example, the proposed method surpasses Q-Align by achieving a higher PLCC by $+1.8%$ on the Koniq-10k dataset and $+4.5%$ on the cross-domain LIVE Challenge [12] dataset. Furthermore, we demonstrate strong generalization on the video quality assessment (VQA) dataset KoNViD [17]. The model trained only with image data has a $+36.4%$ improvement in SRCC.
此外,依托RealQA数据集中标注的细粒度属性,我们可以利用冷启动策略赋予多模态大语言模型提取这些属性的能力。随后,自然可以自标注公开数据集(如Koniq-10k和AVA)中图像的属性。如图2所示,部分属性已与IQA和IAA数据集中的平均意见分(MOS)呈现高度相关性。因此,我们采用思维链(CoT)技术对公开数据集的属性和最终得分进行组织,并重新训练最终模型以同步预测属性和最终得分。借助CoT技术,所提方法在五个公开IQA和IAA数据集上超越了当前最优(SOTA)方法。例如,在Koniq-10k数据集上,所提方法以$+1.8%$的PLCC优势超越Q-Align;在跨域LIVE Challenge[12]数据集上优势达$+4.5%$。此外,我们在视频质量评估(VQA)数据集KoNViD[17]上展现出强大泛化能力——仅使用图像数据训练的模型就实现了SRCC指标$+36.4%$的提升。
Our core contributions can be summarized as three-fold:
我们的核心贡献可总结为以下三点:
• We introduce a novel UGC assessment dataset, named RealQA, a collection of 14,715 UGC images. Each image is annotated with 10 fine-grained humanperceptual attributes, reflecting the common perception about image quality and aesthetics.
• 我们推出了一个名为RealQA的新型用户生成内容(UGC)评估数据集,包含14,715张UGC图像。每张图像都标注了10个细粒度的人类感知属性,反映了对图像质量和美学的普遍认知。
• We conduct a series of in-depth research of the next token paradigm on how to use MLLMs to directly predict numerical scores. We find that based on the model training extensively on large-scale data, directly predicting the numerical scores with two extra significant digits can achieve superior performance.
• 我们针对如何利用多模态大语言模型(MLLM)直接预测数值分数这一问题,对下一token范式展开了一系列深入研究。研究发现,基于模型在大规模数据上的广泛训练,通过额外保留两位有效数字直接预测数值分数可获得更优性能。
• By leveraging CoT to integrate fine-grained attributes, the proposed method surpasses SOTA performance on five IQA and IAA benchmarks and demonstrates strong zero-shot generalization for the VQA task.
• 通过利用思维链 (CoT) 整合细粒度属性,所提出的方法在五个图像质量评估 (IQA) 和图像美学评估 (IAA) 基准上超越了当前最优 (SOTA) 性能,并在视觉问答 (VQA) 任务中展现出强大的零样本泛化能力。
2. Related Work
2. 相关工作
2.1. Connection between IQA and IAA
2.1. IQA与IAA的关联
In recent years, IQA has gradually shifted towards deep learning-based full-reference or no-reference methods [4, 6, 9, 32, 38, 44, 50, 55]. By integrating multi-scale feature fusion [57], these studies have improved performance by decomposing quality attributes into quantifiable dimensions such as synthetic degradation s (e.g., Gaussian noise) [26, 31] and real-world degradation s (e.g., sensor noise) [11, 12, 18]. IAA has experienced an evolution from artificial features (e.g., color histogram [10]) to data-driven models [33, 47]. Although both IQA and IAA aim to simulate human subjective perception, most existing methods treat them independently, overlooking their complementary nature at the visual perception level. For example, aesthetic features like color balance may significantly affect quality ratings, while quality defects like noise may reduce aesthetic appeal [58, 60]. Recent studies on AI-Generated Content (AIGC) images indicate that integrating IQA and IAA by jointly learning shared representations can improve model ability to comprehend image perception [53, 56].
近年来,图像质量评估 (IQA) 逐渐转向基于深度学习的全参考或无参考方法 [4, 6, 9, 32, 38, 44, 50, 55]。通过整合多尺度特征融合 [57],这些研究通过将质量属性分解为可量化维度(如合成退化 (synthetic degradation)(例如高斯噪声)[26, 31] 和真实世界退化 (real-world degradation)(例如传感器噪声)[11, 12, 18])提升了性能。图像美学评估 (IAA) 经历了从人工特征(例如颜色直方图 [10])到数据驱动模型 [33, 47] 的演变。尽管 IQA 和 IAA 都旨在模拟人类主观感知,但现有方法大多将它们独立处理,忽视了它们在视觉感知层面的互补性。例如,色彩平衡等美学特征可能显著影响质量评分,而噪点等质量缺陷可能降低美学吸引力 [58, 60]。最近关于生成式 AI (AIGC) 图像的研究表明,通过联合学习共享表征来整合 IQA 和 IAA 可以提升模型对图像感知的理解能力 [53, 56]。
AGIN [5] enhances naturalness attributes (e.g., brightness) and evaluates images based on presence, color, and layout, addressing the limitations of sparse single-task data and noisy annotations. The proposed RealQA dataset integrates IQA and IAA via the fine-grained attributes, creating a more effective and user-centric assessment system.
AGIN [5] 增强了自然度属性(如亮度),并根据存在感、色彩和布局评估图像,解决了稀疏单任务数据和嘈杂标注的局限性。提出的 RealQA 数据集通过细粒度属性整合了图像质量评估 (IQA) 和图像美学评估 (IAA),创建了一个更有效且以用户为中心的评估系统。
2.2. Quality and Aesthetic Scoring with MLLMs
2.2. 基于大语言模型的质量与美学评分
Recently, MLLMs [3, 8, 13, 22, 39] seamlessly integrate visual and linguistic information, exhibiting significant potential. Recently, there has been significant progress in research on scoring for IQA and IAA utilizing MLLMs [20, 21, 27, 40]. DepictQA [52] uses MLLMs to generate reasoning description language for IQA to explain the image quality, but does not output the final score. When utilizing MLLMs, how to model numerical scores is an open problem. Q-Bench [42] and Q-Align [44] suggest using the discrete text-defined levels. Among them, the most representative method is Q-Align [44]. Q-Align suggests predicting the discrete text-defined levels (“excellent”, “good”, “fair”, “poor”, and “bad”), which are more suitable for MLLMs to predict. In implementation, these discrete textdefined levels represent different intervals of MOS. However, although the text-defined levels simulate human language habits, further refinement remains challenging. For example, if we define a word between “poor” and “bad”, it is difficult to find it accurately in the predefined vocabulary. Additionally, Videoscore [16] predicts the numerical scores by the regression from the linear layer. However, the regression result cannot be naturally output together with the next token prediction paradigm and lacks interpret ability. Thus, we conduct a series of in-depth investigations into how to effectively predict numerical scores using MLLMs.
近年来,多模态大语言模型(MLLMs) [3, 8, 13, 22, 39] 实现了视觉与语言信息的无缝融合,展现出巨大潜力。在利用MLLMs进行图像质量评估(IQA)和图像美学评估(IAA)打分的研究领域 [20, 21, 27, 40] 已取得显著进展。DepictQA [52] 使用MLLMs为IQA生成解释图像质量的推理描述语言,但未输出最终分数。如何用MLLMs建模数值分数仍是一个开放性问题。Q-Bench [42] 和 Q-Align [44] 提出使用离散的文本定义等级。其中最具代表性的是Q-Align [44] ,该方法建议预测"优秀"、"良好"、"一般"、"较差"和"差"五个离散文本等级,这种形式更适合MLLMs预测。实际实现中,这些离散文本等级对应不同区间的平均意见分数(MOS)。然而,虽然文本定义等级模拟了人类语言习惯,但进一步细化仍具挑战性。例如若需定义介于"较差"和"差"之间的词汇,很难在预定义词表中准确找到对应表述。此外,Videoscore [16] 通过线性层回归预测数值分数,但回归结果无法与下一代token预测范式自然结合,且缺乏可解释性。因此,我们针对如何有效利用MLLMs预测数值分数展开了一系列深入研究。
3. Method
3. 方法
In the following sections, we thoroughly describe the detailed procedures involved in creating the RealQA dataset in Sec.3.1. Then, Sec.3.2 describes numerical ambiguity, the proposed NCM and its variant $\mathbf{NCM^{*}}$ . Next, Sec.3.3 outlines the comprehensive training recipe, where we present the fine-grained attributes training and CoT training with multiple forms. Finally, we show the model architecture.
在以下章节中,我们详细描述了创建RealQA数据集的完整流程(见第3.1节)。随后,第3.2节阐述了数值模糊性、提出的NCM及其变体$\mathbf{NCM^{*}}$。接着,第3.3节概述了综合训练方案,包括细粒度属性训练和多种形式的思维链(CoT)训练。最后,我们展示了模型架构。
3.1. Realistic Image Quality and Aesthetic Dataset
3.1. 真实图像质量与美学数据集
Data Curation. Adequate image quality and aesthetic scoring generally depend on a wide range of image sources. To reflect this variability, we collect 14,715 UGC images from AutoNavi, including tourist attractions, restaurants, hotels and other user-active areas. These images undergo processing steps, including subtitle and watermark filtering and manual image stitching filtering. After the filtering processes, we divide the dataset into a training set containing 13,712 images and a test set containing 1,003 images.
数据整理。良好的图像质量和美学评分通常依赖于广泛的图片来源。为反映这种多样性,我们从高德地图收集了14,715张用户生成内容(UGC)图像,涵盖旅游景区、餐厅、酒店等用户活跃区域。这些图像经过字幕水印过滤、人工拼接筛选等处理步骤。最终将数据集划分为包含13,712张图像的训练集和包含1,003张图像的测试集。
Stage 1. Multi-level Attributes Training
阶段 1: 多层次属性训练
Stage 2. CoT Training with Multiple Forms
阶段2:多形式思维链 (CoT) 训练
Figure 3: Training pipeline. In stage 1, we utilize a cold-start strategy to empower MLLMs with the capability to extract the fine-grained attributes. Then, we self-label the fine-grained attributes of images in public datasets. In stage 2, we retrain the model by combining the fine-grained attributes and numerical scores for public datasets in the CoT manner.
图 3: 训练流程。在第一阶段,我们采用冷启动策略使多模态大语言模型 (MLLM) 具备提取细粒度属性的能力。随后,我们对公开数据集中图像的细粒度属性进行自标注。在第二阶段,我们以思维链 (CoT) 方式结合细粒度属性和公开数据集的数值评分对模型进行重新训练。
Image Attributes Selection. Human perception of image quality and aesthetics relies on multiple attributes. As shown in Fig.1 (a), we meticulously categorize image quality and aesthetics into three perspectives: high-level attributes, middle-level attributes, and low-level attributes. From the high-level perspective, the eye-catching score of the image describes how attractive the content of the image is. The composition score represents how well the arrangement and organization of elements within a work create a harmonious and aesthetically pleasing whole. From the middle-level perspective, we first identify the subject and the background. Based on the subject and the background, we choose the subject integrity, subject clutter and background clutter. The subject integrity describes how complete the subject is. The subject clutter describes the degree of clutter of the subject, which is usually useful in an image related to food, hotels, a group of subjects and so on. Similarly, the background clutter examines the visual noise in the background that might distract from or compete with the subject, influencing the overall coherence and balance of the image. Besides, the level shot determines whether the image is taken straight or non-horizontally. From the lowlevel perspective, we select image clarity, exposure and saturation, which pertain to the fundamental quality of images.
图像属性选择。人类对图像质量和美感的感知依赖于多重属性。如图1(a)所示,我们将图像质量和美学精心划分为三个维度:高层级属性、中层级属性和低层级属性。从高层级视角来看,图像的吸睛度(eye-catching score)描述了内容吸引力,构图分(composition score)则反映画面元素的排布如何达成和谐美观的整体效果。中层级属性首先识别主体与背景,基于此选取主体完整性(subject integrity)、主体杂乱度(subject clutter)和背景杂乱度(background clutter)。主体完整性衡量主体的完整程度,主体杂乱度适用于餐饮、酒店、群体对象等场景的视觉评估;背景杂乱度则检测可能干扰主体的背景视觉噪声,影响图像的整体协调性。此外,水平拍摄(level shot)判断图像是否为平直取景。低层级属性选择涉及图像基础质量的清晰度(clarity)、曝光度(exposure)和饱和度(saturation)。
Annotation. For high-level attributes, we assess a score ranging from 1 to 10 and provide appropriate reasons to support the evaluations. For attributes at other levels, we employ multi-tiered classifications. For example, subject and background clutter are categorized as cluttered, moderately cluttered and uncluttered. See appendix B.2 for all attribute details). However, annotating fine-grained attributes is challenging. Initially, we employ professionally trained annotators to annotate the fine-grained attributes. While there is significant variability in the annotation of fine-grained attributes. To address this issue, we assign each tag to some specific annotators. Nevertheless, as individual annotators handle more items, their annotations tend to become more neutral and conservative. Ultimately, using prompt engineering, we extensively adopt closed-source MLLMs (e.g., GPT-4o [22]) to label the images. Subsequently, we instruct the annotators to refine the outputs generated by the closed-source MLLMs and correct the obvious errors when the prediction of MLLMs is wrong.
标注说明。对于高级属性,我们采用1到10分的评分体系,并提供相应的评估依据。其他层级属性则采用多级分类法,例如主体与背景杂乱度分为杂乱、中度杂乱和无杂乱三类 (详见附录B.2)。但细粒度属性的标注具有挑战性:我们首先安排专业培训的标注员进行标注,但发现细粒度属性标注存在显著差异性。为此,我们为每个标签分配特定标注人员。然而随着标注量增加,标注结果会趋向中性保守。最终通过提示工程 (prompt engineering),我们大规模采用闭源多模态大语言模型 (如GPT-4o [22]) 进行图像标注,随后指导标注人员对模型输出进行精细化修正,并在模型预测错误时纠正明显偏差。
3.2. Next Token Prediction Paradigm
3.2. 下一Token预测范式
Numerical Ambiguity. When using MLLMs to predict numerical scores in the next token prediction paradigm, numerical scores consist of multiple tokens. For Qwen2-VL, the score 3.99 is typically represented by the tokens “3”, “.”, $\mathbf{\nabla}^{\left\langle6\right\rangle}$ , and $\mathbf{\tilde{\gamma}}$ . During training with teacher-forcing [25], the cross-entropy loss $\mathcal{L}_{c e}$ can be formulated as follows:
数值模糊性。当使用大语言模型在下一个token预测范式中预测数值分数时,数值分数由多个token组成。对于Qwen2-VL模型,分数3.99通常由token"3"、"."、$\mathbf{\nabla}^{\left\langle6\right\rangle}$和$\mathbf{\tilde{\gamma}}$表示。在使用教师强制[25]进行训练时,交叉熵损失$\mathcal{L}_{ce}$可表述如下:
$$
\mathcal{L}{c e}=-\sum_{i=1}^{T}\log P(t_{i}|t_{1},t_{2},\ldots,t_{i-1}),
$$
$$
\mathcal{L}{c e}=-\sum_{i=1}^{T}\log P(t_{i}|t_{1},t_{2},\ldots,t_{i-1}),
$$
where $t_{i}$ denotes the $i$ -th token and $T$ denotes the sequence length. The cross-entropy loss only maximizes the probability of logits for the corresponding GT tokens, leaving other negative tokens unsupervised. For the numerical scores composed of multiple tokens, a natural question is whether MLLMs memorizes the tokens at different positions or understands them as a whole. For example, during inference, if the first digit is predicted incorrectly, it may cause numerical ambiguity, as shown in Fig.4. When the difference between the predicted number and GT number is larger, the cross-entropy loss is smaller.
其中 $t_{i}$ 表示第 $i$ 个 token,$T$ 表示序列长度。交叉熵损失仅最大化对应真实标签 (GT) token 的 logits 概率,而未对其他负样本 token 进行监督。对于由多个 token 组成的数字分数,一个自然的问题是:多模态大语言模型 (MLLM) 究竟是在记忆不同位置的 token,还是将其作为一个整体来理解。例如在推理过程中,若首位数字预测错误,就可能引发数值歧义(如图 4 所示)。当预测数字与真实数字差异越大时,交叉熵损失反而越小。
Figure 4: The numerical ambiguity of the standard crossentropy loss $\mathcal{L}{c e}$ for predicting numerical scores. “4.01” is a more accurate prediction than $\begin{array}{r}{\left.\left.4.99^{\circ}\right.\right.}\end{array}$ , but the $\mathcal{L}_{c e}$ is larger. The proposed NCM and its variant $\mathrm{NCM^{*}}$ transform the discrete tokens into continuous expectations to monitor the influence of the numerical ambiguity.
图 4: 标准交叉熵损失 $\mathcal{L}{ce}$ 在预测数值分数时的数值模糊性。"4.01"是比 $\begin{array}{r}{\left.\left.4.99^{\circ}\right.\right.}\end{array}$ 更准确的预测,但 $\mathcal{L}_{ce}$ 值更大。提出的 NCM 及其变体 $\mathrm{NCM^{*}}$ 将离散的 token 转换为连续期望值,以监测数值模糊性的影响。
where $M$ denotes the number of digits. We normalize scores from different datasets to this range (e.g., for score 3.9845, $S_{G T}{=}4.0$ when $M{=}2$ and $S_{G T}{=}3.98$ when $M{=}3$ ). The function round( $\frac{k}{10^{M-1}},M-1)$ denotes rounds the number 10Mk−1 to M − 1 decimal places. Let zn ∈ RM×d denotes the predicted logits corresponding to the digits, where $d$ is 10, due to only the digits 0 through 9 are considered 1. We can calculate the mathematical expectation $\mathbb{E}[z_{i}^{n}]$ corresponding to the $i$ -th digit of the final numerical score $z_{i}^{n}\in\mathbb{R}^{d}$ , which is defined as follows:
其中 $M$ 表示数字位数。我们将不同数据集的分数归一化到此范围(例如,分数3.9845在 $M{=}2$ 时 $S_{G T}{=}4.0$,在 $M{=}3$ 时 $S_{G T}{=}3.98$)。函数round( $\frac{k}{10^{M-1}},M-1)$ 表示将数字10Mk−1四舍五入到M−1位小数。设 $z_n \in R^{M \times d}$ 表示对应数字的预测logits,其中 $d$ 为10,因为仅考虑数字0到9。我们可以计算最终数值分数第 $i$ 位 $z_{i}^{n}\in\mathbb{R}^{d}$ 对应的数学期望 $\mathbb{E}[z_{i}^{n}]$,其定义如下:
$$
\mathbb{E}[z_{i}^{n}]=\mathbf{v}\cdot\mathrm{Softmax}(z_{i}^{n})=\frac{\mathbf{v}\cdot e^{z_{i}^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}}
$$
$$
\mathbb{E}[z_{i}^{n}]=\mathbf{v}\cdot\mathrm{Softmax}(z_{i}^{n})=\frac{\mathbf{v}\cdot e^{z_{i}^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}}
$$
where $\mathbf{v}=(0,1,2,\ldots,9)$ denotes the digits vector. Furthermore, the expectation $\mathbb{E}[S_{p r e d}]$ of the predicted numerical score $S_{p r e d}$ can be formulated as follows:
其中 $\mathbf{v}=(0,1,2,\ldots,9)$ 表示数字向量。此外,预测数值分数 $S_{p r e d}$ 的期望 $\mathbb{E}[S_{p r e d}]$ 可表示为:
$$
\begin{array}{r l}&{\mathbb{E}[S_{p r e d}]=w_{1}\mathbb{E}[z_{1}^{n}]+w_{2}\mathbb{E}[z_{2}^{n}]+\cdot\cdot\cdot+w_{M}\mathbb{E}[z_{M}^{n}]}\ &{\quad\quad\quad=w\cdot\frac{\textbf{v}\cdot e^{z^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}}.}\end{array}
$$
$$
\begin{array}{r l}&{\mathbb{E}[S_{p r e d}]=w_{1}\mathbb{E}[z_{1}^{n}]+w_{2}\mathbb{E}[z_{2}^{n}]+\cdot\cdot\cdot+w_{M}\mathbb{E}[z_{M}^{n}]}\ &{\quad\quad\quad=w\cdot\frac{\textbf{v}\cdot e^{z^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}}.}\end{array}
$$
The numerical weight can be formulated as $\mathbf{\Delta}w\quad=\quad$ $(1,0.1,\cdot\cdot\cdot,10^{1-M})$ . Suppose the ground-truth score is $S_{G T}\in[0,10)$ , the final NCM can be formulated as follows:
数值权重可以表示为 $\mathbf{\Delta}w\quad=\quad$ $(1,0.1,\cdot\cdot\cdot,10^{1-M})$。假设真实分数为 $S_{G T}\in[0,10)$,最终 NCM 可表示如下:
$$
\begin{array}{r}{\mathbf{NCM}=\mathbf{MSE}\big(\mathbb{E}[S_{p r e d}],S_{G T}\big).}\end{array}
$$
$$
\begin{array}{r}{\mathbf{NCM}=\mathbf{MSE}\big(\mathbb{E}[S_{p r e d}],S_{G T}\big).}\end{array}
$$
To further investigate the generalization of numerical scores by MLLMs, we propose a variant metric, $\mathrm{NCM^{*}}$ , which calculates the expectation when the GT digits are excluded. The insight is that a well-trained MLLMs predict digits near the GT digits even when the first-ranked tokens (i.e., GT tokens) are excluded, due to adjacent tokens having high probabilities (e.g., for GT $t_{i}{=}3$ , the adjacent tokens are 2 and 4). In contrast, poorly converged MLLMs tend to predict randomly distributed expectations.
为了进一步研究多模态大语言模型 (MLLM) 对数值分数的泛化能力,我们提出了一种变体指标 $\mathrm{NCM^{*}}$,该指标在排除真实数字 (GT digits) 的情况下计算期望值。其核心思想是:训练良好的 MLLM 即使在排除排名最高的 Token (即真实 Token) 时,仍能预测接近真实数字的数值,这是因为相邻 Token 具有较高概率 (例如真实数字为 $t_{i}{=}3$ 时,相邻 Token 为 2 和 4)。相反,收敛性较差的 MLLM 往往会预测随机分布的期望值。
The new expectation $\mathbb{E}[S_{p r e d}]^{*}$ of the predicted numerical score $S_{p r e d}$ can be formulated as follows:
预测数值分数 $S_{pred}$ 的新期望 $\mathbb{E}[S_{pred}]^{*}$ 可表示为:
$$
\mathbb{E}[S_{p r e d}]^{*}=\pmb{w}\cdot\frac{\mathbf{v}\cdot e^{m\odot z^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}},
$$
where $\pmb{m}\in\mathbb{R}^{d}$ denotes the binary mask. The position corresponding the GT digit is $-\infty$ , and other positions are 1. The final $\mathrm{NCM^{*}}$ which ignores GT digits can be formulated as follows:
其中 $\pmb{m}\in\mathbb{R}^{d}$ 表示二元掩码。对应真实数字(ground truth digit)的位置为 $-\infty$,其他位置为1。最终忽略真实数字的 $\mathrm{NCM^{*}}$ 可表示为:
$$
\begin{array}{r}{\mathbf{NCM}^{}=\mathbf{MSE}(\mathbb{E}[S_{p r e d}]^{*},S_{G T}).}\end{array}
$$
$$
\begin{array}{r}{\mathbf{NCM}^{}=\mathbf{MSE}(\mathbb{E}[S_{p r e d}]^{*},S_{G T}).}\end{array}
$$
3.3. Training Recipe
3.3. 训练方案
Training Pipeline. As shown in Fig.3, the training processing can be divided into two stages: fine-grained attributes training and CoT training with multiple forms. The first stage aims to utilize a cold-start strategy to empower MLLMs with the capability to extract the fine-grained attributes. The training conversation can refer to section 3.3. The fine-grained attributes align with human perception and increase interpret ability, which can further improve the understanding of MLLMs. Then, we self-label the finegrained attributes of images in public datasets. During the second stage, we retrain the model by combining the fine-grained attributes and the numerical scores for public datasets (e.g., AVA and KonIQ-10k) in the CoT manner.
训练流程。如图3所示,训练过程可分为两个阶段:细粒度属性训练和多种形式的思维链(CoT)训练。第一阶段采用冷启动策略,使多模态大语言模型(MLLMs)具备提取细粒度属性的能力。训练对话可参考3.3节。这些细粒度属性符合人类感知并增强可解释性,能进一步提升MLLMs的理解能力。随后,我们对公开数据集中的图像进行细粒度属性的自动标注。在第二阶段,我们以思维链形式结合细粒度属性和公开数据集(如AVA和KonIQ-10k)的数值评分对模型进行重新训练。
Conversation Formats. During the first stage, to accommodate diverse requirements and improve the capability of extracting the fine-grained attributes, we provide three specific conversation templates: 1) Q&A for individual items (“Attributes”), 2) Q&A for a single level (“Levels”), and 3) Q&A for all items (“Mix”). “Attributes” denotes that we predict an attribute in one conversation. “Levels” denotes that we predict certain level attributes in one conversation. “Mix” denotes that we predict all attributes in one conversation. See Appendix B.3 for specific templates. During the second stage, according to different purposes, we organize several conversation formats: 1) the direct answer to numerical scores (Q1-R1), 2) the CoT with a natural human language format (Q2-R2) and 3) the CoT with a format conducive to regular expression extraction (Q3-R3). We can get direct or CoT responses based on different questions. As shown in Fig.1, we show examples of the two CoT formats. Besides, the Q1-R1 format is as follows:
对话格式。在第一阶段,为适应多样化需求并提升细粒度属性提取能力,我们提供了三种特定对话模板:1) 单项问答 ("Attributes"),2) 单层级问答 ("Levels"),3) 全项问答 ("Mix")。"Attributes"表示单次对话预测一个属性,"Levels"表示单次对话预测某层级的若干属性,"Mix"表示单次对话预测所有属性。具体模板参见附录B.3。在第二阶段,根据不同目的我们组织了几种对话格式:1) 直接回答数值分数 (Q1-R1),2) 自然语言格式的思维链 (CoT) (Q2-R2),3) 便于正则表达式提取的思维链格式 (Q3-R3)。基于不同提问方式可获得直接回答或思维链响应。如图1所示,我们展示了两种思维链格式的示例。此外,Q1-R1格式如下:
Model Architecture. As illustrated in Fig.3, we utilize LoRA [19] to fine-tune Qwen2-VL-7B. Referring to $\mathsf{Q-}$ Align [44], we unfreeze the vision encoder, allowing the model to learn different levels of granularity in the input images adaptively and set the LoRA rank to 128.
模型架构。如图3所示,我们采用LoRA [19]对Qwen2-VL-7B进行微调。参考$\mathsf{Q-}$Align [44],我们解冻视觉编码器,使模型能自适应学习输入图像的不同粒度层级,并将LoRA秩设为128。
Table 1: Evaluation results of the fine-grained attributes on the RealQA dataset. The metrics are SRCC/PLCC.
MLLMs | 高层级 | 中层级 | 低层级 | 平均 | 排名 |
---|---|---|---|---|---|
吸睛度 (Eye-Catch) | 主观兴趣 (Subj.Int.) | 图像清晰度 (Img.Clarity) | |||
构图 (Comp.) | 主观杂乱度 (Subj.Clut.) | 曝光 (Exposure) | |||
背景杂乱度 (Back.Clut.) | 饱和度 (Saturation) | ||||
拍摄水平 (Lvl.Shot) | |||||
Qwen2-VL-7B | 0.627/0.692 | N.A. | 0.444/0.518 | N.A. | 7 |
0.616/0.663 | N.A. | 0.171/0.191 | |||
0.269/0.265 | 0.507/0.559 | ||||
Ours | 0.711/0.755 | 0.493/0.491 | 0.640/0.702 | 0.588/0.617 | 一 |
0.825/0.854 | 0.552/0.641 | 0.535/0.553 | |||
0.380/0.380 | 0.752/0.774 | ||||
0.405/0.405 | |||||
GPT-4o[22] | 0.702/0.722 | 0.232/0.231 | 0.625/0.676 | 0.500/0.515 | 2 |
0.795/0.812 | 0.337/0.372 | 0.520/0.513 | |||
0.416/0.412 | 0.633/0.655 | ||||
0.238/0.238 | |||||
Qwen-VL-Max[2] | 0.648/0.694 | 0.231/0.232 | 0.581/0.622 | 0.479/0.498 | 3 |
0.701/0.749 | 0.461/0.488 | 0.348/0.361 | |||
0.381/0.376 | 0.683/0.686 | ||||
0.273/0.273 | |||||
GPT-4o-mini [1] | 0.678/0.707 | 0.164/0.167 | 0.525/0.549 | 0.455/0.473 | 4 |
0.753/0.791 | 0.327/0.341 | 0.293/0.313 | |||
0.354/0.356 | 0.736/0.763 | ||||
0.266/0.266 | |||||
GPT-4V[29] | 0.656/0.708 | 0.212/0.213 | 0.500/0.576 | 0.444/0.472 | 5 |
0.713/0.755 | 0.362/0.397 | 0.282/0.311 | |||
0.363/0.368 | 0.585/0.600 | ||||
0.319/0.319 | |||||
Qwen2-VL-72B | 0.633/0.703 | 0.263/0.264 | 0.587/0.645 | 0.435/0.461 | 6 |
0.615/0.677 | 0.455/0.476 | 0.336/0.351 | |||
0.123/0.122 | 0.652/0.665 | ||||
0.248/0.248 |
表 1: RealQA数据集上细粒度属性的评估结果。指标为SRCC/PLCC。
4. Experiments
4. 实验
In the following sections, we first present the dataset and implementation details in Sec.4.1 and Sec.4.2. Then, we demonstrate the ability of the cold-start model to extract attributes compared to general MLLMs in Sec.4.3 and provide an ablation study on attribute prediction in Sec.4.4. Next, we conduct in-depth research on using MLLMs to directly predict numerical scores in Sec.4.5 and compare this approach with the regression method in Sec.4.6. Finally, we present the comprehensive quantitative results in Sec.4.7.
在以下章节中,我们首先在第4.1节和第4.2节介绍数据集和实现细节。接着,在第4.3节展示冷启动模型相比通用多模态大语言模型(MLLM)的属性提取能力,并在第4.4节提供属性预测的消融研究。随后,在第4.5节深入探讨使用MLLM直接预测数值分数的方法,并在第4.6节将该方法与回归方法进行对比。最后,我们在第4.7节呈现全面的定量结果。
4.1. Datasets and Metrics
4.1. 数据集与评估指标
We use six realist public datasets catering to the IAA, IQA and VQA tasks. For IAA, we adopt AVA [28] and TAD66K [14]. For IQA, we utilize KonIQ-10k [18], SPAQ [11] and LIVE Challenge [12]. For VQA, we use KoNViD [17]. Although there are related IQA using synthetic datasets, they are beyond the scope of our discussion. Following the previous methods [15, 44], we use the Pearson linear correlation coefficient (PLCC) and Spearman rank correlation coefficient (SRCC) to evaluate the numerical scores and the fine-grained attributes. To assess the fine-grained attributes, we convert qualitative attributes into quantitative values by assigning numerical codes (e.g., cluttered $=1$ , moderately cluttered $=2$ , uncluttered $=3$ ).
我们采用了六个针对IAA、IQA和VQA任务的真实公开数据集。对于IAA任务,选用AVA [28]和TAD66K [14];IQA任务使用KonIQ-10k [18]、SPAQ [11]和LIVE Challenge [12];VQA任务则采用KoNViD [17]。尽管存在基于合成数据集的IQA相关研究,但不在本文讨论范围内。沿用先前方法 [15, 44],我们使用皮尔逊线性相关系数(PLCC)和斯皮尔曼秩相关系数(SRCC)来评估数值分数与细粒度属性。为量化评估细粒度属性,我们将定性属性转换为定量值(例如杂乱=1、中等杂乱=2、整洁=3)。
4.2. Implementation Details
4.2. 实现细节
Based on Qwen2-VL-7B, we explore the capabilities of MLLMs in quality and aesthetic scoring. We uniformly train the model for 2 epochs. When CoT data is involved, we train for 6 epochs to achieve full convergence. We finetune the model with 4 NVIDIA A6000 GPUs.
基于Qwen2-VL-7B,我们探索了多模态大语言模型(MLLM)在质量与美学评分方面的能力。模型统一训练2个周期,当涉及思维链(CoT)数据时,我们训练6个周期以实现完全收敛。使用4块NVIDIA A6000 GPU进行模型微调。
4.3. Comparison with General MLLMs
4.3. 与通用多模态大语言模型的对比
We evaluate the proposed method compared with general MLLMs to extract the fine-grained attributes on the RealQA dataset. The general MLLMs adopt the prompts of the annotation of the close-source MLLMs, which are carefully designed. As shown in Tab.1, the proposed method significantly outperforms competitors in both the SRCC and PLCC. For Qwen2-VL-7B, we annotate results as N.A. (i.e., not applicable) due to failures to follow certain instructions. The proposed model consistently achieves the highest average rank of 1, outperforming other competitive models, such as GPT-4o, Qwen-VL-Max, and GPT-4V.
我们在RealQA数据集上评估了所提方法与通用多模态大语言模型(MLLM)在提取细粒度属性方面的表现。通用MLLM采用了经过精心设计的闭源MLLM标注提示词。如表1所示,所提方法在SRCC和PLCC指标上均显著优于竞争对手。对于Qwen2-VL-7B模型,由于未能遵循特定指令,我们将其结果标注为N.A.(即不适用)。所提模型始终保持着1的最高平均排名,优于GPT-4o、Qwen-VL-Max和GPT-4V等其他竞争模型。
Table 2: Ablation of the fine-grained attributes training.
表 2: 细粒度属性训练的消融实验
训练粒度 | 层级顺序 | SRCC | PLCC |
---|---|---|---|
属性 (Attributes) | N.A. | 0.541 | 0.571 |
层级 (Levels) | N.A. | 0.567 | 0.595 |
混合 (Mix) | 高→中→低 | 0.570 | 0.593 |
混合 (Mix) | 低→中→高 | 0.564 | 0.592 |
混合+层级+属性 (Mix+Levels+Attributes) | 高→中→低 | 0.588 | 0.617 |
Table 3: Comparison of numerical sorting capabilities.
表 3: 数值排序能力对比
方法 | 准确率 | 召回率 | 幻觉率 |
---|---|---|---|
LLaVA-1.5-7B [13] | 0.075 | 0.953 | 0.020 |
mPLUG-Ow12-7B [49] | 0.399 | 0.951 | 0.024 |
Qwen2-VL-2B | 0.753 | 0.981 | 0.004 |
MiniCPM-V2.5-8B [48] | 0.855 | 0.995 | 0.003 |
InternVL2-8B [7] | 0.957 | 0.997 | 0.003 |
Qwen2-VL-7B | 0.968 | 0.998 | 0.003 |
4.4. Ablation of Fine-grained Attributes Training
4.4. 细粒度属性训练的消融研究
To accommodate diverse requirements and improve the capability of extracting the fine-grained attributes, we split the training granularity of the fine-grained attributes into “Attributes”, “Levels” and “Mix” as shown in Sec.3.3. We show the ablation results in Tab.2. Referring to the first three rows, we find that aggregating attributes into a conversation can get better performance. Considering the mutual influence between different levels of attributes, we further conduct the ablation study on the order of levels using the “Mix” mode. Predicting high-level attributes first gives similar results as predicting low-level attributes first, and we choose the former by default. Finally, it can be observed that the ”Mix” mode, which combines data from different training granular i ties, yields the best performance.
为满足多样化需求并提升细粒度属性提取能力,我们将细粒度属性的训练粒度划分为"属性"、"层级"和"混合"三种模式(详见第3.3节)。消融实验结果如 表2 所示。观察前三行数据可知,将属性聚合到对话中能获得更优性能。考虑到不同层级属性间的相互影响,我们进一步采用"混合"模式对层级顺序进行消融实验。先预测高层级属性与先预测低层级属性结果相近,默认采用前者方案。最终可见,整合不同训练粒度数据的"混合"模式表现最佳。
4.5. Directly Predict Numerical Scores by MLLMs
4.5 通过大语言模型直接预测数值分数
In this section, we conduct a series of in-depth investigations into how to effectively predict numerical scores using MLLMs. To this end, a) first, we test zero-shot capability for numerical sorting of common 7B-sized MLLMs. b) Second, we perform an ablation study on the number of digits that should be predicted in the next token paradigm. c) Third, we monitor NCM and its variant $\mathbf{NCM^{*}}$ during training to determine whether MLLMs memorizes the tokens at different positions or understands them as a whole.
在本节中,我们针对如何利用大语言模型 (MLLM) 有效预测数值分数展开了一系列深入研究。具体包括:
a) 首先,测试常见7B规模大语言模型在数值排序任务中的零样本能力;
b) 其次,对下一token预测范式下应预测的数字位数进行消融实验;
c) 最后,通过监测训练过程中NCM及其变体 $\mathbf{NCM^{*}}$ 的表现,验证大语言模型是记忆不同位置的token还是整体理解其语义。
Zero-shot Capability for Numerical Sorting. We conduct a toy example to test the ability of popular opensourced MLLMs. Specifically, we randomly generate 10 decimals with a maximum valid digit of 2 and range from 1 to 10, then we prompt MLLMs to sort them from low to high. We repeat the experiment 200 times and show the average results in Tab.3. We take 3 metrics, which are accuracy, recall and hallucination. Accuracy represents the accuracy of the predicted sequence compared to the GT sorted sequence. Recall indicates the proportion of the predicted sequence that uses the numbers to be sorted. Hallucination indicates the proportion of the predicted sequence that uses the numbers that are not in the sequence to be sorted. Inte rest ingly, MLLMs perform quite differently on numerical sorting. Qwen2-VL-7B achieves the best results. However, basic models with less training data, such as LLaVA-1.5- 7B, perform poorly on the numerical sorting task and are almost guessing. The basic model of Q-Align (i.e., mPLUGOwl2-7B) also obtains poor results.
零样本 (Zero-shot) 数值排序能力。我们设计了一个简单实验来测试主流开源多模态大语言模型 (MLLM) 的表现。具体而言,我们随机生成10个1到10范围内、最多保留2位有效数字的小数,并要求MLLM将其按从小到大排序。实验重复200次后,平均结果如 表3 所示。我们采用准确率 (accuracy)、召回率 (recall) 和幻觉率 (hallucination) 三项指标:准确率表示预测序列与标准排序序列的匹配程度;召回率反映预测序列中使用待排序数字的比例;幻觉率则衡量预测序列中出现非待排序数字的比例。有趣的是,不同MLLM在数值排序任务上表现差异显著。Qwen2-VL-7B取得了最佳效果,而训练数据较少的基线模型(如LLaVA-1.5-7B)表现接近随机猜测水平。Q-Align的基线模型(即mPLUGOwl2-7B)同样表现欠佳。
Table 4: Ablation of t