[论文翻译]下一个Token就够了:基于多模态大语言模型的真实图像质量与美学评分


原文地址:https://arxiv.org/pdf/2503.06141v1


https://arxiv.org/pdf/2503.06141v1# Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model

下一个Token就够了:基于多模态大语言模型的真实图像质量与美学评分


Figure 1: Realistic image Quality and Aesthetic (RealQA) dataset (b), including 10 rich fine-grained attributes (a). Based on these attributes, we can (1) reconstruct public datasets in a CoT manner and (2) directly apply the fine-grained attributes and composite scores for real-world applications. We show two kinds of CoT forms on the AVA dataset in (c). And we show the real-world applications in (d), where Q-Align [44] trained on the AVA dataset [28] assigns an inappropriate rank (scores scale to 1–10 for fairness), while the model trained on the RealQA dataset gives a correct rank with the rich fine-grained attributes. Here, we show only part of the predicted fine-grained attributes for clearer viewing.

图 1: 真实图像质量与美学 (RealQA) 数据集 (b), 包含10个丰富的细粒度属性 (a)。基于这些属性, 我们可以 (1) 以思维链 (CoT) 方式重构公共数据集, (2) 直接将细粒度属性和综合评分应用于现实场景。我们在 (c) 中展示了AVA数据集上的两种CoT形式, 并在 (d) 中展示了实际应用案例: 基于AVA数据集 [28] 训练的Q-Align [44] 给出了不合理的评分 (分数按1-10标准化以保持公平性), 而基于RealQA数据集训练的模型则通过丰富的细粒度属性给出了正确排序。为便于观察, 此处仅显示部分预测的细粒度属性。

Abstract

摘要

The rapid expansion of mobile internet has resulted in a substantial increase in user-generated content (UGC) images, thereby making the thorough assessment of UGC images both urgent and essential. Recently, multimodal large language models (MLLMs) have shown great potential in image quality assessment (IQA) and image aesthetic assessment (IAA). Despite this progress, effectively scoring the quality and aesthetics of UGC images still faces two main challenges: 1) A single score is inadequate to capture the hierarchical human perception. 2) How to use MLLMs to output numerical scores, such as mean opinion scores (MOS), remains an open question. To address these challenges, we introduce a novel dataset, named Realistic image Quality and Aesthetic (RealQA), including 14,715 UGC images, each of which is annotated with 10 fine-grained attributes. These attributes span three levels: low level (e.g., image clarity), middle level (e.g., subject integrity) and high level (e.g., composition). Besides, we conduct a series of in-depth and comprehensive investigations into how to effectively predict numerical scores using MLLMs. Surprisingly, by predicting just two extra significant digits, the next token paradigm can achieve SOTA performance. Furthermore, with the help of chain of thought (CoT) [41] combined with the learnt fine-grained attributes, the proposed method can outperform SOTA methods on five public datasets for IQA and IAA with superior interpret ability and show strong zero-shot generalization for video quality assessment (VQA). The code and dataset will be released.

移动互联网的迅猛发展导致用户生成内容(UGC)图像数量激增,这使得对UGC图像的全面评估变得既紧迫又必要。近期,多模态大语言模型(MLLM)在图像质量评估(IQA)和图像美学评估(IAA)领域展现出巨大潜力。尽管取得进展,有效评估UGC图像的质量与美学仍面临两大挑战:1) 单一分数难以捕捉人类感知的层次性;2) 如何利用MLLM输出数值分数(如平均意见得分MOS)仍是待解难题。为解决这些问题,我们提出了名为Realistic image Quality and Aesthetic (RealQA)的新数据集,包含14,715张UGC图像,每张图像标注了10个细粒度属性。这些属性涵盖三个层次:低层次(如图像清晰度)、中层次(如主体完整性)和高层次(如构图)。此外,我们针对如何利用MLLM预测数值分数开展了一系列深入全面的研究。令人惊讶的是,仅需预测两个额外有效数字,next token范式就能达到SOTA性能。进一步地,通过思维链(CoT) [41]结合学习到的细粒度属性,所提方法在五个公开IQA和IAA数据集上超越SOTA方法,具有更优的可解释性,并在视频质量评估(VQA)中展现出强大的零样本泛化能力。代码和数据集将公开。

1. Introduction

1. 引言

The widespread adoption of mobile internet enables users to effortlessly upload images, leading to the largescale generation of UGC images. Scoring UGC images in a way which closely aligns with human perception becomes increasingly important in real-world applications.

移动互联网的广泛普及使用户能够轻松上传图像,导致大规模用户生成内容(UGC)图像的涌现。在实际应用中,以符合人类感知的方式对UGC图像进行评分变得愈发重要。

Recently, numerous advancements [42, 44, 45, 51, 61, 62] have been made in leveraging MLLMs for IQA and IAA, due to their exceptional capabilities in visual and linguistic understanding. Despite this progress, effectively scoring the quality and aesthetics of the UGC images still faces two main challenges: 1) A single score (e.g., MOS) is inadequate to capture the hierarchical human perception. Simply learning one score restricts the capability of MLLMs to capture the underlying rationale behind judgments, thereby limiting the effective alignment with human perception [24, 37, 59]. We believe that incorporating fine-grained attribute-level content can promote more robust consistency with human perception and better interpre t ability. 2) How to utilize MLLMs to predict the final score (e.g., MOS) is still an open question. Current approaches typically employ either the pre-defined textual labels (e.g., “poor”, “bad”) [42, 44] to associate with the limited discrete scores or direct regression to a numerical score from hidden layers [16]. Nevertheless, the pre-defined textual labels pose challenges for further refinements. For example, if we define a word between “poor” and “bad”, it is hard to find it accurately in the predefined vocabulary. Meanwhile, scores predicted by the regression method cannot be seamlessly integrated with the next-token prediction paradigm. This motivates us to ask: Is it possible for MLLMs to directly predict the final score numerically?

最近,多项研究[42, 44, 45, 51, 61, 62]利用多模态大语言模型(MLLM)在视觉与语言理解方面的卓越能力,推动了图像质量评估(IQA)和图像美学评估(IAA)的进展。然而,在有效评估用户生成内容(UGC)图像的质量与美学时仍面临两大挑战:1) 单一评分(如MOS)难以捕捉人类感知的层次性。仅学习单一分数会限制MLLM捕捉判断背后逻辑的能力,从而影响与人类感知的有效对齐[24, 37, 59]。我们认为引入细粒度属性级内容可增强与人类感知的一致性并提升可解释性。2) 如何利用MLLM预测最终分数(如MOS)仍是开放问题。现有方法通常采用预定义文本标签(如"poor"、"bad")[42, 44]关联有限离散分数,或通过隐藏层直接回归数值分数[16]。但预定义标签难以灵活调整——例如若需在"poor"与"bad"之间新增描述,很难在既定词表中准确定位;而回归方法预测的分数无法与下一token预测范式无缝衔接。这促使我们思考:能否让MLLM直接以数值形式预测最终分数?

To address these challenges, we first propose a novel dataset, named Realistic image Quality and Aesthetic (RealQA) dataset. To align with the human perception, as shown in Fig.1 (a), we decompose human perception into three perspectives: low-level attributes, middle-level attributes and high-level attributes, which includes 10 finegrained human perceptual attributes in total. Specifically, the high-level attributes offer a more comprehensive assessment of an image, such as composition and the degree of the eye-catching. The middle-level attributes differentiate the layering between foreground and background and emphasize the expression of neatness and integrity. The low-level attributes pertain to the fundamental quality of images, including clarity, exposure, and saturation. The RealQA dataset collects 14,715 images from AutoNavi, which are taken in various industries, including tourist attractions, restaurants, hotels, leisure and entertainment venues, and other user-active areas. To ensure the application for the real-world UGC image scenarios, we collect the feedback in the real application to determine the weights of various attributes by partial least squares. These weights are aggregated into a composite score, facilitating the comprehensive evaluation of the fine-grained attributes. Although the composite score does not follow the MOS collection, it relies on the online results to better fit the real applications.

为了解决这些挑战,我们首先提出了一个名为真实图像质量与美学 (RealQA) 数据集的新颖数据集。为了与人类感知保持一致,如图 1 (a) 所示,我们将人类感知分解为三个层次:低层属性、中层属性和高层属性,共包含10个细粒度的人类感知属性。具体而言,高层属性提供了对图像更全面的评估,例如构图和吸睛程度。中层属性区分前景与背景的层次感,并强调整洁性与完整性的表达。低层属性涉及图像的基础质量,包括清晰度、曝光和饱和度。RealQA数据集从高德地图收集了14,715张图像,这些图像拍摄于旅游景点、餐厅、酒店、休闲娱乐场所等用户活跃区域的不同行业。为确保适用于真实世界的用户生成内容 (UGC) 图像场景,我们通过偏最小二乘法收集实际应用中的反馈来确定各属性的权重。这些权重被汇总为一个综合评分,便于对细粒度属性进行全面评估。尽管综合评分不遵循平均意见得分 (MOS) 的收集方式,但它依赖于在线结果以更好地适应实际应用。


Figure 2: Top 3 Pearson correlation coefficients between predicted attributes and MOS on the IQA (Koniq-10k [18]) and IAA (AVA [28]) dataset separately. Some attributes already have a high correlation with the MOS for IQA and IAA, which can be utilized to improve the ability of MLLMs to score the image quality and aesthetics.

图 2: 预测属性与MOS在IQA (Koniq-10k [18]) 和IAA (AVA [28]) 数据集上的前3个皮尔逊相关系数。部分属性已与IQA和IAA的MOS呈现较高相关性,可用于提升MLLMs对图像质量和美学评分的能力。

Second, since a numerical score is generally composed of multiple tokens in MLLMs (e.g., for 3.99, the token of Qwen2-VL [39] is “3”, “.”, “9” and “9” ), it is not a simple one-time token classification problem. Based on the above observation, we conduct a series of in-depth and comprehensive research on how to use MLLMs to directly predict scores numerically. Interestingly, we find several conclusions: a) For simple numerical sorting, even MLLMs of similar sizes vary greatly in their performance. b) Training with the next token paradigm, predicting two extra significant digits can greatly improve model performance compared to directly predicting the integer. c) We propose the Numerical Continuity Metric (NCM) and its variant $\mathrm{NCM^{*}}$ , which verify that well-trained MLLMs understand numerical scores as a whole rather than memorizing tokens in different positions. Although previous methods [16, 44] show inferior performance when MLLMs directly predict the final numerical score. In this paper, we argue that it is entirely feasible to employ MLLMs for accurate numerical score prediction within the next token prediction paradigm.

其次,由于多模态大语言模型(MLLM)中的数值分数通常由多个token组成(例如Qwen2-VL [39]对3.99的token化结果为"3"、"."、"9"和"9"),这并非简单的单次token分类问题。基于上述观察,我们针对如何利用MLLM直接预测数值分数展开了一系列深入全面的研究。有趣的是,我们发现了以下结论:a) 对于简单数值排序任务,即使规模相近的MLLM也表现出显著性能差异;b) 采用下一个token预测范式训练时,相比直接预测整数,额外预测两位有效数字能显著提升模型性能;c) 我们提出数值连续性度量(NCM)及其变体$\mathrm{NCM^{*}}$,验证了训练良好的MLLM会将数值分数作为整体理解,而非记忆不同位置的token。尽管先前方法[16,44]显示MLLM直接预测最终数值分数时性能欠佳,但本文论证了在下个token预测范式中使用MLLM实现精确数值分数预测完全可行。

Furthermore, relying on the fine-grained attributes labeled on the RealQA dataset, we can utilize a cold-start strategy to empower MLLMs with the capability to extract these attributes. Then, naturally, we can self-label the attributes of images in the public datasets (e.g. Koniq-10k and AVA). As shown in Fig.2, some of the attributes already have a high correlation with the MOS on the IQA and IAA datasets. Thus, we organize the attributes and the final scores using CoT for the public datasets. We retrain the final model to simultaneously predict the attributes and the final score. With the help of CoT, the proposed method outperforms the SOTA method on five public IQA and IAA datasets. For example, the proposed method surpasses Q-Align by achieving a higher PLCC by $+1.8%$ on the Koniq-10k dataset and $+4.5%$ on the cross-domain LIVE Challenge [12] dataset. Furthermore, we demonstrate strong generalization on the video quality assessment (VQA) dataset KoNViD [17]. The model trained only with image data has a $+36.4%$ improvement in SRCC.

此外,依托RealQA数据集中标注的细粒度属性,我们可以利用冷启动策略赋予多模态大语言模型提取这些属性的能力。随后,自然可以自标注公开数据集(如Koniq-10k和AVA)中图像的属性。如图2所示,部分属性已与IQA和IAA数据集中的平均意见分(MOS)呈现高度相关性。因此,我们采用思维链(CoT)技术对公开数据集的属性和最终得分进行组织,并重新训练最终模型以同步预测属性和最终得分。借助CoT技术,所提方法在五个公开IQA和IAA数据集上超越了当前最优(SOTA)方法。例如,在Koniq-10k数据集上,所提方法以$+1.8%$的PLCC优势超越Q-Align;在跨域LIVE Challenge[12]数据集上优势达$+4.5%$。此外,我们在视频质量评估(VQA)数据集KoNViD[17]上展现出强大泛化能力——仅使用图像数据训练的模型就实现了SRCC指标$+36.4%$的提升。

Our core contributions can be summarized as three-fold:

我们的核心贡献可总结为以下三点:

• We introduce a novel UGC assessment dataset, named RealQA, a collection of 14,715 UGC images. Each image is annotated with 10 fine-grained humanperceptual attributes, reflecting the common perception about image quality and aesthetics.

• 我们推出了一个名为RealQA的新型用户生成内容(UGC)评估数据集,包含14,715张UGC图像。每张图像都标注了10个细粒度的人类感知属性,反映了对图像质量和美学的普遍认知。

• We conduct a series of in-depth research of the next token paradigm on how to use MLLMs to directly predict numerical scores. We find that based on the model training extensively on large-scale data, directly predicting the numerical scores with two extra significant digits can achieve superior performance.

• 我们针对如何利用多模态大语言模型(MLLM)直接预测数值分数这一问题,对下一token范式展开了一系列深入研究。研究发现,基于模型在大规模数据上的广泛训练,通过额外保留两位有效数字直接预测数值分数可获得更优性能。

• By leveraging CoT to integrate fine-grained attributes, the proposed method surpasses SOTA performance on five IQA and IAA benchmarks and demonstrates strong zero-shot generalization for the VQA task.

• 通过利用思维链 (CoT) 整合细粒度属性,所提出的方法在五个图像质量评估 (IQA) 和图像美学评估 (IAA) 基准上超越了当前最优 (SOTA) 性能,并在视觉问答 (VQA) 任务中展现出强大的零样本泛化能力。

2. Related Work

2. 相关工作

2.1. Connection between IQA and IAA

2.1. IQA与IAA的关联

In recent years, IQA has gradually shifted towards deep learning-based full-reference or no-reference methods [4, 6, 9, 32, 38, 44, 50, 55]. By integrating multi-scale feature fusion [57], these studies have improved performance by decomposing quality attributes into quantifiable dimensions such as synthetic degradation s (e.g., Gaussian noise) [26, 31] and real-world degradation s (e.g., sensor noise) [11, 12, 18]. IAA has experienced an evolution from artificial features (e.g., color histogram [10]) to data-driven models [33, 47]. Although both IQA and IAA aim to simulate human subjective perception, most existing methods treat them independently, overlooking their complementary nature at the visual perception level. For example, aesthetic features like color balance may significantly affect quality ratings, while quality defects like noise may reduce aesthetic appeal [58, 60]. Recent studies on AI-Generated Content (AIGC) images indicate that integrating IQA and IAA by jointly learning shared representations can improve model ability to comprehend image perception [53, 56].

近年来,图像质量评估 (IQA) 逐渐转向基于深度学习的全参考或无参考方法 [4, 6, 9, 32, 38, 44, 50, 55]。通过整合多尺度特征融合 [57],这些研究通过将质量属性分解为可量化维度(如合成退化 (synthetic degradation)(例如高斯噪声)[26, 31] 和真实世界退化 (real-world degradation)(例如传感器噪声)[11, 12, 18])提升了性能。图像美学评估 (IAA) 经历了从人工特征(例如颜色直方图 [10])到数据驱动模型 [33, 47] 的演变。尽管 IQA 和 IAA 都旨在模拟人类主观感知,但现有方法大多将它们独立处理,忽视了它们在视觉感知层面的互补性。例如,色彩平衡等美学特征可能显著影响质量评分,而噪点等质量缺陷可能降低美学吸引力 [58, 60]。最近关于生成式 AI (AIGC) 图像的研究表明,通过联合学习共享表征来整合 IQA 和 IAA 可以提升模型对图像感知的理解能力 [53, 56]。

AGIN [5] enhances naturalness attributes (e.g., brightness) and evaluates images based on presence, color, and layout, addressing the limitations of sparse single-task data and noisy annotations. The proposed RealQA dataset integrates IQA and IAA via the fine-grained attributes, creating a more effective and user-centric assessment system.

AGIN [5] 增强了自然度属性(如亮度),并根据存在感、色彩和布局评估图像,解决了稀疏单任务数据和嘈杂标注的局限性。提出的 RealQA 数据集通过细粒度属性整合了图像质量评估 (IQA) 和图像美学评估 (IAA),创建了一个更有效且以用户为中心的评估系统。

2.2. Quality and Aesthetic Scoring with MLLMs

2.2. 基于大语言模型的质量与美学评分

Recently, MLLMs [3, 8, 13, 22, 39] seamlessly integrate visual and linguistic information, exhibiting significant potential. Recently, there has been significant progress in research on scoring for IQA and IAA utilizing MLLMs [20, 21, 27, 40]. DepictQA [52] uses MLLMs to generate reasoning description language for IQA to explain the image quality, but does not output the final score. When utilizing MLLMs, how to model numerical scores is an open problem. Q-Bench [42] and Q-Align [44] suggest using the discrete text-defined levels. Among them, the most representative method is Q-Align [44]. Q-Align suggests predicting the discrete text-defined levels (“excellent”, “good”, “fair”, “poor”, and “bad”), which are more suitable for MLLMs to predict. In implementation, these discrete textdefined levels represent different intervals of MOS. However, although the text-defined levels simulate human language habits, further refinement remains challenging. For example, if we define a word between “poor” and “bad”, it is difficult to find it accurately in the predefined vocabulary. Additionally, Videoscore [16] predicts the numerical scores by the regression from the linear layer. However, the regression result cannot be naturally output together with the next token prediction paradigm and lacks interpret ability. Thus, we conduct a series of in-depth investigations into how to effectively predict numerical scores using MLLMs.

近年来,多模态大语言模型(MLLMs) [3, 8, 13, 22, 39] 实现了视觉与语言信息的无缝融合,展现出巨大潜力。在利用MLLMs进行图像质量评估(IQA)和图像美学评估(IAA)打分的研究领域 [20, 21, 27, 40] 已取得显著进展。DepictQA [52] 使用MLLMs为IQA生成解释图像质量的推理描述语言,但未输出最终分数。如何用MLLMs建模数值分数仍是一个开放性问题。Q-Bench [42] 和 Q-Align [44] 提出使用离散的文本定义等级。其中最具代表性的是Q-Align [44] ,该方法建议预测"优秀"、"良好"、"一般"、"较差"和"差"五个离散文本等级,这种形式更适合MLLMs预测。实际实现中,这些离散文本等级对应不同区间的平均意见分数(MOS)。然而,虽然文本定义等级模拟了人类语言习惯,但进一步细化仍具挑战性。例如若需定义介于"较差"和"差"之间的词汇,很难在预定义词表中准确找到对应表述。此外,Videoscore [16] 通过线性层回归预测数值分数,但回归结果无法与下一代token预测范式自然结合,且缺乏可解释性。因此,我们针对如何有效利用MLLMs预测数值分数展开了一系列深入研究。

3. Method

3. 方法

In the following sections, we thoroughly describe the detailed procedures involved in creating the RealQA dataset in Sec.3.1. Then, Sec.3.2 describes numerical ambiguity, the proposed NCM and its variant $\mathbf{NCM^{*}}$ . Next, Sec.3.3 outlines the comprehensive training recipe, where we present the fine-grained attributes training and CoT training with multiple forms. Finally, we show the model architecture.

在以下章节中,我们详细描述了创建RealQA数据集的完整流程(见第3.1节)。随后,第3.2节阐述了数值模糊性、提出的NCM及其变体$\mathbf{NCM^{*}}$。接着,第3.3节概述了综合训练方案,包括细粒度属性训练和多种形式的思维链(CoT)训练。最后,我们展示了模型架构。

3.1. Realistic Image Quality and Aesthetic Dataset

3.1. 真实图像质量与美学数据集

Data Curation. Adequate image quality and aesthetic scoring generally depend on a wide range of image sources. To reflect this variability, we collect 14,715 UGC images from AutoNavi, including tourist attractions, restaurants, hotels and other user-active areas. These images undergo processing steps, including subtitle and watermark filtering and manual image stitching filtering. After the filtering processes, we divide the dataset into a training set containing 13,712 images and a test set containing 1,003 images.

数据整理。良好的图像质量和美学评分通常依赖于广泛的图片来源。为反映这种多样性,我们从高德地图收集了14,715张用户生成内容(UGC)图像,涵盖旅游景区、餐厅、酒店等用户活跃区域。这些图像经过字幕水印过滤、人工拼接筛选等处理步骤。最终将数据集划分为包含13,712张图像的训练集和包含1,003张图像的测试集。

Stage 1. Multi-level Attributes Training

阶段 1: 多层次属性训练

Stage 2. CoT Training with Multiple Forms

阶段2:多形式思维链 (CoT) 训练


Figure 3: Training pipeline. In stage 1, we utilize a cold-start strategy to empower MLLMs with the capability to extract the fine-grained attributes. Then, we self-label the fine-grained attributes of images in public datasets. In stage 2, we retrain the model by combining the fine-grained attributes and numerical scores for public datasets in the CoT manner.

图 3: 训练流程。在第一阶段,我们采用冷启动策略使多模态大语言模型 (MLLM) 具备提取细粒度属性的能力。随后,我们对公开数据集中图像的细粒度属性进行自标注。在第二阶段,我们以思维链 (CoT) 方式结合细粒度属性和公开数据集的数值评分对模型进行重新训练。

Image Attributes Selection. Human perception of image quality and aesthetics relies on multiple attributes. As shown in Fig.1 (a), we meticulously categorize image quality and aesthetics into three perspectives: high-level attributes, middle-level attributes, and low-level attributes. From the high-level perspective, the eye-catching score of the image describes how attractive the content of the image is. The composition score represents how well the arrangement and organization of elements within a work create a harmonious and aesthetically pleasing whole. From the middle-level perspective, we first identify the subject and the background. Based on the subject and the background, we choose the subject integrity, subject clutter and background clutter. The subject integrity describes how complete the subject is. The subject clutter describes the degree of clutter of the subject, which is usually useful in an image related to food, hotels, a group of subjects and so on. Similarly, the background clutter examines the visual noise in the background that might distract from or compete with the subject, influencing the overall coherence and balance of the image. Besides, the level shot determines whether the image is taken straight or non-horizontally. From the lowlevel perspective, we select image clarity, exposure and saturation, which pertain to the fundamental quality of images.

图像属性选择。人类对图像质量和美感的感知依赖于多重属性。如图1(a)所示,我们将图像质量和美学精心划分为三个维度:高层级属性、中层级属性和低层级属性。从高层级视角来看,图像的吸睛度(eye-catching score)描述了内容吸引力,构图分(composition score)则反映画面元素的排布如何达成和谐美观的整体效果。中层级属性首先识别主体与背景,基于此选取主体完整性(subject integrity)、主体杂乱度(subject clutter)和背景杂乱度(background clutter)。主体完整性衡量主体的完整程度,主体杂乱度适用于餐饮、酒店、群体对象等场景的视觉评估;背景杂乱度则检测可能干扰主体的背景视觉噪声,影响图像的整体协调性。此外,水平拍摄(level shot)判断图像是否为平直取景。低层级属性选择涉及图像基础质量的清晰度(clarity)、曝光度(exposure)和饱和度(saturation)。

Annotation. For high-level attributes, we assess a score ranging from 1 to 10 and provide appropriate reasons to support the evaluations. For attributes at other levels, we employ multi-tiered classifications. For example, subject and background clutter are categorized as cluttered, moderately cluttered and uncluttered. See appendix B.2 for all attribute details). However, annotating fine-grained attributes is challenging. Initially, we employ professionally trained annotators to annotate the fine-grained attributes. While there is significant variability in the annotation of fine-grained attributes. To address this issue, we assign each tag to some specific annotators. Nevertheless, as individual annotators handle more items, their annotations tend to become more neutral and conservative. Ultimately, using prompt engineering, we extensively adopt closed-source MLLMs (e.g., GPT-4o [22]) to label the images. Subsequently, we instruct the annotators to refine the outputs generated by the closed-source MLLMs and correct the obvious errors when the prediction of MLLMs is wrong.

标注说明。对于高级属性,我们采用1到10分的评分体系,并提供相应的评估依据。其他层级属性则采用多级分类法,例如主体与背景杂乱度分为杂乱、中度杂乱和无杂乱三类 (详见附录B.2)。但细粒度属性的标注具有挑战性:我们首先安排专业培训的标注员进行标注,但发现细粒度属性标注存在显著差异性。为此,我们为每个标签分配特定标注人员。然而随着标注量增加,标注结果会趋向中性保守。最终通过提示工程 (prompt engineering),我们大规模采用闭源多模态大语言模型 (如GPT-4o [22]) 进行图像标注,随后指导标注人员对模型输出进行精细化修正,并在模型预测错误时纠正明显偏差。

3.2. Next Token Prediction Paradigm

3.2. 下一Token预测范式

Numerical Ambiguity. When using MLLMs to predict numerical scores in the next token prediction paradigm, numerical scores consist of multiple tokens. For Qwen2-VL, the score 3.99 is typically represented by the tokens “3”, “.”, $\mathbf{\nabla}^{\left\langle6\right\rangle}$ , and $\mathbf{\tilde{\gamma}}$ . During training with teacher-forcing [25], the cross-entropy loss $\mathcal{L}_{c e}$ can be formulated as follows:

数值模糊性。当使用大语言模型在下一个token预测范式中预测数值分数时,数值分数由多个token组成。对于Qwen2-VL模型,分数3.99通常由token"3"、"."、$\mathbf{\nabla}^{\left\langle6\right\rangle}$和$\mathbf{\tilde{\gamma}}$表示。在使用教师强制[25]进行训练时,交叉熵损失$\mathcal{L}_{ce}$可表述如下:

$$
\mathcal{L}{c e}=-\sum_{i=1}^{T}\log P(t_{i}|t_{1},t_{2},\ldots,t_{i-1}),
$$

$$
\mathcal{L}{c e}=-\sum_{i=1}^{T}\log P(t_{i}|t_{1},t_{2},\ldots,t_{i-1}),
$$

where $t_{i}$ denotes the $i$ -th token and $T$ denotes the sequence length. The cross-entropy loss only maximizes the probability of logits for the corresponding GT tokens, leaving other negative tokens unsupervised. For the numerical scores composed of multiple tokens, a natural question is whether MLLMs memorizes the tokens at different positions or understands them as a whole. For example, during inference, if the first digit is predicted incorrectly, it may cause numerical ambiguity, as shown in Fig.4. When the difference between the predicted number and GT number is larger, the cross-entropy loss is smaller.

其中 $t_{i}$ 表示第 $i$ 个 token,$T$ 表示序列长度。交叉熵损失仅最大化对应真实标签 (GT) token 的 logits 概率,而未对其他负样本 token 进行监督。对于由多个 token 组成的数字分数,一个自然的问题是:多模态大语言模型 (MLLM) 究竟是在记忆不同位置的 token,还是将其作为一个整体来理解。例如在推理过程中,若首位数字预测错误,就可能引发数值歧义(如图 4 所示)。当预测数字与真实数字差异越大时,交叉熵损失反而越小。


Figure 4: The numerical ambiguity of the standard crossentropy loss $\mathcal{L}{c e}$ for predicting numerical scores. “4.01” is a more accurate prediction than $\begin{array}{r}{\left.\left.4.99^{\circ}\right.\right.}\end{array}$ , but the $\mathcal{L}_{c e}$ is larger. The proposed NCM and its variant $\mathrm{NCM^{*}}$ transform the discrete tokens into continuous expectations to monitor the influence of the numerical ambiguity.

图 4: 标准交叉熵损失 $\mathcal{L}{ce}$ 在预测数值分数时的数值模糊性。"4.01"是比 $\begin{array}{r}{\left.\left.4.99^{\circ}\right.\right.}\end{array}$ 更准确的预测,但 $\mathcal{L}_{ce}$ 值更大。提出的 NCM 及其变体 $\mathrm{NCM^{*}}$ 将离散的 token 转换为连续期望值,以监测数值模糊性的影响。

where $M$ denotes the number of digits. We normalize scores from different datasets to this range (e.g., for score 3.9845, $S_{G T}{=}4.0$ when $M{=}2$ and $S_{G T}{=}3.98$ when $M{=}3$ ). The function round( $\frac{k}{10^{M-1}},M-1)$ denotes rounds the number 10Mk−1 to M − 1 decimal places. Let zn ∈ RM×d denotes the predicted logits corresponding to the digits, where $d$ is 10, due to only the digits 0 through 9 are considered 1. We can calculate the mathematical expectation $\mathbb{E}[z_{i}^{n}]$ corresponding to the $i$ -th digit of the final numerical score $z_{i}^{n}\in\mathbb{R}^{d}$ , which is defined as follows:

其中 $M$ 表示数字位数。我们将不同数据集的分数归一化到此范围(例如,分数3.9845在 $M{=}2$ 时 $S_{G T}{=}4.0$,在 $M{=}3$ 时 $S_{G T}{=}3.98$)。函数round( $\frac{k}{10^{M-1}},M-1)$ 表示将数字10Mk−1四舍五入到M−1位小数。设 $z_n \in R^{M \times d}$ 表示对应数字的预测logits,其中 $d$ 为10,因为仅考虑数字0到9。我们可以计算最终数值分数第 $i$ 位 $z_{i}^{n}\in\mathbb{R}^{d}$ 对应的数学期望 $\mathbb{E}[z_{i}^{n}]$,其定义如下:

$$
\mathbb{E}[z_{i}^{n}]=\mathbf{v}\cdot\mathrm{Softmax}(z_{i}^{n})=\frac{\mathbf{v}\cdot e^{z_{i}^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}}
$$

$$
\mathbb{E}[z_{i}^{n}]=\mathbf{v}\cdot\mathrm{Softmax}(z_{i}^{n})=\frac{\mathbf{v}\cdot e^{z_{i}^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}}
$$

where $\mathbf{v}=(0,1,2,\ldots,9)$ denotes the digits vector. Furthermore, the expectation $\mathbb{E}[S_{p r e d}]$ of the predicted numerical score $S_{p r e d}$ can be formulated as follows:

其中 $\mathbf{v}=(0,1,2,\ldots,9)$ 表示数字向量。此外,预测数值分数 $S_{p r e d}$ 的期望 $\mathbb{E}[S_{p r e d}]$ 可表示为:

$$
\begin{array}{r l}&{\mathbb{E}[S_{p r e d}]=w_{1}\mathbb{E}[z_{1}^{n}]+w_{2}\mathbb{E}[z_{2}^{n}]+\cdot\cdot\cdot+w_{M}\mathbb{E}[z_{M}^{n}]}\ &{\quad\quad\quad=w\cdot\frac{\textbf{v}\cdot e^{z^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}}.}\end{array}
$$

$$
\begin{array}{r l}&{\mathbb{E}[S_{p r e d}]=w_{1}\mathbb{E}[z_{1}^{n}]+w_{2}\mathbb{E}[z_{2}^{n}]+\cdot\cdot\cdot+w_{M}\mathbb{E}[z_{M}^{n}]}\ &{\quad\quad\quad=w\cdot\frac{\textbf{v}\cdot e^{z^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}}.}\end{array}
$$

The numerical weight can be formulated as $\mathbf{\Delta}w\quad=\quad$ $(1,0.1,\cdot\cdot\cdot,10^{1-M})$ . Suppose the ground-truth score is $S_{G T}\in[0,10)$ , the final NCM can be formulated as follows:

数值权重可以表示为 $\mathbf{\Delta}w\quad=\quad$ $(1,0.1,\cdot\cdot\cdot,10^{1-M})$。假设真实分数为 $S_{G T}\in[0,10)$,最终 NCM 可表示如下:

$$
\begin{array}{r}{\mathbf{NCM}=\mathbf{MSE}\big(\mathbb{E}[S_{p r e d}],S_{G T}\big).}\end{array}
$$

$$
\begin{array}{r}{\mathbf{NCM}=\mathbf{MSE}\big(\mathbb{E}[S_{p r e d}],S_{G T}\big).}\end{array}
$$

To further investigate the generalization of numerical scores by MLLMs, we propose a variant metric, $\mathrm{NCM^{*}}$ , which calculates the expectation when the GT digits are excluded. The insight is that a well-trained MLLMs predict digits near the GT digits even when the first-ranked tokens (i.e., GT tokens) are excluded, due to adjacent tokens having high probabilities (e.g., for GT $t_{i}{=}3$ , the adjacent tokens are 2 and 4). In contrast, poorly converged MLLMs tend to predict randomly distributed expectations.

为了进一步研究多模态大语言模型 (MLLM) 对数值分数的泛化能力,我们提出了一种变体指标 $\mathrm{NCM^{*}}$,该指标在排除真实数字 (GT digits) 的情况下计算期望值。其核心思想是:训练良好的 MLLM 即使在排除排名最高的 Token (即真实 Token) 时,仍能预测接近真实数字的数值,这是因为相邻 Token 具有较高概率 (例如真实数字为 $t_{i}{=}3$ 时,相邻 Token 为 2 和 4)。相反,收敛性较差的 MLLM 往往会预测随机分布的期望值。

The new expectation $\mathbb{E}[S_{p r e d}]^{*}$ of the predicted numerical score $S_{p r e d}$ can be formulated as follows:

预测数值分数 $S_{pred}$ 的新期望 $\mathbb{E}[S_{pred}]^{*}$ 可表示为:

$$
\mathbb{E}[S_{p r e d}]^{*}=\pmb{w}\cdot\frac{\mathbf{v}\cdot e^{m\odot z^{n}}}{\sum_{j=1}^{d}e^{z_{j}^{n}}},
$$

where $\pmb{m}\in\mathbb{R}^{d}$ denotes the binary mask. The position corresponding the GT digit is $-\infty$ , and other positions are 1. The final $\mathrm{NCM^{*}}$ which ignores GT digits can be formulated as follows:

其中 $\pmb{m}\in\mathbb{R}^{d}$ 表示二元掩码。对应真实数字(ground truth digit)的位置为 $-\infty$,其他位置为1。最终忽略真实数字的 $\mathrm{NCM^{*}}$ 可表示为:

$$
\begin{array}{r}{\mathbf{NCM}^{}=\mathbf{MSE}(\mathbb{E}[S_{p r e d}]^{*},S_{G T}).}\end{array}
$$

$$
\begin{array}{r}{\mathbf{NCM}^{}=\mathbf{MSE}(\mathbb{E}[S_{p r e d}]^{*},S_{G T}).}\end{array}
$$

3.3. Training Recipe

3.3. 训练方案

Training Pipeline. As shown in Fig.3, the training processing can be divided into two stages: fine-grained attributes training and CoT training with multiple forms. The first stage aims to utilize a cold-start strategy to empower MLLMs with the capability to extract the fine-grained attributes. The training conversation can refer to section 3.3. The fine-grained attributes align with human perception and increase interpret ability, which can further improve the understanding of MLLMs. Then, we self-label the finegrained attributes of images in public datasets. During the second stage, we retrain the model by combining the fine-grained attributes and the numerical scores for public datasets (e.g., AVA and KonIQ-10k) in the CoT manner.

训练流程。如图3所示,训练过程可分为两个阶段:细粒度属性训练和多种形式的思维链(CoT)训练。第一阶段采用冷启动策略,使多模态大语言模型(MLLMs)具备提取细粒度属性的能力。训练对话可参考3.3节。这些细粒度属性符合人类感知并增强可解释性,能进一步提升MLLMs的理解能力。随后,我们对公开数据集中的图像进行细粒度属性的自动标注。在第二阶段,我们以思维链形式结合细粒度属性和公开数据集(如AVA和KonIQ-10k)的数值评分对模型进行重新训练。

Conversation Formats. During the first stage, to accommodate diverse requirements and improve the capability of extracting the fine-grained attributes, we provide three specific conversation templates: 1) Q&A for individual items (“Attributes”), 2) Q&A for a single level (“Levels”), and 3) Q&A for all items (“Mix”). “Attributes” denotes that we predict an attribute in one conversation. “Levels” denotes that we predict certain level attributes in one conversation. “Mix” denotes that we predict all attributes in one conversation. See Appendix B.3 for specific templates. During the second stage, according to different purposes, we organize several conversation formats: 1) the direct answer to numerical scores (Q1-R1), 2) the CoT with a natural human language format (Q2-R2) and 3) the CoT with a format conducive to regular expression extraction (Q3-R3). We can get direct or CoT responses based on different questions. As shown in Fig.1, we show examples of the two CoT formats. Besides, the Q1-R1 format is as follows:

对话格式。在第一阶段,为适应多样化需求并提升细粒度属性提取能力,我们提供了三种特定对话模板:1) 单项问答 ("Attributes"),2) 单层级问答 ("Levels"),3) 全项问答 ("Mix")。"Attributes"表示单次对话预测一个属性,"Levels"表示单次对话预测某层级的若干属性,"Mix"表示单次对话预测所有属性。具体模板参见附录B.3。在第二阶段,根据不同目的我们组织了几种对话格式:1) 直接回答数值分数 (Q1-R1),2) 自然语言格式的思维链 (CoT) (Q2-R2),3) 便于正则表达式提取的思维链格式 (Q3-R3)。基于不同提问方式可获得直接回答或思维链响应。如图1所示,我们展示了两种思维链格式的示例。此外,Q1-R1格式如下:

Model Architecture. As illustrated in Fig.3, we utilize LoRA [19] to fine-tune Qwen2-VL-7B. Referring to $\mathsf{Q-}$ Align [44], we unfreeze the vision encoder, allowing the model to learn different levels of granularity in the input images adaptively and set the LoRA rank to 128.

模型架构。如图3所示,我们采用LoRA [19]对Qwen2-VL-7B进行微调。参考$\mathsf{Q-}$Align [44],我们解冻视觉编码器,使模型能自适应学习输入图像的不同粒度层级,并将LoRA秩设为128。

Table 1: Evaluation results of the fine-grained attributes on the RealQA dataset. The metrics are SRCC/PLCC.

MLLMs 高层级 中层级 低层级 平均 排名
吸睛度 (Eye-Catch) 主观兴趣 (Subj.Int.) 图像清晰度 (Img.Clarity)
构图 (Comp.) 主观杂乱度 (Subj.Clut.) 曝光 (Exposure)
背景杂乱度 (Back.Clut.) 饱和度 (Saturation)
拍摄水平 (Lvl.Shot)
Qwen2-VL-7B 0.627/0.692 N.A. 0.444/0.518 N.A. 7
0.616/0.663 N.A. 0.171/0.191
0.269/0.265 0.507/0.559
Ours 0.711/0.755 0.493/0.491 0.640/0.702 0.588/0.617
0.825/0.854 0.552/0.641 0.535/0.553
0.380/0.380 0.752/0.774
0.405/0.405
GPT-4o[22] 0.702/0.722 0.232/0.231 0.625/0.676 0.500/0.515 2
0.795/0.812 0.337/0.372 0.520/0.513
0.416/0.412 0.633/0.655
0.238/0.238
Qwen-VL-Max[2] 0.648/0.694 0.231/0.232 0.581/0.622 0.479/0.498 3
0.701/0.749 0.461/0.488 0.348/0.361
0.381/0.376 0.683/0.686
0.273/0.273
GPT-4o-mini [1] 0.678/0.707 0.164/0.167 0.525/0.549 0.455/0.473 4
0.753/0.791 0.327/0.341 0.293/0.313
0.354/0.356 0.736/0.763
0.266/0.266
GPT-4V[29] 0.656/0.708 0.212/0.213 0.500/0.576 0.444/0.472 5
0.713/0.755 0.362/0.397 0.282/0.311
0.363/0.368 0.585/0.600
0.319/0.319
Qwen2-VL-72B 0.633/0.703 0.263/0.264 0.587/0.645 0.435/0.461 6
0.615/0.677 0.455/0.476 0.336/0.351
0.123/0.122 0.652/0.665
0.248/0.248

表 1: RealQA数据集上细粒度属性的评估结果。指标为SRCC/PLCC。

4. Experiments

4. 实验

In the following sections, we first present the dataset and implementation details in Sec.4.1 and Sec.4.2. Then, we demonstrate the ability of the cold-start model to extract attributes compared to general MLLMs in Sec.4.3 and provide an ablation study on attribute prediction in Sec.4.4. Next, we conduct in-depth research on using MLLMs to directly predict numerical scores in Sec.4.5 and compare this approach with the regression method in Sec.4.6. Finally, we present the comprehensive quantitative results in Sec.4.7.

在以下章节中,我们首先在第4.1节和第4.2节介绍数据集和实现细节。接着,在第4.3节展示冷启动模型相比通用多模态大语言模型(MLLM)的属性提取能力,并在第4.4节提供属性预测的消融研究。随后,在第4.5节深入探讨使用MLLM直接预测数值分数的方法,并在第4.6节将该方法与回归方法进行对比。最后,我们在第4.7节呈现全面的定量结果。

4.1. Datasets and Metrics

4.1. 数据集与评估指标

We use six realist public datasets catering to the IAA, IQA and VQA tasks. For IAA, we adopt AVA [28] and TAD66K [14]. For IQA, we utilize KonIQ-10k [18], SPAQ [11] and LIVE Challenge [12]. For VQA, we use KoNViD [17]. Although there are related IQA using synthetic datasets, they are beyond the scope of our discussion. Following the previous methods [15, 44], we use the Pearson linear correlation coefficient (PLCC) and Spearman rank correlation coefficient (SRCC) to evaluate the numerical scores and the fine-grained attributes. To assess the fine-grained attributes, we convert qualitative attributes into quantitative values by assigning numerical codes (e.g., cluttered $=1$ , moderately cluttered $=2$ , uncluttered $=3$ ).

我们采用了六个针对IAA、IQA和VQA任务的真实公开数据集。对于IAA任务,选用AVA [28]和TAD66K [14];IQA任务使用KonIQ-10k [18]、SPAQ [11]和LIVE Challenge [12];VQA任务则采用KoNViD [17]。尽管存在基于合成数据集的IQA相关研究,但不在本文讨论范围内。沿用先前方法 [15, 44],我们使用皮尔逊线性相关系数(PLCC)和斯皮尔曼秩相关系数(SRCC)来评估数值分数与细粒度属性。为量化评估细粒度属性,我们将定性属性转换为定量值(例如杂乱=1、中等杂乱=2、整洁=3)。

4.2. Implementation Details

4.2. 实现细节

Based on Qwen2-VL-7B, we explore the capabilities of MLLMs in quality and aesthetic scoring. We uniformly train the model for 2 epochs. When CoT data is involved, we train for 6 epochs to achieve full convergence. We finetune the model with 4 NVIDIA A6000 GPUs.

基于Qwen2-VL-7B,我们探索了多模态大语言模型(MLLM)在质量与美学评分方面的能力。模型统一训练2个周期,当涉及思维链(CoT)数据时,我们训练6个周期以实现完全收敛。使用4块NVIDIA A6000 GPU进行模型微调。

4.3. Comparison with General MLLMs

4.3. 与通用多模态大语言模型的对比

We evaluate the proposed method compared with general MLLMs to extract the fine-grained attributes on the RealQA dataset. The general MLLMs adopt the prompts of the annotation of the close-source MLLMs, which are carefully designed. As shown in Tab.1, the proposed method significantly outperforms competitors in both the SRCC and PLCC. For Qwen2-VL-7B, we annotate results as N.A. (i.e., not applicable) due to failures to follow certain instructions. The proposed model consistently achieves the highest average rank of 1, outperforming other competitive models, such as GPT-4o, Qwen-VL-Max, and GPT-4V.

我们在RealQA数据集上评估了所提方法与通用多模态大语言模型(MLLM)在提取细粒度属性方面的表现。通用MLLM采用了经过精心设计的闭源MLLM标注提示词。如表1所示,所提方法在SRCC和PLCC指标上均显著优于竞争对手。对于Qwen2-VL-7B模型,由于未能遵循特定指令,我们将其结果标注为N.A.(即不适用)。所提模型始终保持着1的最高平均排名,优于GPT-4o、Qwen-VL-Max和GPT-4V等其他竞争模型。

Table 2: Ablation of the fine-grained attributes training.

表 2: 细粒度属性训练的消融实验

训练粒度 层级顺序 SRCC PLCC
属性 (Attributes) N.A. 0.541 0.571
层级 (Levels) N.A. 0.567 0.595
混合 (Mix) 高→中→低 0.570 0.593
混合 (Mix) 低→中→高 0.564 0.592
混合+层级+属性 (Mix+Levels+Attributes) 高→中→低 0.588 0.617

Table 3: Comparison of numerical sorting capabilities.

表 3: 数值排序能力对比

方法 准确率 召回率 幻觉率
LLaVA-1.5-7B [13] 0.075 0.953 0.020
mPLUG-Ow12-7B [49] 0.399 0.951 0.024
Qwen2-VL-2B 0.753 0.981 0.004
MiniCPM-V2.5-8B [48] 0.855 0.995 0.003
InternVL2-8B [7] 0.957 0.997 0.003
Qwen2-VL-7B 0.968 0.998 0.003

4.4. Ablation of Fine-grained Attributes Training

4.4. 细粒度属性训练的消融研究

To accommodate diverse requirements and improve the capability of extracting the fine-grained attributes, we split the training granularity of the fine-grained attributes into “Attributes”, “Levels” and “Mix” as shown in Sec.3.3. We show the ablation results in Tab.2. Referring to the first three rows, we find that aggregating attributes into a conversation can get better performance. Considering the mutual influence between different levels of attributes, we further conduct the ablation study on the order of levels using the “Mix” mode. Predicting high-level attributes first gives similar results as predicting low-level attributes first, and we choose the former by default. Finally, it can be observed that the ”Mix” mode, which combines data from different training granular i ties, yields the best performance.

为满足多样化需求并提升细粒度属性提取能力,我们将细粒度属性的训练粒度划分为"属性"、"层级"和"混合"三种模式(详见第3.3节)。消融实验结果如 表2 所示。观察前三行数据可知,将属性聚合到对话中能获得更优性能。考虑到不同层级属性间的相互影响,我们进一步采用"混合"模式对层级顺序进行消融实验。先预测高层级属性与先预测低层级属性结果相近,默认采用前者方案。最终可见,整合不同训练粒度数据的"混合"模式表现最佳。

4.5. Directly Predict Numerical Scores by MLLMs

4.5 通过大语言模型直接预测数值分数

In this section, we conduct a series of in-depth investigations into how to effectively predict numerical scores using MLLMs. To this end, a) first, we test zero-shot capability for numerical sorting of common 7B-sized MLLMs. b) Second, we perform an ablation study on the number of digits that should be predicted in the next token paradigm. c) Third, we monitor NCM and its variant $\mathbf{NCM^{*}}$ during training to determine whether MLLMs memorizes the tokens at different positions or understands them as a whole.

在本节中,我们针对如何利用大语言模型 (MLLM) 有效预测数值分数展开了一系列深入研究。具体包括:
a) 首先,测试常见7B规模大语言模型在数值排序任务中的零样本能力;
b) 其次,对下一token预测范式下应预测的数字位数进行消融实验;
c) 最后,通过监测训练过程中NCM及其变体 $\mathbf{NCM^{*}}$ 的表现,验证大语言模型是记忆不同位置的token还是整体理解其语义。

Zero-shot Capability for Numerical Sorting. We conduct a toy example to test the ability of popular opensourced MLLMs. Specifically, we randomly generate 10 decimals with a maximum valid digit of 2 and range from 1 to 10, then we prompt MLLMs to sort them from low to high. We repeat the experiment 200 times and show the average results in Tab.3. We take 3 metrics, which are accuracy, recall and hallucination. Accuracy represents the accuracy of the predicted sequence compared to the GT sorted sequence. Recall indicates the proportion of the predicted sequence that uses the numbers to be sorted. Hallucination indicates the proportion of the predicted sequence that uses the numbers that are not in the sequence to be sorted. Inte rest ingly, MLLMs perform quite differently on numerical sorting. Qwen2-VL-7B achieves the best results. However, basic models with less training data, such as LLaVA-1.5- 7B, perform poorly on the numerical sorting task and are almost guessing. The basic model of Q-Align (i.e., mPLUGOwl2-7B) also obtains poor results.

零样本 (Zero-shot) 数值排序能力。我们设计了一个简单实验来测试主流开源多模态大语言模型 (MLLM) 的表现。具体而言,我们随机生成10个1到10范围内、最多保留2位有效数字的小数,并要求MLLM将其按从小到大排序。实验重复200次后,平均结果如 表3 所示。我们采用准确率 (accuracy)、召回率 (recall) 和幻觉率 (hallucination) 三项指标:准确率表示预测序列与标准排序序列的匹配程度;召回率反映预测序列中使用待排序数字的比例;幻觉率则衡量预测序列中出现非待排序数字的比例。有趣的是,不同MLLM在数值排序任务上表现差异显著。Qwen2-VL-7B取得了最佳效果,而训练数据较少的基线模型(如LLaVA-1.5-7B)表现接近随机猜测水平。Q-Align的基线模型(即mPLUGOwl2-7B)同样表现欠佳。

Table 4: Ablation of the number of digits $M$ training with the next token prediction paradigm on the AVA and KonIQ10k dataset. Predicting two extra digits significantly improves SRCC and PLCC on the AVA dataset.

表 4: 数字位数 $M$ 在 AVA 和 KonIQ10k 数据集上采用下一token预测范式的消融实验。预测额外两位数字显著提升了 AVA 数据集上的 SRCC 和 PLCC。

测试集 AVA KonIQ-10k
数字位数 M SRCC PLCC SRCC PLCC
M=1 0.712 0.705 0.940 0.950
M=2 0.806 (+0.094) 0.805 (+0.100) 0.938 (-0.002) 0.948 (-0.002)
M=3 0.808 (+0.096) 0.806 (+0.101) 0.939 (-0.001) 0.949 (-0.001)

Ablation of the Number of Predicted Digits. Intuitively, we believe the larger the $M$ in the GT numerical score $S_{G T}$ , the better the effect. This is because the numerical scores (i.e., MOS) are fine-grained labels. When the numerical score is used as a label, the quantization error is small. For example, when $M{=}3$ for the GT numerical score $S_{G T}$ , the quantization error is $0.5%$ of the text-defined labeling in QAlign. As shown in Tab.4, if one more decimal place is predicted directly, the SRCC on the AVA dataset is 0.094 more than that of directly predicting the integer. When one more decimal place is added, the effect is only slightly improved. Thus, we use $M{=}3$ by default in the experiment.

预测数字位数的消融实验。直观上,我们认为GT数值分数$S_{GT}$中的$M$越大效果越好,因为数值分数(即MOS)是细粒度标签。当数值分数作为标签时,量化误差较小。例如在QAlign中,当GT数值分数$S_{GT}$的$M{=}3$时,量化误差仅为文本定义标注的$0.5%$。如表4所示,若直接多预测一位小数,AVA数据集上的SRCC比直接预测整数高出0.094;当再增加一位小数时,效果提升幅度很小。因此实验中默认采用$M{=}3$。

NCM and Its Variant $\mathbf{NCM^{}}$ During Training. In this paragraph, we show two interesting findings. Finding 1: The numerical ambiguity of the cross-entropy occurs in training a naive model. As shown in Fig.5(a), training Qwen2-VL-7B from scratch leads to a substantial reduction of the cross-entropy $(\mathcal{L}{c e})$ . Nevertheless, the NCM and $\mathbf{NCM^{}}$ do not converge and remain unstable. In other words, as shown in Tab.6, the convergence ratio of $\mathcal{L}{c e}$ is $-91.37%$ , while the convergence ratio of the NCM is just $+6.02%$ , which is not converged at all. It verifies that the naive model tends to ignore the intrinsic relationship between numerical scores, which only memorizes the tokens at different positions. Finding 2: Well-trained model regular iz ation avoids numerical ambiguity. As shown in Fig.5(b) and Tab.6, during the LoRA FT, NCM, $\mathrm{NCM^{*}}$ and $\mathcal{L}_{c e}$ converge together. In summary, MLLMs trained extensively on large-scale data have adequate intrinsic capabilities to accurately predict numerical scores with only the next token prediction paradigm, which indicates that MLLMs understands the numerical scores as a whole.

NCM及其变体$\mathbf{NCM^{}}$在训练过程中。本段展示两个有趣的发现。发现一:交叉熵的数值歧义性出现在训练朴素模型时。如图5(a)所示,从头训练Qwen2-VL-7B会导致交叉熵$(\mathcal{L}{c e})$显著下降,但NCM和$\mathbf{NCM^{}}$始终无法收敛且保持不稳定。具体而言,如表6所示,$\mathcal{L}{c e}$的收敛率为$-91.37%$,而NCM的收敛率仅为$+6.02%$,完全未达到收敛。这验证了朴素模型倾向于忽略数值分数间的内在关联,仅记忆不同位置的token。发现二:经过良好训练的模型正则化可避免数值歧义性。如图5(b)和表6所示,在LoRA微调阶段,NCM、$\mathrm{NCM^{*}}$与$\mathcal{L}_{c e}$能同步收敛。综上表明,通过海量数据充分训练的多模态大语言模型(MLLMs)具备仅凭下一token预测范式即可准确预测数值分数的内在能力,说明MLLMs将数值分数作为整体理解。

Table 5: Comparison between the next token prediction (NTP) paradigm and regression method for Qwen2-VL-7B to predict numerical scores directly using MLLMs.

表 5: Qwen2-VL-7B 使用大语言模型直接预测数值分数时,下一词预测 (NTP) 范式与回归方法的对比

方法 训练轮数 SRCC PLCC
回归 2 0.882 0.890
回归 6 0.908 (+0.026) 0.923 (+0.033)
NTP 2 0.939 (+0.057) 0.949 (+0.059)

4.6. Comparison with Regression Method

4.6. 与回归方法的对比

Another way to use MLLMs to directly predict numerical scores is to regress. For the regression method, the numerical score can be regressed through the linear layer. The regression method appears in the AIGC video assessment [16]. Referring to Eq.1, the last token of the question text contains the entire information during training, including the input image and the texts. We extract the logits of the last token and directly map them to one dimension through a linear layer. Finally, we normalize the one-dimensional feature with the sigmoid function and use MSE to supervise it. As shown in Tab.5, we compare the next token prediction paradigm with the regression. In the same epochs, the NTP outperforms the regression method. With more training steps, the performance of the regression method can improve, but it is still not as good as the NTP. We infer that the regression method needs to re-establish the inner distribution to adjust to a different objective, so it takes longer train- ing time to further improve. Furthermore, the NTP offers greater advantages by comparison since the output numerical scores can be used as the context of MLLMs to support other tasks and have a wider range of application scenarios.

另一种直接使用多模态大语言模型 (MLLM) 预测数值分数的方法是回归。对于回归方法,数值分数可以通过线性层进行回归预测。该方法已应用于AIGC视频评估领域 [16]。如公式1所示,在训练过程中,问题文本的最后一个token包含了全部输入信息(包括图像和文本)。我们提取该token的logits,通过线性层直接映射为一维特征,最后用sigmoid函数归一化处理,并使用均方误差 (MSE) 作为监督信号。如表5所示,我们将下一token预测 (NTP) 范式与回归方法进行了对比:在相同训练周期下,NTP表现优于回归方法;即使增加训练步数,回归方法的性能仍不及NTP。我们推测回归方法需要重新建立内部分布以适应不同目标,因此需要更长的训练时间来实现性能提升。此外,NTP具有更显著的优势——其输出的数值分数可作为MLLM的上下文支持其他任务,拥有更广泛的应用场景。

4.7. Quantitative results

4.7. 定量结果

Result on IQA. For the IQA task, we train the model on the KonIQ-10k [18] and cross-evaluate the performance on the test set of SPAQ [11] and LIVE Challenge [12]. As shown in Tab.7, the proposed method consistently outperforms previous methods across all datasets. The NTP has achieved the SOTA performance, especially in the crossdataset scenario. With the CoT training, the proposed method improves the SRCC/PLCC by $0.9%/1.0%$ on the $\mathrm{KonIQ_{\mathrm{test}}}$ . In addition, the proposed method also boosts the performance on the cross-dataset. For example, our method outperforms Q-Align by achieving a $4.5%$ higher PLCC on the cross-domain LIVE Challenge dataset.

IQA任务结果。在IQA任务中,我们在KonIQ-10k [18]上训练模型,并在SPAQ [11]和LIVE Challenge [12]的测试集上进行交叉评估。如表7所示,所提方法在所有数据集上均优于先前方法。NTP实现了SOTA性能,尤其在跨数据集场景下。通过CoT训练,该方法在$\mathrm{KonIQ_{\mathrm{test}}}$上将SRCC/PLCC提升了$0.9%/1.0%$。此外,该方法还提升了跨数据集性能。例如,在跨域LIVE Challenge数据集上,我们的方法以PLCC高出$4.5%$的优势超越Q-Align。


Figure 5: NCM, $\mathbf{NCM^{}}$ and $\mathcal{L}{c e}$ in the different training stages with the supervision of $\mathcal{L}{c e}$ . 1) During the different training stages, $\mathcal{L}_{c e}$ converges. 2) Only during LoRA fine-tuning, the NCM and $\mathbf{NCM^{*}}$ also converge.

图 5: 在不同训练阶段中,由 $\mathcal{L}{ce}$ 监督的 NCM、$\mathbf{NCM^{}}$ 和 $\mathcal{L}{ce}$ 的表现。1) 在不同训练阶段,$\mathcal{L}_{ce}$ 逐渐收敛。2) 仅在 LoRA (Low-Rank Adaptation) 微调期间,NCM 和 $\mathbf{NCM^{*}}$ 也会收敛。

训练方法 NCM NCM* Cce
前100次迭代
从头开始全微调 (Full FT) 2.16 3.19 5.91
LoRA微调 1.43 1.96 1.42
最后100次迭代
从头开始全微调 (Full FT) 2.29 3.08 0.51
LoRA微调 0.24 0.71 0.38
收敛比率 (% (最后-最初)/最初)
从头开始全微调 (Full FT) +6.02% -3.45% -91.37%
LoRA微调 -83.22% -63.78% -73.24%

Table 6: The average of the NCM, $\mathrm{NCM^{*}}$ and $\mathcal{L}_{c e}$ in the different training stages. FT denotes fine-tuning.

表 6: 不同训练阶段中 NCM、$\mathrm{NCM^{*}}$ 和 $\mathcal{L}_{c e}$ 的平均值。FT 表示微调 (fine-tuning)。

Table 7: Results on the IQA dataset trained with the KonIQ10k dataset. The cross-set evaluations are labeled with ∗. † denotes that we mix the training data organized by CoT.

表 7: 使用KonIQ10k数据集训练后在IQA数据集上的结果。带∗标记的为跨数据集评估结果。†表示我们混合了由CoT组织的训练数据。

方法 KonIQtest SRCC KonIQtest PLCC SPAQtest SRCC SPAQtest PLCC LIVE Challengetest SRCC LIVE Challengetest PLCC
HyperIQA (CVPR 2020) [36] 0.906 0.917 0.788 0.791 0.749 0.772
CLIP-IQA+ (AAAI 2023) [38] 0.895 0.909 0.864 0.866 0.805 0.832
LIQE (CVPR 2023) [57] 0.928 0.912 0.833 0.846 0.870 0.830
DSMix (ECCV2024) [34] 0.915 0.925 - - 0.791 -
Q-Align (ICML 2024) [44] 0.940 0.941 0.887 0.886 0.860 0.853
LoDa (CVPR 2024) [46] 0.932 0.944 - - 0.811 -
QCN (CVPR 2024) [35] 0.934 0.945 - - 0.840 -
Ours 0.939 0.949 0.878 0.875 0.878 0.883
Ours' 0.948 0.959 0.904 0.906 0.885 0.898

Table 8: Results on the IAA dataset trained with the AVA dataset. The cross-set evaluations are labeled with ∗. † denotes that we mix the training data organized by CoT.

表 8: 使用AVA数据集训练的IAA数据集结果。带∗标记的为跨数据集评估结果。†表示我们混合了由CoT组织的训练数据。

Result on IAA. For the IAA task, we show the comparison with the previous methods in Tab.8. The NTP has a competitive performance on the AVA dataset. With the CoT training, the proposed method improves the SRCC/PLCC of $1.9%/1.8%$ . For the cross-dataset TAD66K, UNIAA trains on 5 datasets including AVA, while the proposed method only trains on AVA. The proposed method also achieves a $1.9%$ higher PLCC compared with UNIAA.

IAA任务结果。对于IAA任务,我们在表8中展示了与先前方法的对比。NTP在AVA数据集上表现出竞争力。通过CoT训练,所提方法将SRCC/PLCC提升了1.9%/1.8%。在跨数据集TAD66K上,UNIAA使用包括AVA在内的5个数据集进行训练,而本方法仅使用AVA训练。与UNIAA相比,所提方法的PLCC也高出1.9%。

Table 10: Results on the RealQA dataset. † denotes that we mix the training data organized by CoT.

表 10: RealQA数据集上的结果。†表示我们混合了由CoT组织的训练数据。

方法 AVAtest TAD66Ktest
SRCC PLCC 方法 SRCC PLCC
HLA-GCN (CVPR 2021) [30] 0.665 0.687 Q-Instruct[43] 0.137 0.160
MUSIQ (ICCV 2021) [23] 0.726 0.738 mPLUG-Ow12[49] 0.215 0.198
TANet (IJCAI 2022) [15] 0.758 0.765 VILA[24] 0.350 0.372
VILA (CVPR 2023) [24] 0.774 0.774 UNIAA [59] 0.411 0.425
Yun et al. (ECCV 2024) [54] 0.808 0.804 Ours 0.410 0.440
Q-Align (ICML 2024) [44] 0.822 0.817 Ourst 0.413 0.444
Ours 0.809 0.807
Ours' 0.828 0.825

Result on VQA. For the VQA task, we first test the capability of Qwen2-VL-7B itself. On the KoNViD [17] dataset, even after prompt engineering, the SRCC of the original Qwen2-VL-7B is only 0.305. Second, considering Qwen2- VL-7B employs the 3D convolution to extract image and video features, a natural idea is to apply the image-pretained model to the video task. To this end, based on the model trained on the IQA dataset (Koniq-10k), we directly test the performance on the VQA dataset (KoNViD). In the imple ment ation, we replace the “image” with “video” in the prompt. As shown in Tab.9, the score generated by CoT can boost the SRCC by $36.4%$ . It demonstrates that the proposed method has strong generalization for VQA.

VQA任务结果。在VQA任务中,我们首先测试了Qwen2-VL-7B自身的能力。在KoNViD[17]数据集上,即使经过提示工程优化,原始Qwen2-VL-7B的SRCC仅为0.305。其次,考虑到Qwen2-VL-7B采用3D卷积提取图像和视频特征,我们自然想到将图像预训练模型应用于视频任务。为此,基于在IQA数据集(Koniq-10k)上训练的模型,我们直接测试了其在VQA数据集(KoNViD)上的表现。具体实现中,我们将提示词中的"image"替换为"video"。如表9所示,通过思维链(CoT)生成的分数可将SRCC提升$36.4%$,这表明所提方法对VQA任务具有强泛化能力。

Table 9: Zero-shot ability on the KoNViD dataset after training with the IQA dataset, Koniq-10k. Note that the training processing has no video data .

表 9: 使用 IQA 数据集 Koniq-10k 训练后在 KoNViD 数据集上的零样本能力。注意训练过程中没有视频数据。

Baseline CoT training SRCC
0.305
0.669 (+0.364)
指标 Q-Align 我们的方法 Oursi
PLCC 0.716 0.792 0.802
SRCC 0.741 0.730 0.746

Results on RealQA. On the RealQA dataset, we follow the division of the pre-defined textual labels for Q-Align to split the composite scores. Then, we train Q-Align and evaluate it using its evaluation method. As shown in Tab.10, the proposed method still outperforms Q-Align, proving that MLLMs can directly predict numerical scores effectively.

RealQA实验结果。在RealQA数据集上,我们沿用Q-Align预定义的文本标签划分方式来拆分综合分数。随后训练Q-Align模型并采用其评估方法进行测试。如表10所示,本方法依然优于Q-Align,证明多模态大语言模型(MLLM)能够直接有效地预测数值评分。

5. Conclusion

5. 结论

In this paper, we introduce RealQA, a novel dataset of 14,715 real-world images, each of which is annotated with 10 fine-grained attributes. Besides, we conduct a series of in-depth and comprehensive investigations into how to effectively predict numerical scores using MLLMs. Surprisingly, by predicting just two extra significant digits, the next token paradigm can achieve SOTA performance. With the help of CoT combined with the fine-grained attributes, the proposed method can outperform SOTA methods on five public datasets for IQA and IAA with superior interpret a bility and show strong zero-shot generalization for VQA. We sincerely hope the proposed method can inspire future quality and aesthetic scoring methods based on MLLMs.

本文介绍了RealQA数据集,包含14,715张真实世界图像,每张图像标注了10个细粒度属性。此外,我们深入系统地研究了如何利用大语言模型有效预测数值分数。令人惊讶的是,仅需预测额外两位有效数字,next token范式即可实现SOTA性能。通过将思维链(CoT)与细粒度属性相结合,所提方法在五个公开的IQA和IAA数据集上超越SOTA方法,兼具优异的可解释性,并在VQA任务中展现出强大的零样本泛化能力。我们期望该方法能为基于大语言模型的质量与美学评分研究提供新思路。

References

参考文献

Appendix A. Details of Toy Example

附录 A. 玩具示例细节

The toy example aims to judge whether MLLMs can sort numerical scores. Therefore, we randomly generate 10 decimals 200 times with a maximum valid digit of 2 (e.g., 3, 3.8, 3.99) and calculate the metrics on average. We take 3 metrics, which are accuracy, recall and hallucination. Accuracy represents the accuracy of the predicted sequence compared to the GT sorted sequence. Recall indicates the proportion of the predicted sequence that uses the numbers to be sorted. Hallucination indicates the proportion of the predicted sequence that uses the numbers that are not in the sequence to be sorted. The implementation of these metrics is as shown in algorithm 1. As shown in Tab.1, we show some examples of the numerical sorting experiment. It can be seen that Qwen2-VL-7B can sort the numbers correctly regardless of whether there are repeated numbers or numbers with different significant digits. Nevertheless, Qwen2-VL7B exhibits a few instances of errors, such as hallucinated red numbers in the final line.

该玩具示例旨在判断多模态大语言模型 (MLLMs) 能否对数值分数进行排序。为此,我们随机生成10个最多保留2位有效数字的小数(例如3、3.8、3.99),重复200次并计算平均指标。我们采用准确率、召回率和幻觉率三个指标:准确率表示预测序列与真实排序序列的匹配程度;召回率反映预测序列中使用待排序数字的比例;幻觉率则衡量预测序列中出现非待排序数字的比例。具体指标实现如算法1所示。如表1所示,数值排序实验的部分案例表明,Qwen2-VL-7B无论面对重复数字还是不同有效位数的数字都能正确排序,但在最后一行仍出现了幻觉生成的红色数字等少量错误案例。

Algorithm 1 Numerical Sorting Metrics

算法 1: 数值排序指标

Appendix B. More Results of RealQA

附录 B. RealQA 更多结果

B.1. Discrepancy with IQA/IAA

B.1. 与IQA/IAA的差异

As shown in Tab.2, we evaluate the SOTA methods trained on the KonIQ-10k and AVA dataset separately on the test set of the RealQA dataset. The proposed method demonstrates better generalization capability than Q-Align, achieving a $2%$ higher SRCC when trained on the IQA dataset and tested on the RealQA dataset. Furthermore, if training on the RealQA dataset, the SRCC/PLCC increases significantly. It verifies that there is a discrepancy between IQA/IAA tasks and UGC image assessment.

如表 2 所示,我们在 RealQA 数据集的测试集上分别评估了在 KonIQ-10k 和 AVA 数据集上训练的 SOTA 方法。所提出的方法展现出比 Q-Align 更好的泛化能力,当在 IQA 数据集上训练并在 RealQA 数据集上测试时,SRCC 提高了 2%。此外,如果在 RealQA 数据集上训练,SRCC/PLCC 会显著提升。这验证了 IQA/IAA 任务与 UGC 图像评估之间存在差异。

B.2. Annotation Granularity

B.2. 标注粒度

As shown in Tab.3, we show the annotation granularity of the fine-grained attributes. These fine-grained attributes include the composition score and the eye-catching score, judged against natural human language principles and scored from 1 to 10. Background clutter is assessed across three levels: cluttered, moderately cluttered, and uncluttered. Subject integrity is evaluated as incomplete, partially complete, or fully complete. The presence of a level shot (yes/no), image clarity (very blurry, moderately blurry, moderately clear, very clear), exposure (overexposed, slightly overexposed, properly exposed, slightly underexposed, underexposed), and saturation (ultra-low, low, medium, high, ultra-high) are also annotated. The finegrained attributes provide a detailed understanding of images and assist the image quality and aesthetics assessment in a more fine-grained manner, offering interpret able explanations using natural language.

如表3所示,我们展示了细粒度属性的标注粒度。这些细粒度属性包括构图分数和吸睛分数,依据自然人类语言原则进行评判,评分范围为1至10分。背景杂乱度分为三个等级:杂乱、中等杂乱、不杂乱。主体完整性评估为不完整、部分完整或完全完整。同时标注了水平拍摄情况(是/否)、图像清晰度(非常模糊、中等模糊、中等清晰、非常清晰)、曝光(过曝、轻微过曝、正常曝光、轻微欠曝、欠曝)以及饱和度(极低、低、中、高、极高)。细粒度属性为图像提供了详细理解,并以更精细的方式辅助图像质量和美学评估,通过自然语言提供可解释性说明。

B.3. Conversations for Attributes Training

B.3. 属性训练对话

Fig.1 lists specific conversation templates used in the fine-grained attributes training.

图 1: 细粒度属性训练中使用的具体对话模板。

B.4. Limitations

B.4. 局限性

Despite data cleaning and corrections of the results of MLLMs, we find that the proposed RealQA dataset still contains ambiguous answers. For example, judging whether an image is horizontally shot. Intuitively, this task is not difficult. However, as shown in Fig.2, which presents four distinct examples. It can be observed that judging whether an image is horizontally shot requires indepth thinking about the shooting posture for the image and the relative relationship of the objects in the image to get the correct answer. In addition, we focus on common and general attributes in this work and do not separate artistic attributes. This is because artistic attributes require a certain understanding of the layout and color of the image, and even the historical stories and current politics in the image, and these data are difficult to collect and annotate. We will continue to explore the above issues in future work.

尽管对多模态大语言模型 (MLLM) 的结果进行了数据清洗和校正,我们发现提出的 RealQA 数据集仍存在答案模糊的情况。例如,判断一张图片是否为横拍。直觉上这个任务并不困难。但如图 2 所示,其中展示了四个不同示例。可以观察到,判断图片是否为横拍需要深入思考该图片的拍摄姿态以及画面中物体的相对关系才能得出正确答案。此外,本研究重点关注常见且通用的属性,并未单独划分艺术类属性。这是因为艺术类属性需要对画面的构图、色彩,甚至其中蕴含的历史故事和时事政治有一定理解,而这些数据难以收集和标注。我们将在未来工作中持续探索上述问题。

Appendix C. Discussion

附录 C. 讨论

C.1. Numerical Scoring for MLLMs

C.1. 多模态大语言模型的数值评分

Recently, there have been several hot discussions in the community: 13.11 and 13.8, which one is larger1? Even

最近社区里有几个热门讨论:13.11和13.8,哪个更大?

提示 回答
将这些数字按从低到高排序并直接返回结果!数字:[5.9,4.0,8.88,5.52,3.5,4.8,9.2,9.375,7.26,7.8]。返回格式:[数字1,数字2,...,数字n] [3.5,4.0,4.8,5.52,5.9,7.26,7.8,8.88,9.2,9.375]
将这些数字按从低到高排序并直接返回结果!数字:[5.0,9.1,8.6,1.5,5.345,2.0,7.7,8.929,6.0,1.48]。返回格式:[数字1,数字2,...,数字n] [1.48,1.5,2.0,5.0,5.345,6.0,7.7,8.6,8.929,9.1]
将这些数字按从低到高排序并直接返回结果!数字:[9.0,2.0,2.0,8.0,1.51,8.0,3.6,9.72,6.0,8.3]。返回格式:[数字1,数字2,...,数字n] [1.51,2.0,2.0,3.6,6.0,8.0,8.0,8.3,8.72,9.0,9.72]

Table 1: Examples of the numerical sorting experiment. It can be seen that Qwen2-VL-7B can sort the numbers correctly regardless of whether there are repeated numbers or numbers with different significant digits. Nevertheless, Qwen2-VL-7B exhibits a few instances of errors, such as hallucinated red numbers in the final line.

表 1: 数值排序实验示例。可以看出,无论是否存在重复数字或不同有效位数的数字,Qwen2-VL-7B 都能正确排序。不过 Qwen2-VL-7B 仍存在少量错误案例,例如最后一行中出现了幻觉生成的红色数字。

Figure 1: Conversations templates for the fine-grained attributes training, where the answers are omitted for ease of viewing. “Attributes” denotes that we predict an attribute in one conversation. “Levels” denotes that we predict certain level attributes in one conversation. “Mix” denotes that we predict all attributes in one conversation.

图 1: 用于细粒度属性训练的对话模板,为便于查看省略了回答内容。"Attributes"表示我们在一次对话中预测一个属性。"Levels"表示我们在一次对话中预测特定层级的属性。"Mix"表示我们在一次对话中预测所有属性。

Mix ●# 用户:"请分析并直接给出这张图片的高层级、中层级和低层级美学属性"
Levels ●# 用户:"简要评价这张图像的高层级属性"
●# 用户:"简要评价这张图像的中层级属性"
●# 用户:"简要评价这张图像的低层级属性"
Attributes
>构图分数/吸睛分数 ●# 用户:"请分析这张图像的构图分数(1-10分)并给出理由"
>主体完整性/主体杂乱度/背景杂乱度 ●# 用户:"主体是什么?主体是否完整且定义明确?"
●# 用户:"主体是什么?是否显得杂乱?"
●# 用户:"背景是什么?是否显得杂乱?"
>水平拍摄
>清晰度 ●# 用户:"清晰度如何?"
>曝光 ●# 用户:"曝光情况如何?"
>饱和度 ●# 用户:"饱和度如何?"
训练集 KonIQtrain KonIQtrain AVAtrain AVAtrain RealQAtrain RealQAtrain
方法 SRCC PLCC SRCC PLCC SRCC PLCC
Q-Align 0.577 0.576 0.519 0.568 0.741 0.716
Ourst 0.597 0.604 0.521 0.564 0.746 0.802

Table 2: Zero-shot evaluation of the IQA (i.e., Koniq-10k) and IAA (i.e., AVA) methods on the RealQA dataset. † denotes that we mix the training data organized by CoT.

表 2: 在 RealQA 数据集上对 IQA (即 Koniq-10k) 和 IAA (即 AVA) 方法进行的零样本评估。† 表示我们混合了由 CoT 组织的训练数据。

though LLMs are capable of solving Olympiad-level mathematical problems, they still commit errors when tackling similar types of questions2. Please note that as of March 7, 2025, the community has not reached a consensus on why this phenomenon occurs. In this paper, we conduct numerical sorting on the common 7B-sized MLLMs, and the results are surprising. Although these MLLMs perform well on more complex tasks, such as VQA and GQA, their capabilities vary greatly on simple numerical sorting. The results are similar to the phenomenon discussed in the community. Although the underlying principles and explanations are beyond the scope of this paper, the proposed method can provide support for subsequent MLLMs to perform quality and aesthetic scoring tasks: it is completely feasible to predict numerical scores on the welltrained MLLMs directly.

尽管大语言模型能够解决奥林匹克级别的数学问题,但在处理同类题型时仍会出现错误[2]。截至2025年3月7日,学界尚未就该现象成因达成共识。本文对常见的7B规模多模态大语言模型(MLLM)进行数值排序测试,结果令人惊讶:这些模型在VQA、GQA等复杂任务上表现优异,却在简单数值排序任务上能力参差不齐。该结果与社区讨论的现象高度相似。虽然其底层原理和解释超出本文研究范围,但所提方法能为后续MLLM执行质量评估与美学评分任务提供支持:直接在训练良好的MLLM上预测数值评分完全可行。

Table 3: Annotation granularity of the fine-grained attributes.

表 3: 细粒度属性的标注粒度。

构图/吸睛原因 选项
自然人类语言
构图/吸睛分数 选项
1-10
主体/背景杂乱度 选项
杂乱 中等杂乱
主体完整性 无杂乱 选项
不完整 (大部分被截断或遮挡) 部分完整 (轻微截断或轻微遮挡)
水平拍摄 完全完整 (无任何截断或遮挡)
选项 是
图像清晰度 选项
非常模糊 中等模糊
中等清晰
非常清晰
曝光 选项
过曝
轻微过曝
曝光正常
轻微欠曝
欠曝
饱和度 选项
超低饱和度
低饱和度
中等饱和度
高饱和度
超高饱和度

C.2. NCM for Qwen2-VL

C.2. Qwen2-VL 的 NCM

MLLMs employ a variety of token iz ation methods, among which Byte Pair Encoding (BPE) is a common method widely adopted. Even when the same BPE algorithm is applied, different MLLMs can produce varying tokenization results. This discrepancy arises from factors such as the BPE training corpus, vocabulary size, and other influencing elements. In this paper, we design NCM and its variant $\mathrm{NCM^{}}$ through the tokenizer of Qwen2-VL, and monitor the ability of MLLMs to understand numbers during the training for numerical prediction. We can verify whether MLLMs memorizes the tokens at different positions or understands them as a whole. For different MLLMs, the organization of numbers will change due to their different tokenization methods. For example, $"3.99"$ may be split into “3”, “.9”, “9”, which do not appear in Qwen2-VL. The designed NCM and its variant $\mathrm{NCM^{}}$ need to adapt to the new tokenizer method, which means that Eq. 3 and Eq. 6 in the main paper for calculating the mathematical expectation need to be adjusted accordingly. However, we emphasize that despite the changes in implementation details, the core ideas and theoretical basis of the proposed NCM and its variant $\mathbf{NCM}^{*}$ remain unchanged.

多模态大语言模型 (MLLM) 采用了多种 token 化方法,其中字节对编码 (BPE) 是一种被广泛采用的常见方法。即使应用相同的 BPE 算法,不同的多模态大语言模型也可能产生不同的 token 化结果。这种差异源于 BPE 训练语料库、词汇量大小等因素的影响。本文通过 Qwen2-VL 的 tokenizer 设计了 NCM 及其变体 $\mathrm{NCM^{}}$,并在训练过程中监测多模态大语言模型对数字的理解能力以进行数值预测。我们可以验证多模态大语言模型是记忆了不同位置的 token 还是将它们作为一个整体来理解。对于不同的多模态大语言模型,数字的组织方式会因其 token 化方法的不同而变化。例如,$"3.99"$ 可能被分割为 "3"、".9"、"9",而这些在 Qwen2-VL 中并未出现。设计的 NCM 及其变体 $\mathrm{NCM^{}}$ 需要适应新的 tokenizer 方法,这意味着主论文中用于计算数学期望的公式 3 和公式 6 需要进行相应调整。然而,我们强调尽管实现细节有所变化,所提出的 NCM 及其变体 $\mathbf{NCM}^{*}$ 的核心思想和理论基础保持不变。


Figure 2: Visualization of real-world level shots. Note that all image annotations are non-level shots. (a) It is not possible to clearly determine whether the shot is level. (b) The photo is taken from a bird’s-eye view, and the dining table is tilted, making it difficult to tell whether the photo is taken level. (c) Non-level shots. (d) For entertainment facilities captured from a low angle, it is impossible to determine whether the shot is level due to the perspective.

图 2: 现实世界水平镜头的可视化。注意所有图像标注均为非水平镜头。(a) 无法明确判断拍摄是否水平。(b) 照片采用鸟瞰视角拍摄,且餐桌存在倾斜,难以判断拍摄是否水平。(c) 非水平镜头。(d) 从低角度拍摄的娱乐设施,由于透视关系无法判断拍摄是否水平。

Appendix D. Visualization Results

附录 D. 可视化结果

As shown in Fig. 3 to Fig. 8, we compare the visualization results between the SOTA method (Q-Align) trained on the AVA dataset and the proposed method trained on the RealQA dataset. For images taken by daily users, the proposed method will have better ranking results of quality and aesthetics.

如图 3 至图 8 所示,我们比较了在 AVA 数据集上训练的 SOTA 方法 (Q-Align) 与在 RealQA 数据集上训练的所提方法的可视化结果。对于日常用户拍摄的图像,所提方法在质量和美学排名上会取得更好的结果。

Figure 3: Real-world visualization results. We scale the predicted scores of Q-Align to a range of 1-10 for comparison.

图 3: 真实场景可视化结果。我们将 Q-Align 的预测分数缩放至 1-10 范围以便对比。

Figure 4: Real-world visualization results. We scale the predicted scores of Q-Align to a range of 1-10 for comparison.

图 4: 真实场景可视化结果。我们将 Q-Align 的预测分数缩放至 1-10 范围以便比较。

Figure 5: Real-world visualization results. We scale the predicted scores of Q-Align to a range of 1-10 for comparison.

图 5: 真实场景可视化结果。我们将 Q-Align 的预测分数缩放至 1-10 范围以便比较。

Figure 6: Real-world visualization results. We scale the predicted scores of Q-Align to a range of 1-10 for comparison.

图 6: 真实场景可视化结果。我们将 Q-Align 的预测分数缩放至 1-10 范围以便对比。

Figure 7: Real-world visualization results. We scale the predicted scores of Q-Align to a range of 1-10 for comparison.

图 7: 真实场景可视化结果。我们将 Q-Align 的预测分数缩放至 1-10 范围以便比较。

Figure 8: Real-world visualization results. We scale the predicted scores of Q-Align to a range of 1-10 for comparison.

图 8: 真实场景可视化结果。我们将 Q-Align 的预测分数缩放至 1-10 范围以便比较。

阅读全文(20积分)