FaceXBench: Evaluating Multimodal LLMs on Face Understanding
FaceXBench: 评估多模态大语言模型在人脸理解上的表现
Abstract
摘要
Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving abilities across a wide range of tasks and domains. However, their capacity for face understanding has not been systematically studied. To address this gap, we introduce FaceXBench, a comprehensive benchmark designed to evaluate MLLMs on complex face understanding tasks. FaceXBench includes 5, 000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6 broad categories, assessing MLLMs’ face understanding abilities in bias and fairness, face authentication, recognition, analysis, localization and tool retrieval. Using FaceXBench, we conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks. We analyze the models across three evaluation settings: zero-shot, in-context task description, and chain-of-thought prompting. Our detailed analysis reveals that current MLLMs, including advanced models like GPT-4o, and GeminiPro 1.5, show significant room for improvement. We believe FaceXBench will be a crucial resource for developing MLLMs equipped to perform sophisticated face understanding.
多模态大语言模型 (MLLMs) 在广泛的任务和领域中展示了令人印象深刻的问题解决能力。然而,它们在面部理解方面的能力尚未得到系统研究。为了填补这一空白,我们引入了 FaceXBench,这是一个旨在评估 MLLMs 在复杂面部理解任务上的综合基准。FaceXBench 包括 5,000 个多模态选择题,这些题目源自 25 个公共数据集和一个新创建的数据集 FaceXAPI。这些题目涵盖了 6 个大类中的 14 个任务,评估 MLLMs 在偏见与公平性、面部认证、识别、分析、定位和工具检索方面的面部理解能力。通过 FaceXBench,我们对 26 个开源 MLLMs 以及 2 个专有模型进行了广泛评估,揭示了复杂面部理解任务中的独特挑战。我们在三种评估设置下分析了这些模型:零样本、上下文任务描述和思维链提示。我们的详细分析表明,当前的 MLLMs,包括像 GPT-4o 和 GeminiPro 1.5 这样的先进模型,仍有显著的改进空间。我们相信 FaceXBench 将成为开发具备复杂面部理解能力的 MLLMs 的关键资源。
1. Introduction
1. 引言
Recent progress in Large Language Models (LLMs) [2, 7, 99] have showcased their impressive ability to understand, reason, and generate text across diverse openended tasks. Building on these advancements, multimodal large language models (MLLMs) [49, 50, 57, 63, 115, 135] have rapidly emerged, using LLMs as a brain to process multimodal inputs including images, videos, and audio. These models [1, 12, 26, 47, 54] demonstrate remarkable performance and excel in challenging visual-question-answering (VQA) benchmarks such as MMMU [119], MME [117], MM-Bench [60], SEEDBench [46], MMAU [116], MMIE [107], and MathVista [62]. These benchmarks evaluate MLLMs in various domains, including college-level subject knowledge, mathematical reasoning, commonsense reasoning, planning, coding, diagram and spatio-temporal understanding. Despite their progress, prominent MLLMs such as GPT4o [30] and GeminiPro 1.5 [96] struggle with basic face understanding tasks, as illustrated in Figure. 1.
大语言模型 (LLM) 的最新进展 [2, 7, 99] 展示了它们在各种开放式任务中理解、推理和生成文本的卓越能力。基于这些进展,多模态大语言模型 (MLLM) [49, 50, 57, 63, 115, 135] 迅速崛起,利用 LLM 作为大脑处理包括图像、视频和音频在内的多模态输入。这些模型 [1, 12, 26, 47, 54] 在具有挑战性的视觉问答 (VQA) 基准测试中表现出色,例如 MMMU [119]、MME [117]、MM-Bench [60]、SEEDBench [46]、MMAU [116]、MMIE [107] 和 MathVista [62]。这些基准测试评估了 MLLM 在多个领域的表现,包括大学级学科知识、数学推理、常识推理、规划、编码、图表和时空理解。尽管取得了进展,但像 GPT4o [30] 和 GeminiPro 1.5 [96] 这样的知名 MLLM 在基本的面部理解任务上仍然存在困难,如图 1 所示。
Before exploring MLLMs for face understanding capabilities, it’s essential to address the fundamental question: “Why should MLLMs be proficient in face under standing?” MLLMs are increasingly deployed as central processors in various advanced applications, including virtual-reality headsets [38], embodied AI [20, 29], driving safety [31, 89], authentication [18], human-computer interaction [100], and sports analysis [106]. In these applications, accurate face understanding is crucial, as face images appear frequently and require accurate face understanding for appropriate responses. However, the face understanding capabilities of existing MLLMs are limited; they often fail to answer basic questions such as “What is the expression of the person in this image?” or “Which of the following regions is not present in the face image?” These shortcomings indicate a significant scope for improvement. Recent works, such as EMO-LLaMA [109], develop an instruction-tuning set for supervised fine-tuning to enhance MLLMs’ capabilities in understanding facial expressions. Face-MLLM [91] proposed a three-stage training pipeline to equip MLLMs’ with face perception capabilities. Nonetheless, the research community currently lacks a standardized benchmark to quantify and compare MLLMs’ performance in face understanding. We believe that a comprehensive benchmark incorpora ting various aspects of face understanding is a crucial and essential foundational step for monitoring progress and advancing MLLMs’ performance in this domain.
在探索多模态大语言模型(MLLMs)的面部理解能力之前,必须回答一个基本问题:“为什么 MLLMs 应该擅长面部理解?” MLLMs 越来越多地被部署为各种先进应用的核心处理器,包括虚拟现实头戴设备 [38]、具身 AI [20, 29]、驾驶安全 [31, 89]、身份验证 [18]、人机交互 [100] 和体育分析 [106]。在这些应用中,准确的面部理解至关重要,因为面部图像频繁出现,并且需要准确的面部理解以做出适当的响应。然而,现有 MLLMs 的面部理解能力有限;它们通常无法回答基本问题,例如“这张图片中的人的表情是什么?”或“以下哪个区域不在面部图像中?”这些不足表明有很大的改进空间。最近的研究,如 EMO-LLaMA [109],开发了一个指令微调集用于监督微调,以增强 MLLMs 在理解面部表情方面的能力。Face-MLLM [91] 提出了一个三阶段训练流程,以赋予 MLLMs 面部感知能力。尽管如此,研究界目前缺乏一个标准化的基准来量化和比较 MLLMs 在面部理解方面的表现。我们认为,一个涵盖面部理解各个方面的综合基准是监测进展和提升 MLLMs 在该领域性能的关键和必要的基础步骤。
To this end, we propose FaceXBench, a comprehensive benchmark designed for complex face understanding and related tasks. We identify six broad categories, each encompassing distinct tasks essential for complete face understanding within the context of MLLMs:
为此,我们提出了 FaceXBench,一个专为复杂人脸理解及相关任务设计的综合基准。我们确定了六大类别,每个类别都包含在 MLLMs 背景下完成人脸理解所需的不同任务:
- Bias and Fairness: Age Estimation (Age), Gender Prediction (Gender), and Race Estimation (Race);
- 偏见与公平性:年龄估计 (Age)、性别预测 (Gender) 和种族估计 (Race);

Figure 1. Left:Failure cases of leading MLLMs, such as LLaVA-OV [47], Qwen2-VL [103], GPT-4o [30] and GeminiPro1.5 [96], on basic questions related to face understanding. Right: Performance comparison of top models across the 14 tasks included in the benchmark.
图 1: 左:领先的多模态大语言模型(MLLM)在与人脸理解相关的基本问题上的失败案例,例如 LLaVA-OV [47]、Qwen2-VL [103]、GPT-4o [30] 和 GeminiPro1.5 [96]。右:基准测试中包含的 14 个任务中顶级模型的性能比较。
Recent works [75, 93] emphasize that, although existing open-source MLLMs’ have achieved advanced capabilities, they still lack the sophistication to perform complex tasks. To address this, these approaches enable MLLMs with tool use, extending their capabilities by leveraging external tools to fulfill complex human instructions that require task-specific processing. Motivated by this, we introduce FaceXAPI, a new dataset that forms part of FaceXBench. It is designed to evaluate MLLMs’ ability to select the appropriate API and function calls for handling complex tasks in face understanding. FaceXBench comprises 5, 000 carefully and manually filtered VQA-type questions derived from 25 public datasets, covering 14 tasks. It includes 10, 441 unique face images representing a diverse range of age groups, genders, races, varied resolutions, head poses, expressions and attributes, reflecting the diversity of faces encountered in real-world scenarios.
最近的研究 [75, 93] 强调,尽管现有的开源多模态大语言模型 (MLLMs) 已经具备了先进的能力,但它们仍然缺乏执行复杂任务的成熟度。为了解决这一问题,这些方法通过工具使用来增强 MLLMs 的能力,利用外部工具扩展其能力,以完成需要特定任务处理的复杂人类指令。受此启发,我们引入了 FaceXAPI,这是 FaceXBench 的一部分新数据集。它旨在评估 MLLMs 在处理面部理解中的复杂任务时选择合适 API 和函数调用的能力。FaceXBench 包含 5,000 个经过精心手动筛选的 VQA 类型问题,这些问题源自 25 个公共数据集,涵盖了 14 个任务。它包括 10,441 张独特的面部图像,代表了不同年龄段、性别、种族、不同分辨率、头部姿势、表情和属性的多样性,反映了现实世界场景中遇到的面部多样性。
We conduct extensive experiments and benchmark 26 open-source MLLMs’ and 2 advanced proprietary MLLMs, GPT-4o [30] and GeminiPro-1.5 [96]. We analyze the models across three evaluation settings: (1) zero-shot, (2) in-context task description and (3) chain-of-thought (CoT) prompting. Our findings reveals two main takeaways: First, existing MLLMs struggle with tasks like deepfake detection and crowd counting, which require fine-grained visual analysis to detect subtle inconsistencies and the ability to recognize and adapt to faces at varying scales. The performance of top models across various tasks is illustrated in Figure 1. Second, attempts to leverage the face knowledge of MLLMs through in-context prompting lead to performance drops, indicating a struggle to utilize contextual information effectively. Chain-of-thought prompting similarly fails to improve performance, suggesting that while these models have reasoning capabilities, they cannot apply them to face-related tasks, limiting their ability to make nuanced interpretations or adjustments. Finally, we present experiments highlighting the potential of diverse supervised fine-tuning and tool use as promising directions for enhancing MLLMs’ face understanding capabilities.
我们进行了广泛的实验,并对 26 个开源 MLLM 和 2 个先进的专有 MLLM(GPT-4o [30] 和 GeminiPro-1.5 [96])进行了基准测试。我们在三种评估设置下分析了这些模型:(1) 零样本,(2) 上下文任务描述和 (3) 思维链 (CoT) 提示。我们的研究结果揭示了两个主要结论:首先,现有的 MLLM 在深度伪造检测和人群计数等任务上表现不佳,这些任务需要细粒度的视觉分析来检测细微的不一致性,并需要识别和适应不同尺度的人脸的能力。图 1 展示了各任务中顶级模型的表现。其次,通过上下文提示利用 MLLM 的人脸知识的尝试导致性能下降,表明这些模型难以有效利用上下文信息。思维链提示同样未能提高性能,这表明虽然这些模型具备推理能力,但它们无法将其应用于与人脸相关的任务,限制了它们进行细致解释或调整的能力。最后,我们展示了实验,强调了多样化的监督微调和工具使用作为增强 MLLM 人脸理解能力的潜在方向。
Our key contributions are as follows:
我们的主要贡献如下:
2. Related Work
2. 相关工作
Face Understanding: Face understanding concerns the analysis and interpretation of human facial features, encompassing tasks such as age estimation [10, 37, 44, 90], gender prediction [5, 39, 55], race estimation [3, 4], face recognition [19, 71, 101], face anti-spoofing [59, 69, 113], deepfake detection [70, 97, 128], facial attributes prediction [66, 123, 136], facial expression recognition [42, 102, 120], headpose estimation [25, 80, 122], crowd counting [76, 77, 121] and face parsing [73, 83, 94, 130]. Early methods focused on developing task-specific models for each facial understanding task, achieving promising results. With the advent of transformers, works such as FaceXFormer [72], Q-Face [92] and Faceptor [74] aim to unify multiple tasks within a single architecture to improve genera liz ation performance. The recent rise in MLLMs’ enabled new avenue of research in face understanding by integrating information across text, visual and other modalities. Recent works [15, 86, 91, 109, 112, 129], leverage the reasoning and zero-shot capabilities of MLLMs’ to approach traditional face-related tasks. However, the field still lacks a standardized benchmark to monitor and regulate the development of these models. Our proposed work addresses this gap by introducing a comprehensive benchmark, FaceXBench, which covers a diverse range of face understanding tasks.
人脸理解:人脸理解涉及对人类面部特征的分析和解释,涵盖年龄估计 [10, 37, 44, 90]、性别预测 [5, 39, 55]、种族估计 [3, 4]、人脸识别 [19, 71, 101]、人脸反欺诈 [59, 69, 113]、深度伪造检测 [70, 97, 128]、面部属性预测 [66, 123, 136]、面部表情识别 [42, 102, 120]、头部姿态估计 [25, 80, 122]、人群计数 [76, 77, 121] 和人脸解析 [73, 83, 94, 130] 等任务。早期的方法专注于为每个面部理解任务开发特定任务的模型,取得了显著成果。随着 Transformer 的出现,FaceXFormer [72]、Q-Face [92] 和 Faceptor [74] 等工作旨在将多个任务统一到一个架构中,以提高泛化性能。最近,多模态大语言模型 (MLLMs) 的兴起为人脸理解开辟了新的研究途径,通过整合文本、视觉和其他模态的信息。最近的研究 [15, 86, 91, 109, 112, 129] 利用 MLLMs 的推理和零样本能力来处理传统的人脸相关任务。然而,该领域仍然缺乏一个标准化的基准来监控和规范这些模型的发展。我们提出的工作通过引入一个全面的基准 FaceXBench 来解决这一差距,该基准涵盖了多种人脸理解任务。
Multimodal Large Language Models and Benchmarks: Following the success of Large Language Models [2, 8, 99], recent research has focused on MLLMs to enhance multimodal comprehension and generation by leveraging the strong generalization capabilities of LLMs. With the rapid development of MLLMs [8, 9, 13, 32, 49, 54, 57, 85, 98, 104, 115, 135], several works propose benchmarks to MLLMs across different aspects. MMMU [119] meticulously collects multimodal questions from college exams, quizzes, and textbooks to assess college-level subject knowledge. MathVista [62] combines challenges from diverse mathematical and visual tasks, focusing on evaluating models’ fine-grained, deep visual understanding and compositional reasoning. MMBench [60] proposes a bilingual objective benchmark and a Circular E val strategy for models with limited instruction-following capabilities. SEED- Bench [45], MMAU [82] and MMIE [107] introduce benchmarks for generative comprehension and generation, audio understanding and interleaved multimodal comprehension and generation, respectively. Some benchmarks, such as SWE-Bench [34] and OSWorld [108] focus on agentic reasoning and multimodal agents for open-ended tasks. SHIELD [86] is an early attempt to benchmark MLLMs on face anti-spoofing and face forgery detection but overlooks several other aspects of face understanding. Recent works, such as EMO-LLaMA [109] and Face-MLLM [91], aim to enhance MLLMs’ capabilities in facial expression recognition and face perception; however the field lacks a comprehensive benchmark to objectively evaluate these tasks. In this work, we introduce FaceXBench, a comprehensive benchmark, which contains $5,000\ \mathrm{MCQ}$ that evaluates 6 broad categories and 14 different tasks.
多模态大语言模型与基准测试:随着大语言模型 [2, 8, 99] 的成功,近期研究聚焦于多模态大语言模型 (MLLMs),旨在利用大语言模型的强大泛化能力来增强多模态理解和生成能力。随着 MLLMs 的快速发展 [8, 9, 13, 32, 49, 54, 57, 85, 98, 104, 115, 135],多项工作提出了针对 MLLMs 在不同方面的基准测试。MMMU [119] 精心收集了来自大学考试、测验和教科书的多模态问题,以评估大学水平的学科知识。MathVista [62] 结合了多种数学和视觉任务的挑战,专注于评估模型的细粒度、深度视觉理解和组合推理能力。MMBench [60] 提出了一个双语客观基准测试,并为指令跟随能力有限的模型设计了循环评估策略。SEED-Bench [45]、MMAU [82] 和 MMIE [107] 分别引入了生成式理解和生成、音频理解以及交错多模态理解和生成的基准测试。一些基准测试,如 SWE-Bench [34] 和 OSWorld [108],专注于开放任务中的智能体推理和多模态智能体。SHIELD [86] 是早期尝试在面部反欺骗和面部伪造检测方面对 MLLMs 进行基准测试的工作,但忽略了面部理解的其他几个方面。近期工作,如 EMO-LLaMA [109] 和 Face-MLLM [91],旨在增强 MLLMs 在面部表情识别和面部感知方面的能力;然而,该领域缺乏一个全面的基准测试来客观评估这些任务。在本工作中,我们引入了 FaceXBench,一个全面的基准测试,包含 5000 道多选题 (MCQ),评估 6 个大类和 14 个不同任务。
3. The FaceXBench Benchmark
3. FaceXBench 基准测试
3.1. Overview of FaceXBench
3.1. FaceXBench 概述
We introduce FaceXBench, a novel and comprehensive benchmark encompassing multiple aspects of face understanding. Additionally, we propose FaceXAPI, a manually curated dataset within FaceXBench that aims to evaluate tool-use capabilities. FaceXBench is the first benchmark specifically designed to evaluate MLLMs’ performance on face-related tasks. It assess MLLMs’ across six broad categories: bias and fairness, face authentication, face recognition, face analysis, face localization, and facial tool use. Fig. 2 presents a detailed taxonomy of these categories, outlining their respective tasks and corresponding number of questions. The designed questions test a MLLMs’ capabilities in areas such as visual grounding, fine-grained feature extraction, anomaly detection, emotion recognition, fairness, contextual understanding, spatial understanding and agentic reasoning.
我们介绍了 FaceXBench,这是一个新颖且全面的基准测试,涵盖了面部理解的多个方面。此外,我们提出了 FaceXAPI,这是 FaceXBench 中手动策划的数据集,旨在评估工具使用能力。FaceXBench 是第一个专门设计用于评估 MLLMs 在面部相关任务上表现的基准测试。它评估 MLLMs 在六个大类中的表现:偏见与公平性、面部认证、面部识别、面部分析、面部定位和面部工具使用。图 2 展示了这些类别的详细分类,概述了它们各自的任务和相应的问题数量。设计的问题测试了 MLLMs 在视觉定位、细粒度特征提取、异常检测、情绪识别、公平性、上下文理解、空间理解和智能体推理等领域的能力。
Benchmark Statistics. FaceXBench consists of 5, 000 questions, divided into 6 categories covering 14 tasks, and is derived from 25 public datasets along with one proposed dataset (i.e. FaceXAPI). There are 2, 750 questions with multiple images, 2, 150 single-image questions, and 100 text-only questions. The questions are generated using 757 unique question templates and contain 10, 441 unique images. The correct answer options are approximately equally distributed between A, B, C and D to avoid bias. They key statistics of the benchmark are summarized in Tab. 1.
基准统计。FaceXBench 包含 5,000 个问题,分为 6 个类别,涵盖 14 个任务,源自 25 个公共数据集以及一个提出的数据集(即 FaceXAPI)。其中有 2,750 个问题包含多张图片,2,150 个问题为单张图片,100 个问题仅包含文本。这些问题使用了 757 个独特的问题模板生成,并包含 10,441 张独特的图片。正确答案选项在 A、B、C 和 D 之间大致均匀分布,以避免偏差。基准的关键统计数据总结在表 1 中。
3.2. Data Collection
3.2. 数据收集
Our data collection pipeline consists of three steps. Step 1. In the first step, we iterate through the identified tasks in each category and select the datasets corresponding to each task. We collect the test sets of existing standard datasets for each task to avoid data leakage. We also created a new dataset, FaceXAPI, specifically for the tool-retrieval task. In total, our collection includes 25 public datasets along with this newly proposed dataset. Step 2. In this step, we manually created question templates for each task involving single and multiple images. While creating these templates, we carefully framed questions to encourage the model to reason, compare, and think critically to find the correct answer. We included a mix of easy and hard questions to maintain a balanced distribution. For example, a hard question is, “<image1 $>$ <image ${2>}$ <image3 $\because$ How many images show a person in the age range of $30;t o;39,?$ ,” while an easy question is, “<image1 $>$ What is the age range of the person shown in the image?”. We prompt GPT-4o with the manually curated question templates as in-context examples and generate additional templates that address similar questions but with varied phrasing, perspective, and/or complexity. For example, “<image1 $>$ <image ${2>}$ Which among these two appears to be older?”. We manually filter the question templates for each task to ensure diversity and task relevance. Additional details about the question templates and their generation can be found in Appendix C.1. Step 3. The final step involves generating answer options. Each question includes four options, with one correct answer. We ensure that distractor options are close to the correct answer but distinct enough to require critical thinking and reasoning when selecting the correct response. For example, for the question “<image1 $>$ What is the age range of the person shown in the image?”, the options designed are (A) 20 to 29, (B) 10 to 19 (C) 30 to 39 and (D) 40 to 49, with the correct answer being (C) 30 to 29. The thoughtfully chosen distractor options close to the actual answer, encourages the model to reason, compare and carefully select the correct choice. This approach raises the difficulty for the model, making the questions more challenge ing. To validate this, we conducted a mini experiment using llava onevision qwen2 7b ov for age estimation on FairFace dataset questions that include single image. The model achieved $88%$ accuracy with randomly chosen options, compared to $53%$ when strategically designed options were used. This result highlights the importance of identifying effective distractor options. Further details on generating distractor options for each task are provided in Appendix C.2. Overall, we observed that the model’s performance significantly varies depending on the complexity of the questions and distractor options.
我们的数据收集流程包括三个步骤。
步骤 1:第一步,我们遍历每个类别中已识别的任务,并选择与每个任务对应的数据集。我们收集了每个任务现有标准数据集的测试集,以避免数据泄露。我们还创建了一个新的数据集 FaceXAPI,专门用于工具检索任务。总的来说,我们的收集包括 25 个公共数据集以及这个新提出的数据集。
步骤 2:在这一步中,我们为涉及单张和多张图像的每个任务手动创建了问题模板。在创建这些模板时,我们精心设计了问题,以鼓励模型进行推理、比较和批判性思考,从而找到正确答案。我们混合了简单和困难的问题,以保持平衡的分布。例如,一个困难的问题是:“<image1 $>$ <image ${2>}$ <image3 $\because$ 有多少张图片显示年龄在 $30;到;39,$ 岁之间的人?”,而一个简单的问题是:“<image1 $>$ 图片中显示的人的年龄范围是多少?”。我们使用手动整理的问题模板作为上下文示例,提示 GPT-4o 生成更多模板,这些模板处理类似的问题,但在措辞、视角和/或复杂性上有所不同。例如:“<image1 $>$ <image ${2>}$ 这两个人中哪一个看起来更年长?”。我们手动过滤每个任务的问题模板,以确保多样性和任务相关性。有关问题模板及其生成的更多详细信息可以在附录 C.1 中找到。
步骤 3:最后一步是生成答案选项。每个问题包括四个选项,其中一个正确答案。我们确保干扰选项接近正确答案,但又足够不同,以在选择正确答案时需要批判性思维和推理。例如,对于问题“<image1 $>$ 图片中显示的人的年龄范围是多少?”,设计的选项是 (A) 20 到 29,(B) 10 到 19,(C) 30 到 39 和 (D) 40 到 49,正确答案是 (C) 30 到 29。精心选择的干扰选项接近实际答案,鼓励模型进行推理、比较并仔细选择正确的选项。这种方法增加了模型的难度,使问题更具挑战性。为了验证这一点,我们使用 llava onevision qwen2 7b ov 在 FairFace 数据集上进行了年龄估计的小型实验,该数据集包含单张图片的问题。模型在使用随机选择的选项时达到了 $88%$ 的准确率,而在使用策略性设计的选项时准确率为 $53%$。这一结果突显了识别有效干扰选项的重要性。有关为每个任务生成干扰选项的更多详细信息,请参见附录 C.2。
总的来说,我们观察到模型的性能显著取决于问题的复杂性和干扰选项的设计。
Table 1. Key statistics of question in FaceXBench.
表 1. FaceXBench 中问题的关键统计
| 统计项 | 数量 |
|---|---|
| 总问题数 总类别数 总任务数 使用的公共数据集数 新提出的数据集数 (FaceXAPI) | 5000 6 14 25 1 |
| 多图像问题 单图像问题 纯文本问题 | 2750 (55%) 2150 (43%) 100 (2%) |
| 所有问题中的总图像数 唯一图像数 唯一问题模板数 最大问题长度 | 11266 10441 757 |
| 最大选项长度 平均问题长度 平均选项长度 | 676 207 64.34 11.04 |
| 每个问题的总选项数 A 作为正确选项的频率 B 作为正确选项的频率 C 作为正确选项的频率 D 作为正确选项的频率 | 4 1278 (25.56%) 1332 (26.64%) 1189 (23.78%) 1201 (24.02%) |

Figure 2. Distribution of questions across different categories and tasks in FaceXBench.
图 2: FaceXBench 中不同类别和任务的提问分布。
3.3. FaceXAPI
3.3. FaceXAPI
We believe that face understanding is an application domain that can be better addressed by equipping MLLMs with tools rather than relying solely on supervised fine-tuning. We provide a detailed discussion, supported by experimental validation, in the Discussion and Future Directions section (Section 6). To test MLLMs’ ability to select the cor- rect sequence of APIs and function calls for successful task completion, we created FaceXAPI, a dataset of 100 textonly questions referencing 13 APIs and 32 function calls. The questions were designed to ensure diversity in scenarios, reflecting a broad range of real-world applications and requiring a sequence of 3 to 5 API calls to solve. The dataset includes a total of 88 unique function call sequences. The questions were generated using GPT-4o, followed by manual filtering. A sample from the dataset is presented in Fig. 3, with the complete prompt provided in Appendix C.3.
我们相信,通过为多模态大语言模型 (MLLMs) 配备工具,而不是仅仅依赖监督微调,可以更好地解决面部理解这一应用领域的问题。我们在讨论与未来方向部分(第6节)提供了详细的讨论,并通过实验验证支持了这一观点。为了测试 MLLMs 选择正确的 API 和函数调用序列以成功完成任务的能力,我们创建了 FaceXAPI,这是一个包含 100 个纯文本问题的数据集,涉及 13 个 API 和 32 个函数调用。这些问题旨在确保场景的多样性,反映广泛的现实应用,并需要 3 到 5 个 API 调用的序列来解决。该数据集共包含 88 个独特的函数调用序列。这些问题由 GPT-4o 生成,随后经过人工筛选。数据集中的一个样本如图 3 所示,完整的提示见附录 C.3。
3.4. Quality Control
3.4. 质量控制
We implement quality control checks at each step of the data collection process to prevent the error propagation. In the first stage of data collection, we convert established, manually annotated datasets for each task into VQA format. This approach ensures a high level of initial accuracy, compared to benchmarks with AI generated ground truths. In the second stage, we focus on achieving diversity and variety in question generation while maintaining appropriate difficulty levels, which ensures that the questions generated are of mixed difficulty and relevant to the task. We manually verify the correctness and relevance of the 757 unique question templates in the benchmark. Finally, we systematically design options that are of high quality and are difficult.
我们在数据收集过程的每一步都实施质量控制检查,以防止错误传播。在数据收集的第一阶段,我们将每个任务的已建立的手动标注数据集转换为VQA格式。与使用AI生成的真实基准相比,这种方法确保了较高的初始准确性。在第二阶段,我们专注于在保持适当难度水平的同时实现问题生成的多样性和变化,这确保了生成的问题具有混合难度并与任务相关。我们手动验证了基准中757个独特问题模板的正确性和相关性。最后,我们系统地设计了高质量且难度较大的选项。

Figure 3. FaceXBench examples cover a total of 14 tasks, addressing various aspects of face understanding. Each question may consist of single or multiple images. Every question includes four options, with only one correct answer. The options are strategically designed to prompt the model to analyze carefully before selecting an option.
图 3: FaceXBench 示例涵盖了总共 14 个任务,涉及面部理解的各个方面。每个问题可能包含单张或多张图片。每个问题包括四个选项,其中只有一个正确答案。选项经过精心设计,旨在促使模型在选择选项之前进行仔细分析。
4. Experiments
4. 实验
In this section, we detail the experiments conducted to benchmark and analyze various MLLMs on face understanding. We evaluate 2 proprietary and 26 open-source models, as listed in Section 4.1. For a fair comparison, all models are evaluated in a zero-shot setting using the same base prompt. We analyze selected MLLMs under in-context and chain-of-thought settings. The evaluation settings are described in Section 4.2 and the results are presented in Section 5. All experiments are performed using 8 NVIDIA A6000 GPU cards.
在本节中,我们详细介绍了为基准测试和分析各种多模态大语言模型(MLLM)在面部理解任务上的表现所进行的实验。我们评估了2个专有模型和26个开源模型,如第4.1节所列。为了公平比较,所有模型均在零样本设置下使用相同的基础提示进行评估。我们在上下文和思维链设置下分析了选定的MLLM。评估设置在第4.2节中描述,结果在第5节中展示。所有实验均使用8张NVIDIA A6000 GPU卡进行。
4.1. Models
4.1. 模型
The 2 proprietary models used are GPT-4o [30] and GeminiPro 1.5 [96]. We divide the 26 open-source models into three major categories based on parameter size: (a) Open Source MLLMs $\mathbf{\mathcal{{C}4B}}$ parameters): PaliGemma [9], LLaVA-OneVision-0.5b-OV [47], and VILA 1.5-3b [54]; (b) Open Source MLLMs (4B-13B parameters): Chameleon-7b [95], Eagle-X4-8B-Plus [85], Idefics2-8b [41], Idefics-9b-Instruct [40], LLaVA-v1.5- 7b [57], Monkey-Chat [53], MiniCPM-Llama3-v2.5 [114],
使用的2个专有模型是GPT-4o [30]和GeminiPro 1.5 [96]。我们将26个开源模型根据参数规模分为三大类:(a) 开源MLLMs($\mathbf{\mathcal{{C}4B}}$参数):PaliGemma [9]、LLaVA-OneVision-0.5b-OV [47]和VILA 1.5-3b [54];(b) 开源MLLMs(4B-13B参数):Chameleon-7b [95]、Eagle-X4-8B-Plus [85]、Idefics2-8b [41]、Idefics-9b-Instruct [40]、LLaVA-v1.5-7b [57]、Monkey-Chat [53]、MiniCPM-Llama3-v2.5 [114]。
LLaVA-OneVision-7b-SI [47], LLaVA-NeXT-Interleave7b [48], Mantis-SIGLIP-8b [32], Phi-3.5-Vision [1], LLaVA-OneVision-7b-OV, Qwen2-VL-7b-Instruct [103], and InternVL2-8b [12]; (c) Open Source MLLMs (>13B parameters): CogVLM2-19b [26], Idefics-80bInstruct [40], LLaVA-v1.5-13b [57], VILA 1.5-13b [54], InternVL-Chat-v1.5 [12], VILA 1.5-40b [54], LLaVA- OneVision-72b-OV [47], Qwen2-VL-72b-Instruct [103], and InternVL2-76b [12]. In Appendix D.1, we provide detailed information regarding the architecture and the parameter size for all open-source MLLMs’ evaluated in this paper, along with additional results under different settings.
LLaVA-OneVision-7b-SI [47]、LLaVA-NeXT-Interleave7b [48]、Mantis-SIGLIP-8b [32]、Phi-3.5-Vision [1]、LLaVA-OneVision-7b-OV、Qwen2-VL-7b-Instruct [103] 和 InternVL2-8b [12];(c) 开源 MLLMs(>13B 参数):CogVLM2-19b [26]、Idefics-80bInstruct [40]、LLaVA-v1.5-13b [57]、VILA 1.5-13b [54]、InternVL-Chat-v1.5 [12]、VILA 1.5-40b [54]、LLaVA-OneVision-72b-OV [47]、Qwen2-VL-72b-Instruct [103] 和 InternVL2-76b [12]。在附录 D.1 中,我们提供了本文评估的所有开源 MLLMs 的架构和参数大小的详细信息,以及不同设置下的额外结果。
4.2. Evaluation Settings
4.2. 评估设置
We evaluate the models under three settings: (a) zero-shot, (b) in-context task description, and (c) chain-of-thought prompting. In the zero-shot setting, we pass the input with only the base prompt. In the in-context task description setting, we prepend the prompt with a brief description of the specific task required by the question. In the chain-ofthought setting, we prompt the model to reason step-bystep before selecting the correct option. The task-specific prepended text and additional details of the evaluation settings are provided in Appendix C.6.
我们在三种设置下评估模型:(a) 零样本 (zero-shot) ,(b) 上下文任务描述 (in-context task description) ,以及 (c) 思维链提示 (chain-of-thought prompting) 。在零样本设置中,我们仅使用基础提示传递输入。在上下文任务描述设置中,我们在提示前加上问题所需的特定任务的简要描述。在思维链设置中,我们提示模型在选择正确选项之前逐步推理。任务特定的前置文本和评估设置的更多细节见附录 C.6。
The proposed benchmark consists of multiple-choice questions (MCQs), which standardize the evaluation across different models. We empirically construct a diverse set of regular expressions and employ a three-step evaluation strategy to extract the correct option in cases where intermediate reasoning or calculations are present in the model’s output. Our evaluation strategy is as follows: First, we use a regular expression to match the correct option (A, B, C, or D) at the beginning of the model’s prediction. If this attempt fails, we then search for the correct option within the entire prediction text. As a final fallback, we compare parts of the predicted output with the option values to find a match. If none of these steps succeed, we categorize the prediction as incorrect. The complete function is provided in Appendix C.4. Alternatively, we could implement a random-choice or frequent-choice strategy instead of labeling responses as incorrect. In the random-choice approach, a random option is selected as the prediction, while in the frequent-choice approach, we select the most common correct option in the dataset. However, these strategies can lead to misleading results, especially when the model fails to respond due to content moderation measures or poor instruction-following.
所提出的基准由多项选择题 (MCQs) 组成,这标准化了不同模型之间的评估。我们经验性地构建了一组多样化的正则表达式,并采用三步评估策略来提取模型输出中存在中间推理或计算时的正确选项。我们的评估策略如下:首先,我们使用正则表达式匹配模型预测开头的正确选项 (A、B、C 或 D)。如果此尝试失败,我们则在预测文本的整个范围内搜索正确选项。作为最后的备选方案,我们将预测输出的部分内容与选项值进行比较以找到匹配项。如果这些步骤均未成功,我们将预测归类为错误。完整函数见附录 C.4。或者,我们可以实施随机选择或频繁选择策略,而不是将响应标记为错误。在随机选择方法中,随机选择一个选项作为预测,而在频繁选择方法中,我们选择数据集中最常见的正确选项。然而,这些策略可能会导致误导性结果,尤其是当模型由于内容审核措施或指令遵循不佳而未能响应时。
5. Results
5. 结果
The performance of various models on FaceXBench is shown in Table 2. The results emphasize the challenging nature of the benchmark, with no model achieving more than $60%$ accuracy. We observe that Qwen2-VL72b-Instruct [103] achieves the best overall performance of $57.86%$ . InternVL2-76b [12], GeminiPro 1.5 [96], Qwen2- VL-72b-Instruct [103], LLaVA-OneVision-72b-OV [47], Qwen2-VL-72b-Instruct [103] and GeminiPro 1.5 [96] achieve the highest performances in the categories of Bias & Fairness, Face Recognition, Face Authentication, Face Analysis, Face Localization and Face Tools use with accuracies of $69.53%$ , $70.00%$ , $41.14%$ , $63.25%$ , $55.45%$ and $57.00%$ , respectively.
表 2 展示了各种模型在 FaceXBench 上的表现。结果强调了该基准的挑战性,没有模型的准确率超过 $60%$。我们观察到 Qwen2-VL72b-Instruct [103] 以 $57.86%$ 的准确率取得了最佳整体表现。InternVL2-76b [12]、GeminiPro 1.5 [96]、Qwen2-VL-72b-Instruct [103]、LLaVA-OneVision-72b-OV [47]、Qwen2-VL-72b-Instruct [103] 和 GeminiPro 1.5 [96] 分别在偏见与公平、人脸识别、人脸认证、人脸分析、人脸定位和人脸工具使用类别中取得了最高表现,准确率分别为 $69.53%$、$70.00%$、$41.14%$、$63.25%$、$55.45%$ 和 $57.00%$。
Results across different evaluation categories. The performance of various models on FaceXBench, as shown in Table 2 highlights the limited face understanding capabilities of current models. Specifically, they perform poorly on tasks such as face authentication and face localization, which require fine-grained facial feature extraction and spatial understanding of facial structure. The low performance in the “Bias & Fairness” category suggests that existing MLLMs exhibit biases toward certain age, gender, and racial groups, which need to be mitigated prior to deployment. Surprisingly, GPT-4o performs poorly in the “Bias & Fairness category”, but on examination, it was found that the model often chose not to answer due to its safety alignment. The top-performing models achieve around $70%$ performance in face recognition tasks; however, their performance drops significantly on low-resolution face recognition. This decline can be attributed to training data predominantly comprising of high-resolution images, which limits models’ effectiveness on low-resolution inputs. Some models may have been trained on image-text datasets with face images, such as FFHQ-Text [134], CelebA-Dialog [33], LAION-Face [133], Face Caption-15M [17], as part of large-scale pre training. Although these datasets often contain attribute and expression information in text, the models still perform poorly in the “Face Analysis” section, indicating substantial room for improvement.
不同评估类别下的结果。表 2 展示了各种模型在 FaceXBench 上的表现,突显了当前模型在人脸理解能力上的局限性。具体而言,它们在需要细粒度面部特征提取和面部结构空间理解的任务(如人脸认证和人脸定位)上表现不佳。在“偏见与公平性”类别中的低表现表明,现有的多模态大语言模型 (MLLMs) 对某些年龄、性别和种族群体存在偏见,这些偏见在部署前需要被缓解。令人惊讶的是,GPT-4o 在“偏见与公平性”类别中表现不佳,但经过检查发现,该模型由于其安全对齐机制,经常选择不回答问题。表现最佳的模型在人脸识别任务中达到了约 $70%$ 的性能;然而,它们在低分辨率人脸识别上的表现显著下降。这种下降可以归因于训练数据主要由高分辨率图像组成,这限制了模型在低分辨率输入上的有效性。一些模型可能已经在包含人脸图像的图像-文本数据集上进行了训练,例如 FFHQ-Text [134]、CelebA-Dialog [33]、LAION-Face [133]、Face Caption-15M [17],作为大规模预训练的一部分。尽管这些数据集通常在文本中包含属性和表情信息,但模型在“人脸分析”部分的表现仍然不佳,表明仍有很大的改进空间。
Open Source vs Closed Source Models. We observe that open-source models, such as InternVL2 and Qwen2-VL, outperform proprietary models like GPT-4o and GeminiPro 1.5, achieving accuracies of $57.80%$ and $57.86%$ , compared to $50.50%$ and $56.96%$ , respectively. Generally, MLLMs have shown a trend where closed-source models outperform open-source ones. However, in the sensitive domain of face analysis, proprietary models are safety-aligned before deployment. The relatively poor performance of proprietary models in “Bias & Fairness” and “Face Analysis” is primarily due to content moderation constraints. Notably, GeminiPro 1.5 demonstrates comparatively better performance than other models on “Face Tools Use”, showcasing its ability to leverage specialized tools for complex scenarios involving multiple face-related tasks.
开源模型与闭源模型。我们观察到,开源模型如 InternVL2 和 Qwen2-VL 在准确率上超越了闭源模型如 GPT-4o 和 GeminiPro 1.5,分别达到了 $57.80%$ 和 $57.86%$,而闭源模型的准确率分别为 $50.50%$ 和 $56.96%$。通常,多模态大语言模型(MLLM)呈现出闭源模型优于开源模型的趋势。然而,在面部分析这一敏感领域,闭源模型在部署前经过了安全对齐。闭源模型在“偏见与公平性”和“面部分析”方面的表现相对较差,主要是由于内容审核的限制。值得注意的是,GeminiPro 1.5 在“面部工具使用”方面表现出色,展示了其在涉及多个面部相关任务的复杂场景中利用专用工具的能力。
Performance across various tasks. To gauge the difficulty of various tasks, we plot the performance of the top-5 models (4B-13B parameters) across each task, as shown in Fig- ure 4(a). A detailed table showing the performance of all models across all tasks is provided in Appendix D.2 (Table D.2). We observe that, on average, models struggle with tasks such as crowd counting, deepfake detection, head pose estimation, and low-resolution face recognition, highlighting areas where MLLMs’ need improvement. In contrast, gender prediction is one of the easier tasks, with an average performance of approximately $80%$ . Additionally, we plot the average performance of these top-5 models on multipleimage and single-image question. Figure 4(c) shows that models generally perform poorly on questions with multiple images as inputs compared to single-image questions. This is primarily because the models are required to process more visual information and compare and contrast facial features across multiple images, making the questions more challenging. Furthermore, models such as Qwen2-VL [103] and LLaVA-OneVision [47], which use dynamic resolution to handle arbitrary image resolutions, perform better than models on segmentation task. This mapping of images to a dynamic number of visual tokens more closely resembles human-like visual processing and leads to improved performance in face-understanding tasks.
跨任务性能表现。为了评估不同任务的难度,我们绘制了前五大模型(4B-13B参数)在每项任务上的表现,如图4(a)所示。附录D.2(表D.2)提供了所有模型在所有任务上表现的详细表格。我们观察到,平均而言,模型在人群计数、深度伪造检测、头部姿态估计和低分辨率人脸识别等任务上表现不佳,凸显了多模态大语言模型(MLLMs)需要改进的领域。相比之下,性别预测是较为简单的任务之一,平均表现约为$80%$。此外,我们绘制了这五大模型在多图像和单图像问题上的平均表现。图4(c)显示,与单图像问题相比,模型在多图像输入的问题上普遍表现较差。这主要是因为模型需要处理更多的视觉信息,并在多张图像之间比较和对比面部特征,使得问题更具挑战性。此外,像Qwen2-VL [103]和LLaVA-OneVision [47]这样使用动态分辨率处理任意图像分辨率的模型,在分割任务上表现优于其他模型。这种将图像映射到动态数量的视觉Token的方法更接近人类的视觉处理方式,从而在人脸理解任务中表现更佳。
Effect of LLM and it’s size on performance. To analyze the impact of LLM on model performance, we plot a performance curve for different models that share the same SigLIP $\mathrm{SO400M}/14@384$ vision encoder, but differ in their LLM backbone. From Figure 4(b), we observe that the performance improves with an increase in LLM size. Furthermore, when comparing LLMs of different sizes within the same family, such as Qwen2, we see that the shape of the curve remains consistent, with performance values translating upward across all dimensions. This indicates that as the LLM size increases, the model’s capabilities improve proportion at ely across various dimensions, maintaining a similar pattern of performance enhancement.
大语言模型及其规模对性能的影响。为了分析大语言模型对模型性能的影响,我们绘制了不同模型的性能曲线,这些模型共享相同的 SigLIP $\mathrm{SO400M}/14@384$ 视觉编码器,但大语言模型骨干不同。从图 4(b) 中,我们观察到随着大语言模型规模的增加,性能有所提升。此外,当比较同一家族中不同规模的大语言模型(如 Qwen2)时,我们发现曲线的形状保持一致,性能值在所有维度上都有所提升。这表明随着大语言模型规模的增加,模型在各个维度上的能力成比例提升,保持了相似的性能增强模式。
Table 2. Results of different models on the FaceXBench. We categorize the open-source models in three categories based on parameter size: (a) Open source MLLMs $\mathrm{<}4\mathrm{B}$ parameters), (b) Open source MLLMs (4B-13B parameters), (c) Open source MLLMs $_{\mathord{>}13\mathrm{B}}$ parameters). Additionally, we evaluate (d) proprietary models. The best-performing model in each category is highlighted in bold.
| 模型 (28) | 总体 | 偏见与公平性 (800) | 人脸识别 (1,500) | 人脸认证 (1,100) | 人脸分析 (800) | 人脸定位 (700) | 人脸工具使用 (100) |
|---|---|---|---|---|---|---|---|
| 随机选择 | (5,000) 25.10 | 24.73 | 26.88 | 22.71 | 24.75 | 25.64 | 30.00 |
| 频繁选择 | 32.22 | 30.73 | 29.50 | 40.14 | 33.25 | 29.73 | 40.00 |
| 开源 MLLMs (< 4B 参数) | |||||||
| PaliGemma [9] | 32.22 | 35.67 | 26.50 | 28.00 | 37.62 | 32.27 | 12.00 |
| LLaVA-OneVision-0.5b-OV [47] | 34.00 | 34.93 | 28.12 | 30.29 | 44.62 | 32.91 | 20.00 |
| VILA 1.5-3b [54] | 35.80 | 38.27 | 33.25 | 30.86 | 44.50 | 31.82 | 28.00 |
| 开源 MLLMs (4B - 13B 参数) | |||||||
| Chameleon-7b [95] | 17.04 | 10.27 | 17.12 | 6.86 | 20.25 | 28.91 | 33.00 |
| Eagle-X4-8B-Plus[85] | 31.44 | 25.00 | 23.12 | 30.00 | 35.62 | 43.64 | 37.00 |
| Idefics-9b-Instruct [40] | 34.58 | 37.93 | 28.62 | 34.43 | 37.38 | 34.18 | 15.00 |
| LLaVA-v1.5-7b [57] | 36.22 | 41.20 | 33.12 | 30.14 | 43.50 | 32.18 | 15.00 |
| Monkey-Chat [53] | 37.40 | 39.00 | 31.50 | 26.00 | 44.00 | 41.73 | 40.00 |
| MiniCPM-Llama3-v2.5 [114] | 40.70 | 45.80 | 29.88 | 32.86 | 52.38 | 40.45 | 15.00 |
| LLaVA-NeXT-Interleave-7b [48] | 43.80 | 52.53 | 38.00 | 38.57 | 55.88 | 32.27 | 26.00 |
| LLaVA-One Vision-7b-SI [47] | 44.32 | 50.73 | 32.75 | 29.86 | 52.25 | 47.27 | 46.00 |
| Idefics2-8b [41] | 44.52 | 52.67 | 31.25 | 33.57 | 53.25 | 43.91 | 42.00 |
| Mantis-SIGLIP-8b [32] | 44.60 | 56.13 | 45.12 | 36.86 | 48.00 | 31.64 | 37.00 |
| Phi-3.5-Vision [1] | 45.16 | 52.47 | 50.12 | 40.00 | 51.00 | 31.64 | 34.00 |
| LLaVA-OneVision-7b-OV [47] | 48.98 | 61.40 | 38.38 | 35.57 | 55.12 | 44.82 | 38.00 |
| Qwen2-VL-7b-Instruct [103] | 51.58 | 57.47 | 57.88 | 34.00 | 57.50 | 47.09 | 38.00 |
| InternVL2-8b [12] | 53.24 | 62.40 | 61.75 | 35.43 | 55.38 | 45.09 | 45.00 |
| 开源 MLLMs (> 13B 参数) | |||||||
| Idefics-80b-Instruct [40] | 35.86 | 39.87 | 35.12 | 27.71 | 35.12 | 38.55 | 15.00 |
| LLaVA-v1.5-13b [57] | 39.88 | 44.60 | 34.88 | 34.14 | 44.75 | 37.27 | 39.00 |
| VILA 1.5-13b [54] | 40.00 | 45.07 | 40.00 | 28.43 | 49.25 | 34.18 | 35.00 |
| CogVLM2-19b [26] | 40.46 | 43.13 | 33.88 | 35.71 | 45.62 | 41.91 | 29.00 |
| InternVL-Chat-v1.5 [12] | 49.18 | 59.73 | 41.38 | 33.00 | 55.12 | 46.73 | 46.00 |
| VILA 1.5-40b [54] | 55.48 | 64.00 | 57.63 | 33.14 | 60.50 | 54.36 | 39.00 |
| LLaVA-OneVision-72b-OV [47] | 56.42 | 66.53 | 52.00 | 37.43 | 63.25 | 53.73 | 48.00 |
| InternVL2-76b[12] | 57.80 | 69.53 | 66.62 | 36.14 | 62.00 | 47.18 | 46.00 |
| Qwen2-VL-72b-Instruct[103] | 57.86 | 62.20 | 69.12 | 41.14 | 57.88 | 55.45 | 46.00 |
| 专有 MLLMs | |||||||
| GPT-40 [30] | 50.50 | 46.93 | 55.62 | 40.00 | 62.25 | 50.36 | 44.00 |
| GeminiPro 1.5 [96] | 56.96 | 67.40 | 70.00 | 35.00 | 58.13 | 46.36 | 57.00 |
表 2. 不同模型在 FaceXBench 上的结果。我们将开源模型根据参数大小分为三类:(a) 开源 MLLMs (<4B 参数), (b) 开源 MLLMs (4B-13B 参数), (c) 开源 MLLMs (>13B 参数)。此外,我们还评估了 (d) 专有模型。每个类别中表现最好的模型以粗体标出。
Performance under different evaluation settings. We evaluate three selected models under the in-context task description and chain-of-thought settings, with results summarized in Table. 3. In the in-context setting, we observe some performance improvements in face authentication and face tools use. However, overall performance drops, indicating that the models struggle to utilize in-context information effectively. In the chain-of-thought setting, we observe a substantial decline in performance, suggesting that, although these models exhibit reasoning capabilities, they do not transfer effectively to face understanding.
不同评估设置下的性能表现。我们在上下文任务描述和思维链设置下评估了三个选定的模型,结果总结在表 3 中。在上下文设置中,我们观察到面部认证和面部工具使用方面的一些性能提升。然而,整体性能下降,表明模型难以有效利用上下文信息。在思维链设置中,我们观察到性能显著下降,这表明尽管这些模型展示了推理能力,但它们无法有效地转移到面部理解任务上。
6. Discussion and Future Directions
6. 讨论与未来方向
In this section, we discuss possible future directions to improve the face understanding capabilities of MLLMs. The MLLM community has developed numerous supervised fine-tuning (SFT) datasets, such as COCO Caption [11],
在本节中,我们讨论了未来可能的方向,以提升多模态大语言模型 (MLLM) 的面部理解能力。MLLM 社区已经开发了许多监督微调 (SFT) 数据集,例如 COCO Caption [11]。

Figure 4. (a) Performance of top-5 models (4B-13B parameters) across various tasks. (b) Effect of LLM and it’s size on model performance. (c) Average performance of the top-5 models (4B-13B parameters) on multiple-image and single-image questions.
图 4: (a) 不同任务中前 5 名模型 (4B-13B 参数) 的表现。 (b) 大语言模型及其规模对模型性能的影响。 (c) 前 5 名模型 (4B-13B 参数) 在多图像和单图像问题上的平均表现。
ScienceQA [81], Vision-FLAN [110], ChartQA [64], FigureQA [35], Geometry3K [23], MAVIS MCollect [124], MATHQA [6], TextOCR [88], OCR-VQA [65], and MagpiePro [111]. These supervised fine-tuning sets target various skills, including document and chart understanding, mathematics, fine-grained perception, grounding, reasoning, and general OCR. Developing these general skills helps models perform better on existing tasks and enhances their reasoning and visual processing capabilities for improved zero-shot task performance. However, to answer the question, “Will supervised fine-tuning improve the performance of MLLMs on face understanding?” we conducted multiple experiments by finetuning LLaVA1.5 [56] on a random $70\mathbf{k}$ subset of the Face Caption 15 M [17], which comprises image-text pairs containing attribute, age, and gender information. We fine-tuned the vision projector and the LLM backbone using LoRA [27] and summarized the results for various data compositions in Table 3(c). Our findings reveal that naive fine-tuning on MLLM with only Face SFT data results in poor performance (Table 3(c), row 1), as models tend to lose their general reasoning and perception capabilities. We incorporated the complete LLaVA SFT dataset (665k samples) alongside the Face SFT, and the model achieved a score of 35.18 (Table 3(c), row 2). We randomly sampled $200\mathrm{k}$ examples from the full 665k LLaVA 1.5 dataset, combining them with the 70k Face SFT data for fine-tuning, resulting in an improved score of 35.24 (Table 3(c), row 3). This experiment demonstrates the benefit of integrating FaceSFT data in the right proportion with existing reasoning and instruction-tuning datasets.
ScienceQA [81]、Vision-FLAN [110]、ChartQA [64]、FigureQA [35]、Geometry3K [23]、MAVIS MCollect [124]、MATHQA [6]、TextOCR [88]、OCR-VQA [65] 和 MagpiePro [111]。这些监督微调数据集针对多种技能,包括文档和图表理解、数学、细粒度感知、基础、推理和通用 OCR。发展这些通用技能有助于模型在现有任务上表现更好,并增强其推理和视觉处理能力,从而提升零样本任务的表现。然而,为了回答“监督微调是否会提高 MLLM 在人脸理解上的表现?”这一问题,我们通过在 Face Caption 15 M [17] 的随机 $70\mathbf{k}$ 子集上微调 LLaVA1.5 [56] 进行了多次实验,该数据集包含属性、年龄和性别信息的图像-文本对。我们使用 LoRA [27] 对视觉投影器和 LLM 主干进行了微调,并在表 3(c) 中总结了各种数据组合的结果。我们的发现表明,仅使用 Face SFT 数据对 MLLM 进行简单微调会导致性能不佳(表 3(c),第 1 行),因为模型往往会失去其通用推理和感知能力。我们将完整的 LLaVA SFT 数据集(665k 样本)与 Face SFT 结合使用,模型得分为 35.18(表 3(c),第 2 行)。我们从完整的 665k LLaVA 1.5 数据集中随机抽取了 $200\mathrm{k}$ 个样本,与 70k Face SFT 数据结合进行微调,结果得分为 35.24(表 3(c),第 3 行)。该实验表明,以适当比例将 FaceSFT 数据与现有的推理和指令微调数据集结合是有益的。
We explored an alternative research direction to improve MLLMs’ performance on face understanding through the use of specialized tools. To investigate the benefits of tool use, we selected specific datasets and converted predictions from state-of-the-art models into text, which we then provided as context to the MLLM. As shown in Table 3(b), this approach resulted in a significant performance boost.
我们探索了一种替代研究方向,通过使用专用工具来提高多模态大语言模型 (MLLM) 在人脸理解任务上的性能。为了研究工具使用的优势,我们选择了特定的数据集,并将最先进模型的预测结果转换为文本,然后将其作为上下文提供给 MLLM。如表 3(b) 所示,这种方法显著提升了性能。
| 模型 | 总体 | B&F | FR | F Auth. | F Anlys. | FL | 工具 |
|---|---|---|---|---|---|---|---|
| 上下文描述 | |||||||
| Phi-3.5-Vision [1] | -3.68 | -5.00 | -7.74 | -3.86 | -3.12 | +0.00 | +5.00 |
| Qwen2-VL-7B [103] | -1.26 | -2.40 | -2.50 | +5.00 | -2.38 | -2.45 | +4.00 |
| InternVL2-8B [12] | -1.10 | -3.13 | -0.50 | +6.00 | -2.01 | -3.00 | +03.00 |
| 思维链 | |||||||
| Phi-3.5-Vision [1] | -11.80 | -20.20 | -19.87 | -4.71 | -13.50 | 1.72 | -6.00 |
| Qwen2-VL-7B [103] | -11.96 | -13.74 | -16.76 | -6.86 | -12.12 | -10.27 | +0.00 |
| InternVL2-8B [12] | -4.36 | -3.00 | -3.75 | -4.86 | -5.50 | -5.91 | +0.00 |
| (a) | |||||||
| 模型 | 分割 (CelebMask) | HPE (BIWI) | FER (AffectNet) | 数据 | 准确率 | ||
| Phi-3.5-Vision [1] | +7.00 | +12.00 | +13.00 | FaceSFT | 33.58 | ||
| Qwen2-VL-7B [103] | +5.00 | +6.67 | +30.00 | LLaVA SFT (665k) + FaceSFT | 35.18 | ||
| InternVL2-8B [12] | +4.00 | +7.33 | +23.00 | LLaVA SFT (200k) + FaceSFT | 35.24 | ||
| (b) | (c) |
Table 3. (a) Change in performance of select models under different evaluation settings. (b) Performance improvement seen by leveraging tool-use. (c) Results of finetune experiments on LLaVA1.5 [47] using different data composition.
表 3: (a) 不同评估设置下选定模型的性能变化。(b) 通过利用工具使用看到的性能提升。(c) 使用不同数据组合对 LLaVA1.5 [47] 进行微调实验的结果。
We believe that equipping MLLMs with agentic behavior through specialized tools is a promising way to enhance their face understanding capabilities.
我们相信,通过专用工具为多模态大语言模型 (MLLM) 配备智能体行为是增强其面部理解能力的一种有前景的方法。
For future work: we advocate the development of diverse supervised fine-tuning sets covering various aspects of face understanding and training MLLMs to leverage specialized tools. Following this direction, we believe that future MLLMs will be capable of advanced face understanding, with FaceXBench serving as a critical resource for monitoring progress by benchmarking models across multiple dimensions of face understanding.
未来的工作方向:我们提倡开发多样化的监督微调数据集,涵盖面部理解的各个方面,并训练多模态大语言模型 (MLLMs) 以利用专用工具。沿着这一方向,我们相信未来的 MLLMs 将能够实现高级的面部理解,而 FaceXBench 将通过多维度面部理解的模型基准测试,成为监测进展的关键资源。
7. Conclusion
7. 结论
We propose FaceXBench, a comprehensive benchmark for face understanding that consists of 5,000 multiple-choice questions. It covers 14 tasks across six broad categories and is derived from 26 datasets. We employed a threestep data collection process to convert existing datasets into a VQA format, implementing quality control checks at each step. We conducted a thorough evaluation of 26 open-source models and two proprietary models, GPT-4o and GeminiPro 1.5, revealing the limitations of these models in face understanding. Our analysis examines perfor- mance across various dimensions and tasks, identifying factors that impact model performance and providing valuable insights. Finally, we discuss possible future directions, advocating for the development of instruction-tuning datasets, with FaceXBench serving as a catalyst for further research.
我们提出了 FaceXBench,一个全面的面部理解基准,包含 5,000 道选择题。它涵盖了六大类别的 14 项任务,并源自 26 个数据集。我们采用三步数据收集流程,将现有数据集转换为 VQA 格式,并在每一步实施质量控制检查。我们对 26 个开源模型和两个专有模型(GPT-4o 和 GeminiPro 1.5)进行了全面评估,揭示了这些模型在面部理解方面的局限性。我们的分析从多个维度和任务中考察了性能,识别了影响模型性能的因素,并提供了有价值的见解。最后,我们讨论了未来可能的方向,倡导开发指令微调数据集,并以 FaceXBench 作为进一步研究的催化剂。
References
参考文献
understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 3 [136] Ni Zhuang, Yan Yan, Si Chen, and Hanzi Wang. Multi-task learning of cascaded cnn for facial attribute classification. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 2069–2074. IEEE, 2018. 3
[136] Ni Zhuang, Yan Yan, Si Chen, 和 Hanzi Wang. 多任务学习的级联 CNN 用于面部属性分类. 在 2018 年第 24 届国际模式识别会议 (ICPR) 上, 页码 2069–2074. IEEE, 2018. 3
FaceXBench: Evaluating Multimodal LLMs on Face Understanding Appendix
FaceXBench: 评估多模态大语言模型在人脸理解上的表现 附录
In the appendix, we provide a discussion of the limitations of our work. Along with this, we present detailed information regarding the benchmark, including the broad categories, the tasks, the source datasets used, and the dataset statistics. Additionally, we focus on its implementation and provide extensive details about the prompts used for dataset collection and the evaluation strategy. Furthermore, we expand on the results presented in the main paper, providing more details about the baselines, showcasing failure cases of the models, and concluding with an ethics statement.
在附录中,我们讨论了本研究的局限性。同时,我们提供了关于基准测试的详细信息,包括广泛的类别、任务、使用的源数据集以及数据集统计。此外,我们重点介绍了其实现过程,并提供了用于数据集收集的提示和评估策略的详细细节。我们还扩展了主论文中展示的结果,提供了更多关于基线的细节,展示了模型的失败案例,并以伦理声明作为结尾。
Table of Contents in Appendix
附录目录
A. Limitations
A. 局限性
E. Ethical Considerations 64
E. 伦理考量 64
A. Limitations
A. 局限性
Our benchmark has a limited number of questions that address scenarios where multiple faces appear within a single image, limiting its applicability in such contexts. Additionally, generative models, such as diffusion models, cannot be evaluated using the current benchmark, restricting its applicability for assessing performance in image generation tasks. Future work will address these limitations by extending the benchmark to include questions designed to evaluate image generation capabilities.
我们的基准测试中涉及单张图像中出现多张人脸场景的问题数量有限,这限制了其在此类情境下的适用性。此外,当前的基准测试无法评估生成模型(如扩散模型),从而限制了其在图像生成任务性能评估中的适用性。未来的工作将通过扩展基准测试,加入旨在评估图像生成能力的问题,来解决这些局限性。
B. FaceXBench
B. FaceXBench
In this section of the appendix, we will provide more information on the broad categories and tasks included in the FaceXBench. Additionally, we will provide details on the source datasets and the dataset statistics, with examples of images used in the benchmark.
在本附录部分,我们将提供更多关于 FaceXBench 中包含的广泛类别和任务的信息。此外,我们还将详细介绍源数据集和数据集统计信息,并提供基准测试中使用的图像示例。
B.1. FaceXBench Categories
B.1. FaceXBench 类别
Bias and Fairness: In this category, we evaluate the model’s ability to estimate age, predict gender, and identify race as key indicators of its understanding and analysis of demographic attributes. The focus extends beyond the accuracy of these predictions to ensuring fairness and in clu siv it y across diverse groups. By analyzing the model’s performance, we aim to uncover potential biases and ensure that predictions remain consistent and unbiased regardless of age group, gender, or race. This assessment is crucial for promoting fairness, mitigating the amplification of societal biases, and enhancing the model’s generalization capabilities for real-world applications.
偏见与公平性:在这一类别中,我们评估模型在估计年龄、预测性别和识别种族方面的能力,这些是其理解与分析人口统计属性的关键指标。关注的焦点不仅在于这些预测的准确性,还在于确保跨不同群体的公平性和包容性。通过分析模型的表现,我们旨在揭示潜在的偏见,并确保无论年龄组、性别或种族如何,预测都能保持一致且无偏见。这一评估对于促进公平性、减轻社会偏见的放大以及增强模型在现实世界应用中的泛化能力至关重要。
Face Recognition: In this category, we evaluate the model’s ability to perform accurate face recognition across various contexts, including high-resolution face recognition, low-resolution face recognition, and celebrity identification. These tasks test the model’s proficiency in feature extraction, spatial awareness, and handling variations in image quality, lighting, and pose. High-resolution face recognition assesses the model’s ability to leverage fine-grained details for precise identification, while low-resolution face recognition challenges its capability to generalize from limited information. Celebrity identification evaluates its knowledge base and contextual understanding of well-known individuals.
人脸识别:在这一类别中,我们评估模型在不同场景下执行准确人脸识别的能力,包括高分辨率人脸识别、低分辨率人脸识别和名人识别。这些任务测试了模型在特征提取、空间感知以及处理图像质量、光照和姿态变化方面的熟练程度。高分辨率人脸识别评估模型利用细粒度细节进行精确识别的能力,而低分辨率人脸识别则挑战其从有限信息中进行泛化的能力。名人识别评估其对知名人物的知识库和上下文理解能力。
Face Authentication: In this category, we evaluate the model’s ability to perform robust face authentication, with a focus on critical tasks such as face anti-spoofing and deepfake detection. These tasks assess the model’s capability to distinguish bonafide facial data from spoofing attempts, thereby ensuring the security and reliability of authentication systems. They are essential for safeguarding sensitive applications like identity verification, access control, and fraud prevention. By analyzing the model’s performance, we aim to confirm that it demonstrates high sensitivity to subtle cues, generalizes effectively across diverse attack methods, and minimizes both false positives and false negatives. This assessment is vital for building trust in face authentication systems and addressing emerging threats in a rapidly evolving technological landscape.
人脸认证:在这一类别中,我们评估模型执行稳健人脸认证的能力,重点关注人脸防伪和深度伪造检测等关键任务。这些任务评估模型区分真实面部数据与伪造攻击的能力,从而确保认证系统的安全性和可靠性。这些任务对于保护身份验证、访问控制和欺诈预防等敏感应用至关重要。通过分析模型的表现,我们旨在确认其对细微线索具有高敏感性,能够有效泛化到各种攻击方法,并最大限度地减少误报和漏报。这一评估对于建立对人脸认证系统的信任以及应对快速发展的技术环境中的新兴威胁至关重要。
Face Analysis: In this category, we evaluate the model’s ability to analyze and interpret facial features through tasks such as facial attribute prediction and facial expression recognition. This evaluation focuses on the model’s ability to accurately identify static attributes, such as physical traits (e.g., glasses, hair color, or beard), which are essential for applications like targeted content delivery and user profiling. It also emphasizes the dynamic understanding of emotions and subtle expressions, including micro-expressions, which play a key role in applications related to human-computer interaction, mental health assessment, and sentiment analysis. These skills are critical for capturing nuanced characteristics and ensuring effective interaction across diverse real-world scenarios.
面部分析:在此类别中,我们通过面部属性预测和面部表情识别等任务评估模型分析和解释面部特征的能力。该评估重点关注模型准确识别静态属性的能力,例如物理特征(如眼镜、发色或胡须),这些属性对于定向内容交付和用户画像等应用至关重要。它还强调对情绪和细微表情(包括微表情)的动态理解,这些在与人机交互、心理健康评估和情感分析相关的应用中起着关键作用。这些技能对于捕捉细微特征并确保在多样化的现实场景中有效交互至关重要。
Face Localization: In this category, we evaluate the model’s ability to accurately locate and analyze facial regions through tasks such as head pose estimation, face parsing, and crowd counting. Head pose estimation assesses the model’s spatial awareness and its ability to interpret the orientation of faces in three-dimensional space, which is crucial for applications like gaze tracking, augmented reality, and driver monitoring systems. Face parsing focuses on the precise segmentation and labeling of facial regions, such as eyes, nose, and mouth, enabling fine-grained analysis for tasks like virtual makeup, medical diagnosis, and personalized user interfaces. Crowd counting evaluates the model’s proficiency in detecting and quantifying multiple faces in dense or cluttered environments, ensuring robust performance in scenarios such as public safety monitoring, event analysis, and resource planning. Together, these tasks test the model’s ability to generalize across varying scales, perspectives, and levels of complexity. This assessment is vital for enhancing real-world applications that rely on
面部定位:在这一类别中,我们通过头部姿态估计、面部解析和人群计数等任务评估模型准确定位和分析面部区域的能力。头部姿态估计评估模型的空间感知能力及其解释三维空间中面部方向的能力,这对于视线追踪、增强现实和驾驶员监控系统等应用至关重要。面部解析侧重于面部区域的精确分割和标注,如眼睛、鼻子和嘴巴,从而为虚拟化妆、医疗诊断和个性化用户界面等任务提供细粒度分析。人群计数评估模型在密集或杂乱环境中检测和量化多个面部的能力,确保在公共安全监控、事件分析和资源规划等场景中的稳健性能。这些任务共同测试了模型在不同尺度、视角和复杂度下的泛化能力。这一评估对于增强依赖面部定位的现实应用至关重要。
Face Tools Use: In this category, we evaluate the model’s ability to leverage external tools for face understanding tasks, reflecting a shift from traditional supervised fine-tuning to tool-based problem-solving. Using the FaceXAPI dataset, we assess the model’s proficiency in selecting and sequencing the correct APIs and function calls to solve complex tasks. This evaluation emphasizes the model’s ability to interpret detailed task requirements, identify relevant tools, and construct accurate operational workflows. The significance of this task lies in its alignment with real-world applications, where equipping MLLMs with tools enhances both s cal ability and adaptability.
面部工具使用:在这一类别中,我们评估模型在利用外部工具进行面部理解任务时的能力,反映了从传统的监督微调向基于工具的问题解决的转变。使用 FaceXAPI 数据集,我们评估模型在选择和排序正确的 API 和函数调用以解决复杂任务时的熟练程度。此评估强调模型在解释详细任务需求、识别相关工具以及构建准确操作工作流方面的能力。该任务的重要性在于其与现实世界应用的一致性,即为 MLLMs 配备工具可以增强其可扩展性和适应性。
B.2. Tasks
B.2. 任务
Age Estimation: This task involves determining an individual’s age or age range based on their facial features.
年龄估计:该任务涉及根据个体的面部特征确定其年龄或年龄范围。
Gender Prediction: Gender prediction is the process of identifying a person’s gender from facial images by analyzing visual characteristics of the face.
性别预测:性别预测是通过分析面部图像的视觉特征来识别一个人的性别的过程。
Race Estimation: Race estimation involves predicting an individual’s racial background by analyzing their facial features.
种族估计:种族估计涉及通过分析个体的面部特征来预测其种族背景。
High-resolution Face Recognition: High-resolution face recognition is the identification or verification of individuals using detailed facial images captured at high resolutions, which enhances the accuracy in distinguishing fine facial features.
高分辨率人脸识别:高分辨率人脸识别是指使用高分辨率拍摄的详细面部图像进行个体的识别或验证,从而提高区分细微面部特征的准确性。
Low-resolution Face Recognition: Low-resolution Face Recognition refers to performing face recognition tasks on images with limited detail, such as surveillance footage, which poses challenges due to reduced image clarity.
低分辨率人脸识别:低分辨率人脸识别是指在细节有限的图像(如监控录像)上执行人脸识别任务,由于图像清晰度降低,这带来了挑战。
Celebrity Identification: Celebrity identification involves recognizing and naming well-known individuals in images or videos by comparing their facial features against a database of celebrity faces.
名人识别:名人识别涉及通过将图像或视频中的人脸特征与名人面部数据库进行比较,来识别并命名知名人士。
Face Anti-spoofing: This task focuses on detecting attempts to deceive facial recognition systems using methods such as photos, videos, or masks, ensuring that the face presented is genuine.
人脸防伪:该任务专注于检测使用照片、视频或面具等方法欺骗面部识别系统的企图,确保呈现的人脸是真实的。
Deepfake Detection: Deepfake detection is the process of identifying synthetic media where a person’s likeness has been digitally altered or replaced, aiming to detect manipulated or fabricated content.
深度伪造检测 (Deepfake Detection):深度伪造检测是识别合成媒体的过程,其中人物的形象被数字化修改或替换,旨在检测被操纵或伪造的内容。
Attributes Prediction: Attributes prediction involves inferring various facial characteristics, such as the presence of glasses, facial hair, or specific expressions, from images.
属性预测:属性预测涉及从图像中推断各种面部特征,例如是否戴眼镜、是否有胡须或特定表情。
Facial Expression Recognition: This task involves analyzing facial movements to determine a person’s emotional state, such as happiness, sadness, or anger.
面部表情识别:该任务涉及分析面部动作以确定一个人的情绪状态,例如快乐、悲伤或愤怒。
Headpose Estimation: Head pose estimation is the process of determining the orientation of a person’s head (e.g., pitch, yaw, roll) relative to the camera, and is useful in applications like gaze tracking.
头部姿态估计:头部姿态估计是确定人的头部相对于相机的方向(例如,俯仰、偏航、滚动)的过程,在视线跟踪等应用中非常有用。
Face parsing This task refers to segmenting a facial image into distinct regions (e.g., eyes, nose, mouth) to facilitate detailed analysis or manipulation of facial components.
人脸解析
该任务指的是将面部图像分割成不同的区域(例如,眼睛、鼻子、嘴巴),以便对面部组件进行详细分析或操作。
Crowd Counting: Crowd counting involves estimating the number of individuals present in an image or video frame, often used in surveillance and event monitoring.
人群计数:人群计数涉及估计图像或视频帧中存在的人数,通常用于监控和事件监测。
Face Tools Retrieval: It refers to predicting the correct sequence of API calls that MLLMs need to execute to complete complex face-related scenarios requiring multiple tasks.
人脸工具检索:指的是预测大语言模型需要执行的正确API调用序列,以完成需要多个任务的复杂人脸相关场景。
B.3. Source Datasets
B.3. 源数据集
FairFace [36]: FairFace is a face image dataset designed to address racial bias in facial recognition systems. It contains 108,501 images with annotations for race, gender, and age. The dataset includes seven race categories: White, Black, East Asian, Southeast Asian, Indian, Middle Eastern, and Latino. The images are sourced primarily from the YFCC-100M Flickr dataset and are balanced across different demographic groups.
FairFace [36]: FairFace 是一个旨在解决面部识别系统中种族偏见问题的面部图像数据集。它包含 108,501 张图像,并标注了种族、性别和年龄。该数据集包括七个种族类别:白人、黑人、东亚人、东南亚人、印度人、中东人和拉丁裔。这些图像主要来自 YFCC-100M Flickr 数据集,并在不同人口群体之间进行了平衡。
UTKFace [127]: UTKFace is a large-scale face dataset comprising over 20,000 images with annotations for age, gender, and ethnicity. The age range spans from 0 to 116 years. The images exhibit variations in pose, facial expression, illumination, occlusion, and resolution, making it suitable for tasks like face detection, age estimation, and landmark localization.
UTKFace [127]: UTKFace 是一个大规模的人脸数据集,包含超过 20,000 张图像,并带有年龄、性别和种族的标注。年龄范围从 0 岁到 116 岁不等。这些图像在姿态、面部表情、光照、遮挡和分辨率方面表现出多样性,使其适用于人脸检测、年龄估计和关键点定位等任务。
WMCA (Wide Multi-Channel Presentation Attack) [24]: The WMCA dataset consists of 1,941 short video recordings from 72 different identities, including both bona fide and presentation attacks. Data is recorded across multiple channels: color, depth, infrared, and thermal. The dataset is designed for research in face anti-spoofing and presentation attack detection.
WMCA (Wide Multi-Channel Presentation Attack) [24]:WMCA 数据集包含来自 72 个不同身份的 1,941 个短视频记录,包括真实样本和呈现攻击样本。数据通过多个通道记录:彩色、深度、红外和热成像。该数据集设计用于人脸反欺骗和呈现攻击检测的研究。
MSU MFSD [105]: It contains 280 video recordings of genuine and attack faces from 35 individuals. Each individual has two real-access videos captured with laptop cameras and Android devices. Attack videos include high-definition replays and photo attacks. The dataset is divided into 120 training videos from 15 subjects and 160 testing videos from 20 subjects.
MSU MFSD [105]: 该数据集包含来自35个人的280个真实和攻击面部的视频记录。每个人有两个通过笔记本电脑摄像头和Android设备捕获的真实访问视频。攻击视频包括高清重放和照片攻击。数据集分为来自15个受试者的120个训练视频和来自20个受试者的160个测试视频。
CASIA MFSD [126]: It is a face anti-spoofing dataset containing 600 video recordings from 50 subjects. Each subject has
CASIA MFSD [126]:这是一个包含50名受试者的600个视频记录的人脸反欺诈数据集。每个受试者有
12 videos under different resolutions and lighting conditions. The dataset includes three spoof attack types: replay, warp print, and cut print attacks. It is divided into 240 training videos from 20 subjects and 360 testing videos from 30 subjects.
12个不同分辨率和光照条件下的视频。该数据集包含三种欺骗攻击类型:重放攻击、扭曲打印攻击和剪切打印攻击。数据集分为20个受试者的240个训练视频和30个受试者的360个测试视频。
Replay Attack [16]: The Replay Attack dataset is designed for evaluating face anti-spoofing systems. It includes videos of both genuine access attempts and various spoofing attacks, such as printed photos and video replays. The dataset provides a diverse set of conditions to test the robustness of anti-spoofing algorithms.
重放攻击 (Replay Attack) [16]: 重放攻击数据集旨在评估人脸反欺诈系统。它包含了真实访问尝试和各种欺诈攻击(如打印照片和视频重放)的视频。该数据集提供了多样化的条件,以测试反欺诈算法的鲁棒性。
CelebDF [52]: CelebDF is a large-scale dataset for deepfake detection, containing 590 real videos of celebrities and 5,639 corresponding deepfake videos. The dataset is designed to evaluate the performance of deepfake detection algorithms under real-world conditions.
CelebDF [52]: CelebDF 是一个用于深度伪造检测的大规模数据集,包含 590 个名人真实视频和 5,639 个对应的深度伪造视频。该数据集旨在评估深度伪造检测算法在现实世界条件下的性能。
$\mathbf{FF++}$ [78]: Face Forensics $^{++}$ is a dataset for evaluating facial manipulation detection methods. It consists of over 1,000 original video sequences and more than 4,000 manipulated videos using four different face manipulation techniques. Annotations include manipulation methods and compression levels.
$\mathbf{FF++}$ [78]: Face Forensics $^{++}$ 是一个用于评估面部操纵检测方法的数据集。它包含超过 1,000 个原始视频序列和超过 4,000 个使用四种不同面部操纵技术生成的操纵视频。标注内容包括操纵方法和压缩级别。
TinyFace [14]: TinyFace is a dataset focused on detecting small faces in images. It contains images with a wide range of face sizes, particularly emphasizing faces that occupy a small number of pixels. Annotations include bounding boxes for each face.
TinyFace [14]: TinyFace 是一个专注于检测图像中小型人脸的数据集。它包含了各种尺寸的人脸图像,特别强调了占据少量像素的人脸。标注内容包括每个人脸的边界框。
LFW [28]: LFW is a dataset of 13,000 labeled images of faces from the wild, collected from the internet. It includes annotations for the identity of the person in each image, with 1,680 individuals having two or more images. The dataset is widely used for studying face recognition in un constrained environments.
LFW [28]: LFW 是一个包含 13,000 张从互联网收集的带标签的野外人脸图像数据集。它包含每张图像中人物身份的标注,其中 1,680 个人拥有两张或更多图像。该数据集广泛用于研究无约束环境中的人脸识别。
AgeDB [68]: AgeDB is a manually collected dataset containing 16,488 images of 568 distinct subjects. It provides age annotations and is used for evaluating age-invariant face verification and recognition algorithms.
AgeDB [68]: AgeDB 是一个手动收集的数据集,包含 568 个不同对象的 16,488 张图像。它提供了年龄标注,用于评估年龄不变的人脸验证和识别算法。
CFP-FF and CFP-FP [84]: The CFP dataset consists of two subsets: CFP-FF (Frontal-Frontal) and CFP-FP (FrontalProfile). Each contains 7,000 images of 500 subjects. CFP-FF includes frontal face pairs, while CFP-FP includes frontal and profile face pairs, facilitating the study of face recognition across different poses.
CFP-FF 和 CFP-FP [84]:CFP 数据集由两个子集组成:CFP-FF(正面-正面)和 CFP-FP(正面-侧面)。每个子集包含 500 个对象的 7,000 张图像。CFP-FF 包含正面人脸对,而 CFP-FP 包含正面和侧面人脸对,便于研究不同姿态下的人脸识别。
CALFW [132]: CALFW is a dataset derived from LFW, focusing on cross-age face verification. It contains 4,025 image pairs with age differences, aiming to evaluate the performance of face recognition systems under age variation.
CALFW [132]: CALFW 是一个从 LFW 派生的数据集,专注于跨年龄人脸验证。它包含 4,025 对具有年龄差异的图像对,旨在评估人脸识别系统在年龄变化下的性能。
CPLFW [131]: CPLFW is another extension of LFW, emphasizing cross-pose face verification. It includes 3,000 image pairs with pose variations, challenging face recognition models to handle different facial orientations.
CPLFW [131]: CPLFW 是 LFW 的另一个扩展,强调跨姿态人脸验证。它包含 3,000 对具有姿态变化的图像对,挑战人脸识别模型处理不同面部方向的能力。
IMDB [79]: The IMDB dataset comprises 460,723 face images of 20,284 celebrities, collected from the Internet Movie Database (IMDb). Annotations include age, gender, and name, making it suitable for age estimation and gender classification tasks.
IMDB [79]: IMDB 数据集包含从互联网电影数据库 (IMDb) 收集的 20,284 位名人的 460,723 张面部图像。标注信息包括年龄、性别和姓名,适用于年龄估计和性别分类任务。
CelebA [61]: CelebA is a large-scale face attributes dataset with more than 200,000 celebrity images, each annotated with 40 attribute labels. The dataset covers a wide range of poses and backgrounds, supporting tasks like attribute prediction and face detection.
CelebA [61]: CelebA 是一个大规模的人脸属性数据集,包含超过 20 万张名人图像,每张图像都标注了 40 个属性标签。该数据集涵盖了广泛的姿态和背景,支持属性预测和人脸检测等任务。
RAF-DB [51]: RAF-DB contains 29,672 facial images with annotations for basic and compound emotions. The dataset is used for studying facial expression recognition in real-world scenarios.
RAF-DB [51]: RAF-DB 包含 29,672 张面部图像,标注了基本和复合情绪。该数据集用于研究真实场景中的面部表情识别。
AffectNet [67]: AffectNet is a comprehensive facial expression dataset with over 1 million images collected from the internet. Annotations include seven discrete facial expressions (anger, contempt, disgust, fear, happiness, sadness, and surprise) and the intensity of valence and arousal. It is widely used for emotion recognition and affective computing research.
AffectNet [67]: AffectNet 是一个全面的面部表情数据集,包含从互联网收集的超过 100 万张图像。标注包括七种离散的面部表情(愤怒、轻蔑、厌恶、恐惧、快乐、悲伤和惊讶)以及效价和唤醒的强度。它广泛用于情感识别和情感计算研究。
AFLW2000 [118]: AFLW2000 contains 2,000 face images annotated with 68 facial landmarks. The images are diverse, covering various poses, expressions, and occlusions. It is often used for face alignment and landmark localization tasks.
AFLW2000 [118]: AFLW2000 包含 2,000 张标注了 68 个面部关键点的人脸图像。这些图像具有多样性,涵盖了各种姿态、表情和遮挡情况。它通常用于人脸对齐和关键点定位任务。
BIWI [22]: The BIWI dataset includes 15,678 images of 20 subjects, captured using a Kinect camera. Annotations consist of 6D head poses (yaw, pitch, roll, and translation vectors) and 3D face models. It is designed for head pose estimation research.
BIWI [22]: BIWI 数据集包含 20 名受试者的 15,678 张图像,使用 Kinect 相机拍摄。标注包括 6D 头部姿态(偏航、俯仰、滚动和平移向量)和 3D 面部模型。该数据集专为头部姿态估计研究设计。
JHUCrowd $^{++}$ [87]: JHUCrowd $^{++}$ is a large-scale dataset for crowd counting, containing 4,372 images with over 1.5 million annotated heads. The annotations include head locations, crowd density maps, and visibility levels, making it suitable for crowd analysis and density estimation.
JHUCrowd$^{++}$ [87]: JHUCrowd$^{++}$ 是一个用于人群计数的大规模数据集,包含 4,372 张图像,标注了超过 150 万个头部。标注内容包括头部位置、人群密度图和可见度级别,适用于人群分析和密度估计。
Shanghai Tech [125]: The Shanghai Tech dataset contains two parts: Part A, with 482 images captured in crowded scenes, and Part B, with 716 images from less dense environments. It includes over 330,000 annotated individuals’ head locations, making it a benchmark for crowd counting and density estimation.
上海科技大学 [125]: 上海科技大学数据集包含两部分:A 部分包含 482 张在拥挤场景中拍摄的图像,B 部分包含 716 张来自密度较低环境的图像。该数据集包含超过 330,000 个标注的个体头部位置,使其成为人群计数和密度估计的基准。
CelebAMask-HQ [43]: CelebAMask-HQ is an extension of the CelebA dataset with 30,000 high-resolution face images and fine-grained segmentation masks for 19 facial attributes (e.g., eyes, nose, hair, and skin). It supports tasks like face parsing and image editing.
CelebAMask-HQ [43]: CelebAMask-HQ 是 CelebA 数据集的扩展,包含 30,000 张高分辨率人脸图像和 19 种面部属性(如眼睛、鼻子、头发和皮肤)的细粒度分割掩码。它支持人脸解析和图像编辑等任务。
LaPa [58]: LaPa includes 22,000 facial images with high-quality annotations for 11 facial regions. It offers various attributes like pose, expression, and occlusion, making it suitable for face parsing and semantic segmentation tasks.
LaPa [58]: LaPa 包含 22,000 张面部图像,具有 11 个面部区域的高质量标注。它提供了各种属性,如姿态、表情和遮挡,适用于面部解析和语义分割任务。
FaceXAPI It is a dataset consisting of 100 text-only questions, each with four options and one correct answer. It is designed to assess the capabilities of MLLMs (Multimodal Large Language Models) in predicting the correct sequence of API calls needed to accomplish complex scenarios involving multiple face-related tasks.
FaceXAPI 这是一个由100个纯文本问题组成的数据集,每个问题有四个选项和一个正确答案。它旨在评估多模态大语言模型 (MLLMs) 在预测完成涉及多个面部相关任务的复杂场景所需的正确API调用序列方面的能力。
B.4. Dataset Statistics
B.4. 数据集统计
The FaceXBench dataset is derived from 25 public datasets and one newly created dataset. The number of questions sourced from each dataset, along with the type of questions (multiple images, single images, or text-only), as well as the associated tasks and categories, are summarized in Table B.1.
FaceXBench 数据集来源于 25 个公开数据集和一个新创建的数据集。每个数据集中问题的数量、问题的类型(多图像、单图像或纯文本)以及相关的任务和类别总结在表 B.1 中。
Table B.1. Question distribution of FaceXBench across datasets
表 B.1: FaceXBench 在各数据集上的问题分布
| 数据集 | 问题数量 | 多张图像 | 单张图像 | 仅文本 | 任务 | 类别 |
|---|---|---|---|---|---|---|
| FairFace [36] | 300 | 200 | 100 | 0 | 年龄估计 | 偏见与公平 |
| UTKFace [127] | 200 | 150 | 50 | 0 | 年龄估计 | 偏见与公平 |
| FairFace [36] | 300 | 200 | 100 | 0 | 性别预测 | 偏见与公平 |
| UTKFace [127] | 200 | 150 | 50 | 0 | 性别预测 | 偏见与公平 |
| FairFace [36] | 300 | 200 | 100 | 0 | 种族估计 | 偏见与公平 |
| UTKFace [127] | 200 | 150 | 50 | 0 | 种族估计 | 偏见与公平 |
| LFW [28] | 60 | 60 | 0 | 0 | 高分辨率人脸识别 | 人脸识别 |
| AgeDB [68] | 100 | 100 | 0 | 0 | 高分辨率人脸识别 | 人脸识别 |
| CFP-FF [84] | 60 | 60 | 0 | 0 | 高分辨率人脸识别 | 人脸识别 |
| CFP-FP [84] | 60 | 60 | 0 | 0 | 高分辨率人脸识别 | 人脸识别 |
| CALFW [132] | 60 | 60 | 0 | 0 | 高分辨率人脸识别 | 人脸识别 |
| CPLFW [131] | 60 | 60 | 0 | 0 | 高分辨率人脸识别 | 人脸识别 |
| TinyFace [14] | 100 | 100 | 0 | 0 | 低分辨率人脸识别 | 人脸识别 |
| IMDB [79] | 300 | 150 | 150 | 0 | 名人识别 | 人脸识别 |
| WMCA [24] | 250 | 100 | 150 | 0 | 人脸反欺骗 | 人脸认证 |
| MSU-MFSD [105] | 50 | 50 | 0 | 0 | 人脸反欺骗 | 人脸认证 |
| CASIA-MFSD [126] | 50 | 50 | 0 | 0 | 人脸反欺骗 | 人脸认证 |
| ReplayAttack [16] | 50 | 50 | 0 | 0 | 人脸反欺骗 | 人脸认证 |
| CelebDF [52] | 150 | 150 | 0 | 0 | 深度伪造检测 | 人脸认证 |
| FF++ [78] | 150 | 150 | 0 | 0 | 深度伪造检测 | 人脸认证 |
| CelebA [61] | 400 | 200 | 200 | 0 | 属性预测 | 人脸分析 |
| RAF-DB [51] | 200 | 100 | 100 | 0 | 面部表情识别 | 人脸分析 |
| AffectNet [67] | 200 | 100 | 100 | 0 | 面部表情识别 | 人脸分析 |
| AFLW2000 [118] | 200 | 50 | 150 | 0 | 头部姿态估计 | 人脸分析 |
| BIWI [22] | 200 | 50 | 150 | 0 | 头部姿态估计 | 人脸分析 |
| JHUCrowd++ [87] | 200 | 0 | 200 | 0 | 人群计数 | 人脸定位 |
| ShanghaiTech [125] | 100 | 0 | 100 | 0 | 人群计数 | 人脸定位 |
| CelebAMask-HQ [43] | 200 | 0 | 200 | 0 | 人脸解析 | 人脸定位 |
| LaPa [58] | 200 | 0 | 200 | 0 | 人脸解析 | 人脸定位 |
| FaceXAPI | 100 | 0 | 0 | 100 | 人脸工具检索 | 人脸工具使用 |
B.5. Images used in the dataset
B.5. 数据集中使用的图像
Figure B.1 displays a subset of the facial images used in the dataset, highlighting the diversity of faces included in the benchmark. The benchmark consists of face images with varying backgrounds, resolutions, and head pose orientations, as well as a wide range of facial expressions. It includes individuals from different age groups, genders, and racial backgrounds, with each face characterized by a versatile set of attributes.
图 B.1 展示了数据集中使用的一部分面部图像,突显了基准中包含的面部多样性。该基准由具有不同背景、分辨率和头部姿态方向的面部图像组成,涵盖了广泛的面部表情。它包括来自不同年龄组、性别和种族背景的个体,每个面部都具有多样化的属性集。
C. Dataset Collection
C. 数据集收集
In this section, we provide additional implementation details on the dataset curation process, including information about question templates, distractor options, the evaluation strategy, and the dataset format.
在本节中,我们提供了关于数据集整理过程的额外实现细节,包括问题模板、干扰选项、评估策略和数据集格式的信息。

Figure B.1. Collage of a subset of images from the dataset, showcasing the diversity of images used in FaceXBench.
图 B.1: 数据集中部分图像的拼贴图,展示了 FaceXBench 中使用的图像多样性。
C.1. Question Templates
C.1. 问题模板
We provide a selection of question templates for each task used in creating the dataset. The options for the questions may vary, and the examples provided below are intended as samples.
我们为创建数据集的每个任务提供了一系列问题模板。问题的选项可能有所不同,以下提供的示例仅供参考。
Age Estimation | Multiple Images
年龄估计 | 多张图像
Which image shows a person in {age_group} age group?
哪张图片显示的是{age_group}年龄段的人?
Age Estimation | Multiple Images
年龄估计 | 多张图像
Who among these two appears to be older?
这两人中谁看起来更年长?
Age Estimation | Multiple Images
年龄估计 | 多图像
Arrange the following images in ascending order of age.
按年龄升序排列以下图像。
Age Estimation | Multiple Images
年龄估计 | 多图像
Arrange the following images in descending order of age.
将以下图像按年龄从大到小排列。
Age Estimation | Multiple Images
年龄估计 | 多张图像
How many images have a person of age between {age_start} and {age_end}?
有多少张图片中的人的年龄在 {age_start} 到 {age_end} 岁之间?
Age Estimation | Multiple Images
年龄估计 | 多张图像
Among these images, which image shows the oldest person?
在这些图像中,哪张图像显示的是最年长的人?
Age Estimation | Single Image
年龄估计 | 单张图像
What is the most appropriate age group for the person in this image?
这张图片中的人最适合的年龄组是什么?
A. <correct option $>$ B. <distract or option $>$ C. <distract or option $>$ D. <distract or option $>$
A. <正确选项 $>$ B. <干扰选项 $>$ C. <干扰选项 $>$ D. <干扰选项 $>$
Age Estimation | Single Image
年龄估计 | 单张图像
Approximately how old is the person in this image?
这张图片中的人大约多大年纪?
A. <correct option $>$ B. <distract or option $>$ C. <distract or option $>$ D. <distract or option $>$
A. <正确选项 $>$ B. <干扰选项 $>$ C. <干扰选项 $>$ D. <干扰选项 $>$
Age Estimation | Single Image
年龄估计 | 单张图像
Select the age group that best describes the person in this image.
选择最能描述图中人物年龄的年龄段。
A. <correct option $>$ B. <distract or option $>$ C. <distract or option $>$ D. <distract or option $>$
A. <正确选项 $>$ B. <干扰选项 $>$ C. <干扰选项 $>$ D. <干扰选项 $>$
Age Estimation | Single Image
年龄估计 | 单张图像
Estimate the age group of the person in this image.
估计这张图片中人物的年龄组。
A. <correct option $>$ B. <distract or option $>$ C. <distract or option $>$ D. <distract or option $>$
A. <正确选项 $>$ B. <干扰选项 $>$ C. <干扰选项 $>$ D. <干扰选项 $>$
Gender Prediction | Multiple Images
性别预测 | 多张图像
Which image shows a male person?
哪张图片显示的是男性?
Gender Prediction | Multiple Images
性别预测 | 多张图像
Which image shows a female person?
哪张图片显示的是女性?
Gender Prediction | Multiple Images
性别预测 | 多张图像
Identify which image shows a person whose gender is male.
识别哪张图片显示的是男性。
Gender Prediction | Multiple Images
性别预测 | 多张图像
Which images appear to have a person with male gender?
哪些图像中似乎有男性性别的人?
A. Image 1, Image 2 B. Image 3, Image 1 C. Image 1 D. None of the above
A. 图像 1, 图像 2
B. 图像 3, 图像 1
C. 图像 1
D. 以上都不是
Gender Prediction | Multiple Images
性别预测 | 多张图像
How many images show female individuals?
有多少张图片展示了女性个体?
Gender Prediction | Single Image
性别预测 | 单张图像
What is the gender of the person in this image?
这张图片中的人的性别是什么?
Gender Prediction | Single Image
性别预测 | 单张图像
Identify the gender of the person in this image.
识别此图像中人物的性别。
Gender Prediction | Single Image
性别预测 | 单张图像
Which gender category best describes the person in this image?
这张图片中的人物最符合哪个性别类别?
Gender Prediction | Single Image
性别预测 | 单张图像
Select the most appropriate gender for the person in this image.
选择此图像中人物最合适的性别。
Gender Prediction | Single Image
性别预测 | 单张图像
Determine the gender of the person shown in this image.
确定图中所示人物的性别。
Race Estimation | Multiple Images
种族估计 | 多张图像
Between the two images, who seems more likely to belong to the Black race?
在这两张图片中,谁看起来更有可能属于黑人种族?
Race Estimation | Multiple Images
种族估计 | 多张图像
Which individual appears to be of Latino Hispanic origin?
哪位个体看起来是拉丁裔西班牙裔?
Race Estimation | Multiple Images
种族估计 | 多张图像
Which images depict individuals of Indian origin?
哪些图像描绘了印度裔个体?
A. Image 1, Image 2 B. Image 3 C. None of the above D. Image 2, Image 3
A. 图像 1, 图像 2
B. 图像 3
C. 以上都不是
D. 图像 2, 图像 3
Race Estimation | Multiple Images
种族估计 | 多张图像
Which images appear to have individuals of Southeast Asian descent?
哪些图像中似乎有东南亚裔的人?
A. Image 1, Image 2 B. Image 1 C. None of the above D. Image 2, Image 3
A. 图像 1, 图像 2
B. 图像 1
C. 以上都不是
D. 图像 2, 图像 3
Race Estimation | Multiple Images
种族估计 | 多张图像
How many images show people of Black race?
有多少张图片展示了黑人种族的人?
Race Estimation | Single Image
种族估计 | 单张图像
What is the race of the person in this image?
这张图片中的人是什么种族?
Race Estimation | Single Image
种族估计 | 单张图像
Identify the race of the person in this image.
识别此图像中人物的种族。
Race Estimation | Single Image
种族估计 | 单张图像
Which race category best describes the person in this image?
这张图片中的人属于哪个种族类别?
Race Estimation | Single Image
种族估计 | 单张图像
Select the most appropriate race for the person in this image.
选择此图像中人物最合适的种族。
Race Estimation | Single Image
种族估计 | 单张图像
Determine the race of the person shown in this image.
确定这张图片中显示的人的种族。
High-Resolution Face Recognition/Low-Resolution Face Recognition | Multiple Images
高分辨率人脸识别/低分辨率人脸识别 | 多图像
The first image is of a person A. The same person A is present in which of the other images?
第一张图片是人物 A。人物 A 还出现在哪张其他图片中?
High-Resolution Face Recognition/Low-Resolution Face Recognition | Multiple Images
高分辨率人脸识别/低分辨率人脸识别 | 多图像
The first image is of a person A. The same person A is present in how many of the remaining images?
第一张图片是人物 A。人物 A 在剩余的图片中出现了多少次?
High-Resolution Face Recognition/Low-Resolution Face Recognition | Multiple Images
高分辨率人脸识别/低分辨率人脸识别 | 多图像
How many unique identities are present in these images?
这些图像中存在多少个独特的身份?
High-Resolution Face Recognition/Low-Resolution Face Recognition | Multiple Images
高分辨率人脸识别/低分辨率人脸识别 | 多图像
The first image is of a person A. Which of the other images show person A?
第一张图片是人物A。其他图片中哪一张显示的是人物A?
A. Image 2, Image 3 B. Image 2, Image 4 C. Image 3, Image 4 D. None of the above
A. 图像 2, 图像 3
B. 图像 2, 图像 4
C. 图像 3, 图像 4
D. 以上都不是
High-Resolution Face Recognition/Low-Resolution Face Recognition | Multiple Images
高分辨率人脸识别/低分辨率人脸识别 | 多图像
Which two images show the same person?
哪两张图片显示的是同一个人?
A. Image 1, Image 2 B. Image 3, Image 4 C. Image 2, Image 5 D. None of the above
A. 图像 1, 图像 2
B. 图像 3, 图像 4
C. 图像 2, 图像 5
D. 以上都不是
High-Resolution Face Recognition/Low-Resolution Face Recognition | Multiple Images
高分辨率人脸识别/低分辨率人脸识别 | 多图像
The first image is of person A. Which images are not person A?
第一张图片是人物 A。哪些图片不是人物 A?
High-Resolution Face Recognition/Low-Resolution Face Recognition | Multiple Images
高分辨率人脸识别/低分辨率人脸识别 | 多张图像
There are two images of the same person. Which images are of people different from that person?
有两张同一个人的图像。哪些图像中的人与这个人不同?
A. Image 1, Image 2 B. Image 2, Image 3 C. Image 3, Image 4 D. None of the above
A. 图像 1, 图像 2
B. 图像 2, 图像 3
C. 图像 3, 图像 4
D. 以上都不是
High-Resolution Face Recognition/Low-Resolution Face Recognition | Multiple Images
高分辨率人脸识别/低分辨率人脸识别 | 多图像
How many pairs of images show the same person?
有多少对图像显示的是同一个人?
Celebrity Identification | Multiple Images
名人识别 | 多张图像
Which images is of the celebrity {celebrity name}?
哪张图片是名人 {celebrity name} 的?
Celebrity Identification | Multiple Images
名人识别 | 多张图像
How many images have the celebrity {celebrity name}?
有多少张名人的图片 {celebrity name}?
Celebrity Identification | Multiple Images
名人识别 | 多张图像
Which images have the celebrity {celebrity name}?
哪些图片中有名人 {celebrity name}?
A. Image 1, Image 2 B. Image 3, Image 2 C. Image 1 D. None of the above
A. 图像 1, 图像 2
B. 图像 3, 图像 2
C. 图像 1
D. 以上都不是
Celebrity Identification | Multiple Images
名人识别 | 多张图像
Which images do not have the celebrity {celebrity name}?
哪些图片中没有名人 {celebrity name}?
Celebrity Identification | Multiple Images
名人识别 | 多张图像
Which pair of images share the same celebrity?
哪对图像中的名人是同一位?
A. Image 1, Image 2 B. Image 3, Image 4 C. Image 2, Image 3 D. None of the above
A. 图像 1, 图像 2
B. 图像 3, 图像 4
C. 图像 2, 图像 3
D. 以上都不是
Celebrity Identification | Multiple Images
名人识别 | 多张图像
How many unique celebrities are present in these images?
这些图像中有多少位独特的名人?
Celebrity Identification | Multiple Images
名人识别 | 多张图像
What is the name of the most frequently occurring celebrity in these images?
这些图像中出现频率最高的名人名字是什么?
A. <celebrity ${-}A>$ B. <celebrity $B>$ C.
A. <celebrity ${-}A>$
B. <celebrity $B>$
C.
D. <celebrity_ $D>$
Celebrity Identification | Single Image
名人识别 | 单张图像
What is the name of this celebrity?
这位名人的名字是什么?
A. <celebrity_ ${-}A>$ B. <celebrity $B>$ C. <celebrity_ $C>$ D. <celebrity_ $D>$
A. <celebrity_ ${-}A>$ B. <celebrity $B>$ C. <celebrity_ $C>$ D. <celebrity_ $D>$
Celebrity Identification | Single Image
名人识别 | 单张图像
Who is this person?
这个人是谁?
A.
A.
B. <celebrity_ $B>$
C.
D. <celebrity_ $D>$
Celebrity Identification | Single Image
名人识别 | 单张图像
Name the celebrity shown in the image.
识别图中名人的姓名。
A. <celebrity A $>$ B. <celebrity_ $B>$ C.
A. <celebrity A $>$
B. <celebrity_ $B>$
C.
D.
Face Anti-spoofing | Multiple Images
人脸反欺诈 | 多图像
Which image shows a bonafide person?
哪张图片显示的是真实的人?
Face Anti-spoofing | Multiple Images
人脸防伪 | 多图像
Which image is spoof attacked?
哪张图片受到了欺骗攻击?
Face Anti-spoofing | Multiple Images
人脸反欺诈 | 多图像
Which statement best describes the images?
哪句话最能描述这些图像?
A. Both images are bonafide B. Both images are attacks C. Image 1 is bonafide and Image 2 is attack D. Image 1 is attack and Image 2 is bonafide
A. 两张图片均为真实图像
B. 两张图片均为攻击图像
C. 图片1为真实图像,图片2为攻击图像
D. 图片1为攻击图像,图片2为真实图像
Face Anti-spoofing | Multiple Images
人脸反欺诈 | 多图像
How many images are bonafide?
有多少图像是真实的?
Face Anti-spoofing | Multiple Images
人脸防伪 | 多图像
How many images are spoof attacked?
有多少图像受到了欺骗攻击?
Face Anti-spoofing | Multiple Images
人脸防伪 | 多图像
Which images are not bonafide?
哪些图像不是真实的?
A. Image 1 and Image 2 B. Image 2 and Image 3 C. Image 3 D. None of the above
A. 图像 1 和图像 2
B. 图像 2 和图像 3
C. 图像 3
D. 以上都不是
Face Anti-spoofing | Single Image
人脸防伪 | 单张图像
Face Anti-spoofing | Single Image
人脸防伪 | 单张图像
Face Anti-spoofing | Single Image
人脸防伪 | 单张图像
Face Anti-spoofing | Single Image
人脸防伪 | 单张图像
Deepfake Detection | Multiple Images
Deepfake检测 | 多图像
Which deepfakes belong to the same identity?
哪些深度伪造属于同一身份?
A. Images 1 and Image 2 B. Images 1 and Image 3 C. Images 2 and Image 4 D. None of the above
A. 图像1和图像2
B. 图像1和图像3
C. 图像2和图像4
D. 以上都不是
Deepfake Detection | Multiple Images
Deepfake 检测 | 多图像
How many deepfakes belong to the same identity?
同一身份有多少个深度伪造?
Deepfake Detection | Multiple Images
深度伪造检测 | 多图像
Deepfake Detection | Multiple Images
深度伪造检测 | 多图像
How many images are real?
有多少图像是真实的?
Deepfake Detection | Multiple Images
Deepfake检测 | 多图像
Which images are deepfakes? A. Image 1 and Image 2 B. Image 2 and Image 3 C. Image 1, Image 2, and Image 3 D. None of the above
哪些图像是深度伪造 (deepfake) 的?
A. 图像 1 和图像 2
B. 图像 2 和图像 3
C. 图像 1、图像 2 和图像 3
D. 以上都不是
Deepfake Detection | Multiple Images
深度伪造检测 | 多图像
How many images are deepfakes?
有多少图像是深度伪造 (deepfakes)?
Deepfake Detection | Multiple Images
深度伪造检测 | 多图像
Deepfake Detection | Multiple Images
深度伪造检测 | 多图像
Deepfake Detection | Multiple Images
深度伪造检测 | 多图像
Which images are real and which are fake? A. Images 1 and 2 are real, Image 3 is fake B. Images 2 and 3 are real, Image 1 is fake C. Images 1 and 3 are real, Image 2 is fake D. All images are fake
哪些图像是真实的,哪些是假的?
A. 图像 1 和 2 是真实的,图像 3 是假的
B. 图像 2 和 3 是真实的,图像 1 是假的
C. 图像 1 和 3 是真实的,图像 2 是假的
D. 所有图像都是假的
Deepfake Detection | Multiple Images
深度伪造检测 | 多图像
Which images are not deepfakes?
哪些图像不是深度伪造 (deepfake)?
A. Image 1 and Image 2 B. Image 2 and Image 3 C. Image 1 D. None of the above
A. 图像 1 和图像 2
B. 图像 2 和图像 3
C. 图像 1
D. 以上都不是
Attribute Prediction | Single Image
属性预测 | 单张图像
Identify which of the following attributes is present in the image.
识别图像中存在的以下属性。
A. 5o Clock Shadow B. Blond Hair C. Eyeglasses D. None of the above
A. 5点钟阴影
B. 金发
C. 眼镜
D. 以上都不是
Attribute Prediction | Single Image
属性预测 | 单张图像
Which of the following attributes is NOT present in the image?
以下哪个属性不在图像中?
A. Heavy Makeup B. Bald C. Black Hair D. None of the above
A. 浓妆
B. 秃头
C. 黑发
D. 以上都不是
Attribute Prediction | Single Image
属性预测 | 单张图像
Which attribute is the most prominent in the image?
图像中最突出的属性是什么?
Attribute Prediction | Multiple Images
属性预测 | 多图像
Which of the following attributes do both images have in common?
以下图像有哪些共同属性?
Attribute Prediction | Multiple Images
属性预测 | 多张图像
How many images have a person with the attribute ’Smiling’?
有多少张图片中的人物具有“微笑”属性?
Attribute Prediction | Multiple Images
属性预测 | 多张图像
Which images have a person with the attribute {attribute type}?
哪些图像中的人物具有 {attribute type} 属性?
A. Image 1 and Image 2 B. Image 2 and Image 3 C. Image 1 D. None of the above
A. 图像 1 和图像 2
B. 图像 2 和图像 3
C. 图像 1
D. 以上都不是
Attribute Prediction | Multiple Images
属性预测 | 多张图像
Which of the following attributes are common to all images?
以下属性是所有图像共有的吗?
A. Bald B. Bushy Eyebrows C. Black Hair D. None of the above
A. 秃头 B. 浓眉 C. 黑发 D. 以上都不是
Attribute Prediction | Multiple Images
属性预测 | 多张图像
Which attribute is unique to one of the images?
哪一项属性是其中一张图片独有的?
Attribute Prediction | Multiple Images
属性预测 | 多张图像
Which pair of images shares the most attributes?
哪对图像共享最多的属性?
A. Image 1 and Image 2 B. Image 2 and Image 3 C. Image 1 and Image 3 D. None of the above
A. 图像1和图像2
B. 图像2和图像3
C. 图像1和图像3
D. 以上都不是
Attribute Prediction | Multiple Images
属性预测 | 多张图像
Which attribute is most frequent among the images?
哪个属性在图像中出现频率最高?
Facial Expression Recognition | Multiple Images
面部表情识别 | 多张图像
How many images have a person with the expression {expression type}?
有多少张图片中的人表情为{表情类型}?
Facial Expression Recognition | Multiple Images
面部表情识别 | 多图像
Which images have a person with the expression {expression type}?
哪些图像中有表情为{表情类型}的人?
Facial Expression Recognition | Multiple Images
面部表情识别 | 多张图像
Which images do not have a person with the expression {expression type}?
哪些图像中没有表情为{表情类型}的人?
A. Image 1 and Image 2 B. Image 3 and Image 4 C. Image 1 D. None of the above
A. 图像 1 和图像 2
B. 图像 3 和图像 4
C. 图像 1
D. 以上都不是
Facial Expression Recognition | Multiple Images
面部表情识别 | 多图像
Which pair of images share the same expression?
哪对图像共享相同的表情?
A. Image 1 and Image 2 B. Image 2 and Image 3 C. Image 1 and Image 3 D. None of the above
A. 图像 1 和图像 2
B. 图像 2 和图像 3
C. 图像 1 和图像 3
D. 以上都不是
Facial Expression Recognition | Multiple Images
面部表情识别 | 多张图像
Which image has a person with the expression {expression type}?
哪张图片中的人表情为{表情类型}?
Facial Expression Recognition | Single Image
面部表情识别 | 单张图像
Which expression is the person showing in the image?
图中人物展示的是哪种表情?
A. Surprise B. Happy C. Sad D. Neutral
A. 惊讶
B. 开心
C. 悲伤
D. 中性
Facial Expression Recognition | Single Image
面部表情识别 | 单张图像
Identify the primary expression shown by the person in the image.
识别图像中人物表现的主要表情。
A. Fear B. Disgust C. Anger D. Neutral
A. 恐惧
B. 厌恶
C. 愤怒
D. 中性
Facial Expression Recognition | Single Image
面部表情识别 | 单张图像
What is the main expression of the person in the image?
图中人物的主要表情是什么?
A. Happy B. Sad C. Surprise D. Anger
A. 快乐 B. 悲伤 C. 惊讶 D. 愤怒
Headpose Estimation | Multiple Images
头部姿态估计 | 多图像
Headpose Estimation | Multiple Images
头部姿态估计 | 多张图像
Which images have a person with the pitch angle of headpose orientation in range {pitch range}?
哪些图像中的人物头部姿态的俯仰角在 {pitch range} 范围内?
A. Image 1 and Image 2 B. Image 3 and Image 4 C. Image 1 D. None of the above
A. 图像 1 和图像 2
B. 图像 3 和图像 4
C. 图像 1
D. 以上都不是
Headpose Estimation | Multiple Images
头部姿态估计 | 多张图像
Which images have a person with the roll angle of headpose orientation in range {roll_range}?
哪些图像中的人物头部姿态方向的滚转角在 {roll_range} 范围内?
Headpose Estimation | Multiple Images
头部姿态估计 | 多张图像
Which pair of images have a person that share the same yaw angle bin index of headpose orientation, given that the total yaw angle range is from -100 to 100 degrees, with bins of 10 degrees each?
给定总偏航角范围为 -100 到 100 度,每个区间为 10 度,哪对图像中的人具有相同的头部姿态方向的偏航角区间索引?
A. Image 1 and Image 2 B. Image 2 and Image 3 C. Image 3 and Image 4 D. None of the above
A. 图像 1 和图像 2
B. 图像 2 和图像 3
C. 图像 3 和图像 4
D. 以上都不是
Headpose Estimation | Multiple Images
头部姿态估计 | 多张图像
Which images have a person with the yaw, pitch, and roll angles of headpose orientation between {yaw_range}, {pitch range}, and {roll_range} degrees respectively?
哪些图像中的人物头部姿态的偏航角 (yaw)、俯仰角 (pitch) 和翻滚角 (roll) 分别在 {yaw_range}、{pitch_range} 和 {roll_range} 度之间?
Headpose Estimation | Multiple Images
头部姿态估计 | 多图像
Which images have a person with the yaw and pitch angles of headpose orientation between {yaw_range} and {pitch range} degrees respectively?
哪些图像中的人物头部姿态方向的偏航角和俯仰角分别在 {yaw_range} 和 {pitch range} 度之间?
A. Image 1 and Image 2 B. Image 3 and Image 4 C. Image 2 D. None of the above
A. 图像 1 和图像 2
B. 图像 3 和图像 4
C. 图像 2
D. 以上都不是
Headpose Estimation | Multiple Images
头部姿态估计 | 多张图像
Which images have a person with the pitch and roll angles of headpose orientation between {pitch range} and {roll_range} degrees respectively?
哪些图像中的人物头部姿态方向的俯仰角和横滚角分别在 {pitch range} 和 {roll_range} 度之间?
Headpose Estimation | Single Image
头部姿态估计 | 单张图像
What is the yaw angle range of headpose orientation for the person in this image?
这张图片中人物的头部姿态方向的偏航角范围是多少?
A. -30 to -20 degrees B. -10 to 10 degrees C. 20 to 30 degrees D. None of the above
A. -30 到 -20 度
B. -10 到 10 度
C. 20 到 30 度
D. 以上都不是
Headpose Estimation | Single Image
头部姿态估计 | 单张图像
What is the pitch angle range of headpose orientation for the person in this image?
这张图片中人物的头部姿态方向的俯仰角范围是多少?
A. -15 to -5 degrees B. 0 to 10 degrees C. 15 to 25 degrees D. None of the above
A. -15 至 -5 度
B. 0 至 10 度
C. 15 至 25 度
D. 以上都不是
Headpose Estimation | Single Image
头部姿态估计 | 单张图像
What is the roll angle range of headpose orientation for the person in this image?
这张图片中人物的头部姿态旋转角度范围是多少?
A. -25 to -15 degrees B. -5 to 5 degrees C. 10 to 20 degrees D. None of the above
A. -25 至 -15 度
B. -5 至 5 度
C. 10 至 20 度
D. 以上都不是
Crowd Counting | Single Image
人群计数 | 单张图像
How many people are present in this image?
这张图片中有多少人?
A. <correct option $>$ B. <distract or option $>$ C. <distract or option $>$ D. <distract or option $>$
A. <正确选项 $>$ B. <干扰选项 $>$ C. <干扰选项 $>$ D. <干扰选项 $>$
Crowd Counting | Single Image
人群计数 | 单张图像
What is the number of individuals shown in this image?
这张图片中显示了多少人?
A. <correct option $>$ B. <distract or option $>$ C. <distract or option $>$ D. <distract or option $>$
A. <正确选项 $>$ B. <干扰选项 $>$ C. <干扰选项 $>$ D. <干扰选项 $>$
Crowd Counting | Single Image
人群计数 | 单张图像
Determine the number of people in this picture.
确定这张图片中的人数。
A. <correct option $>$ B. <distract or option $>$ C. <distract or option $>$ D. <distract or option $>$
A. <正确选项 $>$ B. <干扰选项 $>$ C. <干扰选项 $>$ D. <干扰选项 $>$
Crowd Counting | Single Image
人群计数 | 单张图像
Estimate the count of people present in this image.
估计此图像中存在的人数。
A. <correct option $>$ B. <distract or option $>$ C. <distract or option $>$ D. <distract or option $>$
A. <正确选项 $>$ B. <干扰选项 $>$ C. <干扰选项 $>$ D. <干扰选项 $>$
Crowd Counting | Single Image
人群计数 | 单张图像
Please identify the number of individuals in this image.
请识别此图像中的人数。
A. <correct option $>$ B. <distract or option $>$ C. <distract or option $>$ D. <distract or option $>$
A. <正确选项 $>$ B. <干扰选项 $>$ C. <干扰选项 $>$ D. <干扰选项 $>$
Face Parsing | Single Image
人脸解析 | 单张图像
Which of the following regions is not present or is segmented out with white color?
以下哪个区域不存在或被白色分割?
A. left eyebrow B. glasses C. right eye D. left ear
A. 左眉 B. 眼镜 C. 右眼 D. 左耳
Face Parsing | Single Image
人脸解析 | 单张图像
Which region is segmented out with white color?
哪个区域被白色分割出来?
Tools Retrieval | Text Only
工具检索 | 仅文本
A video conferencing platform needs to ensure user engagement by tracking head pose, verifying expressions over time, and confirming that the detected face is real and not a spoof. Expression tracking should continue only if head pose confidence is high. What is the correct sequence of API calls?
一个视频会议平台需要通过跟踪头部姿态、随时间验证表情以及确认检测到的面部是真实的而非伪造的来确保用户参与度。只有在头部姿态置信度高的情况下,才应继续表情跟踪。正确的 API 调用顺序是什么?
A. api_4-detect spoofing, api_11-estimate head pose, api_11-pose confidence score, api_10-track expression over time B. api_11-estimate head pose, api_11-pose confidence score, api_10-track expression over time, api_4-detect spoofing C. api_11-estimate head pose, api_11-pose confidence score, api_10-track expression over time, api_4-detect spoofing D. api_4-detect spoofing, api_10-track expression over time, api_11-estimate head pose, api_11-pose confidence score
A. api_4-检测欺骗, api_11-估计头部姿态, api_11-姿态置信度得分, api_10-随时间跟踪表情
B. api_11-估计头部姿态, api_11-姿态置信度得分, api_10-随时间跟踪表情, api_4-检测欺骗
C. api_11-估计头部姿态, api_11-姿态置信度得分, api_10-随时间跟踪表情, api_4-检测欺骗
D. api_4-检测欺骗, api_10-随时间跟踪表情, api_11-估计头部姿态, api_11-姿态置信度得分
Tools Retrieval | Text Only
工具检索 | 仅文本
An AR app segments users’ faces into regions, detects head pose, and applies filters based on gender and expressions. Expression analysis is performed only if gender is classified with high confidence. Which sequence of API calls is appropriate?
一个 AR 应用将用户的面部分割成多个区域,检测头部姿态,并根据性别和表情应用滤镜。只有在性别分类具有高置信度时,才会进行表情分析。哪种 API 调用序列是合适的?
Tools Retrieval | Text Only
工具检索 | 仅文本
A high-security facility verifies individuals’ identities, estimates age, and monitors expressions. Age estimation is only done if expression confidence is above a threshold. What is the appropriate API function sequence?
高安全性设施验证个人身份、估计年龄并监控表情。仅当表情置信度高于阈值时,才会进行年龄估计。合适的 API 函数调用顺序是什么?
A. api_7-identify high res face, api_1-predict age, api_10-detect expression, api_10-get emotion probabilities B. api_7-identify high res face, api_10-get emotion probabilities, api_10-detect expression, api_1-predict age C. api_7-identify high res face, api_10-detect expression, api_10-get emotion probabilities, api_1-predict age D. api_10-detect expression, api_7-identify high res face, api_1-predict age, api_10-get emotion probabilities
A. api_7-识别高分辨率人脸, api_1-预测年龄, api_10-检测表情, api_10-获取情绪概率
B. api_7-识别高分辨率人脸, api_10-获取情绪概率, api_10-检测表情, api_1-预测年龄
C. api_7-识别高分辨率人脸, api_10-检测表情, api_10-获取情绪概率, api_1-预测年龄
D. api_10-检测表情, api_7-识别高分辨率人脸, api_1-预测年龄, api_10-获取情绪概率
Tools Retrieval | Text Only
工具检索 | 仅文本
For a security checkpoint, the system detects deepfakes, checks for spoofing, and estimates head pose. Spoof detection is mandatory before head pose estimation if deepfake confidence is high. What is the correct sequence?
对于安全检查点,系统会检测深度伪造 (deepfake)、检查欺骗行为,并估计头部姿态。如果深度伪造置信度高,则在头部姿态估计之前必须进行欺骗检测。正确的顺序是什么?
A. api_11-estimate head pose, api_5-detect deep fake, api_4-detect spoofing B. api_5-detect deep fake, api_11-estimate head pose, api_4-spoof confidence score C. api_5-detect deep fake, api_4-detect spoofing, api_11-estimate head pose D. api_4-detect spoofing, api_11-estimate head pose, api_5-detect deep fake
A. api_11-估计头部姿态, api_5-检测深度伪造, api_4-检测欺骗
B. api_5-检测深度伪造, api_11-估计头部姿态, api_4-欺骗置信度评分
C. api_5-检测深度伪造, api_4-检测欺骗, api_11-估计头部姿态
D. api_4-检测欺骗, api_11-估计头部姿态, api_5-检测深度伪造
Tools Retrieval | Text Only
工具检索 | 仅文本
An interactive museum exhibit uses head pose tracking and demographic analysis (age, gender, race) to tailor content for visitors. Race prediction is skipped if age confidence falls below a certain threshold. Which API sequence would you use?
一个互动博物馆展览使用头部姿态跟踪和人口统计(年龄、性别、种族)分析来为访客定制内容。如果年龄置信度低于某个阈值,则跳过种族预测。你会使用哪个 API 序列?
C.2. Generating Distractor Options
C.2. 生成干扰选项
We strategically design the distractor options to encourage MLLMs to carefully choose an option as the prediction. The following methods are employed for different types of tasks:
我们策略性地设计干扰选项,以鼓励多模态大语言模型 (MLLMs) 仔细选择一个选项作为预测。针对不同类型的任务,采用了以下方法:
C.3. FaceXAPI dataset prompt
C.3. FaceXAPI 数据集提示
To generate the FaceXAPI questions, we design a detailed prompt incorporating $13\ \mathrm{API}$ calls and a total of 32 functions. Additional guidelines are provided to ensure scenario realism, functional complexity, and diversity in the questions. We also include instructions for generating distractor options, emphasizing logical plausibility and ensuring they are sufficiently close yet distinct from the correct answer. A total of 150 questions are generated, out of which 100 are carefully selected to maintain diversity. To further enhance the quality, the options are manually reviewed to ensure the presence of one correct answer alongside logically plausible distract or s. The detailed prompt for question generation is provided below.
为了生成 FaceXAPI 问题,我们设计了一个详细的提示,包含 13 个 API 调用和总共 32 个函数。提供了额外的指导方针,以确保场景的真实性、功能的复杂性和问题的多样性。我们还包含了生成干扰选项的说明,强调逻辑上的合理性,并确保它们与正确答案足够接近但又有所区别。总共生成了 150 个问题,其中 100 个经过精心挑选以保持多样性。为了进一步提高质量,选项经过人工审查,以确保存在一个正确答案以及逻辑上合理的干扰项。以下是问题生成的详细提示。
You are an AI tasked with generating complex, real-world scenario questions to assess a model’s ability to select the correct API and function calls to accomplish nuanced tasks. Use the list of APIs and functions provided below.
你是一个AI,任务是生成复杂的现实场景问题,以评估模型选择正确API和函数调用来完成细致任务的能力。使用下面提供的API和函数列表。
1. Age Estimation:
1. 年龄估计:
api_name: api_1 • predict age: Description: Predicts the age of the person in the input face image. Input: np.ndarray or str - The input face image. Output: int - The estimated age. • age confidence score: Description: Provides a confidence score for the age estimation. Input: dict - Output from the age estimation model. Output: float - Confidence score of the age prediction.
api_name: api_1
• 预测年龄 (predict age):
描述: 预测输入人脸图像中人物的年龄。
输入: np.ndarray 或 str - 输入的人脸图像。
输出: int - 估计的年龄。
• 年龄置信度分数 (age confidence score):
描述: 提供年龄估计的置信度分数。
输入: dict - 年龄估计模型的输出。
输出: float - 年龄预测的置信度分数。
2. Gender Prediction:
2. 性别预测:
api_name: api_2 • classify gender: Description: Classifies the gender of the person in the face image. Input: np.ndarray or str - The input face image. Output: str - The predicted gender (’male’ or ’female’). • get gender probabilities: Description: Returns probabilities for each gender class. Input: np.ndarray - The input face image. Output: dict - Probabilities for ’male’ and ’female’ classes.
api_name: api_2
• 分类性别 (classify gender):
描述: 对输入的人脸图像进行性别分类。
输入: np.ndarray 或 str - 输入的人脸图像。
输出: str - 预测的性别('male' 或 'female')。
• 获取性别概率 (get gender probabilities):
描述: 返回每个性别类别的概率。
输入: np.ndarray - 输入的人脸图像。
输出: dict - 'male' 和 'female' 类别的概率。
3. Race Detection:
3. 竞态检测:
api_name: api_3 • predict race: Description: Predicts the race of the individual(s) in the input image(s). Input: str, bytes, np.ndarray, or list - The input image or batch of images. Output: list - Predicted races for each detected face, including race labels and probabilities. • get race probabilities: Description: Returns probability distribution over races for detected faces. Input: str, bytes, np.ndarray, or list - The input image or batch of images. Output: list - For each detected face, a dictionary of race probabilities. • race confidence score: Description: Provides a confidence score for race prediction for each detected face. Input: str, bytes, np.ndarray, or list - The input image(s). Output: list - Confidence scores for race predictions.
api_name: api_3
• predict race:
描述: 预测输入图像中个体的种族。
输入: str, bytes, np.ndarray, 或 list - 输入图像或图像批次。
输出: list - 每个检测到的人脸的预测种族,包括种族标签和概率。
• get race probabilities:
描述: 返回检测到的人脸的种族概率分布。
输入: str, bytes, np.ndarray, 或 list - 输入图像或图像批次。
输出: list - 对于每个检测到的人脸,一个包含种族概率的字典。
• race confidence score:
描述: 为每个检测到的人脸提供种族预测的置信度分数。
输入: str, bytes, np.ndarray, 或 list - 输入图像。
输出: list - 种族预测的置信度分数。
4. Face Anti-Spoofing:
4. 人脸防伪 (Face Anti-Spoofing):
api_name: api_4 • detect spoofing: Description: Detects if the face in the image is real or a spoof. Input: np.ndarray or str - The input face image. Output: bool - True if spoof detected, False if face is real. • spoof confidence score: Description: Provides a confidence score indicating the likelihood of spoofing. Input: dict - Output from the anti-spoofing model. Output: float - Spoofing confidence score.
api_name: api_4
• 检测欺骗: 描述: 检测图像中的面部是真实的还是欺骗的。输入: np.ndarray 或 str - 输入的面部图像。输出: bool - 如果检测到欺骗则为 True,如果面部是真实的则为 False。
• 欺骗置信度分数: 描述: 提供一个置信度分数,表示欺骗的可能性。输入: dict - 反欺骗模型的输出。输出: float - 欺骗置信度分数。
5. Deepfake Detection:
5. 深度伪造检测:
api_name: api_5
api_name: api_5
6. Low-Resolution Face Recognition:
6. 低分辨率人脸识别:
api_name: api_6
api_name: api_6
7. High-Resolution Face Recognition:
7. 高分辨率人脸识别:
api_name: api_7
api_name: api_7
8. Celebrity Identification:
8. 名人识别:
api_name: api_8
api_name: api_8
9. Attributes Prediction:
9. 属性预测:
api_name: api_9 • detect attributes: Description: Detects various attributes of the face in the image. Input: np.ndarray or str - The input face image. Output: dict - Detected attributes and their values. • list face attributes: Description: Lists all possible face attributes that can be predicted. Input: None Output: list - List of attribute names. • attribute confidence score: Description: Provides confidence scores for each predicted attribute. Input: dict - Output from the attribute detection model. Output: dict - Confidence scores for each attribute.
api_name: api_9
• 检测属性:
描述: 检测图像中人脸的各种属性。
输入: np.ndarray 或 str - 输入的人脸图像。
输出: dict - 检测到的属性及其值。
• 列出人脸属性:
描述: 列出所有可以预测的人脸属性。
输入: 无
输出: list - 属性名称列表。
• 属性置信度分数:
描述: 提供每个预测属性的置信度分数。
输入: dict - 属性检测模型的输出。
输出: dict - 每个属性的置信度分数。
- Facial Expression Recognition:
- 面部表情识别:
api_name: api_10 • detect expression: Description: Detects facial expressions in the input image. Input: np.ndarray or str - The input face image. Output: str - The detected expression label. • get emotion probabilities: Description: Provides probabilities for each emotion class. Input: np.ndarray - The input face image. Output: dict - Probabilities for each emotion class. • track expression over time: Description: Tracks facial expressions over a sequence of frames. Input: list of np.ndarray - List of frames from a video. Output: list - Sequence of detected expressions.
api_name: api_10
• 检测表情 (detect expression):
描述: 检测输入图像中的面部表情。
输入: np.ndarray 或 str - 输入的面部图像。
输出: str - 检测到的表情标签。
• 获取情绪概率 (get emotion probabilities):
描述: 提供每个情绪类别的概率。
输入: np.ndarray - 输入的面部图像。
输出: dict - 每个情绪类别的概率。
• 跟踪表情随时间变化 (track expression over time):
描述: 在一系列帧中跟踪面部表情。
输入: list of np.ndarray - 视频帧列表。
输出: list - 检测到的表情序列。
11. Headpose Estimation:
11. 头部姿态估计:
api_name: api_11 • estimate head pose: Description: Estimates the head pose angles (yaw, pitch, roll) from the face image. Input: np.ndarray or str - The input face image. Output: tuple - Estimated angles (yaw, pitch, roll). • pose confidence score: Description: Provides a confidence score for the estimated head pose. Input: dict - Output from the head pose estimation model. Output: float - Confidence score for the head pose estimation.
api_name: api_11
• 估计头部姿态:
描述: 从面部图像中估计头部姿态角度 (yaw, pitch, roll)。
输入: np.ndarray 或 str - 输入的面部图像。
输出: tuple - 估计的角度 (yaw, pitch, roll)。
• 姿态置信度评分:
描述: 提供估计头部姿态的置信度评分。
输入: dict - 头部姿态估计模型的输出。
输出: float - 头部姿态估计的置信度评分。
12. Crowd Counting:
12. 人群计数:
api_name: api_12 • estimate crowd size: Description: Estimates the size of the crowd based on face detection. Input: np.ndarray - The input image. Output: int - Estimated number of people in the crowd. • aggregate counting data: Description: Aggregates counting data over multiple images or frames. Input: list of np.ndarray - List of images. Output: dict - Aggregated counting results.
api_name: api_12
• 估计人群规模:
描述: 基于人脸检测估计人群规模。
输入: np.ndarray - 输入图像。
输出: int - 估计的人群人数。
• 聚合计数数据:
描述: 聚合多张图像或帧的计数数据。
输入: np.ndarray 列表 - 图像列表。
输出: dict - 聚合的计数结果。
13. Face Segmentation:
13. 面部分割:
api_name: api_13 • segment face regions: Description: Segments different regions of the face in the image. Input: np.ndarray - The input face image. Output: np.ndarray - Segmentation mask of the face regions. • classify face parts: Description: Classifies different parts of the face into categories. Input: np.ndarray - The input face image. Output: dict - Dictionary of face parts and their classifications. • get segmentation mask part: Description: Generates a segmentation mask for a particular face part. Input: np.ndarray - The input face image. Output: np.ndarray - Binary mask of the segmented face part.
api_name: api_13
• 分割面部区域:
描述:分割图像中面部的不同区域。
输入:np.ndarray - 输入的面部图像。
输出:np.ndarray - 面部区域的分割掩码。
• 分类面部部位:
描述:将面部的不同部位分类到不同的类别中。
输入:np.ndarray - 输入的面部图像。
输出:dict - 面部部位及其分类的字典。
• 获取分割掩码部位:
描述:为特定的面部部位生成分割掩码。
输入:np.ndarray - 输入的面部图像。
输出:np.ndarray - 分割面部部位的二进制掩码。
Each scenario should require the model to accurately retrieve and execute 3 to 5 function calls across multiple APIs, simulating the complexity and sequential decision-making needed in real-world applications.
每个场景应要求模型准确检索并执行跨多个 API 的 3 到 5 个函数调用,模拟实际应用中所需的复杂性和顺序决策。
Guidelines for Generating Questions:
生成问题的指南:
Guidelines for Generating Options:
生成选项的指南:
• Complete API Chains: Provide four option chains, each specifying a complete sequence of API function calls in the correct order necessary to solve the task. One of these sequences should be the correct answer, while the other three should be close but logically incorrect.
• 完整的 API 调用链:提供四个选项链,每个选项链指定一个完整的 API 函数调用序列,这些调用序列按解决任务所需的正确顺序排列。其中一个序列应为正确答案,而其他三个序列应接近正确但在逻辑上不正确。
• Logical Plausibility of Distract or s: Ensure distract or s are constructed logically and appear plausible. Avoid overly obvious incorrect answers; the distract or s should require reasoning to eliminate, demanding attention to function descriptions, and careful thought to get to the correct answer.
• 干扰项的逻辑合理性:确保干扰项在逻辑上合理且看似可信。避免过于明显的错误答案;干扰项应需要通过推理来排除,要求关注函数描述,并经过仔细思考才能得出正确答案。
• Randomized Answer Positioning: Shuffle the options to ensure the correct answer appears randomly in position A, B, C, or D.
• 随机答案定位:打乱选项,确保正确答案随机出现在位置 A、B、C 或 D。
We provide a example question and output template, which needs to be strictly followed when providing output.
我们提供一个示例问题和输出模板,在提供输出时需要严格遵循。
Example Question:
示例问题:
In an airport security system, faces are checked for deepfakes, head poses are verified, and age is estimated. Age is analyzed only if head pose confidence is high. Which API sequence should be applied?
在机场安检系统中,人脸会被检查是否为深度伪造 (deepfakes),头部姿态会被验证,并估计年龄。只有在头部姿态置信度高的情况下才会分析年龄。应该应用哪种 API 序列?
A. api_5-detect deep fake, api_1-predict age, api_11-pose confidence score, api_11-estimate head pose B. api_11-estimate head pose, api_11-pose confidence score, api_5-detect deep fake, api_1-predict age C. api_5-detect deep fake, api_11-estimate head pose, api_11-pose confidence score, api_1-predict age D. api_11-pose confidence score, api_5-detect deep fake, api_11-estimate head pose, api_1-predict age Correct Answer: C. api_5-detect deep fake, api_11-estimate head pose, api_11-pose confidence score, api_1-predict age
A. api_5-检测深度伪造 (deep fake), api_1-预测年龄, api_11-姿态置信度得分, api_11-估计头部姿态
B. api_11-估计头部姿态, api_11-姿态置信度得分, api_5-检测深度伪造, api_1-预测年龄
C. api_5-检测深度伪造, api_11-估计头部姿态, api_11-姿态置信度得分, api_1-预测年龄
D. api_11-姿态置信度得分, api_5-检测深度伪造, api_11-估计头部姿态, api_1-预测年龄
正确答案: C. api_5-检测深度伪造, api_11-估计头部姿态, api_11-姿态置信度得分, api_1-预测年龄
Provide 150 questions. Remember to follow the JSON structure and guidelines strictly. Ensure each question and answer chain is unique and fits the described requirements.
提供150个问题。请严格遵守JSON结构和指南。确保每个问题和答案链都是唯一的,并符合所述要求。


We design extract option label funtion to robustly extract an option label from a given model output, utilizing a three-step fallback mechanism to ensure accurate identification. The first step involves cleaning the model output by removing leading and trailing whitespace. A series of regular expression (regex) patterns are then applied to match common formats of option labels, such as A, (A), A), Option A, A:, $\mathtt{\lambda A-}$ , specifically at the start of the text. If a match is found, the label is validated against the provided option labels list and returned if valid. If no valid label is identified at the start, the second fallback step searches the entire model output for matches to these regex patterns and again validates the extracted labels against option labels, allowing for flexibility in label positioning. If both regex-based approaches fail, the third and final fallback step compares the model output directly with the exact text of the provided options. It checks whether the entire model output or any substring within it matches an option text and maps it to the corresponding label. If no label is found after these steps, the function returns None. This layered process ensures minimal false positives by progressively narrowing down matches using robust regex and exact comparisons.
我们设计了提取选项标签函数,以从给定的模型输出中稳健地提取选项标签,利用三步回退机制确保准确识别。第一步通过去除前导和尾随空格来清理模型输出。然后应用一系列正则表达式 (regex) 模式来匹配常见的选项标签格式,例如 A、(A)、A)、Option A、A:、$\mathtt{\lambda A-}$,特别是在文本的开头。如果找到匹配项,则根据提供的选项标签列表验证标签,并在有效时返回。如果在开头没有识别出有效标签,则第二步回退步骤在整个模型输出中搜索这些正则表达式模式的匹配项,并再次根据选项标签验证提取的标签,允许标签位置的灵活性。如果基于正则表达式的方法都失败,则第三步也是最后一步回退步骤将模型输出直接与提供的选项的精确文本进行比较。它检查整个模型输出或其内的任何子字符串是否与选项文本匹配,并将其映射到相应的标签。如果经过这些步骤后仍未找到标签,则函数返回 None。这种分层过程通过逐步缩小匹配范围,使用稳健的正则表达式和精确比较,确保最小化误报。
C.5. Dataset Format
C.5. 数据集格式
The benchmark consists of multiple JSON files, with each dataset represented in a JSON file named in the format _<multiple/single>.json. These files include detailed metadata and questions for evaluation, enabling easy analysis and comprehensive testing. The JSON dataset format is designed to enable the evaluation of MLLMs on specific tasks or subsets of tasks. A sample structure of the JSON file is provided below.
基准测试由多个 JSON 文件组成,每个数据集以一个 JSON 文件表示,文件命名格式为 <数据集名称>_<多/单>.json。这些文件包含详细的元数据和评估问题,便于分析和全面测试。JSON 数据集格式旨在支持对特定任务或任务子集上的 MLLM 进行评估。以下提供了一个 JSON 文件的示例结构。
The components in the JSON file are as follows:
JSON 文件中的组件如下:
This JSON format allows the community to evaluate MLLMs on a subset of categories or tasks, providing flexibility and consistency in evaluation. The inclusion of both correct answer option and answer supports diverse evaluation strategies.
这种 JSON 格式允许社区在部分类别或任务上评估 MLLM(多模态大语言模型),提供了评估的灵活性和一致性。包含正确答案选项和答案支持了多样化的评估策略。
C.6. Other Evaluation Settings
C.6. 其他评估设置
In the in-context evaluation setting, we prepend the task-description before the question to provide relevant context to the model. The prepend text for each task is as follows:
在上下文评估设置中,我们在问题前添加任务描述,以便为模型提供相关上下文。每个任务的预置文本如下:
answer.’, to encourage the model to reason before making its final prediction.
答案。', 以鼓励模型在做出最终预测之前进行推理。
C.7. Implementation Details
C.7. 实现细节
To ensure reproducibility and efficiency, we utilized the open-source VLMEvalKit [21] for all experiments. The VLMEvalKit provides a comprehensive framework for evaluating vision-language models, streamlining the experimental workflow. Below, we outline the key libraries and parameter settings used in our implementation:
为确保实验的可重复性和效率,我们使用了开源的 VLMEvalKit [21] 进行所有实验。VLMEvalKit 提供了一个全面的框架,用于评估视觉-语言模型,简化了实验流程。以下是我们实现中使用的主要库和参数设置:
D. Results
D. 结果
In this section, we provide additional results as an extension of the main paper. We also showcase several failure cases and include more details on the baselines to facilitate easier reproducibility.
在本节中,我们提供了主要论文的扩展内容,包括额外的结果。我们还展示了一些失败案例,并提供了基线的更多细节,以便更容易复现。
D.1. Baselines
D.1. 基线
Table D.1. Vision encoder, LLM and LLM size of the baseline models
表 D.1: 基线模型的视觉编码器、大语言模型及其规模
| 模型 | 视觉编码器 | 大语言模型 | 大语言模型规模 |
|---|---|---|---|
| PaliGemma [9] | SigLIP-S0400m | Gemma-2B | 2B |
| LLaVA-One Vision-0.5b-OV [47] | SigLIP-So400m | Qwen2-0.5b | 0.5B |
| VILA 1.5-3b [54] | SigLIP-S0400m | Sheared LLaMA-2.7b | 2.7B |
| Eagle-X4-8B-Plus [85] | 混合视觉编码器 | LLama3-8b-instruct | 8B |
| Idefics2-8b [41] | SigLIP-S0400m | Mistral-7B-v0.1 | 7B |
| Idefics-9b-instruct [40] | CLIP-ViT-H-14-laion2B-s32B-b79K | LLama-7b | 7B |
| LLaVA-v1.5-7b [57] | CLIP ViT-L-14 | Vicuna-7b | 7B |
| Monkey-Chat [53] | ViT-BigG-2b | Qwen-7b | 7B |
| MiniCPM-Llama3-v-2.5 [114] | SigLIP-S0400m | Llama3-8B-Instruct | 8B |
| LLaVA-One Vision-7b-SI [47] | SigLIP-S0400m | Qwen-7b | 7B |
| LLaVA-NeXT-Interleave-7b [48] | SigLIP-S0400m | Qwen1.5-7b | 7B |
| Phi-3.5-Vision [1] | CLIP ViT-L/14 | phi-3.5-mini | 3.8B |
| LLaVA-One Vision-7b-OV [47] | SigLIP-So400m | Llama3-8B-Instruct | 8B |
| Qwen2-VL-7b-Instruct [103] | ViT-H | Qwen2 | 7B |
| InternVL2-8B [12] | InternViT-300M-448px | Internlm2_5-7b-chat | 7B |
| CogVLM2-19b [26] | EVA-CLIP | Llama-3-8B-Instruct | 8B |
| Idefics-80b-instruct [40] | CLIP-ViT-H-14-laion2B-s32B-b79K | LLama-65b | 65B |
| LLaVA-v1.5-13b [57] | CLIP ViT-L-14 | Vicuna-13b | 13B |
| VILA 1.5-13b [54] | SigLIP-So400m | Vicuna-13b | 13B |
| Mantis-SIGLIP-8b [32] | SigLIP-So400m | LLama-8b | 8B |
| InternVL-Chat-v1.5 [12] | InternViT-6B-448px-V1-5 | InternLM2-Chat-20B | 20B |
| LLaVA-OneVision-72b-OV [47] | SigLIP-S0400m | Qwen-72B | 72B |
| Qwen2-VL-72b-Instruct [103] | ViT-H | Qwen2 | 72B |
| InternVL2-76B [12] | Intern ViT-6B-448px-V1-5 | Hermes-2-Theta-Llama-3-70B | 70B |
D.2. Results Across Tasks
D.2. 跨任务结果
To provide additional insights, we summarize the task-wise performance of various models in Table D.2. We observe that models struggle with tasks such as head pose estimation, crowd counting, and deepfake detection. Conversely, tasks like gender prediction, facial expression recognition, and high-resolution face recognition are comparatively easier. Additionally, we present the results of selected models under the "in-context" and "chain-of-thought" evaluation settings in Table D.3 and Table D.4, respectively. We observe a consistent drop in performance, indicating that the models are not capable of effectively processing in-context information and fail to reason in tasks related to face understanding.
为了提供更多见解,我们在表 D.2 中总结了各种模型在任务上的表现。我们观察到,模型在头部姿态估计、人群计数和深度伪造检测等任务上表现不佳。相反,性别预测、面部表情识别和高分辨率人脸识别等任务相对较容易。此外,我们分别在表 D.3 和表 D.4 中展示了选定模型在“上下文内”和“思维链”评估设置下的结果。我们观察到性能一致下降,表明模型无法有效处理上下文信息,并且在与人脸理解相关的任务中无法进行推理。
Table D.2. Performance of various models across all evaluation tasks of FaceXBench.
| Deepfake Detection | 25.33 | 32.33 | 20.33 | 27.00 | 19.00 | 28.33 |
| Parsing Face | 27.75 | 27.33 | 4.67 | 27.00 | 24.33 | 85.25 |
| Headpose Estimation | 23.25 | 22.33 | 19.33 | 25.67 | 23.67 | 39.25 |
| 25.60 | 27.00 | 24.00 | 27.33 | 28.67 | |
| 27.75 | 45.50 | 27.00 | 23.00 | 29.33 | |
| 30.00 | 28.75 | 25.67 | 23.67 | 28.00 | |
| | 24.00 | 35.00 | 25.33 | | |
| | 31.75 | 42.25 | 27.33 | 52.00 | 29.67 |
| | 38.25 | 74.50 | 30.00 | 46.50 | 26.00 |
| | 32.25 | 37.00 | 17.33 | 63.50 | 80.75 |
| | 31.50 | 36.50 | 27.00 | 82.75 | 71.00 |
| | 47.80 | 66.00 | 76.50 | 87.75 | 27.75 |
| | 35.75 | 60.00 | 64.50 | 81.50 | 37.00 |
| | 28.00 | 46.50 | 43.75 | 74.50 | |
| | 28.00 | 20.00 | 39.75 | 24.25 | |
| | 26.00 | 27.25 | 74.50 | 28.75 | |
| | 44.67 | 30.50 | 82.00 | 28.50 | |
| | | 28.75 | 66.75 | 27.25 | |
| | | 26.75 | 50.50 | 30.50 | |
| | | 30.50 | 35.00 | 38.25 | |
| | | 28.00 | 33.00 | 34.25 | |
| | | | 27.50 | | |
| | | | 25.50 | | |
| | | | 31.00 | | |
| | | | 32.00 | | |
| | | | 33.50 | | |
| | | | 31.00 | | |
| Prediction Gender Attributes Prediction | 31.80 | 9.20 | 45.60 | 49.60 | 82.40 |
| | 40.40 | 24.60 | 44.20 | 48.00 | 90.20 |
| | 47.40 | 25.50 | 49.40 | 58.40 | 80.20 |
| | 36.50 | 32.25 | 57.20 | 48.40 | 70.20 |
| | 28.75 | 29.75 | 71.20 | 79.80 | 88.40 |
| | 39.25 | 36.25 | 64.60 | 82.40 | |
| | 20.00 | 37.00 | 67.40 | | |
| | | 50.00 | 80.80 | | |
| | | 54.25 | 73.00 | | |
| | | 52.50 | 81.00 | | |
| | | 49.00 | 79.60 | | |
| | | | 81.00 | | |
| Retrieval Tools Face | 40.00 | 33.00 | 45.75 | 33.75 | 65.25 |
| | 12.00 | 37.00 | 51.00 | 40.75 | 55.75 |
| | | 15.00 | 54.00 | 41.00 | 44.00 |
| | | 15.00 | 58.25 | 45.50 | 57.00 |
| | | 40.00 | 50.00 | 46.75 | |
| | | 15.00 | 38.00 | 57.75 | |
| | | 26.00 | 38.00 | 64.75 | |
| | | 46.00 | 45.00 | 57.75 | |
| | | 42.00 | 15.00 | 60.00 | |
| | | 37.00 | | | |
| | | 34.00 | | | |
| Anti-spoofing LR FR | 20.75 | 38.75 | 49.75 | 39.00 | 47.75 |
| | 46.00 | 8.50 | 43.00 | 35.00 | 41.75 |
| | 28.50 | 38.00 | 39.00 | 29.00 | |
| | 36.25 | 42.25 | 44.75 | 46.00 | |
| | 35.00 | 32.50 | 30.75 | 39.00 | |
| | 24.00 | 26.25 | | 48.00 | |
| | 21.00 | 37.00 | | 46.00 | |
| | | 45.00 | | 46.00 | |
| | | 39.25 | | | |
| | | 38.50 | | | |
| | | 44.25 | | | |
| HR FR Identification Celebrity | 27.00 | 8.00 | 21.00 | 40.75 | 49.00 |
| | 25.25 | 23.00 | 32.00 | 35.50 | 42.00 |
| | 28.75 | 22.00 | 44.00 | 44.25 | |
| | 23.75 | 19.00 | 26.00 | 40.00 | |
| | | 29.00 | 21.00 | 36.50 | |
| | | 22.00 | | 43.50 | |
| | | 24.00 | | 42.25 | |
| | | 21.00 | | 50.75 | |
| | | 17.00 | | 27.00 | |
| | | 24.00 | | 33.00 | |
| | | 43.00 | | 22.00 | |
| | | | | 31.00 | |
| | | | | 28.00 | |
| | | | | 43.00 | |
| | | | | 42.00 | |
| | Open source MLLMs (< 4B parameters) | 23.50 | 11.25 | 37.50 | 23.75 | 78.00 |
| | | | 23.75 | 23.75 | 30.25 | 82.25 |
| | | | 20.50 | 21.00 | 19.50 | |
| | | | 26.50 | 42.25 | 35.25 | |
| | | | 19.25 | 49.25 | 57.50 | |
| | | | 17.00 | 34.25 | 47.75 | |
| | | | | 65.50 | 71.00 | |
| | | | | 69.50 | 76.25 | |
| | | | | 23.75 | | |
| Counting Crowd | 29.00 | Open source MLLMs (4B - 13B parameters) | 28.00 | Open source MLLMs (> 13B parameters) | 43.33 | 54.33 | Proprietary MLLMs | 28.00 |
| | 28.67 | 22.33 | 48.67 | 57.33 | 63.00 |
| | 31.00 | 41.67 | 49.67 | 53.33 | |
| | 36.67 | 46.67 | 56.00 | 56.00 | |
| | 35.33 | 48.67 | 53.67 | 66.67 | |
| | | 49.67 | 49.67 | 65.67 | |
| | | | 56.33 | 68.67 | |
| | | | 57.33 | 68.67 | |
| | | | 53.33 | | |
| Estimation Race | 26.00 | 23.67 | 19.00 | 35.00 | 40.00 |
| | 34.67 | 23.00 | 24.67 | 25.00 | 26.00 |
| | 25.67 | 24.33 | 31.00 | 31.00 | |
| | 34.40 | 35.33 | 21.00 | 24.67 | |
| | | 31.00 | 29.00 | 41.67 | |
| | | 29.33 | 23.67 | 37.33 | |
| | | 27.67 | 20.67 | 28.00 | |
| | | | 31.67 | 37.33 | |
| | | | 32.67 | | |
| Estimation Age | 24.40 | 5.80 | 44.00 | 51.00 | 30.00 |
| | 31.00 | 24.20 | 52.60 | 41.80 | 64.80 |
| | 38.40 | 40.60 | 48.00 | 49.80 | |
| | | 46.80 | 46.60 | 56.80 | |
| | | 36.40 | 45.00 | 60.80 | |
| | | 45.60 | 61.00 | 67.80 | |
| | | | 49.00 | 65.60 | |
| | | | 61.80 | 61.00 | |
| | | | 34.00 | | |
| Expression Recognition Overall | 24.20 | 32.00 | 39.40 | 34.80 | 40.60 |
| | 29.40 | 15.80 | 42.20 | 35.00 | 49.00 |
| | 28.20 | 26.20 | 43.80 | 31.20 | |
| | 23.00 | 27.60 | 44.40 | 42.60 | |
| | | 32.60 | 36.00 | 48.80 | |
| | | 31.20 | | 49.40 | |
| | | 34.60 | | 52.80 | |
| | | 42.40 | | 45.40 | |
| | | 35.00 | | | |
| | | 42.60 | | | |
| | | 41.00 | | | |
| | 21.75 | 53.25 | 50.25 | 48.75 | 59.25 |
| | 30.00 | 15.00 | 51.00 | 57.50 | 60.50 |
| | 46.50 | 39.00 | 56.25 | 45.75 | |
| | 50.00 | 45.00 | 56.75 | 63.50 | |
| | 25.10 | 50.75 | 60.75 | 63.25 | |
| | | 51.00 | 36.50 | 61.75 | |
| | | 54.75 | | 66.25 | |
| | | 55.88 | | 55.75 | |
| | | 52.00 | | | |
| | | 57.50 | | | |
| | 32.22 | 35.80 | 44.60 | 39.88 | 50.50 |
| | 32.22 | 17.04 | 45.16 | 40.00 | 56.96 |
| | 34.00 | 31.44 | 48.98 | 40.46 | |
| | | 34.58 | 51.58 | 49.18 | |
| | | 36.22 | 53.24 | 55.48 | |
| | | 37.40 | 35.86 | 56.42 | |
| | | 40.70 | | 57.80 | |
| | | 43.80 | | 57.86 | |
| | | 44.32 | | | |
| | | 44.52 | | | |
| Random Choice | LLaVA-OneVision-Qwen2-0.5b | Frequent Choice | PaliGemma-3b | MiniCPM-Llama3-v2.5 | Eagle-X4-8B-Plus | Idefics-9b-Instruct | Chameleon-7b | LLaVA-v1.5-7b | Monkey-Chat |
| | | | | | | | | | |
| | | | | | | | | | |
| 任务 | 模型 | Qwen2-VL-7B-Instruct | Phi-3.5-Vision | InternVL2-8B |
|---|---|---|---|---|
| Deepfake Detection | 27.00 | 31.00 | 32.00 | |
| Parsing | Face | 40.00 | 78.75 | 66.25 |
| Headpose Estimation | 24.25 | 27.25 | 28.75 | |
| Prediction | Gender | 62.80 | 72.80 | 74.00 |
| Attributes Prediction | 42.75 | 58.50 | 48.75 | |
| Retrieval | Tools | 39.00 | 42.00 | 48.00 |
| Anti-spoofing | Face | 43.00 | 45.00 | 48.50 |
| Identification | Celebrity | 51.33 | 55.67 | 57.67 |
| Counting | Crowd | 30.33 | 22.33 | 27.67 |
| Estimation | Race | 46.40 | 54.40 | 63.20 |
| Estimation | Age | 33.20 | 38.00 | 40.60 |
| Expression Recognition | 53.00 | 51.75 | 58.00 | |
| Overall | 41.48 | 50.32 | 52.14 |
表 D.4. (b) 模型在链式思维评估设置中的表现。
表 D.3. (a) 模型在上下文评估设置中的表现。
D.3. Failure Cases
D.3. 失败案例
We conducted a failure case analysis using top-performing models, namely GPT-4o, Qwen2-VL, and InternVL2. As part of this analysis, we present one question from each task to illustrate the failure cases. Our observations reveal that these models struggle to answer basic questions related to face understanding, highlighting the need for a dedicated benchmark to monitor the progress of MLLMs in this domain.
我们使用表现最佳的模型(即 GPT-4o、Qwen2-VL 和 InternVL2)进行了失败案例分析。作为分析的一部分,我们从每个任务中选取一个问题来展示失败案例。我们的观察表明,这些模型在回答与面部理解相关的基本问题时存在困难,这凸显了需要专门的基准来监测 MLLMs 在这一领域的进展。
Age Estimation
年龄估计
Question: <image $\underline{{\uparrow}}><$ image $2>$ Identify the person who is most likely in 20-29 age group. Option:
问题:<image $\underline{{\uparrow}}><$ image $2>$ 识别最可能在20-29岁年龄组的人。选项:


GPT-4o: (A) None of the above Prediction: ✗
GPT-4o: (A) 以上都不是 预测: ✗
Qwen2-VL: C Prediction: ✗
Qwen2-VL: C 预测: ✗
InternVL2: (B) Image 1 Prediction: ✗
InternVL2: (B) 图像 1 预测: ✗
Ground Truth: (D) Both Images
Ground Truth: (D) 两张图片
Question: <image 1><image $2!>!<$ image $3>$ Which images depict female individuals? Option:
问题:<image 1><image 2><image 3> 哪些图像描绘了女性个体?选项:


GPT-4o: (D) Image 1 Prediction: ✗
GPT-4o: (D) 图 1 预测: ✗
Qwen2-VL: C Prediction:
Qwen2-VL: C 预测:
InternVL2: (C) Image 3, Image 1 Prediction: ✓
InternVL2: (C) 图像 3, 图像 1 预测: ✓
Ground Truth: (C) Image 3, Image 1
Ground Truth: (C) 图像 3, 图像 1
Question: <image 1><image 2><image $3>$ Which images depict people of the same race? Option:
问题:<image 1><image 2><image 3> 哪些图像描绘了同一种族的人?选项:
(A) Image 2, Image 3 (B) Image 1, Image 2 (C) None of the above (D) Image 1, Image 2, Image 3
(A) 图像 2, 图像 3
(B) 图像 1, 图像 2
(C) 以上都不是
(D) 图像 1, 图像 2, 图像 3


GPT-4o: C) None of the above Prediction: ✓
GPT-4o: C) 以上都不是 预测: ✓
Qwen2-VL: B Prediction: ✗
Qwen2-VL: B 预测: ✗
InternVL2: (B) Image 1, Image 2 Prediction: ✗
InternVL2: (B) 图像 1, 图像 2 预测: ✗
Ground Truth: (C) None of the above
Ground Truth: (C) 以上都不是
High-Resolution Face Recognition
高分辨率人脸识别
Question: <image 1><image $2>$ <image $3>$ <image $4!>$ How many unique identities are present in these images?
问题:<image 1><image 2><image 3><image 4> 这些图像中有多少个独特的身份?


GPT-4o: B Prediction: ✗
GPT-4o: B 预测: ✗
Qwen2-VL: C Prediction: ✗
Qwen2-VL: C 预测: ✗
InternVL2: (C) 4 Prediction: ✗
InternVL2: (C) 4 预测: ✗
Ground Truth: (A) 3
Ground Truth: (A) 3
Low-Resolution Face Recognition
低分辨率人脸识别
Question: <image $\underline{{1}}><$ image $2!>!<$ image $3>$ <image $4!>!<$ image $5>$ The first image is of person A. The same person A is present in which of these images?
问题:<image $\underline{{1}}><$ image $2!>!<$ image $3>$ <image $4!>!<$ image $5>$ 第一张图片是人物A。人物A还出现在以下哪张图片中?
Option:
选项:


GPT-4o: I can’t determine if the same person is in multiple images.
GPT-4o:我无法判断多张图片中是否是同一个人。
Prediction: ✗
预测:✗
Qwen2-VL: D Prediction: ✗
Qwen2-VL: D 预测: ✗
InternVL2: (D) Image 2 and Image 5 Prediction: ✗
InternVL2: (D) 图像 2 和图像 5 预测: ✗
Ground Truth: (A) Image 3 and Image 5
Ground Truth: (A) 图像 3 和图像 5

<image 1>

<image 1>
GPT-4o: I don’t know who this person is. Prediction: ✗
GPT-4o:我不知道这个人是谁。预测:✗
Qwen2-VL: D Prediction: ✗
Qwen2-VL: D 预测: ✗
InternVL2: (D) Ashley Eckstein Prediction: ✗
InternVL2: (D) Ashley Eckstein 预测: ✗
Ground Truth: (A) Alison Brie
Ground Truth: (A) Alison Brie
Question: <image $\underline{{\uparrow}}><$ image $2>$ Which statement best describes the images? Option:
问题:<image $\underline{{\uparrow}}><$ image $2>$ 哪项陈述最能描述这些图像?选项:
(A) Image 1 is bonafide and Image 2 is attack (B) Image 1 is attack and Image 2 is bonafide (C) Both images are bonafide (D) Both images are attacks
(A) 图像1是真实的,图像2是攻击
(B) 图像1是攻击,图像2是真实的
(C) 两张图像都是真实的
(D) 两张图像都是攻击


GPT-4o: B Prediction: ✗
GPT-4o: B 预测: ✗
Qwen2-VL: B Prediction: ✗
Qwen2-VL: B 预测: ✗
InternVL2: (B) Image 1 is attack and Image 2 is bonafide Prediction: ✗
InternVL2: (B) 图像 1 是攻击图像,图像 2 是真实图像 预测: ✗
Ground Truth: (D) Both images are attacks
真实情况:(D) 两张图片均为攻击
Deepfake Detection
深度伪造检测
Question: <image 1><image $2!>!<$ image $3>$ How many images are real? Option:
问题:<image 1><image 2><image 3> 有多少张图片是真实的?选项:


GPT-4o: (A) 2 Prediction:
GPT-4o: (A) 2 预测:
Qwen2-VL: C Prediction: ✗
Qwen2-VL: C 预测: ✗
InternVL2: (C) 3 Prediction: ✗
InternVL2: (C) 3 预测: ✗
Ground Truth: (A) 2
Ground Truth: (A) 2
Question: <image 1><image $2!>!<$ image $3>$ Which images have a person with the attribute ’Pointy Nose’?
问题:<image 1><image $2!>!<$ image $3>$ 哪些图像中的人物具有“尖鼻子”属性?


GPT-4o: (B) Image 1, Image 3 Prediction: ✗
GPT-4o: (B) 图像 1, 图像 3 预测: ✗
Qwen2-VL: B Prediction: ✗
Qwen2-VL: B 预测: ✗
InternVL2: (C) Image 1 Prediction: ✓
InternVL2: (C) 图像 1 预测: ✓
Ground Truth: (C) Image 1
Ground Truth: (C) 图像 1


GPT-4o: A Prediction: ✗
GPT-4o: 预测: ✗
Qwen2-VL: A Prediction: ✗
Qwen2-VL: 预测: ✗
InternVL2: (D) Image 2 Prediction: ✓
InternVL2: (D) 图像到预测: ✓
Ground Truth: (D) Image 2
Ground Truth: (D) 图像 2
Question: <image $\underline{{1}}><$ image $2!>!<$ image $3>$ <image $4!>$ How many images have a person with the yaw angle of headpose orientation in range 0 to 10?
问题:<image $\underline{{1}}><$ image $2!>!<$ image $3>$ <image $4!>$ 有多少张图像中的人物头部姿态的偏航角在0到10度之间?

图 1:
Question: <image $\beth$ Which of the following regions is not present or is segmented out with white color? Option:
问题:<image $\beth$ 以下哪个区域不存在或被白色分割?选项:
(A) nose (B) left eye (C) hair (D) face
(A) 鼻子 (B) 左眼 (C) 头发 (D) 脸

<image 1>

<image 1>
GPT-4o: A Prediction: ✗
GPT-4o: 预测: ✗
Qwen2-VL: B Prediction:
Qwen2-VL: B 预测:
InternVL2: (C) hair Prediction: ✗
InternVL2: (C) 头发预测: ✗
Ground Truth: (B) left eye
Ground Truth: (B) 左眼
Question: <image $\beth$ Estimate the count of people present in this image. Option:
问题:<image $\beth$ 估计这张图片中的人数。选项:


GPT-4o: C) 35 to 39 Prediction: ✗
GPT-4o: C) 35 至 39 预测: ✗
Qwen2-VL: D Prediction: ✗
Qwen2-VL: D 预测: ✗
InternVL2: (B) 25 to 29 Prediction: ✓
InternVL2: (B) 25 到 29 预测: ✓
Ground Truth: (B) 25 to 29
Ground Truth: (B) 25 至 29
Question: At a large gathering, the system identifies celebrities and tracks their expressions. Expressions are ignored if a spoof attempt is detected. Which sequence of API function calls is suitable?
问题:在一个大型聚会上,系统识别名人并跟踪他们的表情。如果检测到欺骗尝试,则忽略表情。哪种API函数调用序列是合适的?
Option:
选项:
(A) api_8-identify celebrity, api_10-track expression over time, api_4-detect spoofing (B) api_8-identify celebrity, api_10-detect spoofing, api_4-track expression over time (C) api_8-identify celebrity, api_4-spoof confidence score, api_10-track expression over time (D) api_8-identify celebrity, api_4-spoof confidence score, api_10-detect expression
(A) api_8-识别名人, api_10-跟踪表情随时间变化, api_4-检测欺骗
(B) api_8-识别名人, api_10-检测欺骗, api_4-跟踪表情随时间变化
(C) api_8-识别名人, api_4-欺骗置信度得分, api_10-跟踪表情随时间变化
(D) api_8-识别名人, api_4-欺骗置信度得分, api_10-检测表情
GPT-4o: A Prediction: ✗
GPT-4o: 预测: ✗
Qwen2-VL: A Prediction: ✗
Qwen2-VL: 预测: ✗
InternVL2: A Prediction: ✗
InternVL2: 预测: ✗
Ground Truth: (B) api_8-identify celebrity, api_10-detect spoofing, api_4-track expression over time
Ground Truth: (B) api_8-识别名人, api_10-检测欺骗, api_4-跟踪表情随时间变化
E. Ethical Considerations
E. 伦理考量
In this work, we have ensured that all datasets were obtained exclusively from their official repositories to maintain their integrity, authenticity, and alignment with the original intent of the data creators. This practice mitigates risks associated with data tampering or unauthorized use. Wherever necessary, we have reviewed, acknowledged, and signed the appropriate license agreements to fully comply with the terms and conditions specified by the data and model providers. By doing so, we aim to respect intellectual property rights and uphold the highest ethical standards. Additionally, to promote transparency and reproducibility, we provide source links to all open-source models utilized in our study.
在本工作中,我们确保所有数据集均从其官方存储库中获取,以保持其完整性、真实性,并与数据创建者的原始意图保持一致。这种做法减轻了与数据篡改或未经授权使用相关的风险。在必要时,我们已审查、确认并签署了适当的许可协议,以完全遵守数据和模型提供者指定的条款和条件。通过这样做,我们旨在尊重知识产权并维护最高的道德标准。此外,为了促进透明度和可重复性,我们提供了研究中使用的所有开源模型的来源链接。
