FaceXBench: Evaluating Multimodal LLMs on Face Understanding

FaceXBench: 评估多模态大语言模型在人脸理解上的表现

Abstract

摘要

Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving abilities across a wide range of tasks and domains. However, their capacity for face understanding has not been systematically studied. To address this gap, we introduce FaceXBench, a comprehensive benchmark designed to evaluate MLLMs on complex face understanding tasks. FaceXBench includes 5, 000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6 broad categories, assessing MLLMs’ face understanding abilities in bias and fairness, face authentication, recognition, analysis, localization and tool retrieval. Using FaceXBench, we conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks. We analyze the models across three evaluation settings: zero-shot, in-context task description, and chain-of-thought prompting. Our detailed analysis reveals that current MLLMs, including advanced models like GPT-4o, and GeminiPro 1.5, show significant room for improvement. We believe FaceXBench will be a crucial resource for developing MLLMs equipped to perform sophisticated face understanding.

多模态大语言模型 (MLLMs) 在广泛的任务和领域中展示了令人印象深刻的问题解决能力。然而，它们在面部理解方面的能力尚未得到系统研究。为了填补这一空白，我们引入了 FaceXBench，这是一个旨在评估 MLLMs 在复杂面部理解任务上的综合基准。FaceXBench 包括 5,000 个多模态选择题，这些题目源自 25 个公共数据集和一个新创建的数据集 FaceXAPI。这些题目涵盖了 6 个大类中的 14 个任务，评估 MLLMs 在偏见与公平性、面部认证、识别、分析、定位和工具检索方面的面部理解能力。通过 FaceXBench，我们对 26 个开源 MLLMs 以及 2 个专有模型进行了广泛评估，揭示了复杂面部理解任务中的独特挑战。我们在三种评估设置下分析了这些模型：零样本、上下文任务描述和思维链提示。我们的详细分析表明，当前的 MLLMs，包括像 GPT-4o 和 GeminiPro 1.5 这样的先进模型，仍有显著的改进空间。我们相信 FaceXBench 将成为开发具备复杂面部理解能力的 MLLMs 的关键资源。

1. Introduction

1. 引言

Recent progress in Large Language Models (LLMs) [2, 7, 99] have showcased their impressive ability to understand, reason, and generate text across diverse openended tasks. Building on these advancements, multimodal large language models (MLLMs) [49, 50, 57, 63, 115, 135] have rapidly emerged, using LLMs as a brain to process multimodal inputs including images, videos, and audio. These models [1, 12, 26, 47, 54] demonstrate remarkable performance and excel in challenging visual-question-answering (VQA) benchmarks such as MMMU [119], MME [117], MM-Bench [60], SEEDBench [46], MMAU [116], MMIE [107], and MathVista [62]. These benchmarks evaluate MLLMs in various domains, including college-level subject knowledge, mathematical reasoning, commonsense reasoning, planning, coding, diagram and spatio-temporal understanding. Despite their progress, prominent MLLMs such as GPT4o [30] and GeminiPro 1.5 [96] struggle with basic face understanding tasks, as illustrated in Figure. 1.

大语言模型 (LLM) 的最新进展 [2, 7, 99] 展示了它们在各种开放式任务中理解、推理和生成文本的卓越能力。基于这些进展，多模态大语言模型 (MLLM) [49, 50, 57, 63, 115, 135] 迅速崛起，利用 LLM 作为大脑处理包括图像、视频和音频在内的多模态输入。这些模型 [1, 12, 26, 47, 54] 在具有挑战性的视觉问答 (VQA) 基准测试中表现出色，例如 MMMU [119]、MME [117]、MM-Bench [60]、SEEDBench [46]、MMAU [116]、MMIE [107] 和 MathVista [62]。这些基准测试评估了 MLLM 在多个领域的表现，包括大学级学科知识、数学推理、常识推理、规划、编码、图表和时空理解。尽管取得了进展，但像 GPT4o [30] 和 GeminiPro 1.5 [96] 这样的知名 MLLM 在基本的面部理解任务上仍然存在困难，如图 1 所示。

Before exploring MLLMs for face understanding capabilities, it’s essential to address the fundamental question: “Why should MLLMs be proficient in face under standing?” MLLMs are increasingly deployed as central processors in various advanced applications, including virtual-reality headsets [38], embodied AI [20, 29], driving safety [31, 89], authentication [18], human-computer interaction [100], and sports analysis [106]. In these applications, accurate face understanding is crucial, as face images appear frequently and require accurate face understanding for appropriate responses. However, the face understanding capabilities of existing MLLMs are limited; they often fail to answer basic questions such as “What is the expression of the person in this image?” or “Which of the following regions is not present in the face image?” These shortcomings indicate a significant scope for improvement. Recent works, such as EMO-LLaMA [109], develop an instruction-tuning set for supervised fine-tuning to enhance MLLMs’ capabilities in understanding facial expressions. Face-MLLM [91] proposed a three-stage training pipeline to equip MLLMs’ with face perception capabilities. Nonetheless, the research community currently lacks a standardized benchmark to quantify and compare MLLMs’ performance in face understanding. We believe that a comprehensive benchmark incorpora ting various aspects of face understanding is a crucial and essential foundational step for monitoring progress and advancing MLLMs’ performance in this domain.

在探索多模态大语言模型（MLLMs）的面部理解能力之前，必须回答一个基本问题：“为什么 MLLMs 应该擅长面部理解？” MLLMs 越来越多地被部署为各种先进应用的核心处理器，包括虚拟现实头戴设备 [38]、具身 AI [20, 29]、驾驶安全 [31, 89]、身份验证 [18]、人机交互 [100] 和体育分析 [106]。在这些应用中，准确的面部理解至关重要，因为面部图像频繁出现，并且需要准确的面部理解以做出适当的响应。然而，现有 MLLMs 的面部理解能力有限；它们通常无法回答基本问题，例如“这张图片中的人的表情是什么？”或“以下哪个区域不在面部图像中？”这些不足表明有很大的改进空间。最近的研究，如 EMO-LLaMA [109]，开发了一个指令微调集用于监督微调，以增强 MLLMs 在理解面部表情方面的能力。Face-MLLM [91] 提出了一个三阶段训练流程，以赋予 MLLMs 面部感知能力。尽管如此，研究界目前缺乏一个标准化的基准来量化和比较 MLLMs 在面部理解方面的表现。我们认为，一个涵盖面部理解各个方面的综合基准是监测进展和提升 MLLMs 在该领域性能的关键和必要的基础步骤。

To this end, we propose FaceXBench, a comprehensive benchmark designed for complex face understanding and related tasks. We identify six broad categories, each encompassing distinct tasks essential for complete face understanding within the context of MLLMs:

为此，我们提出了 FaceXBench，一个专为复杂人脸理解及相关任务设计的综合基准。我们确定了六大类别，每个类别都包含在 MLLMs 背景下完成人脸理解所需的不同任务：

Bias and Fairness: Age Estimation (Age), Gender Prediction (Gender), and Race Estimation (Race);
偏见与公平性：年龄估计 (Age)、性别预测 (Gender) 和种族估计 (Race)；

Figure 1. Left:Failure cases of leading MLLMs, such as LLaVA-OV [47], Qwen2-VL [103], GPT-4o [30] and GeminiPro1.5 [96], on basic questions related to face understanding. Right: Performance comparison of top models across the 14 tasks included in the benchmark.

图 1: 左：领先的多模态大语言模型（MLLM）在与人脸理解相关的基本问题上的失败案例，例如 LLaVA-OV [47]、Qwen2-VL [103]、GPT-4o [30] 和 GeminiPro1.5 [96]。右：基准测试中包含的 14 个任务中顶级模型的性能比较。

Recent works [75, 93] emphasize that, although existing open-source MLLMs’ have achieved advanced capabilities, they still lack the sophistication to perform complex tasks. To address this, these approaches enable MLLMs with tool use, extending their capabilities by leveraging external tools to fulfill complex human instructions that require task-specific processing. Motivated by this, we introduce FaceXAPI, a new dataset that forms part of FaceXBench. It is designed to evaluate MLLMs’ ability to select the appropriate API and function calls for handling complex tasks in face understanding. FaceXBench comprises 5, 000 carefully and manually filtered VQA-type questions derived from 25 public datasets, covering 14 tasks. It includes 10, 441 unique face images representing a diverse range of age groups, genders, races, varied resolutions, head poses, expressions and attributes, reflecting the diversity of faces encountered in real-world scenarios.

最近的研究 [75, 93] 强调，尽管现有的开源多模态大语言模型 (MLLMs) 已经具备了先进的能力，但它们仍然缺乏执行复杂任务的成熟度。为了解决这一问题，这些方法通过工具使用来增强 MLLMs 的能力，利用外部工具扩展其能力，以完成需要特定任务处理的复杂人类指令。受此启发，我们引入了 FaceXAPI，这是 FaceXBench 的一部分新数据集。它旨在评估 MLLMs 在处理面部理解中的复杂任务时选择合适 API 和函数调用的能力。FaceXBench 包含 5,000 个经过精心手动筛选的 VQA 类型问题，这些问题源自 25 个公共数据集，涵盖了 14 个任务。它包括 10,441 张独特的面部图像，代表了不同年龄段、性别、种族、不同分辨率、头部姿势、表情和属性的多样性，反映了现实世界场景中遇到的面部多样性。

We conduct extensive experiments and benchmark 26 open-source MLLMs’ and 2 advanced proprietary MLLMs, GPT-4o [30] and GeminiPro-1.5 [96]. We analyze the models across three evaluation settings: (1) zero-shot, (2) in-context task description and (3) chain-of-thought (CoT) prompting. Our findings reveals two main takeaways: First, existing MLLMs struggle with tasks like deepfake detection and crowd counting, which require fine-grained visual analysis to detect subtle inconsistencies and the ability to recognize and adapt to faces at varying scales. The performance of top models across various tasks is illustrated in Figure 1. Second, attempts to leverage the face knowledge of MLLMs through in-context prompting lead to performance drops, indicating a struggle to utilize contextual information effectively. Chain-of-thought prompting similarly fails to improve performance, suggesting that while these models have reasoning capabilities, they cannot apply them to face-related tasks, limiting their ability to make nuanced interpretations or adjustments. Finally, we present experiments highlighting the potential of diverse supervised fine-tuning and tool use as promising directions for enhancing MLLMs’ face understanding capabilities.

我们进行了广泛的实验，并对 26 个开源 MLLM 和 2 个先进的专有 MLLM（GPT-4o [30] 和 GeminiPro-1.5 [96]）进行了基准测试。我们在三种评估设置下分析了这些模型：(1) 零样本，(2) 上下文任务描述和 (3) 思维链 (CoT) 提示。我们的研究结果揭示了两个主要结论：首先，现有的 MLLM 在深度伪造检测和人群计数等任务上表现不佳，这些任务需要细粒度的视觉分析来检测细微的不一致性，并需要识别和适应不同尺度的人脸的能力。图 1 展示了各任务中顶级模型的表现。其次，通过上下文提示利用 MLLM 的人脸知识的尝试导致性能下降，表明这些模型难以有效利用上下文信息。思维链提示同样未能提高性能，这表明虽然这些模型具备推理能力，但它们无法将其应用于与人脸相关的任务，限制了它们进行细致解释或调整的能力。最后，我们展示了实验，强调了多样化的监督微调和工具使用作为增强 MLLM 人脸理解能力的潜在方向。

Our key contributions are as follows:

我们的主要贡献如下：

2. 相关工作

Face Understanding: Face understanding concerns the analysis and interpretation of human facial features, encompassing tasks such as age estimation [10, 37, 44, 90], gender prediction [5, 39, 55], race estimation [3, 4], face recognition [19, 71, 101], face anti-spoofing [59, 69, 113], deepfake detection [70, 97, 128], facial attributes prediction [66, 123, 136], facial expression recognition [42, 102, 120], headpose estimation [25, 80, 122], crowd counting [76, 77, 121] and face parsing [73, 83, 94, 130]. Early methods focused on developing task-specific models for each facial understanding task, achieving promising results. With the advent of transformers, works such as FaceXFormer [72], Q-Face [92] and Faceptor [74] aim to unify multiple tasks within a single architecture to improve genera liz ation performance. The recent rise in MLLMs’ enabled new avenue of research in face understanding by integrating information across text, visual and other modalities. Recent works [15, 86, 91, 109, 112, 129], leverage the reasoning and zero-shot capabilities of MLLMs’ to approach traditional face-related tasks. However, the field still lacks a standardized benchmark to monitor and regulate the development of these models. Our proposed work addresses this gap by introducing a comprehensive benchmark, FaceXBench, which covers a diverse range of face understanding tasks.

人脸理解：人脸理解涉及对人类面部特征的分析和解释，涵盖年龄估计 [10, 37, 44, 90]、性别预测 [5, 39, 55]、种族估计 [3, 4]、人脸识别 [19, 71, 101]、人脸反欺诈 [59, 69, 113]、深度伪造检测 [70, 97, 128]、面部属性预测 [66, 123, 136]、面部表情识别 [42, 102, 120]、头部姿态估计 [25, 80, 122]、人群计数 [76, 77, 121] 和人脸解析 [73, 83, 94, 130] 等任务。早期的方法专注于为每个面部理解任务开发特定任务的模型，取得了显著成果。随着 Transformer 的出现，FaceXFormer [72]、Q-Face [92] 和 Faceptor [74] 等工作旨在将多个任务统一到一个架构中，以提高泛化性能。最近，多模态大语言模型 (MLLMs) 的兴起为人脸理解开辟了新的研究途径，通过整合文本、视觉和其他模态的信息。最近的研究 [15, 86, 91, 109, 112, 129] 利用 MLLMs 的推理和零样本能力来处理传统的人脸相关任务。然而，该领域仍然缺乏一个标准化的基准来监控和规范这些模型的发展。我们提出的工作通过引入一个全面的基准 FaceXBench 来解决这一差距，该基准涵盖了多种人脸理解任务。

Multimodal Large Language Models and Benchmarks: Following the success of Large Language Models [2, 8, 99], recent research has focused on MLLMs to enhance multimodal comprehension and generation by leveraging the strong generalization capabilities of LLMs. With the rapid development of MLLMs [8, 9, 13, 32, 49, 54, 57, 85, 98, 104, 115, 135], several works propose benchmarks to MLLMs across different aspects. MMMU [119] meticulously collects multimodal questions from college exams, quizzes, and textbooks to assess college-level subject knowledge. MathVista [62] combines challenges from diverse mathematical and visual tasks, focusing on evaluating models’ fine-grained, deep visual understanding and compositional reasoning. MMBench [60] proposes a bilingual objective benchmark and a Circular E val strategy for models with limited instruction-following capabilities. SEED- Bench [45], MMAU [82] and MMIE [107] introduce benchmarks for generative comprehension and generation, audio understanding and interleaved multimodal comprehension and generation, respectively. Some benchmarks, such as SWE-Bench [34] and OSWorld [108] focus on agentic reasoning and multimodal agents for open-ended tasks. SHIELD [86] is an early attempt to benchmark MLLMs on face anti-spoofing and face forgery detection but overlooks several other aspects of face understanding. Recent works, such as EMO-LLaMA [109] and Face-MLLM [91], aim to enhance MLLMs’ capabilities in facial expression recognition and face perception; however the field lacks a comprehensive benchmark to objectively evaluate these tasks. In this work, we introduce FaceXBench, a comprehensive benchmark, which contains $5,000\ \mathrm{MCQ}$ that evaluates 6 broad categories and 14 different tasks.

多模态大语言模型与基准测试：随着大语言模型 [2, 8, 99] 的成功，近期研究聚焦于多模态大语言模型 (MLLMs)，旨在利用大语言模型的强大泛化能力来增强多模态理解和生成能力。随着 MLLMs 的快速发展 [8, 9, 13, 32, 49, 54, 57, 85, 98, 104, 115, 135]，多项工作提出了针对 MLLMs 在不同方面的基准测试。MMMU [119] 精心收集了来自大学考试、测验和教科书的多模态问题，以评估大学水平的学科知识。MathVista [62] 结合了多种数学和视觉任务的挑战，专注于评估模型的细粒度、深度视觉理解和组合推理能力。MMBench [60] 提出了一个双语客观基准测试，并为指令跟随能力有限的模型设计了循环评估策略。SEED-Bench [45]、MMAU [82] 和 MMIE [107] 分别引入了生成式理解和生成、音频理解以及交错多模态理解和生成的基准测试。一些基准测试，如 SWE-Bench [34] 和 OSWorld [108]，专注于开放任务中的智能体推理和多模态智能体。SHIELD [86] 是早期尝试在面部反欺骗和面部伪造检测方面对 MLLMs 进行基准测试的工作，但忽略了面部理解的其他几个方面。近期工作，如 EMO-LLaMA [109] 和 Face-MLLM [91]，旨在增强 MLLMs 在面部表情识别和面部感知方面的能力；然而，该领域缺乏一个全面的基准测试来客观评估这些任务。在本工作中，我们引入了 FaceXBench，一个全面的基准测试，包含 5000 道多选题 (MCQ)，评估 6 个大类和 14 个不同任务。

3. The FaceXBench Benchmark

3. FaceXBench 基准测试

3.1. Overview of FaceXBench

3.1. FaceXBench 概述

We introduce FaceXBench, a novel and comprehensive benchmark encompassing multiple aspects of face understanding. Additionally, we propose FaceXAPI, a manually curated dataset within FaceXBench that aims to evaluate tool-use capabilities. FaceXBench is the first benchmark specifically designed to evaluate MLLMs’ performance on face-related tasks. It assess MLLMs’ across six broad categories: bias and fairness, face authentication, face recognition, face analysis, face localization, and facial tool use. Fig. 2 presents a detailed taxonomy of these categories, outlining their respective tasks and corresponding number of questions. The designed questions test a MLLMs’ capabilities in areas such as visual grounding, fine-grained feature extraction, anomaly detection, emotion recognition, fairness, contextual understanding, spatial understanding and agentic reasoning.

我们介绍了 FaceXBench，这是一个新颖且全面的基准测试，涵盖了面部理解的多个方面。此外，我们提出了 FaceXAPI，这是 FaceXBench 中手动策划的数据集，旨在评估工具使用能力。FaceXBench 是第一个专门设计用于评估 MLLMs 在面部相关任务上表现的基准测试。它评估 MLLMs 在六个大类中的表现：偏见与公平性、面部认证、面部识别、面部分析、面部定位和面部工具使用。图 2 展示了这些类别的详细分类，概述了它们各自的任务和相应的问题数量。设计的问题测试了 MLLMs 在视觉定位、细粒度特征提取、异常检测、情绪识别、公平性、上下文理解、空间理解和智能体推理等领域的能力。

Benchmark Statistics. FaceXBench consists of 5, 000 questions, divided into 6 categories covering 14 tasks, and is derived from 25 public datasets along with one proposed dataset (i.e. FaceXAPI). There are 2, 750 questions with multiple images, 2, 150 single-image questions, and 100 text-only questions. The questions are generated using 757 unique question templates and contain 10, 441 unique images. The correct answer options are approximately equally distributed between A, B, C and D to avoid bias. They key statistics of the benchmark are summarized in Tab. 1.

基准统计。FaceXBench 包含 5,000 个问题，分为 6 个类别，涵盖 14 个任务，源自 25 个公共数据集以及一个提出的数据集（即 FaceXAPI）。其中有 2,750 个问题包含多张图片，2,150 个问题为单张图片，100 个问题仅包含文本。这些问题使用了 757 个独特的问题模板生成，并包含 10,441 张独特的图片。正确答案选项在 A、B、C 和 D 之间大致均匀分布，以避免偏差。基准的关键统计数据总结在表 1 中。

3.2. Data Collection

3.2. 数据收集

Our data collection pipeline consists of three steps. Step 1. In the first step, we iterate through the identified tasks in each category and select the datasets corresponding to each task. We collect the test sets of existing standard datasets for each task to avoid data leakage. We also created a new dataset, FaceXAPI, specifically for the tool-retrieval task. In total, our collection includes 25 public datasets along with this newly proposed dataset. Step 2. In this step, we manually created question templates for each task involving single and multiple images. While creating these templates, we carefully framed questions to encourage the model to reason, compare, and think critically to find the correct answer. We included a mix of easy and hard questions to maintain a balanced distribution. For example, a hard question is, “<image1 $>$ <image ${2>}$ <image3 $\because$ How many images show a person in the age range of $30;t o;39,?$ ,” while an easy question is, “<image1 $>$ What is the age range of the person shown in the image?”. We prompt GPT-4o with the manually curated question templates as in-context examples and generate additional templates that address similar questions but with varied phrasing, perspective, and/or complexity. For example, “<image1 $>$ <image ${2>}$ Which among these two appears to be older?”. We manually filter the question templates for each task to ensure diversity and task relevance. Additional details about the question templates and their generation can be found in Appendix C.1. Step 3. The final step involves generating answer options. Each question includes four options, with one correct answer. We ensure that distractor options are close to the correct answer but distinct enough to require critical thinking and reasoning when selecting the correct response. For example, for the question “<image1 $>$ What is the age range of the person shown in the image?”, the options designed are (A) 20 to 29, (B) 10 to 19 (C) 30 to 39 and (D) 40 to 49, with the correct answer being (C) 30 to 29. The thoughtfully chosen distractor options close to the actual answer, encourages the model to reason, compare and carefully select the correct choice. This approach raises the difficulty for the model, making the questions more challenge ing. To validate this, we conducted a mini experiment using llava onevision qwen2 7b ov for age estimation on FairFace dataset questions that include single image. The model achieved $88%$ accuracy with randomly chosen options, compared to $53%$ when strategically designed options were used. This result highlights the importance of identifying effective distractor options. Further details on generating distractor options for each task are provided in Appendix C.2. Overall, we observed that the model’s performance significantly varies depending on the complexity of the questions and distractor options.

我们的数据收集流程包括三个步骤。
步骤 1：第一步，我们遍历每个类别中已识别的任务，并选择与每个任务对应的数据集。我们收集了每个任务现有标准数据集的测试集，以避免数据泄露。我们还创建了一个新的数据集 FaceXAPI，专门用于工具检索任务。总的来说，我们的收集包括 25 个公共数据集以及这个新提出的数据集。
步骤 2：在这一步中，我们为涉及单张和多张图像的每个任务手动创建了问题模板。在创建这些模板时，我们精心设计了问题，以鼓励模型进行推理、比较和批判性思考，从而找到正确答案。我们混合了简单和困难的问题，以保持平衡的分布。例如，一个困难的问题是：“<image1 $>$ <image ${2>}$ <image3 $\because$ 有多少张图片显示年龄在 $30;到;39,$ 岁之间的人？”，而一个简单的问题是：“<image1 $>$ 图片中显示的人的年龄范围是多少？”。我们使用手动整理的问题模板作为上下文示例，提示 GPT-4o 生成更多模板，这些模板处理类似的问题，但在措辞、视角和/或复杂性上有所不同。例如：“<image1 $>$ <image ${2>}$ 这两个人中哪一个看起来更年长？”。我们手动过滤每个任务的问题模板，以确保多样性和任务相关性。有关问题模板及其生成的更多详细信息可以在附录 C.1 中找到。
步骤 3：最后一步是生成答案选项。每个问题包括四个选项，其中一个正确答案。我们确保干扰选项接近正确答案，但又足够不同，以在选择正确答案时需要批判性思维和推理。例如，对于问题“<image1 $>$ 图片中显示的人的年龄范围是多少？”，设计的选项是 (A) 20 到 29，(B) 10 到 19，(C) 30 到 39 和 (D) 40 到 49，正确答案是 (C) 30 到 29。精心选择的干扰选项接近实际答案，鼓励模型进行推理、比较并仔细选择正确的选项。这种方法增加了模型的难度，使问题更具挑战性。为了验证这一点，我们使用 llava onevision qwen2 7b ov 在 FairFace 数据集上进行了年龄估计的小型实验，该数据集包含单张图片的问题。模型在使用随机选择的选项时达到了 $88%$ 的准确率，而在使用策略性设计的选项时准确率为 $53%$。这一结果突显了识别有效干扰选项的重要性。有关为每个任务生成干扰选项的更多详细信息，请参见附录 C.2。
总的来说，我们观察到模型的性能显著取决于问题的复杂性和干扰选项的设计。

Table 1. Key statistics of question in FaceXBench.

表 1. FaceXBench 中问题的关键统计

统计项	数量
总问题数总类别数总任务数使用的公共数据集数新提出的数据集数 (FaceXAPI)	5000 6 14 25 1
多图像问题单图像问题纯文本问题	2750 (55%) 2150 (43%) 100 (2%)
所有问题中的总图像数唯一图像数唯一问题模板数最大问题长度	11266 10441 757
最大选项长度平均问题长度平均选项长度	676 207 64.34 11.04
每个问题的总选项数 A 作为正确选项的频率 B 作为正确选项的频率 C 作为正确选项的频率 D 作为正确选项的频率	4 1278 (25.56%) 1332 (26.64%) 1189 (23.78%) 1201 (24.02%)

Figure 2. Distribution of questions across different categories and tasks in FaceXBench.

图 2: FaceXBench 中不同类别和任务的提问分布。

3.3. FaceXAPI

We believe that face understanding is an application domain that can be better addressed by equipping MLLMs with tools rather than relying solely on supervised fine-tuning. We provide a detailed discussion, supported by experimental validation, in the Discussion and Future Directions section (Section 6). To test MLLMs’ ability to select the cor- rect sequence of APIs and function calls for successful task completion, we created FaceXAPI, a dataset of 100 textonly questions referencing 13 APIs and 32 function calls. The questions were designed to ensure diversity in scenarios, reflecting a broad range of real-world applications and requiring a sequence of 3 to 5 API calls to solve. The dataset includes a total of 88 unique function call sequences. The questions were generated using GPT-4o, followed by manual filtering. A sample from the dataset is presented in Fig. 3, with the complete prompt provided in Appendix C.3.

我们相信，通过为多模态大语言模型 (MLLMs) 配备工具，而不是仅仅依赖监督微调，可以更好地解决面部理解这一应用领域的问题。我们在讨论与未来方向部分（第6节）提供了详细的讨论，并通过实验验证支持了这一观点。为了测试 MLLMs 选择正确的 API 和函数调用序列以成功完成任务的能力，我们创建了 FaceXAPI，这是一个包含 100 个纯文本问题的数据集，涉及 13 个 API 和 32 个函数调用。这些问题旨在确保场景的多样性，反映广泛的现实应用，并需要 3 到 5 个 API 调用的序列来解决。该数据集共包含 88 个独特的函数调用序列。这些问题由 GPT-4o 生成，随后经过人工筛选。数据集中的一个样本如图 3 所示，完整的提示见附录 C.3。

3.4. Quality Control

3.4. 质量控制

We implement quality control checks at each step of the data collection process to prevent the error propagation. In the first stage of data collection, we convert established, manually annotated datasets for each task into VQA format. This approach ensures a high level of initial accuracy, compared to benchmarks with AI generated ground truths. In the second stage, we focus on achieving diversity and variety in question generation while maintaining appropriate difficulty levels, which ensures that the questions generated are of mixed difficulty and relevant to the task. We manually verify the correctness and relevance of the 757 unique question templates in the benchmark. Finally, we systematically design options that are of high quality and are difficult.

我们在数据收集过程的每一步都实施质量控制检查，以防止错误传播。在数据收集的第一阶段，我们将每个任务的已建立的手动标注数据集转换为VQA格式。与使用AI生成的真实基准相比，这种方法确保了较高的初始准确性。在第二阶段，我们专注于在保持适当难度水平的同时实现问题生成的多样性和变化，这确保了生成的问题具有混合难度并与任务相关。我们手动验证了基准中757个独特问题模板的正确性和相关性。最后，我们系统地设计了高质量且难度较大的选项。

Figure 3. FaceXBench examples cover a total of 14 tasks, addressing various aspects of face understanding. Each question may consist of single or multiple images. Every question includes four options, with only one correct answer. The options are strategically designed to prompt the model to analyze carefully before selecting an option.

图 3: FaceXBench 示例涵盖了总共 14 个任务，涉及面部理解的各个方面。每个问题可能包含单张或多张图片。每个问题包括四个选项，其中只有一个正确答案。选项经过精心设计，旨在促使模型在选择选项之前进行仔细分析。

4. Experiments

4. 实验

In this section, we detail the experiments conducted to benchmark and analyze various MLLMs on face understanding. We evaluate 2 proprietary and 26 open-source models, as listed in Section 4.1. For a fair comparison, all models are evaluated in a zero-shot setting using the same base prompt. We analyze selected MLLMs under in-context and chain-of-thought settings. The evaluation settings are described in Section 4.2 and the results are presented in Section 5. All experiments are performed using 8 NVIDIA A6000 GPU cards.

在本节中，我们详细介绍了为基准测试和分析各种多模态大语言模型（MLLM）在面部理解任务上的表现所进行的实验。我们评估了2个专有模型和26个开源模型，如第4.1节所列。为了公平比较，所有模型均在零样本设置下使用相同的基础提示进行评估。我们在上下文和思维链设置下分析了选定的MLLM。评估设置在第4.2节中描述，结果在第5节中展示。所有实验均使用8张NVIDIA A6000 GPU卡进行。

4.1. Models

4.1. 模型

The 2 proprietary models used are GPT-4o [30] and GeminiPro 1.5 [96]. We divide the 26 open-source models into three major categories based on parameter size: (a) Open Source MLLMs $\mathbf{\mathcal{{C}4B}}$ parameters): PaliGemma [9], LLaVA-OneVision-0.5b-OV [47], and VILA 1.5-3b [54]; (b) Open Source MLLMs (4B-13B parameters): Chameleon-7b [95], Eagle-X4-8B-Plus [85], Idefics2-8b [41], Idefics-9b-Instruct [40], LLaVA-v1.5- 7b [57], Monkey-Chat [53], MiniCPM-Llama3-v2.5 [114],

使用的2个专有模型是GPT-4o [30]和GeminiPro 1.5 [96]。我们将26个开源模型根据参数规模分为三大类：(a) 开源MLLMs（$\mathbf{\mathcal{{C}4B}}$参数）：PaliGemma [9]、LLaVA-OneVision-0.5b-OV [47]和VILA 1.5-3b [54]；(b) 开源MLLMs（4B-13B参数）：Chameleon-7b [95]、Eagle-X4-8B-Plus [85]、Idefics2-8b [41]、Idefics-9b-Instruct [40]、LLaVA-v1.5-7b [57]、Monkey-Chat [53]、MiniCPM-Llama3-v2.5 [114]。

LLaVA-OneVision-7b-SI [47], LLaVA-NeXT-Interleave7b [48], Mantis-SIGLIP-8b [32], Phi-3.5-Vision [1], LLaVA-OneVision-7b-OV, Qwen2-VL-7b-Instruct [103], and InternVL2-8b [12]; (c) Open Source MLLMs (>13B parameters): CogVLM2-19b [26], Idefics-80bInstruct [40], LLaVA-v1.5-13b [57], VILA 1.5-13b [54], InternVL-Chat-v1.5 [12], VILA 1.5-40b [54], LLaVA- OneVision-72b-OV [47], Qwen2-VL-72b-Instruct [103], and InternVL2-76b [12]. In Appendix D.1, we provide detailed information regarding the architecture and the parameter size for all open-source MLLMs’ evaluated in this paper, along with additional results under different settings.

LLaVA-OneVision-7b-SI [47]、LLaVA-NeXT-Interleave7b [48]、Mantis-SIGLIP-8b [32]、Phi-3.5-Vision [1]、LLaVA-OneVision-7b-OV、Qwen2-VL-7b-Instruct [103] 和 InternVL2-8b [12]；(c) 开源 MLLMs（>13B 参数）：CogVLM2-19b [26]、Idefics-80bInstruct [40]、LLaVA-v1.5-13b [57]、VILA 1.5-13b [54]、InternVL-Chat-v1.5 [12]、VILA 1.5-40b [54]、LLaVA-OneVision-72b-OV [47]、Qwen2-VL-72b-Instruct [103] 和 InternVL2-76b [12]。在附录 D.1 中，我们提供了本文评估的所有开源 MLLMs 的架构和参数大小的详细信息，以及不同设置下的额外结果。

4.2. Evaluation Settings

4.2. 评估设置

We evaluate the models under three settings: (a) zero-shot, (b) in-context task description, and (c) chain-of-thought prompting. In the zero-shot setting, we pass the input with only the base prompt. In the in-context task description setting, we prepend the prompt with a brief description of the specific task required by the question. In the chain-ofthought setting, we prompt the model to reason step-bystep before selecting the correct option. The task-specific prepended text and additional details of the evaluation settings are provided in Appendix C.6.

我们在三种设置下评估模型：(a) 零样本 (zero-shot) ，(b) 上下文任务描述 (in-context task description) ，以及 (c) 思维链提示 (chain-of-thought prompting) 。在零样本设置中，我们仅使用基础提示传递输入。在上下文任务描述设置中，我们在提示前加上问题所需的特定任务的简要描述。在思维链设置中，我们提示模型在选择正确选项之前逐步推理。任务特定的前置文本和评估设置的更多细节见附录 C.6。

The proposed benchmark consists of multiple-choice questions (MCQs), which standardize the evaluation across different models. We empirically construct a diverse set of regular expressions and employ a three-step evaluation strategy to extract the correct option in cases where intermediate reasoning or calculations are present in the model’s output. Our evaluation strategy is as follows: First, we use a regular expression to match the correct option (A, B, C, or D) at the beginning of the model’s prediction. If this attempt fails, we then search for the correct option within the entire prediction text. As a final fallback, we compare parts of the predicted output with the option values to find a match. If none of these steps succeed, we categorize the prediction as incorrect. The complete function is provided in Appendix C.4. Alternatively, we could implement a random-choice or frequent-choice strategy instead of labeling responses as incorrect. In the random-choice approach, a random option is selected as the prediction, while in the frequent-choice approach, we select the most common correct option in the dataset. However, these strategies can lead to misleading results, especially when the model fails to respond due to content moderation measures or poor instruction-following.

所提出的基准由多项选择题 (MCQs) 组成，这标准化了不同模型之间的评估。我们经验性地构建了一组多样化的正则表达式，并采用三步评估策略来提取模型输出中存在中间推理或计算时的正确选项。我们的评估策略如下：首先，我们使用正则表达式匹配模型预测开头的正确选项 (A、B、C 或 D)。如果此尝试失败，我们则在预测文本的整个范围内搜索正确选项。作为最后的备选方案，我们将预测输出的部分内容与选项值进行比较以找到匹配项。如果这些步骤均未成功，我们将预测归类为错误。完整函数见附录 C.4。或者，我们可以实施随机选择或频繁选择策略，而不是将响应标记为错误。在随机选择方法中，随机选择一个选项作为预测，而在频繁选择方法中，我们选择数据集中最常见的正确选项。然而，这些策略可能会导致误导性结果，尤其是当模型由于内容审核措施或指令遵循不佳而未能响应时。

5. Results

5. 结果

The performance of various models on FaceXBench is shown in Table 2. The results emphasize the challenging nature of the benchmark, with no model achieving more than $60%$ accuracy. We observe that Qwen2-VL72b-Instruct [103] achieves the best overall performance of $57.86%$ . InternVL2-76b [12], GeminiPro 1.5 [96], Qwen2- VL-72b-Instruct [103], LLaVA-OneVision-72b-OV [47], Qwen2-VL-72b-Instruct [103] and GeminiPro 1.5 [96] achieve the highest performances in the categories of Bias & Fairness, Face Recognition, Face Authentication, Face Analysis, Face Localization and Face Tools use with accuracies of $69.53%$ , $70.00%$ , $41.14%$ , $63.25%$ , $55.45%$ and $57.00%$ , respectively.

表 2 展示了各种模型在 FaceXBench 上的表现。结果强调了该基准的挑战性，没有模型的准确率超过 $60%$。我们观察到 Qwen2-VL72b-Instruct [103] 以 $57.86%$ 的准确率取得了最佳整体表现。InternVL2-76b [12]、GeminiPro 1.5 [96]、Qwen2-VL-72b-Instruct [103]、LLaVA-OneVision-72b-OV [47]、Qwen2-VL-72b-Instruct [103] 和 GeminiPro 1.5 [96] 分别在偏见与公平、人脸识别、人脸认证、人脸分析、人脸定位和人脸工具使用类别中取得了最高表现，准确率分别为 $69.53%$、$70.00%$、$41.14%$、$63.25%$、$55.45%$ 和 $57.00%$。

Results across different evaluation categories. The performance of various models on FaceXBench, as shown in Table 2 highlights the limited face understanding capabilities of current models. Specifically, they perform poorly on tasks such as face authentication and face localization, which require fine-grained facial feature extraction and spatial understanding of facial structure. The low performance in the “Bias & Fairness” category suggests that existing MLLMs exhibit biases toward certain age, gender, and racial groups, which need to be mitigated prior to deployment. Surprisingly, GPT-4o performs poorly in the “Bias & Fairness category”, but on examination, it was found that the model often chose not to answer due to its safety alignment. The top-performing models achieve around $70%$ performance in face recognition tasks; however, their performance drops significantly on low-resolution face recognition. This decline can be attributed to training data predominantly comprising of high-resolution images, which limits models’ effectiveness on low-resolution inputs. Some models may have been trained on image-text datasets with face images, such as FFHQ-Text [134], CelebA-Dialog [33], LAION-Face [133], Face Caption-15M [17], as part of large-scale pre training. Although these datasets often contain attribute and expression information in text, the models still perform poorly in the “Face Analysis” section, indicating substantial room for improvement.

不同评估类别下的结果。表 2 展示了各种模型在 FaceXBench 上的表现，突显了当前模型在人脸理解能力上的局限性。具体而言，它们在需要细粒度面部特征提取和面部结构空间理解的任务（如人脸认证和人脸定位）上表现不佳。在“偏见与公平性”类别中的低表现表明，现有的多模态大语言模型 (MLLMs) 对某些年龄、性别和种族群体存在偏见，这些偏见在部署前需要被缓解。令人惊讶的是，GPT-4o 在“偏见与公平性”类别中表现不佳，但经过检查发现，该模型由于其安全对齐机制，经常选择不回答问题。表现最佳的模型在人脸识别任务中达到了约 $70%$ 的性能；然而，它们在低分辨率人脸识别上的表现显著下降。这种下降可以归因于训练数据主要由高分辨率图像组成，这限制了模型在低分辨率输入上的有效性。一些模型可能已经在包含人脸图像的图像-文本数据集上进行了训练，例如 FFHQ-Text [134]、CelebA-Dialog [33]、LAION-Face [133]、Face Caption-15M [17]，作为大规模预训练的一部分。尽管这些数据集通常在文本中包含属性和表情信息，但模型在“人脸分析”部分的表现仍然不佳，表明仍有很大的改进空间。

Open Source vs Closed Source Models. We observe that open-source models, such as InternVL2 and Qwen2-VL, outperform proprietary models like GPT-4o and GeminiPro 1.5, achieving accuracies of $57.80%$ and $57.86%$ , compared to $50.50%$ and $56.96%$ , respectively. Generally, MLLMs have shown a trend where closed-source models outperform open-source ones. However, in the sensitive domain of face analysis, proprietary models are safety-aligned before deployment. The relatively poor performance of proprietary models in “Bias & Fairness” and “Face Analysis” is primarily due to content moderation constraints. Notably, GeminiPro 1.5 demonstrates comparatively better performance than other models on “Face Tools Use”, showcasing its ability to leverage specialized tools for complex scenarios involving multiple face-related tasks.

开源模型与闭源模型。我们观察到，开源模型如 InternVL2 和 Qwen2-VL 在准确率上超越了闭源模型如 GPT-4o 和 GeminiPro 1.5，分别达到了 $57.80%$ 和 $57.86%$，而闭源模型的准确率分别为 $50.50%$ 和 $56.96%$。通常，多模态大语言模型（MLLM）呈现出闭源模型优于开源模型的趋势。然而，在面部分析这一敏感领域，闭源模型在部署前经过了安全对齐。闭源模型在“偏见与公平性”和“面部分析”方面的表现相对较差，主要是由于内容审核的限制。值得注意的是，GeminiPro 1.5 在“面部工具使用”方面表现出色，展示了其在涉及多个面部相关任务的复杂场景中利用专用工具的能力。

Performance across various tasks. To gauge the difficulty of various tasks, we plot the performance of the top-5 models (4B-13B parameters) across each task, as shown in Fig- ure 4(a). A detailed table showing the performance of all models across all tasks is provided in Appendix D.2 (Table D.2). We observe that, on average, models struggle with tasks such as crowd counting, deepfake detection, head pose estimation, and low-resolution face recognition, highlighting areas where MLLMs’ need improvement. In contrast, gender prediction is one of the easier tasks, with an average performance of approximately $80%$ . Additionally, we plot the average performance of these top-5 models on multipleimage and single-image question. Figure 4(c) shows that models generally perform poorly on questions with multiple images as inputs compared to single-image questions. This is primarily because the models are required to process more visual information and compare and contrast facial features across multiple images, making the questions more challenging. Furthermore, models such as Qwen2-VL [103] and LLaVA-OneVision [47], which use dynamic resolution to handle arbitrary image resolutions, perform better than models on segmentation task. This mapping of images to a dynamic number of visual tokens more closely resembles human-like visual processing and leads to improved performance in face-understanding tasks.

跨任务性能表现。为了评估不同任务的难度，我们绘制了前五大模型（4B-13B参数）在每项任务上的表现，如图4(a)所示。附录D.2（表D.2）提供了所有模型在所有任务上表现的详细表格。我们观察到，平均而言，模型在人群计数、深度伪造检测、头部姿态估计和低分辨率人脸识别等任务上表现不佳，凸显了多模态大语言模型（MLLMs）需要改进的领域。相比之下，性别预测是较为简单的任务之一，平均表现约为$80%$。此外，我们绘制了这五大模型在多图像和单图像问题上的平均表现。图4(c)显示，与单图像问题相比，模型在多图像输入的问题上普遍表现较差。这主要是因为模型需要处理更多的视觉信息，并在多张图像之间比较和对比面部特征，使得问题更具挑战性。此外，像Qwen2-VL [103]和LLaVA-OneVision [47]这样使用动态分辨率处理任意图像分辨率的模型，在分割任务上表现优于其他模型。这种将图像映射到动态数量的视觉Token的方法更接近人类的视觉处理方式，从而在人脸理解任务中表现更佳。

Effect of LLM and it’s size on performance. To analyze the impact of LLM on model performance, we plot a performance curve for different models that share the same SigLIP $\mathrm{SO400M}/14@384$ vision encoder, but differ in their LLM backbone. From Figure 4(b), we observe that the performance improves with an increase in LLM size. Furthermore, when comparing LLMs of different sizes within the same family, such as Qwen2, we see that the shape of the curve remains consistent, with performance values translating upward across all dimensions. This indicates that as the LLM size increases, the model’s capabilities improve proportion at ely across various dimensions, maintaining a similar pattern of performance enhancement.

大语言模型及其规模对性能的影响。为了分析大语言模型对模型性能的影响，我们绘制了不同模型的性能曲线，这些模型共享相同的 SigLIP $\mathrm{SO400M}/14@384$ 视觉编码器，但大语言模型骨干不同。从图 4(b) 中，我们观察到随着大语言模型规模的增加，性能有所提升。此外，当比较同一家族中不同规模的大语言模型（如 Qwen2）时，我们发现曲线的形状保持一致，性能值在所有维度上都有所提升。这表明随着大语言模型规模的增加，模型在各个维度上的能力成比例提升，保持了相似的性能增强模式。

Table 2. Results of different models on the FaceXBench. We categorize the open-source models in three categories based on parameter size: (a) Open source MLLMs $\mathrm{<}4\mathrm{B}$ parameters), (b) Open source MLLMs (4B-13B parameters), (c) Open source MLLMs $_{\mathord{>}13\mathrm{B}}$ parameters). Additionally, we evaluate (d) proprietary models. The best-performing model in each category is highlighted in bold.

模型 (28)	总体	偏见与公平性 (800)	人脸识别 (1,500)	人脸认证 (1,100)	人脸分析 (800)	人脸定位 (700)	人脸工具使用 (100)
随机选择	(5,000) 25.10	24.73	26.88	22.71	24.75	25.64	30.00
频繁选择	32.22	30.73	29.50	40.14	33.25	29.73	40.00
开源 MLLMs (< 4B 参数)
PaliGemma [9]	32.22	35.67	26.50	28.00	37.62	32.27	12.00
LLaVA-OneVision-0.5b-OV [47]	34.00	34.93	28.12	30.29	44.62	32.91	20.00
VILA 1.5-3b [54]	35.80	38.27	33.25	30.86	44.50	31.82	28.00
开源 MLLMs (4B - 13B 参数)
Chameleon-7b [95]	17.04	10.27	17.12	6.86	20.25	28.91	33.00
Eagle-X4-8B-Plus[85]	31.44	25.00	23.12	30.00	35.62	43.64	37.00
Idefics-9b-Instruct [40]	34.58	37.93	28.62	34.43	37.38	34.18	15.00
LLaVA-v1.5-7b [57]	36.22	41.20	33.12	30.14	43.50	32.18	15.00
Monkey-Chat [53]	37.40	39.00	31.50	26.00	44.00	41.73	40.00
MiniCPM-Llama3-v2.5 [114]	40.70	45.80	29.88	32.86	52.38	40.45	15.00
LLaVA-NeXT-Interleave-7b [48]	43.80	52.53	38.00	38.57	55.88	32.27	26.00
LLaVA-One Vision-7b-SI [47]	44.32	50.73	32.75	29.86	52.25	47.27	46.00
Idefics2-8b [41]	44.52	52.67	31.25	33.57	53.25	43.91	42.00
Mantis-SIGLIP-8b [32]	44.60	56.13	45.12	36.86	48.00	31.64	37.00
Phi-3.5-Vision [1]	45.16	52.47	50.12	40.00	51.00	31.64	34.00
LLaVA-OneVision-7b-OV [47]	48.98	61.40	38.38	35.57	55.12	44.82	38.00
Qwen2-VL-7b-Instruct [103]	51.58	57.47	57.88	34.00	57.50	47.09	38.00
InternVL2-8b [12]	53.24	62.40	61.75	35.43	55.38	45.09	45.00
开源 MLLMs (> 13B 参数)
Idefics-80b-Instruct [40]	35.86	39.87	35.12	27.71	35.12	38.55	15.00
LLaVA-v1.5-13b [57]	39.88	44.60	34.88	34.14	44.75	37.27	39.00
VILA 1.5-13b [54]	40.00	45.07	40.00	28.43	49.25	34.18	35.00
CogVLM2-19b [26]	40.46	43.13	33.88	35.71	45.62	41.91	29.00
InternVL-Chat-v1.5 [12]	49.18	59.73	41.38	33.00	55.12	46.73	46.00
VILA 1.5-40b [54]	55.48	64.00	57.63	33.14	60.50	54.36	39.00
LLaVA-OneVision-72b-OV [47]	56.42	66.53	52.00	37.43	63.25	53.73	48.00
InternVL2-76b[12]	57.80	69.53	66.62	36.14	62.00	47.18	46.00
Qwen2-VL-72b-Instruct[103]	57.86	62.20	69.12	41.14	57.88	55.45	46.00
专有 MLLMs
GPT-40 [30]	50.50	46.93	55.62	40.00	62.25	50.36	44.00
GeminiPro 1.5 [96]	56.96	67.40	70.00	35.00	58.13	46.36	57.00

表 2. 不同模型在 FaceXBench 上的结果。我们将开源模型根据参数大小分为三类：(a) 开源 MLLMs (<4B 参数), (b) 开源 MLLMs (4B-13B 参数), (c) 开源 MLLMs (>13B 参数)。此外，我们还评估了 (d) 专有模型。每个类别中表现最好的模型以粗体标出。

Performance under different evaluation settings. We evaluate three selected models under the in-context task description and chain-of-thought settings, with results summarized in Table. 3. In the in-context setting, we observe some performance improvements in face authentication and face tools use. However, overall performance drops, indicating that the models struggle to utilize in-context information effectively. In the chain-of-thought setting, we observe a substantial decline in performance, suggesting that, although these models exhibit reasoning capabilities, they do not transfer effectively to face understanding.

不同评估设置下的性能表现。我们在上下文任务描述和思维链设置下评估了三个选定的模型，结果总结在表 3 中。在上下文设置中，我们观察到面部认证和面部工具使用方面的一些性能提升。然而，整体性能下降，表明模型难以有效利用上下文信息。在思维链设置中，我们观察到性能显著下降，这表明尽管这些模型展示了推理能力，但它们无法有效地转移到面部理解任务上。

6. Discussion and Future Directions

6. 讨论与未来方向

In this section, we discuss possible future directions to improve the face understanding capabilities of MLLMs. The MLLM community has developed numerous supervised fine-tuning (SFT) datasets, such as COCO Caption [11],

在本节中，我们讨论了未来可能的方向，以提升多模态大语言模型 (MLLM) 的面部理解能力。MLLM 社区已经开发了许多监督微调 (SFT) 数据集，例如 COCO Caption [11]。

Figure 4. (a) Performance of top-5 models (4B-13B parameters) across various tasks. (b) Effect of LLM and it’s size on model performance. (c) Average performance of the top-5 models (4B-13B parameters) on multiple-image and single-image questions.

图 4: (a) 不同任务中前 5 名模型 (4B-13B 参数) 的表现。 (b) 大语言模型及其规模对模型性能的影响。 (c) 前 5 名模型 (4B-13B 参数) 在多图像和单图像问题上的平均表现。

ScienceQA [81], Vision-FLAN [110], ChartQA [64], FigureQA [35], Geometry3K [23], MAVIS MCollect [124], MATHQA [6], TextOCR [88], OCR-VQA [65], and MagpiePro [111]. These supervised fine-tuning sets target various skills, including document and chart understanding, mathematics, fine-grained perception, grounding, reasoning, and general OCR. Developing these general skills helps models perform better on existing tasks and enhances their reasoning and visual processing capabilities for improved zero-shot task performance. However, to answer the question, “Will supervised fine-tuning improve the performance of MLLMs on face understanding?” we conducted multiple experiments by finetuning LLaVA1.5 [56] on a random $70\mathbf{k}$ subset of the Face Caption 15 M [17], which comprises image-text pairs containing attribute, age, and gender information. We fine-tuned the vision projector and the LLM backbone using LoRA [27] and summarized the results for various data compositions in Table 3(c). Our findings reveal that naive fine-tuning on MLLM with only Face SFT data results in poor performance (Table 3(c), row 1), as models tend to lose their general reasoning and perception capabilities. We incorporated the complete LLaVA SFT dataset (665k samples) alongside the Face SFT, and the model achieved a score of 35.18 (Table 3(c), row 2). We randomly sampled $200\mathrm{k}$ examples from the full 665k LLaVA 1.5 dataset, combining them with the 70k Face SFT data for fine-tuning, resulting in an improved score of 35.24 (Table 3(c), row 3). This experiment demonstrates the benefit of integrating FaceSFT data in the right proportion with existing reasoning and instruction-tuning datasets.

ScienceQA [81]、Vision-FLAN [110]、ChartQA [64]、FigureQA [35]、Geometry3K [23]、MAVIS MCollect [124]、MATHQA [6]、TextOCR [88]、OCR-VQA [65] 和 MagpiePro [111]。这些监督微调数据集针对多种技能，包括文档和图表理解、数学、细粒度感知、基础、推理和通用 OCR。发展这些通用技能有助于模型在现有任务上表现更好，并增强其推理和视觉处理能力，从而提升零样本任务的表现。然而，为了回答“监督微调是否会提高 MLLM 在人脸理解上的表现？”这一问题，我们通过在 Face Caption 15 M [17] 的随机 $70\mathbf{k}$ 子集上微调 LLaVA1.5 [56] 进行了多次实验，该数据集包含属性、年龄和性别信息的图像-文本对。我们使用 LoRA [27] 对视觉投影器和 LLM 主干进行了微调，并在表 3(c) 中总结了各种数据组合的结果。我们的发现表明，仅使用 Face SFT 数据对 MLLM 进行简单微调会导致性能不佳（表 3(c)，第 1 行），因为模型往往会失去其通用推理和感知能力。我们将完整的 LLaVA SFT 数据集（665k 样本）与 Face SFT 结合使用，模型得分为 35.18（表 3(c)，第 2 行）。我们从完整的 665k LLaVA 1.5 数据集中随机抽取了 $200\mathrm{k}$ 个样本，与 70k Face SFT 数据结合进行微调，结果得分为 35.24（表 3(c)，第 3 行）。该实验表明，以适当比例将 FaceSFT 数据与现有的推理和指令微调数据集结合是有益的。

We explored an alternative research direction to improve MLLMs’ performance on face understanding through the use of specialized tools. To investigate the benefits of tool use, we selected specific datasets and converted predictions from state-of-the-art models into text, which we then provided as context to the MLLM. As shown in Table 3(b), this approach resulted in a significant performance boost.

我们探索了一种替代研究方向，通过使用专用工具来提高多模态大语言模型 (MLLM) 在人脸理解任务上的性能。为了研究工具使用的优势，我们选择了特定的数据集，并将最先进模型的预测结果转换为文本，然后将其作为上下文提供给 MLLM。如表 3(b) 所示，这种方法显著提升了性能。

模型	总体	B&F	FR	F Auth.	F Anlys.	FL	工具
上下文描述
Phi-3.5-Vision [1]	-3.68	-5.00	-7.74	-3.86	-3.12	+0.00	+5.00
Qwen2-VL-7B [103]	-1.26	-2.40	-2.50	+5.00	-2.38	-2.45	+4.00
InternVL2-8B [12]	-1.10	-3.13	-0.50	+6.00	-2.01	-3.00	+03.00
思维链
Phi-3.5-Vision [1]	-11.80	-20.20	-19.87	-4.71	-13.50	1.72	-6.00
Qwen2-VL-7B [103]	-11.96	-13.74	-16.76	-6.86	-12.12	-10.27	+0.00
InternVL2-8B [12]	-4.36	-3.00	-3.75	-4.86	-5.50	-5.91	+0.00
(a)
模型	分割 (CelebMask)	HPE (BIWI)	FER (AffectNet)	数据			准确率
Phi-3.5-Vision [1]	+7.00	+12.00	+13.00	FaceSFT			33.58
Qwen2-VL-7B [103]	+5.00	+6.67	+30.00		LLaVA SFT (665k) + FaceSFT		35.18
InternVL2-8B [12]	+4.00	+7.33	+23.00		LLaVA SFT (200k) + FaceSFT		35.24
(b)				(c)

Table 3. (a) Change in performance of select models under different evaluation settings. (b) Performance improvement seen by leveraging tool-use. (c) Results of finetune experiments on LLaVA1.5 [47] using different data composition.

表 3: (a) 不同评估设置下选定模型的性能变化。(b) 通过利用工具使用看到的性能提升。(c) 使用不同数据组合对 LLaVA1.5 [47] 进行微调实验的结果。

We believe that equipping MLLMs with agentic behavior through specialized tools is a promising way to enhance their face understanding capabilities.

我们相信，通过专用工具为多模态大语言模型 (MLLM) 配备智能体行为是增强其面部理解能力的一种有前景的方法。

For future work: we advocate the development of diverse supervised fine-tuning sets covering various aspects of face understanding and training MLLMs to leverage specialized tools. Following this direction, we believe that future MLLMs will be capable of advanced face understanding, with FaceXBench serving as a critical resource for monitoring progress by benchmarking models across multiple dimensions of face understanding.

未来的工作方向：我们提倡开发多样化的监督微调数据集，涵盖面部理解的各个方面，并训练多模态大语言模型 (MLLMs) 以利用专用工具。沿着这一方向，我们相信未来的 MLLMs 将能够实现高级的面部理解，而 FaceXBench 将通过多维度面部理解的模型基准测试，成为监测进展的关键资源。

7. Conclusion

7. 结论

We propose FaceXBench, a comprehensive benchmark for face understanding that consists of 5,000 multiple-choice questions. It covers 14 tasks across six broad categories and is derived from 26 datasets. We employed a threestep data collection process to convert existing datasets into a VQA format, implementing quality control checks at each step. We conducted a thorough evaluation of 26 open-source models and two proprietary models, GPT-4o and GeminiPro 1.5, revealing the limitations of these models in face understanding. Our analysis examines perfor- mance across various dimensions and tasks, identifying factors that impact model performance and providing valuable insights. Finally, we discuss possible future directions, advocating for the development of instruction-tuning datasets, with FaceXBench serving as a catalyst for further research.

我们提出了 FaceXBench，一个全面的面部理解基准，包含 5,000 道选择题。它涵盖了六大类别的 14 项任务，并源自 26 个数据集。我们采用三步数据收集流程，将现有数据集转换为 VQA 格式，并在每一步实施质量控制检查。我们对 26 个开源模型和两个专有模型（GPT-4o 和 GeminiPro 1.5）进行了全面评估，揭示了这些模型在面部理解方面的局限性。我们的分析从多个维度和任务中考察了性能，识别了影响模型性能的因素，并提供了有价值的见解。最后，我们讨论了未来可能的方向，倡导开发指令微调数据集，并以 FaceXBench 作为进一步研究的催化剂。

References

参考文献

understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 3 [136] Ni Zhuang, Yan Yan, Si Chen, and Hanzi Wang. Multi-task learning of cascaded cnn for facial attribute classification. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 2069–2074. IEEE, 2018. 3

[136] Ni Zhuang, Yan Yan, Si Chen, 和 Hanzi Wang. 多任务学习的级联 CNN 用于面部属性分类. 在 2018 年第 24 届国际模式识别会议 (ICPR) 上, 页码 2069–2074. IEEE, 2018. 3

FaceXBench: Evaluating Multimodal LLMs on Face Understanding Appendix

FaceXBench: 评估多模态大语言模型在人脸理解上的表现附录

In the appendix, we provide a discussion of the limitations of our work. Along with this, we present detailed information regarding the benchmark, including the broad categories, the tasks, the source datasets used, and the dataset statistics. Additionally, we focus on its implementation and provide extensive details about the prompts used for dataset collection and the evaluation strategy. Furthermore, we expand on the results presented in the main paper, providing more details about the baselines, showcasing failure cases of the models, and concluding with an ethics statement.

在附录中，我们讨论了本研究的局限性。同时，我们提供了关于基准测试的详细信息，包括广泛的类别、任务、使用的源数据集以及数据集统计。此外，我们重点介绍了其实现过程，并提供了用于数据集收集的提示和评估策略的详细细节。我们还扩展了主论文中展示的结果，提供了更多关于基线的细节，展示了模型的失败案例，并以伦理声明作为结尾。

Table of Contents in Appendix

附录目录

A. Limitations

A. 局限性

E. Ethical Considerations 64

E. 伦理考量 64

A. Limitations

A. 局限性

Our benchmark has a limited number of questions that address scenarios where multiple faces appear within a single image, limiting its applicability in such contexts. Additionally, generative models, such as diffusion models, cannot be evaluated using the current benchmark, restricting its applicability for assessing performance in image generation tasks. Future work will address these limitations by extending the benchmark to include questions designed to evaluate image generation capabilities.

我们的基准测试中涉及单张图像中出现多张人脸场景的问题数量有限，这限制了其在此类情境下的适用性。此外，当前的基准测试无法评估生成模型（如扩散模型），从而限制了其在图像生成任务性能评估中的适用性。未来的工作将通过扩展基准测试，加入旨在评估图像生成能力的问题，来解决这些局限性。

B. FaceXBench

In this section of the appendix, we will provide more information on the broad categories and tasks included in the FaceXBench. Additionally, we will provide details on the source datasets and the dataset statistics, with examples of images used in the benchmark.

在本附录部分，我们将提供更多关于 FaceXBench 中包含的广泛类别和任务的信息。此外，我们还将详细介绍源数据集和数据集统计信息，并提供基准测试中使用的图像示例。

B.1. FaceXBench Categories

B.1. FaceXBench 类别

Bias and Fairness: In this category, we evaluate the model’s ability to estimate age, predict gender, and identify race as key indicators of its understanding and analysis of demographic attributes. The focus extends beyond the accuracy of these predictions to ensuring fairness and in clu siv it y across diverse groups. By analyzing the model’s performance, we aim to uncover potential biases and ensure that predictions remain consistent and unbiased regardless of age group, gender, or race. This assessment is crucial for promoting fairness, mitigating the amplification of societal biases, and enhancing the model’s generalization capabilities for real-world applications.

偏见与公平性：在这一类别中，我们评估模型在估计年龄、预测性别和识别种族方面的能力，这些是其理解与分析人口统计属性的关键指标。关注的焦点不仅在于这些预测的准确性，还在于确保跨不同群体的公平性和包容性。通过分析模型的表现，我们旨在揭示潜在的偏见，并确保无论年龄组、性别或种族如何，预测都能保持一致且无偏见。这一评估对于促进公平性、减轻社会偏见的放大以及增强模型在现实世界应用中的泛化能力至关重要。

Face Recognition: In this category, we evaluate the model’s ability to perform accurate face recognition across various contexts, including high-resolution face recognition, low-resolution face recognition, and celebrity identification. These tasks test the model’s proficiency in feature extraction, spatial awareness, and handling variations in image quality, lighting, and pose. High-resolution face recognition assesses the model’s ability to leverage fine-grained details for precise identification, while low-resolution face recognition challenges its capability to generalize from limited information. Celebrity identification evaluates its knowledge base and contextual understanding of well-known individuals.

人脸识别：在这一类别中，我们评估模型在不同场景下执行准确人脸识别的能力，包括高分辨率人脸识别、低分辨率人脸识别和名人识别。这些任务测试了模型在特征提取、空间感知以及处理图像质量、光照和姿态变化方面的熟练程度。高分辨率人脸识别评估模型利用细粒度细节进行精确识别的能力，而低分辨率人脸识别则挑战其从有限信息中进行泛化的能力。名人识别评估其对知名人物的知识库和上下文理解能力。

Face Authentication: In this category, we evaluate the model’s ability to perform robust face authentication, with a focus on critical tasks such as face anti-spoofing and deepfake detection. These tasks assess the model’s capability to distinguish bonafide facial data from spoofing attempts, thereby ensuring the security and reliability of authentication systems. They are essential for safeguarding sensitive applications like identity verification, access control, and fraud prevention. By analyzing the model’s performance, we aim to confirm that it demonstrates high sensitivity to subtle cues, generalizes effectively across diverse attack methods, and minimizes both false positives and false negatives. This assessment is vital for building trust in face authentication systems and addressing emerging threats in a rapidly evolving technological landscape.

人脸认证：在这一类别中，我们评估模型执行稳健人脸认证的能力，重点关注人脸防伪和深度伪造检测等关键任务。这些任务评估模型区分真实面部数据与伪造攻击的能力，从而确保认证系统的安全性和可靠性。这些任务对于保护身份验证、访问控制和欺诈预防等敏感应用至关重要。通过分析模型的表现，我们旨在确认其对细微线索具有高敏感性，能够有效泛化到各种攻击方法，并最大限度地减少误报和漏报。这一评估对于建立对人脸认证系统的信任以及应对快速发展的技术环境中的新兴威胁至关重要。

Face Analysis: In this category, we evaluate the model’s ability to analyze and interpret facial features through tasks such as facial attribute prediction and facial expression recognition. This evaluation focuses on the model’s ability to accurately identify static attributes, such as physical traits (e.g., glasses, hair color, or beard), which are essential for applications like targeted content delivery and user profiling. It also emphasizes the dynamic understanding of emotions and subtle expressions, including micro-expressions, which play a key role in applications related to human-computer interaction, mental health assessment, and sentiment analysis. These skills are critical for capturing nuanced characteristics and ensuring effective interaction across diverse real-world scenarios.

面部分析：在此类别中，我们通过面部属性预测和面部表情识别等任务评估模型分析和解释面部特征的能力。该评估重点关注模型准确识别静态属性的能力，例如物理特征（如眼镜、发色或胡须），这些属性对于定向内容交付和用户画像等应用至关重要。它还强调对情绪和细微表情（包括微表情）的动态理解，这些在与人机交互、心理健康评估和情感分析相关的应用中起着关键作用。这些技能对于捕捉细微特征并确保在多样化的现实场景中有效交互至关重要。

Face Localization: In this category, we evaluate the model’s ability to accurately locate and analyze facial regions through tasks such as head pose estimation, face parsing, and crowd counting. Head pose estimation assesses the model’s spatial awareness and its ability to interpret the orientation of faces in three-dimensional space, which is crucial for applications like gaze tracking, augmented reality, and driver monitoring systems. Face parsing focuses on the precise segmentation and labeling of facial regions, such as eyes, nose, and mouth, enabling fine-grained analysis for tasks like virtual makeup, medical diagnosis, and personalized user interfaces. Crowd counting evaluates the model’s proficiency in detecting and quantifying multiple faces in dense or cluttered environments, ensuring robust performance in scenarios such as public safety monitoring, event analysis, and resource planning. Together, these tasks test the model’s ability to generalize across varying scales, perspectives, and levels of complexity. This assessment is vital for enhancing real-world applications that rely on

面部定位：在这一类别中，我们通过头部姿态估计、面部解析和人群计数等任务评估模型准确定位和分析面部区域的能力。头部姿态估计评估模型的空间感知能力及其解释三维空间中面部方向的能力，这对于视线追踪、增强现实和驾驶员监控系统等应用至关重要。面部解析侧重于面部区域的精确分割和标注，如眼睛、鼻子和嘴巴，从而为虚拟化妆、医疗诊断和个性化用户界面等任务提供细粒度分析。人群计数评估模型在密集或杂乱环境中检测和量化多个面部的能力，确保在公共安全监控、事件分析和资源规划等场景中的稳健性能。这些任务共同测试了模型在不同尺度、视角和复杂度下的泛化能力。这一评估对于增强依赖面部定位的现实应用至关重要。

Face Tools Use: In this category, we evaluate the model’s ability to leverage external tools for face understanding tasks, reflecting a shift from traditional supervised fine-tuning to tool-based problem-solving. Using the FaceXAPI dataset, we assess the model’s proficiency in selecting and sequencing the correct APIs and function calls to solve complex tasks. This evaluation emphasizes the model’s ability to interpret detailed task requirements, identify relevant tools, and construct accurate operational workflows. The significance of this task lies in its alignment with real-world applications, where equipping MLLMs with tools enhances both s cal ability and adaptability.

面部工具使用：在这一类别中，我们评估模型在利用外部工具进行面部理解任务时的能力，反映了从传统的监督微调向基于工具的问题解决的转变。使用 FaceXAPI 数据集，我们评估模型在选择和排序正确的 API 和函数调用以解决复杂任务时的熟练程度。此评估强调模型在解释详细任务需求、识别相关工具以及构建准确操作工作流方面的能力。该任务的重要性在于其与现实世界应用的一致性，即为 MLLMs 配备工具可以增强其可扩展性和适应性。

B.2. Tasks

B.2. 任务

Age Estimation: This task involves determining an individual’s age or age range based on their facial features.

年龄估计：该任务涉及根据个体的面部特征确定其年龄或年龄范围。

Gender Prediction: Gender prediction is the process of identifying a person’s gender from facial images by analyzing visual characteristics of the face.

性别预测：性别预测是通过分析面部图像的视觉特征来识别一个人的性别的过程。

Race Estimation: Race estimation involves predicting an individual’s racial background by analyzing their facial features.

种族估计：种族估计涉及通过分析个体的面部特征来预测其种族背景。

High-resolution Face Recognition: High-resolution face recognition is the identification or verification of individuals using detailed facial images captured at high resolutions, which enhances the accuracy in distinguishing fine facial features.

高分辨率人脸识别：高分辨率人脸识别是指使用高分辨率拍摄的详细面部图像进行个体的识别或验证，从而提高区分细微面部特征的准确性。

Low-resolution Face Recognition: Low-resolution Face Recognition refers to performing face recognition tasks on images with limited detail, such as surveillance footage, which poses challenges due to reduced image clarity.

低分辨率人脸识别：低分辨率人脸识别是指在细节有限的图像（如监控录像）上执行人脸识别任务，由于图像清晰度降低，这带来了挑战。

Celebrity Identification: Celebrity identification involves recognizing and naming well-known individuals in images or videos by comparing their facial features against a database of celebrity faces.

名人识别：名人识别涉及通过将图像或视频中的人脸特征与名人面部数据库进行比较，来识别并命名知名人士。

Face Anti-spoofing: This task focuses on detecting attempts to deceive facial recognition systems using methods such as photos, videos, or masks, ensuring that the face presented is genuine.

人脸防伪：该任务专注于检测使用照片、视频或面具等方法欺骗面部识别系统的企图，确保呈现的人脸是真实的。

Deepfake Detection: Deepfake detection is the process of identifying synthetic media where a person’s likeness has been digitally altered or replaced, aiming to detect manipulated or fabricated content.

深度伪造检测 (Deepfake Detection)：深度伪造检测是识别合成媒体的过程，其中人物的形象被数字化修改或替换，旨在检测被操纵或伪造的内容。

Attributes Prediction: Attributes prediction involves inferring various facial characteristics, such as the presence of glasses, facial hair, or specific expressions, from images.

属性预测：属性预测涉及从图像中推断各种面部特征，例如是否戴眼镜、是否有胡须或特定表情。

Facial Expression Recognition: This task involves analyzing facial movements to determine a person’s emotional state, such as happiness, sadness, or anger.

面部表情识别：该任务涉及分析面部动作以确定一个人的情绪状态，例如快乐、悲伤或愤怒。

Headpose Estimation: Head pose estimation is the process of determining the orientation of a person’s head (e.g., pitch, yaw, roll) relative to the camera, and is useful in applications like gaze tracking.

头部姿态估计：头部姿态估计是确定人的头部相对于相机的方向（例如，俯仰、偏航、滚动）的过程，在视线跟踪等应用中非常有用。

Face parsing This task refers to segmenting a facial image into distinct regions (e.g., eyes, nose, mouth) to facilitate detailed analysis or manipulation of facial components.

人脸解析
该任务指的是将面部图像分割成不同的区域（例如，眼睛、鼻子、嘴巴），以便对面部组件进行详细分析或操作。

Crowd Counting: Crowd counting involves estimating the number of individuals present in an image or video frame, often used in surveillance and event monitoring.

人群计数：人群计数涉及估计图像或视频帧中存在的人数，通常用于监控和事件监测。

Face Tools Retrieval: It refers to predicting the correct sequence of API calls that MLLMs need to execute to complete complex face-related scenarios requiring multiple tasks.

人脸工具检索：指的是预测大语言模型需要执行的正确API调用序列，以完成需要多个任务的复杂人脸相关场景。

B.3. Source Datasets

B.3. 源数据集

FairFace [36]: FairFace is a face image dataset designed to address racial bias in facial recognition systems. It contains 108,501 images with annotations for race, gender, and age. The dataset includes seven race categories: White, Black, East Asian, Southeast Asian, Indian, Middle Eastern, and Latino. The images are sourced primarily from the YFCC-100M Flickr dataset and are balanced across different demographic groups.

FairFace [36]: FairFace 是一个旨在解决面部识别系统中种族偏见问题的面部图像数据集。它包含 108,501 张图像，并标注了种族、性别和年龄。该数据集包括七个种族类别：白人、黑人、东亚人、东南亚人、印度人、中东人和拉丁裔。这些图像主要来自 YFCC-100M Flickr 数据集，并在不同人口群体之间进行了平衡。

UTKFace [127]: UTKFace is a large-scale face dataset comprising over 20,000 images with annotations for age, gender, and ethnicity. The age range spans from 0 to 116 years. The images exhibit variations in pose, facial expression, illumination, occlusion, and resolution, making it suitable for tasks like face detection, age estimation, and landmark localization.

UTKFace [127]: UTKFace 是一个大规模的人脸数据集，包含超过 20,000 张图像，并带有年龄、性别和种族的标注。年龄范围从 0 岁到 116 岁不等。这些图像在姿态、面部表情、光照、遮挡和分辨率方面表现出多样性，使其适用于人脸检测、年龄估计和关键点定位等任务。

WMCA (Wide Multi-Channel Presentation Attack) [24]: The WMCA dataset consists of 1,941 short video recordings from 72 different identities, including both bona fide and presentation attacks. Data is recorded across multiple channels: color, depth, infrared, and thermal. The dataset is designed for research in face anti-spoofing and presentation attack detection.

WMCA (Wide Multi-Channel Presentation Attack) [24]：WMCA 数据集包含来自 72 个不同身份的 1,941 个短视频记录，包括真实样本和呈现攻击样本。数据通过多个通道记录：彩色、深度、红外和热成像。该数据集设计用于人脸反欺骗和呈现攻击检测的研究。

MSU MFSD [105]: It contains 280 video recordings of genuine and attack faces from 35 individuals. Each individual has two real-access videos captured with laptop cameras and Android devices. Attack videos include high-definition replays and photo attacks. The dataset is divided into 120 training videos from 15 subjects and 160 testing videos from 20 subjects.

MSU MFSD [105]: 该数据集包含来自35个人的280个真实和攻击面部的视频记录。每个人有两个通过笔记本电脑摄像头和Android设备捕获的真实访问视频。攻击视频包括高清重放和照片攻击。数据集分为来自15个受试者的120个训练视频和来自20个受试者的160个测试视频。

CASIA MFSD [126]: It is a face anti-spoofing dataset containing 600 video recordings from 50 subjects. Each subject has

CASIA MFSD [126]：这是一个包含50名受试者的600个视频记录的人脸反欺诈数据集。每个受试者有

12 videos under different resolutions and lighting conditions. The dataset includes three spoof attack types: replay, warp print, and cut print attacks. It is divided into 240 training videos from 20 subjects and 360 testing videos from 30 subjects.

12个不同分辨率和光照条件下的视频。该数据集包含三种欺骗攻击类型：重放攻击、扭曲打印攻击和剪切打印攻击。数据集分为20个受试者的240个训练视频和30个受试者的360个测试视频。

Replay Attack [16]: The Replay Attack dataset is designed for evaluating face anti-spoofing systems. It includes videos of both genuine access attempts and various spoofing attacks, such as printed photos and video replays. The dataset provides a diverse set of conditions to test the robustness of anti-spoofing algorithms.

重放攻击 (Replay Attack) [16]: 重放攻击数据集旨在评估人脸反欺诈系统。它包含了真实访问尝试和各种欺诈攻击（如打印照片和视频重放）的视频。该数据集提供了多样化的条件，以测试反欺诈算法的鲁棒性。

CelebDF [52]: CelebDF is a large-scale dataset for deepfake detection, containing 590 real videos of celebrities and 5,639 corresponding deepfake videos. The dataset is designed to evaluate the performance of deepfake detection algorithms under real-world conditions.

CelebDF [52]: CelebDF 是一个用于深度伪造检测的大规模数据集，包含 590 个名人真实视频和 5,639 个对应的深度伪造视频。该数据集旨在评估深度伪造检测算法在现实世界条件下的性能。

$\mathbf{FF++}$ [78]: Face Forensics $^{++}$ is a dataset for evaluating facial manipulation detection methods. It consists of over 1,000 original video sequences and more than 4,000 manipulated videos using four different face manipulation techniques. Annotations include manipulation methods and compression levels.

$\mathbf{FF++}$ [78]: Face Forensics $^{++}$ 是一个用于评估面部操纵检测方法的数据集。它包含超过 1,000 个原始视频序列和超过 4,000 个使用四种不同面部操纵技术生成的操纵视频。标注内容包括操纵方法和压缩级别。

TinyFace [14]: TinyFace is a dataset focused on detecting small faces in images. It contains images with a wide range of face sizes, particularly emphasizing faces that occupy a small number of pixels. Annotations include bounding boxes for each face.

TinyFace [14]: TinyFace 是一个专注于检测图像中小型人脸的数据集。它包含了各种尺寸的人脸图像，特别强调了占据少量像素的人脸。标注内容包括每个人脸的边界框。

LFW [28]: LFW is a dataset of 13,000 labeled images of faces from the wild, collected from the internet. It includes annotations for the identity of the person in each image, with 1,680 individuals having two or more images. The dataset is widely used for studying face recognition in un constrained environments.

LFW [28]: LFW 是一个包含 13,000 张从互联网收集的带标签的野外人脸图像数据集。它包含每张图像中人物身份的标注，其中 1,680 个人拥有两张或更多图像。该数据集广泛用于研究无约束环境中的人脸识别。

AgeDB [68]: AgeDB is a manually collected dataset containing 16,488 images of 568 distinct subjects. It provides age annotations and is used for evaluating age-invariant face verification and recognition algorithms.

AgeDB [68]: AgeDB 是一个手动收集的数据集，包含 568 个不同对象的 16,488 张图像。它提供了年龄标注，用于评估年龄不变的人脸验证和识别算法。

CFP-FF and CFP-FP [84]: The CFP dataset consists of two subsets: CFP-FF (Frontal-Frontal) and CFP-FP (FrontalProfile). Each contains 7,000 images of 500 subjects. CFP-FF includes frontal face pairs, while CFP-FP includes frontal and profile face pairs, facilitating the study of face recognition across different poses.

CFP-FF 和 CFP-FP [84]：CFP 数据集由两个子集组成：CFP-FF（正面-正面）和 CFP-FP（正面-侧面）。每个子集包含 500 个对象的 7,000 张图像。CFP-FF 包含正面人脸对，而 CFP-FP 包含正面和侧面人脸对，便于研究不同姿态下的人脸识别。

CALFW [132]: CALFW is a dataset derived from LFW, focusing on cross-age face verification. It contains 4,025 image pairs with age differences, aiming to evaluate the performance of face recognition systems under age variation.

CALFW [132]: CALFW 是一个从 LFW 派生的数据集，专注于跨年龄人脸验证。它包含 4,025 对具有年龄差异的图像对，旨在评估人脸识别系统在年龄变化下的性能。

CPLFW [131]: CPLFW is another extension of LFW, emphasizing cross-pose face verification. It includes 3,000 image pairs with pose variations, challenging face recognition models to handle different facial orientations.

CPLFW [131]: CPLFW 是 LFW 的另一个扩展，强调跨姿态人脸验证。它包含 3,000 对具有姿态变化的图像对，挑战人脸识别模型处理不同面部方向的能力。

IMDB [79]: The IMDB dataset comprises 460,723 face images of 20,284 celebrities, collected from the Internet Movie Database (IMDb). Annotations include age, gender, and name, making it suitable for age estimation and gender classification tasks.

IMDB [79]: IMDB 数据集包含从互联网电影数据库 (IMDb) 收集的 20,284 位名人的 460,723 张面部图像。标注信息包括年龄、性别和姓名，适用于年龄估计和性别分类任务。

CelebA [61]: CelebA is a large-scale face attributes dataset with more than 200,000 celebrity images, each annotated with 40 attribute labels. The dataset covers a wide range of poses and backgrounds, supporting tasks like attribute prediction and face detection.

CelebA [61]: CelebA 是一个大规模的人脸属性数据集，包含超过 20 万张名人图像，每张图像都标注了 40 个属性标签。该数据集涵盖了广泛的姿态和背景，支持属性预测和人脸检测等任务。

RAF-DB [51]: RAF-DB contains 29,672 facial images with annotations for basic and compound emotions. The dataset is used for studying facial expression recognition in real-world scenarios.

RAF-DB [51]: RAF-DB 包含 29,672 张面部图像，标注了基本和复合情绪。该数据集用于研究真实场景中的面部表情识别。

AffectNet [67]: AffectNet is a comprehensive facial expression dataset with over 1 million images collected from the internet. Annotations include seven discrete facial expressions (anger, contempt, disgust, fear, happiness, sadness, and surprise) and the intensity of valence and arousal. It is widely used for emotion recognition and affective computing research.

AffectNet [67]: AffectNet 是一个全面的面部表情数据集，包含从互联网收集的超过 100 万张图像。标注包括七种离散的面部表情（愤怒、轻蔑、厌恶、恐惧、快乐、悲伤和惊讶）以及效价和唤醒的强度。它广泛用于情感识别和情感计算研究。

AFLW2000 [118]: AFLW2000 contains 2,000 face images annotated with 68 facial landmarks. The images are diverse, covering various poses, expressions, and occlusions. It is often used for face alignment and landmark localization tasks.

AFLW2000 [118]: AFLW2000 包含 2,000 张标注了 68 个面部关键点的人脸图像。这些图像具有多样性，涵盖了各种姿态、表情和遮挡情况。它通常用于人脸对齐和关键点定位任务。

BIWI [22]: The BIWI dataset includes 15,678 images of 20 subjects, captured using a Kinect camera. Annotations consist of 6D head poses (yaw, pitch, roll, and translation vectors) and 3D face models. It is designed for head pose estimation research.

BIWI [22]: BIWI 数据集包含 20 名受试者的 15,678 张图像，使用 Kinect 相机拍摄。标注包括 6D 头部姿态（偏航、俯仰、滚动和平移向量）和 3D 面部模型。该数据集专为头部姿态估计研究设计。

JHUCrowd $^{++}$ [87]: JHUCrowd $^{++}$ is a large-scale dataset for crowd counting, containing 4,372 images with over 1.5 million annotated heads. The annotations include head locations, crowd density maps, and visibility levels, making it suitable for crowd analysis and density estimation.

JHUCrowd$^{++}$ [87]: JHUCrowd$^{++}$ 是一个用于人群计数的大规模数据集，包含 4,372 张图像，标注了超过 150 万个头部。标注内容包括头部位置、人群密度图和可见度级别，适用于人群分析和密度估计。

Shanghai Tech [125]: The Shanghai Tech dataset contains two parts: Part A, with 482 images captured in crowded scenes, and Part B, with 716 images from less dense environments. It includes over 330,000 annotated individuals’ head locations, making it a benchmark for crowd counting and density estimation.

上海科技大学 [125]: 上海科技大学数据集包含两部分：A 部分包含 482 张在拥挤场景中拍摄的图像，B 部分包含 716 张来自密度较低环境的图像。该数据集包含超过 330,000 个标注的个体头部位置，使其成为人群计数和密度估计的基准。

CelebAMask-HQ [43]: CelebAMask-HQ is an extension of the CelebA dataset with 30,000 high-resolution face images and fine-grained segmentation masks for 19 facial attributes (e.g., eyes, nose, hair, and skin). It supports tasks like face parsing and image editing.

CelebAMask-HQ [43]: CelebAMask-HQ 是 CelebA 数据集的扩展，包含 30,000 张高分辨率人脸图像和 19 种面部属性（如眼睛、鼻子、头发和皮肤）的细粒度分割掩码。它支持人脸解析和图像编辑等任务。

LaPa [58]: LaPa includes 22,000 facial images with high-quality annotations for 11 facial regions. It offers various attributes like pose, expression, and occlusion, making it suitable for face parsing and semantic segmentation tasks.

LaPa [58]: LaPa 包含 22,000 张面部图像，具有 11 个面部区域的高质量标注。它提供了各种属性，如姿态、表情和遮挡，适用于面部解析和语义分割任务。

FaceXAPI It is a dataset consisting of 100 text-only questions, each with four options and one correct answer. It is designed to assess the capabilities of MLLMs (Multimodal Large Language Models) in predicting the correct sequence of API calls needed to accomplish complex scenarios involving multiple face-related tasks.

FaceXAPI 这是一个由100个纯文本问题组成的数据集，每个问题有四个选项和一个正确答案。它旨在评估多模态大语言模型 (MLLMs) 在预测完成涉及多个面部相关任务的复杂场景所需的正确API调用序列方面的能力。

B.4. Dataset Statistics

B.4. 数据集统计

The FaceXBench dataset is derived from 25 public datasets and one newly created dataset. The number of questions sourced from each dataset, along with the type of questions (multiple images, single images, or text-only), as well as the associated tasks and categories, are summarized in Table B.1.

FaceXBench 数据集来源于 25 个公开数据集和一个新创建的数据集。每个数据集中问题的数量、问题的类型（多图像、单图像或纯文本）以及相关的任务和类别总结在表 B.1 中。

Table B.1. Question distribution of FaceXBench across datasets

表 B.1: FaceXBench 在各数据集上的问题分布

数据集	问题数量	多张图像	单张图像	仅文本	任务	类别
FairFace [36]	300	200	100	0	年龄估计	偏见与公平
UTKFace [127]	200	150	50	0	年龄估计	偏见与公平
FairFace [36]	300	200	100	0	性别预测	偏见与公平
UTKFace [127]	200	150	50	0	性别预测	偏见与公平
FairFace [36]	300	200	100	0	种族估计	偏见与公平
UTKFace [127]	200	150	50	0	种族估计	偏见与公平
LFW [28]	60	60	0	0	高分辨率人脸识别	人脸识别
AgeDB [68]	100	100	0	0	高分辨率人脸识别	人脸识别
CFP-FF [84]	60	60	0	0	高分辨率人脸识别	人脸识别
CFP-FP [84]	60	60	0	0	高分辨率人脸识别	人脸识别
CALFW [132]	60	60	0	0	高分辨率人脸识别	人脸识别
CPLFW [131]	60	60	0	0	高分辨率人脸识别	人脸识别
TinyFace [14]	100	100	0	0	低分辨率人脸识别	人脸识别
IMDB [79]	300	150	150	0	名人识别	人脸识别
WMCA [24]	250	100	150	0	人脸反欺骗	人脸认证
MSU-MFSD [105]	50	50	0	0	人脸反欺骗	人脸认证
CASIA-MFSD [126]	50	50	0	0	人脸反欺骗	人脸认证
ReplayAttack [16]	50	50	0	0	人脸反欺骗	人脸认证
CelebDF [52]	150	150	0	0	深度伪造检测	人脸认证
FF++ [78]	150	150	0	0	深度伪造检测	人脸认证
CelebA [61]	400	200	200	0	属性预测	人脸分析
RAF-DB [51]	200	100	100	0	面部表情识别	人脸分析
AffectNet [67]	200	100	100	0	面部表情识别	人脸分析
AFLW2000 [118]	200	50	150	0	头部姿态估计	人脸分析
BIWI [22]	200	50	150	0	头部姿态估计	人脸分析
JHUCrowd++ [87]	200	0	200	0	人群计数	人脸定位
ShanghaiTech [125]	100	0	100	0	人群计数	人脸定位
CelebAMask-HQ [43]	200	0	200	0	人脸解析	人脸定位
LaPa [58]	200	0	200	0	人脸解析	人脸定位
FaceXAPI	100	0	0	100	人脸工具检索	人脸工具使用

B.5. Images used in the dataset

B.5. 数据集中使用的图像

Figure B.1 displays a subset of the facial images used in the dataset, highlighting the diversity of faces included in the benchmark. The benchmark consists of face images with varying backgrounds, resolutions, and head pose orientations, as well as a wide range of facial expressions. It includes individuals from different age groups, genders, and racial backgrounds, with each face characterized by a versatile set of attributes.

图 B.1 展示了数据集中使用的一部分面部图像，突显了基准中包含的面部多样性。该基准由具有不同背景、分辨率和头部姿态方向的面部图像组成，涵盖了广泛的面部表情。它包括来自不同年龄组、性别和种族背景的个体，每个面部都具有多样化的属性集。

C. Dataset Collection

C. 数据集收集

In this section, we provide additional implementation details on the dataset curation process, including information about question templates, distractor options, the evaluation strategy, and the dataset format.

在本节中，我们提供了关于数据集整理过程的额外实现细节，包括问题模板、干扰选项、评估策略和数据集格式的信息。

Figure B.1. Collage of a subset of images from the dataset, showcasing the diversity of images used in FaceXBench.

How many images have a person of age between {age_start} and {age_end}?

有多少张图片中的人的年龄在 {age_start} 到 {age_end} 岁之间？

Which images appear to have a person with male gender?

哪些图像中似乎有男性性别的人？

A. Image 1, Image 2 B. Image 3, Image 1 C. Image 1 D. None of the above

A. 图像 1, 图像 2
B. 图像 3, 图像 1
C. 图像 1
D. 以上都不是

Gender Prediction | Multiple Images

有多少张图片展示了黑人种族的人？

Race Estimation | Single Image

种族估计 | 单张图像

What is the race of the person in this image?

这张图片中的人是什么种族？

Race Estimation | Single Image

种族估计 | 单张图像

Identify the race of the person in this image.

识别此图像中人物的种族。

Race Estimation | Single Image

种族估计 | 单张图像

Which race category best describes the person in this image?

这张图片中的人属于哪个种族类别？

Race Estimation | Single Image

种族估计 | 单张图像

Select the most appropriate race for the person in this image.

选择此图像中人物最合适的种族。

Race Estimation | Single Image

种族估计 | 单张图像

Determine the race of the person shown in this image.

确定这张图片中显示的人的种族。