[论文翻译]使用多视角图像和负指令缓解对象属性幻觉


原文地址:https://arxiv.org/pdf/2501.10011


Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions

使用多视角图像和负指令缓解对象属性幻觉

Abstract—Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA), leading to incorrect determination of fine-grained attributes in the input images. Leveraging significant advancements in 3D generation from a single image, this paper proposes a novel method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled from generated 3D representations as visual prompts for LVLMs, thereby providing more visual information from other viewpoints. Furthermore, we observe the input order of multiple multiview images significantly affects the performance of LVLMs. Consequently, we have devised Multiview Image Augmented VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule capable of simultaneously eliminating the influence of input image order and aligning visual information from multiview images with Large Language Models (LLMs). Besides, we designed and employed negative instructions to mitigate LVLMs’ bias towards “Yes” responses. Comprehensive experiments demonstrate the effectiveness of our method.

摘要—当前流行的大视觉语言模型(LVLMs)在对象属性幻觉(HoOA)方面存在问题,导致对输入图像中细粒度属性的错误判断。本文利用单图像3D生成的显著进展,提出了一种新颖的方法来缓解LVLMs中的HoOA。该方法利用从生成的3D表示中采样的多视图图像作为LVLMs的视觉提示,从而提供来自其他视角的更多视觉信息。此外,我们观察到多个多视图图像的输入顺序显著影响LVLMs的性能。因此,我们设计了多视图图像增强视觉语言模型(MIAVLM),其中包含一个多视图属性感知器(MAP)子模块,能够同时消除输入图像顺序的影响,并将多视图图像的视觉信息与大语言模型(LLMs)对齐。此外,我们设计并使用了负面指令来减轻LVLMs对“是”回答的偏见。综合实验证明了我们方法的有效性。

Index Terms—hallucinations, LLM, LVLM

索引术语—幻觉、大语言模型 (LLM)、视觉语言大模型 (LVLM)

I. INTRODUCTION

I. 引言

Current popular Large Vision-Language Models (LVLMs) [1]–[6] are suffering from hallucinations [7], [8]. These halluci nations manifest as inconsistencies between the textual responses generated by LVLMs and the semantic content of input images [9]. Specifically, these hallucinations can be categorized into three types [7]: a.) Hallucination on Object Existence $(H o O E)$ , wherein errors occur in judgments regarding the presence of objects, such as when non-existent objects are included in the descriptions generated by LVLMs; b.) Hallucination on Object Attributes $(H o O A)$ , wherein errors arise in describing the attributes of objects, including shape and color attributes, as exemplified by LVLMs describing a red apple as green; and $c.$ .) Hallucination on Object Relationships (HoOR) , wherein errors occur in describing relationships between different objects, such as describing a person in front of a sofa as being behind it [7]. Notably, benchmarks designed for assessing LVLMs’ hallucinations, such as M-HalDetect [10], MMHal-Bench [11], and AMBER [12], include multiple objects that exhibit issues related to existence, attributes, and relationships simultaneously. Consequently, these three types

当前流行的大视觉语言模型 (LVLMs) [1]–[6] 正遭受幻觉问题 [7], [8]。这些幻觉表现为 LVLMs 生成的文本响应与输入图像的语义内容之间的不一致 [9]。具体而言,这些幻觉可以分为三种类型 [7]: a.) 对象存在性幻觉 $(H o O E)$,即在判断对象是否存在时出现错误,例如 LVLMs 生成的描述中包含不存在的对象;b.) 对象属性幻觉 $(H o O A)$,即在描述对象属性时出现错误,包括形状和颜色属性,例如 LVLMs 将红苹果描述为绿色;以及 $c.$ ) 对象关系幻觉 (HoOR),即在描述不同对象之间的关系时出现错误,例如将沙发前的人描述为在沙发后面 [7]。值得注意的是,用于评估 LVLMs 幻觉的基准测试,如 M-HalDetect [10], MMHal-Bench [11] 和 AMBER [12],包含多个同时存在存在性、属性和关系问题的对象。因此,这三种类型

Q: Do people with glasses wear red clothes? Please respond with either "Yes" or "No". A: Yes.

Q: 戴眼镜的人穿红色衣服吗?请回答“是”或“否”。
A: 是。

Q: Do people with glasses wear black clothes? Please respond with either "Yes" or "No". A: Yes.

Q: 戴眼镜的人穿黑色衣服吗?请回答“是”或“否”。
A: 是。


Fig. 1. Illustration of the HoOE Problem.

图 1: HoOE 问题示意图。

of problems are strongly coupled in current benchmarks, posing significant challenges for analyzing their individual causes. For instance, in addressing the HoOA, the presence of multiple objects introduces hallucinations related to HoOE and HoOR, thereby complicating the analysis. More specifically, as illustrated in Figure 1, LLaVA-1.5 [13] provides correct answers to questions within the red box above the dashed line. However, in the image below the dashed line, LLaVA1.5 determines that there is a person wearing glasses and dressed in black. This constitutes a HoOE problem. However, such hallucinations might be caused by the LVLMs failing to correctly understand the “black” attribute, which corresponds to a HoOA problem. In such complex test scenarios, it is challenging to decouple these different types of hallucinations and address them separately, making it difficult to accurately assess the true capabilities of LVLMs.

在当前基准测试中,多个问题紧密耦合,给分析其各自原因带来了显著挑战。例如,在解决 HoOA 时,多个对象的存在会引入与 HoOE 和 HoOR 相关的幻觉,从而使分析变得复杂。更具体地说,如图 1 所示,LLaVA-1.5 [13] 对虚线上方红色框内的问题给出了正确答案。然而,在虚线下方的图像中,LLaVA-1.5 判断出有一个戴眼镜、穿黑色衣服的人。这构成了 HoOE 问题。然而,这种幻觉可能是由于 LVLMs 未能正确理解“黑色”属性所致,这对应着 HoOA 问题。在这种复杂的测试场景中,很难将这些不同类型的幻觉解耦并分别处理,从而难以准确评估 LVLMs 的真实能力。

Therefore, it is necessary to design individual benchmarks for each type of hallucination that can exclude interference from other hallucinations. This article demonstrates how to utilize face captioning as a foundational task to design a benchmark for the HoOA problem. Face captioning is a crucial multimodal task widely employed in downstream applications such as facial recognition [14] and text-to-face applications [15]. The CelebAText-HQ [16] dataset, manually annotated with facial attributes, provides detailed descriptions for each face, including shape, color, and other facial attributes. CelebAText-HQ exclusively offers detailed descriptions for individual objects (faces), thereby allowing us to design a benchmark that excludes issues related to HoOE and HoOR, facilitating a more accurate evaluation of the HoOA problem. In constructing this benchmark, we employ common techniques used in standard evaluation metrics, such as POPE [8], CIEM [17], and NOPE [18], to transform the generative task into a disc rim i native task. Each manually annotated description is converted into a question posed to LVLMs, with occurrences of “Yes” responses tallied to calculate accuracy. It is noteworthy that all converted questions yield “Yes” answers. However, as mentioned in previous studies [17], [19], there exists a tendency in current LVLMs to favor “Yes” responses disproportionately. Consequently, to assess whether LVLMs recognize the attributes of the images, we designed questions for which the answer is “No” for the same image. For clarity, we term questions with correct “Yes” answers as positive questions, and those with “No” answers as negative questions. Ultimately, we observed a near-opposite performance of LVLMs on positive and negative questions.

因此,有必要为每种幻觉类型设计单独的基准,以排除其他幻觉的干扰。本文展示了如何利用面部描述作为基础任务来设计一个针对 HoOA 问题的基准。面部描述是一项重要的多模态任务,广泛应用于面部识别 [14] 和文本到面部应用 [15] 等下游应用中。CelebAText-HQ [16] 数据集通过人工标注面部属性,为每张脸提供了详细的描述,包括形状、颜色和其他面部属性。CelebAText-HQ 专门为单个对象(面部)提供详细描述,因此我们可以设计一个排除 HoOE 和 HoOR 问题的基准,从而更准确地评估 HoOA 问题。在构建该基准时,我们采用了标准评估指标中常用的技术,如 POPE [8]、CIEM [17] 和 NOPE [18],将生成任务转化为判别任务。每个手动标注的描述都被转化为向 LVLMs 提出的问题,统计“是”回答的次数以计算准确率。值得注意的是,所有转化后的问题都应得到“是”的回答。然而,如之前的研究 [17]、[19] 所述,当前的 LVLMs 存在过度倾向于“是”回答的趋势。因此,为了评估 LVLMs 是否识别了图像的属性,我们为同一图像设计了答案为“否”的问题。为了清晰起见,我们将正确答案为“是”的问题称为正面问题,而答案为“否”的问题称为负面问题。最终,我们观察到 LVLMs 在正面和负面问题上的表现几乎相反。

Training high-quality LVLMs requires addressing both training data and model design aspects. Therefore, we analyze potential causes of HoOA from both data and model perspectives. From the data perspective, the emergence of the HoOA problem can be ascribed to two causes: a.) Insufficient information in single images to enable LVLMs to generate correct responses. b.) Popular LVLMs often undergo instruction tuning with a high proportion of positive visual instructions.

训练高质量的大视觉语言模型(LVLM)需要同时解决训练数据和模型设计两方面的问题。因此,我们从数据和模型两个角度分析了幻觉对象生成(HoOA)问题的潜在原因。从数据角度来看,HoOA问题的出现可以归因于两个原因:a) 单张图像中的信息不足以让LVLM生成正确的响应。b) 流行的LVLM通常会在指令微调中使用高比例的正面视觉指令。

In response to cause a.), prior studies have found that introducing richer image descriptions [10] or spatial information [20] can effectively mitigate hallucinations in LVLMs. An intuitive approach is to introduce additional depth maps to significantly improve HoOR problems. Consider the relationship between a person and a sofa: introducing a depth map as additional information can effectively resolve HoOR problems based on the different depths of the sofa and the person. However, when considering the HoOA problem, solely utilizing semantic segmentation or depth maps from the current viewpoint would evidently overlook fine-grained attribute information from other viewpoints. This loss of attribute information leads to two results. Firstly, certain fine-grained details from the current viewpoint may be incomplete. In cases where questions are posed about these potentially incomplete details, the possibility of LVLMs producing hallucinations exists. Secondly, fine-grained attribute information from other viewpoints is almost certainly incomplete. When questions are posed about these inherently incomplete details, LVLMs are highly likely to produce hallucinations. Therefore, generating 3D representations for current objects can effectively mitigate such HoOA problems. Benefiting from the considerable advancements in generating 3D representations from single images, images from other viewpoints can be sampled from the 3D representation [21]–[23]. From the perspective of visual prompt learning [24]–[26], these sampled images can be regarded as visual prompts. These visual prompts provide more visual information for the same attributes, thereby enhancing the robustness of LVLMs’ responses. As for cause $^{b.}$ ), prior works have introduced additional negative visual instructions during instruction tuning to enhance overall performance [19], [20]. Drawing inspiration from these approaches, we adopt the aforementioned negative questions as negative instructions to teach the LVLMs to respond “No” to HoOA problems.

针对原因 a),先前的研究发现,引入更丰富的图像描述 [10] 或空间信息 [20] 可以有效缓解 LVLMs 中的幻觉问题。一种直观的方法是引入额外的深度图来显著改善 HoOR 问题。以人与沙发的关系为例:引入深度图作为附加信息,可以根据沙发和人的不同深度有效解决 HoOR 问题。然而,当考虑 HoOA 问题时,仅利用当前视角的语义分割或深度图显然会忽略其他视角的细粒度属性信息。这种属性信息的丢失会导致两个结果。首先,当前视角的某些细粒度细节可能不完整。当问题涉及这些可能不完整的细节时,LVLMs 有可能产生幻觉。其次,其他视角的细粒度属性信息几乎肯定是不完整的。当问题涉及这些本质上不完整的细节时,LVLMs 极有可能产生幻觉。因此,为当前对象生成 3D 表示可以有效缓解此类 HoOA 问题。得益于从单张图像生成 3D 表示技术的显著进步,可以从 3D 表示中采样其他视角的图像 [21]–[23]。从视觉提示学习 [24]–[26] 的角度来看,这些采样图像可以被视为视觉提示。这些视觉提示为相同属性提供了更多的视觉信息,从而增强了 LVLMs 响应的鲁棒性。至于原因 $^{b.}$),先前的工作在指令微调过程中引入了额外的负面视觉指令,以增强整体性能 [19], [20]。受这些方法的启发,我们采用上述负面问题作为负面指令,教导 LVLMs 对 HoOA 问题回答“否”。

From the model perspective, we have observed that cause c): Directly inputting multiple images to LVLMs may lead to HoOA problems. Similar to LLMs’ sensitivity to the order of multiple prompts [27]–[29], LVLMs also exhibit sensitivity to the order of multiview images. For causes c), we designed a submodule named Multiview Attributes Perceiver (MAP) and integrated it into a model called Multiview Image Augmented VLM (MIAVLM) to align the multiview images with the LLM and mitigate the impact of the input order.

从模型的角度来看,我们观察到原因 c):直接将多张图像输入到 LVLMs 可能会导致 HoOA 问题。类似于大语言模型对多个提示顺序的敏感性 [27]–[29],LVLMs 也对多视角图像的顺序表现出敏感性。针对原因 c),我们设计了一个名为多视角属性感知器 (Multiview Attributes Perceiver, MAP) 的子模块,并将其集成到一个名为多视角图像增强 VLM (Multiview Image Augmented VLM, MIAVLM) 的模型中,以对齐多视角图像与大语言模型,并减轻输入顺序的影响。

In summary, the contributions of this paper are as follows: 1.To ascertain the presence of the HoOA problem while eliminating interference from HoOE and HoOR problems, we propose the HoOA benchmark. 2. To mitigate the HoOA problem, we propose utilizing multiview images of current objects as visual prompts. Furthermore, we design a novel network module called MIAVLM, integrating a MAP submodule capable of eliminating the influence of input image order and aligning visual information from multiview images with LLMs. Additionally, we designed and employed negative instructions to mitigate LVLMs’ bias towards ”Yes” responses. 3. To validate the effectiveness of our algorithms, we conducted comprehensive experiments on the HoOA benchmark.

本文的贡献总结如下:

  1. 为了确定 HoOA 问题的存在,同时消除 HoOE 和 HoOR 问题的干扰,我们提出了 HoOA 基准。
  2. 为了缓解 HoOA 问题,我们提出利用当前物体的多视角图像作为视觉提示。此外,我们设计了一个名为 MIAVLM 的新型网络模块,集成了一个能够消除输入图像顺序影响的 MAP 子模块,并将多视角图像的视觉信息与大语言模型对齐。同时,我们设计并使用了负向指令来减轻 LVLMs 对“是”回答的偏见。
  3. 为了验证我们算法的有效性,我们在 HoOA 基准上进行了全面的实验。


Fig. 2. An overview of the MIAVLM model. Frozen parts are blue and marked with a snowflake while trainable parts are red and marked with a flame. II. METHOD

图 2: MIAVLM 模型概览。冻结部分为蓝色并标有雪花,可训练部分为红色并标有火焰。

We propose Multiview Image Augmented Vision-Vanguage Model (MIAVLM), a model for generating more comprehensive and reliable results from multiple inputs.

我们提出了多视图图像增强视觉-语言模型 (Multiview Image Augmented Vision-Vanguage Model, MIAVLM),该模型能够从多个输入中生成更全面和可靠的结果。

A. Model Architecture

A. 模型架构

The overview of the MIAVLM model is shown in Figure 2, in which we propose a Multiview Attributes Perceiver (MAP) to bridge the gap between a frozen image encoder and a frozen LLM (Flan-T5-large [30]). Firstly, the input image is processed by a Multiview Generator (HFGI3D [31]) to generate multiview images of the input. Secondly, the image encoder (ViT-L/16 [32]) encodes multiview images into dense embeddings and passes the projected embeddings through the MAP to get the aggregated representation of all the inputs. Finally, the frozen LLM then accepts the output from the MAP and produces the final text outputs. The inner structure of the MAP is shown in Figure 4, including a Visual Extractor and a Multihead Sampler. The details will be further discussed.

MIAVLM 模型的概览如图 2 所示,其中我们提出了一个多视图属性感知器 (Multiview Attributes Perceiver, MAP) 来弥合冻结的图像编码器和冻结的大语言模型 (Flan-T5-large [30]) 之间的差距。首先,输入图像通过多视图生成器 (HFGI3D [31]) 处理,生成输入图像的多视图图像。其次,图像编码器 (ViT-L/16 [32]) 将多视图图像编码为密集嵌入,并通过 MAP 传递投影嵌入,以获得所有输入的聚合表示。最后,冻结的大语言模型接受来自 MAP 的输出,并生成最终的文本输出。MAP 的内部结构如图 4 所示,包括一个视觉提取器和一个多头采样器。细节将在后续进一步讨论。


Fig. 3. An overview of the Multihead Sampler. B. Visual Extractor

图 3: Multihead Sampler 的概览。B. 视觉提取器

Following the Vanilla transformer, we design the Visual Extractor as a pile of transformer decoder blocks (6 blocks). In cross attention, the input soft prompts are regarded as the queries, and the image embeddings are regarded as keys and values to inject the visual information into the soft prompts.

在 Vanilla Transformer 的基础上,我们将视觉提取器设计为一堆 Transformer 解码器块(6 个块)。在交叉注意力中,输入的软提示被视为查询,图像嵌入被视为键和值,以将视觉信息注入软提示中。

Engaging soft prompts in cross-attention with image embeddings encourages the prompt to interact with vision information and better extract the information for downstream tasks. Since there are multiple input image embeddings, the Visual Extractor performs cross-attention on the soft prompts with each image embedding separately. Formally, we denote the multiple input image embeddings as $E={e_{1},e_{2},...,e_{n}}$ , where $e_{i}$ is the $i$ -th input embedding. We denote the soft prompts as $P$ containing $l$ soft tokens, $P\in R^{l\times d}$ , where $d$ is the LLM’s model dimension $_{\mathrm{{}}l=32}$ and $d{=}1024\$ ). Denote the output from the Visual Extractor as $O_{V E}$ , the mapping matrix as $W_{Q},,W_{K}$ and $W_{V}$ , then $O_{V E}$ can be computed as follows:

在跨注意力机制中引入软提示与图像嵌入的交互,能够促使提示更好地与视觉信息互动,从而为下游任务更有效地提取信息。由于存在多个输入图像嵌入,视觉提取器分别对每个图像嵌入与软提示进行跨注意力计算。形式上,我们将多个输入图像嵌入表示为 $E={e_{1},e_{2},...,e_{n}}$,其中 $e_{i}$ 是第 $i$ 个输入嵌入。软提示表示为 $P$,包含 $l$ 个软 Token,$P\in R^{l\times d}$,其中 $d$ 是大语言模型的维度($_{\mathrm{{}}l=32}$ 且 $d{=}1024\$)。设视觉提取器的输出为 $O_{V E}$,映射矩阵为 $W_{Q},,W_{K}$ 和 $W_{V}$,则 $O_{V E}$ 的计算方式如下:

\begin{array}{l}{{O_{V E}=\{s o f t m a x(\frac{(P W_{Q})(e_{i}W_{K})^{T}}{\sqrt{d}})e_{i}W_{V}|e_{i}\in E\}}}\\ {{E=\{e_{1},e_{2},...,e_{n}\};W_{Q},W_{K},W_{V}\in R^{d\times d}}}\end{array}```

In Equation (1), $P$ denotes the soft prompts and $e_{i}$ is the image embedding of the $i$ -th input. Noticing that different input has different contributions to the final output, the outputs in $O_{V E}$ are weighted and summed according to the weights computed by the Multihead Sampler. Besides, we compute each cross-attention output in parallel and separately instead of using the previous output as the next query in Equation (1). This is because in most conditions the input images have no order and we’re supposed to compute their relation to the soft prompts separately.

在公式 (1) 中,$P$ 表示软提示 (soft prompts),$e_{i}$ 是第 $i$ 个输入的图像嵌入 (image embedding)。注意到不同的输入对最终输出的贡献不同,$O_{V E}$ 中的输出根据多头采样器 (Multihead Sampler) 计算的权重进行加权求和。此外,我们并行且独立地计算每个交叉注意力输出,而不是像公式 (1) 中那样使用前一个输出作为下一个查询。这是因为在大多数情况下,输入图像没有顺序,我们应该分别计算它们与软提示的关