Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions
使用多视角图像和负指令缓解对象属性幻觉
Abstract—Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA), leading to incorrect determination of fine-grained attributes in the input images. Leveraging significant advancements in 3D generation from a single image, this paper proposes a novel method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled from generated 3D representations as visual prompts for LVLMs, thereby providing more visual information from other viewpoints. Furthermore, we observe the input order of multiple multiview images significantly affects the performance of LVLMs. Consequently, we have devised Multiview Image Augmented VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule capable of simultaneously eliminating the influence of input image order and aligning visual information from multiview images with Large Language Models (LLMs). Besides, we designed and employed negative instructions to mitigate LVLMs’ bias towards “Yes” responses. Comprehensive experiments demonstrate the effectiveness of our method.
摘要—当前流行的大视觉语言模型(LVLMs)在对象属性幻觉(HoOA)方面存在问题,导致对输入图像中细粒度属性的错误判断。本文利用单图像3D生成的显著进展,提出了一种新颖的方法来缓解LVLMs中的HoOA。该方法利用从生成的3D表示中采样的多视图图像作为LVLMs的视觉提示,从而提供来自其他视角的更多视觉信息。此外,我们观察到多个多视图图像的输入顺序显著影响LVLMs的性能。因此,我们设计了多视图图像增强视觉语言模型(MIAVLM),其中包含一个多视图属性感知器(MAP)子模块,能够同时消除输入图像顺序的影响,并将多视图图像的视觉信息与大语言模型(LLMs)对齐。此外,我们设计并使用了负面指令来减轻LVLMs对“是”回答的偏见。综合实验证明了我们方法的有效性。
Index Terms—hallucinations, LLM, LVLM
索引术语—幻觉、大语言模型 (LLM)、视觉语言大模型 (LVLM)
I. INTRODUCTION
I. 引言
Current popular Large Vision-Language Models (LVLMs) [1]–[6] are suffering from hallucinations [7], [8]. These halluci nations manifest as inconsistencies between the textual responses generated by LVLMs and the semantic content of input images [9]. Specifically, these hallucinations can be categorized into three types [7]: a.) Hallucination on Object Existence $(H o O E)$ , wherein errors occur in judgments regarding the presence of objects, such as when non-existent objects are included in the descriptions generated by LVLMs; b.) Hallucination on Object Attributes $(H o O A)$ , wherein errors arise in describing the attributes of objects, including shape and color attributes, as exemplified by LVLMs describing a red apple as green; and $c.$ .) Hallucination on Object Relationships (HoOR) , wherein errors occur in describing relationships between different objects, such as describing a person in front of a sofa as being behind it [7]. Notably, benchmarks designed for assessing LVLMs’ hallucinations, such as M-HalDetect [10], MMHal-Bench [11], and AMBER [12], include multiple objects that exhibit issues related to existence, attributes, and relationships simultaneously. Consequently, these three types
当前流行的大视觉语言模型 (LVLMs) [1]–[6] 正遭受幻觉问题 [7], [8]。这些幻觉表现为 LVLMs 生成的文本响应与输入图像的语义内容之间的不一致 [9]。具体而言,这些幻觉可以分为三种类型 [7]: a.) 对象存在性幻觉 $(H o O E)$,即在判断对象是否存在时出现错误,例如 LVLMs 生成的描述中包含不存在的对象;b.) 对象属性幻觉 $(H o O A)$,即在描述对象属性时出现错误,包括形状和颜色属性,例如 LVLMs 将红苹果描述为绿色;以及 $c.$ ) 对象关系幻觉 (HoOR),即在描述不同对象之间的关系时出现错误,例如将沙发前的人描述为在沙发后面 [7]。值得注意的是,用于评估 LVLMs 幻觉的基准测试,如 M-HalDetect [10], MMHal-Bench [11] 和 AMBER [12],包含多个同时存在存在性、属性和关系问题的对象。因此,这三种类型


Q: Do people with glasses wear red clothes? Please respond with either "Yes" or "No". A: Yes.
Q: 戴眼镜的人穿红色衣服吗?请回答“是”或“否”。
A: 是。
Q: Do people with glasses wear black clothes? Please respond with either "Yes" or "No". A: Yes.
Q: 戴眼镜的人穿黑色衣服吗?请回答“是”或“否”。
A: 是。

Fig. 1. Illustration of the HoOE Problem.
图 1: HoOE 问题示意图。
of problems are strongly coupled in current benchmarks, posing significant challenges for analyzing their individual causes. For instance, in addressing the HoOA, the presence of multiple objects introduces hallucinations related to HoOE and HoOR, thereby complicating the analysis. More specifically, as illustrated in Figure 1, LLaVA-1.5 [13] provides correct answers to questions within the red box above the dashed line. However, in the image below the dashed line, LLaVA1.5 determines that there is a person wearing glasses and dressed in black. This constitutes a HoOE problem. However, such hallucinations might be caused by the LVLMs failing to correctly understand the “black” attribute, which corresponds to a HoOA problem. In such complex test scenarios, it is challenging to decouple these different types of hallucinations and address them separately, making it difficult to accurately assess the true capabilities of LVLMs.
在当前基准测试中,多个问题紧密耦合,给分析其各自原因带来了显著挑战。例如,在解决 HoOA 时,多个对象的存在会引入与 HoOE 和 HoOR 相关的幻觉,从而使分析变得复杂。更具体地说,如图 1 所示,LLaVA-1.5 [13] 对虚线上方红色框内的问题给出了正确答案。然而,在虚线下方的图像中,LLaVA-1.5 判断出有一个戴眼镜、穿黑色衣服的人。这构成了 HoOE 问题。然而,这种幻觉可能是由于 LVLMs 未能正确理解“黑色”属性所致,这对应着 HoOA 问题。在这种复杂的测试场景中,很难将这些不同类型的幻觉解耦并分别处理,从而难以准确评估 LVLMs 的真实能力。
Therefore, it is necessary to design individual benchmarks for each type of hallucination that can exclude interference from other hallucinations. This article demonstrates how to utilize face captioning as a foundational task to design a benchmark for the HoOA problem. Face captioning is a crucial multimodal task widely employed in downstream applications such as facial recognition [14] and text-to-face applications [15]. The CelebAText-HQ [16] dataset, manually annotated with facial attributes, provides detailed descriptions for each face, including shape, color, and other facial attributes. CelebAText-HQ exclusively offers detailed descriptions for individual objects (faces), thereby allowing us to design a benchmark that excludes issues related to HoOE and HoOR, facilitating a more accurate evaluation of the HoOA problem. In constructing this benchmark, we employ common techniques used in standard evaluation metrics, such as POPE [8], CIEM [17], and NOPE [18], to transform the generative task into a disc rim i native task. Each manually annotated description is converted into a question posed to LVLMs, with occurrences of “Yes” responses tallied to calculate accuracy. It is noteworthy that all converted questions yield “Yes” answers. However, as mentioned in previous studies [17], [19], there exists a tendency in current LVLMs to favor “Yes” responses disproportionately. Consequently, to assess whether LVLMs recognize the attributes of the images, we designed questions for which the answer is “No” for the same image. For clarity, we term questions with correct “Yes” answers as positive questions, and those with “No” answers as negative questions. Ultimately, we observed a near-opposite performance of LVLMs on positive and negative questions.
因此,有必要为每种幻觉类型设计单独的基准,以排除其他幻觉的干扰。本文展示了如何利用面部描述作为基础任务来设计一个针对 HoOA 问题的基准。面部描述是一项重要的多模态任务,广泛应用于面部识别 [14] 和文本到面部应用 [15] 等下游应用中。CelebAText-HQ [16] 数据集通过人工标注面部属性,为每张脸提供了详细的描述,包括形状、颜色和其他面部属性。CelebAText-HQ 专门为单个对象(面部)提供详细描述,因此我们可以设计一个排除 HoOE 和 HoOR 问题的基准,从而更准确地评估 HoOA 问题。在构建该基准时,我们采用了标准评估指标中常用的技术,如 POPE [8]、CIEM [17] 和 NOPE [18],将生成任务转化为判别任务。每个手动标注的描述都被转化为向 LVLMs 提出的问题,统计“是”回答的次数以计算准确率。值得注意的是,所有转化后的问题都应得到“是”的回答。然而,如之前的研究 [17]、[19] 所述,当前的 LVLMs 存在过度倾向于“是”回答的趋势。因此,为了评估 LVLMs 是否识别了图像的属性,我们为同一图像设计了答案为“否”的问题。为了清晰起见,我们将正确答案为“是”的问题称为正面问题,而答案为“否”的问题称为负面问题。最终,我们观察到 LVLMs 在正面和负面问题上的表现几乎相反。
Training high-quality LVLMs requires addressing both training data and model design aspects. Therefore, we analyze potential causes of HoOA from both data and model perspectives. From the data perspective, the emergence of the HoOA problem can be ascribed to two causes: a.) Insufficient information in single images to enable LVLMs to generate correct responses. b.) Popular LVLMs often undergo instruction tuning with a high proportion of positive visual instructions.
训练高质量的大视觉语言模型(LVLM)需要同时解决训练数据和模型设计两方面的问题。因此,我们从数据和模型两个角度分析了幻觉对象生成(HoOA)问题的潜在原因。从数据角度来看,HoOA问题的出现可以归因于两个原因:a) 单张图像中的信息不足以让LVLM生成正确的响应。b) 流行的LVLM通常会在指令微调中使用高比例的正面视觉指令。
In response to cause a.), prior studies have found that introducing richer image descriptions [10] or spatial information [20] can effectively mitigate hallucinations in LVLMs. An intuitive approach is to introduce additional depth maps to significantly improve HoOR problems. Consider the relationship between a person and a sofa: introducing a depth map as additional information can effectively resolve HoOR problems based on the different depths of the sofa and the person. However, when considering the HoOA problem, solely utilizing semantic segmentation or depth maps from the current viewpoint would evidently overlook fine-grained attribute information from other viewpoints. This loss of attribute information leads to two results. Firstly, certain fine-grained details from the current viewpoint may be incomplete. In cases where questions are posed about these potentially incomplete details, the possibility of LVLMs producing hallucinations exists. Secondly, fine-grained attribute information from other viewpoints is almost certainly incomplete. When questions are posed about these inherently incomplete details, LVLMs are highly likely to produce hallucinations. Therefore, generating 3D representations for current objects can effectively mitigate such HoOA problems. Benefiting from the considerable advancements in generating 3D representations from single images, images from other viewpoints can be sampled from the 3D representation [21]–[23]. From the perspective of visual prompt learning [24]–[26], these sampled images can be regarded as visual prompts. These visual prompts provide more visual information for the same attributes, thereby enhancing the robustness of LVLMs’ responses. As for cause $^{b.}$ ), prior works have introduced additional negative visual instructions during instruction tuning to enhance overall performance [19], [20]. Drawing inspiration from these approaches, we adopt the aforementioned negative questions as negative instructions to teach the LVLMs to respond “No” to HoOA problems.
针对原因 a),先前的研究发现,引入更丰富的图像描述 [10] 或空间信息 [20] 可以有效缓解 LVLMs 中的幻觉问题。一种直观的方法是引入额外的深度图来显著改善 HoOR 问题。以人与沙发的关系为例:引入深度图作为附加信息,可以根据沙发和人的不同深度有效解决 HoOR 问题。然而,当考虑 HoOA 问题时,仅利用当前视角的语义分割或深度图显然会忽略其他视角的细粒度属性信息。这种属性信息的丢失会导致两个结果。首先,当前视角的某些细粒度细节可能不完整。当问题涉及这些可能不完整的细节时,LVLMs 有可能产生幻觉。其次,其他视角的细粒度属性信息几乎肯定是不完整的。当问题涉及这些本质上不完整的细节时,LVLMs 极有可能产生幻觉。因此,为当前对象生成 3D 表示可以有效缓解此类 HoOA 问题。得益于从单张图像生成 3D 表示技术的显著进步,可以从 3D 表示中采样其他视角的图像 [21]–[23]。从视觉提示学习 [24]–[26] 的角度来看,这些采样图像可以被视为视觉提示。这些视觉提示为相同属性提供了更多的视觉信息,从而增强了 LVLMs 响应的鲁棒性。至于原因 $^{b.}$),先前的工作在指令微调过程中引入了额外的负面视觉指令,以增强整体性能 [19], [20]。受这些方法的启发,我们采用上述负面问题作为负面指令,教导 LVLMs 对 HoOA 问题回答“否”。
From the model perspective, we have observed that cause c): Directly inputting multiple images to LVLMs may lead to HoOA problems. Similar to LLMs’ sensitivity to the order of multiple prompts [27]–[29], LVLMs also exhibit sensitivity to the order of multiview images. For causes c), we designed a submodule named Multiview Attributes Perceiver (MAP) and integrated it into a model called Multiview Image Augmented VLM (MIAVLM) to align the multiview images with the LLM and mitigate the impact of the input order.
从模型的角度来看,我们观察到原因 c):直接将多张图像输入到 LVLMs 可能会导致 HoOA 问题。类似于大语言模型对多个提示顺序的敏感性 [27]–[29],LVLMs 也对多视角图像的顺序表现出敏感性。针对原因 c),我们设计了一个名为多视角属性感知器 (Multiview Attributes Perceiver, MAP) 的子模块,并将其集成到一个名为多视角图像增强 VLM (Multiview Image Augmented VLM, MIAVLM) 的模型中,以对齐多视角图像与大语言模型,并减轻输入顺序的影响。
In summary, the contributions of this paper are as follows: 1.To ascertain the presence of the HoOA problem while eliminating interference from HoOE and HoOR problems, we propose the HoOA benchmark. 2. To mitigate the HoOA problem, we propose utilizing multiview images of current objects as visual prompts. Furthermore, we design a novel network module called MIAVLM, integrating a MAP submodule capable of eliminating the influence of input image order and aligning visual information from multiview images with LLMs. Additionally, we designed and employed negative instructions to mitigate LVLMs’ bias towards ”Yes” responses. 3. To validate the effectiveness of our algorithms, we conducted comprehensive experiments on the HoOA benchmark.
本文的贡献总结如下:
- 为了确定 HoOA 问题的存在,同时消除 HoOE 和 HoOR 问题的干扰,我们提出了 HoOA 基准。
- 为了缓解 HoOA 问题,我们提出利用当前物体的多视角图像作为视觉提示。此外,我们设计了一个名为 MIAVLM 的新型网络模块,集成了一个能够消除输入图像顺序影响的 MAP 子模块,并将多视角图像的视觉信息与大语言模型对齐。同时,我们设计并使用了负向指令来减轻 LVLMs 对“是”回答的偏见。
- 为了验证我们算法的有效性,我们在 HoOA 基准上进行了全面的实验。

Fig. 2. An overview of the MIAVLM model. Frozen parts are blue and marked with a snowflake while trainable parts are red and marked with a flame. II. METHOD
图 2: MIAVLM 模型概览。冻结部分为蓝色并标有雪花,可训练部分为红色并标有火焰。
We propose Multiview Image Augmented Vision-Vanguage Model (MIAVLM), a model for generating more comprehensive and reliable results from multiple inputs.
我们提出了多视图图像增强视觉-语言模型 (Multiview Image Augmented Vision-Vanguage Model, MIAVLM),该模型能够从多个输入中生成更全面和可靠的结果。
A. Model Architecture
A. 模型架构
The overview of the MIAVLM model is shown in Figure 2, in which we propose a Multiview Attributes Perceiver (MAP) to bridge the gap between a frozen image encoder and a frozen LLM (Flan-T5-large [30]). Firstly, the input image is processed by a Multiview Generator (HFGI3D [31]) to generate multiview images of the input. Secondly, the image encoder (ViT-L/16 [32]) encodes multiview images into dense embeddings and passes the projected embeddings through the MAP to get the aggregated representation of all the inputs. Finally, the frozen LLM then accepts the output from the MAP and produces the final text outputs. The inner structure of the MAP is shown in Figure 4, including a Visual Extractor and a Multihead Sampler. The details will be further discussed.
MIAVLM 模型的概览如图 2 所示,其中我们提出了一个多视图属性感知器 (Multiview Attributes Perceiver, MAP) 来弥合冻结的图像编码器和冻结的大语言模型 (Flan-T5-large [30]) 之间的差距。首先,输入图像通过多视图生成器 (HFGI3D [31]) 处理,生成输入图像的多视图图像。其次,图像编码器 (ViT-L/16 [32]) 将多视图图像编码为密集嵌入,并通过 MAP 传递投影嵌入,以获得所有输入的聚合表示。最后,冻结的大语言模型接受来自 MAP 的输出,并生成最终的文本输出。MAP 的内部结构如图 4 所示,包括一个视觉提取器和一个多头采样器。细节将在后续进一步讨论。

Fig. 3. An overview of the Multihead Sampler. B. Visual Extractor
图 3: Multihead Sampler 的概览。B. 视觉提取器
Following the Vanilla transformer, we design the Visual Extractor as a pile of transformer decoder blocks (6 blocks). In cross attention, the input soft prompts are regarded as the queries, and the image embeddings are regarded as keys and values to inject the visual information into the soft prompts.
在 Vanilla Transformer 的基础上,我们将视觉提取器设计为一堆 Transformer 解码器块(6 个块)。在交叉注意力中,输入的软提示被视为查询,图像嵌入被视为键和值,以将视觉信息注入软提示中。
Engaging soft prompts in cross-attention with image embeddings encourages the prompt to interact with vision information and better extract the information for downstream tasks. Since there are multiple input image embeddings, the Visual Extractor performs cross-attention on the soft prompts with each image embedding separately. Formally, we denote the multiple input image embeddings as $E={e_{1},e_{2},...,e_{n}}$ , where $e_{i}$ is the $i$ -th input embedding. We denote the soft prompts as $P$ containing $l$ soft tokens, $P\in R^{l\times d}$ , where $d$ is the LLM’s model dimension $_{\mathrm{{}}l=32}$ and $d{=}1024\$ ). Denote the output from the Visual Extractor as $O_{V E}$ , the mapping matrix as $W_{Q},,W_{K}$ and $W_{V}$ , then $O_{V E}$ can be computed as follows:
在跨注意力机制中引入软提示与图像嵌入的交互,能够促使提示更好地与视觉信息互动,从而为下游任务更有效地提取信息。由于存在多个输入图像嵌入,视觉提取器分别对每个图像嵌入与软提示进行跨注意力计算。形式上,我们将多个输入图像嵌入表示为 $E={e_{1},e_{2},...,e_{n}}$,其中 $e_{i}$ 是第 $i$ 个输入嵌入。软提示表示为 $P$,包含 $l$ 个软 Token,$P\in R^{l\times d}$,其中 $d$ 是大语言模型的维度($_{\mathrm{{}}l=32}$ 且 $d{=}1024\$)。设视觉提取器的输出为 $O_{V E}$,映射矩阵为 $W_{Q},,W_{K}$ 和 $W_{V}$,则 $O_{V E}$ 的计算方式如下:
\begin{array}{l}{{O_{V E}=\{s o f t m a x(\frac{(P W_{Q})(e_{i}W_{K})^{T}}{\sqrt{d}})e_{i}W_{V}|e_{i}\in E\}}}\\ {{E=\{e_{1},e_{2},...,e_{n}\};W_{Q},W_{K},W_{V}\in R^{d\times d}}}\end{array}```
In Equation (1), $P$ denotes the soft prompts and $e_{i}$ is the image embedding of the $i$ -th input. Noticing that different input has different contributions to the final output, the outputs in $O_{V E}$ are weighted and summed according to the weights computed by the Multihead Sampler. Besides, we compute each cross-attention output in parallel and separately instead of using the previous output as the next query in Equation (1). This is because in most conditions the input images have no order and we’re supposed to compute their relation to the soft prompts separately.
在公式 (1) 中,$P$ 表示软提示 (soft prompts),$e_{i}$ 是第 $i$ 个输入的图像嵌入 (image embedding)。注意到不同的输入对最终输出的贡献不同,$O_{V E}$ 中的输出根据多头采样器 (Multihead Sampler) 计算的权重进行加权求和。此外,我们并行且独立地计算每个交叉注意力输出,而不是像公式 (1) 中那样使用前一个输出作为下一个查询。这是因为在大多数情况下,输入图像没有顺序,我们应该分别计算它们与软提示的关系。
C. Multihead Sampler
C. 多头采样器
The Multihead Sampler is used for computing weights for the weighted sum of the Visual Extractor’s outputs $O_{V E}$ in Equation (1). To further decompose the visual information in the input image embeddings, a Decomposer consisting of a 2-layer MLP is used to map the [CLS] token of the input embedding into $m$ ( $m=4,$ ) extra tokens, and the same number of attention heads are applied to compute the attention weights over each decomposed token and the soft prompts. This design aims to introduce multiple experts in the form of attention heads to focus on different features in the inputs.
多头采样器 (Multihead Sampler) 用于计算视觉提取器 (Visual Extractor) 输出 $O_{V E}$ 在公式 (1) 中的加权和的权重。为了进一步分解输入图像嵌入中的视觉信息,使用了一个由两层 MLP 组成的分解器 (Decomposer),将输入嵌入的 [CLS] token 映射为 $m$ ( $m=4$ ) 个额外的 token,并应用相同数量的注意力头来计算每个分解 token 和软提示 (soft prompts) 之间的注意力权重。该设计旨在通过注意力头的形式引入多个专家,以专注于输入中的不同特征。
As shown in Figure 3, the soft prompts serve as the queries and the decomposed image embeddings serve as the keys to compute the attention weights. Note that only the attention weights are computed and the means over the query’s dimension are taken as the output of each head. Denote $e_{i}^{j}$ as the $j$ -th decomposed token of the $i$ -th input embedding, each head’s output weightsj can be written as follows:
如图 3 所示,软提示 (soft prompts) 作为查询 (queries),分解后的图像嵌入 (image embeddings) 作为键 (keys) 来计算注意力权重 (attention weights)。需要注意的是,只计算注意力权重,并取查询维度的均值作为每个头的输出。用 $e_{i}^{j}$ 表示第 $i$ 个输入嵌入的第 $j$ 个分解后的 token,每个头的输出权重 $weights_j$ 可以写成如下形式:
\begin{array}{r l}&{s c o r e_{j}=h e a d_{j}(P,E^{j});P\in R^{l\times d},E^{j}=[e_{1}^{j},e_{2}^{j},...,e_{n}^{j}]\in R^{n\times d}}\\ &{w e i g h t s_{j}=m e a n(s c o r e_{j});s o c r e_{j}\in R^{l\times n};\quad w e i g h t s_{j}\in R^{n}\;}\end{array}
In Equation (2), $d$ is the model dimension, $P$ is the soft prompts and $h e a d_{j}$ means the computation of attention score over $P$ and $E_{j}$ . $l$ is the number of tokens in soft prompts. The mean operation takes the averaged sum over the number of tokens in $P$ . The averaged sum of weights from each head is taken as the output of the Multihead Sampler:
在公式 (2) 中,$d$ 是模型维度,$P$ 是软提示 (soft prompts),$head_{j}$ 表示在 $P$ 和 $E_{j}$ 上的注意力分数计算。$l$ 是软提示中的 token 数量。均值操作对 $P$ 中的 token 数量取平均和。每个头的权重平均和作为多头采样器 (Multihead Sampler) 的输出:
w_{M S}=\frac{1}{m}\sum_{j=1}^{m}w e i g h t s_{j};\quad w_{M S}\in R^{n}```
In Equation (3), $m$ $m=4,$ ) is the number of decomposed extra tokens and the number of attention heads in MS.
在公式 (3) 中,$m$($m=4$)是分解的额外 Token 数量以及 MS 中注意力头的数量。
MS aims to further capture the fine-grained visual features in the input image embeddings by applying different attention heads for different decomposed embeddings of the inputs.
MS 旨在通过对输入的不同分解嵌入应用不同的注意力头,进一步捕捉输入图像嵌入中的细粒度视觉特征。

Fig. 4. The structure of Multiview Attributes Perceiver. D. Multiview Attributes Perceiver
图 4: Multiview Attributes Perceiver 的结构。D. Multiview Attributes Perceiver
As shown in Figure 4, after getting the weights from the Multihead Sampler, the final output of MAP is computed through the weighted sum of $O_{V E}$ over $w_{M S}$ . Assume $w_{M S},=,{w_{1},w_{2},...,w_{n}}$ and the output of Visual Extractor $O_{V E},=,{o_{1},o_{2},...,o_{n}}$ , the outputs of MAP can be formulated as follows:
如图 4 所示,在从多头采样器 (Multihead Sampler) 获取权重后,MAP 的最终输出通过 $O_{V E}$ 在 $w_{M S}$ 上的加权和计算得出。假设 $w_{M S},=,{w_{1},w_{2},...,w_{n}}$ 且视觉提取器 (Visual Extractor) 的输出 $O_{V E},=,{o_{1},o_{2},...,o_{n}}$,则 MAP 的输出可以表示为:
\begin{array}{r l}&{\displaystyle\sum_{i=1}^{n}w_{i}\cdot o_{i};\quad o_{i}\in O_{V E},w_{i}\in w_{M S}}\\ &{O_{V E}=\{o_{1},o_{2},...,o_{n}\},w_{M S}=\{w_{1},w_{2},...,w_{n}\}}\end{array}
In Equation (4), $w_{i}$ is the corresponding weight of the $i$ -th input in $w_{M S}$ . Note that in the design of MAP, the number of image inputs is not restricted and this design enables the proposed MIAVLM model to accept any number of image inputs. The form of a weighted sum in the outputs also ensures that the input order has no influence on the final output, making the model more robust and reliable in practice. III. EXPERIMENTS
在公式 (4) 中,$w_{i}$ 是 $w_{M S}$ 中第 $i$ 个输入的对应权重。需要注意的是,在 MAP 的设计中,图像输入的数量不受限制,这种设计使得所提出的 MIAVLM 模型能够接受任意数量的图像输入。输出中的加权和形式也确保了输入顺序对最终输出没有影响,从而使模型在实际应用中更加稳健和可靠。III. 实验
TABLE I MAIN RESULTS ON OUR HOOA BENCHMARK. BOLD IS THE BEST.
表 I 我们的 HOOA 基准上的主要结果。加粗为最佳。
| 模型 | 参数量 (B) | 推理时间 (s) | 正样本准确率 | 负样本准确率 | HoOA 指标 | |
|---|---|---|---|---|---|---|
| 原始图像/9合1图像 | BLIP3 | 3.9 | 6.021/6.149 | 0.823/0.831 | 0.312/0.267 | 0.568/0.549 |
| OPERA | 7.0 | 41.572/41.584 | 0.934/0.937 | 0.152/0.107 | 0.543/0.522 | |
| LLaVA-UHD | 7.0 | 2.545/2.736 | 0.933/0.921 | 0.157/0.113 | 0.545/0.517 | |
| 原始图像/9多视图图像 | OpenFlamingo1 | 2.5 | 0.807/1.332 | 0.734/0.768 | 0.385/0.397 | 0.560/0.582 |
| OpenFlamingo2 | 2.5 | 0.881/1.457 | 0.963/0.960 | 0.210/0.223 | 0.606/0.591 | |
| OpenFlamingo3 | 4.9 | 1.112/3.152 | 0.740/0.761 | 0.472/0.486 | 0.606/0.623 | |
| OpenFlamingo4 | 4.9 | 1.247/1.793 | 0.624/0.632 | 0.483/0.501 | 0.553/0.565 | |
| Idefics2 | 8.0 | 1.294/6.482 | 0.847/0.852 | 0.421/0.432 | 0.634/0.642 | |
| MIAVLM (Ours) | 1.0 | 0.071/0.105 | 0.752/0.762 | 0.797/0.812 | 0.775/0.787 |
A. Benchmark Settings and Implementation Details
A. 基准设置与实现细节
Benchmark Settings. The HoOA benchmark is generated from the CelebAText-HQ [16] dataset. In the original dataset, each image contains manually annotated descriptions of facial attributes such as ear shapes, colors, and various other attributes. Based on these descriptions, we used the Yi-CHAT34B [33] model to rewrite them into general questions. These questions are called positive questions since all their answers are ‘Yes’. To generate negative questions, we use Yi-CHAT34B [33] to replace the original attributes in the questions with their opposite words to generate adversarial question sets. Finally, we sampled 1,430 images and obtained 14,291 positive questions and 14,291 negative questions. During instruction tuning, they were respectively employed as 14,291 positive instructions and 14,266 negative instructions for MIAVLM. Throughout the instruction tuning process, these instructions were divided into training and testing sets in a 9:1 ratio. We define the model’s average accuracy on positive and negative questions as the HoOA metric.
基准设置。HoOA 基准是从 CelebAText-HQ [16] 数据集中生成的。在原始数据集中,每张图像都包含手动注释的面部属性描述,如耳朵形状、颜色和其他各种属性。基于这些描述,我们使用 Yi-CHAT34B [33] 模型将它们重写为一般问题。这些问题被称为正面问题,因为它们的所有答案都是“是”。为了生成负面问题,我们使用 Yi-CHAT34B [33] 将问题中的原始属性替换为它们的反义词,以生成对抗性问题集。最后,我们采样了 1,430 张图像,并获得了 14,291 个正面问题和 14,291 个负面问题。在指令调优过程中,它们分别被用作 MIAVLM 的 14,291 个正面指令和 14,266 个负面指令。在整个指令调优过程中,这些指令以 9:1 的比例分为训练集和测试集。我们将模型在正面和负面问题上的平均准确率定义为 HoOA 指标。
Implementation Details. The Language Modeling loss is used for training MIAVLM and we apply Adam optimizer with $l r=0.001$ for optimization. The whole model is trained for 20 epochs with a cosine annealing scheduler. A single NVIDIA 3090 GPU was used for training. B. The Performance of LVLMs on HoOA Benchmark
实现细节。我们使用语言建模损失来训练 MIAVLM,并应用 Adam 优化器进行优化,学习率 $l r=0.001$。整个模型训练了 20 个 epoch,采用余弦退火调度器。训练使用了一块 NVIDIA 3090 GPU。B. LVLMs 在 HoOA 基准上的表现
We compared MIAVLM (ours) with BLIP3 [34], four versions of Open Flamingo [35], OPERA [36], Idefics2 [37] and LLaVA-UHD [2] on the HoOA benchmark. Among these LVLMs, both LLaVA-UHD and OPERA claim to have made improvements specifically targeting the hallucination problem based on LLaVA-1.5 [13]. All LVLMs utilize two input modes: 1. Using only the original image. 2. Using the original image along with eight generated images as input. For LVLMs like BLIP3, LLaVA-UHD, and OPERA, which only support single-image input, mode 2 involves combining the nine images into a single image (9in1). Overall, we observed that current popular LVLMs generally have a tendency to respond ”Yes” to questions. In contrast, our model demonstrates a more balanced approach. Given that our designed MAP can efficiently process multiple images simultaneously and utilizes a lightweight LLM, our model also has a significant advantage in terms of efficiency. By comparing the performance of different LVLMs across the two input modes, we observed that using the 9in1 image as input did not improve results. This may be due to the fact that the 9in1 image is more challenging to interpret compared to the original image. In contrast, models that used nine separate multiview images as input showed overall performance improvements.
我们在 HoOA 基准上比较了 MIAVLM(我们的模型)与 BLIP3 [34]、四个版本的 Open Flamingo [35]、OPERA [36]、Idefics2 [37] 和 LLaVA-UHD [2]。在这些大视觉语言模型(LVLM)中,LLaVA-UHD 和 OPERA 都声称基于 LLaVA-1.5 [13] 针对幻觉问题进行了改进。所有 LVLM 都采用了两种输入模式:1. 仅使用原始图像。2. 使用原始图像以及八张生成的图像作为输入。对于像 BLIP3、LLaVA-UHD 和 OPERA 这样仅支持单张图像输入的 LVLM,模式 2 涉及将九张图像合并为一张图像(9in1)。总体而言,我们观察到当前流行的 LVLM 通常倾向于对问题回答“是”。相比之下,我们的模型表现出更为平衡的响应方式。由于我们设计的 MAP 能够高效处理多张图像并利用轻量级大语言模型,我们的模型在效率方面也具有显著优势。通过比较不同 LVLM 在两种输入模式下的表现,我们观察到使用 9in1 图像作为输入并没有改善结果。这可能是因为与原始图像相比,9in1 图像更难解释。相比之下,使用九张独立的多视角图像作为输入的模型在整体性能上有所提升。
表 II: 负向指令 (NI) 的消融实验。MIAVLM (带 NI) / MIAVLM (不带 NI) Pos./ Neg./ HoOA 0.790/ 0.540/ 0.665 / 0.762/ 0.812/ 0.787
To demonstrate the importance of negative instructions, we only used positive instructions to tune MIAVLM. The results are shown in Table II. It can be observed that using negative instructions effectively enhances the model’s performance on negative questions. However, using negative instructions also leads to performance degradation on positive questions. CTheInfu f Multivi Ord LVIM
为了展示负指令的重要性,我们仅使用正指令对 MIAVLM 进行调优。结果如表 II 所示。可以观察到,使用负指令有效提升了模型在负问题上的表现。然而,使用负指令也会导致模型在正问题上的性能下降。CTheInfu f Multivi Ord LVIM

Fig. 5. The influence of multiview images input order on Open Flamingo [35] and MIAVLM (ours). : Outlier. Yellow line: Median. OF: Open Flamingo.

图 5. 多视图图像输入顺序对 Open Flamingo [35] 和 MIAVLM (ours) 的影响。: 异常值。黄线: 中位数。OF: Open Flamingo。
We used Open Flamingo [35] for comparison, along with our MIAVLM, both using 9 images as input from positive questions. We shuffled the order of these 9 images five times and recorded the results of both models. As shown in Figure 5, MIAVLM are not affected by any input order. However, any version of Open Flamingo [35] is influenced by the order.
我们使用 Open Flamingo [35] 进行比较,同时使用我们的 MIAVLM,两者都从正面问题中使用了 9 张图像作为输入。我们将这 9 张图像的顺序打乱五次,并记录了两个模型的结果。如图 5 所示,MIAVLM 不受任何输入顺序的影响。然而,任何版本的 Open Flamingo [35] 都会受到顺序的影响。
IV. CONCLUSION
IV. 结论
In this paper, we introduce a new benchmark to confirm the significant presence of HoOA problems in popular LVLMs. To mitigate HoOA, we propose MIAVLM, a LVLM that leverages multiview images of the current object as input and employs a novel MAP module to eliminate the influence of input image order. Additionally, negative instructions are utilized to suppress LVLMs’ tendency to answer “Yes” excessively.
在本文中,我们引入了一个新的基准,以确认在流行的 LVLM 中 HoOA 问题的显著存在。为了缓解 HoOA,我们提出了 MIAVLM,这是一种利用当前对象的多视图图像作为输入,并采用新颖的 MAP 模块来消除输入图像顺序影响的 LVLM。此外,我们还利用负指令来抑制 LVLM 过度回答“是”的倾向。
