A Survey on Multimodal Large Language Models
多模态大语言模型综述
Shukang Yin* , Chaoyou $\mathsf{F u}^{\star}\dagger$ , Sirui Zhao* , Ke Li, Xing Sun, Tong Xu, and Enhong Chen, Fellow, IEEE
Shukang Yin* , Chaoyou $\mathsf{F u}^{\star}\dagger$* , Sirui Zhao* , Ke Li, Xing Sun, Tong Xu, 以及 IEEE 会士 Enhong Chen
Abstract—Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
摘要—近期,以 GPT-4V 为代表的多模态大语言模型 (Multimodal Large Language Model, MLLM) 成为新兴研究热点,其利用强大的大语言模型作为核心处理多模态任务。MLLM 展现的传统多模态方法罕见的涌现能力(例如基于图像创作故事、无需 OCR 的数学推理),暗示了通往通用人工智能的潜在路径。为此,学术界与产业界竞相开发媲美甚至超越 GPT-4V 的 MLLM,以惊人速度推进研究边界。本文系统梳理并总结了 MLLM 的最新进展:首先阐述其基本框架,解析架构、训练策略与数据、评估等核心概念;随后探讨如何扩展 MLLM 以支持更细粒度模态、更多语言和场景的研究主题;继而分析多模态幻觉问题及多模态上下文学习 (M-ICL)、多模态思维链 (M-CoT)、大语言模型辅助视觉推理 (LAVR) 等延伸技术;最后讨论现存挑战并指出未来研究方向。鉴于 MLLM 时代刚刚开启,我们将持续更新本综述,希望激发更多研究。相关 GitHub 链接持续收录最新论文:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models。
Index Terms—Multimodal Large Language Model, Vision Language Model, Large Language Model.
索引术语—多模态大语言模型 (Multimodal Large Language Model)、视觉语言模型 (Vision Language Model)、大语言模型 (Large Language Model)。
1 INTRODUCTION
1 引言
ECENT years have seen the remarkable progress of model size, these LLMs raise extraordinary emergent abilities, typically including instruction following [5], [6], InContext Learning (ICL) [7], and Chain of Thought (CoT) [8]. Although LLMs have demonstrated surprising zero/fewshot reasoning performance on most Natural Language Processing (NLP) tasks, they are inherently “blind” to vision since they can only understand discrete text. Concurrently, Large Vision Models (LVMs) can see clearly [9], [10], [11], [12], but commonly lag in reasoning.
近年来,模型规模取得了显著进展,这些大语言模型展现出非凡的涌现能力,主要包括指令跟随 [5]、[6]、上下文学习 (ICL) [7] 和思维链 (CoT) [8]。尽管大语言模型在大多数自然语言处理 (NLP) 任务中表现出惊人的零样本/少样本推理能力,但它们本质上对视觉是"盲视"的,因为只能理解离散文本。与此同时,大型视觉模型 (LVM) 虽能清晰"看见" [9]、[10]、[11]、[12],但通常在推理能力上有所欠缺。
In light of this complement ari ty, LLM and LVM run towards each other, leading to the new field of Multimodal Large Language Model (MLLM). Formally, it refers to the LLM-based model with the ability to receive, reason, and output with multimodal information. Prior to MLLM, there have been a lot of works devoted to multi modality, which can be divided into disc rim i native [13], [14], [15] and generative [16], [17], [18] paradigms. CLIP [13], as a representative of the former, projects visual and textual information into a unified representation space, building a bridge for downstream multimodal tasks. In contrast, OFA [16] is a representative of the latter, which unifies multimodal tasks in a sequence-to-sequence manner. MLLM can be classified as the latter according to the sequence operation, but it manifests two representative traits compared with the traditional counterparts: (1) MLLM is based on LLM with billionscale parameters, which is not available in previous models. (2) MLLM uses new training paradigms to unleash its full potential, such as using multimodal instruction tuning [19], [20] to encourage the model to follow new instructions. Armed with the two traits, MLLM exhibits new capabilities, such as writing website code based on images [21], understanding the deep meaning of a meme [22], and OCR-free math reasoning [23].
鉴于这种互补性,大语言模型(LLM)与视觉大模型(LVM)相互融合,催生了多模态大语言模型(MLLM)这一新领域。其正式定义为基于LLM、能够接收/推理/输出多模态信息的模型体系。在MLLM出现之前,已有大量多模态研究工作,可分为判别式[13][14][15]与生成式[16][17][18]两大范式。前者以CLIP[13]为代表,将视觉与文本信息映射到统一表征空间,为下游多模态任务搭建桥梁;后者则以OFA[16]为代表,以序列到序列方式统一多模态任务。虽然MLLM根据序列操作可归类为后者,但相比传统模型呈现出两大特征:(1) 基于参数规模达数十亿的LLM架构,这是先前模型不具备的;(2) 采用新型训练范式释放潜力,例如通过多模态指令微调[19][20]使模型适应新指令。凭借这些特质,MLLM展现出诸如根据图像生成网页代码[21]、理解网络迷因深层含义[22]、无需OCR的数学推理[23]等新能力。
Ever since the release of GPT-4 [3], there has been a research frenzy over MLLMs because of the amazing multimodal examples it shows. Rapid development is fueled by efforts from both academia and industry. Preliminary research on MLLMs focuses on text content generation grounded in text prompts and image [20], [24]/video [25], [26]/audio [27]. Subsequent works have expanded the capabilities or the usage scenarios, including: (1) Better granularity support. Finer control on user prompts is developed to support specific regions through boxes [28] or a certain object through a click [29]. (2) Enhanced support on input and output modalities [30], [31], such as image, video, audio, and point cloud. Besides input, projects like NExT-GPT [32] further support output in different modalities. (3) Improved language support. Efforts have been made to extend the success of MLLMs to other languages (e.g. Chinese) with relatively limited training corpus [33], [34]. (4) Extension to more realms and usage scenarios. Some studies transfer the strong capabilities of MLLMs to other domains such as medical image understanding [35], [36], [37] and document parsing [38], [39], [40]. Moreover, multimodal agents are developed to assist in real-world interaction, e.g. embodied agents [41], [42] and GUI agents [43], [44], [45]. An MLLM timeline is illustrated in Fig. 1.
自GPT-4[3]发布以来,其展示的惊人多模态案例引发了学界对多模态大语言模型(MLLM)的研究热潮。学术界与工业界的共同努力推动着该领域的快速发展。早期MLLM研究主要集中于基于文本提示与图像[20][24]/视频[25][26]/音频[27]的文本内容生成。后续研究则不断拓展能力边界与应用场景,包括:(1) 更精细的粒度支持。通过边界框[28]或点击定位[29]实现对特定区域/对象的精准控制;(2) 输入输出模态增强[30][31],涵盖图像、视频、音频及点云等。除输入模态外,NExT-GPT[32]等项目进一步实现了多模态输出;(3) 语言支持优化。研究者在训练语料相对有限的情况下,成功将MLLM拓展至中文等更多语言[33][34];(4) 跨领域应用延伸。部分研究将MLLM强大能力迁移至医学影像理解[35][36][37]、文档解析[38][39][40]等领域。此外,多模态智能体被开发用于现实交互,如具身智能体[41][42]和图形界面智能体[43][44][45]。图1展示了MLLM发展时间轴。
In view of such rapid progress and the promising results of this field, we write this survey to provide researchers with a grasp of the basic idea, main method, and current progress of MLLMs. Note that we mainly focus on visual and language modalities, but also include works involving other modalities like video and audio. Specifically, we cover the most important aspects of MLLMs with corresponding summaries and open a GitHub page that would be updated in real time. To the best of our knowledge, this is the first survey on MLLM.
鉴于该领域的快速进展和前景广阔的成果,我们撰写本综述以帮助研究者掌握多模态大语言模型(MLLM)的基本概念、主要方法和当前进展。需要说明的是,我们主要关注视觉与语言模态,但也涵盖涉及视频、音频等其他模态的研究工作。具体而言,我们从多个重要维度对多模态大语言模型进行系统性梳理,提供相应总结,并开设了实时更新的GitHub页面。据我们所知,这是首篇关于多模态大语言模型的综述性论文。

Fig. 1: A timeline of representative MLLMs. We are witnessing rapid growth in this field. More works can be found in our released GitHub page, which is updated daily.
图 1: 代表性多模态大语言模型(MLLM)发展时间轴。我们正见证该领域的快速发展。更多研究成果可在我们每日更新的GitHub页面查看。
The following parts of the survey are structured as such: the survey starts with a comprehensive review of the essential aspects of MLLMs, including (1) Mainstream architectures (§2); (2) A full recipe of training strategy and data (§3); (3) Common practices of performance evaluation (§4). Then, we delve into a deeper discussion on some important topics about MLLMs, each focusing on a main problem: (1) What aspects can be further improved or extended (§5)? (2) How to relieve the multimodal hallucination issue (§6)? The survey continues with the introduction of three key techniques (§7), each specialized in a specific scenario: MICL (§7.1) is an effective technique commonly used at the inference stage to boost few-shot performance. Another important technique is M-CoT (§7.2), which is typically used in complex reasoning tasks. Afterward, we delineate a general idea to develop LLM-based systems to solve composite reasoning tasks or to address common user queries (§7.3). Finally, we finish our survey with a summary and potential research directions.
本综述的结构安排如下:首先全面回顾多模态大语言模型(MLLM)的核心要素,包括:(1) 主流架构(§2);(2) 完整的训练策略与数据配方(§3);(3) 性能评估的通用实践(§4)。随后深入探讨关于MLLMs的重要议题,每个议题聚焦一个核心问题:(1) 哪些方面可进一步改进或扩展(§5)?(2) 如何缓解多模态幻觉问题(§6)?接着介绍三项关键技术(§7),每项技术针对特定场景:MICL(§7.1)是推理阶段提升少样本性能的常用技术,M-CoT(§7.2)专用于复杂推理任务,最后阐述基于大语言模型开发系统来解决复合推理任务或用户常见查询的通用思路(§7.3)。最终以总结与潜在研究方向结束本综述。
2 ARCHITECTURE
2 架构
A typical MLLM can be abstracted into three modules, i.e. a pre-trained modality encoder, a pre-trained LLM, and a modality interface to connect them. Drawing an analogy to humans, modality encoders such as image/audio encoders are human eyes/ears that receive and pre-process optical/acoustic signals, while LLMs are like human brains that understand and reason with the processed signals. In between, the modality interface serves to align different modalities. Some MLLMs also include a generator to output other modalities apart from text. A diagram of the architecture is plotted in Fig. 2. In this section, we introduce each module in sequence.
典型的 MLLM 可以抽象为三个模块:预训练模态编码器 (pre-trained modality encoder)、预训练大语言模型 (pre-trained LLM) 以及连接二者的模态接口 (modality interface)。以人类作类比,图像/音频编码器等模态编码器相当于接收并预处理光/声信号的人眼/耳朵,而大语言模型则像理解并推理这些处理信号的人类大脑。介于二者之间的模态接口则用于对齐不同模态。部分 MLLM 还包含生成器,用于输出文本之外的其他模态。该架构示意图如图 2 所示。本节将依次介绍各个模块。
2.1 Modality encoder
2.1 模态编码器
The encoders compress raw information, such as images or audio, into a more compact representation. Rather than training from scratch, a common approach is to use a pretrained encoder that has been aligned to other modalities. For example, CLIP [13] incorporates a visual encoder semantically aligned with the text through large-scale pretraining on image-text pairs. Therefore, it is easier to use such initially pre-aligned encoders to align with LLMs through alignment pre-training (see §3.1).
编码器将原始信息(如图像或音频)压缩为更紧凑的表示形式。常见的做法是使用经过预训练并与其它模态对齐的编码器,而非从头开始训练。例如,CLIP [13] 通过在大规模图文对上预训练,将视觉编码器与文本语义对齐。因此,利用这类预先对齐的编码器通过对齐预训练(参见§3.1)与大语言模型进行对齐更为便捷。
The series of commonly used image encoders are summarized in Table 1. Apart from vanilla CLIP image encoders [13], some works also explore using other variants. For example, MiniGPT-4 [21] adopts an EVA-CLIP [47], [48] (ViT-G/14) encoder, which is trained with improved training techniques. In contrast, Osprey [29] introduces a convolution-based ConvNext-L encoder [46] to utilize higher resolution and multi-level features. Some works also explore encoder-free architecture. For instance, the image patches of Fuyu-8b [49] are directly projected before sending to LLMs. Thus, the model naturally supports flexible image resolution input.
常用图像编码器系列总结如表1所示。除了原始CLIP图像编码器[13],一些研究还探索了其他变体。例如,MiniGPT-4[21]采用了经过改进训练技术训练的EVA-CLIP[47]48编码器。相比之下,Osprey[29]引入了基于卷积的ConvNext-L编码器[46]以利用更高分辨率和多级特征。部分研究还探索了无编码器架构,例如Fuyu-8b[49]的图像块在输入大语言模型前直接进行投影,因此该模型天然支持灵活的图像分辨率输入。
TABLE 1: A summary of commonly used image encoders.
| Variants | Pretraining Corpus | Resolution | Samples (B) | Parameter Size (M) |
| OpenCLIP-ConvNext-L [46] | LAION-2B | 320 | 29 | 197.4 |
| CLIP-ViT-L/14[13] | OpenAI'sWIT | 224/336 | 13 | 304.0 |
| EVA-CLIP-ViT-G/14 4[47] | LAION-2B,COYO-700M | 224 | 11 | 1000.0 |
| OpenCLIP-ViT-G/14[46] | LAION-2B | 224 | 34 | 1012.7 |
| OpenCLIP-ViT-bigG/14 [46] | LAION-2B | 224 | 34 | 1844.9 |
表 1: 常用图像编码器汇总
| 变体 | 预训练语料 | 分辨率 | 样本量 (B) | 参数量 (M) |
|---|---|---|---|---|
| OpenCLIP-ConvNext-L [46] | LAION-2B | 320 | 29 | 197.4 |
| CLIP-ViT-L/14 [13] | OpenAI'sWIT | 224/336 | 13 | 304.0 |
| EVA-CLIP-ViT-G/14 [47] | LAION-2B,COYO-700M | 224 | 11 | 1000.0 |
| OpenCLIP-ViT-G/14 [46] | LAION-2B | 224 | 34 | 1012.7 |
| OpenCLIP-ViT-bigG/14 [46] | LAION-2B | 224 | 34 | 1844.9 |

Fig. 2: An illustration of typical MLLM architecture. It includes an encoder, a connector, and a LLM. An optional generator can be attached to the LLM to generate more modalities besides text. The encoder takes in images, audios or videos and outputs features, which are processed by the connector so that the LLM can better understand. There are broadly three types of connectors: projection-based, querybased, and fusion-based connectors. The former two types adopt token-level fusion, processing features into tokens to be sent along with text tokens, while the last type enables a feature-level fusion inside the LLM.
图 2: 典型多模态大语言模型 (MLLM) 架构示意图。该架构包含编码器、连接器和大语言模型 (LLM)。可在LLM后附加可选生成器,以生成除文本外的更多模态内容。编码器接收图像、音频或视频并输出特征,连接器对这些特征进行处理以便LLM更好地理解。连接器主要分为三类:基于投影的、基于查询的和基于融合的连接器。前两类采用token级融合,将特征处理为token与文本token一起输入,最后一类则支持在LLM内部进行特征级融合。
When choosing encoders, one often considers factors like resolution, parameter size, and pre training corpus. Notably, many works have empirically verified that using higher resolution can achieve remarkable performance gains [34], [50], [51], [52]. The approaches for scaling up input resolution can be categorized into direct scaling and patch-division methods. The direct scaling way inputs images of higher resolutions to the encoder, which often involves further tuning the encoder [34] or replacing a pre-trained encoder with higher resolution [50]. Similarly, CogAgent [44] uses a dual-encoder mechanism, where two encoders process high and low-resolution images, respectively. High-resolution features are injected into the lowresolution branch through cross-attention. Patch-division methods cut a high-resolution image into patches and reuse the low-resolution encoder. For example, Monkey [51] and SPHINX [53] divide a large image into smaller patches and send sub-images together with a down sampled highresolution image to the image encoder, where the subimages and the low-resolution image capture local and global features, respectively. In contrast, parameter size and training data composition are of less importance compared with input resolution, found by empirical studies [52].
在选择编码器时,通常会考虑分辨率、参数量级和预训练语料库等因素。值得注意的是,许多研究通过实证验证了使用更高分辨率能带来显著的性能提升 [34], [50], [51], [52]。提升输入分辨率的方法可分为直接缩放和分块处理两类。直接缩放方式向编码器输入更高分辨率的图像,通常需要进一步调优编码器 [34] 或替换为支持更高分辨率的预训练编码器 [50]。类似地,CogAgent [44] 采用双编码器机制,由两个编码器分别处理高、低分辨率图像,并通过交叉注意力将高分辨率特征注入低分辨率分支。分块处理方法将高分辨率图像切割为图块并复用低分辨率编码器,例如Monkey [51] 和SPHINX [53] 将大图像分割为小块,将子图像与下采样的高分辨率图像共同输入图像编码器,其中子图像和低分辨率图像分别捕获局部和全局特征。实证研究表明,与输入分辨率相比,参数量级和训练数据构成的影响相对较小 [52]。
Similar encoders are also available for other modalities. For example, Pengi [27] uses CLAP [54] model as the audio encoder. ImageBind-LLM [30] uses the ImageBind [55] encoder, which supports encoding image, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data. Equipped with the strong encoder, ImageBind-LLM can respond to the input of multiple modalities.
类似编码器也适用于其他模态。例如,Pengi [27] 使用 CLAP [54] 模型作为音频编码器。ImageBind-LLM [30] 采用 ImageBind [55] 编码器,该编码器支持对图像、文本、音频、深度、热成像及惯性测量单元 (IMU) 数据进行编码。凭借强大的编码器,ImageBind-LLM 能够响应多模态输入。
2.2 Pre-trained LLM
2.2 预训练大语言模型
Instead of training an LLM from scratch, it is more efficient and practical to start with a pre-trained one. Through tremendous pre-training on web corpus, LLMs have been embedded with rich world knowledge, and demonstrate strong generalization and reasoning capabilities.
与其从头开始训练大语言模型(LLM),从预训练模型出发更为高效实用。通过对网络语料的大规模预训练,大语言模型已内化了丰富的世界知识,并展现出强大的泛化与推理能力。
We summarize the commonly used and publicly available LLMs in Table 2. Notably, most LLMs fall in the causal decoder category, following GPT-3 [7]. Among them, FlanT5 [56] series are relatively early LLMs used in works like BLIP-2 [59] and Instruct BLIP [60]. LLaMA series [5], [57] and Vicuna family [4] are representative open-sourced LLMs that have attracted much academic attention. Since the two LLMs are predominantly pre-trained on English corpus, they are limited in multi-language support, such as Chinese. In contrast, Qwen [58] is a bilingual LLM that supports Chinese and English well.
我们在表2中总结了常用且公开可用的大语言模型。值得注意的是,大多数大语言模型都属于因果解码器类别,遵循GPT-3 [7]的架构。其中,FlanT5 [56]系列是相对早期的大语言模型,被用于BLIP-2 [59]和Instruct BLIP [60]等工作。LLaMA系列[5][57]和Vicuna家族[4]是具有代表性的开源大语言模型,吸引了大量学术关注。由于这两个大语言模型主要基于英语语料进行预训练,它们对多语言(如中文)的支持有限。相比之下,Qwen [58]是一个双语大语言模型,能很好地支持中文和英文。
It should be noted that scaling up the parameter size of LLMs also brings additional gains, similar to the case of increasing input resolution. Specifically, Liu et al. [50], [61] find that simply scaling up LLM from 7B to 13B brings comprehensive improvement on various benchmarks. Furthermore, when using a 34B LLM, the model shows emergent zero-shot Chinese capability, given that only English multimodal data are used during training. Lu et al. [62] see a similar phenomenon by scaling up LLMs from 13B to 35B and 65B/70B, where the larger model size brings consistent gains on benchmarks specifically designed for MLLMs. There are also works that use smaller LLMs to facilitate deployment on mobile devices. For example, MobileVLM series [63], [64] use downscaled LLaMA [5] (termed as Mobile LLaMA 1.4B/2.7B), enabling efficient inference on mobile processors.
需要注意的是,扩大大语言模型的参数量同样能带来额外收益,这与提升输入分辨率的情况类似。具体而言,Liu等人[50][61]发现,仅将大语言模型从70亿参数扩展到130亿参数,就能在各种基准测试中带来全面提升。更值得注意的是,当使用340亿参数的大语言模型时,尽管训练过程中仅使用了英文多模态数据,模型却展现出突现的零样本中文能力。Lu等人[62]在将大语言模型从130亿参数扩展到350亿及650亿/700亿参数时也观察到类似现象,更大的模型规模在专为多模态大语言模型设计的基准测试中持续带来性能提升。也有研究采用较小规模的大语言模型以适应移动设备部署需求,例如MobileVLM系列[63][64]使用缩减版的LLaMA[5](称为Mobile LLaMA 14亿/27亿参数),从而在移动处理器上实现高效推理。
Recently, explorations of Mixture of Experts (MoE) architecture for LLMs have garnered rising attention [65], [66], [67]. Compared with dense models, the sparse architecture enables scaling up total parameter size without increasing computational cost, by selective activation of the parameters. Empirically, MM1 [52] and MoE-LLaVA [68] find that MoE implementation achieves better performance than the dense counterpart on almost all the benchmarks.
最近,针对大语言模型的混合专家 (Mixture of Experts,MoE) 架构的探索日益受到关注 [65], [66], [67]。与密集模型相比,这种稀疏架构通过选择性激活参数,能够在保持计算成本不变的情况下扩大总参数量。实证研究表明,MM1 [52] 和 MoE-LLaVA [68] 发现 MoE 实现方案在几乎所有基准测试中都优于对应的密集模型。
TABLE 2: A summary of commonly used open-sourced LLMs. en, zh, fr, and de stand for English, Chinese, French, and German, respectively.
| Model | ReleaseDate | PretrainDataScale | Parameter Size (B) | Language Support | Architecture |
| Flan-T5-XL/XXL[56] | Oct-2022 | 3/ 11 | en, fr, de | Encoder-Decoder | |
| LLaMA [5] | Feb-2023 | 1.4Ttokens | 7/13/33/65 | en | CausalDecoder |
| Vicuna [4] | Mar-2023 | 1.4Ttokens | 7/13/33 | en | CausalDecoder |
| LLaMA-2 [57] | Jul-2023 | 2Ttokens | 7/13/70 | en | CausalDecoder |
| Qwen [58] | Sep-2023 | 3Ttokens | 1.8/7/14/72 | en,zh | CausalDecoder |
表 2: 常用开源大语言模型汇总。en、zh、fr 和 de 分别代表英语、中文、法语和德语。
| 模型 | 发布日期 | 预训练数据规模 | 参数量 (B) | 支持语言 | 架构 |
|---|---|---|---|---|---|
| Flan-T5-XL/XXL[56] | 2022年10月 | - | 3/11 | en, fr, de | 编码器-解码器 |
| LLaMA [5] | 2023年2月 | 1.4T tokens | 7/13/33/65 | en | 因果解码器 |
| Vicuna [4] | 2023年3月 | 1.4T tokens | 7/13/33 | en | 因果解码器 |
| LLaMA-2 [57] | 2023年7月 | 2T tokens | 7/13/70 | en | 因果解码器 |
| Qwen [58] | 2023年9月 | 3T tokens | 1.8/7/14/72 | en, zh | 因果解码器 |
2.3 Modality interface
2.3 模态接口
Since LLMs can only perceive text, bridging the gap between natural language and other modalities is necessary. However, it would be costly to train a large multimodal model in an end-to-end manner. A more practical way is to introduce a learnable connector between the pre-trained visual encoder and LLM. The other approach is to translate images into languages with the help of expert models, and then send the language to LLM.
由于大语言模型只能感知文本,因此需要弥合自然语言与其他模态之间的差距。然而,以端到端方式训练大型多模态模型的成本很高。更实用的方法是在预训练视觉编码器和大语言模型之间引入可学习的连接器。另一种方法是借助专家模型将图像翻译成语言,然后将语言发送给大语言模型。
Learnable Connector. It is responsible for bridging the gap between different modalities. Specifically, the module projects information into the space that LLM can understand efficiently. Based on how multimodal information is fused, there are broadly two ways to implement such interfaces, i.e. token-level and feature-level fusion.
可学习连接器。该模块负责弥合不同模态之间的鸿沟,具体而言是将信息投射至大语言模型(LLM)能高效理解的空间。根据多模态信息融合方式,此类接口主要有两种实现路径:(1) Token级融合 (2) 特征级融合。
For token-level fusion, features output from encoders are transformed into tokens and concatenated with text tokens before being sent into LLMs. A common and feasible solution is to leverage a group of learnable query tokens to ex- tract information in a query-based manner [69], which first has been implemented in BLIP-2 [59], and subsequently inherited by a variety of work [26], [60], [70]. Such Q-Formerstyle approaches compress visual tokens into a smaller number of representation vectors. In contrast, some methods simply use a MLP-based interface to bridge the modality gap [20], [37], [71], [72]. For example, LLaVA series adopts one/two linear MLP [20], [50] to project visual tokens and align the feature dimension with word embeddings.
在Token级别的融合中,编码器输出的特征被转换为Token并与文本Token拼接后输入大语言模型。一种常见且可行的解决方案是利用一组可学习的查询(query) Token,以基于查询的方式提取信息[69],该方法最初在BLIP-2[59]中实现,随后被多种工作继承[26][60][70]。这类Q-Former风格的方法将视觉Token压缩为更少数量的表示向量。相比之下,有些方法仅使用基于MLP的接口来弥合模态差距[20][37][71][72]。例如,LLaVA系列采用单层/双层线性MLP[20][50]来投影视觉Token,并将特征维度与词嵌入对齐。
On a related note, MM1 [52] has ablated on design choices on the connector and found that for token-level fusion, the type of modality adapter is far less important than the number of visual tokens and input resolution. Nevertheless, Zeng et al. [73] compare the performance of token and feature-level fusion, and empirically reveal that the token-level fusion variant performs better in terms of VQA benchmarks. Regarding the performance gap, the authors suggest that cross-attention models might require a more complicated hyper-parameter searching process to achieve comparable performance.
值得一提的是,MM1 [52] 对连接器的设计选择进行了消融实验,发现对于token级融合而言,模态适配器的类型远不如视觉token数量和输入分辨率重要。然而,Zeng等人 [73] 比较了token级与特征级融合的性能,并通过实验证明token级融合变体在VQA基准测试中表现更优。针对性能差距,作者指出交叉注意力模型可能需要更复杂的超参数搜索过程才能达到相当的性能。
As another line, feature-level fusion inserts extra modules that enable deep interaction and fusion between text features and visual features. For example, Flamingo [74] inserts extra cross-attention layers between frozen Transformer layers of LLMs, thereby augmenting language features with external visual cues. Similarly, CogVLM [75] plugs in a visual expert module in each Transformer layer to enable dual interaction and fusion between vision and language features. For better performance, the QKV weight matrix of the introduced module is initialized from the pre-trained LLM. Similarly, LLaMA-Adapter [76] introduces learnable prompts into Transformer layers. These prompts are first embedded with visual knowledge and then concatenated with text features as prefixes.
作为另一条技术路线,特征级融合通过插入额外模块实现文本特征与视觉特征的深度交互与融合。例如 Flamingo [74] 在大语言模型(LLM)冻结的Transformer层间插入跨注意力层,从而利用外部视觉线索增强语言特征。类似地,CogVLM [75] 在每个Transformer层中嵌入视觉专家模块,实现视觉与语言特征的双向交互融合。为获得更好性能,新增模块的QKV权重矩阵从预训练大语言模型初始化。同样地,LLaMA-Adapter [76] 向Transformer层引入可学习提示词,这些提示词先与视觉知识进行嵌入,再作为前缀与文本特征拼接。
In terms of parameter size, learnable interfaces generally comprise a small portion compared with encoders and LLMs. Take Qwen-VL [34] as an example, the parameter size of the Q-Former is about 0.08B, accounting for less than $1%$ of the whole parameters, while the encoder and the LLM account for about $19.8%$ (1.9B) and $80.2%$ (7.7B), respectively.
在参数量方面,可学习接口通常只占编码器和大语言模型的一小部分。以Qwen-VL [34]为例,其Q-Former的参数量约为0.08B,不到总参数量的$1%$,而编码器和大语言模型分别占比约$19.8%$ (1.9B)和$80.2%$ (7.7B)。
Expert Model. Apart from the learnable interface, using expert models, such as an image captioning model, is also a feasible way to bridge the modality gap [77], [78], [79], [80]. The basic idea is to convert multimodal inputs into languages without training. In this way, LLMs can understand multi modality by the converted languages. For example, VideoChat-Text [25] uses pre-trained vision models to extract visual information such as actions and enriches the descriptions using a speech recognition model. Though using expert models is straightforward, it may not be as flexible as adopting a learnable interface. The conversion of foreign modalities into text would cause information loss. For example, transforming videos into textual descriptions distorts spatial-temporal relationships [25].
专家模型 (Expert Model)。除了可学习接口外,使用专家模型(如图像描述模型)也是弥合模态差距的可行方案 [77]、[78]、[79]、[80]。其核心思想是在无需训练的情况下将多模态输入转换为语言。通过这种转换后的语言,大语言模型便能理解多模态内容。例如 VideoChat-Text [25] 使用预训练视觉模型提取动作等视觉信息,并通过语音识别模型丰富描述文本。虽然专家模型使用直观,但其灵活性可能不如可学习接口。将外部模态转换为文本会导致信息丢失,例如将视频转为文本描述会扭曲时空关系 [25]。
3 TRAINING STRATEGY AND DATA
3 训练策略与数据
A full-fledged MLLM undergoes three stages of training, i.e. pre-training, instruction-tuning, and alignment tuning. Each phase of training requires different types of data and fulfills different objectives. In this section, we discuss training objectives, as well as data collection and characteristics for each training stage.
成熟的多模态大语言模型(MLLM)训练包含三个阶段:预训练(pre-training)、指令微调(instruction-tuning)和对齐调优(alignment tuning)。每个训练阶段需要不同类型的数据并实现不同目标。本节将探讨各阶段的训练目标,以及数据收集与特征。
3.1 Pre-training
3.1 预训练
3.1.1 Training Detail
3.1.1 训练细节
As the first training stage, pre-training mainly aims to align different modalities and learn multimodal world knowledge. Pre-training stage generally entails large-scale textpaired data, e.g. caption data. Typically, the caption pairs describe images/audio/videos in natural language sentences.
作为第一个训练阶段,预训练(pre-training)主要旨在对齐不同模态并学习多模态世界知识。预训练阶段通常需要大规模文本配对数据,例如字幕数据。典型情况下,这些字幕对用自然语言句子描述图像/音频/视频内容。
Here, we consider a common scenario where MLLMs are trained to align vision with text. As illustrated in Table 3, given an image, the model is trained to predict autoregressively the caption of the image, following a standard cross-entropy loss. A common approach for pre-training is to keep pre-trained modules (e.g. visual encoders and LLMs) frozen and train a learnable interface [20], [35], [72]. The idea is to align different modalities without losing pre-trained knowledge. Some methods [34], [81], [82] also unfreeze more modules (e.g. visual encoder) to enable more trainable parameters for alignment. It should be noted that the training scheme is closely related to the data quality. For short and noisy caption data, a lower resolution (e.g. 224) can be adopted to speed up the training process, while for longer and cleaner data, it is better to utilize higher resolutions (e.g. 448 or higher) to mitigate hallucinations. Besides, ShareGPT4V [83] finds that with high-quality caption data in the pre training stage, unlocking the vision encode promotes better alignment.
在此,我们考虑一种常见场景,即训练多模态大语言模型(MLLMs)实现视觉与文本的对齐。如表3所示,给定一张图像,模型通过自回归方式预测图像描述,并采用标准交叉熵损失进行训练。常见的预训练方法是冻结预训练模块(如视觉编码器和大语言模型)并训练可学习接口[20][35][72],其核心思想是在保持预训练知识的前提下实现多模态对齐。部分方法[34][81][82]会解冻更多模块(如视觉编码器)以增加可训练参数量来提升对齐效果。需要注意的是,训练方案与数据质量密切相关:对于简短且含噪声的描述数据,可采用较低分辨率(如224)加速训练;而对于更长更干净的数据,则建议使用更高分辨率(如448或以上)来减少幻觉现象。此外,ShareGPT4V[83]发现,若预训练阶段使用高质量描述数据,解锁视觉编码器能促进更好的模态对齐。
| Input: : |
| Response: {caption} |
输入: :
响应: {caption}
3.1.2 Data
3.1.2 数据
Pre training data mainly serve two purposes, i.e. (1) aligning different modalities and (2) providing world knowledge. The pre training corpora can be divided into coarse-grained and fine-grained data according to granular i ties, which we will introduce sequentially. We summarize commonly used pre training datasets in Table 4.
预训练数据主要有两个用途:(1) 对齐不同模态和(2) 提供世界知识。根据数据粒度,预训练语料可分为粗粒度数据和细粒度数据,我们将依次介绍。常用预训练数据集总结如表4所示。
Coarse-grained caption data share some typical traits in common: (1) The data volume is large since samples are generally sourced from the internet. (2) Because of the webscrawled nature, the captions are usually short and noisy since they originate from the alt-text of the web images. These data can be cleaned and filtered via automatic tools, for example, using CLIP [13] model to filter out imagetext pairs whose similarities are lower than a pre-defined threshold. In what follows, we introduce some representative coarse-grained datasets.
粗粒度标注数据具有一些典型共性特征:(1) 由于样本通常来自互联网,数据量庞大。(2) 网络爬取特性导致标注文本通常简短且含噪声,因为这些文本源自网页图像的替代文本。这类数据可通过自动工具进行清洗过滤,例如使用 CLIP [13] 模型剔除相似度低于预设阈值的图文对。下文将介绍几个具有代表性的粗粒度数据集。
TABLE 4: Common datasets used for pre-training.
| Dataset | Samples | Date |
| Coarse-grained Image-Text | 2018 | |
| CC-3M [84] CC-12M [85] SBU Captions [86] | 3.3M 12.4M 1M | 2020 2011 |
| LAION-5B[87] LAION-2B[87] | 5.9B 2.3B | Mar-2022 Mar-2022 |
| LAION-COCO [88] COYO-700M[90] | 600M 747M | Sep-2022 Aug-2022 |
| Fine-grained Image-Text | ||
| ShareGPT4V-PT[83] | 1.2M | Nov-2023 |
| LVIS-Instruct4V[91] ALLaVA [92] | 111K 709K | Nov-2023 Feb-2024 |
| Video-Text | ||
| MSR-VTT [93] | 200K | 2016 |
| Audio-Text | ||
| WavCaps [94] | 24K | Mar-2023 |
表 4: 预训练常用数据集
| 数据集 | 样本量 | 日期 |
|---|---|---|
| 粗粒度图文数据 | 2018 | |
| CC-3M [84] CC-12M [85] SBU Captions [86] | 3.3M 12.4M 1M | 2020 2011 |
| LAION-5B [87] LAION-2B [87] | 5.9B 2.3B | 2022年3月 2022年3月 |
| LAION-COCO [88] COYO-700M [90] | 600M 747M | 2022年9月 2022年8月 |
| 细粒度图文数据 | ||
| ShareGPT4V-PT [83] | 1.2M | 2023年11月 |
| LVIS-Instruct4V [91] ALLaVA [92] | 111K 709K | 2023年11月 2024年2月 |
| 视频-文本数据 | ||
| MSR-VTT [93] | 200K | 2016 |
| 音频-文本数据 | ||
| WavCaps [94] | 24K | 2023年3月 |
CC. CC-3M [84] is a web-scale caption dataset of 3.3M image-caption pairs, where the raw descriptions are derived from alt-text associated with images. The authors design a complicated pipeline to clean data: (1) For images, those with inappropriate content or aspect ratio are filtered. (2) For text, NLP tools are used to obtain text annotations, with samples filtered according to the designed heuristics. (3) For image-text pairs, images are assigned labels via class if i ers. If text annotations do not overlap with image labels, the corresponding samples are dropped.
CC-3M [84] 是一个网络规模的图像描述数据集,包含330万张图像-标题对,其原始描述来自与图像关联的替代文本。作者设计了一套复杂的数据清洗流程:(1) 对于图像,过滤掉内容不当或宽高比不符合要求的样本。(2) 对于文本,使用自然语言处理(NLP)工具获取文本标注,并根据设计的启发式规则筛选样本。(3) 对于图文对,通过分类器为图像分配标签。若文本标注与图像标签不匹配,则丢弃对应样本。
CC-12M [85] is a following work of CC-3M and contains 12.4M image-caption pairs. Compared with the previous work, CC-12M relaxes and simplifies the data-collection pipeline, thus collecting more data.
CC-12M [85] 是 CC-3M 的后续工作,包含 1240 万张图像-标题对。与之前的工作相比,CC-12M 放宽并简化了数据收集流程,从而收集了更多数据。
SBU Captions [86]. It is a captioned photo dataset containing 1M image-text pairs, with images and descriptions sourced from Flickr. Specifically, an initial set of images is acquired by querying the Flickr website with a large number of query terms. The descriptions attached to the images thus serve as captions. Then, to ensure that descriptions are relevant to the images, the retained images fulfill these requirements: (1) Descriptions of the images are of satisfactory length, decided by observation. (2) Descriptions of the images contain at least 2 words in the predefined term lists and a propositional word (e.g. “on”, “under”) that generally suggests spatial relationships.
SBU Captions [86]。这是一个带字幕的照片数据集,包含100万张图像-文本对,其中的图像和描述均来自Flickr。具体来说,最初的一组图像是通过使用大量查询词在Flickr网站上进行搜索获取的。这些图像附带的描述因此被用作字幕。然后,为了确保描述与图像相关,保留的图像需满足以下要求:(1) 图像描述的长度令人满意,这是通过观察决定的。(2) 图像描述包含预定义术语列表中的至少2个单词和一个通常表示空间关系的介词性单词(如"on"、"under")。
LAION. This series are large web-scale datasets, with images scrawled from the internet and associated alt-text as captions. To filter the image-text pairs, the following steps are performed: (1) Text with short lengths or images with too small or too big sizes are dropped. (2) Image de duplication based on URL. (3) Extract CLIP [13] embeddings for images and text, and use the embeddings to drop possibly illegal content and image-text pairs with low cosine similarity between embeddings. Here we offer a brief summary of some typical variants:
LAION。该系列是大型网络规模数据集,包含从互联网抓取的图像及对应的替代文本作为标注。为过滤图像-文本对,执行以下步骤:(1) 剔除文本过短或图像尺寸过大/过小的样本。(2) 基于URL进行图像去重。(3) 提取图像和文本的CLIP[13]嵌入向量,通过向量剔除可能非法内容及嵌入间余弦相似度较低的图文对。以下是部分典型变体的简要说明:
• LAION-5B [87]: It is a research-purpose dataset of 5.85B image-text pairs. The dataset is multilingual with a 2B English subset. • LAION-COCO [88]: It contains 600M images extracted from the English subset of LAION-5B. The captions are synthetic, using BLIP [89] to generate various image captions and using CLIP [13] to pick the best fit for the image. COYO-700M [90]. It contains 747M image-text pairs, which are extracted from Common Crawl. For data filtering, the authors design the following strategies: (1) For images, those with inappropriate size, content, format, or aspect ratio are filtered. Moreover, the images are filtered based on the pHash value to remove images overlapped with public datasets such as ImageNet and MS-COCO. (2) For text, only English text with satisfactory length, noun forms, and appropriate words are saved. Whitespace before and after the sentence will be removed, and consecutive whitespace characters will be replaced with a single whitespace. Moreover, text appearing more than 10 times (e.g. “image for”) will be dropped. (3) For image-text pairs, duplicated samples are removed based on (image pHash, text) tuple.
• LAION-5B [87]: 这是一个包含58.5亿图文对的研究用途数据集,含20亿英文子集的多语言数据集。
• LAION-COCO [88]: 该数据集从LAION-5B英文子集中提取了6亿张图像,通过BLIP [89]生成多样化图像描述,并利用CLIP [13]筛选最匹配的合成标注。
COYO-700M [90]: 包含7.47亿从Common Crawl提取的图文对,其数据过滤策略如下:(1) 图像过滤:剔除尺寸、内容、格式或长宽比不合格的图像,基于pHash值去重以避免与ImageNet、MS-COCO等公开数据集重复;(2) 文本过滤:仅保留长度适中、名词形式规范且用词得当的英文文本,清除首尾空格并将连续空格合并为单个空格,同时剔除出现超过10次的重复文本(如"image for");(3) 图文对过滤:基于(图像pHash值,文本)元组去除重复样本。
Recently, more works [83], [91], [92] have explored generating high-quality fine-grained data through prompting strong MLLMs (e.g. GPT-4V). Compared with coarsegrained data, these data generally contain longer and more accurate descriptions of the images, thus enabling finergrained alignment between image and text modalities. However, since this approach generally requires calling commercial-use MLLMs, the cost is higher, and the data volume is relatively smaller. Notably, ShareGPT4V [83] strikes a balance by first training a captioner with GPT-4V-generated
最近,更多研究[83]、[91]、[92]探索通过提示强大的多模态大语言模型(例如GPT-4V)来生成高质量的细粒度数据。与粗粒度数据相比,这些数据通常包含更长、更准确的图像描述,从而实现图像与文本模态间更细粒度的对齐。然而,由于这种方法通常需要调用商用多模态大语言模型,成本较高且数据量相对较小。值得注意的是,ShareGPT4V[83]通过先用GPT-4V生成的数据训练一个标注器,实现了成本与数据质量的平衡。

Fig. 3: Comparison of three typical learning paradigms. The image is from [19].
图 3: 三种典型学习范式的对比。图片来自 [19]。
100K data, then scaling up the data volume to 1.2M using the pre-trained captioner.
100K数据,然后使用预训练的标题生成器将数据量扩展到1.2M。
3.2 Instruction-tuning
3.2 指令调优
3.2.1 Introduction
3.2.1 引言
Instruction refers to the description of tasks. Intuitively, instruction tuning aims to teach models to better understand the instructions from users and fulfill the demanded tasks. Tuning in this way, LLMs can generalize to unseen tasks by following new instructions, thus boosting zero-shot performance. This simple yet effective idea has sparked the success of subsequent NLP works, such as ChatGPT [2], InstructGPT [95], FLAN [19], [56], and OPT-IML [96].
指令是指对任务的描述。直观上,指令微调旨在教会模型更好地理解用户指令并完成所需任务。通过这种方式微调,大语言模型能够遵循新指令泛化到未见过的任务,从而提升零样本性能。这一简单却有效的想法催生了后续NLP工作的成功,例如ChatGPT [2]、InstructGPT [95]、FLAN [19][56]以及OPT-IML [96]。
The comparisons between instruction tuning and related typical learning paradigms are illustrated in Fig. 3. The supervised fine-tuning approach usually requires a large amount of task-specific data to train a task-specific model. The prompting approach reduces the reliance on large-scale data and can fulfill a specialized task via prompt engineering. In such a case, though the few-shot performance has been improved, the zero-shot performance is still quite average [7]. Differently, instruction tuning learns how to generalize to unseen tasks rather than fitting specific tasks like the two counterparts. Moreover, instruction tuning is highly related to multi-task prompting [97].
指令微调与相关典型学习范式的比较如图 3 所示。监督微调方法通常需要大量任务特定数据来训练任务特定模型。提示方法减少了对大规模数据的依赖,可通过提示工程完成专门任务。在这种情况下,虽然少样本性能有所提升,但零样本性能仍较为一般 [7]。与之不同,指令微调学习的是如何泛化到未见任务,而非像前两者那样适配特定任务。此外,指令微调与多任务提示高度相关 [97]。
In this section, we delineate the format of instruction samples, the training objectives, typical ways to gather instruction data, and corresponding commonly used datasets.
在本节中,我们将阐述指令样本的格式、训练目标、收集指令数据的典型方法以及相应的常用数据集。
3.2.2 Training Detail
3.2.2 训练细节
A multimodal instruction sample often includes an optional instruction and an input-output pair. The instruction is typically a natural language sentence describing the task, such as, “Describe the image in detail.” The input can be an image-text pair like the VQA task [99] or only an image
多模态指令样本通常包含一条可选指令和一个输入-输出对。指令通常是描述任务的自然语言句子,例如"详细描述这张图片"。输入可以是像VQA任务[99]这样的图文对,或仅包含图像
TABLE 5: A simplified template to structure the multimodal instruction data. <instruction $>$ is a textual description of the task. {,
表 5: 多模态指令数据的简化结构模板。,
like the image caption task [100]. The output is the answer to the instruction conditioned on the input. The instruction template is flexible and subject to manual designs [20], [25], [98], as exemplified in Table 5. Note that the instruction template can also be generalized to the case of multi-round conversations [20], [37], [71], [98].
如图像标注任务 [100]。输出是根据输入条件对指令的响应。指令模板具有灵活性且可人工设计 [20][25][98],如表 5 所示。需要注意的是,指令模板也可推广至多轮对话场景 [20][37][71][98]。
Formally, a multimodal instruction sample can be denoted in a triplet form, i.e. $({\mathcal{I}},{\mathcal{M}},{\mathcal{R}})$ , where $\bar{\mathcal{I}},\mathcal{M},\mathcal{R}$ represent the instruction, the multimodal input, and the ground truth response, respectively. The MLLM predicts an answer given the instruction and the multimodal input:
形式上,多模态指令样本可以表示为三元组形式,即 $({\mathcal{I}},{\mathcal{M}},{\mathcal{R}})$ ,其中 $\bar{\mathcal{I}},\mathcal{M},\mathcal{R}$ 分别代表指令、多模态输入和真实响应。大语言模型根据指令和多模态输入预测答案:
$$
\mathcal{A}=f(\mathcal{T},\mathcal{M};\theta)
$$
$$
\mathcal{A}=f(\mathcal{T},\mathcal{M};\theta)
$$
Here, $\mathcal{A}$ denotes the predicted answer, and $\theta$ are the parameters of the model. The training objective is typically the original auto-regressive objective used to train LLMs [20], [37], [71], [101], based on which the MLLM is encouraged to predict the next token of the response. The objective can be expressed as:
这里,$\mathcal{A}$表示预测的答案,$\theta$是模型的参数。训练目标通常采用用于训练大语言模型的原始自回归目标[20][37][71][101],基于此目标鼓励多模态大语言模型预测回答的下一个token。该目标可表示为:
$$
\mathcal{L}(\boldsymbol{\theta})=-\sum_ {i=1}^{N}\log p(\mathcal{R}_ {i}|\mathcal{I},\mathcal{R}_ {<i};\boldsymbol{\theta})
$$
$$
\mathcal{L}(\boldsymbol{\theta})=-\sum_ {i=1}^{N}\log p(\mathcal{R}_ {i}|\mathcal{I},\mathcal{R}_ {<i};\boldsymbol{\theta})
$$
where $N$ is the length of the ground-truth response.
其中 $N$ 是真实响应的长度。
3.2.3 Data Collection
3.2.3 数据收集
Since instruction data are more flexible in formats and varied in task formulations, it is usually trickier and more costly to collect data samples. In this section, we summarize three typical ways to harvest instruction data at scale, i.e. data adaptation, self-instruction, and data mixture.
由于指令数据在格式上更为灵活且任务表述多样,收集数据样本通常更为棘手且成本更高。本节总结了三种大规模获取指令数据的典型方法,即数据适配 (data adaptation)、自我指令 (self-instruction) 和数据混合 (data mixture)。
Data Adaptation. Task-specific datasets are rich sources of high-quality data. Hence, abundant works [60], [70], [76], [82], [101], [102], [103], [104] have utilized existing highquality datasets to construct instruction-formatted datasets. Take the transformation of VQA datasets for an example, the original sample is an input-out pair where the input comprises an image and a natural language question, and the output is the textual answer to the question conditioned on the image. The input-output pairs of these datasets could naturally comprise the multimodal input and response of the instruction sample (see $\S3.2.2)$ . The instructions, i.e. the descriptions of the tasks, can either derive from manual design or from semi-automatic generation aided by GPT. Specifically, some works [21], [35], [60], [70], [102], [105] hand-craft a pool of candidate instructions and sample one of them during training. We offer an example of instruction templates for the VQA datasets as shown in Table 6. The other works manually design some seed instructions and use these to prompt GPT to generate more [25], [82], [98].
数据适配。特定任务的数据集是高质量数据的丰富来源。因此,大量研究[60]、[70]、[76]、[82]、[101]、[102]、[103]、[104]利用现有高质量数据集构建指令格式数据集。以VQA数据集的转换为例,原始样本是一个输入-输出对,其中输入包含一张图像和一个自然语言问题,输出是基于该图像对问题的文本回答。这些数据集的输入-输出对自然可以构成指令样本的多模态输入和响应(见$\S3.2.2$)。指令(即任务描述)可以来自人工设计,也可以来自GPT辅助的半自动生成。具体而言,部分研究[21]、[35]、[60]、[70]、[102]、[105]手工构建候选指令池并在训练期间从中采样,表6展示了VQA数据集的指令模板示例。其他研究则手动设计一些种子指令,并用其提示GPT生成更多指令[25]、[82]、[98]。
Note that since the answers of existing VQA and caption datasets are usually concise, directly using these datasets for instruction tuning may limit the output length of MLLMs. There are two common strategies to tackle this problem. The first one is to specify explicitly in instructions. For example, ChatBridge [104] explicitly declares short and brief for shortanswer data, as well as a sentence and single sentence for conventional coarse-grained caption data. The second one is to extend the length of existing answers [105]. For example, $\mathrm{M}^{3}\mathrm{IT}$ [105] proposes to rephrase the original answer by
需要注意的是,由于现有VQA(视觉问答)和图像描述数据集的答案通常较为简洁,直接使用这些数据集进行指令微调可能会限制大语言模型的输出长度。针对这一问题,目前有两种常见策略:第一种是在指令中明确指定输出要求,例如ChatBridge [104] 对简短答案数据标注"short and brief"(简短精要),对传统粗粒度描述数据标注"a sentence and single sentence"(单一句子);第二种是对现有答案进行扩展 [105],例如$\mathrm{M}^{3}\mathrm{IT}$ [105] 提出通过改写原始答案来增加长度。
TABLE 6: Instruction templates for VQA datasets, cited from [60]. and {Question} are the image and the question in the original VQA datasets, respectively. TABLE 7: A summary of popular datasets generated by self-instruction. For input/output modalities, I: Image, T: Text, Video, A: Audio. For data composition, M-T and $S{-}\mathrm{T}$ denote multi-turn and single-turn, respectively.
| Dataset | Sample | Modality | Source | Composition |
| LLaVA-Instruct | 158K | I+T→T | MS-COCO | 23K caption + 58K M-T QA + 77K reasoning |
| LVIS-Instruct | 220K | I+T→T | LVIS | 110K caption + 110K M-T QA |
| ALLaVA | 1.4M | I+T→T | VFlan,LAION | 709K caption + 709K S-T QA |
| Video-ChatGPT | 100K | V+T→T | ActivityNet | 7K description + 4K M-T QA |
| VideoChat | 11K | V+T→T | WebVid | description+summarization+creation |
| Clotho-Detail | 3.9K | A+T→T | Clotho | caption |
表 6: VQA数据集的指令模板,引用自[60]。和{Question}分别代表原始VQA数据集中的图像和问题。
表 7: 自生成指令流行数据集汇总。输入/输出模态中,I: 图像,T: 文本,V: 视频,A: 音频。数据构成中,M-T和S-T分别代表多轮和单轮对话。
| 数据集 | 样本量 | 模态 | 数据来源 | 构成 |
|---|---|---|---|---|
| LLaVA-Instruct | 158K | I+T→T | MS-COCO | 23K描述+58K多轮QA+77K推理 |
| LVIS-Instruct | 220K | I+T→T | LVIS | 110K描述+110K多轮QA |
| ALLaVA | 1.4M | I+T→T | VFlan,LAION | 709K描述+709K单轮QA |
| Video-ChatGPT | 100K | V+T→T | ActivityNet | 7K描述+4K多轮QA |
| VideoChat | 11K | V+T→T | WebVid | 描述+摘要+创作 |
| Clotho-Detail | 3.9K | A+T→T | Clotho | 描述 |
prompting ChatGPT with the original question, answer, and contextual information of the image (e.g. caption and OCR).
用原始问题、答案和图像的上下文信息(例如标题和OCR)提示ChatGPT。
Self-Instruction. Although existing multi-task datasets can contribute a rich source of data, they usually do not meet human needs well in real-world scenarios, such as multiple rounds of conversations. To tackle this issue, some works collect samples through self-instruction [106], which utilizes LLMs to generate textual instruction-following data using a few hand-annotated samples. Specifically, some instructionfollowing samples are hand-crafted as demonstrations, after which ChatGPT/GPT-4 is prompted to generate more instruction samples with the demonstrations as guidance. LLaVA [20] extends the approach to the multimodal field by translating images into text of captions and bounding boxes, and prompting text-only GPT-4 to generate new data with the guidance of requirements and demonstrations. In this way, a multimodal instruction dataset is constructed, called LLaVA-Instruct-150k. Following this idea, subsequent works such as MiniGPT-4 [21], ChatBridge [104], GPT4Tools [107], and DetGPT [72] develop different datasets catering for different needs. Recently, with the release of the more powerful multimodal model GPT4V, many works have adopted GPT-4V to generate data of higher quality, as exemplified by LVIS-Instruct4V [91] and ALLaVA [92]. We summarize the popular datasets generated through self-instruction in Table 7.
自指令方法。虽然现有的多任务数据集能提供丰富的数据来源,但这些数据通常无法很好地满足现实场景中的人类需求(例如多轮对话)。为解决这一问题,部分研究通过自指令方法[106]收集样本,即利用大语言模型基于少量人工标注样本生成文本指令跟随数据。具体而言,先手工制作若干指令跟随样本作为演示示例,随后以这些示例为引导,通过ChatGPT/GPT-4生成更多指令样本。LLaVA[20]将这种方法扩展到多模态领域:先将图像转换为描述文本和边界框信息,再引导纯文本GPT-4根据需求说明和演示示例生成新数据,由此构建出名为LLaVA-Instruct-150k的多模态指令数据集。基于这一思路,后续研究如MiniGPT-4[21]、ChatBridge[104]、GPT4Tools[107]和DetGPT[72]开发了满足不同需求的多样化数据集。近期随着更强大的多模态模型GPT4V发布,许多工作开始采用GPT-4V生成更高质量数据,典型代表包括LVIS-Instruct4V[91]和ALLaVA[92]。我们在表7中总结了通过自指令生成的主流数据集。
Data Mixture. Apart from the multimodal instruction data, language-only user-assistant conversation data can also be used to improve conversational prof ici enc ies and instruction-following abilities [81], [98], [101], [103]. LaVIN [101] directly constructs a minibatch by randomly sampling from both language-only and multimodal data. Multi Instruct [102] probes different strategies for training with a fusion of single modal and multimodal data, including mixed instruction tuning (combine both types of data and randomly shuffle) and sequential instruction tuning (text data followed by multimodal data).
数据混合。除多模态指令数据外,纯语言用户-助手对话数据也能用于提升对话熟练度和指令跟随能力 [81], [98], [101], [103]。LaVIN [101] 通过从纯语言和多模态数据中随机采样直接构建小批量。Multi Instruct [102] 探索了单模态与多模态数据融合训练的不同策略,包括混合指令微调(合并两类数据并随机打乱)和顺序指令微调(先文本数据后多模态数据)。
3.2.4 Data Quality
3.2.4 数据质量
Recent research has revealed that the data quality of instruction-tuning samples is no less important than quantity. Lynx [73] finds that models pre-trained on large-scale but noisy image-text pairs do not perform as well as models pre-trained with smaller but cleaner datasets. Similarly, Wei et al. [108] finds that less instruction-tuning data with higher quality can achieve better performance. For data filtering, the work proposes some metrics to evaluate data quality and, correspondingly, a method to automatically filter out inferior vision-language data. Here we discuss two important aspects regarding data quality.
近期研究表明,指令微调样本的数据质量重要性不亚于数据量。Lynx [73] 发现,在大规模但含噪声的图文对上预训练的模型,其表现不如使用较小但更干净数据集预训练的模型。同样,Wei等 [108] 指出,更少但更高质量的指令微调数据反而能获得更优性能。该研究提出了一些评估数据质量的指标,并相应开发了自动筛选劣质视觉语言数据的方法。这里我们重点讨论数据质量的两个关键维度。
Prompt Diversity. The diversity of instructions has been found to be critical for model performance. Lynx [73] empirically verifies that diverse prompts help improve model performance and generalization ability.
指令多样性。研究发现指令的多样性对模型性能至关重要。Lynx [73] 通过实验验证了多样化的提示有助于提升模型性能和泛化能力。
Task Coverage. In terms of tasks involved in training data, Du et al. [109] perform an empirical study and find that the visual reasoning task is superior to captioning and QA tasks for boosting model performance. Moreover, the study suggests that enhancing the complexity of instructions might be more beneficial than increasing task diversity and incorporating fine-grained spatial annotations.
任务覆盖范围。在训练数据涉及的任务方面,Du等人[109]通过实证研究发现,视觉推理任务在提升模型性能方面优于图像描述和问答任务。此外,该研究指出,提升指令复杂度可能比增加任务多样性和引入细粒度空间标注更具效益。
3.3 Alignment tuning
3.3 对齐调优
3.3.1 Introduction
3.3.1 引言
Alignment tuning is more often used in scenarios where models need to be aligned with specific human preferences, e.g. response with fewer hallucinations (see §6). Currently, Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO) are two main techniques for alignment tuning. In this section, we introduce the main ideas of the two techniques in sequence and offer some examples of how they are utilized in addressing practical problems, and finally, give a compilation of the related datasets.
对齐调优 (Alignment tuning) 更常用于需要让模型与特定人类偏好对齐的场景,例如减少幻觉的响应(参见第6节)。目前,基于人类反馈的强化学习 (RLHF) 和直接偏好优化 (DPO) 是对齐调优的两种主要技术。本节依次介绍这两种技术的主要思想,并提供一些它们如何用于解决实际问题的示例,最后给出相关数据集的汇编。
3.3.2 Training Detail
3.3.2 训练细节
RLHF [110], [111]. This technique aims to utilize reinforcement learning algorithms to align LLMs with human preferences, with human annotations as supervision in the training loop. As exemplified in Instruct GP T [95], RLHF incorporates three key steps:
RLHF [110], [111]。该技术旨在利用强化学习算法使大语言模型与人类偏好对齐,并以人工标注作为训练循环中的监督信号。如InstructGPT [95]所示,RLHF包含三个关键步骤:
- Supervised fine-tuning. This step aims to fine-tune a pre-trained model to present the preliminary desired output behavior. The fine-tuned model in the RLHF setting is called a policy model. Note that this step might be skipped since the supervised policy model $\pi^{\mathrm{{SFT}}}$ can be initialized from an instruction-tuned model (see §3.2).
- 监督微调 (Supervised fine-tuning)。此步骤旨在对预训练模型进行微调,使其呈现初步期望的输出行为。在RLHF设置中,经过微调的模型称为策略模型 (policy model)。需注意,该步骤可能被跳过,因为监督策略模型 $\pi^{\mathrm{{SFT}}}$ 可以直接从指令调优模型初始化 (参见§3.2)。
- Reward modeling. A reward model is trained using preference pairs in this step. Given a multimodal prompt (e.g. image and text) $x$ and a response pair $(y_ {w},\bar{y}_ {l}).$ , the reward model $r_ {\theta}$ learns to give a higher reward to the preferred response $y_ {w},$ and vice versa for $y_ {l.}$ , according to the following objective:
- 奖励建模。在此步骤中,使用偏好对训练奖励模型。给定一个多模态提示(例如图像和文本)$x$ 和响应对 $(y_ {w},\bar{y}_ {l})$ ,奖励模型 $r_ {\theta}$ 学习根据以下目标为优选响应 $y_ {w}$ 给予更高奖励,反之对 $y_ {l}$ 给予较低奖励:
![image.png]()
where $\begin{array}{l l l}{{\mathcal{D}}}&{{=}}&{{{(x,y_ {w},y_ {l})}}}\end{array}$ is the comparison dataset labeled by human annotators. In practice, the reward model $r_ {\theta}$ shares a similar structure with the policy model.
其中 $\begin{array}{l l l}{{\mathcal{D}}}&{{=}}&{{{(x,y_ {w},y_ {l})}}}\end{array}$ 是由人工标注者标记的比较数据集。实际上,奖励模型 $r_ {\theta}$ 与策略模型结构相似。
- Reinforcement learning. In this step, the Proximal Policy Optimization (PPO) algorithm is adopted to optimize the RL policy model $\pi_ {\phi}^{\mathrm{RL}}$ . A per-token KL penalty is often added to the training objective to avoid deviating too far from the original policy [95], resulting in the objective:
- 强化学习。在此步骤中,采用近端策略优化 (Proximal Policy Optimization, PPO) 算法来优化强化学习策略模型 $\pi_ {\phi}^{\mathrm{RL}}$。通常会在训练目标中添加每个 token 的 KL 散度惩罚项,以避免偏离原始策略过远 [95],最终得到的目标函数为:
$$
\begin{array}{r l}&{\mathcal{L}(\phi)=-\mathbb{E}_ {x\sim\mathcal{D},y\sim\pi_ {\phi}^{R L}(y|x)}\Big[r_ {\theta}(x,y)}\ &{\quad\quad-\beta\cdot\mathbb{D}_ {K L}\Big(\pi_ {\phi}^{R L}(y|x)||\pi^{R E F}(y|x)\Big)\Big]}\end{array}
$$
$$
\begin{array}{r l}&{\mathcal{L}(\phi)=-\mathbb{E}_ {x\sim\mathcal{D},y\sim\pi_ {\phi}^{R L}(y|x)}\Big[r_ {\theta}(x,y)}\ &{\quad\quad-\beta\cdot\mathbb{D}_ {K L}\Big(\pi_ {\phi}^{R L}(y|x)||\pi^{R E F}(y|x)\Big)\Big]}\end{array}
$$
where $\beta$ is the coefficient for the $\mathrm{KL}$ penalty term. Typically, both the RL policy $\pi_ {\phi}^{\mathrm{RL}}$ and the reference model πREF are initialized from the supervised model $\pi^{\mathrm{SFT}}$ . The obtained RL policy model is expected to align with human preferences through this tuning process.
其中 $\beta$ 是 $\mathrm{KL}$ 惩罚项的系数。通常,强化学习策略 $\pi_ {\phi}^{\mathrm{RL}}$ 和参考模型 $\pi^{\mathrm{REF}}$ 都从监督微调模型 $\pi^{\mathrm{SFT}}$ 初始化。通过这一调优过程,最终获得的强化学习策略模型有望与人类偏好对齐。
Researchers have explored using the RLHF techniques for better multimodal alignment. For example, LLaVARLHF [112] collects human preference data and tunes a model with fewer hallucinations based on LLaVA [20].
研究人员探索了利用RLHF (Reinforcement Learning from Human Feedback)技术来改进多模态对齐。例如,LLaVARLHF [112]收集了人类偏好数据,并基于LLaVA [20]调整出一个幻觉更少的模型。
DPO [113]. It learns from human preference labels utilizing a simple binary classification loss. Compared with the PPObased RLHF algorithm, DPO is exempt from learning an explicit reward model, thus simplifying the whole pipeline to two steps, i.e. human preference data collection and preference learning. The learning objective is as follows:
DPO [113]。它通过简单的二元分类损失从人类偏好标签中学习。与基于PPO的RLHF算法相比,DPO无需学习显式奖励模型,从而将整个流程简化为两个步骤:人类偏好数据收集和偏好学习。其学习目标如下:
$$
\begin{array}{r l}&{\mathcal{L}(\phi)=-\mathbb{E}_ {(x,y_ {w},y_ {l})\sim\mathcal{D}}\Big[\log\sigma\Big(\beta\log\frac{\pi_ {\phi}^{\mathrm{RL}}(y_ {w}|x)}{\pi^{\mathrm{REF}}(y_ {w}|x)}}\ &{\qquad-\beta\log\frac{\pi_ {\phi}^{\mathrm{RL}}(y_ {l}|x)}{\pi^{\mathrm{REF}}(y_ {l}|x)}\Big)\Big]}\end{array}
$$
$$
\begin{array}{r l}&{\mathcal{L}(\phi)=-\mathbb{E}_ {(x,y_ {w},y_ {l})\sim\mathcal{D}}\Big[\log\sigma\Big(\beta\log\frac{\pi_ {\phi}^{\mathrm{RL}}(y_ {w}|x)}{\pi^{\mathrm{REF}}(y_ {w}|x)}}\ &{\qquad-\beta\log\frac{\pi_ {\phi}^{\mathrm{RL}}(y_ {l}|x)}{\pi^{\mathrm{REF}}(y_ {l}|x)}\Big)\Big]}\end{array}
$$
RLHF-V [114] collects fine-grained (segment-level) preference data pairs by correcting hallucinations in the model response and uses the obtained data to perform dense DPO. Silkie [115] instead collects preference data via prompting GPT-4V and distills the preference supervision into an instruction-tuned model through DPO.
RLHF-V [114] 通过修正模型响应中的幻觉收集细粒度(片段级)偏好数据对,并利用获得的数据执行密集DPO。Silkie [115] 则通过提示GPT-4V收集偏好数据,并通过DPO将偏好监督蒸馏至指令调优模型中。
TABLE 8: A summary of datasets for alignment-tuning. For input/output modalities, I: Image, T: Text.
| Dataset | Sample | Modality | Source |
| LLaVA-RLHF F[112] | 10K | I+T→T | Human |
| RLHF-V[114] | 5.7K | I+T→T | Human |
| VLFeedback [115] | 380K | I+T→T | GPT-4V |
表 8: 对齐微调数据集汇总。输入/输出模态说明:I: 图像, T: 文本。
| 数据集 | 样本量 | 模态 | 来源 |
|---|---|---|---|
| LLaVA-RLHF [112] | 10K | I+T→T | 人工标注 |
| RLHF-V [114] | 5.7K | I+T→T | 人工标注 |
| VLFeedback [115] | 380K | I+T→T | GPT-4V生成 |
3.3.3 Data
3.3.3 数据
The gist of data collection for alignment-tuning is to collect feedback for model responses, i.e. to decide which response is better. It is generally more expensive to collect such data, and the amount of data used for this phase is typically even less than that used in previous stages. In this part, we introduce some datasets and summarize them in Table 8.
对齐调优数据收集的核心在于获取模型反馈,即判定哪个响应更优。这类数据采集成本通常较高,且该阶段使用的数据量往往少于前期阶段。本节介绍部分数据集并将其汇总于表8:
LLaVA-RLHF [112]. It contains 10K preference pairs collected from human feedback in terms of honesty and helpfulness. The dataset mainly serves to reduce hallucinations in model responses.
LLaVA-RLHF [112]。该数据集包含1万组基于诚实性和帮助性人工反馈收集的偏好对,主要用于减少模型响应中的幻觉现象。
RLHF-V [114]. It has 5.7K fine-grained human feedback data collected by segment-level hallucination corrections.
RLHF-V [114]。该数据集包含5.7K条通过片段级幻觉修正收集的细粒度人类反馈数据。
VLFeedback [115]. It utilizes AI to provide feedback on model responses. The dataset contains more than 380K comparison pairs scored by GPT-4V in terms of helpfulness, faithfulness, and ethical concerns.
VLFeedback [115]。该系统利用AI对模型响应提供反馈。数据集包含超过38万组由GPT-4V根据有用性、忠实度和伦理问题评分的对比配对。
4 EVALUATION
4 评估
Evaluation is an essential part of developing MLLMs since it provides feedback for model optimization and helps to compare the performance of different models. Compared with evaluation methods of traditional multimodal models, the evaluation of MLLMs exhibits several new traits: (1) Since MLLMs are generally versatile, it is important to evaluate MLLMs comprehensively. (2) MLLMs exhibit many emergent capabilities that require special attention (e.g. OCR-free math reasoning) and thus require new evaluation schemes. The evaluation of MLLMs can be broadly categorized into two types according to the question genres, including closed-set and open-set.
评估是开发多模态大语言模型 (MLLM) 的关键环节,它为模型优化提供反馈并帮助比较不同模型的性能。与传统多模态模型的评估方法相比,MLLM 的评估呈现出若干新特点:(1) 由于 MLLM 通常具备通用性,因此全面评估尤为重要。(2) MLLM 展现出许多需要特别关注的新兴能力 (例如无需 OCR 的数学推理),这要求设计新的评估方案。根据问题类型,MLLM 的评估可大致分为两类:闭集评估和开集评估。
4.1 Closed-set
4.1 闭集 (Closed-set)
Closed-set questions refer to a type of question where the possible answer options are predefined and limited to a finite set. The evaluation is usually performed on taskspecific datasets. In this case, the responses can be naturally judged by benchmark metrics [20], [60], [70], [76], [101], [102], [103], [104]. For example, InstructBLIP [60] reports the accuracy on ScienceQA [116], as well as the CIDEr score [117] on NoCaps [118] and Flickr30K [119]. The evaluation settings are typically zero-shot [60], [102], [104], [105] or finetuning [20], [35], [60], [70], [76], [101], [103], [105]. The first setting often selects a wide range of datasets covering different general tasks and splits them into held-in and held-out datasets. After tuning on the former, zero-shot performance is evaluated on the latter with unseen datasets or even unseen tasks. In contrast, the second setting is often observed in the evaluation of domain-specific tasks. For example, LLaVA [20] and LLaMA-Adapter [76] report finetuned performance on ScienceQA [116]. LLaVA-Med [35] reports results on biomedical VQA [120], [121], [122].
封闭式问题指的是一类答案选项已预先定义且限定在有限集合内的问题。评估通常在任务特定数据集上进行。这种情况下,回答可自然地通过基准指标进行评判 [20], [60], [70], [76], [101], [102], [103], [104]。例如,InstructBLIP [60] 报告了在ScienceQA [116]上的准确率,以及在NoCaps [118]和Flickr30K [119]上的CIDEr分数 [117]。评估设置通常为零样本 [60], [102], [104], [105] 或微调 [20], [35], [60], [70], [76], [101], [103], [105]。第一种设置常选择涵盖不同通用任务的广泛数据集,并将其划分为保留集和测试集。在前者上调整后,后者使用未见数据集甚至未见任务评估零样本性能。相比之下,第二种设置常见于领域特定任务的评估。例如,LLaVA [20] 和LLaMA-Adapter [76] 报告了在ScienceQA [116]上的微调性能。LLaVA-Med [35] 报告了在生物医学VQA [120], [121], [122]上的结果。
The above evaluation methods are usually limited to a small range of selected tasks or datasets, lacking a comprehensive quantitative comparison. To this end, some efforts have endeavored to develop new benchmarks specially designed for MLLMs [123], [124], [125], [126], [127], [128], [129]. For example, Fu et al. [123] construct a comprehensive evaluation benchmark MME that includes a total of 14 perception and cognition tasks. All instruction-answer pairs in MME are manually designed to avoid data leakage. MMBench [124] is a benchmark specifically designed for evaluating multiple dimensions of model capabilities, using ChatGPT to match open responses with pre-defined choices. Video-ChatGPT [130] and Video-Bench [131] focus on video domains and propose specialized benchmarks as well as evaluation tools for assessment. There are also evaluation strategies designed to evaluate a specific aspect of the model [102], as exemplified by POPE [132] for assessment of hallucination degree.
上述评估方法通常局限于小范围选定的任务或数据集,缺乏全面的定量比较。为此,一些研究致力于开发专门为多模态大语言模型 (MLLM) 设计的新基准 [123]、[124]、[125]、[126]、[127]、[128]、[129]。例如,Fu 等人 [123] 构建了一个包含 14 项感知与认知任务的综合评估基准 MME,其中所有指令-答案对均通过人工设计以避免数据泄露。MMBench [124] 是专门用于评估模型多维度能力的基准,通过 ChatGPT 将开放回答与预定义选项进行匹配。Video-ChatGPT [130] 和 Video-Bench [131] 聚焦视频领域,提出了专用基准及配套评估工具。还有针对模型特定方面设计的评估策略 [102],例如评估幻觉程度的 POPE [132]。
4.2 Open-set
4.2 开放集 (Open-set)
In contrast to the closed-set questions, the responses to open-set questions can be more flexible, where MLLMs usually play a chatbot role. Because the content of the chat can be arbitrary, it would be trickier to judge than the closed-ended output. The criterion can be classified into manual scoring, GPT scoring, and case study. Manual scoring requires humans to assess the generated responses. This kind of approach often involves hand-crafted questions that are designed to assess specific dimensions. For example, mPLUG-Owl [81] collects a visually related evaluation set to judge capabilities like natural image understanding, diagram, and flowchart understanding. Similarly, GPT4Tools [107] builds two sets for the finetuning and zeroshot performance, respectively, and evaluates the responses in terms of thought, action, arguments, and the whole.
与封闭式问题不同,开放式问题的回答可以更加灵活,此时多模态大语言模型 (MLLM) 通常扮演聊天机器人的角色。由于聊天内容具有任意性,其评判标准比封闭式输出更为复杂。评判标准可分为人工评分、GPT评分和案例分析三类。人工评分需要人类评估生成的回答,这种方法通常涉及手工设计的问题集,用于评估特定维度。例如,mPLUG-Owl [81] 收集了视觉相关评估集来评判自然图像理解、图表和流程图理解等能力;GPT4Tools [107] 则分别构建了微调测试集和零样本测试集,从思维链、行动、论据和整体表现四个维度评估回答质量。
Since manual assessment is labor intensive, some researchers have explored rating with GPT, namely GPT scoring. This approach is often used to evaluate performance on multimodal dialogue. LLaVA [20] proposes to score the responses via text-only GPT-4 in terms of different aspects, such as helpfulness and accuracy. Specifically, 30 images are sampled from the COCO [133] validation set, each associated with a short question, a detailed question, and a complex reasoning question via self-instruction on GPT-4. The answers generated by both the model and GPT-4 are sent to GPT-4 for comparison. Subsequent works follow this idea and prompt ChatGPT [81] or GPT-4 [35], [70], [101], [104], [105] to rate results [35], [70], [81], [101], [104] or judge which one is better [103].
由于人工评估耗时费力,部分研究者开始探索利用GPT进行评分(即GPT评分法)。该方法常用于评估多模态对话性能。LLaVA[20]提出通过纯文本版GPT-4从有用性、准确性等维度对回答进行评分:从COCO[133]验证集中采样30张图像,每张图像通过GPT-4自指令生成简短问题、详细问题和复杂推理问题,将模型与GPT-4生成的答案同时提交给GPT-4进行对比。后续研究沿袭这一思路,采用ChatGPT[81]或GPT-4[35][70][101][104][105]对结果进行评分[35][70][81][101][104]或优劣判定[103]。
A main issue of applying text-only GPT-4 as an evaluator is that the judge is only based on image-related text content, such as captions or bounding box coordinates, without accessing the image [35]. Thus, it may be questionable to set GPT-4 as the performance upper bound in this case. With the release of the vision interface of GPT, some works [77], [134] exploit a more advanced GPT-4V model to assess the performance of MLLMs. For example, Woodpecker [77] adopts GPT-4V to judge the response quality of model answers based on the image. The evaluation is expected to be more accurate than using text-only GPT-4 since GPT-4V has direct access to the image.
仅依赖纯文本版GPT-4作为评估器的主要问题在于,其判断仅基于图像相关文本内容(如描述文字或边界框坐标)而无法直接访问图像本身[35]。因此,在此情境中将GPT-4设为性能上限可能存疑。随着GPT视觉接口的发布,部分研究[77][134]开始采用更先进的GPT-4V模型来评估多模态大语言模型(MLLMs)的性能。例如Woodpecker[77]利用GPT-4V根据图像判断模型回答的响应质量。由于GPT-4V能直接读取图像信息,这种评估方式预计比纯文本版GPT-4更为准确。
A supplementary approach is to compare the different capabilities of MLLMs through case studies. For instance, some studies evaluate two typical advanced commercial-use models, GPT-4V and Gemini. Yang et al. [135] perform indepth qualitative analysis on GPT-4V by crafting a series of samples across various domains and tasks, spanning from preliminary skills, such as caption and object counting, to complex tasks that require world knowledge and reasoning, such as joke understanding and indoor navigation as an embodied agent. Wen et al. [136] make a more focused evaluation of GPT-4V by designing samples targeting automatic driving scenarios. Fu et al. [137] carry out a comprehensive evaluation on Gemini-Pro by comparing the model against GPT-4V. The results suggest that GPT-4V and Gemini exhibit comparable visual reasoning abilities in spite of different response styles.
一种补充方法是通过案例研究比较不同MLLM的能力差异。例如,部分研究评估了GPT-4V和Gemini这两款典型商用先进模型。Yang等人[135]通过构建跨领域任务的样本集对GPT-4V进行深度定性分析,任务范围涵盖从基础技能(如图说生成和物体计数)到需要世界知识与推理的复杂任务(如笑话理解和作为具身智能体的室内导航)。Wen等人[136]针对自动驾驶场景设计样本,对GPT-4V进行了更聚焦的评估。Fu等人[137]通过对比Gemini-Pro与GPT-4V展开全面评测,结果表明尽管响应风格不同,但两者展现出相当的视觉推理能力。
5 EXTENSIONS
5 扩展功能
Recent studies have made significant strides in extending the capabilities of MLLMs, spanning from more potent foundational abilities to broader coverage of scenarios. We trace the principal development of MLLMs in this regard.
近期研究在扩展多模态大语言模型(MLLM)能力方面取得显著进展,涵盖从强化基础能力到拓宽应用场景等多个维度。本文系统梳理了该领域的主要发展脉络。
Granularity Support. To facilitate better interaction between agents and users, researchers have developed MLLMs with finer support of granular i ties in terms of model inputs and outputs. On the input side, models that support finer control from user prompts are developed progressively, evolving from image to region [28], [138], [139] and even pixels [29], [140], [141]. Specifically, Shikra [28] supports region-level input and understanding. Users may interact with the assistant more flexibly by referring to specific regions, which are represented in bounding boxes of natural language forms. Ferret [141] takes a step further and supports more flexible referring by devising a hybrid representation scheme. The model supports different forms of prompts, including point, box, and sketch. Similarly, Osprey [29] supports point input by utilizing a segmentation model [9]. Aided by the exceptional capabilities of the pre-trained segmentation model, Osprey enables specifying a single entity or part of it with a single click. On the output side, grounding capabilities are improved in line with the development of input support. Shikra [28] supports response grounded in the image with box annotations, resulting in higher precision and finer referring experience. LISA [142] further supports masklevel understanding and reasoning, which makes pixel-level grounding possible.
粒度支持。为促进AI智能体与用户之间更好的交互,研究人员开发了在模型输入输出方面具备更精细粒度支持的多模态大语言模型(MLLM)。输入侧逐步发展出支持用户提示更精细控制的模型,从图像演进到区域[28][138][139]乃至像素级别[29][140][141]。具体而言,Shikra[28]支持区域级输入与理解,用户可通过自然语言形式的边界框指定特定区域进行更灵活的交互。Ferret[141]采用混合表征方案,进一步支持点、框、草图等多种提示形式实现更灵活的指代。Osprey[29]则利用分割模型[9]实现点击指定单个实体或其部件的点输入功能。输出侧则与输入支持同步提升定位能力:Shikra[28]通过框标注实现图像响应定位,提供更高精度的细粒度指代体验;LISA[142]进一步支持掩码级理解与推理,使像素级定位成为可能。
Modality Support. Increased support for modalities is a tendency for MLLM studies. On the one hand, researchers have explored adapting MLLMs to support the input of more multimodal content, such as 3D point cloud [41], [143], [144], [145]. On the other hand, MLLMs are also extended to generate responses of more modalities, such as image [32], [146], [147], [148], audio [32], [147], [149], [150], and video [32], [151]. For example, NExT-GPT [32] proposes a framework that supports inputs and outputs of mixed modalities, specifically, combinations of text, image, audio, and video, with the help of diffusion models [152], [153] attached to the MLLM. The framework applies an encoder-decoder architecture and puts LLM as a pivot for understanding and reasoning.
模态支持。增强对多模态的支持是多模态大语言模型(MLLM)研究的一个趋势。一方面,研究者探索如何使MLLM适配更多模态内容的输入,例如3D点云[41][143][144][145]。另一方面,MLLM也被扩展用于生成更多模态的响应,如图像[32][146][147][148]、音频[32][147][149][150]和视频[32][151]。例如,NExT-GPT[32]提出了一个支持混合模态输入输出的框架,具体通过结合扩散模型[152][153]实现文本、图像、音频和视频的组合处理。该框架采用编码器-解码器架构,并将大语言模型作为理解和推理的核心枢纽。
Language Support. Current models are predominantly unilingual, probably due to the fact that high-quality nonEnglish training corpus is scarce. Some works have been devoted to developing multilingual models so that a broader range of users can be covered. VisCPM [33] transfers model capabilities to the multilingual setting by designing a multistage training scheme. Specifically, the scheme takes English as a pivotal language, with abundant training corpus. Utilizing a pre-trained bilingual LLM, the multimodal capabilities are transferred to Chinese by adding some translated samples during instruction tuning. Taking a similar approach, Qwen-VL [34] is developed from the bilingual LLM Qwen [58] and supports both Chinese and English. During pre-training, Chinese data is mixed into the training corpus to preserve the bilingual capabilities of the model, taking up $22.7%$ of the whole data volume.
语言支持。当前模型主要是单语言的,这可能是因为高质量的非英语训练语料稀缺。一些工作致力于开发多语言模型,以覆盖更广泛的用户群体。VisCPM [33] 通过设计多阶段训练方案,将模型能力迁移到多语言环境中。具体来说,该方案以英语为枢纽语言(拥有丰富的训练语料),利用预训练的双语大语言模型,通过在指令微调阶段添加翻译样本,将多模态能力迁移到中文。类似地,Qwen-VL [34] 基于双语大语言模型Qwen [58]开发,支持中英双语。在预训练阶段,中文数据被混合到训练语料中以保持模型的双语能力,占整体数据量的22.7%。
Scenario/Task Extension. Apart from developing common general-purpose assistants, some studies have focused on more specific scenarios where practical conditions should be considered, while others extend MLLMs to downstream tasks with specific expertise.
场景/任务扩展。除了开发通用的通用助手外,部分研究聚焦于需考虑实际条件的特定场景,另一些则将MLLM扩展至具备专业能力的下游任务。
A typical tendency is to adapt MLLMs to more specific real-life scenarios. MobileVLM [63] explores developing small-size variants of MLLMs for resource-limited scenarios. Some designs and techniques are utilized for deployment on mobile devices, such as LLMs of smaller size and quantization techniques to speed up computation. Other works de- velop agents that interact with real-world [41], [154], [155], e.g. user-friendly assistants specially designed for Graphical User Interface (GUI), as exemplified by CogAgent [44], AppAgent [43], and Mobile-Agent [45]. These assistants excel in planning and guiding through each step to fulfill a task specified by users, acting as helpful agents for humanmachine interaction. Another line is to augment MLLMs with specific skills for solving tasks in different domains, e.g. document understanding [38], [39], [156], [157] and medical domains [35], [36], [37]. For document understanding, mPLUG-DocOwl [38] utilizes various forms of documentlevel data for tuning, resulting in an enhanced model in OCR-free document understanding. TextMonkey [39] incorporates multiple tasks related to document understanding to improve model performance. Apart from conventional document image and scene text datasets, position-related tasks are added to reduce hallucinations and help models learn to ground responses in the visual information. MLLMs can also be extended to medical domains by instilling knowledge of the medical domain. For example, LLaVA-Med [158] injects medical knowledge into vanilla LLaVA [20] and develops an assistant specialized in medical image understanding and question answering.
一个典型趋势是将多模态大语言模型 (MLLM) 适配到更具体的现实场景中。MobileVLM [63] 探索为资源受限场景开发小型化多模态大语言模型,采用小规模大语言模型和量化加速等技术实现移动端部署。另一方向是开发与现实世界交互的AI智能体 [41][154][155],例如专为图形用户界面 (GUI) 设计的友好助手,典型代表包括CogAgent [44]、AppAgent [43] 和Mobile-Agent [45]。这些助手擅长规划并逐步引导用户完成任务,成为人机交互的高效代理。还有研究通过注入特定领域技能来增强多模态大语言模型,例如文档理解 [38][39][156][157] 和医疗领域 [35][36][37]。在文档理解方面,mPLUG-DocOwl [38] 利用多形式文档级数据进行调优,实现了无需OCR的增强型文档理解模型;TextMonkey [39] 则整合多项文档理解相关任务来提升性能,除传统文档图像和场景文本数据集外,还加入位置相关任务以减少幻觉并增强视觉信息响应能力。多模态大语言模型还可通过注入医学知识扩展至医疗领域,例如LLaVA-Med [158] 向原始LLaVA [20] 注入医学知识,开发出专精医学图像理解与问答的助手。
6 MULTIMODAL HALLUCINATION
6 多模态幻觉
Multimodal hallucination refers to the phenomenon of responses generated by MLLMs being inconsistent with the image content [77]. As a fundamental and important problem, the issue has received increased attention. In this section, we briefly introduce some related concepts and research development.
多模态幻觉 (Multimodal hallucination) 指大语言模型生成的响应与图像内容不一致的现象 [77]。作为基础而重要的问题,该现象正受到越来越多的关注。本节将简要介绍相关概念与研究进展。
6.1 Preliminaries
6.1 预备知识
Current research on multimodal hallucinations can be further categorized into three types [159]:
当前关于多模态幻觉的研究可进一步分为三类[159]:
In what follows, we first introduce some specific evaluation methods (§6.2), which are useful to gauge the performance of methods for mitigating hallucinations (§6.3). Then, we will discuss in detail the current methods for reducing hallucinations, according to the main categories each method falls into.
接下来,我们首先介绍一些具体的评估方法(§6.2),这些方法可用于衡量缓解幻觉(§6.3)方法的性能。然后,我们将根据每种方法所属的主要类别,详细讨论当前减少幻觉的方法。
6.2 Evaluation Methods
6.2 评估方法
CHAIR [160] is an early metric that evaluates hallucination levels in open-ended captions. The metric measures the proportion of sentences with hallucinated objects or hallucinated objects in all the objects mentioned. In contrast, POPE [132] is a method that evaluates closed-set choices. Specifically, multiple prompts with binary choices are formulated, each querying if a specific object exists in the image. The method also covers more challenging settings to evaluate the robustness of MLLMs, with data statistics taken into consideration. The final evaluation uses a simple watchword mechanism, i.e. by detecting keywords $\mathrm{^{\prime\prime}y e s/n o^{\prime\prime}}.$ , to convert open-ended responses into closedset binary choices. With a similar evaluation approach, MME [123] provides a more comprehensive evaluation, covering aspects of existence, count, position and color, as exemplified in [77].
CHAIR [160] 是一种早期用于评估开放式描述中幻觉程度的指标。该指标衡量的是包含幻觉对象或幻觉对象在所有提及对象中所占的比例。相比之下,POPE [132] 是一种评估封闭式选择的方法。具体而言,该方法会构建多个带有二元选择的提示,每个提示都询问图像中是否存在特定对象。该方法还涵盖了更具挑战性的设置,以评估 MLLM 的鲁棒性,并考虑了数据统计因素。最终评估采用简单的关键词检测机制(即通过检测关键词 $\mathrm{^{\prime\prime}y e s/n o^{\prime\prime}}$)将开放式回答转换为封闭式二元选择。采用类似评估方法的 MME [123] 提供了更全面的评估,涵盖存在性、数量、位置和颜色等方面,如 [77] 所示。
Different from previous approaches that use matching mechanisms to detect and decide hallucinations, HaELM [161] proposes using text-only LLMs as a judge to automatically decide whether MLLMs’ captions are correct against reference captions. In light of the fact that text-only LLMs can only access limited image context and require reference annotations, Woodpecker [77] uses GPT-4V to directly assess model responses grounded in the image. FaithScore [162] is a more fine-grained metric based on a routine that breaks down descriptive sub-sentences and evaluates each sub-sentence separately. Based on previous studies, AMBER [163] is an LLM-free benchmark that encompasses both disc rim i native tasks and generative tasks and involves three types of possible hallucinations (see §6.1).
与以往采用匹配机制检测和判定幻觉的方法不同,HaELM [161] 提出仅使用纯文本大语言模型作为评判器,自动判断多模态大语言模型生成的描述是否与参考描述一致。鉴于纯文本大语言模型只能获取有限图像上下文且依赖参考标注,Woodpecker [77] 改用 GPT-4V 直接评估模型基于图像生成的响应。FaithScore [162] 是一种更细粒度的评估指标,其通过分解描述性子句并分别评估每个子句来实现。基于前人研究,AMBER [163] 是一个无需大语言模型的基准测试,同时包含判别性任务和生成性任务,并涉及三种可能的幻觉类型(参见 §6.1)。
6.3 Mitigation Methods
6.3 缓解方法
According to high-level ideas, the current methods can be roughly divided into three categories: pre-correction, inprocess-correction, and post-correction.
根据高层理念,现有方法大致可分为三类:预校正 (pre-correction)、过程校正 (inprocess-correction) 和后校正 (post-correction)。
Pre-correction. An intuitive and straightforward solution for hallucination is to collect specialized data (e.g. negative data) and use the data for fine-tuning, thus resulting in models with fewer hallucinated responses.
预校正。解决幻觉问题的一种直观且直接的方法是收集专门的数据(如负样本数据),并利用这些数据进行微调,从而减少模型产生幻觉响应的概率。
LRV-Instruction [164] introduces a visual instruction tuning dataset. Apart from common positive instructions, the dataset incorporates delicately designed negative instructions at different semantic levels to encourage responses faithful to the image content. LLaVA-RLHF [112] collects human-preference pairs and finetunes models with reinforcement learning techniques, leading to models more aligned with less hallucinated answers.
LRV-Instruction [164] 提出了一个视觉指令微调数据集。除了常见的正向指令外,该数据集还包含了精心设计的不同语义层面的负向指令,以鼓励模型生成更忠实于图像内容的回答。LLaVA-RLHF [112] 收集了人类偏好数据对,并采用强化学习技术对模型进行微调,从而使模型生成的答案更符合人类偏好且减少幻觉。
In-process-correction. Another line is to make improvements in architectural design or feature representation. These works try to explore the reasons for hallucinations and design corresponding remedies to mitigate them in the generation process.
过程中修正。另一条路线是在架构设计或特征表示上进行改进。这些工作试图探索幻觉产生的原因,并设计相应的补救措施以在生成过程中减轻它们。
HallE-Switch [159] performs an empirical analysis of possible factors of object existence hallucinations and hypothesizes that existence hallucinations derive from objects not grounded by visual encoders, and they are actually inferred based on knowledge embedded in the LLM. Based on the assumption, a continuous controlling factor and corresponding training scheme are introduced to control the extent of imagination in model output during inference.
HallE-Switch [159] 对物体存在性幻觉的可能因素进行了实证分析,并提出假设认为存在性幻觉源于视觉编码器未锚定的物体,实际上这些物体是基于大语言模型中嵌入的知识推断得出的。基于这一假设,研究引入了一个连续控制因子及相应的训练方案,用于在推理过程中控制模型输出的想象程度。
VCD [165] suggests that object hallucinations derive from two primary causes, i.e. statistical bias in training corpus and strong language prior embedded in LLMs. The authors take notice of the phenomenon that when injecting noise into the image, MLLMs tend to lean towards language prior rather than the image content for response generation, leading to hallucinations. Correspondingly, this work designs an amplify-then-contrast decoding scheme to offset the false bias.
VCD [165] 指出物体幻觉主要源于两个原因,即训练语料库中的统计偏差和大语言模型中嵌入的强烈语言先验。作者注意到,当向图像注入噪声时,多模态大语言模型倾向于依赖语言先验而非图像内容生成响应,从而导致幻觉。为此,该研究设计了一种"放大后对比解码"方案来抵消这种错误偏差。
HACL [166] investigates the embedding space of vision and language. Based on the observation, a contrastive learning scheme is devised to pull paired cross-modal representation closer while pushing away non-hallucinated and hallucinated text representation.
HACL [166] 研究了视觉与语言的嵌入空间。基于这一观察,设计了一种对比学习方案,将配对的跨模态表示拉近,同时将非幻觉和幻觉的文本表示推远。
Post-correction. Different from previous paradigms, postcorrection mitigates hallucinations in a post-remedy way and corrects hallucinations after output generation. Woodpecker [77] is a training-free general framework for hallucination correction. Specifically, the method incorporates expert models to supplement contextual information of the image and crafts a pipeline to correct hallucinations step by step. The method is interpret able in that intermediate results of each step can be checked, and objects are grounded in the image. The other method LURE [167] trains a specialized revisor to mask objects with high uncertainty in the descriptions and regenerates the responses again.
后矫正。与先前范式不同,后矫正采用事后补救方式缓解幻觉问题,在输出生成后进行校正。Woodpecker [77] 是一个无需训练的通用幻觉矫正框架。该方法通过引入专家模型补充图像上下文信息,并构建分步矫正流程。其可解释性体现在可检查各步骤中间结果,且所有对象均基于图像定位。另一方法LURE [167] 训练专用修订器来遮蔽描述中高不确定性对象,并重新生成响应。
7 EXTENDED TECHNIQUES
7 扩展技术
7.1 Multimodal In-Context Learning
7.1 多模态上下文学习
ICL is one of the important emergent abilities of LLMs. There are two good traits of ICL: (1) Different from tra BOS Below are some examples and an instruction that describes a task. Write a response that appropriately completes the request
ICL是大语言模型(LLM)的重要涌现能力之一。ICL具有两个优良特性:(1)不同于tra BOS 以下是若干示例和一个描述任务的指令。请编写恰当完成该请求的响应
Instruction: ### Image:
### Response:
指令: ### 图像:
### 响应:
Image:
### Response: ### Image:
### Response:
图片:
### 响应: ### 图片:
### 响应:
TABLE 9: A simplified example of the template to structure an M-ICL query, adapted from [98]. For illustration, we list two in-context examples and a query divided by a dashed line. {instruction} and {response} are texts from the data sample. is a placeholder to represent the multimodal input (an image in this case). ${<}\mathrm{BOS}>$ and ${<}\mathrm{EOS}>$ are tokens denoting the start and the end of the input to the LLM, respectively.
表 9: 结构化M-ICL查询的简化模板示例,改编自[98]。为便于说明,我们列出两个上下文示例和一个用虚线分隔的查询。{instruction}和{response}是数据样本中的文本。是表示多模态输入(本例为图像)的占位符。${<}\mathrm{BOS}>$和${<}\mathrm{EOS}>$分别是表示大语言模型输入开始和结束的token。
ditional supervised learning paradigms that learn implicit patterns from abundant data, the crux of ICL is to learn from analogy [168]. Specifically, in the ICL setting, LLMs learn from a few examples along with an optional instruction and extrapolate to new questions, thereby solving complex and unseen tasks in a few-shot manner [22], [169], [170]. (2) ICL is usually implemented in a training-free manner [168] and thus can be flexibly integrated into different frameworks at the inference stage. A closely related technique to ICL is instruction-tuning (see $\S3.2)$ , which is shown empirically to enhance the ICL ability [19].
与传统监督学习范式从海量数据中学习隐式模式不同,上下文学习 (ICL) 的核心在于类比学习 [168]。具体而言,在ICL设定中,大语言模型通过少量示例和可选指令进行学习,并外推至新问题,从而以少样本方式解决复杂且未见过的任务 [22][169][170]。(2) ICL通常以免训练方式实现 [168],因此能在推理阶段灵活集成至不同框架。与ICL密切相关的技术是指令微调 (instruction-tuning,见 $\S3.2$),实证研究表明其能增强ICL能力 [19]。
In the context of MLLM, ICL has been extended to more modalities, leading to Multimodal ICL (M-ICL). Building upon the setting in (§3.2), at inference time, M-ICL can be implemented by adding a demonstration set, i.e. a set of in-context samples, to the original sample. In this case, the template can be extended as illustrated in Table 9. Note that we list two in-context examples for illustration, but the number and the ordering of examples can be flexibly adjusted. In fact, models are commonly sensitive to the arrangement of demonstrations [168], [171].
在多模态大语言模型(MLLM)的背景下,上下文学习(ICL)已扩展到更多模态,形成了多模态上下文学习(M-ICL)。基于(§3.2)中的设置,在推理时,M-ICL可以通过向原始样本添加演示集(即一组上下文样本)来实现。如表9所示,模板可以相应扩展。需要注意的是,我们仅列出两个上下文示例进行说明,但示例的数量和顺序可以灵活调整。实际上,模型通常对演示样本的排列顺序较为敏感 [168] [171]。
7.1.1 Improvement on ICL capabilities
7.1.1 ICL能力的改进
Recently, a growing amount of work has focused on enhancing ICL performance under various scenarios. In this section, we trace the development of this field and summarize some relevant works.
近年来,越来越多的研究致力于提升大语言模型在不同场景下的零样本学习性能。本节将追溯该领域的发展历程并综述相关代表性工作。
MIMIC-IT [172] combines in-context learning with instruction tuning by building an instruction dataset formatted with multimodal context. The model instruction tuned on the introduced dataset shows improved few-shot performance in the caption task. Emu [173] extends the idea of Flamingo [74] by introducing extra modalities in model generation and corresponding training corpus. Aided by the introduced vision decoder, i.e. Stable Diffusion, the model learns from extra vision supervision and supports more flexibility in output format and in-context reasoning. Specifically, apart from answering in pure text, the model can also give responses in the form of images. Sheng et al. [174] adopt a similar idea and try to extend output modalities into both text and image. Instead of adopting a specialized encoder for images, the work adopts a unified quantization scheme with a shared embedding layer.
MIMIC-IT [172] 通过构建带有多模态上下文格式的指令数据集,将上下文学习与指令微调相结合。在该数据集上进行指令微调的模型在图像描述任务中展现出更强的少样本性能。Emu [173] 扩展了 Flamingo [74] 的思想,通过在模型生成中引入额外模态及对应训练语料。借助引入的视觉解码器(即 Stable Diffusion),该模型通过额外视觉监督进行学习,并在输出格式和上下文推理方面支持更高灵活性。具体而言,除了纯文本回答外,该模型还能以图像形式生成响应。Sheng 等人 [174] 采用类似思路,尝试将输出模态扩展至文本和图像。与采用专用图像编码器的方案不同,该工作采用统一量化方案及共享嵌入层。
Some other works explore improving few-shot learning performance under specific settings. Link-context learn- ing [175] focuses on strengthening the causal link between image-label pairs and casts a contrast training scheme by formulating positive and negative image-description pairs. MMICL [176] aims to augment the capabilities in reasoning with multiple related images. To strengthen the link between image and text, the work proposes a context scheme to transform interleaved image-text data into a uniform format. Jeong [177] finds that when inserting a small fraction of incoherent images/text as noise, MLLMs can be misled to give responses inconsistent with the context. Based on the observation, the work accordingly proposes a pre-filtering method to remove irrelevant context and facilitate more coherent responses.
其他研究探索了在特定场景下提升少样本学习性能的方法。Link-context学习[175]通过构建正负图像-描述对来强化图像-标签间的因果关系,并采用对比训练方案。MMICL[176]致力于增强多图关联推理能力,提出将交错式图文数据转换为统一格式的上下文方案以强化图文关联。Jeong[177]发现当注入少量不连贯图像/文本作为噪声时,多模态大语言模型会产生与上下文矛盾的响应,据此提出预过滤方法以消除无关上下文并提升响应连贯性。
7.1.2 Applications
7.1.2 应用
In terms of applications in multi modality, M-ICL is mainly used in two scenarios: (1) solving various visual reasoning tasks [22], [74], [178], [179], [180] and (2) teaching LLMs to use external tools [169], [170], [181]. The former usually involves learning from a few task-specific examples and generalizing to a new but similar question. From the information provided in instructions and demonstrations, LLMs get a sense of what the task is doing and what the output template is and finally generate expected answers. In contrast, examples of tool usage are more fine-grained. They typically comprise a chain of steps that could be sequentially executed to fulfill the task. Thus, the second scenario is closely related to CoT (see §7.2).
在多模态应用方面,M-ICL主要应用于两种场景:(1) 解决各类视觉推理任务 [22], [74], [178], [179], [180];(2) 指导大语言模型使用外部工具 [169], [170], [181]。前者通常涉及从少量特定任务示例中学习,并泛化至新但类似的问题。通过指令和演示中提供的信息,大语言模型能理解任务目标及输出模板,最终生成预期答案。相比之下,工具使用示例更为细粒度,通常包含可顺序执行以实现任务的一系列步骤,因此第二种场景与思维链 (CoT) 密切相关 (参见§7.2)。
7.2 Multimodal Chain of Thought
7.2 多模态思维链 (Multimodal Chain of Thought)
As the pioneer work [8] points out, CoT is “a series of intermediate reasoning steps”, which has been proven to be effective in complex reasoning tasks [8], [182], [183]. The main idea of CoT is to prompt LLMs to output not only the final answer but also the reasoning process that leads to the answer, resembling the cognitive process of humans.
正如开创性研究[8]所指出的,思维链(CoT)是"一系列中间推理步骤",已被证明在复杂推理任务中具有有效性[8]、[182]、[183]。CoT的核心思想是提示大语言模型不仅输出最终答案,还要输出推导出答案的推理过程,这与人类的认知过程相似。
Inspired by the success in NLP, multiple works [184], [185], [186], [187] have been proposed to extend the unimodal CoT to Multimodal CoT (M-CoT). We first introduce different paradigms for acquiring the M-CoT ability (§7.2.1). Then, we delineate more specific aspects of M-CoT, including the chain configuration (§7.2.2) and the pattern (§7.2.3).
受自然语言处理(NLP)领域成功的启发,多项研究[184]、[185]、[186]、[187]提出将单模态思维链(CoT)扩展为多模态思维链(M-CoT)。我们首先介绍获取M-CoT能力的不同范式(见7.2.1节),然后详细阐述M-CoT的具体方面,包括链式配置(见7.2.2节)和模式(见7.2.3节)。
7.2.1 Learning Paradigms
7.2.1 学习范式
The learning paradigm is also an aspect worth investigating. There are broadly three ways to acquire the M-CoT ability, i.e. through finetuning and training-free few/zero-shot learning. The sample size requirement for the three ways is in descending order.
学习范式也是一个值得研究的方面。获取M-CoT能力的方式大致有三种,即通过微调(finetuning)和无训练少样本/零样本(few/zero-shot)学习。这三种方式对样本量的需求依次递减。
Intuitively, the finetuning approach often involves curating specific datasets for M-CoT learning. For example, Lu et al. [116] construct a scientific question-answering dataset ScienceQA with lectures and explanations, which can serve as sources of learning CoT reasoning, and finetune the model on this proposed dataset. Multimodal-CoT [185] also uses the ScienceQA benchmark but generates the output in a two-step fashion, i.e. the rationale (chain of reasoning steps) and the final answer based on the rationale. CoT-PT [187] learns an implicit chain of reasoning through a combination of prompt tuning and step-specific visual bias.
直观来看,微调方法通常需要为多模态思维链(M-CoT)学习精心构建特定数据集。例如,Lu等人[116]构建了包含讲解和解析的科学问答数据集ScienceQA,可作为学习思维链推理的来源,并在该数据集上对模型进行微调。Multimodal-CoT[185]同样采用ScienceQA基准,但采用两步式生成输出:首先生成推理依据(推理步骤链),再基于该依据生成最终答案。CoT-PT[187]则通过提示调优(prompt tuning)与步骤特异性视觉偏置的结合,学习隐式推理链。
Compared with finetuning, few/zero-shot learning is more computationally efficient. The main difference between them is that the few-shot learning typically requires hand-crafting some in-context examples so that the model can learn to reason step by step more easily. In contrast, the zero-shot learning does not require any specific example for CoT learning. In this case, models learn to use the embedded knowledge and the reasoning ability without explicit guidance by prompting designed instructions like “Let’s think frame by frame” or “What happened between these two keyframes” [184], [186]. Similarly, some works [22], [188] prompt models with descriptions of the task and tool usage to decompose complex tasks into sub-tasks.
与微调相比,少样本/零样本学习的计算效率更高。两者的主要区别在于,少样本学习通常需要手动构建一些上下文示例,以便模型更容易学会逐步推理。而零样本学习则不需要任何特定示例来进行思维链(CoT)学习。在这种情况下,模型通过提示设计的指令(如"让我们逐帧思考"或"这两个关键帧之间发生了什么")[184]、[186],在没有明确指导的情况下学会利用嵌入知识和推理能力。类似地,一些研究[22]、[188]通过描述任务和工具使用情况来提示模型,将复杂任务分解为子任务。
7.2.2 Chain Configuration
7.2.2 链式配置
Structure and length are two critical aspects of the reasoning chains. In terms of structure, current methods can be divided into single-chain and tree-shape methods. Reasoning with a single chain is a paradigm widely used in various methods [116], [185]. Specifically, the step-by-step reasoning process forms a single question-rationale-answer chain. Recently, some methods have explored using a more complicated scheme, i.e. tree-shape chain, for reasoning. Specifically, DDCoT [189] breaks down a question into multiple sub-questions, each of which is solved by LLM itself or visual experts to generate rationales. Then the LLM aggregates and reasons with the rationales to form the final answer. With respect for chain length, it can be categorized into adaptive and pre-defined formations. The former configuration requires LLMs to decide on their own when to halt the reasoning chains [22], [116], [169], [170], [185], [188], while the latter setting stops the chains with a pre-defined length [79], [184], [186], [187].
结构和长度是推理链的两个关键方面。在结构方面,当前方法可分为单链式和树状结构两类。单链推理是多种方法广泛采用的范式[116][185],具体表现为逐步推理过程形成单一的问题-依据-答案链。近期部分方法开始探索更复杂的树状链推理方案,例如DDCoT[189]将问题分解为多个子问题,每个子问题由大语言模型或视觉专家生成推理依据,最终由大语言模型汇总这些依据进行推理并形成答案。在链长度方面,可分为自适应和预设两种形式:前者要求大语言模型自主决定何时终止推理链[22][116][169][170][185][188],后者则按预设长度停止推理过程[79][184][186][187]。
7.2.3 Generation Patterns
7.2.3 生成模式
How the chain is constructed is a question worth studying. We summarize the current works into (1) an infilling-based pattern and (2) a predicting-based pattern. Specifically, the infilling-based pattern demands deducing steps between surrounding context (previous and following steps) to fill the logical gaps [184], [186]. In contrast, the predictingbased pattern requires extending the reasoning chains given conditions such as instructions and previous reasoning history [22], [116], [169], [170], [185], [188]. The two types of patterns share a requirement that the generated steps should be consistent and correct.
推理链的构建方式是一个值得研究的问题。我们将现有工作归纳为:(1) 基于填充的模式 (infilling-based pattern) 和 (2) 基于预测的模式 (predicting-based pattern)。具体而言,基于填充的模式要求根据上下文环境(前后步骤)推导中间步骤以填补逻辑空缺 [184], [186];而基于预测的模式则需要在给定条件(如指令和先前推理历史)下扩展推理链 [22], [116], [169], [170], [185], [188]。这两种模式都要求生成的步骤具备一致性和正确性。
7.3 LLM-Aided Visual Reasoning
7.3 大语言模型 (LLM) 辅助视觉推理
7.3.1 Introduction
7.3.1 引言
Inspired by the success of tool-augmented LLMs [190], [191], [192], [193], some researches have explored the possibilities of invoking external tools [22], [107], [169], [170] or vision foundation models [22], [79], [80], [188], [194], [195], [196] for visual reasoning tasks. Taking LLMs as helpers with different roles, these works build task-specific [79], [197], [198] or general-purpose [22], [169], [170], [181], [188] visual reasoning systems.
受工具增强大语言模型 [190], [191], [192], [193] 成功的启发,部分研究探索了调用外部工具 [22], [107], [169], [170] 或视觉基础模型 [22], [79], [80], [188], [194], [195], [196] 来完成视觉推理任务的可能性。这些工作将大语言模型视为具有不同角色的助手,构建了任务特定型 [79], [197], [198] 或通用型 [22], [169], [170], [181], [188] 的视觉推理系统。
Compared with conventional visual reasoning models [199], [200], [201], these works manifest several good traits: (1) Strong generalization abilities. Equipped with rich open-world knowledge learned from large-scale pretraining, these systems can easily generalize to unseen objects or concepts with remarkable zero/few-shot performance [169], [170], [195], [197], [198], [202]. (2) Emergent abilities. Aided by strong reasoning abilities of LLMs, these systems can perform complex tasks. For example, given an image, MMREACT [22] can interpret the meaning beneath the surface, such as explaining why a meme is funny. (3) Better interactivity and control. Traditional models typically allow a limited set of control mechanisms and often entail expensive curated datasets [203], [204]. In contrast, LLM-based systems have the ability to make fine control in a user-friendly interface (e.g. click and natural language queries) [79].
与传统视觉推理模型[199]、[200]、[201]相比,这些工作展现出多项优势:(1) 强大的泛化能力。通过大规模预训练获得的开放世界知识,这些系统能轻松泛化至未见过的物体或概念,并展现卓越的零样本/少样本性能[169]、[170]、[195]、[197]、[198]、[202]。(2) 涌现能力。借助大语言模型的强大推理能力,这些系统可执行复杂任务。例如给定一张图像,MMREACT[22]能解读深层含义,如解释表情包为何有趣。(3) 更优的交互性与控制性。传统模型通常仅支持有限控制机制,且依赖昂贵定制数据集[203]、[204],而基于大语言模型的系统可通过用户友好界面(如点击和自然语言查询)[79]实现精细控制。
For this part, we start with introducing different training paradigms employed in the construction of LLM-Aided Visual Reasoning systems (§7.3.2). Then, we delve into the primary roles that LLMs play within these systems (§7.3.3).
在这一部分,我们首先介绍构建大语言模型辅助视觉推理 (LLM-Aided Visual Reasoning) 系统时采用的不同训练范式 (§7.3.2)。接着,我们将深入探讨大语言模型在这些系统中扮演的主要角色 (§7.3.3)。
7.3.2 Training Paradigms
7.3.2 训练范式
According to training paradigms, LLM-Aided Visual Reasoning systems can be divided into two types, i.e. trainingfree and finetuning.
根据训练范式,大语言模型辅助的视觉推理系统可分为两类,即免训练和微调。
Training-free. With abundant prior knowledge stored in pre-trained LLMs, an intuitive and simple way is to freeze pre-trained models and directly prompt LLMs to fulfill various needs. According to the setting, the reasoning systems can be further categorized into few-shot models [22], [169], [170], [181] and zero-shot models [79], [197]. The few-shot models entail a few hand-crafted in-context samples (see §7.1) to guide LLMs to generate a program or a sequence of execution steps. These programs or execution steps serve as instructions for corresponding foundation models or external tools/modules. The zero-shot models take a step further by directly utilizing LLMs’ linguistics/semantics knowledge or reasoning abilities. For example, PointCLIP V2 [197] prompts GPT-3 to generate descriptions with 3Drelated semantics for better alignment with corresponding images. In CAT [79], LLMs are instructed to refine the captions according to user queries.
无训练模式。预训练大语言模型中存储着丰富的先验知识,一种直观简单的方法是冻结预训练模型,直接通过提示词驱动大语言模型满足各类需求。根据具体设置,推理系统可进一步分为少样本模型 [22][169][170][181] 和零样本模型 [79][197]。少样本模型需要人工设计少量上下文示例(参见§7.1)来引导大语言模型生成程序或执行步骤序列,这些程序或步骤将作为基础模型或外部工具/模块的指令。零样本模型则更进一步,直接利用大语言模型的语言学/语义知识或推理能力。例如PointCLIP V2 [197] 通过提示GPT-3生成具有3D相关语义的描述文本,以实现与对应图像的更好对齐;在CAT [79] 中,大语言模型被用于根据用户查询优化图像描述。
Finetuning. Some works adopt further finetuning to improve the planning abilities with respect to tool usage [107] or to improve localization capabilities [142], [205] of the system. For example, GPT4Tools [107] introduces the instruction-tuning approach (see §3.2). Accordingly, a new tool-related instruction dataset is collected and used to finetune the model.
微调。一些研究采用进一步微调来提升系统在工具使用[107]或定位能力[142]、[205]方面的规划能力。例如,GPT4Tools[107]引入了指令微调方法(见第3.2节),通过收集新的工具相关指令数据集对模型进行微调。
7.3.3 Functions
7.3.3 函数
In order to further inspect what roles LLMs exactly play in LLM-Aided Visual Reasoning systems, existing related works are divided into three types:
为了进一步探究大语言模型在LLM辅助视觉推理系统中的具体作用,现有相关研究被划分为三类:
LLM as a Controller LLM as a Decision Maker LLM as a Semantics Refiner
大语言模型作为控制器
大语言模型作为决策者
大语言模型作为语义优化器
The first two roles are related to CoT (see $\S7.2)$ . It is frequently used because complex tasks need to be broken down into intermediate simpler steps. When LLMs act as controllers, the systems often finish the task in a single round, while multi-round is more common in the case of the decision maker. We delineate how LLMs serve these roles in the following parts.
前两个角色与思维链 (CoT) 相关 (参见 $\S7.2)$ 。由于复杂任务需要分解为中间步骤,该模式被频繁使用。当大语言模型作为控制器时,系统通常单轮即可完成任务;而作为决策者时,多轮交互更为常见。下文将详细阐述大语言模型如何履行这些角色。
LLM as a Controller. In this case, LLMs act as a central controller that (1) breaks down a complex task into simpler sub-tasks/steps and (2) assigns these tasks to appropriate tools/modules. The first step is often finished by leveraging the CoT ability of LLMs. Specifically, LLMs are prompted explicitly to output task planning [181] or, more directly, the modules to call [107], [169], [170]. For example, VisProg [170] prompts GPT-3 to output a visual program, where each program line invokes a module to perform a sub-task. In addition, LLMs are required to output argument names for the module input. To handle these complex requirements, some hand-crafted in-context examples are used as references [169], [170], [181]. This is closely related to the optimization of reasoning chains (see $\S7.2)$ , or more specifically, the least-to-most prompting [206] technique. In this way, complex problems are broken down into sub-problems that are solved sequentially.
大语言模型作为控制器。在这种情况下,大语言模型充当中央控制器,其功能包括:(1) 将复杂任务分解为更简单的子任务/步骤;(2) 将这些任务分配给适当的工具/模块。第一步通常通过利用大语言模型的思维链(CoT)能力完成,具体表现为显式提示大语言模型输出任务规划[181],或更直接地输出待调用模块[107][169][170]。例如,VisProg[170]提示GPT-3输出可视化程序,其中每行程序调用一个模块来执行子任务。此外,大语言模型还需输出模块输入的参数名称。为满足这些复杂需求,通常会使用手工设计的上下文示例作为参考[169][170][181]。这与推理链优化(参见$\S7.2$)密切相关,更具体地说,与最少到最多提示(least-to-most prompting)[206]技术相关。通过这种方式,复杂问题被分解为可顺序解决的子问题。
LLM as a Decision Maker. In this case, complex tasks are solved in a multi-round manner, often in an iterative way [195]. Decision-makers often fulfill the following resp on sibi li ties: (1) Summarize the current context and the history information, and decide if the information available at the current step is sufficient to answer the question or complete the task; (2) Organize and summarize the answer to present it in a user-friendly way.
大语言模型作为决策者。在这种情况下,复杂任务通常以多轮迭代方式解决[195]。决策者通常履行以下职责:(1) 汇总当前上下文和历史信息,并判断当前步骤的可用信息是否足以回答问题或完成任务;(2) 组织并汇总答案,以用户友好的方式呈现。
LLM as a Semantics Refiner. When LLM is used as a Semantics Refiner, researchers mainly utilize its rich linguistics and semantics knowledge. Specifically, LLMs are often instructed to integrate information into consistent and fluent natural language sentences [202] or generate texts according to different specific needs [79], [197], [198].
大语言模型作为语义优化器。当大语言模型被用作语义优化器时,研究者主要利用其丰富的语言学和语义学知识。具体而言,大语言模型通常被要求将信息整合为连贯流畅的自然语言句子 [202] 或根据不同特定需求生成文本 [79], [197], [198]。
8 CHALLENGES AND FUTURE DIRECTIONS
8 挑战与未来方向
The development of MLLMs is still in a rudimentary stage and thus leaves much room for improvement, which we summarize below:
MLLM的发展仍处于初级阶段,因此还有很大的改进空间,我们总结如下:
• Developing embodied agents based on MLLMs is a heated topic. It would be meaningful to develop such agents that can interact with the real world. Such endeavors require models with critical capabilities, including perception, reasoning, planning, and execution. • Safety issues. Similar to LLMs, MLLMs can be vulnerable to crafted attacks [177], [207], [208]. In other words, MLLMs can be misled to output biased or undesirable responses. Thus, improving model safety will be an important topic.
• 基于多模态大语言模型(MLLM)开发具身智能体(AI Agent)是当前热点。开发能与现实世界互动的此类智能体具有重要意义,这需要模型具备感知、推理、规划和执行等关键能力。
• 安全问题。与LLM类似,MLLM可能受到精心设计的攻击[177][207][208]影响,被诱导输出带有偏见或不良的响应,因此提升模型安全性将是重要课题。
9 CONCLUSION
9 结论
In this paper, we perform a survey of the existing MLLM literature and offer a broad view of its main directions, including the basic recipe and related extensions. Moreover, we underscore the current research gaps that need to be filled and point out some promising research directions. We hope this survey can offer readers a clear picture of the current progress of MLLM and inspire more works.
本文对现有多模态大语言模型(MLLM)文献进行了全面综述,从基础框架到相关扩展系统梳理了该领域的主要研究方向。同时,我们着重指出了当前研究中亟待填补的空白领域,并展望了若干具有潜力的未来研究方向。希望本综述能帮助读者清晰把握MLLM领域的研究进展,并激发更多创新性工作。

