A Survey on Multimodal Large Language Models

多模态大语言模型综述

Shukang Yin* , Chaoyou $\mathsf{F u}^{\star}\dagger$ , Sirui Zhao* , Ke Li, Xing Sun, Tong Xu, and Enhong Chen, Fellow, IEEE

Shukang Yin* , Chaoyou $\mathsf{F u}^{\star}\dagger$* , Sirui Zhao* , Ke Li, Xing Sun, Tong Xu, 以及 IEEE 会士 Enhong Chen

Abstract—Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

摘要—近期，以 GPT-4V 为代表的多模态大语言模型 (Multimodal Large Language Model, MLLM) 成为新兴研究热点，其利用强大的大语言模型作为核心处理多模态任务。MLLM 展现的传统多模态方法罕见的涌现能力（例如基于图像创作故事、无需 OCR 的数学推理），暗示了通往通用人工智能的潜在路径。为此，学术界与产业界竞相开发媲美甚至超越 GPT-4V 的 MLLM，以惊人速度推进研究边界。本文系统梳理并总结了 MLLM 的最新进展：首先阐述其基本框架，解析架构、训练策略与数据、评估等核心概念；随后探讨如何扩展 MLLM 以支持更细粒度模态、更多语言和场景的研究主题；继而分析多模态幻觉问题及多模态上下文学习 (M-ICL)、多模态思维链 (M-CoT)、大语言模型辅助视觉推理 (LAVR) 等延伸技术；最后讨论现存挑战并指出未来研究方向。鉴于 MLLM 时代刚刚开启，我们将持续更新本综述，希望激发更多研究。相关 GitHub 链接持续收录最新论文：https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models。

Index Terms—Multimodal Large Language Model, Vision Language Model, Large Language Model.

索引术语—多模态大语言模型 (Multimodal Large Language Model)、视觉语言模型 (Vision Language Model)、大语言模型 (Large Language Model)。

1 INTRODUCTION

1 引言

ECENT years have seen the remarkable progress of model size, these LLMs raise extraordinary emergent abilities, typically including instruction following [5], [6], InContext Learning (ICL) [7], and Chain of Thought (CoT) [8]. Although LLMs have demonstrated surprising zero/fewshot reasoning performance on most Natural Language Processing (NLP) tasks, they are inherently “blind” to vision since they can only understand discrete text. Concurrently, Large Vision Models (LVMs) can see clearly [9], [10], [11], [12], but commonly lag in reasoning.

近年来，模型规模取得了显著进展，这些大语言模型展现出非凡的涌现能力，主要包括指令跟随 [5]、[6]、上下文学习 (ICL) [7] 和思维链 (CoT) [8]。尽管大语言模型在大多数自然语言处理 (NLP) 任务中表现出惊人的零样本/少样本推理能力，但它们本质上对视觉是"盲视"的，因为只能理解离散文本。与此同时，大型视觉模型 (LVM) 虽能清晰"看见" [9]、[10]、[11]、[12]，但通常在推理能力上有所欠缺。

In light of this complement ari ty, LLM and LVM run towards each other, leading to the new field of Multimodal Large Language Model (MLLM). Formally, it refers to the LLM-based model with the ability to receive, reason, and output with multimodal information. Prior to MLLM, there have been a lot of works devoted to multi modality, which can be divided into disc rim i native [13], [14], [15] and generative [16], [17], [18] paradigms. CLIP [13], as a representative of the former, projects visual and textual information into a unified representation space, building a bridge for downstream multimodal tasks. In contrast, OFA [16] is a representative of the latter, which unifies multimodal tasks in a sequence-to-sequence manner. MLLM can be classified as the latter according to the sequence operation, but it manifests two representative traits compared with the traditional counterparts: (1) MLLM is based on LLM with billionscale parameters, which is not available in previous models. (2) MLLM uses new training paradigms to unleash its full potential, such as using multimodal instruction tuning [19], [20] to encourage the model to follow new instructions. Armed with the two traits, MLLM exhibits new capabilities, such as writing website code based on images [21], understanding the deep meaning of a meme [22], and OCR-free math reasoning [23].

鉴于这种互补性，大语言模型(LLM)与视觉大模型(LVM)相互融合，催生了多模态大语言模型(MLLM)这一新领域。其正式定义为基于LLM、能够接收/推理/输出多模态信息的模型体系。在MLLM出现之前，已有大量多模态研究工作，可分为判别式[13][14][15]与生成式[16][17][18]两大范式。前者以CLIP[13]为代表，将视觉与文本信息映射到统一表征空间，为下游多模态任务搭建桥梁；后者则以OFA[16]为代表，以序列到序列方式统一多模态任务。虽然MLLM根据序列操作可归类为后者，但相比传统模型呈现出两大特征：(1) 基于参数规模达数十亿的LLM架构，这是先前模型不具备的；(2) 采用新型训练范式释放潜力，例如通过多模态指令微调[19][20]使模型适应新指令。凭借这些特质，MLLM展现出诸如根据图像生成网页代码[21]、理解网络迷因深层含义[22]、无需OCR的数学推理[23]等新能力。

Ever since the release of GPT-4 [3], there has been a research frenzy over MLLMs because of the amazing multimodal examples it shows. Rapid development is fueled by efforts from both academia and industry. Preliminary research on MLLMs focuses on text content generation grounded in text prompts and image [20], [24]/video [25], [26]/audio [27]. Subsequent works have expanded the capabilities or the usage scenarios, including: (1) Better granularity support. Finer control on user prompts is developed to support specific regions through boxes [28] or a certain object through a click [29]. (2) Enhanced support on input and output modalities [30], [31], such as image, video, audio, and point cloud. Besides input, projects like NExT-GPT [32] further support output in different modalities. (3) Improved language support. Efforts have been made to extend the success of MLLMs to other languages (e.g. Chinese) with relatively limited training corpus [33], [34]. (4) Extension to more realms and usage scenarios. Some studies transfer the strong capabilities of MLLMs to other domains such as medical image understanding [35], [36], [37] and document parsing [38], [39], [40]. Moreover, multimodal agents are developed to assist in real-world interaction, e.g. embodied agents [41], [42] and GUI agents [43], [44], [45]. An MLLM timeline is illustrated in Fig. 1.

自GPT-4[3]发布以来，其展示的惊人多模态案例引发了学界对多模态大语言模型(MLLM)的研究热潮。学术界与工业界的共同努力推动着该领域的快速发展。早期MLLM研究主要集中于基于文本提示与图像[20][24]/视频[25][26]/音频[27]的文本内容生成。后续研究则不断拓展能力边界与应用场景，包括：(1) 更精细的粒度支持。通过边界框[28]或点击定位[29]实现对特定区域/对象的精准控制；(2) 输入输出模态增强[30][31]，涵盖图像、视频、音频及点云等。除输入模态外，NExT-GPT[32]等项目进一步实现了多模态输出；(3) 语言支持优化。研究者在训练语料相对有限的情况下，成功将MLLM拓展至中文等更多语言[33][34]；(4) 跨领域应用延伸。部分研究将MLLM强大能力迁移至医学影像理解[35][36][37]、文档解析[38][39][40]等领域。此外，多模态智能体被开发用于现实交互，如具身智能体[41][42]和图形界面智能体[43][44][45]。图1展示了MLLM发展时间轴。

In view of such rapid progress and the promising results of this field, we write this survey to provide researchers with a grasp of the basic idea, main method, and current progress of MLLMs. Note that we mainly focus on visual and language modalities, but also include works involving other modalities like video and audio. Specifically, we cover the most important aspects of MLLMs with corresponding summaries and open a GitHub page that would be updated in real time. To the best of our knowledge, this is the first survey on MLLM.

鉴于该领域的快速进展和前景广阔的成果，我们撰写本综述以帮助研究者掌握多模态大语言模型(MLLM)的基本概念、主要方法和当前进展。需要说明的是，我们主要关注视觉与语言模态，但也涵盖涉及视频、音频等其他模态的研究工作。具体而言，我们从多个重要维度对多模态大语言模型进行系统性梳理，提供相应总结，并开设了实时更新的GitHub页面。据我们所知，这是首篇关于多模态大语言模型的综述性论文。

Fig. 1: A timeline of representative MLLMs. We are witnessing rapid growth in this field. More works can be found in our released GitHub page, which is updated daily.

图 1: 代表性多模态大语言模型(MLLM)发展时间轴。我们正见证该领域的快速发展。更多研究成果可在我们每日更新的GitHub页面查看。

The following parts of the survey are structured as such: the survey starts with a comprehensive review of the essential aspects of MLLMs, including (1) Mainstream architectures (§2); (2) A full recipe of training strategy and data (§3); (3) Common practices of performance evaluation (§4). Then, we delve into a deeper discussion on some important topics about MLLMs, each focusing on a main problem: (1) What aspects can be further improved or extended (§5)? (2) How to relieve the multimodal hallucination issue (§6)? The survey continues with the introduction of three key techniques (§7), each specialized in a specific scenario: MICL (§7.1) is an effective technique commonly used at the inference stage to boost few-shot performance. Another important technique is M-CoT (§7.2), which is typically used in complex reasoning tasks. Afterward, we delineate a general idea to develop LLM-based systems to solve composite reasoning tasks or to address common user queries (§7.3). Finally, we finish our survey with a summary and potential research directions.

本综述的结构安排如下：首先全面回顾多模态大语言模型(MLLM)的核心要素，包括：(1) 主流架构(§2)；(2) 完整的训练策略与数据配方(§3)；(3) 性能评估的通用实践(§4)。随后深入探讨关于MLLMs的重要议题，每个议题聚焦一个核心问题：(1) 哪些方面可进一步改进或扩展(§5)？(2) 如何缓解多模态幻觉问题(§6)？接着介绍三项关键技术(§7)，每项技术针对特定场景：MICL(§7.1)是推理阶段提升少样本性能的常用技术，M-CoT(§7.2)专用于复杂推理任务，最后阐述基于大语言模型开发系统来解决复合推理任务或用户常见查询的通用思路(§7.3)。最终以总结与潜在研究方向结束本综述。

2 ARCHITECTURE

2 架构

A typical MLLM can be abstracted into three modules, i.e. a pre-trained modality encoder, a pre-trained LLM, and a modality interface to connect them. Drawing an analogy to humans, modality encoders such as image/audio encoders are human eyes/ears that receive and pre-process optical/acoustic signals, while LLMs are like human brains that understand and reason with the processed signals. In between, the modality interface serves to align different modalities. Some MLLMs also include a generator to output other modalities apart from text. A diagram of the architecture is plotted in Fig. 2. In this section, we introduce each module in sequence.

典型的 MLLM 可以抽象为三个模块：预训练模态编码器 (pre-trained modality encoder)、预训练大语言模型 (pre-trained LLM) 以及连接二者的模态接口 (modality interface)。以人类作类比，图像/音频编码器等模态编码器相当于接收并预处理光/声信号的人眼/耳朵，而大语言模型则像理解并推理这些处理信号的人类大脑。介于二者之间的模态接口则用于对齐不同模态。部分 MLLM 还包含生成器，用于输出文本之外的其他模态。该架构示意图如图 2 所示。本节将依次介绍各个模块。

2.1 Modality encoder

2.1 模态编码器

The encoders compress raw information, such as images or audio, into a more compact representation. Rather than training from scratch, a common approach is to use a pretrained encoder that has been aligned to other modalities. For example, CLIP [13] incorporates a visual encoder semantically aligned with the text through large-scale pretraining on image-text pairs. Therefore, it is easier to use such initially pre-aligned encoders to align with LLMs through alignment pre-training (see §3.1).

编码器将原始信息（如图像或音频）压缩为更紧凑的表示形式。常见的做法是使用经过预训练并与其它模态对齐的编码器，而非从头开始训练。例如，CLIP [13] 通过在大规模图文对上预训练，将视觉编码器与文本语义对齐。因此，利用这类预先对齐的编码器通过对齐预训练（参见§3.1）与大语言模型进行对齐更为便捷。

The series of commonly used image encoders are summarized in Table 1. Apart from vanilla CLIP image encoders [13], some works also explore using other variants. For example, MiniGPT-4 [21] adopts an EVA-CLIP [47], [48] (ViT-G/14) encoder, which is trained with improved training techniques. In contrast, Osprey [29] introduces a convolution-based ConvNext-L encoder [46] to utilize higher resolution and multi-level features. Some works also explore encoder-free architecture. For instance, the image patches of Fuyu-8b [49] are directly projected before sending to LLMs. Thus, the model naturally supports flexible image resolution input.

常用图像编码器系列总结如表1所示。除了原始CLIP图像编码器[13]，一些研究还探索了其他变体。例如，MiniGPT-4[21]采用了经过改进训练技术训练的EVA-CLIP[47]48编码器。相比之下，Osprey[29]引入了基于卷积的ConvNext-L编码器[46]以利用更高分辨率和多级特征。部分研究还探索了无编码器架构，例如Fuyu-8b[49]的图像块在输入大语言模型前直接进行投影，因此该模型天然支持灵活的图像分辨率输入。

TABLE 1: A summary of commonly used image encoders.

Variants	Pretraining Corpus	Resolution	Samples (B)	Parameter Size (M)
OpenCLIP-ConvNext-L [46]	LAION-2B	320	29	197.4
CLIP-ViT-L/14[13]	OpenAI'sWIT	224/336	13	304.0
EVA-CLIP-ViT-G/14 4[47]	LAION-2B,COYO-700M	224	11	1000.0
OpenCLIP-ViT-G/14[46]	LAION-2B	224	34	1012.7
OpenCLIP-ViT-bigG/14 [46]	LAION-2B	224	34	1844.9

表 1: 常用图像编码器汇总

变体	预训练语料	分辨率	样本量 (B)	参数量 (M)
OpenCLIP-ConvNext-L [46]	LAION-2B	320	29	197.4
CLIP-ViT-L/14 [13]	OpenAI'sWIT	224/336	13	304.0
EVA-CLIP-ViT-G/14 [47]	LAION-2B,COYO-700M	224	11	1000.0
OpenCLIP-ViT-G/14 [46]	LAION-2B	224	34	1012.7
OpenCLIP-ViT-bigG/14 [46]	LAION-2B	224	34	1844.9

Fig. 2: An illustration of typical MLLM architecture. It includes an encoder, a connector, and a LLM. An optional generator can be attached to the LLM to generate more modalities besides text. The encoder takes in images, audios or videos and outputs features, which are processed by the connector so that the LLM can better understand. There are broadly three types of connectors: projection-based, querybased, and fusion-based connectors. The former two types adopt token-level fusion, processing features into tokens to be sent along with text tokens, while the last type enables a feature-level fusion inside the LLM.

图 2: 典型多模态大语言模型 (MLLM) 架构示意图。该架构包含编码器、连接器和大语言模型 (LLM)。可在LLM后附加可选生成器，以生成除文本外的更多模态内容。编码器接收图像、音频或视频并输出特征，连接器对这些特征进行处理以便LLM更好地理解。连接器主要分为三类：基于投影的、基于查询的和基于融合的连接器。前两类采用token级融合，将特征处理为token与文本token一起输入，最后一类则支持在LLM内部进行特征级融合。

When choosing encoders, one often considers factors like resolution, parameter size, and pre training corpus. Notably, many works have empirically verified that using higher resolution can achieve remarkable performance gains [34], [50], [51], [52]. The approaches for scaling up input resolution can be categorized into direct scaling and patch-division methods. The direct scaling way inputs images of higher resolutions to the encoder, which often involves further tuning the encoder [34] or replacing a pre-trained encoder with higher resolution [50]. Similarly, CogAgent [44] uses a dual-encoder mechanism, where two encoders process high and low-resolution images, respectively. High-resolution features are injected into the lowresolution branch through cross-attention. Patch-division methods cut a high-resolution image into patches and reuse the low-resolution encoder. For example, Monkey [51] and SPHINX [53] divide a large image into smaller patches and send sub-images together with a down sampled highresolution image to the image encoder, where the subimages and the low-resolution image capture local and global features, respectively. In contrast, parameter size and training data composition are of less importance compared with input resolution, found by empirical studies [52].

在选择编码器时，通常会考虑分辨率、参数量级和预训练语料库等因素。值得注意的是，许多研究通过实证验证了使用更高分辨率能带来显著的性能提升 [34], [50], [51], [52]。提升输入分辨率的方法可分为直接缩放和分块处理两类。直接缩放方式向编码器输入更高分辨率的图像，通常需要进一步调优编码器 [34] 或替换为支持更高分辨率的预训练编码器 [50]。类似地，CogAgent [44] 采用双编码器机制，由两个编码器分别处理高、低分辨率图像，并通过交叉注意力将高分辨率特征注入低分辨率分支。分块处理方法将高分辨率图像切割为图块并复用低分辨率编码器，例如Monkey [51] 和SPHINX [53] 将大图像分割为小块，将子图像与下采样的高分辨率图像共同输入图像编码器，其中子图像和低分辨率图像分别捕获局部和全局特征。实证研究表明，与输入分辨率相比，参数量级和训练数据构成的影响相对较小 [52]。

Similar encoders are also available for other modalities. For example, Pengi [27] uses CLAP [54] model as the audio encoder. ImageBind-LLM [30] uses the ImageBind [55] encoder, which supports encoding image, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data. Equipped with the strong encoder, ImageBind-LLM can respond to the input of multiple modalities.

类似编码器也适用于其他模态。例如，Pengi [27] 使用 CLAP [54] 模型作为音频编码器。ImageBind-LLM [30] 采用 ImageBind [55] 编码器，该编码器支持对图像、文本、音频、深度、热成像及惯性测量单元 (IMU) 数据进行编码。凭借强大的编码器，ImageBind-LLM 能够响应多模态输入。

2.2 Pre-trained LLM

2.2 预训练大语言模型

Instead of training an LLM from scratch, it is more efficient and practical to start with a pre-trained one. Through tremendous pre-training on web corpus, LLMs have been embedded with rich world knowledge, and demonstrate strong generalization and reasoning capabilities.

与其从头开始训练大语言模型(LLM)，从预训练模型出发更为高效实用。通过对网络语料的大规模预训练，大语言模型已内化了丰富的世界知识，并展现出强大的泛化与推理能力。

We summarize the commonly used and publicly available LLMs in Table 2. Notably, most LLMs fall in the causal decoder category, following GPT-3 [7]. Among them, FlanT5 [56] series are relatively early LLMs used in works like BLIP-2 [59] and Instruct BLIP [60]. LLaMA series [5], [57] and Vicuna family [4] are representative open-sourced LLMs that have attracted much academic attention. Since the two LLMs are predominantly pre-trained on English corpus, they are limited in multi-language support, such as Chinese. In contrast, Qwen [58] is a bilingual LLM that supports Chinese and English well.

我们在表2中总结了常用且公开可用的大语言模型。值得注意的是，大多数大语言模型都属于因果解码器类别，遵循GPT-3 [7]的架构。其中，FlanT5 [56]系列是相对早期的大语言模型，被用于BLIP-2 [59]和Instruct BLIP [60]等工作。LLaMA系列[5][57]和Vicuna家族[4]是具有代表性的开源大语言模型，吸引了大量学术关注。由于这两个大语言模型主要基于英语语料进行预训练，它们对多语言(如中文)的支持有限。相比之下，Qwen [58]是一个双语大语言模型，能很好地支持中文和英文。

It should be noted that scaling up the parameter size of LLMs also brings additional gains, similar to the case of increasing input resolution. Specifically, Liu et al. [50], [61] find that simply scaling up LLM from 7B to 13B brings comprehensive improvement on various benchmarks. Furthermore, when using a 34B LLM, the model shows emergent zero-shot Chinese capability, given that only English multimodal data are used during training. Lu et al. [62] see a similar phenomenon by scaling up LLMs from 13B to 35B and 65B/70B, where the larger model size brings consistent gains on benchmarks specifically designed for MLLMs. There are also works that use smaller LLMs to facilitate deployment on mobile devices. For example, MobileVLM series [63], [64] use downscaled LLaMA [5] (termed as Mobile LLaMA 1.4B/2.7B), enabling efficient inference on mobile processors.

需要注意的是，扩大大语言模型的参数量同样能带来额外收益，这与提升输入分辨率的情况类似。具体而言，Liu等人[50][61]发现，仅将大语言模型从70亿参数扩展到130亿参数，就能在各种基准测试中带来全面提升。更值得注意的是，当使用340亿参数的大语言模型时，尽管训练过程中仅使用了英文多模态数据，模型却展现出突现的零样本中文能力。Lu等人[62]在将大语言模型从130亿参数扩展到350亿及650亿/700亿参数时也观察到类似现象，更大的模型规模在专为多模态大语言模型设计的基准测试中持续带来性能提升。也有研究采用较小规模的大语言模型以适应移动设备部署需求，例如MobileVLM系列[63][64]使用缩减版的LLaMA[5]（称为Mobile LLaMA 14亿/27亿参数），从而在移动处理器上实现高效推理。

Recently, explorations of Mixture of Experts (MoE) architecture for LLMs have garnered rising attention [65], [66], [67]. Compared with dense models, the sparse architecture enables scaling up total parameter size without increasing computational cost, by selective activation of the parameters. Empirically, MM1 [52] and MoE-LLaVA [68] find that MoE implementation achieves better performance than the dense counterpart on almost all the benchmarks.

最近，针对大语言模型的混合专家 (Mixture of Experts，MoE) 架构的探索日益受到关注 [65], [66], [67]。与密集模型相比，这种稀疏架构通过选择性激活参数，能够在保持计算成本不变的情况下扩大总参数量。实证研究表明，MM1 [52] 和 MoE-LLaVA [68] 发现 MoE 实现方案在几乎所有基准测试中都优于对应的密集模型。

TABLE 2: A summary of commonly used open-sourced LLMs. en, zh, fr, and de stand for English, Chinese, French, and German, respectively.

Model	ReleaseDate	PretrainDataScale	Parameter Size (B)	Language Support	Architecture
Flan-T5-XL/XXL[56]	Oct-2022		3/ 11	en, fr, de	Encoder-Decoder
LLaMA [5]	Feb-2023	1.4Ttokens	7/13/33/65	en	CausalDecoder
Vicuna [4]	Mar-2023	1.4Ttokens	7/13/33	en	CausalDecoder
LLaMA-2 [57]	Jul-2023	2Ttokens	7/13/70	en	CausalDecoder
Qwen [58]	Sep-2023	3Ttokens	1.8/7/14/72	en,zh	CausalDecoder

表 2: 常用开源大语言模型汇总。en、zh、fr 和 de 分别代表英语、中文、法语和德语。

模型	发布日期	预训练数据规模	参数量 (B)	支持语言	架构
Flan-T5-XL/XXL[56]	2022年10月	-	3/11	en, fr, de	编码器-解码器
LLaMA [5]	2023年2月	1.4T tokens	7/13/33/65	en	因果解码器
Vicuna [4]	2023年3月	1.4T tokens	7/13/33	en	因果解码器
LLaMA-2 [57]	2023年7月	2T tokens	7/13/70	en	因果解码器
Qwen [58]	2023年9月	3T tokens	1.8/7/14/72	en, zh	因果解码器

2.3 Modality interface

2.3 模态接口

Since LLMs can only perceive text, bridging the gap between natural language and other modalities is necessary. However, it would be costly to train a large multimodal model in an end-to-end manner. A more practical way is to introduce a learnable connector between the pre-trained visual encoder and LLM. The other approach is to translate images into languages with the help of expert models, and then send the language to LLM.

由于大语言模型只能感知文本，因此需要弥合自然语言与其他模态之间的差距。然而，以端到端方式训练大型多模态模型的成本很高。更实用的方法是在预训练视觉编码器和大语言模型之间引入可学习的连接器。另一种方法是借助专家模型将图像翻译成语言，然后将语言发送给大语言模型。

Learnable Connector. It is responsible for bridging the gap between different modalities. Specifically, the module projects information into the space that LLM can understand efficiently. Based on how multimodal information is fused, there are broadly two ways to implement such interfaces, i.e. token-level and feature-level fusion.

可学习连接器。该模块负责弥合不同模态之间的鸿沟，具体而言是将信息投射至大语言模型(LLM)能高效理解的空间。根据多模态信息融合方式，此类接口主要有两种实现路径：(1) Token级融合 (2) 特征级融合。

For token-level fusion, features output from encoders are transformed into tokens and concatenated with text tokens before being sent into LLMs. A common and feasible solution is to leverage a group of learnable query tokens to ex- tract information in a query-based manner [69], which first has been implemented in BLIP-2 [59], and subsequently inherited by a variety of work [26], [60], [70]. Such Q-Formerstyle approaches compress visual tokens into a smaller number of representation vectors. In contrast, some methods simply use a MLP-based interface to bridge the modality gap [20], [37], [71], [72]. For example, LLaVA series adopts one/two linear MLP [20], [50] to project visual tokens and align the feature dimension with word embeddings.

在Token级别的融合中，编码器输出的特征被转换为Token并与文本Token拼接后输入大语言模型。一种常见且可行的解决方案是利用一组可学习的查询(query) Token，以基于查询的方式提取信息[69]，该方法最初在BLIP-2[59]中实现，随后被多种工作继承[26][60][70]。这类Q-Former风格的方法将视觉Token压缩为更少数量的表示向量。相比之下，有些方法仅使用基于MLP的接口来弥合模态差距[20][37][71][72]。例如，LLaVA系列采用单层/双层线性MLP[20][50]来投影视觉Token，并将特征维度与词嵌入对齐。

On a related note, MM1 [52] has ablated on design choices on the connector and found that for token-level fusion, the type of modality adapter is far less important than the number of visual tokens and input resolution. Nevertheless, Zeng et al. [73] compare the performance of token and feature-level fusion, and empirically reveal that the token-level fusion variant performs better in terms of VQA benchmarks. Regarding the performance gap, the authors suggest that cross-attention models might require a more complicated hyper-parameter searching process to achieve comparable performance.

值得一提的是，MM1 [52] 对连接器的设计选择进行了消融实验，发现对于token级融合而言，模态适配器的类型远不如视觉token数量和输入分辨率重要。然而，Zeng等人 [73] 比较了token级与特征级融合的性能，并通过实验证明token级融合变体在VQA基准测试中表现更优。针对性能差距，作者指出交叉注意力模型可能需要更复杂的超参数搜索过程才能达到相当的性能。

As another line, feature-level fusion inserts extra modules that enable deep interaction and fusion between text features and visual features. For example, Flamingo [74] inserts extra cross-attention layers between frozen Transformer layers of LLMs, thereby augmenting language features with external visual cues. Similarly, CogVLM [75] plugs in a visual expert module in each Transformer layer to enable dual interaction and fusion between vision and language features. For better performance, the QKV weight matrix of the introduced module is initialized from the pre-trained LLM. Similarly, LLaMA-Adapter [76] introduces learnable prompts into Transformer layers. These prompts are first embedded with visual knowledge and then concatenated with text features as prefixes.

作为另一条技术路线，特征级融合通过插入额外模块实现文本特征与视觉特征的深度交互与融合。例如 Flamingo [74] 在大语言模型(LLM)冻结的Transformer层间插入跨注意力层，从而利用外部视觉线索增强语言特征。类似地，CogVLM [75] 在每个Transformer层中嵌入视觉专家模块，实现视觉与语言特征的双向交互融合。为获得更好性能，新增模块的QKV权重矩阵从预训练大语言模型初始化。同样地，LLaMA-Adapter [76] 向Transformer层引入可学习提示词，这些提示词先与视觉知识进行嵌入，再作为前缀与文本特征拼接。

In terms of parameter size, learnable interfaces generally comprise a small portion compared with encoders and LLMs. Take Qwen-VL [34] as an example, the parameter size of the Q-Former is about 0.08B, accounting for less than $1%$ of the whole parameters, while the encoder and the LLM account for about $19.8%$ (1.9B) and $80.2%$ (7.7B), respectively.

在参数量方面，可学习接口通常只占编码器和大语言模型的一小部分。以Qwen-VL [34]为例，其Q-Former的参数量约为0.08B，不到总参数量的$1%$，而编码器和大语言模型分别占比约$19.8%$ (1.9B)和$80.2%$ (7.7B)。

Expert Model. Apart from the learnable interface, using expert models, such as an image captioning model, is also a feasible way to bridge the modality gap [77], [78], [79], [80]. The basic idea is to convert multimodal inputs into languages without training. In this way, LLMs can understand multi modality by the converted languages. For example, VideoChat-Text [25] uses pre-trained vision models to extract visual information such as actions and enriches the descriptions using a speech recognition model. Though using expert models is straightforward, it may not be as flexible as adopting a learnable interface. The conversion of foreign modalities into text would cause information loss. For example, transforming videos into textual descriptions distorts spatial-temporal relationships [25].

专家模型 (Expert Model)。除了可学习接口外，使用专家模型（如图像描述模型）也是弥合模态差距的可行方案 [77]、[78]、[79]、[80]。其核心思想是在无需训练的情况下将多模态输入转换为语言。通过这种转换后的语言，大语言模型便能理解多模态内容。例如 VideoChat-Text [25] 使用预训练视觉模型提取动作等视觉信息，并通过语音识别模型丰富描述文本。虽然专家模型使用直观，但其灵活性可能不如可学习接口。将外部模态转换为文本会导致信息丢失，例如将视频转为文本描述会扭曲时空关系 [25]。

3 TRAINING STRATEGY AND DATA

3 训练策略与数据

A full-fledged MLLM undergoes three stages of training, i.e. pre-training, instruction-tuning, and alignment tuning. Each phase of training requires different types of data and fulfills different objectives. In this section, we discuss training objectives, as well as data collection and characteristics for each training stage.

成熟的多模态大语言模型(MLLM)训练包含三个阶段：预训练(pre-training)、指令微调(instruction-tuning)和对齐调优(alignment tuning)。每个训练阶段需要不同类型的数据并实现不同目标。本节将探讨各阶段的训练目标，以及数据收集与特征。

3.1 Pre-training

3.1 预训练

3.1.1 Training Detail

3.1.1 训练细节

As the first training stage, pre-training mainly aims to align different modalities and learn multimodal world knowledge. Pre-training stage generally entails large-scale textpaired data, e.g. caption data. Typically, the caption pairs describe images/audio/videos in natural language sentences.

作为第一个训练阶段，预训练(pre-training)主要旨在对齐不同模态并学习多模态世界知识。预训练阶段通常需要大规模文本配对数据，例如字幕数据。典型情况下，这些字幕对用自然语言句子描述图像/音频/视频内容。

Here, we consider a common scenario where MLLMs are trained to align vision with text. As illustrated in Table 3, given an image, the model is trained to predict autoregressively the caption of the image, following a standard cross-entropy loss. A common approach for pre-training is to keep pre-trained modules (e.g. visual encoders and LLMs) frozen and train a learnable interface [20], [35], [72]. The idea is to align different modalities without losing pre-trained knowledge. Some methods [34], [81], [82] also unfreeze more modules (e.g. visual encoder) to enable more trainable parameters for alignment. It should be noted that the training scheme is closely related to the data quality. For short and noisy caption data, a lower resolution (e.g. 224) can be adopted to speed up the training process, while for longer and cleaner data, it is better to utilize higher resolutions (e.g. 448 or higher) to mitigate hallucinations. Besides, ShareGPT4V [83] finds that with high-quality caption data in the pre training stage, unlocking the vision encode promotes better alignment.

在此，我们考虑一种常见场景，即训练多模态大语言模型(MLLMs)实现视觉与文本的对齐。如表3所示，给定一张图像，模型通过自回归方式预测图像描述，并采用标准交叉熵损失进行训练。常见的预训练方法是冻结预训练模块(如视觉编码器和大语言模型)并训练可学习接口[20][35][72]，其核心思想是在保持预训练知识的前提下实现多模态对齐。部分方法[34][81][82]会解冻更多模块(如视觉编码器)以增加可训练参数量来提升对齐效果。需要注意的是，训练方案与数据质量密切相关：对于简短且含噪声的描述数据，可采用较低分辨率(如224)加速训练；而对于更长更干净的数据，则建议使用更高分辨率(如448或以上)来减少幻觉现象。此外，ShareGPT4V[83]发现，若预训练阶段使用高质量描述数据，解锁视觉编码器能促进更好的模态对齐。

Input: :

Response: {caption}

输入: :
响应: {caption}

3.1.2 Data

3.1.2 数据

Pre training data mainly serve two purposes, i.e. (1) aligning different modalities and (2) providing world knowledge. The pre training corpora can be divided into coarse-grained and fine-grained data according to granular i ties, which we will introduce sequentially. We summarize commonly used pre training datasets in Table 4.

预训练数据主要有两个用途：(1) 对齐不同模态和(2) 提供世界知识。根据数据粒度，预训练语料可分为粗粒度数据和细粒度数据，我们将依次介绍。常用预训练数据集总结如表4所示。

Coarse-grained caption data share some typical traits in common: (1) The data volume is large since samples are generally sourced from the internet. (2) Because of the webscrawled nature, the captions are usually short and noisy since they originate from the alt-text of the web images. These data can be cleaned and filtered via automatic tools, for example, using CLIP [13] model to filter out imagetext pairs whose similarities are lower than a pre-defined threshold. In what follows, we introduce some representative coarse-grained datasets.

粗粒度标注数据具有一些典型共性特征：(1) 由于样本通常来自互联网，数据量庞大。(2) 网络爬取特性导致标注文本通常简短且含噪声，因为这些文本源自网页图像的替代文本。这类数据可通过自动工具进行清洗过滤，例如使用 CLIP [13] 模型剔除相似度低于预设阈值的图文对。下文将介绍几个具有代表性的粗粒度数据集。

TABLE 4: Common datasets used for pre-training.

Dataset	Samples	Date
Coarse-grained Image-Text		2018
CC-3M [84] CC-12M [85] SBU Captions [86]	3.3M 12.4M 1M	2020 2011
LAION-5B[87] LAION-2B[87]	5.9B 2.3B	Mar-2022 Mar-2022
LAION-COCO [88] COYO-700M[90]	600M 747M	Sep-2022 Aug-2022
Fine-grained Image-Text
ShareGPT4V-PT[83]	1.2M	Nov-2023
LVIS-Instruct4V[91] ALLaVA [92]	111K 709K	Nov-2023 Feb-2024
Video-Text

MSR-VTT [93]	200K	2016
Audio-Text
WavCaps [94]	24K	Mar-2023

表 4: 预训练常用数据集

数据集	样本量	日期
粗粒度图文数据		2018
CC-3M [84] CC-12M [85] SBU Captions [86]	3.3M 12.4M 1M	2020 2011
LAION-5B [87] LAION-2B [87]	5.9B 2.3B	2022年3月 2022年3月
LAION-COCO [88] COYO-700M [90]	600M 747M	2022年9月 2022年8月
细粒度图文数据
ShareGPT4V-PT [83]	1.2M	2023年11月
LVIS-Instruct4V [91] ALLaVA [92]	111K 709K	2023年11月 2024年2月
视频-文本数据
MSR-VTT [93]	200K	2016
音频-文本数据
WavCaps [94]	24K	2023年3月

CC. CC-3M [84] is a web-scale caption dataset of 3.3M image-caption pairs, where the raw descriptions are derived from alt-text associated with images. The authors design a complicated pipeline to clean data: (1) For images, those with inappropriate content or aspect ratio are filtered. (2) For text, NLP tools are used to obtain text annotations, with samples filtered according to the designed heuristics. (3) For image-text pairs, images are assigned labels via class if i ers. If text annotations do not overlap with image labels, the corresponding samples are dropped.

CC-3M [84] 是一个网络规模的图像描述数据集，包含330万张图像-标题对，其原始描述来自与图像关联的替代文本。作者设计了一套复杂的数据清洗流程：(1) 对于图像，过滤掉内容不当或宽高比不符合要求的样本。(2) 对于文本，使用自然语言处理(NLP)工具获取文本标注，并根据设计的启发式规则筛选样本。(3) 对于图文对，通过分类器为图像分配标签。若文本标注与图像标签不匹配，则丢弃对应样本。

CC-12M [85] is a following work of CC-3M and contains 12.4M image-caption pairs. Compared with the previous work, CC-12M relaxes and simplifies the data-collection pipeline, thus collecting more data.

CC-12M [85] 是 CC-3M 的后续工作，包含 1240 万张图像-标题对。与之前的工作相比，CC-12M 放宽并简化了数据收集流程，从而收集了更多数据。

SBU Captions [86]. It is a captioned photo dataset containing 1M image-text pairs, with images and descriptions sourced from Flickr. Specifically, an initial set of images is acquired by querying the Flickr website with a large number of query terms. The descriptions attached to the images thus serve as captions. Then, to ensure that descriptions are relevant to the images, the retained images fulfill these requirements: (1) Descriptions of the images are of satisfactory length, decided by observation. (2) Descriptions of the images contain at least 2 words in the predefined term lists and a propositional word (e.g. “on”, “under”) that generally suggests spatial relationships.

SBU Captions [86]。这是一个带字幕的照片数据集，包含100万张图像-文本对，其中的图像和描述均来自Flickr。具体来说，最初的一组图像是通过使用大量查询词在Flickr网站上进行搜索获取的。这些图像附带的描述因此被用作字幕。然后，为了确保描述与图像相关，保留的图像需满足以下要求：(1) 图像描述的长度令人满意，这是通过观察决定的。(2) 图像描述包含预定义术语列表中的至少2个单词和一个通常表示空间关系的介词性单词(如"on"、"under")。

LAION. This series are large web-scale datasets, with images scrawled from the internet and associated alt-text as captions. To filter the image-text pairs, the following steps are performed: (1) Text with short lengths or images with too small or too big sizes are dropped. (2) Image de duplication based on URL. (3) Extract CLIP [13] embeddings for images and text, and use the embeddings to drop possibly illegal content and image-text pairs with low cosine similarity between embeddings. Here we offer a brief summary of some typical variants:

LAION。该系列是大型网络规模数据集，包含从互联网抓取的图像及对应的替代文本作为标注。为过滤图像-文本对，执行以下步骤：(1) 剔除文本过短或图像尺寸过大/过小的样本。(2) 基于URL进行图像去重。(3) 提取图像和文本的CLIP[13]嵌入向量，通过向量剔除可能非法内容及嵌入间余弦相似度较低的图文对。以下是部分典型变体的简要说明：

• LAION-5B [87]: It is a research-purpose dataset of 5.85B image-text pairs. The dataset is multilingual with a 2B English subset. • LAION-COCO [88]: It contains 600M images extracted from the English subset of LAION-5B. The captions are synthetic, using BLIP [89] to generate various image captions and using CLIP [13] to pick the best fit for the image. COYO-700M [90]. It contains 747M image-text pairs, which are extracted from Common Crawl. For data filtering, the authors design the following strategies: (1) For images, those with inappropriate size, content, format, or aspect ratio are filtered. Moreover, the images are filtered based on the pHash value to remove images overlapped with public datasets such as ImageNet and MS-COCO. (2) For text, only English text with satisfactory length, noun forms, and appropriate words are saved. Whitespace before and after the sentence will be removed, and consecutive whitespace characters will be replaced with a single whitespace. Moreover, text appearing more than 10 times (e.g. “image for”) will be dropped. (3) For image-text pairs, duplicated samples are removed based on (image pHash, text) tuple.

• LAION-5B [87]: 这是一个包含58.5亿图文对的研究用途数据集，含20亿英文子集的多语言数据集。
• LAION-COCO [88]: 该数据集从LAION-5B英文子集中提取了6亿张图像，通过BLIP [89]生成多样化图像描述，并利用CLIP [13]筛选最匹配的合成标注。
COYO-700M [90]: 包含7.47亿从Common Crawl提取的图文对，其数据过滤策略如下：(1) 图像过滤：剔除尺寸、内容、格式或长宽比不合格的图像，基于pHash值去重以避免与ImageNet、MS-COCO等公开数据集重复；(2) 文本过滤：仅保留长度适中、名词形式规范且用词得当的英文文本，清除首尾空格并将连续空格合并为单个空格，同时剔除出现超过10次的重复文本（如"image for"）；(3) 图文对过滤：基于（图像pHash值，文本）元组去除重复样本。

Recently, more works [83], [91], [92] have explored generating high-quality fine-grained data through prompting strong MLLMs (e.g. GPT-4V). Compared with coarsegrained data, these data generally contain longer and more accurate descriptions of the images, thus enabling finergrained alignment between image and text modalities. However, since this approach generally requires calling commercial-use MLLMs, the cost is higher, and the data volume is relatively smaller. Notably, ShareGPT4V [83] strikes a balance by first training a captioner with GPT-4V-generated

最近，更多研究[83]、[91]、[92]探索通过提示强大的多模态大语言模型(例如GPT-4V)来生成高质量的细粒度数据。与粗粒度数据相比，这些数据通常包含更长、更准确的图像描述，从而实现图像与文本模态间更细粒度的对齐。然而，由于这种方法通常需要调用商用多模态大语言模型，成本较高且数据量相对较小。值得注意的是，ShareGPT4V[83]通过先用GPT-4V生成的数据训练一个标注器，实现了成本与数据质量的平衡。

Fig. 3: Comparison of three typical learning paradigms. The image is from [19].

图 3: 三种典型学习范式的对比。图片来自 [19]。

100K data, then scaling up the data volume to 1.2M using the pre-trained captioner.

100K数据，然后使用预训练的标题生成器将数据量扩展到1.2M。

3.2 Instruction-tuning

3.2 指令调优

3.2.1 Introduction

3.2.1 引言

Instruction refers to the description of tasks. Intuitively, instruction tuning aims to teach models to better understand the instructions from users and fulfill the demanded tasks. Tuning in this way, LLMs can generalize to unseen tasks by following new instructions, thus boosting zero-shot performance. This simple yet effective idea has sparked the success of subsequent NLP works, such as ChatGPT [2], InstructGPT [95], FLAN [19], [56], and OPT-IML [96].

指令是指对任务的描述。直观上，指令微调旨在教会模型更好地理解用户指令并完成所需任务。通过这种方式微调，大语言模型能够遵循新指令泛化到未见过的任务，从而提升零样本性能。这一简单却有效的想法催生了后续NLP工作的成功，例如ChatGPT [2]、InstructGPT [95]、FLAN [19][56]以及OPT-IML [96]。

The comparisons between instruction tuning and related typical learning paradigms are illustrated in Fig. 3. The supervised fine-tuning approach usually requires a large amount of task-specific data to train a task-specific model. The prompting approach reduces the reliance on large-scale data and can fulfill a specialized task via prompt engineering. In such a case, though the few-shot performance has been improved, the zero-shot performance is still quite average [7]. Differently, instruction tuning learns how to generalize to unseen tasks rather than fitting specific tasks like the two counterparts. Moreover, instruction tuning is highly related to multi-task prompting [97].

指令微调与相关典型学习范式的比较如图 3 所示。监督微调方法通常需要大量任务特定数据来训练任务特定模型。提示方法减少了对大规模数据的依赖，可通过提示工程完成专门任务。在这种情况下，虽然少样本性能有所提升，但零样本性能仍较为一般 [7]。与之不同，指令微调学习的是如何泛化到未见任务，而非像前两者那样适配特定任务。此外，指令微调与多任务提示高度相关 [97]。

In this section, we delineate the format of instruction samples, the training objectives, typical ways to gather instruction data, and corresponding commonly used datasets.

在本节中，我们将阐述指令样本的格式、训练目标、收集指令数据的典型方法以及相应的常用数据集。

3.2.2 Training Detail

3.2.2 训练细节

A multimodal instruction sample often includes an optional instruction and an input-output pair. The instruction is typically a natural language sentence describing the task, such as, “Describe the image in detail.” The input can be an image-text pair like the VQA task [99] or only an image

多模态指令样本通常包含一条可选指令和一个输入-输出对。指令通常是描述任务的自然语言句子，例如"详细描述这张图片"。输入可以是像VQA任务[99]这样的图文对，或仅包含图像

TABLE 5: A simplified template to structure the multimodal instruction data. <instruction $>$ is a textual description of the task. {, } and are input and output from the data sample. Note that in the input may be missed for some datasets, such as image caption datasets merely have . The example is adapted from [98].

表 5: 多模态指令数据的简化结构模板。是对任务的文本描述。{, } 和分别来自数据样本的输入和输出。注意，某些数据集可能缺少输入中的，例如仅包含的图像描述数据集。该示例改编自 [98]。

like the image caption task [100]. The output is the answer to the instruction conditioned on the input. The instruction template is flexible and subject to manual designs [20], [25], [98], as exemplified in Table 5. Note that the instruction template can also be generalized to the case of multi-round conversations [20], [37], [71], [98].

如图像标注任务 [100]。输出是根据输入条件对指令的响应。指令模板具有灵活性且可人工设计 [20][25][98]，如表 5 所示。需要注意的是，指令模板也可推广至多轮对话场景 [20][37][71][98]。

Formally, a multimodal instruction sample can be denoted in a triplet form, i.e. $({\mathcal{I}},{\mathcal{M}},{\mathcal{R}})$ , where $\bar{\mathcal{I}},\mathcal{M},\mathcal{R}$ represent the instruction, the multimodal input, and the ground truth response, respectively. The MLLM predicts an answer given the instruction and the multimodal input:

形式上，多模态指令样本可以表示为三元组形式，即 $({\mathcal{I}},{\mathcal{M}},{\mathcal{R}})$ ，其中 $\bar{\mathcal{I}},\mathcal{M},\mathcal{R}$ 分别代表指令、多模态输入和真实响应。大语言模型根据指令和多模态输入预测答案：

$$
\mathcal{A}=f(\mathcal{T},\mathcal{M};\theta)
$$

Here, $\mathcal{A}$ denotes the predicted answer, and $\theta$ are the parameters of the model. The training objective is typically the original auto-regressive objective used to train LLMs [20], [37], [71], [101], based on which the MLLM is encouraged to predict the next token of the response. The objective can be expressed as:

这里，$\mathcal{A}$表示预测的答案，$\theta$是模型的参数。训练目标通常采用用于训练大语言模型的原始自回归目标[20][37][71][101]，基于此目标鼓励多模态大语言模型预测回答的下一个token。该目标可表示为：

$$
\mathcal{L}(\boldsymbol{\theta})=-\sum_ {i=1}^{N}\log p(\mathcal{R}_ {i}|\mathcal{I},\mathcal{R}_ {<i};\boldsymbol{\theta})
$$

where $N$ is the length of the ground-truth response.

其中 $N$ 是真实响应的长度。

3.2.3 Data Collection

3.2.3 数据收集

Since instruction data are more flexible in formats and varied in task formulations, it is usually trickier and more costly to collect data samples. In this section, we summarize three typical ways to harvest instruction data at scale, i.e. data adaptation, self-instruction, and data mixture.

由于指令数据在格式上更为灵活且任务表述多样，收集数据样本通常更为棘手且成本更高。本节总结了三种大规模获取指令数据的典型方法，即数据适配 (data adaptation)、自我指令 (self-instruction) 和数据混合 (data mixture)。

Data Adaptation. Task-specific datasets are rich sources of high-quality data. Hence, abundant works [60], [70], [76], [82], [101], [102], [103], [104] have utilized existing highquality datasets to construct instruction-formatted datasets. Take the transformation of VQA datasets for an example, the original sample is an input-out pair where the input comprises an image and a natural language question, and the output is the textual answer to the question conditioned on the image. The input-output pairs of these datasets could naturally comprise the multimodal input and response of the instruction sample (see $\S3.2.2)$ . The instructions, i.e. the descriptions of the tasks, can either derive from manual design or from semi-automatic generation aided by GPT. Specifically, some works [21], [35], [60], [70], [102], [105] hand-craft a pool of candidate instructions and sample one of them during training. We offer an example of instruction templates for the VQA datasets as shown in Table 6. The other works manually design some seed instructions and use these to prompt GPT to generate more [25], [82], [98].

数据适配。特定任务的数据集是高质量数据的丰富来源。因此，大量研究[60]、[70]、[76]、[82]、[101]、[102]、[103]、[104]利用现有高质量数据集构建指令格式数据集。以VQA数据集的转换为例，原始样本是一个输入-输出对，其中输入包含一张图像和一个自然语言问题，输出是基于该图像对问题的文本回答。这些数据集的输入-输出对自然可以构成指令样本的多模态输入和响应(见$\S3.2.2$)。指令(即任务描述)可以来自人工设计，也可以来自GPT辅助的半自动生成。具体而言，部分研究[21]、[35]、[60]、[70]、[102]、[105]手工构建候选指令池并在训练期间从中采样，表6展示了VQA数据集的指令模板示例。其他研究则手动设计一些种子指令，并用其提示GPT生成更多指令[25]、[82]、[98]。

Note that since the answers of existing VQA and caption datasets are usually concise, directly using these datasets for instruction tuning may limit the output length of MLLMs. There are two common strategies to tackle this problem. The first one is to specify explicitly in instructions. For example, ChatBridge [104] explicitly declares short and brief for shortanswer data, as well as a sentence and single sentence for conventional coarse-grained caption data. The second one is to extend the length of existing answers [105]. For example, $\mathrm{M}^{3}\mathrm{IT}$ [105] proposes to rephrase the original answer by

需要注意的是，由于现有VQA（视觉问答）和图像描述数据集的答案通常较为简洁，直接使用这些数据集进行指令微调可能会限制大语言模型的输出长度。针对这一问题，目前有两种常见策略：第一种是在指令中明确指定输出要求，例如ChatBridge [104] 对简短答案数据标注"short and brief"（简短精要），对传统粗粒度描述数据标注"a sentence and single sentence"（单一句子）；第二种是对现有答案进行扩展 [105]，例如$\mathrm{M}^{3}\mathrm{IT}$ [105] 提出通过改写原始答案来增加长度。

TABLE 6: Instruction templates for VQA datasets, cited from [60]. and {Question} are the image and the question in the original VQA datasets, respectively. TABLE 7: A summary of popular datasets generated by self-instruction. For input/output modalities, I: Image, T: Text, Video, A: Audio. For data composition, M-T and $S{-}\mathrm{T}$ denote multi-turn and single-turn, respectively.

Dataset	Sample	Modality	Source	Composition
LLaVA-Instruct	158K	I+T→T	MS-COCO	23K caption + 58K M-T QA + 77K reasoning
LVIS-Instruct	220K	I+T→T	LVIS	110K caption + 110K M-T QA
ALLaVA	1.4M	I+T→T	VFlan,LAION	709K caption + 709K S-T QA
Video-ChatGPT	100K	V+T→T	ActivityNet	7K description + 4K M-T QA
VideoChat	11K	V+T→T	WebVid	description+summarization+creation
Clotho-Detail	3.9K	A+T→T	Clotho	caption

表 6: VQA数据集的指令模板，引用自[60]。和{Question}分别代表原始VQA数据集中的图像和问题。
表 7: 自生成指令流行数据集汇总。输入/输出模态中，I: 图像，T: 文本，V: 视频，A: 音频。数据构成中，M-T和S-T分别代表多轮和单轮对话。

数据集	样本量	模态	数据来源	构成
LLaVA-Instruct	158K	I+T→T	MS-COCO	23K描述+58K多轮QA+77K推理
LVIS-Instruct	220K	I+T→T	LVIS	110K描述+110K多轮QA
ALLaVA	1.4M	I+T→T	VFlan,LAION	709K描述+709K单轮QA
Video-ChatGPT	100K	V+T→T	ActivityNet	7K描述+4K多轮QA
VideoChat	11K	V+T→T	WebVid	描述+摘要+创作
Clotho-Detail	3.9K	A+T→T	Clotho	描述

prompting ChatGPT with the original question, answer, and contextual information of the image (e.g. caption and OCR).

用原始问题、答案和图像的上下文信息(例如标题和OCR)提示ChatGPT。

Self-Instruction. Although existing multi-task datasets can contribute a rich source of data, they usually do not meet human needs well in real-world scenarios, such as multiple rounds of conversations. To tackle this issue, some works collect samples through self-instruction [106], which utilizes LLMs to generate textual instruction-following data using a few hand-annotated samples. Specifically, some instructionfollowing samples are hand-crafted as demonstrations, after which ChatGPT/GPT-4 is prompted to generate more instruction samples with the demonstrations as guidance. LLaVA [20] extends the approach to the multimodal field by translating images into text of captions and bounding boxes, and prompting text-only GPT-4 to generate new data with the guidance of requirements and demonstrations. In this way, a multimodal instruction dataset is constructed, called LLaVA-Instruct-150k. Following this idea, subsequent works such as MiniGPT-4 [21], ChatBridge [104], GPT4Tools [107], and DetGPT [72] develop different datasets catering for different needs. Recently, with the release of the more powerful multimodal model GPT4V, many works have adopted GPT-4V to generate data of higher quality, as exemplified by LVIS-Instruct4V [91] and ALLaVA [92]. We summarize the popular datasets generated through self-instruction in Table 7.

自指令方法。虽然现有的多任务数据集能提供丰富的数据来源，但这些数据通常无法很好地满足现实场景中的人类需求（例如多轮对话）。为解决这一问题，部分研究通过自指令方法[106]收集样本，即利用大语言模型基于少量人工标注样本生成文本指令跟随数据。具体而言，先手工制作若干指令跟随样本作为演示示例，随后以这些示例为引导，通过ChatGPT/GPT-4生成更多指令样本。LLaVA[20]将这种方法扩展到多模态领域：先将图像转换为描述文本和边界框信息，再引导纯文本GPT-4根据需求说明和演示示例生成新数据，由此构建出名为LLaVA-Instruct-150k的多模态指令数据集。基于这一思路，后续研究如MiniGPT-4[21]、ChatBridge[104]、GPT4Tools[107]和DetGPT[72]开发了满足不同需求的多样化数据集。近期随着更强大的多模态模型GPT4V发布，许多工作开始采用GPT-4V生成更高质量数据，典型代表包括LVIS-Instruct4V[91]和ALLaVA[92]。我们在表7中总结了通过自指令生成的主流数据集。

Data Mixture. Apart from the multimodal instruction data, language-only user-assistant conversation data can also be used to improve conversational prof ici enc ies and instruction-following abilities [81], [98], [101], [103]. LaVIN [101] directly constructs a minibatch by randomly sampling from both language-only and multimodal data. Multi Instruct [102] probes different strategies for training with a fusion of single modal and multimodal data, including mixed instruction tuning (combine both types of data and randomly shuffle) and sequential instruction tuning (text data followed by multimodal data).

数据混合。除多模态指令数据外，纯语言用户-助手对话数据也能用于提升对话熟练度和指令跟随能力 [81], [98], [101], [103]。LaVIN [101] 通过从纯语言和多模态数据中随机采样直接构建小批量。Multi Instruct [102] 探索了单模态与多模态数据融合训练的不同策略，包括混合指令微调（合并两类数据并随机打乱）和顺序指令微调（先文本数据后多模态数据）。

3.2.4 Data Quality

3.2.4 数据质量

Recent research has revealed that the data quality of instruction-tuning samples is no less important than quantity. Lynx [73] finds that models pre-trained on large-scale but noisy image-text pairs do not perform as well as models pre-trained with smaller but cleaner datasets. Similarly, Wei et al. [108] finds that less instruction-tuning data with higher quality can achieve better performance. For data filtering, the work proposes some metrics to evaluate data quality and, correspondingly, a method to automatically filter out inferior vision-language data. Here we discuss two important aspects regarding data quality.

近期研究表明，指令微调样本的数据质量重要性不亚于数据量。Lynx [73] 发现，在大规模但含噪声的图文对上预训练的模型，其表现不如使用较小但更干净数据集预训练的模型。同样，Wei等 [108] 指出，更少但更高质量的指令微调数据反而能获得更优性能。该研究提出了一些评估数据质量的指标，并相应开发了自动筛选劣质视觉语言数据的方法。这里我们重点讨论数据质量的两个关键维度。

Prompt Diversity. The diversity of instructions has been found to be critical for model performance. Lynx [73] empirically verifies that diverse prompts help improve model performance and generalization ability.

指令多样性。研究发现指令的多样性对模型性能至关重要。Lynx [73] 通过实验验证了多样化的提示有助于提升模型性能和泛化能力。

Task Coverage. In terms of tasks involved in training data, Du et al. [109] perform an empirical study and find that the visual reasoning task is superior to captioning and QA tasks for boosting model performance. Moreover, the study suggests that enhancing the complexity of instructions might be more beneficial than increasing task diversity and incorporating fine-grained spatial annotations.

任务覆盖范围。在训练数据涉及的任务方面，Du等人[109]通过实证研究发现，视觉推理任务在提升模型性能方面优于图像描述和问答任务。此外，该研究指出，提升指令复杂度可能比增加任务多样性和引入细粒度空间标注更具效益。

3.3 Alignment tuning

3.3 对齐调优

3.3.1 Introduction

3.3.1 引言

Alignment tuning is more often used in scenarios where models need to be aligned with specific human preferences, e.g. response with fewer hallucinations (see §6). Currently, Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO) are two main techniques for alignment tuning. In this section, we introduce the main ideas of the two techniques in sequence and offer some examples of how they are utilized in addressing practical problems, and finally, give a compilation of the related datasets.

对齐调优 (Alignment tuning) 更常用于需要让模型与特定人类偏好对齐的场景，例如减少幻觉的响应（参见第6节）。目前，基于人类反馈的强化学习 (RLHF) 和直接偏好优化 (DPO) 是对齐调优的两种主要技术。本节依次介绍这两种技术的主要思想，并提供一些它们如何用于解决实际问题的示例，最后给出相关数据集的汇编。

3.3.2 Training Detail

3.3.2 训练细节

RLHF [110], [111]. This technique aims to utilize reinforcement learning algorithms to align LLMs with human preferences, with human annotations as supervision in the training loop. As exemplified in Instruct GP T [95], RLHF incorporates three key steps:

RLHF [110], [111]。该技术旨在利用强化学习算法使大语言模型与人类偏好对齐，并以人工标注作为训练循环中的监督信号。如InstructGPT [95]所示，RLHF包含三个关键步骤：

Supervised fine-tuning. This step aims to fine-tune a pre-trained model to present the preliminary desired output behavior. The fine-tuned model in the RLHF setting is called a policy model. Note that this step might be skipped since the supervised policy model $\pi^{\mathrm{{SFT}}}$ can be initialized from an instruction-tuned model (see §3.2).
监督微调 (Supervised fine-tuning)。此步骤旨在对预训练模型进行微调，使其呈现初步期望的输出行为。在RLHF设置中，经过微调的模型称为策略模型 (policy model)。需注意，该步骤可能被跳过，因为监督策略模型 $\pi^{\mathrm{{SFT}}}$ 可以直接从指令调优模型初始化 (参见§3.2)。
Reward modeling. A reward model is trained using preference pairs in this step. Given a multimodal prompt (e.g. image and text) $x$ and a response pair $(y_ {w},\bar{y}_ {l}).$ , the reward model $r_ {\theta}$ learns to give a higher reward to the preferred response $y_ {w},$ and vice versa for $y_ {l.}$ , according to the following objective:
奖励建模。在此步骤中，使用偏好对训练奖励模型。给定一个多模态提示（例如图像和文本）$x$ 和响应对 $(y_ {w},\bar{y}_ {l})$ ，奖励模型 $r_ {\theta}$ 学习根据以下目标为优选响应 $y_ {w}$ 给予更高奖励，反之对 $y_ {l}$ 给予较低奖励：

where $\begin{array}{l l l}{{\mathcal{D}}}&{{=}}&{{{(x,y_ {w},y_ {l})}}}\end{array}$ is the comparison dataset labeled by human annotators. In practice, the reward model $r_ {\theta}$ shares a similar structure with the policy model.

其中 $\begin{array}{l l l}{{\mathcal{D}}}&{{=}}&{{{(x,y_ {w},y_ {l})}}}\end{array}$ 是由人工标注者标记的比较数据集。实际上，奖励模型 $r_ {\theta}$ 与策略模型结构相似。

Reinforcement learning. In this step, the Proximal Policy Optimization (PPO) algorithm is adopted to optimize the RL policy model $\pi_ {\phi}^{\mathrm{RL}}$ . A per-token KL penalty is often added to the training objective to avoid deviating too far from the original policy [95], resulting in the objective:
强化学习。在此步骤中，采用近端策略优化 (Proximal Policy Optimization, PPO) 算法来优化强化学习策略模型 $\pi_ {\phi}^{\mathrm{RL}}$。通常会在训练目标中添加每个 token 的 KL 散度惩罚项，以避免偏离原始策略过远 [95]，最终得到的目标函数为：

\begin{array}{r l}&{\mathcal{L}(\phi)=-\mathbb{E}_ {x\sim\mathcal{D},y\sim\pi_ {\phi}^{R L}(y|x)}\Big[r_ {\theta}(x,y)}\ &{\quad\quad-\beta\cdot\mathbb{D}_ {K L}\Big(\pi_ {\phi}^{R L}(y|x)||\pi^{R E F}(y|x)\Big)\Big]}\end{array}

$$
\begin{array}{r l}&{\mathcal{L}(\phi)=-\mathbb{E}_ {x\sim\mathcal{D},y\sim\pi_ {\phi}^{R L}(y|x)}\Big[r_ {\theta}(x,y)}\ &{\quad\quad-\beta\cdot\mathbb{D}_ {K L}\Big(\pi_ {\phi}^{R L}(y|x)||\pi^{R E F}(y|x)\Big)\Big]}\end{array}
$$

where $\beta$ is the coefficient for the $\mathrm{KL}$ penalty term. Typically, both the RL policy $\pi_ {\phi}^{\mathrm{RL}}$ and the reference model πREF are initialized from the supervised model $\pi^{\mathrm{SFT}}$ . The obtained RL policy model is expected to align with human preferences through this tuning process.

其中 $\beta$ 是 $\mathrm{KL}$ 惩罚项的系数。通常，强化学习策略 $\pi_ {\phi}^{\mathrm{RL}}$ 和参考模型 $\pi^{\mathrm{REF}}$ 都从监督微调模型 $\pi^{\mathrm{SFT}}$ 初始化。通过这一调优过程，最终获得的强化学习策略模型有望与人类偏好对齐。

Researchers have explored using the RLHF techniques for better multimodal alignment. For example, LLaVARLHF [112] collects human preference data and tunes a model with fewer hallucinations based on LLaVA [20].

研究人员探索了利用RLHF (Reinforcement Learning from Human Feedback)技术来改进多模态对齐。例如，LLaVARLHF [112]收集了人类偏好数据，并基于LLaVA [20]调整出一个幻觉更少的模型。

DPO [113]. It learns from human preference labels utilizing a simple binary classification loss. Compared with the PPObased RLHF algorithm, DPO is exempt from learning an explicit reward model, thus simplifying the whole pipeline to two steps, i.e. human preference data collection and preference learning. The learning objective is as follows:

DPO [113]。它通过简单的二元分类损失从人类偏好标签中学习。与基于PPO的RLHF算法相比，DPO无需学习显式奖励模型，从而将整个流程简化为两个步骤：人类偏好数据收集和偏好学习。其学习目标如下：

$$
\begin{array}{r l}&{\mathcal{L}(\phi)=-\mathbb{E}_ {(x,y_ {w},y_ {l})\sim\mathcal{D}}\Big[\log\sigma\Big(\beta\log\frac{\pi_ {\phi}^{\mathrm{RL}}(y_ {w}|x)}{\pi^{\mathrm{REF}}(y_ {w}|x)}}\ &{\qquad-\beta\log\frac{\pi_ {\phi}^{\mathrm{RL}}(y_ {l}|x)}{\pi^{\mathrm{REF}}(y_ {l}|x)}\Big)\Big]}\end{array}
$$

RLHF-V [114] collects fine-grained (segment-level) preference data pairs by correcting hallucinations in the model response and uses the obtained data to perform dense DPO. Silkie [115] instead collects preference data via prompting GPT-4V and distills the preference supervision into an instruction-tuned model through DPO.

RLHF-V [114] 通过修正模型响应中的幻觉收集细粒度（片段级）偏好数据对，并利用获得的数据执行密集DPO。Silkie [115] 则通过提示GPT-4V收集偏好数据，并通过DPO将偏好监督蒸馏至指令调优模型中。

TABLE 8: A summary of datasets for alignment-tuning. For input/output modalities, I: Image, T: Text.

Dataset	Sample	Modality	Source
LLaVA-RLHF F[112]	10K	I+T→T	Human
RLHF-V[114]	5.7K	I+T→T	Human
VLFeedback [115]	380K	I+T→T	GPT-4V

表 8: 对齐微调数据集汇总。输入/输出模态说明：I: 图像, T: 文本。

数据集	样本量	模态	来源
LLaVA-RLHF [112]	10K	I+T→T	人工标注
RLHF-V [114]	5.7K	I+T→T	人工标注
VLFeedback [115]	380K	I+T→T	GPT-4V生成

3.3.3 Data

3.3.3 数据

The gist of data collection for alignment-tuning is to collect feedback for model responses, i.e. to decide which response is better. It is generally more expensive to collect such data, and the amount of data used for this phase is typically even less than that used in previous stages. In this part, we introduce some datasets and summarize them in Table 8.

对齐调优数据收集的核心在于获取模型反馈，即判定哪个响应更优。这类数据采集成本通常较高，且该阶段使用的数据量往往少于前期阶段。本节介绍部分数据集并将其汇总于表8:

LLaVA-RLHF [112]. It contains 10K preference pairs collected from human feedback in terms of honesty and helpfulness. The dataset mainly serves to reduce hallucinations in model responses.

LLaVA-RLHF [112]。该数据集包含1万组基于诚实性和帮助性人工反馈收集的偏好对，主要用于减少模型响应中的幻觉现象。

RLHF-V [114]. It has 5.7K fine-grained human feedback data collected by segment-level hallucination corrections.

RLHF-V [114]。该数据集包含5.7K条通过片段级幻觉修正收集的细粒度人类反馈数据。

VLFeedback [115]. It utilizes AI to provide feedback on model responses. The dataset contains more than 380K comparison pairs scored by GPT-4V in terms of helpfulness, faithfulness, and ethical concerns.

VLFeedback [115]。该系统利用AI对模型响应提供反馈。数据集包含超过38万组由GPT-4V根据有用性、忠实度和伦理问题评分的对比配对。

4 EVALUATION

4 评估

Evaluation is an essential part of developing MLLMs since it provides feedback for model optimization and helps to compare the performance of different models. Compared with evaluation methods of traditional multimodal models, the evaluation of MLLMs exhibits several new traits: (1) Since MLLMs are generally versatile, it is important to evaluate MLLMs comprehensively. (2) MLLMs exhibit many emergent capabilities that require special attention (e.g. OCR-free math reasoning) and thus require new evaluation schemes. The evaluation of MLLMs can be broadly categorized into two types according to the question genres, including closed-set and open-set.

评估是开发多模态大语言模型 (MLLM) 的关键环节，它为模型优化提供反馈并帮助比较不同模型的性能。与传统多模态模型的评估方法相比，MLLM 的评估呈现出若干新特点：(1) 由于 MLLM 通常具备通用性，因此全面评估尤为重要。(2) MLLM 展现出许多需要特别关注的新兴能力 (例如无需 OCR 的数学推理)，这要求设计新的评估方案。根据问题类型，MLLM 的评估可大致分为两类：闭集评估和开集评估。

4.1 Closed-set

4.1 闭集 (Closed-set)

Closed-set questions refer to a type of question where the possible answer options are predefined and limited to a finite set. The evaluation is usually performed on taskspecific datasets. In this case, the responses can be naturally judged by benchmark metrics [20], [60], [70], [76], [101], [102], [103], [104]. For example, InstructBLIP [60] reports the accuracy on ScienceQA [116], as well as the CIDEr score [117] on NoCaps [118] and Flickr30K [119]. The evaluation settings are typically zero-shot [60], [102], [104], [105] or finetuning [20], [35], [60], [70], [76], [101], [103], [105]. The first setting often selects a wide range of datasets covering different general tasks and splits them into held-in and held-out datasets. After tuning on the former, zero-shot performance is evaluated on the latter with unseen datasets or even unseen tasks. In contrast, the second setting is often observed in the evaluation of domain-specific tasks. For example, LLaVA [20] and LLaMA-Adapter [76] report finetuned performance on ScienceQA [116]. LLaVA-Med [35] reports results on biomedical VQA [120], [121], [122].

封闭式问题指的是一类答案选项已预先定义且限定在有限集合内的问题。评估通常在任务特定数据集上进行。这种情况下，回答可自然地通过基准指标进行评判 [20], [60], [70], [76], [101], [102], [103], [104]。例如，InstructBLIP [60] 报告了在ScienceQA [116]上的准确率，以及在NoCaps [118]和Flickr30K [119]上的CIDEr分数 [117]。评估设置通常为零样本 [60], [102], [104], [105] 或微调 [20], [35], [60], [70], [76], [101], [103], [105]。第一种设置常选择涵盖不同通用任务的广泛数据集，并将其划分为保留集和测试集。在前者上调整后，后者使用未见数据集甚至未见任务评估零样本性能。相比之下，第二种设置常见于领域特定任务的评估。例如，LLaVA [20] 和LLaMA-Adapter [76] 报告了在ScienceQA [116]上的微调性能。LLaVA-Med [35] 报告了在生物医学VQA [120], [121], [122]上的结果。

The above evaluation methods are usually limited to a small range of selected tasks o

[论文翻译]多模态大语言模型综述

原文地址：https://arxiv.org/pdf/2306.13549