Long-VITA：将大型多模态模型扩展到100万Token，同时保持领先的短上下文准确性

Abstract

摘要

Establishing the long-context capability of large visionlanguage models is crucial for video understanding, highresolution image understanding, multi-modal agents and reasoning. We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates the state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.

建立大视觉语言模型的长上下文能力对于视频理解、高分辨率图像理解、多模态智能体和推理至关重要。我们介绍了 Long-VITA，一个简单而有效的大规模多模态模型，用于长上下文视觉语言理解任务。它擅长同时处理和分析超过 4K 帧或 1M Token 的图像、视频和文本模态，并在短上下文多模态任务上提供先进性能。我们提出了一种有效的多模态训练方案，从大语言模型开始，通过视觉语言对齐、通用知识学习以及两个顺序的长序列微调阶段进行。我们进一步实现了上下文并行性分布式推理和 Logits 掩码语言建模头，以在模型推理过程中将 Long-VITA 扩展到无限长的图像和文本输入。关于训练数据，Long-VITA 仅建立在 17M 公开数据集样本的混合基础上，并在各种多模态基准测试中展示了最先进的性能，与具有内部数据的最新尖端模型相比。Long-VITA 完全可复现，并支持 NPU 和 GPU 平台进行训练和测试。我们希望 Long-VITA 能作为一个有竞争力的基线，并为开源社区在推进长上下文多模态理解方面提供宝贵的见解。

1. Introduction

1. 引言

In recent years, proprietary Large Multi-Modal Models (LMMs) have been undergoing rapid iteration and evolution [1–3], progressively extending large language models (LLMs) with multi-sensory skills, such as visual understanding. Beyond closed-source models, open-source models, including LLaVA series [4, 5], Qwen-VL series [6, 7], and VITA series [8, 9], are also making significant strides, trying to close the gap with their closed-source counterparts.

近年来，专有的大型多模态模型(Large Multi-Modal Models, LMMs)正在快速迭代和演进[1–3]，逐步将大语言模型扩展到多感官技能，例如视觉理解。除了闭源模型，开源模型，包括LLaVA系列[4, 5]、Qwen-VL系列[6, 7]和VITA系列[8, 9]，也在取得显著进展，试图缩小与闭源模型的差距。

However, most of the above open-source works on visual understanding tasks typically focus on static images and short video inputs. Proprietary models show superior support for long-content inputs, while open-source models lag significantly behind. For example, Gemini 1.5 pro [10] runs up to 1 million tokens in production and processes 1 hour of video information in one go. Therefore, more information is needed on developing high-performing, longcontext vision-language models for public use. Recently, many works [11–14] have been proposed to address the challenges of training and inference of long-context information. However, they [11, 12] aim mainly to improve a comprehensive understanding of long video, neglecting the static image and short video input scenarios. On the other hand, some works [13, 15] rely on compressing visual tokens, which often comes at the expense of performance degradation.

然而，上述大多数开源工作在视觉理解任务上通常专注于静态图像和短视频输入。专有模型在对长内容输入的支持上表现出色，而开源模型则明显落后。例如，Gemini 1.5 pro [10] 在生产中可处理多达 100 万个 token，并能一次性处理 1 小时的视频信息。因此，需要更多关于开发高性能、长上下文视觉语言模型以供公众使用的信息。最近，许多工作 [11–14] 被提出以解决长上下文信息训练和推理的挑战。然而，它们 [11, 12] 主要旨在提高对长视频的全面理解，忽视了静态图像和短视频输入场景。另一方面，一些工作 [13, 15] 依赖于压缩视觉 token，这通常以性能下降为代价。

To further push the limits of open-source model capabilities, we extend the context length to 1 million tokens and introduce Long-VITA, a strong open-source long-context visual language model. To this end, we employ a phased training approach that positions language as the pivot. Specifically, Long-VITA ’s ability to handle extended contexts is systematically augmented through a four-stage process. Beyond the conventional stages of vision-language alignment and supervised fine-tuning, our approach incorporates specialized stages for long-context supervised fine-tuning of 128K and 1M. To achieve a good performance trade-off between the long and short sequences, Long-VITA takes advantage of the existing abundance of open-source imagetext and video-text data. We also introduce a multi-image sum mari z ation dataset, Comic-9K, comprising $9\mathbf{k}$ comic books and the corresponding detailed synopsis. This dataset has a total of $200\mathbf{k}$ images with an average of 20 highresolution photos for each sample, and the synopsis is all manually written collecting from the web.

为了进一步突破开源模型的能力极限，我们将上下文长度扩展至100万tokens，并推出了Long-VITA，一个强大的开源长上下文视觉语言模型。为此，我们采用了以语言为核心的阶段性训练方法。具体而言，Long-VITA处理长上下文的能力通过四个阶段的逐步增强而得以提升。除了传统的视觉语言对齐和监督微调阶段外，我们的方法还专门设计了针对128K和1M上下文长度的监督微调阶段。为了在长序列和短序列之间实现良好的性能平衡，Long-VITA充分利用了现有的开源图像文本和视频文本数据。我们还引入了一个多图像摘要数据集Comic-9K，该数据集包含了$9\mathbf{k}$本漫画书以及对应的详细概要。这个数据集总共包含$200\mathbf{k}$张图像，每个样本平均有20张高分辨率照片，所有概要均为从网络上收集并手动编写的。

2. 相关工作

2.1. Large Vision Language Models

2.1. 大视觉语言模型

Recent advancements have seen the creation of large vision language models (LVLMs), which usually enhance large language models with the capability to process and interpret visual information. Flamingo [16] performs various multi-modal tasks, e.g., image captioning, visual dialogue, classification, or visual question answering, from only a few input/output examples. BLIP-2 [17] leverages frozen pre-trained image encoders and large language models to bootstrap vision-language pre-training. LLaVA [18] uses language models to generate multi-modal instructionfollowing data and connects a vision encoder and large language models for general-purpose visual and language under standing. Qwen-VL [6] and InternVL [19] series further perform various vision-language tasks, such as image captioning, question answering, text-oriented question answering, and visual grounding. These works showcase that LVLMs have achieved significant breakthroughs. Furthermore, significant progress in multi-modal model evaluation [20–22] has also contributed to the rapid improvement of large vision-language models. In this work, we introduce Long-VITA, a series of multi-modal and long-content models trained exclusively with fully open-source datasets for pre-training and supervised fine-tuning, demonstrating promising results on extensive benchmarks.

近期的发展催生了大型视觉语言模型（LVLMs），这类模型通常通过增强大语言模型的能力来处理和解析视觉信息。Flamingo [16] 能够从仅有的少量输入/输出示例中完成多种多模态任务，如图像描述、视觉对话、分类或视觉问答。BLIP-2 [17] 利用冻结的预训练图像编码器和大语言模型来启动视觉语言预训练。LLaVA [18] 使用语言模型生成多模态指令跟随数据，并连接视觉编码器和大语言模型，以实现通用的视觉和语言理解。Qwen-VL [6] 和 InternVL [19] 系列进一步执行了多种视觉语言任务，如图像描述、问答、面向文本的问答和视觉定位。这些工作表明 LVLMs 已取得了显著突破。此外，多模态模型评估 [20–22] 的重大进展也促进了大型视觉语言模型的快速提升。在本工作中，我们介绍了 Long-VITA，这是一系列多模态和长内容模型，完全使用开源数据集进行预训练和监督微调，在广泛的基准测试中展示了令人瞩目的结果。

2.2. 长上下文多模态模型

LLMs are typically pre-trained with a pre-defined context length. Training LLMs with long context from scratch is prohibitively expensive for most researchers. Recently, several works, e.g., Position Interpolation [23], YaRN [24],

大语言模型通常使用预定义的上下文长度进行预训练。从头开始训练具有长上下文的大语言模型对大多数研究者来说成本过高。最近，一些工作，如位置插值 (Position Interpolation) [23]、YaRN [24]，

LongRoPE [25] and LongLoRA [26] have tried to extend the context length of LLMs by fine-tuning.

LongRoPE [25] 和 LongLoRA [26] 尝试通过微调来扩展大语言模型的上下文长度。

Many methods have also been proposed in large multi-modal models to handle long-context visual inputs. LongVILA [11] execute a continuation of pre-training on the LLM to enhance its context length to 256K, followed by long video training. LongLLaVA [12] integrates a hybrid of Mamba and Transformer blocks for long-context multi-modal understanding. LongVU [13] proposes a spatio-temporal adaptive compression scheme to reduce long video tokens by leveraging cross-modal query and inter-frame similarities. LongVA [14] extends the language model on text data and then aligns the extended model with visual inputs, which perceives more than 200K visual tokens. Kangaroo [27] develops a curriculum training strategy that progressively equips the LLM basement with the capacity to comprehend long videos.

在大规模多模态模型中，也提出了许多处理长上下文视觉输入的方法。LongVILA [11] 对大语言模型进行预训练延续，以将其上下文长度增强至256K，随后进行长视频训练。LongLLaVA [12] 集成了Mamba和Transformer块的混合，用于长上下文多模态理解。LongVU [13] 提出了一种时空自适应压缩方案，通过利用跨模态查询和帧间相似性来减少长视频Token。LongVA [14] 在文本数据上扩展了语言模型，然后将扩展后的模型与视觉输入对齐，使其能够感知超过200K的视觉Token。Kangaroo [27] 开发了一种课程训练策略，逐步使大语言模型具备理解长视频的能力。

However, the above methods mainly focus on video understanding and visual information retrieval, neglecting the trade-off between image and long-video understanding tasks. In this paper, we propose Long-VITA, a powerful LMM that obtains a long-context capacity of 1 million tokens and simultaneously achieves superior performance on both image and video understanding tasks.

然而，上述方法主要关注视频理解和视觉信息检索，忽视了图像和长视频理解任务之间的权衡。本文提出了 Long-VITA，一种强大的大语言模型 (LMM) ，它能够获得 100 万个 Token 的长上下文能力，同时在图像和视频理解任务上均取得了优异性能。

2.3. Long-Context Visual Instruction Data

2.3 长上下文视觉指令数据

LLaVA-Video [28] synthesizes long video-language in- struction data, covering various tasks such as captioning, open-ended, and multi-choice QA. LongVILA [11] constructs instruction-following datasets from long videos, which encompasses sum mari z ation and other queries relevant to a comprehensive understanding of the content of long videos. Video-MME [29] incorporates a diverse range of video types, varying temporal durations, ranging from 11 seconds to 1 hour. LVBench [30] evaluates long-context LMMs that feature video-language interleaved inputs up to an hour long. Long Video Bench [31] introduces referring reasoning questions and presents significant challenges for both proprietary and open-source LMMs in their longcontext multi-modal capabilities.

LLaVA-Video [28] 合成了长视频语言指令数据，涵盖了字幕生成、开放式问答和多选问答等多种任务。LongVILA [11] 从长视频中构建了指令跟随数据集，包括摘要生成和其他与全面理解长视频内容相关的查询。Video-MME [29] 包含了多种视频类型，时间跨度从11秒到1小时不等。LVBench [30] 评估了具备长达一小时视频语言交错输入能力的长上下文大语言模型。Long Video Bench [31] 引入了参考推理问题，并对专有和开源大语言模型的长上下文多模态能力提出了重大挑战。

However, the above works only focus on long-context understanding in videos, which usually contain redundant visual frames and other modal information, such as subtitles and audio. In this paper, we explore the comic-based longcontext instruction learning and collect a high-quality real dataset for comic book sum mari z ation,

然而，上述工作仅关注视频中的长上下文理解，而视频通常包含冗余的视觉帧和其他模态信息，如字幕和音频。在本文中，我们探索基于漫画的长上下文指令学习，并收集了一个高质量的漫画书摘要真实数据集。

3. Long-VITA

3.1. Architecture

3.1. 架构

This technical report presents Long-VITA, a new series of open-source vision-language models that explore longcontext vision understanding without token compression and sparse local attention. Our multi-modal architecture is constructed around three core components: the Vision Encoder, the Projector, and the LLM.

本技术报告介绍了 Long-VITA，这是一个新的开源视觉-语言模型系列，探索了无需 Token 压缩和稀疏局部注意力的长上下文视觉理解。我们的多模态架构围绕三个核心组件构建：视觉编码器 (Vision Encoder)、投影器 (Projector) 和大语言模型 (LLM)。

Table 1. Summary of datasets used in Long-VITA for different stages. ‘Comic-9K’ and ‘MovieNet-Summary’ are publicly available at: https://hugging face.co/datasets/VITA-MLLM/Comic-9K and https://hugging face.co/datasets/VITA-MLLM/MovieNet-Summary, respectively.

表 1: Long-VITA 不同阶段使用的数据集总结。‘Comic-9K’ 和 ‘MovieNet-Summary’ 公开可用的地址分别为：https://hugging face.co/datasets/VITA-MLLM/Comic-9K 和 https://hugging face.co/datasets/VITA-MLLM/MovieNet-Summary。

类型	名称	总数	采样比例 / 最大数量	阶段 1	阶段 2	阶段 3
图像-文本	LLaVA-ReCap [32]	3.5M	1.0	1.0	0.1	0.1
图像-文本	ALLaVA-4V [33]	1.4M	1.0	1.0	0.1	0.1
图像-文本	LVIS-Instruct4V [34]	222K	0.0	1.0	0.1	0.1
图像-文本	ShareGPT4V [35]	1.3M	1.0	1.0	0.1	0.1
图像-文本	the-cauldron [36]	1.1M	0.0	1.0	0.1	0.1
图像-文本	Docmatix [37]	1.2M	1.0	1.0	0.1	0.1
图像-文本	LLaVA-OneVision-Mid [5]	444K	1.0	1.0	0.1	0.1
图像-文本	LLaVA-One Vision [5]	3.6M	0.0	1.0	0.1	0.1
图像-文本	M4-Instruct [38]	860K	0.0	1.0	0.1	0.1
图像-文本	Comic-9K	9K	0.0	0.0	1.0	1.0
视频-文本	VideoGPT-plus [39]	575K	0.0	1.0	0.1	0.1
视频-文本	ShareGemini-cap [40]	323K	0.0	1.0	0.1	0.1
视频-文本	LLaVA-Vide0-178K [28]	1.6M	0.0	0.0	1.0	1.0
视频-文本	MovieNet-Summary	1K	0.0	0.0	0.0	1.0
短文本	OpenHermes-2.5 [41]	1.0M	0.0	1.0	0.1	0.1
短文本	LIMA [42]	1K	0.0	1.0	0.1	0.1
短文本	databricks-dolly-15k [43]	15K	0.0	1.0	0.1	0.1
短文本	MetaMathQA [44]	395K	0.0	1.0	0.1	0.1
短文本	MathInstruct [45]	262K	0.0	1.0	0.1	0.1
短文本	Orca-Math [46]	200K	0.0	1.0	0.1	0.1
短文本	atlas-math-sets [47]	17.8M	0.0	1.0	0.1	0.1
短文本	goat [48]	1.7M	0.0	1.0	0.1	0.1
短文本	camel-ai-math [49]	50K	0.0	1.0	0.1	0.1
长文本	Long-Instruction [50]	16K	0.0	0.0	1.0	1.0
长文本	LongForm [51]	23K	0.0	0.0	1.0	1.0
长文本	LongAlign-10k [52]	10K	0.0	0.0	1.0	1.0
长文本	LongCite-45k [53]	45K	极乐棋牌	极乐棋牌	极乐棋牌	极乐棋牌
长文本	LongWriter-6k [54]	6K	0.0	0.0	1.0	1.0
长文本	LongQLoRA [55]	39K	0.0	0.0	1.极乐棋牌	极乐棋牌
长文本	LongAlpaca [26]	12K	0.0	0.0	1.0	1.0
长文本	Long-Data-Collections [56]	98K	0.0	0.0	1.0	1.0

Large Language Model. We choose Qwen2.5-14BInstruct [57] as our LLM.

大语言模型。我们选择 Qwen2.5-14BInstruct [57] 作为我们的大语言模型。

Vision Encoder. We consider the InternViT-300M [19] as the visual encoder. We introduce a dynamic tiling vision encoding strategy [58] that efficiently processes highresolution images of varying aspect ratios.

视觉编码器 (Vision Encoder)。我们采用 InternViT-300M [19] 作为视觉编码器。我们引入了一种动态分块视觉编码策略 [58]，能够高效处理不同宽高比的高分辨率图像。

Vision-Language Projector. We employ a 2-layer MLP to project image features into the word embedding space. We also apply a simple pixel shuffle [58] to visual tokens and reduce the number of visual tokens to one-quarter.

视觉语言投影器。我们采用一个2层MLP将图像特征投影到词嵌入空间。我们还对视觉Token应用了简单的像素重排[58]，并将视觉Token的数量减少到四分之一。

3.2. Data Construction

3.2. 数据构建

Long-VITA is trained on open-source datasets only. As shown in Tab. 1, the training dataset encompasses a diverse range of sources.

Long-VITA 仅在开源数据集上进行训练。如表 1 所示，训练数据集涵盖了多种来源。

Image-Text Data. The datasets employed can be categorized into three groups:

图文数据

• Image Captioning. The visual caption dataset consists of LLaVA-ReCap [32], ALLaVA-4V [33], ShareGPT4V [35] and LLaVA-OneVision-Mid [5]. • Visual Question Answering. We combine general VQA from LVIS-Instruct4V [34], the-cauldron [36], Docmatix [37], LLaVA-OneVision [5]. • Interleaved Image-Text. To empower all-round multi-image capabilities, we employ the M4- Instruct [38]. To further enhance multi-image understanding with more than 10 images, we collect the public comic book with the corresponding detailed synopsis from the web and build the Comic $.9\mathbf{k}$ datasets. Specifically, Comic $.9\mathbf{k}$ contains 200K images, spanning 9K comic books, along with a manual-labeled synopsis.

• 图像描述。视觉描述数据集包括 LLaVA-ReCap [32]、ALLaVA-4V [33]、ShareGPT4V [35] 和 LLaVA-OneVision-Mid [5]。
• 视觉问答。我们结合了来自 LVIS-Instruct4V [34]、the-cauldron [36]、Docmatix [37]、LLaVA-OneVision [5] 的通用 VQA。
• 交错图像文本。为了增强全方位的多图像能力，我们采用了 M4-Instruct [38]。为了进一步增强超过 10 张图像的多图像理解能力，我们从网络上收集了带有相应详细摘要的公开漫画书，并构建了 Comic $.9\mathbf{k}$ 数据集。具体来说，Comic $.9\mathbf{k}$ 包含 20 万张图像，涵盖 9K 本漫画书，以及手动标注的摘要。

Table 2. Detailed configuration for each training stage of the Long-VITA model.

表 2: Long-VITA 模型每个训练阶段的详细配置。

	阶段 1	阶段 2	阶段 3	阶段 4
序列长度	32K	16K	128K	1M
批量大小	528	528	64	8
训练迭代次数	1,000	5,000	1,000	500
训练 Token 数	16B	40B	8B	4B
序列打包	√	√		√ 8
并行性	Tensor Pipeline Context Data	8 1 1 8	8 1 1 8	8 1 2 4
图像分辨率	448 + 448 × {1 × 2,2 × 1,...,3 × 4,4 × 3}
最大图像数量	128	64	512	4,096
视频分辨率	448
视频帧率	1
最大视频帧数	128	64	512	4,096
视觉学习率	0.0	1.0 × 10-6	1.0 × 10-6	1.0 × 10-6
投影器学习率	1.0 × 10-3	1.0 × 10-5	1.0 × 10-5	1.0 × 10-5
大语言模型学习率	0.0	1.0 × 10-5	1.0 × 10-5	1.0 × 10-5
视觉学习率衰减	0.0		0.9
学习率调度器	Cosine
权重衰减	0.0
梯度裁剪	1.0
Rotary Base	1,000,000
Adam Beta1	0.9
Adam Beta2	0.999

Video-Text Data. We construct our video understanding data using VideoGPT-plus [39], Share Gemini [40], and LLaVA-Video-178K [28]. To improve the long-context capability of movie-level video understanding, we build a MovieNet-Summary dataset, which consists of paired movies and synopses from MovieNet [59].

视频-文本数据。我们使用 VideoGPT-plus [39]、Share Gemini [40] 和 LLaVA-Video-178K [28] 构建了视频理解数据。为了提高电影级视频理解的长上下文能力，我们构建了一个 MovieNet-Summary 数据集，该数据集由来自 MovieNet [59] 的电影和简介配对组成。

Short Text Data. Following [37], the pure text data is collected from OpenHermes-2.5 [41], LIMA [42], databricks-dolly $15\mathrm{k}$ [43], MetaMathQA [44], MathInstruct [45], Orca-Math [46], atlas-math-sets [47], goat [48], and camel-ai-math [49].

短文本数据。参考 [37]，纯文本数据从 OpenHermes-2.5 [41]、LIMA [42]、databricks-dolly $15\mathrm{k}$ [43]、MetaMathQA [44]、MathInstruct [45]、Orca-Math [46]、atlas-math-sets [47]、goat [48] 和 camel-ai-math [49] 中收集。

Long Text Data. To transfer the context length of the language model to the modality-aligned multi-modal models [14], we gather several long text datasets, including Long-Instruction-with-Paraphrasing [50], LongForm [51], LongAlign-10k [52], LongCite-45k [53], LongWriter6k [54], LongQLoRA [55], LongAlpaca [26], and LongData-Collections [56].

长文本数据。为了将语言模型的上下文长度转移到模态对齐的多模态模型 [14]，我们收集了多个长文本数据集，包括 Long-Instruction-with-Paraphrasing [50]、LongForm [51]、LongAlign-10k [52]、LongCite-45k [53]、LongWriter6k [54]、LongQLoRA [55]、LongAlpaca [26] 和 LongData-Collections [56]。

Note that Comic-9k and MovieNet-Summary are created by this work and are made publicly available. Therefore, Long-VITA is only trained on open data, and we do not use data flitering methods.

请注意，Comic-9k 和 MovieNet-Summary 是由本工作创建并公开可用的。因此，Long-VITA 仅在开放数据上进行训练，且我们未使用数据过滤方法。

3.3. Training Pipelines

3.3. 训练管道

Unlike other models, Long-VITA training is divided into four stages with varying sequence lengths.

与其他模型不同，Long-VITA 的训练分为四个阶段，每个阶段具有不同的序列长度。

Table 3. Maximal supported sequence length for inference and training.

表 3: 推理和训练支持的最大序列长度

Name	# Param.	Device Type	DevicesNumber	Inference	Training
LongVILA [11]	7B	80G GPU	8	276K	666K
Long-VITA	14B	64G NPU	16 32	1024K	1024K
Long-VITA	14B	96G GPU	64	400K 800K	1600K

Stage 1: Vision-Language Alignment. Building upon pre-trained language models, our primary objective is to establish initial connections between visual features and language features. We freeze the LLM and the visual encoder, only training the visual projector. Therefore, we mainly use caption data for pre-training. We also add Docmatix [37] in this stage to improve document-based VQA.

第一阶段：视觉与语言对齐。基于预训练的大语言模型，我们的主要目标是建立视觉特征和语言特征之间的初步连接。我们冻结大语言模型和视觉编码器，仅训练视觉投影器。因此，我们主要使用字幕数据进行预训练。我们还在这一阶段加入了 Docmatix [37] 以提升基于文档的视觉问答 (VQA) 能力。

Stage 2: General Knowledge Learning. After establishing the vision-language alignment in the embedding space, we dedicate most of our computational resources to vision-language general knowledge learning. This stage leverages all the image-text data for multiple tasks, including image captioning, common VQA, OCR, and multimodel conversations. In this stage, we also add text-only general instructions, math problems, and arithmetic calculations. For video understanding, we only add VideoGPTplus [39] and Share Gemini-cap [40]. In both Stage 1 and Stage 2, we pack all training data to a fixed sequence length, which effectively trains samples with different lengths of sequences. Specifically, we random sample data items from the same source and concatenate them into one training sample with a token length of 32K and 16K for Stage 1 and Stage 2, respectively. We reset positional embeddings and attention masks for all packed samples so that each textvision pair only attends to itself. This approach helps manage extensive datasets and ensure diverse data segments’ coverage.

阶段 2: 通用知识学习。在嵌入空间中建立视觉-语言对齐后，我们将大部分计算资源投入到视觉-语言通用知识学习中。此阶段利用所有图像-文本数据进行多任务学习，包括图像描述、常见视觉问答 (VQA)、光学字符识别 (OCR) 和多模态对话。在此阶段，我们还添加了纯文本的通用指令、数学问题和算术计算。对于视频理解，我们仅添加了 VideoGPTplus [39] 和 Share Gemini-cap [40]。在阶段 1 和阶段 2 中，我们将所有训练数据打包为固定序列长度，从而有效训练不同长度序列的样本。具体来说，我们从同一来源随机采样数据项，并将它们连接成一个训练样本，阶段 1 和阶段 2 的 Token 长度分别为 32K 和 16K。我们为所有打包样本重置位置嵌入和注意力掩码，以便每个文本-视觉对仅关注自身。这种方法有助于管理大规模数据集，并确保多样数据段的覆盖。

Stage 3: Long-Sequence Fine-Tuning. In this stage, we extend the context length to 128K. We reduce the sampling ratio of the data in Stage 2 to 01 and incorporate additional long-context text instructions, comic book summaries, and video understanding datasets.

第三阶段：长序列微调。在此阶段，我们将上下文长度扩展到128K。我们将第二阶段的数据采样率降低到0.1，并引入额外的长上下文文本指令、漫画书摘要和视频理解数据集。

Stage 4: Long-Sequence Fine-Tuning. In this stage, we extend the context length to 1 024K and add additional movie summary data. In both Stage 3 and Stage 4, we also pack all training data to a fixed sequence length without resetting positional embedding and attention mask. Therefore, we impose the model to capture the correlations between these two modalities in long-contextual information.

阶段4：长序列微调。在此阶段，我们将上下文长度扩展到1024K，并添加额外的电影摘要数据。在阶段3和阶段4中，我们还将所有训练数据打包到固定序列长度，而不重置位置嵌入和注意力掩码。因此，我们迫使模型在长上下文信息中捕捉这两种模态之间的相关性。

We do not use the interpolation technique during training and testing, therefore, the context window of Long-VITA can be extended further when equipped with YaRN [24], LongRoPE [25] and NTK-based interpolation.

我们在训练和测试过程中不使用插值技术，因此，当配备 YaRN [24]、LongRoPE [25] 和基于 NTK 的插值时，Long-VITA 的上下文窗口可以进一步扩展。

Note that we do not use any parameter-efficient methods such as LoRA [60] or approximate attention [26].

请注意，我们没有使用任何参数高效的方法，例如 LoRA [60] 或近似注意力 [26]。

3.4. Hyper-parameters and Infrastructures

3.4. 超参数与基础设施

We initially implement training and inference with MindSpeed [78] and MindSpeed-LLM [79], which adapt Megatron-LM [80] to Ascend NPU. We also transfer the training and inference code to the GPU platform. As shown in Tab. 2, we list the detailed hyper-parameters in LongVITA.

我们最初使用 MindSpeed [78] 和 MindSpeed-LLM [79] 进行训练和推理，这两者将 Megatron-LM [80] 适配到了 Ascend NPU 上。我们还将训练和推理代码迁移到了 GPU 平台。如表 2 所示，我们列出了 LongVITA 中的详细超参数。

Training. We configure different distributed training strategies for each key module in Long-VITA.

训练。我们为 Long-VITA 中的每个关键模块配置了不同的分布式训练策略。

Inference. We implement two new designs to scale up the number of tokens for model inference.

推理。我们实现了两种新设计，以扩大模型推理的Token数量。

• Context-Parallelism Distributed Inference. We implement tensor parallelism with context parallelism for model inference, thus supporting distribution attention for infinite-length input tokens. Similar to the training mode, the length of inference tokens is fixed during the decoding phase. Specifically, we concatenate the input tokens with padding tokens of the maximum output length. The system needs to extract the new predicted next token in the fixed-length output tokens and accordingly terminate the generation process during each forward.

• 上下文并行分布式推理。我们在模型推理中实现了上下文并行的张量并行，从而支持无限长度输入Token的分布式注意力。与训练模式类似，推理Token的长度在解码阶段是固定的。具体来说，我们将输入Token与最大输出长度的填充Token连接起来。系统需要在固定长度的输出Token中提取新预测的下一个Token，并在每次前向传播时相应地终止生成过程。

• Logits-Masked Language Modeling Head. We observe that the output logits from the language modeling head induce excessive memory footprints. For example, given 1M tokens with $10^{5}$ vocabularies, the output logit matrix has a shape of $10^{6}\times10^{5}$ and requires $\mathrm{400,GB}$ of memory for the float32 data type. To address the memory issue, we mask out all hidden features and only feed the one hidden feature that predicts the next tokens to the language modeling head. With the above design, the memory consumption of the output logit matrix is 00004 GB with $10^{6}\times$ reduction. Note that this design can also apply to model training with long-context inputs and the language modeling head only needs to predict the short-context outputs to reduce memory consumption.

Logits-Masked 语言建模头。我们观察到，来自语言建模头的输出 logits 会引发过大的内存占用。例如，给定 1M 个 token 和 $10^{5}$ 个词汇表，输出 logit 矩阵的形状为 $10^{6}\times10^{5}$，并且需要 $\mathrm{400,GB}$ 的内存来存储 float32 数据类型。为了解决内存问题，我们屏蔽所有隐藏特征，仅将预测下一个 token 的隐藏特征传递给语言建模头。通过上述设计，输出 logit 矩阵的内存消耗为 00004 GB，减少了 $10^{6}\times$。需要注意的是，这种设计也适用于长上下文输入的模型训练，语言建模头只需预测短上下文输出以减少内存消耗。

Table 4. Comparison with the state-of-the-art models under 20B parameters on Open Compass Leader board. MMB: the test split of MMBench [20], MV: MathVista [61], HB: H allusion Bench [62], OCR: OCRBench [63]. ‘AVG $\langle{\epsilon}\rangle$ denotes the average scores of six objective benchmarks, i.e., MMBench, MMStar, MMMU, H allusion Bench, AI2D, and OCRBench, which do not use a judge LLM to evaluate. ‘AVG’ denotes the average of scores on all eight benchmarks. ‘Internal Data’ denotes whether the model is trained with in-house data, which is not publicly available. Results are obtained from the leader board of Open Compass.

表 4: 在 Open Compass 排行榜上 20B 参数以下的最先进模型对比。MMB: MMBench [20] 的测试集, MV: MathVista [61], HB: H allusion Bench [62], OCR: OCRBench [63]。‘AVG $\langle{\epsilon}\rangle$’ 表示六个客观基准的平均分, 即 MMBench, MMStar, MMMU, H allusion Bench, AI2D, 和 OCRBench, 这些基准不使用大语言模型进行评估。‘AVG’ 表示所有八个基准的平均分。‘Internal Data’ 表示模型是否使用了内部数据进行训练, 这些数据未公开。结果来自 Open Compass 的排行榜。

| Name | Internal Data | MMB | MMStar | MMMU | MV | HB |
|------|----

[论文翻译]Long-VITA：将大型多模态模型扩展到100万Token，同时保持领先的短上下文准确性

原文地址：https://arxiv.org/pdf/2502.05177v1

Long-VITA：将大型多模态模型扩展到100万Token，同时保持领先的短上下文准确性

Abstract

1. Introduction

1. 引言

2. 相关工作

2.1. Large Vision Language Models

2.3. Long-Context Visual Instruction Data

2.3 长上下文视觉指令数据

3. Long-VITA

3. Long-VITA

3.1. Architecture

3.2. Data Construction

3.3. Training Pipelines

3.3. 训练管道

3.4. Hyper-parameters and Infrastructures

[论文翻译]Long-VITA：将大型多模态模型扩展到100万Token，同时保持领先的短上下文准确性

原文地址：https://arxiv.org/pdf/2502.05177v1

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Long-VITA：将大型多模态模型扩展到100万Token，同时保持领先的短上下文准确性

Abstract

1. Introduction

1. 引言

2. Related Work

2. 相关工作

2.1. Large Vision Language Models

2.2. Long-Context Multi-Modal Model

2.3. Long-Context Visual Instruction Data

2.3 长上下文视觉指令数据

3. Long-VITA

3. Long-VITA

3.1. Architecture

3.2. Data Construction

3.3. Training Pipelines

3.3. 训练管道

3.4. Hyper-parameters and Infrastructures