Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
Ola: 通过渐进式模态对齐推动全能模态大语言模型的发展
Zuyan Liu1,2* Yuhao Dong3,2* Jiahui Wang1 Ziwei Liu3 Winston $\mathrm{Hu}^{2}$ Jiwen Lu1† Yongming Rao2,1†
ZZuyan Liu1,2* Yuhao Dong3,2* Jiahui Wang1 Ziwei Liu3 Winston $\mathrm{Hu}^{2}$ Jiwen Lu1† Yongming Rao2,1†
1 Tsinghua University 2 Tencent Hunyuan Research 3 S-Lab, NTU
11 清华大学 2 腾讯混元研究院 3 新加坡南洋理工大学S实验室
httpshttps://ola-omni.github.io/
Figure 1. Ola pushes the frontiers of the omni-modal language model across image, video and audio understanding benchmarks. We compare Ola with existing state-of-the-art open-sourced multimodal models and GPT-4o on their abilities in mainstream image, video, and audio benchmarks. For fair comparisons, we select around 7B versions of existing MLLMs. Ola can achieve outperforming performance against omni-modal and specialized MLLMs in all modalities thanks to our progressive alignment strategy. “ $\times^{\ast}$ indicates that the model is not capable of the task and “−” indicates the result is lacking. The score for Libri Speech is inverted as lower is better for the WER metric.
图图 1: Ola 在图像、视频和音频理解基准测试中推动了全模态语言模型的前沿。我们将 Ola 与现有的开源多模态模型和 GPT-4o 在主流图像、视频和音频基准测试中的能力进行了比较。为了公平比较,我们选择了约 7B 版本的现有 MLLM(多模态大语言模型)。得益于我们的渐进式对齐策略,Ola 在所有模态中都能超越全模态和专用 MLLM 的表现。“$\times^{\ast}$”表示模型无法完成该任务,“−”表示结果缺失。Libri Speech 的分数是反向的,因为 WER(词错率)指标越低越好。
Abstract
摘要摘要
Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized singlemodality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are opensourced at https://github.com/Ola-Omni/Ola.
最近最近在大型语言模型 (LLM) 方面的进展,尤其是在 GPT-4o 之后,激发了人们对开发能够理解更多模态的全模态模型的兴趣。虽然已经出现了一些开源的替代方案,但在性能上仍显着落后于专门的单模态模型。在本文中,我们提出了 Ola,这是一种全模态语言模型,在图像、视频和音频理解方面与专门的模型相比具有竞争力。Ola 的核心设计在于其逐步模态对齐策略,逐步扩展语言模型的支持模态。我们的训练管道从最不同的模态开始:图像和文本,然后逐渐扩展模型的技能集,使用连接语言和音频知识的语音数据,以及连接所有模态的视频数据。逐步学习管道还使我们能够保持跨模态对齐数据的相对较小规模,使得从现有的视觉语言模型开发全模态模型变得容易且成本较低。此外,为了解锁类似 GPT-4o 的高级交互体验,我们进一步设计了一种逐句解码解决方案,用于流式语音生成。大量实验表明,Ola 在所有模态上都超过了现有的开放全模态大语言模型,同时在类似规模的先进专门模型中也表现出极强的竞争力。我们的目标是使 Ola 成为一种完全开放的全模态理解解决方案,以推动这一新兴领域的未来研究。模型权重、代码和数据已在 https://github.com/Ola-Omni/Ola 开源。
1. Introduction
1. 引言
Multi-Modal Large Language Models are drawing increasing attention owing to their strong instruction-following capabilities and abundant knowledge of handling complex inputs including texts, images, videos, and audio. Based on the strong performance of open-sourced large language models [56, 76], extensive research has been done on connecting specific modalities with language responses [10, 32, 36, 55, 57, 67]. Recently, the success of GPT-4o [53] and Gemini [24] aiming at supporting more modalities in Large Language Models inspires researchers to take one important steps towards omni models that understand all the inputs in one model.
多多模态大语言模型因其强大的指令跟随能力和处理复杂输入(包括文本、图像、视频和音频)的丰富知识而日益受到关注。基于开源大语言模型的出色表现 [56, 76],已有大量研究致力于将特定模态与语言响应连接起来 [10, 32, 36, 55, 57, 67]。最近,GPT-4o [53] 和 Gemini [24] 的成功旨在支持大语言模型中的更多模态,激励研究人员向理解所有输入的通用模型迈出重要一步。
Figure 2. Ola Architecture. Ola supports omni-modal inputs including text, image, video, and audio, capable of processing the inputs simultaneously with competitive performance on understanding tasks for all these modalities. Meanwhile, Ola supports user-friendly real-time streaming decoding for texts and speeches thanks to the text de token ize r and the speech decoder.
图图 2: Ola 架构。Ola 支持全模态输入,包括文本、图像、视频和音频,能够在处理这些模态的理解任务时同时展现出竞争力。同时,得益于文本去 Token 器 (Text detokenizer) 和语音解码器,Ola 支持用户友好的实时流式解码,适用于文本和语音。
Figure 3. Progressive modality alignment helps to learn better omni-modal models. We compare our progressive alignment strategy with two baseline training pipelines on Image QA(MMBench [40]), Video QA(VideoMME [21]), and ASR(Libri Speech [54]): 1) direct mixing where all instruction tuning data is merged and trained in a single stage, and 2) balanced sampling where we upsample certain sources to make the training data more balanced among modalities. The experiment is conducted on a subsampled training set for efficiency and we train models for the same number of steps for fair comparisons. The score is normalized based on the score of progressive alignment to calculate the relative score and the ASR score is inverted as lower is better for the WER metric.
图图 3: 渐进式模态对齐有助于学习更好的全模态模型。我们在 Image QA (MMBench [40])、Video QA (VideoMME [21]) 和 ASR (Libri Speech [54]) 上比较了我们的渐进式对齐策略与两种基线训练流程:1) 直接混合,即所有指令微调数据在一个阶段中合并训练;2) 平衡采样,即我们通过上采样某些数据源使训练数据在模态间更加平衡。实验在一个子采样的训练集上进行以提高效率,并且我们为公平比较训练了相同步数的模型。基于渐进式对齐的分数对分数进行归一化以计算相对分数,并且由于 WER 指标越低越好,ASR 分数进行了反转。
The core challenges in training omni-modal Large Language Models lie in the modeling of modalities in various distributions, and the design of an effective learning pipeline to achieve competitive, balanced performance on all the supported tasks. Several attempts have been made to overcome the difficulty of omni-modal models [17, 22, 72], where we illustrate the mainstream works and the state-of-the-art MLLMs [11, 20, 32, 67] in specific domains in Tab. 2. Impressive performance and modality breadth are contradicted in previous works, while existing open-sourced omni-modal solutions still have a large performance gap between stateof-the-art specialized LLMs, making a strong barrier between the concept of omni-modal and real-world applications. Moreover, the lack of capability in specific domains or tasks, the mass data demand, the delay for user interaction, and the inadequate alignment across modalities show the suboptimal for existing omni-modal models.
训练训练全模态大语言模型的核心挑战在于对各种分布模态的建模,以及设计有效学习管道以在所有支持的任务上实现具有竞争力的均衡性能。已有多次尝试克服全模态模型的困难 [17, 22, 72],我们在表 2 中展示了主流工作和特定领域的最先进 MLLM [11, 20, 32, 67]。之前的工作在性能和模态广度上取得了令人印象深刻的成果,但现有的开源全模态解决方案与最先进的专用大语言模型之间仍存在较大性能差距,使得全模态的概念与实际应用之间存在强大障碍。此外,特定领域或任务的能力不足、大量数据需求、用户交互的延迟以及模态间对齐的不足,都表明现有全模态模型尚未达到最优。
In this paper, we propose the Ola model, exploring the solution for training an omni-modal Large Language Model with comparable performance with state-of-the-art specific LLMs, real-time interaction, and high efficiency on alignment data. The core design of the Ola model is the progressive modality alignment strategy. To build connections between language and vision, we start from two fundamental and separated modalities, image and text, to build the basic knowledge for omni-modal models. Subsequently, we gradually expand the training sets to equip the model with an extended ability including video frames that strengthen the visual understanding capability, speech data that connects the language and audio knowledge, and the video with audio that mixes up the information from language, video, and audio comprehensively. The progressively learning strategy makes omni-modal learning easier by disassembling the complex training procedure into small steps, therefore maintaining a small size of cross-modal alignment data and making it easier to start from existing achievements in vision-language models. As shown in Fig. 3, the performance of Ola largely benefits from our progressive training pipeline, leading to more balanced and competitive results on all modalities.
本文本文提出了 Ola 模型,探索了一种训练全能模态大语言模型的解决方案,该模型在性能上可与最先进的特定大语言模型相媲美,同时具备实时交互性和高效的对齐数据处理能力。Ola 模型的核心设计是渐进式模态对齐策略。为了在语言和视觉之间建立联系,我们从图像和文本这两种基本且独立的模态出发,为全能模态模型构建基础知识。随后,我们逐步扩展训练集,使模型具备扩展能力,包括增强视觉理解能力的视频帧、连接语言和音频知识的语音数据,以及综合语言、视频和音频信息的带音频视频。这种渐进式学习策略通过将复杂的训练过程分解为小步骤,使全能模态学习更加容易,从而保持较小的跨模态对齐数据规模,并更容易从现有的视觉语言模型成果入手。如图 3 所示,Ola 的性能很大程度上受益于我们的渐进式训练流程,在所有模态上都取得了更加均衡和具有竞争力的结果。
To cooperate with the training strategy, important improvements have been made to the architecture and data domains. 1) The Ola architecture supports omni-modal inputs and streaming text and speech generation with extendable and concise architecture design. We design the joint alignment module for vision and audio, fusing the visual inputs with a local-global attention pooling layer, and make free combinations for visual, audio, and text tokens.
为了为了配合训练策略,我们已对架构和数据域进行了重要改进。1) Ola架构支持全模态输入和流式文本及语音生成,具有可扩展且简洁的架构设计。我们设计了视觉与音频的联合对齐模块,通过局部-全局注意力池化层融合视觉输入,并对视觉、音频和文本Token进行自由组合。
Moreover, we integrate the sentence-wise streaming decoding module for high-quality voice synthesis. 2) Beyond the collected fine-tuning data in vision and audio aspects, we deeply excavate the relationships between video and the corresponding audio to construct the bridge between visual and audio modality. Specifically, we collect raw videos from academic and open-ended web sources, design separated cleaning pipelines, and then utilize vision-language models to generate question-answering pairs based on the subtitles and video content.
此外此外,我们集成了逐句流式解码模块,以实现高质量的语音合成。除了在视觉和音频方面收集的微调数据外,我们还深入挖掘了视频与相应音频之间的关系,以构建视觉和音频模态之间的桥梁。具体来说,我们从学术和开放式网络资源中收集原始视频,设计独立的清洗流程,然后利用视觉-语言模型根据字幕和视频内容生成问答对。
We evaluate Ola under the complete omni-modal benchmarks including image, video, and audio aspects. With only 7B parameters, Ola achieves competitive performance across mainstream multi-modal benchmarks. On Image Benchmarks, Ola excels at general understanding and specific-task understanding, with an overall mean accuracy of $72.6%$ on the challenging Open Compass benchmark [16], $84.3%$ average scores on MMBench-1.1 [40], $57.0%$ average scores on MMMU [77], etc. On the challenging VideoMME [21] multiple-choice benchmark ranging from videos within 30 seconds to 1 hour, Ola achieves the impressive accuracy of $68.4%$ with video and audio inputs. Ola also excels at audio understanding tasks such as audio-speech recognition and chat evaluation, achieving 3.1 mean WER on Lib- riSpeech [54] and 6.41 GPT-eval score on AIR-Bench [74]. Results on the benchmarks show a giant promotion compared with existing omni-modal LLMs and even outperforming the performance of state-of-the-art specialized LLMs.
我们在我们在包括图像、视频和音频在内的全模态基准测试中评估了Ola。仅用70亿参数,Ola在主流多模态基准测试中展现了具有竞争力的表现。在图像基准测试中,Ola在通用理解和特定任务理解方面表现出色,在具有挑战性的Open Compass基准测试[16]中获得了72.6%的平均准确率,在MMBench-1.1[40]中获得了84.3%的平均分数,在MMMU[77]中获得了57.0%的平均分数等。在30秒到1小时视频范围内具有挑战性的VideoMME[21]多项选择基准测试中,Ola在视频和音频输入情况下获得了68.4%的准确率。Ola在音频理解任务(如语音识别和聊天评估)中也表现优异,在LibriSpeech[54]上获得了3.1的平均WER,在AIR-Bench[74]上获得了6.41的GPT-eval分数。基准测试结果显示,与现有的全模态大语言模型相比,Ola有了显著的提升,甚至超越了当前最先进的专用大语言模型的表现。
2. Related Works
2. 相关工作
Large Vision-Language Models. Inspired by the success of AI assistants and large language models [50–52], research has increasingly focused on vision-language multimodal large language models. Significant advancements have been made in architecture design [3, 13, 33, 39, 44], training strategies [10, 43], model scaling [32], and data curation [31, 43, 45, 73]. Furthermore, models are evolving beyond static images to support video [7, 36, 42, 47], 3D [27, 37], and mixed visual inputs [57, 59]. However, extending these models to effectively integrate audio modalities while maintaining balanced and robust performance remains an area that has not been fully explored.
大规模大规模视觉语言模型。受到 AI 助手和大语言模型 [50–52] 成功的启发,研究逐渐聚焦于视觉语言多模态大语言模型。在架构设计 [3, 13, 33, 39, 44]、训练策略 [10, 43]、模型扩展 [32] 和数据管理 [31, 43, 45, 73] 等方面取得了显著进展。此外,模型正在从静态图像扩展到支持视频 [7, 36, 42, 47]、3D [27, 37] 以及混合视觉输入 [57, 59]。然而,将这些模型扩展到有效整合音频模态,同时保持平衡且鲁棒的性能,仍然是一个尚未完全探索的领域。
Large Audio-Text Models. Large Language Models, mainly focused on text inputs and outputs, have a foundational link to speech, with pioneering efforts integrating speech inputs through adapter-based modifications [4, 19, 70]. The challenge of LLM-based speech generation has been addressed with the development of speech decoders [61, 80], marking a significant step towards omni-modal models. Beyond speech, research is expanding into audio-based LLMs that encompass the understanding of music, events, and more. Notable examples include AudioGPT [28] and SALMONN [63], which explore these audio dimensions, while models like Qwen2-Audio [11] demonstrate enhanced understanding capabilities.
大规模大规模音频-文本模型。大语言模型主要关注文本输入和输出,与语音有着基础性的联系,通过基于适配器的修改集成了语音输入的开创性工作 [4, 19, 70]。随着语音解码器的发展,基于大语言模型的语音生成挑战得到了解决 [61, 80],这标志着向全模态模型迈出了重要的一步。除了语音,研究还在扩展到理解音乐、事件等基于音频的大语言模型。典型案例包括 AudioGPT [28] 和 SALMONN [63],它们探索了这些音频维度,而像 Qwen2-Audio [11] 这样的模型则展示了增强的理解能力。
Towards Large Omni-Modal Models. Recent advancements in large language models [24, 53] have spurred interest in developing omni-modal models that can handle multiple modalities simultaneously. Notable examples include SpeechGPT [80] and LLaMA-Omni [17], which integrate audio-text understanding with speech generation. The MiniOmni [72] series addresses challenges in speech streaming generation through parallel decoding techniques. VITA [22] extends this capability by unifying audio, image, video, and text understanding. While these models excel in understanding tasks, efforts are also being made to tackle both understanding and generation tasks [68, 71]. However, existing omni-modal models often fall short in managing the full spectrum of input modalities and output formats, or they suffer from significantly poorer performance. Ola aims to address these limitations by enhancing the capability and efficiency of omni-modal models with better architecture, training strategy, and targeting data preparation.
迈向迈向大规模全能模态模型。近年来,大语言模型 [24, 53] 的进展激发了开发能够同时处理多种模态的全能模态模型的兴趣。值得注意的例子包括 SpeechGPT [80] 和 LLaMA-Omni [17],它们将音频-文本理解与语音生成相结合。MiniOmni [72] 系列通过并行解码技术解决了语音流生成的挑战。VITA [22] 通过统一音频、图像、视频和文本理解扩展了这一能力。虽然这些模型在理解任务上表现出色,但也正在努力解决理解和生成任务 [68, 71]。然而,现有的全能模态模型通常在管理完整的输入模态和输出格式方面表现不足,或者性能显著较差。Ola 旨在通过改进架构、训练策略和目标数据准备,提升全能模态模型的能力和效率。
3. Ola: Omni-Modal Understanding
3. Ola: 全模态理解
We put main efforts into three aspects to obtain an omnimodal understanding for Ola, capable of reading, hearing, seeing, typing, and speaking arbitrarily. 1) The Ola architecture introduced in Sec. 3.1 supports omni-modal inputs and streaming outputs for both text and speech. 2) We design the progressive training strategy in Sec. 3.2 to bridge the modality gaps between language and vision from primary to periphery. 3) Effective omni-modal training data in Sec. 3.3 provide strong performance across all the benchmarks, especially with our proposed cross-modal video data that stresses to learn audio-relevant information from videos.
我们在我们在三个方面做出了主要努力,以实现Ola的全模态理解能力,使其能够任意地读、听、看、打和说。1) 第3.1节中介绍的Ola架构支持全模态输入以及文本和语音的流式输出。2) 我们设计了第3.2节中的渐进式训练策略,以从核心到外围弥合语言和视觉之间的模态差距。3) 第3.3节中的有效全模态训练数据在所有基准测试中都表现出色,尤其是我们提出的跨模态视频数据,强调从视频中学习与音频相关的信息。
3.1. Ola Architecture
3.1. Ola 架构
A general view of Ola architecture is illustrated in Fig. 1. The encoding part accepts omni-modal inputs in text, image, video, and audio formats with modal-specific encoders or embeddings. Subsequently, the joint alignment operations process all the inputs in a unified manner, fusing and concatenating all the sequences into flattened tokens for the core Ola Large Language Model. The LLM generates text tokens in serial, and we adopt the speech decoder for streaming speech decoding.
图图 1: Ola架构概览。编码部分通过特定模态的编码器或嵌入(embedding)接收文本、图像、视频和音频格式的多模态输入。随后,联合对齐操作以统一的方式处理所有输入,融合并连接所有序列为扁平化的 Token,供核心的 Ola 大语言模型使用。大语言模型串行生成文本 Token,同时我们采用语音解码器进行流式语音解码。
Omni-Modal Inputs Encoding. We encode visual, audio, and text inputs separately based on the successful practice of the previous text-to-one-modal Large Language Models. For visual inputs that include images $I$ , videos $V_{1,2,\cdots,n}$ with $n$ frames, we follow vision-language models [33, 39] to use multi-modal visual encoder $\mathcal{E}{I,V}(I,V{f_{1},f_{2},\cdots})$ for encoding. Note that we preserve the original aspect ratio of each image or frame for the arbitrary resolution vision encoder OryxViT [42] initialized from SigLIP-400M [79], as OryxViT performs a more natural solution for visual inputs. We obtain the image features $f_{I}$ , and the video features for each frame $\left[f_{V_{1}},f_{V_{2}},\cdots,f_{V_{n}}\right]$ based on the image patches.
全全模态输入编码。我们根据先前文本到单一模态大语言模型成功实践,分别对视觉、音频和文本输入进行编码。对于包含图像 $I$ 的视频 $V_{1,2,\cdots,n}$ (具有 $n$ 帧),我们遵循视觉语言模型 [33, 39],使用多模态视觉编码器 $\mathcal{E}{I,V}(I,V{f_{1},f_{2},\cdots})$ 进行编码。值得注意的是,我们为从 SigLIP-400M [79] 初始化的任意分辨率视觉编码器 OryxViT [42] 保留了每个图像或帧的原始宽高比,因为 OryxViT 为视觉输入提供了更自然的解决方案。基于图像块,我们获得了图像特征 $f_{I}$,以及每帧的视频特征 $\left[f_{V_{1}},f_{V_{2}},\cdots,f_{V_{n}}\right]$。
Figure 4. Illustrations of the Ola Progressive Modality Alignment. We visualize the relationships among modalities in the left part. Speech acts as the connection between language and audio knowledge, while video constructs the bridge with highly relevant visual and audio information. Therefore, we design the progressive alignment training strategy from primary to periphery. Furthermore, we design the cross-modality video-audio data to better capture the relationships among modalities.
图图 4: Ola Progressive Modality Alignment 的图示。我们在左侧部分展示了模态之间的关系。语音作为语言和音频知识之间的桥梁,而视频则通过高度相关的视觉和音频信息构建这种桥梁。因此,我们设计了从核心到外部的渐进对齐训练策略。此外,我们设计了跨模态的视频-音频数据,以更好地捕捉模态之间的关系。
For audio encoding, we propose the dual encoder approach for the audio input. Specifically, we use Whisperv3 [58] as the speech encoder and BEATs [8] as the music encoder for better alignment between audios and texts, providing richer audio information. The music encoder takes the original wav audio $A$ as inputs, while the speech encoder takes the wav transformed into Mel spec tr ogram represent ation $A_{(m e l)}$ as inputs. Note that the Whisper encoder only supports a certain length of audio inputs, therefore we fix the sample rate as $16000\mathrm{Hz}$ and cut the overlong audio into segments of 30 seconds $A_{1},A_{2},\cdots,,A_{n}$ and conduct the encoder operation in batches $\left[f_{A_{1}},f_{A_{2}},\cdot\cdot\cdot,,f_{A_{n}}\right]=$ $\mathcal{E}{A}([A{1},A_{2},\cdot\cdot\cdot,,A_{n}])$ . The embedding features of the speech and music encoders are concatenated across channel dimensions for the comprehensive audio features $f_{A}$ .
对于对于音频编码,我们提出了音频输入的双编码器方法。具体来说,我们使用 Whisperv3 [58]作为语音编码器,使用 BEATs [8]作为音乐编码器,以实现音频和文本之间更好的对齐,并提供更丰富的音频信息。音乐编码器以原始 wav 音频 $A$ 作为输入,而语音编码器则将以 Mel 频谱图表示的 $A_{(mel)}$ 作为输入。需要注意的是,Whisper 编码器仅支持特定长度的音频输入,因此我们将采样率固定为 $16000\mathrm{Hz}$,并将过长的音频切分为 30 秒的片段 $A_1, A_2, \cdots, A_n$,并批量进行编码操作 $\left[f_{A_1}, f_{A_2}, \cdots, f_{A_n}\right] = \mathcal{E}_A([A_1, A_2, \cdots, A_n])$。语音编码器和音乐编码器的嵌入特征在通道维度上进行拼接,以得到综合的音频特征 $f_A$。
For the text inputs, we use the carefully designed tokenizer and the well-trained embedding layers from the pretrained Large Language Model for the text tokens $t_{T}$ directly.
对于对于文本输入,我们直接使用经过精心设计的 Tokenizer 以及预训练大语言模型中的训练良好的嵌入层来处理文本 Token $t_{T}$。
Joint Alignment for Vision and Audio. The alignment modules act as the converter from specific-modal spaces to the text embedding space, which is an essential part of omnimodal Large Language Models. To reduce the token length of visual features for higher efficiency, we obtain one step forward based on the motivation of structural down sampling in previous works [35], and propose the Local-Global Attention Pooling layer for better down sampled features with less information loss. Specifically, for image or video frame feature in spatial shape $H\times W$ and channel $C$ , we adopt the bilinear interpolation for $2\mathbf{x}$ down sampling to obtain $f^{\mathrm{global}}$ , which contains the global information of the downsampled region. We combine the original and global features for local-global embeddings and use Softmax to predict the importance $\pi$ of each down sampled spatial region:
视觉视觉与音频的联合对齐。对齐模块充当从特定模态空间到文本嵌入空间的转换器,这是全能大语言模型的重要组成部分。为了提高效率并减少视觉特征的Token长度,我们基于先前工作中结构下采样的动机,提出了局部-全局注意力池化层,以在信息损失较少的情况下获得更好的下采样特征。具体来说,对于空间形状为 $H\times W$ 和通道为 $C$ 的图像或视频帧特征,我们采用双线性插值进行 $2\mathbf{x}$ 下采样以获得 $f^{\mathrm{global}}$,其中包含下采样区域的全局信息。我们将原始特征与全局特征结合为局部-全局嵌入,并使用Softmax预测每个下采样空间区域的重要性 $\pi$:
The down sampled feature $f^{\mathrm{global}}$ determines the weight of each previous region based on the score $\pi$ with the Hadamard product.
下下采样特征 $f^{\mathrm{global}}$ 根据得分 $\pi$ 与 Hadamard 乘积确定每个先前区域的权重。
We apply the simple yet effective two-layer non-linear MLP connectors $\mathsf{M L P}{A},\mathsf{M L P}{V}$ following the previous works [38, 39] to project the specific modal features $[f_{I},f_{V},f_{A}]$ into unified tokens $[t_{I},t_{V},t_{A}]$ . We define visual and audio start, separate, newline, and end tokens to indicate special positions for inputs. The omni-modal tokens $[t_{I},t_{V},t_{A}]$ are concatenated with text tokens $t_{T}$ in free combination for LLM decoding.
我们我们采用简单但有效的两层非线性MLP连接器$\mathsf{M L P}{A},\mathsf{M L P}{V}$,借鉴先前的研究[38, 39],将特定模态特征$[f_{I},f_{V},f_{A}]$投影为统一的Token$[t_{I},t_{V},t_{A}]$。我们定义了视觉和音频的起始、分隔、换行以及结束Token,用以指示输入的特殊位置。全模态Token$[t_{I},t_{V},t_{A}]$与文本Token$t_{T}$自由组合,供大语言模型解码使用。
Streaming Speech Generation. We adopt CosyVoice [15] as a high-quality speech decoder for speech generation. To support user-friendly streaming decoding, we detect the generated text tokens in real time and truncate the sentence once punctuation is encountered. Afterward, the previous sentence is fed into the speech decoder for audio synthesis. Therefore, Ola does not need to wait for the whole sentence to finish while supporting streaming decoding. Though several attempts [17, 72] have been made to train the speech generation module end-to-end, the external text-to-speech decoder is a more efficient, high-quality, and training-free solution for omni-modal models.
流流式语音生成。我们采用 CosyVoice [15] 作为高质量的语音解码器进行语音生成。为了支持用户友好的流式解码,我们实时检测生成的文本 Token,并在遇到标点符号时截断句子。随后,将前一个句子输入语音解码器进行音频合成。因此,Ola 不需要等待整个句子完成,同时支持流式解码。尽管已有一些尝试 [17, 72] 来端到端地训练语音生成模块,但外部文本到语音解码器对于全模态模型来说是一种更高效、高质量且无需训练的解决方案。
3.2. Progressive Omni-Modal Alignment
33.2. 渐进式全模态对齐
Rethinking Modality Gaps among Language, Vision and Audio. From our exploration, we recognize two critical issues in omni-modal training. 1) Modal balancing. As illustrated in Fig. 3, directly combining data from all modalities negatively affects benchmark performance. Therefore, we propose a rational training procedure that progressively equips the sense organ to the Ola model. We assert that texts and images are the core modalities in omni-modal learning, while speeches and videos are variants of texts and images, respectively. Learning to recognize texts and images ensures the model’s basic cross-modal ability, so we prioritize these harder cases. Subsequently, we gradually incorporate video, audio, and speech into the training for the omni-modal LLM. 2) Connections between audio and vision. Another problem lies in building connections between audio and vision, which has been overlooked by previous works. However, jointly learning audio and vision data can yield surprising results in omni-modal learning by providing a more comprehensive view across different modalities. For the Ola model, we consider video as the bridge between audio and vision, as videos contain natural, abundant, and highly relevant information between frames and the accompanying audio. We test our hypothesis by optimizing the training pipeline and preparing targeted training data, as introduced below.
重新重新思考语言、视觉和音频之间的模态差距。通过我们的探索,我们认识到全模态训练中的两个关键问题。1)模态平衡。如图3所示,直接结合所有模态的数据会对基准性能产生负面影响。因此,我们提出了一种合理的训练流程,逐步为Ola模型配备感官器官。我们断言文本和图像是全模态学习的核心模态,而语音和视频分别是文本和图像的变体。学习识别文本和图像确保了模型的基本跨模态能力,因此我们优先处理这些更复杂的情况。随后,我们逐步将视频、音频和语音纳入全模态大语言模型的训练中。2)音频与视觉之间的联系。另一个问题在于建立音频和视觉之间的联系,这一问题在之前的工作中被忽视了。然而,通过联合学习音频和视觉数据,可以在全模态学习中获得出人意料的结果,因为它提供了跨不同模态的更全面的视角。对于Ola模型,我们将视频视为音频和视觉之间的桥梁,因为视频帧与伴随的音频之间包含了自然、丰富且高度相关的信息。我们通过优化训练流程和准备有针对性的训练数据来验证我们的假设,如下所述。
Stage 1: Text-Image Training. The Ola training starts from a pre-trained Large Language Model, where we use Qwen2.5-7B [64] in our implementation for better tradeoffs for model sizes and performance. The Ola text-image training includes MLP alignment, large-scale pre-training, and supervised fine-tuning following common practice in large-scale multi-modal learning [32, 67]. We initialize the visual MLP adapter and freeze other parameters in MLP alignment with the image captioning task. Subsequently, we unfreeze all the parameters including the vision encoder in the pre-training and supervised fine-tuning phase. The downsampling module is well-trained in the text-image training stage to hold the $2\mathbf{x}$ compression for visual data including images and videos.
阶段阶段1:文本-图像训练。Ola训练从一个预训练的大语言模型开始,我们在实现中使用了Qwen2.5-7B [64] 以在模型大小和性能之间取得更好的平衡。Ola的文本-图像训练包括MLP对齐、大规模预训练和遵循大规模多模态学习常见实践的监督微调 [32, 67]。我们初始化视觉MLP适配器并在MLP对齐阶段冻结其他参数以进行图像描述任务。随后,在预训练和监督微调阶段解冻所有参数,包括视觉编码器。下采样模块在文本-图像训练阶段得到了充分训练,以保持对包括图像和视频在内的视觉数据的 $2\mathbf{x}$ 压缩。
Stage 2: Continuous Training for Images and Videos. Based on a strong text-image multi-modal LLM, we continuously extend the capability for Ola with video data. We keep most of the experimental setting for supervised finetuning while freezing the vision encoder in this stage as the encoder is already fully trained beforehand. We mix the previous image data and the video data to preserve the original text-image performance.
阶段阶段 2:图像和视频的持续训练。基于强大的文本-图像多模态大语言模型,我们持续扩展 Ola 处理视频数据的能力。在此阶段,我们保留了大部分有监督微调的实验设置,同时冻结了视觉编码器,因为编码器已经预先完全训练。我们将之前的图像数据与视频数据混合,以保留原有的文本-图像性能。
Stage 3: Bridging Vision and Audio with Videos. The audio-relevant training is included in stage 3. We follow the training strategy for the visual MLP adapter while initializing the audio MLP adapter with a basic audio-speech recognition (ASR) task. Then we mix up the text & speech under- standing, text & music understanding, audio & video joint comprehension, and the foremost text-image multi-modal tasks together for the formal training. Ola concentrates on learning audio recognition and identifying the relationships between vision and audio in this stage, resulting in a comprehensive image, video, and audio understanding model after training.
阶段阶段3:通过视频连接视觉与音频。第3阶段包含与音频相关的训练。我们在初始化音频MLP适配器时,遵循视觉MLP适配器的训练策略,并使用基本的音频语音识别(ASR)任务。随后,我们将文本与语音理解、文本与音乐理解、音频与视频联合理解以及最重要的文本图像多模态任务混合在一起进行正式训练。Ola在此阶段专注于学习音频识别及识别视觉与音频之间的关系,最终在训练后形成一个综合的图像、视频和音频理解模型。
3.3. Data
3.3. 数据
The training data of Ola includes the general supervised finetuning data collected from open-source academic datasets in image, video, and audio categories. Additionally, we design a pipeline to generate cross-modal video-audio data for omni-modal alignment.
OOla 的训练数据包括从开源学术数据集中收集的图像、视频和音频类别的通用监督微调数据。此外,我们设计了一个流程来生成用于全模态对齐的跨模态视频-音频数据。
Image Data. We follow the simple setting in [38] for image MLP alignment. The MLP alignment data includes $800\mathbf{k}$ image captioning pairs from the LAION dataset [62]. For the large-scale pre-training phase, we collect around 20M text-image data pairs from open-sourced and in-house data to build the basic capability of the model. For textimage supervised fine-tuning data, we collect abundant data from various tasks including captions, conversations, OCR, etc. The source of the training data involves the mixture of LLaVA-OneVision [32], Cauldron [31], Cambrian-1 [67], MAmmoTH-VL [26], PixMo [12], etc., resulting in 7.3M image training data in total.
图像图像数据。我们遵循 [38] 中的简单设置进行图像 MLP(多层感知机)对齐。MLP 对齐数据包括来自 LAION 数据集 [62] 的 80 万图像标注对。在大规模预训练阶段,我们从开源数据和内部数据中收集了约 2000 万文本-图像数据对,以构建模型的基础能力。对于文本-图像监督微调数据,我们从包括图像标注、对话、OCR(光学字符识别)等任务中收集了大量数据。训练数据的来源包括 LLaVA-OneVision [32]、Cauldron [31]、Cambrian-1 [67]、MAmmoTH-VL [26]、PixMo [12] 等,最终共获得 730 万图像训练数据。
Video Data. For text-video training data, we collect useful video datasets from LLaVA-Video-178K [84], Video Chat GP T-Plus [47], LLaVA-Hound [82], and Cinepile [60], with 1.9M video conversation pieces in total. We randomly sample 2/3 video-language data pairs from LLaVA-Video-178K, resulting in 1.2M high-quality training data, and we use the full set of other data sources. In stage 2 for multi-image training, we randomly sample $0.8\mathrm{M}$ image data from stage 1 and mix it with the video datasets for continuous training to maintain the basic performance.
视频视频数据。对于文本-视频训练数据,我们从 LLaVA-Video-178K [84]、Video Chat GPT-Plus [47]、LLaVA-Hound [82] 和 Cinepile [60] 中收集了有用的视频数据集,总共包含 1.9M 条视频对话片段。我们从 LLaVA-Video-178K 中随机抽取了 2/3 的视频-语言数据对,生成了 1.2M 条高质量训练数据,并使用了其他数据源的完整数据集。在多图训练的第二阶段,我们从第一阶段中随机抽取了 $0.8\mathrm{M}$ 条图像数据,并将其与视频数据集混合进行持续训练,以保持基本性能。
Audio Data. We prepare audio training data for comprehensive speech and music understanding. For text-speech understanding, we design multiple tasks including ASR from Libri Speech [54] and GigaSpeech [5] datasets, audio captioning from AudioCaps [30] and Clotho [14] datasets, speech question answering from Libri Speech [54] datasets, audio question answering from WavCaps [49] and AudioCaps [30] datasets. For text-music understanding, we collect tasks including music captioning from MusicCaps [1], music question answering from Million Song [48] and MusicNet [66]. The overall audio training data involves 1.1M samples. The relevant text question-answering representations are collected from SALMONN [63].
音频音频数据。我们为综合语音和音乐理解准备了音频训练数据。对于文本-语音理解,我们设计了多项任务,包括来自Libri Speech [54]和GigaSpeech [5]数据集的自动语音识别 (ASR)、来自AudioCaps [30]和Clotho [14]数据集的音频字幕生成、来自Libri Speech [54]数据集的语音问答、来自WavCaps [49]和AudioCaps [30]数据集的音频问答。对于文本-音乐理解,我们收集了包括来自MusicCaps [1]的音乐字幕生成、来自Million Song [48]和MusicNet [66]的音乐问答等任务。整体音频训练数据涉及110万样本。相关的文本问答表示来自SALMONN [63]。
Generating Cross-Modal Video Data. Most existing video training data are annotated or synthesized solely from frame inputs, often overlooking the valuable information in accompanying audio. To address this, we designed a pipeline to generate cross-modal video data, aiming to uncover the intrinsic relationships between video and audio. This guides an omni-modal large language model in learning cross-modality information. Specifically, we developed two tasks for crossmodal learning: video-audio question answering and video speech recognition. We collected videos from the academic dataset LLaVA-Video-178k [84] and the open-ended video datasets from FineVideo [18]. Due to the lack of subtitles in the academic datasets, we used Whisper-v3 [58] to generate subtitles from the video audio and conducted a languagebased cleaning procedure. We then employed a large language model to assess whether the subtitles were complete and informative. We gathered 41k pure videos from LLaVAVideo-178k, along with the original 42k videos from FineVideo. Subsequently, we used Qwen2-VL-72B [57] to generate questions and answers based on the video and corresponding subtitles. The model was instructed to focus on the subtitle inputs while using the videos as supplementary information. We created three question-answer pairs for each video, resulting in $243\mathbf{k}$ cross-modal video-audio data points. Additionally, we included the original video subtitling tasks with $83\mathbf{k}$ training data to help the model maintain its ASR ability in noisy environments. During training, the models processed multiple image frames, audio, and text inputs simultaneously, significantly enhancing their cross-modal capabilities.
生成生成跨模态视频数据。大多数现有的视频训练数据仅从帧输入进行标注或合成,往往忽略了伴音频中的宝贵信息。为解决这一问题,我们设计了一个生成跨模态视频数据的流程,旨在揭示视频与音频之间的内在关系。这指导了一个全模态大语言模型学习跨模态信息。具体来说,我们开发了两项跨模态学习任务:视频-音频问答和视频语音识别。我们从学术数据集 LLaVA-Video-178k [84] 和 FineVideo [18] 的开放式视频数据集中收集了视频。由于学术数据集中缺少字幕,我们使用 Whisper-v3 [58] 从视频音频中生成字幕,并进行了基于语言的清理过程。然后,我们使用大语言模型评估字幕是否完整且信息丰富。我们从 LLaVA-Video-178k 收集了 41k 纯视频,以及 FineVideo 的原始 42k 视频。随后,我们使用 Qwen2-VL-72B [57] 根据视频和相应的字幕生成问题和答案。模型被指示专注于字幕输入,同时将视频作为补充信息。我们为每个视频创建了三个问答对,最终生成了 $243\mathbf{k}$ 个跨模态视频-音频数据点。此外,我们还保留了原始的 $83\mathbf{k}$ 训练数据的视频字幕任务,以帮助模型在嘈杂环境中保持其自动语音识别(ASR)能力。在训练过程中,模型同时处理多个图像帧、音频和文本输入,显著增强了其跨模态能力。
Table 1. Main Results across Image, Video, and Audio Understanding Benchmarks. We select representative benchmarks among image, video, and audio benchmarks, and select mainstream state-of-the-art open-source large language models in each modality. We also include open-source omni-modal LLMs for comparison. In the table, ”−” indicates the model is capable of solving the tasks theoretically, while the result is lacking. ”✗” indicates that the model is not capable of the task. $\downarrow$ indicates that lower score is better. * LLaMA-Omni is not optimized for ASR and thus cannot produce reasonable results on this task.
The prepared $324\mathrm{k}$ cross-modal video data is mixed with audio data implemented above for stage 3 training. We mix all the 1.1M pure text-audio training data and $324\mathbf{k}$ cross-modal video-audio data for the comprehensive training stage. Additionally, we sample $400\mathbf{k}$ image data from stage 1 to maintain the basic ability and create $200\mathbf{k}$ image data with voice instructions to equip the model with interaction capability.
准备好的准备好的 $324\mathrm{k}$ 跨模态视频数据与上述实现的音频数据混合,用于阶段 3 的训练。我们将所有 1.1M 的纯文本-音频训练数据和 $324\mathbf{k}$ 的跨模态视频-音频数据混合用于综合训练阶段。此外,我们从阶段 1 中采样 $400\mathbf{k}$ 图像数据以保持基本能力,并创建 $200\mathbf{k}$ 带有语音指令的图像数据,以使模型具备交互能力。
4. Experiments
4. 实验
We conduct all-sided benchmarking in Sec. 4.2 to evaluate the all-powerful Ola model, including the representative benchmarks in image, video, and audio understanding. Subsequently, we conduct detailed results on critical benchmarks in Sec. 4.3 to demonstrate the effectiveness of our design on motivation, training, and data preparations.
在在4.2节中,我们进行了全方位的基准测试,以评估全能型Ola模型,包括图像、视频和音频理解的代表性基准。随后,在4.3节中,我们对关键基准进行了详细的结果分析,以展示我们在动机、训练和数据准备方面的设计有效性。
4.1. Implementation Details
44.1. 实现细节
The Ola model builds upon the Qwen-2.5-7B [64] framework, incorporating OryxViT [42] as the vision encoder initialized from SigLIP-400M [79], Whisper-V3-Large [58] as the speech encoder, and BEATs-AS2M(cpt2) [8] as the music encoder. Initially, we employ a relatively high learning rate of 1e-3 for MLP adapter pre-training. During supervised fine-tuning, the learning rate is gradually reduced from 2e-5 for text-image and multi-image training to 1e-5 for videoaudio training. We utilize a batch size of 256 for fine-tuning, leveraging 64 NVIDIA A800 GPUs to conduct our training. We adopt $10\times$ down sampled rate for audio features to reduce the token length, resulting in 300 tokens per minute. During training and inference, the maximum token length is set to 16384 and the maximum number of audio trunks is set to 25.
OOla 模型基于 Qwen-2.5-7B [64] 框架,集成了 OryxViT [42] 作为视觉编码器(从 SigLIP-400M [79] 初始化)、Whisper-V3-Large [58] 作为语音编码器,以及 BEATs-AS2M(cpt2) [8] 作为音乐编码器。最初,我们采用相对较高的学习率 1e-3 进行 MLP 适配器的预训练。在有监督微调过程中,学习率从 2e-5(用于文本-图像和多图像训练)逐步降低到 1e-5(用于视频-音频训练)。我们在微调时使用了 256 的批量大小,并利用 64 台 NVIDIA A800 GPU 进行训练。我们采用了 $10\times$ 下采样率来减少音频特征的 token 长度,结果为每分钟 300 个 token。在训练和推理过程中,最大 token 长度设置为 16384,最大音频片段数设置为 25。
4.2. Results on Omni Understanding
4.2. Omni 理解结果
Benchmarks. We conduct extensive comparisons across image, video, and audio understanding benchmarks to demonstrate the omni-modal capabilities of the Ola model. For image benchmarks, we utilize comprehensive understanding datasets including MMBench-1.1 [40], MMMU [77], MMStar [6], MathVista [46], Hallusion Bench [25], AI2D [29], and OCRBench [41]. In the video domain, we evaluate using VideoMME [21], which involves multiple-choice questions on videos of varying lengths, and Long Video Bench [69] for assessing performance on extremely long video content, MVBench [34] for the general recognition ability. For audio benchmarks, we focus on two primary tasks relevant to audio LLMs. Libri speech [54] serves as a traditional audio-speech recognition (ASR) dataset, testing the model’s ability to accurately transcribe spoken language. AIR-Bench [74] provides a comprehensive evaluation of audio question-answering capabilities, incorporating speech, sound, and music inputs. The responses are evaluated with a GPT-based [52] scorer against ground truth answers. We report the mainstream evaluation metric on image and video benchmarks and report the mean metric in Tab. 2 for ASR and audio understanding tasks for simplicity.
基准基准测试。我们对图像、视频和音频理解基准进行了广泛比较,以展示Ola模型的全模态能力。在图像基准方面,我们使用包括MMBench-1.1 [40]、MMMU [77]、MMStar [6]、MathVista [46]、Hallusion Bench [25]、AI2D [29] 和 OCRBench [41] 在内的综合理解数据集。在视频领域,我们使用VideoMME [21] 进行评估,该基准涉及不同长度视频的多项选择题,使用Long Video Bench [69] 评估超长视频内容的性能,使用MVBench [34] 评估一般识别能力。在音频基准方面,我们重点研究与音频大语言模型相关的两项主要任务。Libri speech [54] 作为传统的音频语音识别 (ASR) 数据集,测试模型准确转录口语的能力。AIR-Bench [74] 对音频问答能力进行全面评估,涵盖语音、声音和音乐输入。使用基于GPT [52] 的评分器对响应进行与真实答案的对比评估。我们报告了图像和视频基准的主流评估指标,并在表 2 中简化报告了ASR和音频理解任务的平均指标。
Baselines. We selected a range of state-of-the-art multimodal large language models across different modalities for comparison and reference. We categorized vision-language models into three groups: image-centric LLMs, videocentric LLMs, and comprehensive LLMs capable of handling both images and videos. For image understanding, we utilized Cambrian-1 [67] and Pixtral-12B [2]. For video understanding, VideoCCAM [20] and LLaVA-Video [84] are employed. Comprehensive models included LLaVAOneVision [32], MiniCPM-V 2.6 [75], InternVL2.5 [9], and Qwen2.5-VL [65] which excel across various visual benchmarks. In the audio domain, we compared our work with state-of-the-art models such as SALMONN [63] and Qwen2 Audio [11]. As an Omni-modal LLM, our model, Ola, was compared with state-of-the-art open-source omni-modal LLMs like Mini-Omni2 [72], VITA-1.5 [22], InternLMXComposer2.5-OmniLive [81], which support image, audio, and text inputs. Additionally, LLaMA-Omni [72], an audiotext omni model, was noted for its strong speech generation
BasBaselines。我们选择了一系列最先进的多模态大语言模型进行比较和参考。我们将视觉-语言模型分为三组:以图像为中心的LLM、以视频为中心的LLM以及能够处理图像和视频的综合LLM。对于图像理解,我们使用了Cambrian-1 [67]和Pixtral-12B [2]。对于视频理解,使用了VideoCCAM [20]和LLaVA-Video [84]。综合模型包括LLaVAOneVision [32]、MiniCPM-V 2.6 [75]、InternVL2.5 [9]和Qwen2.5-VL [65],这些模型在各种视觉基准上表现出色。在音频领域,我们将我们的工作与SALMONN [63]和Qwen2 Audio [11]等最先进的模型进行了比较。作为全模态LLM,我们的模型Ola与Mini-Omni2 [72]、VITA-1.5 [22]、InternLMXComposer2.5-OmniLive [81]等支持图像、音频和文本输入的最先进开源全模态LLM进行了比较。此外,LLaMA-Omni [72]作为一种音频-文本全模态模型,以其强大的语音生成能力而著称。
Table 2. Analysis Results on Audio Benchmarks. We report the WER rate on test-clean, test-other, dev-clean, dev-other subsets of Libri Speech, and the scores on AIR-Bench. In the table, ” ” indicates the model is capable of solving the tasks, while the result is lacking. $\mathbf{\ddot{\mu}}\mathbf{\mu}^{,,}$ indicates that the model is not capable of the task.
capabilities.
能力能力。
Results. We present the comprehensive results in Table 1, highlighting Ola’s competitive performance across major multi-modal benchmarks when compared to state-of-theart specialist-modal LLMs. Specifically, in image benchmarks, Ola achieves $84.3%$ on MMBench-1.1 [40], $70.8%$ on MMStar [6], $57.0%$ on MMMU [77], $68.4%$ on Math- Vista [46], $86.1%$ on AI2D [29] and 827 on OCRBench [41], surpassing all the relative multi-modal LLMs in similar number of parameters. In video benchmarks, Ola attains an impressive $68.4%$ on VideoMME [21], showcasing its robust capability to handle both video and audio inputs simultaneously, and setting a new state-of-the-art performance among 7B models on the VideoMME benchmark. Ola also maintains a leading position compared to mainstream video LLMs including LLaVA-Video [84] and VideoCCAM [20] on Long Video Bench [69] and MVBench [34]. In audio benchmarks, Ola demonstrates strong audio-speech recognition and conversational abilities, with $3.1%$ mean WER rate on Libri Speech [54] and 6.41 mean score on AIRBench [74], outperforming existing omni-modal LLMs, including LLaMA-Omni [17], which focuses on audio understanding. These results indicate a significant advancement over current omni-modal LLMs, underscoring the effectiveness of Ola’s training approach.
结果结果。我们在表 1 中展示了全面的结果,突出了 Ola 与最先进的专用模态大语言模型相比在主要多模态基准测试中的竞争优势。具体而言,在图像基准测试中,Ola 在 MMBench-1.1 [40] 上达到了 $84.3%$,在 MMStar [6] 上达到了 $70.8%$,在 MMMU [77] 上达到了 $57.0%$,在 Math-Vista [46] 上达到了 $68.4%$,在 AI2D [29] 上达到了 $86.1%$,在 OCRBench [41] 上达到了 827,超过了所有参数规模相近的多模态大语言模型。在视频基准测试中,Ola 在 VideoMME [21] 上达到了令人印象深刻的 $68.4%$,展示了其同时处理视频和音频输入的强大能力,并在 VideoMME 基准测试中为 7B 模型设立了新的最先进性能。与包括 LLaVA-Video [84] 和 VideoCCAM [20] 在内的主流视频大语言模型相比,Ola 在 Long Video Bench [69] 和 MVBench [34] 上也保持了领先地位。在音频基准测试中,Ola 展示了强大的音频-语音识别和对话能力,在 Libri Speech [54] 上的平均 WER 率为 $3.1%$,在 AIRBench [74] 上的平均得分为 6.41,优于包括专注于音频理解的 LLaMA-Omni [17] 在内的现有全模态大语言模型。这些结果表明,Ola 的训练方法在当前全模态大语言模型的基础上取得了显著进步。
4.3. Analysis
44.3. 分析
For the analysis part, we report the detailed results on audio benchmarks to illustrate the fine-grained performance. We also demonstrate our designs on training and cross-modal training data with ablations on critical benchmarks. At last, we perform qualitative showcases of Ola.
在在分析部分,我们报告了音频基准测试的详细结果,以展示细粒度的性能表现。我们还通过关键基准测试的消融实验展示了训练和跨模态训练数据的设计。最后,我们对 Ola 进行了定性展示。
Analysis on Audio Benchmarks. To d