[论文翻译]OSUM: 在学术界有限资源下推动开放语音理解模型的发展


原文地址:https://arxiv.org/pdf/2501.13306v2


OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

OSUM: 在学术界有限资源下推动开放语音理解模型的发展

Audio, Speech and Language Processing Group (ASLP@NPU) School of Computer Science, Northwestern Polytechnic al University

音频、语音与语言处理组 (ASLP@NPU) 西北工业大学计算机学院

Abstract

摘要

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speechto-text chat (STTC). By employing an $\mathrm{ASR+X}$ training strategy, OsUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

大语言模型 (LLMs) 在各种下游任务中取得了显著进展,推动了语音理解语言模型 (SULMs) 的发展,以实现基于语音的全面交互。然而,大多数先进的 SULMs 是由行业开发的,利用了学术界难以获得的大规模数据集和计算资源。此外,训练细节缺乏透明度,进一步阻碍了创新。在本研究中,我们提出了 OSUM,一种开放语音理解模型,旨在探索在受限的学术资源下训练 SULMs 的潜力。OSUM 模型结合了 Whisper 编码器和 Qwen2 大语言模型,支持多种语音任务,包括语音识别 (ASR)、带时间戳的语音识别 (SRWT)、语音事件检测 (VED)、语音情感识别 (SER)、说话风格识别 (SSR)、说话者性别分类 (SGC)、说话者年龄预测 (SAP) 和语音到文本聊天 (STTC)。通过采用 $\mathrm{ASR+X}$ 训练策略,OSUM 通过同时优化 ASR 和目标任务,实现了高效稳定的多任务训练。除了提供强大的性能外,OSUM 还通过公开数据准备和训练方法,强调了透明度,为学术界提供了宝贵的见解和实用指导。通过这样做,我们旨在加速先进 SULM 技术的研究和创新。

1 Introduction

1 引言

Large language models (LLMs) have shown tremendous progress towards Artificial General Intelligence (AGI) in recent years. Given the inherent human preference for speech-based interaction, there has been growing interest in extending LLMs with speech capabilities to develop Speech LLMs. To generate fluent and expressive text or speech responses, Speech LLMs must fully comprehend input speech, including both its semantic content and para linguistic information, like emotion, speaking style, speaker gender, and age. Moreover, this comprehension ability is also crucial for audio data labeling. Currently, the mainstream multi-label generation approach is to use multiple models to label each task separately, which consumes extremely high computational resources. A labeling model capable of accurately generating multiple labels simultaneously holds broad application prospects.

大语言模型 (LLMs) 近年来在通用人工智能 (AGI) 领域取得了巨大进展。鉴于人类对基于语音的交互的天然偏好,扩展大语言模型的语音能力以开发语音大语言模型 (Speech LLMs) 的兴趣日益增长。为了生成流畅且富有表现力的文本或语音响应,语音大语言模型必须充分理解输入的语音,包括其语义内容以及副语言信息,如情感、说话风格、说话者性别和年龄。此外,这种理解能力对于音频数据标注也至关重要。目前,主流的多标签生成方法是使用多个模型分别标注每个任务,这会消耗极高的计算资源。能够同时准确生成多个标签的标注模型具有广泛的应用前景。

The area which focuses on Speech Understanding Language Models (SULMs), has seen notable advancements through projects such as Qwen-Audio(Chu et al., 2023), Qwen2-Audio(Chu et al., 2024), PandGPT(Su et al., 2023), and SALMONN (Tang et al., 2024). Whisper (Radford et al., 2023) marks a pioneering exploration of speech understanding independent of LLMs, utilizing an encoder-decoder Transformer (Vaswani, 2017) architecture to tackle a variety of speech tasks, such as automatic speech recognition (ASR), speech translation (ST), language identification (LID), and voice activity detection (VAD). Building on Whisper's design, SenseVoice (An et al., 2024) and TouchASP (Song et al., 2024) expand more tasks like speech emotion recognition (SER) and audio event detection (AED), further enriching their ability to process and comprehend human speech. Qwen-Audio integrates Whisper's encoder with the text-based Qwen LLM (Bai et al., 2023), enabling the latter to understand speech. Compared to Whisper, Qwen-Audio leverages a more powerful LLM decoder and performs over 30 speech-related tasks, making it a representative model in the field of SULMs. Its successor, Qwen2-Audio, further enhances these capabilities by supporting natural language prompts and achieving superior performance across various benchmarks (Chu et al., 2024).

专注于语音理解语言模型 (Speech Understanding Language Models, SULMs) 的领域,通过 Qwen-Audio (Chu et al., 2023)、Qwen2-Audio (Chu et al., 2024)、PandGPT (Su et al., 2023) 和 SALMONN (Tang et al., 2024) 等项目取得了显著进展。Whisper (Radford et al., 2023) 标志着在不依赖大语言模型的情况下进行语音理解的开创性探索,它利用编码器-解码器 Transformer (Vaswani, 2017) 架构来处理多种语音任务,例如自动语音识别 (ASR)、语音翻译 (ST)、语言识别 (LID) 和语音活动检测 (VAD)。基于 Whisper 的设计,SenseVoice (An et al., 2024) 和 TouchASP (Song et al., 2024) 扩展了更多任务,如语音情感识别 (SER) 和音频事件检测 (AED),进一步丰富了它们处理和理解人类语音的能力。Qwen-Audio 将 Whisper 的编码器与基于文本的 Qwen 大语言模型 (Bai et al., 2023) 集成,使后者能够理解语音。与 Whisper 相比,Qwen-Audio 利用了更强大的大语言模型解码器,并执行了超过 30 项与语音相关的任务,使其成为 SULMs 领域的代表性模型。其继任者 Qwen2-Audio 通过支持自然语言提示并在各种基准测试中取得优异性能,进一步增强了这些能力 (Chu et al., 2024)。


Figure 1: Comparison of Qwen2-Audio and our OSUM model. In most tasks, OSUM achieves a better performance than Qwen2-Audio despite using significantly fewer computational resources and training data. The values for each model's task in the radar chart are based on the average results on the public and internal test sets, as shown in Table 4 and Table 5.

图 1: Qwen2-Audio 和我们的 OSUM 模型对比。在大多数任务中,OSUM 的表现优于 Qwen2-Audio,尽管其使用的计算资源和训练数据显著减少。雷达图中每个模型的任务值基于公开和内部测试集的平均结果,如表 4 和表 5 所示。

Although these advanced SULMs have achieved remarkable progress, most of them are developed by industry, leveraging millions of hours of training data and massive GPU resources. For instance, TouchASP and SenseVoice utilized 1,000,000 and 400,000 hours of training data, respectively. Such large-scale resources are typically beyond the reach of academia institutions. Furthermore, while inference models are often opensourced, essential details regarding data preparation, training strategies, codebases, and hyper-parameters configurations are rarely disclosed. These limitations hinder academic community efforts to further optimize and expand SULM research. Recently, a growing movement advocating for open science in Speech LLM research has emerged. This movement emphasizes the importance of releasing comprehensive training frameworks, datasets, and methodological details to promote research and innovation. A notable example is the Open Whisper-style Speech Model (OWSM) series (Peng et al., 2023), which replicates Whisper-style training using open-sourced tools and publicly available data, significantly advancing public understanding and research on speech understanding models.

尽管这些先进的 SULM 取得了显著进展,但大多数是由行业开发的,利用了数百万小时的训练数据和大量的 GPU 资源。例如,TouchASP 和 SenseVoice 分别使用了 1,000,000 和 400,000 小时的训练数据。如此大规模的资源通常是学术机构无法企及的。此外,虽然推理模型通常是开源的,但关于数据准备、训练策略、代码库和超参数配置的关键细节很少被披露。这些限制阻碍了学术界进一步优化和扩展 SULM 研究的努力。最近,在语音大语言模型研究中,倡导开放科学的运动逐渐兴起。该运动强调发布全面的训练框架、数据集和方法细节的重要性,以促进研究和创新。一个显著的例子是 Open Whisper-style Speech Model (OWSM) 系列 (Peng et al., 2023),它使用开源工具和公开数据复制了 Whisper 风格的训练,极大地推动了公众对语音理解模型的理解和研究。


Figure 2: The overview of the architecture and tasks of OSUM.

图 2: OSUM 的架构和任务概览。

In this study, we aim to foster broader academic exploration of SULMs with limited resource demands, encouraging wider research community participation. To this end, we introduce OSUM, an open SULM with its data processing pipeline and training details publicly available. The OSUM model integrates a Whisper speech encoder, fine-tuned on a multi-task dataset, with a Qwen2 LLM. It is capable of performing a wide range of speech tasks, including automatic speech recognition (AsR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). Notably, SSR is a distinctive feature of our OSUM model and serves as a vital component of speech understanding. It enhances the model's capability by improving contextual comprehension and boosting performance across various downstream speech tasks. Furthermore, it establishes a foundation for enabling more natural and context-aware speech-based interactions. We adopt an $\operatorname{ASR}!+!X$ training strategy to enhance training stability and reduce resource consumption for our SLUM model, wherein an auxiliary ASR task is optimized alongside the primary target task (denoted as $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$ ). For instance, during the training of the SER task, we concurrently train the ASR task $\langle\mathrm{ASR}\mathrm{+SER}\rangle$ 0 by predicting both transcription and emotion labels for each speech sample. This multi-task training accelerates modality alignment, enabling the LLM to effectively utilize both textual and acoustic modalities. Our OSUM model utilizes only 50,500 hours of training data and achieves comparable or superior performance to other SULMs. The overall performance of OSUM is illustrated in Fig. 1. The model is trained on Nvidia A6000 GPUs and Huawei Ascend NPUs, supporting inference on both platforms. The goal of this study is to foster transparency and accelerate progress in the field of SULMs by providing accessible tools and resources for the broader research community.

在本研究中,我们旨在促进对资源需求有限的SULMs进行更广泛的学术探索,鼓励更广泛的研究社区参与。为此,我们引入了OSUM,一个开放的SULM,其数据处理流程和训练细节公开可用。OSUM模型集成了在多任务数据集上微调的Whisper语音编码器和Qwen2大语言模型。它能够执行多种语音任务,包括自动语音识别(ASR)、带时间戳的语音识别(SRWT)、语音事件检测(VED)、语音情感识别(SER)、说话风格识别(SSR)、说话者性别分类(SGC)、说话者年龄预测(SAP)和语音到文本聊天(STTC)。值得注意的是,SSR是我们OSUM模型的一个独特特征,也是语音理解的重要组成部分。它通过提高上下文理解能力提升了模型在各种下游语音任务中的表现,并为实现更自然和上下文感知的语音交互奠定了基础。我们采用$\operatorname{ASR}!+!X$训练策略来增强我们SLUM模型的训练稳定性并减少资源消耗,其中辅助ASR任务与主要目标任务(表示为$\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$)一起优化。例如,在训练SER任务时,我们同时训练ASR任务$\langle\mathrm{ASR}\mathrm{+SER}\rangle$,通过预测每个语音样本的转录和情感标签。这种多任务训练加速了模态对齐,使大语言模型能够有效利用文本和声学模态。我们的OSUM模型仅使用了50,500小时的训练数据,并取得了与其他SULMs相当或更优的性能。OSUM的整体性能如图1所示。该模型在Nvidia A6000 GPU和华为Ascend NPU上训练,并支持在这两种平台上进行推理。本研究的目标是通过为更广泛的研究社区提供可访问的工具和资源,促进SULMs领域的透明度和加速进展。

2 Methodology

2 方法论

This section introduces our proposed OSUM, a model designed for comprehensive speech understanding. Section 2.1 presents its architecture;Section 2.2 details its multitask training process. Section 2.3 and Section 2.4 provide an overview of the training data and processing pipeline, respectively.

本节介绍我们提出的 OSUM 模型,该模型旨在实现全面的语音理解。2.1 节介绍其架构;2.2 节详细说明其多任务训练过程。2.3 节和 2.4 节分别概述了训练数据和处理流程。

2.1 Model Architecture

2.1 模型架构

As shown in Figure 2, our OSUM model comprises a speech encoder, an adaptor, and an LLM. During the training, all of the parameters in the encoder and adaptor are updated, while the LLM is fine-tuned with LoRA (Hu et al., 2022). The input of our model consists of a speech and a natural language prompt. Unlink Whisper (Radford et al., 2023) and Qwen-audio (Bai et al., 2023), which rely on instruct tags, the OSUM employs descriptive text, converting all eight supported tasks as shown in Fig. 2. Currently, our model supports only text-based responses, but audio output capabilities are under active development. The following sections describe each sub-module in detail.

如图 2 所示,我们的 OSUM 模型由语音编码器、适配器和大语言模型组成。在训练过程中,编码器和适配器中的所有参数都会更新,而大语言模型则使用 LoRA (Hu et al., 2022) 进行微调。我们的模型输入包括语音和自然语言提示。与依赖指令标签的 Whisper (Radford et al., 2023) 和 Qwen-audio (Bai et al., 2023) 不同,OSUM 使用描述性文本,将如图 2 所示的所有八项支持任务进行转换。目前,我们的模型仅支持基于文本的响应,但音频输出功能正在积极开发中。以下部分将详细描述每个子模块。

Speech Encoder Our OSUM utilizes the Whisper-Medium model as its speech encoder, which consists of 2 one-dimensional convolutional layers with 2 times down sampling, and 24 Transformer layers with 1024 hidden state dimensions and 16-headed self-attention. The encoder has approximately 300 million parameters, which makes it take into account both speech comprehension ability and inference efficiency.

语音编码器 我们的OSUM使用Whisper-Medium模型作为语音编码器,它由2个一维卷积层(具有2倍下采样)和24个Transformer层(具有1024个隐藏状态维度和16头自注意力机制)组成。该编码器大约有3亿个参数,这使得它既能兼顾语音理解能力,又能保证推理效率。

Adaptor The adaptor module features a hybrid architecture combining 3-layer 1D convolutional layers (Conv1D) and 4-layer Transformer. The Conv1D layers use kernel widths of (3, 3, 3) and strides of (1, 2, 2), achieving an overall 4 times down sampling. The Transformer layers have a model dimension of 1,280, an inner dimension of 2,560, and 4 attention heads. This architecture bridges the output of the speech encoder with the input requirements of the LLM, enabling efficient modality alignment.

适配器 (Adaptor)

LLM with LoRA The Qwen2-7B-Instruct is selected as our LLM. Qwen2-7B-Instruct 2 is a general-purpose LLM with a parameter scale of 7 billion, specifically designed for multi-task instruction optimization. In our work, we fine-tune the Qwen2-7B-Instruct model using LoRA (Low-Rank Adaptation) technology. The LoRA hyper parameters $\alpha$ rank, and dropout ratio are set to 32, 8, and 0.1, respectively.

使用 LoRA 的大语言模型
我们选择了 Qwen2-7B-Instruct 作为我们的大语言模型。Qwen2-7B-Instruct 2 是一个参数规模为 70 亿的通用大语言模型,专门为多任务指令优化设计。在我们的工作中,我们使用 LoRA (Low-Rank Adaptation) 技术对 Qwen2-7B-Instruct 模型进行了微调。LoRA 的超参数 $\alpha$、rank 和 dropout ratio 分别设置为 32、8 和 0.1。

2.2 Multitask Supervised Training

2.2 多任务监督训练

The training procedure includes two stages. First, we perform multi-task supervised fine-tuning on the original Whisper model without an LLM. Second, we integrate the fine-tuned Whisper encoder with the Qwen2 LLM to create the complete OSUM system, then conduct further supervised training using a larger dataset.

训练过程包括两个阶段。首先,我们在没有大语言模型的情况下对原始 Whisper 模型进行多任务监督微调。其次,我们将微调后的 Whisper 编码器与 Qwen2 大语言模型集成,创建完整的 OSUM 系统,然后使用更大的数据集进行进一步的监督训练。

Whisper Fine-tuning The original Whisper model supports a limited scope of speech-understanding tasks, which makes the direct integration of the Whisper with an LLM for multi-task training risky when data and computation resources are constrained. Therefore, we first fine-tune the Whisper via multi-task data to ensure faster convergence of the OSUM model. Furthermore, this stage allows us to verify the reliability of our multi-task data. Specifically, we expand Whisper's instruction tag set to accommodate more tasks. Each forward pass executes only a single task.

Whisper 微调

OSUM Training Training SULMs typically begins with pre-training on an ASR task, which serves as a foundation for incorporating additional speech tasks to enable LLMs to process semantic content from the speech encoder. Given computational constraints, we introduce an $\mathrm{ASR+X}$ paradigm for OsUM's multitask training. It concurrently trains ASR and a secondary task $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$ , accelerating training while allowing the $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$ task to utilize both textual and acoustic features, thereby potentially improving performance The $\mathrm{ASR+X}$ paradigm follows a two-step process: first, transcribing speech to text (ASR); then, integrating this transcription with acoustic features to execute the target task (X). This is achieved within the LLM's auto regressive framework by adjusting predicted labels, without modifications to the model architectures or loss functions. We implemented the $\mathrm{ASR+X}$ paradigm by prompting the LLLM with natural language prompts. ChatGPT3 is used to generate 5 candidate prompts for each task, one of which is randomly selected during training. Table 1 shows examples of the prompts and $\mathrm{ASR+X}$ prediction labels.

OSUM 训练
训练 SULMs 通常从 ASR 任务的预训练开始,作为整合额外语音任务的基础,使大语言模型能够处理来自语音编码器的语义内容。考虑到计算限制,我们为 OsUM 的多任务训练引入了一个 $\mathrm{ASR+X}$ 范式。它同时训练 ASR 和一个次要任务 $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$,加速训练的同时允许 $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$ 任务利用文本和声学特征,从而可能提高性能。$\mathrm{ASR+X}$ 范式遵循两步过程:首先,将语音转录为文本(ASR);然后,将此转录与声学特征结合以执行目标任务(X)。这是在大语言模型的自回归框架内通过调整预测标签实现的,无需修改模型架构或损失函数。我们通过使用自然语言提示来实现 $\mathrm{ASR+X}$ 范式。ChatGPT3 用于为每个任务生成 5 个候选提示,训练期间随机选择一个。表 1 展示了提示和 $\mathrm{ASR+X}$ 预测标签的示例。

Table 1: Example of prompts and $\mathrm{ASR+X}$ labels in OSUM multitask training.

表 1: OSUM 多任务训练中的提示词和 $\mathrm{ASR+X}$ 标签示例。

任务 输入 提示词 标签
ASR 转录这段音频中的语音内容为文字。请将以下音频进行转录,同时标记出每个英文 播放小梦想大梦想
SRWT 单词及对应中文字符的起始与结束时间,时间单位精确到0.1秒,并用<>来表示这些时间范围。 <0.21>父<0.47><0.46>母<0.60><0.60>的<60><980><980><0><0>
VED 请将以下音频进行转录,并在结尾处给出<音频事件>标签,音频事件分为8类:laugh, cough, cry, screaming, sigh, throat clearing, sneeze, other。 我的属性为了完成我的使命everyday每个流血不止
SER 请将以下音频内容进行转录,并在结尾处给出<情感>标签,情感共分为8类:sad, anger, neutral, happy, surprise, fear, disgust, 以及 other。 你一个享受者你有什么跟这个复仇者去叫板呢
SSR 请将以下音频内容进行转录,并在结尾处给出<风格>标签,风格共分为8类:新闻科普,恐怖故事,童话故事,客服,诗歌散文,有声书,日常口语,其他。 小猪皮革恍然大悟赶紧把这颗白色的小药片吞进肚子里<童话故事>
SGC 请将这段音频进行转录,并在文本末尾附加<性别>标签。性别分为两种:female, male。 帮我调大声音
SAP 请将以下音频内容转录成文字,并在文字的最后加上<年龄>标签,标明是child、adult还是old。 在黑板上写字的老师一转身
STTC 首先将语音转化为书面语,随后以<开始回答>作分隔,最后对语音内容做出答复。 我感觉不太满意<开始回答>抱歉,我们会让你满意的。

2.3 Training Data

2.3 训练数据

Our OSUM is designed to perform multi-task training using diverse speech datasets, with the goal of building a unified model capable of comprehensively understanding input speech in conversational scenarios. The multi-task training process enables tasks to benefit from shared learning, enhancing overall model performance. Upon completion of training, OsUM can be utilized for speech data annotation or further extended into a conversational Speech LLM. Detailed information about the datasets used for training is provided in Table 2.

我们的 OSUM 旨在使用多样化的语音数据集进行多任务训练,目标是构建一个能够在对话场景中全面理解输入语音的统一模型。多任务训练过程使任务能够从共享学习中受益,从而提升整体模型性能。训练完成后,OsUM 可用于语音数据标注或进一步扩展为对话式语音大语言模型。训练所用数据集的详细信息见表 2。

2.4 Data Processing Pipe-line

2.4 数据处理管道

The data processing pipeline is crucial for training multi-task SULMs. In this section, we reveal the data processing schemes used for each task in the OSUM project, with the aim of providing a valuable reference for academic research.

数据处理流程对于训练多任务 SULM 至关重要。在本节中,我们将揭示 OSUM 项目中每项任务所使用的数据处理方案,旨在为学术研究提供有价值的参考。

ASR The training data include publicly available resources like We net speech (Zhang et al., 2022), AISHELL1 (Bu et al., 2017), AISHELL-2 (Du et al., 2018), and Libri Speech (Panayotov et al., 2015), along with our internal ASR dataset, resulting in a total of 24,000 hours.

ASR 训练数据包括公开可用的资源,如We net speech (Zhang et al., 2022)、AISHELL1 (Bu et al., 2017)、AISHELL-2 (Du et al., 2018) 和 Libri Speech (Panayotov et al., 2015),以及我们内部的 ASR 数据集,总计 24,000 小时。

Table 2: Details of the multi-task training data for OSUM. The total duration is 50,500 hours. The total training data includes both open-sourced datasets and the data we processed.

SRWT For the SRWT task, a Gaussian Mixture Model - Hidden Markov Model (GMM-HMM) based conventional ASR model, is used to conduct force alignment and obtain word-level timestamp es. This model is trained on the 54,000-hour proprietary ASR dataset. To evaluate its performance, we establish an internal SRWT test set and assess alignment quality using the Average Alignment Score (AAS) metric (Shi et al., 2022). The GMM-HMM model achieves an AAS of 7.55, demonstrating its efficacy in generating reliable word-level timestamps.

SRWT任务中,采用基于高斯混合模型-隐马尔可夫模型 (GMM-HMM) 的传统 ASR 模型进行强制对齐,并获取词级时间戳。该模型在 54,000 小时的专有 ASR 数据集上训练。为评估其性能,我们建立了内部 SRWT 测试集,并使用平均对齐分数 (AAS) 指标 (Shi et al., 2022) 评估对齐质量。GMM-HMM 模型的 AAS 达到 7.55,表明其在生成可靠的词级时间戳方面具有高效性。

SSR Given the absence of open-sourced tools for annotating style labels directly from audio data, we leverage two text-based LLMs-Qwen2.5-14B 4 and GLM-4-9B-Chat 5- to annotate speech transcriptions using carefully designed prompts. To enhance annotation accuracy and reliability, we retain only the intersection of labeling results from both models. This intersection-based approach ensures high-quality annotations for training the SSR task.

SSR 鉴于目前缺乏直接从音频数据中标注风格标签的开源工具,我们利用两个基于文本的大语言模型——Qwen2.5-14B 和 GLM-4-9B-Chat——通过精心设计的提示来标注语音转录。为了提高标注的准确性和可靠性,我们仅保留两个模型标注结果的交集。这种基于交集的方法确保了 SSR 任务训练所需的高质量标注。

VED We have attempted to train a vocal event labeling tool; however, due to the limited availability of training data, its classification performance is suboptimal, especially when vocal events and speech occur within the same audio segment. Therefore, we employ a Voice Conversion (VC) tool to modify the timbre of vocal event audio and insert it randomly into speech audio, creating a dataset of $\mathrm{ASR+VED}$ format. We find that this approach effectively mitigates the over fitting problems caused by the scarcity of vocal event training data with the assistance of VC. The open-source vocal event datasets we use include Audioset (Gemmeke et al., 2017), ESC-50 (Piczak, 2015), Vocal Sound (Gong et al., 2022), and Nonspeech7k (Rashid et al., 2023), while the ASR data consists solely of AISHELL-2 (Du et al., 2018).

VED:我们尝试训练了一个声音事件标注工具;然而,由于训练数据的有限性,其分类性能并不理想,特别是在声音事件和语音出现在同一音频片段时。因此,我们采用了语音转换(VC)工具来修改声音事件音频的音色,并将其随机插入到语音音频中,创建了一个 $\mathrm{ASR+VED}$ 格式的数据集。我们发现,这种方法在VC的帮助下有效缓解了由于声音事件训练数据稀缺导致的过拟合问题。我们使用的开源声音事件数据集包括Audioset (Gemmeke et al., 2017)、ESC-50 (Piczak, 2015)、Vocal Sound (Gong et al., 2022) 和 Nonspeech7k (Rashid et al., 2023),而ASR数据仅包括AISHELL-2 (Du et al., 2018)。

SER Emotion 2 Vec (Ma et al., 2024) is the first universal speech emotion representation model. Without additional fine-tuning, we directly apply the pre-trained Emotion $2\mathrm{Vec}+$ Large model °, which is trained on 40,000 hours of emotional speech data, to annotate the audio with emotion labels. Additionally, we leverage the GLM-4-9B-Chat model to generate emotion labels from the textual transcriptions of the speech. By intersecting these annotations, we generate high-quality emotional labels for the entire dataset.

SER Emotion 2 Vec (Ma 等, 2024) 是首个通用的语音情感表示模型。在不进行额外微调的情况下,我们直接应用预训练的 Emotion $2\mathrm{Vec}+$ Large 模型,该模型基于 40,000 小时的语音情感数据进行训练,用于为音频标注情感标签。此外,我们利用 GLM-4-9B-Chat 模型从语音的文本转录中生成情感标签。通过交叉这些标注,我们为整个数据集生成了高质量的情感标签。

SGC Efforts to train a speaker gender classification model to label web-sourced data yield unsatisfactory performance. Consequently, we discard the pseudo-labeled data and relied solely exclusively on manually labeled datasets for traning. For the SGC task, we select KeSpeech Tang et al. 2021), Datatang-Kid (Datatang Tech Co Ltd, 2024), AISHELL-1 (Bu et al., 2017), AISHELL-2 (Du et al., 2018), Libri Speech (Panayotov et al., 2015), Kaggle-Common Voice (Kaggle Community, 2017), and Magicdata-Read OpenSLR (2019) as training dataset, as they include reliable speaker gender labels. In addition, we employ a noisy data augmentation strategy.

SGC 训练说话人性别分类模型以标注网络来源数据的努力未能取得令人满意的表现。因此,我们放弃了伪标签数据,仅依赖手动标注的数据集进行训练。对于 SGC 任务,我们选择了 KeSpeech (Tang et al., 2021)、Datatang-Kid (Datatang Tech Co Ltd, 2024)、AISHELL-1 (Bu et al., 2017)、AISHELL-2 (Du et al., 2018)、Libri Speech (Panayotov et al., 2015)、Kaggle-Common Voice (Kaggle Community, 2017) 和 Magicdata-Read OpenSLR (2019) 作为训练数据集,因为它们包含可靠的说话人性别标签。此外,我们采用了噪声数据增强策略。

SAP Similar to the SGC task, due to the poor performance of the automated labeling model, only manually labeled data is used for training. We use KeSpeech (Tang et al., 2021), Datatang-Kid (Datatang Tech Co Ltd, 2024), Magicdata-Read (OpenSLR, 2019), Kaggle-Common Voice (Kaggle Community, 2017), AISHELLASR0060 (AISHELL Tech Co Ltd, 2024), and AISHELL-ASR0018 (AISHELL Tech Co Ltd, 2024) as the training dataset for the SAP task, as these datasets provide reliable speaker age labels. Noisy data augmentation is also utilized in this task.

SAP 与 SGC 任务类似,由于自动标注模型表现不佳,仅使用手动标注数据进行训练。我们使用 KeSpeech (Tang et al., 2021)、Datatang-Kid (Datatang Tech Co Ltd, 2024)、Magicdata-Read (OpenSLR, 2019)、Kaggle-Common Voice (Kaggle Community, 2017)、AISHELLASR0060 (AISHELL Tech Co Ltd, 2024) 和 AISHELL-ASR0018 (AISHELL Tech Co Ltd, 2024) 作为 SAP 任务的训练数据集,因为这些数据集提供了可靠的说话者年龄标签。此任务中也使用了噪声数据增强。

STTC For the STTC task, we use three types of data. First, we utilize a human-recorded audio questionanswer dataset Databacker-Conversation (Databaker Tech Co Ltd, 2024). Then, we use a text-based dialogue dataset LCCC (Wang et al., 2020) and the ChatTTS 7 system with random speaker capabilities to generate the utterances of the questioner in the dialogue, thus obtaining the speech-text pairs for the dialogue task. Finally, We filter suitable response sentences from the We net speech (Zhang et al., 2022) dataset using Qwen2.5-7B 8, guiding the LLM to generate text answers.

STTC
对于STTC任务,我们使用三种类型的数据。首先,我们利用了一个人类录制的音频问答数据集Databacker-Conversation(Databaker Tech Co Ltd, 2024)。然后,我们使用了一个基于文本的对话数据集LCCC(Wang et al., 2020)和具有随机说话者功能的ChatTTS 7系统来生成对话中提问者的语音,从而获得对话任务的语音-文本对。最后,我们从We net speech(Zhang et al., 2022)数据集中筛选出合适的回应句子,使用Qwen2.5-7B 8引导大语言模型生成文本回答。

3 Experiments

3 实验

This section begins by presenting our training setup in Section 3.1. Subsequently, to conduct a more comprehensive evaluation of OSUM, we establish a series of complete internal evaluation sets, as detailed in Section 3.2. Finally, we report the performance of OSUM on both public and internal test sets, accompanied by an analysis in Section 3.3.

本节首先在第 3.1 节介绍我们的训练设置。随后,为了对 OSUM 进行更全面的评估,我们建立了一系列完整的内部评估集,详细内容见第 3.2 节。最后,我们在第 3.3 节报告了 OSUM 在公开和内部测试集上的表现,并附有分析。

3.1 Training Setup

3.1 训练设置

The two-stage training process for OSUM is detailed as follows:

OSUM 的两阶段训练过程如下:

Whisper Fine-tuning_ In the first stage, we fine-tune the Whisper-Medium model on the multi-task datasets described in Table 2. This stage is conducted on 8 Nvidia A6000 GPUs. A warm-up scheduler is employed to adjust the learning rate, peaking at 5e-5. The multitask Whisper is trained for 150,000 steps, which takes approximately 15.3 days.

Whisper微调:在第一阶段,我们在表2中描述的多任务数据集上对Whisper-Medium模型进行微调。该阶段在8个Nvidia A6000 GPU上进行。采用预热调度器来调整学习率,峰值学习率为5e-5。多任务Whisper训练了150,000步,大约需要15.3天。

OSUM Training In the second stage, we conduct experiments on 24 Huawei Ascend NPUs, using a learning rate of 5e-5. The process completes a total of 704,000 training steps and consumes 10 days.

OSUM 训练
在第二阶段,我们在 24 个华为 Ascend NPU 上进行实验,学习率为 5e-5。该过程共完成 704,000 个训练步骤,耗时 10 天。

Table 3: Details of the internal multi-task test sets. The categories in the classification task are balanced.

表 3: 内部多任务测试集的详细信息。分类任务中的类别是平衡的。

任务 名称 描述 项目数量
ASR SpeechIO 9 是一个广泛使用的中文 ASR 排行榜,涵盖了电视节目、演讲和新闻等多种场景。我们使用 SpeechIO 0-4 作为 ASR 测试集。 11932
SRWT Testsrwt 该测试集包含十个人的真人录音,涵盖成年男性和女性声音以及儿童声音。它包括讲故事和朗读两种风格,并附带精心标注的词级时间戳。 494
VED Testved 该测试集由十位成年男性和女性讲者的内部真人录音组成,所有录音均为普通话朗读,并在随机位置插入语音事件。 400
SER Testser 测试数据选自五个公开的情感评估集:IEMOCAP (Buss0 et al., 2008)、MER2023 (Lian et al., 2023)、M3ED (Zha0 et al., 2022)、MSP-IMPROV (Busso et al., 2017) 和 ESD (Zhou et al., 2021),包括中文和英文,涵盖八种情感类型。 3885
SSR Testssr 该测试集由两部分组成:标注有风格标签的网络爬取数据和用于 TTS 训练的高表现力数据。它涵盖八种说话风格,主要为中文。 3000
SGC Testsgc 内部测试集包含两个年龄类别的数据,女性与男性的比例为 2:3。所有录音均来自真人对话场景。 3000
SAP Testsap 内部测试集包括三个年龄类别:儿童、成人和老人,比例为 1:1:1。所有数据均为真人对话场景数据。 3000
STTC Teststtc 该测试集从中文文本对话数据集中生成。具体而言,我们使用 Cosyvoice2 TTS 模型将问题文本合成为音频,而答案仍保持文本形式。用于测试集的文本对话数据集与训练集来自同一来源。 200

3.2 Internal Test Sets

3.2 内部测试集

Currently, most SULMs evaluate multi-task performance using publicly available English datasets (Chu et al., 2023, 2024; Radford et al., 2023; Song et al., 2024). However, as OSUM training incorporates a substantial amount of Chinese data, we have developed a series of internal multi-task test sets tailored for Chinese 10. These complement the publicly available English test sets, creating a more comprehensive evaluation framework.To support the $\mathrm{ASR+X}$ paradigm, we further enhance the test sets with speech transcripts. However, ASR metrics are used solely for internal reference to assess model convergence and will not be publicly reported. Table 3 presents a description of our internal multi-task test sets.

目前,大多数 SULM 使用公开的英文数据集来评估多任务性能 (Chu et al., 2023, 2024; Radford et al., 2023; Song et al., 2024)。然而,由于 OSUM 训练中包含了大量的中文数据,我们开发了一系列针对中文的内部多任务测试集。这些测试集与公开的英文测试集相辅相成,形成了一个更全面的评估框架。为了支持 $\mathrm{ASR+X}$ 范式,我们进一步增强了测试集的语音转录功能。然而,ASR 指标仅用于内部参考以评估模型收敛性,不会公开报告。表 3 描述了我们内部的多任务测试集。

3.3 Main Results

3.3 主要结果

Table 4 and Table 5 show the experimental results of our OSUM across various tasks. The results reveal that our approach achieves performance that is comparable to, and in many cases superior to, speech understanding models such as Qwen-audio, Qwen2-audio, Whisper, and SenseVoice. Furthermore, in this section, we will highlight the performance disparities between our model and other comparable approaches, while providing a detailed analysis of the challenges SULMs face in these tasks. We hope that these experiences can provide useful references for researchers.

表 4 和表 5 展示了我们的 OSUM 在各种任务中的实验结果。结果表明,我们的方法在性能上可与 Qwen-audio、Qwen2-audio、Whisper 和 SenseVoice 等语音理解模型相媲美,甚至在许多情况下表现更优。此外,在本节中,我们将重点分析我们的模型与其他类似方法之间的性能差异,同时详细探讨 SULMs 在这些任务中面临的挑战。我们希望这些经验能为研究人员提供有益的参考。

Table 4: Evaluation results of ASR tasks on public and internal test sets. The bold font represents the best result among the same test set. All internal results are inferred by ourselves.

ASR As illustrated in Table 4, our approach reveals obvious advantages in the ASR task on the Chinese test sets. Notably, the proposed OSUM consistently outperforms other models on the We net Speech test-meeting set, three AISHELL-2 sub-test sets, and four internally used SpeechIO test sets. While OSUM does not surpass the top-performing method on the English test set, it rivals performance comparable to SenseVoiceS. These results are achieved, remarkably, with substantially less training data. In addition, we find that OSUM exhibits a surprisingly impressive ability to recognize Chinese-English code-mixed speech, even though such code-mixed datasets are not included during training. To be specific, the MER/CER/WER is $3.89%/2.41%/19.30%$ on the ASRU code-switching test set (Shi et al., 2020). Going forward, we will enhance this function. Overall, these results underscore that our $\mathrm{ASR+X}$ task paradigm effectively enhances model convergence in ASR tasks, significantly minimizing the data and computational resources required for SULMs training.

如表 4 所示,我们的方法在中文测试集上的 ASR 任务中展现了显著优势。值得注意的是,提出的 OSUM 在 We net Speech 测试会议集、三个 AISHELL-2 子测试集以及四个内部使用的 SpeechIO 测试集上始终优于其他模型。尽管 OSUM 在英文测试集上未超越表现最佳的方法,但其性能与 SenseVoiceS 相当。值得注意的是,这些结果是在使用显著更少的训练数据的情况下取得的。此外,我们发现 OSUM 在识别中英文混合语音方面表现出令人惊讶的能力,尽管训练过程中并未包含此类混合数据集。具体而言,在 ASRU 代码切换测试集 (Shi et al., 2020) 上,MER/CER/WER 分别为 $3.89%/2.41%/19.30%$。未来,我们将增强这一功能。总体而言,这些结果表明我们的 $\mathrm{ASR+X}$ 任务范式有效提升了模型在 ASR 任务中的收敛性,显著减少了 SULMs 训练所需的数据和计算资源。

SRWT Table 5 presents the SRWT evaluation results for our proposed OSUM model compared to WhisperLarge-v3, Qwen-Audio, and the GMM-HMM model used for generating annotated data in SRWT tasks. Our OSUM model significantly outperforms Whisper-Large-v3 by relative $36.70%$ and also surpasses Qwen-Audio.

表 5 展示了我们提出的 OSUM 模型与 WhisperLarge-v3、Qwen-Audio 以及用于生成 SRWT 任务标注数据的 GMM-HMM 模型在 SRWT 评估中的结果。我们的 OSUM 模型显著优于 Whisper-Large-v3,相对提升了 36.70%,并且也超越了 Qwen-Audio。

Additionally, our OSUM's performance in the SRWT task even slightly surpasses that of the GMM-HMM model, which is widely recognized for its high accuracy in timestamp prediction. These results underscore the effectiveness of OSUM in the SRWT task. Additionally, OSUM's high performance in the SRWT task not only enables it to predict timestamps in an end-to-end manner but, more importantly, simplifies its integration with other tasks, such as speaker di ari z ation.

此外,我们的OSUM在SRWT任务中的表现甚至略微超过了GMM-HMM模型,而GMM-HMM模型在时间戳预测方面以其高准确性而广受认可。这些结果突显了OSUM在SRWT任务中的有效性。此外,OSUM在SRWT任务中的高性能不仅使其能够以端到端的方式预测时间戳,更重要的是,简化了其与其他任务(如说话人分割)的集成。

VED We first evaluate OSUM's performance on the public test sets ESC-50 and VocalSound. However, since the event categories in these two datasets do not completely align with those in OSUM, the comparison to other approaches should only serve as a rough assessment. Specifically, the ESC-50 contains a substantial number of non-vocal audio events, we categorize them as "other." The experimental results on this test set demonstrate that our model successfully classifies these non-vocal audio events as "other." Additionally, on the VocalSound set, we select the categories supported by OSUM and calculate the average accuracy across these categories. This result reveals that our OSUM exhibits a gap compared to Qwen2-audio, primarily due to our training data consisting of concatenated speech and vocal events. In contrast, the VocalSound test set includes only the latter, resulting in a significant mismatch. Nevertheless, our OSUM achieves a norm level, successfully identifying the most independent vocal events. In our internal human-recoded $\mathrm{ASR+VED}$ test set, PANNs become unusable due to similar mismatches, particularly because their design treats speech as a standalone event, exacerbating accuracy degradation. Qwen2-audio performs relatively better but also experiences a performance decline in our test set, likely due to over fitting. In contrast, our model demonstrates balanced results in both the public and internal test sets, showcasing enhanced generalization. This indicates that using VC to augment data for vocal events can effectively mitigate over fitting in VED tasks.

我们首先在公开测试集 ESC-50 和 VocalSound 上评估 OSUM 的性能。然而,由于这两者的数据集中事件类别与 OSUM 并不完全一致,因此与其他方法的比较仅能作为粗略评估。具体来说,ESC-50 中包含大量非人声音频事件,我们将其归类为 "其他"。在该测试集上的实验结果表明,我们的模型成功将这些非人声音频事件分类为 "其他"。此外,在 VocalSound 集上,我们选择了 OSUM 支持的类别,并计算这些类别的平均准确率。这一结果显示,我们的 OSUM 与 Qwen2-audio 相比存在差距,主要原因是我们的训练数据由拼接的语音和人声事件组成,而 VocalSound 测试集仅包含后者,导致显著的不匹配。尽管如此,我们的 OSUM 达到了正常水平,成功识别出最独立的人声事件。在我们内部人工录制的 $\mathrm{ASR+VED}$ 测试集中,PANNs 由于类似的不匹配问题变得不可用,特别是其设计将语音视为独立事件,加剧了准确率下降。Qwen2-audio 表现得相对较好,但在我们的测试集中也出现了性能下降,可能是由于过拟合。相比之下,我们的模型在公开和内部测试集上均表现出平衡的结果,展示了增强的泛化能力。这表明使用 VC 来增强人声事件的数据可以有效缓解 VED 任务中的过拟合问题。

SER For the SER task, we extract the categories supported by OSUM from the public datasets MELD and MER2023 for testing, followed by a comprehensive evaluation on our internal test set. In the experiments with the public datasets, OSUM demonstrates superior performance on the MER2023 test set, outperforming several recent public benchmark models. On the MELD dataset, OSUM's performance ranks just below the SenseVoice-L model, likely due to the latter's additional training on a larger-scale speech emotion dataset. In addition, while OsUM's result on the internal test set is comparable to that of the EmoBox model, it significantly surpasses other comparative approaches. Furthermore, we observe that among the eight emotions supported, disgust and fear are particularly challenging to recognize, a difficulty partly attributed to the scarcity of training data for these two emotions. In our future work, we plan to enhance the model's performance and generalization capability by utilizing OsUM for labeling, thereby obtaining a larger and more balanced emotion dataset.

SER
在SER任务中,我们从公开数据集MELD和MER2023中提取OSUM支持的类别进行测试,随后在我们的内部测试集上进行了全面评估。在公开数据集的实验中,OSUM在MER2023测试集上表现出色,超越了多个最近的公开基准模型。在MELD数据集上,OSUM的表现仅次于SenseVoice-L模型,这可能是由于后者在更大规模的语言情感数据集上进行了额外训练。此外,虽然OSUM在内部测试集上的结果与EmoBox模型相当,但它显著超越了其他对比方法。我们还观察到,在支持的八种情感中,厌恶和恐惧的识别尤为困难,这一部分归因于这两种情感的训练数据稀缺。在未来的工作中,我们计划通过利用OSUM进行标注,从而获得更大且更均衡的情感数据集,以提升模型的性能和泛化能力。

SSR The acoustic-text dual-modal style classification employed by our OSUM significantly outperforms the single-text modality of GLM-4-9B-Chat. It demonstrates a strong ability to distinguish among eight styles: news and science reporting, horror stories, fairy tales, customer service, poetry and prose, audiobooks, spontaneous conversation, and others. Notably, the classification performance for news science communication, audiobooks, fairy tales, and customer service styles is commendable; however, there remains room for improvement in the categorization of poetry and prose, horror stories, and other styles. Moving forward, we leverage OSUM to label additional data, aiming to enhance data quality and optimize the distribution across categories.

我们的OSUM采用的声学-文本双模态风格分类显著优于GLM-4-9B-Chat的单一文本模态。它展示了在区分新闻与科学报道、恐怖故事、童话、客户服务、诗歌与散文、有声读物、自发对话以及其他风格方面的强大能力。值得注意的是,新闻科学传播、有声读物、童话和客户服务风格的分类表现值得称赞;然而,诗歌与散文、恐怖故事和其他风格的分类仍有改进空间。接下来,我们利用OSUM标记更多数据,旨在提高数据质量并优化类别分布。

SGC In the SGC task, we evaluate Qwen2-Audio and OSUM. The results demonstrate that OSUM achieves an $100%$ accuracy on the AISHELL-1 test set. While this result is commendable, we suspect it may indicate some degree of over fitting. Furthermore, on the Kaggle test set, our approach slightly outperforms Qwen2- Audio, yet it falls short on our internal test set. This indicates that Qwen2-Audio has strong robustness in the SGC task, which may be due to the fact that they have used a wider range of data. In the follw-up, we plan to continue to increase the training data for this task. Overall, OSUM exhibits its value in the SGC task.

SGC 在 SGC 任务中,我们评估了 Qwen2-Audio 和 OSUM。结果表明,OSUM 在 AISHELL-1 测试集上达到了 $100%$ 的准确率。虽然这一结果值得称赞,但我们怀疑这可能表明了一定程度的过拟合。此外,在 Kaggle 测试集上,我们的方法略优于 Qwen2-Audio,但在内部测试集上却表现不佳。这表明 Qwen2-Audio 在 SGC 任务中具有较强的鲁棒性,这可能是由于他们使用了更广泛的数据。在后续工作中,我们计划继续增加该任务的训练数据。总体而言,OSUM 在 SGC 任务中展现了其价值。

Table 5: Evaluation results of multi-tasking on public and internal test sets. The best results for each test set are highlighted in bold font. Results shown in blue font, as well as those on internal test sets, are inferred using the original released model by ourselves.

表 5: 在公开和内部测试集上的多任务评估结果。每个测试集的最佳结果以粗体显示。蓝色字体显示的结果以及内部测试集上的结果是我们使用原始发布模型推断的。

任务 模型 指标 公共测试集 公共结果 内部测试集 内部结果
SRWT Whisper-L-v3 AAS(ms, ↓) (Shi et al., 2022) 对比计划测试集未开源。 Testsrwt 11.09
Qwen-Audio 9.17
GMM-HMM 7.55
OSUM 7.02
VED Qwen2-Audio ACC (%, ↑) ESC-50 VocalSound 93.3 Testved 33.25
TouchASP 85.7
PANNs 83.3 3.25
OSUM 96.60 82.58 78.25
SER Qwen2-Audio ACC (%, ↑) MELD 55.3 38.04
TouchASP 50.5
Sensevoice-L 63.1 69.2 57.8
Sensevoice-S test MER2023 test Testser
Emotion2Vec 68.3 51.88 40.77 51.14
EmoBox 51.89 65.23 74.54
OSUM 53.38 86.43 72.33
SSR GLM-4 ACC (%, ↑) 对比计划测试集未开源。 53.97 Testssr 67.34
OSUM
SGC Qwen2-Audio ACC (%, ↑) AISHELL-1 test Kaggle-CommonVoice valid-test 97.36 97.25 Testsgc 98.43
OSUM 100 96.79
SAP Qwen2-Audio ACC (%, ↑) Kaggle-CommonVoice valid-test 99.41 35.53 Testsap 49.52
76.52 67.36
STTC OSUM Qwen2-Audio ASLP-Audio GPT-3.5-Turbo AirBench speech 6.77 Teststtc 6.91
Scoring 5.98 7.04

SAP We also compare our OSUM with Qwen2-Audio on the SAP task. During our previous experiments, we found that the acoustic similarity between teenagers and adults is remarkably high, complicating effective differentiation. Consequently, we categorize age into three groups: child, adult, and old. Curiously, despite our efforts to debug the prompts, Qwen2-Audio demonstrates a lower age classification accuracy on both the Kaggle test set and our internal test set. This may stem from their overly detailed age categorization, which hinders the model's training accuracy. Our model significantly surpasses Qwen2-Audio on the Kaggle test set, achieving an accuracy of $76.52%$ Although the classification accuracy slightly declines on our proprietary test set, it still outperforms Qwen2-Audio. This indicates that our model exhibits strong generalization capabilities on different data.

SAP
我们还比较了我们的OSUM与Qwen2-Audio在SAP任务上的表现。在之前的实验中,我们发现青少年和成人的声学相似度非常高,这使得有效区分变得复杂。因此,我们将年龄分为三类:儿童、成人和老人。奇怪的是,尽管我们努力调试提示,Qwen2-Audio在Kaggle测试集和我们的内部测试集上的年龄分类准确率都较低。这可能源于他们过于详细的年龄分类,阻碍了模型的训练准确性。我们的模型在Kaggle测试集上显著优于Qwen2-Audio,准确率达到$76.52%$。尽管在我们的专有测试集上分类准确率略有下降,但仍优于Qwen2-Audio。这表明我们的模型在不同数据上表现出强大的泛化能力。

STTC In the STTC task, we follow AirBench's evaluation protocol across all test sets. This involves providing the text of audio queries along with the text of two distinct answers, allowing a text-based LLM to assign subjective scores from 1 to 10. The two answers consist of a real response and the answer generated by SULMs. While AirBench employs GPT-4 as the scoring LLM, it is currently inaccessible, so we instead utilize GPT-3.5-Turbo. The test results presented in Table 5 indicate that, on AirBench's official speech sub-test set, Our score is lower than that of Qwen2-Audio, suggesting that our model's capabilities in English conversation and audio description lag behind those of Qwen2-Audio. This is primarily because we do not use English conversational data for training; the current score relies entirely on the LLM's performance. However, on our internal Chinese conversational test set, OSUM outperforms Qwen2-Audio, which indicates that our strategy of performing ASR before chat is beneficial. Overall, our OSUM model is comparable to Qwen2-Audio in terms of conversational ability. We will not be content with this achievement. Our future work will mainly focus on the conversation task.

STTC 在STTC任务中,我们遵循AirBench的评估协议,对所有测试集进行评估。这包括提供音频查询的文本以及两个不同答案的文本,让基于文本的大语言模型从1到10分配主观评分。两个答案包括一个真实响应和SULMs生成的答案。虽然AirBench使用GPT-4作为评分的大语言模型,但目前无法访问,因此我们改用GPT-3.5-Turbo。表5中的测试结果表明,在AirBench的官方语音子测试集上,我们的得分低于Qwen2-Audio,这表明我们的模型在英语对话和音频描述方面的能力落后于Qwen2-Audio。这主要是因为我们没有使用英语对话数据进行训练;当前得分完全依赖于大语言模型的性能。然而,在我们内部的中文对话测试集上,OSUM优于Qwen2-Audio,这表明我们在聊天前进行ASR的策略是有益的。总体而言,我们的OSUM模型在对话能力方面与Qwen2-Audio相当。我们不会满足于这一成就。我们未来的工作将主要集中在对话任务上。

4 Future Works

4 未来工作

While OSUM demonstrates commendable performance, our research remains an ongoing endeavor to push the boundaries of academic exploration in Speech LLMs. In the coming months, we aim to address several key areas for improvement and innovation:

尽管 OSUM 表现出色,但我们的研究仍在持续努力,以推动语音大语言模型的学术探索边界。在未来几个月内,我们计划解决几个关键的改进和创新领域:

· Expanding OSUM's Functionalities. We plan to enhance OSUM with additional capabilities, such as language and accent identification, to broaden its applicability in multilingual and diverse speech Scenarios.

扩展 OSUM 的功能。我们计划通过语言和口音识别等附加功能增强 OSUM,以扩大其在多语言和多样化语音场景中的适用性。

· Enabling Multi-Task Capability. We aim to activate OSUM's ability to perform multiple tasks simultaneously, such as identifying the emotion, age, gender, and speaking style in a single inference pass. Leveraging this multi-task capability, we plan to develop a versatile data labeling tool to streamline audio data processing pipelines.

· 启用多任务能力。我们旨在激活 OSUM 同时执行多个任务的能力,例如在一次推理过程中识别情绪、年龄、性别和说话风格。利用这种多任务能力,我们计划开发一个多功能的数据标注工具,以简化音频数据处理流程。

● Incorporating Full-Duplex Voice Interaction. To improve naturalness and responsiveness, we plan to integrate full-duplex voice interaction capabilities into OSUM. This enhancement will allow OSUM to generate context-aware, natural responses, such as matching the questioner's emotion or mimicking specific speaking styles, like that of a child.

● 引入全双工语音交互。为了提高自然度和响应速度,我们计划在OSUM中集成全双工语音交互功能。这一增强将使OSUM能够生成上下文感知的自然响应,例如匹配提问者的情绪或模仿特定的说话风格,如儿童的说话方式。

· Open Science Contributions. As part of our commitment to advancing academic research, we will continue to share detailed training methodologies, data pipelines, and model updates. Our aim is to foster collaboration, provide valuable resources for researchers, and democratize access to cutting-edge Speech LLM technologies.

· 开放科学贡献。作为我们推动学术研究承诺的一部分,我们将继续分享详细的训练方法、数据管道和模型更新。我们的目标是促进合作,为研究人员提供宝贵的资源,并普及尖端语音大语言模型技术的访问。

Through these efforts, we seek to extend OSUM's capabilities, establish new benchmarks for Speech LLMs, and contribute meaningfully to the academic study and practical applications of speech understanding.

通过这些努力,我们旨在扩展 OSUM 的能力,为语音大语言模型建立新的基准,并对语音理解的学术研究和实际应用做出有意义的贡献。

5 Conclusion

5 结论

In this study, we propose OSUM, an open-source Speech understanding language model. The OSUM model integrates a Whisper encoder with a Qwen2 LLM, supporting eight speech tasks. By employing an $\mathrm{ASR+X}$ training strategy, OSUM achieves efficient and stable multi-task training, simultaneously optimizing ASR alongside target tasks. Beyond delivering robust performance, OSUM prioritizes transparency by providing openly accessible data preparation and training methodologies, offering valuable insights and practical guidance for the academic community.

在本研究中,我们提出了OSUM,一个开源的语音理解大语言模型。OSUM模型集成了Whisper编码器和Qwen2大语言模型,支持八种语音任务。通过采用$\mathrm{ASR+X}$训练策略,OSUM实现了高效且稳定的多任务训练,同时优化了自动语音识别(ASR)与目标任务。除了提供强大的性能外,OSUM还注重透明度,提供了公开可访问的数据准备和训练方法,为学术界提供了宝贵的见解和实用指导。

6 Acknowledgment

6 致谢

We appreciate that AISHELL 11, Databaker 12, MagicData 13, and DataTang 14 generously provide some training data. Most of our model components are trained using Huawei 15 Ascend NPUs.

感谢 AISHELL 11、Databaker 12、MagicData 13 和 DataTang 14 慷慨提供部分训练数据。我们的大多数模型组件均使用华为 15 Ascend NPU 进行训练。

7 Authors

7 作者

The ranking is from top to bottom and then from left to right. * indicates equal contribution. + indicates the corresponding author.

排名从上到下,从左到右。* 表示同等贡献。+ 表示通讯作者。

· Xuelong Geng · Kun Wei Qijie Shao Shuiyun Liu* Zhennan Lin* Zhixian Zhao* Guojian Li*

· 耿学龙 · 魏坤 邵其杰 刘水云* 林振南* 赵志贤* 李国建*

· Wenjie Tian* · Peikun Chen Yangze Li Pengcheng Guo Mingchen Shao Shuiyuan Wang Yuang Cao

· Wenjie Tian* · Peikun Chen Yangze Li Pengcheng Guo Mingchen Shao Shuiyuan Wang Yuang Cao

References

参考文献

阅读全文(20积分)