OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

OSUM: 在学术界有限资源下推动开放语音理解模型的发展

Audio, Speech and Language Processing Group (ASLP@NPU) School of Computer Science, Northwestern Polytechnic al University

音频、语音与语言处理组 (ASLP@NPU) 西北工业大学计算机学院

Abstract

摘要

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speechto-text chat (STTC). By employing an $\mathrm{ASR+X}$ training strategy, OsUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

大语言模型 (LLMs) 在各种下游任务中取得了显著进展，推动了语音理解语言模型 (SULMs) 的发展，以实现基于语音的全面交互。然而，大多数先进的 SULMs 是由行业开发的，利用了学术界难以获得的大规模数据集和计算资源。此外，训练细节缺乏透明度，进一步阻碍了创新。在本研究中，我们提出了 OSUM，一种开放语音理解模型，旨在探索在受限的学术资源下训练 SULMs 的潜力。OSUM 模型结合了 Whisper 编码器和 Qwen2 大语言模型，支持多种语音任务，包括语音识别 (ASR)、带时间戳的语音识别 (SRWT)、语音事件检测 (VED)、语音情感识别 (SER)、说话风格识别 (SSR)、说话者性别分类 (SGC)、说话者年龄预测 (SAP) 和语音到文本聊天 (STTC)。通过采用 $\mathrm{ASR+X}$ 训练策略，OSUM 通过同时优化 ASR 和目标任务，实现了高效稳定的多任务训练。除了提供强大的性能外，OSUM 还通过公开数据准备和训练方法，强调了透明度，为学术界提供了宝贵的见解和实用指导。通过这样做，我们旨在加速先进 SULM 技术的研究和创新。

1 Introduction

1 引言

Large language models (LLMs) have shown tremendous progress towards Artificial General Intelligence (AGI) in recent years. Given the inherent human preference for speech-based interaction, there has been growing interest in extending LLMs with speech capabilities to develop Speech LLMs. To generate fluent and expressive text or speech responses, Speech LLMs must fully comprehend input speech, including both its semantic content and para linguistic information, like emotion, speaking style, speaker gender, and age. Moreover, this comprehension ability is also crucial for audio data labeling. Currently, the mainstream multi-label generation approach is to use multiple models to label each task separately, which consumes extremely high computational resources. A labeling model capable of accurately generating multiple labels simultaneously holds broad application prospects.

大语言模型 (LLMs) 近年来在通用人工智能 (AGI) 领域取得了巨大进展。鉴于人类对基于语音的交互的天然偏好，扩展大语言模型的语音能力以开发语音大语言模型 (Speech LLMs) 的兴趣日益增长。为了生成流畅且富有表现力的文本或语音响应，语音大语言模型必须充分理解输入的语音，包括其语义内容以及副语言信息，如情感、说话风格、说话者性别和年龄。此外，这种理解能力对于音频数据标注也至关重要。目前，主流的多标签生成方法是使用多个模型分别标注每个任务，这会消耗极高的计算资源。能够同时准确生成多个标签的标注模型具有广泛的应用前景。

The area which focuses on Speech Understanding Language Models (SULMs), has seen notable advancements through projects such as Qwen-Audio(Chu et al., 2023), Qwen2-Audio(Chu et al., 2024), PandGPT(Su et al., 2023), and SALMONN (Tang et al., 2024). Whisper (Radford et al., 2023) marks a pioneering exploration of speech understanding independent of LLMs, utilizing an encoder-decoder Transformer (Vaswani, 2017) architecture to tackle a variety of speech tasks, such as automatic speech recognition (ASR), speech translation (ST), language identification (LID), and voice activity detection (VAD). Building on Whisper's design, SenseVoice (An et al., 2024) and TouchASP (Song et al., 2024) expand more tasks like speech emotion recognition (SER) and audio event detection (AED), further enriching their ability to process and comprehend human speech. Qwen-Audio integrates Whisper's encoder with the text-based Qwen LLM (Bai et al., 2023), enabling the latter to understand speech. Compared to Whisper, Qwen-Audio leverages a more powerful LLM decoder and performs over 30 speech-related tasks, making it a representative model in the field of SULMs. Its successor, Qwen2-Audio, further enhances these capabilities by supporting natural language prompts and achieving superior performance across various benchmarks (Chu et al., 2024).

专注于语音理解语言模型 (Speech Understanding Language Models, SULMs) 的领域，通过 Qwen-Audio (Chu et al., 2023)、Qwen2-Audio (Chu et al., 2024)、PandGPT (Su et al., 2023) 和 SALMONN (Tang et al., 2024) 等项目取得了显著进展。Whisper (Radford et al., 2023) 标志着在不依赖大语言模型的情况下进行语音理解的开创性探索，它利用编码器-解码器 Transformer (Vaswani, 2017) 架构来处理多种语音任务，例如自动语音识别 (ASR)、语音翻译 (ST)、语言识别 (LID) 和语音活动检测 (VAD)。基于 Whisper 的设计，SenseVoice (An et al., 2024) 和 TouchASP (Song et al., 2024) 扩展了更多任务，如语音情感识别 (SER) 和音频事件检测 (AED)，进一步丰富了它们处理和理解人类语音的能力。Qwen-Audio 将 Whisper 的编码器与基于文本的 Qwen 大语言模型 (Bai et al., 2023) 集成，使后者能够理解语音。与 Whisper 相比，Qwen-Audio 利用了更强大的大语言模型解码器，并执行了超过 30 项与语音相关的任务，使其成为 SULMs 领域的代表性模型。其继任者 Qwen2-Audio 通过支持自然语言提示并在各种基准测试中取得优异性能，进一步增强了这些能力 (Chu et al., 2024)。

Figure 1: Comparison of Qwen2-Audio and our OSUM model. In most tasks, OSUM achieves a better performance than Qwen2-Audio despite using significantly fewer computational resources and training data. The values for each model's task in the radar chart are based on the average results on the public and internal test sets, as shown in Table 4 and Table 5.

图 1: Qwen2-Audio 和我们的 OSUM 模型对比。在大多数任务中，OSUM 的表现优于 Qwen2-Audio，尽管其使用的计算资源和训练数据显著减少。雷达图中每个模型的任务值基于公开和内部测试集的平均结果，如表 4 和表 5 所示。

Although these advanced SULMs have achieved remarkable progress, most of them are developed by industry, leveraging millions of hours of training data and massive GPU resources. For instance, TouchASP and SenseVoice utilized 1,000,000 and 400,000 hours of training data, respectively. Such large-scale resources are typically beyond the reach of academia institutions. Furthermore, while inference models are often opensourced, essential details regarding data preparation, training strategies, codebases, and hyper-parameters configurations are rarely disclosed. These limitations hinder academic community efforts to further optimize and expand SULM research. Recently, a growing movement advocating for open science in Speech LLM research has emerged. This movement emphasizes the importance of releasing comprehensive training frameworks, datasets, and methodological details to promote research and innovation. A notable example is the Open Whisper-style Speech Model (OWSM) series (Peng et al., 2023), which replicates Whisper-style training using open-sourced tools and publicly available data, significantly advancing public understanding and research on speech understanding models.

尽管这些先进的 SULM 取得了显著进展，但大多数是由行业开发的，利用了数百万小时的训练数据和大量的 GPU 资源。例如，TouchASP 和 SenseVoice 分别使用了 1,000,000 和 400,000 小时的训练数据。如此大规模的资源通常是学术机构无法企及的。此外，虽然推理模型通常是开源的，但关于数据准备、训练策略、代码库和超参数配置的关键细节很少被披露。这些限制阻碍了学术界进一步优化和扩展 SULM 研究的努力。最近，在语音大语言模型研究中，倡导开放科学的运动逐渐兴起。该运动强调发布全面的训练框架、数据集和方法细节的重要性，以促进研究和创新。一个显著的例子是 Open Whisper-style Speech Model (OWSM) 系列 (Peng et al., 2023)，它使用开源工具和公开数据复制了 Whisper 风格的训练，极大地推动了公众对语音理解模型的理解和研究。

Figure 2: The overview of the architecture and tasks of OSUM.

图 2: OSUM 的架构和任务概览。

In this study, we aim to foster broader academic exploration of SULMs with limited resource demands, encouraging wider research community participation. To this end, we introduce OSUM, an open SULM with its data processing pipeline and training details publicly available. The OSUM model integrates a Whisper speech encoder, fine-tuned on a multi-task dataset, with a Qwen2 LLM. It is capable of performing a wide range of speech tasks, including automatic speech recognition (AsR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). Notably, SSR is a distinctive feature of our OSUM model and serves as a vital component of speech understanding. It enhances the model's capability by improving contextual comprehension and boosting performance across various downstream speech tasks. Furthermore, it establishes a foundation for enabling more natural and context-aware speech-based interactions. We adopt an $\operatorname{ASR}!+!X$ training strategy to enhance training stability and reduce resource consumption for our SLUM model, wherein an auxiliary ASR task is optimized alongside the primary target task (denoted as $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$ ). For instance, during the training of the SER task, we concurrently train the ASR task $\langle\mathrm{ASR}\mathrm{+SER}\rangle$ 0 by predicting both transcription and emotion labels for each speech sample. This multi-task training accelerates modality alignment, enabling the LLM to effectively utilize both textual and acoustic modalities. Our OSUM model utilizes only 50,500 hours of training data and achieves comparable or superior performance to other SULMs. The overall performance of OSUM is illustrated in Fig. 1. The model is trained on Nvidia A6000 GPUs and Huawei Ascend NPUs, supporting inference on both platforms. The goal of this study is to foster transparency and accelerate progress in the field of SULMs by providing accessible tools and resources for the broader research community.

在本研究中，我们旨在促进对资源需求有限的SULMs进行更广泛的学术探索，鼓励更广泛的研究社区参与。为此，我们引入了OSUM，一个开放的SULM，其数据处理流程和训练细节公开可用。OSUM模型集成了在多任务数据集上微调的Whisper语音编码器和Qwen2大语言模型。它能够执行多种语音任务，包括自动语音识别（ASR）、带时间戳的语音识别（SRWT）、语音事件检测（VED）、语音情感识别（SER）、说话风格识别（SSR）、说话者性别分类（SGC）、说话者年龄预测（SAP）和语音到文本聊天（STTC）。值得注意的是，SSR是我们OSUM模型的一个独特特征，也是语音理解的重要组成部分。它通过提高上下文理解能力提升了模型在各种下游语音任务中的表现，并为实现更自然和上下文感知的语音交互奠定了基础。我们采用$\operatorname{ASR}!+!X$训练策略来增强我们SLUM模型的训练稳定性并减少资源消耗，其中辅助ASR任务与主要目标任务（表示为$\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$）一起优化。例如，在训练SER任务时，我们同时训练ASR任务$\langle\mathrm{ASR}\mathrm{+SER}\rangle$，通过预测每个语音样本的转录和情感标签。这种多任务训练加速了模态对齐，使大语言模型能够有效利用文本和声学模态。我们的OSUM模型仅使用了50,500小时的训练数据，并取得了与其他SULMs相当或更优的性能。OSUM的整体性能如图1所示。该模型在Nvidia A6000 GPU和华为Ascend NPU上训练，并支持在这两种平台上进行推理。本研究的目标是通过为更广泛的研究社区提供可访问的工具和资源，促进SULMs领域的透明度和加速进展。

2 Methodology

2 方法论

This section introduces our proposed OSUM, a model designed for comprehensive speech understanding. Section 2.1 presents its architecture;Section 2.2 details its multitask training process. Section 2.3 and Section 2.4 provide an overview of the training data and processing pipeline, respectively.

本节介绍我们提出的 OSUM 模型，该模型旨在实现全面的语音理解。2.1 节介绍其架构；2.2 节详细说明其多任务训练过程。2.3 节和 2.4 节分别概述了训练数据和处理流程。

2.1 Model Architecture

2.1 模型架构

As shown in Figure 2, our OSUM model comprises a speech encoder, an adaptor, and an LLM. During the training, all of the parameters in the encoder and adaptor are updated, while the LLM is fine-tuned with LoRA (Hu et al., 2022). The input of our model consists of a speech and a natural language prompt. Unlink Whisper (Radford et al., 2023) and Qwen-audio (Bai et al., 2023), which rely on instruct tags, the OSUM employs descriptive text, converting all eight supported tasks as shown in Fig. 2. Currently, our model supports only text-based responses, but audio output capabilities are under active development. The following sections describe each sub-module in detail.

如图 2 所示，我们的 OSUM 模型由语音编码器、适配器和大语言模型组成。在训练过程中，编码器和适配器中的所有参数都会更新，而大语言模型则使用 LoRA (Hu et al., 2022) 进行微调。我们的模型输入包括语音和自然语言提示。与依赖指令标签的 Whisper (Radford et al., 2023) 和 Qwen-audio (Bai et al., 2023) 不同，OSUM 使用描述性文本，将如图 2 所示的所有八项支持任务进行转换。目前，我们的模型仅支持基于文本的响应，但音频输出功能正在积极开发中。以下部分将详细描述每个子模块。

Speech Encoder Our OSUM utilizes the Whisper-Medium model as its speech encoder, which consists of 2 one-dimensional convolutional layers with 2 times down sampling, and 24 Transformer layers with 1024 hidden state dimensions and 16-headed self-attention. The encoder has approximately 300 million parameters, which makes it take into account both speech comprehension ability and inference efficiency.

语音编码器我们的OSUM使用Whisper-Medium模型作为语音编码器，它由2个一维卷积层（具有2倍下采样）和24个Transformer层（具有1024个隐藏状态维度和16头自注意力机制）组成。该编码器大约有3亿个参数，这使得它既能兼顾语音理解能力，又能保证推理效率。

Adaptor The adaptor module features a hybrid architecture combining 3-layer 1D convolutional layers (Conv1D) and 4-layer Transformer. The Conv1D layers use kernel widths of (3, 3, 3) and strides of (1, 2, 2), achieving an overall 4 times down sampling. The Transformer layers have a model dimension of 1,280, an inner dimension of 2,560, and 4 attention heads. This architecture bridges the output of the speech encoder with the input requirements of the LLM, enabling efficient modality alignment.

适配器 (Adaptor)

LLM with LoRA The Qwen2-7B-Instruct is selected as our LLM. Qwen2-7B-Instruct 2 is a general-purpose LLM with a parameter scale of 7 billion, specifically designed for multi-task instruction optimization. In our work, we fine-tune the Qwen2-7B-Instruct model using LoRA (Low-Rank Adaptation) technology. The LoRA hyper parameters $\alpha$ rank, and dropout ratio are set to 32, 8, and 0.1, respectively.

使用 LoRA 的大语言模型
我们选择了 Qwen2-7B-Instruct 作为我们的大语言模型。Qwen2-7B-Instruct 2 是一个参数规模为 70 亿的通用大语言模型，专门为多任务指令优化设计。在我们的工作中，我们使用 LoRA (Low-Rank Adaptation) 技术对 Qwen2-7B-Instruct 模型进行了微调。LoRA 的超参数 $\alpha$、rank 和 dropout ratio 分别设置为 32、8 和 0.1。

2.2 Multitask Supervised Training

2.2 多任务监督训练

The training procedure includes two stages. First, we perform multi-task supervised fine-tuning on the original Whisper model without an LLM. Second, we integrate the fine-tuned Whisper encoder with the Qwen2 LLM to create the complete OSUM system, then conduct further supervised training using a larger dataset.

训练过程包括两个阶段。首先，我们在没有大语言模型的情况下对原始 Whisper 模型进行多任务监督微调。其次，我们将微调后的 Whisper 编码器与 Qwen2 大语言模型集成，创建完整的 OSUM 系统，然后使用更大的数据集进行进一步的监督训练。

Whisper Fine-tuning The original Whisper model supports a limited scope of speech-understanding tasks, which makes the direct integration of the Whisper with an LLM for multi-task training risky when data and computation resources are constrained. Therefore, we first fine-tune the Whisper via multi-task data to ensure faster convergence of the OSUM model. Furthermore, this stage allows us to verify the reliability of our multi-task data. Specifically, we expand Whisper's instruction tag set to accommodate more tasks. Each forward pass executes only a single task.

Whisper 微调

OSUM Training Training SULMs typically begins with pre-training on an ASR task, which serves as a foundation for incorporating additional speech tasks to enable LLMs to process semantic content from the speech encoder. Given computational constraints, we introduce an $\mathrm{ASR+X}$ paradigm for OsUM's multitask training. It concurrently trains ASR and a secondary task $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$ , accelerating training while allowing the $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$ task to utilize both textual and acoustic features, thereby potentially improving performance The $\mathrm{ASR+X}$ paradigm follows a two-step process: first, transcribing speech to text (ASR); then, integrating this transcription with acoustic features to execute the target task (X). This is achieved within the LLM's auto regressive framework by adjusting predicted labels, without modifications to the model architectures or loss functions. We implemented the $\mathrm{ASR+X}$ paradigm by prompting the LLLM with natural language prompts. ChatGPT3 is used to generate 5 candidate prompts for each task, one of which is randomly selected during training. Table 1 shows examples of the prompts and $\mathrm{ASR+X}$ prediction labels.

OSUM 训练
训练 SULMs 通常从 ASR 任务的预训练开始，作为整合额外语音任务的基础，使大语言模型能够处理来自语音编码器的语义内容。考虑到计算限制，我们为 OsUM 的多任务训练引入了一个 $\mathrm{ASR+X}$ 范式。它同时训练 ASR 和一个次要任务 $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$，加速训练的同时允许 $\mathbf{\Psi}^{\prime\prime}\mathbf{X}^{\prime\prime}$ 任务利用文本和声学特征，从而可能提高性能。$\mathrm{ASR+X}$ 范式遵循两步过程：首先，将语音转录为文本（ASR）；然后，将此转录与声学特征结合以执行目标任务（X）。这是在大语言模型的自回归框架内通过调整预测标签实现的，无需修改模型架构或损失函数。我们通过使用自然语言提示来实现 $\mathrm{ASR+X}$ 范式。ChatGPT3 用于为每个任务生成 5 个候选提示，训练期间随机选择一个。表 1 展示了提示和 $\mathrm{ASR+X}$ 预测标签的示例。

Table 1: Example of prompts and $\mathrm{ASR+X}$ labels in OSUM multitask training.

表 1: OSUM 多任务训练中的提示词和 $\mathrm{ASR+X}$ 标签示例。

任务	提示词	标签
ASR	转录这段音频中的语音内容为文字。请将以下音频进行转录，同时标记出每个英文	播放小梦想大梦想
SRWT	单词及对应中文字符的起始与结束时间，时间单位精确到0.1秒，并用<>来表示这些时间范围。	<0.21>父<0.47><0.46>母<0.60><0.60>的<60><980><980><0><0>
VED	请将以下音频进行转录，并在结尾处给出<音频事件>标签，音频事件分为8类：laugh, cough, cry, screaming, sigh, throat clearing, sneeze, other。	我的属性为了完成我的使命everyday每个流血不止
SER	请将以下音频内容进行转录，并在结尾处给出<情感>标签，情感共分为8类：sad, anger, neutral, happy, surprise, fear, disgust, 以及 other。	你一个享受者你有什么跟这个复仇者去叫板呢
SSR	请将以下音频内容进行转录，并在结尾处给出<风格>标签，风格共分为8类：新闻科普，恐怖故事，童话故事，客服，诗歌散文，有声书，日常口语，其他。	小猪皮革恍然大悟赶紧把这颗白色的小药片吞进肚子里<童话故事>
SGC	请将这段音频进行转录，并在文本末尾附加<性别>标签。性别分为两种：female, male。	帮我调大声音
SAP	请将以下音频内容转录成文字，并在文字的最后加上<年龄>标签，标明是child、adult还是old。	在黑板上写字的老师一转身
STTC	首先将语音转化为书面语，随后以<开始回答>作分隔，最后对语音内容做出答复。	我感觉不太满意<开始回答>抱歉，我们会让你满意的。

2.3 Training Data

2.3 训练数据

Our OSUM is designed to perform multi-task training using diverse speech datasets, with the goal of building a unified model capable of comprehensively understanding input speech in conversational scenarios. The multi-task training process enables tasks to benefit from shared learning, enhancing overall model performance. Upon completion of training, OsUM can be utilized for speech data annotation or further extended into a conversational Speech LLM. Detailed information about the datasets used for training is provided in Table 2.

我们的 OSUM 旨在使用多样化的语音数据集进行多任务训练，目标是构建一个能够在对话场景中全面理解输入语音的统一模型。多任务训练过程使任务能够从共享学习中受益，从而提升整体模型性能。训练完成后，OsUM 可用于语音数据标注或进一步扩展为对话式语音大语言模型。训练所用数据集的详细信息见表 2。

2.4 Data Processing Pipe-line

2.4 数据处理管道

The data processing pipeline is crucial for training multi-task SULMs. In this section, we reveal the data processing schemes used for each task in the OSUM project, with the aim of providing a valuable reference for academic research.

数据处理流程对于训练多任务 SULM 至关重要。在本节中，我们将揭示 OSUM 项目中每项任务所使用的数据处理方案，旨在为学术研究提供有价值的参考。

ASR The training data include publicly available resources like We net speech (Zhang et al., 2022), AISHELL1 (Bu et al., 2017), AISHELL-2 (Du et al., 2018), and Libri Speech (Panayotov et al., 2015), along with our internal ASR dataset, resulting in a total of 24,000 hours.

ASR 训练数据包括公开可用的资源，如We net speech (Zhang et al., 2022)、AISHELL1 (Bu et al., 2017)、AISHELL-2 (Du et al., 2018) 和 Libri Speech (Panayotov et al., 2015)，以及我们内部的 ASR 数据集，总计 24,000 小时。

Table 2: Details of the multi-task training data for OSUM. The total duration is 50,500 hours. The total training data includes both open-sourced datasets and the data we processed.

SRWT For the SRWT task, a Gaussian Mixture Model - Hidden Markov Model (GMM-HMM) based conventional ASR model, is used to conduct force alignment and obtain word-level timestamp es. This model is trained on the 54,000-hour proprietary ASR dataset. To evaluate its performance, we establish an internal SRWT test set and assess alignment quality using the Average Alignment Score (AAS) metric (Shi et al., 2022). The GMM-HMM model achieves an AAS of 7.55, demonstrating its efficacy in generating reliable word-level timestamps.

SRWT任务中，采用基于高斯混合模型-隐马尔可夫模型 (GMM-HMM) 的传统 ASR 模型进行强制对齐，并获取词级时间戳。该模型在 54,000 小时的专有 ASR 数据集上训练。为评估其性能，我们建立了内部 SRWT 测试集，并使用平均对齐分数 (AAS) 指标 (Shi et al., 2022) 评估对齐质量。GMM-HMM 模型的 AAS 达到 7.55，表明其在生成可靠的词级时间戳方面具有高效性。

SSR Given the absence of open-sourced tools for annotating style labels directly from audio data, we leverage two text-based LLMs-Qwen2.5-14B 4 and GLM-4-9B-Chat 5- to annotate speech transcriptions using carefully designed prompts. To enhance annotation accuracy and reliability, we retain only the intersection of labeling results from both models. This intersection-based approach ensures high-quality annotations for training the SSR task.

SSR 鉴于目前缺乏直接从音频数据中标注风格标签的开源工具，我们利用两个基于文本的大语言模型——Qwen2.5-14B 和 GLM-4-9B-Chat——通过精心设计的提示来标注语音转录。为了提高标注的准确性和可靠性，我们仅保留两个模型标注结果的交集。这种基于交集的方法确保了 SSR 任务训练所需的高质量标注。

VED We have attempted to train a vocal event labeling tool; however, due to the limited availability of training data, its classification performance is suboptimal, especially when vocal events and speech occur within the same audio segment. Therefore, we employ a Voice Conversion (VC) tool to modify the timbre of vocal event audio and insert it randomly into speech audio, creating a dataset of $\mathrm{ASR+VED}$ format. We find that this approach effectively mitigates the over fitting problems caused by the scarcity of vocal event training data with the assistance of VC. The open-source vocal event datasets we use include Audioset (Gemmeke et al., 2017), ESC-50 (Piczak, 2015), Vocal Sound (Gong et al., 2022), and Nonspeech7k (Rashid et al., 2023), while the ASR data consists solely of AISHELL-2 (Du et al., 2018).

VED：我们尝试训练了一个声音事件标注工具；然而，由于训练数据的有限性，其分类性能并不理想，特别是在声音事件和语音出现在同一音频片段时。因此，我们采用了语音转换（VC）工具来修改声音事件音频的音色，并将其随机插入到语音音频中，创建了一个 $\mathrm{ASR+VED}$ 格式的数据集。我们发现，这种方法在VC的帮助下有效缓解了由于声音事件训练数据稀缺导致的过拟合问题。我们使用的开源声音事件数据集包括Audioset (Gemmeke et al., 2017)、ESC-50 (Piczak, 2015)、Vocal Sound (Gong et al., 2022) 和 Nonspeech7k (Rashid et al., 2023)，而ASR数据仅包括AISHELL-2 (Du et al., 2018)。

SER Emotion 2 Vec (Ma et al., 2024) is the first universal speech emotion representation model. Without additional fine-tuning, we directly apply the pre-trained Emotion $2\mathrm{Vec}+$ Large model °, which is trained on 40,000 hours of emotional speech data, to annotate the audio with emotion labels. Additionally, we leverage the GLM-4-9B-Chat model to generate emotion labels from the textual transcriptions of the speech. By intersecting these annotations, we generate high-quality emotional labels for the entire dataset.

SER Emotion 2 Vec (Ma 等, 2024) 是首个通用的语音情感表示模型。在不进行额外微调的情况下，我们直接应用预训练的 Emotion $2\mathrm{Vec}+$ Large 模型，该模型基于 40,000 小时的语音情感数据进行训练，用于为音频标注情感标签。此外，我们利用 GLM-4-9B-Chat 模型从语音的文本转录中生成情感标签。通过交叉这些标注，我们为整个数据集生成了高质量的情感标签。

SGC Efforts to train a speaker gender classification model to label web-sourced data yield unsatisfactory performance. Consequently, we discard the pseudo-labeled data and relied solely exclusively on manually labeled datasets for traning. For the SGC task, we select KeSpeech Tang et al. 2021), Datatang-Kid (Datatang Tech Co Ltd, 2024), AISHELL-1 (Bu et al., 2017), AISHELL-2 (Du et al., 2018), Libri Speech (Panayotov et al., 2015), Kaggle-Common Voice (Kaggle Community, 2017), and Magicdata-Read OpenSLR (2019) as training dataset, as they include reliable speaker gender labels. In addition, we employ a noisy data augmentation strategy.

SGC 训练说话人性别分类模型以标注网络来源数据的努力未能取得令人满意的表现。因此，我们放弃了伪标签数据，仅依赖手动标注的数据集进行训练。对于 SGC 任务，我们选择了 KeSpeech (Tang et al., 2021)、Datatang-Kid (Datatang Tech Co Ltd, 2024)、AISHELL-1 (Bu et al., 2017)、AISHELL-2 (Du et al., 2018)、Libri Speech (Panayotov et al., 2015)、Kaggle-Common Voice (Kaggle Community, 2017) 和 Magicdata-Read OpenSLR (2019) 作为训练数据集，因为它们包含可靠的说话人性别标签。此外，我们采用了噪声数据增强策略。

SAP Similar to the SGC task, due to the poor performance of the automated labeling model, only manually labeled data is used for training. We use KeSpeech (Tang et al., 2021), Datatang-Kid (Datatang Tech Co Ltd, 2024), Magicdata-Read (OpenSLR, 2019), Kaggle-Common Voice (Kaggle Community, 2017), AISHELLASR0060 (AISHELL Tech Co Ltd, 2024), and AISHELL-ASR0018 (AISHELL Tech Co Ltd, 2024) as the training dataset for the SAP task, as these datasets provide reliable speaker age labels. Noisy data augmentation is also utilized in this task.

SAP 与 SGC 任务类似，由于自动标注模型表现不佳，仅使用手动标注数据进行训练。我们使用 KeSpeech (Tang et al., 2021)、Datatang-Kid (Datatang Tech Co Ltd, 2024)、Magicdata-Read (OpenSLR, 2019)、Kaggle-Common Voice (Kaggle Community, 2017)、AISHELLASR0060 (AISHELL Tech Co Ltd, 2024) 和 AISHELL-ASR0018 (AISHELL Tech Co Ltd, 2024) 作为 SAP 任务的训练数据集，因为这些数据集提供了可靠的说话者年龄标签。此任务中也使用了噪声数据增强。

STTC For the STTC task, we use three types of data. First, we utilize a human-recorded audio questionanswer dataset Databacker-Conversation (Databaker Tech Co Ltd, 2024). Then, we use a text-based dialogue dataset LCCC (Wang et al., 2020) and the ChatTTS 7 system with random speaker capabilities to generate the utterances of the questioner in the dialogue, thus obtaining the speech-text pairs for the dialogue task. Finally, We filter suitable response sentences from the We net speech (Zhang et al., 2022) dataset using Qwen2.5-7B 8, guiding the LLM to generate text answers.

STTC
对于STTC任务，我们使用三种类型的数据。首先，我们利用了一个人类录制的音频问答数据集Databacker-Conversation（Databaker Tech Co Ltd, 2024）。然后，我们使用了一个基于文本的对话数据集LCCC（Wang et al., 2020）和具有随机说话者功能的ChatTTS 7系统来生成对话中提问者的语音，从而获得对话任务的语音-文本对。最后，我们从We net speech（Zhang et al., 2022）数据集中筛选出合适的回应句子，使用Qwen2.5-7B 8引导大语言模型生成文本回答。

3 Experiments

3 实验

This section begins by presenting our training setup in Section 3.1. Subsequently, to conduct a more comprehensive evaluation of OSUM, we establish a series of complete internal evaluation sets, as detailed in Section 3.2. Finally, we report the performance of OSUM on both public and internal test sets, accompanied by an analysis in Section 3.3.

本节首先在第 3.1 节介绍我们的训练设置。随后，为了对 OSUM 进行更全面的评估，我们建立了一系列完整的内部评估集，详细内容见第 3.2 节。最后，我们在第 3.3 节报告了 OSUM 在公开和内部测试集上的表现，并附有分析。

3.1 Training Setup

3.1 训练设置

The two-stage training process for OSUM is detailed as follows:

OSUM 的两阶段训练过程如下：

Whisper Fine-tuning_ In the first stage, we fine-tune the Whisper-Medium model on the multi-task datasets described in Table 2. This stage is conducted on 8 Nvidia A6000 GPUs. A warm-up scheduler is employed to adjust the learning rate, peaking at 5e-5. The multitask Whisper is trained for 150,000 steps, which takes approximately 15.3 days.

Whisper微调：在第一阶段，我们在表2中描述的多任务数据集上对Whisper-Medium模型进行微调。该阶段在8个Nvidia A6000 GPU上进行。采用预热调度器来调整学习率，峰值学习率为5e-5。多任务Whisper训练了150,000步，大约需要15.3天。

OSUM Training In the second stage, we conduct experiments on 24 Huawei Ascend NPUs, using a learning rate of 5e-5. The process completes a total of 704,000 training steps and consumes 10 days.

OSUM 训练
在第二阶段，我们在 24 个华为 Ascend NPU 上进行实验，学习率为 5e-5。该过程共完成 704,000 个训练步骤，耗时 10 天。

Table 3: Details of the internal multi-task test sets. The categories in the classification task are balanced.

表 3: 内部多任务测试集的详细信息。分类任务中的类别是平衡的。

任务	名称	描述	项目数量
ASR		SpeechIO 9 是一个广泛使用的中文 ASR 排行榜，涵盖了电视节目、演讲和新闻等多种场景。我们使用 SpeechIO 0-4 作为 ASR 测试集。	11932
SRWT	Testsrwt	该测试集包含十个人的真人录音，涵盖成年男性和女性声音以及儿童声音。它包括讲故事和朗读两种风格，并附带精心标注的词级时间戳。	494
VED	Testved	该测试集由十位成年男性和女性讲者的内部真人录音组成，所有录音均为普通话朗读，并在随机位置插入语音事件。	400
SER	Testser	测试数据选自五个公开的情感评估集：IEMOCAP (Buss0 et al., 2008)、MER2023 (Lian et al., 2023)、M3ED (Zha0 et al., 2022)、MSP-IMPROV (Busso et al., 2017) 和 ESD (Zhou et al., 2021)，包括中文和英文，涵盖八种情感类型。	3885
SSR	Testssr	该测试集由两部分组成：标注有风格标签的网络爬取数据和用于 TTS 训练的高表现力数据。它涵盖八种说话风格，主要为中文。	3000
SGC	Testsgc	内部测试集包含两个年龄类别的数据，女性与男性的比例为 2:3。所有录音

[论文翻译]OSUM: 在学术界有限资源下推动开放语音理解模型的发展

原文地址：https://arxiv.org/pdf/2501.13306v2