[论文翻译]Slamming: 在一张 GPU 上一天内训练一个语音语言模型


原文地址:https://arxiv.org/pdf/2502.15814v1


Slamming: Training a Speech Language Model on One GPU in a Day

Slamming: 在一张 GPU 上一天内训练一个语音语言模型

Abstract

摘要

We introduce Slam, a recipe for training highquality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these in- sights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples - https://pages.cs.huji.ac.il/adiyosslab/slamming.

我们介绍了 Slam,这是一种在单张学术 GPU 上 24 小时内高质量训练语音语言模型(SLM)的配方。我们通过模型初始化和架构的实证分析、合成训练数据、合成数据的偏好优化以及调整所有其他组件来实现这一目标。我们通过实证证明,这种训练配方在更多计算资源下也能很好地扩展,以一小部分计算成本获得与领先 SLM 相当的结果。我们希望这些见解能使 SLM 训练和研究更加普及。在 SLM 扩展定律的背景下,我们的结果远远超出了预测的计算最优性能,为 SLM 的可行性提供了乐观的展望。参见代码、数据、模型、样本 - https://pages.cs.huji.ac.il/adiyosslab/slamming。


Figure 1: Comparing Topic-StoryCloze performance of different SLMs as a function of training compute. Model size is indicated by the size of the circle.

图 1: 不同 SLM 在 Topic-StoryCloze 任务上的性能与训练计算量的关系。模型大小由圆圈的大小表示。

1 Introduction

1 引言

Speech Language Models (SLMs) have gained significant interest from researchers (Peng et al., $2024\mathrm{a}$ ; Cui et al., 2024; Ji et al., 2024; Latif et al., 2023), demonstrating remarkable performance in traditional speech tasks (Wang et al., 2023; Elmakies et al., 2025), diverse generative applications (Yang et al., 2023, 2024b), and reasoning over speech and audio signals (Tang et al., 2024; Chu et al., 2023).

语音语言模型 (Speech Language Models, SLMs) 引起了研究者的广泛兴趣 (Peng et al., $2024\mathrm{a}$ ; Cui et al., 2024; Ji et al., 2024; Latif et al., 2023),在传统语音任务 (Wang et al., 2023; Elmakies et al., 2025)、多样化生成应用 (Yang et al., 2023, 2024b) 以及对语音和音频信号的推理 (Tang et al., 2024; Chu et al., 2023) 中展现了显著的性能。

SLMs can generally be classified into two main categories: (i) generative speech Language Models (LMs) (which can also incorporate text) and (ii) speech-aware LMs. The first category follows a similar pre-training approach to text-based Large Language Models (LLMs), directly maximising the likelihood of speech considering both input and output, typically by representing audio as a sequence of discrete tokens. The second category consists of pre-trained text LMs adapted to process speech inputs. In this work, we focus on the first.

SLMs通常可以分为两大类:(i) 生成式语音语言模型 (LMs)(也可以包含文本)和 (ii) 语音感知 LMs。第一类遵循与基于文本的大语言模型 (LLMs) 类似的预训练方法,直接最大化语音的似然性,同时考虑输入和输出,通常通过将音频表示为一组离散的 Token。第二类由适应处理语音输入的预训练文本 LMs 组成。在本工作中,我们重点关注第一类。

Training high-quality SLMs can be highly resource intensive (Hassid et al., 2024; Cuervo and Marxer, 2024; Zeng et al., 2024; Nguyen et al., 2025; Défossez et al., 2024). For example, Nguyen et al. (2025) trained their SLM on approximately $570k$ hours of speech data, while Défossez et al. (2024) utilised around $7M$ hours. Additionally, Cuervo and Marxer (2024) proposed SLM scaling laws, suggesting that training high-quality SLMs requires $\sim3X$ more data compared to text-based counterparts. These computational demands restrict the required fundamental research aimed at enhancing SLMs, such as advancements in speech token is ation, efficient acoustic modelling, etc.

训练高质量的语音语言模型 (SLM) 可能需要大量资源 (Hassid et al., 2024; Cuervo and Marxer, 2024; Zeng et al., 2024; Nguyen et al., 2025; Défossez et al., 2024)。例如,Nguyen et al. (2025) 使用了大约 $570k$ 小时的语音数据来训练他们的 SLM,而 Défossez et al. (2024) 使用了约 $7M$ 小时。此外,Cuervo and Marxer (2024) 提出了 SLM 的扩展规律,表明训练高质量的 SLM 需要比基于文本的模型多 $\sim3X$ 的数据。这些计算需求限制了旨在提升 SLM 的基础研究,例如语音 token 化、高效声学建模等方面的进展。

In the Natural Language Processing (NLP) community, numerous studies have investigated efficient model training techniques, including masked language models such as Cramming (Geiping and Goldstein, 2023) and ModernBERT (Warner et al., 2024), along with next-token prediction LLMs such as MobileLLM (Liu et al., 2024b). These methods include implementation efficiencies, architectural improvements, data selection strategies, and enhancements to the overall training pipeline.

在自然语言处理 (Natural Language Processing, NLP) 领域,许多研究探索了高效的模型训练技术,包括掩码语言模型(如 Cramming (Geiping and Goldstein, 2023) 和 ModernBERT (Warner et al., 2024)),以及基于下一个 Token 预测的大语言模型(如 MobileLLM (Liu et al., 2024b))。这些方法涵盖了实现效率、架构改进、数据选择策略以及整体训练流程的优化。

Inspired by Cramming (Geiping and Goldstein, 2023) in text, we investigate compute-limited SLM training, which we term Slamming. We pose the question: Is it possible to train high-quality SLMs using a single GPU within 24 hours? For that, we conduct an extensive empirical analysis exploring how different training components influence performance. From this, we derive a training recipe that maximises model performance within a fixed compute budget. Specifically, we investigate the impact of model initial is ation and architecture, various optimisers and learning rate schedulers, data selection strategies - including the role of synthetic data, text-interleaving and preference optimisation.

受文本领域 Cramming (Geiping and Goldstein, 2023) 的启发,我们研究了计算受限的 SLM 训练,并将其称为 Slamming。我们提出一个问题:是否可以在 24 小时内使用单个 GPU 训练出高质量的 SLM?为此,我们进行了广泛的实证分析,探讨不同训练组件如何影响性能。由此,我们得出了一个在固定计算预算下最大化模型性能的训练方案。具体来说,我们研究了模型初始化和架构、各种优化器和学习率调度器、数据选择策略(包括合成数据的作用、文本交错和偏好优化)的影响。

We believe that developing these training strategies and proving their feasibility will empower the speech and audio research community to advance SLMs beyond the scope of large, well-funded academic and industrial labs. Figure 1 illustrates the performance of various SLMs relative to their training compute budget, with circle sizes representing the size of the models. Furthermore, we compare our results with the scaling performance predicted from Cuervo and Marxer (2024). Although the authors present a somewhat pessimistic view of the computational resources needed to train highquality SLMs, we empirically show that reality is more promising, demonstrating that it is possible to significantly exceed the predicted performance per unit of compute. We encourage the community to refine and expand scaling laws specifically tailored for SLM training across various settings.

我们相信,开发这些训练策略并证明其可行性将赋能语音和音频研究社区,推动SLMs超越大型、资金充足的学术和工业实验室的范畴。图 1 展示了各种SLMs相对于其训练计算预算的性能,圆圈大小代表模型的规模。此外,我们将我们的结果与Cuervo和Marxer (2024) 预测的扩展性能进行了对比。尽管作者对训练高质量SLMs所需的计算资源持较为悲观的态度,但我们通过实证表明,现实更为乐观,证明了在每单位计算量上显著超越预测性能是可能的。我们鼓励社区针对不同环境下的SLM训练,进一步完善和扩展特定的扩展定律。

Our main contributions are:

我们的主要贡献是:

We open-source all code, models, training recipes, and synthetic datasets.

我们开源了所有代码、模型、训练方法和合成数据集。

2 Related Work

相关研究

Efficient training. Enhancing the efficiency of neural network training has been extensively studied (Shen et al., 2023). Haj imola hose in i et al. (2023); Wang et al. (2024) examined the impact of data selection on Large Language Model (LLM) training and introduced efficient data selection methods. Muhamed et al. (2024) proposed using structured sparse gradients to enhance compute efficiency in LLM training, while Rawat et al. (2024) explored the potential of leveraging smaller language models to improve the training efficiency of larger LLMs. Lv et al. (2024) investigated the use of low-dimensional projections for attention parameters to enhance training efficiency. Meanwhile, Neiterman and Ben-Artzi (2024) proposed applying LayerDrop as a technique to optimise neural network training.

高效训练。提升神经网络训练效率已被广泛研究 (Shen et al., 2023) 。Haj imola hose in i et al. (2023);Wang et al. (2024) 研究了数据选择对大语言模型 (LLM) 训练的影响,并提出了高效的数据选择方法。Muhamed et al. (2024) 提出了使用结构化稀疏梯度来提高 LLM 训练的计算效率,而 Rawat et al. (2024) 探索了利用较小语言模型来提高较大 LLMs 训练效率的潜力。Lv et al. (2024) 研究了使用低维投影来提升注意力参数训练效率的方法。同时,Neiterman 和 Ben-Artzi (2024) 提出了应用 LayerDrop 作为优化神经网络训练的技术。

More closely related to our work, Li et al. (2023) propose a training strategy for developing LLMs within a $100k\Phi$ budget. Warner et al. (2024) introduce ModernBERT, an efficient training pipeline for optimising BERT models, while Izsak et al. (2021) outline a method for training a BERT model in 24 hours using 8 GPUs. The most relevant work to ours is Cramming (Geiping and Goldstein, 2023), where the authors conduct an in-depth analysis of masked LM training on a single GPU in one day.

与我们的工作更为相关的是,Li 等人 (2023) 提出了一种在 10 万美元预算内开发大语言模型的训练策略。Warner 等人 (2024) 介绍了 ModernBERT,一种用于优化 BERT 模型的高效训练管道,而 Izsak 等人 (2021) 则概述了一种在 24 小时内使用 8 个 GPU 训练 BERT 模型的方法。与我们的工作最相关的是 Cramming (Geiping 和 Goldstein, 2023),作者在一天内使用单个 GPU 对掩码语言模型训练进行了深入分析。

While these studies offer valuable insights, they primarily focus on training text models, such as LLMs and masked LMs. In the speech domain, similar research has been conducted on self-supervised representation models (Liu et al., 2024a), but not on SLMs. In this work, we address this gap by focusing on efficient SLM training.

虽然这些研究提供了宝贵的见解,但它们主要专注于训练文本模型,例如大语言模型和掩码语言模型。在语音领域,已有类似研究针对自监督表示模型 (Liu et al., 2024a),但尚未涉及小语言模型 (SLM)。在本研究中,我们通过专注于高效的小语言模型训练来填补这一空白。

Generative speech language models were explored under various setups (Lakhotia et al., 2021; Kharitonov et al., 2021). Lakhotia et al. (2021) were the first to show how raw and uncurated speech data can be leveraged into building a Generative Spoken Language Modeling (GSLM) system. Next, Borsos et al. (2023) proposed a cascade version using both coarse and fine speech tokens. Such a modelling framework opened up a new and promising research direction for processing and modelling spoken data, such as speech re synthesis (Polyak et al., 2021), speaking style conversion (Kreuk et al., 2021; Maimon and Adi,

生成式语音语言模型在各种设置下被探索(Lakhotia等,2021;Kharitonov等,2021)。Lakhotia等人(2021)首次展示了如何将原始且未经过整理的语音数据用于构建生成式语音语言建模(GSLM)系统。随后,Borsos等人(2023)提出了一个使用粗粒度与细粒度语音Token的级联版本。这种建模框架为处理和建模语音数据开辟了一个新的、有前景的研究方向,例如语音重新合成(Polyak等,2021)、语音风格转换(Kreuk等,2021;Maimon和Adi,

2023), dialogue modelling (Nguyen et al., 2022), speech-to-speech translation (Popuri et al., 2022; Peng et al., 2024b), etc. Nachmani et al. (2024) proposed augmenting a text Language Model (LM) with continuous speech data to improve spoken question-answering tasks. Recently, Park et al. (2024) proposed SLM based on state-space models (Gu et al., 2021) to further push long context- efficient modelling, while Lin et al. (2024) proposed to fine-tune SLMs using direct preference optimisation (Rafailov et al., 2024) obtained from text LLM rankings.

2023), 对话建模 (Nguyen et al., 2022), 语音到语音翻译 (Popuri et al., 2022; Peng et al., 2024b) 等。Nachmani et al. (2024) 提出了通过增强文本语言模型 (LM) 与连续语音数据来改进口语问答任务。最近,Park et al. (2024) 提出了基于状态空间模型 (Gu et al., 2021) 的 SLM 以进一步推动长上下文高效建模,而 Lin et al. (2024) 提出了使用从文本大语言模型排名中获得的直接偏好优化 (Rafailov et al., 2024) 来微调 SLM。

Similar to text LLMs, training SLMs often demands large-scale datasets. For instance, Moshi (Défossez et al., 2024) was trained on 7 million hours of speech data, SpiritLM (Nguyen et al., 2025) utilized $560k$ hours, and TWIST (Hassid et al., 2024) was trained on approximately $150k$ . Recently, Cuervo and Marxer (2024) introduced the first scaling laws for SLMs, suggesting that achieving comparable performance to text LMs requires three times more tokens. In this work, we focus on reducing the computational demands while maintaining performance comparable to leading SLMs.

类似于文本大语言模型,训练语音大语言模型通常需要大规模数据集。例如,Moshi (Défossez et al., 2024) 使用了 700 万小时的语音数据进行训练,SpiritLM (Nguyen et al., 2025) 使用了 $560k$ 小时,而 TWIST (Hassid et al., 2024) 则训练了大约 $150k$ 小时的数据。最近,Cuervo 和 Marxer (2024) 提出了语音大语言模型的第一个扩展法则,表明要达到与文本大语言模型相当的性能,需要三倍以上的Token。在这项工作中,我们专注于在保持与领先语音大语言模型相当性能的同时,减少计算需求。

3Setup

3 设置

In this study, we explore decoder-only generative SLMs, which aim at maximising the likelihood of speech samples represented as discrete tokens. We examine both purely speech-based SLMs trained on speech tokens and joint speech-text SLMs using interleaving strategies (Nguyen et al., 2025). Similarly to Hassid et al. (2024); Lakhotia et al. (2021), we obtain speech tokens by quantising continuous latent representations of a self-supervised speech representation model using the $\mathbf{k}$ -means algorithm, often known as semantic tokens. Specifically, we utilise a multilingual HuBERT (Hsu et al., 2021) model running at $25\ \mathrm{Hz}$ , as employed in Hassid et al. (2024). We then train SLMs by minimising the negative log-likelihood of the input segments.

在本研究中,我们探索了仅解码器的生成式 SLM(Speech Language Model),其目标是最大化表示为离散 Token 的语音样本的似然。我们研究了基于纯语音 Token 训练的 SLM 以及使用交错策略的联合语音-文本 SLM(Nguyen et al., 2025)。与 Hassid et al. (2024) 和 Lakhotia et al. (2021) 类似,我们通过使用 $\mathbf{k}$-means 算法对自监督语音表示模型的连续潜在表示进行量化来获取语音 Token,通常称为语义 Token。具体而言,我们使用了在 $25\ \mathrm{Hz}$ 下运行的多语言 HuBERT(Hsu et al., 2021)模型,如 Hassid et al. (2024) 中所采用的那样。然后,我们通过最小化输入片段的负对数似然来训练 SLM。

Unless mentioned otherwise, all SLMs are trained using a single $A5000$ GPU $\langle24G B$ VRAM) along with 16 CPU cores for 24 hours. We deliberately focus on this constrained compute budget, assuming that most academic labs can access similar resources, thereby ensuring the accessibility of our research. The training data is pre-processed, i.e. extracting HuBERT units and dividing data into chunks, and stored prior to model training. As a result, this pre-processing time is excluded from the compute budget. This approach, aligned with Geiping and Goldstein (2023), is practical since many research experiments utilise the same pre-processed data. We additionally do not count the time for running validation and visualisation s as they are not used as part of the optimisation pipeline and only used for demonstration purposes.

除非另有说明,所有 SLM 均使用单个 $A5000$ GPU $\langle24G B$ VRAM) 和 16 个 CPU 核心训练 24 小时。我们特意专注于这种受限的计算预算,假设大多数学术实验室可以访问类似的资源,从而确保我们研究的可访问性。训练数据经过预处理(即提取 HuBERT 单元并将数据分成块),并在模型训练之前存储。因此,预处理时间不计入计算预算。这种方法与 Geiping 和 Goldstein (2023) 保持一致,是实用的,因为许多研究实验使用相同的预处理数据。此外,我们不计算运行验证和可视化的时间,因为它们不用于优化管道,仅用于演示目的。

Evaluation metrics. We assess all SLMs using four distinct evaluation metrics. The first three are based on likelihood evaluation, while the fourth is a generative metric. For likelihood based modelling we consider sBLIMP (Dunbar et al., 2021), Spoken Story Cloze (SSC)), and Topic StoryCloze (TSC) (Hassid et al., 2024). For modellinglikelihood metrics, we evaluate the likelihood assigned by the SLMs to pairs of speech utterances, consisting of a positive example and a distractor. We calculate the percent of pairs in which the SLM assigns higher likelihood to the positive sample. sBLIMP focuses on grammatical abilities thus the negative is ungrammatical version of the positive. SSC and TSC focus on semantic modelling abilities. In SSC, the distractor suffix is taken from the original textual StoryCloze dataset (Most af azad eh et al., 2016), allowing to assess fine-grained semantic speech understanding. In TSC, however, the distractor suffix is drawn from a different topic, enabling us to evaluate the model’s ability to understand the overall semantic concept.

评估指标。我们使用四种不同的评估指标来评估所有 SLM。前三个基于似然评估,而第四个是生成指标。对于基于似然的建模,我们考虑 sBLIMP (Dunbar et al., 2021)、Spoken Story Cloze (SSC) 和 Topic StoryCloze (TSC) (Hassid et al., 2024)。对于建模似然指标,我们评估 SLM 对语音话语对的似然,包括正例和干扰项。我们计算 SLM 对正样本赋予更高似然的百分比。sBLIMP 关注语法能力,因此负例是正例的语法错误版本。SSC 和 TSC 关注语义建模能力。在 SSC 中,干扰项后缀来自原始文本 StoryCloze 数据集 (Most af azad eh et al., 2016),允许评估细粒度的语义语音理解。在 TSC 中,干扰项后缀来自不同主题,使我们能够评估模型理解整体语义概念的能力。

Finally, to assess the generative abilities of SLMs, we compute generative perplexity (GenPPL). Following the approach of (Lakhotia et al., 2021; Hassid et al., 2024), we provide the SLM with a short speech prompt and generate speech tokens continuation. We use unit-vocoder with duration prediction to convert the tokens into speech (Polyak et al., 2021; Hassid et al., 2024). The generated speech is then transcribed, and its Perplexity (PPL) is evaluated using a pre-trained text LLM. To minimise the impact of token repetition on PPL measurements, we ground the generated text using diversity metrics derived from the auto-BLEU score (Lakhotia et al., 2021). Similarly to Lin et al. (2024) we use bigram autoBLEU. In other words, we ensure that all models achieve similar auto-BLEU scores, allowing for a fair comparison of PPL. Specifically, we transcribe speech segments using Whisper-large-v3- turbo model (Radford et al., 2023) and measure PPL using Llama-3.2-1B model (LLama, 2024).

最后,为了评估 SLM 的生成能力,我们计算生成困惑度 (GenPPL)。按照 (Lakhotia et al., 2021; Hassid et al., 2024) 的方法,我们为 SLM 提供一个简短的语音提示并生成语音 token 的延续。我们使用带有时长预测的 unit-vocoder 将 token 转换为语音 (Polyak et al., 2021; Hassid et al., 2024)。生成的语音随后被转录,并使用预训练的文本大语言模型评估其困惑度 (PPL)。为了最小化 token 重复对 PPL 测量的影响,我们使用从 auto-BLEU 分数得出的多样性指标来对生成的文本进行基准化 (Lakhotia et al., 2021)。与 Lin et al. (2024) 类似,我们使用 bigram autoBLEU。换句话说,我们确保所有模型都达到相似的 auto-BLEU 分数,从而允许对 PPL 进行公平的比较。具体来说,我们使用 Whisper-large-v3-turbo 模型 (Radford et al., 2023) 转录语音片段,并使用 Llama-3.2-1B 模型 (LLama, 2024) 测量 PPL。

Software efficiency. To maximise performance within 24 hours of model training, we leverage multiple efficient implementations. Through extensive performance testing, we found that using bfloat16 (Kalamkar et al., 2019) alongside Flash Attention 2 (Dao, 2023) and data packing provided the most efficient compute performance in our setup. We also experimented with model compilation using torch.compile (Ansel et al., 2024), but it lacked native compatibility with Flash Attention 2 at the time of our study, and its performance without Flash Attention 2 was subpar. Future work could investigate this further with more efficient attention implementations (Shah et al., 2024; Li et al., 2024).

软件效率。为了在模型训练的 24 小时内最大化性能,我们利用了多种高效实现。通过广泛的性能测试,我们发现使用 bfloat16 (Kalamkar et al., 2019) 结合 Flash Attention 2 (Dao, 2023) 和数据打包在我们的设置中提供了最高效的计算性能。我们还尝试了使用 torch.compile (Ansel et al., 2024) 进行模型编译,但在我们的研究期间,它缺乏与 Flash Attention 2 的原生兼容性,并且在没有 Flash Attention 2 的情况下性能较差。未来的工作可以结合更高效的注意力实现 (Shah et al., 2024; Li et al., 2024) 进一步研究这一点。

To enable rapid and scalable experimentation, we developed a specialised library for SLM training that supports various model architectures, training objectives, and evaluation metrics. This library accommodates both TWIST-style training, textspeech interleaving, preference optimisation, etc. We will open-source this package along all models weights and training recipes, aiming to empower the community to further explore SLMs.

为实现快速且可扩展的实验,我们开发了一个用于SLM(小语言模型)训练的专用库,该库支持多种模型架构、训练目标和评估指标。该库兼容TWIST风格的训练、文本语音交替、偏好优化等方法。我们将开源该库以及所有模型权重和训练方案,旨在赋能社区进一步探索SLM。

4 Investigations

4 调查

With this setup, we systematically analyse and ablate each component of the training pipeline, ultimately refining an optimised cook-book for training SLMs. We specifically examine the influence of model family, initial is ation, size, and architectural choices (e.g., dropout, positional embedding, etc.). We analyse optimisation parameters and data characteristics. Lastly, we explore alternative training objectives beyond standard next-token prediction, including speech-text interleaving and direct preference optimisation using synthetic data.

在此设置下,我们系统地分析和消融了训练流程中的每个组件,最终优化了训练小语言模型 (SLM) 的指南。我们特别研究了模型家族、初始化、大小和架构选择(例如,dropout、位置嵌入等)的影响。我们分析了优化参数和数据特征。最后,我们探索了标准的下一个 Token 预测之外的替代训练目标,包括语音-文本交错和使用合成数据进行直接偏好优化。

4.1 Model & Optimisation

4.1 模型与优化

Hyper-parameters. Unless specified otherwise, we use a context length of 512 tokens and an effective batch size of 256, employing gradient accumulation when necessary, as preliminary results indicated this configuration yields the best overall performance. We set the peak learning rate to $1e!-!3$ to enhance training speed and use a warmup period of $1%$ of the total training steps, as this proved more effective than the fixed 100-step warmup used in the original TWIST. To improve training stability, particularly with large learning rates, we apply gradient normalisation with a norm of 0.5 at no additional cost, following Geiping and Goldstein (2023). Unless modified later in our investigation, we use an inverse-square root scheduler and the AdamW optimiser (Loshchilov, 2017).

超参数。除非另有说明,我们使用 512 个 Token 的上下文长度和 256 的有效批量大小,在必要时采用梯度累积,因为初步结果表明这种配置能够带来最佳的整体性能。我们将峰值学习率设置为 $1e!-!3$ 以提高训练速度,并使用总训练步骤 $1%$ 的热身期,因为这比原始 TWIST 中使用的固定 100 步热身更有效。为了提高训练稳定性,尤其是在使用大学习率时,我们按照 Geiping 和 Goldstein (2023) 的方法,应用梯度归一化,归一化范数为 0.5,且不增加额外成本。除非在后续研究中有所修改,否则我们使用逆平方根调度器和 AdamW 优化器 (Loshchilov, 2017)。


Figure 2: Comparing PPL of different models of similar parameter count, with and without TWIST initial is ation.

图 2: 比较不同模型在参数数量相似的情况下,使用和不使用 TWIST 初始化的 PPL (Perplexity)。

Initial is ation. Hassid et al. (2024) empirically demonstrated that initial ising SLMs with pretrained text LMs can enhance convergence speed and improve model performance. We examine the effect of this initial is ation within our setup across different model types. To do so, we train multiple models, both with and without TWIST initialisation, while staying within our compute budget. As shown in Figure 2, TWIST initial is ation benefits all evaluated models at the beginning of training, though its overall impact by the end varies. Notice, the $\mathbf{X}$ -axis in Figure 2 represents theoretical FLOPs, calculated as $6*N_{p a r a m s}*D_{t o k e n s}$ following Hoffmann et al. (2022). However, due to variations in model architecture and implementation, practical efficiency differs, leading to varying amounts of compute processed within 24 hours.

Hassid 等 (2024) 通过实验证明,使用预训练的文本大语言模型初始化小型语言模型 (SLMs) 可以加速收敛并提升模型性能。我们在不同模型类型中检验了这种初始化的效果。为此,我们在计算预算范围内训练了多个模型,包括使用和不使用 TWIST 初始化的版本。如图 2 所示,TWIST 初始化在训练初期对所有评估模型都有益,但其最终影响各不相同。请注意,图 2 中的 $\mathbf{X}$ 轴表示理论 FLOPs,按照 Hoffmann 等 (2022) 的方法计算为 $6*N_{p a r a m s}*D_{t o k e n s}$。然而,由于模型架构和实现的差异,实际效率有所不同,导致 24 小时内处理的计算量也有所不同。

Results suggest that benefits of TWIST initia lisa tion can be substantial, especially for topperforming models like Qwen2.5. As a result, we prioritise investigations based on existing pretrained text LMs. Interestingly, the results in Figure 2 demonstrate that Qwen2.5 outperforms other models even without TWIST initial is ation, perhaps suggesting that their architectural design choices or size might also provide some benefit.

结果表明,TWIST初始化的好处可能非常显著,尤其是对于像Qwen2.5这样的顶级模型。因此,我们优先基于现有的预训练文本大语言模型进行研究。有趣的是,图 2 中的结果表明,即使没有TWIST初始化,Qwen2.5的表现也优于其他模型,这可能表明其架构设计选择或模型规模也带来了一定的优势。

Optimal model size & family. Cuervo and Marxer (2024) conducted a scaling analysis on GSLM-style SLMs, estimating the optimal model size and token count for a compute-efficient model. However, using a text LM initial is ation might impact these findings. As we observe, TWIST initialisation greatly impact model performance, suggest- ing that prior it ising larger models may be more effective than simply increasing the dataset size. Additionally, various model families gain different advantages from TWIST initial is ation; for example, Qwen2.5 models show significantly better performance compared to OPT models. In Figure 3, we compare the results under the pre-defined compute budget within model families1. We note that the best model sizes for both MobileLLM (Liu et al., 2024b), SmolLM2 (Allal et al., 2025) and Pythia (Biderman et al., 2023) are $\sim300M$ parameters, while for OPT the best is 125M. According to Cuervo and Marxer (2024), the estimated optimal model size is approximately 66M parameters. However, the best-performing model, Qwen2.5, is significantly larger. Since there are no smaller models in this family, it is difficult to determine whether this deviation is due to the quality of the initialisation or other factors. Moving forward, we proceed with both OPT-125M and Qwen2.5-0.5B.

最优模型规模与家族。Cuervo 和 Marxer (2024) 对 GSLM 风格的小语言模型进行了规模分析,估算了计算效率模型的最优模型规模和 Token 数量。然而,使用文本大语言模型初始化可能会影响这些发现。正如我们观察到的,TWIST 初始化对模型性能有显著影响,这表明优先考虑更大的模型可能比单纯增加数据集规模更有效。此外,不同模型家族从 TWIST 初始化中获得不同的优势;例如,Qwen2.5 模型相比 OPT 模型表现出显著更好的性能。在图 3 中,我们比较了在预定义计算预算内不同模型家族的结果。我们注意到,MobileLLM (Liu 等, 2024b)、SmolLM2 (Allal 等, 2025) 和 Pythia (Biderman 等, 2023) 的最佳模型规模约为 3 亿参数,而 OPT 的最佳模型规模为 1.25 亿。根据 Cuervo 和 Marxer (2024) 的估计,最优模型规模约为 6600 万参数。然而,表现最好的模型 Qwen2.5 规模显著更大。由于该家族中没有更小的模型,难以确定这种偏差是由初始化质量还是其他因素引起的。接下来,我们将同时使用 OPT-125M 和 Qwen2.5-0.5B。


Figure 3: Comparing PPL of different models under TWIST initial is ation.

图 3: 不同模型在 TWIST 初始化下的 PPL 对比。


Figure 4: Comparing validation PPL of our best model with different optimisers and schedulers.

图 4: 比较我们最佳模型在使用不同优化器和调度器时的验证 PPL (Perplexity)。

Dropout. The original OPT models includes dropout to mitigate over fitting. Although dropout is beneficial for regular is ation, it effectively decreases the number of gradient updates per parameter with- out shortening the update-step wall time. Hence, reduces the number of parameter updates per sec- ond. Following Geiping and Goldstein (2023), we experiment with removing dropout and observed improved performance in our setup.

Dropout。原始的 OPT 模型包含了 dropout 以减少过拟合。虽然 dropout 对正则化有益,但它实际上减少了每个参数的梯度更新次数,而不会缩短更新步骤的用时。因此,它降低了每秒的参数更新次数。根据 Geiping 和 Goldstein (2023) 的研究,我们尝试移除 dropout,并在我们的设置中观察到了性能的提升。

Positional Encoding. Transformers rely on positional encoding to capture the order of input tokens. Many modern LMs, including the Qwen models, use Rotary Position Embedding (Su et al., 2024). This method uses a hyper parameter, $\theta$ , to control the trade-off between granularity and the ability to handle long contexts. $\theta$ is often tuned to accommodate longer context lengths (Yang et al., $2024\mathrm{a}$ ; Roziere et al., 2023). Since our context length is significantly shorter than that of the original LLM, we explore reducing $\theta$ for potential performance gains. Our findings show that setting $\theta~=~10,000$ with a context length of 1024 enhances performance, so we adopt this configuration moving forward. We note that since we increase the context length, we need to reduce the batch size as well, to not run into memory problems when training. We reduce the batch size by a half and keep the same amount of gradient accumulation steps, which gives us an effective batch size of 128.

位置编码 (Positional Encoding)。Transformer 依靠位置编码来捕捉输入 Token 的顺序。许多现代大语言模型,包括 Qwen 模型,使用旋转位置嵌入 (Rotary Position Embedding) (Su 等人, 2024)。该方法使用一个超参数 $\theta$ 来控制粒度与处理长上下文能力之间的权衡。$\theta$ 通常被调整以适应更长的上下文长度 (Yang 等人, $2024\mathrm{a}$; Roziere 等人, 2023)。由于我们的上下文长度明显短于原始大语言模型,我们探索减少 $\theta$ 以获得潜在的性能提升。我们的研究结果显示,在上下文长度为 1024 的情况下,设置 $\theta~=~10,000$ 可以提高性能,因此我们采用了这一配置。我们注意到,由于我们增加了上下文长度,我们也需要减少批量大小,以避免在训练时遇到内存问题。我们将批量大小减少一半,并保持相同的梯度累积步数,从而得到 128 的有效批量大小。

Optimiser and Scheduler. Various optimisers and schedulers have been developed to enhance training efficiency, reduce memory usage (Shazeer and Stern, 2018; Dettmers et al., 2022), or accelerate convergence (Pag liard in i et al., 2024; Chen et al., 2023). With limited compute, faster convergence and better memory efficiency can be especially important. We first consider efficient optimisers, specifically AdamW with fused kernels, and 8-bit AdamW, but observe no notable improvements in batch size or runtime compared to standard AdamW. This could do with the relatively small model size, resulting in a minimal memory footprint of the optimisers. We then compare AdamW with two state-of-the-art optimisers: AdaLomo (Lv et al., 2023) and AdEMAMeix (Pag liard in i et al.,

优化器和调度器

Table 1: Analysing impact of training data diversity and synthetic data on SLM performance. The default Slam recipe does not use diverse data (only Libri-light and Libri Speech), but uses the synthetic s Tiny Stories data.

模型 数据 数据 指标 指标 指标 指标
Div. Syn. sBLIMP↑ sSC↑ tSC↑ GenPPL√
OPT125M 55.28 55.46 75.18 96.8
55.06 55.00 74.83 116.6
55.88 54.52 70.82 160.3
55.65 54.78 70.18 172.7
Qwen-0.5B × 56.45 55.59 78.01 88.3
56.17 55.37 77.13 101.3
x 56.60 53.50 71.14 145.4
56.10 53.72 70.66 161.8

表 1: 分析训练数据多样性和合成数据对SLM性能的影响。默认的Slam配方不使用多样化的数据(仅使用Libri-light和Libri Speech),但使用了合成的sTiny Stories数据。

2024). Results, presented in Figure 4, suggest that with the original Inverse Sqr t scheduler used by Hassid et al. (2024), using AdEMAMeix improves validation loss, compared to AdamW, with AdaLomo far behind.

图 4 中的结果表明,与 Hassid 等人 (2024) 使用的原始 Inverse Sqrt 调度器相比,使用 AdEMAMeix 相比 AdamW 提高了验证损失,而 AdaLomo 远远落后。

Next, we analyse a cosine decay learning rate scheduler, in place of the original InverseSqrt as this was shown to improve convergence (Loshchilov and Hutter, 2016). We consider the previous optimisers, and provide the validation loss throughout training in Figure 4. We see that this notably improved the loss for AdamW, and slightly harmed results for AdEMAMeix. Overall, AdamW with a cosine schedule provide the best setup, far outperforming the original setup.

接下来,我们分析了一个余弦衰减学习率调度器,以取代原始的 InverseSqrt,因为研究表明这可以改善收敛性(Loshchilov 和 Hutter,2016)。我们考虑了之前的优化器,并在图 4 中提供了整个训练过程中的验证损失。我们发现这显著改善了 AdamW 的损失,但略微损害了 AdEMAMeix 的结果。总体而言,使用余弦调度的 AdamW 提供了最佳设置,远远优于原始设置。

4.2 Data

4.2 数据

Next, we examine how the training data-mix influences performance in a compute-constrained setting. Specifically, we explore whether diversity in accents, speaking styles, etc. is beneficial and assess whether synthetic data can enhance semantic modelling abilities.

接下来,我们探讨在计算受限的环境中,训练数据组合如何影响性能。具体来说,我们研究口音、说话风格等的多样性是否有利,并评估合成数据是否能增强语义建模能力。

Diverse Data. We begin by examining how dataset diversity impacts model performance. Many leading speech datasets, such as those based on audiobooks (Panayotov et al., 2015; Kahn et al., 2020), consist of relatively clean, single-speaker recordings within a specific content domain. To introduce greater diversity in speaking styles and content, we curate additional datasets, including VoxPopuli (Wang et al., 2021b), Tedlium (Hernandez et al., 2018), People Speech (Galvez et al., 2021), and SWC (Baumann et al., 2018). For all mentioned datasets, we use the official data cleaning and preprocessing scripts when available. Specifically, for Libri-light, we apply the official Voice Activity

多样数据


Figure 5: Analysing the optimal part of the 24 hour compute budget that should be used for DPO, with the rest used for pre-training.

图 5: 分析 24 小时计算预算中应用于 DPO 的最佳部分,其余部分用于预训练。

Detection model to remove silences and generate smaller audio segments. To evaluate the impact of dataset diversity, we compare the performance of SLMs trained using our best training recipes using a subset of Libri Speech and Libri-light against all curated datasets. This comparison is conducted for both OPT-125M, which processes a large number of tokens during training, and Qwen-0.5B, which encounters significantly less data due to model size. Results are summarised in Table 1. We observe that dataset diversity has an overall negative effect on model performance. We hypothesis e this is due to the models struggling in modelling rich and complex audio under such low compute resources.

检测模型用于去除静音并生成更小的音频片段。为了评估数据集多样性的影响,我们将使用Libri Speech和Libri-light的子集训练的最佳训练方案与所有精选数据集进行对比,分别在OPT-125M和Qwen-0.5B上进行。结果总结在表1中。我们观察到,数据集多样性对模型性能有总体负面影响。我们假设这是由于模型在如此低的计算资源下难以建模丰富且复杂的音频。

Synthetic Data. Recent studies have highlighted the potential of synthetic data generated through Text-to-Speech (TTS) (Cuervo and Marxer, 2024) or direct text-to-unit conversion (Zeng et al., 2024). Hence, we examine the impact of including synthetically generated speech within our constrained compute setup. To do so, we synthesised the Tiny Stories dataset (Eldan and Li, 2023) using a single-speaker TTS model (Wang et al., 2021a), as it is computationally efficient. Additionally, prior research has shown that HuBERT units largely remove speaker information (Maimon and Adi, 2023). Tiny Stories has been demonstrated to enhance text LM performance and improve SLMs (Cuervo and Marxer, 2024). Results are presented in Table 1. Results indicate that incorporating such synthetic data into the training data-mix significantly boosts both modelling and generative performance metrics, across all evaluated setups. We also consider adding the synthetic data to the original TWIST recipe, and the results in the bottom of Table 2 suggests that while this helps with semantic metrics, it is far from enough without other optimisation s we introduced.

合成数据

Table 2: Comparing slamming to leading SLMs, and predicted optimal performance for the compute. We also consider TWIST-350M using our code and compute budget, but with the original training recipe. $\pm$ indicates distance to min/max of 3 seeds.

计算量 (GPU 天) 参数量 sBLIMP↑ sStoryCloze↑ tStoryCloze↑ GenPPL↓ Auto-BLEU√
TWIST-350M (Hassid et al., 2024) 40*V100 305M 56.20 137.3 3.46
TWIST-1.3B (Hassid et al., 2024) 160*V100 1B 57.00 52.4 70.6 131.8 3.20
TWIST-7B (Hassid et al., 2024) ? 7B 59.00 55.3 74.1 93.7 3.06
TWIST-13B (Hassid et al., 2024) ? 13B 59.20 55.4 76.4
Scaled Optimal (Cuervo and Marxer, 2024) ? 823M 61.3 56.7 78.0
Predicted Optimal (Cuervo and Marxer, 2024) 1*A5000 78M 56.85 54.09 70.49
TWIST-350M (原始配方) 1*A5000 305M 51.52 ± .19 53.65 ± .57 68.80 ± .47 259.2 ± 6.7 3.26 ± .46
TWIST-350M + sTinyStories 1*A5000 305M 51.21 ± .26 54.17 ± .54 72.40 ± .18 159.0 ± 6.0 4.18 ± .24
Slam (-DPO) (ours) 1*A5000 358M 56.45 ± .17 55.59 ± .30 78.01 ± .27 88.3 ± 1.0 3.47 ± .17
Slam (ours) 1*A5000 358M 58.86 ± .20 58.04 ± .51 82.04 ± .21 62.8 ± 4.1 3.88 ± .11

表 2: 比较 slam 与其他领先的 SLM,以及预测的计算最优性能。我们还考虑了使用我们的代码和计算预算的 TWIST-350M,但使用了原始的训练配方。$\pm$ 表示与 3 个种子的最小/最大值的距离。

We see that across all datasets, and specifically with our best mixture Libri-Light, Libri Speech and s Tiny Stories, Qwen-0.5B outperforms OPT-125M so we continue with it to the final stages.

我们发现,在所有数据集上,尤其是我们最佳混合的 Libri-Light、Libri Speech 和 Tiny Stories 数据集上,Qwen-0.5B 的表现优于 OPT-125M,因此我们继续将其推进到最终阶段。

4.3 Text Interleaving

4.3 文本交错

Several recent SLMs combine both speech and text modalities, either predicting both simultaneously (Défossez et al., 2024; Fang et al., 2024; Xie and Wu, 2024) or training on interleaved data (Nguyen et al., 2025; Zeng et al., 2024). Beyond enhancing cross-modal abilities, this has been shown to improve the semantic capabilities of SLMs, even in speech-only evaluations. Building on these studies, we investigate whether speech-text interleaving can enhance semantic ability in speech-only tasks, even under strict computational constraints.

最近的一些SLM(语音语言模型)结合了语音和文本两种模态,要么同时预测两者(Défossez et al., 2024; Fang et al., 2024; Xie and Wu, 2024),要么在交错数据上进行训练(Nguyen et al., 2025; Zeng et al., 2024)。除了增强跨模态能力,这种方法还被证明可以提高SLM的语义能力,即使在纯语音评估中也是如此。基于这些研究,我们探讨了在严格的计算限制下,语音-文本交错是否能够增强纯语音任务中的语义能力。

For this we use Whisper-large-v3-turbo to get aligned transcriptions of our data, except sTinyStories for which we get alignment from the TTS. We follow Zeng et al. (2024) by selecting speech spans with length from a Poisson distribution with $\lambda=10$ totalling $30%$ of the interleaved data. Following Nguyen et al. (2025) we train with balanced batches regarding token count between text data, speech data and interleaved data. We use a subset of RedPajama (Weber et al., 2024) filtered by Gopher (Rae et al., 2021) rules as our text data.

为此,我们使用 Whisper-large-v3-turbo 来获取数据的对齐转录,除了 sTinyStories,我们通过 TTS 获取对齐。我们遵循 Zeng 等人 (2024) 的方法,从泊松分布中选择长度 $\lambda=10$ 的语音片段,总计为交错数据的 $30%$。根据 Nguyen 等人 (2025) 的方法,我们在文本数据、语音数据和交错数据之间根据 token 数量进行平衡批次的训练。我们使用经过 Gopher (Rae 等人, 2021) 规则过滤的 RedPajama (Weber 等人, 2024) 子集作为文本数据。

We find that under our setup performance is notably worse than speech-only Slam in all metrics. We hypothesis e that in the slamming setup the larger vocabulary (leading to more parameters), and fewer speech tokens led to under-trained models. We leave for future work to find the minimal compute budget to benefit from text-interleaving or to consider other efficient approaches.

我们发现,在我们的设置下,所有指标的表现都明显不如仅使用语音的Slam。我们假设,在Slam设置中,较大的词汇量(导致更多参数)和较少的语音Token导致了模型训练不足。我们留给未来的工作来找到从文本交错中受益的最小计算预算,或考虑其他高效的方法。

4.4 Synthetic Data Preference Optimisation

4.4 合成数据偏好优化

Preference optimisation methods have been shown to enhance the performance of text LLMs (Ouyang et al., 2022) and, more recently, SLMs (Lin et al., 2024). With preference optimisation, we aim to train our model to generate outputs that better align with a specified reward function or preference set.

偏好优化方法已被证明可以提升文本大语言模型的性能 (Ouyang et al., 2022),最近也能提升小语言模型的性能 (Lin et al., 2024)。通过偏好优化,我们的目标是训练模型生成更符合指定奖励函数或偏好集的输出。

We evaluate how preference optimisation af-fects SLM performance while considering our constrained computational budget. Using an off-policy approach with pre-generated preference data, we apply DPO to enhance training efficiency. Specifically, we synthetically generate the SWAG (Zellers et al., 2018) text corpus for evaluating semantic knowledge. SWAG consists of text prefixes paired with multiple possible suffixes, where only one is semantically plausible. For preference data, we use the first sentence as the prompt, the correct suffix as the positive continuation, and a randomly chosen incorrect suffix as the rejected continuation. To ensure quality, we filter out samples with repetitive patterns, identified by an auto-BLEU score above 0.3. We generate all recordings using Kokoro TTS (Hexgrad, 2025), incorporating four speakers (two male and two female), evenly split between British and American accents. This process results in a total of $\mathrm{47k}$ SWAG preference pairs.

我们在考虑有限计算预算的情况下,评估了偏好优化对 SLM (Small Language Model) 性能的影响。通过使用预生成的偏好数据采用离策略 (off-policy) 方法,我们应用 DPO (Direct Preference Optimization) 来提高训练效率。具体来说,我们合成了 SWAG (Zellers et al., 2018) 文本语料库用于评估语义知识。SWAG 由文本前缀与多个可能的后缀配对组成,其中只有一个在语义上是合理的。对于偏好数据,我们使用第一句话作为提示,正确的后缀作为积极延续,随机选择的不正确后缀作为被拒绝的延续。为了确保质量,我们过滤掉了具有重复模式的样本,这些样本通过 auto-BLEU 得分高于 0.3 来识别。我们使用 Kokoro TTS (Hexgrad, 2025) 生成了所有录音,包含四位说话者(两位男性和两位女性),英式和美式口音各占一半。这一过程总共生成了 $\mathrm{47k}$ 个 SWAG 偏好对。

For DPO we use $\beta=0.1$ (see Appendix A for full hyper parameters). In initial tests, we observe that after DPO training, the model shows increased likelihood at the cost of repeated patterns, a known issue with DPO (Lanchantin et al., 2025). To address this, we apply a repetition penalty with a factor of 1.1, following the approach of Keskar et al. (2019), and find that it helps mitigate the problem. Future work could explore alternative solutions, such as proposed by Lanchantin et al. (2025).

对于 DPO,我们使用 $\beta=0.1$(完整超参数见附录 A)。在初步测试中,我们观察到 DPO 训练后,模型在增加可能性的同时出现了重复模式的问题,这是 DPO 的一个已知问题 (Lanchantin et al., 2025)。为了解决这个问题,我们采用了 Keskar et al. (2019) 的方法,应用了重复惩罚因子 1.1,发现它有助于缓解该问题。未来的工作可以探索其他解决方案,例如 Lanchantin et al. (2025) 提出的方案。

We begin by examining how the allocation of budget for DPO impacts performance, particularly when it comes at the cost of a shorter pre-training phase. Figure 5 depicts the results. We observe significant improvements across all metrics when applying DPO for at least 30 minutes compared to not using DPO at all. However, allocating a higher proportion of the budget to DPO does not yield further gains and can even degrade model performance. Thus we stick to 30 minutes out of 24 hours for DPO, using the rest for pre-training.

我们首先考察了 DPO 预算分配对性能的影响,尤其是在以缩短预训练阶段为代价的情况下。图 5 展示了结果。我们观察到,相比完全不使用 DPO,至少应用 30 分钟 DPO 在所有指标上都有显著提升。然而,将更高比例的预算分配给 DPO 并未带来进一步的收益,甚至可能降低模型性能。因此,我们坚持在 24 小时中分配 30 分钟用于 DPO,其余时间用于预训练。

Table 3: Analysing the effect of scaling up compute for Slam. Number tokens refers to total, not necessarily unique, tokens used for training (estimated from the provided information). We separately mark DPO tokens with $^{\textrm{a+}}$

GPUs Params Num tokens sBLIMP↑ sStoryCloze↑tStoryCloze↑ GenPPL↓ Auto-BLEU←
Speechonlypre-training
GSLM (Lakhotia et al.,2021) 8*V100 100M 1B 54.2 53.3 66.6
SyllableLM (Baade et al.,2024) 4*A40 300M 16B 63.7 75.4
TWIST-350M (Hassid et al.,2024) 8*V100 305M 10.8B 56.20 137.3 3.46
TWIST-1.3B (Hassid et al., 2024) 32*V100 1B 10.8B 57.00 52.4 70.6 131.8 3.20
TWIST-7B (Hassidet al.,2024) 32*V100 7B 36B 59.00 55.3 74.1 93.74 3.06
TWIST-13B (Hassidet al.,2024) 32*V100 13B 36B 59.20 55.4 76.4
Scaled Optimal(Cuervoand Marxer,2024) # 823M 82B 61.3 56.7 78.0 #
Moshi (Defossez et al.,2024) ?*H100 7B 58.9 58.7 81.8
SpiritLM (Nguyen et al., 2025) 64*A100 7B 100B 58.0 54.8 72.9
Joint speech-text pre-training/preference optimisation
Zeng et al. (2024) # 9B ~1T 62.4 82.9
Moshi (Defossez et al.,2024) ?*H100 7B ~720B 58.8 60.8 83.0
SpiritLM (Nguyen et al., 2025) 64*A100 7B 100B 58.3 61.0 82.9
AlignSLM-1.3B (Lin et al.,2024) 64*A100 1B 10.8B+~158B 59.8 55.0 80.0
AlignSLM-7B (Lin et al., 2024) 64*A100 7B 36B+~158B 62.3 61.1 86.8
Slam (-DPO) 2*A100 358M 16.7B 58.53 58.15 80.71 67.3 3.25
Slam 1*A5000 358M 1.4B + 5M 58.86 58.04 82.04 62.8 3.88
Slam (scaled) 2*A100 358M 16.7B + 9M 61.11 61.30 84.18 46.6 3.75

表 3: 分析 Slam 的计算扩展效果。Num tokens 指的是用于训练的总 Token 数量(不一定是唯一的 Token 数量,根据提供的信息估算)。我们用 $^{\textrm{a+}}$ 单独标记 DPO Token。

5 Final Recipe

5 最终配方

Building on these empirical findings, we develop the final Slam recipe. Using it, we train SLMs based on Qwen2.5-0.5B. We then compare Slam to the TWIST model family across various sizes: 350M, 1.3B, 7B, and 13B. We also present results for TWIST-350M using our computational constraints but following TWIST’s original training recipe, along with our synthetic data. Finally, we report results for the top-performing model from (Cuervo and Marxer, 2024), including their predicted optimal performance under our compute budget based on SLM scaling laws. Results are reported in Table 2. The results indicate that Slam delivers performance that is either superior or on par with baseline models while requiring signifi- cantly fewer computational resources (e.g., a single A5000 for a day compared to 160 days on a V100).

基于这些实证发现,我们开发了最终的 Slam 方案。使用该方案,我们基于 Qwen2.5-0.5B 训练了 SLM。然后,我们将 Slam 与不同规模的 TWIST 模型系列进行了比较:350M、1.3B、7B 和 13B。我们还展示了 TWIST-350M 的结果,该结果使用了我们的计算约束条件,但遵循了 TWIST 的原始训练方案,并使用了我们的合成数据。最后,我们报告了 (Cuervo and Marxer, 2024) 中表现最佳的模型的结果,包括他们基于 SLM 扩展定律在我们的计算预算下预测的最佳性能。结果如表 2 所示。结果表明,Slam 提供的性能要么优于基线模型,要么与之相当,同时所需的计算资源显著减少(例如,使用单张 A5000 训练一天,而使用 V100 需要 160 天)。

6 Increasing Compute

6 增加计算资源

Similarly to Geiping and Goldstein (2023), we analyse whether the proposed approach holds well also in increased compute budget. We opt for 48 hours on 2 A100 GPUs as a reasonable academic budget for larger scale tests, and represents $\sim,10$ times more compute than the Slamming setting. We use exactly the same Slam recipe for more steps, and increase the batch size times 2. We provide the full results in Table 3. We note that the performance continues to improve across all metrics, also outperforming methods which have far larger compute scales. We note that DPO training on synthetic data for 2 epochs, notably boosts performance.

与 Geiping 和 Goldstein (2023) 类似,我们分析了所提出的方法在增加计算预算的情况下是否仍然有效。我们选择在 2 个 A100 GPU 上运行 48 小时作为大规模测试的合理学术预算,这比 Slamming 设置的计算量大约增加了 $\sim,10$ 倍。我们使用完全相同的 Slam 配方进行更多步骤的训练,并将批量大小增加 2 倍。我们在表 3 中提供了完整的结果。我们注意到,所有指标的性能继续提升,甚至超过了计算规模大得多的方法。我们还注意到,在合成数据上进行 2 个 epoch 的 DPO 训练显著提升了性能。

7 Conclusion

7 结论

In this work we show that training high quality SLMs with a very modest compute budget, is feasible. We give these main guidelines: (i) Do not skimp on the model - not all model families are born equal and the TWIST initial is ation exaggerates this, thus it is worth selecting a stronger / bigger text-LM even if it means less tokens. we found Qwen2.5 to be a good choice; (ii) Utilise synthetic training data - pre-training on data generated with TTS helps a lot; (iii) Go beyond next token prediction - we found that DPO boosts performance notably even when using synthetic data, and as little as 30 minutes training massively improves results; (iv) Optimise hyper-parameters - as researchers we often dis-regard this stage, yet we found that tuning learning rate schedulers and optimising code efficiency can improve results notably. We hope that these insights, and open source resources will be of use to the research community in furthering research into remaining open questions in SLMs.

在本工作中,我们展示了在非常有限的计算预算下训练高质量的小语言模型 (SLM) 是可行的。我们给出了以下主要指导原则:(i) 不要在模型上吝啬——并非所有模型家族生而平等,TWIST 初始化放大了这一点,因此即使这意味着更少的 Token,也值得选择一个更强/更大的文本大语言模型。我们发现 Qwen2.5 是一个不错的选择;(ii) 利用合成训练数据——在通过 TTS 生成的数据上进行预训练有很大帮助;(iii) 超越下一个 Token 预测——我们发现即使使用合成数据,DPO 也能显著提升性能,而且仅需 30 分钟的训练就能大幅改善结果;(iv) 优化超参数——作为研究人员,我们经常忽视这一阶段,但我们发现调整学习率调度器和优化代码效率可以显著提升结果。我们希望这些见解和开源资源能够帮助研究界进一步探索小语言模型中的未解问题。

Limitations

局限性

While the SLMs trained under Slamming compute budget performed notably well compared to other SLMs trained with much more compute they might perform less well in other areas. For instance, evaluating their abilities on acoustic or prosodic elements as in SALMon (Maimon et al., 2024) could show further challenges of low resource settings.

尽管在 Slamming 计算预算下训练的小语言模型 (SLM) 相比其他使用更多计算资源训练的 SLM 表现非常出色,但它们在其他领域可能表现不佳。例如,在 SALMon (Maimon et al., 2024) 中评估它们在声学或韵律元素方面的能力,可能会进一步揭示低资源设置下的挑战。

Furthermore, we focus in this study on the well used HuBERT (Hsu et al., 2021) model as a tokeniser, and while we do not make any adjustments specifically for it, future work might wish to investigate our cramming approach with new tokenisers, such as Mimi (Défossez et al., 2024) and SylBoost (Baade et al., 2024).

此外,我们在本研究中重点关注了广泛使用的 HuBERT (Hsu et al., 2021) 模型作为分词器,虽然我们没有对其进行任何特定调整,但未来的工作可能会希望探索我们的 cramming 方法与新分词器(如 Mimi (Défossez et al., 2024) 和 SylBoost (Baade et al., 2024))的结合。

Ethical Statement

伦理声明

The broader impact of this study is, as in any generative model, the development of a high quality and natural speech synthesis. We hope that allowing training SLMs under low-resource settings, and open sourcing resources to aid this goal, will have a positive impact on in clu siv it y and accessibility of SLM research beyond well funded labs.

本研究的广泛影响在于,如同任何生成式模型一样,开发高质量且自然的语音合成系统。我们希望通过在低资源环境下训练 SLM,并开源资源以支持这一目标,能够对 SLM 研究在资金充足实验室之外的包容性和可访问性产生积极影响。

References

参考文献

Loubna Ben Allal, Anton Lozhkov, Elie Bak- ouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíˇcek, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. 2025. Smollm2: When smol goes big–data-centric training of a small language model. arXiv preprint arXiv:2502.02737.

Loubna Ben Allal, Anton Lozhkov, Elie Bak-ouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav 等. 2025. Smollm2: 当小模型遇到大数据——以数据为中心的小型语言模型训练. arXiv 预印本 arXiv:2502.02737.

Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal- lahan, Mohammad Aflah Khan, Shivanshu Purohit,

Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit

USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.

USVSN Sai Prashanth, Edward Raff, 等. 2023. Pythia: 一套用于分析大语言模型在训练和扩展中的工具套件. 在国际机器学习会议上, 第2397–2430页. PMLR.

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. 2024. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666.

方庆凯, 郭守涛, 周岩, 马正瑞, 张少雷, 冯洋. 2024. Llama-omni: 大语言模型的无缝语音交互. arXiv 预印本 arXiv:2409.06666.

Daniel Galvez, Greg Diamos, Juan Manuel Ciro Tor- res, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Max Lam, Mark Mazumder, and Vijay Janapa Reddi. 2021. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).

Daniel Galvez, Greg Diamos, Juan Manuel Ciro Torres, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Max Lam, Mark Mazumder, 和 Vijay Janapa Reddi. 2021. The People's Speech: 一个用于商业用途的大规模多样化英语语音识别数据集. 在第三十五届神经信息处理系统会议数据集与基准赛道(第一轮).

Jonas Geiping and Tom Goldstein. 2023. Cramming: Training a language model on a single gpu in one day. In International Conference on Machine Learning, pages 11117–11143. PMLR.

Jonas Geiping 和 Tom Goldstein. 2023. Cramming: 在一天内用单个 GPU 训练大语言模型. 在国际机器学习会议上, 页码 11117–11143. PMLR.

Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.

Albert Gu, Karan Goel, and Christopher Ré. 2021. 使用结构化状态空间高效建模长序列. arXiv 预印本 arXiv:2111.00396.

Habib Haj imola hose in i, Omar Mohamed Awad, Walid Ahmed, Austin Wen, Saina Asani, Mohammad Hassanpour, Farnoosh Javadi, Mehdi Ahmadi, Foozhan Ataiefard, Kangling Liu, et al. 2023. Swiftlearn: A data-efficient training method of deep learning models using importance sampling. arXiv preprint arXiv:2311.15134.

Habib Hajimolahoseini, Omar Mohamed Awad, Walid Ahmed, Austin Wen, Saina Asani, Mohammad Hassanpour, Farnoosh Javadi, Mehdi Ahmadi, Foozhan Ataiefard, Kangling Liu, 等. 2023. Swiftlearn: 一种使用重要性采样的深度学习模型高效训练方法. arXiv 预印本 arXiv:2311.15134.

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexan- dre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. 2024. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36.

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux 等. 2024. 基于文本预训练的语音语言模型. 神经信息处理系统进展, 36.

François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Esteve. 2018. Tedlium 3: Twice as much data and corpus re partition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer.

François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Esteve. 2018. Tedlium 3: 用于说话人适应实验的两倍数据及语料库重新划分。在《语音与计算机:第20届国际会议,SPECOM 2018,德国莱比锡,2018年9月18–22日,论文集20》中,页码198–208。Springer。

Hexgrad. 2025. Kokoro-82m (revision d8b4fc7).

Hexgrad. 2025. Kokoro-82m (修订版 d8b4fc7).

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buch at s kaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Train- ing compute-optimal large language models. arXiv preprint arXiv:2203.15556.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, 等. 2022. 训练计算最优的大语言模型. arXiv 预印本 arXiv:2203.15556.

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salak hut dino v, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460.

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: 通过隐藏单元的掩码预测进行自监督语音表示学习。IEEE/ACM 音频、语音和语言处理汇刊, 29:3451–3460.

Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. How to train bert with an academic budget. arXiv preprint arXiv:2104.07705.

Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. 如何在学术预算下训练 BERT。arXiv 预印本 arXiv:2104.07705。

Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.

Aslanides, Sarah Henderson, Roman Ring, Susannah Young, 等. 2021. 扩展语言模型:训练 Gopher 的方法、分析与见解. arXiv 预印本 arXiv:2112.11446.

A Full Slam Recipe

完整大满贯配方

We provide below the full training recipe, including hyper parameters for the best, Slam recipe. In Table 4 we see the Slam (-DPO) pre-training recipe and in Table 5 we see the Slam DPO training recipe.

我们在下方提供了完整的训练方案,包括最佳 Slam 方案的超参数。在表 4 中,我们展示了 Slam (-DPO) 预训练方案,在表 5 中,我们展示了 Slam DPO 训练方案。

Table 4: Slam (-DPO) Pre Training Recipe

表 4: Slam (-DPO) 预训练配方

参数
TextBaseModel Qwen2.5-0.5B
TWISTinitialisation True
数据 Librilight + Librispeech + sTinyStories
训练时间 23.5小时 17625步
RoPE theta 10000
上下文长度 1024
每个设备的批量大小 8
梯度累积 16
基础学习率 1e-3
WarmupRatio 1%
优化器 AdamW
学习率调度器 cosinewithmin5e-5
最大梯度范数 0.5
Dtype bfloat16

Table 5: Slam DPO Training Recipe

表 5: Slam DPO 训练方案

参数
初始模型 Slam(-DPO)
数据 SpokenSwag,自动 BLEU 小于 0.3
训练时间 0.5 小时,813 步
RoPE theta 10000
上下文长度 1024
每设备批量大小 4
梯度累积 16
基础学习率 5e-5
优化器 AdamW
学习率调度器 逆平方根
最大梯度范数 0.5
数据类型 bfloat16
DPOβ 0.1

B Model Sizes

B 模型规模

Table 6: Model names and parameter counts after changing vocabulary to speech only units (500).

表 6: 将词汇表更改为仅语音单元(500)后的模型名称和参数数量

模型名称 参数数量
MobileLLM-125M (Liu et al., 2024b) MobileLLM-350M (Liu et al., 2024b) 106,492,608 315,117,120 303,339,520
OPT-125M (Zhang et al., 2022) 87,015,936
OPT-350M (Zhang et al., 2022) QWEN2.5-0.5B (Yang et al., 2024a) 305,714,176
SmolLM2-135M (Allal et al., 2025) 358,347,904
SmolLM2-360M (Allal et al., 2025) 106,492,608
315,117,120
Pythia-160M (Biderman et al., 2023) Pythia-410M (Biderman et al., 2023) 85,827,072

As mentioned, we use the original names of the text LMs used for clarity and consistency, but note that the actual parameter counts after resizing the

如前所述,为了清晰和一致,我们使用了文本大语言模型的原始名称,但请注意,调整大小后的实际参数数量

vocabulary to speech-units only can be very different. In Table 6 we provide an extensive list of models and sizes.

在表 6 中,我们提供了一个广泛的模型和大小列表。

C Dataset Statistics

C 数据集统计

We use and synthesise several datasets. In this section we give exact details of number of samples, splits used, domains etc.

我们使用并整合了多个数据集。本节将详细说明样本数量、使用的划分、领域等信息。

For pre-training we use Libri-Light (Kahn et al., 2020) and Libri Speech (Panayotov et al., 2015). For Libri-Light we randonly select one percent of samples as validation, whereas for Libri Speech we use the original dev-clean and dev-other splits. Both of these datasets are English speech only, focused in the audio-book domain. We also synthesise s Tiny Stories for pre-training which consists of synthetically generated English short stories. We use the official train split for training. Full dataset sizes are in Table 7.

在预训练阶段,我们使用了 Libri-Light (Kahn et al., 2020) 和 Libri Speech (Panayotov et al., 2015)。对于 Libri-Light,我们随机选择了 1% 的样本作为验证集,而对于 Libri Speech,我们使用了原始的 dev-clean 和 dev-other 划分。这两个数据集都只包含英语语音,主要集中在有声书领域。我们还合成了 Tiny Stories 用于预训练,该数据集由合成生成的英语短篇故事组成。我们使用官方的训练集进行训练。完整的数据集大小如表 7 所示。

We also investigate diverse datasets for pretraining: SWC (Baumann et al., 2018), Tedlium (Hernandez et al., 2018), People Speech (Galvez et al., 2021) and VoxPopuli (Wang et al., 2021b). We only take English subsets for all datasets, yet they can still contain diverse accents. These datasets are in the following domains SWC - read Wikipedia articles, Tedlium - short lectures, PeopleSpeech - diverse data including many local council gatherings etc, VoxPopuli - from European Parliament meetings. For SWC specifically, we use the text alignment to create chunks, remove silence from the audio and remove mis-aligned chunks. We use full training splits where provided, otherwise splitting $99%$ for training. The dataset sizes are described in 7.

我们还研究了多种用于预训练的数据集:SWC (Baumann et al., 2018)、Tedlium (Hernandez et al., 2018)、People Speech (Galvez et al., 2021) 和 VoxPopuli (Wang et al., 2021b)。我们只使用这些数据集的英语子集,但它们仍然可能包含多种口音。这些数据集的领域如下:SWC - 阅读维基百科文章,Tedlium - 简短讲座,PeopleSpeech - 多样数据,包括许多地方议会会议等,VoxPopuli - 来自欧洲议会的会议。对于 SWC,我们使用文本对齐来创建数据块,从音频中去除静音,并移除未对齐的数据块。在有提供的情况下,我们使用完整的训练分割,否则将 $99%$ 用于训练。数据集的大小在表 7 中描述。

Table 7: Dataset train set sizes that we use.

表 7: 我们使用的数据集训练集大小。

数据集 小时数 Token数
Libri-Light (Kahn et al., 2020) 50K 3.5B
LibriSpeech (Panayotov et al., 2015) 960 67M
SWC (Baumann et al., 2018) 750 19M
Tedlium (Hernandez et al., 2018) 1.6K 110M
PeopleSpeech (Galvez et al., 2021) 7K 480M
VoxPopuli (Wang et al., 2021b) 24K 1.64B
sTinyStories 30K 2.2B

For DPO we synthesise SpokenSwag based on the SWAG (Zellers et al., 2018) dataset. We use only the official train set and filter only the gold standard labels. We end up with $47\mathrm{k}$ sample pairs which end up to be $\sim4.5M$ tokens.

对于DPO,我们基于SWAG (Zellers et al., 2018) 数据集合成了SpokenSwag。我们仅使用官方训练集,并仅过滤出黄金标准标签。最终我们得到了 $47\mathrm{k}$ 个样本对,共计 $\sim4.5M$ 个Token。

D AI Tool Usage

D AI工具使用

AI based tools may have been used in writing parts of the code for this study, or para-phrasing some of the writing within the paper, yet all the content was thoroughly checked by the authors, with these only being used as assistive tools.

本研究的代码编写或论文中的部分内容可能使用了基于 AI 的工具进行辅助生成或重述,但所有内容均由作者仔细检查,这些工具仅作为辅助手段。

阅读全文(20积分)