Slamming: Training a Speech Language Model on One GPU in a Day
Slamming: 在一张 GPU 上一天内训练一个语音语言模型
Abstract
摘要
We introduce Slam, a recipe for training highquality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these in- sights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples - https://pages.cs.huji.ac.il/adiyosslab/slamming.
我们介绍了 Slam,这是一种在单张学术 GPU 上 24 小时内高质量训练语音语言模型(SLM)的配方。我们通过模型初始化和架构的实证分析、合成训练数据、合成数据的偏好优化以及调整所有其他组件来实现这一目标。我们通过实证证明,这种训练配方在更多计算资源下也能很好地扩展,以一小部分计算成本获得与领先 SLM 相当的结果。我们希望这些见解能使 SLM 训练和研究更加普及。在 SLM 扩展定律的背景下,我们的结果远远超出了预测的计算最优性能,为 SLM 的可行性提供了乐观的展望。参见代码、数据、模型、样本 - https://pages.cs.huji.ac.il/adiyosslab/slamming。
Figure 1: Comparing Topic-StoryCloze performance of different SLMs as a function of training compute. Model size is indicated by the size of the circle.
图 1: 不同 SLM 在 Topic-StoryCloze 任务上的性能与训练计算量的关系。模型大小由圆圈的大小表示。
1 Introduction
1 引言
Speech Language Models (SLMs) have gained significant interest from researchers (Peng et al., $2024\mathrm{a}$ ; Cui et al., 2024; Ji et al., 2024; Latif et al., 2023), demonstrating remarkable performance in traditional speech tasks (Wang et al., 2023; Elmakies et al., 2025), diverse generative applications (Yang et al., 2023, 2024b), and reasoning over speech and audio signals (Tang et al., 2024; Chu et al., 2023).
语音语言模型 (Speech Language Models, SLMs) 引起了研究者的广泛兴趣 (Peng et al., $2024\mathrm{a}$ ; Cui et al., 2024; Ji et al., 2024; Latif et al., 2023),在传统语音任务 (Wang et al., 2023; Elmakies et al., 2025)、多样化生成应用 (Yang et al., 2023, 2024b) 以及对语音和音频信号的推理 (Tang et al., 2024; Chu et al., 2023) 中展现了显著的性能。
SLMs can generally be classified into two main categories: (i) generative speech Language Models (LMs) (which can also incorporate text) and (ii) speech-aware LMs. The first category follows a similar pre-training approach to text-based Large Language Models (LLMs), directly maximising the likelihood of speech considering both input and output, typically by representing audio as a sequence of discrete tokens. The second category consists of pre-trained text LMs adapted to process speech inputs. In this work, we focus on the first.
SLMs通常可以分为两大类:(i) 生成式语音语言模型 (LMs)(也可以包含文本)和 (ii) 语音感知 LMs。第一类遵循与基于文本的大语言模型 (LLMs) 类似的预训练方法,直接最大化语音的似然性,同时考虑输入和输出,通常通过将音频表示为一组离散的 Token。第二类由适应处理语音输入的预训练文本 LMs 组成。在本工作中,我们重点关注第一类。
Training high-quality SLMs can be highly resource intensive (Hassid et al., 2024; Cuervo and Marxer, 2024; Zeng et al., 2024; Nguyen et al., 2025; Défossez et al., 2024). For example, Nguyen et al. (2025) trained their SLM on approximately $570k$ hours of speech data, while Défossez et al. (2024) utilised around $7M$ hours. Additionally, Cuervo and Marxer (2024) proposed SLM scaling laws, suggesting that training high-quality SLMs requires $\sim3X$ more data compared to text-based counterparts. These computational demands restrict the required fundamental research aimed at enhancing SLMs, such as advancements in speech token is ation, efficient acoustic modelling, etc.
训练高质量的语音语言模型 (SLM) 可能需要大量资源 (Hassid et al., 2024; Cuervo and Marxer, 2024; Zeng et al., 2024; Nguyen et al., 2025; Défossez et al., 2024)。例如,Nguyen et al. (2025) 使用了大约 $570k$ 小时的语音数据来训练他们的 SLM,而 Défossez et al. (2024) 使用了约 $7M$ 小时。此外,Cuervo and Marxer (2024) 提出了 SLM 的扩展规律,表明训练高质量的 SLM 需要比基于文本的模型多 $\sim3X$ 的数据。这些计算需求限制了旨在提升 SLM 的基础研究,例如语音 token 化、高效声学建模等方面的进展。
In the Natural Language Processing (NLP) community, numerous studies have investigated efficient model training techniques, including masked language models such as Cramming (Geiping and Goldstein, 2023) and ModernBERT (Warner et al., 2024), along with next-token prediction LLMs such as MobileLLM (Liu et al., 2024b). These methods include implementation efficiencies, architectural improvements, data selection strategies, and enhancements to the overall training pipeline.
在自然语言处理 (Natural Language Processing, NLP) 领域,许多研究探索了高效的模型训练技术,包括掩码语言模型(如 Cramming (Geiping and Goldstein, 2023) 和 ModernBERT (Warner et al., 2024)),以及基于下一个 Token 预测的大语言模型(如 MobileLLM (Liu et al., 2024b))。这些方法涵盖了实现效率、架构改进、数据选择策略以及整体训练流程的优化。
Inspired by Cramming (Geiping and Goldstein, 2023) in text, we investigate compute-limited SLM training, which we term Slamming. We pose the question: Is it possible to train high-quality SLMs using a single GPU within 24 hours? For that, we conduct an extensive empirical analysis exploring how different training components influence performance. From this, we derive a training recipe that maximises model performance within a fixed compute budget. Specifically, we investigate the impact of model initial is ation and architecture, various optimisers and learning rate schedulers, data selection strategies - including the role of synthetic data, text-interleaving and preference optimisation.
受文本领域 Cramming (Geiping and Goldstein, 2023) 的启发,我们研究了计算受限的 SLM 训练,并将其称为 Slamming。我们提出一个问题:是否可以在 24 小时内使用单个 GPU 训练出高质量的 SLM?为此,我们进行了广泛的实证分析,探讨不同训练组件如何影响性能。由此,我们得出了一个在固定计算预算下最大化模型性能的训练方案。具体来说,我们研究了模型初始化和架构、各种优化器和学习率调度器、数据选择策略(包括合成数据的作用、文本交错和偏好优化)的影响。
We believe that developing these training strategies and proving their feasibility will empower the speech and audio research community to advance SLMs beyond the scope of large, well-funded academic and industrial labs. Figure 1 illustrates the performance of various SLMs relative to their training compute budget, with circle sizes representing the size of the models. Furthermore, we compare our results with the scaling performance predicted from Cuervo and Marxer (2024). Although the authors present a somewhat pessimistic view of the computational resources needed to train highquality SLMs, we empirically show that reality is more promising, demonstrating that it is possible to significantly exceed the predicted performance per unit of compute. We encourage the community to refine and expand scaling laws specifically tailored for SLM training across various settings.
我们相信,开发这些训练策略并证明其可行性将赋能语音和音频研究社区,推动SLMs超越大型、资金充足的学术和工业实验室的范畴。图 1 展示了各种SLMs相对于其训练计算预算的性能,圆圈大小代表模型的规模。此外,我们将我们的结果与Cuervo和Marxer (2024) 预测的扩展性能进行了对比。尽管作者对训练高质量SLMs所需的计算资源持较为悲观的态度,但我们通过实证表明,现实更为乐观,证明了在每单位计算量上显著超越预测性能是可能的。我们鼓励社区针对不同环境下的SLM训练,进一步完善和扩展特定的扩展定律。
Our main contributions are:
我们的主要贡献是:
We open-source all code, models, training recipes, and synthetic datasets.
我们开源了所有代码、模型、训练方法和合成数据集。
2 Related Work
相关研究
Efficient training. Enhancing the efficiency of neural network training has been extensively studied (Shen et al., 2023). Haj imola hose in i et al. (2023); Wang et al. (2024) examined the impact of data selection on Large Language Model (LLM) training and introduced efficient data selection methods. Muhamed et al. (2024) proposed using structured sparse gradients to enhance compute efficiency in LLM training, while Rawat et al. (2024) explored the potential of leveraging smaller language models to improve the training efficiency of larger LLMs. Lv et al. (2024) investigated the use of low-dimensional projections for attention parameters to enhance training efficiency. Meanwhile, Neiterman and Ben-Artzi (2024) proposed applying LayerDrop as a technique to optimise neural network training.
高效训练。提升神经网络训练效率已被广泛研究 (Shen et al., 2023) 。Haj imola hose in i et al. (2023);Wang et al. (2024) 研究了数据选择对大语言模型 (LLM) 训练的影响,并提出了高效的数据选择方法。Muhamed et al. (2024) 提出了使用结构化稀疏梯度来提高 LLM 训练的计算效率,而 Rawat et al. (2024) 探索了利用较小语言模型来提高较大 LLMs 训练效率的潜力。Lv et al. (2024) 研究了使用低维投影来提升注意力参数训练效率的方法。同时,Neiterman 和 Ben-Artzi (2024) 提出了应用 LayerDrop 作为优化神经网络训练的技术。
More closely related to our work, Li et al. (2023) propose a training strategy for developing LLMs within a $100k\Phi$ budget. Warner et al. (2024) introduce ModernBERT, an efficient training pipeline for optimising BERT models, while Izsak et al. (2021) outline a method for training a BERT model in 24 hours using 8 GPUs. The most relevant work to ours is Cramming (Geiping and Goldstein, 2023), where the authors conduct an in-depth analysis of masked LM training on a single GPU in one day.
与我们的工作更为相关的是,Li 等人 (2023) 提出了一种在 10 万美元预算内开发大语言模型的训练策略。Warner 等人 (2024) 介绍了 ModernBERT,一种用于优化 BERT 模型的高效训练管道,而 Izsak 等人 (2021) 则概述了一种在 24 小时内使用 8 个 GPU 训练 BERT 模型的方法。与我们的工作最相关的是 Cramming (Geiping 和 Goldstein, 2023),作者在一天内使用单个 GPU 对掩码语言模型训练进行了深入分析。
While these studies offer valuable insights, they primarily focus on training text models, such as LLMs and masked LMs. In the speech domain, similar research has been conducted on self-supervised representation models (Liu et al., 2024a), but not on SLMs. In this work, we address this gap by focusing on efficient SLM training.
虽然这些研究提供了宝贵的见解,但它们主要专注于训练文本模型,例如大语言模型和掩码语言模型。在语音领域,已有类似研究针对自监督表示模型 (Liu et al., 2024a),但尚未涉及小语言模型 (SLM)。在本研究中,我们通过专注于高效的小语言模型训练来填补这一空白。
Generative speech language models were explored under various setups (Lakhotia et al., 2021; Kharitonov et al., 2021). Lakhotia et al. (2021) were the first to show how raw and uncurated speech data can be leveraged into building a Generative Spoken Language Modeling (GSLM) system. Next, Borsos et al. (2023) proposed a cascade version using both coarse and fine speech tokens. Such a modelling framework opened up a new and promising research direction for processing and modelling spoken data, such as speech re synthesis (Polyak et al., 2021), speaking style conversion (Kreuk et al., 2021; Maimon and Adi,
生成式语音语言模型在各种设置下被探索(Lakhotia等,2021;Kharitonov等,2021)。Lakhotia等人(2021)首次展示了如何将原始且未经过整理的语音数据用于构建生成式语音语言建模(GSLM)系统。随后,Borsos等人(2023)提出了一个使用粗粒度与细粒度语音Token的级联版本。这种建模框架为处理和建模语音数据开辟了一个新的、有前景的研究方向,例如语音重新合成(Polyak等,2021)、语音风格转换(Kreuk等,2021;Maimon和Adi,
2023), dialogue modelling (Nguyen et al., 2022), speech-to-speech translation (Popuri et al., 2022; Peng et al., 2024b), etc. Nachmani et al. (2024) proposed augmenting a text Language Model (LM) with continuous speech data to improve spoken question-answering tasks. Recently, Park et al. (2024) proposed SLM based on state-space models (Gu et al., 2021) to further push long context- efficient modelling, while Lin et al. (2024) proposed to fine-tune SLMs using direct preference optimisation (Rafailov et al., 2024) obtained from text LLM rankings.
2023), 对话建模 (Nguyen et al., 2022), 语音到语音翻译 (Popuri et al., 2022; Peng et al., 2024b) 等。Nachmani et al. (2024) 提出了通过增强文本语言模型 (LM) 与连续语音数据来改进口语问答任务。最近,Park et al. (2024) 提出了基于状态空间模型 (Gu et al., 2021) 的 SLM 以进一步推动长上下文高效建模,而 Lin et al. (2024) 提出了使用从文本大语言模型排名中获得的直接偏好优化 (Rafailov et al., 2024) 来微调 SLM。
Similar to text LLMs, training SLMs often demands large-scale datasets. For instance, Moshi (Défossez et al., 2024) was trained on 7 million hours of speech data, SpiritLM (Nguyen et al., 2025) utilized $560k$ hours, and TWIST (Hassid et al., 2024) was trained on approximately $150k$ . Recently, Cuervo and Marxer (2024) introduced the first scaling laws for SLMs, suggesting that achieving comparable performance to text LMs requires three times more tokens. In this work, we focus on reducing the computational demands while maintaining performance comparable to leading SLMs.
类似于文本大语言模型,训练语音大语言模型通常需要大规模数据集。例如,Moshi (Défossez et al., 2024) 使用了 700 万小时的语音数据进行训练,SpiritLM (Nguyen et al., 2025) 使用了 $560k$ 小时,而 TWIST (Hassid et al., 2024) 则训练了大约 $150k$ 小时的数据。最近,Cuervo 和 Marxer (2024) 提出了语音大语言模型的第一个扩展法则,表明要达到与文本大语言模型相当的性能,需要三倍以上的Token。在这项工作中,我们专注于在保持与领先语音大语言模型相当性能的同时,减少计算需求。
3Setup
3 设置
In this study, we explore decoder-only generative SLMs, which aim at maximising the likelihood of speech samples represented as discrete tokens. We examine both purely speech-based SLMs trained on speech tokens and joint speech-text SLMs using interleaving strategies (Nguyen et al., 2025). Similarly to Hassid et al. (2024); Lakhotia et al. (2021), we obtain speech tokens by quantising continuous latent representations of a self-supervised speech representation model using the $\mathbf{k}$ -means algorithm, often known as semantic tokens. Specifically, we utilise a multilingual HuBERT (Hsu et al., 2021) model running at $25\ \mathrm{Hz}$ , as employed in Hassid et al. (2024). We then train SLMs by minimising the negative log-likelihood of the input segments.
在本研究中,我们探索了仅解码器的生成式 SLM(Speech Language Model),其目标是最大化表示为离散 Token 的语音样本的似然。我们研究了基于纯语音 Token 训练的 SLM 以及使用交错策略的联合语音-文本 SLM(Nguyen et al., 2025)。与 Hassid et al. (2024) 和 Lakhotia et al. (2021) 类似,我们通过使用 $\mathbf{k}$-means 算法对自监督语音表示模型的连续潜在表示进行量化来获取语音 Token,通常称为语义 Token。具体而言,我们使用了在 $25\ \mathrm{Hz}$ 下运行的多语言 HuBERT(Hsu et al., 2021)模型,如 Hassid et al. (2024) 中所采用的那样。然后,我们通过最小化输入片段的负对数似然来训练 SLM。
Unless mentioned otherwise, all SLMs are trained using a single $A5000$ GPU $\langle24G B$ VRAM) along with 16 CPU cores for 24 hours. We deliberately focus on this constrained compute budget, assuming that most academic labs can access similar resources, thereby ensuring the accessibility of our research. The training data is pre-processed, i.e. extracting HuBERT units and dividing data into chunks, and stored prior to model training. As a result, this pre-processing time is excluded from the compute budget. This approach, aligned with Geiping and Goldstein (2023), is practical since many research experiments utilise the same pre-processed data. We additionally do not count the time for running validation and visualisation s as they are not used as part of the optimisation pipeline and only used for demonstration purposes.
除非另有说明,所有 SLM 均使用单个 $A5000$ GPU $\langle24G B$ VRAM) 和 16 个 CPU 核心训练 24 小时。我们特意专注于这种受限的计算预算,假设大多数学术实验室可以访问类似的资源,从而确保我们研究的可访问性。训练数据经过预处理(即提取 HuBERT 单元并将数据分成块),并在模型训练之前存储。因此,预处理时间不计入计算预算。这种方法与 Geiping 和 Goldstein (2023) 保持一致,是实用的,因为许多研究实验使用相同的预处理数据。此外,我们不计算运行验证和可视化的时间,因为它们不用于优化管道,仅用于演示目的。
Evaluation metrics. We assess all SLMs using four distinct evaluation metrics. The first three are based on likelihood evaluation, while the fourth is a generative metric. For likelihood based modelling we consider sBLIMP (Dunbar et al., 2021), Spoken Story Cloze (SSC)), and Topic StoryCloze (TSC) (Hassid et al., 2024). For modellinglikelihood metrics, we evaluate the likelihood assigned by the SLMs to pairs of speech utterances, consisting of a positive example and a distractor. We calculate the percent of pairs in which the SLM assigns higher likelihood to the positive sample. sBLIMP focuses on grammatical abilities thus the negative is ungrammatical version of the positive. SSC and TSC focus on semantic modelling abilities. In SSC, the distractor suffix is taken from the original textual StoryCloze dataset (Most af azad eh et al., 2016), allowing to assess fine-grained semantic speech understanding. In TSC, however, the distractor suffix is drawn from a different topic, enabling us to evaluate the model’s ability to understand the overall semantic concept.
评估指标。我们使用四种不同的评估指标来评估所有 SLM。前三个基于似然评估,而第四个是生成指标。对于基于似然的建模,我们考虑 sBLIMP (Dunbar et al., 2021)、Spoken Story Cloze (SSC) 和 Topic StoryCloze (TSC) (Hassid et al., 2024)。对于建模似然指标,我们评估 SLM 对语音话语对的似然,包括正例和干扰项。我们计算 SLM 对正样本赋予更高似然的百分比。sBLIMP 关注语法能力,因此负例是正例的语法错误版本。SSC 和 TSC 关注语义建模能力。在 SSC 中,干扰项后缀来自原始文本 StoryCloze 数据集 (Most af azad eh et al., 2016),允许评估细粒度的语义语音理解。在 TSC 中,干扰项后缀来自不同主题,使我们能够评估模型理解整体语义概念的能力。
Finally, to assess the generative abilities of SLMs, we compute generative perplexity (GenPPL). Following the approach of (Lakhotia et al., 2021; Hassid et al., 2024), we provide the SLM with a short speech prompt and generate speech tokens continuation. We use unit-vocoder with duration prediction to convert the tokens into speech (Polyak et al., 2021; Hassid et al., 2024). The generated speech is then transcribed, and its Perplexity (PPL) is evaluated using a pre-trained text LLM. To minimise the impact of token repetition on PPL measurements, we ground the generated text using diversity metrics derived from the auto-BLEU score (Lakhotia et al., 2021). Similarly to Lin et al. (2024) we use bigram autoBLEU. In other words, we ensure that all models achieve similar auto-BLEU scores, allowing for a fair comparison of PPL. Specifically, we transcribe speech segments using Whisper-large-v3- turbo model (Radford et al., 2023) and measure PPL using Llama-3.2-1B model (LLama, 2024).
最后,为了评估 SLM 的生成能力,我们计算生成困惑度 (GenPPL)。按照 (Lakhotia et al., 2021; Hassid et al., 2024) 的方法,我们为 SLM 提供一个简短的语音提示并生成语音 token 的延续。我们使用带有时长预测的 unit-vocoder 将 token 转换为语音 (Polyak et al., 2021; Hassid et al., 2024)。生成的语音随后被转录,并使用预训练的文本大语言模型评估其困惑度 (PPL)。为了最小化 token 重复对 PPL 测量的影响,我们使用从 auto-BLEU 分数得出的多样性指标来对生成的文本进行基准化 (Lakhotia et al., 2021)。与 Lin et al. (2024) 类似,我们使用 bigram autoBLEU。换句话说,我们确保所有模型都达到相似的 auto-BLEU 分数,从而允许对 PPL 进行公平的比较。具体来说,我们使用 Whisper-large-v3-turbo 模型 (Radford et al., 2023) 转录语音片段,并使用 Llama-3.2-1B 模型 (LLama, 2024) 测量 PPL。
Software efficiency. To maximise performance within 24 hours of model training, we leverage multiple efficient implementations. Through extensive performance testing, we found that using bfloat16 (Kalamkar et al., 2019) alongside Flash Attention 2 (Dao, 2023) and data packing provided the most efficient compute performance in our setup. We also experimented with model compilation using torch.compile (Ansel et al., 2024), but it lacked native compatibility with Flash Attention 2 at the time of our study, and its performance without Flash Attention 2 was subpar. Future work could investigate this further with more efficient attention implementations (Shah et al., 2024; Li et al., 2024).
软件效率。为了在模型训练的 24 小时内最大化性能,我们利用了多种高效实现。通过广泛的性能测试,我们发现使用 bfloat16 (Kalamkar et al., 2019) 结合 Flash Attention 2 (Dao, 2023) 和数据打包在我们的设置中提供了最高效的计算性能。我们还尝试了使用 torch.compile (Ansel et al., 2024) 进行模型编译,但在我们的研究期间,它缺乏与 Flash Attention 2 的原生兼容性,并且在没有 Flash Attention 2 的情况下性能较差。未来的工作可以结合更高效的注意力实现 (Shah et al., 2024; Li et al., 2024) 进一步研究这一点。
To enable rapid and scalable experimentation, we developed a specialised library for SLM training that supports various model architectures, training objectives, and evaluation metrics. This library accommodates both TWIST-style training, textspeech interleaving, preference optimisation, etc. We will open-source this package along all models weights and training recipes, aiming to empower the community to further explore SLMs.
为实现快速且可扩展的实验,我们开发了一个用于SLM(小语言模型)训练的专用库,该库支持多种模型架构、训练目标和评估指标。该库兼容TWIST风格的训练、文本语音交替、偏好优化等方法。我们将开源该库以及所有模型权重和训练方案,旨在赋能社区进一步探索SLM。
4 Investigations
4 调查
With this setup, we systematically analyse and ablate each component of the training pipeline, ultimately refining an optimised cook-book for training SLMs. We specifically examine the influence of model family, initial is ation, size, and architectural choices (e.g., dropout, positional embedding, etc.). We analyse optimisation parameters and data characteristics. Lastly, we explore alternative training objectives beyond standard next-token prediction, including speech-text interleaving and direct preference optimisation using synthetic data.
在此设置下,我们系统地分析和消融了训练流程中的每个组件,最终优化了训练小语言模型 (SLM) 的指南。我们特别研究了模型家族、初始化、大小和架构选择(例如,dropout、位置嵌入等)的影响。我们分析了优化参数和数据特征。最后,我们探索了标准的下一个 Token 预测之外的替代训练目标,包括语音-文本交错和使用合成数据进行直接偏好优化。
4.1 Model & Optimisation
4.1 模型与优化
Hyper-parameters. Unless specified otherwise, we use a context length of 512 tokens and an effective batch size of 256, employing gradient accumulation when necessary, as preliminary results indicated this configuration yields the best overall performance. We set the peak learning rate to $1e!-!3$ to enhance training speed and use a warmup period of $1%$ of the total training steps, as this proved more effective than the fixed 100-step warmup used in the original TWIST. To improve training stability, particularly with large learning rates, we apply gradient normalisation with a norm of 0.5 at no additional cost, following Geiping and Goldstein (2023). Unless modified later in our investigation, we use an inverse-square root scheduler and the AdamW optimiser (Loshchilov, 2017).
超参数。除非另有说明,我们使用 512 个 Token 的上下文长度和 256 的有效批量大小,在必要时采用梯度累积,因为初步结果表明这种配置能够带来最佳的整体性能。我们将峰值学习率设置为 $1e!-!3$ 以提高训练速度,并使用总训练步骤 $1%$ 的热身期,因为这比原始 TWIST 中使用的固定 100 步热身更有效。为了提高训练稳定性,尤其是在使用大学习率时,我们按照 Geiping 和 Goldstein (2023) 的方法,应用梯度归一化,归一化范数为 0.5,且不增加额外成本。除非在后续研究中有所修改,否则我们使用逆平方根调度器和 AdamW 优化器 (Loshchilov, 2017)。
Figure 2: Comparing PPL of different models of similar parameter count, with and without TWIST initial is ation.
图 2: 比较不同模型在参数数量相似的情况下,使用和不使用 TWIST 初始化的 PPL (Perplexity)。
Initial is ation. Hassid et al. (2024) empirically demonstrated that initial ising SLMs with pretrained text LMs can enhance convergence speed and improve model performance. We examine the effect of this initial is ation within our setup across different model types. To do so, we train multiple models, both with and without TWIST initialisation, while staying within our compute budget. As shown in Figure 2, TWIST initial is ation benefits all evaluated models at the beginning of training, though its overall impact by the end varies. Notice, the $\mathbf{X}$ -axis in Figure 2 represents theoretical FLOPs, calculated as $6*N_{p a r a m s}*D_{t o k e n s}$ following Hoffmann et al. (2022). However, due to variations in model architecture and implementation, practical efficiency differs, leading to varying amounts of compute processed within 24 hours.
Hassid 等 (2024) 通过实验证明,使用预训练的文本大语言模型初始化小型语言模型 (SLMs) 可以加速收敛并提升模型性能。我们在不同模型类型中检验了这种初始化的效果。为此,我们在计算预算范围内训练了多个模型,包括使用和不使用 TWIST 初始化的版本。如图 2 所示,TWIST 初始化在训练初期对所有评估模型都有益,但其最终影响各不相同。请注意,图 2 中的 $\mathbf{X}$ 轴表示理论 FLOPs,按照 Hoffmann 等 (2022) 的方法计算为 $6*N_{p a r a m s}*D_{t o k e n s}$。然而,由于模型架构和实现的差异,实际效率有所不同,导致 24 小时内处理的计算量也有所不同。
Results suggest that benefits of TWIST initia lisa tion can be substantial, especially for topperforming models like Qwen2.5. As a result, we prioritise investigations based on existing pretrained text LMs. Interestingly, the results in Figure 2 demonstrate that Qwen2.5 outperforms other models even without TWIST initial is ation, perhaps suggesting that their architectural design choices or size might also provide some benefit.
结果表明,TWIST初始化的好处可能非常显著,尤其是对于像Qwen2.5这样的顶级模型。因此,我们优先基于现有的预训练文本大语言模型进行研究。有趣的是,图 2 中的结果表明,即使没有TWIST初始化,Qwen2.5的表现也优于其他模型,这可能表明其架构设计选择或模型规模也带来了一定的优势。
Optimal model size & family. Cuervo and Marxer (2024) conducted a scaling analysis on GSLM-style SLMs, estimating the optimal model size and token count for a compute-efficient model. However, using a text LM initial is ation might impact these findings. As we observe, TWIST initialisation greatly impact model performance, suggest- ing that prior it ising larger models may be more effective than simply increasing the dataset size. Additionally, various model families gain different advantages from TWIST initial is ation; for example, Qwen2.5 models show significantly better performance compared to OPT models. In Figure 3, we compare the results under the pre-defined compute budget within model families1. We note that the best model sizes for both MobileLLM (Liu et al., 2024b), SmolLM2 (Allal et al., 2025) and Pythia (Biderman et al., 2023) are $\sim300M$ parameters, while for OPT the best is 125M. According to Cuervo and Marxer (2024), the estimated optimal model size is approximately 66M parameters. However, the best-performing model, Qwen2.5, is significantly larger. Since there are no smaller models in this family, it is difficult to determine whether this deviation is due to the quality of the initialisation or other factors. Moving forward, we proceed with both OPT-125M and Qwen2.5-0.5B.
最优模型规模与家族。Cuervo 和 Marxer (2024) 对 GSLM 风格的小语言模型进行了规模分析,估算了计算效率模型的最优模型规模和 Token 数量。然而,使用文本大语言模型初始化可能会影响这些发现。正如我们观察到的,TWIST 初始化对模型性能有显著影响,这表明优先考虑更大的模型可能比单纯增加数据集规模更有效。此外,不同模型家族从 TWIST 初始化中获得不同的优势;例如,Qwen2.5 模型相比 OPT 模型表现出显著更好的性能。在图 3 中,我们比较了在预定义计算预算内不同模型家族的结果。我们注意到,MobileLLM (Liu 等, 2024b)、SmolLM2 (Allal 等, 2025) 和 Pythia (Biderman 等, 2023) 的最佳模型规模约为 3 亿参数,而 OPT 的最佳模型规模为 1.25 亿。根据 Cuervo 和 Marxer (2024) 的估计,最优模型规模约为 6600 万参数。然而,表现最好的模型 Qwen2.5 规模显著更大。由于该家族中没有更小的模型,难以确定这种偏差是由初始化质量还是其他因素引起的。接下来,我们将同时使用 OPT-125M 和 Qwen2.5-0.5B。
Figure 3: Comparing PPL of different models under TWIST initial is ation.
图 3: 不同模型在 TWIST 初始化下的 PPL 对比。
Figure 4: Comparing validation PPL of our best model with different optimisers and schedulers.
图 4: 比较我们最佳模型在使用不同优化器和调度器时的验证 PPL (Perplexity)。
Dropout. The original OPT models includes dropout to mitigate over fitting. Although dropout is beneficial for regular is ation, it effectively decreases the number of gradient updates per parameter with- out shortening the update-step wall time. Hence, reduces the number of parameter updates per sec- ond. Following Geiping and Goldstein (2023), we experiment with removing dropout and observed improved performance in our setup.
Dropout。原始的 OPT 模型包含了 dropout 以减少过拟合。虽然 dropout 对正则化有益,但它实际上减少了每个参数的梯度更新次数,而不会缩短更新步骤的用时。因此,它降低了每秒的参数更新次数。根据 Geiping 和 Goldstein (2023) 的研究,我们尝试移除 dropout,并在我们的设置中观察到了性能的提升。
Positional Encoding. Transformers rely on positional encoding to capture the order of input tokens. Many modern LMs, including the Qwen models, use Rotary Position Embedding (Su et al., 2024). This method uses a hyper parameter, $\theta$ , to control the trade-off between granularity and the ability to handle long contexts. $\theta$ is often tuned to accommodate longer context lengths (Yang et al., $2024\mathrm{a}$ ; Roziere et al., 2023). Since our context length is significantly shorter than that of the original LLM, we explore reducing $\theta$ for potential performance gains. Our findings show that setting $\theta~=~10,000$ with a context length of 1024 enhances performance, so we adopt this configuration moving forward. We note that since we increase the context length, we need to reduce the batch size as well, to not run into memory problems when training. We reduce the batch size by a half and keep the same amount of gradient accumulation steps, which gives us an effective batch size of 128.
位置编码 (Positional Encoding)。Transformer 依靠位置编码来捕捉输入 Token 的顺序。许多现代大语言模型,包括 Qwen 模型,使用旋转位置嵌入 (Rotary Position Embedding) (Su 等人, 2024)。该方法使用一个超参数 $\theta$ 来控制粒度与处理长上下文能力之间的权衡。$\theta$ 通常被调整以适应更长的上下文长度 (Yang 等人, $2024\mathrm{a}$; Roziere 等人, 2023)。由于我们的上下文长度明显短于原始大语言模型,我们探索减少 $\theta$ 以获得潜在的性能提升。我们的研究结果显示,在上下文长度为 1024 的情况下,设置 $\theta~=~10,000$ 可以提高性能,因此我们采用了这一配置。我们注意到,由于我们增加了上下文长度,我们也需要减少批量大小,以避免在训练时遇到内存问题。我们将批量大小减少一半,并保持相同的梯度累积步数,从而得到 128 的有效批量大小。
Optimiser and Scheduler. Various optimisers and schedulers have been developed to enhance training efficiency, reduce memory usage (Shazeer and Stern, 2018; Dettmers et al., 2022), or accelerate convergence (Pag liard in i et al., 2024; Chen et al., 2023). With limited compute, faster convergence and better memory efficiency can be especially important. We first consider efficient optimisers, specifically AdamW with fused kernels, and 8-bit AdamW, but observe no notable improvements in batch size or runtime compared to standard AdamW. This could do with the relatively small model size, resulting in a minimal memory footprint of the optimisers. We then compare AdamW with two state-of-the-art optimisers: AdaLomo (Lv et al., 2023) and AdEMAMeix (Pag liard in i et al.,
优化器和调度器
Table 1: Analysing impact of training data diversity and synthetic data on SLM performance. The default Slam recipe does not use diverse data (only Libri-light and Libri Speech), but uses the synthetic s Tiny Stories data.
模型 | 数据 | 数据 | 指标 | 指标 | 指标 | 指标 |
---|---|---|---|---|---|---|
Div. | Syn. | sBLIMP↑ | sSC↑ | tSC↑ | GenPPL√ | |
OPT125M | 55.28 | 55.46 | 75.18 | 96.8 | ||
55.06 | 55.00 | 74.83 | 116.6 | |||
55.88 | 54.52 | 70.82 | 160.3 | |||
55.65 | 54.78 | 70.18 | 172.7 | |||
Qwen-0.5B | × | √ | 56.45 | 55.59 | 78.01 | 88.3 |
56.17 | 55.37 | 77.13 | 101.3 | |||
x | 56.60 | 53.50 | 71.14 | 145.4 | ||
56.10 | 53.72 | 70.66 | 161.8 |
表 1: 分析训练数据多样性和合成数据对SLM性能的影响。默认的Slam配方不使用多样化的数据(仅使用Libri-light和Libri Speech),但使用了合成的sTiny Stories数据。
2024). Results, presented in Figure 4, suggest that with the original Inverse Sqr t scheduler used by Hassid et al. (2024), using AdEMAMeix improves validation loss, compared to AdamW, with AdaLomo far behind.
图 4 中的结果表明,与 Hassid 等人 (2024) 使用的原始 Inverse Sqrt 调度器相比,使用 AdEMAMeix 相比 AdamW 提高了验证损失,而 AdaLomo 远远落后。
Next, we analyse a cosine decay learning rate scheduler, in place of the original InverseSqrt as this was shown to improve convergence (Loshchilov and Hutter, 2016). We consider the previous optimisers, and provide the validation loss throughout training in Figure 4. We see that this notably improved the loss for AdamW, and slightly harmed results for AdEMAMeix. Overall, AdamW with a cosine schedule provide the best setup, far outperforming the original setup.
接下来,我们分析了一个余弦衰减学习率调度器,以取代原始的 InverseSqrt,因为研究表明这可以改善收敛性(Loshchilov 和 Hutter,2016)。我们考虑了之前的优化器,并在图 4 中提供了整个训练过程中的验证损失。我们发现这显著改善了 AdamW 的损失,但略微损害了 AdEMAMeix 的结果。总体而言,使用余弦调度的 AdamW 提供了最佳设置,远远优于原始设置。
4.2 Data
4.2 数据
Next, we examine how the training data-mix influences performance in a compute-constrained setting. Specifically, we explore whether diversity in accents, speaking styles, etc. is beneficial and assess whether synthetic data can enhance semantic modelling abilities.
接下来,我们探讨在计算受限的环境中,训练数据组合如何影响性能。具体来说,我们研究口音、说话风格等的多样性是否有利,并评估合成数据是否能增强语义建模能力。
Diverse Data. We begin by examining how dataset diversity impacts model performance. Many leading speech datasets, such as those based on audiobooks (Panayotov et al., 2015; Kahn et al., 2020), consist of relatively clean, single-speaker recordings within a specific content domain. To introduce greater diversity in speaking styles and content, we curate additional datasets, including VoxPopuli (Wang et al., 2021b), Tedlium (Hernandez et al., 2018), People Speech (Galvez et al., 2021), and SWC (Baumann et al., 2018). For all mentioned datasets, we use the official data cleaning and preprocessing scripts when available. Specifically, for Libri-light, we apply the official Voice Activity
多样数据
Figure 5: Analysing the optimal part of the 24 hour compute budget that should be used for DPO, with the rest used for pre-training.
图 5: 分析 24 小时计算预算中应用于 DPO 的最佳部分,其余部分用于预训练。
Detection model to remove silences and generate smaller audio segments. To evaluate the impact of dataset diversity, we compare the performance of SLMs trained using our best training recipes using a subset of Libri Speech and Libri-light against all curated datasets. This comparison is conducted for both OPT-125M, which processes a large number of tokens during training, and Qwen-0.5B, which encounters significantly less data due to model size. Results are summarised in Table 1. We observe that dataset diversity has an overall negative effect on model performance. We hypothesis e this is due to the models struggling in modelling rich and complex audio under such low compute resources.
检测模型用于去除静音并生成更小的音频片段。为了评估数据集多样性的影响,我们将使用Libri Speech和Libri-light的子集训练的最佳训练方案与所有精选数据集进行对比,分别在OPT-125M和Qwen-0.5B上进行。结果总结在表1中。我们观察到,数据集多样性对模型性能有总体负面影响。我们假设这是由于模型在如此低的计算资源下难以建模丰富且复杂的音频。
Synthetic Data. Recent studies have highlighted the potential of synthetic data generated through Text-to-Speech (TTS) (Cuervo and Marxer, 2024) or direct text-to-unit conversion (Zeng et al., 2024). Hence, we examine the impact of including synthetically generated speech within our constrained compute setup. To do so, we synthesised the Tiny Stories dataset (Eldan and Li, 2023) using a single-speaker TTS model (Wang et al., 2021a), as it is computationally efficient. Additionally, prior research has shown that HuBERT units largely remove speaker information (Maimon and Adi, 2023). Tiny Stories has been demonstrated to enhance text LM performance and improve SLMs (Cuervo and Marxer, 2024). Results are presented in Table 1. Results indicate that incorporating such synthetic data into the training data-mix significantly boosts both modelling and generative performance metrics, across all evaluated setups. We also consider adding the synthetic data to the original TWIST recipe, and the results in the bottom of Table 2 suggests that while this helps with semantic metrics, it is far from enough without other optimisation s we introduced.
合成数据
Table 2: Comparing slamming to leading SLMs, and predicted optimal performance for the compute. We also consider TWIST-350M using our code and compute budget, but with the original training recipe. $\pm$ indicates distance to min/max of 3 seeds.
计算量 (GPU 天) | 参数量 | sBLIMP↑ | sStoryCloze↑ | tStoryCloze↑ | GenPPL↓ | Auto-BLEU√ | |
---|---|---|---|---|---|---|---|
TWIST-350M (Hassid et al., 2024) | 40*V100 | 305M | 56.20 | 137.3 | 3.46 | ||
TWIST-1.3B (Hassid et al., 2024) | 160*V100 | 1B | 57.00 | 52.4 | 70.6 | 131.8 | 3.20 |
TWIST-7B (Hassid et al., 2024) | ? | 7B | 59.00 | 55.3 | 74.1 | 93.7 | 3.06 |
TWIST-13B (Hassid et al., 2024) | ? | 13B | 59.20 | 55.4 | 76.4 | ||
Scaled Optimal (Cuervo and Marxer, 2024) | ? | 823M | 61.3 | 56.7 | 78.0 | ||
Predicted Optimal (Cuervo and Marxer, 2024) | 1*A5000 | 78M | 56.85 | 54.09 | 70.49 | ||
TWIST-350M (原始配方) | 1*A5000 | 305M | 51.52 ± .19 | 53.65 ± .57 | 68.80 ± .47 | 259.2 ± 6.7 | 3.26 ± .46 |
TWIST-350M + sTinyStories | 1*A5000 | 305M | 51.21 ± .26 | 54.17 ± .54 | 72.40 ± .18 | 159.0 ± 6.0 | 4.18 ± .24 |
Slam (-DPO) (ours) | 1*A5000 | 358M | 56.45 ± .17 | 55.59 ± .30 | 78.01 ± .27 | 88.3 ± 1.0 | 3.47 ± .17 |
Slam (ours) | 1*A5000 | 358M | 58.86 ± .20 | 58.04 ± .51 | 82.04 ± .21 | 62.8 ± 4.1 | 3.88 ± .11 |
表 2: 比较 slam 与其他领先的 SLM,以及预测的计算最优性能。我们还考虑了使用我们的代码和计算预算的 TWIST-35