[论文翻译]音乐源分离的混合式Transformer


原文地址:https://arxiv.org/pdf/2211.08553v1


HYBRID TRANSFORMERS FOR MUSIC SOURCE SEPARATION

音乐源分离的混合式Transformer

ABSTRACT

摘要

A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers [1] have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs [2], where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB [3], we show that it outperforms Hybrid Demucs (trained on the same data) by $0.45~\mathrm{dB}$ of SDR when using 800 extra training songs. Using sparse attention kernels to extend its receptive field, and per source fine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with $9.20\mathrm{dB}$ of SDR.

音乐源分离(Music Source Separation, MSS)领域自然产生了一个问题:长距离上下文信息是否有用,还是局部声学特征就已足够。在其他领域,基于注意力机制的Transformer[1]已展现出整合长序列信息的能力。本文提出了混合Transformer Demucs(HT Demucs),这是一种基于Hybrid Demucs[2]的时域/频域双U-Net混合架构,其最内层被替换为跨域Transformer编码器,在单域内使用自注意力机制,跨域时采用交叉注意力机制。虽然在仅使用MUSDB[3]数据集训练时表现欠佳,但我们证明当增加800首额外训练歌曲时,其性能超越Hybrid Demucs(相同训练数据)0.45dB的SDR指标。通过采用稀疏注意力核扩展感受野,以及逐源微调技术,我们在MUSDB数据集上使用额外训练数据取得了9.20dB SDR的当前最佳结果。

Index Terms— Music Source Separation, Transformers with Transformer layers, applied both in the time and spectral representation, using self-attention within one domain, and cross-attention across domains. As Transformers are usually data hungry, we leverage an internal dataset composed of 800 songs on top of the MUSDB dataset, described in Section 4.

索引术语— 音乐源分离 (Music Source Separation)、时域和频域表示中应用的带Transformer层的Transformer、在单域内使用自注意力 (self-attention)、跨域使用交叉注意力 (cross-attention)。由于Transformer通常需要大量数据,我们在MUSDB数据集基础上利用包含800首歌曲的内部数据集(详见第4节)。

Our second contribution is to evaluate extensively this new architecture in Section 5, with various settings (depth, number of channels, context length, augmentations etc.). We show in particular that it improves over the baseline Hybrid Demucs architecture (retrained on the same data) by 0.35 dB.

我们的第二个贡献是在第5节中对这一新架构进行了广泛评估,涵盖了多种设置(深度、通道数、上下文长度、数据增强等)。特别地,我们证明其性能比基准Hybrid Demucs架构(使用相同数据重新训练)提升了0.35分贝。

Finally, we experiment with increasing the context duration using sparse kernels based with Locally Sensitive Hashing to overcome memory issues during training, and finetuning procedure, thus achieving a final SDR of 9.20 dB on the test set of MUSDB.

最后,我们尝试使用基于局部敏感哈希的稀疏核来增加上下文时长,以解决训练和微调过程中的内存问题,从而在MUSDB测试集上实现了9.20 dB的最终SDR。

We release the training code, pre-trained models, and samples on our github facebook research/demucs.

我们在 github facebook research/demucs 上发布了训练代码、预训练模型和样本。

2. RELATED WORK

2. 相关工作

1. INTRODUCTION

1. 引言

Since the 2015 Signal Separation Evaluation Campaign (SiSEC) [4], the community of MSS has mostly focused on the task of training supervised models to separate songs into 4 stems: drums, bass, vocals and other (all the other instruments). The reference dataset that is used to benchmark MSS is MUSDB18 [3, 5] which is made of 150 songs in two versions (HQ and non-HQ). Its training set is composed of 87 songs, a relatively small corpus compared with other deep learning based tasks, where Transformer [1] based architectures have seen widespread success and adoption, such as vision [6, 7] or natural language tasks [8]. Source separation is a task where having a short context or a long context as input both make sense. Conv-Tasnet [9] uses about one second of context to perform the separation, using only local acoustic features. On the other hand, Demucs [10] can use up to 10 seconds of context, which can help to resolve ambiguities in the input. In the present work, we aim at studying how Transformer architectures can help leverage this context, and what amount of data is required to train them.

自2015年信号分离评估竞赛(SiSEC) [4]以来,音乐源分离(MSS)领域主要聚焦于训练监督模型将歌曲分离为4个音轨的任务:鼓组、贝斯、人声和其他(所有其他乐器)。用于基准测试的参考数据集MUSDB18 [3,5]包含150首歌曲的两个版本(HQ和非HQ),其训练集由87首歌曲组成——与其他基于深度学习的任务相比,这个数据量相对较小。在这些领域(如视觉[6,7]或自然语言任务[8]),基于Transformer[1]的架构已取得广泛成功。音源分离任务中,使用短时上下文或长时上下文作为输入都具有意义:Conv-Tasnet[9]仅利用约1秒的上下文进行分离,仅依赖局部声学特征;而Demucs[10]可使用长达10秒的上下文,这有助于解决输入中的歧义。本研究旨在探索Transformer架构如何有效利用这种上下文信息,以及训练这些模型所需的数据量规模。

We first present in Section 3 a novel architecture, Hybrid Transformer Demucs (HT Demucs), which replaces the innermost layers of the original Hybrid Demucs architecture [2]

我们首先在第3节介绍一种新颖的架构——混合Transformer Demucs (HT Demucs),它取代了原始Hybrid Demucs架构[2]的最内层。

A traditional split for MSS methods is between spectrogram based and waveform based models. The former includes models like Open-Unmix [11], a biLSTM with fully connected that predicts a mask on the input spec tr ogram or D3Net [12] which uses dilated convolutional blocks with dense connections. More recently, using complexspec tr ogram as input and output was favored [13] as it provides a richer representation and removes the topline given by the Ideal-Ratio-Mask. The latest spec tr ogram model, Band-Split RNN [14], combines this idea, along with multiple dual-path RNNs [15], each acting in carefully crafted frequency band. It currently achieves the state-of-the-art on MUSDB with 8.9 dB. Waveform based models started with Wave-U-Net [16], which served as the basis for Demucs [10], a time domain U-Net with a bi-LSTM between the encoder and decoder. Around the same time, Conv-TasNet showed competitive results [9, 10] using residual dilated convolution blocks to predict a mask over a learnt representation. Finally, a recent trend has been to use both temporal and spectral domains, either through model blending, like KUIELABMDX-Net [17], or using a bi-U-Net structure with a shared backbone as Hybrid Demucs [2]. Hybrid Demucs was the first ranked architecture at the latest MDX MSS Competition [18], although it is now surpassed by Band-Split RNN.

传统上,音乐源分离(MSS)方法可分为基于频谱图(spectrogram)和基于波形(waveform)的模型。前者包括Open-Unmix [11](一种带全连接的双向LSTM,用于预测输入频谱图的掩码)和D3Net [12](采用密集连接的空洞卷积块)。近年来,使用复数频谱图(complex spectrogram)作为输入和输出的方法更受青睐[13],因其能提供更丰富的表示并消除理想比率掩码(Ideal-Ratio-Mask)的性能上限。最新的频谱图模型Band-Split RNN [14]结合了这一思路与多个双路径RNN [15],每个RNN作用于精心设计的频段,目前在MUSDB数据集上以8.9分贝的成绩保持最优。

基于波形的模型始于Wave-U-Net [16],它是Demucs [10]的基础架构——一种时域U-Net,在编码器和解码器之间加入了双向LSTM。同期,Conv-TasNet通过残差空洞卷积块预测学习表示的掩码,取得了可比的结果[9,10]。最新趋势是结合时域和频域,例如KUIELAB-MDX-Net [17]采用模型混合策略,Hybrid Demucs [2]则使用共享主干网络的双U-Net结构。Hybrid Demucs在最近MDX音乐分离竞赛[18]中曾位列榜首,但现已被Band-Split RNN超越。

Table 1: Comparison with baselines on the test set of MUSDB HQ (methods with a ∗ are reported on the non HQ version). “Extra?” indicates the number of extra songs used at train time, $\dagger$ indicates that only mixes are used. “fine tuned” indices per source fine-tuning.

表 1: 在 MUSDB HQ 测试集上与基线的对比 (带 * 的方法是在非 HQ 版本上报告的) 。"Extra?" 表示训练时使用的额外歌曲数量, $\dagger$ 表示仅使用混合音频。"fine tuned" 表示对每个音源进行了微调。

架构 Extra? A11 Drums Bass Other Vocals
IRMoracle N/A 8.22 8.45 7.12 7.85 9.43
KUIELAB-MDX-Net[17] 7.54 7.33 7.86 5.95 9.00
Hybrid Demucs [2] 7.64 8.12 8.43 5.65 8.35
Band-Split RNN [14] 8.24 9.01 7.22 6.70 10.01
HT Demucs 7.52 7.94 8.48 5.72 7.93
Spleeter*[19] 25k 5.91 6.71 5.51 4.55 6.86
D3Net* [12] 1.5k 6.68 7.36 6.20 5.37 7.80
Demucs v2*[10] 150 6.79 7.58 7.60 4.69 7.29
Hybrid Demucs [2] 800 8.34 9.31 9.13 6.18 8.75
Band-Split RNN [14] 1750t 8.97 10.15 8.16 7.08 10.47
HT Demucs 150 8.49 9.51 9.76 6.13 8.56
HT Demucs 800 8.80 10.05 9.78 6.42 8.93
HT Demucs (fine tuned) 800 9.00 10.08 10.39 6.32 9.20
Sparse HT Demucs (fine tuned) 800 9.20 10.83 10.47 6.41 9.37

Using large datasets has been shown to be beneficial to the task of MSS. Spleeter [19] is a spec tr ogram masking UNet architecture trained on 25,000 songs extracts of 30 seconds, and was at the time of its release, the best model available. Both D3Net and Demucs highly benefited from using extra training data, while still offering strong performance on MUSDB only. Band-Split RNN introduced a novel unsupervised augmentation technique requiring only mixes to improve its performance by $0.7\mathrm{dB}$ of SDR.

使用大规模数据集已被证明对音乐源分离(MSS)任务有益。Spleeter [19] 是一种基于频谱掩码的UNet架构,其训练数据包含25,000首30秒歌曲片段,在发布时是当时最优秀的模型。D3Net和Demucs都通过使用额外训练数据显著提升了性能,同时在仅使用MUSDB数据集时仍保持强劲表现。Band-Split RNN引入了一种新颖的无监督增强技术,仅需混合音频即可将SDR指标提升$0.7\mathrm{dB}$。

Transformers have been used for speech source separation with SepFormer [20], which is similar to Dual-Path RNN: short range attention layers are interleaved with long range ones. However, its requires almost 11GB of memory for the forward pass for 5 seconds of audio at $8\mathrm{kHz}$ , and thus is not adequate for studying longer inputs at $44.1\mathrm{kHz}$ .

Transformer已通过SepFormer [20]应用于语音源分离,其原理类似于双路径RNN:短程注意力层与长程注意力层交替排列。然而,该模型在处理5秒8kHz音频时前向传播需占用近11GB内存,因此无法适用于44.1kHz采样率下的长时输入研究。

3. ARCHITECTURE

3. 架构

We introduce the Hybrid Transformer Demucs model, based on Hybrid Demucs [2]. The original Hybrid Demucs model is made of two U-Nets, one in the time domain (with temporal convolutions) and one in the spec tr ogram domain (with convolutions over the frequency axis). Each U-Net is made of 5 encoder layers, and 5 decoder layers. After the 5-th encoder layer, both representation have the same shape, and they are summed before going into a shared 6-th layer. Similarly, the first decoder layer is shared, and its output is sent both the temporal and spectral branch. The output of the spectral branch is transformed to a waveform using the iSTFT, before being summed with the output of the temporal branch, giving the actual prediction of the model.

我们基于Hybrid Demucs [2]提出了Hybrid Transformer Demucs模型。原始Hybrid Demucs模型由两个U-Net组成:一个在时域(采用时间卷积)工作,另一个在频谱(spectrogram)域(沿频率轴进行卷积)工作。每个U-Net包含5个编码器层和5个解码器层。在第5层编码器后,两个表征具有相同形状,在输入共享的第6层之前会进行求和。类似地,第一个解码器层也是共享的,其输出会同时传递给时域分支和频域分支。频域分支的输出通过逆短时傅里叶变换(iSTFT)转换为波形后,与时域分支的输出相加,形成模型的最终预测结果。

Hybrid Transformer Demucs keeps the outermost 4 layers as is from the original architecture, and replaces the 2 innermost layers in the encoder and the decoder, including local attention and bi-LSTM, with a cross-domain Transformer Encoder. It treats in parallel the 2D signal from the spectral branch and the 1D signal from the waveform branch. Unlike the original Hybrid Demucs which required careful tuning of the model parameters (STFT window and hop length, stride, paddding, etc.) to align the time and spectral representation, the cross-domain Transformer Encoder can work with heterogeneous data shape, making it a more flexible architecture.

混合Transformer Demucs保留了原始架构中最外层的4层不变,并将编码器和解码器中最内层的2层(包括局部注意力机制和双向LSTM)替换为跨域Transformer编码器。该模型并行处理来自频谱分支的2D信号和波形分支的1D信号。与原始Hybrid Demucs需要仔细调整模型参数(STFT窗口和跳跃长度、步长、填充等)以对齐时间和频谱表示不同,跨域Transformer编码器能够处理异构数据形状,使其成为更灵活的架构。

The architecture of Hybrid Transformer Demucs is depicted on Fig.1. On the left, we show a single self-attention Encoder layer of the Transformer [1] with normalization s before the Self-Attention and Feed-Forward operations, it is combined with Layer Scale [6] initialized to $\epsilon{=}10^{-4}$ in order to stabilize the training. The two first normalization s are layer normalization s (each token is independently normalized) and the third one is a time layer normalization (all the tokens are normalized together). The input/output dimension of the Transformer is 384, and linear layers are used to convert to the internal dimension of the Transformer when required. The attention mechanism has 8 heads and the hidden state size of the feed forward network is equal to 4 times the dimension of the transformer. The cross-attention Encoder layer is the same but using cross-attention with the other domain representation. In the middle, a cross-domain Transformer Encoder of depth 5 is depicted. It is the interleaving of self-attention Encoder layers and cross-attention Encoder layers in the spectral and waveform domain. 1D [1] and 2D [21] sinusoidal encodings are added to the scaled inputs and reshaping is applied to the spectral representation in order to treat it as a sequence. On Fig. 1 (c), we give a representation of the entire architecture, along with the double U-Net encoder/decoder structure.

Hybrid Transformer Demucs的架构如图1所示。左侧展示了Transformer[1]的单层自注意力编码器结构,在自注意力(Self-Attention)和前馈网络(Feed-Forward)操作前各有一个归一化层,并结合了初始值为$\epsilon{=}10^{-4}$的层缩放(Layer Scale)[6]以稳定训练。前两个归一化层是层归一化(每个token独立归一化),第三个是时序层归一化(所有token共同归一化)。该Transformer的输入/输出维度为384,必要时通过线性层转换至Transformer内部维度。注意力机制采用8个头,前馈网络的隐藏层大小为Transformer维度的4倍。交叉注意力编码器层结构相同,但使用另一域表征进行交叉注意力计算。

中间部分展示了深度为5的跨域Transformer编码器,由频谱域和波形域的自注意力编码器层与交叉注意力编码器层交替组成。对缩放后的输入添加一维[1]和二维[21]正弦位置编码,并对频谱表征进行序列化重塑处理。图1(c)呈现了整体架构及双U-Net编码器/解码器结构。

Memory consumption and the speed of attention quickly deteriorates with an increase of the sequence lengths. To further scale, we leverage sparse attention kernels introduced in the xformer package [22], along with a Locally Sensitive Hashing (LSH) scheme to determine dynamically the sparsity pattern. We use a sparsity level of $90%$ (defined as the proportion of elements removed in the softmax), which is determined by performing 32 rounds of LSH with 4 buckets each. We select the elements that match at least $k$ times over all 32 rounds of LSH, with $k$ such that the sparsity level is $90%$ . We refer to this variant as Sparse HT Demucs.

随着序列长度的增加,内存消耗和注意力速度会迅速下降。为了进一步扩展规模,我们利用了xformer包[22]中引入的稀疏注意力核( sparse attention kernels ),并结合局部敏感哈希( LSH )方案动态确定稀疏模式。我们采用90%的稀疏度(定义为softmax中被移除元素的比例),这是通过执行32轮LSH(每轮4个桶)确定的。我们选择在所有32轮LSH中至少匹配k次的元素,其中k值确保稀疏度达到90%。我们将此变体称为稀疏HT Demucs( Sparse HT Demucs )。

4. DATASET

4. 数据集

We curated an internal dataset composed of 3500 songs with the stems from 200 artists with diverse music genres. Each stem is assigned to one of the 4 sources according to the name given by the music producer (for instance ”vocals2”, ”fx”, ”sub” etc...). This labeling is noisy because these names are subjective and sometime ambiguous. For 150 of those tracks, we manually verified that the automated labeling was correct, and discarded ambiguous stems. We trained a first Hybrid Demucs model on MUSDB and those 150 tracks. We preprocess the dataset according to several rules. First, we keep only the stems for which all four sources are non silent at least $30%$ of the time. For each 1 second segment, we define it as silent if its volume is less than -40dB. Second, for a song $x$ of our dataset, noting $x_{i}$ , with $i\in\mathrm{{drums}}$ , bass, other, vocals}, each stem and $f$ the Hybrid Demucs model previously mentioned, we define $y_{i,j}{=}f(x_{i}){j}$ i.e. the output $j$ when separat- ing the stem $i$ . Theoretically, if all the stems were perfectly labeled and if $f$ were a perfect source separation model we would have $y_{i,j}=x_{i}\delta_{i,j}$ with $\delta_{i,j}$ being the Kronecker delta. For a waveform $z$ , let us define the volume in dB measured over 1 second segments:

我们整理了一个包含3500首歌曲的内部数据集,这些歌曲来自200位不同音乐流派的艺术家,并带有音轨分轨(stems)。根据音乐制作人给定的名称(例如"vocals2"、"fx"、"sub"等),每个音轨被分配到4种音源类型之一。这种标注存在噪声,因为这些名称具有主观性且有时含义模糊。我们对其中150首曲目进行了人工验证,确保自动标注正确,并剔除了含义模糊的音轨。

我们首先在MUSDB数据集和这150首曲目上训练了一个Hybrid Demucs模型。数据预处理遵循以下规则:首先,我们仅保留四个音源在至少30%时间段内非静音的音轨。对于每1秒片段,若其音量低于-40dB则定义为静音。其次,对于数据集中的歌曲x,记作x_i(i∈{drums, bass, other, vocals}),每个音轨和前述Hybrid Demucs模型f,我们定义y_i,j=f(x_i)_j,即在分离音轨i时的输出j。理论上,若所有音轨都完美标注且f是完美的音源分离模型,则应有y_i,j=x_iδ_i,j,其中δ_i,j是克罗内克δ函数。对于波形z,我们定义其1秒片段的音量(dB)为:


Fig. 1: Details of the Hybrid Transformer Demucs architecture. (a): the Transformer Encoder layer with self-attention and Layer Scale [6]. (b): The Cross-domain Transformer Encoder treats spectral and temporal signals with interleaved Transformer Encoder layers and cross-attention Encoder layers. (c): Hybrid Transformer Demucs keeps the outermost 4 encoder and decoder layers of Hybrid Demucs with the addition of a cross-domain Transformer Encoder between them.

图 1: Hybrid Transformer Demucs架构细节。(a) 带自注意力(self-attention)和层缩放(Layer Scale) [6]的Transformer编码器层。(b) 跨域Transformer编码器通过交错排列的Transformer编码器层和交叉注意力编码器层处理频谱与时域信号。(c) Hybrid Transformer Demucs保留Hybrid Demucs最外层的4个编码器/解码器层,并在其间插入跨域Transformer编码器。

$$
V(z)=10\cdot\log_{10}\left(\mathrm{AveragePool}(z^{2},1\mathrm{{sec}})\right).
$$

$$
V(z)=10\cdot\log_{10}\left(\mathrm{AveragePool}(z^{2},1\mathrm{{sec}})\right).
$$

For each pair of sources $i,j$ , we take the segments of 1 second where the stem is present (regarding the first criteria) and define $P_{i,j}$ as the proportion of these segments where $V(y_{i,j})\mathrm{-~}V(x_{i})\mathrm{~>~-10dB}$ . We obtain a square matrix $P\in[0,1]^{4\times4}$ , and we notice that in perfect condition, we should have $P=\mathrm{{Id}}$ . We thus keep only the songs for which for all sources $i$ , $P_{i,i}>70%$ , and pairs of sources $i\neq j$ ,

对于每对音源 $i,j$,我们截取包含音轨的1秒片段(基于第一个标准),并定义 $P_{i,j}$ 为满足 $V(y_{i,j})\mathrm{-~}V(x_{i})\mathrm{~>~-10dB}$ 的片段占比。由此得到一个方阵 $P\in[0,1]^{4\times4}$,理论上在理想条件下应有 $P=\mathrm{{Id}}$。因此我们仅保留满足以下条件的歌曲:对于所有音源 $i$ 有 $P_{i,i}>70%$,且对于任意不同音源对 $i\neq j$...

$P_{i,j}<30%$ . This procedure selects 800 songs.

$P_{i,j}<30%$。该流程筛选出800首歌曲。

5. EXPERIMENTS AND RESULTS

5. 实验与结果

5.1. Experimental Setup

5.1. 实验设置

All our experiments are done on 8 Nvidia V100 GPUs with 32GB of memory, using fp32 precision. We use the L1 loss on the waveforms, optimized with Adam [23] without weight decay, a learning rate of $3\cdot10^{-4}$ , $\beta_{1}=0.9$ , $\beta_{2}=0.999$ and a batch size of 32 unless stated otherwise. We train for 1200 epochs of 800 batches each over the MUSDB18-HQ dataset, completed with the 800 curated songs dataset presented in Section 4, sampled at $44.1\mathrm{kHz}$ and stereophonic. We use exponential moving average as described in [2], and select the best model over the valid set, composed of the valid set of MUSDB18-HQ and another 8 songs. We use the same data augmentation as described in [2], including repitching/tempo stretch and remixing of the stems within one batch.

我们所有的实验均在8块32GB显存的Nvidia V100 GPU上完成,使用fp32精度。除非另有说明,我们采用波形L1损失函数,通过Adam优化器[23]进行优化(无权重衰减),学习率为$3\cdot10^{-4}$,$\beta_{1}=0.9$,$\beta_{2}=0.999$,批量大小为32。我们在MUSDB18-HQ数据集上训练1200个epoch(每个epoch包含800个批次),并补充第4节所述的800首精选歌曲数据集(采样率为$44.1\mathrm{kHz}$的立体声音频)。如文献[2]所述,我们采用指数移动平均方法,并在验证集(由MUSDB18-HQ验证集和另外8首歌曲组成)上选择最佳模型。数据增强方案与文献[2]保持一致,包括音高/速度拉伸以及单批次内音轨的重新混音。

Using one model per source can be beneficial [14], although its adds overhead both at train time and evaluation. In order to limit the impact at train time, we propose a procedure where one copy of the multi-target model is fine-tuned on a single target task for 50 epochs, with a learning rate of $10^{-4}$ , no remixing, repitching, nor rescaling applied as data augmentation. Having noticed some instability towards the end of the training of the main model, we use for the fine tuning a gradient clipping (maximum L2 norm of 5. for the gradient), and a weight decay of 0.05.

每个源使用一个模型可能是有益的 [14],尽管这会增加训练和评估时的开销。为了限制训练时的影响,我们提出一种方法:将多目标模型的一个副本在单个目标任务上微调50个周期,学习率为 $10^{-4}$,且不采用重新混合、变调或重缩放作为数据增强手段。注意到主模型训练末期存在不稳定性,我们在微调时使用了梯度裁剪(梯度最大L2范数为5)和0.05的权重衰减。

At test time, we split the audio into chunks having the same duration as the one used for training, with an overlap of $25%$ and a linear transition from one chunk to the next. We report is the Signal-to-Distortion-Ratio (SDR) as defined by the SiSEC18 [24] which is the median across the median SDR over all 1 second chunks in each song. On Tab. 2 we also report the Real Time Factor (RTF) computed on a single core of an Intel Xeon CPU at 2.20GHz. It is defined as the time to process some fixed audio input (we use 40 seconds of gaussian noise) divided by the input duration.

在测试时,我们将音频分割成与训练时相同时长的片段,重叠率为25%,并在片段间采用线性过渡。我们采用的评价指标是SiSEC18 [24]定义的信号失真比(SDR),即每首歌曲中所有1秒片段SDR中位数的中位数。在表2中,我们还报告了在2.20GHz的Intel Xeon单核CPU上计算的实时因子(RTF),其定义为处理固定音频输入(我们使用40秒高斯噪声)的时间除以输入时长。

5.2. Comparison with the baselines

5.2. 与基线方法对比

On Tab. 1, we compare to several state-of-the-art baselines. For reference, we also provide baselines that are trained without any extra training data. Comparing to the original Hybrid Demucs architecture, we notice that the improved sequence modeling capabilities of transformers increased the SDR by $0.45:\mathrm{dB}$ for the simple version, and up to almost $0.9\mathrm{dB}$ when using the sparse variant of our model along with fine tuning. The most competitive baseline is Band-Split RNN [14], which achieves a better SDR on both the other and vocals sources, despite using only MUSDB18HQ as a supervised training set, and using 1750 unsupervised tracks as extra training data.

在表1中,我们与多种先进基线方法进行了对比。作为参考,我们还提供了无需额外训练数据的基线模型。与原始Hybrid Demucs架构相比,我们发现Transformer改进的序列建模能力使简易版本的SDR提升了$0.45:\mathrm{dB}$,而采用稀疏变体模型结合微调时,提升幅度可达近$0.9\mathrm{dB}$。最具竞争力的基线是Band-Split RNN [14],尽管仅使用MUSDB18HQ作为监督训练集并辅以1750条无监督音轨作为额外训练数据,该模型在其他声源和人声部分均取得了更高的SDR指标。

5.3. Impact of the architecture hyper-parameters

5.3. 架构超参数的影响

We first study the influence of three architectural hyperparameters in Tab. 2: the duration in seconds of the excerpts used to train the model, the depth of the transformer encoder and its dimension. For short training excerpts (3.4 seconds), we notice that augmenting the depth from 5 to 7 increases the test SDR and that augmenting the transformer dimension from 384 to 512 lowers it slightly by 0.05 dB. With longer segments (7.8 seconds), we observe an increase of almost $0.6:\mathrm{dB}$ when using a depth of 5 and 384 dimensions. Given this observation, we also tried to increase the duration or the depth but this led to Out Of Memory (OOM) errors. Finally, augmenting the dimension to 512 when training over 7.8 seconds led to an improvement of 0.1 dB.

我们首先研究了表2中三个架构超参数的影响:用于训练模型的音频片段时长(秒)、Transformer编码器的深度及其维度。对于短训练片段(3.4秒),我们发现将深度从5增加到7会提升测试SDR,而将Transformer维度从384增至512会轻微降低0.05 dB。使用较长片段(7.8秒)时,当深度为5且维度为384时,我们观察到性能提升近$0.6:\mathrm{dB}$。基于此现象,我们尝试进一步增加时长或深度,但引发了内存不足(OOM)错误。最终,在7.8秒训练时长下将维度增至512带来了0.1 dB的性能提升。

5.4. Impact of the data augmentation

5.4. 数据增强的影响

On Tab. 3, we study the impact of disabling some of the data augmentation, as we were hoping that using more training data would reduce the need for such augmentations. However, we observe a constant deterioration of the final SDR as we disable those. While the repitching augmentation has a limited impact, the remixing augmentation remain highly important to train our model, with a loss of $0.7\mathrm{dB}$ without it.

在表 3 中,我们研究了禁用部分数据增强 (data augmentation) 的影响,原本希望使用更多训练数据能降低对此类增强的需求。然而观察发现,随着逐步禁用这些增强,最终 SDR (Signal-to-Distortion Ratio) 指标持续恶化。其中音高调整 (repitching) 增强的影响有限,但混音调整 (remixing) 增强对模型训练仍至关重要——禁用该增强会导致 $0.7\mathrm{dB}$ 的性能损失。

Table 2: Impact of segment duration, transformer depth and transformer dimension. OOM means Out of Memory. Results are commented in Sec. 5.3.

表 2: 分段时长、Transformer深度和Transformer维度的影响。OOM表示内存不足。结果分析见第5.3节。

时长(秒) 深度 维度 参数量 RTF(cpu) SDR(A11)
3.4 5 384 26.9M 1.02 8.17
3.4 7 384 34.0M 1.23 8.26
3.4 5 512 41.4M 1.30 8.12
7.8 5 384 26.9M 1.49 8.70
7.8 7 384 34.0M 1.68 OOM
7.8 5 512 41.4M 1.77 8.80
12.2 5 384 26.9M 2.04 0OM

Table 3: Impact of data augmentation. The model has a depth of 5, and a dimension of 384. See Sec. 5.4 for details.

表 3: 数据增强的影响。模型深度为5,维度为384。详情参见第5.4节。

dur. (秒) depth remixing repitching SDR (A11)
7.8 5 8.70
7.8 5 × 8.65
7.8 5 8.00

5.5. Impact of using sparse kernels and fine tuning

5.5. 使用稀疏内核和微调的影响

We test the sparse kernels described in Section 3 to increase the depth to 7 and the train segment duration to 12.2 seconds, with a dimension of 512. This simple change yields an extra $0.14~\mathrm{dB}$ of SDR (8.94 dB). The fine-tuning per source improves the SDR by 0.25 dB, to $9.20~\mathrm{dB}$ , despite requiring only 50 epochs to train. We tried further extending the receptive field of the Transformer Encoder to 15 seconds during the fine tuning stage by reducing the batch size, however, this led to the same SDR of $9.20~\mathrm{dB}$ , although training from scratch with such a context might lead to a different result.

我们测试了第3节描述的稀疏核,将深度增加到7,训练片段时长延长至12.2秒,维度保持512。这一简单改动使SDR额外提升了$0.14~\mathrm{dB}$(达到8.94 dB)。尽管仅需50个训练周期,针对各音源的微调仍将SDR提高了0.25 dB至$9.20~\mathrm{dB}$。在微调阶段,我们尝试通过减小批次大小将Transformer编码器的感受野进一步扩展到15秒,但这并未改变SDR结果(仍为$9.20~\mathrm{dB}$),不过从头开始训练可能会得到不同结果。

Conclusion

结论

We introduced Hybrid Transformer Demucs, a Transformer based variant of Hybrid Demucs that replaces the innermost convolutional layers by a Cross-domain Transformer Encoder, using self-attention and cross-attention to process spectral and temporal informations. This architecture benefits from our large training dataset and outperforms Hybrid Demucs by 0.45 dB. Thanks to sparse attention techniques, we scaled our model to an input length up to 12.2 seconds during training which led to a supplementary gain of $0.4:\mathrm{dB}$ . Finally, we could explore splitting the spec tr ogram into subbands in order to process them differently as it is done in [14].

我们推出了Hybrid Transformer Demucs,这是Hybrid Demucs的一个基于Transformer的变体,通过用跨域Transformer编码器(Cross-domain Transformer Encoder)替换最内层的卷积层,利用自注意力(self-attention)和交叉注意力(cross-attention)处理频谱和时域信息。得益于庞大的训练数据集,该架构性能优于Hybrid Demucs 0.45 dB。借助稀疏注意力技术,我们在训练时将模型输入长度扩展至12.2秒,从而带来额外的$0.4:\mathrm{dB}$性能提升。最后,我们探索了将频谱图分割为子带(subbands)进行差异化处理的方法,如文献[14]所述。

阅读全文(20积分)