HYBRID TRANSFORMERS FOR MUSIC SOURCE SEPARATION
音乐源分离的混合式Transformer
ABSTRACT
摘要
A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers [1] have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs [2], where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB [3], we show that it outperforms Hybrid Demucs (trained on the same data) by $0.45~\mathrm{dB}$ of SDR when using 800 extra training songs. Using sparse attention kernels to extend its receptive field, and per source fine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with $9.20\mathrm{dB}$ of SDR.
音乐源分离(Music Source Separation, MSS)领域自然产生了一个问题:长距离上下文信息是否有用,还是局部声学特征就已足够。在其他领域,基于注意力机制的Transformer[1]已展现出整合长序列信息的能力。本文提出了混合Transformer Demucs(HT Demucs),这是一种基于Hybrid Demucs[2]的时域/频域双U-Net混合架构,其最内层被替换为跨域Transformer编码器,在单域内使用自注意力机制,跨域时采用交叉注意力机制。虽然在仅使用MUSDB[3]数据集训练时表现欠佳,但我们证明当增加800首额外训练歌曲时,其性能超越Hybrid Demucs(相同训练数据)0.45dB的SDR指标。通过采用稀疏注意力核扩展感受野,以及逐源微调技术,我们在MUSDB数据集上使用额外训练数据取得了9.20dB SDR的当前最佳结果。
Index Terms— Music Source Separation, Transformers with Transformer layers, applied both in the time and spectral representation, using self-attention within one domain, and cross-attention across domains. As Transformers are usually data hungry, we leverage an internal dataset composed of 800 songs on top of the MUSDB dataset, described in Section 4.
索引术语— 音乐源分离 (Music Source Separation)、时域和频域表示中应用的带Transformer层的Transformer、在单域内使用自注意力 (self-attention)、跨域使用交叉注意力 (cross-attention)。由于Transformer通常需要大量数据,我们在MUSDB数据集基础上利用包含800首歌曲的内部数据集(详见第4节)。
Our second contribution is to evaluate extensively this new architecture in Section 5, with various settings (depth, number of channels, context length, augmentations etc.). We show in particular that it improves over the baseline Hybrid Demucs architecture (retrained on the same data) by 0.35 dB.
我们的第二个贡献是在第5节中对这一新架构进行了广泛评估,涵盖了多种设置(深度、通道数、上下文长度、数据增强等)。特别地,我们证明其性能比基准Hybrid Demucs架构(使用相同数据重新训练)提升了0.35分贝。
Finally, we experiment with increasing the context duration using sparse kernels based with Locally Sensitive Hashing to overcome memory issues during training, and finetuning procedure, thus achieving a final SDR of 9.20 dB on the test set of MUSDB.
最后,我们尝试使用基于局部敏感哈希的稀疏核来增加上下文时长,以解决训练和微调过程中的内存问题,从而在MUSDB测试集上实现了9.20 dB的最终SDR。
We release the training code, pre-trained models, and samples on our github facebook research/demucs.
我们在 github facebook research/demucs 上发布了训练代码、预训练模型和样本。
2. RELATED WORK
2. 相关工作
1. INTRODUCTION
1. 引言
Since the 2015 Signal Separation Evaluation Campaign (SiSEC) [4], the community of MSS has mostly focused on the task of training supervised models to separate songs into 4 stems: drums, bass, vocals and other (all the other instruments). The reference dataset that is used to benchmark MSS is MUSDB18 [3, 5] which is made of 150 songs in two versions (HQ and non-HQ). Its training set is composed of 87 songs, a relatively small corpus compared with other deep learning based tasks, where Transformer [1] based architectures have seen widespread success and adoption, such as vision [6, 7] or natural language tasks [8]. Source separation is a task where having a short context or a long context as input both make sense. Conv-Tasnet [9] uses about one second of context to perform the separation, using only local acoustic features. On the other hand, Demucs [10] can use up to 10 seconds of context, which can help to resolve ambiguities in the input. In the present work, we aim at studying how Transformer architectures can help leverage this context, and what amount of data is required to train them.
自2015年信号分离评估竞赛(SiSEC) [4]以来,音乐源分离(MSS)领域主要聚焦于训练监督模型将歌曲分离为4个音轨的任务:鼓组、贝斯、人声和其他(所有其他乐器)。用于基准测试的参考数据集MUSDB18 [3,5]包含150首歌曲的两个版本(HQ和非HQ),其训练集由87首歌曲组成——与其他基于深度学习的任务相比,这个数据量相对较小。在这些领域(如视觉[6,7]或自然语言任务[8]),基于Transformer[1]的架构已取得广泛成功。音源分离任务中,使用短时上下文或长时上下文作为输入都具有意义:Conv-Tasnet[9]仅利用约1秒的上下文进行分离,仅依赖局部声学特征;而Demucs[10]可使用长达10秒的上下文,这有助于解决输入中的歧义。本研究旨在探索Transformer架构如何有效利用这种上下文信息,以及训练这些模型所需的数据量规模。
We first present in Section 3 a novel architecture, Hybrid Transformer Demucs (HT Demucs), which replaces the innermost layers of the original Hybrid Demucs architecture [2]
我们首先在第3节介绍一种新颖的架构——混合Transformer Demucs (HT Demucs),它取代了原始Hybrid Demucs架构[2]的最内层。
A traditional split for MSS methods is between spectrogram based and waveform based models. The former includes models like Open-Unmix [11], a biLSTM with fully connected that predicts a mask on the input spec tr ogram or D3Net [12] which uses dilated convolutional blocks with dense connections. More recently, using complexspec tr ogram as input and output was favored [13] as it provides a richer representation and removes the topline given by the Ideal-Ratio-Mask. The latest spec tr ogram model, Band-Split RNN [14], combines this idea, along with multiple dual-path RNNs [15], each acting in carefully crafted frequency band. It currently achieves the state-of-the-art on MUSDB with 8.9 dB. Waveform based models started with Wave-U-Net [16], which served as the basis for Demucs [10], a time domain U-Net with a bi-LSTM between the encoder and decoder. Around the same time, Conv-TasNet showed competitive results [9, 10] using residual dilated convolution blocks to predict a mask over a learnt representation. Finally, a recent trend has been to use both temporal and spectral domains, either through model blending, like KUIELABMDX-Net [17], or using a bi-U-Net structure with a shared backbone as Hybrid Demucs [2]. Hybrid Demucs was the first ranked architecture at the latest MDX MSS Competition [18], although it is now surpassed by Band-Split RNN.
传统上,音乐源分离(MSS)方法可分为基于频谱图(spectrogram)和基于波形(waveform)的模型。前者包括Open-Unmix [11](一种带全连接的双向LSTM,用于预测输入频谱图的掩码)和D3Net [12](采用密集连接的空洞卷积块)。近年来,使用复数频谱图(complex spectrogram)作为输入和输出的方法更受青睐[13],因其能提供更丰富的表示并消除理想比率掩码(Ideal-Ratio-Mask)的性能上限。最新的频谱图模型Band-Split RNN [14]结合了这一思路与多个双路径RNN [15],每个RNN作用于精心设计的频段,目前在MUSDB数据集上以8.9分贝的成绩保持最优。
基于波形的模型始于Wave-U-Net [16],它是Demucs [10]的基础架构——一种时域U-Net,在编码器和解码器之间加入了双向LSTM。同期,Conv-TasNet通过残差空洞卷积块预测学习表示的掩码,取得了可比的结果[9,10]。最新趋势是结合时域和频域,例如KUIELAB-MDX-Net [17]采用模型混合策略,Hybrid Demucs [2]则使用共享主干网络的双U-Net结构。Hybrid Demucs在最近MDX音乐分离竞赛[18]中曾位列榜首,但现已被Band-Split RNN超越。
Table 1: Comparison with baselines on the test set of MUSDB HQ (methods with a ∗ are reported on the non HQ version). “Extra?” indicates the number of extra songs used at train time, $\dagger$ indicates that only mixes are used. “fine tuned” indices per source fine-tuning.
表 1: 在 MUSDB HQ 测试集上与基线的对比 (带 * 的方法是在非 HQ 版本上报告的) 。"Extra?" 表示训练时使用的额外歌曲数量, $\dagger$ 表示仅使用混合音频。"fine tuned" 表示对每个音源进行了微调。
架构 | Extra? | A11 | Drums | Bass | Other | Vocals |
---|---|---|---|---|---|---|
IRMoracle | N/A | 8.22 | 8.45 | 7.12 | 7.85 | 9.43 |
KUIELAB-MDX-Net[17] | 7.54 | 7.33 | 7.86 | 5.95 | 9.00 | |
Hybrid Demucs [2] | 7.64 | 8.12 | 8.43 | 5.65 | 8.35 | |
Band-Split RNN [14] | 8.24 | 9.01 | 7.22 | 6.70 | 10.01 | |
HT Demucs | 7.52 | 7.94 | 8.48 | 5.72 | 7.93 | |
Spleeter*[19] | 25k | 5.91 | 6.71 | 5.51 | 4.55 | 6.86 |
D3Net* [12] | 1.5k | 6.68 | 7.36 | 6.20 | 5.37 | 7.80 |
Demucs v2*[10] | 150 | 6.79 | 7.58 | 7.60 | 4.69 | 7.29 |
Hybrid Demucs [2] | 800 | 8.34 | 9.31 | 9.13 | 6.18 | 8.75 |
Band-Split RNN [14] | 1750t | 8.97 | 10.15 | 8.16 | 7.08 | 10.47 |
HT Demucs | 150 | 8.49 | 9.51 | 9.76 | 6.13 | 8.56 |
HT Demucs | 800 | 8.80 | 10.05 | 9.78 | 6.42 | 8.93 |
HT Demucs (fine tuned) | 800 | 9.00 | 10.08 | 10.39 | 6.32 | 9.20 |
Sparse HT Demucs (fine tuned) | 800 | 9.20 | 10.83 | 10.47 | 6.41 | 9.37 |
Using large datasets has been shown to be beneficial to the task of MSS. Spleeter [19] is a spec tr ogram masking UNet architecture trained on 25,000 songs extracts of 30 seconds, and was at the time of its release, the best model available. Both D3Net and Demucs highly benefited from using extra training data, while still offering strong performance on MUSDB only. Band-Split RNN introduced a novel unsupervised augmentation technique requiring only mixes to improve its performance by $0.7\mathrm{dB}$ of SDR.
使用大规模数据集已被证明对音乐源分离(MSS)任务有益。Spleeter [19] 是一种基于频谱掩码的UNet架构,其训练数据包含25,000首30秒歌曲片段,在发布时是当时最优秀的模型。D3Net和Demucs都通过使用额外训练数据显著提升了性能,同时在仅使用MUSDB数据集时仍保持强劲表现。Band-Split RNN引入了一种新颖的无监督增强技术,仅需混合音频即可将SDR指标提升$0.7\mathrm{dB}$。
Transformers have been used for speech source separation with SepFormer [20], which is similar to Dual-Path RNN: short range attention layers are interleaved with long range ones. However, its requires almost 11GB of memory for the forward pass for 5 seconds of audio at $8\mathrm{kHz}$ , and thus is not adequate for studying longer inputs at $44.1\mathrm{kHz}$ .
Transformer已通过SepFormer [20]应用于语音源分离,其原理类似于双路径RNN:短程注意力层与长程注意力层交替排列。然而,该模型在处理5秒8kHz音频时前向传播需占用近11GB内存,因此无法适用于44.1kHz采样率下的长时输入研究。
3. ARCHITECTURE
3. 架构
We introduce the Hybrid Transformer Demucs model, based on Hybrid Demucs [2]. The original Hybrid Demucs model is made of two U-Nets, one in the time domain (with temporal convolutions) and one in the spec tr ogram domain (with convolutions over the frequency axis). Each U-Net is made of 5 encoder layers, and 5 decoder layers. After the 5-th encoder layer, both representation have the same shape, and they are summed before going into a shared 6-th layer. Similarly, the first decoder layer is shared, and its output is sent both the temporal and spectral branch. The output of the spectral branch is transformed to a waveform using the iSTFT, before being summed with the output of the temporal branch, giving the actual prediction of the model.
我们基于Hybrid Demucs [2]提出了Hybrid Transformer Demucs模型。原始Hybrid Demucs模型由两个U-Net组成:一个在时域(采用时间卷积)工作,另一个在频谱(spectrogram)域(沿频率轴进行卷积)工作。每个U-Net包含5个编码器层和5个解码器层。在第5层编码器后,两个表征具有相同形状,在输入共享的第6层之前会进行求和。类似地,第一个解码器层也是共享的,其输出会同时传递给时域分支和频域分支。频域分支的输出通过逆短时傅里叶变换(iSTFT)转换为波形后,与时域分支的输出相加,形成模型的最终预测结果。
Hybrid Transformer Demucs keeps the outermost 4 layers as is from the original architecture, and replaces the 2 innermost layers in the encoder and the decoder, including local attention and bi-LSTM, with a cross-domain Transformer Encoder. It treats in parallel the 2D signal from the spectral branch and the 1D signal from the waveform branch. Unlike the original Hybrid Demucs which required careful tuning of the model parameters (STFT window and hop length, stride, paddding, etc.) to align the time and spectral representation, the cross-domain Transformer Encoder can work with heterogeneous data shape, making it a more flexible architecture.
混合Transformer Demucs保留了原始架构中最外层的4层不变,并将编码器和解码器中最内层的2层(包括局部注意力机制和双向LSTM)替换为跨域Transformer编码器。该模型并行处理来自频谱分支的2D信号和波形分支的1D信号。与原始Hybrid Demucs需要仔细调整模型参数(STFT窗口和跳跃长度、步长、填充等)以对齐时间和频谱表示不同,跨域Transformer编码器能够处理异构数据形状,使其成为更灵活的架构。
The architecture of Hybrid Transformer Demucs is depicted on Fig.1. On the left, we show a single self-attention Encoder layer of the Transformer [1] with normalization s before the Self-Attention and Feed-Forward operations, it is combined with Layer Scale [6] initialized to $\epsilon{=}10^{-4}$ in order to stabilize the training. The two first normalization s are layer normalization s (each token is independently normalized) and the third one is a time layer normalization (all the tokens are normalized together). The input/output dimension of the Transformer is 384, and linear layers are used to convert to the internal dimension of the Transformer when required. The attention mechanism has 8 heads and the hidden state size of the feed forward network is equal to 4 times the dimension of the transformer. The cross-attention Encoder layer is the same but using cross-attention with the other domain representation. In the middle, a cross-domain Transformer Encoder of depth 5 is depicted. It is the interleaving of self-attention Encoder layers and cross-attention Encoder layers in the spectral and waveform domain. 1D [1] and 2D [21] sinusoidal encodings are added to the scaled inputs and reshaping is applied to the spectral representation in order to treat it as a sequence. On Fig. 1 (c), we give a representation of the entire architecture, along with the double U-Net encoder/decoder structure.
Hybrid Transformer Demucs的架构如图1所示。左侧展示了Transformer[1]的单层自注意力编码器结构,在自注意力(Self-Attention)和前馈网络(Feed-Forward)操作前各有一个归一化层,并结合了初始值为$\epsilon{=}10^{-4}$的层缩放(Layer Scale)[6]以稳定训练。前两个归一化层是层归一化(每个token独立归一化),第三个是时序层归一化(所有token共同归一化)。该Transformer的输入/输出维度为384,必要时通过线性层转换至Transformer内部维度。注意力机制采用8个头,前馈网络的隐藏层大小为Transformer维度的4倍。交叉注意力编码器层结构相同,但使用另一域表征进行交叉注意力计算。
中间部分展示了深度为5的跨域Transformer编码器,由频谱域和波形域的自注意力编码器层与交叉注意力编码器层交替组成。对缩放后的输入添加一维[1]和二维[21]正弦位置编码,并对频谱表征进行序列化重塑处理。图1(c)呈现了整体架构及双U-Net编码器/解码器结构。
Memory consumption and the speed of attention quickly deteriorates with an increase of the sequence lengths. To further scale, we leverage sparse attention kernels introduced in the xformer package [22], along with a Locally Sensitive Hashing (LSH) scheme to determine dynamically the sparsity pattern. We use a sparsity level of $90%$ (defined as the proportion of elements removed in the softmax), which is determined by performing 32 rounds of LSH with 4 buckets each. We select the elements that match at least $k$ times over all 32 rounds of LSH, with $k$ such that the sparsity level is $90%$ . We refer to this variant as Sparse HT Demucs.
随着序列长度的增加,内存消耗和注意力速度会迅速下降。为了进一步扩展规模,我们利用了xformer包[22]中引入的稀疏注意力核( sparse attention kernels ),并结合局部敏感哈希( LSH )方案动态确定稀疏模式。我们采用90%的稀疏度(定义为softmax中被移除元素的比例),这是通过执行32轮LSH(每轮4个桶)确定的。我们选择在所有32轮LSH中至少匹配k次的元素,其中k值确保稀疏度达到90%。我们将此变体称为稀疏HT Demucs( Sparse HT Demucs )。
4. DATASET
4. 数据集
We curated an internal dataset composed of 3500 songs with the stems from 200 artists with diverse music genres. Each stem is assigned to one of the 4 sources according to the name given by the music producer (for instance ”vocals2”, ”fx”, ”sub” etc...). This labeling is noisy because these names are subjective and sometime ambiguous. For 150 of those tracks, we manually verified that the automated labeling was correct, and discarded ambiguous stems. We trained a first Hybrid Demucs model on MUSDB and those 150 tracks. We preprocess the dataset according to several rules. First, we keep only the stems for which all four sources are non silent at least $30%$ of the time. For each 1 second segment, we define it as silent if its volume is less than -40dB. Second, for a song $x$ of our dataset, noting $x_{i}$ , with $i\in\mathrm{{drums}}$ , bass, other, vocals}, each stem and $f$ the Hybrid Demucs model previously mentioned, we define $y_{i,j}{=}f(x_{i}){j}$ i.e. the output $j$ when separat- ing the stem $i$ . Theoretically, if all the stems were perfectly labeled and if $f$ were a perfect source separation model we would have $y_{i,j}=x_{i}\delta_{i,j}$ with $\delta_{i,j}$ being the Kronecker delta. For a waveform $z$ , let us define the volume in dB measured over 1 second segments:
我们整理了一个包含3500首歌曲的内部数据集,这些歌曲来自200位不同音乐流派的艺术家,并带有音轨分轨(stems)。根据音乐制作人给定的名称(例如"vocals2"、"fx"、"sub"等),每个音轨被分配到4种音源类型之一。这种标注存在噪声,因为这些名称具有主观性且有时含义模糊。我们对其中150首曲目进行了人工验证,确保自动标注正确,并剔除了含义模糊的音轨。
我们首先在MUSDB数据集和这150首曲目上训练了一个Hybrid Demucs模型。数据预处理遵循以下规则:首先,我们仅保留四个音源在至少30%时间段内非静音的音轨。对于每1秒片段,若其音量低于-40dB则定义为静音。其次,对于数据集中的歌曲x,记作x_i(i∈{drums, bass, other, vocals}),每个音轨和前述Hybrid Demucs模型f,我们定义y_i,j=f(x_i)_j,即在分离音轨i时的输出j。理论上,若所有音轨都完美标注且f是完美的音源分离模型,则应有y_i,j=x_iδ_i,j,其中δ_i,j是克罗内克δ函数。对于波形z,我们定义其1秒片段的音量(dB)为: