[论文翻译]IIANet: 一种用于视听语音分离的模态内与模态间注意力网络


原文地址:https://arxiv.org/pdf/2308.08143v3


IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

IIANet: 一种用于视听语音分离的模态内与模态间注意力网络

Abstract

摘要

Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this issue, We propose a novel model called Intra- and Inter-Attention Network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and interattention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only $7%$ of CTCNet’s MACs and is $40%$ faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.

近期研究在视听语音分离的融合模块设计上取得了显著进展。然而,这些方法主要集中于听觉和视觉特征在单一时间尺度上的多模态融合,且未采用选择性注意力机制,这与大脑处理机制形成鲜明对比。为解决这一问题,我们提出了一种名为内外注意力网络 (Intra- and Inter-Attention Network, IIANet) 的新模型,该模型利用注意力机制实现高效的视听特征融合。IIANet包含两类注意力模块:内部注意力 (IntraA) 和交互注意力 (InterA) 模块,其中InterA模块分布在网络顶部、中部和底部。这些模块深度借鉴了人脑在不同时间尺度上选择性关注相关内容的机制,既能保持学习模态特异性特征的能力,又可从视听特征中提取不同语义信息。在三个标准视听分离基准数据集 (LRS2、LRS3和VoxCeleb2) 上的全面实验表明,IIANet在保持相当推理时间的同时,性能优于现有最优方法。特别地,快速版IIANet (IIANet-fast) 仅需CTCNet 7%的乘加运算量,在CPU上运行速度比CTCNet快40%,同时实现更优的分离质量,这充分展现了注意力机制在高效多模态融合中的巨大潜力。

1. Introduction

1. 引言

In our daily lives, audio and visual signals are the primary means of information transmission, providing rich cues for humans to obtain valuable information in noisy environments (Cherry, 1953; Arons, 1992). This innate ability is known as the “cocktail party effect”, also referred to as “speech separation” in computer science. Speech separation possesses extensive research significance, such as assisting individuals with hearing impairments (Summer field, 1992), enhancing the auditory experience of wearable devices (Ryumin et al., 2023), etc.

在我们的日常生活中,音频和视觉信号是信息传递的主要方式,为人类在嘈杂环境中获取有价值信息提供了丰富线索 (Cherry, 1953; Arons, 1992)。这种先天能力被称为"鸡尾酒会效应",在计算机科学领域也被称为"语音分离"。语音分离具有广泛的研究意义,例如帮助听力障碍人士 (Summer field, 1992)、提升可穿戴设备的听觉体验 (Ryumin et al., 2023) 等。

Researchers have previously sought to address the cocktail party effect through audio streaming alone (Luo & Mesgarani, 2019; Luo et al., 2020; Hu et al., 2021). This task is called audio-only speech separation (AOSS). However, the quality of the separation system may decline dramatically when speech is corrupted by noise (Afouras et al., 2018a). To address this issue, many researchers have focused on audio-visual speech separation (AVSS), achieving remarkable progress. Existing AVSS methods can be broadly classified into two categories based on their fusion modules: CNN-based and Transformer-based. CNN-based methods (Gao & Grauman, 2021; Li et al., 2022; Wu et al., 2019) exhibit lower computational complexity and excellent local feature extraction capabilities, enabling the extraction of audio-visual information at multiple scales to capture local contextual relationships. Transformer-based methods (Lee et al., 2021; Rahimi et al., 2022; Montesinos et al., 2022) can leverage cross-attention mechanisms to learn associations across different time steps, effectively handling dynamic relationships between audio-visual information.

研究人员此前曾尝试仅通过音频流来解决鸡尾酒会效应 (Luo & Mesgarani, 2019; Luo et al., 2020; Hu et al., 2021) ,这项任务被称为纯音频语音分离 (AOSS) 。然而,当语音被噪声干扰时,分离系统的质量可能会急剧下降 (Afouras et al., 2018a) 。为了解决这个问题,许多研究者将重点转向视听语音分离 (AVSS) ,并取得了显著进展。现有AVSS方法根据其融合模块可大致分为两类:基于CNN的方法和基于Transformer的方法。基于CNN的方法 (Gao & Grauman, 2021; Li et al., 2022; Wu et al., 2019) 具有较低的计算复杂度和出色的局部特征提取能力,能够提取多尺度的视听信息以捕捉局部上下文关系。基于Transformer的方法 (Lee et al., 2021; Rahimi et al., 2022; Montesinos et al., 2022) 则可以利用交叉注意力机制学习不同时间步之间的关联,有效处理视听信息间的动态关系。

By comparing the working mechanisms of these AVSS methods and the brain, we find that there are two major differences. First, auditory and visual information undergoes neural integration at multiple levels of the auditory and visual pathways, including the thalamus (Halverson & Freeman, 2006; Cai et al., 2019), A1 and V1 (Mesik et al., 2015; Eckert et al., 2008), and occipital cortex (Ghazanfar & Schroeder, 2006; Stein & Stanford, 2008). However, most AVSS methods integrate audio-visual information only at either coarse (Gao & Grauman, 2021; Lee et al., 2021) or fine (Li et al., 2022; Montesinos et al., 2022; Wu et al., 2019; Rahimi et al., 2022) temporal scale, thus ignoring semantic associations across different scales.

通过比较这些视听语音分离 (AVSS) 方法与大脑的工作机制,我们发现存在两个主要差异。首先,听觉和视觉信息在听觉和视觉通路的多个层级进行神经整合,包括丘脑 (Halverson & Freeman, 2006; Cai et al., 2019)、初级听觉皮层 (A1) 和初级视觉皮层 (V1) (Mesik et al., 2015; Eckert et al., 2008) 以及枕叶皮层 (Ghazanfar & Schroeder, 2006; Stein & Stanford, 2008)。然而,大多数 AVSS 方法仅在粗粒度 (Gao & Grauman, 2021; Lee et al., 2021) 或细粒度 (Li et al., 2022; Montesinos et al., 2022; Wu et al., 2019; Rahimi et al., 2022) 时间尺度上整合视听信息,从而忽略了跨尺度的语义关联。

Second, numerous evidences have indicated that the human brain employs selective attention to focus on relevant speech in a noisy environment (Cherry, 1953; Golumbic et al., 2013). This also modulates neural responses in a frequencyspecific manner, enabling the tracked speech to be distinguished and separated from competing speech (Golumbic et al., 2013; Mesgarani & Chang, 2012; O’sullivan et al.,

其次,大量证据表明,人类大脑会利用选择性注意在嘈杂环境中聚焦相关语音 (Cherry, 1953; Golumbic et al., 2013)。这种机制还会以频率特异性方式调节神经反应,使被追踪的语音能够从竞争语音中区分和分离出来 (Golumbic et al., 2013; Mesgarani & Chang, 2012; O’sullivan et al.,

2015; Kaya & Elhilali, 2017; Cooke & Ellis, 2001; Aytar et al., 2016). The Transformer-based AVSS methods adopt attention mechanisms (Rahimi et al., 2022; Montesinos et al., 2022), but they commonly employ the Transformer as the backbone network to extract and fuse audio-visual features, and the attention is not explicitly designed for selecting speakers. In contrast to above complex fusion strategies, our approach aims to design a simple and efficient fusion module based on the selective attention mechanism.

2015; Kaya & Elhilali, 2017; Cooke & Ellis, 2001; Aytar et al., 2016)。基于Transformer的视听语音分离(AVSS)方法采用注意力机制(Rahimi et al., 2022; Montesinos et al., 2022),但它们通常将Transformer作为主干网络来提取和融合视听特征,且注意力机制并非专门为说话人选择而设计。与上述复杂融合策略不同,我们的方法旨在基于选择性注意力机制设计一个简单高效的融合模块。

To achieve this goal, inspired by the structure and function of the brain, we propose a concise and efficient audio-visual fusion scheme, called Intra- and Inter-Attention Network (IIANet), which leverages different forms of attention1 to extract diverse semantics from audio-visual features. IIANet consists of two hierarchical unimodal networks, mimicking the audio and visual pathways in the brain, and employs two types of attention mechanisms: intra-attention (IntraA) within single modalities and inter-attention (InterA) between modalities.

为实现这一目标,受大脑结构和功能的启发,我们提出了一种简洁高效的视听融合方案,称为内外注意力网络 (Intra- and Inter-Attention Network, IIANet),该方案利用不同形式的注意力机制从视听特征中提取多样化语义。IIANet由两个分层单模态网络组成,模拟大脑中的听觉和视觉通路,并采用两种注意力机制:单模态内的内部注意力 (IntraA) 和跨模态间的交互注意力 (InterA)。

Following a recent AOSS model (Li et al., 2023), the IntraA has two forms of top-down attention, global and local, which aim to enhance the ability of the model to select relevant information in unimodal networks with the guidance of higher-level features. This mechanism is inspired by the massive top-down neural projections discovered in both auditory (Guinan Jr, 2006; Budinger et al., 2008) and visual (Angelucci et al., 2002; Felleman & Van Essen, 1991) pathways. The InterA mechanism aims to select relevant information in one modality with the help of the features in the other unimodal network. It is used at all levels of the unimodal networks. Roughly speaking, the InterA block at the top level (InterA-T) corresponds to higher associate cortical areas such as the frontal cortex and occipital cortex (Raij et al., 2000; Keil et al., 2012; Stein & Stanford, 2008), and at the bottom level (InterA-B) corresponds to the thalamus (Halverson & Freeman, 2006; Cai et al., 2019). The InterA blocks at the middle levels (InterA-M) reflect the direct neural projections between different auditory areas and visual areas, e.g., from the core and belt regions of the auditory cortex to the V1 area (Falchier et al., 2002).

遵循近期提出的AOSS模型 (Li et al., 2023),IntraA机制包含全局与局部两种自上而下的注意力形式,旨在通过高层特征引导增强单模态网络中的相关信息选择能力。该机制灵感源自听觉 (Guinan Jr, 2006; Budinger et al., 2008) 和视觉 (Angelucci et al., 2002; Felleman & Van Essen, 1991) 通路中发现的巨量自上而下神经投射。InterA机制则借助另一单模态网络的特征来选择当前模态的相关信息,其作用于单模态网络的所有层级。简言之,顶层InterA模块 (InterA-T) 对应前额叶皮层、枕叶皮层等高级联合皮层区 (Raij et al., 2000; Keil et al., 2012; Stein & Stanford, 2008),底层模块 (InterA-B) 对应丘脑结构 (Halverson & Freeman, 2006; Cai et al., 2019),而中层InterA模块 (InterA-M) 反映了听觉皮层核心区/带状区至V1区等跨模态直接神经投射 (Falchier et al., 2002)。

On three AVSS benchmarks, we found that IIANet surpassed the previous SOTA method CTCNet (Li et al., 2022) by a considerable margin, while the model inference time was nearly on par with CTCNet. We also built a fast version of IIANet, termed IIANet-fast. It achieved an inference speed on CPU that was more than twice as fast as CTCNet and still yielded better separation quality on three benchmark datasets.

在三个AVSS基准测试中,我们发现IIANet以显著优势超越了之前的SOTA方法CTCNet (Li et al., 2022),同时模型推理时间与CTCNet基本持平。我们还构建了IIANet的快速版本IIANet-fast,其在CPU上的推理速度比CTCNet快两倍以上,且在三个基准数据集上仍能提供更好的分离质量。

2. Related work

2. 相关工作

2.1. Audio-visual speech separation

2.1. 视听语音分离

Incorporating different modalities aligns better with the brain’s processing (Calvert, 2001; Bar, 2007). Some methods (Gao & Grauman, 2021; Li et al., 2022; Wu et al., 2019; Lee et al., 2021; Rahimi et al., 2022; Montesinos et al., 2022) have attempted to add visual cues in the AOSS task to improve the clarity of separated audios, called audiovisual speech separation (AVSS). They have demonstrated impressive separation results in more complex acoustic environments, but these methods only focus on audio-visual features fusion at a single temporal scale in both modality, e.g., only at the finest (Wu et al., 2019; Li et al., 2022) or coarsest (Gao & Grauman, 2021; Lee et al., 2021) scales, which restrains their efficient utilization of different modalities’ information. While CTCNet’s fusion module (Li et al., 2022) attempts to improve separation quality by utilizing visual and auditory features at different scales, it still focuses on fusion at the finest temporal scales. This unintentionally hinders the effectiveness of using visual information for guidance during the separation process. Moreover, some of the AVSS models (Lee et al., 2021; Montesinos et al., 2022) employ attention mechanisms within the process of inter-modal fusion, while the significance of intra-modal attention mechanisms is overlooked. This deficiency constrains their separation capabilities, as the global information within the auditory modality plays a crucial role in enhancing the clarity of the separated audio (Li et al., 2023). Notably, the fusion module of these methods (Lee et al., 2021; Montesinos et al., 2022) tend to use a similarity matrix to compute the attention, which dramatically increases the computational complexity. In contrast, our proposed IIANet uses the sigmoid function and element-wise product for selective attention, which can efficiently integrate audio-visual features at different temporal scales.

融入不同模态能更好地匹配大脑处理机制 (Calvert, 2001; Bar, 2007)。部分研究 (Gao & Grauman, 2021; Li et al., 2022; Wu et al., 2019; Lee et al., 2021; Rahimi et al., 2022; Montesinos et al., 2022) 尝试在AOSS任务中添加视觉线索以提升分离音频的清晰度,称为视听语音分离 (AVSS)。这些方法在复杂声学环境中展现了出色的分离效果,但仅关注单时间尺度的跨模态特征融合,例如仅在最精细 (Wu et al., 2019; Li et al., 2022) 或最粗糙 (Gao & Grauman, 2021; Lee et al., 2021) 尺度进行融合,限制了多模态信息的有效利用。虽然CTCNet的融合模块 (Li et al., 2022) 尝试通过多尺度视听特征提升分离质量,但仍聚焦于最精细时间尺度的融合,这无意中削弱了视觉信息在分离过程中的引导效果。此外,部分AVSS模型 (Lee et al., 2021; Montesinos et al., 2022) 在跨模态融合过程中采用注意力机制,却忽视了模态内注意力机制的重要性——听觉模态的全局信息对提升分离音频清晰度具有关键作用 (Li et al., 2023)。值得注意的是,这些方法 (Lee et al., 2021; Montesinos et al., 2022) 的融合模块通常使用相似度矩阵计算注意力,显著增加了计算复杂度。相比之下,我们提出的IIANet采用sigmoid函数和逐元素乘积实现选择性注意力,能高效整合多时间尺度的视听特征。

2.2. Attention mechanism in speech separation

2.2. 语音分离中的注意力机制

Attention is a critical function of the brain (Rensink, 2000; Corbetta & Shulman, 2002), serving as a pivotal mechanism for humans to cope with complex internal and external environments. Our brain has the ability to focus its superior resources to selectively process task-relevant information (Schneider, 2013). It is widely accepted that numerous brain regions are involved in the transmission of attention, forming an intricate neural network of interconnections (Posner & Petersen, 1990; Fukushima, 1986). These regions play distinct roles, enabling humans to selectively focus on processing certain information at necessary times and locations while ignoring other perceivable information. While this ability is known to work well along with sensory or motor stimuli, it also enables our brain to maintain the ability to process information in real-time even in the absence of these stimuli (Tallon-Baudry, 2012; Koch & Tsuchiya, 2007).

注意力是大脑的一项关键功能 (Rensink, 2000; Corbetta & Shulman, 2002),作为人类应对复杂内外环境的核心机制。我们的大脑能够集中优势资源,选择性处理任务相关信息 (Schneider, 2013)。学界普遍认为,众多脑区参与注意力传递,形成错综复杂的互联神经网络 (Posner & Petersen, 1990; Fukushima, 1986)。这些区域各司其职,使人类能在必要时空选择性地专注处理特定信息,同时忽略其他可感知信息。虽然这种能力已知能与感觉或运动刺激协同运作,但它也使大脑在缺乏这些刺激时仍能保持实时处理信息的能力 (Tallon-Baudry, 2012; Koch & Tsuchiya, 2007)。

Recently, Kuo et al. (2022) constructed artificial neural networks to simulate the attention mechanism of auditory features, confirming that top-down attention plays a critical role in addressing the cocktail party problem. Several recent AOSS models (Shi et al., 2018; Li et al., 2023; Chen et al., 2023) have utilized top-down attention in designing speech separation models. These approaches were able to reduce computational costs while maintaining audio separation quality. They have also shown excellent performance dealing with complex mixture audio. However, the efficient integration of top-down and lateral attention between modalities in AVSS models remains an unclear aspect.

最近,Kuo等人 (2022) 构建了人工神经网络来模拟听觉特征的注意力机制,证实了自上而下注意力 (top-down attention) 在解决鸡尾酒会问题中的关键作用。近期多个AOSS模型 (Shi等人, 2018; Li等人, 2023; Chen等人, 2023) 在设计语音分离模型时都采用了自上而下注意力。这些方法在保持音频分离质量的同时降低了计算成本,在处理复杂混合音频时也展现出优异性能。然而,AVSS模型中跨模态的自上而下注意力与横向注意力 (lateral attention) 的高效整合仍是一个未解难题。

3. The proposed model

3. 提出的模型

Let $\mathbf{A}\in R^{1\times T_{a}}$ and $\mathbf{V}\in R^{H\times W\times T_{v}}$ represent the audio and video streams of an individual speaker, respectively, where $T_{a}$ denotes the audio length; $H,W$ , and $T_{v}$ denote the height, width, and the number of lip frames, respectively. We aim to separate a high-quality, clean single-speaker audio A from the noisy speech $\mathbf{S}\in R^{1\times T_{a}}$ based on the video cues $\mathbf{V}$ and filter out the remaining speech components (other interfering speakers). Specifically, our pipeline consists of four modules: audio encoder, video encoder, separation network and audio decoder (see Figure 1A).

设 $\mathbf{A}\in R^{1\times T_{a}}$ 和 $\mathbf{V}\in R^{H\times W\times T_{v}}$ 分别表示单个说话者的音频和视频流,其中 $T_{a}$ 表示音频长度; $H,W$ 和 $T_{v}$ 分别表示高度、宽度和唇部帧数。我们的目标是根据视频线索 $\mathbf{V}\in R^{1\times T_{a}}$ 从带噪语音 $\mathbf{S}\in R^{1\times T_{a}}$ 中分离出高质量、清晰的单说话者音频 A,并滤除其余语音成分(其他干扰说话者)。具体而言,我们的流程包含四个模块:音频编码器、视频编码器、分离网络和音频解码器(见图 1A)。

The overall pipeline of the IIANet is summarized as follows. First, we obtain a video containing two speakers from the streaming media, and the lips are extracted from each image frame. Second, we encode the lip frames and noisy speech into lip embeddings ${\bf E}{V}\in R^{N_{v}\times T_{v}}$ and noisy speech embeddings ${\bf E}{S}\in R^{N_{a}\times T_{a}^{\prime}}$ using a video encoder same as CTCNet (Li et al., 2022) and an audio encoder (a 1D convolutional layer), respectively, where $N_{v}$ , $N_{a}$ and $T_{a}^{\prime}$ are the visual and audio embedding dimensions and the number of frames for audio embedding, respectively. Third, the separation network takes ${\bf E}{S}$ and $\mathbf{E}{V}$ as inputs and outputs a soft mask $\mathbf{M}\in R^{N_{a}\times T_{a}^{\prime}}$ for the target speaker. We multiply ${\bf E}{S}$ with $\mathbf{M}$ to estimate the target speaker’s speech embedding $\mathbf{\bar{E}}{S}=\mathbf{E}{S}\boldsymbol{\odot}\mathbf{M})$ , where $\left\langle\varsigma\right\rangle$ ” denotes the element-wise product. Finally, we utilize the audio decoder (a 1D transposed convolutional layer) to convert the $\bar{\mathbf{E}}_{S}$ back to the waveform, generating the target speaker's speech A E RlxTa.

IIANet的整体流程概述如下。首先,我们从流媒体中获取包含两位说话者的视频,并从每帧图像中提取唇部区域。其次,使用与CTCNet (Li et al., 2022)相同的视频编码器和音频编码器(一维卷积层),分别将唇部帧和含噪语音编码为唇部嵌入${\bf E}{V}\in R^{N_{v}\times T_{v}}$和含噪语音嵌入${\bf E}{S}\in R^{N_{a}\times T_{a}^{\prime}}$,其中$N_{v}$、$N_{a}$和$T_{a}^{\prime}$分别表示视觉与音频嵌入维度以及音频嵌入的帧数。接着,分离网络以${\bf E}{S}$和$\mathbf{E}{V}$作为输入,输出目标说话者的软掩码$\mathbf{M}\in R^{N_{a}\times T_{a}^{\prime}}$。我们将${\bf E}{S}$与$\mathbf{M}$相乘,得到目标说话者的语音嵌入估计$\mathbf{\bar{E}}{S}=\mathbf{E}{S}\boldsymbol{\odot}\mathbf{M})$,其中$\left\langle\varsigma\right\rangle$表示逐元素乘积。最后,利用音频解码器(一维转置卷积层)将$\bar{\mathbf{E}}_{S}$转换回波形,生成目标说话者的语音A E RlxTa。

3.1. Audio-visual separation network

3.1. 视听分离网络

The core architecture of IIANet is an audio-visual separation network consisting of two types of components: (A) intraattention (IntraA) blocks, (B) inter-attention (InterA) blocks (distributed at different locations: top (InterA-T), middle (InterA-M) and bottom (InterA-B)). The architecture of the separation network is illustrated in Figure 1B, which is described as follows:

IIANet的核心架构是一个由两种组件组成的视听分离网络:(A) 内部注意力 (IntraA) 模块,(B) 交互注意力 (InterA) 模块(分布于不同位置:顶部 (InterA-T)、中部 (InterA-M) 和底部 (InterA-B))。分离网络的架构如图 1B 所示,具体描述如下:

  1. Bottom-up pass. The audio network and video network take $\mathbf{E}{S}$ and $\mathbf{E}{V}$ as inputs and output multi-scale auditory ${\mathbf{S}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}|i=0,1...,D}$ and visual ${\mathbf{V}{i}\in R^{N_{v}\times\frac{T_{v}}{2^{i}}}|i=0,1...,D}$ features, where $D$ denotes the total number of convolutional layers with a kernel size of 5 and a stride of 2, followed by a global normalization layer (GLN) (Luo & Mesgarani, 2019). In Figure 1B, we show an example model with $D=3$
  2. 自底向上传递。音频网络和视频网络以 $\mathbf{E}{S}$ 和 $\mathbf{E}{V}$ 作为输入,输出多尺度听觉特征 ${\mathbf{S}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}|i=0,1...,D}$ 和视觉特征 ${\mathbf{V}{i}\in R^{N_{v}\times\frac{T_{v}}{2^{i}}}|i=0,1...,D}$ ,其中 $D$ 表示卷积层总数(采用核大小为5、步长为2的卷积操作),后接全局归一化层 (GLN) (Luo & Mesgarani, 2019)。在图1B中,我们展示了 $D=3$ 的示例模型。
  3. AV fusion through the InterA-T block. The multiscale auditory $\mathbf{S}{i}$ and visual $\mathbf{V}{i}$ features are fused using the InterA-T block (Figure 2A) at the top of separation network to obtain the inter-modal global features $\mathbf{S}{G}\in$ $\boldsymbol{R}^{N_{a}\times\frac{T_{a}^{\prime}}{2D}}$ and $\mathbf{V}{G}\in R^{N_{v}\times\frac{T v}{2D}}$ .
  4. 通过InterA-T块实现视听融合。多尺度听觉特征$\mathbf{S}{i}$和视觉特征$\mathbf{V}{i}$在分离网络顶部的InterA-T块(图2A)中进行融合,以获得跨模态全局特征$\mathbf{S}{G}\in\boldsymbol{R}^{N_{a}\times\frac{T_{a}^{\prime}}{2D}}$和$\mathbf{V}{G}\in R^{N_{v}\times\frac{T v}{2D}}$。
  5. AV fusion in the top-down pass through IntraA and InterA-M blocks. Firstly, the $\mathbf{S}{G}$ and $\mathbf{V}{G}$ are employed to modulate $\mathbf{S}{i}$ and $\mathbf{V}{i}$ within each modality using top-down global IntraA blocks, yielding auditory $\bar{\mathbf{S}}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}$ and visual $\bar{\mathbf{V}}{i}\in\mathbb{R}^{N_{v}\times\frac{T_{v}}{2^{i}}}$ fea- tures. Secondly, the InterA-M block takes the auditory $\bar{\bf S}{i}$ and visual $\bar{\mathbf{V}}{i}$ features at the same temporal scale as inputs, and output the modulated auditory features $\tilde{\mathbf{S}}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}$ . Finally, the top-down pass generates the auditory $\check{\mathbf{S}}{0}\in R^{N_{a}\times T_{a}^{\prime}}$ and visual $\mathbf{\tilde{V}}{0}\in\dot{R^{N_{v}\times T_{v}}}$ features at the maximum temporal scale using topdown local IntraA blocks.
  6. 通过IntraA和InterA-M模块自上而下传递的视听融合。首先,利用$\mathbf{S}{G}$和$\mathbf{V}{G}$通过自上而下的全局IntraA模块调制各模态内的$\mathbf{S}{i}$和$\mathbf{V}{i}$,生成听觉特征$\bar{\mathbf{S}}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}$与视觉特征$\bar{\mathbf{V}}{i}\in\mathbb{R}^{N_{v}\times\frac{T_{v}}{2^{i}}}$。接着,InterA-M模块将处于相同时间尺度的听觉特征$\bar{\bf S}{i}$和视觉特征$\bar{\mathbf{V}}{i}$作为输入,输出调制后的听觉特征 $\check{\mathbf{S}}{0}\in R^{N_{a}\times T_{a}^{\prime}}$。最后,自上而下传递过程通过局部IntraA模块生成最大时间尺度的听觉特征$\check{\mathbf{S}}{0}\in R^{N_{a}\times T_{a}^{\prime}}$和视觉特征$\mathbf{\tilde{V}}{0}\in\dot{R^{N_{v}\times T_{v}}}$。
  7. AV fusion through the InterA-B block. The InterAB block (Figure 2D) takes the auditory $\check{\mathbf{S}}{0}$ and visual ${\check{\mathbf{V}}}{0}$ features as input and outputs audio-visual features $\bar{\mathbf{E}}{S,t}\in R^{N_{a}\times T_{a}^{\prime}}$ and $\bar{\mathbf{E}}{V,t}\in R^{N_{v}\times T_{v}}$ , where $t$ denotes the cycle index of audio-visual fusion.
  8. 通过InterA-B块实现的视听融合。InterAB块 (图 2D) 以听觉特征 $\check{\mathbf{S}}{0}$ 和视觉特征 ${\check{\mathbf{V}}}{0}$ 作为输入,输出视听特征 $\bar{\mathbf{E}}{S,t}\in R^{N_{a}\times T_{a}^{\prime}}$ 和 $\bar{\mathbf{E}}{V,t}\in R^{N_{v}\times T_{v}}$ ,其中 $t$ 表示视听融合的循环索引。
  9. Cycle of audio and video networks. The above steps complete the first cycle of the separation network. For the $t$ -th cycle where $t\geq2$ , the audio-visual features $\bar{\mathbf{E}}{S,t-1}$ and $\bar{\mathbf{E}}{V,t-1}$ obtained in the $(t-1)$ -th cycle, instead of ${\bf E}{S}$ and $\mathbf{E}{V}$ , serve as input to the separation network, and we repeat the above steps. After $N_{F}$ cycles, the auditory features $\bar{\mathbf{E}}{S,t}$ are refined alone for $N_{S}$ cycles in the audio network to reconstruct highquality audio. Finally, we pass the output of the final cycle through an activation function (ReLU) to yield the separation network’s output $\mathbf{M}$ , as illustrated in Figure 1A.
  10. 音视频网络循环。上述步骤完成了分离网络的第一个循环。对于第 $t$ 次循环($t\geq2$),将第 $(t-1)$ 次循环中获得的视听特征 $\bar{\mathbf{E}}{S,t-1}$ 和 $\bar{\mathbf{E}}{V,t-1}$(而非初始的 ${\bf E}{S}$ 和 $\mathbf{E}{V}$)作为分离网络的输入,并重复上述步骤。经过 $N_{F}$ 次循环后,听觉特征 $\bar{\mathbf{E}}{S,t}$ 将在音频网络中单独进行 $N_{S}$ 次循环以重构高质量音频。最终,我们将最后一个循环的输出通过激活函数(ReLU)生成分离网络的输出 $\mathbf{M}$,如图 1A 所示。

The steps 2-4 and the blocks mentioned in these steps are described in details in what follows.

步骤2-4及其中提到的模块将在下文中详细描述。

Step 2: AV fusion through the InterA-T block. In the InterA-T block, we first utilize an average pooling layer with a pooling ratio $2^{D-i}$ in the temporal dimension to down-sample the temporal dimensions of $\mathbf{S}{i}$ and $\mathbf{V}{i}$ to $\frac{T_{a}^{\prime}}{2^{D}}$ and $\frac{T_{v}}{2^{D}}$ , respectively, and then merge them to obtain global features $\mathbf{S}{G}$ and $\mathbf{V}_{G}$ using InterA-T block (Figure 2A). The detailed process of the InterA-T block is described by the equations below:

步骤2:通过InterA-T块进行音视融合。在InterA-T块中,我们首先在时间维度上使用池化比例为$2^{D-i}$的平均池化层,将$\mathbf{S}{i}$和$\mathbf{V}{i}$的时间维度分别下采样至$\frac{T_{a}^{\prime}}{2^{D}}$和$\frac{T_{v}}{2^{D}}$,然后通过InterA-T块合并它们以获得全局特征$\mathbf{S}{G}$和$\mathbf{V}_{G}$ (图2A)。InterA-T块的详细过程由以下公式描述:

A. Overall pipeline

A. 整体流程


Figure 1. The overall pipeline of IIANet. (A) IIANet consists of four main components: audio encoder, video encoder, separation network, and audio decoder. The red and blue tildes indicate that the same module is repeated several times. (B) The separation network contains two types of attention blocks: IntraA and InterA (InterA-T, InterA-M, InterA-B) blocks. The dashed lines indicate the use of global features $\mathbf{S}{G}$ and ${\mathbf{V}}{G}$ as top-down attention modulation for multi-scale features $\mathbf{S}{i}$ and $\mathbf{V}_{i}$ . All blocks use different parameters but keep the same across different cycles.

图 1: IIANet的整体流程。(A) IIANet包含四个主要组件:音频编码器、视频编码器、分离网络和音频解码器。红色和蓝色波浪线表示相同模块被重复多次。(B) 分离网络包含两种注意力块:IntraA和InterA (InterA-T, InterA-M, InterA-B)块。虚线表示使用全局特征$\mathbf{S}{G}$和${\mathbf{V}}{G}$作为多尺度特征$\mathbf{S}{i}$和$\mathbf{V}_{i}$的自顶向下注意力调制。所有块使用不同参数但在不同周期间保持一致。

$$
\begin{array}{r l}&{\displaystyle\pmb{f}{S}=\sum_{i=0}^{D-1}p_{i}(\mathbf{S}_{i})+\mathbf{S}_{D}\mathrm{and~}\pmb{f}{V}=\sum_{i=0}^{D-1}p_{i}(\mathbf{V}{i})+\mathbf{V}{D},}\ &{\displaystyle\mathbf{S}{G}=\mathrm{FFN}{S}\left(\pmb{f}{S}\odot\sigma(\pmb{\mathscr{Q}}(\pmb{f}{V}))\right),}\ &{\displaystyle\mathbf{V}{G}=\mathrm{FFN}{V}\left(\pmb{f}{V}\odot\sigma(\pmb{\mathscr{Q}}(\pmb{f}_{S}))\right),}\end{array}
$$

where $f_{S}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2D}}$ and $\pmb{f}{V}\in\cal R^{N_{v}\times\frac{T_{v}}{2D}}$ denote the cumulative multi-scale auditory and visual features, $\mathrm{FFN}(\cdot)$ denotes a feed-forward network with three convolutional layers followed by a GLN, $\mathcal{Q}^{2}$ denotes a convolutional layer followed by a GLN, $p_{i}(\cdot)$ denotes the average pooling layers and $\sigma$ denotes the sigmoid function. Here, $p_{i}(\cdot)$ uses different pooling ratios for different $i$ such that the temporal dimensions of all pi(Si) are 2T Da and the temporal dimensions of all $p_{i}(\mathbf{V}{i})$ are $\frac{T_{v}}{2^{D}}$ . The hyper parameters of $\mathcal{Q}$ are specified such that the resulting audio and video features have the same embedding dimension. The $\sigma$ function receiving information from one modality acts as selective attention, modulating the other modality. The global features $\mathbf{S}{G}$ and $\mathbf{V}_{G}$ are used to guide the separation process in the top-down pass.

其中 $f_{S}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2D}}$ 和 $\pmb{f}{V}\in\cal R^{N_{v}\times\frac{T_{v}}{2D}}$ 表示累积的多尺度听觉和视觉特征,$\mathrm{FFN}(\cdot)$ 表示具有三个卷积层后接GLN的前馈网络,$\mathcal{Q}^{2}$ 表示一个卷积层后接GLN,$p_{i}(\cdot)$ 表示平均池化层,$\sigma$ 表示sigmoid函数。此处 $p_{i}(\cdot)$ 对不同 $i$ 使用不同池化比例,使得所有 $p_{i}(S_{i})$ 的时间维度为 $\frac{T_{a}}{2^{D}}$,所有 $p_{i}(\mathbf{V}{i})$ 的时间维度为 $\frac{T_{v}}{2^{D}}$。$\mathcal{Q}$ 的超参设置使得生成的音频和视频特征具有相同嵌入维度。接收单模态信息的 $\sigma$ 函数作为选择性注意力机制来调制另一模态。全局特征 $\mathbf{S}{G}$ 和 $\mathbf{V}_{G}$ 用于在自上而下传递过程中指导分离过程。

Step 3: AV fusion in the top-down pass through IntraA and InterA-M blocks. Given the multi-scale audio-visual features ${\mathbf{S}{i},\mathbf{V}{i}}$ and $\mathbf{S}{G}$ and $\mathbf{V}{G}$ as inputs, we employ the global IntraA blocks $\phi(\mathbf{x},\mathbf{y})$ (Figure 2B) in the top-down pass (the dotted lines in Figure 1B) to extract audio $\bar{\bf S}{i}$ and visual features $\bar{\mathbf{V}}_{i}$ at each temporal scale,

步骤3:通过IntraA和InterA-M块自上而下传递进行视听融合。给定多尺度视听特征${\mathbf{S}{i},\mathbf{V}{i}}$以及全局特征$\mathbf{S}{G}$和$\mathbf{V}{G}$作为输入,我们在自上而下传递过程中(图1B中的虚线)使用全局IntraA块$\phi(\mathbf{x},\mathbf{y})$(图2B)来提取每个时间尺度的音频特征$\bar{\bf S}{i}$和视觉特征$\bar{\mathbf{V}}_{i}$。

$$
\bar{\mathbf{x}}=\phi(\mathbf{x},\mathbf{y})=\sigma\left(\boldsymbol{\mathcal{Q}}\left(\mu\left(\mathbf{y}\right)\right)\right)\odot\mathbf{x}+\boldsymbol{\mathcal{Q}}\left(\mu\left(\mathbf{y}\right)\right)
$$

$$
\bar{\mathbf{x}}=\phi(\mathbf{x},\mathbf{y})=\sigma\left(\boldsymbol{\mathcal{Q}}\left(\mu\left(\mathbf{y}\right)\right)\right)\odot\mathbf{x}+\boldsymbol{\mathcal{Q}}\left(\mu\left(\mathbf{y}\right)\right)
$$

where $\mu(\cdot)$ denotes interpolation up-sampling. For audio signals, $\bar{\bf S}{i}=\phi({\bf S}{i},{\bf S}{G})$ ; for video signals, $\bar{\bf V}{i}~=$ $\phi(\mathbf{V}{i},\mathbf{V}{G})$ . The hyper parameters of $\mu$ are specified such that the resulted audio and video signals have the same temporal dimension. Since these attention blocks use the global features $\mathbf{S}{G}$ and $\mathbf{V}{G}$ to modulate the intermediate features $\mathbf{S}{i}$ and $\mathbf{V}_{i}$ when reconstructing these features in the topdown process, they are called global IntraA blocks. This is inspired by the global attention mechanism used in an

其中 $\mu(\cdot)$ 表示插值上采样。对于音频信号,$\bar{\bf S}{i}=\phi({\bf S}{i},{\bf S}{G})$;对于视频信号,$\bar{\bf V}{i}~=$$\phi(\mathbf{V}{i},\mathbf{V}{G})$。$\mu$ 的超参数设置需确保生成的音频和视频信号具有相同的时间维度。由于这些注意力块在自上而下过程中重建特征时使用全局特征 $\mathbf{S}{G}$ 和 $\mathbf{V}{G}$ 来调节中间特征 $\mathbf{S}{i}$ 和 $\mathbf{V}_{i}$,因此被称为全局IntraA块。这一设计灵感来源于某全局注意力机制


Figure 2. Flow diagram of InraA and InterA blocks: (A) InterA-T block, (B) IntraA block, (C) InterA-M block and (D) InterA-B block in the IIANet, where $\odot$ denotes element-wise product and $\sigma$ denotes the sigmoid function.

图 2: IntraA和InterA模块流程图:(A) InterA-T模块,(B) IntraA模块,(C) InterA-M模块和(D) InterA-B模块在IIANet中的结构,其中$\odot$表示逐元素乘积,$\sigma$表示sigmoid函数。

AOSS model (Li et al., 2023). But in that model, the global feature is simply upsampled, then goes through a sigmoid function and multiplied with the intermediate features $\mathbf{x}$ . Mathematically, the process is written as follows

AOSS模型 (Li等人, 2023)。但在该模型中,全局特征仅经过简单上采样,随后通过sigmoid函数并与中间特征$\mathbf{x}$相乘。该过程的数学表达式如下

$$
\bar{\mathbf{x}}=\phi^{\prime}(\mathbf{x},\mathbf{y})=\sigma(\mu(\mathbf{y}))\odot\mathbf{x}.
$$

$$
\bar{\mathbf{x}}=\phi^{\prime}(\mathbf{x},\mathbf{y})=\sigma(\mu(\mathbf{y}))\odot\mathbf{x}.
$$

We empirically found that $\phi$ worked better than $\phi^{\prime}$ (see Appendix A).

我们通过实证发现 $\phi$ 的效果优于 $\phi^{\prime}$ (详见附录 A)。

In contrast to previous AVSS methods (Gao & Grauman, 2021; Li et al., 2022), we investigate which part of the audio network should be attended to undergo the guidance of visual features within the same temporal scale. We utilize the visual features $\bar{\mathbf{V}}{i}$ from the same temporal scale as $\bar{\bf S}{i}$ to modulate the auditory features $\bar{\bf S}_{i}$ using the InterA-M blocks (Figure 2C),

与以往的音视频同步方法 (Gao & Grauman, 2021; Li et al., 2022) 不同,我们研究了音频网络的哪部分应在相同时间尺度下接受视觉特征的引导。我们使用与 $\bar{\bf S}{i}$ 相同时间尺度的视觉特征 $\bar{\mathbf{V}}{i}$,通过 InterA-M 模块 (图 2C) 对听觉特征 $\bar{\bf S}_{i}$ 进行调制。

$$
\tilde{\mathbf{S}}{i}=\sigma(\mathcal{Q}(\mu(\bar{\mathbf{V}}{i})))\odot\bar{\mathbf{S}}_{i},\quad\forall i\in[0,D],
$$

$$
\tilde{\mathbf{S}}{i}=\sigma(\mathcal{Q}(\mu(\bar{\mathbf{V}}{i})))\odot\bar{\mathbf{S}}_{i},\quad\forall i\in[0,D],
$$

where $\tilde{\mathbf{S}}{i}\in R^{N\times\frac{T_{a}^{\prime}}{2^{i}}}$ denotes the modulated auditory features. Same as before, the hyper parameters of $\mathcal{Q}$ and $\mu$ are specified such that the two terms on the two sides of $\odot$ have the same dimension. In this way, the InterA-M blocks are expected to extract the target speaker’s sound features according to the target speaker’s visual features at the same scale.

其中 $\tilde{\mathbf{S}}{i}\in R^{N\times\frac{T_{a}^{\prime}}{2^{i}}}$ 表示调制后的听觉特征。与之前相同,通过指定 $\mathcal{Q}$ 和 $\mu$ 的超参数,使得 $\odot$ 两侧的两项具有相同维度。这样,InterA-M 模块就能根据相同尺度的目标说话者视觉特征来提取其声音特征。

Then, we reconstructed auditory and visual features separately to obtain the features $(\check{\mathbf{S}}{0},\check{\mathbf{V}}{0})$ in a top-down pass (Figure 1B). We leverage the local top-down attention mechanism (Li et al., 2023), and the corresponding blocks are called local IntraA blocks, which are able to fuse as many multi-scale features as possible while reducing information redundancy. We start from the $(D-1)$ -th layer. The attention signals are $\tilde{\mathbf{S}}{D}$ and $\bar{\mathbf{V}}_{D}$ . Therefore,

接着,我们分别重建听觉和视觉特征,通过自上而下的传递获得特征 $(\check{\mathbf{S}}{0},\check{\mathbf{V}}{0})$ (图 1B)。我们采用局部自上而下的注意力机制 (Li et al., 2023),对应的模块称为局部 IntraA 块,能够在减少信息冗余的同时融合尽可能多的多尺度特征。我们从第 $(D-1)$ 层开始,注意力信号为 $\tilde{\mathbf{S}}{D}$ 和 $\bar{\mathbf{V}}_{D}$。因此,

$$
\check{\mathbf{S}}{D-1}=\phi(\tilde{\mathbf{S}}{D-1},\tilde{\mathbf{S}}{D}),\quad\check{\mathbf{V}}{D-1}=\phi(\bar{\mathbf{V}}{D-1},\bar{\mathbf{V}}_{D}).
$$

$$
\check{\mathbf{S}}{D-1}=\phi(\tilde{\mathbf{S}}{D-1},\tilde{\mathbf{S}}{D}),\quad\check{\mathbf{V}}{D-1}=\phi(\bar{\mathbf{V}}{D-1},\bar{\mathbf{V}}_{D}).
$$

For $i=D-2,...,0$ , the attention signals are $\check{\mathbf{S}}{i}$ and $\check{\mathbf{V}}_{i}$ , therefore

对于 $i=D-2,...,0$,注意力信号为 $\check{\mathbf{S}}{i}$ 和 $\check{\mathbf{V}}_{i}$,因此

$$
\begin{array}{r}{\check{\mathbf{S}}{i}=\phi(\tilde{\mathbf{S}}{i},\check{\mathbf{S}}{i+1}),\quad\check{\mathbf{V}}{i}=\phi(\bar{\mathbf{V}}{i},\check{\mathbf{V}}_{i+1}).}\end{array}
$$

$$
\begin{array}{r}{\check{\mathbf{S}}{i}=\phi(\tilde{\mathbf{S}}{i},\check{\mathbf{S}}{i+1}),\quad\check{\mathbf{V}}{i}=\phi(\bar{\mathbf{V}}{i},\check{\mathbf{V}}_{i+1}).}\end{array}
$$

Please note that different IntraA blocks within one cycle, as depicted in Figure 1B, use different parameters.

请注意,同一周期内的不同 IntraA 块(如图 1B 所示)使用不同的参数。

Step 4: AV fusion through the InterA-B block. In CTCNet (Li et al., 2022), the global audio-visual features fused all auditory and visual multi-scale features using the concatenation strategy, leading to high computational complexity. Here, we only perform inter-modal fusion on the finestgrained auditory and visual features ( $\check{\mathbf{S}}{0}$ and ${\check{\mathbf{V}}}_{0}$ ) using attention strategy, greatly reducing computational complexity.

步骤4:通过InterA-B块进行视听融合。在CTCNet (Li et al., 2022) 中,全局视听特征通过拼接策略融合了所有听觉和视觉多尺度特征,导致计算复杂度较高。本文仅对最精细粒度的听觉特征 ( $\check{\mathbf{S}}{0}$ ) 和视觉特征 ( ${\check{\mathbf{V}}}_{0}$ ) 采用注意力策略进行跨模态融合,显著降低了计算复杂度。

Specifically, we reduce the impact of redundant features by incorporating the InterA-B block (Figure 2D) between the finest-grained auditory and visual features $\check{\mathbf{S}}{0}$ and ${\check{\mathbf{V}}}_{0}$ . In particular, the process of global audio-visual fusion is as follows:

具体来说,我们通过在最细粒度的听觉特征$\check{\mathbf{S}}{0}$和视觉特征${\check{\mathbf{V}}}_{0}$之间加入InterA-B模块(图2D)来降低冗余特征的影响。全局视听融合的具体过程如下:

$$
\begin{array}{r l}&{\bar{\mathbf{E}}{S,t}=\check{\mathbf{S}}{0}+\mathcal{Q}\left(\mu(\check{\mathbf{V}}{0})\odot\sigma\left(\mathcal{Q}\left(\check{\mathbf{S}}{0}\right)\right)\right),}\ &{\bar{\mathbf{E}}{V,t}=\check{\mathbf{V}}{0}+\mathcal{Q}\left(\mu(\check{\mathbf{S}}{0})\odot\sigma\left(\mathcal{Q}\left(\check{\mathbf{V}}_{0}\right)\right)\right).}\end{array}
$$

$$
\begin{array}{r l}&{\bar{\mathbf{E}}{S,t}=\check{\mathbf{S}}{0}+\mathcal{Q}\left(\mu(\check{\mathbf{V}}{0})\odot\sigma\left(\mathcal{Q}\left(\check{\mathbf{S}}{0}\right)\right)\right),}\ &{\bar{\mathbf{E}}{V,t}=\check{\mathbf{V}}{0}+\mathcal{Q}\left(\mu(\check{\mathbf{S}}{0})\odot\sigma\left(\mathcal{Q}\left(\check{\mathbf{V}}_{0}\right)\right)\right).}\end{array}
$$

Same as in (1), the hyper parameters of $\mathcal{Q}$ and $\mu$ are specified such that the element-wise product and addition in (5) function correctly. The working process of this block is simple. One modality features are modulated by the other modality features through an attention function $\sigma$ , and then the results are added to the original features. Several convolutions and up-samplings are interleaved in the process to ensure that the dimensions of results are appropriate. We keep the original features in the final results by addition because we do not want the final features to be altered too much by the other modality information.

与 (1) 相同,$\mathcal{Q}$ 和 $\mu$ 的超参数经过设定,确保 (5) 中的逐元素乘积和加法运算正确执行。该模块的工作流程较为简单:一种模态特征通过注意力函数 $\sigma$ 受另一模态特征调制,随后结果被叠加至原始特征上。过程中穿插了若干卷积和上采样操作以保证结果维度适配。我们通过加法保留原始特征于最终结果中,以避免其他模态信息对最终特征造成过度干扰。

Finally, the output auditory and visual features $\bar{\mathbf{E}}{S,t}$ and $\bar{\mathbf{E}}_{V,t}$ serve as inputs for the subsequent audio network and video network.

最后,输出的听觉和视觉特征 $\bar{\mathbf{E}}{S,t}$ 和 $\bar{\mathbf{E}}_{V,t}$ 将作为后续音频网络和视频网络的输入。

4. Experiments

4. 实验

4.1. Datasets

4.1. 数据集

Consistent with previous research, we experimented on three commonly used AVSS datasets: LRS2 (Afouras et al., 2018a), LRS3 (Afouras et al., 2018b), and VoxCeleb2 (Chung et al., 2018). Due to the fact that the LRS2 and VoxCeleb2 datasets were collected from YouTube, they present a more complex acoustic environment compared to the LRS3 dataset. Each audio was 2 seconds in length with a sample rate of $16\mathrm{kHz}$ . The mixture was created by randomly selecting two different speakers from datasets and mixing their speeches with signal-to-noise ratios between -5 dB and 5 dB. Lip frames were synchronized with audio, having a frame rate of 25 FPS and a size of $88\times88$ grayscale images. We used the same dataset split consistent with previous works (Li et al., 2022; Gao & Grauman, 2021; Lee et al., 2021). See Appendix B for details.

与先前研究一致,我们在三个常用的视听语音分离(AVSS)数据集上进行了实验:LRS2 (Afouras等人, 2018a)、LRS3 (Afouras等人, 2018b)和VoxCeleb2 (Chung等人, 2018)。由于LRS2和VoxCeleb2数据集采集自YouTube,其声学环境比LRS3数据集更为复杂。每段音频时长为2秒,采样率为$16\mathrm{kHz}$。混合音频是通过从数据集中随机选择两名不同说话者,将其语音以-5 dB至5 dB的信噪比混合而成。唇部帧与音频同步,帧率为25 FPS,尺寸为$88\times88$的灰度图像。我们采用了与先前工作(Li等人, 2022; Gao & Grauman, 2021; Lee等人, 2021)相同的数据集划分方式。详见附录B。

4.2. Model configurations and evaluation metrics

4.2. 模型配置与评估指标

The IIANet model was implemented using PyTorch (Paszke et al., 2019), and optimized using the Adam optimizer (Kingma & Ba, 2014) with an initial learning rate of 0.001. The learning rate was cut by $50%$ whenever there was no improvement in the best validation loss for 15 consecutive epochs. The training was halted if the best validation loss failed to improve over 30 successive epochs. Gradient clipping was used during training to avert gradient explosion, setting the maximum $L2$ norm to 5. We used the SI-SNR objective function to train the IIANet. See Appendix C for details. All experiments were conducted with a batch size of 6, utilizing 8 GeForce RTX 4090 GPUs. The hyperparameters are included in Appendix D.

IIANet模型采用PyTorch语言 (Paszke et al., 2019) 实现,并使用Adam优化器 (Kingma & Ba, 2014) 进行优化,初始学习率为0.001。当最佳验证损失连续15个epoch没有改善时,学习率会降低50%。若最佳验证损失连续30个epoch未提升,则终止训练。训练过程中采用梯度裁剪防止梯度爆炸,最大L2范数设为5。我们使用SI-SNR目标函数训练IIANet,详见附录C。所有实验批大小设为6,使用8块GeForce RTX 4090 GPU完成。超参数配置见附录D。

Same as previous research (Gao & Grauman, 2021; Li et al., 2022; Afouras et al., 2018a), the SDRi (Vincent et al., 2006b) and SI-SNRi (Le Roux et al., 2019) were used as a metric for speech separation. See Appendix E for details. To simulate human auditory perception and to assess and quantify the quality of speech signals, we also used PESQ (Union, 2007) as an evaluation metric, which is used to measure the clarity and intelligibility of speech signals, ensuring

与先前研究 (Gao & Grauman, 2021; Li et al., 2022; Afouras et al., 2018a) 相同,我们采用 SDRi (Vincent et al., 2006b) 和 SI-SNRi (Le Roux et al., 2019) 作为语音分离的评估指标。详见附录 E。为模拟人类听觉感知并量化评估语音信号质量,我们还使用了 PESQ (Union, 2007) 作为衡量语音信号清晰度与可懂度的评估指标,确保

the reliability of the results.

结果的可靠性。

4.3. Comparisons with state-of-the-art methods

4.3. 与先进方法的对比

To explore the efficiency of the proposed approach, we have designed a faster version, namely IIANet-fast, which runs half of $N_{S}$ cycles compared to IIANet (see Appendix D for detailed parameters). We compared IIANet and IIANet-fast with existing AVSS methods. To conduct a fair comparison, we obtained metric values in original papers or reproduced the results using officially released models by the authors.

为探究所提方法的效率,我们设计了更快版本 IIANet-fast,其运行周期数为 IIANet 的一半(详见附录 D 的参数说明)。我们将 IIANet 和 IIANet-fast 与现有 AVSS 方法进行对比。为确保公平性,所有指标均采用原论文报告值或通过作者官方发布模型复现得出。

Separation quality. As demonstrated in Table 1, IIANet consistently outperformed existing methods across all datasets, with at least a 0.9 dB gain in SI-SNRi. In particular, IIANet exceled in complex environments, notably on the LRS2 and VoxCeleb2 datasets, outperforming AVSS methods by a minimum of 1.7 dB SI-SNRi in separation quality. This significant improvement highlights the importance of IIANet’s hierarchical audio-visual integration. In addition, IIANet-fast also achieved excellent results, which surpassed CTCNet in separation quality with an improvement of 1.1 dB in SI-SNRi on the challenging LRS2 dataset.

分离质量。如表 1 所示,IIANet 在所有数据集上始终优于现有方法,SI-SNRi 至少提高了 0.9 dB。特别是在复杂环境中,IIANet 表现尤为突出,在 LRS2 和 VoxCeleb2 数据集上的分离质量比 AVSS 方法至少高出 1.7 dB SI-SNRi。这一显著改进凸显了 IIANet 分层视听集成的重要性。此外,IIANet-fast 也取得了优异的结果,在具有挑战性的 LRS2 数据集上,其分离质量超越了 CTCNet,SI-SNRi 提高了 1.1 dB。

Model size and computational cost. In practical applications, model size and computational cost are often crucial considerations. We employed three hardware-agnostic metrics to circumvent disparities between different hardware, including number of parameters (Params) and MultiplyAccumulate operations (MACs)3, and GPU memory usage during inference. We also selected two hardware-related metrics to measure inference speed on specific GPU and CPU hardware. The results were shown in Table 2. Please note that some of the AVSS methods (Alfouras et al., 2018; Afouras et al., 2020; Lee et al., 2021) in Table 1 are not included in Table 2 because they are not open sourced.

模型规模与计算成本。在实际应用中,模型规模和计算成本往往是关键考量因素。我们采用三种与硬件无关的指标来规避不同硬件间的差异,包括参数量(Params)、乘加运算量(MACs)3以及推理时的GPU内存占用。同时选取两项硬件相关指标来衡量特定GPU和CPU硬件上的推理速度,结果如表2所示。需注意表1中部分AVSS方法(Alfouras等人,2018;Afouras等人,2020;Lee等人,2021)未包含在表2中,因其未开源。

Compared to the previous SOTA method CTCNet, IIANet can achieve significantly higher separation quality using $11%$ of the MACs, $44%$ of the parameters and $16%$ of the GPU memory. IIANet’s inference on CPU was as fast as CTCNet, but was slower on GPU. With better separation quality than CTCNet, IIANet-fast had much less computational cost (only $7%$ of CTCNet’s MACs) and substantially higher inference speed ( $40%$ of CTCNet’s time on CPU and $84%$ of CTCNet’s time on GPU). This not only showcased the efficiency of IIANet-fast but also highlighted its aptness for deployment onto environments with limited hardware resources and strict real-time requirement.

与之前的SOTA方法CTCNet相比,IIANet仅需11%的MACs运算量、44%的参数量和16%的GPU显存即可实现显著更高的分离质量。IIANet在CPU上的推理速度与CTCNet相当,但在GPU上较慢。IIANet-fast在保持优于CTCNet分离质量的同时,计算成本大幅降低(仅需CTCNet的7% MACs运算量),推理速度显著提升(CPU耗时仅为CTCNet的40%,GPU耗时仅为CTCNet的84%)。这不仅证明了IIANet-fast的高效性,更凸显了其在硬件资源有限且需要严格实时性的部署环境中的适用优势。

In conclusion, compared to with previous AVSS methods, IIANet achieved a good trade-off between separation quality

综上所述,与以往的AVSS方法相比,IIANet在分离质量上取得了良好的平衡

Submission and Formatting Instructions for ICML 2024

ICML 2024 投稿与格式要求

方法 LRS2 LRS3 VoxCeleb2
SI-SNRi SDRi PESQ SI-SNRi SDRi PESQ SI-SNRi SDRi PESQ
AVConvTasNet (Wu 等, 2019) 12.5 12.8 2.69 11.2 11.7 2.58 9.2 9.8 2.17
LWTNet (Afouras 等, 2020) 10.8 4.8
Visualvoice (Gao & Grauman, 2021) 11.5 11.8 2.78 9.9 10.3 2.13 9.3 10.2 2.45
CaffNet-C (Lee 等, 2021) 10.0 1.15 9.8 7.6
AVLiT-8 (Martel 等, 2023) 12.8 13.1 2.56 13.5 13.6 2.78 9.4 9.9 2.23
CTCNet (Li 等, 2022) 14.3 14.6 3.08 17.4 17.5 3.24 11.9 13.1 3.00
IANet (本文) 16.0 16.2 3.23 18.3 18.5 3.28 13.6 14.3 3.12
IIANet-fast (本文) 15.1 15.4 3.11 17.8 17.9 3.25 12.6 13.6 3.01

Table 1. Separation results of different AVSS methods on LRS2, LRS3, and VoxCeleb2 datasets. These metrics represent the average values for all speakers in each test set, where larger SI-SNRi, SDRi and PESQ values are better. “-” denotes results not reported in the original paper. Optimal and suboptimal performances is highlighted.

表 1. 不同视听语音分离(AVSS)方法在LRS2、LRS3和VoxCeleb2数据集上的分离结果。这些指标代表每个测试集中所有说话者的平均值,其中SI-SNRi、SDRi和PESQ值越大越好。"-"表示原论文未报告的结果。最优和次优性能已高亮显示。

方法 计算量 参数量 推理时间 GPU显存 (MB)
MACs (G) Params (M) CPU (s) GPU (ms)
AVConvTasNet (Wu等, 2019) 23.8 16.5 1.06 62.51
Visualvoice (Gao和Grauman, 2021) 9.7 77.8 2.98 110.31
AVLiT (Martel等, 2023) 18.2 5.8 0.46 62.51
CTCNet (Li等, 2022) 167.1 7.0 1.26 84.17
IANet (本文) 18.6 3.1 1.27 110.11
IANet-fast (本文) 11.9 3.1 0.52 70.94

Table 2. Parameter sizes and computational complexities of different AVSS methods. All results are measured with an input audio length of 1 second, a sample rate of $16\mathrm{kHz}$ , and a video frame sample rate of $25\mathrm{FPS}$ . Inference time is measured in the same testing environment without the use of any additional acceleration techniques, such as quantization or pruning.

表 2: 不同AVSS方法的参数量与计算复杂度对比。所有测试结果均基于1秒输入音频(采样率$16\mathrm{kHz}$)和视频帧采样率$25\mathrm{FPS}$的基准条件。推理时间测试环境统一,未使用量化(quantization)或剪枝(pruning)等加速技术。

and computational cost.

以及计算成本。

4.4. Multi-speaker performance

4.4. 多说话人性能

We present results on the LRS2 dataset with 3 and 4 speakers. We created two variants of the dataset for these purposes, named LRS2-3Mix and LRS2-4Mix, both used in our model’s training and testing. Correspondingly, the default LRS2 dataset has a new name LRS2-2Mix. Throughout the paper, unless otherwise specified, LRS2 dataset refers to the default LRS2-2Mix dataset. For more details on dataset construction, training processes, and testing procedures, please refer to Appendix G.

我们在LRS2数据集上展示了3人和4人说话者的实验结果。为此我们创建了两个数据集变体:LRS2-3Mix和LRS2-4Mix,均用于模型的训练和测试。相应地,默认的LRS2数据集被重新命名为LRS2-2Mix。在本文中,除非特别说明,LRS2数据集均指默认的LRS2-2Mix数据集。关于数据集构建、训练过程和测试流程的更多细节,请参阅附录G。

We use the following baseline methods for comparison: AVConvTasNet (Wu et al., 2019), AVLIT (Martel et al., 2023), and CTCNet (Li et al., 2022). The results are presented in Table 3. Each table column shows a different dataset, where the number of speakers in the mixed signal $C$ varies. The models used for evaluating each dataset were specifically trained for separating a corresponding number of speakers. It is evident from the results that the proposed model significantly outperformed the previous methods across all three datasets.

我们使用以下基线方法进行比较:AVConvTasNet (Wu等人,2019)、AVLIT (Martel等人,2023) 和 CTCNet (Li等人,2022)。结果如 表 3 所示。每列显示不同数据集,其中混合信号中的说话者数量 $C$ 各不相同。用于评估每个数据集的模型都专门针对相应数量的说话者进行了分离训练。从结果中可以明显看出,所提出的模型在所有三个数据集上都显著优于之前的方法。

4.5. Ablation study

4.5. 消融实验

To better understand IIANet, we compared each key component using the LRS2 dataset in a completely fair setup, i.e., using the same architecture and hyper parameters in the following experiments.

为了更好地理解IIANet,我们在完全公平的设置下使用LRS2数据集对每个关键组件进行了比较,即在以下实验中使用相同的架构和超参数。

Importance of IntraA and InterA. We investigated the role of IntraA and InterA in model performance. To this end, we constructed a control model (called Control 1) resulted from removing IntraA and InterA blocks from IIANet. Basically, it uses two upside down UNet (Ronne berger et al., 2015) architectures as the visual and auditory backbones and fuses visual and auditory features at the finest temporal scale (Figure 4A in Appendix). We then added IntraA blocks to Control 1 to obtain a new model, called Control 2 (Figure 4B in Appendix). The detailed description of Controls 1 and 2 can be found in Appendix H and the training and inference procedures are the same as IIANet. As shown in Table 4, Control 1 obtained poor results, and adding IntraA blocks improved the results. This improvement validated the effectiveness of the IntraA blocks in the AVSS task, though it was originally proposed for AOSS (Li et al., 2023), because the IntraA module obtained intra-modal contextual features. Adding InterA blocks to Control 2 yielded the proposed IIANet, which obtained a substantial separation quality improvement by $2.4:\mathrm{dB}$ and $2.2\mathrm{dB}$ in SI-SNRi and SDRi metrics, respectively. This proves that the InterA mod

IntraA与InterA的重要性。我们研究了IntraA和InterA在模型性能中的作用。为此,我们构建了一个控制模型(称为Control 1),该模型通过从IIANet中移除IntraA和InterA块得到。基本上,它使用两个倒置的UNet (Ronneberger et al., 2015) 架构作为视觉和听觉主干,并在最精细的时间尺度上融合视觉和听觉特征(附录中的图4A)。然后,我们将IntraA块添加到Control 1中,得到一个新模型,称为Control 2(附录中的图4B)。Control 1和Control 2的详细描述可在附录H中找到,训练和推理过程与IIANet相同。如表4所示,Control 1的结果较差,而添加IntraA块后结果有所改善。这一改进验证了IntraA块在AVSS任务中的有效性,尽管它最初是为AOSS提出的 (Li et al., 2023),因为IntraA模块获取了模态内上下文特征。将InterA块添加到Control 2中得到了所提出的IIANet,其在SI-SNRi和SDRi指标上分别实现了$2.4:\mathrm{dB}$和$2.2\mathrm{dB}$的显著分离质量提升。这证明了InterA模块的重要性。

Table 3. Performance of different models varies with the number of speakers. These results were based on training conducted using the code from published baseline methods.

方法 LRS2-2Mix LRS2-3Mix LRS2-4Mix
SI-SNRi SDRi SI-SNRi
AVConvTasNet (Wu et al., 2019) 12.5 12.8 8.2
AVLIT-8 (Martel et al., 2023) 12.8 13.1 9.4
CTCNet (Li et al., 2022) 14.3 14.6 10.3
IANet (ours) 16.0 16.2 12.6

表 3: 不同模型性能随说话者数量变化情况。这些结果基于已发表基线方法的代码训练得出。

ule enables the visual information to exploit and extract the relevant auditory features fully. Taken together, the results show the effectiveness of the IntraA and InterA blocks.

该机制使视觉信息能够充分利用并提取相关的听觉特征。综合来看,结果证明了IntraA和InterA模块的有效性。

模型 SI-SNRi/SDRi Params (M)/MACs (G)
Control1 13.1/13.4 2.2/16.7
Control2 13.6/14.0 2.2/18.1
IANet 16.0/16.2 3.1/18.6

Table 4. Importance of IntraA blocks and InterA blocks on the LRS2 test set. MACs are measured with 1s input audio sampled at $16\mathrm{kHz}$ .

表 4: IntraA模块和InterA模块在LRS2测试集上的重要性。MACs以16kHz采样的1秒输入音频为测量基准。

Table 5. Impact of each InterA block in IIANet on the LRS2 test set.

表 5. IIANet中每个InterA模块在LRS2测试集上的影响。

InterA-T InterA-M InterA-B SI-SNRi/SNRi
× × 13.8/14.1
× × 14.5/14.7
× 14.1/14.4
× 14.8/15.1
× 15.0/15.2
15.5/15.7
16.0/16.2

Importance of different InterA blocks. Table 5 reports the performances of different combinations of the three InterA blocks: InterA-T, InterA-M and InterA-B. Please note that excluding InterA-T block means directly using $\begin{array}{r}{\sum_{i=0}^{D-1}p(\mathbf{S}{i})+\mathbf{S}{D}^{\phantom{}}}\end{array}$ and $\begin{array}{r}{\sum_{i=0}^{D-1}p(\mathbf{V}{i})+\mathbf{V}_{D}}\end{array}$ as input to auditory and visual MLP s, respectively, with cross-modal attention signal removed. Excluding InterA-M block means removing visual fusion branches at each temporal scale. Excluding InterA-B block means not performing fine-grained audio-visual fusion.

不同InterA模块的重要性。表5报告了三种InterA模块(InterA-T、InterA-M和InterA-B)不同组合的性能表现。需注意的是,排除InterA-T模块意味着直接使用$\begin{array}{r}{\sum_{i=0}^{D-1}p(\mathbf{S}{i})+\mathbf{S}{D}^{\phantom{}}}\end{array}$和$\begin{array}{r}{\sum_{i=0}^{D-1}p(\mathbf{V}{i})+\mathbf{V}_{D}}\end{array}$分别作为听觉和视觉MLP的输入,同时移除跨模态注意力信号;排除InterA-M模块意味着移除每个时间尺度上的视觉融合分支;排除InterA-B模块意味着不执行细粒度的视听融合。

When only one type of attention blocks is enabled in the inter-modal settings, the InterA-M block results in the highest separation quality. When two types of attention blocks are enabled, the combination of InterA-M and InterA-B yielded the best results. The combination of all three types of blocks performed the best.

当跨模态设置中仅启用一种注意力块时,InterA-M块能带来最高的分离质量。当启用两种注意力块时,InterA-M与InterA-B的组合效果最佳。而同时启用三种注意力块的组合表现最优。

More analyses. We also performed additional analyses of IIANet. First, we found that networks that include visual information significantly improve speech separation performance compared to networks that rely only on audio information (see Appendix F for details). Then, we improve the separation quality by gradually increasing the number of fusions (1 to 5) (Appendix I). Meanwhile, we investigated the separation quality of three different pre-trained video encoder models on the LRS2-2Mix dataset (Appendix J). In addition, we explored the relationship between the loss of visual cues (especially the speaker’s side profile) and the separation quality (Appendix K). The samples from the side view angle showed a significant decrease in separation quality compared to the frontal view. Compared to other AVSS methods, IIANet had the smallest decrease of about $4%$ Finally, we applied a dynamic mixing data enhancement scheme and obtained even better results (Appendix L). Unless specifically stated, experiments did not utilize dynamic mixing data enhancement.

更多分析。我们还对IIANet进行了额外分析。首先,我们发现包含视觉信息的网络相比仅依赖音频信息的网络能显著提升语音分离性能(详见附录F)。接着,我们通过逐步增加融合次数(1至5次)来提升分离质量(附录I)。同时,我们在LRS2-2Mix数据集上研究了三种不同预训练视频编码器模型的分离质量(附录J)。此外,我们探究了视觉线索(尤其是说话者侧脸)缺失与分离质量的关系(附录K)。侧视角样本相比正脸视角的分离质量出现显著下降。与其他AVSS方法相比,IIANet的下降幅度最小,约为$4%$。最后,我们采用动态混合数据增强方案并获得了更优结果(附录L)。除非特别说明,实验均未使用动态混合数据增强。

4.6. Qualitative evaluation

4.6. 定性评估

We visually compare the speech separation results of AVConvTasNet, Visual voice, CTCNet, and our IIANet, all trained on the LRS2 dataset. We use the spec tr ogram of the separated audios to demonstrate the quality of the separated signals, as shown in Figure 6 in Appendix. In the left part, we observe that IIANet achieves better reconstruction results. See Appendix M for details.

我们直观比较了在LRS2数据集上训练的AVConvTasNet、Visual voice、CTCNet以及我们的IIANet的语音分离效果。通过分离音频的频谱图展示信号分离质量 (如附录中的图6所示) 。左侧结果显示IIANet实现了更优的重建效果,详见附录M。

Besides, we evaluated different AVSS methods (Visual voice, CTCNet) in real-world scenarios by collecting multi-speaker videos on YouTube. Some typical results are presented in website4. By listening to these results, one can confirm that IIANet generated higher-quality separated audio than other separation models.

此外,我们通过在YouTube上收集多人说话视频,在真实场景中评估了不同视听语音分离方法(Visual Voice、CTCNet)。部分典型结果已展示在website4上。通过试听这些结果,可以确认IIANet生成的分离音频质量优于其他分离模型。

5. Conclusion and discussion

5. 结论与讨论

We proposed a brain-inspired AVSS model, which is characterized by extensive use of intra-attention and inter-attention in the audio and video networks. Compared with the previous SOTA method CTCNet on three benchmark datasets, IIANet achieved significantly higher separation quality with $11%$ MACs, $44%$ parameters and $16%$ GPU memory. By reducing the number of cycles, IIANet ran much faster than CTCNet yet still obtained better separation results.

我们提出了一种受大脑启发的视听语音分离(AVSS)模型,其特点是在音频和视频网络中广泛使用内部注意力(intra-attention)和交互注意力(inter-attention)。与之前的最先进(SOTA)方法CTCNet相比,IIANet在三个基准数据集上实现了显著更高的分离质量,同时仅需11%的乘加运算量(MACs)、44%的参数和16%的GPU显存。通过减少循环次数,IIANet运行速度比CTCNet快得多,同时仍能获得更好的分离效果。

6. Impact statements

6. 影响声明

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

本论文的研究旨在推动机器学习领域的发展。我们的工作可能带来诸多潜在的社会影响,但在此无需特别强调。

References

参考文献

Raij, T., Uutela, K., and Hari, R. Audiovisual integration of letters in the human brain. Neuron, 28(2):617–625, 2000.

Raij, T., Uutela, K., 和 Hari, R. 人脑对字母的视听整合. Neuron, 28(2):617–625, 2000.

Rensink, R. A. The dynamic representation of scenes. Visual cognition, 7(1-3):17–42, 2000.

Rensink, R. A. 场景的动态表征. Visual cognition, 7(1-3):17–42, 2000.

A. Evaluation on different global IntraA implementations

A. 不同全局IntraA实现的评估

In this experiment, we investigated the performance of two implementations of global IntraA blocks in IIANet: $\phi(\mathbf{x},\mathbf{y})$ and $\phi^{\prime}(\mathbf{x},\mathbf{y})$ (see eqns. (2) and (3)). T The results in Table 6 indicate that global IntraA blocks with $\phi(\mathbf{x},\mathbf{y})$ significantly improves separation quality without notably increasing computational complexity. This finding underscores the importance of global IntraA with $\phi(\mathbf{x},\mathbf{y})$ in effectively and accurately handing intra-modal contextual features, thereby substantially improving performance.

在本实验中,我们研究了IIANet中两种全局IntraA块的实现性能:$\phi(\mathbf{x},\mathbf{y})$ 和 $\phi^{\prime}(\mathbf{x},\mathbf{y})$ (参见公式(2)和(3))。表6结果显示,采用 $\phi(\mathbf{x},\mathbf{y})$ 的全局IntraA块在未显著增加计算复杂度的前提下,显著提升了分离质量。这一发现印证了全局IntraA结合 $\phi(\mathbf{x},\mathbf{y})$ 对于高效精准处理模态内上下文特征的重要性,从而大幅提升模型性能。

实现方式 SI-SNRi SDRi 参数量 (M) 计算量 (G)
Φ(x, y) 16.0 16.2 3.1 18.64
Φ'(x,y) 14.7 15.0 3.1 17.53

Table 6. Comparison of separation quality and efficiency among different global IntraA implementations.

表 6: 不同全局IntraA实现的分离质量与效率对比

B. Dataset details

B. 数据集详情

Lip Reading Sentences 2 (LRS2) (Afouras et al., 2018a). The LRS2 dataset consists of thousands of BBC video clips, divided into Train, Validation, and Test folders. We used the same dataset consistent with previous works (Li et al., 2022; Gao & Grauman, 2021; Lee et al., 2021), created by randomly selecting two different speakers from LRS2 and mixing their speeches with signal-to-noise ratios between -5dB and 5dB. Since the LRS2 data contains reverberation and noise, and the overlap rate is not $100%$ , the dataset is closer to real-world scenarios. We use the same data split containing 11-hour training, 3-hour validation, and 1.5-hour test sets.

唇读语句2 (LRS2) (Afouras et al., 2018a)。LRS2数据集包含数千个BBC视频片段,分为训练集、验证集和测试集。我们采用与先前研究 (Li et al., 2022; Gao & Grauman, 2021; Lee et al., 2021) 相同的数据构建方式:从LRS2中随机选取两名不同说话者,将其语音以-5dB至5dB的信噪比混合。由于LRS2数据本身含有混响和噪声,且重叠率未达$100%$,该数据集更贴近真实场景。我们使用相同的数据划分方案:11小时训练集、3小时验证集和1.5小时测试集。

Lip Reading Sentences 3 (LRS3) (Afouras et al., 2018b). The LRS3 dataset includes thousands of spoken sentences from TED and TEDx videos, divided into Trainval and Test folders. We used the same dataset consistent with previous works (Li et al., 2022; Gao & Grauman, 2021; Lee et al., 2021), which is constructed by randomly selecting voices of two different speakers from the LRS3 data. In contrast to LRS2, the LRS3 data has relatively less noise and is closer to separation tasks in clean environments. It comprises 28-hour training, 3-hour validation, and 1.5-hour test sets.

唇读句子3 (LRS3) (Afouras et al., 2018b)。LRS3数据集包含来自TED和TEDx视频的数千条口语语句,分为Trainval和Test文件夹。我们采用与先前研究 (Li et al., 2022; Gao & Grauman, 2021; Lee et al., 2021) 一致的数据集构建方式,即从LRS3数据中随机选取两位不同说话者的语音。相较于LRS2,LRS3数据噪声相对较少,更接近纯净环境下的分离任务。该数据集包含28小时训练集、3小时验证集和1.5小时测试集。

VoxCeleb2 (Chung et al., 2018). The VoxCeleb2 dataset contains over one million sentences from 6,112 individuals extracted from YouTube videos, divided into Dev and Test folders. We used the same dataset consistent with previous works (Li et al., 2022; Gao & Grauman, 2021; Lee et al., 2021), constructed by selecting $5%$ of the data from the Dev folder of VoxCeleb2 for creating training and validation sets. Similar to LRS2, VoxCeleb2 also contains a significant amount of noise and reverberation, making it closer to real-world scenarios, but the acoustic environment of VoxCeleb2 is more complex and challenging. It comprises 56-hour training, 3-hour validation, and 1.5-hour test sets.

VoxCeleb2 (Chung et al., 2018)。VoxCeleb2数据集包含从YouTube视频中提取的6,112名个体的超过一百万句话,分为Dev和Test文件夹。我们采用了与先前研究 (Li et al., 2022; Gao & Grauman, 2021; Lee et al., 2021) 一致的数据集构建方法,即从VoxCeleb2的Dev文件夹中选取$5%$的数据来创建训练集和验证集。与LRS2类似,VoxCeleb2也包含大量噪声和混响,使其更接近真实场景,但VoxCeleb2的声学环境更为复杂且具有挑战性。该数据集包含56小时的训练集、3小时的验证集和1.5小时的测试集。

C. Objective function

C. 目标函数

We used scale-invariant source-to-noise ratio (SI-SNR) (Le Roux et al., 2019) as the objective function to be maximized for training IIANet. The SI-SNR for each speaker is defined as:

我们使用尺度不变的信噪比 (SI-SNR) (Le Roux et al., 2019) 作为训练 IIANet 的最大化目标函数。每位说话者的 SI-SNR 定义为:

$$
\mathrm{SI-SNR}(\mathbf{A},\bar{\mathbf{A}})=10\log_{10}\left(\frac{||\boldsymbol{\omega}\times\mathbf{A}||^{2}}{||\bar{\mathbf{A}}-\boldsymbol{\omega}\times\mathbf{A}||^{2}}\right),\mathrm{where}\boldsymbol{\omega}=\frac{\bar{\mathbf{A}}^{\top}\mathbf{A}}{\mathbf{A}^{\top}\mathbf{A}},
$$

$$
\mathrm{SI-SNR}(\mathbf{A},\bar{\mathbf{A}})=10\log_{10}\left(\frac{||\boldsymbol{\omega}\times\mathbf{A}||^{2}}{||\bar{\mathbf{A}}-\boldsymbol{\omega}\times\mathbf{A}||^{2}}\right),\mathrm{where}\boldsymbol{\omega}=\frac{\bar{\mathbf{A}}^{\top}\mathbf{A}}{\mathbf{A}^{\top}\mathbf{A}},
$$

where A and $\bar{\mathbf{A}}$ denote the ground truth speech signal and the estimate speech signal by the model, respectively. “ $\times$ ” denotes the matrix multiplication.

其中A和$\bar{\mathbf{A}}$分别表示真实语音信号和模型估计的语音信号。"$\times$"表示矩阵乘法。

D. Model configurations

D. 模型配置

We set the kernel size of the audio encoder and decoder to 16 and stride size to 8. The video encoder utilized a lip-reading pre-trained model consistent with CTCNet (Li et al., 2022) to extract visual embeddings. The number of down-sampling $D$ for auditory and visual networks was set to 4, and the number of channels for all convolutional layers was set to 512. The auditory and visual networks used the same set of weights across different cycles, and they were cycled $N_{F}=4$ times; after that, the audio network was additionally cycled $N_{S}=12$ times. The fast version of IIANet, namely IIANet-fast, performed only $N_{S}=6$ cycles on the auditory network. For FFN in Inter-T blocks, we set the three convolutional layers to have channel size (512, 1024, 512), kernel size $(1,5,1)$ , stride size $(1,1,1)$ , and biases (False, True, False). To avoid over fitting, we set the dropout probability for all layers to 0.1.

我们将音频编码器和解码器的核大小设为16,步长设为8。视频编码器采用与CTCNet (Li et al., 2022) 一致的唇读预训练模型提取视觉嵌入。听觉和视觉网络的下采样次数 $D$ 设为4,所有卷积层的通道数设为512。听觉和视觉网络在不同循环周期中共享权重,共循环 $N_{F}=4$ 次;此后音频网络额外循环 $N_{S}=12$ 次。快速版IIANet(即IIANet-fast)仅在听觉网络上执行 $N_{S}=6$ 次循环。对于Inter-T块中的FFN,三个卷积层的通道数设为(512, 1024, 512),核大小 $(1,5,1)$,步长 $(1,1,1)$,偏置设为(False, True, False)。为防止过拟合,所有层的dropout概率设为0.1。


A. Overall pipeline Figure 3. The overall pipeline and architecture of IIANet (audio-only).

图 3: IIANet (纯音频版) 的整体流程和架构

E. Evaluation metrics

E. 评估指标

The quality of speech separation is usually assessed using two key metrics: scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal distortion ratio improvement (SDRi). The use of SI-SNR and SDR improves ment, which means that it considers the quality of the separated speech and the difficulty of the original blend. This is important because for more difficult speech blends to separate, even if the performance of the speech separation system is not optimal, the improvement relative to the original blend may still be significant. The values of both metrics are derived from the scale-invariant signal-to-noise ratio (SI-SNR) (Le Roux et al., 2019) and the source distortion ratio (SDR) (Vincent et al., 2006a). The higher the values of these metrics, the higher the speech separation quality. The formulas for SI-SNRi, SDR and SDRi are as follows:

语音分离的质量通常通过两个关键指标来评估:尺度不变信噪比提升(SI-SNRi)和信号失真比提升(SDRi)。使用SI-SNR和SDR的提升值意味着它同时考虑了分离语音的质量和原始混合语音的分离难度。这一点很重要,因为对于更难分离的语音混合,即使语音分离系统的表现并非最优,相对于原始混合的提升可能仍然显著。这两个指标的数值源自尺度不变信噪比(SI-SNR) (Le Roux et al., 2019)和源失真比(SDR) (Vincent et al., 2006a)。这些指标的数值越高,表示语音分离质量越好。SI-SNRi、SDR和SDRi的计算公式如下:

$$
\begin{array}{r l r}{\mathrm{SI-SNRi}({\bf S},{\bf A},\bar{\bf A})=\mathrm{SI-SNR}({\bf A},\bar{\bf A})-\mathrm{SI-SNR}({\bf A},{\bf S}),}&{{}}&{}\ {\mathrm{SDR}({\bf A},\bar{\bf A})=10\log_{10}\left(\frac{||{\bf A}||^{2}}{||{\bf A}-\bar{\bf A}||^{2}}\right),}&{{}}&{}\ {\mathrm{SDRi}({\bf S},{\bf A},\bar{\bf A})=\mathrm{SDR}({\bf A},\bar{\bf A})-\mathrm{SDR}({\bf A},{\bf S}),}&{{}}&{}\end{array}
$$

$$
\begin{array}{r l r}{\mathrm{SI-SNRi}({\bf S},{\bf A},\bar{\bf A})=\mathrm{SI-SNR}({\bf A},\bar{\bf A})-\mathrm{SI-SNR}({\bf A},{\bf S}),}&{{}}&{}\ {\mathrm{SDR}({\bf A},\bar{\bf A})=10\log_{10}\left(\frac{||{\bf A}||^{2}}{||{\bf A}-\bar{\bf A}||^{2}}\right),}&{{}}&{}\ {\mathrm{SDRi}({\bf S},{\bf A},\bar{\bf A})=\mathrm{SDR}({\bf A},\bar{\bf A})-\mathrm{SDR}({\bf A},{\bf S}),}&{{}}&{}\end{array}
$$

where S denotes the mixture audio.

其中 S 表示混合音频。

F. Audio network

F. 音频网络

In IIANet (audio-only), we use only audio as input and its pipeline is shown in Figure 3A. The structure of the audio network is illustrated in Figure 3B. Specifically, the audio network takes $\mathbf{E}{S,t+1}$ as input and obtains multi-scale auditory features $\mathbf{S}{i}$ through a bottom-up pathway. At the coarsest temporal scale, these multi-scale features are integrated to obtain the global feature $\mathbf{S}{G}$ . Subsequently, we employ a top-down attention mechanism to generate modulated multi-scale auditory features $\bar{\bf S}{i}$ , enabling auditory focus across different temporal scales and capturing the core semantics at various scales. Finally, through top-down processing, refined auditory features $\bar{\mathbf{E}}{S,t}=\check{\mathbf{S}}_{0}$ are produced, serving as the input for the next audio network iteration. To the end, we achieve optimization of auditory features within the audio network with shared parameters.

在IIANet(纯音频版)中,我们仅使用音频作为输入,其流程如图3A所示。音频网络结构如图3B所示。具体而言,该网络以$\mathbf{E}{S,t+1}$为输入,通过自底向上路径获取多尺度听觉特征$\mathbf{S}{i}$。在最粗时间尺度上,这些多尺度特征被整合为全局特征$\mathbf{S}{G}$。随后采用自顶向下的注意力机制生成调制后的多尺度听觉特征$\bar{\bf S}{i}$,实现跨时间尺度的听觉聚焦并捕获不同尺度的核心语义。最终通过自上而下处理生成精细化听觉特征$\bar{\mathbf{E}}{S,t}=\check{\mathbf{S}}_{0}$,作为下一次音频网络迭代的输入。由此,我们在参数共享的音频网络内实现了听觉特征的优化。

For the training and testing of IIANet (audio-only), we used hyper parameter settings and datasets consistent with IIANet for a fair comparison with other models. We solved the permutation problem (Hershey et al., 2016) using permutation invariant training (PIT) method (Yu et al., 2017) consistent with blind source separation methods (Li et al., 2023; Luo & Mesgarani, 2019; Luo et al., 2020) to maximize the SI-SNR.

为了对IIANet(仅音频)进行训练和测试,我们采用了与IIANet一致的超参数设置和数据集,以便与其他模型进行公平比较。我们使用与盲源分离方法 (Li et al., 2023; Luo & Mesgarani, 2019; Luo et al., 2020) 一致的排列不变训练 (PIT) 方法 (Yu et al., 2017) 来解决排列问题 (Hershey et al., 2016),从而最大化SI-SNR。

As shown in Table 7, the speech separation quality with added visual information (IIANet) was improved by about 4 dB compared to IIANet (audio-only).

如表 7 所示,加入视觉信息 (IIANet) 的语音分离质量相比纯音频 IIANet 提升了约 4 dB。

模型 LRS2 LRS2 LRS2 LRS3 LRS3 LRS3 VoxCeleb2 VoxCeleb2 VoxCeleb2
SI-SNRi SDRi PESQ SI-SNRi SDRi PESQ SI-SNRi SDRi PESQ
HANet (ours) 16.0 16.2 3.23 18.3 18.5 3.28 13.6 14.3 3.12
HIANet (audio-only) 11.2 11.5 2.32 12.5 12.8 2.54 9.5 10.0 2.23

Table 7. Separation results of different AVSS methods on LRS2, LRS3, and VoxCeleb2 datasets. These metrics represent the average values for all speakers in each test set. The larger SI-SNRi, SDRi, and PESQ values are better.

表 7: 不同音频视觉语音分离 (AVSS) 方法在 LRS2、LRS3 和 VoxCeleb2 数据集上的分离结果。这些指标代表每个测试集中所有说话者的平均值。SI-SNRi、SDRi 和 PESQ 值越大越好。


Figure 4. The architecture of IIANet’s control models. (A) Control 1. It is obtained by removing the IntraA and InterA blocks of IIANet. (B) Control 2. It is obtained by removing InterA blocks of IIANet.

图 4: IIANet控制模型架构。(A) Control 1。通过移除IIANet的IntraA和InterA模块获得。(B) Control 2。通过移除IIANet的InterA模块获得。

G. Experimental configurations for multiple speakers

G. 多说话人实验配置

Dataset: We utilized the LRS2-3Mix and LRS2-4Mix datasets for training purposes. For audio data construction, we selected random audio clips from 3 to 4 random speakers mixed with a random signal-to-noise ratio (SNR) value ranging between -5 and 5 dB. Regarding visual data construction, we employed the same experimental setup used in the LRS2-2Mix dataset. In the end, the constructed LRS2-3Mix and LRS2-4Mix datasets comprised 20,000 mixed speech samples for the training set, 5,000 for the validation set, and 300 for the testing set.

数据集:我们使用LRS2-3Mix和LRS2-4Mix数据集进行训练。在音频数据构建中,我们随机选取3至4个说话者的音频片段,并以-5至5 dB范围内的随机信噪比(SNR)值进行混合。视觉数据构建方面,我们采用了与LRS2-2Mix数据集相同的实验设置。最终构建的LRS2-3Mix和LRS2-4Mix数据集包含20,000个混合语音样本作为训练集,5,000个作为验证集,300个作为测试集。

Training Process: Consistent with the experimental configuration used for training with two speakers, we trained different baseline models and the IIANet on multi-speaker datasets. This approach allowed us to maintain uniformity in our experimental configuration while exploring the model’s performance across varying numbers of speakers.

训练过程:与双说话人训练的实验配置保持一致,我们在多说话人数据集上训练了不同的基线模型和IIANet。这种方法使我们能够在探索模型在不同说话人数量下的性能时,保持实验配置的一致性。

Inference Process: In the case of video with multiple speakers, our inference process is as follows: Firstly, we use a face detection model (Zhang et al., 2016) to extract the facial images of the different speakers from the video. These images are then cropped to isolate the lip area, with each of these cropped images serving as inputs to the lip reading model. At the same time, the audio corresponding to the video is fed as input into the audio encoder. These processed visual and auditory inputs are subsequently fed into our visual and auditory networks in the separation network, generating visual and auditory features. Through this separation network, we obtain the audio mask of the intended speaker. Then, by employing the audio decoder, we obtain the waveform of the intended speaker. This process continues iterative ly until all speakers’ visual information has been processed.

推理过程:对于包含多位说话者的视频,我们的推理流程如下:首先使用人脸检测模型 (Zhang et al., 2016) 从视频中提取不同说话者的面部图像,随后裁剪出唇部区域,这些裁剪后的图像将作为唇语识别模型的输入。同时,视频对应的音频会被输入至音频编码器。经过处理的视觉和听觉输入随后被送入分离网络中的视觉与听觉网络,生成视觉和听觉特征。通过该分离网络,我们获得目标说话者的音频掩码。接着利用音频解码器得到目标说话者的波形。该过程会迭代执行,直至处理完所有说话者的视觉信息。

H. Control models

H. 控制模型

To validate the effectiveness of our proposed intra-attention and inter-attention blocks in IIANet, we constructed two control models: Control 1 and Control 2.

为了验证IIANet中提出的内部注意力 (intra-attention) 和交互注意力 (inter-attention) 模块的有效性,我们构建了两个对照模型:Control 1和Control 2。

Control 1 was obtained by removing all IntraA and InterA blocks from IIANet, and its architecture is depicted in Figure 4A. We retained the InterA-B blocks in Control 1 to preserve essential connectivity between the audio and visual networks for audio-visual feature fusion. Control 2 was obtained by adding IntraA blocks to Control 1, and its architecture is depicted in Figure 4B.

对照1是通过从IIANet中移除所有IntraA和InterA模块获得的,其架构如图4A所示。我们在对照1中保留了InterA-B模块,以维持音频和视觉网络之间的基本连接,实现视听特征融合。对照2通过在对照1基础上添加IntraA模块获得,其架构如图4B所示。

I. Impact of different AV fusion cycles

I. 不同 AV 融合周期 的影响

In IIANet, we gradually expanded the number of AV fusion cycles. With fixed audio-only cycle numbers of 14, the cycle numbers for AV fusion were 1, 2, 3, 4, and 5. This allowed us to gradually detect the effect of visual information through the strength of the fused information. Table 8 shows that separation quality improved as the number of fusions increased, but reached a plateau at 4. To make the model lightweight without compromising separation quality, we consider $N_{F}=4$ to be a good choice.

在IIANet中,我们逐步增加了视听融合(AV fusion)的循环次数。在固定纯音频循环次数为14的情况下,视听融合循环次数分别为1、2、3、4和5次。这使得我们能够通过融合信息的强度逐步检测视觉信息的效果。表8显示,随着融合次数的增加,分离质量有所提升,但在4次时达到稳定状态。为了使模型保持轻量化同时不损害分离质量,我们认为$N_{F}=4$是一个理想选择。

NF SI-SNRi SDRi PESQ Params(M) MACs(G)
1 14.0 14.2 3.06 3.1 17.99
2 14.5 14.7 3.10 3.1 18.57
3 15.2 15.3 3.15 3.1 18.61
4 16.0 16.2 3.23 3.1 18.64
5 16.0 16.3 3.23 3.1 18.68

Table 8. Results on the LRS2 dataset for different numbers of audio-visual fusion cycles used by IIANet.

表 8: IIANet 在 LRS2 数据集上使用不同视听融合周期数的结果

J. Comparison of different video encoders

J. 不同视频编码器的比较

We examined the separation quality of three different pre-trained video encoder models on the LRS2-2Mix dataset: DCTCN (Ma et al., 2021), MS-TCN (Martinez et al., 2020), and CTCNet-Lip (Li et al., 2022), where DC-TCN, MS-TCN and CTCNet-Lip are lip-reading models that are used to compute embeddings in linguistic contexts. We trained IIANet by replacing only the video encoder. The results are shown in Table 9. Interestingly, we can find that the separation quality was as good as using different video encoders (with SI-SNRi less than 0.3 dB). This indicates that different video encoders do not affect overall performance as much, implying the pivotal role of the audio-video fusion strategy itself.

我们在LRS2-2Mix数据集上测试了三种预训练视频编码器模型的分离质量:DCTCN (Ma等人, 2021)、MS-TCN (Martinez等人, 2020)和CTCNet-Lip (Li等人, 2022)。其中DC-TCN、MS-TCN和CTCNet-Lip是用于计算语言学上下文嵌入的唇读模型。我们仅通过替换视频编码器来训练IIANet。结果如表9所示。有趣的是,我们发现使用不同视频编码器的分离质量相当(SI-SNRi差异小于0.3 dB)。这表明不同视频编码器对整体性能影响不大,暗示了音视频融合策略本身的关键作用。

Submission and Formatting Instructions for ICML 2024

ICML 2024 投稿与格式要求

视频编码器 SI-SNRi SDRi PESQ Acc(%)
DC-TCN 15.7 16.0 3.19 88.0
MS-TCN 15.8 16.0 3.22 85.3
CTCNet-Lip 16.0 16.2 3.23 84.1

K. Impaired visual cues

K. 视觉线索受损

We explored the relationship between visual cue loss (speaker’s facial orientation, specifically the side profile) and separation quality. We utilized the MediaPipe5 tool on the LRS2 dataset to assess the speaker’s facial orientation and categorized 6000 audio-video pairs into two groups: frontal orientation (5331 samples) and side orientation (669 samples). A few sample visualization s are shown in Figure 5. Our findings revealed a marked decline in separation quality for samples with a side orientation compared to those facing front. Specifically, we observe that the SI-SNRi value decreases by more than $8%$ on AVLiT-8 and CTCNet for side orientation samples compared to frontal orientation samples, whereas this decrease is only about $4%$ on IIANet. This result underscores the effectiveness of our proposed cross-modal fusion approach in integrating visual and auditory features capable of mitigating visual information impairment.

我们探究了视觉线索缺失(说话者面部朝向,特别是侧脸)与分离质量之间的关系。在LRS2数据集上使用MediaPipe5工具评估说话者面部朝向,并将6000个音视频对分为两组:正面朝向(5331个样本)和侧面朝向(669个样本)。图5展示了几组可视化样本。研究发现,与正面朝向样本相比,侧面朝向样本的分离质量显著下降。具体而言,AVLiT-8和CTCNet模型在侧面朝向样本上的SI-SNRi值比正面朝向样本下降超过$8%$,而IIANet的下降幅度仅为$4%$左右。这一结果印证了我们提出的跨模态融合方法在整合视听特征方面的有效性,能够缓解视觉信息受损带来的影响。


Figure 5. Sample visualization s of faces with different orientations in the LRS2 dataset. Table 9. Results of different video encoders used in IIANet on the LRS2 dataset. “Acc” denotes the accuracy of lip-reading recognition, which is an effective metric for evaluating lip-reading capabilities.

图 5: LRS2数据集中不同朝向人脸的示例可视化。
表 9: IIANet在LRS2数据集上采用不同视频编码器的结果。"Acc"表示唇语识别的准确率,这是评估唇语能力的有效指标。

L. Data augmentation

L. 数据增强

To enhance the model’s generalization capability, we apply a dynamic mixture data augmentation scheme to the audio-visual separation task. Specifically, we construct new training data during the training process by randomly sampling two speech signals and mixing them after random gain adjustments. Similar approaches have been used in speech and music separation methods (Subakan et al., 2021; Zeghidour & Grangier, 2021). As shown in Table 11, adopting the dynamic mixing training scheme can further improve the model’s separation performance. We recommend using this training scheme in practice, as the cost of generating mixtures is not high (compared to fixed training data, it only adds 0.1 minutes per epoch).

为增强模型的泛化能力,我们在视听分离任务中采用了动态混合数据增强方案。具体而言,在训练过程中通过随机采样两段语音信号并进行随机增益调整后混合,构建新的训练数据。类似方法已在语音和音乐分离领域得到应用 (Subakan et al., 2021; Zeghidour & Grangier, 2021)。如表 11 所示,采用动态混合训练方案能进一步提升模型的分离性能。由于生成混合数据的成本较低(相比固定训练数据,每轮仅增加0.1分钟),我们建议在实践中采用该训练方案。

Table 10. Results of different models with different orientations on the LRS2 dataset.

表 10. 不同朝向模型在 LRS2 数据集上的结果。

模型 SI-SNRi SDRi PESQ
正面朝向
AVLiT-8 13.5 13.8 2.75
CTCNet 14.9 15.1 3.12
HANet 16.3 16.5 3.24
侧面朝向
AVLiT-8 12.1 12.4 2.37
CTCNet 13.7 14.1 3.04
HANet 15.7 15.9 3.22
方法 LRS2 LRS3 VoxCeleb2 训练时间 (分钟)
SI-SNRi SDRi PESQ SI-SNRi SDRi PESQ SI-SNRi SDRi PESQ
WithoutDM 16.0 16.2 3.23 18.3 18.5 3.28 13.6 14.3 3.12 36.1
WithDM 16.8 17.0 3.35 18.6 18.8 3.30 14.0 14.7 3.17 36.2

Table 11. Comparison of separation performance for different training schemes. ”DM” stands for dynamic mixture scheme. Training time refers to the time spent training one epoch on the LRS2 training set.

表 11: 不同训练方案的分离性能对比。"DM"代表动态混合方案。训练时间指在LRS2训练集上训练一个epoch所花费的时间。

M. Visual results

M. 可视化结果

We performed visual analysis of the separation performance of four AVSS methods: IIANet, CTCNet, Visual voice and A VC on vT as Net. The results, based on models trained on the LRS2 dataset, are presented in Figure 6. We tested the performance of the four algorithms by randomly selecting three audio mixing samples from the LRS2 dataset.

我们对四种视听语音分离 (AVSS) 方法 (IIANet、CTCNet、Visual Voice 和 AVC on vT as Net) 的分离性能进行了可视化分析。基于在 LRS2 数据集上训练的模型结果如图 6 所示。我们通过从 LRS2 数据集中随机选取三个音频混合样本来测试四种算法的性能。

In Sample I, poor separation of high-frequency components was observed for CTCNet, Visual voice, and A VC on vT as Net. In contrast, the IIANet model exhibited higher separation quality, evidenced by its clarity and strong similarity to the ground truth.

在样本I中,观察到CTCNet、Visual voice和A VC on vT as Net对高频成分的分离效果较差。相比之下,IIANet模型展现出更高的分离质量,其清晰度和与真实情况的高度相似性证明了这一点。

In Sample II, IIANet presented more detailed harmonic features by accurately removing low and mid-frequency components to the right of the center. In sharp contrast, the other models demonstrated significant deficiencies in recovering harmonic features, resulting in large deviations from the ground truth.

在样本II中,IIANet通过精确去除中心右侧的低频和中频成分,呈现出更细致的谐波特征。与之形成鲜明对比的是,其他模型在恢复谐波特征方面表现出明显不足,导致与真实情况存在较大偏差。

In Sample III, IIANet demonstrated excellent ability to maintain the complex harmonic structure of the original audio. On the other hand, models such as CTCNet and A VC on vT as Net showed blurring effects in this regard, failing to capture the fine details in the ground truth.

在样本III中,IIANet展现了保持原始音频复杂谐波结构的出色能力。相比之下,CTCNet和A VC on vT as Net等模型在这方面出现了模糊效应,未能捕捉到真实数据中的精细细节。


Figure 6. Spec tr ogram of separated audio from different models. Each row represents the results for a same audio mixture.

图 6: 不同模型分离音频的频谱图。每行表示同一混合音频的处理结果。

阅读全文(20积分)