[论文翻译]IIANet: 一种用于视听语音分离的模态内与模态间注意力网络


原文地址:https://arxiv.org/pdf/2308.08143v3


IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

IIANet: 一种用于视听语音分离的模态内与模态间注意力网络

Abstract

摘要

Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this issue, We propose a novel model called Intra- and Inter-Attention Network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and interattention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only $7%$ of CTCNet’s MACs and is $40%$ faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.

近期研究在视听语音分离的融合模块设计上取得了显著进展。然而,这些方法主要集中于听觉和视觉特征在单一时间尺度上的多模态融合,且未采用选择性注意力机制,这与大脑处理机制形成鲜明对比。为解决这一问题,我们提出了一种名为内外注意力网络 (Intra- and Inter-Attention Network, IIANet) 的新模型,该模型利用注意力机制实现高效的视听特征融合。IIANet包含两类注意力模块:内部注意力 (IntraA) 和交互注意力 (InterA) 模块,其中InterA模块分布在网络顶部、中部和底部。这些模块深度借鉴了人脑在不同时间尺度上选择性关注相关内容的机制,既能保持学习模态特异性特征的能力,又可从视听特征中提取不同语义信息。在三个标准视听分离基准数据集 (LRS2、LRS3和VoxCeleb2) 上的全面实验表明,IIANet在保持相当推理时间的同时,性能优于现有最优方法。特别地,快速版IIANet (IIANet-fast) 仅需CTCNet 7%的乘加运算量,在CPU上运行速度比CTCNet快40%,同时实现更优的分离质量,这充分展现了注意力机制在高效多模态融合中的巨大潜力。

1. Introduction

1. 引言

In our daily lives, audio and visual signals are the primary means of information transmission, providing rich cues for humans to obtain valuable information in noisy environments (Cherry, 1953; Arons, 1992). This innate ability is known as the “cocktail party effect”, also referred to as “speech separation” in computer science. Speech separation possesses extensive research significance, such as assisting individuals with hearing impairments (Summer field, 1992), enhancing the auditory experience of wearable devices (Ryumin et al., 2023), etc.

在我们的日常生活中,音频和视觉信号是信息传递的主要方式,为人类在嘈杂环境中获取有价值信息提供了丰富线索 (Cherry, 1953; Arons, 1992)。这种先天能力被称为"鸡尾酒会效应",在计算机科学领域也被称为"语音分离"。语音分离具有广泛的研究意义,例如帮助听力障碍人士 (Summer field, 1992)、提升可穿戴设备的听觉体验 (Ryumin et al., 2023) 等。

Researchers have previously sought to address the cocktail party effect through audio streaming alone (Luo & Mesgarani, 2019; Luo et al., 2020; Hu et al., 2021). This task is called audio-only speech separation (AOSS). However, the quality of the separation system may decline dramatically when speech is corrupted by noise (Afouras et al., 2018a). To address this issue, many researchers have focused on audio-visual speech separation (AVSS), achieving remarkable progress. Existing AVSS methods can be broadly classified into two categories based on their fusion modules: CNN-based and Transformer-based. CNN-based methods (Gao & Grauman, 2021; Li et al., 2022; Wu et al., 2019) exhibit lower computational complexity and excellent local feature extraction capabilities, enabling the extraction of audio-visual information at multiple scales to capture local contextual relationships. Transformer-based methods (Lee et al., 2021; Rahimi et al., 2022; Montesinos et al., 2022) can leverage cross-attention mechanisms to learn associations across different time steps, effectively handling dynamic relationships between audio-visual information.

研究人员此前曾尝试仅通过音频流来解决鸡尾酒会效应 (Luo & Mesgarani, 2019; Luo et al., 2020; Hu et al., 2021) ,这项任务被称为纯音频语音分离 (AOSS) 。然而,当语音被噪声干扰时,分离系统的质量可能会急剧下降 (Afouras et al., 2018a) 。为了解决这个问题,许多研究者将重点转向视听语音分离 (AVSS) ,并取得了显著进展。现有AVSS方法根据其融合模块可大致分为两类:基于CNN的方法和基于Transformer的方法。基于CNN的方法 (Gao & Grauman, 2021; Li et al., 2022; Wu et al., 2019) 具有较低的计算复杂度和出色的局部特征提取能力,能够提取多尺度的视听信息以捕捉局部上下文关系。基于Transformer的方法 (Lee et al., 2021; Rahimi et al., 2022; Montesinos et al., 2022) 则可以利用交叉注意力机制学习不同时间步之间的关联,有效处理视听信息间的动态关系。

By comparing the working mechanisms of these AVSS methods and the brain, we find that there are two major differences. First, auditory and visual information undergoes neural integration at multiple levels of the auditory and visual pathways, including the thalamus (Halverson & Freeman, 2006; Cai et al., 2019), A1 and V1 (Mesik et al., 2015; Eckert et al., 2008), and occipital cortex (Ghazanfar & Schroeder, 2006; Stein & Stanford, 2008). However, most AVSS methods integrate audio-visual information only at either coarse (Gao & Grauman, 2021; Lee et al., 2021) or fine (Li et al., 2022; Montesinos et al., 2022; Wu et al., 2019; Rahimi et al., 2022) temporal scale, thus ignoring semantic associations across different scales.

通过比较这些视听语音分离 (AVSS) 方法与大脑的工作机制,我们发现存在两个主要差异。首先,听觉和视觉信息在听觉和视觉通路的多个层级进行神经整合,包括丘脑 (Halverson & Freeman, 2006; Cai et al., 2019)、初级听觉皮层 (A1) 和初级视觉皮层 (V1) (Mesik et al., 2015; Eckert et al., 2008) 以及枕叶皮层 (Ghazanfar & Schroeder, 2006; Stein & Stanford, 2008)。然而,大多数 AVSS 方法仅在粗粒度 (Gao & Grauman, 2021; Lee et al., 2021) 或细粒度 (Li et al., 2022; Montesinos et al., 2022; Wu et al., 2019; Rahimi et al., 2022) 时间尺度上整合视听信息,从而忽略了跨尺度的语义关联。

Second, numerous evidences have indicated that the human brain employs selective attention to focus on relevant speech in a noisy environment (Cherry, 1953; Golumbic et al., 2013). This also modulates neural responses in a frequencyspecific manner, enabling the tracked speech to be distinguished and separated from competing speech (Golumbic et al., 2013; Mesgarani & Chang, 2012; O’sullivan et al.,

其次,大量证据表明,人类大脑会利用选择性注意在嘈杂环境中聚焦相关语音 (Cherry, 1953; Golumbic et al., 2013)。这种机制还会以频率特异性方式调节神经反应,使被追踪的语音能够从竞争语音中区分和分离出来 (Golumbic et al., 2013; Mesgarani & Chang, 2012; O’sullivan et al.,

2015; Kaya & Elhilali, 2017; Cooke & Ellis, 2001; Aytar et al., 2016). The Transformer-based AVSS methods adopt attention mechanisms (Rahimi et al., 2022; Montesinos et al., 2022), but they commonly employ the Transformer as the backbone network to extract and fuse audio-visual features, and the attention is not explicitly designed for selecting speakers. In contrast to above complex fusion strategies, our approach aims to design a simple and efficient fusion module based on the selective attention mechanism.

2015; Kaya & Elhilali, 2017; Cooke & Ellis, 2001; Aytar et al., 2016)。基于Transformer的视听语音分离(AVSS)方法采用注意力机制(Rahimi et al., 2022; Montesinos et al., 2022),但它们通常将Transformer作为主干网络来提取和融合视听特征,且注意力机制并非专门为说话人选择而设计。与上述复杂融合策略不同,我们的方法旨在基于选择性注意力机制设计一个简单高效的融合模块。

To achieve this goal, inspired by the structure and function of the brain, we propose a concise and efficient audio-visual fusion scheme, called Intra- and Inter-Attention Network (IIANet), which leverages different forms of attention1 to extract diverse semantics from audio-visual features. IIANet consists of two hierarchical unimodal networks, mimicking the audio and visual pathways in the brain, and employs two types of attention mechanisms: intra-attention (IntraA) within single modalities and inter-attention (InterA) between modalities.

为实现这一目标,受大脑结构和功能的启发,我们提出了一种简洁高效的视听融合方案,称为内外注意力网络 (Intra- and Inter-Attention Network, IIANet),该方案利用不同形式的注意力机制从视听特征中提取多样化语义。IIANet由两个分层单模态网络组成,模拟大脑中的听觉和视觉通路,并采用两种注意力机制:单模态内的内部注意力 (IntraA) 和跨模态间的交互注意力 (InterA)。

Following a recent AOSS model (Li et al., 2023), the IntraA has two forms of top-down attention, global and local, which aim to enhance the ability of the model to select relevant information in unimodal networks with the guidance of higher-level features. This mechanism is inspired by the massive top-down neural projections discovered in both auditory (Guinan Jr, 2006; Budinger et al., 2008) and visual (Angelucci et al., 2002; Felleman & Van Essen, 1991) pathways. The InterA mechanism aims to select relevant information in one modality with the help of the features in the other unimodal network. It is used at all levels of the unimodal networks. Roughly speaking, the InterA block at the top level (InterA-T) corresponds to higher associate cortical areas such as the frontal cortex and occipital cortex (Raij et al., 2000; Keil et al., 2012; Stein & Stanford, 2008), and at the bottom level (InterA-B) corresponds to the thalamus (Halverson & Freeman, 2006; Cai et al., 2019). The InterA blocks at the middle levels (InterA-M) reflect the direct neural projections between different auditory areas and visual areas, e.g., from the core and belt regions of the auditory cortex to the V1 area (Falchier et al., 2002).

遵循近期提出的AOSS模型 (Li et al., 2023),IntraA机制包含全局与局部两种自上而下的注意力形式,旨在通过高层特征引导增强单模态网络中的相关信息选择能力。该机制灵感源自听觉 (Guinan Jr, 2006; Budinger et al., 2008) 和视觉 (Angelucci et al., 2002; Felleman & Van Essen, 1991) 通路中发现的巨量自上而下神经投射。InterA机制则借助另一单模态网络的特征来选择当前模态的相关信息,其作用于单模态网络的所有层级。简言之,顶层InterA模块 (InterA-T) 对应前额叶皮层、枕叶皮层等高级联合皮层区 (Raij et al., 2000; Keil et al., 2012; Stein & Stanford, 2008),底层模块 (InterA-B) 对应丘脑结构 (Halverson & Freeman, 2006; Cai et al., 2019),而中层InterA模块 (InterA-M) 反映了听觉皮层核心区/带状区至V1区等跨模态直接神经投射 (Falchier et al., 2002)。

On three AVSS benchmarks, we found that IIANet surpassed the previous SOTA method CTCNet (Li et al., 2022) by a considerable margin, while the model inference time was nearly on par with CTCNet. We also built a fast version of IIANet, termed IIANet-fast. It achieved an inference speed on CPU that was more than twice as fast as CTCNet and still yielded better separation quality on three benchmark datasets.

在三个AVSS基准测试中,我们发现IIANet以显著优势超越了之前的SOTA方法CTCNet (Li et al., 2022),同时模型推理时间与CTCNet基本持平。我们还构建了IIANet的快速版本IIANet-fast,其在CPU上的推理速度比CTCNet快两倍以上,且在三个基准数据集上仍能提供更好的分离质量。

2. Related work

2. 相关工作

2.1. Audio-visual speech separation

2.1. 视听语音分离

Incorporating different modalities aligns better with the brain’s processing (Calvert, 2001; Bar, 2007). Some methods (Gao & Grauman, 2021; Li et al., 2022; Wu et al., 2019; Lee et al., 2021; Rahimi et al., 2022; Montesinos et al., 2022) have attempted to add visual cues in the AOSS task to improve the clarity of separated audios, called audiovisual speech separation (AVSS). They have demonstrated impressive separation results in more complex acoustic environments, but these methods only focus on audio-visual features fusion at a single temporal scale in both modality, e.g., only at the finest (Wu et al., 2019; Li et al., 2022) or coarsest (Gao & Grauman, 2021; Lee et al., 2021) scales, which restrains their efficient utilization of different modalities’ information. While CTCNet’s fusion module (Li et al., 2022) attempts to improve separation quality by utilizing visual and auditory features at different scales, it still focuses on fusion at the finest temporal scales. This unintentionally hinders the effectiveness of using visual information for guidance during the separation process. Moreover, some of the AVSS models (Lee et al., 2021; Montesinos et al., 2022) employ attention mechanisms within the process of inter-modal fusion, while the significance of intra-modal attention mechanisms is overlooked. This deficiency constrains their separation capabilities, as the global information within the auditory modality plays a crucial role in enhancing the clarity of the separated audio (Li et al., 2023). Notably, the fusion module of these methods (Lee et al., 2021; Montesinos et al., 2022) tend to use a similarity matrix to compute the attention, which dramatically increases the computational complexity. In contrast, our proposed IIANet uses the sigmoid function and element-wise product for selective attention, which can efficiently integrate audio-visual features at different temporal scales.

融入不同模态能更好地匹配大脑处理机制 (Calvert, 2001; Bar, 2007)。部分研究 (Gao & Grauman, 2021; Li et al., 2022; Wu et al., 2019; Lee et al., 2021; Rahimi et al., 2022; Montesinos et al., 2022) 尝试在AOSS任务中添加视觉线索以提升分离音频的清晰度,称为视听语音分离 (AVSS)。这些方法在复杂声学环境中展现了出色的分离效果,但仅关注单时间尺度的跨模态特征融合,例如仅在最精细 (Wu et al., 2019; Li et al., 2022) 或最粗糙 (Gao & Grauman, 2021; Lee et al., 2021) 尺度进行融合,限制了多模态信息的有效利用。虽然CTCNet的融合模块 (Li et al., 2022) 尝试通过多尺度视听特征提升分离质量,但仍聚焦于最精细时间尺度的融合,这无意中削弱了视觉信息在分离过程中的引导效果。此外,部分AVSS模型 (Lee et al., 2021; Montesinos et al., 2022) 在跨模态融合过程中采用注意力机制,却忽视了模态内注意力机制的重要性——听觉模态的全局信息对提升分离音频清晰度具有关键作用 (Li et al., 2023)。值得注意的是,这些方法 (Lee et al., 2021; Montesinos et al., 2022) 的融合模块通常使用相似度矩阵计算注意力,显著增加了计算复杂度。相比之下,我们提出的IIANet采用sigmoid函数和逐元素乘积实现选择性注意力,能高效整合多时间尺度的视听特征。

2.2. Attention mechanism in speech separation

2.2. 语音分离中的注意力机制

Attention is a critical function of the brain (Rensink, 2000; Corbetta & Shulman, 2002), serving as a pivotal mechanism for humans to cope with complex internal and external environments. Our brain has the ability to focus its superior resources to selectively process task-relevant information (Schneider, 2013). It is widely accepted that numerous brain regions are involved in the transmission of attention, forming an intricate neural network of interconnections (Posner & Petersen, 1990; Fukushima, 1986). These regions play distinct roles, enabling humans to selectively focus on processing certain information at necessary times and locations while ignoring other perceivable information. While this ability is known to work well along with sensory or motor stimuli, it also enables our brain to maintain the ability to process information in real-time even in the absence of these stimuli (Tallon-Baudry, 2012; Koch & Tsuchiya, 2007).

注意力是大脑的一项关键功能 (Rensink, 2000; Corbetta & Shulman, 2002),作为人类应对复杂内外环境的核心机制。我们的大脑能够集中优势资源,选择性处理任务相关信息 (Schneider, 2013)。学界普遍认为,众多脑区参与注意力传递,形成错综复杂的互联神经网络 (Posner & Petersen, 1990; Fukushima, 1986)。这些区域各司其职,使人类能在必要时空选择性地专注处理特定信息,同时忽略其他可感知信息。虽然这种能力已知能与感觉或运动刺激协同运作,但它也使大脑在缺乏这些刺激时仍能保持实时处理信息的能力 (Tallon-Baudry, 2012; Koch & Tsuchiya, 2007)。

Recently, Kuo et al. (2022) constructed artificial neural networks to simulate the attention mechanism of auditory features, confirming that top-down attention plays a critical role in addressing the cocktail party problem. Several recent AOSS models (Shi et al., 2018; Li et al., 2023; Chen et al., 2023) have utilized top-down attention in designing speech separation models. These approaches were able to reduce computational costs while maintaining audio separation quality. They have also shown excellent performance dealing with complex mixture audio. However, the efficient integration of top-down and lateral attention between modalities in AVSS models remains an unclear aspect.

最近,Kuo等人 (2022) 构建了人工神经网络来模拟听觉特征的注意力机制,证实了自上而下注意力 (top-down attention) 在解决鸡尾酒会问题中的关键作用。近期多个AOSS模型 (Shi等人, 2018; Li等人, 2023; Chen等人, 2023) 在设计语音分离模型时都采用了自上而下注意力。这些方法在保持音频分离质量的同时降低了计算成本,在处理复杂混合音频时也展现出优异性能。然而,AVSS模型中跨模态的自上而下注意力与横向注意力 (lateral attention) 的高效整合仍是一个未解难题。

3. The proposed model

3. 提出的模型

Let $\mathbf{A}\in R^{1\times T_{a}}$ and $\mathbf{V}\in R^{H\times W\times T_{v}}$ represent the audio and video streams of an individual speaker, respectively, where $T_{a}$ denotes the audio length; $H,W$ , and $T_{v}$ denote the height, width, and the number of lip frames, respectively. We aim to separate a high-quality, clean single-speaker audio A from the noisy speech $\mathbf{S}\in R^{1\times T_{a}}$ based on the video cues $\mathbf{V}$ and filter out the remaining speech components (other interfering speakers). Specifically, our pipeline consists of four modules: audio encoder, video encoder, separation network and audio decoder (see Figure 1A).

设 $\mathbf{A}\in R^{1\times T_{a}}$ 和 $\mathbf{V}\in R^{H\times W\times T_{v}}$ 分别表示单个说话者的音频和视频流,其中 $T_{a}$ 表示音频长度; $H,W$ 和 $T_{v}$ 分别表示高度、宽度和唇部帧数。我们的目标是根据视频线索 $\mathbf{V}\in R^{1\times T_{a}}$ 从带噪语音 $\mathbf{S}\in R^{1\times T_{a}}$ 中分离出高质量、清晰的单说话者音频 A,并滤除其余语音成分(其他干扰说话者)。具体而言,我们的流程包含四个模块:音频编码器、视频编码器、分离网络和音频解码器(见图 1A)。

The overall pipeline of the IIANet is summarized as follows. First, we obtain a video containing two speakers from the streaming media, and the lips are extracted from each image frame. Second, we encode the lip frames and noisy speech into lip embeddings ${\bf E}{V}\in R^{N_{v}\times T_{v}}$ and noisy speech embeddings ${\bf E}{S}\in R^{N_{a}\times T_{a}^{\prime}}$ using a video encoder same as CTCNet (Li et al., 2022) and an audio encoder (a 1D convolutional layer), respectively, where $N_{v}$ , $N_{a}$ and $T_{a}^{\prime}$ are the visual and audio embedding dimensions and the number of frames for audio embedding, respectively. Third, the separation network takes ${\bf E}{S}$ and $\mathbf{E}{V}$ as inputs and outputs a soft mask $\mathbf{M}\in R^{N_{a}\times T_{a}^{\prime}}$ for the target speaker. We multiply ${\bf E}{S}$ with $\mathbf{M}$ to estimate the target speaker’s speech embedding $\mathbf{\bar{E}}{S}=\mathbf{E}{S}\boldsymbol{\odot}\mathbf{M})$ , where $\left\langle\varsigma\right\rangle$ ” denotes the element-wise product. Finally, we utilize the audio decoder (a 1D transposed convolutional layer) to convert the $\bar{\mathbf{E}}_{S}$ back to the waveform, generating the target speaker's speech A E RlxTa.

IIANet的整体流程概述如下。首先,我们从流媒体中获取包含两位说话者的视频,并从每帧图像中提取唇部区域。其次,使用与CTCNet (Li et al., 2022)相同的视频编码器和音频编码器(一维卷积层),分别将唇部帧和含噪语音编码为唇部嵌入${\bf E}{V}\in R^{N_{v}\times T_{v}}$和含噪语音嵌入${\bf E}{S}\in R^{N_{a}\times T_{a}^{\prime}}$,其中$N_{v}$、$N_{a}$和$T_{a}^{\prime}$分别表示视觉与音频嵌入维度以及音频嵌入的帧数。接着,分离网络以${\bf E}{S}$和$\mathbf{E}{V}$作为输入,输出目标说话者的软掩码$\mathbf{M}\in R^{N_{a}\times T_{a}^{\prime}}$。我们将${\bf E}{S}$与$\mathbf{M}$相乘,得到目标说话者的语音嵌入估计$\mathbf{\bar{E}}{S}=\mathbf{E}{S}\boldsymbol{\odot}\mathbf{M})$,其中$\left\langle\varsigma\right\rangle$表示逐元素乘积。最后,利用音频解码器(一维转置卷积层)将$\bar{\mathbf{E}}_{S}$转换回波形,生成目标说话者的语音A E RlxTa。

3.1. Audio-visual separation network

3.1. 视听分离网络

The core architecture of IIANet is an audio-visual separation network consisting of two types of components: (A) intraattention (IntraA) blocks, (B) inter-attention (InterA) blocks (distributed at different locations: top (InterA-T), middle (InterA-M) and bottom (InterA-B)). The architecture of the separation network is illustrated in Figure 1B, which is described as follows:

IIANet的核心架构是一个由两种组件组成的视听分离网络:(A) 内部注意力 (IntraA) 模块,(B) 交互注意力 (InterA) 模块(分布于不同位置:顶部 (InterA-T)、中部 (InterA-M) 和底部 (InterA-B))。分离网络的架构如图 1B 所示,具体描述如下:

  1. Bottom-up pass. The audio network and video network take $\mathbf{E}{S}$ and $\mathbf{E}{V}$ as inputs and output multi-scale auditory ${\mathbf{S}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}|i=0,1...,D}$ and visual ${\mathbf{V}{i}\in R^{N_{v}\times\frac{T_{v}}{2^{i}}}|i=0,1...,D}$ features, where $D$ denotes the total number of convolutional layers with a kernel size of 5 and a stride of 2, followed by a global normalization layer (GLN) (Luo & Mesgarani, 2019). In Figure 1B, we show an example model with $D=3$
  2. 自底向上传递。音频网络和视频网络以 $\mathbf{E}{S}$ 和 $\mathbf{E}{V}$ 作为输入,输出多尺度听觉特征 ${\mathbf{S}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}|i=0,1...,D}$ 和视觉特征 ${\mathbf{V}{i}\in R^{N_{v}\times\frac{T_{v}}{2^{i}}}|i=0,1...,D}$ ,其中 $D$ 表示卷积层总数(采用核大小为5、步长为2的卷积操作),后接全局归一化层 (GLN) (Luo & Mesgarani, 2019)。在图1B中,我们展示了 $D=3$ 的示例模型。
  3. AV fusion through the InterA-T block. The multiscale auditory $\mathbf{S}{i}$ and visual $\mathbf{V}{i}$ features are fused using the InterA-T block (Figure 2A) at the top of separation network to obtain the inter-modal global features $\mathbf{S}{G}\in$ $\boldsymbol{R}^{N_{a}\times\frac{T_{a}^{\prime}}{2D}}$ and $\mathbf{V}{G}\in R^{N_{v}\times\frac{T v}{2D}}$ .
  4. 通过InterA-T块实现视听融合。多尺度听觉特征$\mathbf{S}{i}$和视觉特征$\mathbf{V}{i}$在分离网络顶部的InterA-T块(图2A)中进行融合,以获得跨模态全局特征$\mathbf{S}{G}\in\boldsymbol{R}^{N_{a}\times\frac{T_{a}^{\prime}}{2D}}$和$\mathbf{V}{G}\in R^{N_{v}\times\frac{T v}{2D}}$。
  5. AV fusion in the top-down pass through IntraA and InterA-M blocks. Firstly, the $\mathbf{S}{G}$ and $\mathbf{V}{G}$ are employed to modulate $\mathbf{S}{i}$ and $\mathbf{V}{i}$ within each modality using top-down global IntraA blocks, yielding auditory $\bar{\mathbf{S}}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}$ and visual $\bar{\mathbf{V}}{i}\in\mathbb{R}^{N_{v}\times\frac{T_{v}}{2^{i}}}$ fea- tures. Secondly, the InterA-M block takes the auditory $\bar{\bf S}{i}$ and visual $\bar{\mathbf{V}}{i}$ features at the same temporal scale as inputs, and output the modulated auditory features $\tilde{\mathbf{S}}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}$ . Finally, the top-down pass generates the auditory $\check{\mathbf{S}}{0}\in R^{N_{a}\times T_{a}^{\prime}}$ and visual $\mathbf{\tilde{V}}{0}\in\dot{R^{N_{v}\times T_{v}}}$ features at the maximum temporal scale using topdown local IntraA blocks.
  6. 通过IntraA和InterA-M模块自上而下传递的视听融合。首先,利用$\mathbf{S}{G}$和$\mathbf{V}{G}$通过自上而下的全局IntraA模块调制各模态内的$\mathbf{S}{i}$和$\mathbf{V}{i}$,生成听觉特征$\bar{\mathbf{S}}{i}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2^{i}}}$与视觉特征$\bar{\mathbf{V}}{i}\in\mathbb{R}^{N_{v}\times\frac{T_{v}}{2^{i}}}$。接着,InterA-M模块将处于相同时间尺度的听觉特征$\bar{\bf S}{i}$和视觉特征$\bar{\mathbf{V}}{i}$作为输入,输出调制后的听觉特征 $\check{\mathbf{S}}{0}\in R^{N_{a}\times T_{a}^{\prime}}$。最后,自上而下传递过程通过局部IntraA模块生成最大时间尺度的听觉特征$\check{\mathbf{S}}{0}\in R^{N_{a}\times T_{a}^{\prime}}$和视觉特征$\mathbf{\tilde{V}}{0}\in\dot{R^{N_{v}\times T_{v}}}$。
  7. AV fusion through the InterA-B block. The InterAB block (Figure 2D) takes the auditory $\check{\mathbf{S}}{0}$ and visual ${\check{\mathbf{V}}}{0}$ features as input and outputs audio-visual features $\bar{\mathbf{E}}{S,t}\in R^{N_{a}\times T_{a}^{\prime}}$ and $\bar{\mathbf{E}}{V,t}\in R^{N_{v}\times T_{v}}$ , where $t$ denotes the cycle index of audio-visual fusion.
  8. 通过InterA-B块实现的视听融合。InterAB块 (图 2D) 以听觉特征 $\check{\mathbf{S}}{0}$ 和视觉特征 ${\check{\mathbf{V}}}{0}$ 作为输入,输出视听特征 $\bar{\mathbf{E}}{S,t}\in R^{N_{a}\times T_{a}^{\prime}}$ 和 $\bar{\mathbf{E}}{V,t}\in R^{N_{v}\times T_{v}}$ ,其中 $t$ 表示视听融合的循环索引。
  9. Cycle of audio and video networks. The above steps complete the first cycle of the separation network. For the $t$ -th cycle where $t\geq2$ , the audio-visual features $\bar{\mathbf{E}}{S,t-1}$ and $\bar{\mathbf{E}}{V,t-1}$ obtained in the $(t-1)$ -th cycle, instead of ${\bf E}{S}$ and $\mathbf{E}{V}$ , serve as input to the separation network, and we repeat the above steps. After $N_{F}$ cycles, the auditory features $\bar{\mathbf{E}}{S,t}$ are refined alone for $N_{S}$ cycles in the audio network to reconstruct highquality audio. Finally, we pass the output of the final cycle through an activation function (ReLU) to yield the separation network’s output $\mathbf{M}$ , as illustrated in Figure 1A.
  10. 音视频网络循环。上述步骤完成了分离网络的第一个循环。对于第 $t$ 次循环($t\geq2$),将第 $(t-1)$ 次循环中获得的视听特征 $\bar{\mathbf{E}}{S,t-1}$ 和 $\bar{\mathbf{E}}{V,t-1}$(而非初始的 ${\bf E}{S}$ 和 $\mathbf{E}{V}$)作为分离网络的输入,并重复上述步骤。经过 $N_{F}$ 次循环后,听觉特征 $\bar{\mathbf{E}}{S,t}$ 将在音频网络中单独进行 $N_{S}$ 次循环以重构高质量音频。最终,我们将最后一个循环的输出通过激活函数(ReLU)生成分离网络的输出 $\mathbf{M}$,如图 1A 所示。

The steps 2-4 and the blocks mentioned in these steps are described in details in what follows.

步骤2-4及其中提到的模块将在下文中详细描述。

Step 2: AV fusion through the InterA-T block. In the InterA-T block, we first utilize an average pooling layer with a pooling ratio $2^{D-i}$ in the temporal dimension to down-sample the temporal dimensions of $\mathbf{S}{i}$ and $\mathbf{V}{i}$ to $\frac{T_{a}^{\prime}}{2^{D}}$ and $\frac{T_{v}}{2^{D}}$ , respectively, and then merge them to obtain global features $\mathbf{S}{G}$ and $\mathbf{V}_{G}$ using InterA-T block (Figure 2A). The detailed process of the InterA-T block is described by the equations below:

步骤2:通过InterA-T块进行音视融合。在InterA-T块中,我们首先在时间维度上使用池化比例为$2^{D-i}$的平均池化层,将$\mathbf{S}{i}$和$\mathbf{V}{i}$的时间维度分别下采样至$\frac{T_{a}^{\prime}}{2^{D}}$和$\frac{T_{v}}{2^{D}}$,然后通过InterA-T块合并它们以获得全局特征$\mathbf{S}{G}$和$\mathbf{V}_{G}$ (图2A)。InterA-T块的详细过程由以下公式描述:

A. Overall pipeline

A. 整体流程


Figure 1. The overall pipeline of IIANet. (A) IIANet consists of four main components: audio encoder, video encoder, separation network, and audio decoder. The red and blue tildes indicate that the same module is repeated several times. (B) The separation network contains two types of attention blocks: IntraA and InterA (InterA-T, InterA-M, InterA-B) blocks. The dashed lines indicate the use of global features $\mathbf{S}{G}$ and ${\mathbf{V}}{G}$ as top-down attention modulation for multi-scale features $\mathbf{S}{i}$ and $\mathbf{V}_{i}$ . All blocks use different parameters but keep the same across different cycles.

图 1: IIANet的整体流程。(A) IIANet包含四个主要组件:音频编码器、视频编码器、分离网络和音频解码器。红色和蓝色波浪线表示相同模块被重复多次。(B) 分离网络包含两种注意力块:IntraA和InterA (InterA-T, InterA-M, InterA-B)块。虚线表示使用全局特征$\mathbf{S}{G}$和${\mathbf{V}}{G}$作为多尺度特征$\mathbf{S}{i}$和$\mathbf{V}_{i}$的自顶向下注意力调制。所有块使用不同参数但在不同周期间保持一致。

$$
\begin{array}{r l}&{\displaystyle\pmb{f}{S}=\sum_{i=0}^{D-1}p_{i}(\mathbf{S}_{i})+\mathbf{S}_{D}\mathrm{and~}\pmb{f}{V}=\sum_{i=0}^{D-1}p_{i}(\mathbf{V}{i})+\mathbf{V}{D},}\ &{\displaystyle\mathbf{S}{G}=\mathrm{FFN}{S}\left(\pmb{f}{S}\odot\sigma(\pmb{\mathscr{Q}}(\pmb{f}{V}))\right),}\ &{\displaystyle\mathbf{V}{G}=\mathrm{FFN}{V}\left(\pmb{f}{V}\odot\sigma(\pmb{\mathscr{Q}}(\pmb{f}_{S}))\right),}\end{array}
$$

where $f_{S}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2D}}$ and $\pmb{f}{V}\in\cal R^{N_{v}\times\frac{T_{v}}{2D}}$ denote the cumulative multi-scale auditory and visual features, $\mathrm{FFN}(\cdot)$ denotes a feed-forward network with three convolutional layers followed by a GLN, $\mathcal{Q}^{2}$ denotes a convolutional layer followed by a GLN, $p_{i}(\cdot)$ denotes the average pooling layers and $\sigma$ denotes the sigmoid function. Here, $p_{i}(\cdot)$ uses different pooling ratios for different $i$ such that the temporal dimensions of all pi(Si) are 2T Da and the temporal dimensions of all $p_{i}(\mathbf{V}{i})$ are $\frac{T_{v}}{2^{D}}$ . The hyper parameters of $\mathcal{Q}$ are specified such that the resulting audio and video features have the same embedding dimension. The $\sigma$ function receiving information from one modality acts as selective attention, modulating the other modality. The global features $\mathbf{S}{G}$ and $\mathbf{V}_{G}$ are used to guide the separation process in the top-down pass.

其中 $f_{S}\in R^{N_{a}\times\frac{T_{a}^{\prime}}{2D}}$ 和 $\pmb{f}{V}\in\cal R^{N_{v}\times\frac{T_{v}}{2D}}$ 表示累积的多尺度听觉和视觉特征,$\mathrm{FFN}(\cdot)$ 表示具有三个卷积层后接GLN的前馈网络,$\mathcal{Q}^{2}$ 表示一个卷积层后接GLN,$p_{i}(\cdot)$ 表示平均池化层,$\sigma$ 表示sigmoid函数。此处 $p_{i}(\cdot)$ 对不同 $i$ 使用不同池化比例,使得所有 $p_{i}(S_{i})$ 的时间维度为 $\frac{T_{a}}{2^{D}}$,所有 $p_{i}(\mathbf{V}{i})$ 的时间维度为 $\frac{T_{v}}{2^{D}}$。$\mathcal{Q}$ 的超参设置使得生成的音频和视频特征具有相同嵌入维度。接收单模态信息的 $\sigma$ 函数作为选择性注意力机制来调制另一模态。全局特征 $\mathbf{S}{G}$ 和 $\mathbf{V}_{G}$ 用于在自上而下传递过程中指导分离过程。

Step 3: AV fusion in the top-down pass through IntraA and InterA-M blocks. Given the multi-scale audio-visual features ${\mathbf{S}{i},\mathbf{V}{i}}$ and $\mathbf{S}{G}$ and $\mathbf{V}{G}$ as inputs, we employ the global IntraA blocks $\phi(\mathbf{x},\mathbf{y})$ (Figure 2B) in the top-down pass (the dotted lines in Figure 1B) to extract audio $\bar{\bf S}{i}$ and visual features $\bar{\mathbf{V}}_{i}$ at each temporal scale,

步骤3:通过IntraA和InterA-M块自上而下传递进行视听融合。给定多尺度视听特征${\mathbf{S}{i},\mathbf{V}{i}}$以及全局特征$\mathbf{S}{G}$和$\mathbf{V}{G}$作为输入,我们在自上而下传递过程中(图1B中的虚线)使用全局IntraA块$\phi(\mathbf{x},\mathbf{y})$(图2B)来提取每个时间尺度的音频特征$\bar{\bf S}{i}$和视觉特征$\bar{\mathbf{V}}_{i}$。

$$
\bar{\mathbf{x}}=\phi(\mathbf{x},\mathbf{y})=\sigma\left(\boldsymbol{\mathcal{Q}}\left(\mu\left(\mathbf{y}\right)\right)\right)\odot\mathbf{x}+\boldsymbol{\mathcal{Q}}\left(\mu\left(\mathbf{y}\right)\right)
$$

$$
\bar{\mathbf{x}}=\phi(\mathbf{x},\mathbf{y})=\sigma\left(\boldsymbol{\mathcal{Q}}\left(\mu\left(\mathbf{y}\right)\right)\right)\odot\mathbf{x}+\boldsymbol{\mathcal{Q}}\left(\mu\left(\mathbf{y}\right)\right)
$$

where $\mu(\cdot)$ denotes interpolation up-sampling. For audio signals, $\bar{\bf S}{i}=\phi({\bf S}{i},{\bf S}{G})$ ; for video signals, $\bar{\bf V}{i}~=$ $\phi(\mathbf{V}{i},\mathbf{V}{G})$ . The hyper parameters of $\mu$ are specified such that the resulted audio and video signals have the same temporal dimension. Since these attention blocks use the global features $\mathbf{S}{G}$ and $\mathbf{V}{G}$ to modulate the intermediate features $\mathbf{S}{i}$ and $\mathbf{V}_{i}$ when reconstructing these features in the topdown process, they are called global IntraA blocks. This is inspired by the global attention mechanism used in an

其中 $\mu(\cdot)$ 表示插值上采样。对于音频信号,$\bar{\bf S}{i}=\phi({\bf S}{i},{\bf S}{G})$;对于视频信号,$\bar{\bf V}{i}~=$$\phi(\mathbf{V}{i},\mathbf{V}{G})$。$\mu$ 的超参数设置需确保生成的音频和视频信号具有相同的时间维度。由于这些注意力块在自上而下过程中重建特征时使用全局特征 $\mathbf{S}{G}$ 和 $\mathbf{V}{G}$ 来调节中间特征 $\mathbf{S}{i}$ 和 $\mathbf{V}_{i}$,因此被称为全局IntraA块。这一设计灵感来源于某全局注意力机制


Figure 2. Flow diagram of InraA and InterA blocks: (A) InterA-T block, (B) IntraA block, (C) InterA-M block and (D) InterA-B block in the IIANet, where $\odot$ denotes element-wise product and $\sigma$ denotes the sigmoid function.

图 2: IntraA和InterA模块流程图:(A) InterA-T模块,(B) IntraA模块,(C) InterA-M模块和(D) InterA-B模块在IIANet中的结构,其中$\odot$表示逐元素乘积,$\sigma$表示sigmoid函数。

AOSS model (Li et al., 2023). But in that model, the global feature is simply upsampled, then goes through a sigmoid function and multiplied with the intermediate features $\mathbf{x}$ . Mathematically, the process is written as follows

AOSS模型 (Li等人, 2023)。但在该模型中,全局特征仅经过简单上采样,随后通过sigmoid函数并与中间特征$\mathbf{x}$相乘。该过程的数学表达式如下

$$
\bar{\mathbf{x}}=\phi^{\prime}(\mathbf{x},\mathbf{y})=\sigma(\mu(\mathbf{y}))\odot\mathbf{x}.
$$

$$
\bar{\mathbf{x}}=\phi^{\prime}(\mathbf{x},\mathbf{y})=\sigma(\mu(\mathbf{y}))\odot\mathbf{x}.
$$

We empirically found that $\phi$ worked better than $\phi^{\prime}$ (see Appendix A).

我们通过实证发现 $\phi$ 的效果优于 $\phi^{\prime}$ (详见附录 A)。

In contrast to previous AVSS methods (Gao & Grauman, 2021; Li et al., 2022), we investigate which part of the audio network should be attended to undergo the guidance of visual features within the same temporal scale. We utilize the visual features $\bar{\mathbf{V}}{i}$ from the same temporal scale as $\bar{\bf S}{i}$ to modulate the auditory features $\bar{\bf S}_{i}$ using the InterA-M blocks (Figure 2C),

与以往的音视频同步方法 (Gao & Grauman, 2021; Li et al., 2022) 不同,我们研究了音频网络的哪部分应在相同时间尺度下接受视觉特征的引导。我们使用与 $\bar{\bf S}{i}$ 相同时间尺度的视觉特征 $\bar{\mathbf{V}}{i}$,通过 InterA-M 模块 (图 2C) 对听觉特征 $\bar{\bf S}_{i}$ 进行调制。

$$
\tilde{\mathbf{S}}{i}=\sigma(\mathcal{Q}(\mu(\bar{\mathbf{V}}{i})))\odot\bar{\mathbf{S}}_{i},\quad\forall i\in[0,D],
$$

$$
\tilde{\mathbf{S}}{i}=\sigma(\mathcal{Q}(\mu(\bar{\mathbf{V}}{i})))\odot\bar{\mathbf{S}}_{i},\quad\forall i\in[0,D],
$$

where $\tilde{\mathbf{S}}{i}\in R^{N\times\frac{T_{a}^{\prime}}{2^{i}}}$ denotes the modulated auditory features. Same as before, the hyper parameters of $\mathcal{Q}$ and $\mu$ are specified such that the two terms on the two sides of $\odot$ have the same dimension. In this way, the InterA-M blocks are expected to extract the target speaker’s sound features according to the target speaker’s visual features at the same scale.

其中 $\tilde{\mathbf{S}}{i}\in R^{N\times\frac{T_{a}^{\prime}}{2^{i}}}$ 表示调制后的听觉特征。与之前相同,通过指定 $\mathcal{Q}$ 和 $\mu$ 的超参数,使得 $\odot$ 两侧的两项具有相同维度。这样,InterA-M 模块就能根据相同尺度的目标说话者视觉特征来提取其声音特征。

Then, we reconstructed auditory and visual features separately to obtain the features $(\check{\mathbf{S}}{0},\check{\mathbf{V}}{0})$ in a top-down pass (Figure 1B). We leverage the local top-down attention mechanism (Li et al., 2023), and the corresponding blocks are called local IntraA blocks, which are able to fuse as many multi-scale features as possible while reducing information redundancy. We start from the $(D-1)$ -th layer. The attention signals are $\tilde{\mathbf{S}}{D}$ and $\bar{\mathbf{V}}_{D}$ . Therefore,

接着,我们分别重建听觉和视觉特征,通过自上而下的传递获得特征 $(\check{\mathbf{S}}{0},\check{\mathbf{V}}{0})$ (图 1B)。我们采用局部自上而下的注意力机制 (Li et al., 2023),对应的模块称为局部 IntraA 块,能够在减少信息冗余的同时融合尽可能多的多尺度特征。我们从第 $(D-1)$ 层开始,注意力信号为 $\tilde{\mathbf{S}}{D}$ 和 $\bar{\mathbf{V}}_{D}$。因此,

$$
\check{\mathbf{S}}{D-1}=\phi(\tilde{\mathbf{S}}{D-1},\tilde{\mathbf{S}}{D}),\quad\check{\mathbf{V}}{D-1}=\phi(\bar{\mathbf{V}}{D-1},\bar{\mathbf{V}}_{D}).
$$

$$
\check{\mathbf{S}}{D-1}=\phi(\tilde{\mathbf{S}}{D-1},\tilde{\mathbf{S}}{D}),\quad\check{\mathbf{V}}{D-1}=\phi(\bar{\mathbf{V}}{D-1},\bar{\mathbf{V}}_{D}).
$$

For $i=D-2,...,0$ , the attention signals are $\check{\mathbf{S}}{i}$ and $\check{\mathbf{V}}_{i}$ , therefore

对于 $i=D-2,...,0$,注意力信号为 $\check{\mathbf{S}}{i}$ 和 $\check{\mathbf{V}}_{i}$,因此

$$
\begin{array}{r}{\check{\mathbf{S}}{i}=\phi(\tilde{\mathbf{S}}{i},\check{\mathbf{S}}{i+1}),\quad\check{\mathbf{V}}{i}=\phi(\bar{\mathbf{V}}{i},\check{\mathbf{V}}_{i+1}).}\end{array}
$$

$$
\begin{array}{r}{\check{\mathbf{S}}{i}=\phi(\tilde{\mathbf{S}}{i},\check{\mathbf{S}}{i+1}),\quad\check{\mathbf{V}}{i}=\phi(\bar{\mathbf{V}}{i},\check{\mathbf{V}}_{i+1}).}\end{array}
$$

Please note that different IntraA blocks within one cycle, as depicted in Figure 1B, use different parameters.

请注意,同一周期内的不同 IntraA 块(如图 1B 所示)使用不同的参数。

Step 4: AV fusion through the InterA-B block. In CTCNet (Li et al., 2022), the global audio-visual features fused all auditory and visual multi-scale features using the concatenation strategy, leading to high computational complexity. Here, we only perform inter-modal fusion on the finestgrained auditory and visual features ( $\check{\mathbf{S}}{0}$ and ${\check{\mathbf{V}}}_{0}$ ) using attention strategy, greatly reducing computational complexity.

步骤4:通过InterA-B块进行视听融合。在CTCNet (Li et al., 2022) 中,全局视听特征通过拼接策略融合了所有听觉和视觉多尺度特征,导致计算复杂度较高。本文仅对最精细粒度的听觉特征 ( $\check{\mathbf{S}}{0}$ ) 和视觉特征 ( ${\check{\mathbf{V}}}_{0}$ ) 采用注意力策略进行跨模态融合,显著降低了计算复杂度。

Specifically, we reduce the impact of redundant features by incorporating the InterA-B block (Figure 2D) between the finest-grained auditory and visual features $\check{\mathbf{S}}{0}$ and ${\check{\mathbf{V}}}_{0}$ . In particular, the process of global audio-visual fusion is as follows:

具体来说,我们通过在最细粒度的听觉特征$\check{\mathbf{S}}{0}$和视觉特征${\check{\mathbf{V}}}_{0}$之间加入InterA-B模块(图2D)来降低冗余特征的影响。全局视听融合的具体过程如下:

$$
\begin{array}{r l}&{\bar{\mathbf{E}}{S,t}=\check{\mathbf{S}}{0}+\mathcal{Q}\left(\mu(\check{\mathbf{V}}{0})\odot\sigma\left(\mathcal{Q}\left(\check{\mathbf{S}}{0}\right)\right)\right),}\ &{\bar{\mathbf{E}}{V,t}=\check{\mathbf{V}}{0}+\mathcal{Q}\left(\mu(\check{\mathbf{S}}{0})\odot\sigma\left(\mathcal{Q}\left(\check{\mathbf{V}}_{0}\right)\right)\right).}\end{array}
$$

$$
\begin{array}{r l}&{\bar{\mathbf{E}}{S,t}=\check{\mathbf{S}}{0}+\mathcal{Q}\left(\mu(\check{\mathbf{V}}{0})\odot\sigma\left(\mathcal{Q}\left(\check{\mathbf{S}}{0}\right)\right)\right),}\ &{\bar{\mathbf{E}}{V,t}=\check{\mathbf{V}}{0}+\mathcal{Q}\left(\mu(\check{\mathbf{S}}{0})\odot\sigma\left(\mathcal{Q}\left(\check{\mathbf{V}}_{0}\right)\right)\right).}\end{array}
$$

Same as in (1), the hyper parameters of $\mathcal{Q}$ and $\mu$ are specified such that the element-wise product and addition in (5) function correctly. The working process of this block is simple. One modality features are modulated by the other modality features through an attention function $\sigma$ , and then the results are added to the original features. Several convolutions and up-samplings are interleaved in the process to ensure that the dimensions of results are appropriate. We keep the original features in the final results by addition because we do not want the final features to be altered too much by the other modality information.

与 (1) 相同,$\mathcal{Q}$ 和 $\mu$ 的超参数经过设定,确保 (5) 中的逐元素乘积和加法运算正确执行。该模块的工作流程较为简单:一种模态特征通过注意力函数 $\sigma$ 受另一模态特征调制,随后结果被叠加至原始特征上。过程中穿插了若干卷积和上采样操作以保证结果维度适配。我们通过加法保留原始特征于最终结果中,以避免其他模态信息对最终特征造成过度干扰。

Finally, the output auditory and visual features $\bar{\mathbf{E}}{S,t}$ and $\bar{\mathbf{E}}_{V,t}$ serve as inputs for the subsequent audio network and video network.

最后,输出的听觉和视觉特征 $\bar{\mathbf{E}}{S,t}$ 和 $\bar{\mathbf{E}}_{V,t}$ 将作为后续音频网络和视频网络的输入。

4. Experiments

4. 实验

4.1. Datasets

4.1. 数据集

Consistent with previous research, we experimented on three commonly used AVSS datasets: LRS2 (Afouras et al., 2018a), LRS3 (Afouras et al., 2018b), and VoxCeleb2 (Chung et al., 2018). Due to the fact that the LRS2 and VoxCeleb2 datasets were collected from YouTube, they present a more complex acoustic environment compared to the LRS3 dataset. Each audio was 2 seconds in length with a sample rate of $16\mathrm{kHz}$ . The mixture was created by randomly selecting two different speakers from datasets and mixing their speeches with signal-to-noise ratios between -5 dB and 5 dB. Lip frames were synchronized with audio, having a frame rate of 25 FPS and a size of $88\times88$ grayscale images. We used the same dataset split consistent with previous works (Li et al., 2022; Gao & Grauman, 2021; Lee et al., 2021). See Appendix B for details.

与先前研究一致,我们在三个常用的视听语音分离(AVSS)数据集上进行了实验:LRS2 (Afouras等人, 2018a)、LRS3 (Afouras等人, 2018b)和VoxCeleb2 (Chung等人, 2018)。由于LRS2和VoxCeleb2数据集采集自YouTube,其声学环境比LRS3数据集更为复杂。每段音频时长为2秒,采样率为$16\mathrm{kHz}$。混合音频是通过从数据集中随机选择两名不同说话者,将其语音以-5 dB至5 dB的信噪比混合而成。唇部帧与音频同步,帧率为25 FPS,尺寸为$88\times88$的灰度图像。我们采用了与先前工作(Li等人, 2022; Gao & Grauman, 2021; Lee等人, 2021)相同的数据集划分方式。详见附录B。

4.2. Model configurations and evaluation metrics

4.2. 模型配置与评估指标

The IIANet model was implemented using PyTorch (Paszke et al., 2019), and optimized using the Adam optimizer (Kingma & Ba, 2014) with an initial learning rate of 0.001. The learning rate was cut by $50%$ whenever there was no improvement in the best validation loss for 15 consecutive epochs. The training was halted if the best validation loss failed to improve over 30 successive epochs. Gradient clipping was used during training to avert gradient explosion, setting the maximum $L2$ norm to 5. We used the SI-SNR objective function to train the IIANet. See Appendix C for details. All experiments were conducted with a batch size of 6, utilizing 8 GeForce RTX 4090 GPUs. The hyperparameters are included in Appendix D.

IIANet模型采用PyTorch语言 (Paszke et al., 2019) 实现,并使用Adam优化器 (Kingma & Ba, 2014) 进行优化,初始学习率为0.001。当最佳验证损失连续15个epoch没有改善时,学习率会降低50%。若最佳验证损失连续30个epoch未提升,则终止训练。训练过程中采用梯度裁剪防止梯度爆炸,最大L2范数设为5。我们使用SI-SNR目标函数训练IIANet,详见附录C。所有实验批大小设为6,使用8块GeForce RTX 4090 GPU完成。超参数配置见附录D。

Same as previous research (Gao & Grauman, 2021; Li et al., 2022; Afouras et al., 2018a), the SDRi (Vincent et al., 2006b) and SI-SNRi (Le Roux et al., 2019) were used as a metric for speech separation. See Appendix E for details. To simulate human auditory perception and to assess and quantify the quality of speech signals, we also used PESQ (Union, 2007) as an evaluation metric, which is used to measure the clarity and intelligibility of speech signals, ensuring

与先前研究 (Gao & Grauman, 2021; Li et al., 2022; Afouras et al., 2018a) 相同,我们采用 SDRi (Vincent et al., 2006b) 和 SI-SNRi (Le Roux et al., 2019) 作为语音分离的评估指标。详见附录 E。为模拟人类听觉感知并量化评估语音信号质量,我们还使用了 PESQ (Union, 2007) 作为衡量语音信号清晰度与可懂度的评估指标,确保

the reliability of the results.

结果的可靠性。

4.3. Comparisons with state-of-the-art methods

4.3. 与先进方法的对比

To explore the efficiency of the proposed approach, we have designed a faster version, namely IIANet-fast, which runs half of $N_{S}$ cycles compared to IIANet (see Appendix D for detailed parameters). We compared IIANet and IIANet-fast with existing AVSS methods. To conduct a fair comparison, we obtained metric values in original papers or reproduced the results using officially released models by the authors.

为探究所提方法的效率,我们设计了更快版本 IIANet-fast,其运行周期数为 IIANet 的一半(详见附录 D 的参数说明)。我们将 IIANet 和 IIANet-fast 与现有 AVSS 方法进行对比。为确保公平性,所有指标均采用原论文报告值或通过作者官方发布模型复现得出。

Separation quality. As demonstrated in Table 1, IIANet consistently outperformed existing methods across all datasets, with at least a 0.9 dB gain in SI-SNRi. In particular, IIANet exceled in complex environments, notably on the LRS2 and VoxCeleb2 datasets, outperforming AVSS methods by a minimum of 1.7 dB SI-SNRi in separation quality. This significant improvement highlights the importance of IIANet’s hierarchical audio-visual integration. In addition, IIANet-fast also achieved excellent results, which surpassed CTCNet in separation quality with an improvement of 1.1 dB in SI-SNRi on the challenging LRS2 dataset.

分离质量。如表 1 所示,IIANet 在所有数据集上始终优于现有方法,SI-SNRi 至少提高了 0.9 dB。特别是在复杂环境中,IIANet 表现尤为突出,在 LRS2 和 VoxCeleb2 数据集上的分离质量比 AVSS 方法至少高出 1.7 dB SI-SNRi。这一显著改进凸显了 IIANet 分层视听集成的重要性。此外,IIANet-fast 也取得了优异的结果,在具有挑战性的 LRS2 数据集上,其分离质量超越了 CTCNet,SI-SNRi 提高了 1.1 dB。

Model size and computational cost. In practical applications, model size and computational cost are often crucial considerations. We employed three hardware-agnostic metrics to circumvent disparities between different hardware, including number of parameters (Params) and MultiplyAccumulate operations (MACs)3, and GPU memory usage during inference. We also selected two hardware-related metrics to measure inference speed on specific GPU and CPU hardware. The results were shown in Table 2. Please note that some of the AVSS methods (Alfouras et al., 2018; Afouras et al., 2020; Lee et al., 2021) in Table 1 are not included in Table 2 because they are not open sourced.

模型规模与计算成本。在实际应用中,模型规模和计算成本往往是关键考量因素。我们采用三种与硬件无关的指标来规避不同硬件间的差异,包括参数量(Params)、乘加运算量(MACs)3以及推理时的GPU内存占用。同时选取两项硬件相关指标来衡量特定GPU和CPU硬件上的推理速度,结果如表2所示。需注意表1中部分AVSS方法(Alfouras等人,2018;Afouras等人,2020;Lee等人,2021)未包含在表2中,因其未开源。

Compared to the previous SOTA method CTCNet, IIANet can achieve significantly higher separation quality using $11%$ of the MACs, $44%$ of the parameters and $16%$ of the GPU memory. IIANet’s inference on CPU was as fast as CTCNet, but was slower on GPU. With better separation quality than CTCNet, IIANet-fast had much less computational cost (only $7%$ of CTCNet’s MACs) and substantially higher inference speed ( $40%$ of CTCNet’s time on CPU and $84%$ of CTCNet’s time on GPU). This not only showcased the efficiency of IIANet-fast but also highlighted its aptness for deployment onto environments with limited hardware resources and strict real-time requirement.

与之前的SOTA方法CTCNet相比,IIANet仅需11%的MACs运算量、44%的参数量和16%的GPU显存即可实现显著更高的分离质量。IIANet在CPU上的推理速度与CTCNet相当,但在GPU上较慢。IIANet-fast在保持优于CTCNet分离质量的同时,计算成本大幅降低(仅需CTCNet的7% MACs运算量),推理速度显著提升(CPU耗时仅为CTCNet的40%,GPU耗时仅为CTCNet的84%)。这不仅证明了IIANet-fast的高效性,更凸显了其在硬件资源有限且需要严格实时性的部署环境中的适用优势。

In conclusion, compared to with previous AVSS methods, IIANet achieved a good trade-off between separation quality

综上所述,与以往的AVSS方法相比,IIANet在分离质量上取得了良好的平衡

Submission and Formatting Instructions for ICML 2024

ICML 2024 投稿与格式要求

方法 LRS2 LRS3 VoxCeleb2
SI-SNRi SDRi PESQ SI-SNRi SDRi PESQ SI-SNRi SDRi PESQ
AVConvTasNet (Wu 等, 2019) 12.5 12.8 2.69 11.2 11.7 2.58 9.2 9.8 2.17
LWTNet (Afouras 等, 2020) 10.8 4.8
Visualvoice (Gao & Grauman, 2021) 11.5 11.8 2.78 9.9 10.3 2.13 9.3 10.2 2.45
CaffNet-C (Lee 等, 2021) 10.0 1.15 9.8 7.6
AVLiT-8 (Martel 等, 2023) 12.8 13.1 2.56 13.5 13.6 2.78 9.4 9.9 2.23
CTCNet (Li 等, 2022) 14.3 14.6 3.08 17.4 17.5 3.24 11.9 13.1 3.00
IANet (本文) 16.0 16.2 3.23 18.3 18.5 3.28 13.6 14.3 3.12
IIANet-fast (本文) 15.1 15.4 3.11 17.8 17.9 3.25 12.6 13.6 3.01

Table 1. Separation results of different AVSS methods on LRS2, LRS3, and VoxCeleb2 datasets. These metrics represent the average values for all speakers in each test set, where larger SI-SNRi, SDRi and PESQ values are better. “-” denotes results not reported in the original paper. Optimal and suboptimal performances is highlighted.

表 1. 不同视听语音分离(AVSS)方法在LRS2、LRS3和VoxCeleb2数据集上的分离结果。这些指标代表每个测试集中所有说话者的平均值,其中SI-SNRi、SDRi和PESQ值越大越好。"-"表示原论文未报告的结果。最优和次优性能已高亮显示。

方法 计算量 参数量 推理时间 GPU显存 (MB)
MACs (G) Params (M) CPU (s) GPU (ms)
AVConvTasNet (Wu等, 2019) 23.8 16.5 1.06 62.51
Visualvoice (Gao和Grauman, 2021) 9.7 77.8 2.98 110.31
AVLiT (Martel等, 2023) 18.2 5.8 0.46 62.51
CTCNet (Li等, 2022) 167.1 7.0 1.26 84.17
IANet (本文) 18.6 3.1 1.27 110.11
IANet-fast (本文) 11.9 3.1 0.52 70.94

Table 2. Parameter sizes and computational complexities of different AVSS methods. All results are measured with an input audio length of 1 second, a sample rate of $16\mathrm{kHz}$ , and a video frame sample rate of $25\mathrm{FPS}$ . Inference time is measured in the same testing environment without the use of any additional acceleration techniques, such as quantization or pruning.

表 2: 不同AVSS方法的参数量与计算复杂度对比。所有测试结果均基于1秒输入音频(采样率$16\mathrm{kHz}$)和视频帧采样率$25\mathrm{FPS}$的基准条件。推理时间测试环境统一,未使用量化(quantization)或剪枝(pruning)等加速技术。

and computational cost.

以及计算成本。

4.4. Multi-speaker performance

4.4. 多说话人性能

We present results on the LRS2 dataset with 3 and 4 speakers. We created two variants of the dataset for these purposes, named LRS2-3Mix and LRS2-4Mix, both used in our model’s training and testing. Correspondingly, the default LRS2 dataset has a new name LRS2-2Mix. Throughout the paper, unless otherwise specified, LRS2 dataset refers to the default LRS2-2Mix dataset. For more details on dataset construction, training processes, and testing procedures, please refer to Appendix G.

我们在LRS2数据集上展示了3人和4人说话者的实验结果。为此我们创建了两个数据集变体:LRS2-3Mix和LRS2-4Mix,均用于模型的训练和测试。相应地,默认的LRS2数据集被重新命名为LRS2-2Mix。在本文中,除非特别说明,LRS2数据集均指默认的LRS2-2Mix数据集。关于数据集构建、训练过程和测试流程的更多细节,请参阅附录G。

We use the following baseline methods for comparison: AVConvTasNet (Wu et al., 2019), AVLIT (Martel et al., 2023), and CTCNet (Li et al., 2022). The results are presented in Table 3. Each table column shows a different dataset, where the number of speakers in the mixed signal $C$ varies. The models used for evaluating each dataset were specifically trained for separating a corresponding number of speakers. It is evident from the results that the proposed model significantly outperformed the previous methods across all three datasets.

我们使用以下基线方法进行比较:AVConvTasNet (Wu等人,2019)、AVLIT (Martel等人,2023) 和 CTCNet (Li等人,2022)。结果如 表 3 所示。每列显示不同数据集,其中混合信号中的说话者数量 $C$ 各不相同。用于评估每个数据集的模型都专门针对相应数量的说话者进行了分离训练。从结果中可以明显看出,所提出的模型在所有三个数据集上都显著优于之前的方法。

4.5. Ablation study

4.5. 消融实验

To better understand IIANet, we compared each key component using the LRS2 dataset in a completely fair setup, i.e., using the same architecture and hyper parameters in the following experiments.

为了更好地理解IIANet,我们在完全公平的设置下使用LRS2数据集对每个关键组件进行了比较,即在以下实验中使用相同的架构和超参数。

Importance of IntraA and InterA. We investigated the role of IntraA and InterA in model performance. To this end, we constructed a control model (called Control 1) resulted from removing IntraA and InterA blocks from IIANet. Basically, it uses two upside down UNet (Ronne berger et al., 2015) architectures