emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

emoDARTS: 联合优化CNN与序列神经网络架构以实现卓越的语音情感识别

Abstract—Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance.

摘要—语音情感识别(SER)对于计算机理解人类交流中的情感至关重要。随着深度学习(DL)的最新进展，SER模型的性能得到了显著提升。然而，设计最优的DL架构需要专业知识和实验评估。幸运的是，神经架构搜索(NAS)为自动确定最佳DL模型提供了潜在解决方案。其中可微分架构搜索(DARTS)是一种特别高效的模型优化方法。本研究提出了emoDARTS，这是一种通过DARTS优化的联合CNN和序列神经网络(SeqNN: LSTM, RNN)架构，可提升SER性能。现有文献支持选择CNN与LSTM耦合来提高性能。

While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.

虽然DARTS此前已用于独立选择CNN和LSTM操作，但我们的技术新增了一种利用DARTS联合选择CNN与SeqNN操作的新机制。与早期研究不同，我们不对CNN的层顺序施加限制，而是让DARTS在DARTS单元内部自主选择最优层序。通过在IEMOCAP、MSP-IMPROV和MSP-Podcast数据集上的评估，我们证明emoDARTS不仅优于传统设计的CNN-LSTM模型，还超越了目前通过DARTS在CNN-LSTM架构上取得的最佳语音情感识别(SER)结果。

Index Terms—speech emotion recognition, neural architecture search, deep learning, DARTS

关键词—语音情感识别、神经架构搜索、深度学习、DARTS

1 INTRODUCTION

1 引言

R sEp CeO eG ch N IiSsI Na Gfu nthdea mee mn to at li,o yneatl con u map nl cee xs c he am ll be end ged.e dO vienr the last decade, the field of Speech Emotion Recognition (SER) has experienced significant strides, predominantly driven by the exponential growth of deep learning [1], [2], [3], [4]. A key breakthrough facilitated by deep learning is its capability to automatically learn features, departing from the traditional reliance on manually crafted features shaped by human perceptions of speech signals. Nevertheless, determining the optimal deep-learning architecture for SER remains a challenging task that warrants attention. Conventional approaches involve iterative modifications and recursive training of models until an optimal configuration is found. However, this approach becomes prohibitively time-consuming due to the extensive training and testing required for numerous configurations.

在过去的十年里，语音情感识别 (Speech Emotion Recognition, SER) 领域取得了显著进展，这主要得益于深度学习的指数级增长 [1], [2], [3], [4]。深度学习带来的一个关键突破是其能够自动学习特征，摆脱了传统上对人类语音信号感知所形成的手工特征的依赖。然而，如何确定最适合SER的深度学习架构仍是一项值得关注的挑战性任务。传统方法涉及对模型进行迭代修改和递归训练，直到找到最优配置。但由于需要对大量配置进行广泛的训练和测试，这种方法变得极其耗时。

An alternative to the conventional approach is the “Neural Architecture Search” (NAS), which can help discover optimal neural networks for a given task. The idea is to find the models’ architecture to minimise the loss. In NAS, search is done over a discrete set of candidate operations, which requires the model to be trained on a specific configuration before moving on to the next configuration. Nevertheless, this approach demands considerable time and resources.

传统方法的一种替代方案是"神经架构搜索"(Neural Architecture Search，NAS)，它可以帮助发现针对特定任务的最优神经网络。其核心思想是通过寻找模型架构来最小化损失。在NAS中，搜索是在一组离散的候选操作上进行的，这要求模型在切换到下一个配置之前，必须先在特定配置上进行训练。然而，这种方法需要耗费大量时间和计算资源。

The Differentiable Architecture Search (DARTS) is a method that has been developed to optimise the search for a neural network architecture. It allows for the relaxation of the discrete set of candidate operations, making the space continuous and reducing the computation time significantly, from 2,000 GPU days to just 2-3 GPU days. This is a major improvement from the previous methods of reinforcement learning or evolution algorithm, which required 2,000 and 3,150 GPU days, respectively. Additionally, through network optimisation DARTS has the potential to offer significantly high SER accuracy, which is currently quite low and needs improvement. These two points serve as motivation to use DARTS for SER.

可微分架构搜索 (DARTS) 是一种为优化神经网络架构搜索而开发的方法。它通过松弛候选操作的离散集合，使搜索空间连续化，从而将计算时间从2000 GPU天大幅缩减至2-3 GPU天。这相较于之前需要2000 GPU天（强化学习）和3150 GPU天（进化算法）的方法是重大改进。此外，通过网络优化，DARTS有望显著提升目前较低且亟需改进的SER准确率。这两点构成了将DARTS应用于SER研究的动机。

Additionally, previous studies have shown that a multitemporal Convolutional Neural Network (CNN) stacked on a Long Short-Term Memory Network (LSTM) can capture contextual information at multiple temporal resolutions, complementing LSTM for modelling long-term contextual information, thus offering improved performance [5], [6], [7], [8]. Sequential Neural Networks (SeqNN) like Recurrent Neural Networks (RNN) or LSTM can easily identify the patterns of a sequential stream of data. This paper takes a pioneering step by leveraging DARTS for a novel joint CNN–SeqNN configuration, named “emoDARTS”, as depicted in Figure 1, with an attention network seamlessly integrated into the SeqNN component to further elevate its performance.

此外，先前研究表明，在长短期记忆网络 (LSTM) 上堆叠多时序卷积神经网络 (CNN) 能捕获多时间分辨率的上下文信息，弥补 LSTM 在长期上下文建模中的不足，从而提升性能 [5][6][7][8]。循环神经网络 (RNN) 或 LSTM 等序列神经网络 (SeqNN) 可轻松识别数据流的时序模式。本文开创性地利用 DARTS 实现新型 CNN-SeqNN 联合架构 (命名为 "emoDARTS")，如图 1 所示，其中 SeqNN 组件无缝集成了注意力网络以进一步提升性能。

Fig. 1. The proposed architecture of emoDARTS passes the input features to the CNN component through the SeqNN component and finally to a dense layer. The optimum CNN and SeqNN operations are selected by DARTS jointly.

图 1: 提出的emoDARTS架构将输入特征通过SeqNN组件传递到CNN组件，最终到达密集层。DARTS联合选择最优的CNN和SeqNN操作。

The investigation of DARTS within the SER domain is minimal and invites further inquiry to uncover the potential for improving SER performance. DARTS has only recently been employed in SER tasks to improve models, as recently as 2022 [9], [10], wherein the researchers have mostly applied DARTS separately on CNNs and RNNs [11], [12], [13]. Hence, the viability of utilising DARTS jointly for CNN and SeqNN requires exploration. While there is a lone study that explores the joint optimisation of CNN and LSTM [10], it imposes constraints on the layer order for the CNN within the DARTS component, thereby limiting the full potential of DARTS. In response to this limitation, our paper takes on the challenge of optimising this joint configuration without such constraints. The contributions of this paper are summarised as follows.

在语音情感识别(SER)领域对DARTS的研究较少，这为进一步探索提升SER性能的潜力提供了空间。直到2022年才有研究[9][10]首次将DARTS应用于SER任务来改进模型，这些工作主要分别在CNN和RNN上应用DARTS[11][12][13]。因此，需要探索同时利用DARTS优化CNN和SeqNN的可行性。虽然有一项研究探索了CNN和LSTM的联合优化[10]，但该研究对DARTS组件中CNN的层顺序施加了限制，从而制约了DARTS的全部潜力。针对这一局限，本文致力于在无此类约束的情况下优化这种联合配置。本文的主要贡献总结如下。

2 相关工作

This section delves into the existing literature on using DARTS and NAS for SER. Notably, our exploration reveals a limited number of papers in this space. We therefore extend our review to encompass relevant papers in related fields to provide a comprehensive perspective. For completeness, we also include studies employing CNNs, LSTM networks, and their joint util is ation for SER.

本节深入探讨了现有关于使用DARTS和NAS进行语音情感识别(SER)的研究文献。值得注意的是，我们的调查发现该领域的研究论文数量有限。因此，我们将综述范围扩展到相关领域的论文，以提供更全面的视角。为求完整，我们还纳入了使用CNN、LSTM网络及其联合应用于SER的研究。

2.1 Speech Emotion Recognition using CNN and LSTM

2.1 使用 CNN 和 LSTM 进行语音情感识别

One of the earliest uses of CNN networks in SER is reported by Zheng et al. in 2015 [14]. The authors introduced a spectrum generated from an audio signal to a CNN network and output the recognised emotion. The authors report that they can surpass the SVM-based classification performance and reach $40%$ classification accuracy for a five-class classification using the IEMOCAP dataset.

最早将CNN网络应用于语音情感识别(SER)的研究之一是Zheng等人于2015年发表的论文[14]。作者将音频信号生成的频谱图输入CNN网络，输出识别出的情感类别。研究结果表明，在使用IEMOCAP数据集进行五分类任务时，该方法超越了基于SVM的分类性能，达到了40%的分类准确率。

The earliest work combining CNN and LSTM for SER is by Trigeorgis et al. in 2016 [15]. The authors show an impressive improvement by a fully self-learnt representation over traditional expert-crafted features on dimensional emotion recognition.

最早将 CNN 和 LSTM 结合用于语音情感识别 (SER) 的研究是 Trigeorgis 等人于 2016 年提出的 [15]。作者证明，在维度情感识别任务中，完全自学习的表征相比传统专家手工特征取得了显著提升。

Zhaoa et al. [1] show that using CNN and LSTM networks combined in the same SER model produces better results than using only CNN. Using the IEMOCAP dataset, they obtained a speaker-independent accuracy of $52%$ by using a log-Mel spec tr ogram as the input feature. Their SER approach utilises an LSTM layer to learn contextual dependencies in local features, while a CNN-based layer learns the local features.

Zhaoa等人[1]研究表明，在同一个语音情感识别(SER)模型中结合使用CNN和LSTM网络比单独使用CNN能取得更好效果。通过采用IEMOCAP数据集，他们以对数梅尔频谱图作为输入特征，获得了52%的说话人无关准确率。该SER方法利用LSTM层学习局部特征的上下文依赖关系，同时通过基于CNN的层学习局部特征。

2.2 NAS与DARTS在SER及相关领域的应用

The first paper suggesting NAS in SER was by Zhang et al. in 2018 [16]. The authors employ a controller network that shapes the architecture by the number of layers and nodes per layer and the hyper parameter activation function of a child network by reinforcement learning. They show an improvement over human-designed architectures and random searches of these.

首篇提出在语音情感识别(SER)中应用神经架构搜索(NAS)的论文由Zhang等人于2018年发表[16]。作者采用控制器网络，通过强化学习来构建子网络的层数、每层节点数及超参数激活函数等架构要素。研究表明，该方法优于人工设计的架构及随机搜索架构。

Zoph and Le [17] use reinforcement learning to optimise an RNN network that develops model architectures to maximise the resulting accuracy of the generated model. As a result, they develop outstanding models for the CIFAR-10 and Penn Treebank datasets. They were able to develop a convolutional network architecture for the CIFAR-10 dataset which has a 3.65 error rate and a recurrent network architecture for Penn Treebank with 62.4 perplexity.

Zoph和Le [17] 使用强化学习来优化RNN网络，该网络通过开发模型架构以最大化生成模型的最终准确率。最终，他们为CIFAR-10和Penn Treebank数据集开发出了卓越的模型。他们成功为CIFAR-10数据集设计了一个错误率为3.65的卷积网络架构，并为Penn Treebank数据集设计了一个困惑度为62.4的循环网络架构。

Even though NAS is primarily used to find optimised architecture for complex and large models, researchers have also studied the possibility of using NAS to design smaller deep neural network models. Liberis et al. [18] develop a NAS system called $\mu\mathrm{NAS}$ to design smaller neural architectures that can run on microcontroller units. They improve the top-1 accuracy by $4.8%$ in image classification problems while reducing the memory footprint up to 13 times. Similarly, Gong et al. [19] study the feasibility of using NAS for reducing deep learning models to deploy on resourceconstrained edge devices.

尽管NAS(Neural Architecture Search)主要用于为复杂大型模型寻找优化架构，研究人员也探索了使用NAS设计小型深度神经网络模型的可能性。Liberis等人[18]开发了名为$\mu\mathrm{NAS}$的NAS系统，用于设计可在微控制器单元上运行的小型神经架构。他们在图像分类任务中将top-1准确率提升了$4.8%$，同时将内存占用最高降低了13倍。类似地，Gong等人[19]研究了利用NAS缩减深度学习模型以部署在资源受限边缘设备上的可行性。

Traditional NAS consumes much computational power and time to achieve the optimal model for a given problem. In 2018, Liu et al. [20] came up with a differentiable approach to solving the optimisation by continuous relaxation of the architecture representation. This approach is more compute efficient and high performing as the search space is not discrete and non-differentiable. They produce highperforming CNN and RNN architectures for tasks such as image recognition and language modelling within a fraction of the search cost of traditional NAS algorithms. DARTS has been popular in the past three years with many studies carried on for extending and improving the algorithm [21], [22], [11], [23], [24], [25].

传统NAS需要消耗大量计算资源和时间来寻找给定问题的最优模型。2018年，Liu等人[20]提出了一种可微分方法，通过对架构表示进行连续松弛来优化求解。由于搜索空间不再是离散且不可微分的，这种方法计算效率更高且性能更优。他们以传统NAS算法极小部分的搜索成本，就为图像识别和语言建模等任务生成了高性能的CNN和RNN架构。过去三年DARTS广受欢迎，许多研究致力于扩展和改进该算法[21]、[22]、[11]、[23]、[24]、[25]。

TABLE 1 Summary and focus on the literature on NAS, DARTS, and speech emotion recognition.

表 1: 关于NAS、DARTS和语音情感识别(SER)文献的总结与关注点

论文	SER	NAS	DARTS	CNN与SeqNN联合优化	IEMOCAP	MSP-IMPROV
Zoph和Le 2016 [17]	×	<
Zhang等 2018 [14]			×
Liu等 2018 [20]
Gong等 2019 [19]		<	×	×
Liberis等 2020 [18]		<			×
Wu等 2022 [10]	<	<	<	<
Sun等 2023 [9]	<		<	×	<	×
emoDARTS

Wu et al. [10] proposed a uniform path dropout strategy to optimise candidate architecture. They use SER as their DARTS application and the IEMOCAP dataset to develop an SER model with an accuracy of $56.28%$ for a fourclass classification problem using discrete Fourier transform spec tro grams extracted from audio as input. In their work, the authors specify layer order as two convolution layers at first, followed by a max-pooling layer, a convolution layer. They use DARTS to select the optimum parameters for each layer. We, on the other hand, do not specify the layer sequence and instead enable DARTS to select the ideal design with minimal interference.

Wu等人[10]提出了一种统一的路径丢弃策略来优化候选架构。他们使用SER作为DARTS应用，并采用IEMOCAP数据集开发了一个SER模型，在以从音频中提取的离散傅里叶变换频谱图作为输入时，对四分类问题达到了56.28%的准确率。在研究中，作者指定了层顺序：首先为两个卷积层，随后接一个最大池化层和一个卷积层。他们利用DARTS为每层选择最优参数。相比之下，我们并未指定层序列，而是让DARTS以最小干预自主选择理想架构。

EmotionNAS is a two-branch NAS strategy introduced by Sun et al. [9] in 2023. The authors use DARTS to optimise their two models in two branches, the CNN model and RNN model, which use a spec tr ogram and a waveform as inputs, respectively. They obtained an unweighted accuracy of $72.1%$ from the combined model for the IEMOCAP dataset. They also report the performance of $63.2%$ in the spec tr ogram branch, which only uses a CNN component. The main difference between our approach and the study by Sun et al. [9] is that we use a SeqNN component coupled in series with the CNN layer as in Figure 2 while Sun et al. [9] use an RNN layer in parallel to the CNN layer in a different branch.

EmotionNAS是由Sun等人[9]于2023年提出的双分支NAS策略。作者使用DARTS分别优化两个分支中的模型：以频谱图(spec trogram)为输入的CNN模型和以波形(waveform)为输入的RNN模型。他们在IEMOCAP数据集上通过组合模型获得了72.1%的未加权准确率，并报告了仅使用CNN组件的频谱图分支63.2%的性能表现。我们的方法与Sun等人[9]研究的主要区别在于：如图2所示，我们采用与CNN层串联的SeqNN组件，而Sun等人[9]则在不同分支中采用与CNN层并联的RNN层。

We conducted preliminary research to determine the feasibility of utilising DARTS for SER in a CNN-LSTM architecture, where we only optimised the CNN network using DARTS [26]. This paper extends the idea of using DARTS in SER but with more relaxation in the SeqNN component by jointly optimising the whole architecture.

我们进行了初步研究，以确定在CNN-LSTM架构中利用DARTS进行语音情感识别(SER)的可行性，其中我们仅使用DARTS优化了CNN网络[26]。本文扩展了在SER中使用DARTS的思路，但通过联合优化整个架构，在SeqNN组件中实现了更大的灵活性。

In recent years, the literature has highlighted the use of attention networks in SER, which has provided superior outcomes [27], [28]. We added an attention network component to the DARTS search scope to discover whether it improves performance. Zou et. al. [29] have introduced a concept called ‘co-attention’ where many separate inputs from multimodal inputs are fused by co-attention. They used three sets of features MFCC, spec tr ogram, and Wav2Vec2 features from the IEMOCAP dataset and obtained $72.70%$ accuracy. Liu et al. [30] have utilised an attention-based bi-directional LSTM followed by a CNN layer for a SER problem. They have achieved a significant performance of

近年来，文献中强调了注意力网络在语音情感识别(SER)中的应用，并取得了优异成果[27][28]。我们在DARTS搜索范围中加入了注意力网络组件，以探索其是否能提升性能。Zou等人[29]提出了一种名为"协同注意力(co-attention)"的概念，通过该机制将多模态输入的多个独立特征进行融合。他们使用IEMOCAP数据集中的MFCC、频谱图和Wav2Vec2三组特征，获得了72.70%的准确率。Liu等人[30]则采用基于注意力的双向LSTM结合CNN层处理SER问题，取得了显著性能提升。

Input DARTS generated CNN component with $C$ number of cells Dense Output and SeqNN Component with N number of cells Layer Layer Fig. 2. The emoDARTS architecture comprises input features processed through CNN, SeqNN, and Dense layers and it utilises DARTS for jointly optimising the CNN and SeqNN components.

图 2: emoDARTS架构包含通过CNN、SeqNN和Dense层处理的输入特征，并利用DARTS联合优化CNN和SeqNN组件。输入DARTS生成的CNN组件包含$C$个单元，Dense输出和SeqNN组件包含N个单元层。

$66.27%$ for the IEMOCAP Dataset. Their idea of ‘CNN - LSTM attention’ paved the foundation for our model architecture.

IEMOCAP 数据集上的准确率为 $66.27%$。他们提出的 "CNN - LSTM 注意力" 架构为我们的模型设计奠定了基础。

In Table 1, we briefly compare the existing studies with emoDARTS. The comparison clearly shows that,

在表1中，我们简要对比了现有研究与emoDARTS。对比清晰地表明，

3 EMODARTS FRAMEWORK

3 EMODARTS框架

The proposed ‘emoDARTS’ uses DARTS to improve SER using a CNN-SeqNN network, which was motivated by studies that showed increased SER performance when CNN and LSTM layers were combined [5], [7], [8], [30]. We represent our network as a multi-component DARTS network, with the input fed into a CNN component and the output from the CNN component fed into a SeqNN component, but all components are optimised jointly during the architecture search phase, delivering an optimal architecture (Figure 2).

提出的 'emoDARTS' 采用 DARTS (Differentiable Architecture Search) 方法，通过 CNN-SeqNN (卷积神经网络-序列神经网络) 混合架构来提升语音情感识别 (SER) 性能。该设计灵感来源于多项研究表明：当 CNN 和 LSTM 层结合时能显著提升 SER 准确率 [5], [7], [8], [30]。我们将网络构建为多组件 DARTS 架构，输入数据先馈入 CNN 组件，其输出再传入 SeqNN 组件，所有组件在架构搜索阶段进行联合优化，最终生成最优架构 (图 2)。

Fig. 3. DARTS employs steps (a) to (d) to search cell architectures: (a) initial is es the graph, (b) forms a search space, (c) updates edge weights, and (d) determines the final cell structure. Nodes signify representations, edges represent operations, with light-coloured edges indicating weaker and dark-coloured edges representing stronger operations.

图 3: DARTS采用(a)至(d)步骤搜索单元架构: (a)初始化图结构, (b)构建搜索空间, (c)更新边权重, (d)确定最终单元结构。节点表示特征表征, 边代表运算操作, 浅色边表示较弱操作, 深色边代表较强操作。

DARTS uses a differentiable approach to network optimisation. A computation cell is the DARTS algorithm’s fundamental unit. It aims to optimise the cell so that the architecture can function to its maximum performance. A DARTS cell is described as a directed graph, with each node representing a feature (representation) and each edge representing an operation that can be performed to a represent ation. One unique feature of this network is that each node is connected to all of its previous nodes by an edge, as seen in Figure 3 (a). If the output of the node $j$ is $x^{(j)}$ and the operation $'o^{\prime}$ on the edge connecting the nodes $i$ and $j$ is $o^{(i,j)},x^{(j)}$ can be obtained by the Equation 1:

DARTS采用可微分方法进行网络优化。计算单元是DARTS算法的基本单位，其目标是优化单元结构以实现最佳性能。DARTS单元被描述为有向图，其中每个节点代表一个特征（表示），每条边代表可对表示执行的操作。该网络的一个独特之处在于，如图3(a)所示，每个节点都通过边与其所有前驱节点相连。若节点$j$的输出为$x^{(j)}$，连接节点$i$和$j$的边上的操作$'o^{\prime}$为$o^{(i,j)}$，则$x^{(j)}$可通过公式1计算得出：

$$
x^{(j)}=\sum_{i<j}o^{(i,j)}(x^{(i)})
$$

In the beginning, the candidate search space is generated by combining each node of the DARTS cell with all the candidate operations (with multiple links between nodes), as illustrated in Figure 3 (b). Equation 1 incorporates a weight parameter $\alpha$ to identify the optimal edge (operation) connecting two nodes, $i$ and $j,$ from the candidate search space of all operations. Equation 2 describes how the node’s output be represented.

初始时，候选搜索空间通过将DARTS单元的每个节点与所有候选操作(节点间存在多条连接)组合生成，如图3(b)所示。公式1引入权重参数$\alpha$来从所有操作的候选搜索空间中识别连接节点$i$和$j$的最优边(操作)。公式2描述了节点输出的表示方式。

$$
x^{(j)}=\sum_{i<j}\alpha^{(i,j)}o^{(i,j)}(x^{(i)})
$$

Then, the continuous relaxation of the search space updates the weights $(\alpha^{i,j})$ of the edges. The final architecture can be obtained by selecting the operation between two nodes with the highest weight $\left(\boldsymbol{\breve{o}}^{(i,j)*}\right)^{\cdot}$ by using Equation 3.

随后，搜索空间的连续松弛更新了边的权重 $(\alpha^{i,j})$。最终架构可通过选择两个节点之间具有最高权重 $\left(\boldsymbol{\breve{o}}^{(i,j)*}\right)^{\cdot}$ 的操作获得，如公式3所示。

$$
o^{(i,j)*}=a r g m a x_{o}(\alpha^{(i,j)})
$$

The searched discrete cell architecture is shown in Figure 3 (d).

搜索到的离散单元架构如图 3 (d) 所示。

The number of cells ( $C$ or $N_{\cdot}$ ) in a component is a parameter for the DARTS algorithm that specifies how many DARTS cells are stacked to form a component in the model. Each cell takes the last two cells’ output as input. If the output from each cell $t$ is $y_{t}$ and the function within the cell is $f,$ then $y_{t}$ can be represented as;

组件中的单元数量（$C$ 或 $N_{\cdot}$）是 DARTS 算法的一个参数，用于指定模型中堆叠多少个 DARTS 单元来构成一个组件。每个单元将前两个单元的输出作为输入。如果每个单元 $t$ 的输出为 $y_{t}$，且单元内的函数为 $f$，则 $y_{t}$ 可表示为：

$$
y_{t}=f(y_{t-1},y_{t-2})
$$

DARTS’ CNN component has two types of CNN cells: ‘normal’ and ‘reduction’ cells. It sets the stride to one in normal cells and two in reduction cells, resulting in a downsampled output in the reduction cells. This down sampling allows the model to eliminate the duplication of intermediate characteristics, reducing complexity.

DARTS的CNN组件包含两种类型的CNN单元："normal"和"reduction"。该架构在normal单元中设置步长为1，在reduction单元中设置步长为2，从而使reduction单元输出下采样结果。这种下采样机制能消除中间特征的重复性，降低模型复杂度。

We decided to jointly optimise the CNN and SeqNN components rather than individually since it is important for downstream components (in this case, the CNN component) to understand the behaviour of upstream components (SeqNN). Joint optimisation in a multiple-component network improves architecture search in various ways, including: 1. the back-propagation of loss minim is ation flows through all the components in a single compute graph; and 2. it reduces the time required in the search phase by searching the architecture of the whole network at once.

我们决定联合优化CNN和SeqNN组件，而非单独优化，因为下游组件（此处指CNN组件）理解上游组件（SeqNN）的行为至关重要。多组件网络中的联合优化通过以下方式提升架构搜索效率：1. 损失最小化的反向传播通过单一计算图流经所有组件；2. 通过一次性搜索整个网络架构，减少搜索阶段所需时间。

4 EXPERIMENTAL SETUP

4 实验设置

4.1 Dataset and Feature Selection

4.1 数据集与特征选择

We use the widely used IEMOCAP [31], MSP-IMPROV [32], and MSP-Podcast [33] datasets for our experiments. Our study takes the improvised subset of IEMOCAP and the four categorical labels, happiness, sadness, anger, and neutral as classes from the datasets. We employ five-fold crossvalidation with at least one speaker out in our training and evaluations. At each fold, the training dataset is divided into two subsets, ‘search’, and ‘training’, by a 70/30 fraction. The ‘search’ set is used in the architecture search; the ‘training’ set is used in optimising the searched architecture, and the remaining testing dataset is used to infer and obtain the testing performance of the searched and optimised model. This way, we manage to split the dataset into three sets in each cross-validation session. The IEMOCAP dataset has five sessions with ten actors and two unique speakers in each. We use one session for the testing dataset and four sessions for the search and training datasets. Similarly, MSPImprov comprises six sessions including twelve actors. We take one session in the testing dataset and the remaining five sessions in the search and training dataset. MSP-Podcast includes a speaker ID with each audio utterance, and we group the entire dataset by the speaker and divide it by the 70/30 rule.

我们使用广泛采用的IEMOCAP [31]、MSP-IMPROV [32]和MSP-Podcast [33]数据集进行实验。研究选取IEMOCAP的即兴对话子集，并将快乐、悲伤、愤怒和中性四类情感标签作为分类目标。采用五折交叉验证方法，确保每次训练评估时至少排除一名说话者。每折实验将训练集按7:3比例划分为"搜索"和"训练"两个子集——"搜索"集用于架构搜索，"训练"集用于优化搜索到的架构，剩余测试集则用于推断并评估优化后模型的性能。通过这种方式，我们在每次交叉验证中将数据划分为三部分。

IEMOCAP数据集包含5个会话，共10名演员（每会话2位独特说话者）。我们采用1个会话作为测试集，其余4个会话用于搜索和训练集。MSP-IMPROV包含6个会话共12名演员，同样保留1个会话作为测试集。对于MSP-Podcast数据集，我们根据说话人ID对音频语句进行分组，并按照7:3规则划分数据集。

In this research, we use Mel Frequency Cepstral Coefficients (MFCC) as input features to the model. MFCC has been used as the input feature in many SER studies in the literature [34], [35] and has proven to obtain promising results. Some machine learning research uses the Mel Filter bank as an input feature when the algorithm is not vulnerable to strongly correlated data. We picked the MFCC for this study since the deep learning model is produced automatically and we do not want to infer the model’s sensitivity to correlated input. We extract 128 MFCCs from each 8-second audio utterance from the dataset. If the audio utterance length is less than 8 seconds, we added padding with zeros while the lengthier utterances are truncated. The MFCC extraction from the Librosa python library [36] outputs a shape $128\times512_{.}$ , down sampled with max pooling, to create a spec tr ogram of the shape $128\times128$ .

在本研究中，我们使用梅尔频率倒谱系数 (MFCC) 作为模型的输入特征。MFCC 已被许多语音情感识别 (SER) 研究 [34][35] 采用为输入特征，并证明能取得良好效果。部分机器学习研究在算法不受强相关性数据影响时，会选用梅尔滤波器组 (Mel Filter bank) 作为输入特征。由于深度学习模型是自动生成的，且我们不想推测模型对相关性输入的敏感性，因此本研究选择了 MFCC。我们从数据集中每段 8 秒的语音中提取 128 个 MFCC 特征。若语音时长不足 8 秒，则进行零填充；超过 8 秒的语音则被截断。通过 Librosa 的 Python语言库 [36] 提取的 MFCC 输出形状为 $128\times512_{.}$，经最大池化下采样后生成 $128\times128$ 形状的声谱图。

4.2 Baseline Models

4.2 基线模型

We compare the performance of emoDARTS for SER with three models developed without DARTS (w/o DARTS) : 1) CNN, 2) $\mathsf{C N N+}$ LSTM, and 3) $\mathrm{CNN+LSTM}$ with attention as baseline models. The CNN baseline model consists of a CNN layer (kernel size $\scriptstyle:=2$ , stride $:=2$ , and padding $=2$ ) followed by a Max-Pooling layer (kernel size $\scriptstyle=2$ and stride $\scriptstyle:=2$ ). Two dense layers then processes the output from the MaxPooling layer after applying a dropout of 0.3. Finally, the last dense layer has four output units resembling the four emotion classes, and the model outputs the probability estimation of each emotion for a given input by a Softmax function.

我们将emoDARTS在语音情感识别(SER)中的性能与三种未使用DARTS的基线模型进行对比：1) CNN、2) $\mathsf{C N N+}$ LSTM、3) 带注意力机制的$\mathrm{CNN+LSTM}$。CNN基线模型由CNN层(核大小$\scriptstyle:=2$，步长$:=2$，填充$=2$)和最大池化层(核大小$\scriptstyle=2$，步长$\scriptstyle:=2$)组成。经过0.3的dropout后，两个全连接层处理最大池化层的输出。最后，末层全连接层具有对应四种情感类别的四个输出单元，模型通过Softmax函数输出给定输入的各情感概率估计。

The CNN+LSTM baseline model is built, including an additional bi-directional LSTM layer of 128 units after the Max-Pooling layer. An attention layer is added to the LSTM layer in the $\prime{\cal C}{\mathrm{NN}}{+}\mathrm{LSTM}$ attention’ baseline model. Figure 4 shows the architecture of the CNN $+$ LSTM attention baseline model

构建了CNN+LSTM基线模型，在Max-Pooling层后增加了128个单元的双向LSTM层。在$\prime{\cal C}{\mathrm{NN}}{+}\mathrm{LSTM}$ attention基线模型中，LSTM层添加了注意力机制。图4展示了CNN $+$ LSTM注意力基线模型的结构

4.3 DARTS Configuration

4.3 DARTS配置

We divide the cell search space operations into the two separate parts CNN and SeqNN based on the components they apply. Table 2 lists the type of operations used in each component. The cell search space of the CNN component consists of pooling operations such as $3\times3$ max pooling $(i=3)$ ) and $3\times3$ average pooling $(i=3)$ ), convolutional operations such as $3\times3$ and $5\times5$ separable convolutions $(i=3,5)$ ), $3\times3$ and $5\times5$ dilated convolution $(i=3,5)$ ),

我们将细胞搜索空间操作根据其应用的组件分为CNN和SeqNN两个独立部分。表2列出了每个组件使用的操作类型。CNN组件的细胞搜索空间包含池化操作，如 $3\times3$ 最大池化 $(i=3)$ 和 $3\times3$ 平均池化 $(i=3)$ ，卷积操作如 $3\times3$ 和 $5\times5$ 可分离卷积 $(i=3,5)$ ， $3\times3$ 和 $5\times5$ 空洞卷积 $(i=3,5)$ ，

Fig. 4. Visualisation of the $\mathsf{C N N}{+}\mathsf{L S T M}$ attention baseline model. The parameters of the CNN layer are: kernel size $(k){=}2$ , stride $(s)=2$ and, padding $(\mathsf{p}){=}2$ and the parameters of the Max-pooling layer are: kernel size $(k){=}2$ and stride $(s)=2$ and the LSTM layer has 128 units.

图 4: $\mathsf{C N N}{+}\mathsf{L S T M}$ 注意力基线模型的可视化。CNN层的参数为: 核大小 $(k){=}2$, 步长 $(s)=2$, 填充 $(\mathsf{p}){=}2$; 最大池化层的参数为: 核大小 $(k){=}2$, 步长 $(s)=2$; LSTM层包含128个单元。

TABLE 2 Type of DARTS operations used in each component. “i” represents the kernel size in CNN while $"j"$ represents the number of layers in the SeqNN component.

表 2: 各组件使用的DARTS操作类型。"i"代表CNN中的卷积核大小，$"j"$代表SeqNN组件中的层数。

组件	操作	描述

[论文翻译]emoDARTS: 联合优化CNN与序列神经网络架构以实现卓越的语音情感识别

原文地址：https://arxiv.org/pdf/2403.14083v1

emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

1 INTRODUCTION

1 引言

2 相关工作

2.1 Speech Emotion Recognition using CNN and LSTM

2.1 使用 CNN 和 LSTM 进行语音情感识别

2.2 NAS与DARTS在SER及相关领域的应用

3 EMODARTS FRAMEWORK

3 EMODARTS框架

4 EXPERIMENTAL SETUP

4 实验设置

4.1 Dataset and Feature Selection

4.1 数据集与特征选择

4.2 Baseline Models

4.2 基线模型

4.3 DARTS Configuration

4.3 DARTS配置

[论文翻译]emoDARTS: 联合优化CNN与序列神经网络架构以实现卓越的语音情感识别

原文地址：https://arxiv.org/pdf/2403.14083v1

emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

1 INTRODUCTION

1 引言

2 RELATED WORK

2 相关工作

2.1 Speech Emotion Recognition using CNN and LSTM

2.1 使用 CNN 和 LSTM 进行语音情感识别

2.2 Application of NAS and DARTS in SER and Related Fields

2.2 NAS与DARTS在SER及相关领域的应用

3 EMODARTS FRAMEWORK

3 EMODARTS框架

4 EXPERIMENTAL SETUP

4 实验设置

4.1 Dataset and Feature Selection

4.1 数据集与特征选择

4.2 Baseline Models

4.2 基线模型

4.3 DARTS Configuration

4.3 DARTS配置