LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
大语言模型监督的预训练用于对话中的多模态情感识别
Abstract—Emotion recognition in conversations (ERC) is challenging due to the multimodal nature of the emotion expression. In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. These transcriptions are obtained from a raw speech dataset with a pre-trained ASR system. A text LLM model is queried to provide pseudo-labels for these transcripts, and these pseudo-labeled transcripts are subsequently used for learning an utterance level text-based emotion recognition model. We use the utterance level text embeddings for emotion recognition in conversations along with speech embeddings obtained from a recently proposed pretrained model. A hierarchical way of training the speech-text model is proposed, keeping in mind the conversational nature of the dataset. We perform experiments on three established datasets, namely, IEMOCAP, MELD, and CMU- MOSI, where we illustrate that the proposed model improves over other benchmarks and achieves state-of-the-art results on two out of these three datasets.
摘要—由于情感表达的多模态特性,对话中的情感识别(ERC)具有挑战性。本文提出了一种基于文本的识别模型,通过大语言模型(LLM)的指导,从无监督的语音转录中进行预训练。这些转录是通过预训练的自动语音识别(ASR)系统从原始语音数据集中获得的。我们使用文本大语言模型为这些转录提供伪标签,随后这些伪标签转录用于学习基于文本的语句级情感识别模型。我们使用语句级文本嵌入与最近提出的预训练模型获得的语音嵌入相结合,进行对话中的情感识别。考虑到数据集的对话性质,我们提出了一种分层训练语音-文本模型的方法。我们在三个已建立的数据集(IEMOCAP、MELD 和 CMU-MOSI)上进行了实验,结果表明所提出的模型优于其他基准,并在其中两个数据集上取得了最先进的结果。
Index Terms—Multimodal emotion recognition, LLM distillation, Hierarchical training, Conversational Analytics
索引词—多模态情感识别、大语言模型蒸馏、分层训练、对话分析
I. INTRODUCTION
I. 引言
Emotion Recognition in Conversation (ERC) focuses on detecting emotions conveyed through multiple modalities during social conversational interactions, which is essential for natural human communication. Developing artificial systems that have improved emotional understanding and intelligence is a vital design step in conversational agents [1], social media analytics tools [2], customer service centers [3], mental health monitoring platforms [4], and wearable systems [5]. ERC enables these technologies to better adapt to human emotions, enhancing user experiences.
对话中的情感识别 (Emotion Recognition in Conversation, ERC) 专注于检测社交对话互动中通过多种模态传达的情感,这对于自然的人类交流至关重要。开发具有更高情感理解和智能的人工系统是对话智能体 [1]、社交媒体分析工具 [2]、客户服务中心 [3]、心理健康监测平台 [4] 和可穿戴系统 [5] 的关键设计步骤。ERC 使这些技术能够更好地适应人类情感,从而提升用户体验。
Emotion recognition in conversational data is challenging due to overlapping speakers, short- and long-term dependencies [6], shortspeaker turns, reverberation and background noise. Emotions are often multimodal, conveyed through various modes such as facial expressions [7], vocal cues [8], gestures [9], and physiological signals [10]. To address these complexities, multimodal approaches are often preferred [11]. This paper focuses on joint emotion recognition from audio and text, using the strengths of both modalities to enhance accuracy in detecting emotions in conversations.
由于对话数据中存在说话者重叠、短期和长期依赖 [6]、短说话轮次、混响和背景噪声等问题,情感识别具有挑战性。情感通常是多模态的,通过面部表情 [7]、声音线索 [8]、手势 [9] 和生理信号 [10] 等多种模式传递。为了应对这些复杂性,多模态方法通常更受青睐 [11]。本文专注于从音频和文本中进行联合情感识别,利用两种模态的优势来提高对话中情感检测的准确性。
The initial methods for Speech Emotion Recognition (SER) relied on handcrafted acoustic features like pitch [12], energy, and speaking rate [13]. The introduction of deep learning techniques, including CNNs [14], LSTMs [15], and transformers [16], significantly improved the SER performance. Recently, self-supervised learning models like wav2vec2.0 [17], HuBERT [18], and WavLM [19] have shown promise in recognition of emotions across multiple datasets and tasks. Large language models (LLMs) have also been explored for SER [20], [21], though they demand significant computational resources.
最初的语音情感识别 (SER) 方法依赖于手工设计的声学特征,如音高 [12]、能量和语速 [13]。随着深度学习技术的引入,包括卷积神经网络 (CNN) [14]、长短期记忆网络 (LSTM) [15] 和 Transformer [16],SER 性能得到了显著提升。最近,自监督学习模型如 wav2vec2.0 [17]、HuBERT [18] 和 WavLM [19] 在多个数据集和任务中的情感识别方面表现出潜力。大语言模型 (LLM) 也被探索用于 SER [20]、[21],尽管它们需要大量的计算资源。
In parallel, text-based emotion recognition (commonly known as sentiment analysis) initially relied on rule-based methods that linked
与此同时,基于文本的情感识别(通常称为情感分析)最初依赖于基于规则的方法,这些方法将
This work was carried out with research grants from British Telecom and the Prime Minister’s Research Fellowship.
本工作由英国电信 (British Telecom) 和总理研究奖学金提供研究资助。
specific words to emotions [22], [23]. With the rise of deep learning, sentiment analysis progressed to CNNs [24], RNNs [25], and transformer architectures such as BERT [26], [27] and RoBERTa [28], [29]. More recently, large language models (LLM) are seen as excellent tools for sentiment analysis [30].
特定词汇与情感 [22], [23]。随着深度学习的兴起,情感分析逐渐发展到使用卷积神经网络 (CNN) [24]、循环神经网络 (RNN) [25] 以及 Transformer 架构,如 BERT [26], [27] 和 RoBERTa [28], [29]。最近,大语言模型 (LLM) 被视为情感分析的优秀工具 [30]。
In order to enhance emotion recognition in conversations, several works have also designed multi-modal fusion techniques combining audio and text data, using models like transformers [31], graph neural networks [32], and capsule networks [33].
为了增强对话中的情感识别,一些研究还设计了结合音频和文本数据的多模态融合技术,使用了如 Transformer [31]、图神经网络 [32] 和胶囊网络 [33] 等模型。
In this paper, we propose a pre-training methodology through a multi-modal approach that leverages both text and speech representations. Specifically, we introduce a strategy to improve emotion classification from text by leveraging unsupervised speech data with large-scale language models (LLMs). To this end, we first utilize a pre-trained ASR based on Whisper-large model [34] to transcribe speech. The “noisy” speech transcripts are labeled with an LLM to automatically generate pseudo-labels of speech sentiments. These labelled text-transcripts are then used to fine-tune a RoBERTa text encoder model [28] for sentiment classification, allowing it to capture nuanced emotional patterns in textual data. We show substantial benefits from this unsupervised pre-training of the text-based model.
在本文中,我们提出了一种通过多模态方法进行预训练的方法论,该方法结合了文本和语音表示。具体而言,我们引入了一种策略,通过利用无监督的语音数据和大语言模型 (LLMs) 来改进文本中的情感分类。为此,我们首先使用基于 Whisper-large 模型 [34] 的预训练 ASR 来转录语音。这些“噪声”语音转录通过 LLM 进行标注,自动生成语音情感的伪标签。然后,这些标注的文本转录被用于微调 RoBERTa 文本编码器模型 [28] 以进行情感分类,使其能够捕捉文本数据中的细微情感模式。我们展示了这种无监督预训练对基于文本的模型的显著优势。
On the speech side, we extract features using the recently proposed CARE model [35]. The CARE model is designed to generate high-quality embeddings that encapsulate both content and acoustic information from speech utterances. Using the speech and text embeddings derived at utterance level, we train a bi-directional gated recurrent unit (GRU) based model to assimilate the information across the entire conversation. The conversation-level embeddings from uni-modal speech-text models are integrated into our proposed multi-modal architecture for conversational emotion recognition. We propose a hierarchical fusion mechanism with a cross-attention-based network [36] to enable the interaction between the modalities, ensuring an fusion of the emotional information present in both speech and text. We call our method MERITS-L (Multimodal Emotion Recognition In Speech and Text with LLM guidance).
在语音方面,我们使用最近提出的 CARE 模型 [35] 提取特征。CARE 模型旨在生成高质量的嵌入,这些嵌入封装了语音话语中的内容和声学信息。使用在话语级别导出的语音和文本嵌入,我们训练了一个基于双向门控循环单元 (GRU) 的模型,以吸收整个对话中的信息。来自单模态语音-文本模型的对话级别嵌入被集成到我们提出的多模态架构中,用于对话情感识别。我们提出了一种基于交叉注意力网络 [36] 的分层融合机制,以实现模态之间的交互,确保融合语音和文本中存在的情感信息。我们将我们的方法称为 MERITS-L (Multimodal Emotion Recognition In Speech and Text with LLM guidance)。
The experiments are performed on three established datasets, namely IEMOCAP [37], MELD [38] and CMU-MOSI [39]. The key contributions from this work are:
实验在三个已建立的数据集上进行,分别是 IEMOCAP [37]、MELD [38] 和 CMU-MOSI [39]。本工作的主要贡献是:
II. RELATED WORK
II. 相关工作
LLMs for Text Sentiment Analysis: The capabilities of large language models (LLMs) have been the focus of recent research efforts [30], [40], [41]. Zhong et al. [41] demonstrated that
大语言模型在文本情感分析中的应用:大语言模型 (LLMs) 的能力一直是近期研究的焦点 [30], [40], [41]。Zhong 等人 [41] 展示了...

Fig. 1. Block diagram of the proposed model. The pre-training stage is shown in the grey box at the top. An ASR system is used to generate the transcripts for the pre-training data which are annotated by a large language model (LLM) as positive, negative or neutral sentiment. These “silver” labels with the text transcripts form the supervised training dataset for RoBERTa-large model. A frozen CARE model [35] is used for extracting audio embeddings. Both the text and speech embeddings thus use only unsupervised data. The MERITS-L model is trained in three stages (denoted as Stage I, II and III in the diagram), wherein the models trained in a particular stage are kept frozen for subsequent stages.
图 1: 提出的模型框图。预训练阶段显示在顶部的灰色框中。使用 ASR 系统生成预训练数据的转录文本,这些转录文本由大语言模型 (LLM) 标注为正面、负面或中性情感。这些“银”标签与文本转录一起构成了 RoBERTa-large 模型的监督训练数据集。使用冻结的 CARE 模型 [35] 提取音频嵌入。因此,文本和语音嵌入都仅使用无监督数据。MERITS-L 模型分三个阶段训练(在图中表示为阶段 I、II 和 III),其中在特定阶段训练的模型在后续阶段保持冻结。
ChatGPT1 achieves performance comparable to fine-tuned BERT models. Zhang et al. [30], however, provided a more comprehensive evaluation across various sentiment analysis tasks. Their findings reveal that while LLMs under-perform in fine-grained sentiment analysis, they exhibit promising zero-shot capabilities in simpler tasks, such as binary sentiment classification. In this work, we leverage the ability of LLMs to coarsely annotate large corpora of emotional speech transcripts.
ChatGPT1 的表现与经过微调的 BERT 模型相当。然而,Zhang 等人 [30] 在各种情感分析任务中提供了更全面的评估。他们的研究结果表明,虽然大语言模型在细粒度情感分析中表现不佳,但在较简单的任务(如二元情感分类)中展现出有前景的零样本能力。在本研究中,我们利用大语言模型的能力对大量情感语音转录文本进行粗略标注。
Emotion Recognition in Conversations: Recent methods that achieve strong performance on benchmark datasets often incorporate speaker identity [42]–[45]. For instance, Hu et al. [43] introduced a supervised contrastive loss, where utterances with the same emotion and speaker are treated as positive samples in a contrastive learning framework. Yu et al. [45] appended speaker embeddings to the utterance representations and utilize a language model, such as RoBERTa [28], to predict emotions via masked language modeling. In contrast, our work does not access speaker labels for any utterance in the conversation.
对话中的情感识别:在基准数据集上表现优异的最新方法通常结合了说话者身份 [42]–[45]。例如,Hu 等人 [43] 引入了一种监督对比损失,在对比学习框架中将具有相同情感和说话者的话语视为正样本。Yu 等人 [45] 将说话者嵌入附加到话语表示中,并利用 RoBERTa [28] 等语言模型通过掩码语言建模来预测情感。相比之下,我们的工作不访问对话中任何话语的说话者标签。
III. METHOD
III. 方法
A. Background
A. 背景
- CARE: In our recently proposed CARE model [35], speech is processed by two encoders - one focusing on the semantic aspect of speech by aligning with the mean-pooled RoBERTa representation of corresponding ASR transcripts, while the other is trained to predict low-level descriptors of speech provided by the $\mathrm{PASE+}$ model [46]. The CARE embeddings are seen to perform better than most other base-sized models in the SUPERB [47] style evaluation. In this paper, we utilize the CARE embeddings from speech utterances and train conversational models along with speech-text fusion.
- CARE:在我们最近提出的CARE模型 [35] 中,语音通过两个编码器进行处理——一个通过与相应ASR转录的均值池化RoBERTa表示对齐来关注语音的语义方面,而另一个则被训练来预测由$\mathrm{PASE+}$模型 [46] 提供的语音低级别描述符。在SUPERB [47] 风格的评估中,CARE嵌入的表现优于大多数其他基础大小的模型。在本文中,我们利用语音话语的CARE嵌入,并结合语音-文本融合训练对话模型。
- RoBERTa: One of the significant contributions in creating a large scale language model was proposed by Devlin et al. [26]. This architecture, called bidirectional encoder representations from transformer (BERT), was trained with two objectives on a corpus of textual data, namely, predicting words masked out in a sentence and to predict whether two sentences semantically follow each other (referred to as next sentence prediction). Liu et. al [28] trained this architecture on a larger corpus of textual data without the next sentence prediction task. This pre-trained model is known as robust optimized BERT approach (RoBERTa). We use the pre-trained large version of this model with 24 transformer layers as the text encoder.
- RoBERTa: Devlin 等人 [26] 在创建大规模语言模型方面提出了重要贡献。这种架构被称为基于 Transformer 的双向编码器表示 (BERT),它在文本数据语料库上通过两个目标进行训练,即预测句子中被掩盖的词语以及预测两个句子是否在语义上相互跟随(称为下一句预测)。Liu 等人 [28] 在没有下一句预测任务的情况下,在更大的文本数据语料库上训练了这种架构。这种预训练模型被称为鲁棒优化 BERT 方法 (RoBERTa)。我们使用预训练的大型版本模型,该模型具有 24 个 Transformer 层,作为文本编码器。
B. Proposed MERITS-L model
B. 提出的 MERITS-L 模型
The block diagram of the proposed model is shown in Fig. 1.
所提出模型的框图如图 1 所示。
- Problem Description: Given a set of utterances $U$ and set of emotion labels $Y$ , a conversation consisting of $K$ utterances is denoted by $\left[(u_{1},y_{1}),(u_{2},y_{2}),\dots,(u_{K},y_{K})\right]$ , where $y_{j}\in Y$ is the emotion of utterance $u_{j}$ in the conversation. Specifically, the task of ERC is achieved by using the speech and text modalities in our case, which means $u_{k}={S_{k},T_{k}}$ , where $S_{k}$ and $T_{k}$ refer to the speech and the text transcript associated with the utterance $u_{k}$ . The objective of ERC is to predict the emotion label $y_{k}$ of each utterance $u_{k}$ .
- 问题描述:给定一组话语 $U$ 和一组情感标签 $Y$,由 $K$ 个话语组成的对话表示为 $\left[(u_{1},y_{1}),(u_{2},y_{2}),\dots,(u_{K},y_{K})\right]$,其中 $y_{j}\in Y$ 是对话中话语 $u_{j}$ 的情感。具体来说,在我们的案例中,ERC(情感识别对话)任务通过使用语音和文本模态来实现,这意味着 $u_{k}={S_{k},T_{k}}$,其中 $S_{k}$ 和 $T_{k}$ 分别指与话语 $u_{k}$ 相关的语音和文本转录。ERC的目标是预测每个话语 $u_{k}$ 的情感标签 $y_{k}$。
- LLM guided text pre-training: The text transcripts from the emotional speech corpus are first extracted from an ASR system (Whisper-large-v3 [34]). Generally, the word error rates on emotional speech are higher than neutral speech [48]. With the ASR generated transcripts of the speech corpus, a large language model (LLM) is prompted to annotate the transcript as three classes, “positive”, “negative” or “neutral”. The pre-trained RoBERTa-large model is fine-tuned to predict the pseudo-classes (from the LLM predictions)
- 大语言模型引导的文本预训练:首先从情感语音语料库中通过自动语音识别系统(Whisper-large-v3 [34])提取文本转录。通常,情感语音的词错误率高于中性语音 [48]。利用自动语音识别生成的语音语料库转录,提示大语言模型(LLM)将转录注释为三类:“积极”、“消极”或“中性”。预训练的 RoBERTa-large 模型被微调以预测伪类别(来自大语言模型的预测)。

Fig. 2. The co-attention network used in the proposed model. It consists of two sub-blocks - the cross-attention and the self-attention blocks.

图 2: 所提出模型中使用的共注意力网络。它由两个子块组成——交叉注意力块和自注意力块。
and the resultant model is used subsequently as a text feature extractor at utterance-level. This model is referred to as RoBERTa-FT in the subsequent sections of the paper.
生成的模型随后被用作话语级别的文本特征提取器。在本文的后续部分中,该模型被称为 RoBERTa-FT。
C. Training
C. 训练
The training methodology for MERITS-L is performed in stages as mentioned below:
MERITS-L 的训练方法按以下阶段进行:
Stage I: All utterances, $U$ , are collated and used to train text sentiment analysis and speech emotion recognition models for each dataset. While the RoBERTa-FT model is fine-tuned for each downstream dataset, small light weight networks with frozen CARE embeddings as input are trained for the speech modality for every dataset. This training stage aims to classify the text transcript $(T_{k})$ and the speech signal $\bar{(S_{k})}$ into the correct emotion category $(y_{k})$ . The final layer embeddings for each utterance $\scriptstyle{T_{k}^{1}}$ and $\breve{S}_{k}^{1}$ for the text transcripts and speech signal respectively) are used for the next stage of training.
阶段 I:所有话语 $U$ 被整理并用于为每个数据集训练文本情感分析和语音情感识别模型。虽然 RoBERTa-FT 模型针对每个下游数据集进行了微调,但针对每个数据集的语音模态,使用冻结的 CARE 嵌入作为输入的小型轻量网络被训练。此训练阶段的目的是将文本转录 $(T_{k})$ 和语音信号 $\bar{(S_{k})}$ 分类到正确的情感类别 $(y_{k})$ 中。每个话语的最终层嵌入 $\scriptstyle{T_{k}^{1}}$ 和 $\breve{S}_{k}^{1}$(分别用于文本转录和语音信号)将用于下一阶段的训练。
Stage II: This stage introduces the conversational nature of the data in the modeling framework. The text features for the utterances in a conversation from the previous stage, denoted by $\left(T_{1}^{1},T_{2}^{1},\ldots,T_{K}^{1}\right)$ for a conversation having $K$ utterances are processed by a bidirectional gated recurrent network (Bi-GRU) with self-attention over the conversational context. This stage encourages the model to predict the emotion class of the utterance $u_{k}$ , keeping the entire conversation as a part of the context. A similar modeling exercise is done for the speech modality as well. Similar to the previous stage, the features from the Bi-GRU with self-attention blocks are used for the final stage of training. These features are denoted by $T_{k}^{2}$ and $S_{k}^{2}$ for the text and speech modality, of utterance $u_{k}$ , respectively.
阶段 II:此阶段在建模框架中引入了数据的对话性质。对话中每个话语的文本特征,由前一阶段的 $\left(T_{1}^{1},T_{2}^{1},\ldots,T_{K}^{1}\right)$ 表示,其中 $K$ 是对话中的话语数量,这些特征通过一个双向门控循环网络 (Bi-GRU) 进行处理,并在对话上下文中使用自注意力机制。此阶段鼓励模型预测话语 $u_{k}$ 的情感类别,同时将整个对话作为上下文的一部分。类似地,语音模态也进行了相同的建模。与前一阶段类似,带有自注意力机制的 Bi-GRU 的特征被用于最终阶段的训练。这些特征分别表示为 $T_{k}^{2}$ 和 $S_{k}^{2}$,分别对应话语 $u_{k}$ 的文本和语音模态。
Stage III: Notably, the previous stages included training the two modalities separately. In order to align the two modalities in a more effective way for emotion recognition, they are combined in this stage. We implement a co-attention fusion strategy for the two modalities as outlined in Fig. 2. The query-key-value sequences for the two modalities are the features obtained after Stage II of training. E.g. in Fig. 2, $Q_{A}$ refers to the query sequence of the speech modality and is denoted as $\left(S_{1}^{2},S_{2}^{2},\ldots,\stackrel{\bullet}{S_{K}^{2}}\right)$ for the conversation $C$ .
阶段 III:值得注意的是,前两个阶段分别对两种模态进行了训练。为了更有效地对齐这两种模态以进行情感识别,本阶段将它们结合起来。我们为这两种模态实现了共注意力融合策略,如图 2 所示。两种模态的查询-键-值序列是第二阶段训练后获得的特征。例如,在图 2 中,$Q_{A}$ 指的是语音模态的查询序列,对于对话 $C$ 表示为 $\left(S_{1}^{2},S_{2}^{2},\ldots,\stackrel{\bullet}{S_{K}^{2}}\right)$。
IV. EXPERIMENTS AND RESULTS
IV. 实验与结果
A. Datasets
A. 数据集
Pre-training:The MSP-PODCAST corpus [49] is used for the task of pre-training. A total of 149, 307 speech-turns amounting to 230 hours of emotional speech data is used. Out of the total number of samples, $80%$ of the data is randomly chosen as the training set while the remaining $20%$ serves as the validation set. The Whisper-large-v3 model2 is used for generating the transcripts. These transcripts are annotated using the GPT-3.5 Turbo model.
预训练:MSP-PODCAST 语料库 [49] 用于预训练任务。总共使用了 149,307 个语音片段,共计 230 小时的情感语音数据。在所有样本中,随机选择 $80%$ 的数据作为训练集,剩余的 $20%$ 作为验证集。使用 Whisper-large-v3 模型生成转录文本。这些转录文本通过 GPT-3.5 Turbo 模型进行标注。
TABLE I RESULTS ON THE DATASETS IN TERMS OF WEIGHTED F1-SCORE.
表 1: 数据集上的加权 F1 分数结果
| 模态 (训练阶段) | IEMOCAP | MELD | CMU-MOSI |
|---|---|---|---|
| 音频 (阶段 I) | 66.93 | 47.61 | 68.77 |
| 音频 (阶段 II) | 77.95 | 49.32 | 69.43 |
| 文本 (阶段 I) | 69.84 | 63.81 | 85.41 |
| 文本 (阶段 II) | 83.85 | 65.24 | 86.02 |
| 音频+文本 (阶段 II) | 86.48 | 66.02 | 86.81 |
ERC datasets: Three datasets are used for evaluating MERITS-L on the ERC task - IEMOCAP [37], MELD [38] and CMU-MOSI [39].
ERC数据集:在ERC任务中,使用三个数据集来评估MERITS-L - IEMOCAP [37]、MELD [38] 和 CMU-MOSI [39]。
IEMOCAP dataset: The IEMOCAP dataset consists of 151 video recordings split into 5 sessions. Keeping in line with previous works, we do a four-way classification task where we consider “angry”, “happy”, “sad”, “neutral” and “excited” categories (with excited and happy categories merged). We have a total of 5531 utterances from the four emotion labels. We consider session 5 for testing purposes. We choose session 1 for validating our models and sessions $\bar{2}-4$ for training.
IEMOCAP 数据集:IEMOCAP 数据集包含 151 个视频记录,分为 5 个会话。与之前的工作保持一致,我们进行四分类任务,考虑“愤怒”、“快乐”、“悲伤”、“中性”和“兴奋”类别(兴奋和快乐类别合并)。我们总共有来自四个情感标签的 5531 个话语。我们将会话 5 用于测试目的,选择会话 1 用于验证我们的模型,会话 $\bar{2}-4$ 用于训练。
MELD dataset: The MELD dataset is a multi-party dataset created from video clippings of the popular TV show, “Friends”. The training data consists of 9988 utterances, validation data consists of 1108 utterances and test data consists of 2610 utterances. A seven way classification task is performed on this dataset, with each utterance being labeled as one of the 7 emotions - “angry”, “sad”, “joy”, “neutral”, “fear”, “surprise” or “disgust”.
MELD 数据集:MELD 数据集是一个多参与者数据集,源自热门电视剧《Friends》的视频片段。训练数据包含 9988 条话语,验证数据包含 1108 条话语,测试数据包含 2610 条话语。在该数据集上执行七分类任务,每条话语被标记为以下七种情绪之一:“愤怒”、“悲伤”、“喜悦”、“中性”、“恐惧”、“惊讶”或“厌恶”。
CMU-MOSI dataset: The CMU-MOSI dataset has a total of 93 monologues divided into 2199 utterances. Each utterance is labeled in the range of $[-3,3]$ . Following previous works, we treat this as a binary classification problem with utterances having sentiment values in the range $[-3,0)$ being classified as negative sentiment and those with values in the range $[0,3]$ as positive sentiment. For dataset partitioning, we follow the prior work by Poria et. al [50], where the first 62 monologues are used for training and validation while the last 31 monologues are used for testing. Of the 62 monologues, we use 49 for training our model and the rest 13 for validation.
CMU-MOSI 数据集:CMU-MOSI 数据集共有 93 段独白,分为 2199 个话语。每个话语的标签范围为 $[-3,3]$。按照之前的工作,我们将其视为一个二分类问题,情感值在 $[-3,0)$ 范围内的话语被分类为负面情感,而情感值在 $[0,3]$ 范围内的话语被分类为正面情感。对于数据集划分,我们遵循 Poria 等人 [50] 的工作,其中前 62 段独白用于训练和验证,最后 31 段独白用于测试。在这 62 段独白中,我们使用 49 段来训练模型,其余 13 段用于验证。
$B$ . Implementation details
$B$ . 实现细节
Once the transcripts are generated by the Whipser model, the $G P\mathrm{{T}}\mathrm{{-}}3\cdot5$ Turbo model is prompted as follows:
一旦 Whisper 模型生成了转录文本,$G P\mathrm{{T}}\mathrm{{-}}3\cdot5$ Turbo 模型将按照以下方式提示:
You are a sentiment classification bot. Given the [sentence], classify as positive, negative or neutral sentiment. Please give the sentiment and no extra text as output.
你是一个情感分类机器人。给定 [句子],将其分类为积极、消极或中性情感。请仅输出情感,不要包含额外文本。
The RoBERTa-large model is pre-trained with the LLM generated labels (3 classes) for a total of 10 epochs with a learning rate of 1e-4 and a batch size of 32. The different stages of MERITS-L are trained for a total of 50 epochs with a learning rate of $1e{-4}$ and a batch size of 32. For all training purposes, the cross-entropy loss with AdamW [51] optimizer is used. The weighted F1-score is used as the metric for performance evaluation on the ERC datasets. Note that, we have not used any additional labeled datasets in pre-training as the pre-training framework for speech and text are purely based on selfsupervised learning principles from raw data. Further, the downstream datasets are used without any knowledge of speaker meta-data.
RoBERTa-large 模型使用大语言模型生成的标签(3 类)进行预训练,共训练 10 个 epoch,学习率为 1e-4,批量大小为 32。MERITS-L 的不同阶段共训练 50 个 epoch,学习率为 $1e{-4}$,批量大小为 32。在所有训练过程中,均使用 AdamW [51] 优化器的交叉熵损失。加权 F1 分数用作 ERC 数据集上的性能评估指标。需要注意的是,我们在预训练中没有使用任何额外的标注数据集,因为语音和文本的预训练框架完全基于从原始数据中自监督学习的原则。此外,下游数据集的使用不涉及任何说话者元数据信息。
C. Results
C. 结果
The results for the three stages are shown in Table I. We note that the performance of both the modalities improve with every modeling stage. The introduction of the contextual information is seen to significantly improve the performance of IEMOCAP dataset (relative improvements of $16.46%$ and $20.06%$ for audio and text modality respectively). The performance of the two modalities are seen to be comparable for the IEMOCAP dataset, unlike the other two where text emotion recognition performance is considerably higher than the performance with the audio modality. Finally, the multi-modal fusion is seen to aid all the datasets, achieving relative improvements of $16.28%$ , $2.24%$ and $5.65%$ over the best performing modality (after Stage II) for IEMOCAP, MELD and CMU-MOSI respectively. This also shows that the multi-modal fusion strategy is more effective while combining modalities having comparable performance. This is perhaps due to the symmetrical nature of the co-attention based fusion mechanism.
三个阶段的结果如表 I 所示。我们注意到,随着每个建模阶段的推进,两种模态的性能都有所提升。引入上下文信息显著提高了 IEMOCAP 数据集的性能(音频和文本模态的相对改进分别为 $16.46%$ 和 $20.06%$)。与其他两个数据集不同,IEMOCAP 数据集的两种模态性能相当,而在其他两个数据集中,文本情感识别的性能明显高于音频模态。最后,多模态融合对所有数据集都有帮助,相对于表现最佳的模态(第二阶段后),IEMOCAP、MELD 和 CMU-MOSI 分别实现了 $16.28%$、$2.24%$ 和 $5.65%$ 的相对改进。这也表明,多模态融合策略在结合性能相当的模态时更为有效。这可能是由于基于共注意力的融合机制的对称性所致。

Fig. 3. The performance of the RoBERTa-large models on the different datasets. Different LLMs are used for generating pseudo emotion labels from speech transcripts. The performance of pre-trained RoBERTa without any supervised fine-tuning is also reported.

图 3: RoBERTa-large 模型在不同数据集上的表现。使用了不同的大语言模型从语音转录中生成伪情感标签。同时报告了未经任何监督微调的预训练 RoBERTa 的表现。
D. Evaluation with different LLMs
D. 使用不同大语言模型进行评估
We design an oracle experiment in this regard. Using a similar prompt template, as described in Sec. IV-B, we annotate the transcripts into the same three classes using Mixtral-8x7B-Instruct-v0. $1^{4}$ and $\mathtt{L1a m a-3-8b-c h a t-h f^{5}}$ . Since the MSP-PODCAST dataset has valence-arousal-dominance values, ranging from 1 to 7, we assign a positive label to samples having valence in the range (5, 7], while negative label is assigned to samples having valence value in $[1,3)$ . The rest of the samples are assigned the neutral label. With these labels serving as the ground truth, we notice that GPT-3.5 Turbo achieves a label overlap of $52.98%$ , while Llama-3-8b-chat-hf is comparable with an overlap of $50.91%$ . The performance of $\tt M i x t r a l!-!8x7B!-!$ Instruct $-\uptau0\cdot1$ is the lowest with an overlap of only $44.98%$ .
我们设计了一个预言实验。使用与第 IV-B 节中描述的类似提示模板,我们使用 Mixtral-8x7B-Instruct-v0 和 $\mathtt{L1a m a-3-8b-c h a t-h f^{5}}$ 将转录本注释为相同的三个类别。由于 MSP-PODCAST 数据集具有从 1 到 7 的效价-唤醒-支配值,我们将效价在 (5, 7] 范围内的样本分配为正标签,而将效价值在 $[1,3)$ 范围内的样本分配为负标签。其余样本被分配为中性标签。以这些标签作为真实值,我们注意到 GPT-3.5 Turbo 的标签重叠率为 $52.98%$,而 Llama-3-8b-chat-hf 的表现相当,重叠率为 $50.91%$。$\tt M i x t r a l!-!8x7B!-!$ Instruct $-\uptau0\cdot1$ 的表现最差,重叠率仅为 $44.98%$。
While the above benchmarking used oracle emotion labels from the speech dataset, we perform a downstream evaluation using the three choices of LLM. The impact of the different LLM annotation ability is shown in Fig.3, where the performance of the model with the text modality after Stage I training is shown. The performance of the model after Stage I with pre-trained RoBERTa (without any LLM guidance) is also shown for reference. We notice that the RoBERTa model fine-tuned with labels provided by $G P\mathrm{{T}}\mathrm{{-}}3\cdot5$ Turbo achieves relative improvements of $8.\dot{2}2%$ , $5.01%$ and $21.9%$ for IEMOCAP, MELD and CMU-MOSI respectively over the pretrained RoBERTa model. The relative improvements achieved by $G P,{\mathrm{T}}{\mathrm{-}}3\ldots{}5$ Turbo over Mixtral-8x7B-Instruct-v0.1 are $5.4%$ , $3.39%$ and $12.51%$ for IEMOCAP, MELD and CMU-MOSI respectively. The highest performance improvement in the CMUMOSI dataset may be attributed to the binary classification task in this dataset.
虽然上述基准测试使用了语音数据集中的真实情感标签,但我们使用三种大语言模型进行了下游评估。不同大语言模型注释能力的影响如图3所示,其中展示了第一阶段训练后具有文本模态的模型性能。作为参考,还展示了第一阶段训练后使用预训练RoBERTa(没有任何大语言模型指导)的模型性能。我们注意到,使用$G P\mathrm{{T}}\mathrm{{-}}3\cdot5$ Turbo提供的标签微调的RoBERTa模型在IEMOCAP、MELD和CMU-MOSI上分别比预训练的RoBERTa模型实现了$8.\dot{2}2%$、$5.01%$和$21.9%$的相对改进。$G P,{\mathrm{T}}{\mathrm{-}}3\ldots{}5$ Turbo相对于Mixtral-8x7B-Instruct-v0.1在IEMOCAP、MELD和CMU-MOSI上分别实现了$5.4%$、$3.39%$和$12.51%$的相对改进。CMU-MOSI数据集中的最高性能提升可能归因于该数据集中的二分类任务。
E. Importance of hierarchical training
E. 分层训练的重要性
In order to understand the impact of hierarchical training, we combine Stages II and III of MERITS-L. The impact of such a training philosophy is shown in Fig.4. The impact of this change in the training methodology is found to be the most in the case of IEMOCAP where the performance of MERITS-L drops from $86.48%$ to $82.91%$ . A performance drop of around $2%$ in absolute terms is also noticed for MELD and CMU-MOSI. The benefits from the hierarchical training arises from the fact that the end-to-end training of the different components of the model often leads to over-fitting as these datasets are relatively small in size.
为了理解分层训练的影响,我们将 MERITS-L 的第二阶段和第三阶段结合起来。这种训练理念的影响如图 4 所示。在 IEMOCAP 数据集上,训练方法的这种变化影响最大,MERITS-L 的性能从 $86.48%$ 下降到 $82.91%$。在 MELD 和 CMU-MOSI 数据集上,性能也出现了约 $2%$ 的绝对下降。分层训练的好处在于,由于这些数据集的规模相对较小,模型不同组件的端到端训练往往会导致过拟合。

Fig. 4. The importance of hierarchical training in MERITS-L
图 4: 分层训练在 MERITS-L 中的重要性
TABLE II COMPARISON WITH DIFFERENT METHODS. ∗STANDS FOR RESULTS WITH TRI-MODAL SYSTEM - TEXT, AUDIO AND VISUAL MODALITIES. ALL NUMBERS ARE WEIGHTED F1-SCORES. THE BEST PERFORMING MODEL FOR EACH DATASET IN MARKED IN BOLD, WHILE UNDERLINE INDICATES THE NEXT BEST PERFORMING MODEL.
表 II 不同方法的比较。* 表示使用三模态系统(文本、音频和视觉模态)的结果。所有数字均为加权 F1 分数。每个数据集中表现最佳的模型以粗体标记,下划线表示次佳模型。
| 方法 | IEMOCAP | MELD | CMU-MOSI |
|---|---|---|---|
| LMFN [52] | 82.54 | 80.92 | |
| M3ER [53] | 82.40 | ||
| DialogueTRM [54] | 63.55 | ||
| SMIN [55] | 87.47 | 64.50 | 81.45 |
| UniMSE [56] | 65.51* | 85.78 | |
| EmoCaps [33] | 63.73 | ||
| MERITS-L | 86.48 | 66.02 | 86.81 |
$F.$ Comparison with other work
F. 与其他工作的对比
We compare the performance of MERITS-L with some recent works in Table II. The proposed MERITS-L achieves the best performance when compared with state-of-the-art models for MELD and CMU-MOSI. For IEMOCAP however, the method by Lian et al. [55] outperforms MERITS-L by a margin of $1%$ absolute. Note that, there have been multiple methods like TelME [57] and EACL [45] which achieve higher performance than MERITS-L on MELD dataset. However, these methods use information about the speaker identity of each spoken utterance and hence are excluded from the comparison in this work.
我们在表 II 中比较了 MERITS-L 与一些近期工作的性能。与 MELD 和 CMU-MOSI 的最先进模型相比,提出的 MERITS-L 实现了最佳性能。然而,对于 IEMOCAP,Lian 等人 [55] 的方法比 MERITS-L 高出 $1%$ 的绝对值。需要注意的是,像 TelME [57] 和 EACL [45] 这样的多种方法在 MELD 数据集上实现了比 MERITS-L 更高的性能。然而,这些方法使用了每个话语的说话者身份信息,因此在本工作中被排除在比较之外。
V. SUMMARY
V. 总结
In this paper, we first propose a novel way of super-vised pretraining of text based emotion recognition using LLM guidance. Text from an emotional speech corpus is extracted, following which a text emotion recognition model is trained to classify each transcript using the pseudo-labels. With this text based recognition model as the utterance-level text embedding extractor, we propose MERITS-L, wherein we develop the model for emotion recognition in conversations by using the speech and textual modalities. A hierarchical way of training the model is proposed, starting with utterances from a single modality, followed by the contextual modeling at the conversational level and subsequently, the alignment of the two modalities. Comparison with other state-of-art works indicate the superiority of our method for two out of the three datasets considered in this work.
本文首先提出了一种基于大语言模型(LLM)指导的文本情感识别监督预训练新方法。从情感语音语料库中提取文本,随后训练一个文本情感识别模型,使用伪标签对每个转录文本进行分类。以该文本识别模型作为话语级别的文本嵌入提取器,我们提出了MERITS-L模型,通过语音和文本模态开发了对话中的情感识别模型。提出了一种分层次的模型训练方法,首先从单一模态的话语开始,然后在对话级别进行上下文建模,最后对两种模态进行对齐。与其他最先进工作的比较表明,在本工作考虑的三个数据集中,我们的方法在其中两个数据集上表现优越。
