LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

大语言模型监督的预训练用于对话中的多模态情感识别

Abstract—Emotion recognition in conversations (ERC) is challenging due to the multimodal nature of the emotion expression. In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. These transcriptions are obtained from a raw speech dataset with a pre-trained ASR system. A text LLM model is queried to provide pseudo-labels for these transcripts, and these pseudo-labeled transcripts are subsequently used for learning an utterance level text-based emotion recognition model. We use the utterance level text embeddings for emotion recognition in conversations along with speech embeddings obtained from a recently proposed pretrained model. A hierarchical way of training the speech-text model is proposed, keeping in mind the conversational nature of the dataset. We perform experiments on three established datasets, namely, IEMOCAP, MELD, and CMU- MOSI, where we illustrate that the proposed model improves over other benchmarks and achieves state-of-the-art results on two out of these three datasets.

摘要—由于情感表达的多模态特性，对话中的情感识别（ERC）具有挑战性。本文提出了一种基于文本的识别模型，通过大语言模型（LLM）的指导，从无监督的语音转录中进行预训练。这些转录是通过预训练的自动语音识别（ASR）系统从原始语音数据集中获得的。我们使用文本大语言模型为这些转录提供伪标签，随后这些伪标签转录用于学习基于文本的语句级情感识别模型。我们使用语句级文本嵌入与最近提出的预训练模型获得的语音嵌入相结合，进行对话中的情感识别。考虑到数据集的对话性质，我们提出了一种分层训练语音-文本模型的方法。我们在三个已建立的数据集（IEMOCAP、MELD 和 CMU-MOSI）上进行了实验，结果表明所提出的模型优于其他基准，并在其中两个数据集上取得了最先进的结果。

Index Terms—Multimodal emotion recognition, LLM distillation, Hierarchical training, Conversational Analytics

索引词—多模态情感识别、大语言模型蒸馏、分层训练、对话分析

I. INTRODUCTION

I. 引言

Emotion Recognition in Conversation (ERC) focuses on detecting emotions conveyed through multiple modalities during social conversational interactions, which is essential for natural human communication. Developing artificial systems that have improved emotional understanding and intelligence is a vital design step in conversational agents [1], social media analytics tools [2], customer service centers [3], mental health monitoring platforms [4], and wearable systems [5]. ERC enables these technologies to better adapt to human emotions, enhancing user experiences.

对话中的情感识别 (Emotion Recognition in Conversation, ERC) 专注于检测社交对话互动中通过多种模态传达的情感，这对于自然的人类交流至关重要。开发具有更高情感理解和智能的人工系统是对话智能体 [1]、社交媒体分析工具 [2]、客户服务中心 [3]、心理健康监测平台 [4] 和可穿戴系统 [5] 的关键设计步骤。ERC 使这些技术能够更好地适应人类情感，从而提升用户体验。

Emotion recognition in conversational data is challenging due to overlapping speakers, short- and long-term dependencies [6], shortspeaker turns, reverberation and background noise. Emotions are often multimodal, conveyed through various modes such as facial expressions [7], vocal cues [8], gestures [9], and physiological signals [10]. To address these complexities, multimodal approaches are often preferred [11]. This paper focuses on joint emotion recognition from audio and text, using the strengths of both modalities to enhance accuracy in detecting emotions in conversations.

由于对话数据中存在说话者重叠、短期和长期依赖 [6]、短说话轮次、混响和背景噪声等问题，情感识别具有挑战性。情感通常是多模态的，通过面部表情 [7]、声音线索 [8]、手势 [9] 和生理信号 [10] 等多种模式传递。为了应对这些复杂性，多模态方法通常更受青睐 [11]。本文专注于从音频和文本中进行联合情感识别，利用两种模态的优势来提高对话中情感检测的准确性。

The initial methods for Speech Emotion Recognition (SER) relied on handcrafted acoustic features like pitch [12], energy, and speaking rate [13]. The introduction of deep learning techniques, including CNNs [14], LSTMs [15], and transformers [16], significantly improved the SER performance. Recently, self-supervised learning models like wav2vec2.0 [17], HuBERT [18], and WavLM [19] have shown promise in recognition of emotions across multiple datasets and tasks. Large language models (LLMs) have also been explored for SER [20], [21], though they demand significant computational resources.

最初的语音情感识别 (SER) 方法依赖于手工设计的声学特征，如音高 [12]、能量和语速 [13]。随着深度学习技术的引入，包括卷积神经网络 (CNN) [14]、长短期记忆网络 (LSTM) [15] 和 Transformer [16]，SER 性能得到了显著提升。最近，自监督学习模型如 wav2vec2.0 [17]、HuBERT [18] 和 WavLM [19] 在多个数据集和任务中的情感识别方面表现出潜力。大语言模型 (LLM) 也被探索用于 SER [20]、[21]，尽管它们需要大量的计算资源。

In parallel, text-based emotion recognition (commonly known as sentiment analysis) initially relied on rule-based methods that linked

与此同时，基于文本的情感识别（通常称为情感分析）最初依赖于基于规则的方法，这些方法将

This work was carried out with research grants from British Telecom and the Prime Minister’s Research Fellowship.

本工作由英国电信 (British Telecom) 和总理研究奖学金提供研究资助。

specific words to emotions [22], [23]. With the rise of deep learning, sentiment analysis progressed to CNNs [24], RNNs [25], and transformer architectures such as BERT [26], [27] and RoBERTa [28], [29]. More recently, large language models (LLM) are seen as excellent tools for sentiment analysis [30].

特定词汇与情感 [22], [23]。随着深度学习的兴起，情感分析逐渐发展到使用卷积神经网络 (CNN) [24]、循环神经网络 (RNN) [25] 以及 Transformer 架构，如 BERT [26], [27] 和 RoBERTa [28], [29]。最近，大语言模型 (LLM) 被视为情感分析的优秀工具 [30]。

In order to enhance emotion recognition in conversations, several works have also designed multi-modal fusion techniques combining audio and text data, using models like transformers [31], graph neural networks [32], and capsule networks [33].

为了增强对话中的情感识别，一些研究还设计了结合音频和文本数据的多模态融合技术，使用了如 Transformer [31]、图神经网络 [32] 和胶囊网络 [33] 等模型。

In this paper, we propose a pre-training methodology through a multi-modal approach that leverages both text and speech representations. Specifically, we introduce a strategy to improve emotion classification from text by leveraging unsupervised speech data with large-scale language models (LLMs). To this end, we first utilize a pre-trained ASR based on Whisper-large model [34] to transcribe speech. The “noisy” speech transcripts are labeled with an LLM to automatically generate pseudo-labels of speech sentiments. These labelled text-transcripts are then used to fine-tune a RoBERTa text encoder model [28] for sentiment classification, allowing it to capture nuanced emotional patterns in textual data. We show substantial benefits from this unsupervised pre-training of the text-based model.

在本文中，我们提出了一种通过多模态方法进行预训练的方法论，该方法结合了文本和语音表示。具体而言，我们引入了一种策略，通过利用无监督的语音数据和大语言模型 (LLMs) 来改进文本中的情感分类。为此，我们首先使用基于 Whisper-large 模型 [34] 的预训练 ASR 来转录语音。这些“噪声”语音转录通过 LLM 进行标注，自动生成语音情感的伪标签。然后，这些标注的文本转录被用于微调 RoBERTa 文本编码器模型 [28] 以进行情感分类，使其能够捕捉文本数据中的细微情感模式。我们展示了这种无监督预训练对基于文本的模型的显著优势。

On the speech side, we extract features using the recently proposed CARE model [35]. The CARE model is designed to generate high-quality embeddings that encapsulate both content and acoustic information from speech utterances. Using the speech and text embeddings derived at utterance level, we train a bi-directional gated recurrent unit (GRU) based model to assimilate the information across the entire conversation. The conversation-level embeddings from uni-modal speech-text models are integrated into our proposed multi-modal architecture for conversational emotion recognition. We propose a hierarchical fusion mechanism with a cross-attention-based network [36] to enable the interaction between the modalities, ensuring an fusion of the emotional information present in both speech and text. We call our method MERITS-L (Multimodal Emotion Recognition In Speech and Text with LLM guidance).

在语音方面，我们使用最近提出的 CARE 模型 [35] 提取特征。CARE 模型旨在生成高质量的嵌入，这些嵌入封装了语音话语中的内容和声学信息。使用在话语级别导出的语音和文本嵌入，我们训练了一个基于双向门控循环单元 (GRU) 的模型，以吸收整个对话中的信息。来自单模态语音-文本模型的对话级别嵌入被集成到我们提出的多模态架构中，用于对话情感识别。我们提出了一种基于交叉注意力网络 [36] 的分层融合机制，以实现模态之间的交互，确保融合语音和文本中存在的情感信息。我们将我们的方法称为 MERITS-L (Multimodal Emotion Recognition In Speech and Text with LLM guidance)。

The experiments are performed on three established datasets, namely IEMOCAP [37], MELD [38] and CMU-MOSI [39]. The key contributions from this work are:

实验在三个已建立的数据集上进行，分别是 IEMOCAP [37]、MELD [38] 和 CMU-MOSI [39]。本工作的主要贡献是：

II. 相关工作

LLMs for Text Sentiment Analysis: The capabilities of large language models (LLMs) have been the focus of recent research efforts [30], [40], [41]. Zhong et al. [41] demonstrated that

大语言模型在文本情感分析中的应用：大语言模型 (LLMs) 的能力一直是近期研究的焦点 [30], [40], [41]。Zhong 等人 [41] 展示了...

Fig. 1. Block diagram of the proposed model. The pre-training stage is shown in the grey box at the top. An ASR system is used to generate the transcripts for the pre-training data which are annotated by a large language model (LLM) as positive, negative or neutral sentiment. These “silver” labels with the text transcripts form the supervised training dataset for RoBERTa-large model. A frozen CARE model [35] is used for extracting audio embeddings. Both the text and speech embeddings thus use only unsupervised data. The MERITS-L model is trained in three stages (denoted as Stage I, II and III in the diagram), wherein the models trained in a particular stage are kept frozen for subsequent stages.

图 1: 提出的模型框图。预训练阶段显示在顶部的灰色框中。使用 ASR 系统生成预训练数据的转录文本，这些转录文本由大语言模型 (LLM) 标注为正面、负面或中性情感。这些“银”标签与文本转录一起构成了 RoBERTa-large 模型的监督训练数据集。使用冻结的 CARE 模型 [35] 提取音频嵌入。因此，文本和语音嵌入都仅使用无监督数据。MERITS-L 模型分三个阶段训练（在图中表示为阶段 I、II 和 III），其中在特定阶段训练的模型在后续阶段保持冻结。

ChatGPT1 achieves performance comparable to fine-tuned BERT models. Zhang et al. [30], however, provided a more comprehensive evaluation across various sentiment analysis tasks. Their findings reveal that while LLMs under-perform in fine-grained sentiment analysis, they exhibit promising zero-shot capabilities in simpler tasks, such as binary sentiment classification. In this work, we leverage the ability of LLMs to coarsely annotate large corpora of emotional speech transcripts.

ChatGPT1 的表现与经过微调的 BERT 模型相当。然而，Zhang 等人 [30] 在各种情感分析任务中提供了更全面的评估。他们的研究结果表明，虽然大语言模型在细粒度情感分析中表现不佳，但在较简单的任务（如二元情感分类）中展现出有前景的零样本能力。在本研究中，我们利用大语言模型的能力对大量情感语音转录文本进行粗略标注。

Emotion Recognition in Conversations: Recent methods that achieve strong performance on benchmark datasets often incorporate speaker identity [42]–[45]. For instance, Hu et al. [43] introduced a supervised contrastive loss, where utterances with the same emotion and speaker are treated as positive samples in a contrastive learning framework. Yu et al. [45] appended speaker embeddings to the utterance representations and utilize a language model, such as RoBERTa [28], to predict emotions via masked language modeling. In contrast, our work does not access speaker labels for any utterance in the conversation.

对话中的情感识别：在基准数据集上表现优异的最新方法通常结合了说话者身份 [42]–[45]。例如，Hu 等人 [43] 引入了一种监督对比损失，在对比学习框架中将具有相同情感和说话者的话语视为正样本。Yu 等人 [45] 将说话者嵌入附加到话语表示中，并利用 RoBERTa [28] 等语言模型通过掩码语言建模来预测情感。相比之下，我们的工作不访问对话中任何话语的说话者标签。

III. METHOD

III. 方法

A. Background

A. 背景

CARE: In our recently proposed CARE model [35], speech is processed by two encoders - one focusing on the semantic aspect of speech by aligning with the mean-pooled RoBERTa representation of corresponding ASR transcripts, while the other is trained to predict low-level descriptors of speech provided by the $\mathrm{PASE+}$ model [46]. The CARE embeddings are seen to perform better than most other base-sized models in the SUPERB [47] style evaluation. In this paper, we utilize the CARE embeddings from speech utterances and train conversational models along with speech-text fusion.
CARE：在我们最近提出的CARE模型 [35] 中，语音通过两个编码器进行处理——一个通过与相应ASR转录的均值池化RoBERTa表示对齐来关注语音的语义方面，而另一个则被训练来预测由$\mathrm{PASE+}$模型 [46] 提供的语音低级别描述符。在SUPERB [47] 风格的评估中，CARE嵌入的表现优于大多数其他基础大小的模型。在本文中，我们利用语音话语的CARE嵌入，并结合语音-文本融合训练对话模型。
RoBERTa: One of the significant contributions in creating a large scale language model was proposed by Devlin et al. [26]. This architecture, called bidirectional encoder representations from transformer (BERT), was trained with two objectives on a corpus of textual data, namely, predicting words masked out in a sentence and to predict whether two sentences semantically follow each other (referred to as next sentence prediction). Liu et. al [28] trained this architecture on a larger corpus of textual data without the next sentence prediction task. This pre-trained model is known as robust optimized BERT approach (RoBERTa). We use the pre-trained large version of this model with 24 transformer layers as the text encoder.
RoBERTa: Devlin 等人 [26] 在创建大规模语言模型方面提出了重要贡献。这种架构被称为基于 Transformer 的双向编码器表示 (BERT)，它在文本数据语料库上通过两个目标进行训练，即预测句子中被掩盖的词语以及预测两个句子是否在语义上相互跟随（称为下一句预测）。Liu 等人 [28] 在没有下一句预测任务的情况下，在更大的文本数据语料库上训练了这种架构。这种预训练模型被称为鲁棒优化 BERT 方法 (RoBERTa)。我们使用预训练的大型版本模型，该模型具有 24 个 Transformer 层，作为文本编码器。

B. Proposed MERITS-L model

B. 提出的 MERITS-L 模型

The block diagram of the proposed model is shown in Fig. 1.

所提出模型的框图如图 1 所示。

Problem Description: Given a set of utterances $U$ and set of emotion labels $Y$ , a conversation consisting of $K$ utterances is denoted by $\left[(u_{1},y_{1}),(u_{2},y_{2}),\dots,(u_{K},y_{K})\right]$ , where $y_{j}\in Y$ is the emotion of utterance $u_{j}$ in the conversation. Specifically, the task of ERC is achieved by using the speech and text modalities in our case, which means $u_{k}={S_{k},T_{k}}$ , where $S_{k}$ and $T_{k}$ refer to the speech and the text transcript associated with the utterance $u_{k}$ . The objective of ERC is to predict the emotion label $y_{k}$ of each utterance $u_{k}$ .
问题描述：给定一组话语 $U$ 和一组情感标签 $Y$，由 $K$ 个话语组成的对话表示为 $\left[(u_{1},y_{1}),(u_{2},y_{2}),\dots,(u_{K},y_{K})\right]$，其中 $y_{j}\in Y$ 是对话中话语 $u_{j}$ 的情感。具体来说，在我们的案例中，ERC（情感识别对话）任务通过使用语音和文本模态来实现，这意味着 $u_{k}={S_{k},T_{k}}$，其中 $S_{k}$ 和 $T_{k}$ 分别指与话语 $u_{k}$ 相关的语音和文本转录。ERC的目标是预测每个话语 $u_{k}$ 的情感标签 $y_{k}$。
LLM guided text pre-training: The text transcripts from the emotional speech corpus are first extracted from an ASR system (Whisper-large-v3 [34]). Generally, the word error rates on emotional speech are higher than neutral speech [48]. With the ASR generated transcripts of the speech corpus, a large language model (LLM) is prompted to annotate the transcript as three classes, “positive”, “negative” or “neutral”. The pre-trained RoBERTa-large model is fine-tuned to predict the pseudo-classes (from the LLM predictions)
大语言模型引导的文本预训练：首先从情感语音语料库中通过自动语音识别系统（Whisper-large-v3 [34]）提取文本转录。通常，情感语音的词错误率高于中性语音 [48]。利用自动语音识别生成的语音语料库转录，提示大语言模型（LLM）将转录注释为三类：“积极”、“消极”或“中性”。预训练的 RoBERTa-large 模型被微调以预测伪类别（来自大语言模型的预测）。

Fig. 2. The co-attention network used in the proposed model. It consists of two sub-blocks - the cross-attention and the self-attention blocks.

图 2: 所提出模型中使用的共注意力网络。它由两个子块组成——交叉注意力块和自注意力块。

and the resultant model is used subsequently as a text feature extractor at utterance-level. This model is referred to as RoBERTa-FT in the subsequent sections of the paper.

生成的模型随后被用作话语级别的文本特征提取器。在本文的后续部分中，该模型被称为 RoBERTa-FT。

C. Training

C. 训练

The training methodology for MERITS-L is performed in stages as mentioned below:

MERITS-L 的训练方法按以下阶段进行：

Stage I: All utterances, $U$ , are collated and used to train text sentiment analysis and speech emotion recognition models for each dataset. While the RoBERTa-FT model is fine-tuned for each downstream dataset, small light weight networks with frozen CARE embeddings as input are trained for the speech modality for every dataset. This training stage aims to classify the text transcript $(T_{k})$ and the speech signal $\bar{(S_{k})}$ into the correct emotion category $(y_{k})$ . The final layer embeddings for each utterance $\scriptstyle{T_{k}^{1}}$ and $\breve{S}_{k}^{1}$ for the text transcripts and speech signal respectively) are used for the next stage of training.

阶段 I：所有话语 $U$ 被整理并用于为每个数据集训练文本情感分析和语音情感识别模型。虽然 RoBERTa-FT 模型针对每个下游数据集进行了微调，但针对每个数据集的语音模态，使用冻结的 CARE 嵌入作为输入的小型轻量网络被训练。此训练阶段的目的是将文本转录 $(T_{k})$ 和语音信号 $\bar{(S_{k})}$ 分类到正确的情感类别 $(y_{k})$ 中。每个话语的最终层嵌入 $\scriptstyle{T_{k}^{1}}$ 和 $\breve{S}_{k}^{1}$（分别用于文本转录和语音信号）将用于下一阶段的训练。

Stage II: This stage introduces the conversational nature of the data in the modeling framework. The text features for the utterances in a conversation from the previous stage, denoted by $\left(T_{1}^{1},T_{2}^{1},\ldots,T_{K}^{1}\right)$ for a conversation having $K$ utterances are processed by a bidirectional gated recurrent network (Bi-GRU) with self-attention over the conversational context. This stage encourages the model to predict the emotion class of the utterance $u_{k}$ , keeping the entire conversation as a part of the context. A similar modeling exercise is done for the speech modality as well. Similar to the previous stage, the features from the Bi-GRU with self-attention blocks are used for the final stage of training. These features are denoted by $T_{k}^{2}$ and $S_{k}^{2}$ for the text and speech modality, of utterance $u_{k}$ , respectively.

阶段 II：此阶段在建模框架中引入了数据的对话性质。对话中每个话语的文本特征，由前一阶段的 $\left(T_{1}^{1},T_{2}^{1},\ldots,T_{K}^{1}\right)$ 表示，其中 $K$ 是对话中的话语数量，这些特征通过一个双向门控循环网络 (Bi-GRU) 进行处理，并在对话上下文中使用自注意力机制。此阶段鼓励模型预测话语 $u_{k}$ 的情感类别，同时将整个对话作为上下文的一部分。类似地，语音模态也进行了相同的建模。与前一阶段类似，带有自注意力机制的 Bi-GRU 的特征被用于最终阶段的训练。这些特征分别表示为 $T_{k}^{2}$ 和 $S_{k}^{2}$，分别对应话语 $u_{k}$ 的文本和语音模态。

Stage III: Notably, the previous stages included training the two modalities separately. In order to align the two modalities in a more effective way for emotion recognition, they are combined in this stage. We implement a co-attention fusion strategy for the two modalities as outlined in Fig. 2. The query-key-value sequences for the two modalities are the features obtained after Stage II of training. E.g. in Fig. 2, $Q_{A}$ refers to the query sequence of the speech modality and is denoted as $\left(S_{1}^{2},S_{2}^{2},\ldots,\stackrel{\bullet}{S_{K}^{2}}\right)$ for the conversation $C$ .

阶段 III：值得注意的是，前两个阶段分别对两种模态进行了训练。为了更有效地对齐这两种模态以进行情感识别，本阶段将它们结合起来。我们为这两种模态实现了共注意力融合策略，如图 2 所示。两种模态的查询-键-值序列是第二阶段训练后获得的特征。例如，在图 2 中，$Q_{A}$ 指的是语音模态的查询序列，对于对话 $C$ 表示为 $\left(S_{1}^{2},S_{2}^{2},\ldots,\stackrel{\bullet}{S_{K}^{2}}\right)$。

IV. EXPERIMENTS AND RESULTS

IV. 实验与结果

A. Datasets

A. 数据集

Pre-training:The MSP-PODCAST corpus [49] is used for the task of pre-training. A total of 149, 307 speech-turns amounting to 230 hours of emotional speech data is used. Out of the total number of samples, $80%$ of the data is randomly chosen as the training set while the remaining $20%$ serves as the validation set. The Whisper-large-v3 model2 is used for generating the transcripts. These transcripts are annotated using the GPT-3.5 Turbo model.

预训练：MSP-PODCAST 语料库 [49] 用于预训练任务。总共使用了 149,307 个语音片段，共计 230 小时的情感语音数据。在所有样本中，随机选择 $80%$ 的数据作为训练集，剩余的 $20%$ 作为验证集。使用 Whisper-large-v3 模型生成转录文本。这些转录文本通过 GPT-3.5 Turbo 模型进行标注。

TABLE I RESULTS ON THE DATASETS IN TERMS OF WEIGHTED F1-SCORE.

表 1: 数据集上的加权 F1 分数结果

模态 (训练阶段)	IEMOCAP	MELD	CMU-MOSI
音频 (阶段 I)	66.93	47.61	68.77
音频 (阶段 II)	77.95	49.32	69.43
文本 (阶段 I)	69.84	63.81	85.41
文本 (阶段 II)	83.85	65.24	86.02
音频+文本 (阶段 II)	86.48	66.02	86.81

ERC datase

[论文翻译]大语言模型监督的预训练用于对话中的多模态情感识别

原文地址：https://arxiv.org/pdf/2501.11468

LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

大语言模型监督的预训练用于对话中的多模态情感识别

I. INTRODUCTION

I. 引言

II. 相关工作

III. METHOD

III. 方法

A. Background

A. 背景

B. Proposed MERITS-L model

B. 提出的 MERITS-L 模型

C. Training

C. 训练

IV. EXPERIMENTS AND RESULTS

IV. 实验与结果

[论文翻译]大语言模型监督的预训练用于对话中的多模态情感识别

原文地址：https://arxiv.org/pdf/2501.11468

LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations

大语言模型监督的预训练用于对话中的多模态情感识别

I. INTRODUCTION

I. 引言

II. RELATED WORK

II. 相关工作

III. METHOD

III. 方法

A. Background

A. 背景

B. Proposed MERITS-L model

B. 提出的 MERITS-L 模型

C. Training

C. 训练

IV. EXPERIMENTS AND RESULTS

IV. 实验与结果