[论文翻译]强化学习优于监督微调:以音频问答为例


原文地址:https://arxiv.org/pdf/2503.11197v1


Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

强化学习优于监督微调:以音频问答为例

{ligang5,liu ji zhong 1}@xiaomi.com

{ligang5,liu ji zhong 1}@xiaomi.com

Abstract

摘要

Recently, reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs), and RL-based approaches have been progressively applied to visual multimodal tasks. However, the audio modality has largely been overlooked in these developments. Thus, we conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task. We leverage the group relative policy optimization (GRPO) algorithm to Qwen2-Audio-7B-Instruct, and our experiments demonstrated state-of-the-art performance on the MMAU Testmini benchmark, achieving an accuracy rate of $64.5%$ . The main findings in this technical report are as follows: 1) The GRPO algorithm can be effectively applied to large audio language models (LALMs), even when the model has only 8.2B parameters; 2) With only $38\mathrm{k\Omega}$ post-training samples, RL significantly outperforms supervised fine-tuning (SFT), indicating that RL-based approaches can be effective without large datasets; 3) The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently utilize deep thinking remains an open question for further research; 4) LALMs still lag far behind humans auditorylanguage reasoning, suggesting that the RL-based approaches warrant further exploration. Our project is available at https://github.com/xiaomi/r1-aqa and https://hugging face.co/mispeech/r1-aqa.

最近,强化学习 (RL) 被证明可以极大地增强大语言模型 (LLMs) 的推理能力,基于 RL 的方法也逐渐应用于视觉多模态任务。然而,在这些发展中,音频模态在很大程度上被忽视了。因此,我们在音频理解和推理方面进行了一系列 RL 探索,特别关注音频问答 (AQA) 任务。我们利用组相对策略优化 (GRPO) 算法对 Qwen2-Audio-7B-Instruct 进行实验,结果表明在 MMAU Testmini 基准测试中达到了最先进的性能,准确率达到 $64.5%$。本技术报告的主要发现如下:1) GRPO 算法可以有效地应用于大型音频语言模型 (LALMs),即使模型只有 8.2B 参数;2) 仅使用 $38\mathrm{k\Omega}$ 的训练后样本,RL 显著优于监督微调 (SFT),表明基于 RL 的方法可以在没有大数据集的情况下有效;3) 显式推理过程并未显示出对 AQA 任务的显著益处,如何有效利用深度思考仍然是进一步研究的开放问题;4) LALMs 仍然远远落后于人类的听觉语言推理,这表明基于 RL 的方法值得进一步探索。我们的项目可在 https://github.com/xiaomi/r1-aqa 和 https://hugging face.co/mispeech/r1-aqa 上获取。

1 Introduction

1 引言

The latest breakthroughs in large language models (LLMs) have greatly enhanced their reasoning abilities, particularly in mathematics and coding. DeepSeek-R1 [1], a pioneering innovator, has demonstrated how reinforcement learning (RL) can effectively improve LLMs’ complex reasoning capabilities. Although chains-of-thought (CoT) is a simple rule-based reward model, it effectively aids reasoning tasks by simulating the human thought process. Therefore, reinforcement learning is likely to achieve performance far surpassing supervised fine-tuning (SFT) through simple methods. Recently, many researchers [2; 3] have attempted to incorporate simple yet ingenious RL methods into visual modality understanding and reasoning tasks.

大语言模型 (LLM) 的最新突破极大地增强了其推理能力,尤其是在数学和编程方面。DeepSeek-R1 [1] 作为先驱创新者,展示了强化学习 (RL) 如何有效提升大语言模型的复杂推理能力。尽管思维链 (CoT) 是一种简单的基于规则的奖励模型,但它通过模拟人类思维过程,有效地辅助了推理任务。因此,强化学习很可能通过简单的方法实现远超监督微调 (SFT) 的性能。最近,许多研究者 [2; 3] 尝试将简单而巧妙的强化学习方法应用于视觉模态的理解和推理任务中。

However, the audio modality has largely been overlooked in recent developments. Although large audio language models (LALMs) have been increasingly proposed, such as Qwen2-Audio [5] and Audio Flamingo 2 [6]. LALMs still rely on pre-trained modules along with SFT to construct systems. In fact, it is not that RL-based approaches are unsuitable for LALMs; rather, tasks such as automatic speech recognition (ASR) and automated audio captioning (AAC) are simple descriptive tasks [7]. More complex logical reasoning tasks are needed to fully explore the potential of RL in the audio modality.

然而,音频模态在最近的发展中很大程度上被忽视了。尽管越来越多的大音频语言模型(LALMs)被提出,例如 Qwen2-Audio [5] 和 Audio Flamingo 2 [6]。LALMs 仍然依赖于预训练模块以及 SFT(监督微调)来构建系统。事实上,并不是基于强化学习(RL)的方法不适用于 LALMs;而是诸如自动语音识别(ASR)和自动音频描述(AAC)等任务属于简单的描述性任务 [7]。需要更复杂的逻辑推理任务来充分探索 RL 在音频模态中的潜力。


Figure 1: An example of a question and choices based on audio.

图 1: 基于音频的问题和选项示例。

Audio Question Answering (AQA) is a multimodal task that involves understanding and reasoning based on audio content to generate accurate responses to questions. It integrates both the auditory modality and linguistic modality, making it particularly suitable for evaluating complex logical reasoning capabilities. It demands the ability to extract meaningful insights from raw audio signals, infer implicit relationships, and provide con textually relevant answers. Due to its inherent complexity, AQA serves as an ideal benchmark for testing the effectiveness of reinforcement learning approaches. In addition, AQA can be considered as an advanced technology built upon automated audio captioning.

音频问答 (Audio Question Answering, AQA) 是一种多模态任务,涉及基于音频内容的理解和推理,以生成准确的问题回答。它结合了听觉模态和语言模态,特别适合评估复杂的逻辑推理能力。它要求从原始音频信号中提取有意义的见解,推断隐含关系,并提供上下文相关的答案。由于其固有的复杂性,AQA 是测试强化学习方法有效性的理想基准。此外,AQA 可以被视为基于自动音频描述的高级技术。

Based on the above reasons, we take AQA as a topic to explore the effectiveness of RL and deep thinking in the audio modality. Meanwhile, it is encouraging to see that some researchers have already started some attempts [8; 9]. In this report, we present a successful technical attempt, where the group relative policy optimization (GRPO) algorithm [10], with a small-scale dataset, improves the reasoning performance of the AQA task via Qwen2-Audio-7B-Instruct [5]. Our experiments demonstrate state-of-the-art performance on the MMAU test-mini benchmark, achieving an accuracy rate of $64.5%$ . In summary, The main findings are as follows:

基于上述原因,我们将 AQA 作为一个主题,探索强化学习 (RL) 和深度思考在音频模态中的有效性。同时,令人鼓舞的是,一些研究人员已经开始了一些尝试 [8; 9]。在本报告中,我们展示了一次成功的技术尝试,其中组相对策略优化 (GRPO) 算法 [10] 在小型数据集上通过 Qwen2-Audio-7B-Instruct [5] 提高了 AQA 任务的推理性能。我们的实验在 MMAU test-mini 基准测试中展示了最先进的性能,达到了 $64.5%$ 的准确率。总结来说,主要发现如下:

2 Related Works

2 相关工作

Audio Question Answering. AQA is a multimodal task that involves understanding and reasoning over audio content to generate accurate responses to questions. In LLMs’ frameworks, AQA builds upon AAC. While AAC focuses on generating descriptive textual captions for audio, AQA requires a deeper comprehension of complex acoustic patterns, temporal relationships, and contextual information embedded in the audio. Although researchers have achieved good performance on the AAC task [11; 12; 13; 14; 15; 16], AQA remains a multimodal challenge, which combines auditory and linguistic modalities, making it ideal for evaluating complex reasoning. can be categorized based on audio type into single-audio and multi-audio tasks, and based on response format into selection-based and open-ended questions. As illustrated in Figure 1, we focus on the most common single-audio task with selection-based answers. Additionally, a multiple-choice setting with a single correct answer presents a significant generation-verification gap [17], making it a suitable setting for evaluating the effectiveness of RL in the audio modality. RL tends to perform well when verification is easy, but generation is complex.

音频问答 (Audio Question Answering, AQA) 是一种多模态任务,涉及对音频内容的理解和推理,以生成对问题的准确回答。在大语言模型的框架中,AQA 建立在音频字幕生成 (AAC) 的基础上。AAC 侧重于为音频生成描述性文本字幕,而 AQA 则需要对音频中复杂的声学模式、时间关系和上下文信息有更深层次的理解。尽管研究人员在 AAC 任务上取得了良好的表现 [11; 12; 13; 14; 15; 16],AQA 仍然是一个多模态挑战,结合了听觉和语言模态,使其成为评估复杂推理的理想任务。AQA 可以根据音频类型分为单音频和多音频任务,根据回答格式分为选择型和开放式问题。如图 1 所示,我们专注于最常见的单音频任务,并采用选择型回答。此外,具有单一正确答案的多项选择设置存在显著的生成-验证差距 [17],使其成为评估强化学习在音频模态中有效性的合适场景。强化学习在验证容易但生成复杂的情况下往往表现良好。

Prompt Templates

提示模板

Multimodal Reasoning. Recent studies have indicated that deep thinking can enhance the reasoning performance of LLMs [1; 18]. DeepSeek-R1 and OpenAI-o1 have significantly improved the reasoning capabilities, particularly in multi-step tasks such as coding and mathematics, sparking renewed interest in RL and CoT. Furthermore, RL and CoT are playing an increasing role in multimodal large language models. For instance, Visual Thinker R1 Zero [3] implements RL in visual reasoning on a 2B non-sft model. LLaVA-CoT [4] outperforms its base model by $7.4%$ using only $100\mathrm{k}$ post-training samples and a simple yet effective inference time scaling method. Recent studies have also explored the CoT-based method in the audio modality. Audio-CoT [9] shows some improvements via zero-shot CoT, but the improvement is very limited. Audio-Reasoner [8] has achieved significant improvements through extensive fine-tuning data and a complex reasoning process. But it lacks thorough ablation studies, and some of these processes may be redundant. Overall, studies on deep thinking in the audio modality is still limited.

多模态推理。最近的研究表明,深度思考可以提升大语言模型的推理性能 [1; 18]。DeepSeek-R1 和 OpenAI-o1 显著提升了推理能力,尤其是在编码和数学等多步骤任务中,这重新激发了人们对强化学习 (RL) 和思维链 (CoT) 的兴趣。此外,RL 和 CoT 在多模态大语言模型中的作用也越来越大。例如,Visual Thinker R1 Zero [3] 在一个 2B 非 SFT 模型上实现了视觉推理中的 RL。LLaVA-CoT [4] 仅使用 100k 的微调后样本和一个简单但有效的推理时间缩放方法,就比其基础模型提升了 7.4%。最近的研究还探索了基于 CoT 的方法在音频模态中的应用。Audio-CoT [9] 通过零样本 CoT 展示了一些改进,但改进非常有限。Audio-Reasoner [8] 通过大量的微调数据和复杂的推理过程取得了显著改进,但缺乏彻底的消融研究,其中一些过程可能是冗余的。总体而言,关于音频模态中深度思考的研究仍然有限。

Large Audio Language Models. LALMs can generally be divided into two categories: audio understanding and audio generation. This study focuses on the field of audio understanding. There are already many representative models, such as Qwen2-Audio [5], Audio Flamingo 2 [6], and SALMONN [19]. However, they are all SFT-based models, and whether RL can unlock their potential remains to be studied.

大音频语言模型 (Large Audio Language Models, LALMs) 通常可以分为两类:音频理解和音频生成。本研究聚焦于音频理解领域。目前已经有许多代表性模型,例如 Qwen2-Audio [5]、Audio Flamingo 2 [6] 和 SALMONN [19]。然而,它们都是基于 SFT 的模型,RL 是否能够释放它们的潜力仍有待研究。

3 Method

3 方法

In this section, we present the training method for our exploration, which leads to the state-of-the-art performance on the MMAU Test-mini benchmark. The goal is to apply the GRPO algorithm directly into to LALMs.

在本节中,我们介绍了探索中的训练方法,该方法在 MMAU Test-mini 基准测试中达到了最先进的性能。目标是将 GRPO 算法直接应用于 LALMs。

Our exploration is based on the Qwen2-Audio-7B-Instruct model [5]. We leverage the GRPO algorithm along with a customized chat template and prompting strategy to enhance its reasoning capabilities. Compared to SFT, RL may be a more efficient and effective way to adapt to downstream tasks [3; 4]. In SFT, instruction tuning requires a large amount of training data. Whether this condition is necessary in RL is one of the key questions. We train our models with the instruction template in Figure 2. For each question in the dataset, the model generates a response in $\vartriangleleft$ answer> template, then it is optimized using a RL objective. The difference is that Prompt ${<}2>$ does not require the model to explicitly output its reasoning process, whereas Prompt ${<}3>$ does (i.e., CoT). Since the cannot be supervised, the placement of the answer within in SFT is merely a formatting choice. SFT adopts the simplest Prompt ${<}1>$ .

我们的探索基于 Qwen2-Audio-7B-Instruct 模型 [5]。我们利用 GRPO 算法以及定制的聊天模板和提示策略来增强其推理能力。与 SFT 相比,RL 可能是适应下游任务更高效和有效的方式 [3; 4]。在 SFT 中,指令调优需要大量的训练数据。而在 RL 中,这一条件是否必要是一个关键问题。我们使用图 2 中的指令模板训练模型。对于数据集中的每个问题,模型在 $\vartriangleleft$ answer> 模板中生成响应,然后使用 RL 目标进行优化。不同之处在于,Prompt ${<}2>$ 不要求模型显式输出其推理过程,而 Prompt ${<}3>$ 则要求(即 CoT)。由于 无法被监督,SFT 中 内答案的位置仅是一种格式选择。SFT 采用最简单的 Prompt ${<}1>$ 。

A review of the GRPO algorithm used for training is provided, mainly referring [3; 10]. To ease the burden of training an additional value function approximation model in proximal policy optimization (PPO) [20], GRPO employs the average reward of sampled response from the policy model as the baseline in computing the advantage. Specifically, given an input question $q$ , a group of responses ${o_{1},o_{2},\cdot\cdot\cdot,o_{G}}$ is first sample, and their corresponding rewards corresponding rewards ${r_{1},r_{2},\cdots,r_{G}}$ are computed using the reward model. The advantage is subsequently computed as:

对用于训练的 GRPO 算法进行了回顾,主要参考了 [3; 10]。为了减轻在近端策略优化 (PPO) [20] 中训练额外价值函数近似模型的负担,GRPO 采用从策略模型中采样的响应的平均奖励作为计算优势的基线。具体来说,给定一个输入问题 $q$,首先采样一组响应 ${o_{1},o_{2},\cdot\cdot\cdot,o_{G}}$,并使用奖励模型计算它们对应的奖励 ${r_{1},r_{2},\cdots,r_{G}}$。随后,优势计算如下:

图片.png

Table 1: Hyper-parameters of reinforcement learning with the GRPO algorithm.

表 1: 使用 GRPO 算法进行强化学习的超参数。

设置
BatchSizeperDevice 1
GradientAccumulationSteps 2
TrainingSteps 500
Learning Rate 1 x 10-6
Temperature 1.0
Maximum Response Length 512
Number of Responses per GRPO Step 8
Kullback-LeibleCoefficient 0.04

The policy model is subsequently optimized by maximizing the Kullback-Leibler objective:

策略模型随后通过最大化Kullback-Leibler目标进行优化:

图片.png

where $\pi_{\theta}$ and $\pi_{o l d}$ are the current and former policy, and $\epsilon$ and $\beta$ are hyper-parameters introduced in PPO. Responses are evaluated by a rule-based reward function in terms of their format and correctness:

其中 $\pi_{\theta}$ 和 $\pi_{o l d}$ 分别是当前策略和旧策略,$\epsilon$ 和 $\beta$ 是 PPO 中引入的超参数。响应通过基于规则的奖励函数评估其格式和正确性:

• If the response provides a correct final answer, the model obtains an accuracy reward of $+1$ . • If the response encloses the thinking in $\triangleleft$ think> (if it is in the format) and the final answer in , the model obtains a format reward of $+1$ . • otherwise, the model receives 0 reward.

• 如果响应提供了正确的最终答案,模型将获得 $+1$ 的准确度奖励。
• 如果响应将思考过程包含在 $\triangleleft$ think> 中(如果是这种格式),并将最终答案包含在 中,模型将获得 $+1$ 的格式奖励。
• 否则,模型将获得 0 奖励。

4 Experiments

4 实验

In this study, we train our models via full fine-tuning, low-rank adaptation (LoRA) [21], and reinforcement learning. To effectively evaluate generalization, the evaluation follows an out-of-distribution testing approach, where the training and test sets come from different data sources.

在本研究中,我们通过全微调、低秩适应 (LoRA) [21] 和强化学习来训练模型。为了有效评估泛化能力,评估遵循分布外测试方法,其中训练集和测试集来自不同的数据源。

4.1 Experimental Setup

4.1 实验设置

Datasets. The training data is from the AVQA dataset [22], which is designed to audio-visual question answering by providing a comprehensive resource for understanding multimodal information in reallife video scenarios. It comprises 57015 videos depicting daily audio-visual activities, accompanied by 57335 specially designed question-answer pairs. The questions are crafted to involve various relationships between objects and activities. We only use audio-text pairs of the training subset and change the “video” in the question to “audio”. The AVQA training set has approximately $38\mathrm{k\Omega}$ samples. The test data is from the MMAU dataset [7]. MMAU is a comprehensive dataset designed to evaluate advanced LALMs on complex tasks requiring expert-level knowledge and reasoning. It consists of 10000 audio clips, each paired with human-annotated natural language questions and answers, encompassing domains such as speech, environmental sounds, and music. MMAU released a full suite comprising 1000 Test-mini samples, and the last 9000 Test samples questions are available without their answers. Thus we only use the Test-mini subset as the test set.

数据集。训练数据来自 AVQA 数据集 [22],该数据集旨在通过提供理解现实生活视频场景中多模态信息的综合资源来进行视听问答。它包含 57015 个描绘日常视听活动的视频,并配有 57335 个特别设计的问题-答案对。这些问题被设计为涉及对象和活动之间的各种关系。我们仅使用训练子集中的音频-文本对,并将问题中的“视频”改为“音频”。AVQA 训练集大约有 $38\mathrm{k\Omega}$ 个样本。测试数据来自 MMAU 数据集 [7]。MMAU 是一个综合数据集,旨在评估需要专家级知识和推理的复杂任务上的高级 LALMs。它包含 10000 个音频片段,每个片段都配有人工注释的自然语言问题和答案,涵盖语音、环境声音和音乐等领域。MMAU 发布了一个包含 1000 个 Test-mini 样本的完整套件,最后 9000 个 Test 样本的问题没有提供答案。因此,我们仅使用 Test-mini 子集作为测试集。

Implementation Details. The RL models are trained using eight NVIDIA H800 GPUs, with each device running a batch size of 1. The model is trained for 500 steps with a learning rate of $1\times10^{-6}$ and a temperature of 1.0. All hyper-parameters are given in Table 1. We conduct comparative experiments on full fine-tuning and LoRA, with each device running a batch size of 4 and a gradient accumulation step of 1. The SFT models are trained using the AdamW optimizerr with a learning rate of $5\times10^{-6}$ for a total of 4 epochs, with a checkpoint saved every 200 steps. The optimal iteration result is selected for final analysis.

实现细节。RL 模型使用八块 NVIDIA H800 GPU 进行训练,每块设备的批量大小为 1。模型训练 500 步,学习率为 $1\times10^{-6}$,温度为 1.0。所有超参数如表 1 所示。我们对全微调和 LoRA 进行了对比实验,每块设备的批量大小为 4,梯度累积步数为 1。SFT 模型使用 AdamW 优化器进行训练,学习率为 $5\times10^{-6}$,总共训练 4 个 epoch,每 200 步保存一次检查点。最终分析时选择最优的迭代结果。

Table 2: Accuracies $(%)$ on MMAU Test-mini benchmark. Human denotes the result of human testing, and Strong Cap. denotes that the model inputs are strong audio captions generated by Qwen2-Audio-Instruct. Blue represents methods based on reinforcement learning, and an underline indicates the method with the highest average accuracy.

* The data are sourced from the MMAU official website: https://sakshi113.github.io/mma u homepage/

表 2: MMAU Test-mini 基准测试的准确率 $(%)$。Human 表示人类测试的结果,Strong Cap. 表示模型输入是由 Qwen2-Audio-Instruct 生成的强音频描述。蓝色表示基于强化学习的方法,下划线表示平均准确率最高的方法。

模型 方法 MMAU Test-mini
Sound Music Speech Average
Baselines:
Human 86.31 78.22 82.17 82.23
Gemini Pro 2.0 Flash Direct Inference 56.46 58.68 51.65 55.60
Audio Flamingo 2 Direct Inference 61.56 73.95 30.93 55.48
GPT4o + Strong Cap. Direct Inference 57.35 49.70 64.86 57.30
Llama-3-8B-Instruct + Strong Cap. Direct Inference* 50.75 48.93 55.25 52.10
Gemini Pro v1.5 Direct Inference 56.75 49.40 58.55 54.90
Qwen2-Audio-7B-Instruct Direct Inference 54.95 50.98 42.04 49.20
Qwen2-Audio-7B-Instruct CoTA [8] 60.06 64.30 60.70 61.71
Qwen2-Audio-7B-Instruct Zero-Shot-CoT [9] 61.86 56.29 55.26 57.80
Ours:
Qwen2-Audio-7B-Instruct Full + Prompt <1> 60.96 49.19 45.35 51.80
Qwen2-Audio-7B-Instruct LoRA + Prompt 58.26 58.08 52.58 56.40
Qwen2-Audio-7B-Instruct GRPO + Prompt <2> 69.37 66.77 57.36 64.50
Qwen2-Audio-7B-Instruct GRPO+Prompt<3> 66.67 62.87 53.75 61.10

* 数据来源于 MMAU 官方网站: https://sakshi113.github.io/mma u homepage/

4.2 Main Results

4.2 主要结果

To evaluate the effectiveness of reinforcement learning, we compare both SFT methods and RL methods on MMAU Test-mini benchmark. Specifically, There are three types of strategies: direct inference by LALMs, fine-tuning LALMs with SFT, and fine-tuning LALMs with RL. The main results are given in Table 2. The baseline is derived from the top six entries on the official MMAU leader board and two recent studies based on reinforcement learning. Using the GRPO algorithm and Prompt ${<}2>$ , we achieved state-of-the-art average accuracy on the MMAU Test-mini benchmark. Nevertheless, all LALMs still lag far behind humans. This suggests that there is still a pressing need to improve the understanding and reasoning capabilities of LALMs.

为了评估强化学习的有效性,我们在 MMAU Test-mini 基准上比较了 SFT 方法和 RL 方法。具体来说,有三种策略:LALMs 的直接推理、使用 SFT 微调 LALMs 以及使用 RL 微调 LALMs。主要结果如表 2 所示。基线数据来自 MMAU 官方排行榜的前六名以及最近两项基于强化学习的研究。使用 GRPO 算法和 Prompt ${<}2>$,我们在 MMAU Test-mini 基准上实现了最先进的平均准确率。然而,所有 LALMs 仍然远远落后于人类。这表明提高 LALMs 的理解和推理能力仍然是一个迫切的需求。

Furthermore, deep thinking methods demonstrate overall superior performance to classic SFT. The top four methods in Table 2 are all RL or CoT approaches: GRPO $+$ Prompt $<2>$ , GRPO $+$ Prompt ${<}3>$ , Audio-Reasoner (CoTA) [8], and Audio-CoT (Zero-Shot-CoT) [9]. AQA is a task that is challenging to generate but easy to verify. Extracting the correct answer from the audio content is highly challenging, while verifying options is straightforward. The AQA task empirically verifies the conclusion in [17] that tasks with a generation-verification gap are suitable for RL. Additionally, RL has demonstrated strong generalization with a limited amount of training data. Figure 3 illustrates the convergence process of the GRPO algorithm. Using only 38k AVQA training samples, we achieved state-of-the-art performance on out-of-distribution tests. This aligns with LLaVA-CoT [4] finding that structured thinking is well-suited for small-sample training. Whether the extensive and complex fine-tuning data utilized in Audio-Reasoner is truly necessary remains an open question. Another intriguing question is whether insights from the vision modality [3] can be utilized to facilitate auditory reasoning in a non-SFT model.

此外,深度思考方法在整体上表现出优于经典 SFT 的性能。表 2 中的前四种方法都是 RL 或 CoT 方法:GRPO $+$ Prompt $<2>$、GRPO $+$ Prompt ${<}3>$、Audio-Reasoner (CoTA) [8] 和 Audio-CoT (Zero-Shot-CoT) [9]。AQA 是一项生成困难但验证容易的任务。从音频内容中提取正确答案极具挑战性,而验证选项则相对简单。AQA 任务通过实验验证了 [17] 中的结论,即存在生成-验证差距的任务适合 RL。此外,RL 在有限的训练数据下表现出强大的泛化能力。图 3 展示了 GRPO 算法的收敛过程。仅使用 38k 个 AVQA 训练样本,我们在分布外测试中实现了最先进的性能。这与 LLaVA-CoT [4] 的发现一致,即结构化思维非常适合小样本训练。Audio-Reasoner 中使用的广泛且复杂的微调数据是否真正必要仍是一个悬而未决的问题。另一个有趣的问题是,是否可以利用视觉模态 [3] 的见解来促进非 SFT 模型中的听觉推理。

However, we find some differences in the implementation of RL compared to previous studies [1; 3]. Explicit CoT may not outperform letting the model think on its own. Neither the CoT templates in [8; 9] nor the simple way in Prompt ${<}3>$ outperforms directly prompting the model to generate . At least for AQA tasks, How deep thinking and step-by-step reasoning contribute to the task requires further exploration.

然而,我们发现与之前的研究 [1; 3] 相比,在强化学习 (RL) 的实现上存在一些差异。显式的思维链 (CoT) 可能并不比让模型自行思考表现更好。无论是 [8; 9] 中的 CoT 模板,还是 Prompt ${<}3>$ 中的简单方式 <think> </think>,都没有直接提示模型生成 <answer> </answer> 表现更好。至少对于 AQA 任务来说,深度思考和逐步推理如何对任务做出贡献还需要进一步探索。


Figure 3: Convergence processes of the GRPO algorithm.

图 3: GRPO 算法的收敛过程。


Figure 4: Convergence processes of the full and LoRA fine-tuning.

图 4: 全量微调和 LoRA 微调的收敛过程。

As shown in Figure 4, SFT struggles to converge with a small-scale dataset. Full fine-tuning appears to fit the training set quickly, but its accuracy on the out-of-distribution MMAU Test-mini set decreases as the model fits the training set more closely. This further confirms that RL fine-tuning is more suitable than SFT for small-scale datasets. Although LoRA partially addresses the challenges of fine-tuning on small-scale datasets, SFT still performs worse compared to RL. The accuracy of LoRA $+$ Prompt ${<}1>$ is $8.1%$ lower than that of GRPO $+$ Prompt $<2>$ , emphasizing the need to explore RL-based fine-tuning approaches.

如图 4 所示,SFT 在小规模数据集上难以收敛。全量微调似乎能快速拟合训练集,但随着模型对训练集的拟合程度加深,其在分布外 MMAU Test-mini 集上的准确率下降。这进一步证实了 RL 微调比 SFT 更适合小规模数据集。尽管 LoRA 部分解决了小规模数据集微调的挑战,但与 RL 相比,SFT 的表现仍然较差。LoRA $+$ Prompt ${<}1>$ 的准确率比 GRPO $+$ Prompt $<2>$ 低 $8.1%$,这强调了探索基于 RL 的微调方法的必要性。

5 Conclusion

5 结论

This report presents a case study on reinforcement learning and supervised fine-tuning for audio question answering. We directly apply the GRPO algorithm to Qwen2-Audio-7B-Instruct, and achieve state-of-the-art performance on the MMAU Test-mini benchmark, with an accuracy rate of $64.5%$ . This result indicates that reinforcement learning can be effectively applied to LALMs and audio multimodal, especially to AQA, which involves a significant generation-verification gap.

本报告介绍了关于音频问答的强化学习和监督微调的案例研究。我们直接将 GRPO 算法应用于 Qwen2-Audio-7B-Instruct,并在 MMAU Test-mini 基准测试中取得了最先进的性能,准确率达到 $64.5%$。这一结果表明,强化学习可以有效地应用于大语言模型和音频多模态领域,尤其是在涉及显著生成-验证差距的音频问答 (AQA) 中。

Even if the model has a small number of parameters and the fine-tuning dataset is limited, it does not affect the implementation of reinforcement learning. In fact, on small-scale datasets, the strong generalization capability of reinforcement learning can be demonstrated even more clearly. However, the explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently leverage deep thinking and step-by-step reasoning remains an open question for further research. LALMs still lag far behind humans auditory-language reasoning, suggesting that reinforcement learning warrants further exploration. Our future research will focus on the effective integration of the chain-of-thought paradigm into the audio modality.

即使模型的参数量较小且微调数据集有限,也不会影响强化学习的实现。事实上,在小规模数据集上,强化学习的强大泛化能力可以更加明显地展现出来。然而,显式推理过程并未对 AQA 任务显示出显著优势,如何有效利用深度思考和逐步推理仍然是一个有待进一步研究的开放性问题。LALMs 在听觉-语言推理方面仍远落后于人类,这表明强化学习值得进一步探索。我们未来的研究将集中在如何将思维链范式有效整合到音频模态中。

References

参考文献

阅读全文(20积分)