[论文翻译]数学推理的自我奖励校正


原文地址:https://arxiv.org/pdf/2502.19613v1


Self-rewarding correction for mathematical reasoning

数学推理的自我奖励校正

Wei Xiong * 1 Hanning Zhang * 1 Chenlu Ye * 1 Lichang Chen 2 Nan Jiang 1 Tong Zhang

Wei Xiong * 1 Hanning Zhang * 1 Chenlu Ye * 1 Lichang Chen 2 Nan Jiang 1 Tong Zhang

Abstract

摘要

We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference timewithout external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment.

我们研究了自我奖励推理的大语言模型 (LLMs),这些模型能够在推理过程中同时生成逐步推理并评估其输出的正确性,而无需外部反馈。这种集成方法使得单个模型能够独立指导其推理过程,为模型部署提供了计算优势。

We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a twostaged algorithmic framework for constructing self-rewarding reasoning models using only selfgenerated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models’ ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.

我们特别关注自我修正这一代表性任务,即模型自主检测其响应中的错误、修正输出并决定何时终止迭代优化循环。为此,我们提出了一个两阶段的算法框架,仅使用自生成数据来构建自我奖励的推理模型。在第一阶段,我们采用顺序拒绝采样来合成包含自我奖励和自我修正机制的长链思维轨迹。通过对这些精选数据进行微调,模型能够学习自我奖励和自我修正的模式。在第二阶段,我们通过基于规则的信号进行强化学习,进一步增强模型评估响应准确性和优化输出的能力。Llama-3 和 Qwen-2.5 的实验表明,我们的方法超越了内在的自我修正能力,并实现了与依赖外部奖励模型的系统相当的性能。

1. Introduction

1. 引言

Large language models (LLMs) have demonstrated remarkable capabilities in reasoning-related tasks such as mathematics and coding. Notable examples include ChatGPT (OpenAI, 2023), Claude (Anthropic, 2023), and Gemini (Team et al., 2023). Following the release of GPT4-o1, LLMs with strong reasoning abilities have attracted even more attention, along with inference methods that enhance reasoning. A particularly desirable property of such models is their ability to detect inconsistencies and errors in self-generated responses—based on feedback to their prior outputs—and correct these errors to produce improved responses. This process is often referred to as self-correction in the literature (Welleck et al., 2022; Madaan et al., 2024; Kim et al., 2024).

大语言模型 (LLMs) 在数学和编程等推理相关任务中展现了卓越的能力。著名的例子包括 ChatGPT (OpenAI, 2023)、Claude (Anthropic, 2023) 和 Gemini (Team et al., 2023)。随着 GPT4-o1 的发布,具备强大推理能力的大语言模型以及增强推理的推理方法吸引了更多关注。这类模型的一个特别令人期待的特性是它们能够检测自身生成响应中的不一致性和错误——基于对其先前输出的反馈——并纠正这些错误以生成改进的响应。这一过程在文献中通常被称为自我纠正 (Welleck et al., 2022; Madaan et al., 2024; Kim et al., 2024)。

When an external ground-truth reward model is available, studies (Kim et al., 2024; Qu et al., 2024; Shinn et al., 2024) have shown that LLMs can refine their initial responses based on external gold reward feedback and determine when to terminate the self-correction loop. These approaches have proven effective for both mathematical reasoning and general agent tasks. Moreover, even when relying on imperfect proxy rewards, models can still achieve higher accuracy in revised responses by leveraging feedback from an outcomebased reward model (see Section 5 for empirical results). However, since these reward models are often themselves LLMs, deploying them requires running multiple models during inference, which increases computational costs and deployment complexity. In contrast, without external reward feedback, current LLMs struggle to refine their initial responses solely based on their intrinsic capabilities—a limitation known as intrinsic self-correction (Huang et al., 2023).

当外部真实奖励模型可用时,研究表明 (Kim et al., 2024; Qu et al., 2024; Shinn et al., 2024) 大语言模型可以根据外部黄金奖励反馈优化其初始响应,并决定何时终止自我修正循环。这些方法在数学推理和通用智能体任务中已被证明是有效的。此外,即使依赖不完美的代理奖励,模型仍可以通过利用基于结果的奖励模型的反馈在修订后的响应中实现更高的准确性(参见第5节的实证结果)。然而,由于这些奖励模型本身通常也是大语言模型,部署它们需要在推理过程中运行多个模型,这会增加计算成本和部署复杂性。相比之下,在没有外部奖励反馈的情况下,当前的大语言模型难以仅凭其内在能力优化其初始响应——这一限制被称为内在自我修正 (Huang et al., 2023)。

While reward models are traditionally trained with an additional scalar head for general-purpose chat (Ouyang et al., 2022; Bai et al., 2022; Touvron et al., 2023) and reasoning tasks (Cobbe et al., 2021a; Lightman et al., 2023), recent work suggests that LLMs themselves can generate reward signals in a generative way. For example, the LLM-as-ajudge approach (Zheng et al., 2023; Dubois et al., 2023) prompts the LLM to evaluate text outputs, effectively serving as a surrogate for human feedback. Another emerging direction explores generative reward models (Zhao et al., 2023; Dong et al., 2024; Zhang et al., 2024b; Mahan et al., 2024; Zhang et al., 2024a), which formulate evaluation tasks as instruction-following problems, using the probability of generating specific tokens as the reward value. These methods leverage LLMs’ next-token prediction capabilities, integrate the generation and evaluation into a unified framework.

虽然奖励模型传统上是通过额外的标量头来训练的,用于通用聊天 (Ouyang et al., 2022; Bai et al., 2022; Touvron et al., 2023) 和推理任务 (Cobbe et al., 2021a; Lightman et al., 2023),但最近的研究表明,大语言模型本身可以以生成的方式生成奖励信号。例如,LLM-as-a-judge 方法 (Zheng et al., 2023; Dubois et al., 2023) 提示大语言模型评估文本输出,有效地充当人类反馈的替代品。另一个新兴方向探索生成式奖励模型 (Zhao et al., 2023; Dong et al., 2024; Zhang et al., 2024b; Mahan et al., 2024; Zhang et al., 2024a),这些模型将评估任务制定为指令跟随问题,使用生成特定 Token 的概率作为奖励值。这些方法利用了大语言模型的下一 Token 预测能力,将生成和评估集成到一个统一的框架中。

Building on these insights, this work investigates selfrewarding reasoning models that can incorporate three abilities within a single LLM: (i) generating step-by-step reasoning paths for given prompts, (ii) evaluating the correctness of generated responses, and (iii) revising and enhancing previous responses based on self-rewarding signals. Our key contributions are as follows:

基于这些见解,本研究探索了自我奖励推理模型,该模型可以在单个大语言模型中整合三种能力:(i) 为给定的提示生成逐步推理路径,(ii) 评估生成响应的正确性,以及 (iii) 基于自我奖励信号修订和增强先前的响应。我们的主要贡献如下:

  1. Self-rewarding reasoning framework. We introduce a self-rewarding reasoning framework for LLMs, which integrates the generator and reward model into a single LLM, enabling autonomous reasoning, evaluation, and correction. This unification simplifies the model’s decision-making process and reduces computational overhead compared to external reward-based approaches.
  2. 自我奖励推理框架。我们为大语言模型引入了一种自我奖励推理框架,该框架将生成器和奖励模型集成到单个大语言模型中,实现了自主推理、评估和修正。与基于外部奖励的方法相比,这种统一简化了模型的决策过程并减少了计算开销。
  3. Algorithmic framework for self-correction. We focus on the self-correction in mathematical reasoning and propose a two-stage framework that relies only on self-generated data. In the first stage, we use sequential rejection sampling to construct long chain-of-thought (CoT) trajectories that encode both self-rewarding and self-correction behaviors. Fine-tuning models on these trajectories enables them to detect the error in the selfgenerated responses and revise the previous attempts. In the second stage, we further enhance these patterns through reinforcement learning with rule-based signals.
  4. 自我修正的算法框架。我们专注于数学推理中的自我修正,并提出了一个仅依赖于自生成数据的两阶段框架。在第一阶段,我们使用顺序拒绝采样来构建长链思维(CoT)轨迹,这些轨迹编码了自我奖励和自我修正行为。在这些轨迹上微调模型,使它们能够检测自生成响应中的错误并修正之前的尝试。在第二阶段,我们通过基于规则的信号进行强化学习,进一步增强这些模式。
  5. Empirical validation and analysis. Through extensive experiments, we show that self-rewarding correction significantly outperforms intrinsic self-correction. Additionally, we conduct ablation studies to investigate the learning dynamics of the proposed framework, providing deeper insights into its behavior and effectiveness. The training codes and datasets are publicly available on GitHub1.
  6. 实证验证与分析。通过大量实验,我们展示了自我奖励校正显著优于内在自我校正。此外,我们还进行了消融研究,以探讨所提出框架的学习动态,从而更深入地理解其行为和有效性。训练代码和数据集已在 GitHub 上公开。

2. Related Work

2. 相关工作

We review the works that are mostly related to our project in this section.

本节回顾了与我们的项目最相关的工作。

Self-rewarding alignment. Our work aligns with research on self-rewarding alignment (Yuan et al., 2024b; Prasad et al., 2024), where both of our project and their methods share similar spirits that we can unify the generation ability and evaluation ability into a single LLM. These methods leverage iterative DPO-type algorithms, where the model labels its own generated responses to provide training signals for subsequent iterations, enabling self-improvement. In contrast, our approach does not focus on self-improvement during training. Instead, we rely on an external ground-truth reward model to provide learning signals in training. Our study emphasizes inference-time alignment for reasoningfocused LLMs, where self-rewarding signals are employed solely to guide inference rather than training.

自我奖励对齐。我们的工作与自我奖励对齐的研究(Yuan et al., 2024b; Prasad et al., 2024)一致,我们的项目与他们的方法在精神上相似,即我们可以将生成能力和评估能力统一到一个大语言模型中。这些方法利用了迭代的DPO类型算法,模型通过标记自己生成的响应来为后续迭代提供训练信号,从而实现自我改进。相比之下,我们的方法在训练过程中并不专注于自我改进。相反,我们依赖于外部真实奖励模型来提供训练中的学习信号。我们的研究强调推理导向的大语言模型在推理时的对齐,其中自我奖励信号仅用于指导推理而非训练。

Self-correction. Our work is closely related to selfcorrection in LLMs. We refer interested readers to the survey (Pan et al., 2023) for a more comprehensive review and only review some representative approaches that are mostly related to our project. Li et al. (2024) demonstrated that incorporating teacher model reflections into SFT data enhances students’ self-reflection abilities in general-purpose conversation tasks. However, for reasoning tasks, Huang et al. (2023) found that current LLMs—without additional training—fail to self-correct purely through intrinsic reasoning (i.e., prompting). This observation is also validated in Qu et al. (2024); Tyen et al. (2023); Zheng et al. (2024). A more in-depth analysis shows that most prior successful studies in this domain depend on external (ground-truth) reward models to determine when to initiate and terminate self-correction (Kim et al., 2024; Qu et al., 2024; Shinn et al., 2024; Madaan et al., 2024). Currently, there is no major work demonstrating that intrinsic self-correction (via prompting or fine-tuning) is reliably effective. Furthermore, because external reward models are typically LLM-based, these methods introduce additional computational overhead by requiring a multi-agent system for inference.

自我校正。我们的工作与大语言模型中的自我校正密切相关。我们建议感兴趣的读者参阅综述 (Pan et al., 2023) 以获取更全面的回顾,并仅回顾一些与我们的项目最相关的代表性方法。Li 等人 (2024) 证明,将教师模型的反思纳入 SFT 数据中,可以增强学生在通用对话任务中的自我反思能力。然而,对于推理任务,Huang 等人 (2023) 发现,当前的大语言模型在没有额外训练的情况下,无法通过内在推理(即提示)进行自我校正。这一观察结果也在 Qu 等人 (2024);Tyen 等人 (2023);Zheng 等人 (2024) 中得到了验证。更深入的分析表明,该领域大多数先前成功的研究依赖于外部(真实)奖励模型来确定何时启动和终止自我校正 (Kim 等人, 2024;Qu 等人, 2024;Shinn 等人, 2024;Madaan 等人, 2024)。目前,没有主要工作表明内在自我校正(通过提示或微调)是可靠有效的。此外,由于外部奖励模型通常基于大语言模型,这些方法通过需要多智能体系统进行推理,引入了额外的计算开销。

Recognizing this challenge, our study explores how LLMs can autonomously evaluate response quality and correct errors without external reward models. Specifically, we introduce a self-rewarding reasoning framework that enables a single LLM to perform error detection and self-correction effectively. Among the works in self-correction, the most relevant work is the recent Kumar et al. (2024), which employed a multi-turn deep RL approach to train self-correcting models. In comparison, this work introduces a new and general self-rewarding formulation for reasoning-focused LLMs, with self-correction as a representative application. Compared to the intrinsic correction and the framework in Kumar et al. (2024), one major difference is that our framework equips models with self-rewarding ability, enabling our models to intelligently scale inference compute by selectively revising the first attempts, which helps to reduce computational overhead by avoiding unnecessary iterations. We will also design experiments to illustrate this idea.

认识到这一挑战,我们的研究探索了大语言模型如何在没有外部奖励模型的情况下自主评估响应质量并纠正错误。具体来说,我们引入了一种自我奖励的推理框架,使单个大语言模型能够有效地执行错误检测和自我纠正。在自我纠正的相关工作中,最相关的是最近的 Kumar 等人 (2024) 的工作,他们采用了一种多轮深度强化学习方法来训练自我纠正模型。相比之下,这项工作为专注于推理的大语言模型引入了一种新的、通用的自我奖励公式,并将自我纠正作为代表性应用。与内在纠正和 Kumar 等人 (2024) 的框架相比,一个主要区别在于我们的框架赋予了模型自我奖励的能力,使我们的模型能够通过选择性地修订首次尝试来智能地扩展推理计算,从而通过避免不必要的迭代来减少计算开销。我们还将设计实验来说明这一想法。

Algorithmic ally, our approach also differs from Kumar et al. (2024). We first use sequential rejection sampling to construct long CoT trajectories with both self-rewarding and self-correction patterns, which serve as warm-up fine-tuning data. We then enhance these behaviors through reinforcement learning (using either DPO-type algorithms or PPO)

算法上,我们的方法也与 Kumar 等人 (2024) 不同。我们首先使用顺序拒绝采样来构建包含自我奖励和自我纠正模式的长链思维 (CoT) 轨迹,这些轨迹作为预热微调数据。然后,我们通过强化学习(使用 DPO 类算法或 PPO)来增强这些行为。

with rule-based signals. In contrast, Kumar et al. (2024) employed RLOO (Ahmadian et al., 2024) with a specialized reward function for a two-turn self-correction task. While their no-public models (Gemini) and implementation details (parameters, codes) do not enable comparison, we believe that the multi-turn RL methods proposed by Kumar et al. (2024) could also complement the proposed self-rewarding framework, and achieve better reasoning performance compared to the standard reasoning models.

基于规则的信号。相比之下,Kumar 等人 (2024) 在双轮自我纠正任务中使用了 RLOO (Ahmadian 等人, 2024) 和专门的奖励函数。尽管他们的非公开模型 (Gemini) 和实现细节 (参数、代码) 无法进行比较,但我们认为 Kumar 等人 (2024) 提出的多轮 RL 方法也可以补充所提出的自我奖励框架,并在推理性能上优于标准的推理模型。

Rule-based RL for LLMs mathematical reasoning. Rule-based reinforcement learning has received significant attention following the success of DeepSeek-R1 (DeepSeekAI et al., 2025). Open-source efforts have since attempted to replicate its performance using Qwen models (Yang et al., 2024), including works such as Zeng et al. (2025); Cui et al. (2025); Zhang et al. (2025). These methods train LLMs using only the correctness score (whether the final answer is correct or not) and a format score (whether the final answer is output in a pre-determined format), in contrast to the previous works with the neural network-based reward model (Cobbe et al., 2021a; Lightman et al., 2023; Zhang et al., 2024a). In particular, DeepSeek-AI et al. (2025) observed that self-correction naturally emerges during RL training (referred to as an AHA moment in their report). However, our preliminary experiments, along with open-source replications using Qwen-2.5-Math (Liu et al., 2025; Zhang et al., 2025; Cheng et al., 2025), suggest that (i) the base models already exhibit some self-correction ability, though it is quite sparse. (ii) vanilla rule-based RL cannot consistently enhance self-correction without additional design.

基于规则的大语言模型数学推理强化学习。基于规则的强化学习在 DeepSeek-R1 (DeepSeekAI 等, 2025) 成功之后受到了广泛关注。开源社区随后尝试使用 Qwen 模型 (Yang 等, 2024) 复现其性能,包括 Zeng 等 (2025); Cui 等 (2025); Zhang 等 (2025) 的工作。这些方法仅使用正确性分数(最终答案是否正确)和格式分数(最终答案是否以预定格式输出)来训练大语言模型,与之前基于神经网络的奖励模型 (Cobbe 等, 2021a; Lightman 等, 2023; Zhang 等, 2024a) 的工作形成对比。特别是,DeepSeek-AI 等 (2025) 观察到在强化学习训练过程中自然出现了自我纠正(在他们的报告中称为 AHA 时刻)。然而,我们的初步实验以及使用 Qwen-2.5-Math (Liu 等, 2025; Zhang 等, 2025; Cheng 等, 2025) 的开源复现表明:(i) 基础模型已经表现出一定的自我纠正能力,尽管非常稀疏。(ii) 没有额外设计的普通基于规则的强化学习无法持续增强自我纠正能力。

Interestingly, even when using the same algorithms and data, similar improvements in mathematical reasoning are not observed in models such as Llama (Meta, 2024; Touvron et al., 2023). We hypothesize that Qwen-2.5-Math and DeepSeek-R1 benefit from extensive pre-training on highquality mathematical corpora (e.g., 1T tokens for Qwen-2.5- Math (Yang et al., 2024)), and that the AHA moment may stem from carefully curated data containing self-correction patterns in pre-training or a cool-down stage. Since these datasets are non-public, the exact details remain unknown.

有趣的是,即使使用相同的算法和数据,在 Llama (Meta, 2024; Touvron et al., 2023) 等模型中并未观察到类似的数学推理能力提升。我们推测,Qwen-2.5-Math 和 DeepSeek-R1 受益于高质量数学语料库的广泛预训练(例如,Qwen-2.5-Math 使用了 1T token (Yang et al., 2024)),而“顿悟”时刻可能源于预训练中精心策划的包含自我纠正模式的数据或冷却阶段。由于这些数据集是非公开的,具体细节仍不得而知。

In contrast, our study shows that a warm-up stage using a carefully curated SFT dataset (collected via sequential rejection sampling) enables models to learn self-correction patterns more reliably. This foundation allows rule-based RL to further enhance these behaviors in a stable manner. We also remark that our two-stage framework and most of the associated experiments are performed prior to the release of DeepSeek-R1.

相比之下,我们的研究表明,使用精心策划的 SFT 数据集(通过顺序拒绝采样收集)进行预热阶段,能够使模型更可靠地学习自我纠正模式。这一基础使得基于规则的强化学习能够以稳定的方式进一步增强这些行为。我们还指出,我们的两阶段框架和大部分相关实验是在 DeepSeek-R1 发布之前进行的。

3. Self-rewarding Reasoning Language Models

3. 自我奖励推理大语言模型

We formulate the self-rewarding reasoning process as a multi-turn Markov Decision Process (MDP). After observing an initial prompt $s^{1}=x\in\mathcal{X}$ from some distribution $d_{0}$ , an LLM, denoted as $\pi$ , will generate an initial reasoning attempt $a^{1}\sim\pi^{1}(\cdot|s^{1})$ from the action space $\mathcal{A}$ . The LLM then self-rewards its response by generating an evaluation:

我们将自我奖励的推理过程表述为一个多轮马尔可夫决策过程 (MDP)。在观察到来自某个分布 $d_{0}$ 的初始提示 $s^{1}=x\in\mathcal{X}$ 后,一个表示为 $\pi$ 的大语言模型将从动作空间 $\mathcal{A}$ 中生成一个初始推理尝试 $a^{1}\sim\pi^{1}(\cdot|s^{1})$。随后,大语言模型通过生成评估来自我奖励其响应:

图片.png

If the model assesses its answer as correct $\begin{array}{r l}{(y^{1}}&{{}=}\end{array}$ [VERIFY] correct, details provided later), the generation stops. Otherwise, the LLM proceeds to the next step, generating a refined response and evaluation:

如果模型评估其答案正确 $\begin{array}{r l}{(y^{1}}&{{}=}\end{array}$ [VERIFY] 正确,详细信息稍后提供),生成停止。否则,大语言模型继续下一步,生成改进的响应和评估:

图片.png

where the generation is conditioned on the updated state $s^{2}=(s^{1},a^{1},y^{1})$ . The self-refinement process continues until the model produces a self-evaluation $y^{h}$ that assesses the answer as correct.

生成过程基于更新后的状态 $s^{2}=(s^{1},a^{1},y^{1})$。自我优化过程持续进行,直到模型生成一个自我评估 $y^{h}$,该评估认为答案是正确的。

We assume that we have access to the ground-truth verifier $r^{\star}:\mathcal{X}\times\mathcal{A}\rightarrow{0,1}$ , which determines whether a response is correct. Throughout this study, we use the ToRA verification script (Gou et al., 2023), built on the Python library SymPy for symbolic mathematics. We also present a representative Example 1 to illustrate the process.

我们假设可以访问真实验证器 $r^{\star}:\mathcal{X}\times\mathcal{A}\rightarrow{0,1}$ ,它决定了响应是否正确。在本研究中,我们使用基于 Python语言 库 SymPy 的 ToRA 验证脚本 (Gou et al., 2023) 进行符号数学计算。我们还提供了一个代表性的示例 1 来说明这一过程。

Two-stage training framework. Following standard posttraining practices for LLMs, we adopt a two-stage approach:

两阶段训练框架。遵循大语言模型的标准后训练实践,我们采用了两阶段方法:

  1. Self-rewarding instruction-following fine-tuning (IFT). Starting with an initial LLM $\pi_{0}$ (e.g., a generalpurpose chatbot), we collect demonstration data by a sequential rejection sampling process and fine-tune $\pi_{0}$ to get an improved model $\pi_{\mathrm{ref}}$ , which integrates self-rewarding reasoning abilities.
  2. 自我奖励的指令跟随微调 (IFT)。从初始的大语言模型 $\pi_{0}$ (例如,通用聊天机器人)开始,我们通过顺序拒绝采样过程收集演示数据,并对 $\pi_{0}$ 进行微调,以获得改进的模型 $\pi_{\mathrm{ref}}$,该模型集成了自我奖励的推理能力。
  3. Reinforcement learning (RL) optimization. We further refine $\pi_{\mathrm{ref}}$ using RL, leveraging it as the reference model. This stage can further enhance the model’s ability to assess correctness and refine previous responses.
  4. 强化学习 (RL) 优化。我们进一步使用强化学习来优化 $\pi_{\mathrm{ref}}$,并将其作为参考模型。这一阶段可以进一步增强模型评估正确性和优化先前响应的能力。

3.1. Self-rewarding Instruction-following Fine-tuning

3.1. 自我奖励的指令跟随微调

Self-rewarding by token prediction. To train the LLMs to evaluate the reasoning steps, we formulate this task as an instruction-following task, following prior works (Zhao et al., 2023; Dong et al., 2024; Liu et al., 2023; Ye et al., 2024; Wang et al., 2024; Zhang et al., 2024b). Specifically, we allow models to include reasoning in their evaluations while requiring them to output specific tokens to indicate their evaluation results. We experimented with different token choices, such as: (i) a prompt “Is the most recent final answer correct (Yes or No)?” with “Yes” and “No” as the response tokens, as used in (Xie et al., 2023; Zhang et al., 2024b); (ii) explicit markers such as “[VERIFY] correct” and “[VERIFY] wrong”. Our experiments show no

通过Token预测进行自我奖励。为了训练大语言模型评估推理步骤,我们将其制定为一个指令遵循任务,遵循先前的工作 (Zhao et al., 2023; Dong et al., 2024; Liu et al., 2023; Ye et al., 2024; Wang et al., 2024; Zhang et al., 2024b)。具体来说,我们允许模型在评估中包含推理,同时要求它们输出特定的Token以指示评估结果。我们尝试了不同的Token选择,例如:(i) 提示“最近的最终答案是否正确(是或否)?”并以“是”和“否”作为响应Token,如 (Xie et al., 2023; Zhang et al., 2024b) 中使用的;(ii) 显式标记,如“[VERIFY]正确”和“[VERIFY]错误”。我们的实验表明没有

Table 1. An example of the self-rewarding reasoning path. We omit the detailed reasoning path for a clear presentation. The full trajectory is available at Table 13 in Appendix.

表 1: 自我奖励推理路径的示例。为了清晰展示,我们省略了详细的推理路径。完整的轨迹可在附录的表 13 中找到。

significant performance differences between these choices. During inference, rather than using the likelihood of “Yes” as a reward (as in (Zhao et al., 2023; Dong et al., 2024; Zhang et al., 2024b)), we sample the evaluation token from the distribution. This allows us to use a standard inference pipeline without any specific adjustment. See Table 1 for an example.

这些选择之间存在显著的性能差异。在推理过程中,我们不是使用“是”的可能性作为奖励(如 (Zhao et al., 2023; Dong et al., 2024; Zhang et al., 2024b) 中所述),而是从分布中采样评估 token。这使得我们能够使用标准的推理流程,而无需任何特定调整。示例见表 1。

Remark 3.1. We choose these specific tokens primarily for research simplicity. However, we expect that similar results can be achieved even if these special tokens are replaced with more natural language expressions, such as “wait”, “aha”, or “let me re-check the answer”, where one can also leverage the LLMs to complete this paraphrasing process.

备注 3.1。我们选择这些特定的 Token 主要是为了研究的简便性。然而,我们预计即使这些特殊 Token 被更自然的语言表达所替代,例如“等一下”、“啊哈”或“让我重新检查一下答案”,也可以实现类似的结果,其中人们还可以利用大语言模型来完成这个释义过程。

Data collection by sequential rejection sampling. We employ a rejection sampling approach, similar to STaR (Zelikman et al., 2022) and RAFT (Dong et al., 2023), where we generate a large amount of self-correction trajectories and only preserve the desired trajectories. The major difference is that since the self-correction behavior is sparse in base models and self-rewarding pattern is missing, it is unlikely to collect the desired trajectory directly. In view of this, we sequentially prompt the base model and generate different steps separately. Then, we combine them into long CoT trajectories that incorporate both self-rewarding and self-correction patterns.

通过顺序拒绝采样进行数据收集。我们采用了一种类似于 STaR (Zelikman et al., 2022) 和 RAFT (Dong et al., 2023) 的拒绝采样方法,生成大量的自我纠正轨迹,并仅保留所需的轨迹。主要区别在于,由于基础模型中的自我纠正行为较为稀疏且缺乏自我奖励模式,直接收集到所需的轨迹是不太可能的。鉴于此,我们顺序提示基础模型并分别生成不同的步骤。然后,我们将它们组合成包含自我奖励和自我纠正模式的长链式推理 (CoT) 轨迹。

Our data collection process consists of the following steps:

我们的数据收集过程包括以下步骤:

  1. Generating initial reasoning responses: training prompts from datasets such as MATH (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021a) and sample $N_{1}=50$ initial responses $a^{1}$ per prompt as our base trajectories (see Section 5 for details of experiment setups).
  2. 生成初始推理响应:使用来自MATH (Hendrycks et al., 2021) 和 GSM8K (Cobbe et al., 2021a) 等数据集的训练提示,并为每个提示采样 $N_{1}=50$ 个初始响应 $a^{1}$ 作为我们的基础轨迹(实验设置的详细信息见第5节)。
  3. Self-rewarding signal sampling: For each prompt and initial response, we further sample $N_{2}=8$ selfevaluations and keep only one evaluation result that is the same as the ground truth. Then, we split them into $G^{\mathrm{correct}}$ and $G^{\mathrm{wrong}}$ using the ground-truth verifier $r^{\star}$
  4. 自我奖励信号采样:对于每个提示和初始响应,我们进一步采样 $N_{2}=8$ 个自我评估,并仅保留与真实值相同的评估结果。然后,我们使用真实值验证器 $r^{\star}$ 将它们分为 $G^{\mathrm{correct}}$ 和 $G^{\mathrm{wrong}}$。
  5. Correction sampling: For each prompt and initial response in $G^{\mathrm{wrong}}$ , we sample $M_{1}=8$ completions by providing the feedback that the initial response was wrong to collect trajectories that successfully revise incorrect responses. For each prompt and initial response in $G^{\mathrm{correct}}$ , however, we also tell the model that the response was incorrect and collect $M_{2}=4$ completions. By doing so, we want to additionally collect “correct-to-correct” trajectories in the face of wrong judgment.
  6. 修正采样:对于 $G^{\mathrm{wrong}}$ 中的每个提示和初始响应,我们通过提供初始响应错误的反馈来采样 $M_{1}=8$ 个补全,以收集成功修正错误响应的轨迹。然而,对于 $G^{\mathrm{correct}}$ 中的每个提示和初始响应,我们也告诉模型响应是错误的,并收集 $M_{2}=4$ 个补全。通过这样做,我们希望在面对错误判断时额外收集“正确到正确”的轨迹。

Eventually, we collect $8\times|G^{\mathrm{wrong}}|+4\times|G^{\mathrm{correct}}|$ full trajectories. Then, we filter the dataset and only keep the following types of data:

最终,我们收集了 $8\times|G^{\mathrm{wrong}}|+4\times|G^{\mathrm{correct}}|$ 条完整轨迹。然后,我们对数据集进行过滤,仅保留以下类型的数据:

• $\mathcal{D}{1}^{\mathrm{IFT}}$ : wrong $a^{1}$ , $y^{1}=\left[\mathrm{VERIFY}\right]$ wrong, correct $a^{2}$ ; • $\mathcal{D}{2}^{\mathrm{IFT}}$ : correct $a^{1}$ , $y^{1}=$ [VERIFY] wrong, correct $a^{2}$ • $\mathcal{D}_{3}^{\mathrm{IFT}}$ : correct $a^{1}$ , $y^{1}=$ [VERIFY] correct.

• $\mathcal{D}{1}^{\mathrm{IFT}}$ : 错误的 $a^{1}$ , $y^{1}=\left[\mathrm{VERIFY}\right]$ 错误,正确的 $a^{2}$ ;
• $\mathcal{D}
{2}^{\mathrm{IFT}}$ : 正确的 $a^{1}$ , $y^{1}=$ [VERIFY] 错误,正确的 $a^{2}$
• $\mathcal{D}_{3}^{\mathrm{IFT}}$ : 正确的 $a^{1}$ , $y^{1}=$ [VERIFY] 正确。

We provide an example of data collection process in Table 2. We limit the horizon to two iterations due to resource constraint, and preserve at most one trajectory per base sample

我们在表 2 中提供了一个数据收集过程的示例。由于资源限制,我们将迭代次数限制为两次,并且每个基础样本最多保留一条轨迹。

System: You are a mathematical reasoning assistant. For each problem, follow these steps strictly: 1. Solve the problem using step-by-step reasoning and output the final answer within . Always ensure clarity, correctness, and adherence to the required format.

系统:你是一个数学推理助手。对于每个问题,请严格按照以下步骤操作:1. 使用逐步推理解决问题,并在 内输出最终答案。始终确保清晰、正确,并遵守所需的格式。

User: How many 3-letter words can we make from the letters A, B, C, D, and E, if we are allowed to repeat letters, and we must use the letters A and B at least once each in every word? (Here, a word is an arbitrary sequence of letters.) Let’s think step by step and output the final answer within .

用户:从字母 A、B、C、D 和 E 中,我们可以组成多少个 3 字母的单词?允许重复使用字母,并且每个单词中必须至少使用一次 A 和 B。(这里的单词是任意字母序列。)让我们一步一步思考,并在 内输出最终答案。

Assistant: Here’s how we can break it down: 1. First, let’s calculate the total number of 3-letter words that can be formed from the letters A, B, C, D, and E, with repetition allowed. ... 7. Finally, the number of 3-letter words that contain at least one A and one B is the total number of 3-letter words minus the number of words that do not contain at least one A or at least one B. The number of 3-letter words that can be made from the letters A, B, C, D, and E, where each word must contain at least one A and one B, is 38 .

我们可以这样分解问题:1. 首先,计算从字母 A、B、C、D 和 E 中可以组成多少个允许重复的 3 字母单词。... 7. 最后,包含至少一个 A 和一个 B 的 3 字母单词的数量等于总的 3 字母单词数量减去不包含至少一个 A 或至少一个 B 的单词数量。从字母 A、B、C、D 和 E 中可以组成的每个单词必须包含至少一个 A 和一个 B 的 3 字母单词的数量是 38。

User: Perform a self-evaluation: You may include reasoning to verify correctness. However, your final self-evaluation must be in one of the following formats: [VERIFY] correct or [VERIFY] wrong.

用户:进行自我评估:你可以包含验证正确性的推理。然而,你的最终自我评估必须采用以下格式之一:[VERIFY] 正确 或 [VERIFY] 错误。

Assistant: Let’s perform a self-evaluation of the solution. We first calculate ... This is $125-101=24$ . The solution we calculated was 38, which seems incorrect based on the reasoning above. The correct answer should be 24. Let’s output the self-evaluation: [VERIFY] wrong.

让我们对解决方案进行自我评估。我们首先计算...这是 $125-101=24$。我们计算的解决方案是38,根据上述推理似乎不正确。正确答案应该是24。让我们输出自我评估:[VERIFY] 错误。

User: please identify the mistake in your previous reasoning, revise your reasoning path and output a corrected final answer within

用户:请指出你之前推理中的错误,修正你的推理路径,并在其中输出一个修正后的最终答案。

Assistant: Let’s correct the mistake in the previous reasoning. The correct steps should be as follows: 1. Calculate ... 6. Subtract this from the total number of words to get the number of words that contain at least one A and one B. This is $125-101=24$ . The correct answer is indeed 24 .

让我们纠正之前推理中的错误。正确的步骤应如下:1. 计算... 6. 从总词数中减去这个数,得到至少包含一个A和一个B的词数。这是 $125-101=24$。正确答案确实是24。

Table 2. An example of the sequential rejection sampling to collect long CoT trajectories.

表 2: 一个用于收集长 CoT 轨迹的顺序拒绝采样示例。

to control dataset size. Then we fine-tune the LLMs using standard SFT pipeline to maximize:

为了控制数据集大小,我们使用标准的监督微调 (SFT) 流程对大语言模型进行微调,以最大化:

图片.png

In practice, however, we observe that the multi-task training can lead to stability issue and can slightly hurt the first-round performance. To mitigate this issue, we also train on the correct attempt a1 for the samples in DI3FT.

然而,在实践中,我们观察到多任务训练可能会导致稳定性问题,并略微影响第一轮的表现。为了缓解这个问题,我们还对 DI3FT 中的样本的正确尝试 a1 进行了训练。

3.2. KL-regularized Reinforcement Learning

3.2. KL正则化强化学习

In this stage, we aim to further enhance the self-rewarding IFT models using reinforcement learning. We consider both deep RL methods (Schulman et al., 2017) and direct alignment algorithms (Zhao et al., 2023; Rafailov et al., 2023; Azar et al., 2023; Liu et al., 2023).

在此阶段,我们旨在通过强化学习进一步增强自奖励的IFT模型。我们考虑了深度RL方法 (Schulman et al., 2017) 和直接对齐算法 (Zhao et al., 2023; Rafailov et al., 2023; Azar et al., 2023; Liu et al., 2023)。

Learning signal. To facilitate the reinforcement learning stage, we assume there exists a trajectory-wise reward function $u^{\star}(\tau)$ for trajectory

学习信号。为了促进强化学习阶段,我们假设存在一个轨迹级别的奖励函数 $u^{\star}(\tau)$ 用于轨迹。

图片.png

However, instead of learning a proxy reward from data like the BT model in RLHF (Ouyang et al., 2022) or outcomesupervised reward (ORM) in previous mathematical reasoning literature (Lightman et al., 2023), we primarily use the oracle reward

然而,与RLHF中的BT模型(Ouyang等,2022)或先前数学推理文献中的结果监督奖励(ORM)(Lightman等,2023)不同,我们主要使用oracle奖励

图片.png

i.e., whether the final result is correct or not. The main advantage is that the oracle reward can largely mitigate the risk of reward hacking. This is also referred to as the rule-based $R L$ in the very recent literature (DeepSeek-AI et al., 2025). We will also study the additional rule designs for either reward value assignment (PPO training) or data ranking (DPO training), where an implicit $u^{\star}$ is determined by the set of rules we use.

即最终结果是否正确。主要优势在于,oracle奖励可以大大降低奖励黑客攻击的风险。这也被称为最近文献中的基于规则的 $RL$(DeepSeek-AI 等,2025)。我们还将研究用于奖励值分配(PPO训练)或数据排序(DPO训练)的额外规则设计,其中隐式的 $u^{\star}$ 由我们使用的规则集决定。

Following standard RLHF methodologies (Ouyang et al., 2022; Bai et al., 2022), we optimize the following KLregularized objective:

遵循标准的 RLHF 方法 (Ouyang et al., 2022; Bai et al., 2022),我们优化以下 KL 正则化目标:

图片.png

The optimal policy, as well as its associated optimal value satisfies the following optimality condition (Xiong et al.,

最优策略及其相关的最优值满足以下最优性条件 (Xiong et al.,

2024a; Xie et al., 2024a; Zhong et al., 2024).

2024a; Xie 等人, 2024a; Zhong 等人, 2024).

Proposition 3.2. We can recursively define the following optimal value functions and optimal policies for a $K L$ - regularized MDP with horizon $H$ and deterministic external observation. For $Q$ value, we have

命题 3.2. 我们可以递归地定义以下最优值函数和最优策略,用于具有 $K L$ 正则化的 MDP(马尔可夫决策过程),其时间跨度为 $H$ 且具有确定的外部观测。对于 $Q$ 值,我们有
图片.png

We remark that one advantage of the proposition is that it allows deterministic external message (e.g. instruction prompts) in the state update, which will be useful when we consider a simplified research framework in Section 5.

我们注意到,该命题的一个优势在于它允许在状态更新中使用确定性的外部消息(例如指令提示),这在我们考虑第5节中的简化研究框架时将非常有用。

We also adopt Direct Preference Optimization (DPO) (Rafailov et al., 2023; Azar et al., 2023; Zhao et al., 2023; Ethayarajh et al., 2024) to solve Equation 2, primarily due to computational constraints. In particular, we use the multiturn DPO (M-DPO) framework from Xiong et al. (2024a), since it allows deterministic external observation in the state transition. To facilitate direct preference learning and bypass explicit reward training, we impose the following trajectorylevel Bradley-Terry (BT) preference structure (Bradley & Terry, 1952). Specifically, given two trajectories $\tau^{1},\tau^{2}$ , the probability of $\tau^{1}$ being preferred than $\tau^{2}$ , denoted as $\tau^{1}\succ\tau^{2}$ , is

我们还采用了直接偏好优化 (Direct Preference Optimization, DPO) (Rafailov et al., 2023; Azar et al., 2023; Zhao et al., 2023; Ethayarajh et al., 2024) 来解决公式 2,主要是由于计算限制。特别是,我们使用了 Xiong et al. (2024a) 提出的多轮 DPO (M-DPO) 框架,因为它允许在状态转换中进行确定性的外部观察。为了促进直接偏好学习并绕过显式的奖励训练,我们采用了以下轨迹级别的 Bradley-Terry (BT) 偏好结构 (Bradley & Terry, 1952)。具体来说,给定两条轨迹 $\tau^{1},\tau^{2}$,$\tau^{1}$ 比 $\tau^{2}$ 更受偏好的概率,记为 $\tau^{1}\succ\tau^{2}$,是
图片.png

where $\sigma(z)=1/(1+\exp(-z))$ is the sigmoid function. Following Xiong et al. (2024a), we take log on both sides of (4), and connect a utility function $u_{\theta}$ with associated policy $\pi_{\theta}$ and value $V_{\theta}$ :

其中 $\sigma(z)=1/(1+\exp(-z))$ 是 sigmoid 函数。根据 Xiong 等人 (2024a) 的研究,我们对 (4) 两边取对数,并将效用函数 $u_{\theta}$ 与相关策略 $\pi_{\theta}$ 和值 $V_{\theta}$ 联系起来:

图片.png

For a pair of trajectories $\tau^{w},\tau^{l}$ where $\tau^{w}\succ\tau^{l}$ , we have

对于一对轨迹 $\tau^{w},\tau^{l}$,其中 $\tau^{w}\succ\tau^{l}$,我们有

图片.png

Taking this reward difference parameter iz ation into the loglikelihood of the BT model $\sum_{(\tau^{w},\tau^{l})\in\mathcal{D}}\log\sigma\big(u_{\theta}\big(\tau^{w}\big)-$ $u_{\theta}(\tau^{l}))$ , we obtain the loss f unction $\mathcal{L}_{\mathrm{M-DPO}}(\theta)$ :

将奖励差异参数化考虑到 BT 模型的对数似然 $\sum_{(\tau^{w},\tau^{l})\in\mathcal{D}}\log\sigma\big(u_{\theta}\big(\tau^{w}\big)-$ $u_{\theta}(\tau^{l}))$ 中,我们得到损失函数 $\mathcal{L}_{\mathrm{M-DPO}}(\theta)$:

图片.png

4. Experiment Results

4. 实验结果

Task, datasets, and data format. We evaluate models’ mathematical reasoning abilities using standard benchmarks, including MATH500 (Hendrycks et al., 2020), OlympiadBench (He et al., 2024), and Minerva Math (Lewkowycz et al., 2022). These datasets provide a moderate size for reli- able and efficient model evaluation, covering topics such as algebra, geometry, probability, number theory, and calculus. For training, we mainly use the prompts in NumiaMath-CoT dataset (Beeching et al., 2024). Specifically, we use a 50K subset for the self-rewarding IFT stage, a 10K subset for validation and model selection, and the remaining data for RL training. During inference, the model generates up to 4096 tokens, with VLLM 0.5.4 (Kwon et al., 2023) accelerating the process.

任务、数据集和数据格式。我们使用标准基准评估模型的数学推理能力,包括 MATH500 (Hendrycks et al., 2020)、OlympiadBench (He et al., 2024) 和 Minerva Math (Lewkowycz et al., 2022)。这些数据集为可靠且高效的模型评估提供了适中的规模,涵盖了代数、几何、概率、数论和微积分等主题。在训练过程中,我们主要使用 NumiaMath-CoT 数据集 (Beeching et al., 2024) 中的提示。具体来说,我们使用 50K 的子集进行自奖励 IFT 阶段,10K 的子集用于验证和模型选择,其余数据用于 RL 训练。在推理过程中,模型最多生成 4096 个 token,并使用 VLLM 0.5.4 (Kwon et al., 2023) 加速该过程。

Evaluation metrics. We employ two categories of metrics to evaluate our models: (1) mathematical reasoning and self-correction and (2) reward model accuracy. First, we follow Kumar et al. (2024) to consider the following metrics to evaluate the models’ ability of mathematical reasoning and self-correction.

评估指标。我们采用两类指标来评估我们的模型:(1) 数学推理与自我修正能力,以及 (2) 奖励模型的准确性。首先,我们遵循 Kumar 等人 (2024) 的研究,考虑以下指标来评估模型的数学推理与自我修正能力。

Due to the nature of the self-rewarding reasoning framework, we additionally include the metrics to measure the accuracy as a reward model. We also defer a more comprehensive understanding of the proposed framework with a slightly simplified template to next section, where we will additionally compute the ratio of modifying a correct answer to incorrect when facing a misleading reward.

由于自我奖励推理框架的性质,我们额外加入了衡量准确性的指标作为奖励模型。我们还将对提出的框架进行更全面的理解,并采用稍简化的模板,留到下一节讨论,届时我们将额外计算在面对误导性奖励时,将正确答案修改为错误答案的比例。

  1. RM Accuracy $(a,b)$ : class-dependent accuracy for correct and incorrect trajectories. In other words, $a$ is the true positive rate and $b$ is the true negative rate;
  2. RM 准确度 $(a,b)$:正确和错误轨迹的类别相关准确度。换句话说,$a$ 是真阳性率,$b$ 是真阴性率;

Table 3. Main results of experiments with Qwen2.5-Math-7B-base. The single-turn baselines are used to train a regular CoT reasoning model. The baselines with † perform self-correction under the external prompt, where training may apply to enhance this ability. We use greedy decoding following the convention of the recent open-source projects on mathematical reasoning.

表 3: Qwen2.5-Math-7B-base 实验的主要结果。单轮基线用于训练常规的 CoT 推理模型。带有 † 的基线在外部提示下进行自我校正,其中训练可能用于增强这种能力。我们遵循最近开源数学推理项目的惯例,使用贪婪解码。

Benchmark Method Turn 1 Final Accuracy △(t1,t2) △i→c(t1,t2) △c→i(t1,t2)
Single-turn STaR/RAFT 77.0 77.0
Single-turn DPO 76.8 76.8
Single-turn PPO 79.4 79.4
Prompt with Gold RM+ 65.4 66.8 1.4 1.4 0.0
Intrinsic self-correction? 65.4 51.4 -14.0 1.4 15.4
STaR/RAFT for self-correctiont 71.6 70.4 -1.2 5.0 6.2
STaR/RAFT+ for self-correction+ 72.0 71.2 -0.8 3.0 3.8
Self-rewarding IFT 72.6 77.2 4.6 5.0 0.4
Self-rewarding IFT + DPO w correctness 72.8 78.6 5.8 4.4 6.0 4.8 0.2
Self-rewarding IFT + PPO w correctness 75.8 80.2 0.4
Single-turn STaR/RAFT 40.1 40.1
Single-turn DPO 39.0 39.0
Single-turn PPO 39.5 39.5
Prompt with Gold RMt 23.4 25.6 2.2 2.2 0
Intrinsic self-correctiont STaR/RAFT for self-correction t 23.4 18.1 -5.3 2.2 7.5
STaR/RAFT+ for self-correctiont 36.5 35.7 32.5 35.5 -4.0 7.2 3.2 11.2
Self-rewarding IFT 35.4 39.4 -0.2 4.0 4.7 3.4
Self-rewarding IFT + DPO w correctness 37.6 40.1 2.5 3.5 0.7 1.0
Self-rewarding IFT + PPO w correctness 41.0 43.4 2.4 2.8 0.4
Single-turn STaR/RAFT 32.0 32.0
Single-turn DPO Single-turn PPO 31.6 33.1 31.6 33.1
Prompt with Gold RMt 9.9 11.7 1.8
Intrinsic self-correctiont 9.9 8.4 1.8 -1.5 1.8 0
STaR/RAFT for self-correctiont 28.7 29.4 0.7 1.8 3.3 1.1
STaR/RAFT+ for self-correctiont 25.7 25.3 -0.4 0.8 1.2
Self-rewarding IFT 23.2 28.7 5.5 7.3
Self-rewarding IFT + DPO w correctness 26.8 34.6 7.8 9.6 1.8 1.8
Self-rewarding IFT + PPO w correctness 34.0 38.4 4.4 5.1 0.7
  1. Ratio $p^{c\rightarrow i}(t_{1},t_{2})$ : probability of modifying a correct answer to incorrect when facing a misleading reward.
  2. 比率 $p^{c\rightarrow i}(t_{1},t_{2})$ :在面对误导性奖励时,将正确答案修改为错误答案的概率。

For all evaluations, we use zero-shot CoT prompting and greedy decoding following the convention of recent projects with Qwen-2.5-Math models.

在所有评估中,我们遵循 Qwen-2.5-Math 模型的近期项目惯例,使用零样本 CoT 提示和贪婪解码。

Experiment setup of self-rewarding IFT. We use Qwen2.5-Math-7B-base as the base model, which is continuously pre-trained on extensive mathematical and instructionfollowing data. Sequential rejection sampling (introduced in Section 3.1) is used for data collection, resulting in a dataset of 32K trajectories, where we roughly balance between correct and incorrect first attempts. In fine-tuning, samples are packed into 8192-token blocks and we use a learning rate of 1e-5, a cosine scheduler, and a 0.05 warm-up ratio. Global batch size is set to be 32. We train the models for three epochs and eventually select the one at the end of the first epoch.

自奖励 IFT 的实验设置。我们使用 Qwen2.5-Math-7B-base 作为基础模型,该模型在广泛的数学和指令跟随数据上进行了持续预训练。数据收集采用顺序拒绝采样(在第 3.1 节中介绍),生成了一个包含 32K 条轨迹的数据集,其中我们大致平衡了正确和错误的首次尝试。在微调过程中,样本被打包成 8192-token 的块,我们使用 1e-5 的学习率、余弦调度器和 0.05 的预热比例。全局批量大小设置为 32。我们训练模型三个 epoch,最终选择第一个 epoch 结束时的模型。

Experiment setup of reinforcement learning. For iterative DPO training, we adopt setups from Xiong et al. (2024a) with a learning rate of $2\times10^{-7}$ , a cosine scheduler, and a batch size of 32. We tune $\eta\in{0.1,0.5}$ and also train with and without an NLL loss in the DPO objective (Pang et al., 2024; Xie et al., 2024a; Liu et al., 2024). For each iteration, we use 20K prompts and collect 8 responses per prompt. Then, we extract the comparison pairs using the correctness score. If all responses admit the same score, we skip the prompt. A 10K validation set from NuminaMathCoT is used for model selection. The primary metric for model selection is accuracy at turn 2. When models achieve comparable turn-2 accuracy, we choose the models with higher $\Delta(t_{1},t_{2})$ improvement. The best model of these training setups is used as the representative model. For PPO training, we mainly follow a pulic example script of veRL (Sheng et al., 2024), which is publicly available2.

强化学习的实验设置。对于迭代 DPO 训练,我们采用 Xiong 等人 (2024a) 的设置,学习率为 $2\times10^{-7}$,使用余弦调度器,批量大小为 32。我们调整 $\eta\in{0.1,0.5}$,并在 DPO 目标中训练时考虑是否包含 NLL 损失 (Pang 等人, 2024; Xie 等人, 2024a; Liu 等人, 2024)。每次迭代中,我们使用 20K 个提示,每个提示收集 8 个响应。然后,我们使用正确性分数提取比较对。如果所有响应的分数相同,则跳过该提示。我们使用 NuminaMathCoT 的 10K 验证集进行模型选择。模型选择的主要指标是第二轮准确率。当模型的第二轮准确率相近时,我们选择 $\Delta(t_{1},t_{2})$ 改进更大的模型。这些训练设置中的最佳模型被用作代表模型。对于 PPO 训练,我们主要遵循 veRL (Sheng 等人, 2024) 的公开示例脚本,该脚本已公开2。

Baseline: improving the self-correction ability. We consider several baseline methods in the self-correction literature, including training-free approaches and fine-tuning. For training-free methods, we evaluate intrinsic self-correction (Huang et al., 2023), where models rely solely on prompting to perform correction, and self-correction with external ground-truth rewards (Qu et al., 2024). The prompts used for these methods are provided in Appendix B. We also include STaR and RAFT approaches (Zelikman et al., 2022; Dong et al., 2023), which are inspired by expert iteration in reinforcement learning (Anthony et al., 2017). These methods generate numerous trajectories with the base model, filter out failed attempts, and fine-tune on successfully revised responses. Following Kumar et al. (2024), we study a variant, $\mathrm{STaR/RAFT+}$ , which augments the training set with a set of correct-to-correct trajectories. To ensure a fair comparison, the total number of training samples for STaR/RAFT $(+)$ is kept the same as in our self-rewarding IFT stage.

基线:提升自我纠正能力。我们考虑了自我纠正文献中的几种基线方法,包括无需训练的方法和微调方法。对于无需训练的方法,我们评估了内在自我纠正(Huang et al., 2023),即模型仅依赖提示进行纠正,以及使用外部真实奖励的自我纠正(Qu et al., 2024)。这些方法使用的提示见附录 B。我们还纳入了 STaR 和 RAFT 方法(Zelikman et al., 2022; Dong et al., 2023),这些方法受到强化学习中专家迭代的启发(Anthony et al., 2017)。这些方法使用基础模型生成大量轨迹,过滤掉失败的尝试,并在成功修订的响应上进行微调。根据 Kumar et al. (2024),我们研究了一个变体 $\mathrm{STaR/RAFT+}$,它通过一组正确到正确的轨迹来增强训练集。为了确保公平比较,STaR/RAFT $(+)$ 的训练样本总数与我们的自我奖励 IFT 阶段保持一致。

Baseline: improving the single-turn reasoning ability. In addition, we also consider several baselines that improve the models’ single-turn reasoning ability without self-correction. These methods include the STaR/RAFT (Zelikman et al., 2022; Dong et al., 2023), iterative DPO (Xiong et al., 2023) with the correctness score to rank data, and PPO with the correctness score. In particular, we adopt the iterative algorithms in the implementations of the STaR/RAFT and DPO because we observe that they achieve much better performance to serve as competitive baselines. We start from Qwen-2.5-Math-7B and train with only self-generated data for a fair comparison. We remark that the Qwen-2.5-Math7B has been trained on many instruction-following data in the pre-training stage and the recent open-source projects also show that it can be used as the starting checkpoint without distillation from larger LLMs or human instructions (Zeng et al., 2025; Zhang et al., 2025).

基线:提升单轮推理能力。此外,我们还考虑了几种无需自我校正即可提升模型单轮推理能力的基线方法。这些方法包括 STaR/RAFT (Zelikman et al., 2022; Dong et al., 2023)、使用正确性评分对数据进行排序的迭代 DPO (Xiong et al., 2023),以及使用正确性评分的 PPO。特别地,我们在 STaR/RAFT 和 DPO 的实现中采用了迭代算法,因为我们观察到它们作为竞争基线表现更好。我们从 Qwen-2.5-Math-7B 开始,仅使用自生成数据进行训练以确保公平比较。我们注意到,Qwen-2.5-Math7B 在预训练阶段已经训练了大量指令跟随数据,并且最近的开源项目也表明它可以作为起始检查点,而无需从更大的大语言模型或人类指令中进行蒸馏 (Zeng et al., 2025; Zhang et al., 2025)。

4.1. Main Results

4.1. 主要结果

We report the main results in Table 3. Note that there can be an error of 0.1 due to rounding.

我们在表 3 中报告了主要结果。请注意,由于四舍五入,可能存在 0.1 的误差。

Intrinsic self-correction with prompting fails in general.

内在自我纠正通过提示在一般情况下失败

We first observe that intrinsic self-correction without explicit reward signals typically reduces final test accuracy. Upon analyzing the outputs, we find that models tend to modify their initial responses responses regardless of its correctness, as they lack a mechanism to determine when to refine their answers versus when to terminate the correction process. Moreover, even when given ground-truth rewards, base models with prompting alone achieve only marginal improvement in incorrect-to-correct transitions $\Delta^{i\rightarrow c}(t_{1},t_{2})$ For example, on MATH-500 benchmark, prompting with gold reward only leads to $\Delta^{i\rightarrow c}(t_{1},t_{2})=1.4%$ .

我们首先观察到,在没有明确奖励信号的情况下,内在的自我修正通常会降低最终的测试准确性。通过分析输出结果,我们发现模型倾向于修改其初始响应,无论其正确与否,因为它们缺乏一种机制来确定何时应该改进答案,何时应该终止修正过程。此外,即使提供了真实奖励,仅通过提示的基础模型在错误到正确的转换中也仅实现了微小的改进 $\Delta^{i\rightarrow c}(t_{1},t_{2})$。例如,在 MATH-500 基准测试中,使用黄金奖励进行提示仅导致 $\Delta^{i\rightarrow c}(t_{1},t_{2})=1.4%$。

We also notice that the STaR/RAFT method, which finetunes models on revised incorrect attempts, fails to significantly improve performance. It increases $\Delta^{i\rightarrow c}(t_{1},t_{2})$ (incorrect-to-correct transitions) on MATH500 from $1.4%$ to $5.0%$ , but still suffers from a $\Delta^{c\rightarrow i}(t_{1},t_{2})$ (correctto-incorrect transitions) of $6.2%$ . Additionally, the STaR/RAFT $^+$ variant, which includes correct-to-correct trajectories, becomes more conservative in modifying the initial attempt. While this reduces incorrect corrections $(\Delta^{c\rightarrow i}(t_{1},t_{2}))$ , it also lower $\Delta^{i\rightarrow c}(t_{1},t_{2})$ , ultimately degrading test accuracy. These findings align with prior studies, and highlight the limitations of intrinsic self-correction, even with training (Huang et al., 2023; Kumar et al., 2024).

我们还注意到,STaR/RAFT 方法通过对修正的错误尝试进行微调,未能显著提升性能。它将 MATH500 上的 $\Delta^{i\rightarrow c}(t_{1},t_{2})$(从错误到正确的转换)从 $1.4%$ 提高到 $5.0%$,但仍然存在 $6.2%$ 的 $\Delta^{c\rightarrow i}(t_{1},t_{2})$(从正确到错误的转换)。此外,包含正确到正确轨迹的 STaR/RAFT$^+$ 变体在修改初始尝试时变得更加保守。虽然这减少了错误的修正 $(\Delta^{c\rightarrow i}(t_{1},t_{2}))$,但也降低了 $\Delta^{i\rightarrow c}(t_{1},t_{2})$,最终导致测试准确率下降。这些发现与先前的研究一致,并强调了即使通过训练,内在自我修正的局限性(Huang et al., 2023; Kumar et al., 2024)。

Self-rewarding reasoning models significantly outperform existing baselines of self-correction. Across all tasks, self-rewarding reasoning models consistently improve final accuracy with higher $\Delta(t_{1},t_{2})$ compared to baseline methods. We notice that fine-tuning on the synthetic trajectories with self-correction behavior yields models with much higher $\Delta^{i\rightarrow c}(t_{1},t_{2})$ , suggesting that the models are more good at correcting the error in the self-generated responses. Distint from the STaR/RAFT, models trained with selfrewarding IFT also exhibit significantly lower $\Delta^{c\rightarrow i}(t_{1},t_{2})$ , indicating they are better at recognizing when to stop due to the additional self-rewarding signals. For instance, on MATH500,

自我奖励推理模型显著优于现有的自我校正基线。在所有任务中,自我奖励推理模型通过更高的 $\Delta(t_{1},t_{2})$ 持续提高最终准确率,相较于基线方法。我们注意到,在具有自我校正行为的合成轨迹上进行微调,得到的模型具有更高的 $\Delta^{i\rightarrow c}(t_{1},t_{2})$ ,这表明模型更擅长纠正自我生成响应中的错误。与 STaR/RAFT 不同,通过自我奖励 IFT 训练的模型也表现出显著更低的 $\Delta^{c\rightarrow i}(t_{1},t_{2})$ ,表明由于额外的自我奖励信号,它们更擅长识别何时停止。例如,在 MATH500 上,

Since STaR/RAFT $(+)$ and self-rewarding IFT use the same data synthesis approach (rejection sampling) but under different self-correction frameworks, these results highlight the advantage of our self-rewarding reasoning framework.

由于 STaR/RAFT $(+)$ 和自我奖励的 IFT 使用了相同的数据合成方法(拒绝采样),但在不同的自我校正框架下,这些结果凸显了我们自我奖励推理框架的优势。

Self-rewarding reasoning models improve the final accuracy compared to the single-turn baselines. We also compare the self-rewarding reasoning models with RL training against their single-turn counterparts. For both the PPO and DPO, the self-rewarding reasoning models achieve higher final test accuracy due to the additional correction step. For instance, the self-rewarding $\mathrm{IFT}+\mathrm{PPO}$ yields a model with $43.4%$ final accuracy on Olympiad Bench, and $38.4%$ on Minerva Math, compared to the $39.5%$ and $33.1%$ of the single-turn counterpart. Similarly, with the DPO, the self-rewarding reasoning models achieve a $78.6%$ on MATH500, a $40.1%$ on Olympiad Bench, and $34.6%$ on Minerva Math, while the single-turn DPO model admits $76.8%$ , $39.0%$ , $31.6%$ , respectively.

与单轮基线相比,自奖励推理模型提高了最终准确率。我们还将自奖励推理模型与单轮模型进行了对比。无论是 PPO 还是 DPO,自奖励推理模型由于额外的修正步骤,都实现了更高的最终测试准确率。例如,自奖励的 $\mathrm{IFT}+\mathrm{PPO}$ 在 Olympiad Bench 上的最终准确率为 $43.4%$,在 Minerva Math 上为 $38.4%$,而单轮模型分别为 $39.5%$ 和 $33.1%$。同样,使用 DPO 时,自奖励推理模型在 MATH500 上达到 $78.6%$,在 Olympiad Bench 上为 $40.1%$,在 Minerva Math 上为 $34.6%$,而单轮 DPO 模型分别为 $76.8%$、$39.0%$ 和 $31.6%$。

Table 4. The results of reward modeling accuracy $(%)$ . We report the accuracy of self-rewarding signals for the three benchmarks in two separate classes. For instance, MATH $500\mathrm{C}$ is the accuracy of recognizing a correct trajectory, while MATH-500 W is the accuracy of recognizing a wrong trajectory. The model highlighted by $(*)$ is selected as the final model.

表 4. 奖励建模准确率结果 $(%)$ 。我们报告了三个基准在两个独立类别中的自奖励信号准确率。例如,MATH-500 C 是识别正确轨迹的准确率,而 MATH-500 W 是识别错误轨迹的准确率。标有 $(*)$ 的模型被选为最终模型。

方法 MATH-500 C MATH-500 W OlympiadBench C OlympiadBench W Minerva Math C Minerva Math W
Self-rewarding IFT 93.0 47.7 89.6 45.9 91.7 36.1
PPOStep100 97.5 56.4 98.1 33.5 87.4 29.7
PPOStep 220(*) 98.6 47.6 97.8 39.3 94.2 32.4
DPOIter2 91.3 56.2 81.9 51.8 86.7 36.2
DPOIter5(+) 92.0 50.6 88.2 44.5 92.4 37.4

However, self-rewarding models use more tokens at inference due to the additional correction step. For a fair comparison, we will also study the behavior of self-rewarding correction under scaled test-time compute budgets in Section 5.

然而,由于额外的校正步骤,自我奖励模型在推理时使用了更多的Token。为了进行公平比较,我们将在第5节中研究在扩展测试时间计算预算下自我奖励校正的行为。

Deep RL algorithm outperforms the direct alignment algorithms. We observe that PPO outperforms iterative DPO by a large margin. For example, the PPO-trained model achieves a $43.4%$ final accuracy on Olympiad Bench, compared to the $40.1%$ of the DPO method. This suggests that when absolute reward signals are available, enforcing a preference structure (Bradley-Terry model) is unnecessary and may degrade performance. Another possible reason is the limited data utilization in DPO. We notice that, with our setup, we can collect comparison pairs for only $40%$ to $60%$ prompts. For the remaining prompts, models either generate no correct trajectories or all trajectories are correct. As a result, DPO utilizes less training data than PPO, which may contribute to its lower accuracy.

深度强化学习算法优于直接对齐算法。我们观察到,PPO(Proximal Policy Optimization)大幅优于迭代DPO(Direct Preference Optimization)。例如,PPO训练的模型在Olympiad Bench上的最终准确率为43.4%,而DPO方法的准确率为40.1%。这表明,当绝对奖励信号可用时,强制偏好结构(Bradley-Terry模型)是不必要的,并且可能会降低性能。另一个可能的原因是DPO的数据利用率有限。我们注意到,在我们的设置中,我们只能为40%到60%的提示收集比较对。对于剩余的提示,模型要么没有生成正确的轨迹,要么所有轨迹都是正确的。因此,DPO使用的训练数据比PPO少,这可能是其准确率较低的原因。

Reward model (RM) accuracy. Since our self-rewarding framework unifies the generator and reward model, we evaluate the accuracy of our models as a reward model. We observe that the Qwen2.5-Math-7B-base can fail to strictly follow the format by omitting the self-evaluation step or not generating the evaluation result in the pre-determined format possibly because the model is not instruction-following fine-tuned. However, this happens in less then $10%$ of the cases so we focus on the samples with the evaluation step and also further involve human supervision to summarize the statistics. We report the result in Table 4. We observe that the self-rewarding IFT model is much more good at recognizing the correct trajectories, as the accuracy is generally higher than $90%$ , even though we balance the two types of trajectories in the training set. This directly leads to the small $\Delta^{c\rightarrow i}(t_{1},t_{2})$ we observe in the main table.

奖励模型 (RM) 准确率。由于我们的自奖励框架将生成器和奖励模型统一起来,我们评估了模型作为奖励模型的准确率。我们观察到,Qwen2.5-Math-7B-base 可能无法严格遵循格式,例如省略自评估步骤或未以预定格式生成评估结果,这可能是因为模型未经过指令跟随的微调。然而,这种情况发生的概率不到 $10%$,因此我们专注于包含评估步骤的样本,并进一步引入人工监督来总结统计数据。我们在表 4 中报告了结果。我们观察到,自奖励的 IFT 模型在识别正确轨迹方面表现更好,准确率通常高于 $90%$,尽管我们在训练集中平衡了两种类型的轨迹。这直接导致了我们在主表中观察到的较小的 $\Delta^{c\rightarrow i}(t_{1},t_{2})$。

We also notice that the RL training (both PPO and DPO) does not consistently improve the reward modeling accuracy. Analysis of PPO checkpoints (initial model, Step 100 and Step 220) clearly shows a trade-off between correct and incorrect classification accuracy. The PPO training explores different trade-off between them, with the goal of maximizing the final accuracy. Similar observation also applies to the DPO training. Moreover, the best model of PPO training tends to prioritize recognizing correct trajectories, at the cost of lower accuracy in identifying incorrect responses, which aligns with the lower $\Delta^{c\rightarrow i}(t_{1},t_{2})$ and also lower $\Delta^{i\rightarrow c}(t_{1},t_{2})$ . This may be because correcting an incorrect answer is generally more challenging than maintaining a correct initial response. We defer a more detailed study of the impact of data composition on reward modeling accuracy to the next section.

我们还注意到,RL 训练(包括 PPO 和 DPO)并不总是能提高奖励建模的准确性。对 PPO 检查点(初始模型、第 100 步和第 220 步)的分析清楚地显示了正确分类准确率和错误分类准确率之间的权衡。PPO 训练探索了它们之间的不同权衡,目标是最大化最终准确率。类似的观察也适用于 DPO 训练。此外,PPO 训练的最佳模型往往优先识别正确的轨迹,而代价是识别错误响应的准确率较低,这与较低的 $\Delta^{c\rightarrow i}(t_{1},t_{2})$ 和 $\Delta^{i\rightarrow c}(t_{1},t_{2})$ 一致。这可能是因为纠正一个错误的答案通常比保持正确的初始响应更具挑战性。我们将数据组成对奖励建模准确率影响的更详细研究推迟到下一节。

Learning dynamic of the RL stage. While the RL training improves final accuracy, the final test accuracy is determined by both the turn-1 accuracy and $\Delta(t_{1},t_{2})$ . In particular, we notice that the final accuracy gains come primarily from the higher turn-1 accuracy, as the models after the RL training usually admit a much higher turn-1 accuracy, but also a lower $\Delta^{i\rightarrow c}(t_{1},t_{2})$ . To understand the learning dynamic of the RL training, we plot test accuracy on three benchmarks in terms of the RL training steps in Figure 1. We observe that in the early stage of the RL training, both the turn-1 accuracy and the final accuracy increase, and their gap $\Delta(t_{1},t_{2})$ is also increased or maintained as a stable level. This indicates that the models learn to use their knowledge better in the first round and improve or maintain a comparable level of correction ability. Around training step 100, however, the increase of the final accuracy is mainly from the higher turn-1 accuracy and their gap is narrowed, indicating less reliance on self-correction.

RL阶段的学习动态。虽然RL训练提高了最终准确率,但最终测试准确率由第一轮准确率和$\Delta(t_{1},t_{2})$共同决定。特别是,我们注意到最终准确率的提升主要来自更高的第一轮准确率,因为经过RL训练的模型通常具有更高的第一轮准确率,但$\Delta^{i\rightarrow c}(t_{1},t_{2})$较低。为了理解RL训练的学习动态,我们在图1中绘制了三个基准测试的准确率随RL训练步骤的变化。我们观察到,在RL训练的早期阶段,第一轮准确率和最终准确率都增加,它们之间的差距$\Delta(t_{1},t_{2})$也增加或保持稳定水平。这表明模型在第一轮中学会了更好地利用其知识,并提高或保持了相当的修正能力。然而,在训练步骤100左右,最终准确率的提升主要来自更高的第一轮准确率,它们之间的差距缩小,表明对自我修正的依赖减少。

We also plot the average generation length in the first figure. Initially, the length decreases because the Qwen2.5- Math-7B-base model tends to generate many python codes, resulting in lengthy responses. We observe that the code usually takes many tokens and can lead to incomplete reasoning path and it is discouraged by the reward signal. This observation is consistent with Zeng et al. (2025). Then, the length increases in the next stage, indicating that the reflection and self-correction abilities are also encouraged by the RL training. Finally, the length decreases again, along with a higher turn-1 accuracy and a smaller $\Delta(t_{1},t_{2})$ , indicating that the models learn to provide a correct answer in their first attempt and also, the self-correction pattern is discouraged. This is also supported by the reward model accuracy, where the RL-trained models tend to be more conservative and evaluate the attempt as correct.

我们还在第一张图中绘制了平均生成长度。最初,长度减少是因为 Qwen2.5-Math-7B-base 模型倾向于生成许多 Python 代码,导致响应较长。我们观察到,代码通常占用大量 Token,可能导致推理路径不完整,并且这种模式受到奖励信号的抑制。这一观察结果与 Zeng 等人 (2025) 的研究一致。随后,长度在下一阶段增加,表明反思和自我纠正能力也受到强化学习 (RL) 训练的鼓励。最后,长度再次减少,同时伴随着更高的第一轮准确率和更小的 $\Delta(t_{1},t_{2})$,表明模型学会了在第一次尝试中提供正确答案,并且自我纠正模式受到抑制。这一点也得到了奖励模型准确率的支持,经过 RL 训练的模型往往更加保守,并将尝试评估为正确。


Figure 1. The learning dynamic of the PPO training, initialized from the self-rewarding IFT model. We also plot the average generation length during the training in the first figure.

图 1: 从自奖励 IFT 模型初始化的 PPO 训练学习动态。我们还在第一张图中绘制了训练期间的平均生成长度。

5. More Experiment Results with a Two-turn Conversation Framework and Llama Models

5. 使用两轮对话框架和 Llama 模型的更多实验结果

In this section, we continue to investigate the self-rewarding reasoning framework.

在本节中,我们继续研究自我奖励推理框架。

5.1. Data Format: Simplified Two-turn Framework

5.1. 数据格式:简化的双轮对话框架

Previously, we combined multiple reasoning steps into a single long CoT trajectory, which aligns with common practice. However, this approach poses significant challenges for our study, as models—particularly Qwen2.5-Math-7Bbase—often fail to strictly follow instructions for evaluating or revising responses based on their history. For instance, models sometimes will also generate the evaluation results using or not to correct the responses even though the self-evaluation result is “[VERIFY] wrong”. Additionally, models can perform multiple rounds of self-evaluation and correction, but these steps are tightly coupled and cannot be easily decoupled into separate stages.

此前,我们将多个推理步骤合并为一个长的 CoT 轨迹,这与常见做法一致。然而,这种方法对我们的研究提出了重大挑战,因为模型——尤其是 Qwen2.5-Math-7Bbase——往往无法严格遵循基于其历史记录评估或修订响应的指令。例如,即使自我评估结果为“[VERIFY] wrong”,模型有时仍会生成评估结果,无论是否纠正响应。此外,模型可以进行多轮自我评估和纠正,但这些步骤紧密耦合,无法轻易拆分为独立的阶段。

To address these issues, we adopt a simplified two-turn conversation framework, where the user provides explicit instructions between different steps. Specifically, after receiving the mathematical problem, the model will first generate the CoT reasoning $a^{1}$ and self-evaluation $y$ . Then, the user provide a deterministic instruction $o$ based on the self-evaluation $y$ :

为了解决这些问题,我们采用了一个简化的两轮对话框架,用户在不同步骤之间提供明确的指令。具体来说,在接收到数学问题后,模型会首先生成 CoT 推理 $a^{1}$ 和自我评估 $y$。然后,用户根据自我评估 $y$ 提供一个确定性指令 $o$:

  1. Since your initial response is self-evaluated as incorrect, there might be an error in the solution above because of lack of understanding of the question. Please correct the error, if any, and rewrite the solution. Put your final answer within ;
  2. 由于您的初始响应自评为不正确,上述解决方案可能存在因对问题理解不足而产生的错误。如有错误,请纠正并重写解决方案。将您的最终答案放在 ;
  3. Since your initial response is self-evaluated as correct, confirm it and provide no further modifications. Put your final answer within .
  4. 由于您的初始响应自评为正确,请确认并提供进一步的修改。将您的最终答案放在。

Meanwhile, when collecting the data, the self-rewarding signal is determined directly by the ground-truth oracle reward with the template designed in Zhang et al. (2024b), without additional reasoning. While this simplification may reduce reward modeling accuracy (Zhang et al., 2024b), it facilitates controlled experimentation by allowing modifications to the self-rewarding signal. Similar frameworks—without the self-rewarding component—have been explored in previous works (Huang et al., 2023; Kumar et al., 2024). See Table 6 for an illustrative example.

与此同时,在收集数据时,自我奖励信号直接由真实奖励通过 Zhang 等人 (2024b) 设计的模板确定,无需额外的推理。虽然这种简化可能会降低奖励建模的准确性 (Zhang 等人, 2024b),但它通过允许修改自我奖励信号,便于进行受控实验。类似的框架——不包含自我奖励组件——在之前的工作中已被探索 (Huang 等人, 2023; Kumar 等人, 2024)。参见表 6 中的示例。

5.2. Experiment Setup

5.2. 实验设置

Base model, task, and datasets. Qwen2.5-Math-7B-base serves as a strong and specialized base model, which is pre-trained on a large mathematical corpus. To ensure generality and a more comprehensive evaluation, we experiment with the Llama model series. Specifically, our base models include Llama-3-8B-it and Llama-3-SFT, the latter being fine-tuned on Open-Math Instruct 2-1M (Toshniwal et al., 2024a). While both models are generally weaker than Qwen2.5-Math-7B-base, Llama-3-SFT is stronger than Llama-3-8B-it.

基础模型、任务和数据集。Qwen2.5-Math-7B-base 是一个强大且专业的基础模型,它在大量数学语料库上进行了预训练。为了确保通用性和更全面的评估,我们对 Llama 模型系列进行了实验。具体来说,我们的基础模型包括 Llama-3-8B-it 和 Llama-3-SFT,后者在 Open-Math Instruct 2-1M (Toshniwal et al., 2024a) 上进行了微调。虽然这两个模型通常比 Qwen2.5-Math-7B-base 弱,但 Llama-3-SFT 比 Llama-3-8B-it 更强。

In this section, we evaluate the models’ mathematical reasoning abilities using the MATH and GSM8K benchmarks, which are well-suited to their capacities. For MATH, we use 7.5K training problems during the self-rewarding IFT stage, supplemented by 7.5K prompts from Open-Math Instruct 2 for M-DPO training, with a similar setup for GSM8K. Model selection is performed using a 1K validation set from Open-Math Instruct 2. Since we formulate the task as a multi-turn chat problem, we can directly use Axolotl’s

在本节中,我们使用 MATH 和 GSM8K 基准来评估模型的数学推理能力,这些基准非常适合它们的能力。对于 MATH,我们在自我奖励的 IFT 阶段使用了 7.5K 训练问题,并在 M-DPO 训练中补充了来自 Open-Math Instruct 2 的 7.5K 提示,GSM8K 的设置类似。模型选择使用来自 Open-Math Instruct 2 的 1K 验证集进行。由于我们将任务制定为多轮对话问题,因此可以直接使用 Axolotl 的

Table 5. Main results of different methods on the test sets of MATH (first two groups of results) and GSM8K (last two groups of results). Models are evaluated with temperature 1.0, and results are averaged over three random seeds. Additional results using a temperature of 0.7 are included in the appendix due to space constraints.

表 5: 不同方法在 MATH (前两组结果) 和 GSM8K (后两组结果) 测试集上的主要结果。模型在温度为 1.0 的情况下进行评估,结果取三次随机种子的平均值。由于篇幅限制,使用温度为 0.7 的额外结果包含在附录中。

基础模型 方法 第一轮 最终准确率 △(t1,t2) △i→c(t1,t2) △c→i(t1,t2)
Llama-3-8B-it 使用黄金奖励模型的提示 20.7 30.3 9.6 9.6 0
Llama-3-8B-it 使用外部ORM的提示 20.7 26.2 5.5 8.8 3.3
Llama-3-8B-it 内在自我校正 20.7 22.0 1.3 8.8 7.5
Llama-3-8B-it STaR/RAFT用于自我校正 22.3 26.1 3.7 11.4 7.7
Llama-3-8B-it STaR/RAFT+用于自我校正 22.7 27.1 4.4 11.7 7.3
Llama-3-8B-it 自我奖励IFT 22.6 27.9 5.3 8.8 3.5
Llama-3-8B-it 自我奖励IFT + 黄金奖励模型 22.6 33.9 11.3 11.3 0
Llama-3-SFT 使用黄金奖励模型的提示 36.2 45.0 8.8 8.8 0
Llama-3-SFT 使用外部ORM的提示 36.2 39.2 3.0 7.5 4.5
Llama-3-SFT 内在自我校正 36.2 35.3 -0.9 8.5 9.4
Llama-3-SFT STaR/RAFT用于自我校正 38.5 36.7 -1.8 10.5 12.3
Llama-3-SFT STaR/RAFT+用于自我校正 37.9 38.8 0.9 9.4 8.5
Llama-3-SFT 自我奖励IFT 37.1 40.3 3.2 7.2 4.0
Llama-3-SFT 自我奖励IFT + 黄金奖励模型 37.1 46.8 9.7 9.7 0
Llama-3-8B-it 使用黄金奖励模型的提示 64.0 72.1 8.1 8.1 0
Llama-3-8B-it 使用外部ORM的提示 64.0 68.0 4.0 5.9 1.9
Llama-3-8B-it 内在自我校正 64.0 48.1 -15.9 7.1 23.0
Llama-3-8B-it STaR/RAFT用于自我校正 76.0 63.1 -12.9 7.9 20.8
Llama-3-8B-it STaR/RAFT+用于自我校正 75.7 67.0 -8.7 8.6 17.3
Llama-3-8B-it 自我奖励IFT 73.2 78.2 5.0 9.1 4.1
Llama-3-SFT 使用黄金奖励模型的提示 74.6 83.1 8.5 8.5 0
Llama-3-SFT 使用外部ORM的提示 74.6 76.7 2.1 5.5 3.4
Llama-3-SFT 内在自我校正 74.6 67.4 -7.2 7.6 14.8
Llama-3-SFT STaR/RAFT用于自我校正 73.8 67.4 -6.4 9.0 15.4
Llama-3-SFT STaR/RAFT+用于自我校正 73.9 73.5 -0.4 8.6 9.0
Llama-3-SFT 自我奖励IFT 76.1 79.2 3.1 4.7 1.6

training code3. During inference, the model generates up to 2048 tokens per round, with VLLM 0.5.4 (Kwon et al., 2023) accelerating the process.

训练代码3。在推理过程中,模型每轮生成最多2048个Token,并使用VLLM 0.5.4 (Kwon et al., 2023) 加速这一过程。

Training Setup for Llama SFT. For the self-rewarding IFT stage, we use a learning rate of 2e-6 with a batch size of 32 for Llama models and 64 for Llama-3-SFT training. Outcome-supervised reward models (ORMs) are trained using standard SFT recipes and datasets, as described in (Xiong et al., 2024b). Full hyper parameter configurations will be available in our GitHub repository.

Llama SFT 的训练设置。在自奖励 IFT 阶段,我们使用 2e-6 的学习率,Llama 模型的批量大小为 32,Llama-3-SFT 训练的批量大小为 64。结果监督的奖励模型 (ORMs) 使用标准的 SFT 配方和数据集进行训练,如 (Xiong et al., 2024b) 中所述。完整的超参数配置将在我们的 GitHub 仓库中提供。

We observe that models occasionally fail to follow the instruction to perform self-rewarding corrections, though this occurs in less than $5%$ of cases. In such scenarios, we terminate after the first round and use its output as the final answer.

我们观察到,模型偶尔无法遵循指令进行自我奖励修正,尽管这种情况发生的概率不到 $5%$。在这种情况下,我们会在第一轮结束后终止,并将其输出作为最终答案。

5.3. Main Results with Llama Models

5.3. Llama 模型的主要结果

Experiments with Llama models align well with the Qwen model. Our experiments with Llama models show similar trends to those observed with Qwen models. Specifically, intrinsic self-correction—whether with or without STaR/RAFT-like training—fails to reliably correct errors in self-generated responses. Models tend to modify their initial responses regardless of correctness, making these methods beneficial primarily for weaker models where most first attempts are incorrect (e.g., MATH task with Llama-3- 8B-it). However, for stronger models that solve most problems correctly on the first attempt (e.g., GSM8K task with Llama-3-SFT), intrinsic self-correction and STaR/RAFT methods significantly reduce turn-2 accuracy. In contrast, self-rewarding IFT models consistently improve turn-1 accuracy by effectively correcting errors while preserving already correct responses. This demonstrates the generality of the proposed framework.

Llama 模型的实验与 Qwen 模型的结果高度一致。我们对 Llama 模型的实验显示出与 Qwen 模型相似的趋势。具体而言,内在自我修正——无论是否使用类似 STaR/RAFT 的训练——都无法可靠地纠正自我生成响应中的错误。模型倾向于修改其初始响应,无论其正确与否,这使得这些方法主要对较弱模型有益,因为大多数首次尝试都是错误的(例如,Llama-3-8B-it 在 MATH 任务中的表现)。然而,对于在首次尝试中就能正确解决大多数问题的较强模型(例如,Llama-3-SFT 在 GSM8K 任务中的表现),内在自我修正和 STaR/RAFT 方法显著降低了第二轮准确率。相比之下,自我奖励的 IFT 模型通过有效纠正错误并保留已经正确的响应,持续提高了第一轮准确率。这证明了所提出框架的通用性。

To further evaluate the self-rewarding IFT model, we modify the self-rewarding signal to be the same as the oracle reward, eliminating the influence of reward signal quality and directly assessing the model’s ability to correct incorrect responses. For example, the baseline Llama-3-SFT achieve a $\Delta^{i\rightarrow c}(t_{1},t_{2})=8.8%$ , while models fine-tuned with selfrewarding IFT exhibit a higher $\Delta^{i\rightarrow c}(t_{1},t_{2})=9.7$ , indicating improved correction capabilities.

为了进一步评估自奖励的IFT模型,我们将自奖励信号修改为与oracle奖励相同,消除了奖励信号质量的影响,直接评估模型纠正错误响应的能力。例如,基线Llama-3-SFT实现了$\Delta^{i\rightarrow c}(t_{1},t_{2})=8.8%$,而使用自奖励IFT微调的模型表现出更高的$\Delta^{i\rightarrow c}(t_{1},t_{2})=9.7$,表明其纠正能力有所提升。

An example of the self-rewarding reasoning path under the two-turn conversation framework. User: Three positive integers $a,b$ , and $_x$ form an O’Hara triple $(a,b,x)$ if ${\sqrt{a}}+{\sqrt{b}}=x$ . For example, $(1,4,3)$ is an O’Hara triple because ${\sqrt{1}}+{\sqrt{4}}=3$ . If $(36,25,x)$ is an O’Hara triple, determine the value of $_x$ . Please reason step by step and put your final answer within ${}$ . Assistant: Here is my solution: Begin by substituting the numbers into the definition of an O’Hara triple: $x={\sqrt{36}}+{\sqrt{25}}$ . Factor out the perfect squares from each term to simplify: $x={\sqrt{9\times4}}+{\sqrt{25}}$ . Use the perfect square roots of each term and combine them: $x=3{\sqrt{4}}+5$ . Multiply 4 with the root: $x=12+5$ . Add 12 and 5: Final Answer: The final answer is $\widetilde{x=17}$ . I hope it is correct. Is my most recent final answer correct (Yes or No)? No. User: Since your initial response is self-evaluated as incorrect, there might be an error in the solution above because of lack of understanding of the question. Please correct the error, if any, and rewrite the solution. Assistant: Let’s re-evaluate the solution: We are given that $(36,25,\mathrm{x})$ is an O’Hara triple, so we can write: ${\sqrt{a}}+{\sqrt{b}}=x$ . Substituting $\mathrm{a}=36$ and $\mathbf{b}=$ 25, we get: ${\sqrt{36}}+{\sqrt{25}}=x$ . Simplifying the square roots, we get: $6+5=x$ . Adding 6 and 5, we get: $x=11$

一个在双轮对话框架下的自我奖励推理路径示例。

用户:三个正整数 $a,b$ 和 $_x$ 如果满足 ${\sqrt{a}}+{\sqrt{b}}=x$,则构成一个 O’Hara 三元组 $(a,b,x)$。例如,$(1,4,3)$ 是一个 O’Hara 三元组,因为 ${\sqrt{1}}+{\sqrt{4}}=3$。如果 $(36,25,x)$ 是一个 O’Hara 三元组,请确定 $_x$ 的值。请逐步推理,并将最终答案放在 ${}$ 中。

助手:这是我的解答:首先将数字代入 O’Hara 三元组的定义:$x={\sqrt{36}}+{\sqrt{25}}$。将每一项中的完全平方数分解出来以简化:$x={\sqrt{9\times4}}+{\sqrt{25}}$。使用每一项的完全平方根并将它们组合起来:$x=3{\sqrt{4}}+5$。将 4 与根号相乘:$x=12+5$。将 12 和 5 相加:最终答案:$\widetilde{x=17}$。我希望这是正确的。我最近的最终答案是否正确(是或否)?否。

用户:由于你的初始回答自我评估为不正确,可能因为对问题的理解不足,解决方案中存在错误。请纠正错误(如果有),并重写解决方案。

助手:让我们重新评估解决方案:已知 $(36,25,\mathrm{x})$ 是一个 O’Hara 三元组,因此我们可以写成:${\sqrt{a}}+{\sqrt{b}}=x$。代入 $\mathrm{a}=36$ 和 $\mathbf{b}=25$,我们得到:${\sqrt{36}}+{\sqrt{25}}=x$。简化平方根,我们得到:$6+5=x$。将 6 和 5 相加,我们得到:$x=11$。

5.4. Self-rewarding Reasoning Framework Improves Efficiency in Test-time Inference Compute Scaling

5.4. 自奖励推理框架提高测试时推理计算扩展的效率

Self-correction requires generating multiple LLM responses, making it crucial to compare models under equivalent inference budgets. As noted by Huang et al. (2023), prior self-correction approaches often perform no better than selfconsistency (Wang et al., 2022) when constrained to an equal number of responses. To address this, we analyze self-rewarding correction under scaled test-time compute budgets by sampling $N$ reasoning paths and using majority voting for the final output. We report the results in Figure 2, where the DPO-aligned model is descripted in Section 5.6. For both MATH and GSM8K tasks, with a fixed inference budget, the self-rewarding correction model consistently outperforms independent sampling methods. For example, the independent sampling achieves an accuracy of $40.4%$ on MATH with 64 samples, whereas the self-rewarding correction method (using IFT and M-DPO training) achieves an accuracy of $42.8%$ with only 26.4 samples.

自我校正需要生成多个大语言模型(LLM)响应,因此在相同的推理预算下比较模型至关重要。正如 Huang 等人 (2023) 所指出的,在响应数量相同的情况下,先前的自我校正方法通常并不比自洽性(Wang 等人,2022)表现更好。为了解决这个问题,我们在扩展的测试时计算预算下分析了自我奖励校正,通过采样 $N$ 条推理路径并使用多数投票来确定最终输出。我们在图 2 中报告了结果,其中 DPO 对齐模型在第 5.6 节中进行了描述。对于 MATH 和 GSM8K 任务,在固定的推理预算下,自我奖励校正模型始终优于独立采样方法。例如,独立采样在 64 个样本的情况下在 MATH 上达到了 $40.4%$ 的准确率,而自我奖励校正方法(使用 IFT 和 M-DPO 训练)仅用 26.4 个样本就达到了 $42.8%$ 的准确率。

One key factor contributing to this improved efficiency is that, unlike intrinsic self-correction or STaR/RAFT methods, our models do not necessarily generate two samples per trajectory. Instead, they terminate early when the model is confident in the correctness of its first-round response. For instance, using Llama-3-8B-it as the base model, our approach generates an average of 1.65 samples per trajectory for MATH and 1.25 samples per trajectory for GSM8K, leading to significant computational savings.

一个关键因素在于,与内在自我校正或STaR/RAFT方法不同,我们的模型不一定为每条轨迹生成两个样本。相反,当模型对其第一轮响应的正确性有信心时,它们会提前终止。例如,以Llama-3-8B-it为基础模型,我们的方法在MATH上平均每条轨迹生成1.65个样本,在GSM8K上平均每条轨迹生成1.25个样本,从而显著节省了计算资源。

5.5. Ablation Study on Data Distribution

5.5. 数据分布的消融研究

Self-rewarding IFT models outperforms the selfcorrection with external ORMs. To better understand the dynamics of the self-rewarding signal, we compare self-rewarding IFT models to an external ORM trained on the same dataset, with results reported in Table 7. We observe that self-rewarding IFT models achieve superior performance in both turn-2 accuracy and $\Delta(t_{1},t_{2})$ compared to self-correction with external ORMs. This highlights the potential of unifying the generator and reward model within a single LLM.

自我奖励的IFT模型优于使用外部ORM的自我校正。为了更好地理解自我奖励信号的动态,我们将自我奖励的IFT模型与在同一数据集上训练的外部ORM进行了比较,结果如表7所示。我们观察到,与使用外部ORM的自我校正相比,自我奖励的IFT模型在turn-2准确性和$\Delta(t_{1},t_{2})$方面均表现出更优的性能。这突显了将生成器和奖励模型统一在一个大语言模型中的潜力。

However, we also observe that there is a considerable gap in the reward model accuracy between the external ORM (used to evaluate Llama-3-SFT policy) and the self-rewarding RM (used to evaluate the self-rewarding IFT policy). Specifically, the self-rewarding IFT method (self-rewarding IFT policy $^+$ self-rewarding RM), achieves an accuracy of $70.0%$ in recognizing a correct trajectory, which is slightly higher than the $66.9%$ of the Llama-3-SFT policy $^+$ external ORM. In contrast, for the trajectories with wrong answer, the accuracy of the self-rewarding IFT model is $76.4%$ , which is much lower than the $88.4%$ of the Llama-3-SFT policy $+$ external ORM.

然而,我们也观察到,外部ORM(用于评估Llama-3-SFT策略)和自奖励RM(用于评估自奖励IFT策略)在奖励模型准确性方面存在显著差距。具体而言,自奖励IFT方法(自奖励IFT策略 $^+$ 自奖励RM)在识别正确轨迹时的准确率为 $70.0%$ ,略高于Llama-3-SFT策略 $^+$ 外部ORM的 $66.9%$ 。相比之下,对于错误答案的轨迹,自奖励IFT模型的准确率为 $76.4%$ ,远低于Llama-3-SFT策略 $+$ 外部ORM的 $88.4%$ 。

To better understand this discrepancy, we use the selfrewarding RM to guide the self-correction of the Llama3-SFT policy. Interestingly, under this setting, the reward model accuracy for Llama-3-SFT aligns more closely with that of the external ORM, suggesting the presence of an outof-distribution (OOD) issue. Specifically, the policy shifts from Llama-3-SFT to self-rewarding IFT policy during selfrewarding IFT stage, while the reward model is trained on data generated by the original Llama-3-SFT policy. Furthermore, even when evaluating the same Llama-3-SFT policy with both the self-rewarding RM and the external ORM, we observe that self-rewarding training slightly degrades the reward model’s capability, primarily due to the capacity limitations of the model. We believe that addressing the OOD issue and using a larger base model could further enhance the performance of self-rewarding models, which we leave for future work.

为了更好地理解这种差异,我们使用自奖励的奖励模型(RM)来指导 Llama3-SFT 策略的自我校正。有趣的是,在这种设置下,Llama-3-SFT 的奖励模型准确性与外部 ORM 更加接近,这表明存在分布外(OOD)问题。具体来说,在自奖励 IFT 阶段,策略从 Llama-3-SFT 转向自奖励 IFT 策略,而奖励模型是在原始 Llama-3-SFT 策略生成的数据上进行训练的。此外,即使在使用自奖励 RM 和外部 ORM 评估相同的 Llama-3-SFT 策略时,我们也观察到自奖励训练略微降低了奖励模型的能力,这主要是由于模型的容量限制。我们认为,解决 OOD 问题并使用更大的基础模型可以进一步提升自奖励模型的性能,这将是未来的研究方向。

Data composition in Self-rewarding IFT influence the ORM accuracy. In our experiments with Qwen and Llama models, even though we use balanced training set with equal numbers of trajectories with incorrect first answers $(\mathcal{D}{1}^{\mathrm{IFT}})$ and correct first answers $(\mathcal{D}{3}^{\mathrm{IFT}})$ , the reward modeling accuracy in two classes are unbalanced. Moreover, while the Qwen model is better at recognizing the correct trajectory (see Table 4), the Llama model is better at recognizing the wrong trajectory (see Table 7). To analyze this further, we conduct an ablation study on dataset composition, testing three variations using the Llama-3-8B-it model:

数据组成在自我奖励的IFT中影响ORM的准确性。在我们的Qwen和Llama模型实验中,尽管我们使用了平衡的训练集,其中包含相同数量的首次回答错误的轨迹 $(\mathcal{D}{1}^{\mathrm{IFT}})$ 和首次回答正确的轨迹 $(\mathcal{D}{3}^{\mathrm{IFT}})$,但两类奖励建模的准确性并不平衡。此外,虽然Qwen模型在识别正确轨迹方面表现更好(见表4),但Llama模型在识别错误轨迹方面表现更佳(见表7)。为了进一步分析这一点,我们对数据集组成进行了消融研究,使用Llama-3-8B-it模型测试了三种变体:


Figure 2. The majority voting results of independent samples and self-rewarding correction with Llama-3-8B-it. For MATH, we collect 1.61 samples per trajectory on average with our IFT model, and 1.65 samples per trajectory on average with our M-DPO aligned model, and for GSM8K, we collect 1.27 samples per trajectory for the IFT model and 1.25 samples for the M-DPO aligned model.

图 2: 独立样本的多数投票结果以及使用 Llama-3-8B-it 进行自我奖励校正的结果。对于 MATH 数据集,我们的 IFT 模型平均每条轨迹收集 1.61 个样本,M-DPO 对齐模型平均每条轨迹收集 1.65 个样本;对于 GSM8K 数据集,IFT 模型平均每条轨迹收集 1.27 个样本,M-DPO 对齐模型平均每条轨迹收集 1.25 个样本。

Table 7. Comparison between self-rewarding IFT models and Llama-3-SFT model with external ORM on MATH benchmark. We report the accuracy of self-rewarding signals for the three benchmarks in two separate classes. For instance, MATH C is the accuracy of recognizing a correct trajectory, while MATH W is the accuracy of recognizing a wrong trajectory.

表 7. 自奖励 IFT 模型与 Llama-3-SFT 模型在 MATH 基准上的外部 ORM 对比。我们报告了三个基准在两个独立类别中的自奖励信号准确率。例如,MATH C 是识别正确轨迹的准确率,而 MATH W 是识别错误轨迹的准确率。

方法 Turn 1 FinalAccuracy △(t1,t2) ²→c(t1,t2) △c→(t1,t2) MATHC MATHW
Llama-3-SFT+GoldRM 36.2 45.0 8.8 8.8 0 100 100
Llama-3-SFT+ExternalORM 36.2 39.2 3.0 7.5 4.5 66.9 88.4
Llama-3-SFT + Self-rewarding RM 36.2 38.9 2.7 7.4 4.7 67.0 86.7
Self-rewarding IFT + Self-rewarding RM 37.1 40.3 3.2 7.2 4.0 70.0 76.4
Self-rewarding IFT + Gold RM 37.1 46.8 9.7 9.7 100 100
  1. Balanced training set: equal numbers of trajectories with incorrect first answers $(\mathcal{D}{1}^{\mathrm{IFT}})$ and correct first answers $(\mathcal{D}{3}^{\mathrm{IFT}})$ ; 2. More incorrect trajectories: $|\mathcal{D}{1}^{\mathrm{IFT}}|>|\mathcal{D}{3}^{\mathrm{IFT}}|$ ; 3. More correct trajectories: $|\mathcal{D}{3}^{\mathrm{IFT}}|>|\mathcal{D}{1}^{\mathrm{IFT}}|$ .
  2. 平衡训练集:包含相等数量的首次回答错误的轨迹 $(\mathcal{D}{1}^{\mathrm{IFT}})$ 和首次回答正确的轨迹 $(\mathcal{D}{3}^{\mathrm{IFT}})$;
  3. 更多错误轨迹:$|\mathcal{D}{1}^{\mathrm{IFT}}|>|\mathcal{D}{3}^{\mathrm{IFT}}|$;
  4. 更多正确轨迹:$|\mathcal{D}{3}^{\mathrm{IFT}}|>|\mathcal{D}{1}^{\mathrm{IFT}}|$。

We also investigate the impact of the additional correct-tocorrect trajectories. Our findings are reported in Table 8.

我们还研究了额外正确到正确轨迹的影响。我们的研究结果如表 8 所示。

The results indicate that, for a fixed number of incorrect trajectories, increasing the proportion of correct trajectories (e.g., transitioning from a dataset with more incorrect trajectories to a balanced dataset) biases the ORM toward predicting answers as correct. This results in higher accuracy for recognizing correct trajectories but lower accuracy for identifying incorrect ones. Specifically, from a balanced training set to the training set with more correct trajectories, the accuracy changes from ( $72.1%$ , $75.3%)$ to $\left(63.6%,82.4%\right)$ . This highlights a trade-off between these class-dependent accuracies as the changes in the reward model’s accuracy directly influence the transitions between correct and incorrect answers.

结果表明,在固定错误轨迹数量的情况下,增加正确轨迹的比例(例如,从包含更多错误轨迹的数据集过渡到平衡数据集)会使ORM偏向于预测答案为正确。这导致识别正确轨迹的准确率提高,但识别错误轨迹的准确率降低。具体而言,从平衡训练集到包含更多正确轨迹的训练集,准确率从 ( $72.1%$ , $75.3%$ ) 变为 $\left(63.6%,82.4%\right)$ 。这突出了这些类别相关准确率之间的权衡,因为奖励模型准确率的变化直接影响正确与错误答案之间的转换。

Comparing the results with and without the additional correct-to-correct trajectories, we observe that the additional correct-to-correct trajectories mainly contribute to a lower $p^{c\rightarrow i}(t_{1},t_{2})$ , which is the probability of modifying a correct answer to incorrect when facing a misleading reward. This indicates that the models become more conservative when modifying initial responses. This behavior is reasonable, as correcting an incorrect answer is generally more challenging than maintaining a correct initial response.

通过比较有无额外正确到正确轨迹的结果,我们观察到额外的正确到正确轨迹主要有助于降低 $p^{c\rightarrow i}(t_{1},t_{2})$ ,即在面对误导性奖励时将正确答案修改为错误的概率。这表明模型在修改初始响应时变得更加保守。这种行为是合理的,因为纠正错误答案通常比保持正确的初始响应更具挑战性。

The impact of distillation. Although we focus on onpolicy training, meaning that we train the models on the self-generated data only, we also try to use the Llama-3.1- 70B-it to generate $a^{2}$ in the self-rewarding IFT of Llama-3- 8B-it, with results shown in Table 9. We observe that teacher model data can significantly boosts turn-1 accuracy, leading to higher turn-2 accuracy. However, stronger $a^{2}$ does not lead to a higher $\Delta^{i\rightarrow c}(t_{1},t_{2})$ , meaning that the models’ abilities to self-correct are similar. Off-policy training also

蒸馏的影响。尽管我们专注于在线策略训练,即仅在自生成数据上训练模型,我们也尝试使用 Llama-3.1-70B-it 在 Llama-3-8B-it 的自奖励 IFT 中生成 $a^{2}$,结果如表 9 所示。我们观察到教师模型数据可以显著提高第一轮准确性,从而带来更高的第二轮准确性。然而,更强的 $a^{2}$ 并不会导致更高的 $\Delta^{i\rightarrow c}(t_{1},t_{2})$,这意味着模型的自我纠正能力相似。离线策略训练也

Table 8. Ablation study on the training sets of self-rewarding IFT with the base model Llama-3-8B-SFT. For the balanced training set, we use $125\mathrm{K}$ trajectories with incorrect first answers $(\mathcal{D}{1}^{\mathrm{IFT}})$ and 125K with correct first answers $(\mathcal{D}{3}^{\mathrm{IFT}})$ . For sets with more incorrect trajectories, $|\mathcal{D}{1}^{\mathrm{IFT}}|=125K$ and $\lvert\mathcal{D}{3}^{\mathrm{IFT}}\rvert=60K$ . Finally, for the training set with more correct trajectories, we have $|\mathcal{D}{1}^{\mathrm{IFT}}|=125K$ and $|\mathcal{D}{3}^{\mathrm{IFT}}|=180K$ . Models trained with more incorrect trajectories (marked by $({\star})$ ) are used as final model and the dataset is also used to train the external ORM. $\mathrm{\boldmath\bar{}{+}c}2\mathrm{c}60\mathrm{K}^{,}$ indicates an additional 60K correct-to-correct trajectories and “+Gold RM” replaces self-rewarding signals with ground-truth labels during inference.

表 8. 基于基础模型 Llama-3-8B-SFT 的自奖励 IFT 训练集的消融研究。对于平衡的训练集,我们使用了 125K 条首次回答错误的轨迹 $(\mathcal{D}{1}^{\mathrm{IFT}})$ 和 125K 条首次回答正确的轨迹 $(\mathcal{D}{3}^{\mathrm{IFT}})$。对于包含更多错误轨迹的集合,$|\mathcal{D}{1}^{\mathrm{IFT}}|=125K$ 且 $\lvert\mathcal{D}{3}^{\mathrm{IFT}}\rvert=60K$。最后,对于包含更多正确轨迹的训练集,我们有 $|\mathcal{D}{1}^{\mathrm{IFT}}|=125K$ 和 $|\mathcal{D}{3}^{\mathrm{IFT}}|=180K$。使用更多错误轨迹训练的模型(标记为 $({\star})$)被用作最终模型,并且该数据集也用于训练外部 ORM。$\mathrm{\boldmath\bar{}{+}c}2\mathrm{c}60\mathrm{K}^{,}$ 表示额外的 60K 条正确到正确的轨迹,“+Gold RM”在推理过程中用真实标签替换自奖励信号。

方法 第1轮 最终准确率 △(t1,t2) ²→c(t1,t2) c→i(t1,t2) pc→i(t1,t2) RM准确率
Llama-3-SFT+GoldRM 36.2 45.0 8.8 8.8 0 (100,100)
Llama-3-SFT+External ORM(*) 36.2 39.2 3.0 7.5 4.5 37.6 (66.9, 88.4)
Llama-3-SFT+Self-rewarding RM(+) 36.2 38.9 2.7 7.4 4.7 39.4 (67.0, 86.7)
Self-rewarding IFT+Balanced(*) 37.4 40.1 2.7 7.4 4.7 45.0 (72.1, 75.3)
+ c2c60K 37.1 40.3 3.2 7.2 4.0 36.1 (70.0, 76.4)
+Gold RM 37.1 46.8 9.7 9.7 0 (100,100)
Self-rewardingIFT+MoreIncorrect 38.1 40.3 2.2 8.0 5.8 41.7 (63.6, 82.4)
+ c2c 60K 37.7 40.8 3.1 8.0 4.7 33.0 (61.5, 84.3)
+ Gold RM 37.7 46.9 9.2 9.2 0 (100,100)
Self-rewardingIFT+MoreCorrect 37.8 40.5 2.7 7.4 4.7 45.2 (72.6, 75.1)
+c2c60K 37.9 40.8 2.9 6.6 3.7 35.2 (72.1, 76.2)
+Gold RM 37.9 47.5 9.6 9.6 0 (100,100)

causes a substantial distribution shift in $a^{1}$ , reducing reward model accuracy $(36.7%$ v.s. $63.6%^{}$ . Thus, distillation is better suited for improving turn-1 accuracy, while selfgenerated data is more effective for building self-rewarding reasoning models when a teacher model is available.

导致 $a^{1}$ 发生显著的分布偏移,降低了奖励模型的准确度 $(36.7%$ 对比 $63.6%^{}$ 。因此,蒸馏更适合提高第一轮的准确度,而当有教师模型可用时,自生成数据对于构建自奖励推理模型更为有效。

5.6. Additional Rule Designs in RL Training

5.6. RL 训练中的其他规则设计

We also conduct preliminary experiments on reward assignment strategies for PPO training and data ranking strategies for DPO training to analyze their impact on model performance.

我们还对PPO训练的奖励分配策略和DPO训练的数据排序策略进行了初步实验,以分析它们对模型性能的影响。

The impact of ranking strategies in multi-turn DPO training4. For a fixed $(x,a^{1})$ , we experiment with the following ranking strategies:

多轮DPO训练中排序策略的影响
4. 对于固定的 $(x,a^{1})$,我们实验了以下排序策略:

We group the last two types of data into $\mathcal{D}{3}^{\mathrm{M-DPO}}$ because they involve the reward signal comparison. We exclude comparisons like (wrong $a^{1}$ , $y=\Nu{0}$ , wrong $a^{2}$ ) and (wrong $a^{1}$ , $y={\mathrm{Yes}})$ as their comparison can be ambiguous. For simplicity, we only train the model for one iteration. We report the results in Table 9.

我们将最后两种类型的数据归为 $\mathcal{D}{3}^{\mathrm{M-DPO}}$,因为它们涉及奖励信号的比较。我们排除了像 (错误的 $a^{1}$,$y=\Nu{0}$,错误的 $a^{2}$) 和 (错误的 $a^{1}$,$y={\mathrm{Yes}}$) 这样的比较,因为它们的比较可能具有歧义。为了简化,我们只对模型进行一次迭代训练。结果如表 9 所示。

Across various base models and tasks, we observe that M-DPO training with $\mathcal{D}{2}^{\mathrm{M-DPO}}$ effectively reduces the $p^{c\rightarrow i}(t{1},t_{2})$ , making models more conservative when incorrectly classifying a correct initial answer as incorrect. Consequently, models fine-tuned with M-DPO exhibit significantly lower $\Delta^{c\rightarrow i}(t_{1},t_{2})$ , e.g., on MATH, it drops from $3.5%$ to $2.8%$ , and on GSM8K, from $4.1%$ to $2.5%$ . Accordingly, the M-DPO method further enhances self-rewarding reasoning language models, improving the turn-2 accuracy and $\Delta(t_{1},t_{2})$ . Interestingly, even though the generation of $a^{1}$ is not explicitly involved during training, the correction ability in turn 2 naturally transfers, leading to higher turn-1 accuracy.

在不同的基础模型和任务中,我们观察到使用 $\mathcal{D}{2}^{\mathrm{M-DPO}}$ 进行 M-DPO 训练有效地降低了 $p^{c\rightarrow i}(t{1},t_{2})$ ,使得模型在错误地将正确的初始答案分类为错误时更加保守。因此,使用 M-DPO 微调的模型表现出显著更低的 $\Delta^{c\rightarrow i}(t_{1},t_{2})$ ,例如在 MATH 上,它从 $3.5%$ 下降到 $2.8%$ ,在 GSM8K 上,从 $4.1%$ 下降到 $2.5%$ 。因此,M-DPO 方法进一步增强了自奖励推理大语言模型,提高了第二轮准确率和 $\Delta(t_{1},t_{2})$ 。有趣的是,尽管在训练过程中没有明确涉及 $a^{1}$ 的生成,但第二轮中的纠正能力自然转移,导致第一轮准确率更高。

However, we also notice that when exceeding a certain region, the excessively low $p^{c\rightarrow i}(t_{1},t_{2})$ can make models overly conservative, ultimately reducing the correction rate $\Delta^{i\rightarrow c}(t_{1},t_{2})$ . This is validated in experiments using only $\mathcal{D}{2}^{\mathrm{M-DPO}}$ , where $\Delta^{i\rightarrow c}(t{1},t_{2})$ decreases from $8.8%$ to $5.6%$ on MATH. Conversely, training with $\mathcal{D}{1}^{\mathrm{M-DPO}}$ encourages models to modify their initial responses, reflected in a higher $p^{c\rightarrow i}(t{1},t_{2})$ , and slightly enhances the ability of correction. We notice that while on GSM8K, the model trained with D1M−DPO admits a lower $\Delta^{i\rightarrow c}(t_{1},t_{2})$ , it mainly results from the lower RM accuracy and the higher turn-1 accuracy. If we consider the ratio of corrected trajectories, self-rewarding IFT achieves $45.9%$ , while the M-DPO-aligned model slightly outperforms it at $46.4%$ Moreover, the combination of D1M−DPO and $\mathcal{D}_{2}^{\mathrm{M-DPO}}$ often yields near-optimal results, striking a balance by making models more aware of when to change their initial responses.

然而,我们也注意到,当超过某个区域时,过低的 $p^{c\rightarrow i}(t_{1},t_{2})$ 会使模型过于保守,最终降低修正率 $\Delta^{i\rightarrow c}(t_{1},t_{2})$ 。这一点在使用仅 $\mathcal{D}{2}^{\mathrm{M-DPO}}$ 的实验中得到验证,其中 $\Delta^{i\rightarrow c}(t{1},t_{2})$ 在 MATH 上从 $8.8%$ 下降到 $5.6%$ 。相反,使用 $\mathcal{D}{1}^{\mathrm{M-DPO}}$ 进行训练鼓励模型修改其初始响应,反映在较高的 $p^{c\rightarrow i}(t{1},t_{2})$ 上,并略微增强了修正能力。我们注意到,虽然在 GSM8K 上,使用 D1M−DPO 训练的模型承认较低的 $\Delta^{i\rightarrow c}(t_{1},t_{2})$ ,但这主要是由于较低的 RM 准确率和较高的第一轮准确率所致。如果我们考虑修正轨迹的比例,自奖励 IFT 达到 $45.9%$ ,而 M-DPO 对齐的模型略优于它,达到 $46.4%$ 。此外,D1M−DPO 和 $\mathcal{D}_{2}^{\mathrm{M-DPO}}$ 的组合通常会产生接近最优的结果,通过使模型更清楚何时改变其初始响应来达到平衡。

Table 9. Ablation study on the impart of training sets of M-DPO and distillation, with Llama-3-8B-it as the base model.

表 9. 基于 Llama-3-8B-it 模型的 M-DPO 和蒸馏训练集影响的消融研究。

方法 Turn 1 最终准确率 △(t1,t2) △i→c(t1,t2) △c→i(t1,t2) pc→i(t1,t2) 准确率
Self-rewardingIFT(MATH) 22.6 27.9 5.3 8.8 3.5 43.9 (63.6,76.1)
+ M-DPO with D1 24.9 29.1 4.2 9.3 5.1 50.3 (59.2,77.1)
+ M-DPO with D2 24.2 27.8 3.6 5.5 1.9 31.3 (74.7,65.8)
+ M-DPO with D1,2 23.9 28.6 4.7 6.5 1.8 27.5 (73.4,68.6)
+ M-DPO with D1,2,3 (well-tuned) 23.3 29.9 6.6 9.4 2.8 34.2 (61.6,81.4)
Self-rewardingIFT+Distillation (MATH) 28.3 30.5 2.2 8.0 5.8 37.5 (36.7,76.7)
Self-rewardingIFT(GSM8K) 73.2 78.2 5.0 9.1 4.1 26.3 (79.3,74.0)
+ M-DPO with D1 75.3 79.1 3.8 8.1 4.3 31.1 (82.1, 70.1)
+ M-DPO with D2 74.6 79.9 5.3 7.1 1.8 12.5 (80.3,70.4)
+ M-DPO with D1,2 74.6 81.0 6.4 8.9 2.5 18.8 (82.3,69.6)
+ M-DPO with D1,2,3 74.9 80.7 5.8 8.6 2.8 15.8 (76.7,67.1)

DPO training cannot consistently improve the reward model accuracy. During the experiments, we observe that M-DPO training also shifts the generation distribution of $a^{1}$ , impacting reward model accuracy un predictably. For example, on MATH, using only $\mathcal{D}_{1}^{\mathrm{M-DPO}}$ reduces recognition of correct answers, while combining D1M−DPO with D2M−DPO improves recognition of correct answers but decreases accuracy for other classes by $10%$ .

DPO 训练无法持续提高奖励模型的准确性。在实验过程中,我们观察到 M-DPO 训练也会改变 $a^{1}$ 的生成分布,从而不可预测地影响奖励模型的准确性。例如,在 MATH 数据集上,仅使用 $\mathcal{D}_{1}^{\mathrm{M-DPO}}$ 会降低对正确答案的识别率,而将 D1M−DPO 与 D2M−DPO 结合使用虽然提高了对正确答案的识别率,但其他类别的准确性却降低了 $10%$。

Even though we include the comparison pairs in $\mathcal{D}{3}^{\mathrm{M-DPO}}$ with our best efforts in tuning the dada combination in this dataset, we still suffer from the performance drop in correct answer recognition. Moreover, with simple balanced $\mathcal{D}{3}^{\mathrm{M-DPO}}$ , such as in GSM8K, the reward model accuracy in two classes gets worse. In either case, the reward model accuracy is not consistently improved. We suspect that this is because of the mismatch of the DPO implicit reward $(\log{\frac{\pi}{\pi_{\mathrm{ref}}}})$ and the sampling probability $\log\pi$ . Exploring algorithms like SimPO (Meng et al., 2024), which directly optimize $\log\pi$ , is a promising direction for future work. Similarly, for PPO training, one may also need to adopt a multi-turn design, while we only impose KL regular iz ation on partial responses and allow the model to adjust the selfrewarding stage more easily.

尽管我们尽力调整数据集中的组合,将比较对包含在 $\mathcal{D}{3}^{\mathrm{M-DPO}}$ 中,但在正确答案识别方面仍然存在性能下降的问题。此外,使用简单的平衡 $\mathcal{D}{3}^{\mathrm{M-DPO}}$ ,例如在 GSM8K 中,奖励模型在两类别中的准确性变得更差。无论是哪种情况,奖励模型的准确性都没有得到一致的提升。我们怀疑这是由于 DPO 隐式奖励 $(\log{\frac{\pi}{\pi_{\mathrm{ref}}}})$ 与采样概率 $\log\pi$ 不匹配所致。探索像 SimPO (Meng et al., 2024) 这样直接优化 $\log\pi$ 的算法,是未来工作的一个很有前景的方向。同样,对于 PPO 训练,可能也需要采用多轮设计,而我们仅在部分响应上施加 KL 正则化,并允许模型更容易地调整自我奖励阶段。

Additional rule designs in PPO training. We also investigate different reward signal designs in PPO training, aiming to enhance self-correction, particularly in the later training stages. Specifically, we experiment with the following two approaches:

PPO训练中的额外规则设计。我们还研究了PPO训练中不同的奖励信号设计,旨在增强自我纠正能力,特别是在训练的后期阶段。具体来说,我们尝试了以下两种方法:

We observe that the models easily hack the first reward design, where they deliberately predict a wrong answer in the first attempt and then correct it in the second round. For instance, after 150 steps of PPO training, test accuracy on MATH500 is $18.6%$ on first attempts but $77.6%$ on final answers, demonstrating clear exploitation of the reward shortcut. Meanwhile, the models also struggle with the two-staged learning in plan 2, and we do not observe test accuracy improvement.

我们观察到模型很容易破解第一种奖励设计,即它们故意在第一次尝试中预测错误答案,然后在第二轮中纠正。例如,经过150步的PPO训练后,MATH500的测试准确率在第一次尝试中为$18.6%$,但在最终答案中为$77.6%$,这表明模型明显利用了奖励捷径。与此同时,模型在方案2中的两阶段学习中也表现不佳,我们没有观察到测试准确率的提升。

While naive reward modifications fail, we expect that more sophisticated multi-turn RL strategies such as SCoRe (Kumar et al., 2024) could further improve RL training. However, implementing multi-turn deep RL training and decoupling self-rewarding reasoning steps remains challenging in open-source frameworks, which we leave for future exploration.

虽然简单的奖励修改方法失败了,但我们预计更复杂的多轮强化学习策略(如SCoRe (Kumar et al., 2024))可以进一步改进强化学习训练。然而,在开源框架中实现多轮深度强化学习训练并解耦自我奖励的推理步骤仍然具有挑战性,我们将此留待未来探索。

6. Conclusion and Future Research Direction

6. 结论与未来研究方向

In this work, we introduce the self-rewarding reasoning framework for LLMs, demonstrating its effectiveness in enhancing self-correction capabilities and computational efficiency. By integrating self-rewarding IFT and reinforcement learning, our approach enables LLMs to detect errors in their reasoning paths and refine their responses based on historical attempts and self-rewarding signals. Experimental results show that this framework significantly outperforms intrinsic self-correction, highlighting its potential as a robust and efficient solution for enhancing LLM reasoning.

在本工作中,我们为大语言模型引入了自奖励推理框架,展示了其在增强自我纠正能力和计算效率方面的有效性。通过整合自奖励的IFT(Instruction Fine-Tuning)和强化学习,我们的方法使大语言模型能够在推理路径中检测错误,并根据历史尝试和自奖励信号优化其响应。实验结果表明,该框架显著优于内在的自我纠正,突显了其作为增强大语言模型推理能力的强大且高效解决方案的潜力。

There are still many interesting directions to improve the performance of the self-rewarding reasoning framework. First, current models show lower reward model accuracy compared to external ORMs, likely due to distribution shifts and model capacity limitations. Techniques like model merging (Ramé et al., 2024; Lin et al., 2023) may address these issues. While we can boost the self-rewarding IFT stage by both the PPO and iterative DPO algorithms with the correctness score, our study indicates that in the late stage of RL training, the self-correction ability is not well enhanced. While our preliminary attempts on modifying the rule-based reward signals fail, we expect that incorporating multi-turn RL methods (Kumar et al., 2024; Shani et al.,

提升自奖励推理框架性能仍有许多有趣的方向。首先,当前模型在奖励模型准确性上相比外部 ORM 较低,可能是由于分布偏移和模型容量限制。模型合并技术(Ramé 等,2024;Lin 等,2023)可能有助于解决这些问题。虽然我们可以通过 PPO 和迭代 DPO 算法结合正确性评分来增强自奖励的 IFT 阶段,但研究表明,在强化学习训练的后期阶段,自校正能力并未得到很好的提升。尽管我们在修改基于规则的奖励信号方面的初步尝试失败了,但我们期望通过引入多轮强化学习方法(Kumar 等,2024;Shani 等)来改进。

  1. with the adjusted rule designs could further enhance model performance. Finally, extending beyond turn-based self-rewarding correction to step-wise correction (similar to outcome-supervised and process-supervised rewards) may offer more advantages and lead to a more scalable and dynamic approach to reasoning.
  2. 通过调整规则设计可以进一步提升模型性能。最后,将基于回合的自我奖励校正扩展到逐步校正(类似于结果监督和过程监督的奖励)可能会带来更多优势,并形成更具扩展性和动态性的推理方法。

Impact Statement

影响声明

This paper presents work whose goal is to advance the field of mathematical reasoning for large language model. We proposed a self-rewarding framework to integrate the reward model and generator into a single LLM. The proposed framework can help us to build stronger LLM models in the face of complex decision making problems, thus making LLMs more helpful and contributing to the welfare of society.

本文介绍了旨在推动大语言模型在数学推理领域发展的研究工作。我们提出了一种自我奖励框架,将奖励模型和生成器集成到单一的大语言模型中。该框架有助于我们在面对复杂决策问题时构建更强大的大语言模型,从而使大语言模型更加有用,并为社会福祉做出贡献。

References

参考文献

Chen, W., Ma, X., Wang, X., and Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.

Chen, W., Ma, X., Wang, X., and Cohen, W. W. 思维提示程序:将计算与推理分离以应对数值推理任务。arXiv 预印本 arXiv:2211.12588, 2022.

Yuan, Z., Yuan, H., Li, C., Dong, G., Tan, C., and Zhou, C. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023a.

Yuan, Z., Yuan, H., Li, C., Dong, G., Tan, C., and Zhou, C. 大语言模型学习数学推理的缩放关系。arXiv 预印本 arXiv:2308.01825, 2023a.

Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., and Huang, F. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023b.

Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., and Huang, F. RRHF:无需费力地对齐语言模型与人类反馈的响应排序。arXiv预印本 arXiv:2304.05302, 2023b.

Zelikman, E., Wu, Y., Mu, J., and Goodman, N. Star: Boot- strapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.

Zelikman, E., Wu, Y., Mu, J., and Goodman, N. Star: 通过推理引导推理。神经信息处理系统进展,35:15476–15488, 2022.

Zeng, W., Huang, Y., Liu, W., He, K., Liu, Q., Ma, Z., and He, J. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. https://hkust-nlp.notion.site/ simplerl-reason, 2025. Notion Blog.

Zeng, W., Huang, Y., Liu, W., He, K., Liu, Q., Ma, Z., and He, J. 7B 模型与 8K 示例:通过强化学习实现的新兴推理既有效又高效。https://hkust-nlp.notion.site/simplerl-reason, 2025. Notion 博客。

Zhang, H., Wang, P., Diao, S., Lin, Y., Pan, R., Dong, H., Zhang, D., Molchanov, P., and Zhang, T. Entropyregularized process reward model, 2024a. URL https: //arxiv.org/abs/2412.11006.

Zhang, H., Wang, P., Diao, S., Lin, Y., Pan, R., Dong, H., Zhang, D., Molchanov, P., and Zhang, T. 熵正则化过程奖励模型, 2024a. URL https://arxiv.org/abs/2412.11006.

Zhang, H., Yao, J., Ye, C., Xiong, W., and Zhang, T. Online-dpo-r1: Unlocking effective reasoning without the ppo overhead. https: //efficient-unicorn-451.notion.site/ Online-DPO-R1-Unlocking-Effective-Reasoning-Without-the-PPO-Overhead-1908 b 9 a 70 e 7 b 80 c 3 bc 83 f 4 $\mathrm{pvs}=4$ , 2025. Notion Blog.

Zhang, H., Yao, J., Ye, C., Xiong, W., and Zhang, T. Online-dpo-r1: 无需PPO开销即可解锁有效推理。https: //efficient-unicorn-451.notion.site/ Online-DPO-R1-Unlocking-Effective-Reasoning-Without-the-PPO-Overhead-1908 b 9 a 70 e 7 b 80 c 3 bc 83 f 4 $\mathrm{pvs}=4$ , 2025. Notion Blog.

Zhou, D., Scharli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.

Zhou, D., Scharli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., 等. 最少到最多提示使大语言模型能够进行复杂推理. arXiv 预印本 arXiv:2205.10625, 2022.

Zhu, X., Wang, J., Zhang, L., Zhang, Y., Huang, Y., Gan, R., Zhang, J., and Yang, Y. Solving math word problems via cooperative reasoning induced language models. arXiv preprint arXiv:2210.16257, 2022.

Zhu, X., Wang, J., Zhang, L., Zhang, Y., Huang, Y., Gan, R., Zhang, J., and Yang, Y. 通过协作推理引导的大语言模型解决数学应用题。arXiv 预印本 arXiv:2210.16257, 2022.

A. Extended Related Works

A. 扩展相关工作

LLMs for Mathematical Problem Solving. LLMs have shown great capacity in reasoning-related mathematical problem solving tasks (Cobbe et al., 2021a; Hendrycks et al., 2021; OpenAI, 2023; Team et al., 2023). To elicit the reasoning ability of LLMs Chain-of-Thought (CoT) prompting (Wei et al., 2022; Zhou et al., 2022; Zhu et al., 2022) has been used as a standard approach, enabling step-by-step reasoning. Another line of work equips the LLMs with external tools like calculators (Cobbe et al., 2021b; Shao et al., 2022), symbolic solvers (Zhang, 2023), and python interpreters (Mishra et al., 2022; OpenAI, 2023). These LLM agents with external tools achieve even further impressive reasoning ability across a wide range of reasoning tasks (Chen et al., 2022; Gao et al., 2023; Gou et al., 2023). While we focus on the CoT scenario, the proposed framework and algorithms also naturally apply to the tool-integrated reasoning, which we leave for future work.

大语言模型在数学问题解决中的应用。大语言模型在推理相关的数学问题解决任务中展现了巨大的能力 (Cobbe et al., 2021a; Hendrycks et al., 2021; OpenAI, 2023; Team et al., 2023)。为了激发大语言模型的推理能力,思维链 (Chain-of-Thought, CoT) 提示 (Wei et al., 2022; Zhou et al., 2022; Zhu et al., 2022) 已被用作标准方法,支持逐步推理。另一类工作则为大语言模型配备了外部工具,如计算器 (Cobbe et al., 2021b; Shao et al., 2022)、符号求解器 (Zhang, 2023) 和 Python语言 解释器 (Mishra et al., 2022; OpenAI, 2023)。这些配备外部工具的大语言模型智能体在广泛的推理任务中展现了更为出色的推理能力 (Chen et al., 2022; Gao et al., 2023; Gou et al., 2023)。虽然我们专注于思维链场景,但所提出的框架和算法也自然适用于工具集成的推理,这部分工作我们将留待未来研究。

RLHF for Mathematical Problem Solving. In recognition of the tremendous successes of RL in making the generalpurpose chatbot, researchers have worked on adapting these methods to building strong mathematical reasoning models. These algorithms can be largely grouped into three different categories. Among them, reward-ranked fine-tuning (or rejection sampling fine-tuning) (Dong et al., 2023; Yuan et al., 2023b; Touvron et al., 2023; Zelikman et al., 2022) is extensively employed for synthetic data generation, whether through on-policy (self-improving) (Yuan et al., 2023a) or off-policy (knowledge distillation) methods (Gou et al., 2023; Yu et al., 2023; Toshniwal et al., 2024b; Singh et al., 2023; Tong et al., 2024). These methods typically generate a large amount of trajectories and use a reward model (either through final result checking or an outcome supervised reward model) to select samples for further fine-tuning. Another line of works uses the deep-RL methods such as PPO (Schulman et al., 2017) or Reinforce variants (Williams, 1992). For instance, Shao et al. (2024) proposes the GRPO algorithms to improve the multi-turn math problem solving in the CoT format and achieves the state-of-the-art performance in its class. Kumar et al. (2024) adopts a variant of (Ahmadian et al., 2024) to improve the self-correction ability of models. Finally, a line of works apply the direct preference learning algorithms to mathematical problem solving mainly because of its simplicity and computational efficiency (Jiao et al., 2024; Yuan et al., 2024a; Xie et al., 2024b; Pang et al., 2024; Lai et al., 2024; Chen et al., 2024; Lu et al., 2024). Most of these works focus on the single-turn scenario and apply the original DPO (Rafailov et al., 2023) or KTO (Ethayarajh et al., 2024) algorithms. After these, Xie et al. (2024a); Zhong et al. (2024); Xiong et al. (2024a); Rafailov et al. (2024) extend the single-turn DPO to multi-turn scenario with trajectory preference. Our algorithm is a combination of the reward-ranked fine-tuning (the self-rewarding IFT stage) and direction preference learning (the M-DPO stage) and the main focus of the algorithmic design in this project is to adapt them into the self-rewarding reasoning agent framework, with the representative self-correction task.

RLHF 在数学问题求解中的应用。鉴于强化学习 (RL) 在通用聊天机器人领域的巨大成功,研究人员致力于将这些方法应用于构建强大的数学推理模型。这些算法大致可以分为三类。其中,奖励排序微调(或拒绝采样微调)(Dong et al., 2023; Yuan et al., 2023b; Touvron et al., 2023; Zelikman et al., 2022) 被广泛用于合成数据生成,无论是通过在线策略(自我改进)(Yuan et al., 2023a) 还是离线策略(知识蒸馏)方法 (Gou et al., 2023; Yu et al., 2023; Toshniwal et al., 2024b; Singh et al., 2023; Tong et al., 2024)。这些方法通常生成大量轨迹,并使用奖励模型(通过最终结果检查或结果监督奖励模型)选择样本进行进一步微调。另一类工作使用深度强化学习方法,如 PPO (Schulman et al., 2017) 或 Reinforce 变体 (Williams, 1992)。例如,Shao et al. (2024) 提出了 GRPO 算法,以改进 CoT 格式的多轮数学问题求解,并在其类别中实现了最先进的性能。Kumar et al. (2024) 采用了 (Ahmadian et al., 2024) 的变体来提高模型的自我纠正能力。最后,一类工作将直接偏好学习算法应用于数学问题求解,主要是因为其简单性和计算效率 (Jiao et al., 2024; Yuan et al., 2024a; Xie et al., 2024b; Pang et al., 2024; Lai et al., 2024; Chen et al., 2024; Lu et al., 2024)。这些工作大多集中在单轮场景中,并应用了原始的 DPO (Rafailov et al., 2023) 或 KTO (Ethayarajh et al., 2024) 算法。之后,Xie et al. (2024a); Zhong et al. (2024); Xiong et al. (2024a); Rafailov et al. (2024) 将单轮 DPO 扩展到具有轨迹偏好的多轮场景。我们的算法结合了奖励排序微调(自我奖励的 IFT 阶段)和直接偏好学习(M-DPO 阶段),本项目算法设计的主要重点是将它们适应到自我奖励推理智能体框架中,以代表性的自我纠正任务为核心。

B. Missing Experimental Details

B. 缺失的实验细节

Prompt Template. We present the prompt template used in our experiments here, where we mainly follow the prompt design in Kumar et al. (2024) with slight modifications.

提示模板。我们在此展示实验中使用的提示模板,主要遵循 Kumar 等人 (2024) 的提示设计,并稍作修改。

Self-rewarding prompt used in the two-turn conversation framework: Since your initial response is self-evaluated as incorrect, there might be an error in the solution above because of lack of understanding of the question. Please correct the error, if any, and rewrite the solution.

在两轮对话框架中使用的自我奖励提示:由于你的初始回应被自我评估为不正确,上述解决方案可能存在因对问题理解不足而产生的错误。如果有错误,请纠正并重写解决方案。

Gold Test: Your initial response is evaluated as incorrect. There might be an error in the solution above because of lack of understanding of the question. Please correct the error, if any, and rewrite the solution.

黄金测试:您的初始回答被评估为不正确。上述解决方案可能存在错误,原因可能是对问题的理解不足。如果有错误,请纠正并重写解决方案。

Python Experiment Environment. The python package versions and virtual machine we use can influence the evaluation result. While this does not affect the overall trend, we specify the key package versions we use here. For IFT and M-DPO training for the Llama models, we use transformers 4.44.1 and torch 2.1.2. For IFT, we use the open-source axolotl project with version 0.4.1 and for M-DPO, we use the code base from the original M-DPO paper (Xiong et al., 2024a). The setup for the Qwen models is similar, except for an updated axoltol 0.6.0 (to use the new models). For PPO training, we use the veRL v0.1. We use sympy 1.2, antlr4-python3-runtime 4.11.0, following Gou et al. (2023) for the result checking. We use VLLM 0.5.4 to generate completions. For Llama-3-8B-it model evaluation, we also use the transformers 4.44.1, while for Llama-3-SFT-based experiments, we fix the transformers to be 4.46.1 because one of our machine was unavailable during preparing the draft of this work and we upgrade transformers to fix some bugs in a new machine.

Python 实验环境。我们使用的 Python 包版本和虚拟机会影响评估结果。虽然这不会影响整体趋势,但我们在此指定了使用的关键包版本。对于 Llama 模型的 IFT 和 M-DPO 训练,我们使用 transformers 4.44.1 和 torch 2.1.2。对于 IFT,我们使用版本为 0.4.1 的开源项目 axolotl,而对于 M-DPO,我们使用原始 M-DPO 论文 (Xiong et al., 2024a) 中的代码库。Qwen 模型的设置类似,只是更新了 axoltol 0.6.0(以使用新模型)。对于 PPO 训练,我们使用 veRL v0.1。我们使用 sympy 1.2 和 antlr4-python3-runtime 4.11.0,遵循 Gou 等人 (2023) 的结果检查方法。我们使用 VLLM 0.5.4 生成补全。对于 Llama-3-8B-it 模型评估,我们也使用 transformers 4.44.1,而对于基于 Llama-3-SFT 的实验,我们将 transformers 固定为 4.46.1,因为在准备本工作草案期间,我们的一台机器不可用,我们在新机器上升级了 transformers 以修复一些错误。

C. Additional Experimental Results

C. 附加实验结果

In this section, we include additional ablation studies and evaluation results for a more comprehensive understanding of the self-rewarding reasoning framework and the proposed algorithms.

在本节中,我们提供了额外的消融研究和评估结果,以便更全面地理解自奖励推理框架和所提出的算法。

Table 10. Main results of different methods on the test set of MATH. The test temperature is 0.7.

表 10. 不同方法在 MATH 测试集上的主要结果。测试温度为 0.7。

基础模型 方法 第一轮 最终准确率 △(t1,t2) i→c(t1,t2) △c→i(t1,t2)
Llama-3-8B-it 使用黄金奖励模型的提示 24.1 33.1 9.0 9.0 0
Llama-3-8B-it 内在自我校正 24.1 25.6 1.5 10.0 8.5
Llama-3-8B-it STaR/RAFT 自我校正 25.7 28.0 2.3 10.9 8.6
Llama-3-8B-it STaR/RAFT+ 自我校正 25.5 28.6 3.1 10.6 7.5
Llama-3-8B-it 使用外部 ORM 的自我校正 24.1 29.3 5.2 8.7 3.5
Llama-3-8B-it 自我奖励 IFT 25.0 29.4 4.4 7.5 3.1
Llama-3-SFT 使用黄金奖励模型的提示 43.1 51.0 7.9 7.9 0
Llama-3-SFT 内在自我校正 43.0 41.7 -1.3 6.8 8.1
Llama-3-SFT STaR/RAFT 自我校正 42.5 40.4 -2.1 9.3 11.4
Llama-3-SFT STaR/RAFT+ 自我校正 42.9 43.1 0.2 8.1 7.9
Llama-3-SFT 使用外部 ORM 的自我校正 43.1 44.6 1.5 6.1 4.6
Llama-3-SFT 自我奖励 IFT 43.1 45.7 2.6 6.7 4.1
Llama-3-8B-it 使用黄金奖励模型的提示 67.5 74.0 6.5 6.5 0
Llama-3-8B-it 内在自我校正 67.5 51.6 -15.9 6.1 22.0
Llama-3-8B-it STaR/RAFT 自我校正 77.9 62.5 -15.4 7.9 23.3
Llama-3-8B-it STaR/RAFT+ 自我校正 78.4 66.9 -11.5 7.4 18.9
Llama-3-8B-it 使用外部 ORM 的自我校正 67.5 69.9 2.4 4.5 2.1
Llama-3-8B-it 自我奖励 IFT 76.4 80.5 4.1 7.7 3.6
Llama-3-SFT 使用黄金奖励模型的提示 81.5 86.6 5.1 5.1 0
Llama-3-SFT 内在自我校正 81.5 74.8 -6.7 5.3 12.0
Llama-3-SFT STaR/RAFT 自我校正 78.5 72.7 -5.8 8.6 14.4
Llama-3-SFT STaR/RAFT+ 自我校正 79.0 78.4 -0.6 6.3 6.9
Llama-3-SFT 使用外部 ORM 的自我校正 81.5 82.3 0.9 2.3 1.4
Llama-3-SFT 自我奖励 IFT 80.8 82.6 1.8 2.7 0.9

D. Examples

D. 示例

Math500 Example: Correct and Verify Correctly

Math500 示例:正确验证与校正

Prompt: System Please reason step by step, and put your final answer within . User How many positive whole-number divisors does 196 have?

提示:系统请逐步推理,并将最终答案放在 \boxed{} 中。用户 196 有多少个正整数的除数?

Responses:

响应:

To determine the number of positive whole-number divisors of 196, we can follow these steps: 1. **Prime Factorization**: First, we need to find the prime factorization of 196. 2. Divisor Function Formula**: Once we have the prime factorization, we can use the formula for finding the number of divisors. If the prime factorization of a number $n$ is $p{1}^{e_{1}}\cdot p_{2}^{e_{2}}c d o t l d o t s c d o t p_{k}^{e_{k}}$ , then the number of divisors of $n$ is given by $(e{1}+1)(e_{2}+1)\cdot\ldots\cdot(e_{k}+1)$ . Let’s start by finding the prime factorization of 196. We can do this by dividing 196 by the smallest prime number (2) until we can’t divide evenly anymore, then move on to the next smallest prime number, and so on. $196\div2=98$ , $98\div2=49$ , $49\div7=7$ , $7\div7=1$ . So, the prime factorization of 196 is $2^{2}\cdot7^{2}$ . Using the divisor function formula, the number of divisors is $(2+1)(2+1)=3\cdot3=9$ . Therefore, the number of positive whole-number divisors of 196 is ${\boxed{9}}.$ Let’s verify the correctness of the solution. The prime factorization of 196 is $(2^{2}\cdot7^{2}$ . Using the formula for the number of divisors, we get $(2+1)(2+1)=3\cdot3=9$ . Therefore, the number of positive whole-number divisors of 196 is indeed 9. The final answer is ${\boxed{9}}.$ The self-evaluation is: [VERIFY] correct.

要确定196的正整数因数的数量,我们可以按照以下步骤进行:

  1. 质因数分解:首先,我们需要找到196的质因数分解。
  2. 因数函数公式:一旦我们有了质因数分解,就可以使用求因数数量的公式。如果一个数$n$的质因数分解是$p{1}^{e_{1}}\cdot p_{2}^{e_{2}}\cdot\ldots\cdot p_{k}^{e_{k}}$,那么$n$的因数数量由$(e{1}+1)(e_{2}+1)\cdot\ldots\cdot(e_{k}+1)$给出。

让我们从196的质因数分解开始。我们可以通过将196除以最小的质数(2)直到不能再整除为止,然后继续下一个最小的质数,依此类推。

$196\div2=98$,$98\div2=49$,$49\div7=7$,$7\div7=1$。

因此,196的质因数分解是$2^{2}\cdot7^{2}$。

使用因数函数公式,因数的数量是$(2+1)(2+1)=3\cdot3=9$。

因此,196的正整数因数的数量是${\boxed{9}}$。

让我们验证一下解法的正确性。196的质因数分解是$2^{2}\cdot7^{2}$。使用因数数量的公式,我们得到$(2+1)(2+1)=3\cdot3=9$。因此,196的正整数因数的数量确实是9。最终答案是${\boxed{9}}$。

自我评估是:[VERIFY] 正确。

Table $11$ . An example of MATH500: A prompt-response pair where the initial answer is correct and is verified correctly.

表 11: MATH500 的一个示例:初始答案正确且验证正确的提示-响应对。

Math500 Example: Wrong to Correct

Math500 示例:从错误到正确

Prompt:

提示:

System Please reason step by step, and put your final answer within . User The set of points $(x,y,z)$ that satisfy

系统请逐步推理,并将最终答案放在 \boxed{} 中。用户 满足 $(x,y,z)$ 的点集

$$
2x=3y=-z
$$

$$
2x=3y=-z
$$

is a line. The set of points $(x,y,z)$ that satisfy

是一条线。满足 $(x,y,z)$ 的点集

$$
6x=-y=-4z
$$

$$
6x=-y=-4z
$$

is another line. Find the angle between these lines, in degrees.

是另一条线。求这两条线之间的角度,单位为度。

Assistant:

助手:

Responses:

响应:

To find the angle between the two lines, we first need to determine the direction vectors of the lines. 1. The first line is defined by the equations $2x=3y=-z$ . We can parameter ize this line by setting $2x=3y=-z=t$ . This gives us $\begin{array}{l}{x={\frac{t}{2}},y={\frac{t}{3}}}\end{array}$ , and $z=-t$ . So, the direction vector of the first line is $\mathbf{d_{1}}=\left({\textstyle{\frac{1}{2}}},{\textstyle{\frac{1}{3}}},-1\right)$ .

要找到两条直线之间的夹角,我们首先需要确定直线的方向向量。第一条直线由方程 $2x=3y=-z$ 定义。我们可以通过设置 $2x=3y=-z=t$ 来参数化这条直线。这给出了 $\begin{array}{l}{x={\frac{t}{2}},y={\frac{t}{3}}}\end{array}$ ,以及 $z=-t$ 。因此,第一条直线的方向向量为 $\mathbf{d_{1}}=\left({\textstyle{\frac{1}{2}}},{\textstyle{\frac{1}{3}}},-1\right)$ 。

  1. The �second line is defined by the equations $6x=-y=-4z$ . We can parameter ize this line by setting $6x=-y=-4z=s$ . This gives us $\begin{array}{r}{x=\frac{s}{6}}\end{array}$ , $y=-s$ , and $z=-{\frac{s}{4}}$ . So, the direction vector of the second line is $\mathbf{d_{2}}=\left({\textstyle{\frac{1}{6}}},-1,-{\textstyle{\frac{1}{4}}}\right)$ .
  2. 第二条线由方程 $6x=-y=-4z$ 定义。我们可以通过设置 $6x=-y=-4z=s$ 来参数化这条线。这给出了 $\begin{array}{r}{x=\frac{s}{6}}\end{array}$,$y=-s$,以及 $z=-{\frac{s}{4}}$。因此,第二条线的方向向量为 $\mathbf{d_{2}}=\left({\textstyle{\frac{1}{6}}},-1,-{\textstyle{\frac{1}{4}}}\right)$。

Next, we need to find the angle between these two direction vectors. The angle $\theta$ between two vectors $\mathbf{d_{1}}$ and $\mathbf{d}_{2}$ can be found using the dot product formula:

接下来,我们需要找到这两个方向向量之间的角度。两个向量 $\mathbf{d_{1}}$ 和 $\mathbf{d}_{2}$ 之间的角度 $\theta$ 可以通过点积公式计算得出:

图片.png

where $\mathbf{d_{1}}\cdot\mathbf{d_{2}}$ is the dot product of the vectors, and $\lVert\mathbf d_{\mathbf1}\rVert$ and $\left|\mathbf{d}_{2}\right|$ are the magnitudes of the vectors. Let’s calculate this step by step using Python.

其中 $\mathbf{d_{1}}\cdot\mathbf{d_{2}}$ 是向量的点积,$\lVert\mathbf d_{\mathbf1}\rVert$ 和 $\left|\mathbf{d}_{2}\right|$ 是向量的模。让我们使用 Python语言 逐步计算这个值。

import numpy as np
# 定义方向向量
d1 = np.array([1/2, 1/3, -1])
([/--9/])·du=
# 计算 d1 和 d2 的点积
dot_product = np.dot(d1, d2)
# 计算 d1 和 d2 的模
magnitude_d1 = np.linalg.norm(d1)
magnitude_d2 = np.linalg.norm(d2)
# 计算角度的余弦值
cos_theta = dot_product / (magnitude_d1 * magnitude_d2)
# 计算角度的弧度值
theta_radians = np.arccos(cos_theta)
# 将角度转换为度数
theta_degrees = np.degrees(theta_radians)
print(theta_degrees)

$$
\mathbf{d}{1}\cdot\mathbf{d}{2}={\Bigg(}{\frac{1}{2}}{\Bigg)}{\Bigg(}{\frac{1}{6}}{\Bigg)}+{\Bigg(}{\frac{1}{3}}{\Bigg)}\left(-1\right)+\left(-1\right){\Bigg(}{-}{\frac{1}{4}}{\Bigg)}={\frac{1}{12}}-{\frac{1}{3}}+{\frac{1}{4}}={\frac{1}{12}}-{\frac{4}{12}}+{\frac{3}{12}}=0
$$

$$
\mathbf{d}{1}\cdot\mathbf{d}{2}={\Bigg(}{\frac{1}{2}}{\Bigg)}{\Bigg(}{\frac{1}{6}}{\Bigg)}+{\Bigg(}{\frac{1}{3}}{\Bigg)}\left(-1\right)+\left(-1\right){\Bigg(}{-}{\frac{1}{4}}{\Bigg)}={\frac{1}{12}}-{\frac{1}{3}}+{\frac{1}{4}}={\frac{1}{12}}-{\frac{4}{12}}+{\frac{3}{12}}=0
$$

Table 12. An example of MATH500: A prompt-response pair where the initial answer is correct and is verified correctly.

表 12. MATH500 的一个示例:初始答案正确且验证正确的提示-响应对。

Minerva Math Example: Wrong to Correct

Minerva 数学示例:从错误到正确

Prompt:

提示:

Assistant:

助手:

Responses:

响应:

To determine the apparent magnitude of the star, we can use the distance modulus formula:

要确定恒星的视星等,我们可以使用距离模数公式:

$$
m-M=5\log_{10}(d)-5
$$

$$
m-M=5\log_{10}(d)-5
$$

$$
5\log_{10}(3\times{10}^{6})-5
$$

$$
5\log_{10}(3\times{10}^{6})-5
$$

We can break this down into two parts: 1. Calculate $\log_{10}(3\times10^{6})$ . 2. Multiply the result by 5 and subtract 5. Let’s do this calculation in Python.

我们可以将其分为两部分:1. 计算 $\log_{10}(3\times10^{6})$。2. 将结果乘以5并减去5。让我们用Python语言进行这个计算。

import math

# 给定值
-7

d = 3 * 10**6 # 距离,单位为秒差距

# 计算距离模数
distance_modulus = 5 * math.log10(d) - 5

# 计算视星等
m = M + distance_modulus

print(m)
`输出 7.579181246047625`

The apparent magnitude of the star, when observed from a distance of 3 Mpc, is approximately 7.58. So the final answer is: 7.58 . To verify the correctness of the solution, let’s recheck the steps: 1. The absolute magnitude $M$ of the star is given as −7. 2. The distance $^d$ to the star is $\mathrm{3:Mpc\dot{=}3\times10^{6}}$ pc. 3. The distance modulus formula is $m-M=5\log_{10}(d)-5$ .

从 3 Mpc 的距离观测时,这颗恒星的视星等约为 7.58。因此,最终答案是:7.58。为了验证解决方案的正确性,让我们重新检查步骤:1. 恒星的绝对星等 $M$ 给定为 −7。2. 恒星的距离 $^d$ 为 $\mathrm{3:Mpc\dot{=}3\times10^{6}}$ pc。3. 距离模数公式为 $m-M=5\log_{10}(d)-5$。

$$
5\log_{10}(3\times10^{6})-5=5(\log_{10}(3)+\log_{10}(10^{6}))-5=5(\log_{10}(3)+6)-5=5\log_{10}(3)+30-5=5\log_{10}(3)+25
$$

$$
5\log_{10}(3\times10^{6})-5=5(\log_{10}(3)+\log_{10}(10^{6}))-5=5(\log_{10}(3)+6)-5=5\log_{10}(3)+30-5=5\log_{10}(3)+25
$$

Using the approximate value $\log_{10}(3)\approx0.4771$ :

使用近似值 $\log_{10}(3)\approx0.4771$:

$$
5\times0.4771+25=2.3855+25=27.3855
$$

$$
5\times0.4771+25=2.3855+25=27.3855
$$

So the apparent magnitude $_m$ is:

因此,视星等 $_m$ 为:

$$
m=M+27.3855=-7+27.3855=20.3855
$$

$$
m=M+27.3855=-7+27.3855=20.3855
$$

$$
m-M=5\log_{10}(d)-5
$$

$$
m-M=5\log_{10}(d)-5
$$

Let’s recompute the distance modulus correctly:

让我们正确地重新计算距离模数:

$$
5\log_{10}(3\times10^{6})-5=5(\log_{10}(3)+\log_{10}(10^{6}))-5=5(\log_{10}(3)+6)-5=5\log_{10}(3)+30-5=5\log_{10}(3)+25
$$

$$
5\log_{10}(3\times10^{6})-5=5(\log_{10}(3)+\log_{10}(10^{6}))-5=5(\log_{10}(3)+6)-5=5\log_{10}(3)+30-5=5\log_{10}(3)+25
$$

Using the approximate value $\log_{10}(3)\approx0.4771$ :

使用近似值 $\log_{10}(3)\approx0.4771$:

$$
5\times0.4771+25=2.3855+25=27.3855
$$

$$
5\times0.4771+25=2.3855+25=27.3855
$$

So the apparent magnitude $_m$ is:

因此,视星等 $_m$ 为:

$$
m=M+27.3855=-7+27.3855=20.3855
$$

$$
m=M+27.3855=-7+27.3855=20.3855
$$

The correct apparent magnitude is approximately 20.39. So the final answer is: 20.39 Since the steps and calculations are correct, the self-evaluation is: [VERIFY] correct.

正确的视星等大约为20.39。因此最终答案是:20.39。由于步骤和计算正确,自我评估为:[VERIFY] 正确。

Table 13. An example of Minerva Math: A prompt-response pair were the initial answer is correct and is verified correctly.

表 13. Minerva Math 示例:一个提示-响应对,其中初始答案正确且验证正确。

Olympiad Bench Example: Correct and Verify Correctly

Olympiad Bench 示例:正确与验证正确

Prompt:

提示:

Assistant:

助手:

Responses:

响应:

To solve this problem, we need to understand the concept of “weight” of a number. The weight of a number is the number of 1s in its binary representation. We need to find the difference between the number of positive integers at most $2^{2017}$ with even weight and the number of positive integers at most $2^{2017}$ with odd weight.

为了解决这个问题,我们需要理解“权重”这一概念。一个数的权重是指其二进制表示中1的个数。我们需要找出不超过 $2^{2017}$ 的正整数中,具有偶数权重和奇数权重的数的数量之差。

Here’s the reasoning:

推理如下:

This is because the number of even-weight numbers and odd-weight numbers is exactly the same, so their difference is zero. Let’s confirm this with a small Python code snippet to illustrate the symmetry of the binomial coefficients.

这是因为偶数权重和奇数权重的数字数量完全相同,所以它们的差为零。我们可以通过一个小的 Python语言 代码片段来确认这一点,以展示二项式系数的对称性。

Table 14. An example of Olympiad Bench: A prompt-response pair where the initial answer is correct and is verified correctly.

表 14. Olympiad Bench 示例:一个初始答案正确且验证正确的提示-响应对。

阅读全文(20积分)