[论文翻译]rStar-Math:小型大语言模型可以通过自我进化的深度思考掌握数学推理


原文地址:https://arxiv.org/pdf/2501.04519


rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

rStar-Math:小型大语言模型可以通过自我进化的深度思考掌握数学推理

Xinyu Guan∗ Li Lyna Zhang∗⋄ Yifei Liu Ning Shang Youran Sun Yi Zhu Fan Yang Mao Yang

Xinyu Guan∗ Li Lyna Zhang∗⋄ Yifei Liu Ning Shang Youran Sun Yi Zhu Fan Yang Mao Yang

Microsoft Research Asia

微软亚洲研究院

Abstract

摘要

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising “deep thinking” through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids naïve step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iterative ly evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for $747\mathbf{k}$ math problems, rStar-Math boosts SLMs’ math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from $58.8%$ to $90.0%$ and Phi3-mini-3.8B from $41.4%$ to $86.4%$ , surpassing o1-preview by $+4.5%$ and $+0.9%$ On the USA Math Olympiad (AIME), rStar-Math solves an average of $53.3%$ (8/15) of problems, ranking among the top $20%$ the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

我们提出了 rStar-Math,以证明小语言模型 (SLMs) 可以在不依赖从更优模型蒸馏的情况下,匹敌甚至超越 OpenAI o1 的数学推理能力。rStar-Math 通过蒙特卡洛树搜索 (MCTS) 实现“深度思考”,其中数学策略 SLM 在基于 SLM 的过程奖励模型的指导下进行测试时搜索。rStar-Math 引入了三项创新来解决训练这两个 SLM 的挑战:(1) 一种新颖的代码增强 CoT 数据合成方法,通过广泛的 MCTS 展开生成逐步验证的推理轨迹,用于训练策略 SLM;(2) 一种新颖的过程奖励模型训练方法,避免了简单的步骤级评分标注,从而产生更有效的过程偏好模型 (PPM);(3) 一种自我进化方案,其中策略 SLM 和 PPM 从零开始构建,并通过迭代进化来提高推理能力。通过 4 轮自我进化,使用数百万个合成的解决方案解决 $747\mathbf{k}$ 个数学问题,rStar-Math 将 SLM 的数学推理能力提升到了最先进的水平。在 MATH 基准测试中,它将 Qwen2.5-Math-7B 从 $58.8%$ 提升到 $90.0%$,将 Phi3-mini-3.8B 从 $41.4%$ 提升到 $86.4%$,分别超越了 o1-preview 的 $+4.5%$ 和 $+0.9%$。在美国数学奥林匹克 (AIME) 中,rStar-Math 平均解决了 $53.3%$ (8/15) 的问题,跻身于最优秀的 $20%$ 高中数学学生之列。代码和数据将在 https://github.com/microsoft/rStar 上提供。

Table 1: rStar-Math enables frontier math reasoning in SLMs via deep thinking over 64 trajectories.

表 1: rStar-Math 通过 64 条轨迹的深度思考,在 SLM 中实现了前沿的数学推理。

任务 (pass@1Acc) rStar-Math (Qwen-7B) rStar-Math (Qwen-1.5B) rStar-Math (Phi3-mini) OpenAI o1-preview OpenAI ol-mini QWQ 32B-preview GPT-4o1 DeepSeek-V3
MATH 90.0 88.6 86.4 85.5 90.0 90.6 76.6 90.2
AIME2024 53.3 46.7 43.3 44.6 56.7 50.0 9.3 39.2
OlympiadBench 65.6 64.6 60.3 65.3 61.2 43.3 55.4
CollegeMath 60.5 59.3 59.1 57.8 55.8 48.5 58.9
Omni-Math 50.5 48.5 46.0 52.5 60.5 49.6 30.5 35.9

1 Introduction

1 引言

Recent studies have demonstrated that large language models (LLMs) are capable of tackling mathematical problems [Team, 2024a, Yang et al., 2024, OpenAI, 2024, Liu et al., 2024]. However, the conventional approach of having LLMs generate complete solutions in a single inference – akin to System 1 thinking [Daniel, 2011] – often yields fast but error-prone results [Valmeekam et al., 2023, OpenAI, 2023]. In response, test-time compute scaling [Snell et al., 2024, Qi et al., 2024] suggests a paradigm shift toward a System 2-style thinking, which emulates human reasoning through a slower and deeper thought process. In this paradigm, an LLM serves as a policy model to generate multiple math reasoning steps, which are then evaluated by another LLM acting as a reward model [OpenAI, 2024]. The steps and solutions deemed more likely to be correct are selected. The process repeats iterative ly and ultimately derives the final answer.

最近的研究表明,大语言模型(LLMs)能够解决数学问题 [Team, 2024a, Yang et al., 2024, OpenAI, 2024, Liu et al., 2024]。然而,传统的方法让 LLMs 在一次推理中生成完整的解决方案——类似于系统 1 思维 [Daniel, 2011]——通常会产生快速但容易出错的结果 [Valmeekam et al., 2023, OpenAI, 2023]。对此,测试时计算扩展 [Snell et al., 2024, Qi et al., 2024] 提出了一种向系统 2 思维范式转变的方法,通过更慢、更深层次的思考过程来模拟人类推理。在这种范式中,LLM 作为策略模型生成多个数学推理步骤,然后由另一个 LLM 作为奖励模型进行评估 [OpenAI, 2024]。选择被认为更有可能正确的步骤和解决方案。该过程反复迭代,最终得出最终答案。


Figure 1: The overview of rStar-Math.

图 1: rStar-Math 的概览。

In the test-time compute paradigm, the key is to train a powerful policy model that generates promising solution steps and a reliable reward model that accurately evaluates them, both of which depend on high-quality training data. Unfortunately, it is well-known that off-the-shelf high-quality math reasoning data is scarce, and synthesizing high-quality math data faces fundamental challenges. For the policy model, it is challenging to distinguish erroneous reasoning steps from the correct ones, complicating the elimination of low-quality data. It is worth noting that in math reasoning, a correct final answer does not ensure the correctness of the entire reasoning trace [Lanham et al., 2023]. Incorrect intermediate steps significantly decrease data quality. As for the reward model, process reward modeling (PRM) shows a great potential by providing fine-grained feedback on intermediate steps [Lightman et al., 2023]. However, the training data is even scarcer in this regard: accurate step-by-step feedback requires intense human labeling efforts and is impractical to scale, while those automatic annotation attempts show limited gains due to noisy reward scores [Luo et al., 2024, Wang et al., 2024c, Chen et al., 2024]. Due to the above challenges, existing distill-based data synthesis approaches to training policy models, e.g., scaling up GPT4-distilled CoT data [Tang et al., 2024, Huang et al., 2024], have shown diminishing returns and cannot exceed the capability of their teacher model; meanwhile, as of today, training reliable PRMs for math reasoning remains an open question.

在测试时计算范式中,关键在于训练一个强大的策略模型来生成有潜力的解题步骤,以及一个可靠的奖励模型来准确评估这些步骤,这两者都依赖于高质量的训练数据。不幸的是,众所周知,现成的高质量数学推理数据稀缺,而合成高质量的数学数据面临着根本性挑战。对于策略模型来说,区分错误推理步骤与正确步骤具有挑战性,这使得消除低质量数据变得复杂。值得注意的是,在数学推理中,正确的最终答案并不能确保整个推理过程的正确性 [Lanham et al., 2023]。错误的中间步骤会显著降低数据质量。至于奖励模型,过程奖励建模 (PRM) 通过提供对中间步骤的细粒度反馈显示出巨大潜力 [Lightman et al., 2023]。然而,这方面的训练数据更加稀缺:准确的逐步反馈需要大量的人工标注工作,且难以扩展,而那些自动标注尝试由于噪声奖励分数而显示出有限的收益 [Luo et al., 2024, Wang et al., 2024c, Chen et al., 2024]。由于上述挑战,现有的基于蒸馏的数据合成方法在训练策略模型时,例如扩展 GPT4 蒸馏的 CoT 数据 [Tang et al., 2024, Huang et al., 2024],已经显示出收益递减的趋势,并且无法超越其教师模型的能力;同时,截至目前,训练可靠的数学推理 PRM 仍然是一个悬而未决的问题。

In this work, we introduce rStar-Math, a self-evolvable System 2-style reasoning approach that achieves the state-of-the-art math reasoning, rivaling and sometimes even surpassing OpenAI o1 on challenging math competition benchmarks with a model size as small as 7 billion. Unlike solutions relying on superior LLMs for data synthesis, rStar-Math leverages smaller language models (SLMs) with Monte Carlo Tree Search (MCTS) to establish a self-evolutionary process, iterative ly generating higher-quality training data. To achieve self-evolution, rStar-Math introduces three key innovations.

在本工作中,我们介绍了 rStar-Math,这是一种可自我进化的 System 2 风格推理方法,它在具有挑战性的数学竞赛基准测试中实现了最先进的数学推理,有时甚至超越了 OpenAI o1,而模型大小仅为 70 亿。与依赖更强大语言模型进行数据合成的解决方案不同,rStar-Math 利用较小语言模型 (SLM) 和蒙特卡洛树搜索 (MCTS) 建立了一个自我进化过程,迭代生成更高质量的训练数据。为了实现自我进化,rStar-Math 引入了三项关键创新。

First, a novel code-augmented CoT data synthesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories with self-annotated MCTS $Q$ -values. Specifically, math problem-solving is decomposed into multi-step generation within MCTS. At each step, the SLM serving as the policy model samples candidate nodes, each generating a one-step CoT and the corresponding Python code. To verify the generation quality, only nodes with successful Python code execution are retained, thus mitigating errors in intermediate steps. Moreover, extensive MCTS rollouts automatically assign a Q-value to each intermediate step based on its contribution: steps contributing to more trajectories that lead to the correct answer are given higher Q-values and considered higher quality. This ensures that the reasoning trajectories generated by SLMs consist of correct, high-quality intermediate steps.

首先,一种新颖的代码增强的CoT数据合成方法,通过广泛的MCTS(蒙特卡洛树搜索)展开来生成带有自注释MCTS $Q$ 值的逐步验证的推理轨迹。具体来说,数学问题解决被分解为MCTS内的多步生成。在每一步中,作为策略模型的SLM(小语言模型)对候选节点进行采样,每个节点生成一步CoT(思维链)和相应的Python代码。为了验证生成质量,仅保留成功执行Python代码的节点,从而减少中间步骤中的错误。此外,广泛的MCTS展开会根据每个中间步骤的贡献自动分配一个Q值:那些有助于更多轨迹通向正确答案的步骤会被赋予更高的Q值,并被认为是更高质量的。这确保了由SLM生成的推理轨迹由正确且高质量的中间步骤组成。

Second, a novel method that trains an SLM acting as a process preference model, i.e., a PPM to implement the desired PRM, that reliably predicts a reward label for each math reasoning step. The PPM leverages the fact that, although $\mathrm{^Q}$ -values are still not precise enough to score each reasoning step despite using extensive MCTS rollouts, the Q-values can reliably distinguish positive (correct) steps from negative (irrelevant/incorrect) ones. Thus the training method constructs preference pairs for each step based on Q-values and uses a pairwise ranking loss [Ouyang et al., 2022] to optimize PPM’s score prediction for each reasoning step, achieving reliable labeling. This approach avoids conventional methods that directly use Q-values as reward labels [Luo et al., 2024, Chen et al., 2024], which are inherently noisy and imprecise in stepwise reward assignment.

其次,一种新颖的方法训练了一个作为过程偏好模型(PPM)的SLM,以实现所需的PRM,该方法能够可靠地预测每个数学推理步骤的奖励标签。PPM利用了这样一个事实:尽管使用广泛的MCTS(蒙特卡洛树搜索)模拟后,$\mathrm{^Q}$值仍然不足以精确评分每个推理步骤,但Q值可以可靠地区分正向(正确)步骤与负向(无关/错误)步骤。因此,训练方法基于Q值为每个步骤构建偏好对,并使用成对排序损失 [Ouyang et al., 2022] 来优化PPM对每个推理步骤的评分预测,从而实现可靠的标签。这种方法避免了传统方法中直接使用Q值作为奖励标签 [Luo et al., 2024, Chen et al., 2024],这些方法在逐步奖励分配中本质上是嘈杂且不精确的。

Finally, a four-round self-evolution recipe that progressively builds both a frontier policy model and PPM from scratch. We begin by curating a dataset of 747k math word problems from publicly available sources. In each round, we use the latest policy model and PPM to perform MCTS, generating increasingly high-quality training data using the above two methods to train a stronger policy model and PPM for next round. Each round achieves progressive refinement: (1) a stronger policy SLM, (2) a more reliable PPM, (3) generating better reasoning trajectories via PPM-augmented MCTS, and (4) improving training data coverage to tackle more challenging and even competitionlevel math problems.

最后,我们采用了一个四轮自我进化策略,逐步从头构建前沿的策略模型和 PPM(策略-价值模型)。我们首先从公开来源中整理了一个包含 747k 个数学应用题的数据集。在每一轮中,我们使用最新的策略模型和 PPM 进行蒙特卡洛树搜索(MCTS),通过上述两种方法生成越来越高质量的训练数据,以训练出更强的策略模型和 PPM 用于下一轮。每一轮都实现了逐步优化:(1) 更强的策略 SLM,(2) 更可靠的 PPM,(3) 通过 PPM 增强的 MCTS 生成更好的推理轨迹,以及 (4) 提高训练数据的覆盖范围,以应对更具挑战性甚至竞赛级别的数学问题。

Extensive experiments across four SLMs (1.5B-7B) and seven math reasoning tasks demonstrate the effectiveness of rStar-Math. Remarkably, rStar-Math improves all four SLMs, matching or even surpassing OpenAI o1 on challenging math benchmarks. On MATH benchmark, with 8 search trajectories, rStar-Math boosts Qwen2.5-Math-7B from $58.8%$ to $89.4%$ and Qwen2.5-Math-1.5B from $51.2%$ to $87.8%$ . With 64 trajectories, the scores rise to $90%$ and $88.4%$ , outperforming o1-preview by $4.5%$ and $2.6%$ and matching o1-mini’s $90%$ . On the Olympiad-level AIME 2024, rStar-Math solves on average $53.3%$ (8/15) of the problems, exceeding o1-preview by $8.7%$ and all other open-sourced LLMs. We further conduct comprehensive experiments to verify the superiority of step-by-step verified reasoning trajectories over state-of-the-art data synthesis baselines, as well as the PPM’s effectiveness compared to outcome reward models and Q value-based PRMs. Finally, we present key findings from rStar-Math deep thinking, including the intrinsic self-reflection capability and PPM’s preference for theorem-applications intermediate steps.

在四个小型语言模型(1.5B-7B)和七个数学推理任务上的广泛实验证明了rStar-Math的有效性。值得注意的是,rStar-Math提升了所有四个小型语言模型,在具有挑战性的数学基准测试中匹配甚至超越了OpenAI的o1。在MATH基准测试中,通过8条搜索轨迹,rStar-Math将Qwen2.5-Math-7B的准确率从$58.8%$提升至$89.4%$,并将Qwen2.5-Math-1.5B的准确率从$51.2%$提升至$87.8%$。通过64条轨迹,准确率分别提升至$90%$和$88.4%$,分别超过o1-preview的$4.5%$和$2.6%$,并与o1-mini的$90%$持平。在奥林匹克级别的AIME 2024测试中,rStar-Math平均解决了$53.3%$(8/15)的问题,超过o1-preview的$8.7%$,并超越了所有其他开源的大语言模型。我们进一步进行了全面的实验,验证了逐步验证的推理轨迹相对于最先进的数据合成基线的优越性,以及PPM相对于结果奖励模型和基于Q值的PRM的有效性。最后,我们展示了rStar-Math深度思考的关键发现,包括其内在的自我反思能力以及PPM对定理应用中间步骤的偏好。

2 Related Works

2 相关工作

Math Data Synthesis. Advancements in LLM math reasoning have largely relied on curating high-quality CoT data, with most leading approaches being GPT-distilled, using frontier models like GPT-4 for synthesis [Wang et al., 2024b, Gou et al., 2023, Luo et al., 2023]. Notable works include NuminaMath [Jia LI and Polu, 2024a] and MetaMath [Yu et al., 2023b]. While effective, this limits reasoning to the capabilities of the teacher LLM. Hard problems that the teacher LLM cannot solve are excluded in the training set. Even solvable problems may contain error-prone intermediate steps, which are hard to detect. Although rejection sampling methods [Yuan et al., 2023, Brown et al., 2024] can improve data quality, they do not guarantee correct intermediate steps. As a result, scaling up CoT data has diminishing returns, with gains nearing saturation—e.g., Open Math Instruct-2 [Toshniwal et al., 2024] only sees a $3.9%$ boost on MATH despite an $8\times$ increase in dataset size.

数学数据合成。大语言模型在数学推理方面的进展主要依赖于高质量思维链(CoT)数据的整理,大多数领先的方法都是通过GPT蒸馏,使用像GPT-4这样的前沿模型进行合成 [Wang et al., 2024b, Gou et al., 2023, Luo et al., 2023]。值得注意的工作包括NuminaMath [Jia LI and Polu, 2024a] 和 MetaMath [Yu et al., 2023b]。虽然这种方法有效,但它将推理能力限制在教师大语言模型的能力范围内。教师大语言模型无法解决的难题被排除在训练集之外。即使是可解决的问题,也可能包含容易出错的中间步骤,这些步骤很难被检测到。尽管拒绝采样方法 [Yuan et al., 2023, Brown et al., 2024] 可以提高数据质量,但它们并不能保证中间步骤的正确性。因此,扩大思维链数据带来的收益逐渐减少,收益接近饱和——例如,Open Math Instruct-2 [Toshniwal et al., 2024] 尽管数据集规模增加了 $8\times$,但在MATH上的提升仅为 $3.9%$。

Scaling Test-time Compute has introduced new scaling laws, allowing LLMs to improve performance across by generating multiple samples and using reward models for best-solution selection [Snell et al., 2024, Wu et al., 2024, Brown et al., 2024]. Various test-time search methods have been proposed [Kang et al., 2024, Wang et al., 2024a], including random sampling [Wang et al., 2023] and tree-search methods [Yao et al., 2024, Hao et al., 2023, Zhang et al., 2024b, Qi et al., 2024] like MCTS. However, open-source methods for scaling test-time computation have shown limited gains in math reasoning, often due to policy LLM or reward model limitations. rStar-Math addresses this by iterative ly evolving the policy LLM and reward model, achieving System 2 mathematical reasoning performance comparable to OpenAI o1 [OpenAI, 2024].

扩展测试时计算引入了新的扩展定律,允许大语言模型通过生成多个样本并使用奖励模型进行最佳解决方案选择来提高性能 [Snell et al., 2024, Wu et al., 2024, Brown et al., 2024]。已经提出了各种测试时搜索方法 [Kang et al., 2024, Wang et al., 2024a],包括随机采样 [Wang et al., 2023] 和树搜索方法 [Yao et al., 2024, Hao et al., 2023, Zhang et al., 2024b, Qi et al., 2024],如 MCTS。然而,开源方法在扩展测试时计算方面在数学推理上显示出有限的收益,通常是由于策略大语言模型或奖励模型的限制。rStar-Math 通过迭代进化策略大语言模型和奖励模型来解决这一问题,实现了与 OpenAI o1 相媲美的系统 2 数学推理性能 [OpenAI, 2024]。

Reward Models are crucial for effective System 2 reasoning but are challenging to obtain. Recent works include LLM-as-a-Judge for verification [Zheng et al., 2023, Qi et al., 2024] and specialized reward models like Outcome Reward Model [Yang et al., 2024, Yu et al., 2023a] and Process Reward Model (PRM) [Lightman et al., 2024]. While PRMs offer promising dense, step-level reward signals for complex reasoning [Luo et al., 2024, Wang et al., 2024c], collecting step-level annotations remains an obstacle. While Kang et al. [2024], Wang et al. [2024a] rely on costly human-annotated datasets like PRM800k [Lightman et al., 2024], recent approaches [Wang et al., 2024c, Luo et al., 2024] explore automated annotation via Monte Carlo Sampling or MCTS. However, they struggle to generate precise reward scores, which limits performance gains. rStar-Math introduces a novel process preference reward (PPM) that eliminates the need for accurate step-level reward score annotation.

奖励模型对于有效的系统2推理至关重要,但获取它们具有挑战性。最近的工作包括使用大语言模型作为验证的评判者 [Zheng et al., 2023, Qi et al., 2024],以及专门的奖励模型,如结果奖励模型 [Yang et al., 2024, Yu et al., 2023a] 和过程奖励模型 (PRM) [Lightman et al., 2024]。虽然PRM为复杂推理提供了有前景的密集、步骤级奖励信号 [Luo et al., 2024, Wang et al., 2024c],但收集步骤级注释仍然是一个障碍。尽管Kang et al. [2024], Wang et al. [2024a] 依赖于昂贵的人工注释数据集,如PRM800k [Lightman et al., 2024],但最近的方法 [Wang et al., 2024c, Luo et al., 2024] 探索了通过蒙特卡洛采样或MCTS进行自动注释。然而,它们在生成精确的奖励分数方面存在困难,这限制了性能的提升。rStar-Math引入了一种新颖的过程偏好奖励 (PPM),消除了对精确步骤级奖励分数注释的需求。

3 Methodology

3 方法论

3.1 Design Choices

3.1 设计选择

MCTS for Effective System 2 Reasoning. We aim to train a math policy SLM and a process reward model (PRM), and integrating both within Monte Carlo Tree Search (MCTS) for System 2 deep thinking. MCTS is chosen for two key reasons. First, it breaks down complex math problems into simpler single-step generation tasks, reducing the difficulty for the policy SLM compared to other

MCTS 用于有效的系统 2 推理。我们的目标是训练一个数学策略 SLM 和一个过程奖励模型 (PRM),并将两者集成到蒙特卡洛树搜索 (MCTS) 中,以实现系统 2 的深度思考。选择 MCTS 有两个关键原因。首先,它将复杂的数学问题分解为更简单的单步生成任务,与其他方法相比,降低了策略 SLM 的难度。

System 2 methods like Best-of-N [Brown et al., 2024] or self-consistency [Wang et al., 2023], which require generating full solutions in one inference. Second, the step-by-step generation in MCTS naturally yields step-level training data for both models. Standard MCTS rollout automatically assign Q-value to each step based on its contribution to the final correct answer, obviating the need for human-generated step-level annotations for process reward model training.

像 Best-of-N [Brown et al., 2024] 或自一致性 [Wang et al., 2023] 这样的 System 2 方法,需要在一个推理过程中生成完整的解决方案。其次,MCTS 中的逐步生成自然地为两个模型提供了步骤级别的训练数据。标准的 MCTS rollout 会根据每个步骤对最终正确答案的贡献自动分配 Q 值,从而无需人工生成的步骤级别注释来进行过程奖励模型的训练。

Ideally, advanced LLMs such as GPT-4 could be integrated within MCTS to generate training data. However, this approach faces two key challenges. First, even these powerful models struggle to consistently solve difficult problems, such as Olympiad-level mathematics. Consequently, the resulting training data would primarily consist of simpler solvable problems, limiting its diversity and quality. Second, annotating per-step Q-values demands extensive MCTS rollouts; insufficient tree exploration can lead to spurious Q-value assignments, such as overestimating suboptimal steps. Given that each rollout involves multiple single-step generations and these models are computationally expensive, increasing rollouts significantly raises inference costs.

理想情况下,像 GPT-4 这样的先进大语言模型可以集成到 MCTS(蒙特卡洛树搜索)中生成训练数据。然而,这种方法面临两个关键挑战。首先,即使是这些强大的模型也难以一致地解决难题,例如奥林匹克级别的数学问题。因此,生成的训练数据将主要由较简单的可解决问题组成,限制了其多样性和质量。其次,标注每一步的 Q 值需要大量的 MCTS 模拟;树探索不足可能导致虚假的 Q 值分配,例如高估次优步骤。由于每次模拟涉及多个单步生成,且这些模型的计算成本高昂,增加模拟次数会显著提高推理成本。

Overview. To this end, we explore using two 7B SLMs (a policy SLM and a PRM) to generate higherquality training data, with their smaller size allowing for extensive MCTS rollouts on accessible hardware (e.g., $4!\times!40\mathrm{GB}$ A100 GPUs). However, self-generating data presents greater challenges for SLMs, due to their weaker capabilities. SLMs frequently fail to generate correct solutions, and even when the final answer is correct, the intermediate steps are often flawed or of poor quality. Moreover, SLMs solve fewer challenging problems compared to advanced models like GPT-4.

概述。为此,我们探索使用两个 7B 的 SLM(策略 SLM 和 PRM)来生成更高质量的训练数据,其较小的规模使得在可访问的硬件(例如,4×40GB A100 GPU)上进行广泛的 MCTS 展开成为可能。然而,由于 SLM 的能力较弱,自生成数据对它们来说带来了更大的挑战。SLM 经常无法生成正确的解决方案,即使最终答案正确,中间步骤也常常存在缺陷或质量较差。此外,与 GPT-4 等先进模型相比,SLM 解决的具有挑战性的问题更少。

This section introduces our methodology, as illustrated in Fig. 1. To mitigate errors and low-quality intermediate steps, we introduce a code-augmented CoT synthetic method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories, annotated with Q-values. To further improve SLM performance on challenging problems, we introduce a four-round self-evolution recipe. In each round, both the policy SLM and the reward model are updated to stronger versions, progressively tackling more difficult problems and generating higher-quality training data. Finally, we present a novel process reward model training approach that eliminates the need for precise per-step reward annotations, yielding the more effective process preference model (PPM).

本节介绍我们的方法,如图 1 所示。为了减少错误和低质量的中间步骤,我们引入了一种代码增强的 CoT 合成方法,该方法通过广泛的 MCTS 展开来生成逐步验证的推理轨迹,并标注 Q 值。为了进一步提高 SLM 在挑战性问题上的表现,我们引入了一种四轮自我进化方案。在每一轮中,策略 SLM 和奖励模型都会更新为更强的版本,逐步解决更困难的问题并生成更高质量的训练数据。最后,我们提出了一种新颖的过程奖励模型训练方法,消除了对精确的每步奖励标注的需求,从而产生了更有效的过程偏好模型 (PPM)。

3.2 Step-by-Step Verified Reasoning Trajectory

3.2 逐步验证的推理轨迹

We start by introducing our method for generating step-by-step verified reasoning trajectories with per-step Q-value annotations. Given a problem $x$ and a policy model $M$ , we run the standard MCTS to increment ally construct a search tree for step-by-step solution exploration. As shown in Fig. 1(a), the root node represents question $x$ , while child nodes correspond to intermediate steps $s$ generated by $M$ . A root-to-leaf path ending at terminal node $s_{d}$ forms a trajectory $\mathbf{t}=x\oplus s_{1}\oplus s_{2}\oplus\ldots\oplus s_{d}$ , with each step $s_{i}$ assigned a $\mathrm{^Q}$ -value $Q(s_{i})$ . From the search tree $\tau$ , we extract solution trajectories $\mathbb{T}={\mathbf{t}^{1},\mathbf{t}^{2},...,\mathbf{t}^{n}}(\bar{n}\geq1)$ . Our goal is to select high-quality trajectories from $\tau$ to construct the training set. For this purpose, we introduce code-augmented CoT synthesis method to filter out low-quality generations and perform extensive rollouts to improve the reliability of Q-value accuracy.

我们首先介绍一种生成带有每步Q值注释的逐步验证推理轨迹的方法。给定一个问题$x$和一个策略模型$M$,我们运行标准的MCTS(蒙特卡洛树搜索)来逐步构建一个搜索树,用于逐步探索解决方案。如图1(a)所示,根节点代表问题$x$,而子节点对应于由$M$生成的中间步骤$s$。从根节点到终端节点$s_{d}$的路径形成一条轨迹$\mathbf{t}=x\oplus s_{1}\oplus s_{2}\oplus\ldots\oplus s_{d}$,其中每个步骤$s_{i}$都被赋予一个$\mathrm{^Q}$值$Q(s_{i})$。从搜索树$\tau$中,我们提取解决方案轨迹$\mathbb{T}={\mathbf{t}^{1},\mathbf{t}^{2},...,\mathbf{t}^{n}}(\bar{n}\geq1)$。我们的目标是从$\tau$中选择高质量的轨迹来构建训练集。为此,我们引入了代码增强的CoT(链式思维)合成方法,以过滤掉低质量的生成,并进行广泛的rollout以提高Q值准确性的可靠性。

Code-augmented CoT Generation. Prior MCTS approaches primarily generate natural language (NL) CoTs [Qi et al., 2024, Zhang et al., 2024a]. However, LLMs often suffer from hallucination, producing incorrect or irrelevant steps yet still arrive at the correct answer by chance [Lanham et al., 2023]. These flawed steps are challenging to detect and eliminate. To address this, we propose a novel code execution augmented CoT. As shown in Fig. 2, the policy model generates a one-step NL CoT alongside its corresponding Python code, where the NL CoT is embedded as a Python comment. Only generations with successfully executed Python code are retained as valid candidates.

代码增强的 CoT 生成。之前的 MCTS 方法主要生成自然语言 (NL) CoT [Qi et al., 2024, Zhang et al., 2024a]。然而,大语言模型经常出现幻觉,生成不正确或不相关的步骤,但仍然偶然得出正确答案 [Lanham et al., 2023]。这些有缺陷的步骤难以检测和消除。为了解决这个问题,我们提出了一种新颖的代码执行增强的 CoT。如图 2 所示,策略模型生成一步 NL CoT 及其对应的 Python 代码,其中 NL CoT 作为 Python 注释嵌入。只有成功执行 Python 代码的生成才会被保留为有效候选。


Figure 2: An example of Code-augmented CoT.

图 2: 代码增强的 CoT 示例。

Specifically, starting from the initial root node `$x$ , we perform multiple MCTS iterations through selection, expansion, rollout, and back-propagation. At step $i$ , we collect the latest reasoning trajectory $x\oplus s_{1}\oplus s_{2}\oplus...\oplus s_{i-1}$ as the current state. Based on this state, we prompt (see Appendix A.3) the policy model to generate $n$ candidates $s_{i,0},...,s_{i,n-1}$ for step $i$ . Python code execution is then employed to filter valid nodes. As shown in Fig. 2, each generation $s_{i,j}$ is concatenated with the code from all previous steps, forming $s_{1}\oplus s_{2}\oplus...\oplus s_{i-1}\oplus s_{i,j}$ . Candidates that execute successfully are retained as valid nodes and scored by the PPM, which assigns a Q-value $q(s_{i})$ . Then, we use the well-known Upper Confidence bounds for Trees (UCT) [Kocsis and Szepesvári, 2006] to select the best node among the $n$ candidates. This selection process is mathematically represented as:

具体来说,从初始根节点$x$ 开始,我们通过选择、扩展、模拟和反向传播执行多次 MCTS 迭代。在第 $i$ 步,我们收集最新的推理轨迹 $x\oplus s_{1}\oplus s_{2}\oplus...\oplus s_{i-1}$ 作为当前状态。基于此状态,我们提示(见附录 A.3)策略模型生成第 $i$ 步的 $n$ 个候选 $s_{i,0},...,s_{i,n-1}$。然后使用 Python 代码执行来筛选有效节点。如图 2 所示,每个生成 $s_{i,j}$ 都与之前所有步骤的代码连接,形成 $s_{1}\oplus s_{2}\oplus...\oplus s_{i-1}\oplus s_{i,j}$。成功执行的候选被保留为有效节点,并由 PPM 评分,分配一个 Q 值 $q(s_{i})$。然后,我们使用著名的树的上置信界 (UCT) [Kocsis and Szepesvári, 2006] 在 $n$ 个候选中选择最佳节点。此选择过程在数学上表示为:

\mathrm{UCT}(s)=Q(s)+c{\sqrt{\frac{\ln N_{p a r e n t}(s)}{N(s)}}};\quad{\mathrm{where}}\quad Q(s)={\frac{q(s)}{N(s)}}
\mathrm{UCT}(s)=Q(s)+c{\sqrt{\frac{\ln N_{p a r e n t}(s)}{N(s)}}};\quad{\mathrm{where}}\quad Q(s)={\frac{q(s)}{N(s)}}
\$\$

Extensive Rollouts for Q-value Annotation. Accurate Q-value $Q(s)$ annotation in Eq. 1 is crucial for guiding MCTS node selection towards correct problem-solving paths and identifying high-quality steps within trajectories. To improve Q-value reliability, we draw inspiration from Go players, who retrospectively evaluate the reward of each move based on game outcomes. Although initial estimates may be imprecise, repeated gameplay refines these evaluations over time. Similarly, in each rollout, we update the Q-value of each step based on its contribution to achieving the correct final answer. After extensive MCTS rollouts, steps consistently leading to correct answers achieve higher Q-values, occasional successes yield moderate Q-values, and consistently incorrect steps receive low Q-values. Specifically, we introduce two self-annotation methods to obtain these step-level Q-values. Fig. 1(c) shows the detailed setting in the four rounds of self-evolution.

广泛的Q值标注展开。在公式1中,准确的Q值 $Q(s)$ 标注对于引导MCTS节点选择正确的解题路径和识别轨迹中的高质量步骤至关重要。为了提高Q值的可靠性,我们从围棋玩家那里获得灵感,他们根据比赛结果回顾性地评估每一步的奖励。虽然初始估计可能不准确,但通过反复对弈,这些评估会随着时间的推移而得到改进。同样,在每次展开中,我们根据每一步对实现最终正确答案的贡献来更新其Q值。经过大量的MCTS展开后,始终导致正确答案的步骤会获得较高的Q值,偶尔成功的步骤会获得中等Q值,而始终错误的步骤则会获得较低的Q值。具体来说,我们引入了两种自标注方法来获得这些步骤级别的Q值。图1(c)展示了四轮自进化中的详细设置。

Terminal-guided annotation. During the first two rounds, when the PPM is unavailable or insufficiently accurate, we use terminal-guided annotation. Formally, let $q(s_{i})^{k}$ denote the $\mathbf{q}$ value for step $s_{i}$ after back-propagation in the $k^{t h}$ rollout. Following AlphaGo [Silver et al., 2017] and rStar [Qi et al., 2024], we score each intermediate node based on its contribution to the final correct answer:

终端引导的标注。在前两轮中,当 PPM 不可用或不够准确时,我们使用终端引导的标注。形式上,令 $q(s_{i})^{k}$ 表示在第 $k^{t h}$ 次回滚后步骤 $s_{i}$ 的 $\mathbf{q}$ 值。遵循 AlphaGo [Silver et al., 2017] 和 rStar [Qi et al., 2024],我们根据每个中间节点对最终正确答案的贡献对其进行评分:

q(s_{i})^{k}=q(s_{i})^{k-1}+q(s_{d})^{k};
q(s_{i})^{k}=q(s_{i})^{k-1}+q(s_{d})^{k};

where the initial q value $q(s_{i})^{0}=0$ in the first rollout. If this step frequently leads to a correct answer, its $q$ value will increase; otherwise, it decreases. Terminal nodes are scored as $q(s_{d})=1$ for correct answers and $q(s_{d})=-1$ otherwise, as shown in Fig. 1.

其中,在第一次 rollout 中,初始 q 值 $q(s_{i})^{0}=0$。如果此步骤经常导致正确答案,其 $q$ 值将增加;否则,将减少。终端节点的评分在正确答案时为 $q(s_{d})=1$,否则为 $q(s_{d})=-1$,如图 1 所示。

PRM-augmented annotation. Starting from the third round, we use PPM to score each step for more effective generation. Compared to terminal-guided annotation, which requires multiple rollouts for a meaningful $q$ value, PPM directly predicts a non-zero initial $q$ value. PPM-augmented MCTS also helps the policy model to generate higher-quality steps, guiding solutions towards correct paths. Formally, for step $s_{i}$ , PPM predicts an initial $\bar{q(s_{i})}^{0}$ value based on the partial trajectory:

PRM增强的注释。从第三轮开始,我们使用PPM对每个步骤进行评分,以实现更有效的生成。与终端引导的注释相比,后者需要多次展开才能获得有意义的$q$值,而PPM直接预测一个非零的初始$q$值。PPM增强的MCTS也有助于策略模型生成更高质量的步骤,引导解决方案走向正确的路径。形式上,对于步骤$s_{i}$,PPM基于部分轨迹预测一个初始$\bar{q(s_{i})}^{0}$值:

q(s_{i})^{0}=P P M(x\oplus s_{1}\oplus s_{2}\oplus\ldots\oplus s_{i-1}\oplus s_{i})
q(s_{i})^{0}=P P M(x\oplus s_{1}\oplus s_{2}\oplus\ldots\oplus s_{i-1}\oplus s_{i})

This $q$ value will be updated based on terminal node’s $q(s_{d})$ value through MCTS back-propagation in Eq. 2. For terminal node $s_{d}$ , we do not use PRM for scoring during training data generation. Instead, we assign a more accurate score based on ground truth labels as terminal-guided rewarding.

这个 $q$ 值将根据终端节点的 $q(s_{d})$ 值通过 MCTS 反向传播在公式 2 中更新。对于终端节点 $s_{d}$,我们在训练数据生成期间不使用 PRM 进行评分。相反,我们根据真实标签分配一个更准确的分数,作为终端引导的奖励。

3.3 Process Preference Model

3.3 过程偏好模型

Process reward models, which provide granular step-level reward signals, is highly desirable for solving challenging math problems. However, obtaining high-quality step-level training data remains an open challenge. Existing methods rely on human annotations [Lightman et al., 2023] or MCTSgenerated scores [Zhang et al., 2024a, Chen et al., 2024] to assign a score for each step. These scores then serve as training targets, with methods such as MSE loss [Chen et al., 2024] or pointwise loss [Wang et al., 2024c, Luo et al., 2024, Zhang et al., 2024a] used to minimize the difference between predicted and labeled scores. As a result, the precision of these annotated step-level reward scores directly determines the effectiveness of the resulting process reward model.

过程奖励模型(Process Reward Models)能够提供细粒度的步骤级奖励信号,这对于解决具有挑战性的数学问题非常理想。然而,获取高质量的步骤级训练数据仍然是一个开放的挑战。现有方法依赖于人工标注 [Lightman et al., 2023] 或 MCTS 生成的分数 [Zhang et al., 2024a, Chen et al., 2024] 来为每个步骤分配分数。这些分数随后作为训练目标,使用诸如 MSE 损失 [Chen et al., 2024] 或逐点损失 [Wang et al., 2024c, Luo et al., 2024, Zhang et al., 2024a] 等方法来最小化预测分数与标注分数之间的差异。因此,这些标注的步骤级奖励分数的精度直接决定了最终过程奖励模型的有效性。

Unfortunately, precise per-step scoring remains a unsolved challenge. Although our extensive MCTS rollouts improve the reliability of Q-values, precisely evaluating fine-grained step quality presents a major obstacle. For instance, among a set of correct steps, it is difficult to rank them as best, secondbest, or average and then assign precise scores. Similarly, among incorrect steps, differentiating the worst from moderately poor steps poses analogous challenges. Even expert human annotation struggles with consistency, particularly at scale, leading to inherent noise in training labels.

遗憾的是,精确的每一步评分仍然是一个未解决的挑战。尽管我们广泛的 MCTS 模拟提高了 Q 值的可靠性,但精确评估细粒度的步骤质量仍然是一个主要障碍。例如,在一组正确的步骤中,很难将它们排名为最佳、次佳或一般,然后分配精确的分数。同样,在错误的步骤中,区分最差和中等差的步骤也面临类似的挑战。即使是专家的人工标注也难以保持一致性,特别是在大规模情况下,这导致训练标签中存在固有的噪声。

We introduce a novel training method that trains a process preference model (PPM) by constructing step-level positive-negative preference pairs. As shown in Fig. 1(b), instead of using Q-values as direct reward labels, we use them to select steps from MCTS tree for preference pair construction. For each step, we select two candidates with the highest Q-values as positive steps and two with the lowest as negative steps. Critically, the selected positive steps must lead to a correct final answer, while negative steps must lead to incorrect answers. For intermediate steps (except the final answer step), the positive and negative pairs share the same preceding steps. For the final answer step, where identical reasoning trajectories rarely yield different final answers, we relax this restriction. We select two correct trajectories with the highest average $\mathrm{{Q}}.$ -values as positive examples and two incorrect trajectories with the lowest average Q-values as negative examples. Following [Ouyang et al., 2022], we define our loss function using the standard Bradley-Terry model with a pairwise ranking loss:

我们引入了一种新颖的训练方法,通过构建步骤级别的正负偏好对来训练过程偏好模型 (PPM)。如图 1(b) 所示,我们不使用 Q 值作为直接奖励标签,而是用它们从 MCTS 树中选择步骤来构建偏好对。对于每个步骤,我们选择两个 Q 值最高的候选作为正步骤,两个 Q 值最低的作为负步骤。关键的是,所选的正步骤必须导致正确的最终答案,而负步骤必须导致错误的答案。对于中间步骤(除了最终答案步骤),正负对共享相同的前置步骤。对于最终答案步骤,由于相同的推理轨迹很少会产生不同的最终答案,我们放宽了这一限制。我们选择两条平均 Q 值最高的正确轨迹作为正例,两条平均 Q 值最低的错误轨迹作为负例。根据 [Ouyang et al., 2022],我们使用标准的 Bradley-Terry 模型和成对排序损失来定义我们的损失函数:

\mathcal{L}_{p p m}(\theta)=-\frac{1}{2\times2}E_{(x,y_{i}^{p o s},y_{i}^{n e g}\in\mathbb{D})}[l o g(\sigma(r_{\theta}(x,y_{i}^{p o s})-r_{\theta}(x,y_{i}^{n e g})))]
\mathcal{L}_{p p m}(\theta)=-\frac{1}{2\times2}E_{(x,y_{i}^{p o s},y_{i}^{n e g}\in\mathbb{D})}[l o g(\sigma(r_{\theta}(x,y_{i}^{p o s})-r_{\theta}(x,y_{i}^{n e g})))]
y_{i}^{p o s}=s_{1}\oplus\ldots\oplus s_{i-1}\oplus s_{i}^{p o s};y_{i}^{n e g}=s_{1}\oplus\ldots\oplus s_{i-1}\oplus s_{i}^{n e g}
y_{i}^{p o s}=s_{1}\oplus\ldots\oplus s_{i-1}\oplus s_{i}^{p o s};y_{i}^{n e g}=s_{1}\oplus\ldots\oplus s_{i-1}\oplus s_{i}^{n e g}

Here, $r_{\theta}(x,y_{i})$ denotes the output of the PPM, where $x$ is the problem and $y$ is the trajectory from the first step to the $i^{t h}$ step.

这里,$r_{\theta}(x,y_{i})$ 表示 PPM 的输出,其中 $x$ 是问题,$y$ 是从第一步到第 $i^{t h}$ 步的轨迹。

3.4 Self-Evolved Deep Thinking

3.4 自我进化的深度思考

3.4.1 Training with Step-by-Step Verified Reasoning Trajectory

3.4.1 使用逐步验证的推理轨迹进行训练

Math Problems Collection. We collect a large dataset of $747\mathbf{k}$ math word problems with final answer ground-truth labels, primarily from NuminaMath [Jia LI and Polu, 2024a] and MetaMath [Yu et al., 2023b]. Notably, only competition-level problems (e.g., Olympiads and AIME/AMC) from NuminaMath are included, as we observe that grade-school-level problems do not significantly improve LLM complex math reasoning. To augment the limited competition-level problems, we follow [Li et al., 2024] and use GPT-4 to synthesize new problems based on the seed problems in $7.5\mathrm{k}$ MATH train set and 3.6k AMC-AIME training split. However, GPT-4 often generated unsolvable problems or incorrect solutions for challenging seed problems. To filter these, we prompt GPT-4 to generate 10 solutions per problem, retaining only those with at least 3 consistent solutions.

数学问题收集。我们收集了一个包含 $747\mathbf{k}$ 个数学应用题的大型数据集,这些数据集带有最终答案的真实标签,主要来自 NuminaMath [Jia LI 和 Polu, 2024a] 和 MetaMath [Yu 等人, 2023b]。值得注意的是,我们只包含了 NuminaMath 中的竞赛级别问题(例如奥林匹克竞赛和 AIME/AMC),因为我们观察到小学级别的问题对大语言模型的复杂数学推理能力提升不大。为了扩充有限的竞赛级别问题,我们遵循 [Li 等人, 2024] 的方法,使用 GPT-4 基于 $7.5\mathrm{k}$ 个 MATH 训练集和 3.6k 个 AMC-AIME 训练集中的种子问题生成新的问题。然而,GPT-4 经常为具有挑战性的种子问题生成无法解决的问题或不正确的解决方案。为了过滤这些问题,我们提示 GPT-4 为每个问题生成 10 个解决方案,仅保留那些至少有 3 个一致解决方案的问题。

Reasoning Trajectories Collection. Instead of using the original solutions in the $747\mathbf{k}$ math dataset, we conduct extensive MCTS rollouts (Sec. 3.2) to generate higher-quality step-by-step verified reasoning trajectories. In each self-evolution round, we perform 16 rollouts per math problem, which leads to 16 reasoning trajectories. Problems are then categories by difficulty based on the correct ratio of the generated trajectories: easy (all solutions are correct), medium (a mix of correct and incorrect solutions) and hard (all solutions are incorrect). For hard problems with no correct trajectories, an additional MCTS with 16 rollouts is performed. After that, all step-by-step trajectories and their annotated Q-values are collected and filtered to train the policy SLM and process preference model.

推理轨迹收集。我们没有使用 $747\mathbf{k}$ 数学数据集中的原始解决方案,而是进行了广泛的 MCTS 展开(第 3.2 节),以生成更高质量的逐步验证推理轨迹。在每一轮自我进化中,我们对每个数学问题执行 16 次展开,从而产生 16 条推理轨迹。然后根据生成轨迹的正确率将问题按难度分类:简单(所有解决方案都正确)、中等(正确和不正确解决方案混合)和困难(所有解决方案都不正确)。对于没有正确轨迹的困难问题,我们会额外执行 16 次展开的 MCTS。之后,收集并过滤所有逐步轨迹及其标注的 Q 值,以训练策略 SLM 和处理偏好模型。

Supervised Fine-tuning the Policy SLM. Through extensive experiments, we find that selecting high-quality reasoning trajectories is the key for fine-tuning a frontier math LLM. While methods such as GPT-distillation and Best-of-N can include low-quality or erroneous intermediate steps, a more effective approach ensures that every step in the trajectory is of high quality. To achieve this, we use per-step Q-values to select optimal trajectories from MCTS rollouts. Specifically, for each math problem, we select the top-2 trajectories with the highest average Q-values among those leading to correct answers as SFT training data.

监督微调策略 SLM。通过大量实验,我们发现选择高质量的推理轨迹是微调前沿数学大语言模型的关键。虽然 GPT 蒸馏和 Best-of-N 等方法可能包含低质量或错误的中间步骤,但更有效的方法是确保轨迹中的每一步都是高质量的。为此,我们使用每步 Q 值从 MCTS 模拟中选择最优轨迹。具体来说,对于每个数学问题,我们从导致正确答案的轨迹中选择平均 Q 值最高的前 2 条轨迹作为 SFT 训练数据。

Training PPM. The PPM is initialized from the fine-tuned policy model, with its next-token prediction head replaced by a scalar-value head consisting of a linear layer and a tanh function to constrain outputs to the range [-1, 1]. We filter out math problems where all solution trajectories are fully correct or incorrect. For problems with mixed outcomes, we select two positive and two negative examples for each step based on Q-values, which are used as preference pairs for training data.

训练 PPM。PPM 从微调后的策略模型初始化,其下一个 Token 的预测头被替换为由线性层和 tanh 函数组成的标量值头,以将输出限制在 [-1, 1] 范围内。我们过滤掉所有解轨迹完全正确或错误的数学问题。对于结果混合的问题,我们根据 Q 值为每个步骤选择两个正例和两个负例,这些正负例将作为训练数据的偏好对。

3.4.2 Recipe for Self-Evolution

3.4.2 自我进化方案

Due to the weaker capabilities of SLMs, we perform four rounds of MCTS deep thinking to progressively generate higher-quality data and expand the training set with more challenging math problems.

由于 SLM 的能力较弱,我们进行了四轮 MCTS 深度思考,逐步生成更高质量的数据,并通过更具挑战性的数学问题扩展训练集。

Table 2: Percentage of the $747\mathbf{k}$ math problems correctly solved in each round. Only problems have correct solutions are included in the training set. The first round uses DeepSeek-Coder-Instruct as the policy LLM, while later rounds use our fine-tuned 7B policy SLM.

表 2: 每轮正确解决的 $747\mathbf{k}$ 数学问题的百分比。只有包含正确解决方案的问题才会被纳入训练集。第一轮使用 DeepSeek-Coder-Instruct 作为策略大语言模型,而后续轮次使用我们微调的 7B 策略小语言模型。

# MCTS 中的模型 GSM 级别 MATH 级别 奥林匹克级别 总计
第1轮 DeepSeek-Coder-V2-Instruct 96.61% 67.36% 20.99% 60.17%
第2轮 策略小语言模型-r1 97.88% 67.40% 56.04% 66.60%
第3轮 策略小语言模型-r2, PPM-r2 98.15% 88.69% 62.16% 77.86%
第4轮 策略小语言模型-r3, PPM-r3 98.15% 94.53% 80.58% 90.25%

Table 3: Pass@1 accuracy of the resulting policy SLM in each round, showing continuous improvement until surpassing the bootstrap model.

表 3: 每轮策略 SLM 的 Pass@1 准确率,显示持续改进直至超越引导模型。

Round# MATH AIME2024 AMC2023 OlympiadBench College Math GSM8K GaokaoEn2023
DeepSeek-Coder-V2-Instruct (引导模型) 75.3 13.3 57.5 37.6 46.2 94.9 64.7
Base (Qwen2.5-Math-7B) 58.8 0.0 22.5 21.8 41.6 91.6 51.7
policy SLM-r1 69.6---3.3---- 30.0—- 34.7- --44.5 5---88.4----57.4
policySLM-r2 73.6 10.0 35.0 39.0 45.7 89.1 59.7
policySLM-r3 75.8 16.7 45.0 44.1 49.6 89.3 62.8
policySLM-r4 78.4 26.7 47.5 47.1 52.5 89.7 65.7

Each round uses MCTS to generate step-by-step verified reasoning trajectories, which are then used to train the new policy SLM and PPM. The new models are then applied in next round to generate higher-quality training data. Fig. 1(c) and Table 2 detail the models used for data generation in each round, along with the identifiers of the trained policy model and PPM. Next, we outline the details and specific improvements targeted in each round.

每一轮使用 MCTS 生成逐步验证的推理轨迹,然后用于训练新的策略 SLM 和 PPM。新模型随后在下一轮中应用,以生成更高质量的训练数据。图 1(c) 和表 2 详细列出了每轮用于数据生成的模型,以及训练的策略模型和 PPM 的标识符。接下来,我们概述每轮的细节和具体改进目标。

Round 1: Boots trapping an initial strong policy SLM-r1. To enable SLMs to self-generate reasonably good training data, we perform a bootstrap round to fine-tune an initial strong policy model, denoted as SLM-r1. As shown in Table 2, we run MCTS with DeepSeek-Coder-V2-Instruct (236B) to collect the SFT data. With no available reward model in this round, we use terminal-guided annotation for Q-values and limit MCTS to 8 rollouts for efficiency. For correct solutions, the top-2 trajectories with the highest average Q-values are selected as SFT data. We also train PPM-r1, but the limited rollouts yields unreliable Q-values, affecting the effectiveness of PPM-r1 ( Table 4).

第一轮:引导初始强策略模型 SLM-r1。为了使 SLM 能够自生成合理的高质量训练数据,我们进行了一轮引导,以微调初始的强策略模型,记为 SLM-r1。如表 2 所示,我们使用 DeepSeek-Coder-V2-Instruct (236B) 运行 MCTS 来收集 SFT 数据。由于本轮没有可用的奖励模型,我们使用终端引导的注释来生成 Q 值,并将 MCTS 的 rollout 限制为 8 次以提高效率。对于正确的解决方案,选择平均 Q 值最高的前两条轨迹作为 SFT 数据。我们还训练了 PPM-r1,但由于 rollout 次数有限,Q 值不可靠,影响了 PPM-r1 的有效性(表 4)。

Round 2: Training a reliable PPM-r2. In this round, with the policy model updated to the 7B SLM-r1, we conduct extensive MCTS rollouts for more reliable Q-value annotation and train the first reliable reward model, PPM-r2. Specifically, we perform 16 MCTS rollouts per problem. The resulting step-by-step verified reasoning trajectories show significant improvements in both quality and Q-value precision. As shown in Table 4, PPM $\cdot\mathbf{r}2$ is notably more effective than in the bootstrap round. Moreover, the policy SLM-r2 also continues to improve as expected (Table 3).

第二轮:训练可靠的 PPM-r2。在这一轮中,随着策略模型更新为 7B SLM-r1,我们进行了广泛的 MCTS 模拟,以获得更可靠的 Q 值标注,并训练了第一个可靠的奖励模型 PPM-r2。具体来说,我们对每个问题进行了 16 次 MCTS 模拟。生成的逐步验证的推理轨迹在质量和 Q 值精度上都有显著提升。如表 4 所示,PPM $\cdot\mathbf{r}2$ 比引导轮中的效果显著更好。此外,策略 SLM-r2 也如预期继续改进(表 3)。

Round 3: PPM-augmented MCTS to significantly improve data quality. With the reliable PPM-r2, we perform PPM-augmented MCTS in this round to generate data, leading to significantly higher-quality trajectories that cover more math and Olympiad-level problems in the training set (Table 2). The generated reasoning trajectories and self-annotated Q-values are then used to train the new policy SLM-r3 and PPM-r3, both of which show significant improvements.

第三轮:PPM增强的MCTS显著提高数据质量。借助可靠的PPM-r2,我们在本轮中执行PPM增强的MCTS来生成数据,从而产生显著更高质量的轨迹,这些轨迹覆盖了训练集中更多的数学和奥林匹克级别的问题(表2)。生成的推理轨迹和自我注释的Q值随后用于训练新的策略SLM-r3和PPM-r3,两者都显示出显著的改进。

Round 4: Solving challenging math problems. After the third round, while grade school and MATH problems achieve high success rates, only $62.16%$ of Olympiad-level problems are included in the training set. This is NOT solely due to weak reasoning abilities in our SLMs, as many Olympiad problems remain unsolved by GPT-4 or o1. To improve coverage, we adopt a straightforward strategy. For unsolved problems after 16 MCTS rollouts, we perform an additional 64 rollouts, and if needed, increase to 128. We also conduct multiple MCTS tree expansions with different random seeds. This boosts the success rate of Olympiad-level problems to $80.58%$ .

第四轮:解决具有挑战性的数学问题。在第三轮之后,虽然小学和MATH问题的成功率很高,但训练集中仅包含 $62.16%$ 的奥赛级别问题。这并不仅仅是因为我们的SLM推理能力较弱,因为许多奥赛问题仍然未被GPT-4或o1解决。为了提高覆盖率,我们采用了一种简单的策略。对于在16次MCTS rollout后仍未解决的问题,我们额外进行64次rollout,如果需要,再增加到128次。我们还使用不同的随机种子进行多次MCTS树扩展。这将奥赛级别问题的成功率提升至 $80.58%$。

After four rounds of self-evolution, $90.25%$ of the $747\mathbf{k}$ math problems are successfully covered into the training set, as shown in Table 2. Among the remaining unsolved problems, a significant portion consists of synthetic questions. We manually review a random sample of 20 problems and find that 19 are incorrectly labeled with wrong answers. Based on this, we conclude that the remaining unsolved problems are of low quality and thus terminate the self-evolution at round 4.

经过四轮自我进化后,$90.25%$ 的 $747\mathbf{k}$ 数学问题成功纳入训练集,如表 2 所示。在剩余的未解决问题中,很大一部分是合成问题。我们手动审查了随机抽取的 20 个问题,发现其中 19 个被错误标注了答案。基于此,我们得出结论,剩余的未解决问题质量较低,因此在第四轮终止了自我进化。

Table 4: The quality of PPM consistently improves across rounds. The policy model has been fixed with policy SLM-r1 for a fair comparison.

表 4: PPM 的质量在每一轮中持续提升。为了公平比较,策略模型已固定为策略 SLM-r1。

Round# MATH AIME 2024 AMC2023 Olympiad Bench College Math GSM8K GaokaoEn2023
PPM-r1 75.2 10.0 57.5 35.7 45.4 90.9 60.3
PPM-r2 84.1 26.7 75.0 52.7 54.2 93.3 73.0
PPM-r3 85.2 33.3 77.5 59.5 55.6 93.9 76.6
PPM-r4 87.0 43.3 77.5 61.5 56.8 94.2 77.8

4 Evaluation

4 评估

4.1 Setup

4.1 设置

Evaluation Datasets. We evaluate rStar-Math on diverse mathematical benchmarks. In addition to the widely-used GSM8K [Cobbe et al., 2021], we include challenging benchmarks from multiple domains: (i) competition and Olympiad-level benchmarks, such as MATH-500 [Lightman et al., 2023], AIME 2024 [AI-MO, 2024a], AMC 2023 [AI-MO, 2024b] and Olympiad Bench [He et al., 2024]. Specifically, AIME is the exams designed to challenge the brightest high school math students in American, with the 2024 dataset comprising 30 problems from AIME I and II exams; (ii) collegelevel math problems from College Math [Tang et al., 2024] and (iii) out-of-domain math benchmark: GaoKao (Chinese College Entrance Exam) En 2023 [Liao et al., 2024].

评估数据集。我们在多样化的数学基准上评估了 rStar-Math。除了广泛使用的 GSM8K [Cobbe et al., 2021],我们还包含了来自多个领域的具有挑战性的基准:(i) 竞赛和奥林匹克级别的基准,例如 MATH-500 [Lightman et al., 2023]、AIME 2024 [AI-MO, 2024a]、AMC 2023 [AI-MO, 2024b] 和 Olympiad Bench [He et al., 2024]。具体来说,AIME 是为挑战美国最优秀的高中数学学生而设计的考试,2024 年的数据集包含来自 AIME I 和 II 考试的 30 个问题;(ii) 来自 College Math [Tang et al., 2024] 的大学级别数学问题,以及 (iii) 领域外的数学基准:GaoKao(中国高考)En 2023 [Liao et al., 2024]。

Base Models and Setup. rStar-Math is a general approach applicable to various LLMs. To show its effectiveness and general iz ability, we use SLMs of different sizes as the base policy models: Qwen2.5-Math-1.5B [Qwen, 2024b], Phi3-mini-Instruct (3B) [Microsoft, 2024, Abdin et al., 2024], Qwen2-Math-7B [Qwen, 2024a] and Qwen2.5-Math-7B [Qwen, 2024c]. Among these, Phi3-miniInstruct is a general-purpose SLM without specialization in math reasoning.

基础模型与设置。rStar-Math 是一种适用于各种大语言模型的通用方法。为了展示其有效性和泛化能力,我们使用不同规模的小语言模型作为基础策略模型:Qwen2.5-Math-1.5B [Qwen, 2024b]、Phi3-mini-Instruct (3B) [Microsoft, 2024, Abdin et al., 2024]、Qwen2-Math-7B [Qwen, 2024a] 和 Qwen2.5-Math-7B [Qwen, 2024c]。其中,Phi3-miniInstruct 是一个通用的小语言模型,未专门针对数学推理进行优化。

Due to limited GPU resources, we performed 4 rounds of self-evolution exclusively on Qwen2.5- Math-7B, yielding 4 evolved policy SLMs (Table 3) and 4 PPMs (Table 4). For the other 3 policy LLMs, we fine-tune them using step-by-step verified trajectories generated from Qwen2.5-Math-7B’s 4th round. The final PPM from this round is then used as the reward model for the 3 policy SLMs.

由于 GPU 资源有限,我们仅对 Qwen2.5-Math-7B 进行了 4 轮自我进化,生成了 4 个进化的策略小模型 (表 3) 和 4 个 PPM (表 4)。对于其他 3 个策略大模型,我们使用 Qwen2.5-Math-7B 第 4 轮生成的逐步验证轨迹对其进行微调。本轮最终的 PPM 随后被用作这 3 个策略小模型的奖励模型。

Baselines. rStar-Math is a System 2 method. We compare it against three strong baselines representing both System 1 and System 2 approaches: (i) Frontier LLMs, including GPT-4o, the latest Claude, OpenAI o1-preview and o1-mini. We measure their accuracy on AMC 2023, Olympiad Bench, College Math, Gaokao and GSM8K, with accuracy numbers for other benchmarks are taken from public technical reports [Team, 2024a]. (ii) Open-sourced superior reasoning models, including DeepSeek-Coder-v2-Instruct, Mathstral [Team, 2024b], NuminaMath-72B [Jia LI and Polu, 2024a], and LLaMA3.1 [Dubey et al., 2024], which represent the current mainstream System 1 approaches for improving LLM math reasoning. (iii) Both System 1 and System 2 performance of the base models trained from the original models teams, including Instruct versions (e.g., Qwen2.5-Math-7B-Instruct) and Best-of-N (e.g., Qwen2.5-Math-72B-Instruct+Qwen2.5-Math-RM-72B). Notably, the reward model used for the three Qwen base models is a 72B ORM, significantly larger than our 7B PPM.

基线。rStar-Math 是一种系统 2 方法。我们将其与代表系统 1 和系统 2 方法的三个强基线进行比较:(i) 前沿大语言模型,包括 GPT-4o、最新的 Claude、OpenAI o1-preview 和 o1-mini。我们测量了它们在 AMC 2023、Olympiad Bench、College Math、Gaokao 和 GSM8K 上的准确率,其他基准的准确率数据来自公开技术报告 [Team, 2024a]。(ii) 开源的高级推理模型,包括 DeepSeek-Coder-v2-Instruct、Mathstral [Team, 2024b]、NuminaMath-72B [Jia LI and Polu, 2024a] 和 LLaMA3.1 [Dubey et al., 2024],这些模型代表了当前改进大语言模型数学推理的主流系统 1 方法。(iii) 从原始模型团队训练的基础模型的系统 1 和系统 2 性能,包括 Instruct 版本(例如 Qwen2.5-Math-7B-Instruct)和 Best-of-N(例如 Qwen2.5-Math-72B-Instruct+Qwen2.5-Math-RM-72B)。值得注意的是,用于三个 Qwen 基础模型的奖励模型是一个 72B ORM,明显大于我们的 7B PPM。

Evaluation Metric. We report Pass $@1$ accuracy for all baselines. For System 2 baselines, we use default evaluation settings, such as default thinking time for o1-mini and o1-preview. For Qwen models with Best-of-N, we re-evaluate MATH-500, AIME/AMC accuracy; other benchmarks results are from their technical reports. For a fair comparison, rStar-Math run MCTS to generate the same number of solutions as Qwen. Specifically, for AIME/AMC, we generate 16 trajectories for AIME/AMC and 8 for other benchmarks, using PPM to select the best solution. We also report performance with increased test-time computation using 64 trajectories, denoted as rStar-Math64.

评估指标。我们报告所有基线的 Pass $@1$ 准确率。对于 System 2 基线,我们使用默认的评估设置,例如 o1-mini 和 o1-preview 的默认思考时间。对于使用 Best-of-N 的 Qwen 模型,我们重新评估了 MATH-500、AIME/AMC 的准确率;其他基准测试结果来自其技术报告。为了公平比较,rStar-Math 运行 MCTS 以生成与 Qwen 相同数量的解决方案。具体来说,对于 AIME/AMC,我们生成 16 条轨迹,其他基准测试生成 8 条轨迹,并使用 PPM 选择最佳解决方案。我们还报告了使用 64 条轨迹增加测试时间计算的性能,记为 rStar-Math64。

4.2 Main Results

4.2 主要结果

Results on diverse challenging math benchmarks. Table 5 shows the results of rStar-Math with comparing to state-of-the-art reasoning models. We highlight three key observations: (1) rStar-Math significantly improves SLMs math reasoning capabilities, achieving performance comparable to or surpassing OpenAI o1 with substantially smaller model size (1.5B-7B). For example, Qwen2.5- Math-7B, originally at $58.8%$ accuracy on MATH, improved dramatically to $90.0%$ with rStar-Math, outperforming o1-preview and Claude 3.5 Sonnet while matching o1-mini. On the College Math benchmark, rStar-Math exceeds o1-mini by $2.7%$ . On AIME 2024, rStar-Math scored $53.3%$ , ranking just below o1-mini, with the 7B model solving 8/15 problems in both AIME I and II, placing in the top $20%$ of the brightest high school math students. Notably, 8 of the unsolved problems were geometry-based, requiring visual understanding, a capability rStar-Math currently does not support. (2) Despite using smaller policy models (1.5B-7B) and reward models (7B), rStar-Math significantly outperforms state-of-the-art System 2 baselines. Compared to Qwen Best-of-N baselines, which use the same base models (Qwen2-Math-7B, Qwen2.5-Math-1.5B/7B) but a $10\times$ larger reward model (Qwen2.5-Math-RM-72B), rStar-Math consistently improves the reasoning accuracy of all base models to state-of-the-art levels. Even against Best-of-N with a $10\times$ larger Qwen2.5-Math-72BInstruct policy model, rStar-Math surpasses it on all benchmarks except GSM8K, using the same number of sampled solutions. (3) Beyond well-known benchmarks like MATH, GSM8K, and AIME, which may risk over-optimization, rStar-Math shows strong general iz ability on other challenging math benchmarks, including Olympiad Bench, College Math, and the Chinese College Entrance Math Exam (Gaokao), setting new state-of-the-art scores. As discussed in Sec. 3.4, our training set is primarily sourced from public datasets, with no specific optimization s for these benchmarks.

在多样化的数学基准测试中的结果。表 5 展示了 rStar-Math 与最先进的推理模型的对比结果。我们强调了三个关键观察点:(1) rStar-Math 显著提升了 SLM 的数学推理能力,在模型规模显著较小的情况下(1.5B-7B),达到了与 OpenAI o1 相当或超越的性能。例如,Qwen2.5-Math-7B 在 MATH 上的准确率从原来的 58.8% 大幅提升至 90.0%,超越了 o1-preview 和 Claude 3.5 Sonnet,同时与 o1-mini 持平。在 College Math 基准测试中,rStar-Math 比 o1-mini 高出 2.7%。在 AIME 2024 上,rStar-Math 得分为 53.3%,仅次于 o1-mini,7B 模型在 AIME I 和 II 中解决了 8/15 的问题,位列最优秀的高中数学学生的前 20%。值得注意的是,8 个未解决的问题是基于几何的,需要视觉理解能力,这是 rStar-Math 目前不支持的。(2) 尽管使用了较小的策略模型(1.5B-7B)和奖励模型(7B),rStar-Math 显著超越了最先进的 System 2 基线。与使用相同基础模型(Qwen2-Math-7B, Qwen2.5-Math-1.5B/7B)但奖励模型大 10 倍的 Qwen Best-of-N 基线相比,rStar-Math 始终将所有基础模型的推理准确率提升至最先进水平。即使与使用 10 倍大的 Qwen2.5-Math-72BInstruct 策略模型的 Best-of-N 相比,rStar-Math 在除 GSM8K 外的所有基准测试上都超越了它,使用了相同数量的采样解决方案。(3) 除了可能面临过度优化风险的知名基准测试如 MATH、GSM8K 和 AIME 外,rStar-Math 在其他具有挑战性的数学基准测试上也表现出强大的泛化能力,包括 Olympiad Bench、College Math 和中国高考数学(Gaokao),创造了新的最先进成绩。正如第 3.4 节所讨论的,我们的训练集主要来自公共数据集,没有针对这些基准测试进行特定的优化。

Table 5: The results of rStar-Math and other frontier LLMs on the most challenging math benchmarks. rStar-Math64 shows the Pass $@1$ accuracy achieved when sampling 64 trajectories.

表 5: rStar-Math 和其他前沿大语言模型在最具挑战性的数学基准测试中的结果。rStar-Math64 显示了在采样 64 条轨迹时达到的 Pass @1 准确率。

方法 竞赛和大学水平 OOD 高考
AIME MATH AMC 2023 奥林匹克
2024
模型 前沿大语言模型 GPT-40 系统 1 76.6 9.3 47.5
Claude3.5-Sonnet 78.3 16.0 48.5
系统 1 85.5 44.6 90.0
GPT-o1-preview GPT-o1-mini 90.0 56.7 95.0 65.3 57.8
开源推理大语言模型 DeepSeek-Coder-V2-Instruct 75.3 13.3 57.5 37.6
Mathstral-7B-v0.1 系统 1 57.8 0.0 37.5 21.5 33.7
NuminaMath-72B-CoT 系统 1 系统 1 64.0 3.3 70.0 32.6 39.7
LLaMA3.1-8B-Instruct 系统 1 51.4 6.7 25.0 15.4
LLaMA3.1-70B-Instruct 65.4 23.3 50.0 27.7
Qwen2.5-Math-72B-Instruct 系统 1 系统 1 85.6 30.0 70.0 49.0
Qwen2.5-Math-72B-Instruct+72B ORM 系统 2 85.8 36.7 72.5 54.5
通用基础模型: Phi3-mini-Instruct(3.8B)
Phi3-mini-Instruct (基础模型) 系统 1 41.4 3.33 7.5 12.3
rStar-Math (3.8B SLM+7B PPM) 系统 2 85.4 40.0 77.5 59.3 58.0
rStar-Math64 (3.8B SLM+7B PPM) 系统 2 86.4 43.3 80.0 60.3 59.1
数学专用基础模型: Qwen2.5-Math-1.5B
Qwen2.5-Math-1.5B (基础模型) 系统 1 51.2 0.0 22.5 16.7
Qwen2.5-Math-1.5B-Instruct 系统 1 60.0 10.0 60.0 38.1 47.7
Qwen2.5-Math-1.5B-Instruct+72B ORM 系统 2 83.4 20.0 72.5 47.3 50.2
rStar-Math (1.5B SLM+7B PPM) 系统 2 87.8 46.7 80.0 63.5 59.0
rStar-Math64 (1.5B SLM+7B PPM) 系统 2 88.6 46.7 85.0 64.6
数学专用基础模型: Qwen2-Math-7B
Qwen2-Math-7B (基础模型) 系统 1 53.4 3.3 25.0 17.3
Qwen2-Math-7B-Instruct 系统 1 73.2 13.3 62.5 38.2 45.9
Qwen2-Math-7B-Instruct+72BORM 系统 2 83.4 23.3 62.5 47.6 47.9
rStar-Math (7B SLM+7B PPM) 系统 2 88.2 43.3 80.0 63.1 58.4
rStar-Math64 (7B SLM+7B PPM) 系统 2 88.6 46.7 85.0 63.4 59.3
数学专用基础模型: Qwen2.5-Math-7B
Qwen2.5-Math-7B (基础模型) 系统 1 58.8 0.0 22.5 21.8
Qwen2.5-Math-7B-Instruct 系统 1 82.6 6.0 62.5 41.6
Qwen2.5-Math-7B-Instruct+72BORM 系统 2 88.4 26.7 75.0 49.9
rStar-Math (7B SLM+7B PPM) 系统 2 89.4 50.0 87.5 65.3
rStar-Math64 (7B SLM+7B PPM) 系统 2 90.0 53.3 87.5 65.6

Scaling up test-time computation. rStar-Math uses MCTS to augment the policy model, searching solutions guided by the PPM. By increasing test-time computation, it explores more trajectories, potentially improving performance. In Fig. 3, we show the impact of test-time compute scaling by comparing the accuracy of the official Qwen Best-of-N across different numbers of sampled trajectories on four challenging math benchmarks. Sampling only one trajectory corresponds to the policy LLM’s Pass $@1$ accuracy, indicating a fallback to System 1 reasoning. We highlight two key observations: (1) With only 4 trajectories, rStar-Math significantly outperforms Best-of-N baselines, exceeding o1-preview and approaching o1-mini, demonstrating its effectiveness. (2) Scaling test-time compute improves reasoning accuracy across all benchmarks, though with varying trends. On Math, AIME, and Olympiad Bench, rStar-Math shows saturation or slow improvement at 64 trajectories, while on College Math, performance continues to improve steadily.

扩展测试时的计算量。rStar-Math 使用 MCTS 来增强策略模型,通过 PPM 引导搜索解决方案。通过增加测试时的计算量,它探索了更多的轨迹,从而可能提高性能。在图 3 中,我们通过在四个具有挑战性的数学基准上比较官方 Qwen Best-of-N 在不同采样轨迹数量下的准确性,展示了测试时计算量扩展的影响。仅采样一个轨迹对应于策略大语言模型的 Pass $@1$ 准确性,表明回退到系统 1 推理。我们强调两个关键观察结果:(1) 仅使用 4 个轨迹,rStar-Math 显著优于 Best-of-N 基线,超过了 o1-preview 并接近 o1-mini,展示了其有效性。(2) 扩展测试时的计算量提高了所有基准的推理准确性,尽管趋势有所不同。在 Math、AIME 和 Olympiad Bench 上,rStar-Math 在 64 个轨迹时显示出饱和或缓慢改进,而在 College Math 上,性能继续稳步提高。


Figure 3: Reasoning performance under scaling up the test-time compute.

图 3: 测试时计算资源扩展下的推理性能。

4.3 Ablation Study and Analysis

4.3 消融研究与分析

We ablate the effectiveness of our three innovations. For System 2-style inference, Pass@1 accuracy is measured with 16 trajectories for AIME and AMC, and 8 for other benchmarks.

我们消融了三种创新的有效性。对于 System 2 风格的推理,Pass@1 准确率通过 16 条轨迹测量 AIME 和 AMC,其他基准则为 8 条轨迹。

Table 6: The continuously improved math reasoning capabilities through rStar-Math self-evolved deep thinking. Starting from round 2, the 7B base model powered by rStar-Math surpasses GPT-4o.

表 6: 通过 rStar-Math 自我进化的深度思考持续提升的数学推理能力。从第 2 轮开始,由 rStar-Math 驱动的 7B 基础模型超越了 GPT-4o。

Round# MATH AIME2024 AMC2023 OlympiadBench CollegeMath GSM8K GaokaoEn2023
GPT-40 76.6 9.3 47.5 43.3 48.5 92.9 67.5
Base7Bmodel 58.8 0.0 22.5 21.8 41.6 91.6 51.7
rStar-MathRound1 75.2 10.0 57.5 35.7 45.4 90.9 60.3
rStar-MathRound2 86.6 43.3 75.0 59.4 55.6 94.0 76.4
rStar-MathRound3 87.0 46.7 80.0 61.6 56.5 94.2 77.1
rStar-MathRound4 89.4 50.0 87.5 65.3 59.0 95.0 80.5

The effectiveness of self-evolution. The impressive results in Table 5 are achieved after 4 rounds of rStar-Math self-evolved deep thinking. Table 6 shows the math reasoning performance in each round, demonstrating a continuous improvement in accuracy. In round 1, the main improvement comes from applying SFT to the base model. Round 2 brings a significant boost with the application of a stronger PPM in MCTS, which unlocks the full potential of System 2 deep reasoning. Notably, starting from round 2, rStar-Math outperforms GPT-4o. Rounds 3 and 4 show further improvements, driven by stronger System 2 reasoning through better policy SLMs and PPMs.

自我进化的有效性。表5中令人印象深刻的结果是在rStar-Math经过4轮自我进化的深度思考后实现的。表6展示了每一轮的数学推理性能,显示出准确率的持续提升。在第一轮中,主要的改进来自于对基础模型应用SFT(监督微调)。第二轮通过应用更强的PPM(策略预测模型)在MCTS(蒙特卡洛树搜索)中带来了显著提升,这释放了系统2深度推理的全部潜力。值得注意的是,从第二轮开始,rStar-Math的表现超过了GPT-4o。第三轮和第四轮通过更好的策略SLM(小语言模型)和PPM进一步提升了系统2的推理能力,带来了更多的改进。

The effectiveness of step-by-step verified reasoning trajectory. rStar-Math generates step-by-step verified reasoning trajectories, which eliminate error intermediate steps and further expand training set with more challenging problems. To evaluate its effectiveness, we use the data generated from round 4 as SFT training data and compare it against three strong baselines: (i) GPT-distillation, which includes open-sourced CoT solutions synthesized using GPT-4, such as MetaMath [Yu et al., 2023b], NuminaMath-CoT [Jia LI and Polu, 2024b]; (ii) Random sampling from self-generation, which use the same policy model (i.e., policy SLM-r3) to randomly generate trajectories; (iii) Rejection sampling, where 32 trajectories are randomly sampled from the policy model, with high-quality solutions ranked by our trained ORM (appendix A.1). For fairness, we select two correct trajectories for each math problem in baseline (ii) and (iii). All SFT experiments use the same training recipe.

逐步验证推理轨迹的有效性。rStar-Math 生成逐步验证的推理轨迹,消除了错误的中间步骤,并进一步扩展了训练集,增加了更具挑战性的问题。为了评估其有效性,我们使用第四轮生成的数据作为 SFT 训练数据,并将其与三个强基线进行比较:(i) GPT 蒸馏,包括使用 GPT-4 合成的开源 CoT 解决方案,如 MetaMath [Yu et al., 2023b]、NuminaMath-CoT [Jia LI and Polu, 2024b];(ii) 从自我生成中随机采样,使用相同的策略模型(即策略 SLM-r3)随机生成轨迹;(iii) 拒绝采样,从策略模型中随机采样 32 条轨迹,并由我们训练的 ORM 对高质量解决方案进行排序(附录 A.1)。为了公平起见,我们在基线 (ii) 和 (iii) 中为每个数学问题选择两条正确的轨迹。所有 SFT 实验使用相同的训练方案。

Table 7 shows the math reasoning accuracy of Qwen2.5-Math-7B fine-tuned on different datasets. We highlight two observations: (i) Fine-tuning with our step-by-step verified trajectories significantly outperforms all other baselines. This is primarily due to our PPM-augmented MCTS for code-augmented CoT synthesis, which provides denser verification during math solution generation. It proves more effective than both random sampling, which lacks verification, and rejection sampling, where ORM provides only sparse verification. (ii) Even randomly sampled code-augmented CoT solutions from our SLM yields comparable or better performance than GPT-4 synthesized NuminaMath and MetaMath datasets. This indicates that our policy SLMs, after rounds of self-evolution, can generate high-quality math solutions. These results demonstrates the huge potential of our method to self-generate higher-quality reasoning data without relying on advanced LLM distillation.

表 7 展示了 Qwen2.5-Math-7B 在不同数据集上微调后的数学推理准确率。我们强调两点观察:(i) 使用我们逐