[论文翻译]1B大语言模型能否超越405B大语言模型?重新思考计算最优的测试时扩展


原文地址:https://arxiv.org/pdf/2502.06703


Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

1B大语言模型能否超越405B大语言模型?重新思考计算最优的测试时扩展

Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs. Our website is available at https://ryanliu112.github.io/compute-optimal-tts.

测试时扩展 (TTS) 是一种通过在推理阶段使用额外计算来提高大语言模型 (LLMs) 性能的重要方法。然而,当前的研究并未系统分析策略模型、过程奖励模型 (PRMs) 和问题难度如何影响 TTS。这种分析的缺乏限制了对 TTS 方法的理解与实际应用。在本文中,我们聚焦于两个核心问题:(1) 在不同的策略模型、PRMs 和问题难度下,扩展测试时计算的最优方法是什么?(2) 扩展计算能在多大程度上提高 LLMs 在复杂任务上的表现,以及较小的语言模型是否可以通过这种方法超越较大的模型?通过在 MATH-500 和具有挑战性的 AIME24 任务上的全面实验,我们得出了以下观察结果:(1) 计算最优的 TTS 策略高度依赖于策略模型、PRM 和问题难度的选择。(2) 在我们的计算最优 TTS 策略下,极小的策略模型可以超越较大的模型。例如,1B 的 LLM 在 MATH-500 上可以超越 405B 的 LLM。此外,在 MATH-500 和 AIME24 上,0.5B 的 LLM 超越了 GPT-4o,3B 的 LLM 超越了 405B 的 LLM,而 7B 的 LLM 击败了 o1 和 DeepSeek-R1,同时具有更高的推理效率。这些发现表明,针对每个任务和模型的特定特征调整 TTS 策略的重要性,并表明 TTS 是增强 LLMs 推理能力的一种有前途的方法。我们的网站位于 https://ryanliu112.github.io/compute-optimal-tts

Figure 1: Comparison between the performance of smaller LLMs compute-optimal TTS and that of larger LLMs CoT on MATH-500 and AIME24. (a) & (d) Llama-3.2-3B-Instruct surpasses Llama- 3.1-405B-Instruct and GPT-4o on MATH-500 and AIME24; (b) & (e) DeepSeek-R1-Distill-1.5B outperforms o1-preview on MATH-500 and AIME24, and surpasses o1-mini on MATH-500; (c) & (f) DeepSeek-R1-Distill-7B beats o1 on MATH-500 and AIME24, and exceeds DeepSeek-R1 on AIME24.

图 1: 较小的大语言模型计算最优 TTS 与较大的大语言模型 CoT 在 MATH-500 和 AIME24 上的性能对比。(a) & (d) Llama-3.2-3B-Instruct 在 MATH-500 和 AIME24 上超越了 Llama-3.1-405B-Instruct 和 GPT-4o;(b) & (e) DeepSeek-R1-Distill-1.5B 在 MATH-500 和 AIME24 上优于 o1-preview,并在 MATH-500 上超越了 o1-mini;(c) & (f) DeepSeek-R1-Distill-7B 在 MATH-500 和 AIME24 上击败了 o1,并在 AIME24 上超过了 DeepSeek-R1。

1. Introduction

1. 引言

Large Language Models (LLMs) have shown significant improvements across a variety of domains (OpenAI, 2023; Hurst et al., 2024; Anthropic, 2023; OpenAI, 2024; DeepSeek-AI et al., 2025). Recently, OpenAI o1 (OpenAI, 2024) has demonstrated that Test-Time Scaling (TTS) can enhance the reasoning capabilities of LLMs by allocating additional computation at inference time, making it an effective approach for improving LLM performance (Qwen Team, 2024; Kimi Team et al., 2025; DeepSeek-AI et al., 2025).

大语言模型 (LLMs) 在各种领域都显示出显著的改进 (OpenAI, 2023; Hurst et al., 2024; Anthropic, 2023; OpenAI, 2024; DeepSeek-AI et al., 2025)。最近,OpenAI o1 (OpenAI, 2024) 展示了测试时扩展 (Test-Time Scaling, TTS) 可以通过在推理时分配额外的计算来增强大语言模型的推理能力,使其成为提升大语言模型性能的有效方法 (Qwen Team, 2024; Kimi Team et al., 2025; DeepSeek-AI et al., 2025)。

TTS approaches can be divided into two main categories: (1) Internal TTS, which trains the LLMs to “think” slowly with long Chain-of-Thought (CoT) (OpenAI, 2024; DeepSeek-AI et al., 2025), and (2) External TTS, which improves the reasoning performance via sampling or search-based methods with fixed LLMs (Wu et al., 2024; Snell et al., 2024). The key challenge of external TTS is how to scale compute optimally, that is, allocating the optimal computation for each problem (Snell et al., 2024). Current TTS methods guide the generation process and select the final answer using Process Reward Models (PRMs), which effectively scale test-time compute (Wu et al., 2024; Snell et al., 2024; Beeching et al., 2024). These TTS methods involve several important factors, such as policy models1, PRMs, and problem difficulty levels. However, there is limited systematic analysis of how policy models, PRMs, and problem difficulty influence these TTS strategies. This limitation prevents the community from fully understanding the effectiveness of this method and developing insights for compute-optimal TTS strategies.

TTS 方法可分为两大类:(1) 内部 TTS (Internal TTS),通过延长思维链 (Chain-of-Thought, CoT) 使大语言模型“缓慢思考”(OpenAI, 2024; DeepSeek-AI et al., 2025);(2) 外部 TTS (External TTS),通过采样或基于搜索的方法提升推理性能,同时保持大语言模型固定 (Wu et al., 2024; Snell et al., 2024)。外部 TTS 的关键挑战在于如何优化计算规模,即为每个问题分配最佳计算资源 (Snell et al., 2024)。当前的 TTS 方法通过过程奖励模型 (Process Reward Models, PRMs) 指导生成过程并选择最终答案,从而有效扩展测试时计算 (Wu et al., 2024; Snell et al., 2024; Beeching et al., 2024)。这些 TTS 方法涉及多个重要因素,例如策略模型 (policy models)、PRMs 以及问题难度级别。然而,关于策略模型、PRMs 和问题难度如何影响这些 TTS 策略的系统分析仍然有限。这一限制阻碍了社区全面理解该方法的有效性,并阻碍了优化计算 TTS 策略的洞察发展。

To address these issues, this paper aims to investigate the influence of policy models, PRMs, and problem difficulty on TTS through comprehensive experimental analysis. Furthermore, we explore the concrete characteristics and performance boundaries of TTS methods. Specifically, we conduct extensive experiments on MATH-500 (Lightman et al., 2024) and the challenging AIME24 (AI-MO, 2024) tasks using a range of PRMs (spanning from 1.5B to 72B across different model series) across multiple policy models (ranging from 0.5B to 72B across two model families). Our results show that the compute-optimal TTS strategy heavily depends on the specific policy model, PRM, and problem difficulty level. Even smaller models (e.g., a 1B model) can outperform larger models (e.g., a 405B model) and even state-of-the-art reasoning models, such as o1 or DeepSeek-R1, in challenging reasoning tasks by applying compute-optimal TTS.

为了解决这些问题,本文旨在通过综合实验分析,研究策略模型、PRM(Problem Resolution Model)和问题难度对TTS(Tree-Traversal Search)的影响。此外,我们还探讨了TTS方法的具体特征和性能边界。具体来说,我们在MATH-500(Lightman et al., 2024)和具有挑战性的AIME24(AI-MO, 2024)任务上进行了广泛的实验,使用了不同系列模型(从1.5B到72B)的多种策略模型(从0.5B到72B,涵盖两个模型系列)。我们的结果表明,计算最优的TTS策略高度依赖于特定的策略模型、PRM和问题难度。在具有挑战性的推理任务中,即使是较小的模型(例如1B模型)通过应用计算最优的TTS,也可以超越较大的模型(例如405B模型)甚至最先进的推理模型,如o1或DeepSeek-R1。

The contributions of this work can be summarized as follows:

本工作的贡献可总结如下:


Figure 2: Comparison of different external TTS methods.

图 2: 不同外部 TTS 方法的比较。

2. Setup & Preliminaries

2. 设置与预备知识

2.1. Problem Formulation

2.1. 问题定义

We formulate the reasoning problem as a Markov Decision Process (MDP) (Sutton and Barto, 2018), defined by the tuple (S,A,P,R,γ) , where S is the state space, A is the action space, P:S×AS is the transition function, R:S×AR is the reward function, and γ[0,1] is the discount factor. Given a prompt xχ , the policy with parameters θ generates the initial action a1,,πθ(,,s1) , where s1=x is the initial state. The policy receives a reward R(s1,a1) , and the state transitions to s2=[s1,a1] , where [,] denotes the concatenation of two strings. This process continues until the episode terminates, either by reaching the maximum number of steps or by generating an $\tt{}token.AtrajectoryoflengthHisrepresentedas\tau={a_{1},a_{2},\cdot\cdot\cdot,a_{H}}$ . The process can be summarized as follows:

我们将推理问题形式化为一个马尔可夫决策过程 (Markov Decision Process, MDP) (Sutton and Barto, 2018),定义为元组 (S,A,P,R,γ),其中 S 是状态空间,A 是动作空间,P:S×AS 是转移函数,R:S×AR 是奖励函数,γ[0,1] 是折扣因子。给定一个提示 xχ,参数为 θ 的策略生成初始动作 a1,,πθ(,,s1),其中 s1=x 是初始状态。策略接收到奖励 R(s1,a1),状态转移到 s2=[s1,a1],其中 [,] 表示两个字符串的连接。这个过程一直持续,直到达到最大步数或生成 $\tt{}TokenH\tau={a_{1},a_{2},\cdot\cdot\cdot,a_{H}}$。整个过程可以总结如下:

image.png

2.2. Test-Time Scaling Method

2.2. 测试时缩放方法 (Test-Time Scaling Method)

We consider three TTS methods: Best-of-N (BoN) (Brown et al., 2024), beam search (Snell et al., 2024), and Diverse Verifier Tree Search (DVTS) (Beeching et al., 2024). As pointed out by Snell et al. (2024), lookahead search is inefficient due to multi-step sampling, so we do not evaluate it or other methods involving lookahead operations, such as Monte Carlo Tree Search (MCTS). The TTS methods are shown in Figure 2.

我们考虑三种TTS方法:Best-of-N (BoN) (Brown et al., 2024)、beam search (Snell et al., 2024) 和 Diverse Verifier Tree Search (DVTS) (Beeching et al., 2024)。正如 Snell et al. (2024) 所指出的,由于多步采样的效率低下,我们不对lookahead search或其他涉及lookahead操作的方法进行评估,例如蒙特卡洛树搜索 (MCTS)。这些TTS方法如图2所示。

Best-of-N. In the BoN approach, the policy model generates N responses, after which scoring and voting methods are applied to select the final answer.

最佳N选1 (Best-of-N, BoN)。在BoN方法中,策略模型生成N个响应,然后通过评分和投票方法选择最终答案。

Beam Search. Given beam width N and beam size M , the policy model first generates N steps. The verifier selects the top NM steps for subsequent search. In the next step, the policy model samples

Beam Search。给定波束宽度 N 和波束大小 M,策略模型首先生成 N 步。验证器选择前 NM 步进行后续搜索。在下一步中,策略模型进行采样。

M steps for each selected previous step. This process repeats until the maximum depth is reached or an <EOS> token is generated.

对于每个选定的前一步,进行 M 步。此过程会重复,直到达到最大深度或生成 <EOS> Token。

Diverse Verifier Tree Search. To increase diversity, DVTS extends beam search by dividing the search process into NM subtrees, each of which is explored independently using beam search. As shown in Beeching et al. (2024), DVTS outperforms beam search on easy and medium problems with a large computational budget N . A similar trend is observed in Chen et al. (2024), where increasing the number of parallel subtrees proves to be more effective than increasing the beam width under the same budget.

多样化验证树搜索 (Diverse Verifier Tree Search)。为了增加多样性,DVTS 将搜索过程划分为 NM 个子树,并扩展了束搜索 (beam search),每个子树使用束搜索独立探索。如 Beeching 等人 (2024) 所示,在计算预算 N 较大的情况下,DVTS 在简单和中等难度问题上优于束搜索。Chen 等人 (2024) 也观察到了类似的趋势,在相同预算下,增加并行子树的数量比增加束宽度更为有效。

2.3. Compute-Optimal Test-Time Scaling

2.3. 计算最优测试时间扩展

To maximize the performance of TTS, Snell et al. (2024) proposes a test-time compute-optimal scaling strategy, which selects hyper parameters corresponding to a given test-time strategy to maximize performance benefits on a specific prompt. Given a prompt x , let Target (θ,N,x) represent the output distribution over x produced by the policy model with parameters θ and a compute budget of N .

为了最大化 TTS 的性能,Snell 等人 (2024) 提出了一种测试时计算最优缩放策略,该策略选择与给定测试时策略对应的超参数,以最大化特定提示的性能收益。给定提示 x,让目标 (θ,N,x) 表示由参数 θ 和计算预算 N 的策略模型在 x 上产生的输出分布。

image.png

where y(x) denotes the ground-truth correct response for x , and 𝜃𝑥*,𝑦*(𝑥)(𝑁) represents the test-time compute-optimal scaling strategy for the problem x with compute budget N .

其中 y(x) 表示 x 的地面真实正确响应,𝜃𝑥*,𝑦*(𝑥)(𝑁) 表示在计算预算 N 下针对问题 x 的计算最优缩放策略。

3. Rethinking Compute-Optimal Test-Time Scaling

3. 重新思考计算最优的测试时扩展

3.1. Compute-Optimal Scaling Strategy Should be Reward-Aware

3.1 计算最优扩展策略应具备奖励意识

Compute-optimal TTS aims to allocate the optimal compute for each problem (Snell et al., 2024). Previous works on TTS use a single PRM as verifier (Snell et al., 2024; Wu et al., 2024; Beeching et al., 2024). Snell et al. (2024) trains a PRM on the responses of a policy model and uses it as the verifier to do TTS with the same policy model, while Wu et al. (2024); Beeching et al. (2024) use a PRM trained on a different policy model to do TTS. From the perspective of Reinforcement Learning (RL), we obtain an on-policy PRM in the former case and an offline PRM in the latter case. The on-policy PRM produces more accurate rewards for the responses of the policy model, while the offline PRM often generates inaccurate rewards due to out-of-distribution (OOD) issues (Snell et al., 2024; Zheng et al., 2024).

计算最优 TTS 旨在为每个问题分配最优计算资源 (Snell et al., 2024)。以往的 TTS 工作使用单个 PRM 作为验证器 (Snell et al., 2024; Wu et al., 2024; Beeching et al., 2024)。Snell et al. (2024) 在一个策略模型的响应上训练 PRM 并将其作为验证器,用同一个策略模型执行 TTS,而 Wu et al. (2024); Beeching et al. (2024) 使用在不同策略模型上训练的 PRM 来执行 TTS。从强化学习 (Reinforcement Learning, RL) 的角度来看,前一种情况下我们获得了一个在线策略 PRM,后一种情况下获得了一个离线策略 PRM。在线策略 PRM 为策略模型的响应生成更准确的奖励,而离线策略 PRM 由于分布外 (Out-of-Distribution, OOD) 问题,通常会产生不准确的奖励 (Snell et al., 2024; Zheng et al., 2024)。

For practical applications of compute-optimal TTS, training a PRM for each policy model to prevent OOD issues is computationally expensive. Therefore, we investigate the compute-optimal TTS strategy in a more general setting, where the PRM might be trained on a different policy model than the one used for TTS. For search-based methods, PRMs guide the selection at each response step, while for sampling-based methods, PRMs evaluate the responses after generation. This indicates that (1) the reward influences response selection across all methods; (2) for search-based methods, the reward also influences the search process.

对于计算最优的TTS实际应用,为每个策略模型训练一个PRM以防止OOD问题在计算上是非常昂贵的。因此,我们研究了计算最优的TTS策略,在更一般的设置中,PRM可能在与用于TTS的策略模型不同的策略模型上进行训练。对于基于搜索的方法,PRM在每一步响应选择中指导选择,而对于基于采样的方法,PRM在生成后评估响应。这表明 (1) 奖励影响所有方法的响应选择; (2) 对于基于搜索的方法,奖励也影响搜索过程。

To analyze these points, we perform a preliminary case study using beam search with Llama-3.1-8BInstruct as the policy model and RLHFlow-PRM-Mistral-8B and RLHFlow-PRM-Deepseek-8B as PRMs. The results in Figure 12 demonstrate that the reward significantly affects the generation process and outcomes. RLHFlow-PRM-Mistral-8B assigns high rewards to short responses, leading to incorrect answers, while searching with RLHFlow-Deepseek-PRM-8B produces correct answers but uses more tokens. In Section 4, we also empirically show that rewards have great influence on TTS performance and output tokens.

为了分析这些点,我们使用 Llama-3.1-8BInstruct 作为策略模型,RLHFlow-PRM-Mistral-8B 和 RLHFlow-PRM-Deepseek-8B 作为 PRM (偏好奖励模型),进行了初步的案例研究。图 12 中的结果表明,奖励显著影响了生成过程和结果。RLHFlow-PRM-Mistral-8B 对简短的回答给予高奖励,导致错误答案,而使用 RLHFlow-Deepseek-PRM-8B 进行搜索则能产生正确答案,但使用了更多的 Token。在第 4 节中,我们还通过实验证明了奖励对 TTS (文本到语音) 性能和输出 Token 有重大影响。


Figure 3: Distribution of Pass @1 accuracy of Qwen2.5-72B-Instruct on MATH-500, divided into five bins.

图 3: Qwen2.5-72B-Instruct 在 MATH-500 上的 Pass @1 准确率分布,分为五个区间。

Based on these findings, we propose that rewards should be integrated into the compute-optimal TTS strategy. Let us denote the reward function as R . Our reward-aware compute-optimal TTS strategy is formulated as:

基于这些发现,我们提出应将奖励整合到计算最优的 TTS 策略中。我们将奖励函数表示为 R 。我们的奖励感知计算最优 TTS 策略公式如下:

image.png

where Target(θ,N,x,R) represents the output distribution of the policy model θ , adjusted by the reward function R , under a compute budget N and prompt x . For sampling-based scaling methods, Targ et(θ,N,x,R);=;Target(θ,N,x) . This reward-aware strategy ensures that compute-optimal scaling adapts to the policy model, prompt, and reward function, leading to a more general framework for practical TTS.

其中 Target(θ,N,x,R) 表示策略模型 θ 在计算预算 N 和提示 x 下,经过奖励函数 R 调整后的输出分布。对于基于采样的缩放方法,Targ et(θ,N,x,R);=;Target(θ,N,x)。这种奖励感知策略确保计算最优缩放能够适应策略模型、提示和奖励函数,从而为实际的 TTS 提供一个更通用的框架。

3.2. Absolute Problem Difficulty Criterion is More Effective Than Quantiles

3.2. 绝对问题难度标准比分位数更有效

To consider the influence of problem difficulty on TTS, Snell et al. (2024) group problems into five difficulty levels based on Pass @1 accuracy quantiles. However, we find that using difficulty levels from MATH (Hendrycks et al., 2021) or oracle labels based on Pass @1 accuracy quantiles (Snell et al., 2024) is not effective since different policy models have different reasoning capabilities. As shown in Figure 3, Qwen2.5-72B-Instruct achieves Pass @1 accuracy above 80 on 76.2 of MATH-500 problems. Therefore, we use absolute thresholds instead of quantiles to measure problem difficulty. Specifically, we define three difficulty levels based on Pass @1 accuracy: easy (50 , medium (10 ), and hard (0 ).

为了考虑问题难度对TTS的影响,Snell等人 (2024) 根据通过率 @1 的精度分位数将问题分为五个难度等级。然而,我们发现使用MATH (Hendrycks等人, 2021) 的难度等级或基于通过率 @1 精度分位数 (Snell等人, 2024) 的预测标签并不可行,因为不同的策略模型具有不同的推理能力。如图 3所示,Qwen2.5-72B-Instruct在 76.2 的MATH-500问题上达到了超过 80 的通过率 @1 准确率。因此,我们使用绝对阈值而非分位数来衡量问题难度。具体来说,我们根据通过率 @1 准确率定义了三个难度等级:简单 (50、中等 (10 和困难 (0

4. How to Scale Test-Time Compute Optimally?

4. 如何优化测试时计算资源的扩展?

In this section, we aim to answer the following questions:

在本节中,我们旨在回答以下问题:

• Q1: How does TTS improve with different policy models and PRMs? • Q2: How does TTS improve for problems with different difficulty levels? • Q3: Does PRMs have bias towards specific response lengths or sensitivity to voting methods?

• Q1: 不同策略模型和PRMs如何改进TTS?
• Q2: 针对不同难度的问题,TTS如何改进?
• Q3: PRMs是否对特定响应长度有偏见或对投票方法敏感?

4.1. Setup

4.1. 设置

Datasets. We conduct experiments on competition-level mathematical datasets, including MATH500 (Lightman et al., 2024) and AIME24 (AI-MO, 2024). MATH-500 contains 500 representative problems from the test set of MATH (Hendrycks et al., 2021), and this subset is used following Snell et al. (2024); Beeching et al. (2024). As recent LLMs show significant progress in mathematical reasoning (OpenAI, 2024; DeepSeek-AI et al., 2025), we include the more challenging AIME24 for experiments.

数据集。我们在竞赛级别的数学数据集上进行了实验,包括 MATH500 (Lightman et al., 2024) 和 AIME24 (AI-MO, 2024)。MATH500 包含了 MATH (Hendrycks et al., 2021) 测试集中的 500 个代表性题目,该子集的使用遵循了 Snell et al. (2024) 和 Beeching et al. (2024)。由于最近的大语言模型在数学推理方面取得了显著进展 (OpenAI, 2024; DeepSeek-AI et al., 2025), 我们引入了更具挑战性的 AIME24 进行实验。

Policy Models. For test-time methods, we use policy models from Llama 3 (Dubey et al., 2024) and Qwen2.5 (Yang et al., 2024b) families with different sizes. We use the Instruct version for all policy models.

策略模型。对于测试时方法,我们使用来自 Llama 3 (Dubey et al., 2024) 和 Qwen2.5 (Yang et al., 2024b) 系列的不同规模的策略模型。所有策略模型均使用 Instruct 版本。

Process Reward Models. We consider the following open-source PRMs for evaluation:

过程奖励模型 (Process Reward Models)。我们考虑以下开源 PRM 进行评估:

Scoring and Voting Methods. Following Wang et al. (2024a), we consider three scoring methods: PRM-Min, PRM-Last, and PRM-Avg, and three voting methods: Majority Vote, PRM-Max, and PRM-Vote. To obtain the final answer, we first use the scoring methods to evaluate the answers. For a trajectory of PRM-M H scores each trajectory by the minimum reward among all steps, i.e., score $\begin{array}{r}{=\operatorname*{min}{\mathcal{R}}{\mathcal{R}{t}}_{t=0}^{H}.}\end{array}(2)PRMLastscoreseachtrajectorybytherewardofthelaststep,i.e.,score=\mathcal{R}_{H}.(3)PRMAvgscoreseachtrajectorybytheaveragerewardamongallsteps,i.e.,score=1HHt=0Rt

$ . The voting methods then aggregate the scores to determine the final answer. Majority Vote selects the answer with the majority of votes (Wang et al., 2023), while PRM-Max selects the answer with the highest score, and PRM-Vote first accumulates the scores of all identical answers and then selects the answer with the highest score.

评分和投票方法。根据 Wang 等人 (2024a),我们考虑了三种评分方法:PRM-Min、PRM-Last 和 PRM-Avg,以及三种投票方法:多数投票 (Majority Vote)、PRM-Max 和 PRM-Vote。为了获得最终答案,我们首先使用评分方法对答案进行评估。对于 PRM-M 的轨迹 H,通过所有步骤中的最小奖励来评分,即 $\begin{array}{r}{=\operatorname*{min}{\mathcal{R}}{\mathcal{R}{t}}_{t=0}^{H}.}\end{array}(2)PRMLast=\mathcal{R}_{H}(3)PRMAvgscore=1HHt=0Rt

$。投票方法随后汇总这些分数以确定最终答案。多数投票选择获得多数票的答案 (Wang 等人, 2023),而 PRM-Max 选择得分最高的答案,PRM-Vote 首先累积所有相同答案的分数,然后选择得分最高的答案。


Figure 4: Performance of Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct on MATH-500 with different PRMs and TTS strategies.

图 4: Llama-3.1-8B-Instruct 和 Qwen2.5-7B-Instruct 在不同 PRM 和 TTS 策略下在 MATH-500 上的表现。


Figure 5: Performance of Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct on AIME24 with different PRMs and TTS strategies.

图 5: Llama-3.1-8B-Instruct 和 Qwen2.5-7B-Instruct 在 AIME24 上使用不同 PRM 和 TTS 策略的性能。

We use OpenR2, which is an open-source LLM reasoning framework as our codebase. For compute budgets, we use 4,16,64,256 in most experiments. The division of steps follows the format nn as in prior works (Xiong et al., 2024; Zhang et al., 2025). For beam search and DVTS, the beam width is set to 4. The temperature of CoT is 0.0, while it is 0.7 for other methods. For CoT and BoN, we restrict the maximum number of new tokens to 8192. For search-based methods, the token limit is 2048 for each step and 8192 for the total response.

我们使用 OpenR2 作为代码库,这是一个开源的大语言模型推理框架。在大多数实验中,我们使用计算预算 4,16,64,256。步骤的划分按照先前的作品(Xiong 等人,2024;Zhang 等人,2025)中的格式 nn 进行。对于 beam search 和 DVTS,beam width 设置为 4。CoT 的温度为 0.0,而其他方法的温度为 0.7。对于 CoT 和 BoN,我们限制新 token 的最大数量为 8192。对于基于搜索的方法,每个步骤的 token 限制为 2048,总响应的 token 限制为 8192。

4.2. How does TTS improve with different policy models and PRMs? (Q1)

4.2. TTS 如何通过不同策略模型和 PRM 提升性能? (Q1)

PRMs are hard to generalize across policy models and tasks. As shown in Figure 4, for Llama3.1-8B-Instruct, the performance of search-based methods with Skywork and Qwen2.5-Math PRMs improves significantly with larger compute budgets, while the results of searching with Math-Shepherd and RLHFlow PRMs remain relatively poor, even worse than majority voting. For Qwen2.5-7B-Instruct, the performance of searching with Skywork-PRM-7B and Qwen2.5-Math PRMs scales well with more budgets, while the performance of other PRMs remains poor. In Figure 5, although the Pass@k accuracy of both policy models improves a lot with larger compute budgets, the performance improvement of TTS remains moderate. These results demonstrate that the generalization of PRMs is particularly challenging across different policy models and tasks, especially for more complex tasks.

PRMs 难以在不同策略模型和任务之间泛化。如图 4 所示,对于 Llama3.1-8B-Instruct,使用 Skywork 和 Qwen2.5-Math PRMs 的搜索方法在计算预算增加时性能显著提升,而使用 Math-Shepherd 和 RLHFlow PRMs 的搜索结果仍然相对较差,甚至不如多数投票。对于 Qwen2.5-7B-Instruct,使用 Skywork-PRM-7B 和 Qwen2.5-Math PRMs 的搜索性能在预算增加时表现良好,而其他 PRMs 的表现仍然不佳。在图 5 中,尽管两个策略模型的 Pass@k 准确率在计算预算增加时都有显著提升,但 TTS 的性能提升仍然有限。这些结果表明,PRMs 在不同策略模型和任务之间的泛化尤其具有挑战性,特别是在处理更复杂的任务时。

The optimal TTS method depends on the PRM used. As shown in Figure 4, BoN outperforms other strategies most of the time when using Math-Shepherd and RLHFlow PRMs, while search-based methods perform better with Skywork and Qwen2.5-Math PRMs. This difference occurs because using a PRM for OOD policy responses leads to sub-optimal answers, as PRMs show limited generalization across policy models. Moreover, if we select each step with OOD PRMs, it is likely to obtain answers trapped in local optima and worsen the performance. This may also be related to the base model of the PRM, since the PRM trained with PRM800K (Lightman et al., 2024) on Qwen2.5-Math-7BInstruct generalizes better than PRMs with Mistral and Llama as base models (Zhang et al., 2025). Further analysis is provided in Section 4.4 and Appendix C. These results suggest that the choice of the optimal TTS strategy depends on the specific PRMs used, emphasizing the importance of considering reward information in compute-optimal TTS. We also explore the relationship between TTS performance and the process supervision abilities of different PRMs. As shown in Figure 6, TTS performance is positively correlated with the process supervision abilities of PRMs, and the fitted function is Y=7.66log(X)+44.31 , where Y represents TTS performance and X represents the process supervision abilities of the PRM (Zhang et al., 2025).

最优的 TTS 方法取决于所使用的 PRM。如图 4 所示,当使用 Math-Shepherd 和 RLHFlow PRM 时,BoN 大多数情况下优于其他策略,而基于搜索的方法在 Skywork 和 Qwen2.5-Math PRM 上表现更好。这种差异的出现是因为使用 PRM 进行 OOD(Out-of-Distribution)策略响应会导致次优答案,因为 PRM 在跨策略模型上的泛化能力有限。此外,如果我们在每个步骤中选用 OOD PRM,很可能会陷入局部最优从而导致性能下降。这也可能与 PRM 的基础模型有关,因为在 Qwen2.5-Math-7BInstruct 上使用 PRM800K (Lightman et al., 2024) 训练的 PRM 相较于以 Mistral 和 Llama 为基础模型的 PRM (Zhang et al., 2025) 表现出更好的泛化能力。进一步的分析详见第 4.4 节和附录 C。这些结果表明,最优 TTS 策略的选择取决于所使用的具体 PRM,强调了在计算优化 TTS 时考虑奖励信息的重要性。我们还探讨了 TTS 性能与不同 PRM 的过程监督能力之间的关系。如图 6 所示,TTS 性能与 PRM 的过程监督能力呈正相关,拟合函数为 Y=7.66log(X)+44.31,其中 Y 表示 TTS 性能,X 表示 PRM 的过程监督能力 (Zhang et al., 2025)。


Figure 6: The relationship between TTS performance and process supervision abilities of different PRMs on MATH, where the size of each circle represents the number of parameters of the PRM and the curve represents the fitted function.

图 6: 不同 PRM 在 MATH 上的 TTS 性能与过程监督能力之间的关系,其中每个圆圈的大小表示 PRM 的参数数量,曲线表示拟合函数。


Figure 7: TTS performance of policy models with parameters from 0.5B to 72B on MATH-500 with different scaling methods.

图7: 参数范围从0.5B到72B的策略模型在MATH-500数据集上采用不同扩展方法的TTS性能。

The optimal TTS method varies with policy models. To study the relationship between the parameters of the policy models and the optimal TTS methods, we conduct experiments with Qwen2.5 family LLMs (Yang et al., 2024b), including models with 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. The results in Figure 7 show that the optimal TTS methods depend on the specific policy models. For small policy models, search-based methods outperform BoN, while for large policy models, BoN is more effective than search-based methods. This difference occurs because larger models have stronger reasoning capabilities and do not need a verifier to perform step-by-step selection. In contrast, smaller models rely on a verifier to select each step, ensuring the correctness of each intermediate step.

最优的TTS方法因策略模型而异。为了研究策略模型的参数与最优TTS方法之间的关系,我们在Qwen2.5系列大语言模型(Yang等,2024b)上进行了实验,包括参数为0.5B、1.5B、3B、7B、14B、32B和72B的模型。图7中的结果表明,最优的TTS方法取决于具体的策略模型。对于小型策略模型,基于搜索的方法优于BoN,而对于大型策略模型,BoN比基于搜索的方法更有效。这种差异的出现是因为较大的模型具有更强的推理能力,不需要验证器逐步选择。相反,较小的模型依赖验证器来选择每一步,以确保每个中间步骤的正确性。

4.3. How does TTS improve for problems with different difficulty levels? (Q2)

4.3. TTS 如何针对不同难度的问题进行优化? (Q2)

Following Snell et al. (2024), we conduct a comprehensive evaluation of tasks with varying difficulty levels. However, as explained in Section 3.2, we observe that using the difficulty levels defined in MATH (Hendrycks et al., 2021) or the oracle labels based on the quantile of Pass(O1) accuracy (Snell et al., 2024) is not appropriate because different policy models exhibit different reasoning abilities. To address this, we categorize the difficulty levels into three groups based on the absolute value of Pass@1 accuracy: easy (50 ), medium (10 ), and hard (0 .

根据 Snell 等人 (2024) 的研究,我们对不同难度级别的任务进行了全面评估。然而,如第3.2节所述,我们发现使用 MATH (Hendrycks 等人,2021) 中定义的难度级别或基于 Pass(O1) 准确率的分位数 (Snell 等人,2024) 的标签并不合适,因为不同的策略模型表现出不同的推理能力。为了解决这个问题,我们根据 Pass@1 准确率的绝对值将难度级别分为三组:简单 (50、中等 (10 和困难 (0

The optimal TTS methods vary with different difficulty levels. The results in Figure 8 and Figure 9 show that for small policy models (i.e., with fewer than 7B parameters), BoN is better for easy problems, while beam search works better for harder problems. For policy models with parameters between 7B and 32B, DVTS performs well for easy and medium problems, and beam search is preferable for hard problems. For policy models with 72B parameters, BoN is the best method for all difficulty levels.

不同难度级别下,最佳的 TTS 方法各不相同。图 8 和图 9 的结果显示,对于小型策略模型(即参数少于 7B),BoN 在简单问题上表现更好,而 beam search 在更困难的问题上效果更佳。对于参数在 7B 到 32B 之间的策略模型,DVTS 在简单和中等难度问题上表现良好,而 beam search 在困难问题上更为优选。对于参数为 72B 的策略模型,BoN 在所有难度级别上都是最佳方法。


Figure 8: TTS performance of three Llama policy models on MATH-500 with three difficulty levels.

图 8: 三种 Llama 策略模型在 MATH-500 三个难度级别上的 TTS (Text-to-Speech) 性能。

4.4. Does PRMs have bias towards specific response lengths or sensitivity to voting methods? (Q3)

4.4. PRMs 是否对特定响应长度有偏见或对投票方法敏感?(Q3)

Table 1: Statistics of training data of RLHFlow PRMs.

表 1: RLHFlow PRMs 训练数据统计

Mistral-PRM-Data Deepseek-PRM-Data
平均每轮响应的Token数 236.9 333.1
平均每步的Token数 46.6 58.4

PRMs are biased towards the length of steps. Although we perform TTS under the same budget in pervious experiments, we find that the number of inference tokens with different PRMs varies sing if i cant ly. For example, given the same budget and the same policy model, the number of inference tokens of scaling with RLHFlow-PRM-Deepseek-8B is consistently larger than that of RLHFlowPRM-Mistral-8B, nearly 2× . The training data of RLHFlow series PRMs are sampled from different LLMs, which may lead to the bias towards the length of the output. To verify this point, we analyze several properties of the training data of RLHFlow-PRM-Mistral 8B3 and RLHFlow-PRM-Deepseek 8B4 . As shown in Table 1, both the average token per response and the average token per step of DeepSeek-PRM-Data are larger than those of Mistral-PRM-Data, indicating that the training data of RLHFlow-PRM-Deepseek-8B is longer than that of RLHFlow-PRM-Mistral-8B. This may lead to the bias towards the length of the output. We also find that the number of inference tokens of scaling with Qwen2.5-Math-7B is larger than that of Skywork-PRM-7B, but the performance is very near, which indicates that searching with Skywork-PRM-7B is more efficient than searching with Qwen2.5-Math-7B.

PRMs在步骤长度上存在偏差。虽然我们在之前的实验中在相同预算下进行了TTS(文本到语音)测试,但我们发现不同PRMs的推理Token数量差异显著。例如,在相同预算和相同策略模型下,RLHFlow-PRM-Deepseek-8B的推理Token数量始终大于RLHFlow-PRM-Mistral-8B,几乎是后者的2倍。RLHFlow系列PRMs的训练数据是从不同的大语言模型中采样的,这可能导致对输出长度的偏差。为了验证这一点,我们分析了RLHFlow-PRM-Mistral-8B和RLHFlow-PRM-Deepseek-8B的训练数据的几个特性。如表1所示,DeepSeek-PRM-Data的每个响应和每个步骤的平均Token数量都大于Mistral-PRM-Data,表明RLHFlow-PRM-Deepseek-8B的训练数据比RLHFlow-PRM-Mistral-8B更长。这可能导致对输出长度的偏差。我们还发现,使用Qwen2.5-Math-7B进行扩展的推理Token数量大于使用Skywork-PRM-7B,但两者的性能非常接近,这表明使用Skywork-PRM-7B进行搜索比使用Qwen2.5-Math-7B更高效。

Table 2: Performance of TTS with different voting methods on MATH-500.

表 2: 不同投票方法在 MATH-500 上的 TTS 性能

Skywork-PRM-7B Qwen2.5-Math-PRM-7B
MajorityVote 86.8 87.6
PRM-Min-Max 83.0 87.4
PRM-Min-Vote 86.6 87.6
PRM-Last-Max 84.4 87.6
PRM-Last-Vote 87.0 87.6
PRM-Avg-Max 85.8 87.8
PRM-Avg-Vote 86.8 87.6

PRMs are sensitive to voting methods. From the results in Table 2, it is shown that SkyworkPRM-7B works better with PRM-Vote than with PRM-Max, while Qwen2.5-Math-PRM-7B is not very sensitive to voting methods. The main reason is that the training data of Qwen2.5-Math PRMs are processed with LLM-as-a-judge (Zheng et al., 2023), which removes the wrong intermediate steps labeled as positive steps in the training data and makes the outputted large reward values more likely to be correct. This shows that the training data of PRMs is important for improving the ability to find errors in the search process.

PRM对投票方法敏感。表 2 中的结果显示,SkyworkPRM-7B 使用 PRM-Vote 的效果优于 PRM-Max,而 Qwen2.5-Math-PRM-7B 对投票方法不太敏感。其主要原因是 Qwen2.5-Math PRM 的训练数据经过了大语言模型作为评判者 (LLM-as-a-judge) (Zheng et al., 2023) 的处理,去除了训练数据中被错误标记为正向步骤的中间步骤,使得输出的大奖励值更可能是正确的。这表明 PRM 的训练数据对于提高在搜索过程中发现错误的能力至关重要。

5. Results for Compute-Optimal Test-Time Scaling

5. 计算最优测试时缩放的结果

With the compute-optimal TTS strategy explored in Section 4, we conduct further experiments to explore the following questions:

在探索了第 4 节中计算优化的 TTS 策略后,我们进行了进一步的实验来探索以下问题:

5.1. Can smaller policy models outperform larger models with the compute-optimal TTS strategy (Q4)

5.1 较小的策略模型是否可以通过计算最优 TTS 策略超越较大模型 (Q4)

Scaling test-time compute of small policy models is crucially important for improving the reasoning performance of LLMs. We are interested in whether smaller policy models can outperform larger ones, GPT-4o, even o1 and DeepSeek-R1, with the compute-optimal TTS strategy. First, we compare the performance of Llama-3.2-3B-Instruct (compute-optimal TTS) with that of Llama-3.1-405B-Instruct (CoT) on MATH-500 and AIME24. Also, we compare the performance of Qwen2.5-0.5B-Instruct, Qwen2.5-1.5B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct with GPT-4o on the above two tasks. As AIME24 is challenging for current LLMs, we also compare the performance of DeepSeekR1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B with o1 on AIME24.

扩展小型策略模型的测试时计算对于提升大语言模型的推理性能至关重要。我们关注的是,采用计算最优的 TTS 策略后,较小的策略模型是否能够超越较大的模型,如 GPT-4o,甚至 o1 和 DeepSeek-R1。首先,我们将 Llama-3.2-3B-Instruct(计算最优 TTS)与 Llama-3.1-405B-Instruct(CoT)在 MATH-500 和 AIME24 上的表现进行比较。此外,我们还将 Qwen2.5-0.5B-Instruct、Qwen2.5-1.5B-Instruct、Llama-3.2-1B-Instruct 和 Llama-3.2-3B-Instruct 与 GPT-4o 在上述两个任务中的表现进行比较。由于 AIME24 对当前的大语言模型具有挑战性,我们还比较了 DeepSeekR1-Distill-Qwen-1.5B 和 DeepSeek-R1-Distill-Qwen-7B 与 o1 在 AIME24 上的表现。

Table 3: Comparison of small policy models (compute-optimal TTS) with frontier reasoning LLMs (CoT) on MATH-500 and AIME24.

表 3: 小型策略模型 (计算最优 TTS) 与前沿推理大语言模型 (CoT) 在 MATH-500 和 AIME24 上的对比。

策略模型 MATH-500 AIME24 平均
专有大语言模型 (CoT)
GPT-40 74.6 9.3 42.0
0l-preview 85.5 44.6 65.1
0l-mini 90.0 63.6 76.8
01 94.8 79.2 87.0
开源大语言模型 (CoT)
Llama-3.1-70B-Inst. Llama-3.1-405B-Inst. QwQ-32B-Preview DeepSeek-R1 65.2 16.7 41.0
71.4 23.3 47.4
90.6 50.0 70.3
97.3 79.8 88.6
开源大语言模型 (TTS)
Llama-3.2-1B-Inst. Llama-3.2-1B-Inst. (N = 512) Llama-3.2-3B-Inst. 66.2 16.7 41.5
72.2 10.0 41.1
75.6 30.0 52.8
76.4 10.0 43.2
Qwen2.5-0.5B-Inst. Qwen2.5-1.5B-Inst. 81.8 20.0 50.9
91.6 63.3 77.5
DeepSeek-R1-Distill-Qwen-1.5B DeepSeek-R1-Distill-Qwen-7B 95.2 83.3 89.3

From the results in Table 3, we have the following observations: (1) Llama-3.2-3B-Instruct with the compute-optimal TTS strategy outperforms Llama-3.1-405B-Instruct on MATH-500 and AIME24, meaning that smaller models can outperform 135× larger models using the compute-optimal TTS strategy. Compared with previous works on TTS (Snell et al., 2024; Beeching et al., 2024), we improve the result by 487.0 23×135×) . (2) If we further increase the compute budget to N=512 , Llama-3.2-1B-Instruct with the compute-optimal TTS strategy beats Llama-3.1-405B-Instruct on MATH-500, but under performs Llama-3.1-405B-Instruct on AIME24.5 (3) Qwen2.5-0.5B-Instruct and Llama-3.2-3B-Instruct with the compute-optimal TTS strategy outperforms GPT-4o, indicating that small models can exceed GPT-level performance with the compute-optimal TTS strategy. (4) DeepSeek-R1-Distill-Qwen-1.5B with the compute-optimal TTS strategy outperforms o1-preview and o1-mini on MATH-500 and AIME24. We also show that DeepSeek-R1-Distill-Qwen-7B with the compute-optimal TTS strategy outperforms o1 and DeepSeek-R1 on MATH-500 and AIME24. These results demonstrate small reasoning-enhanced models can outperform frontier reasoning LLMs with the compute-optimal TTS strategy.

从表 3 的结果中,我们得出以下观察:(1) 采用计算最优 TTS 策略的 Llama-3.2-3B-Instruct 在 MATH-500 和 AIME24 上优于 Llama-3.1-405B-Instruct,这意味着较小的模型可以通过计算最优 TTS 策略超越 135× 更大的模型。与之前关于 TTS 的研究 (Snell et al., 2024; Beeching et al., 2024) 相比,我们将结果提高了 487.0 (23×135×)。(2) 如果我们将计算预算进一步增加到 N=512,采用计算最优 TTS 策略的 Llama-3.2-1B-Instruct 在 MATH-500 上击败了 Llama-3.1-405B-Instruct,但在 AIME24.5 上表现不及 Llama-3.1-405B-Instruct。(3) 采用计算最优 TTS 策略的 Qwen2.5-0.5B-Instruct 和 Llama-3.2-3B-Instruct 优于 GPT-4o,这表明小模型可以通过计算最优 TTS 策略超越 GPT 级别的性能。(4) 采用计算最优 TTS 策略的 DeepSeek-R1-Distill-Qwen-1.5B 在 MATH-500 和 AIME24 上优于 o1-preview 和 o1-mini。我们还展示了采用计算最优 TTS 策略的 DeepSeek-R1-Distill-Qwen-7B 在 MATH-500 和 AIME24 上优于 o1 和 DeepSeek-R1。这些结果表明,具有推理增强功能的小模型可以通过计算最优 TTS 策略超越前沿推理大语言模型。

FLOPS Comparison. To answer the question of whether compute-optimal TTS is more effective than increasing the model size, we compare the FLOPS of evaluated models in Table 4 following Snell et al. (2024), where the computed FLOPS is corresponded to the results in Table 3. From the results, we can see that small policy models even surpass large ones with less inference FLOPS and reduce the total FLOPS by 100×1000× .

FLOPS 对比。为了回答计算最优的 TTS 是否比增加模型规模更有效的问题,我们根据 Snell 等人 (2024) 的方法,在表 4 中对比了评估模型的 FLOPS,其中计算的 FLOPS 与表 3 的结果相对应。从结果可以看出,小型的策略模型在推理 FLOPS 较少的情况下甚至超过了大型模型,并将总 FLOPS 减少了 100×1000×

Table 4: FLOPS comparison between smaller policy models (compute-optimal TTS) and larger ones (CoT).

表 4: 较小策略模型 (compute-optimal TTS) 与较大模型 (CoT) 的 FLOPS 对比

Policy Model Pre-training FLOPS Inference FLOPS Total FLOPS
Llama-3.2-3B-Inst. 1.62 × 10^23 3.07 × 10^17 1.62 × 10^23
Llama-3.1-405B-Inst. 3.65 × 10^25 4.25 × 10^17 3.65 × 10^25
DeepSeek-R1-Distill-7B 7.56 × 10^23 8.15 × 10^17 7.56 × 10^23
DeepSeek-R1 5.96 × 10^25 4.03 × 10^18 5.96 × 10^25

Table 5: Comparison of compute-optimal TTS, CoT, and majority voting with different policy models on MATH-500.

表 5: 不同策略模型在 MATH-500 上的计算最优 TTS、CoT 和多数投票的对比

策略模型 CoT 多数投票 计算最优 TTS 性能提升 效率提升
Llama-3.2-1B-Inst. 26.0 39.0 66.2 154.6% >256.0x
Llama-3.2-3B-Inst. 41.4 58.4 78.2 88.9% 14.1×
Llama-3.1-8B-Inst. 49.8 66.4 80.6 61.8% 43.9x
Qwen2.5-0.5B-Inst. 31.6 47.2 76.4 141.8% >64.0x
Qwen2.5-1.5B-Inst. 54.4 68.4 85.6 57.4% >256.0x
Qwen2.5-3B-Inst. 64.0 77.0 87.6 36.9% 58.4x
Qwen2.5-7B-Inst. 76.8 83.6 91.0 18.5% 35.9x
Qwen2.5-14B-Inst. 80.2 85.6 91.0 13.5% 51.4x
Qwen2.5-32B-Inst. 82.4 87.0 90.6 10.0% 0.8x
Qwen2.5-72B-Inst. 83.8 87.2 91.8 9.5% 12.9x

5.2. How does compute-optimal TTS improve compared with CoT and majority voting? (Q5)

5.2. 计算最优 TTS 相比 CoT 和多数投票如何改进? (Q5)

Based on the findings of compute-optimal TTS with different policy models, PRMs, and difficulty levels, we summarize the results of compute-optimal TTS for each policy model on MATH-500 in Table 5. We find that compute-optimal TTS can be 256× more efficient than majority voting and improve reasoning performance by 154.6 over CoT. These results demonstrate that compute-optimal TTS significantly enhances the reasoning capabilities of LLMs. However, as the number of parameters in the policy model increases, the improvement of TTS gradually decreases. This suggests that the effectiveness of TTS is directly related to the reasoning ability of the policy model. Specifically, for models with weak reasoning abilities, scaling test-time compute leads to a substantial improvement, whereas for models with strong reasoning abilities, the gain is limited.

基于使用不同策略模型、PRM 和难度级别的计算最优 TTS 的研究结果,我们在表 5 中总结了 MATH-500 上每个策略模型的计算最优 TTS 结果。我们发现,计算最优 TTS 的效率可以比多数投票高 256×,并且比 CoT 提高推理性能 154.6。这些结果表明,计算最优 TTS 显著增强了大语言模型的推理能力。然而,随着策略模型中参数数量的增加,TTS 的提升逐渐减小。这表明 TTS 的有效性与策略模型的推理能力直接相关。具体来说,对于推理能力较弱的模型,扩展测试时计算量会带来显著提升,而对于推理能力较强的模型,增益则有限。

5.3. Is TTS more effective than long-CoT-based methods? (Q6)

5.3. TTS 是否比基于长链式思维 (long-CoT) 的方法更有效? (Q6)

Recently, long-CoT-based methods have shown substantial progress in mathematical reasoning (Guan et al., 2025; Cui et al., 2025; Zeng et al., 2025; DeepSeek-AI et al., 2025). We compare the performance of TTS with these approaches.

最近,基于长链思维 (long-CoT) 的方法在数学推理方面取得了显著进展 (Guan et al., 2025; Cui et al., 2025; Zeng et al., 2025; DeepSeek-AI et al., 2025)。我们比较了 TTS 与这些方法的性能。

Setup. We evaluate the following methods: (1) rStar-Math (Guan et al., 2025): This method first generates reasoning data via MCTS, followed by online policy and preference model learning. (2) Eurus-2 (Cui et al., 2025): This method enhances the reasoning abilities of LLMs through implicit process rewards and online RL. (3) SimpleRL (Zeng et al., 2025): This method replicates self-reflection with only 8K training data. (4) Satori (Shen et al., 2025): This method first learn the format and then improves the reasoning abilities via RL. (5) DeepSeek-R1-Distill-Qwen-7B (DeepSeek-AI et al., 2025): This method distills 800K high-quality reasoning samples from DeepSeek-R1 with 671B parameters into a 7B LLM.

我们评估了以下方法:(1) rStar-Math (Guan et al., 2025): 该方法首先生成推理数据,再进行在线策略和偏好模型学习。(2) Eurus-2 (Cui et al., 2025): 该方法通过隐式过程奖励和在线强化学习提升大语言模型的推理能力。(3) SimpleRL (Zeng et al., 2025): 该方法仅用8K训练数据复现自我反思。(4) Satori (Shen et al., 2025): 该方法首先学习格式,再通过强化学习提升推理能力。(5) DeepSeek-R1-Distill-Qwen-7B (DeepSeek-AI et al., 2025): 该方法将拥有671B参数的DeepSeek-R1中的800K高质量推理样本蒸馏到一个7B大语言模型中。

Table 6: Comparison of compute-optimal TTS with long-CoT methods on MATH-500 and AIME24.

表 6: MATH-500 和 AIME24 上计算最优 TTS 与长 CoT 方法的比较

策略模型 MATH-500 AIME24 平均
开源大语言模型 (CoT)
Qwen2.5-7B-Inst. 76.8 13.3 45.1
Qwen2.5-Math-7B-Inst. 79.8 13.3 46.6
长 CoT 方法 (CoT)
rStar-Math-7B 78.4 26.7 52.6
Eurus-2-7B-PRIME 79.2 26.7 53.0
Qwen2.5-7B-SimpleRL-Zero 77.2 33.3 55.3
Qwen2.5-7B-SimpleRL 82.4 26.7 54.6
Satori-Qwen-7B 83.6 23.3 53.5
DeepSeek-R1-Distill-Qwen-7B 92.4 63.3 77.9
开源大语言模型 (TTS)
Qwen2.5-7B-Inst. w/ 7B PRM (Ours) 88.0 33.3 -
Qwen2.5-7B-Inst. w/ 72B PRM (Ours) 60.5 36.7 63.9

Results. As shown in Table 6, we find that TTS with Qwen2.5-7B-Instruct outperforms rStar-Math, Eurus-2, SimpleRL, and Satori on both MATH-500 and AIME24. However, while the performance of TTS on MATH-500 is close to that of DeepSeek-R1-Distill-Qwen-7B, it shows a significant drop on AIME24. These results indicate that TTS is more effective than methods applying direct RL or SFT on the data generated via MCTS but is less effective than distilling from strong reasoning models. Also, TTS is more effective on simpler tasks than on more complex tasks.

结果。如表 6 所示,我们发现采用 Qwen2.5-7B-Instruct 的 TTS 在 MATH-500 和 AIME24 上均优于 rStar-Math、Eurus-2、SimpleRL 和 Satori。然而,虽然 TTS 在 MATH-500 上的表现接近 DeepSeek-R1-Distill-Qwen-7B,但在 AIME24 上却显著下降。这些结果表明,TTS 比在通过 MCTS 生成的数据上直接应用 RL 或 SFT 的方法更有效,但比从强推理模型蒸馏的效果稍逊。此外,TTS 在较简单任务上比在更复杂任务上更有效。

6. Related Work

6. 相关工作

LLM Test-Time Scaling. Scaling LLM test-time compute is an effective way to improve the performance (OpenAI, 2024). Previous works explore majority voting (Wang et al., 2023), search-based methods (Yao et al., 2023; Xie et al., 2023; Khanov et al., 2024; Wan et al., 2024), and refinement (Qu et al., 2024) to improve the performance. For verification-guided test-time compute, Brown et al. (2024) explores inference compute with repeated sampling and domain verifiers, while Kang et al. (2024); Wu et al. (2024); Snell et al. (2024) further explore search-based methods with process reward guidance and Wang et al. (2024c) extends this setting to VLMs. To eliminate the need for external reward models and the generation of extensive samples, Manvi et al. (2024) proposes a self-evaluation method for adaptive and efficient test-time compute. A recent work (Beeching et al., 2024) explores TTS via search methods with diversity. However, these works lack a evaluation with either strong verifiers or policies with different sizes / capabilities. In this paper, we aim to provide a more systematically evaluation with up-to-date policies and verifiers, more challenging tasks, and provide some principles for practical TTS.

大语言模型测试时间扩展

Improving Mathematical Reasoning Abilities of LLMs. Prior methods for improving mathematical reasoning abilities can be divided into training-time methods and test-time methods. For trainingtime methods, previous works explore large-scale mathematical corpus pre-training (OpenAI, 2023; Azerbayev et al., 2024; Shao et al., 2024) and supervised fine-tuning (Luo et al., 2023; Yu et al., 2024; Gou et al., 2024; Tang et al., 2024; Tong et al., 2024; Zeng et al., 2024) to improve mathematical capabilities. Another line of works explore self-training and self-improvement strategies (Zelikman et al., 2022; Gulcehre et al., 2023; Trung et al., 2024; Hosseini et al., 2024; Zelikman et al., 2024; Zhang et al., 2024a; Setlur et al., 2024a; Kumar et al., 2024; Cui et al., 2025), which improve the reasoning abilities by fine-tuning on self-generated solutions. Recently, many works improve the mathematical reasoning abilities with long CoT (Qin et al., 2024; Huang et al., 2024; Kimi, 2024; DeepSeek-AI et al., 2025; Qwen Team, 2024; Skywork, 2024; Zhao et al., 2024), as OpenAI o1 (OpenAI, 2024) shows significantly powerful reasoning capabilities with long thinking.

提升大语言模型的数学推理能力

For test-time methods, prompt-based approaches have been extensively studied to enhance reasoning without altering the model parameters. Techniques such as Chain-of-Thought (CoT) (Wei et al., 2022) and its variants (Yao et al., 2023; Leang et al., 2024) guide the model to decompose problems into manageable sub-steps, thereby improving accuracy and coherence in mathematical reasoning. Beyond prompting strategies, self-refinement techniques (Madaan et al., 2023) allow models to review and correct their outputs, while external tool integration (Gao et al., 2023; Chen et al., 2023) leverages program interpreter or symbolic manipulators to perform precise calculations and validations. Selfverification approaches (Weng et al., 2023) enable models to assess the correctness of their own reasoning processes, further increasing robustness. These test-time strategies complement trainingtime enhancements, collectively contributing to significant improvements in LLMs’ mathematical reasoning capabilities. Our work mainly enhances the reasoning performance via scaling test-time compute via PRM-guided search methods.

Process Reward Models. Previous works show that PRMs are more effective than ORMs (Uesato et al., 2022; Lightman et al., 2024). However, collecting high-quality PRMs data, such as PRM800K (Lightman et al., 2024), is often costly. The researchers explores automatic PRM data collection via direct Monte Carlo estimation (Wang et al., 2024b), detecting relative scores of ORMs (Lu et al., 2024), and efficient MCTS with binary search (Luo et al., 2024). Recently, more advanced PRMs are explored from advantage modeling (Setlur et al., 2024b), Q -value rankings (Li and Li, 2024), implicit rewards (Yuan et al., 2024), and entropy regular iz ation (Zhang et al., 2024b) perspectives. Additionally, more open-source PRMs are released (Xiong et al., 2024; Skywork, 2024; Zhang et al., 2024b; Li and Li, 2024; Yuan et al., 2024; Zhang et al., 2025), showing strong performance on mathematical tasks. With the rapid development of PRMs, Process Bench (Zheng et al., 2024) and PRMBench (Song et al., 2025) are proposed to provide comprehensive evaluation of PRMs. Zhang et al. (2025) provides guidelines for practical development of PRMs and releases the most capable PRMs for mathematical tasks up-to-date.

7. Conclusion & Discussion

7. 结论与讨论

In this paper, we present a thorough empirical analysis of compute-optimal test-time scaling from the perspectives of different policy models, PRMs, and more challenging evaluation tasks. Our findings demonstrate the dependency of compute-optimal TTS strategies on policy models, PRMs, and problem difficulty, validating that smaller language models can perform better than larger models when applying compute-optimal TTS. Our results show that a 1B model can achieve better performance than a 405B model through TTS. Additionally, we demonstrate that a 7B PRM can achieve strong TTS results by supervising a more capable 72B policy model, which suggests the importance of investigating a true “weak-to-strong” approach instead of the current “strong-to-weak” supervision for policy optimization. To achieve this goal, we need to develop more efficient supervision methods, as both PRM-based and RL-based approaches have limitations due to their dependence on high-quality supervision. Future work should focus on developing more adaptable and universal supervision mechanisms to boost the performance of small language models on complex tasks and provide new approaches for developing efficient reasoning strategies.

本文从不同策略模型、PRM(偏好评估模型)以及更具挑战性的评估任务角度,对计算最优测试时间扩展(compute-optimal test-time scaling, TTS)进行了全面的实证分析。我们的研究结果表明,计算最优TTS策略依赖于策略模型、PRM和问题难度,验证了在应用计算最优TTS时,较小的语言模型可以优于较大的模型。结果显示,通过TTS,一个1B(10亿参数)模型可以比一个405B(4050亿参数)模型表现更好。此外,我们证明了一个7B的PRM可以通过监督一个能力更强的72B策略模型来实现强大的TTS结果,这表明研究真正的“弱到强”方法而非当前的“强到弱”监督对策略优化的重要性。为了实现这一目标,我们需要开发更高效的监督方法,因为无论是基于PRM还是基于强化学习(RL)的方法,都因其对高质量监督的依赖而存在局限性。未来的工作应专注于开发更具适应性和普遍性的监督机制,以提升小语言模型在复杂任务上的表现,并为开发高效推理策略提供新途径。

Limitations. Although we provide a comprehensive evaluation of TTS on mathematical tasks, there are still some limitations and future directions to explore: (1) Extending TTS to more tasks such as coding and chemistry tasks. (2) Exploring more effective methods for compute-optimal TTS.

局限性。尽管我们对数学任务上的 TTS 进行了全面评估,但仍有一些局限性和未来方向需要探索: (1) 将 TTS 扩展到编码和化学任务等更多任务。 (2) 探索更有效的计算优化 TTS 方法。

References

参考文献

AI-MO. Aime 2024, 2024. URL https://hugging face.co/datasets/AI-MO/ aimo-validation-aime.

AI-MO. Aime 2024, 2024. URL https://hugging face.co/datasets/AI-MO/ aimo-validation-aime.

Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/ introducing-claude/.

Anthropic. 介绍 Claude, 2023. URL https://www.anthropic.com/index/introducing-claude/.

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id 4WnqRR915j.

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: 一个面向数学的开放语言模型. In International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id 4WnqRR915j.

Edward Beeching, Lewis Tunstall, and Sasha Rush. Scaling test-time compute with open models, 2024. URL https://hugging face.co/spaces/Hugging Face H 4/ blogpost-scaling-test-time-compute.

Edward Beeching、Lewis Tunstall 和 Sasha Rush。开放式模型的测试时计算扩展,2024。URL https://huggingface.co/spaces/Hugging Face H4/blogpost-scaling-test-time-compute。

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024.

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré 和 Azalia Mirhoseini. 大语言猴子:通过重复采样扩展推理计算. arXiv 预印本 arXiv:2407.21787, 2024.

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: Process supervision without process. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://openreview.net/forum?id VaXnxQ3UKo.

陈国新, 廖敏鹏, 李成熙, 范凯. Alphamath almost zero: 无需过程的过程监督. 载于《神经信息处理系统进展》(NeurIPS), 2024. 网址 https://openreview.net/forum?id VaXnxQ3UKo.

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research (TMLR), 2023. ISSN 2835-8856. URL https://openreview.net/forum? id YfZ4ZPt8zd.

Wenhu Chen, Xueguang Ma, Xinyi Wang, 和 William W. Cohen. 思维提示编程:将计算与推理分离以实现数值推理任务. Transactions on Machine Learning Research (TMLR), 2023. ISSN 2835-8856. URL https://openreview.net/forum?id YfZ4ZPt8zd.

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025.

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, 和 Ning Ding。过程强化通过隐式奖励。arXiv 预印本 arXiv:2502.01456, 2025。

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Sha

阅读全文(2积分)