s1: Simple test-time scaling
s1: 简单测试时缩放
Niklas Mu en nigh off * 1 3 4 Zitong Yang * 1 Weijia $\mathbf\mathbf\mathbf^\mathbf^$ Xiang Lisa $\mathbf\mathbf^\mathbf^$ Li Fei-Fei 1 Hannaneh Hajishirzi 2 3 Luke Z ett le moyer Percy Liang Emmanuel Candès Tatsunori Hashimoto
Niklas Muennighoff * 1 3 4 Zitong Yang * 1 Weijia $\mathbf\mathbf\mathbf^\mathbf^$ Xiang Lisa $\mathbf\mathbf^\mathbf^$ Li Fei-Fei 1 Hannaneh Hajishirzi 2 3 Luke Zettlemoyer Percy Liang Emmanuel Candès Tatsunori Hashimoto
Abstract
摘要
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to doublecheck its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5- 32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to $27%$ (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from $50%$ to $57%$ on AIME24. Our model, data, and code are open-source at https: //github.com/simple scaling/s1
测试时扩展是一种有前景的新方法,用于语言建模,通过额外的测试计算来提升性能。最近,OpenAI 的 o1 模型展示了这一能力,但未公开其方法,导致了许多复现尝试。我们寻求最简单的方法来实现测试时扩展和强大的推理性能。首先,我们通过消融实验验证了三个标准:难度、多样性和质量,策划了一个包含 1,000 个问题及相应推理路径的小数据集 s1K。其次,我们开发了预算强制控制方法,通过强制终止模型的思考过程或在模型试图结束时多次附加“Wait”来延长其生成内容,从而控制测试时的计算量。这可以使模型重新检查其答案,经常纠正错误的推理步骤。在将 Qwen2.5-32B-Instruct 语言模型在 s1K 上进行监督微调并配备预算强制控制后,我们的模型 s1-32B 在竞赛数学问题上超越了 o1-preview,最高提升了 27%(MATH 和 AIME24)。此外,通过预算强制控制扩展 s1-32B,允许其在不进行测试时干预的情况下超越自身性能:在 AIME24 上从 50% 提升到 57%。我们的模型、数据及代码已开源,地址为 https://github.com/simple scaling/s1。

Figure 1. Test-time scaling with s1-32B. We benchmark s1-32B on reasoning-intensive tasks and vary test-time compute.
图 1: 使用 s1-32B 的测试时间缩放。我们在推理密集型任务上对 s1-32B 进行基准测试,并改变测试时的计算量。
of this approach is to increase the compute at test time to get better results. There has been much work exploring this idea (Snell et al., 2024; Welleck et al., 2024), and the viability of this paradigm was recently validated by OpenAI o1 (OpenAI, 2024). o1 has demonstrated strong reasoning performance with consistent gains from scaling test-time compute. OpenAI describes their approach as using largescale reinforcement learning (RL) implying the use of sizable amounts of data (OpenAI, 2024). This has led to various attempts to replicate their models relying on techniques like Monte Carlo Tree Search (Gao et al., 2024b; Zhang et al., 2024a), multi-agent approaches (Qin et al., 2024), and others (Wang et al., 2024a; Huang et al., 2024b; 2025). Among these approaches, DeepSeek R1 (DeepSeek-AI et al., 2025) has successfully replicated o1-level performance, also employing reinforcement learning via millions of samples and multiple training stages. However, despite the large number of o1 replication attempts, none have openly replicated a clear test-time scaling behavior. Thus, we ask: what is the simplest approach to achieve both test-time scaling and strong reasoning performance?
该方法的目的是通过增加测试时的计算量以获得更好的结果。许多工作探索了这一想法 (Snell et al., 2024; Welleck et al., 2024),而这一范式的可行性最近被 OpenAI 的 o1 (OpenAI, 2024) 验证。o1 展示了强大的推理性能,并且随着测试时计算量的增加,性能持续提升。OpenAI 描述了他们的方法,即使用大规模强化学习 (RL),这意味着需要使用大量数据 (OpenAI, 2024)。这引发了各种尝试,试图复制他们的模型,依赖于蒙特卡洛树搜索 (Gao et al., 2024b; Zhang et al., 2024a)、多智能体方法 (Qin et al., 2024) 以及其他技术 (Wang et al., 2024a; Huang et al., 2024b; 2025)。在这些方法中,DeepSeek R1 (DeepSeek-AI et al., 2025) 成功复制了 o1 级别的性能,同样通过数百万样本和多个训练阶段使用强化学习。然而,尽管有大量的 o1 复制尝试,但没有一个公开复制出明显的测试时扩展行为。因此,我们提出一个问题:实现测试时扩展和强大推理性能的最简单方法是什么?
1. Introduction
1. 引言
Performance improvements of language models (LMs) over the past years have largely relied on scaling up train-time compute using large-scale self-supervised pre training (Kaplan et al., 2020; Hoffmann et al., 2022). The creation of these powerful models has set the stage for a new scaling paradigm built on top of them: test-time scaling. The aim
语言模型的性能提升
We show that training on only 1,000 samples with next-token prediction and controlling thinking duration via a simple test-time technique we refer to as budget forcing leads to a strong reasoning model that scales in performance with more test-time compute. Specifically, we construct s1K, which consists of 1,000 carefully curated questions paired with reasoning traces and answers distilled from Gemini Thinking Experimental (Google, 2024). We perform supervised fine-tuning (SFT) of an off-the-shelf pretrained model on our small dataset requiring just 26 minutes of training on 16 H100 GPUs. After training, we control the amount of test-time compute our model spends using budget forcing: (I) If the model generates more thinking tokens than a desired limit, we forcefully end the thinking process by appending an end-of-thinking token delimiter. Ending the thinking this way makes the model transition to generating its answer. (II) If we want the model to spend more test-time compute on a problem, we suppress the generation of the end-of-thinking token delimiter and instead append “Wait” to the model’s current reasoning trace to encourage more exploration. Equipped with this simple recipe – SFT on 1,000 samples and test-time budget forcing – our model s1- 32B exhibits test-time scaling (Figure 1). Further, s1-32B is the most sample-efficient reasoning model and outperforms closed-source models like OpenAI’s o1-preview (Figure 2).
我们展示了仅使用1,000个样本进行训练并通过我们称为预算强制(budget forcing)的简单测试时技术来控制思考时长,可以生成一个在性能上随测试时计算量增加而扩展的强大推理模型。具体来说,我们构建了s1K,其中包含1,000个精心挑选的问题以及从Gemini Thinking Experimental(Google, 2024)中提炼的推理轨迹和答案。我们在一个现成的预训练模型上进行了监督微调(SFT),仅需在16个H100 GPU上进行26分钟的训练。训练完成后,我们使用预算强制来控制模型在测试时的计算量:(I)如果模型生成的思考Token超过所需的限制,我们通过添加一个思考结束Token分隔符来强制结束思考过程。以这种方式结束思考会使模型过渡到生成其答案。(II)如果我们希望模型在某个问题上花费更多的测试时计算量,我们会抑制思考结束Token分隔符的生成,并在模型当前的推理轨迹后添加“等待”以鼓励更多的探索。通过这个简单的方案——在1,000个样本上进行SFT和测试时预算强制——我们的模型s1-32B展示了测试时的扩展能力(图1)。此外,s1-32B是样本效率最高的推理模型,并且优于OpenAI的o1-preview等闭源模型(图2)。
We conduct extensive ablation experiments targeting (a) our selection of 1,000 (1K) reasoning samples and (b) our testtime scaling. For (a), we find that jointly incorporating difficulty, diversity, and quality measures into our selection algorithm is important. Random selection, selecting samples with the longest reasoning traces, or only selecting maximally diverse samples all lead to significantly worse performance (around $-30%$ on AIME24 on average). Training on our full data pool of 59K examples, a superset of s1K, does not offer substantial gains over our 1K selection. This highlights the importance of careful data selection and echoes prior findings for instruction tuning (Zhou et al., 2023). For ${\bf(b)}$ , we define desiderata for test-time scaling methods to compare different approaches. Budget forcing leads to the best scaling as it has perfect control l ability with a clear positive slope leading to strong performance.
我们进行了广泛的消融实验,针对(a)我们选择的1,000个(1K)推理样本和(b)我们的测试时间扩展。对于(a),我们发现将难度、多样性和质量度量共同纳入我们的选择算法是很重要的。随机选择、选择具有最长推理路径的样本,或仅选择最多样化的样本,都会导致性能显著下降(在AIME24上平均约为$-30%$)。在我们包含59K示例的完整数据池(s1K的超集)上进行训练,并不能比我们选择的1K样本带来显著增益。这突显了仔细选择数据的重要性,并与之前关于指令微调的发现相呼应(Zhou et al., 2023)。对于(b),我们定义了测试时间扩展方法的需求,以比较不同的方法。预算强制扩展表现出最佳的扩展效果,因为它具有完美的控制能力,且具有明显的正斜率,从而实现了强大的性能。
In summary, our contributions are: We develop simple methods for creating a sample-efficient reasoning dataset (§2) and test-time scaling (§3); Based on these we build s1-32B which is competitive with o1-preview (§4); We ablate subtleties of data (§5.1) and test-time scaling (§5.2). We end with a discussion to motivate future work on simple reasoning (§6). Our code, model, and data are open-source at https://github.com/simple scaling/s1.
总结来说,我们的贡献是:我们开发了简单的方法来创建样本高效的推理数据集(第2节)和测试时间扩展(第3节);基于这些方法,我们构建了s1-32B,与o1-preview具有竞争力(第4节);我们消融了数据的细节(第5.1节)和测试时间扩展的细节(第5.2节)。最后,我们讨论了推动未来简单推理工作的动机(第6节)。我们的代码、模型和数据在https://github.com/simple scaling/s1上开源。
2. Reasoning data curation to create s1K
2. 推理数据管理以创建 s1K
In this section, we describe our process for creating a large dataset first in $\S2.1$ and then filtering it down to s1K in $\S2.2$ .
在本节中,我们首先在 $\S2.1$ 中描述了创建大型数据集的过程,然后在 $\S2.2$ 中将其过滤至 s1K。
2.1. Initial collection of 59K samples
2.1. 初步收集的 59K 样本
We collect an initial 59,029 questions from 16 diverse sources following three guiding principles. Quality: Datasets should be of high quality; we always inspect samples and ignore datasets with, e.g., poor formatting; Difficulty: Datasets should be challenging and require significant reasoning effort; Diversity: Datasets should stem from different fields to cover different reasoning tasks. We collect datasets of two categories:
我们根据三个指导原则从16个不同来源收集了最初的59,029个问题。质量:数据集应具有高质量;我们始终检查样本并忽略格式不良的数据集;难度:数据集应具有挑战性,需要大量的推理努力;多样性:数据集应来自不同领域,以涵盖不同的推理任务。我们收集了两类数据集:
Curation of existing datasets Our largest source is NuminaMATH (LI et al., 2024) with 30,660 mathematical problems from online websites. We also include historical AIME problems (1983-2021). To enhance diversity, we add Olympic Arena (Huang et al., 2024a) with 4,250 questions spanning Astronomy, Biology, Chemistry, Computer Science, Geography, Mathematics, and Physics from various Olympiads. OmniMath (Gao et al., 2024a) adds 4,238 competition-level mathematics problems. We also include 2,385 problems from AGIEval (Zhong et al., 2023), which features questions from standardized tests like SAT and LSAT, covering English, Law, and Logic. We refer to Table 6 in $\S B$ for our other sources.
现有数据集的整理
New datasets in quantitative reasoning To complement these existing datasets, we create two original datasets. s1-prob consists of 182 questions from the probability section of Stanford University’s Statistics Department’s PhD Qualifying Exams (https://statistics. stanford.edu), accompanied by handwritten solutions that cover difficult proofs. The probability qualifying exam is held yearly and requires professional-level mathematical problem-solving. s1-teasers comprises 23 challenging brain-teasers commonly used in interview questions for quantitative trading positions. Each sample consists of a problem and solution taken from Puzzled Quant (https: //www.puzzled quant.com/). We only take examples with the highest difficulty level ("Hard").
定量推理中的新数据集
为了补充这些现有数据集,我们创建了两个原始数据集。s1-prob 包含 182 道来自斯坦福大学统计系博士资格考试概率部分的问题 (https://statistics.stanford.edu),并附有涵盖复杂证明的手写解答。概率资格考试每年举行一次,要求具备专业水平的数学问题解决能力。s1-teasers 包含 23 道常用于定量交易职位面试题中的具有挑战性的脑筋急转弯。每个样本均由取自 Puzzled Quant (https://www.puzzledquant.com/) 的问题和解答组成。我们仅选取难度级别最高 ("Hard") 的示例。
For each question, we generate a reasoning trace and solution using the Google Gemini Flash Thinking API (Google, 2024) extracting its reasoning trace and response. This yields 59K triplets of a question, generated reasoning trace, and generated solution. Examples from our dataset are in $\mathrm{\SC}.2$ We decontaminate all samples against our evaluation questions (MATH500, GPQA Diamond, AIME24; $\S B.5$ ) using 8-grams and de duplicate the data.
对于每个问题,我们使用 Google Gemini Flash Thinking API (Google, 2024) 生成推理轨迹和解决方案,提取其推理轨迹和响应。这将产生 59K 个问题、生成的推理轨迹和生成的解决方案的三元组。我们数据集的示例如 $\mathrm{\SC}.2$ 所示。我们使用 8-grams 对所有样本进行去污染处理(针对评估问题 MATH500、GPQA Diamond、AIME24;参见 $\S B.5$),并去重数据。
2.2. Final selection of 1K samples
2.2. 1K样本的最终选择
We could directly train on our pool of 59K questions, however, our goal is to find the simplest approach with minimal resources. Thus, we go through three stages of filtering to arrive at a minimal set of 1,000 samples relying on our three guiding data principles: Quality, Difficulty, and Diversity.
我们可以直接在 59K 问题的池中进行训练,然而,我们的目标是找到最简方法并最小化资源消耗。因此,我们通过三个阶段进行筛选,最终得出 1000 个样本,遵循我们的三个数据原则:质量、难度和多样性。
Quality We first remove any questions where we ran into any API errors reducing our dataset to 54,116 samples. Next, we filter out low-quality examples by checking if they contain any string patterns with formatting issues, such as ASCII art diagrams, non-existent image references, or inconsistent question numbering reducing our dataset to 51,581 examples. From this pool, we identify 384 samples for our final 1,000 samples from datasets that we perceive as high-quality and not in need of further filtering (see $\S B.4$ for details).
质量
我们首先剔除了所有遇到 API 错误的样本,将数据集减少到 54,116 个样本。接着,我们通过检查是否包含任何格式问题的字符串模式(如 ASCII 艺术图、不存在的图像引用或不一致的题号)来过滤低质量样本,将数据集进一步减少到 51,581 个样本。从这个池中,我们从我们认为高质量且无需进一步过滤的数据集中挑选了 384 个样本,作为最终 1,000 个样本的一部分(详见 $\S B.4$ )。

Figure 2. s1K and s1-32B. (left) s1K is a dataset of 1,000 high-quality, diverse, and difficult questions with reasoning traces. (right) s1-32B, a 32B parameter model finetuned on s1K is on the sample-efficiency frontier. See Table 1 for details on other models.
图 2: s1K 和 s1-32B。(左)s1K 是一个包含 1,000 个高质量、多样化且具有挑战性问题的数据集,附带推理过程。(右)s1-32B 是一个在 s1K 上微调的 32B 参数模型,位于样本效率的前沿。详见 表 1 中其他模型的详细信息。
Difficulty For difficulty, we use two indicators: model performance and reasoning trace length. We evaluate two models on each question: Qwen2.5-7B-Instruct and Qwen2.5- 32B-Instruct (Qwen et al., 2024), with correctness assessed by Claude 3.5 Sonnet comparing each attempt against the reference solution (see $\S B.3$ for the grading protocol). We measure the token length of each reasoning trace to indicate problem difficulty using the Qwen2.5 tokenizer. This relies on the assumption that more difficult problems require more thinking tokens. Based on the grading, we remove questions that either Qwen2.5-7B-Instruct or Qwen2.5-32B-Instruct can solve correctly and thus may be too easy. By using two models we reduce the likelihood of an easy sample slipping through our filtering due to a rare mistake on an easy question of one of the models. This brings our total samples down to 24,496, setting the stage for the next round of subsampling based on diversity. While filtering with these two models may be optimized for our setup as we will also use Qwen2.5-32B-Instruct as our model to finetune, the idea of model-based filtering generalizes to other setups.
难度
对于难度,我们使用两个指标:模型性能和推理轨迹长度。我们在每个问题上评估了两个模型:Qwen2.5-7B-Instruct 和 Qwen2.5-32B-Instruct (Qwen et al., 2024),通过 Claude 3.5 Sonnet 将每次尝试与参考解决方案进行比较来评估正确性(参见 $\S B.3$ 的评分协议)。我们使用 Qwen2.5 的 tokenizer 测量每个推理轨迹的 token 长度,以指示问题难度。这基于一个假设,即更困难的问题需要更多的思考 token。根据评分结果,我们删除了 Qwen2.5-7B-Instruct 或 Qwen2.5-32B-Instruct 能够正确解决并因此可能过于简单的问题。通过使用两个模型,我们减少了一个简单样本因其中一个模型在简单问题上罕见错误而通过过滤的可能性。这将我们的总样本量减少到 24,496 个,为下一轮基于多样性的子采样奠定了基础。虽然使用这两个模型进行过滤可能针对我们的设置进行了优化,因为我们也使用 Qwen2.5-32B-Instruct 作为微调的模型,但基于模型的过滤思路可以推广到其他设置。
Diversity To quantify diversity we classify each question into specific domains using Claude 3.5 Sonnet based on the Mathematics Subject Classification (MSC) system (e.g., geometry, dynamic systems, real analysis, etc.) from the American Mathematical Society. The taxonomy focuses on topics in mathematics but also includes other sciences such as biology, physics, and economics. To select our final examples from the pool of 24,496 questions, we first choose one domain uniformly at random. Then, we sample one problem from this domain according to a distribution that favors longer reasoning traces (see $\S B.4$ for details) as motivated in Difficulty. We repeat this process until we have 1,000 total samples.
多样性 为了量化多样性,我们使用 Claude 3.5 Sonnet 根据美国数学学会的数学主题分类系统 (Mathematical Subject Classification, MSC) (例如几何、动力系统、实分析等) 将每个问题分类到特定领域。该分类法主要关注数学主题,但也包括其他科学领域,如生物学、物理学和经济学。为了从 24,496 个问题池中选择最终示例,我们首先均匀随机选择一个领域。然后,根据有利于更长推理轨迹的分布从该领域中抽样一个问题 (详见 $\S B.4$),这一选择动机来自难度部分。我们重复此过程,直到总共获得 1,000 个样本。
This three-stage process yields a dataset spanning 50 different domains (see Table 5). In $\S5.1$ , we will show that using our three criteria in combination is important, as only relying on quality, diversity, or difficulty in isolation leads to worse datasets. Examples from our dataset are in $\mathrm{\SC}.2$ .
这个三步过程生成了一个涵盖50个不同领域的数据集(见表5)。在$\S5.1$中,我们将展示组合使用我们的三个标准是重要的,因为仅依赖质量、多样性或难度会导致较差的数据集。我们数据集中的示例在$\mathrm{\SC}.2$中。
3. Test-time scaling
3. 测试时缩放
3.1. Method
3.1. 方法
We classify test-time scaling methods into 1) Sequential, where later computations depend on earlier ones (e.g., a long reasoning trace), and 2) Parallel, where computations run independent ly (e.g., majority voting) (Snell et al., 2024; Brown et al., 2024). We focus on sequential scaling as intuitively we believe it should scale better, since later computations can build on intermediate results, allowing for deeper reasoning and iterative refinement. We propose new sequential scaling methods and ways to benchmark them.
我们将测试时扩展方法分为两类:1) 顺序扩展 (Sequential),其中后续计算依赖于先前的计算结果(例如,长推理链),以及 2) 并行扩展 (Parallel),其中计算独立运行(例如,多数投票)(Snell et al., 2024; Brown et al., 2024)。我们专注于顺序扩展,因为我们直观上认为它应该更具可扩展性,因为后续计算可以基于中间结果进行,从而实现更深的推理和迭代优化。我们提出了新的顺序扩展方法以及对其进行基准测试的方式。
Budget forcing We propose a simple decoding-time intervention by forcing a maximum and/or minimum number of thinking tokens at test time. Specifically, we enforce a maximum token count by simply appending the end-of-thinking token delimiter and “Final Answer:” to early exit the thinking stage and make the model provide its current best answer. To enforce a minimum, we suppress the generation of the end-of-thinking token delimiter and optionally append the string “Wait” to the model’s current reasoning trace to encourage the model to reflect on its current generation. Figure 3 contains an example of how this simple approach can lead the model to arrive at a better answer.
预算强制
我们提出了一种简单的解码时干预方法,通过在测试时强制设置最大和/或最小思考Token数量。具体来说,我们通过简单地追加思考结束Token定界符和“最终答案:”来强制最大Token计数,以提前退出思考阶段,并让模型提供其当前的最佳答案。为了强制最小计数,我们抑制生成思考结束Token定界符,并可选择在当前推理轨迹中追加“等待”字符串,以鼓励模型反思其当前的生成。图 3 包含了一个示例,展示了这种简单方法如何引导模型得出更好的答案。
"raspberry"中有多少个'r'?问题
让我们分解一下计算单词"raspberry"中字母'r'的过程...
- 第一个字母:'r' - 这是一个'r',计数 = 1。
- 第二个字母:'a' - 不是'r',计数保持为1。
- 第六个字母:'e' - 不是'r',计数保持为1。
- 第七个字母:'r' - 这是一个'r',计数 = 2。
- 第八个字母:'y' - 不是'r',计数保持为2...
"raspberry"中的'r'的数量是2。等等,让我们仔细重新阅读问题。它问的是"raspberry中有多少个'r'?"。
*r - a - s - p - b - e - r - r - y...
- 第一个'r'...
- 第二个'r'...
- 第三个'r'...计数 = 3...
推理轨迹:由于快速阅读单词,我最初回答的2是错误的。最终答案:最终答案是3
Baselines We benchmark budget forcing with: (I) Conditional length-control methods, which rely on telling the model in the prompt how long it should generate for. We group them by granularity into (a) Token-conditional control: We specify an upper bound of thinking tokens in the prompt; (b) Step-conditional control: We specify an upper bound of thinking steps, where each step is around 100 tokens; (c) Class-conditional control: We write two generic prompts that tell the model to either think for a short or long amount of time (see $\S D.1$ for details). (II) Rejection sampling, which samples until a generation fits a predetermined compute budget. This oracle captures the posterior over responses conditioned on its length.
基准方法
我们通过以下方法对预算约束进行基准测试:
(I) 条件长度控制方法,这些方法依赖于在提示中告诉模型应该生成多长的内容。我们按粒度将其分为:
(a) Token条件控制:在提示中指定思考Token的上限;
(b) 步骤条件控制:在提示中指定思考步骤的上限,其中每一步大约为100个Token;
(c) 类别条件控制:我们编写两个通用提示,告诉模型思考较短或较长的时间(详见$\S D.1$)。
(II) 拒绝采样,该方法不断采样,直到生成的样本符合预定的计算预算。此方法捕捉了基于长度条件的后验响应分布。
3.2. Metrics
3.2. 指标
We establish a set of desiderata as evaluation metrics to measure test-time scaling across methods. Importantly, we do not only care about the accuracy a method can achieve but also its control l ability and test-time scaling slope. For each method we consider, we run a set of evaluations $a\in$ $\mathbfcal{A}$ varying test-time compute on a fixed benchmark, e.g. AIME24. This produces a piece-wise linear function $f$ with compute as the $\mathbf{X}$ -axis measured in thinking tokens and accuracy as the y-axis (see Figure 1, where the rightmost dot for AIME24 corresponds to $f(7320)=57%$ ). We measure
我们建立了一组期望作为评估指标,以衡量方法在测试时的扩展能力。重要的是,我们不仅关心方法可以达到的准确性,还关心其控制能力和测试时扩展斜率。对于每种方法,我们运行一组评估 $a\in$ $\mathbfcal{A}$,在固定基准(例如 AIME24)上改变测试时的计算量。这产生了一个分段线性函数 $f$,其中计算量作为 $\mathbf{X}$ 轴,以思考Token (thinking tokens) 衡量,准确性作为 y 轴(参见图 1,其中 AIME24 最右侧的点对应于 $f(7320)=57%$)。我们测量
three metrics:
三个指标:

where $a_{\mathrm{min}},a_{\mathrm{max}}$ refer to a pre-specified minimum and maximum amount of test-time compute; in our case thinking tokens. We usually only constrain $a_{\mathrm{max}}$ . As tokens generated correspond to the amount of test-time compute spent, this metric measures the extent to which a method allows control l ability over the use of that test-time compute. We report it as a percentage with $100%$ being perfect control.
其中,$a_{\mathrm{min}},a_{\mathrm{max}}$ 表示预先指定的测试时计算量(compute)的最小值和最大值;在本例中指 Token。通常我们只约束 $a_{\mathrm{max}}$。由于生成的 Token 与测试时计算量的消耗相对应,该指标衡量了某种方法对测试时计算量使用的控制能力。我们以百分比形式报告,100% 表示完全控制。


Performance is simply the maximum performance the method achieves on the benchmark. A method with monotonically increasing scaling achieves $100%$ performance on any benchmark in the limit. However, the methods we investigate eventually flatten out or further scaling fails due to control or context window limitations.
性能是该方法在基准测试中达到的最大性能。随着规模单调增加的方法在任何基准测试中最终都能达到 $100%$ 的性能。然而,由于控制或上下文窗口的限制,我们研究的方法最终会趋于平缓或无法进一步扩展。
4. Results
4. 结果
4.1. Setup
4.1. 设置
Training We perform supervised finetuning on Qwen2.5- 32B-Instruct using s1K to obtain our model s1-32B using basic hyper parameters outlined in $\S C$ . Finetuning took 26 minutes on 16 NVIDIA H100 GPUs with PyTorch FSDP.
训练:我们使用s1K对Qwen2.5-32B-Instruct进行监督微调,以获得我们的模型s1-32B,并遵循$\S C$中概述的基本超参数。在16个NVIDIA H100 GPU上使用PyTorch FSDP进行微调,耗时26分钟。
Evaluation We select three representative reasoning benchmarks widely used in the field: AIME24 (of America, 2024) consists of 30 problems that were used in the 2024 American Invitational Mathematics Examination (AIME) held from Wednesday, January 31 – Thursday, February 1, 2024. AIME tests mathematical problem-solving with arithmetic, algebra, counting, geometry, number theory, probability, and other secondary school math topics. High-scoring high school students in the test are invited to participate in the United States of America Mathematics Olympiad (USAMO). All AIME answers are integers ranging from 000 to 999, inclusive. Some AIME problems rely on figures that we provide to our model using the vector graphics language Asymptote as it cannot take image inputs. MATH500 (Hendrycks et al., 2021) is a benchmark of competition math problems of varying difficulty. We evaluate on the same 500 samples selected by OpenAI in prior work (Lightman et al., 2023). GPQA Diamond (Rein et al., 2023) consists of 198
评估我们选择了三个在该领域广泛使用的代表性推理基准:AIME24 (America, 2024) 包含30道题目,这些题目来自2024年1月31日至2月1日举行的美国数学邀请赛(AIME)。AIME测试算术、代数、计数、几何、数论、概率等中学数学主题的解题能力。在考试中获得高分的高中生将被邀请参加美国数学奥林匹克(USAMO)。所有AIME答案均为000到999之间的整数。由于模型无法处理图像输入,部分AIME题目依赖于我们使用矢量图形语言Asymptote提供的图形。MATH500 (Hendrycks et al., 2021) 是一个包含不同难度竞赛数学题目的基准。我们在OpenAI先前工作中选择的500个样本上进行评估 (Lightman et al., 2023)。GPQA Diamond (Rein et al., 2023) 包含198道题目。

(a) Sequential scaling via budget forcing
(a) 通过预算约束进行顺序扩展

Figure 4. Sequential and parallel test-time scaling. (a): Budget forcing shows clear scaling trends and extrapolates to some extent. For the three rightmost dots, we prevent the model from stopping its thinking 2/4/6 times, each time appending “Wait” to its current reasoning trace. $(b)$ : For Qwen2.5-32B-Instruct we perform 64 evaluations for each sample with a temperature of 1 and visualize the performance when majority voting across 2, 4, 8, 16, 32, and 64 of these. (b) Parallel scaling via majority voting
图 4: 顺序和并行测试时间扩展。(a): 预算强制显示了明显的扩展趋势,并在一定程度上进行了外推。对于最右边的三个点,我们阻止模型停止思考2/4/6次,每次将其当前的推理轨迹附加“Wait”。(b): 对于Qwen2.5-32B-Instruct,我们对每个样本进行了64次评估,温度为1,并在2、4、8、16、32和64次多数投票时可视化性能。(b) 通过多数投票进行并行扩展
PhD-level science questions from Biology, Chemistry and Physics. Experts with PhDs in the corresponding domains only achieved $69.7%$ on GPQA Diamond (OpenAI, 2024). When we write “GPQA” in the context of evaluation in this work, we always refer to the Diamond subset. We build on the “lm-evaluation-harness” framework (Gao et al., 2021; Biderman et al., 2024).
来自生物学、化学和物理学的博士级科学问题。相应领域的博士专家在GPQA钻石(OpenAI, 2024)上仅取得了 $69.7%$ 的成绩。在本工作的评估背景下,当我们提到“GPQA”时,始终指的是钻石子集。我们基于“lm-evaluation-harness”框架(Gao et al., 2021; Biderman et al., 2024)。
Other models We benchmark s1-32B against: OpenAI o1 series (OpenAI, 2024), which are closed-source models that popularized the idea of test-time scaling; DeepSeek r1 series (Team, 2024a), which are open-weight reasoning models with up to o1-level performance, concurrently released to ours; Qwen’s QwQ-32B-preview (Team, 2024b), a 32B open-weight reasoning model without disclosed methodology; Sky-T1-32B-Preview (Team, 2025) and Bespoke32B (Labs, 2025), which are open models with open reasoning data distilled from QwQ-32B-preview and r1; Google Gemini 2.0 Flash Thinking Experimental (Google, 2024), the API that we distill from. As it has no official evaluation scores, we use the Gemini API to benchmark it ourselves. However, the “recitation error” of the Gemini API makes evaluation challenging.2 We circumvent this, by manually inserting all 30 AIME24 questions in its web interface where the error does not appear. However, we leave out MATH500 (500 questions) and GPQA Diamond (198 questions), thus they are N.A. in Table 1. Our model, s1-32B, is fully open including weights, reasoning data, and code.
其他模型我们对比了以下基准模型:OpenAI o1系列,该系列是使测试时扩展理念普及的闭源模型;DeepSeek r1系列,这是一些与我们同期发布的、性能可达o1水平的开源权重推理模型;Qwen的QwQ-32B-preview,这是一款32B参数、方法论未公开的开源权重推理模型;Sky-T1-32B-Preview和Bespoke32B,它们是从QwQ-32B-preview及r1蒸馏而来的开放推理数据的开源模型;Google Gemini 2.0 Flash Thinking Experimental,这是我们从其API中蒸馏的模型。鉴于其无官方评估分数,我们自行使用Gemini API进行基准测试。然而,Gemini API的“复述错误”使评估颇具挑战。为此,我们手动在其网页界面中插入所有30道AIME24题目,这一接口下该错误不会出现。但对于MATH500(500题)和GPQA Diamond(198题),我们选择略过,故它们在表1中标记为N.A.。我们的模型s1-32B全面开放,包括权重、推理数据和代码。
Table 1. s1-32B is an open and sample-efficient reasoning model. We evaluate s1-32B, Qwen, and Gemini (some entries are unknown (N.A.), see $\S4$ ). Other results are from the respective reports (Qwen et al., 2024; Team, 2024b; OpenAI, 2024; DeepSeek-AI et al., 2025; Labs, 2025; Team, 2025). # ex. $=$ number examples used for reasoning finetuning; BF $=$ budget forcing.
表 1: s1-32B 是一个开放且高效的推理模型。我们评估了 s1-32B、Qwen 和 Gemini(部分条目未知 (N.A.),见 $\S4$ )。其他结果来自各自的报告 (Qwen et al., 2024; Team, 2024b; OpenAI, 2024; DeepSeek-AI et al., 2025; Labs, 2025; Team, 2025)。# ex. $=$ 用于推理微调的样本数量;BF $=$ 预算强制。
| 模型 | # ex. | AIME 2024 | MATH 500 | GPQA Diamond |
|---|---|---|---|---|
| API only | ||||
| 01-preview | N.A. | 44.6 85.5 | 73.3 | |
| 01-mini | N.A. 70.0 | 90.0 | 60.0 | |
| 01 | N.A. 74.4 | 94.8 | 77.3 | |
| Gemini 2.0 Flash Think. | N.A. 60.0 | N.A. | N.A. | |
| Open Weights | ||||
| Qwen2.5- 32B-Instruct | N.A. | 26.7 84.0 | 49.0 | |
| QwQ-32B | N.A. 50.0 | 90.6 | 65.2 | |
| rl | >800K 79.8 | 97.3 | 71.5 | |
| rl-distill | 800K 72.6 | 94.3 | 62.1 | |
| Open Weights and Open Data | 56.8 | |||
| Sky-T1 17K Bespoke-32B 17K s1 w/o BF 1K s1-32B 1K | 43.3 63.3 50.0 56.7 | 82.4 93.0 58.1 92.6 56.6 93.0 59.6 |
4.2. Performance
4.2. 性能
Test-time scaling Figure 1 shows the performance of s1- 32B with budget forcing scales with more test-time compute. In Figure 4 (left), we expand the plot from Figure 1 (middle) showing that while we can improve AIME24 performance using our budget forcing technique (§3) and more test-time compute it does eventually flatten out at six times. Suppressing the end-of-thinking token delimiter too often can lead the model into repetitive loops instead of continued reasoning. In Figure 4 (right), we show that after training Qwen2.5- 32B-Instruct on our 1,000 samples to produce s1-32B and equipping it with the simple budget forcing technique, it operates in a different scaling paradigm. Scaling test-time compute on the base model via majority voting cannot catch up with the performance of s1-32B which validates our intuition from $\S3$ that sequential scaling is more effective than parallel. We provide example generations of s1-32B in Figure 5.
测试时的扩展
图1展示了s1-32B在更多测试时计算资源下的性能表现。在图4(左)中,我们扩展了图1(中间)的图表,显示虽然我们可以通过预算强制技术(第3节)和更多测试时计算资源来提高AIME24的性能,但最终在六倍时会趋于平缓。过于频繁地抑制思考结束标志(end-of-thinking token delimiter)可能会导致模型进入重复循环,而不是继续推理。在图4(右)中,我们展示了在1000个样本上训练Qwen2.5-32B-Instruct以生成s1-32B,并为其配备简单的预算强制技术后,它在一个不同的扩展范式中运行。通过多数投票在基础模型上进行测试时计算的扩展无法赶上s1-32B的性能,这验证了我们在第3节的直觉,即顺序扩展比并行扩展更有效。我们在图5中提供了s1-32B的生成示例。
Sample-efficiency In Figure 2 (right) and Table 1 we compare s1-32B with other models. We find that s1- 32B is the most sample-efficient open data reasoning model. It performs significantly better than our base model (Qwen2.5-32B-Instruct) despite just training it on an additional 1,000 samples. The concurrently released r1-32B shows stronger performance than s1-32B while also only using SFT (DeepSeek-AI et al., 2025). However, it is trained on $800,\times$ more reasoning samples. It is an open question whether one can achieve their performance with just 1,000 samples. Finally, our model nearly matches Gemini 2.0 Thinking on AIME24. As s1-32B is distilled from Gemini 2.0, this shows our distillation procedure was likely effective.
样本效率
在图 2 (右) 和表 1 中,我们将 s1-32B 与其他模型进行了比较。我们发现 s1-32B 是样本效率最高的开源数据推理模型。尽管仅额外训练了 1000 个样本,它的表现明显优于我们的基础模型 (Qwen2.5-32B-Instruct)。同时发布的 r1-32B 显示出比 s1-32B 更强的性能,但同样只使用了 SFT (DeepSeek-AI et al., 2025)。然而,它使用了 $800,\times$ 更多的推理样本进行训练。是否仅用 1000 个样本就能达到它们的性能仍是一个开放问题。最后,我们的模型在 AIME24 上几乎与 Gemini 2.0 Thinking 相当。由于 s1-32B 是从 Gemini 2.0 蒸馏而来,这表明我们的蒸馏过程可能是有效的。
Table 2. s1K data ablations. We budget force (BF) a maximum of around 30,000 thinking tokens for all scores in this table. This performs slightly better than the scores without BF (Table 1) as it allows the model to finish with a best guess when stuck in an infinite loop. We report $95%$ paired bootstrap confidence intervals for differences relative to the s1K model using 10,000 bootstrap samples. E.g., the interval $[-13%$ , $20%]$ means that, with $95%$ confidence, the true difference between 59K-full and s1K is between $13%$ and $+20%$ . If the entire interval is negative, e.g. $[-27%,-3%]$ , we can confidently say that the performance is worse than s1K.
表 2. s1K 数据消融实验。我们在本表中为所有分数预算了最多约 30,000 个思维 token 的强制预算 (budget force, BF)。这比没有 BF 的分数表现略好 (表 1),因为它允许模型在陷入无限循环时以最佳猜测结束。我们使用 10,000 个自举样本,报告了相对于 s1K 模型差异的 95% 配对自举置信区间。例如,区间 $[-13%$ , $20%]$ 意味着,在 95% 的置信度下,59K-full 和 s1K 之间的真实差异在 $13%$ 和 $+20%$ 之间。如果整个区间为负,例如 $[-27%,-3%]$ ,我们可以肯定地说性能比 s1K 更差。
| 模型 | AIME 2024 | MATH 500 | GPQA Diamond |
|---|---|---|---|
| 1K-random | 36.7 [-26.7%, -3.3%] | 90.6 [-4.8%, 0.0%] | 52.0 [-12.6%, 2.5%] |
| 1K-diverse | 26.7 [-40.0%, -10.0%] | 91.2 [-4.0%, 0.2%] | 54.6 [-10.1%, 5.1%] |
| 1K-longest | 33.3 [-36.7%, 0.0%] | 90.4 [-5.0%, -0.2%] | 59.6 [-5.1%, 10.1%] |
| 59K-full | 53.3 [-13.3%, 20.0%] | 92.8 [-2.6%, 2.2%] | 58.1 |
| s1K | 50.0 | 93.0 | [-6.6%, 8.6%] 57.6 |
samples, a superset of all the 1K-sample versions. This leads to a strong model but uses much more resources. To finetune on 59K samples, we use $394;\mathrm{Hl}00$ GPU hours while s1-32B only required 7 H100 GPU hours. Moreover, relying only on s1K is extremely competitive as shown in $\S2$ . Overall, combining all three criteria – Quality, Difficulty, Diversity – via our methodology in $\S2$ is key for sampleefficient reasoning training.
样本,所有1K样本版本的超集。这导致了一个强大的模型,但使用了更多的资源。为了在59K样本上进行微调,我们使用了$394;\mathrm{Hl}00$的GPU小时,而s1-32B只需要7个H100 GPU小时。此外,仅依赖s1K也极具竞争力,如$\S2$所示。总体而言,通过我们在$\S2$中的方法论,结合质量、难度和多样性这三个标准,是高效推理训练的关键。
5. Ablations
5. 消融实验
5.1. Data Quantity, Diversity, and Difficulty
5.1 数据量、多样性与难度
In $\S2$ we outlined our three guiding principles in curating s1K: Quality, Difficulty, and Diversity. Here we test the importance of combining them and the overall efficacy of our selection. Only Quality (1K-random): After obtaining our high-quality reasoning chains from Gemini, we select 1,000 samples at random; not relying on our difficulty and diversity filtering at all. Table 2 shows this approach performs much worse than s1K across all benchmarks. Only Diversity (1K-diverse): For this dataset, we sample uniformly across domains to maximize diversity disregarding any notion of difficulty. This approach also leads to poor performance similar to 1K-random. Only Difficulty (1K-longest): Here we rely on one of our difficulty indicators introduced in $\S2$ by selecting the 1,000 samples with the longest reasoning traces. This approach significantly boosts GPQA performance but overall still falls short of using s1K. Maximize Quantity: Finally, we compare with just training on all of our 59K
在 $\S2$ 中,我们概述了筛选s1K的三大指导原则:质量、难度和多样性。在此,我们将测试结合这些原则的重要性以及我们筛选的整体效果。仅有质量 (1K-random):在从Gemini获取高质量推理链后,我们随机选择1,000个样本;完全不依赖我们的难度和多样性筛选。表2显示,这种方法在所有基准测试中的表现都比s1K差得多。仅有多样性 (1K-diverse):对于这个数据集,我们在各个领域中均匀采样,以最大化多样性,忽略任何难度概念。这种方法的表现也较差,与1K-random类似。仅有难度 (1K-longest):在这里,我们依赖于 $\S2$ 中引入的一个难度指标,选择推理链最长的1,000个样本。这种方法显著提升了GPQA的表现,但总体上仍然不及使用s1K。最大化数量:最后,我们与仅在我们的59K数据集上训练的结果进行了比较。
5.2. Test-time scaling methods
5.2. 测试时缩放方法
Table 3. Ablations on methods to scale test-time compute on AIME24. $|{\mathcal{A}}|$ refers to the number of evaluation runs used to estimate th e| pr|operties; thus a higher value indicates more robustness. Bold indicates our chosen method and the best values. BF $=$ budget forcing, TCC/SCC/CCC $=$ token/step/class-conditional control, $\mathrm{RS}=$ rejection sampling.
表 3: 在 AIME24 上扩展测试时计算的方法消融实验。$|{\mathcal{A}}|$ 指的是用于估计属性 pr 的评估运行次数;因此,较高的值表示更强的鲁棒性。粗体表示我们选择的方法和最佳值。BF $=$ 预算强制, TCC/SCC/CCC $=$ token/step/class-conditional控制, $\mathrm{RS}=$ 拒绝采样。
| 方法 | 控制 | 扩展 | 性能 | [A |
|---|---|---|---|---|
| BF | 100% | 15 | 56.7 | 5 |
| TCC | 40% | -24 | 40.0 | 5 |
| TCC + BF | 100% | 13 | 40.0 | 5 |
| SCC | 9% | 3 | 36.7 | 5 |
| SCC + BF | 100% | 6 | 36.7 | 5 |
| CCC | 50% | 25 | 36.7 | 2 |
| RS | 100% | -35 | 40.0 | 5 |
[AIME24]
[AIME24]
[MATH500]
[MATH500]
[GPQA]
[GPQA]
Alice and Bob play the following game. A stack of $n$ tokens lies before them. The players take turns with Alice going first. On each turn, the player removes either 1 token or 4 tokens from the stack. Whoever removes the last token wins. Find the number of positive integers $n$ less than or equal to 2024 for which there exists a strategy for Bob that guarantees that Bob will win the game regardless of Alice’s play.
Alice 和 Bob 玩以下游戏。他们面前有一堆 $n$ 个 Token。玩家轮流进行,Alice 先手。在每轮中,玩家从堆中移除 1 个或 4 个 Token。移除最后一个 Token 的玩家获胜。找出所有小于等于 2024 的正整数 $n$,使得 Bob 存在一种策略,可以保证无论 Alice 如何操作,Bob 都能赢得游戏。
Find the projection of $a$ onto $b=$ $\binom{2}{6},{\mathfrak{i f}},a\cdot b=8.$
求 $a$ 在 $b=$ $\binom{2}{6}$ 上的投影,已知 $a \cdot b=8$。
The projection of vector 𝐚onto vector 𝐛is given by the formula:
向量 𝐚 在向量 𝐛 上的投影由公式给出:
A quantum mechanical particle of mass $m$ moves in two dimensions in the following potential, as a function of the polar coordinates $(r,\theta)$ :
质量为 $m$ 的量子力学粒子在极坐标 $(r,\theta)$ 下的势能函数中沿二维运动:

Find the energy spectrum. Hint: Write the potential in Cartesian coordinates.
求能量谱。提示:将势能用笛卡尔坐标表示。
The total number of positive integers $n\leq2024$ that are losing positions is $404+405={\boxed{809}}.$
不超过 2024 的正整数 $n$ 中,处于不利位置的总数为 $404+405={\boxed{809}}$。
Budget forcing In Table 3 we compare the test-time scaling methods we have introduced in $\S3$ . Overall, we find that budget forcing provides perfect control, good scaling, and leads to our best AIME24 score. Thus, this is the method we use for s1-32B in Figure 1 and in $\S4$ . In Table 4, we compare different strings for extrapolating performance. We find that “Wait” generally gives the best performance.
预算强制
在表 3 中,我们比较了在 $\S3$ 中介绍的测试时间扩展方法。总体上,我们发现预算强制提供了完美的控制、良好的扩展,并获得了最佳的 AIME24 分数。因此,这是我们在图 1 和 $\S4$ 中用于 s1-32B 的方法。在表 4 中,我们比较了用于外推性能的不同字符串。我们发现“Wait”通常能提供最佳性能。
Class-conditional control We provide benchmark scores for this method in $\S D.1$ and summarize three findings here: (1) Token-conditional control fails without budget forcing, as our model cannot reliably count tokens - even when trained to do so. (2) Under step-conditional control, the model generates a similar total number of tokens when given different step targets, as the model goes from few steps with many tokens per step, to many steps with few tokens in each step. Thus, the model learns to hack its way around the compute constraint making the control l ability of this method mediocre. (3) Class-conditional control can work - telling a model to simply think longer can increase its test-time compute and performance, which leads good scaling in Table 3.
类别条件控制
我们在 $\S D.1$ 中提供了这种方法的基准分数,并在这里总结了三个发现:(1) 在没有预算强制的情况下,Token 条件控制失败,因为我们的模型无法可靠地计数 Token —— 即使经过训练也是如此。(2) 在步骤条件控制下,当给定不同的步骤目标时,模型生成的总 Token 数量相似,因为模型从每步生成多个 Token 的少步骤,转变为每步生成少量 Token 的多步骤。因此,模型学会了绕过计算约束,使得这种方法的控制能力一般。(3) 类别条件控制可以发挥作用 —— 告诉模型简单地思考更长时间可以增加其测试时间计算和性能,这导致了表 3 中的良好扩展。
Table 4. Budget forcing extrapolation ablations. We compare ignoring the end-of-thinking delimiter twice and appending none or various strings.
表 4: 预算强制外推消融实验。我们比较了两次忽略思考结束分隔符以及不附加或附加不同字符串的情况。
| 模型 | AIME 2024 | MATH 500 | GPQA Diamond |
|---|---|---|---|
| 不外推 | 50.0 | 93.0 | 57.6 |
| 2x 无字符串 | 50.0 | 90.2 | 55.1 |
| 2x“Alternatively” | 50.0 | 92.2 | 59.6 |
| 2x“Hmm” | 50.0 | 93.0 | 59.6 |
| 2x“Wait” | 53.3 | 93.0 | 59.6 |
Rejection sampling Surprisingly, we find that simply sampling until the generation fits a specific length leads to an inverse scaling trend as depicted in Figure 6. In $\S\mathrm{D}.2$ we inspect a question, which was answered correctly by the model when rejection sampling for $\leq4000$ , but not for the $\leq8000$ token setting. In the $\leq4000$ setting the model directly jumps to the correct approach, while for the $\leq8000$ setting it backtracks a lot. We hypothesize that there is a correlation such that shorter generations tend to be the ones where the model was on the right track from the start, whereas longer ones tend to be ones where the model made mistakes and thus backtracks or questions itself. This leads to longer samples often being wrong when rejection sampling and thus the inverse scaling trend.
拒绝采样 (Rejection Sampling) 令人惊讶的是,我们发现,简单地通过采样直到生成的内容符合特定长度,会导致如图 6 所示的逆向缩放趋势。在 $\S\mathrm{D}.2$ 中,我们检查了一个问题,该问题在 $\leq4000$ 的拒绝采样设置下被模型正确回答,但在 $\leq8000$ 的 token 设置下却没有。在 $\leq4000$ 的设置中,模型直接跳到了正确的方法,而在 $\leq8000$ 的设置中,模型多次回溯。我们假设存在一种相关性,即较短的生成往往是模型从开始就走对了路,而较长的生成往往是模型犯了错误并因此回溯或自我质疑。这导致在拒绝采样时,较长的样本通常是错误的,从而导致了逆向缩放趋势。

Figure 6. Rejection sampling on AIME24 with s1-32B. We sample with a temperature of 1 until all generations have less than (from left to right) 3500, 4000, 5000, 8000, and 16000 thinking tokens requiring an average of 655, 97, 8, 3, 2, and 1 tries per sample.
图 6: 在 AIME24 上使用 s1-32B 进行拒绝采样。我们以温度 1 进行采样,直到所有生成样本的思考 token 数少于(从左到右)3500、4000、5000、8000 和 16000,每个样本平均需要 655、97、8、3、2 和 1 次尝试。
6. Discussion and related work
6. 讨论与相关工作
6.1. Sample-efficient reasoning
6.1. 样本高效推理
Models There are a number of concurrent efforts to build models that replicate the performance of o1 (OpenAI, 2024). For example, DeepSeek-r1 and $\mathrm{k}1.5$ (DeepSeek-AI et al., 2025; Team et al., 2025) are built with reinforcement learning methods, while others rely on SFT using tens of thousands of distilled examples (Team, 2025; Xu et al., 2025; Labs, 2025). We show that SFT on only 1,000 examples suffices to build a competitive reasoning model matching o1-preview and produces a model that lies on the pareto frontier (Figure 2). Further, we introduce budget forcing which combined with our reasoning model leads to the first reproduction of OpenAI’s test-time scaling curves (OpenAI, 2024). Why does supervised finetuning on just 1,000 samples lead to such performance gains? We hypothesize that the model is already exposed to large amounts of reasoning data during pre training which spans trillions of tokens. Thus, the ability to perform reasoning is already present in our model. Our sample-efficient finetuning stage just activates it and we scale it further at test time with budget forcing. This is similar to the "Superficial Alignment Hypothesis" presented in LIMA (Zhou et al., 2023), where the authors find that 1,000 examples can be sufficient to align a model to adhere to user preferences.
模型
Benchmarks and methods To evaluate and push the limits of these models, increasingly challenging benchmarks have been introduced, such as Olympiad-level science competitions (He et al., 2024; Jain et al., 2024; Zhong et al., 2023) and others (Srivastava et al., 2023; Glazer et al., 2024; Su et al., 2024; Kim et al., 2024; Phan et al., 2025). To enhance models’ performance on reasoning-related tasks, researchers have pursued several strategies: Prior works have explored continuing training language models on specialized corpora related to mathematics and science (Azerbayev et al., 2023; Yang et al., 2024), sometimes even synthetically generated data (Yu et al., 2024). Others have developed training methodologies specifically aimed at reasoning performance (Zelikman et al., 2022; 2024; Luo et al., 2025;
基准测试与方法:为了评估并突破这些模型的极限,人们引入了越来越具挑战性的基准测试,例如奥林匹克级别的科学竞赛 (He et al., 2024; Jain et al., 2024; Zhong et al., 2023) 以及其他测试 (Srivastava et al., 2023; Glazer et al., 2024; Su et al., 2024; Kim et al., 2024; Phan et al., 2025)。为了提升模型在推理相关任务中的表现,研究人员采取了多种策略:先前的工作探索了在专门与数学和科学相关的语料库上继续训练语言模型 (Azerbayev et al., 2023; Yang et al., 2024),有时甚至使用合成生成的数据 (Yu et al., 2024)。其他人则开发了专门针对推理性能的训练方法 (Zelikman et al., 2022; 2024; Luo et al., 2025)。
Yuan et al., 2025; Wu et al., 2024a). Another significant line of work focuses on prompting-based methods to elicit and improve reasoning abilities, including methods like Chainof-Thought prompting (Wei et al., 2023; Yao et al., 2023a;b; Bi et al., 2023; Fu et al., 2023; Zhang et al., 2024b; Xiang et al., 2025; Hu et al., 2024). These combined efforts aim to advance the reasoning ability of language models, enabling them to handle more complex and abstract tasks effectively.
Yuan 等, 2025; Wu 等, 2024a). 另一项重要的工作关注基于提示的方法来激发和提升推理能力,包括 Chain-of-Thought 提示方法 (Wei 等, 2023; Yao 等, 2023a;b; Bi 等, 2023; Fu 等, 2023; Zhang 等, 2024b; Xiang 等, 2025; Hu 等, 2024)。这些努力共同旨在提升大语言模型的推理能力,使它们能够有效地处理更复杂和抽象的任务。
6.2. Test-time scaling
6.2. 测试时缩放
Methods As we introduce in $\S3$ , we differentiate two methods to scale test-time compute: parallel and sequential. The former relies on multiple solution attempts generated in parallel and selecting the best outcome via specific criteria. These criteria include choosing the most frequent response for majority voting or the best response based on an external reward for Best-of-N (Brown et al., 2024; Irvine et al., 2023; Snell et al., 2024). Unlike repeated sampling, previous sequential scaling methods let the model generate solution attempts sequentially based on previous attempts, allowing it to refine each attempt based on previous outcomes (Snell et al., 2024; Hou et al., 2025; Lee et al., 2025). Tree-based search methods (Gandhi et al., 2024; Wu et al., 2024b) offer a hybrid approach between sequential and parallel scaling, such as Monte-Carlo Tree Search (MCTS) (Liu et al., 2024; Zhang et al., 2023; Zhou et al., 2024; Choi et al., 2023) and guided beam search (Xie et al., 2023). REBASE (Wu et al., 2024b) employs a process reward model to balance exploitation and pruning during tree search. Empirically, REBASE has been shown to outperform sampling-based methods and MCTS (Wu et al., 2024b). Reward models (Lightman et al., 2023; Wang et al., 2024b;c) play a key role in these methods. They come in two variants: outcome reward models and process reward models. Outcome reward models (Xin et al., 2024; Ankner et al., 2024) assign a score to complete solutions and are particularly useful in Best-of-N selection, while process reward models (Lightman et al., 2023; Wang et al., 2024b; Wu et al., 2024b) assess individual reasoning steps and are effective in guiding tree-based search methods.
方法
如我们在 $\S3$ 中所介绍的,我们区分了两种扩展测试时间计算的方法:并行和顺序。前者依赖于并行生成的多个解决方案尝试,并通过特定标准选择最佳结果。这些标准包括选择出现频率最高的响应作为多数表决,或基于外部奖励选择最佳响应(Best-of-N)(Brown et al., 2024; Irvine et al., 2023; Snell et al., 2024)。与重复采样不同,之前的顺序扩展方法允许模型基于先前的尝试顺序生成解决方案尝试,使其能够根据先前的成果逐步改进每次尝试(Snell et al., 2024; Hou et al., 2025; Lee et al., 2025)。基于树的搜索方法(Gandhi et al., 2024; Wu et al., 2024b)提供了一种介于顺序和并行扩展之间的混合方法,例如蒙特卡洛树搜索 (MCTS)(Liu et al., 2024; Zhang et al., 2023; Zhou et al., 2024; Choi et al., 2023)和引导束搜索(Xie et al., 2023)。REBASE(Wu et al., 2024b)采用过程奖励模型来平衡树搜索过程中的利用和剪枝。经验表明,REBASE 在性能上优于基于采样的方法和 MCTS(Wu et al., 2024b)。奖励模型(Lightman et al., 2023; Wang et al., 2024b;c)在这些方法中起着关键作用。它们有两种变体:结果奖励模型和过程奖励模型。结果奖励模型(Xin et al., 2024; Ankner et al., 2024)对完整解决方案进行评分,在 Best-of-N 选择中特别有用,而过程奖励模型(Lightman et al., 2023; Wang et al., 2024b; Wu et al., 2024b)则评估各个推理步骤,在指导基于树的搜索方法方面非常有效。
Limits to further test-time scaling We have shown that budget forcing allows extrapolating test-time compute in $\S4$ , e.g., improving AIME24 performance from $50%$ to $57%$ However, it has two key limitations when scaling further: it eventually flattens out (Figure 4), and the context window of the underlying language model constrains it. Despite these constraints, our work shows test-time scaling across a wide range of accuracies (Figure 1), partly because scaling down test-time compute behaves predictably and does not suffer from these constraints.
进一步扩展测试时的限制
我们已经在 $\S4$ 中展示了预算强制允许外推测试时的计算,例如将 AIME24 的性能从 $50%$ 提高到 $57%$ 。然而,在进一步扩展时,它有两个关键限制:最终会趋于平稳(图 4),并且底层大语言模型的上下文窗口会限制它。尽管存在这些限制,我们的工作显示了在广泛准确性范围内的测试时扩展(图 1),部分原因是减少测试时的计算是可预测的,并且不会受到这些限制的影响。
Continuing test-time scaling will require approaches that can further extrapolate test-time compute. How can we get such extrapolation? There may be improvements to budget forcing such as rotating through different strings, not only “Wait”, or combining it with frequency penalties or higher temperature to avoid repetitive loops. An exciting direction for future work is also researching whether applying budget forcing to a reasoning model trained with reinforcement learning yields better extrapolation; or if RL allows for new ways of test-time scaling beyond budget forcing. Our work defines the right metrics (§3.2) – Control, Scaling, and Performance – to enable future research and progress on extrapolating test-time compute.
持续扩展测试时计算需要能够进一步外推测试时计算的方法。我们如何实现这种外推?预算强制(budget forcing)可能有改进方法,例如轮换不同的字符串,而不仅仅是“等待”,或者将其与频率惩罚或更高温度结合以避免重复循环。未来工作的一个令人兴奋的方向是研究将预算强制应用于通过强化学习训练的推理模型是否会产生更好的外推效果;或者说,强化学习是否允许在预算强制之外实现新的测试时扩展方法。我们的工作定义了正确的指标(§3.2)——控制、扩展和性能——以支持未来的研究和进展,从而实现测试时计算的外推。

Figure 7. Scaling further with parallel scaling methods. All metrics averaged over the 30 questions in AIME24. Average thinking tokens for REBASE do not account for the additional compute from the reward model. For sequential scaling, we prompt the model to use up to (from left to right) 32, 64, 256, and 512 steps. For REBASE and majority voting we generate 16 parallel trajectories to aggregate across.
图 7: 使用并行扩展方法进一步扩展。所有指标均基于 AIME24 中的 30 个问题计算平均值。REBASE 的平均思考 Token 未包含奖励模型的额外计算。在顺序扩展中,我们提示模型使用最多(从左到右)32、64、256 和 512 步。对于 REBASE 和多数投票,我们生成 16 条并行轨迹进行聚合。
Parallel scaling as a solution Parallel scaling offers one solution to the limits of sequential scaling, thus we augment our sequentially scaled model with two methods: (I) Majority voting: After generating $k$ solutions, the final solution is the most frequent one across generations; $(\mathbf{II})$ Tree search via REBASE: We use the REBASE process reward model, which is initialized from LLaMA-34B and further finetuned on a synthetic process reward modeling dataset (Wu et al., 2024b). We then aggregate the solutions generated by REBASE via majority voting. As shown in Figure 7, augmenting our model with REBASE scales better than majority voting, and even sequential scaling in this scenario. However, REBASE requires an additional forward pass at each step for the reward model adding some computation overhead. For sequential scaling, when prompted to use up to 512 steps, for 12 out of the 30 evaluation questions the model generates a response that exceeds the context window leading to a large performance drop. Overall, we find that these parallel scaling methods complement sequential scaling thus they offer an avenue for scaling test-time compute even further; beyond fixed context windows.
并行扩展作为一种解决方案
并行扩展提供了一种解决顺序扩展限制的方案,因此我们通过两种方法增强我们的顺序扩展模型:
(I) 多数投票:在生成 $k$ 个解决方案后,最终解决方案是生成中出现频率最高的一个;
(II) 通过 REBASE 的树搜索:我们使用 REBASE 过程奖励模型,该模型从 LLaMA-34B 初始化,并在合成过程奖励建模数据集上进一步微调(Wu et al., 2024b)。然后我们通过多数投票聚合 REBASE 生成的解决方案。如图 7 所示,使用 REBASE 增强的模型比多数投票以及此场景中的顺序扩展表现更好。然而,REBASE 在每一步都需要额外的前向传递来评估奖励模型,这增加了一些计算开销。对于顺序扩展,当提示使用最多 512 步时,在 30 个评估问题中的 12 个问题中,模型生成的响应超出了上下文窗口,导致性能大幅下降。总体而言,我们发现这些并行扩展方法可以补充顺序扩展,因此它们为进一步扩展测试时计算提供了途径,甚至超越了固定上下文窗口的限制。
Impact Statement
影响声明
Language models with strong reasoning capabilities have the potential to greatly enhance human productivity, from assisting in complex decision-making to driving scientific breakthroughs. However, recent advances in reasoning, such as OpenAI’s o1 and DeepSeek’s r1, lack transparency, limiting broader research progress. Our work aims to push the frontier of reasoning in a fully open manner, fostering innovation and collaboration to accelerate advancements that ultimately benefit society.
具备强大推理能力的语言模型有潜力大幅提升人类生产力,从辅助复杂决策到推动科学突破。然而,最近的推理进展,例如 OpenAI 的 o1 和 DeepSeek 的 r1,缺乏透明度,限制了更广泛的研究进程。我们的工作旨在以完全开放的方式推进推理的前沿,促进创新和合作,以加速最终造福社会的进步。
Acknowledgements
致谢
This work was partly conducted using the Stanford Marlowe GPU cluster (Kapfer et al., 2025) made possible by financial support from Stanford University. We thank Alexander M. Rush, Andrew Ilyas, Banghua Zhu, Chenglei Si, Chunting Zhou, John Yang, Ludwig Schmidt, Samy Jelassi, Tengyu Ma, Xuechen Li, Yu Sun and Yue Zhang for very constructive discussions.
本工作部分使用了由斯坦福大学提供资金支持的斯坦福 Marlowe GPU 集群 (Kapfer et al., 2025)。我们感谢 Alexander M. Rush、Andrew Ilyas、Banghua Zhu、Chenglei Si、Chunting Zhou、John Yang、Ludwig Schmidt、Samy Jelassi、Tengyu Ma、Xuechen Li、Yu Sun 和 Yue Zhang 进行的非常有建设性的讨论。
References
参考文献
Ankner, Z., Paul, M., Cui, B., Chang, J. D., and Ammanabrolu, P. Critique-out-loud reward models, 2024. URL https://arxiv.org/abs/2408.11791.
Ankner, Z., Paul, M., Cui, B., Chang, J. D., 和 Ammanabrolu, P. Critique-out-loud reward models, 2024. URL https://arxiv.org/abs/2408.11791.
Arora, D., Singh, H. G., and Mausam. Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023. URL https://arxiv. org/abs/2305.15074.
Arora, D., Singh, H. G., and Mausam. 大语言模型是否已经足够先进?一个大语言模型的挑战性问题解决基准,2023. URL https://arxiv.org/abs/2305.15074.
Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S., Jiang, A. Q., Deng, J., Biderman, S., and Welleck, S. Llemma: An open language model for mathematics, 2023.
Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S., Jiang, A. Q., Deng, J., Biderman, S., and Welleck, S. Llemma: 一个开放的数学语言模型, 2023.
Bi, Z., Zhang, N., Jiang, Y., Deng, S., Zheng, G., and Chen, H. When do program-of-thoughts work for reasoning?, 2023. URL https://arxiv.org/abs/2308. 15452.
Bi, Z., Zhang, N., Jiang, Y., Deng, S., Zheng, G., and Chen, H. 程序思维何时适用于推理?, 2023. URL https://arxiv.org/abs/2308.15452.
Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, J., Abbasi, B., Aji, A. F., Amman a man chi, P. S., Black, S., Clive, J., DiPofi, A., Etxaniz, J., Fattori, B., Forde, J. Z., Foster, C., Hsu, J., Jaiswal, M., Lee, W. Y., Li, H., Lovering, C., Mu en nigh off, N., Pavlick, E., Phang, J., Skowron, A., Tan, S., Tang, X., Wang, K. A., Winata, G. I., Yvon, F., and Zou, A. Lessons from the trenches on reproducible evaluation of language models, 2024.
Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, J., Abbasi, B., Aji, A. F., Amman a man chi, P. S., Black, S., Clive, J., DiPofi, A., Etxaniz, J., Fattori, B., Forde, J. Z., Foster, C., Hsu, J., Jaiswal, M., Lee, W. Y., Li, H., Lovering, C., Mu en nigh off, N., Pavlick, E., Phang, J., Skowron, A., Tan, S., Tang, X., Wang, K. A., Winata, G. I., Yvon, F., and Zou, A. 从实战中学习语言模型的可复现性评估,2024。
Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Ré, C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787.
Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Ré, C., and Mirhoseini, A. 《大型语言猴子:通过重复采样扩展推理计算》,2024,URL https://arxiv.org/abs/2407.21787.
Cesista, F. L. Multimodal structured generation: Cvpr’s 2nd mmfm challenge technical report, 2024. URL https: //arxiv.org/abs/2406.11403.
Cesista, F. L. 多模态结构化生成:CVPR 第 2 届 MMFM 挑战赛技术报告, 2024. URL https://arxiv.org/abs/2406.11403.
Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theoremdriven question answering dataset, 2023. URL https: //arxiv.org/abs/2305.12524.
Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem-driven question answering dataset, 2023. URL https://arxiv.org/abs/2305.12524.
Choi, S., Fang, T., Wang, Z., and Song, Y. Kcts: Knowledgeconstrained tree search decoding with token-level hallucination detection, 2023. URL https://arxiv.org/ abs/2310.09044.
Choi, S., Fang, T., Wang, Z., 和 Song, Y. Kcts: 基于知识约束的树搜索解码与Token级幻觉检测, 2023. URL https://arxiv.org/ abs/2310.09044.
DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai, J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang, M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M., Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin, R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., Wu, S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X., Shan, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhang, Y., Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Zhu, Y. X., Xu, Y., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z. Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z., and Zhang, Z. Deepseek-r1: In centi viz ing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948.
DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai, J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang, M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M., Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin, R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., Wu, S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X., Shan, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhang, Y., Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Zhu, Y. X., Xu, Y., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z. Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z., and Zhang, Z. Deepseek-r1: 通过强化学习增强大语言模型的推理能力, 2025. URL https://arxiv.org/abs/2501.12948.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., 等人. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783.
Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot, T. Complexity-based prompting for multi-step reasoning, 2023. URL https://arxiv.org/abs/2210. 00720.
Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot, T. Complexity-based prompting for multi-step reasoning, 2023. URL https://arxiv.org/abs/2210. 00720.
Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma, A., and Goodman, N. D. Stream of search (sos): Learning to search in language, 2024. URL https://arxiv. org/abs/2404.03683.
Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma, A., and Goodman, N. D. 搜索流(SOS): 在语言中学习搜索, 2024. URL https://arxiv.org/abs/2404.03683.
Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li, L., Ma, C., Chen, L., Xu, R., Tang, Z., Wang, B., Zan, D., Quan, S., Zhang, G., Sha, L., Zhang, Y., Ren, X., Liu, T., and Chang, B. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024a. URL https://arxiv.org/abs/2410.07985.
Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li, L., Ma, C., Chen, L., Xu, R., Tang, Z., Wang, B., Zan, D., Quan, S., Zhang, G., Sha, L., Zhang, Y., Ren, X., Liu, T., and Chang, B. Omni-math: 大语言模型的通用奥赛级别数学基准, 2024a. URL https://arxiv.org/abs/2410.07985.
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., McDonell, K., Mu en nigh off, N., Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, September 2021. URL doi.org/10.5281/zenodo.5371628.
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., McDonell, K., Mu en nigh off, N., Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. 少样本大语言模型评估框架, September 2021. URL doi.org/10.5281/zenodo.5371628.
Gao, Z., Niu, B., He, X., Xu, H., Liu, H., Liu, A., Hu, X., and Wen, L. Interpret able contrastive monte carlo tree search reasoning, 2024b. URL https://arxiv.org/abs/ 2410.01707.
Gao, Z., Niu, B., He, X., Xu, H., Liu, H., Liu, A., Hu, X., and Wen, L. Interpretable contrastive monte carlo tree search reasoning, 2024b. URL https://arxiv.org/abs/2410.01707.
Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., de Oliveira Santos, E., Järviniemi, O., Barnett, M., San- dler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., Enugandla, S. V., and Wildon, M. Frontier math: A benchmark for evaluating advanced mathematical reasoning in ai, 2024. URL https://arxiv.org/abs/2411. 04872.
Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., de Oliveira Santos, E., Järviniemi, O., Barnett, M., Sandler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., Enugandla, S. V., and Wildon, M. Frontier math: A benchmark for evaluating advanced mathematical reasoning in ai, 2024. URL https://arxiv.org/abs/2411.04872.
Google. Gemini 2.0 flash thinking mode (gemini-2.0- flash-thinking-exp-1219), December 2024. URL https://cloud.google.com/vertex-ai/ generative-ai/docs/thinking-mode.
Google. Gemini 2.0 闪思模式 (gemini-2.0-flash-thinking-exp-1219), 2024 年 12 月. URL https://cloud.google.com/vertex-ai/generative-ai/docs/thinking-mode.
Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K. R., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Mu en nigh off, N., Naik, A., Nam, C., Peters, M. E., Pyatkin, V., Ravi chan der, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Z ett le moyer, L., Dodge, J., Lo, K., Soldaini, L., Smith, N. A., and Hajishirzi, H. Olmo: Accelerating the science of language models, 2024.
Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K. R., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Mu en nigh off, N., Naik, A., Nam, C., Peters, M. E., Pyatkin, V., Ravi chan der, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Z ett le moyer, L., Dodge, J., Lo, K., Soldaini, L., Smith, N. A., and Hajishirzi, H. Olmo:加速语言模型的科学研究,2024。
He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M. Olympiad bench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008.
He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M. Olympiad 基准:一个具有挑战性的基准,旨在通过奥林匹克级别的双语多模态科学问题推动通用人工智能的发展,2024. URL https://arxiv.org/abs/2402.14008.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. 使用数学数据集衡量数学问题解决能力, 2021. URL https://arxiv.org/abs/2103.03874.
Hoffmann, J., Borgeaud, S., Mensch, A., Buch at s kaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203. 15556.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. 训练计算最优的大语言模型 (Training compute-optimal large language models), 2022. URL https://arxiv.org/abs/2203.15556.
Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J., Tang, J., and Dong, Y. Advancing language model reasoning through reinforcement learning and inference scaling, 2025. URL https://arxiv.org/abs/2501. 11651.
Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J., Tang, J., and Dong, Y. 通过强化学习和推理扩展推进语言模型推理,2025. URL https://arxiv.org/abs/2501.11651.
Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettlemoyer, L., Smith, N. A., and Krishna, R. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024. URL https://arxiv.org/ abs/2406.09403.
胡, Y., 施, W., 付, X., Roth, D., Ostendorf, M., Zettlemoyer, L., Smith, N. A., and Krishna, R. 视觉画板:多模态语言模型中的视觉思维链绘制, 2024. URL https://arxiv.org/ abs/2406.09403.
Huang, Z., Wang, Z., Xia, S., Li, X., Zou, H., Xu, R., Fan, R.-Z., Ye, L., Chern, E., Ye, Y., Zhang, Y., Yang, Y., Wu, T., Wang, B., Sun, S., Xiao, Y., Li, Y., Zhou, F., Chern, S., Qin, Y., Ma, Y., Su, J., Liu, Y., Zheng, Y., Zhang, S., Lin, D., Qiao, Y., and Liu, P. Olympic arena: Benchmarking multi-discipline cognitive reasoning for super intelligent ai, 2024a. URL https://arxiv.org/abs/2406. 12753.
Huang, Z., Wang, Z., Xia, S., Li, X., Zou, H., Xu, R., Fan, R.-Z., Ye, L., Chern, E., Ye, Y., Zhang, Y., Yang, Y., Wu, T., Wang, B., Sun, S., Xiao, Y., Li, Y., Zhou, F., Chern, S., Qin, Y., Ma, Y., Su, J., Liu, Y., Zheng, Y., Zhang, S., Lin, D., Qiao, Y., and Liu, P. Olympic arena: 多学科认知推理基准测试,为超级智能AI,2024a. URL https://arxiv.org/abs/2406.12753.
Huang, Z., Zou, H., Li, X., Liu, Y., Zheng, Y., Chern, E., Xia, S., Qin, Y., Yuan, W., and Liu, P. O1 replication journey – part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?, 2024b. URL https://arxiv.org/abs/2411.16489.
Huang, Z., Zou, H., Li, X., Liu, Y., Zheng, Y., Chern, E., Xia, S., Qin, Y., Yuan, W., and Liu, P. O1 复现之旅——第二部分:通过简单蒸馏超越 o1-preview,是巨大进步还是惨痛教训?, 2024b. URL https://arxiv.org/abs/2411.16489.
Huang, Z., Geng, G., Hua, S., Huang, Z., Zou, H., Zhang, S., Liu, P., and Zhang, X. O1 replication journey – part 3: Inference-time scaling for medical reasoning, 2025. URL https://arxiv.org/abs/2501.06458.
Huang, Z., Geng, G., Hua, S., Huang, Z., Zou, H., Zhang, S., Liu, P., and Zhang, X. O1 复制之旅——第三部分:医学推理的推理时间扩展,2025. URL https://arxiv.org/abs/2501.06458.
Irvine, R., Boubert, D., Raina, V., Liusie, A., Zhu, Z., Mudupalli, V., Korshuk, A., Liu, Z., Cremer, F., As- sassi, V., Beauchamp, C.-C., Lu, X., Rialan, T., and
Irvine, R., Boubert, D., Raina, V., Liusie, A., Zhu, Z., Mudupalli, V., Korshuk, A., Liu, Z., Cremer, F., Assassi, V., Beauchamp, C.-C., Lu, X., Rialan, T.,
Beauchamp, W. Rewarding chatbots for real-world engagement with millions of users, 2023. URL https: //arxiv.org/abs/2303.06135.
Beauchamp, W. 通过数百万用户的真实互动奖励聊天机器人,2023. 链接 https://arxiv.org/abs/2303.06135.
Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang, Y. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning, 2020. URL https: //arxiv.org/abs/2007.08124.
Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang, Y. Logiqa: 一个用于逻辑推理机器阅读理解挑战的数据集, 2020. URL https://arxiv.org/abs/2007.08124.
Liu, J., Cohen, A., Pasunuru, R., Choi, Y., Hajishirzi, H., and Cel i kyi l maz, A. Don’t throw away your value model! generating more preferable text with value-guided monte-carlo tree search decoding, 2024. URL https: //arxiv.org/abs/2309.15028.
Liu, J., Cohen, A., Pasunuru, R., Choi, Y., Hajishirzi, H., and Cel i kyi l maz, A. 不要丢弃你的价值模型!通过价值引导的蒙特卡罗树搜索解码生成更优文本, 2024. URL https://arxiv.org/abs/2309.15028.
Loshchilov, I. and Hutter, F. Decoupled weight decay regularization, 2019.
Loshchilov, I. 和 Hutter, F. 解耦权重衰减正则化 (Decoupled weight decay regularization), 2019.
Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., Tang, Y., and Zhang, D. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2025. URL https: //arxiv.org/abs/2308.09583.
Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., Tang, Y., and Zhang, D. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2025. URL https://arxiv.org/abs/2308.09583.
Mu en nigh off, N., Soldaini, L., Groeneveld, D., Lo, K., Morrison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert, N., Gu, Y., Arora, S., Bhagia, A., Schwenk, D., Wadden, D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi, A., Smith, N. A., Koh, P. W., Singh, A., and Hajishirzi, H. Olmoe: Open mixture-of-experts language models, 2024. URL https://arxiv.org/abs/2409.02060.
Mu en nigh off, N., Soldaini, L., Groeneveld, D., Lo, K., Morrison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert, N., Gu, Y., Arora, S., Bhagia, A., Schwenk, D., Wadden, D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi, A., Smith, N. A., Koh, P. W., Singh, A., and Hajishirzi, H. Olmoe: 开放式专家混合语言模型, 2024. URL https://arxiv.org/abs/2409.02060.
of America, M. A. Aime, February 2024. URL https: //art of problem solving.com/wiki/index. php/AIME Problems and Solutions/.
美国数学协会 (MAA) ,2024 年 2 月。URL https://artofproblemsolving.com/wiki/index.php/AIME_问题与解答/
OpenAI. Learning to reason with llms, September 2024. URL https://openai.com/index/ learning-to-reason-with-llms/.
OpenAI。学习用大语言模型进行推理,2024年9月。URL https://openai.com/index/learning-to-reason-with-llms/。
Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Shi, S., Choi, M., Agrawal, A., Chopra, A., et al. Humanity’s last exam, 2025. URL https://arxiv.org/abs/ 2501.14249.
Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Shi, S., Choi, M., Agrawal, A., Chopra, A., 等. 人类最后的考试, 2025. 网址 https://arxiv.org/abs/ 2501.14249.
Qin, Y., Li, X., Zou, H., Liu, Y., Xia, S., Huang, Z., Ye, Y., Yuan, W., Liu, H., Li, Y., and Liu, P. O1 replication journey: A strategic progress report – part 1, 2024. URL https://arxiv.org/abs/2410.18982.
Qin, Y., Li, X., Zou, H., Liu, Y., Xia, S., Huang, Z., Ye, Y., Yuan, W., Liu, H., Li, Y., 和 Liu, P. 《O1 复制之旅:战略进展报告 – 第一部分》,2024。URL https://arxiv.org/abs/2410.18982。
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5 technical report, 2024. URL https://arxiv.org/ abs/2412.15115.
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5 技术报告, 2024. URL https://arxiv.org/ abs/2412.15115.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. GPQA: 一个研究生级别的谷歌证明问答基准, 2023. URL https://arxiv.org/abs/2311.12022.
Shi, Q., Tang, M., Narasimhan, K., and Yao, S. Can language models solve olympiad programming?, 2024. URL https://arxiv.org/abs/2404.10952.
Shi, Q., Tang, M., Narasimhan, K., and Yao, S. 语言模型能解决奥林匹克编程问题吗?, 2024. URL https://arxiv.org/abs/2404.10952.
Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm testtime compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv. org/abs/2408.03314.
Snell, C., Lee, J., Xu, K., and Kumar, A. 扩展大语言模型测试时计算量优化可能比扩展模型参数更有效,2024. URL https://arxiv.org/abs/2408.03314.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., 等. 超越模仿游戏:量化与推断大语言模型的能力,2023。
Su, H., Yen, H., Xia, M., Shi, W., Mu en nigh off, N., yu Wang, H., Liu, H., Shi, Q., Siegel, Z. S., Tang, M., Sun, R., Yoon, J., Arik, S. O., Chen, D., and Yu, T. Bright: A realistic and challenging benchmark for reasoning-intensive retrieval, 2024. URL https://arxiv.org/abs/ 2407.12883.
Su, H., Yen, H., Xia, M., Shi, W., Mu en nigh off, N., yu Wang, H., Liu, H., Shi, Q., Siegel, Z. S., Tang, M., Sun, R., Yoon, J., Arik, S. O., Chen, D., and Yu, T. Bright: 一个真实且具有挑战性的推理密集型检索基准, 2024. URL https://arxiv.org/abs/ 2407.12883.
Sun, L., Han, Y., Zhao, Z., Ma, D., Shen, Z., Chen, B., Chen, L., and Yu, K. Scieval: A multi-level large language model evaluation benchmark for scientific research, 2024. URL https://arxiv.org/abs/2308.13149.
Sun, L., Han, Y., Zhao, Z., Ma, D., Shen, Z., Chen, B., Chen, L., and Yu, K. Scieval: 面向科学研究的多层次大语言模型评估基准, 2024. URL https://arxiv.org/abs/2308.13149.
Team, D. Deepseek r1, November 2024a. URL https://x.com/deep seek a i/status/ 1859200141355536422.
团队, D. Deepseek r1, 2024年11月a. 网址 https://x.com/deep seek a i/status/ 1859200141355536422.
Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., Tang, C., Wang, C., Zhang, D., Yuan, E., Lu, E., Tang, F., Sung, F., Wei, G., Lai, G., Guo, H., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H., Yao, H., Zhao, H., Lu, H., Li, H., Yu, H., Gao, H., Zheng, H., Yuan, H., Chen, J., Guo, J., Su, J., Wang, J., Zhao, J., Zhang, J., Liu, J., Yan, J., Wu, J., Shi, L., Ye, L., Yu, L., Dong, M., Zhang, N., Ma, N., Pan, Q., Gong, Q., Liu, S., Ma, S., Wei, S., Cao, S., Huang, S., Jiang, T., Gao, W., Xiong, W., He, W., Huang, W., Wu, W., He, W., Wei, X., Jia, X., Wu, X., Xu, X., Zu, X., Zhou, X., Pan, X., Charles, Y., Li, Y., Hu, Y., Liu, Y., Chen, Y., Wang, Y., Liu, Y., Qin, Y., Liu, Y., Yang, Y., Bao, Y., Du, Y., Wu, Y., Wang, Y., Zhou, Z., Wang, Z., Li, Z., Zhu, Z., Zhang, Z., Wang, Z., Yang, Z., Huang, Z., Huang, Z., Xu, Z., and Yang, Z. Kimi k1.5: Scaling reinforcement learning with llms, 2025. URL https://arxiv.org/abs/2501.12599.
Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., Tang, C., Wang, C., Zhang, D., Yuan, E., Lu, E., Tang, F., Sung, F., Wei, G., Lai, G., Guo, H., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H., Yao, H., Zhao, H., Lu, H., Li, H., Yu, H., Gao, H., Zheng, H., Yuan, H., Chen, J., Guo, J., Su, J., Wang, J., Zhao, J., Zhang, J., Liu, J., Yan, J., Wu, J., Shi, L., Ye, L., Yu, L., Dong, M., Zhang, N., Ma, N., Pan, Q., Gong, Q., Liu, S., Ma, S., Wei, S., Cao, S., Huang, S., Jiang, T., Gao, W., Xiong, W., He, W., Huang, W., Wu, W., He, W., Wei, X., Jia, X., Wu, X., Xu, X., Zu, X., Zhou, X., Pan, X., Charles, Y., Li, Y., Hu, Y., Liu, Y., Chen, Y., Wang, Y., Liu, Y., Qin, Y., Liu, Y., Yang, Y., Bao, Y., Du, Y., Wu, Y., Wang, Y., Zhou, Z., Wang, Z., Li, Z., Zhu, Z., Zhang, Z., Wang, Z., Yang, Z., Huang, Z., Huang, Z., Xu, Z., and Yang, Z. Kimi k1.5: 利用大语言模型扩展强化学习, 2025. URL https://arxiv.org/abs/2501.12599.
Team, N. Sky-t1: Fully open-source reasoning model with o1-preview performance in $\mathbb{S}450$ budget, 2025. URL https://novasky-ai.github.io/ posts/sky-t1. Accessed: 2025-01-09.
Team, N. Sky-t1: 基于 $\mathbb{S}450$ 预算实现 o1-preview 性能的完全开源推理模型,2025。URL https://novasky-ai.github.io/posts/sky-t1. 访问于: 2025-01-09。
Team, Q. Qwq: Reflect deeply on the boundaries of the unknown, November 2024b. URL https://qwenlm. github.io/blog/qwq-32b-preview/.
团队,Q. Qwq:深度思考未知的边界,2024年11月b。URL https://qwenlm. github.io/blog/qwq-32b-preview/.
Wang, J., Meng, F., Liang, Y., and Zhou, J. Drt-o1: Optimized deep reasoning translation via long chain-ofthought, 2024a. URL https://arxiv.org/abs/ 2412.17498.
Wang, J., Meng, F., Liang, Y., 和 Zhou, J. Drt-o1: 通过长链思考优化深度推理翻译, 2024a. URL https://arxiv.org/abs/ 2412.17498.
Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, D., Wu, Y., and Sui, Z. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024b. URL https://arxiv.org/abs/2312.08935.
Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, D., Wu, Y., and Sui, Z. Math-shepherd: 无需人工注释的逐步验证和强化大语言模型, 2024b. URL https://arxiv.org/abs/2312.08935.
Wang, S., Liu, Z., Zhong, W., Zhou, M., Wei, Z., Chen, Z., and Duan, N. From lsat: The progress and challenges of complex reasoning, 2021. URL https://arxiv. org/abs/2108.00648.
Wang, S., Liu, Z., Zhong, W., Zhou, M., Wei, Z., Chen, Z., and Duan, N. From lsat: 复杂推理的进展与挑战,2021. URL https://arxiv.org/abs/2108.00648.
Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Egert, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Helpsteer2: Open-source dataset for training top-performing reward models, 2024c. URL https://arxiv.org/ abs/2406.08673.
Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Egert, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Helpsteer2: 开源数据集用于训练最佳奖励模型, 2024c. URL https://arxiv.org/ abs/2406.08673.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-ofthought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/ 2201.11903.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. 大语言模型中的链式思维提示引发推理,2023. URL https://arxiv.org/abs/2201.11903.
Welleck, S., Bertsch, A., Finlayson, M., Schoelkopf, H., Xie, A., Neubig, G., Kulikov, I., and Harchaoui, Z. From decoding to meta-generation: Inference-time algorithms for large language models, 2024. URL https://arxiv. org/abs/2406.16838.
Welleck, S., Bertsch, A., Finlayson, M., Schoelkopf, H., Xie, A., Neubig, G., Kulikov, I., and Harchaoui, Z. 从解码到元生成:大语言模型的推断时间算法,2024. URL https://arxiv.org/abs/2406.16838.
Wu, T., Lan, J., Yuan, W., Jiao, J., Weston, J., and Sukhbaatar, S. Thinking llms: General instruction following with thought generation, 2024a. URL https: //arxiv.org/abs/2410.10630.
Wu, T., Lan, J., Yuan, W., Jiao, J., Weston, J., and Sukhbaatar, S. Thinking LLMs: 通过思维生成实现通用指令跟随, 2024a. URL https://arxiv.org/abs/2410.10630.
Wu, Y., Sun, Z., Li, S., Welleck, S., and Yang, Y. Inference scaling laws: An empirical analysis of computeoptimal inference for problem-solving with language models, 2024b. URL https://arxiv.org/abs/2408. 00724.
Wu, Y., Sun, Z., Li, S., Welleck, S., and Yang, Y. 推理缩放定律:基于计算最优推理的语言模型问题解决实证分析, 2024b. URL https://arxiv.org/abs/2408.00724.
Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan, D., Castricato, L., Franken, J.-P., Haber, N., and Finn, C. Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought, 2025. URL https: //arxiv.org/abs/2501.04682.
Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan, D., Castricato, L., Franken, J.-P., Haber, N., and Finn, C. 大语言模型中的系统2推理:通过元思维链学习如何思考, 2025. URL https://arxiv.org/abs/2501.04682.
Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He, J., and Xie, Q. Self-evaluation guided beam search for reasoning, 2023. URL https://arxiv.org/abs/ 2305.00633.
Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He, J., and Xie, Q. 自评估引导的束搜索推理, 2023. URL https://arxiv.org/abs/2305.00633.
Contents
目录
1 Introduction
1 引言
2 Reasoning data curation to create s1K
2 推理数据整理以创建 s1K
3 Test-time scaling 3
3 测试时缩放 3
4 Results
4 结果
5 Ablations 6
5 Ablation 6
6 Discussion and related work 8
6 讨论与相关工作 8
A Evaluation determinism 16
评估确定性 16
B s1K details 16
B s1K 详情 16
C Training details 20
C 训练细节 20
A. Evaluation determinism
A. 评估确定性
We run our evaluations using vLLM (Kwon et al., 2023) as it is faster than the alternatives we tried. However, we find that even when using the same random seeds and greedy sampling, evaluation scores can change significantly across runs:
我们使用 vLLM (Kwon et al., 2023) 运行评估,因为它比我们尝试的其他替代方案更快。然而,我们发现即使使用相同的随机种子和贪婪采样,评估分数在不同运行中也可能发生显著变化:
As our model generates long reasoning traces prior to its answer, small numeric changes can snowball into large differences. We encounter many generations that are exactly the same for thousands of tokens and then suddenly differ in one token eventually ending up with an entirely different answer. To partly counter this issue we generally run our final evaluations using full precision unless otherwise indicated.
由于我们的模型在生成答案之前会产生较长推理过程,微小的数值变化可能会累积成巨大差异。我们遇到许多生成过程,其前几千个Token完全相同,但突然在某一个Token上出现差异,最终导致完全不同的答案。为了解决这个问题,除非特别说明,我们通常在进行最终评估时使用全精度计算。
B. s1K details
B. s1K 详细信息
B.1. s1K summary
B.1. s1K 摘要
Table 5. Summary of our dataset s1K. Token count measured by the Qwen-2.5 tokenizer. We prompt Claude to produce keywords given several questions from the domain.
表 5: 数据集 s1K 的摘要。Token 数量通过 Qwen-2.5 的 Tokenizer 测量。我们提示 Claude 根据领域内的若干问题生成关键词。
| 领域 | 问题数量 | 总 Token 数量 | 关键词 |
|---|---|---|---|
| 几何 | 109 | 560.2K | 面积, 三角形, 距离 |
| 数论 | 98 | 522.5K | 序列, 可除性 |
| 组合数学 | 75 | 384.7K | 排列, 计数 |
| 实函数 | 43 | 234.8K | 三角函数, 微积分 |
| 生物学 | 41 | 120.9K | 有机反应 |
| 复函数 | 32 | 170.2K | 复根 |
| 量子理论 | 32 | 127.9K | 粒子, 波函数 |
| 场论 | 28 | 150.1K | 多项式, 根 |
| 变分法 | 28 | 155.5K | 优化, 控制 |
| 差分方程 | 24 | 132.5K | 递推, 递归 |
| 电磁理论 | 23 | 95.8K | 光学, 波动, 衍射 |
| 群论 | 22 | 100.0K | 群, 自同构 |
| 线性代数 | 22 | 128.3K | 矩阵, 行列式 |
| 概率论 | 20 | 114.6K | 随机游走, 期望 |
| 代数系统 | 19 | 109.9K | 函数方程 |
| 力学 | 19 | 103.6K | 力, 运动, 能量 |
| 热力学 | 19 | 74.2K | 热机, 熵 |
| 微分方程 | 18 | 89.6K | 替换, 存在性 |
| 计算机科学 | 18 | 34.2K | 复杂性理论, 算法 |
| 数值分析 | 18 | 76.5K | 误差分析, 稳定性 |
| 微积分 | 17 | 96.3K | 收敛, 求和 |
| 代数结构 | 17 | 90.4K | 不等式, 集合 |
| 天文学 | 16 | 37.7K | 恒星群, 轨道 |
| 其余 27 个领域 | 242 | 982.2K | 问题数量 ≤ 16 的领域 |
| 所有领域 (51) | 1000 | 4.7M | s1K |
B.2. Dataset composition for full 59K questions
B.2. 59K 问题的完整数据集组成
Table 6. Composition of full 59K questions. Thinking and response lengths are measured in tokens using the Qwen2.5-32B-Instruct tokenizer (Qwen et al., 2024). In addition to excluding our evaluation benchmark, AIME24, we also exclude AIME questions from 2022-2023 as we use these 90 questions during our development stage of s1-32B.
表 6: 完整的 59K 问题的构成。思考长度和响应长度是使用 Qwen2.5-32B-Instruct 的 Tokenizer 测量的(Qwen 等,2024)。除了排除我们的评估基准 AIME24 外,我们还排除了 2022-2023 年的 AIME 问题,因为这些 90 个问题在我们的 s1-32B 开发阶段被使用。
| 来源 | 描述 | 样本数 | 平均思考长度 |
|---|---|---|---|
| NuminaMATH (LI 等,2024) | 来自在线网站的数学问题 | 30660 | 4.1K |
| MATH (Hendrycks 等,2021) | 来自竞赛的数学问题 | 11999 | 2.9K |
| OlympicArena (Huang 等,2024a) | 天文学、生物学、化学、计算机科学、地理、数学和物理奥林匹克问题 | 4250 | 3.2K |
| OmniMath (Gao 等,2024a) AGIEval (Zhong 等,2023; Ling 等,2017; Hendrycks 等,2021; Liu 等,2020; Zhong 等,2019; Wang 等,2021) | 数学竞赛问题 | 4238 | 4.4K |
| xword | 填字游戏 | 999 | 0.7K |
| OlympiadBench (He 等,2024) | 数学和物理奥林匹克问题 | 896 | 3.9K |
| AIME (1983-2021) | 美国数学邀请赛 | 890 | 4.7K |
| TheoremQA (Chen 等,2023) | 计算机科学、金融、数学和物理大学级问题 | 747 | 2.1K |
| USACO (Shi 等,2024) | 美国计算机奥林匹克竞赛代码问题 | 519 | - |
| JEEBench (Arora 等,2023) | 印度理工学院入学考试中的化学、数学和物理问题 | 515 | 3.6K |
| GPQA (Rein 等,2023) SciEval (Sun 等,2024) | 博士级科学问题 | 348 | 2.9K |
| s1-prob LiveCodeBench (Jain 等,2024) | 斯坦福统计资格考试题目 来自编程网站(LeetCode、AtCoder 和 CodeForces)的代码问题 | 182 | 4.0K |
| sl-teasers | 从互联网上抓取的数学脑筋急转弯 | 23 | 4.1K |
| 所有 59K 问题 | 上述数据集的组合,包含推理轨迹和解决方案 | 59029 | 3.6K |
B.3. s1K grading prompt
B.3. s1K 评分提示
To grade whether an example is correct for our dataset selection in $\S2$ , we use the prompt in Figure 8.
为了评估 $\S2$ 数据集选择中示例的正确性,我们使用了图 8 中的提示。
You are an AI assistant for grading a science problem. The user will provide you with the question itself, an attempt made by a student and the correct answer to the problem. Your job is to judge whether the attempt is correct by comparing it with the correct answer. If the expected solution concludes with a number or choice, there should be no ambiguity. If the expected solution involves going through the entire reasoning process, you should judge the attempt based on whether the reasoning process is correct with correct answer if helpful.
你是一位用于科学问题评分的AI助手。用户将提供问题本身、学生的尝试以及问题的正确答案。你的任务是通过将尝试与正确答案进行比较来判断其是否正确。如果预期解决方案以数字或选择结尾,则应无歧义。如果预期解决方案涉及整个推理过程,则应根据推理过程是否正确以及正确答案是否有助于判断来评估尝试。
The user will provide the attempt and the correct answer in the following format:
用户将提供以下格式的尝试和正确答案:
Explain your reasoning, and end your response on a new line with only "Yes" or "No" (without quotes).
解释你的推理,并在新的一行以“是”或“否”结束你的回答(不带引号)。
Figure 8. Grading prompt.
图8: 评分提示
B.4. s1K diversity selection
B.4. s1K 多样性选择
Algorithm 1 provides our algorithm for selecting data in our diversity selection stage. As mentioned in $\S2$ , we also include samples from some specific benchmarks we perceive as high-quality. None of the samples overlap with our final evaluation.
算法 1 提供了我们在多样性选择阶段选择数据的算法。如 $\S2$ 所述,我们还包含了一些我们认为高质量的具体基准样本。这些样本与我们的最终评估没有重叠。
B.5. Decontamination
B.5 去污
We filter all samples by checking for an 8-gram overlap between the selected examples and the evaluation benchmarks: MATH500, GPTQA Diamond, and AIME24. We exclude questions with more than an 8-gram overlap.
我们通过检查所选示例与评估基准(MATH500、GPTQA Diamond 和 AIME24)之间的 8-gram 重叠来过滤所有样本。我们排除了 8-gram 重叠超过限制的问题。
Algorithm 1 Two-stage sampling for s1K
算法 1 用于 s1K 的两阶段采样
C. Training details
C. 训练细节
We take a model that has already been pretrained and instruction tuned and further finetune it for reasoning. Specifically, we use Qwen2.5-32B-Instruct (Qwen et al., 2024), which on math tasks generally matches or outperforms the larger Qwen2.5- 72B-Instruct (Qwen et al., 2024) or other open models (Dubey et al., 2024; Groeneveld et al., 2024; Mu en nigh off et al., 2024). We use token delimiters to separate the thinking stage from the answering stage. We enclose the thinking stage with $<|$ im_start $|>$ think and $<|$ im_start|>answer; both preceded and followed by a newline. Samples from our dataset are in $\mathrm{\SC}.2$ . We use basic fine-tuning hyper parameters: we train for 5 epochs with a batch size of 16 for a total of 315 gradient steps. We train in bfloat16 precision with a learning rate of $1e-5$ warmed up linearly for $5%$ (16 steps) and then decayed to 0 over the rest of training (299 steps) following a cosine schedule. We use the AdamW optimizer (Loshchilov & Hutter, 2019) with $\beta_{1}=0.9,\beta_{2}=0.95$ and weight decay of $1e-4$ . We do not compute loss on questions, only on reasoning traces and solutions. We ensure the sequence length is large enough to avoid cutting off any samples; a setting we ablate in $\mathrm{\SC.1}$ . The training takes just 26 minutes on 16 NVIDIA H100 GPUs.
我们采用了一个已经经过预训练和指令微调的模型,并进一步对其进行推理微调。具体来说,我们使用了 Qwen2.5-32B-Instruct (Qwen et al., 2024),该模型在数学任务上通常能够匹敌或优于更大的 Qwen2.5-72B-Instruct (Qwen et al., 2024) 或其他开源模型 (Dubey et al., 2024; Groeneveld et al., 2024; Mu en nigh off et al., 2024)。我们使用 token 分隔符将思考阶段与回答阶段分开。我们将思考阶段用 $<|$ im_start $|>$ think 和 $<|$ im_start|>answer 括起来,前后各加一个换行符。我们的数据集样本在 $\mathrm{\SC}.2$ 中。我们使用基本的微调超参数:训练 5 个 epoch,批量大小为 16,共进行 315 个梯度步骤。我们以 bfloat16 精度训练,学习率为 $1e-5$,线性预热 $5%$(16 步),然后在剩余的训练过程中(299 步)按照余弦计划衰减到 0。我们使用 AdamW 优化器 (Loshchilov & Hutter, 2019),$\beta_{1}=0.9,\beta_{2}=0.95$,权重衰减为 $1e-4$。我们不在问题上计算损失,只在推理轨迹和解决方案上计算。我们确保序列长度足够大,以避免截断任何样本;我们在 $\mathrm{\SC.1}$ 中对此进行了消融实验。训练在 16 个 NVIDIA H100 GPU 上仅需 26 分钟。

Figure 9. Training dynamics of s1-32B on s1K.
图 9: s1-32B 在 s1K 上的训练动态。
C.1. Training Ablations: Sequence length
C.1. 训练消融:序列长度
Table 7. Training sequence length ablation. We report “accuracy / average thinking tokens per sample”; the higher the accuracy and the fewer the thinking tokens (inference cost) the better.
表 7: 训练序列长度消融实验。我们报告“准确率/每个样本的平均思考 Token”;准确率越高且思考 Token 越少(推理成本越低)越好。
| ModelA | Model B | |
|---|---|---|
| 训练序列长度 | 4096 | 32768 |
| 训练样本截断百分比 | 74% | 0% |
| AIME24 | 30.0%/20721 | 50.0%/6984 |
| MATH500 | 90.0%/5324 | 91.0%/3268 |
| GPQA | 52.5%/6841 | 53.0%/3568 |
Besides our scaling ablations in $\S5.2$ , the main training hyper parameter we ablate is the sequence length used during training. We find that a shorter training sequence length leads to longer reasoning traces at test time. This is because when training with a shorter sequence length the answer section of the training sample is more commonly cut off. Inversely, when the training sequence length is longer, more samples appear in their entirety with the section where the model answers. Thus the model receives more gradient updates where it learns to generate an answer following its chain. This in turn leads to a higher log probability of the answer section at any point during the generation and thus shorter reasoning traces at test time. Performance-wise, we also find that the model trained with a longer sequence length performs better. Thus we opt for the longest training sequence length as it leads to better performance and makes inference more efficient by leading to shorter reasoning traces.
除了我们在$\S5.2$中的扩展消融实验外,我们主要消融的训练超参数是训练时使用的序列长度。我们发现,较短的训练序列长度会导致测试时更长的推理轨迹。这是因为当使用较短的序列长度进行训练时,训练样本的答案部分更常被截断。相反,当训练序列长度较长时,更多样本以完整形式出现,并且包含模型回答的部分。因此,模型在生成其答案链后学习生成答案时会获得更多的梯度更新。这反过来导致生成过程中答案部分的更高对数概率,从而在测试时产生更短的推理轨迹。在性能方面,我们还发现使用较长序列长度训练的模型表现更好。因此,我们选择最长的训练序列长度,因为它能够带来更好的性能,并通过产生更短的推理轨迹使推理更高效。
C.2. Training Samples
C.2. 训练样本
Table 8, Table 9, Table 10 contain training samples from s1K
表 8、表 9、表 10 包含来自 s1K 的训练样本
Table 8. s1K sample stemming from s1-prob. The black text is the question and the yellow text is the solution. Based solely on the question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.
表 8: 源自 s1-prob 的 s1K 样本。黑色文本是问题,黄色文本是解决方案。仅基于问题,我们生成浅蓝色的推理轨迹和蓝色的 Gemini 响应,用于训练。
You are given integrable random variables $X,Y_{0}$ and $Z_{0}$ on the same probability space $(\Omega,\mathcal{F},\mathbb{P})$ , and two $\sigma$ -algebras $A\subset F,B\subset F$ . For $k=1,2,\dots,$ , let
给定在同一概率空间 $(\Omega,\mathcal{F},\mathbb{P})$ 上的可积随机变量 $X,Y_{0}$ 和 $Z_{0}$,以及两个 $\sigma$-代数 $A\subset F,B\subset F$。对于 $k=1,2,\dots,$,令

(a) Show that there exist integrable random variables $Y_{\infty}$ and $Z_{\infty}$ such that as $n~\rightarrow~\infty$ both $\mathbb{E}[|Y_{n}-Y_{\infty}|],\to,0$ and $\mathbb{E}[|Z_{n}-Z_{\infty}|]\to0$ . (b) Prove that almost surely $Y_{\infty}=Z_{\infty}$ .
(a) 证明存在可积随机变量 $Y_{\infty}$ 和 $Z_{\infty}$,使得当 $n~\rightarrow~\infty$ 时,有 $\mathbb{E}[|Y_{n}-Y_{\infty}|],\to,0$ 和 $\mathbb{E}[|Z_{n}-Z_{\infty}|]\to0$。
(b) 证明几乎必然有 $Y_{\infty}=Z_{\infty}$。
- Introduce the notations, $\mathcal{F}{n}:=\sigma(\mathcal{A},Z{0},\ldots,Z_{n-1})$ and $\mathcal{G}{n}:=\sigma(B,Y{0},\ldots,Y_{n-1})$ , for all $n\geq1$ and $\mathcal{F}{0}:=\mathcal{A},\mathcal{G}{0}:=\mathcal{B}.$ Clearly, both are filtration s and by definition, $Y_{n}=\mathbb{E}\left[X|\mathcal{F}_{n}\right]$ and $Z_{n}=\mathbb{E}\left[X|\mathcal{G}_{n}\right]$ , for all $n\geq0$ . Thus $\left{Y_{n}\right}_{n\ge0}$ (or $\left{Z_{n}\right}_{n\ge0},,$ ) is a Doob’s Martingale with respect to filtration $\left{\mathcal{F}_{n}\right}_{n\ge0}$ (or $\left{\mathcal{G}_{n}\right}_{n\ge0})$ and hence is U.I., $X$ being integrable. Now apply the fact that a Martingale is U.I. if and only if it converges in $L^{1}$ and conclude.
- 引入符号,$\mathcal{F}{n}:=\sigma(\mathcal{A},Z{0},\ldots,Z_{n-1})$ 和 $\mathcal{G}{n}:=\sigma(B,Y{0},\ldots,Y_{n-1})$,对于所有 $n\geq1$,以及 $\mathcal{F}{0}:=\mathcal{A},\mathcal{G}{0}:=\mathcal{B}$。显然,两者都是过滤,根据定义,对于所有 $n\geq0$,$Y_{n}=\mathbb{E}\left[X|\mathcal{F}_{n}\right]$ 和 $Z_{n}=\mathbb{E}\left[X|\mathcal{G}_{n}\right]$。因此,$\left{Y_{n}\right}_{n\ge0}$(或 $\left{Z_{n}\right}_{n\ge0}$)是一个关于过滤 $\left{\mathcal{F}_{n}\right}_{n\ge0}$(或 $\left{\mathcal{G}_{n}\right}_{n\ge0}$)的 Doob 鞅,并且是 U.I.,因为 $X$ 是可积的。现在应用鞅是 U.I. 当且仅当它在 $L^{1}$ 中收敛的事实并得出结论。
- We provide two different proofs.
- 我们提供两种不同的证明。
- Observe the following identity which follows from Tower Property.
- 观察以下由塔性质 (Tower Property) 得出的恒等式。

Note that 𝔼𝔼 $\left|Y_{n}|\mathcal{G}{n}\right|-\mathbb{E}\left|Y{\infty}|\mathcal{G}{n}\right||\leq\mathbb{E}|Y{n}-Y_{\infty}|=o(1)$ and by Levy’s Upward Theorem, $\mathbb{E}\left[Y_{\infty}|\mathcal{G}{n}\right]\overset{L^{1}}{\longrightarrow}\mathbb{E}\left[Y{\infty}|\mathcal{G}{\infty}\right]$ , where ${\mathcal{G}}{\infty}=\sigma\left(\cup_{n\geq0}{\mathcal{G}}_{n}\right)=\sigma(B,Y_{0},Y_{1}\ldots)$ . It is |evident that $Y_{\infty}\in m\mathcal{G}_{\infty}$ (since it is almost sur|e limit of $Y_{n}\in m\boldsymbol{\mathcal{G}}_{n},$ ) and hence $\mathbb{E}\left[Y_{\infty}|\mathcal{G}_{\infty}\right|=Y_{\infty}$ almost surely. Combining the above observations we conclude that $\mathbb{E}\left[Y_{n}|\mathcal{G}_{n}\right]\xrightarrow{L^{1}}Y_{\infty}$ Similarly, $\mathbb{E}\left[Z_{n}|\mathcal{F}_{n}\right]\stackrel{L^{1}}{\longrightarrow}Z_{\infty}.$ . Now invoke Equation (4), to conclude that $Y_{\infty}=Z_{\infty}$ almost surely.
注意到 𝔼𝔼 $\left|Y_{n}|\mathcal{G}{n}\right|-\mathbb{E}\left|Y{\infty}|\mathcal{G}{n}\right||\leq\mathbb{E}|Y{n}-Y_{\infty}|=o(1)$ ,根据 Lévy 上升定理,$\mathbb{E}\left[Y_{\infty}|\mathcal{G}{n}\right]\overset{L^{1}}{\longrightarrow}\mathbb{E}\left[Y{\infty}|\mathcal{G}{\infty}\right]$ ,其中 ${\mathcal{G}}{\infty}=\sigma\left(\cup_{n\geq0}{\mathcal{G}}_{n}\right)=\sigma(B,Y_{0},Y_{1}\ldots)$ 。很明显,$Y_{\infty}\in m\mathcal{G}_{\infty}$ (因为它是 $Y_{n}\in m\boldsymbol{\mathcal{G}}_{n},$ 的几乎必然极限),因此几乎必然有 $\mathbb{E}\left[Y_{\infty}|\mathcal{G}_{\infty}\right|=Y_{\infty}$ 。结合上述观察,我们得出结论:$\mathbb{E}\left[Y_{n}|\mathcal{G}_{n}\right]\xrightarrow{L^{1}}Y_{\infty}$ 。类似地,$\mathbb{E}\left[Z_{n}|\mathcal{F}_{n}\right]\stackrel{L^{1}}{\longrightarrow}Z_{\infty}.$ 。现在调用方程 (4),可以得出结论,几乎必然有 $Y_{\infty}=Z_{\infty}$ 。
- Since $X$ is integrable, we have by Levy’s Upward Theorem, $\begin{array}{r}{Y_{\infty},=,\operatorname*{lim}{n\to\infty}Y{n},=,\operatorname*{lim}_{n\to\infty}\mathbb{E}\left|X|\mathcal{F}_{n}\right|,=,\mathbb{E}\left|X|\mathcal{F}_{\infty}\right|.}\end{array}$ Similarly, $Z_{\infty},=,\mathbb{E}\left[X|\mathcal{G}_{\infty}\right].$ . On the otherhand, in first proof we have observed that $Y_{\infty}\in,m\mathcal{G}_{\infty}$ and hence $Y_{\infty}\in$ $m\mathcal{G}_{\infty}\cap m\mathcal{F}_{\infty}=m\left(\mathcal{G}_{\infty}\cap\mathcal{F}_{\infty}\right)$ . This yields the following.
- 由于$X$是可积的,根据Levy的向上定理,我们有$\begin{array}{r}{Y_{\infty},=,\operatorname*{lim}{n\to\infty}Y{n},=,\operatorname*{lim}_{n\to\infty}\mathbb{E}\left|X|\mathcal{F}_{n}\right|,=,\mathbb{E}\left|X|\mathcal{F}_{\infty}\right|.}\end{array}$。类似地,$Z_{\infty},=,\mathbb{E}\left[X|\mathcal{G}_{\infty}\right]$。另一方面,在第一个证明中我们观察到$Y_{\infty}\in,m\mathcal{G}_{\infty}$,因此$Y_{\infty}\in$ $m\mathcal{G}_{\infty}\cap m\mathcal{F}_{\infty}=m\left(\mathcal{G}_{\infty}\cap\mathcal{F}_{\infty}\right)$。这得出了以下结论。

Similarly, $Z_{\infty}=\mathbb{E}\left[X|\mathcal{G}{\infty}\cap\mathcal{F}{\infty}\right].$ This proves that $Y_{\infty}=Z_{\infty}$ almost surely.
类似地,$Z_{\infty}=\mathbb{E}\left[X|\mathcal{G}{\infty}\cap\mathcal{F}{\infty}\right].$ 这证明了 $Y_{\infty}=Z_{\infty}$ 几乎必然成立。
The sequence of sigma-algebras $\sigma(\mathcal{A},\mathcal{L}{0},\ldots,\mathcal{L}{k-1})$ is increasing. By the Martingale Convergence Theorem (for conditional expectations), since $X$ is integrable, the sequence $Y_{k}$ converges almost surely and in $L^{1}$ to $\mathbb{E}[X|\sigma(\mathcal{A},Z_{0},Z_{1},...)]$ . Let $Y_{\infty}=\mathbb{E}[X|\sigma(\mathcal{A},Z_{0},Z_{1},...)].$ .
$\sigma(\mathcal{A},\mathcal{L}{0},\ldots,\mathcal{L}{k-1})$ 的序列是递增的。根据鞅收敛定理(对于条件期望),由于 $X$ 是可积的,序列 $Y_{k}$ 几乎必然且在 $L^{1}$ 中收敛到 $\mathbb{E}[X|\sigma(\mathcal{A},Z_{0},Z_{1},...)]$。令 $Y_{\infty}=\mathbb{E}[X|\sigma(\mathcal{A},Z_{0},Z_{1},...)]$。
Similarly, for the $Z_{k}$ sequence: $Z_{1}=\mathbb{E}[X|\sigma(B)]$ $Z_{2}=\mathbb{E}[X|\sigma(B,Y_{0})]$ $Z_{3}=\mathbb{E}[X|\sigma(B,Y_{0},Y_{1})]$
类似地,对于 $Z_{k}$ 序列:$Z_{1}=\mathbb{E}[X|\sigma(B)]$,$Z_{2}=\mathbb{E}[X|\sigma(B,Y_{0})]$,$Z_{3}=\mathbb{E}[X|\sigma(B,Y_{0},Y_{1})]$
Table 9. s1K sample stemming from s1-teasers. The black text is the question and the yellow text is the solution. Based solely on the question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.
表 9: s1K 样本源自 s1-teasers。黑色文本是问题,黄色文本是解决方案。仅根据问题,我们使用 Gemini 生成浅蓝色推理轨迹和蓝色响应,这些用于训练。
Alice and Bob each have a coin and flip it until they get a heads. If Bob flipped his coin more times than Alice, what is the expected number of times Alice flipped her coin?
Alice 和 Bob 各有一枚硬币,直到他们掷出正面为止。如果 Bob 掷硬币的次数比 Alice 多,Alice 掷硬币的期望次数是多少?
4/3
4/3
$P(X=x)=(1-p)^{x-1}p.$
$$
\begin{array}{r l}&{P(X=x)=\frac{1}{2}\times10^{-9}\times10^{-11}}\ &{P(X=x)=\sum_{k=1}^{\infty}\alpha_{k}\left(1-p\right)^{k}\left(1-p\right)^{k}\left(1-(1-p)^{k+1}+(1-p)^{k+1}+\cdots\right)=p_{1}^{k+1.9}\alpha_{1}^{\prime}=1-p^{k}\left(1-p\right)^{k}}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=1-p^{k}\left(1-p\right)^{k}\alpha_{1}^{\prime}=p_{2}-p^{k}\left(1-p\right)^{k+1};,}\ &{\qquad\quad\times10^{-1}p\leq X=\sum_{k=1}^{\infty}\alpha_{1}^{\prime}=1-p^{k}\left(1-p\right)^{k}\alpha_{1}^{\prime}=p^{k}\left(1-p\right)^{k+1}}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=1-p^{k}\left(1-p\right)^{k}\alpha_{1}^{\prime}=10^{-9}\dots}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}\dots\quad\frac{1}{2}\times10^{-11}}\ &{\qquad\quad\times117\times10^{-8}p\leq\alpha_{1}^{\prime}=10^{-8}p\leq\alpha_{1}^{\prime}x\geq p(1-p)}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^{\prime}=10^{-9}}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^{\prime}=00^{-9}\dots\quad}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^{\prime}=10^{-9}p}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^{\prime}=10^{-9}p}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^
$$
$P(X=x)=(1-p)^{x-1}p.$
$$
\begin{array}{r l}&{P(X=x)=\frac{1}{2}\times10^{-9}\times10^{-11}}\ &{P(X=x)=\sum_{k=1}^{\infty}\alpha_{k}\left(1-p\right)^{k}\left(1-p\right)^{k}\left(1-(1-p)^{k+1}+(1-p)^{k+1}+\cdots\right)=p_{1}^{k+1.9}\alpha_{1}^{\prime}=1-p^{k}\left(1-p\right)^{k}}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=1-p^{k}\left(1-p\right)^{k}\alpha_{1}^{\prime}=p_{2}-p^{k}\left(1-p\right)^{k+1};,}\ &{\qquad\quad\times10^{-1}p\leq X=\sum_{k=1}^{\infty}\alpha_{1}^{\prime}=1-p^{k}\left(1-p\right)^{k}\alpha_{1}^{\prime}=p^{k}\left(1-p\right)^{k+1}}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=1-p^{k}\left(1-p\right)^{k}\alpha_{1}^{\prime}=10^{-9}\dots}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}\dots\quad\frac{1}{2}\times10^{-11}}\ &{\qquad\quad\times117\times10^{-8}p\leq\alpha_{1}^{\prime}=10^{-8}p\leq\alpha_{1}^{\prime}x\geq p(1-p)}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^{\prime}=10^{-9}}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^{\prime}=00^{-9}\dots\quad}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^{\prime}=10^{-9}p}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^{\prime}=10^{-9}p}\ &{\qquad\quad\times10^{-1}p\leq\alpha_{1}^{\prime}=10^{-9}p\leq\alpha_{1}^
$$
Solution:
解决方案:
Table 10. s1K sample stemming from the astronomy subset of Olympic Arena. The black text is the question and the blue text is the solution. Based solely on the question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.
表 10: 源自 Olympic Arena 天文学子集的 s1K 样本。黑色文本是问题,蓝色文本是解决方案。我们仅根据问题生成浅蓝色的推理轨迹和蓝色的 Gemini 响应,用于训练。
An often-repeated fun fact is that humans produce more power per unit volume than stars. If the sun were the same size, but it produced the same amount of power per unit volume as a human, what would its surface temperature be? Assume the "average human" produces 100 watts of power and has a volume of 66400 cubic centimeters.
一个常常被提及的有趣事实是,人类每单位体积产生的能量比恒星还要多。如果太阳的大小不变,但它每单位体积产生的能量与人类相同,那么它的表面温度会是多少?假设“普通人”产生100瓦特的功率,体积为66400立方厘米。
A: 3500 K B: 10000 K C: 25000 K D: 40000 K E: 50000 K
A: 3500 K B: 10000 K C: 25000 K D: 40000 K E: 50000 K
Using the numbers from the problem, the average human produces
根据问题中的数字,人类的平均产量为

The volume of the sun is $\begin{array}{r}{\frac{4}{3}\pi R_{\odot}^{3}}\end{array}$ , so its new power output would be $\begin{array}{r}{P=\frac{4}{3}\pi R_{\odot}^{3}u}\end{array}$ . To find the equilibrium temperature, we use
太阳的体积为 $\begin{array}{r}{\frac{4}{3}\pi R_{\odot}^{3}}\end{array}$ ,因此它的新功率输出为 $\begin{array}{r}{P=\frac{4}{3}\pi R_{\odot}^{3}u}\end{array}$ 。为了找到平衡温度,我们使用

Solving for $T$ , we get
求解 $T$,我们得到
$$
T=\sqrt[4]{\frac{R_{\odot}u}{3\sigma}}=\sqrt[4]{\frac{\left(6.96\times10^{8}\right)\left(1506\right)}{3\left(5.67\times10^{-8}\right)}}=49823\approx50000,\mathrm{K}.
$$
$$
T=\sqrt[4]{\frac{R_{\odot}u}{3\sigma}}=\sqrt[4]{\frac{\left(6.96\times10^{8}\right)\left(1506\right)}{3\left(5.67\times10^{-8}\right)}}=49823\approx50000,\mathrm{K}.
$$
The problem asks for the surface temperature of the sun if it had the same size but produced power per unit volume as a human.
问题要求计算如果太阳的大小与人类相同,但单位体积产生的功率与人类相同,太阳的表面温度是多少。
Solution:
解决方案:
D. Test-time scaling details
D. 测试时缩放细节
D.1. Sequential scaling ablations
D.1. 顺序缩放消融实验
| user | user |
|---|---|
| 生命、宇宙以及一切的答案是什么? | 生命、宇宙以及一切的答案是什么? |
| 最多思考 2048 个 Token。 | 最多思考 64 步。 |
| assistant | assistant |
| think | 64 步剩余 |
| 让我把这个问题分解成三个部分: | 让我把这个问题分解成它要问的三个部分: |
| 1) 生命 2) 宇宙 3) 一切 | 1) 生命 2) 宇宙 3) 一切 |
| 让我从生命开始... | 63 步剩余 |
| 让我从生命开始... | |
| answer | answer |
| 答案是... | 答案是... |
Table 11. Scaling thinking time via tokens-conditional control. All metrics are averaged over the 30 questions in AIME24.
表 11. 通过 Token 条件控制扩展思考时间。所有指标均在 AIME24 的 30 个问题上取平均值。
| Tokens instructed (→) | 1024 2048 | 4096 8192 | 16384 |
|---|---|---|---|
| 无干预测试时间 | |||
| 思考 Token | 7939 | 7158 8263 | 7108 |
| 回答 Token | 689 | 669 659 | 722 |
| AIME24 | 26.7 | 30.0 33.3 | 33.3 |
| 达到 Token 预算时强制结束思考 | |||
| 思考 Token | 1024 2048 | 4031 | 5664 |
| 回答 Token | 15 | 15 142 | 722 |
| AIME24 | 3.3 | 30.0 33.3 | 33.3 |
Token-conditional control One general approach is to simply tell a model in the prompt precisely how many tokens it should generate. Ideally, the model can keep track of its token count and adjust its generation to finish within the desired limits. We experiment with this approach by training a model with token instructions using the format in Figure 10 (left). We bucket the lengths of the reasoning traces from our 1,000 training examples into powers of two (rounded upwards) and add a corresponding instruction to the user prompt. For example, if the instruction says “Think for up to 2048 tokens”, then the reasoning trace has anywhere between 1024 and 2048 tokens. In Table 11, we show that after training the model hardly follows the token instruction. It does sometimes generate more tokens when given a higher limit but often overshoots the limit. This may not be unique to our model as prior work suggests that OpenAI o1-mini can also not follow token instructions (Zhang & Chen, 2024). To prevent exceeding the limit, we test budget forcing the thinking to end once the limit is reached. This leads to perfect control (Table 11 (lower)). With budget forcing, the scaling trend is also clearer as the model can no longer overshoot the limit when given a small thinking budget. This leads to better test-time scaling values for Token Prompting $^+$ budget forcing in Table 3. To compute Control reported in Table 3 for token-conditional control variants we divide the number of times the thinking tokens in Table 11 are less than the upper limit by the total evaluations (2/5 for without intervention; 5/5 for with intervention).
Token-conditional control
一种通用方法是在提示中明确告诉模型应该生成多少个Token。理想情况下,模型可以跟踪其Token计数并调整生成过程,以在所需限制内完成。我们通过使用图 10(左)中的格式对模型进行Token指令训练来实验这种方法。我们将 1,000 个训练示例中的推理轨迹长度按 2 的幂次方(向上取整)进行分桶,并在用户提示中添加相应的指令。例如,如果指令说“最多思考 2048 个Token”,则推理轨迹的Token数介于 1024 到 2048 之间。在表 11 中,我们展示了训练后模型几乎不遵循Token指令。虽然有时在给定较高限制时会生成更多Token,但通常会超出限制。这可能不仅仅是我们的模型独有的问题,因为之前的研究表明 OpenAI o1-mini 也无法遵循Token指令 (Zhang & Chen, 2024)。为了防止超过限制,我们测试了在达到限制时强制结束思考的预算强制方法。这实现了完美的控制(表 11(下))。通过预算强制,扩展趋势也更清晰,因为当给定较小的思考预算时,模型不能再超出限制。这导致在表 3 中Token Prompting $^+$ budget forcing 的测试时扩展值更好。为了计算表 3 中报告的Token条件控制变体的控制率,我们将表 11 中思考Token数小于上限的次数除以总评估次数(无干预时为 2/5;有干预时为 5/5)。
Step-conditional control Token instructions fail as current models cannot count tokens. To accommodate this lack of capability, we experiment with making the counting more coarse-grained. We partition the reasoning traces into steps and ask the model to think for a specific number of steps rather than tokens. We split our reasoning traces on double newlines into steps, which we find act as intuitive separators based on manual inspection of samples. We bucket our training samples into powers of 2 depending on their number of steps and add a corresponding step instruction following the format in Figure 10 (right). This format is based on early experiments, where we found the model to be more likely to adhere to the step limit when counting down (“3 steps left...2 steps left”) rather than counting up (“Step2...Step3...”). This is likely because if counting down, the final step is always 1, which will act as a strong prior to the model to finish its generation. If counting up, the final step before the answer varies, thus if the model does not remember the original step instruction, it may fail to stop. We conclude the following from our results in Table 12: (1) The model still struggles to adhere to the step limit. The model sometimes simply continues counting into negative steps, e.g. “-1 steps left”. To solve this issue, we automatically stop the thinking process once 0 steps are reached and then force the model to transition to answering mode by appending the answer token delimiter (§3). This leads to perfect step adherence (lower half of Table 12), yet problems remain. (2) The model compensates for fewer steps by making each step longer. For example, when forced to use up to 16 steps vs 256 steps, the model generates an average of 96 tokens per step vs 56. Despite this issue, more steps still clearly correlate with more total thinking tokens in Table 12 and better performance leading to a positive slope (3) Step instructions are more costly than other methods. The step delimiters require around 6 tokens each which for e.g. 64 steps adds up to a total of around 380 tokens. When ignoring the step delimiters in token counts as in Table 12, the model still requires 7551 thinking tokens on average to achieve only $33.3%$ on AIME24. To compute Control reported in Table 3 for step-conditional control variants, we first decide that 100 tokens are an upper limit per step and then multiply this number by the steps instructed to arrive at a proxy total token limit, e.g. 1600 for 16 steps instructed. We then check whether the thinking tokens in Table 12 fit within the respective limit for each evaluation run (3/5 for without intervention; 5/5 for with intervention). For the model in Figure 7, we use a model with step-conditional control trained on an earlier version of our data and using an earlier version of our evaluation codebase.
基于步骤的条件控制 Token 指令失效,因为当前模型无法计数 Token。为了解决这一能力的不足,我们尝试将计数变得更加粗粒度。我们将推理轨迹划分为步骤,并要求模型思考特定数量的步骤,而不是 Token。我们根据手动检查的样本,将推理轨迹按双换行符分割为步骤,这些步骤作为直观的分隔符。我们根据步骤数量将训练样本按 2 的幂次进行分组,并按照图 10(右)中的格式添加相应的步骤指令。这种格式基于早期实验,我们发现模型在倒数(“3 步剩余…2 步剩余”)时比正数(“第 2 步…第 3 步…”)更有可能遵守步骤限制。这可能是因为如果倒数,最后一步始终是 1,这将作为模型完成生成的强烈先验。如果正数,答案前的最后一步会变化,因此如果模型不记得最初的步骤指令,它可能无法停止。我们从表 12 的结果中得出以下结论:(1)模型仍然难以遵守步骤限制。模型有时会简单地继续计数到负步骤,例如“-1 步剩余”。为了解决这个问题,一旦达到 0 步,我们自动停止思考过程,然后通过附加答案 Token 分隔符强制模型切换到回答模式(§3)。这导致了完美的步骤遵守(表 12 的下半部分),但问题依然存在。(2)模型通过将每一步加长来弥补步骤的减少。例如,当被迫使用最多 16 步与 256 步时,模型生成的每步平均 Token 数为 96 与 56。尽管存在这个问题,表 12 中更多的步骤仍然明显与更多的总思考 Token 数和更好的性能相关,导致正斜率。(3)步骤指令比其他方法成本更高。步骤分隔符每个大约需要 6 个 Token,例如 64 步总共有大约 380 个 Token。当像表 12 那样忽略步骤分隔符的 Token 计数时,模型平均仍然需要 7551 个思考 Token 才能在 AIME24 上仅达到 $33.3%$。为了计算表 3 中报告的基于步骤的条件控制变体的 Control,我们首先确定每步 100 个 Token 为上限,然后将该数乘以指示的步骤数以得出代理总 Token 限制,例如,指示 16 步的 1600 个 Token。然后我们检查表 12 中的思考 Token 是否符合每次评估运行的相应限制(无干预时为 3/5;有干预时为 5/5)。对于图 7 中的模型,我们使用一个基于早期版本数据和早期版本评估代码库训练的基于步骤的条件控制模型。
Table 12. Scaling thinking time via step-conditional control. All metrics are averaged over the 30 samples in AIME24. Token counts ignore the thinking and step delimiters.
表 12: 通过步骤条件控制扩展思考时间。所有指标均在 AIME24 的 30 个样本中取平均。Token 计数忽略思考和步骤分隔符。
| 指令步骤 (→) | 16 | 32 | 64 | 128 | 256 |
|---|---|---|---|---|---|
| 无干预测试时间 | |||||
| 使用步骤 | 123 | 90 | 80 | 82 | 136 |
| 每步 Token 数 | 60 | 70 | 69 | 66 | 56 |
| 思考 Token 数 | 7252 | 6277 | 5396 | 5552 | 7551 |
| 回答 Token 数 | 665 | 653 | 735 | 777 | 754 |
| AIME24 | 33.3 | 23.3 | 33.3 | 36.7 | 33.3 |
| 达到 O 步时强制结束思考 | |||||
| 使用步骤 | 16 | 32 | 59 | 78 | 136 |
| 每步 Token 数 | 96 | 94 | 80 | 70 | 56 |
| 思考 Token 数 | 1517 | 2963 | 4636 | 5409 | 7551 |
| 回答 Token 数 | 1111 | 788 | 799 | 794 | 754 |
| AIME24 | 23.3 | 23.3 | 33.3 | 36.7 | 33.3 |
Table 13. Scaling thinking time via class-conditional control. We report “accuracy / average thinking tokens per sample”; the higher the accuracy and the fewer the thinking tokens (inference cost) the better.
表 13: 通过类别条件控制扩展思考时间。我们报告“准确率/每样本平均思考 token”;准确率越高且思考 token(推理成本)越少越好。
| 提示(在问题后空两行附加) | AIME24 | MATH500GPQA |
|---|---|---|
| 经过短暂思考后回答。不要花费过多时间反复检查你的工作。 | 8033 | %9'9S/%'06/%0'0 2537 4177 |
| 经过长时间思考后回答。如果你觉得提前完成,请利用额外时间反复检查你的工作,直到你绝对确定答案正确。 | 9651 | 36.7%/91.4%/51.0%/3875 4827 |
| 不附加通用提示 | 6109 | 50.0%/93.0%/57.6% 3298 3510 |
Class-conditional control OpenAI exposes test-time compute control to users via a “reasoning effort” API parameter with three possible settings: low, medium, and high. The OpenAI documentation also states that “Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response." suggesting that they are unable to control test-time compute with guarantees. Thus, maybe OpenAI simply adjusts the prompt or system instruction depending on the reasoning effort desired. In Table 13, we show that separate prompts for short and long thinking allow us to control thinking time to some extent: Prompting the model to think for longer leads to longer thinking. However, it does not reliably improve performance and control is not precise. The current adherence to control may suffice when we only have three classes, but it might not scale to finer-grained classes. To compute Control reported in Table 3 for this method, we assume that prompting the model to think for a short time in Table 13 should produce fewer tokens than the default for AIME24, while the long prompt should produce more. As $8033>6109$ and $9651>6109$ , one out of two follows our expected control thus Control is $50%$ .
类条件控制
OpenAI 通过一个“推理努力” API 参数向用户暴露了测试时计算控制,该参数有三个可能的设置:低、中和高。OpenAI 文档还指出,“减少推理努力可能会导致响应速度更快,并且在响应中使用的推理 Token 更少。” 这表明他们无法保证控制测试时计算。因此,OpenAI 可能只是根据所需的推理努力来调整提示或系统指令。在表 13 中,我们展示了为短思考和长思考分别提供的提示可以在一定程度上控制思考时间:提示模型思考更长时间会导致更长的思考。然而,这并不能可靠地提高性能,且控制并不精确。当我们只有三个类别时,当前的这种控制可能已经足够,但它可能无法扩展到更细粒度的类别。为了计算表 3 中此方法的控制,我们假设在表 13 中提示模型进行短时间思考应该产生比 AIME24 默认值更少的 Token,而长提示应该产生更多的 Token。由于 $8033>6109$ 且 $9651>6109$,两个结果中有一个符合我们的预期控制,因此控制率为 $50%$。
D.2. Examples for rejection sampling ablation
D.2. 拒绝采样消融实验示例
Table 14. Answers generated by an earlier version of s1-32B when rejection sampling on AIME24. The black text is the question and the yellow text is the solution. Based solely on the question we generate the light blue reasoning trace and blue response with s1-32B using a temperature of 1. We generate until the reasoning trace has less than 4000 tokens for the first and less than 8000 tokens for the second example. It required on average 199 (1) tries per example until thinking was $\leq4000$ $\leq8000$ tokens).
表 14: 在 AIME24 上进行拒绝采样时,早期版本的 s1-32B 生成的答案。黑色文字是问题,黄色文字是解答。我们仅根据问题使用温度设置为 1 的 s1-32B 生成了浅蓝色的推理轨迹和蓝色的回答。我们生成直到第一个例子的推理轨迹小于 4000 Token,第二个例子的推理轨迹小于 8000 Token。每个示例平均需要 199 (1) 次尝试,直到思考过程达到 $\leq4000$ $\leq8000$ Token。
Let $O(0,0),A(\frac{1}{2},0)$ , and $B(0,\textstyle{\frac{\sqrt{3}}{2}})$ be points in the coordinate plane. Let ${\boldsymbol{\mathcal{F}}}$ be the family of segments $\overline{{P Q}}$ of unit length lying in the first quadrant with $P$ on the $x$ -axis and $Q$ on the $y_{\mathrm{,}}$ -axis. There is a unique point $C$ on $\overline{{A B}}$ , distinct from $A$ and $B$ , that does not belong to any segment from $\mathcal{F}$ other than $\overline{{A B}}$ . Then $\begin{array}{r}{O C^{2}=\frac{p}{q}}\end{array}$ , where $p$ and $q$ are relatively prime positive integers. Find $p+q$ .
设 $O(0,0)$、$A(\frac{1}{2},0)$ 和 $B(0,\textstyle{\frac{\sqrt{3}}{2}})$ 为坐标平面上的点。设 ${\boldsymbol{\mathcal{F}}}$ 为单位长度线段 $\overline{{P Q}}$ 的集合,这些线段位于第一象限,且 $P$ 在 $x$ 轴上,$Q$ 在 $y_{\mathrm{,}}$ 轴上。在 $\overline{{A B}}$ 上存在一个唯一的点 $C$,不同于 $A$ 和 $B$,它不属于 ${\mathcal{F}}$ 中的任何线段,除了 $\overline{{A B}}$。则有 $\begin{array}{r}{O C^{2}=\frac{p}{q}}\end{array}$,其中 $p$ 和 $q$ 为互质的正整数。求 $p+q$。
[Correct generation with 3981 thinking tokens and 1055 answer tokens]
[3981个思考Token和1055个回答Token的正确生成]
[Incorrect generation with 5923 thinking tokens and 754 answer tokens]
[错误生成,包含 5923 个思考 Token 和 754 个回答 Token]
We know $x^{2}+y^{2}=1$ .
我们知道 $x^{2}+y^{2}=1$。
We are looking for a point $\textrm{C}$ on $A B$ that does not lie on any other segment from $\gamma^{\ast}$
我们正在寻找一个在 $A B$ 上的点 $\textrm{C}$,该点不位于 $\gamma^{\ast}$ 的任何其他线段上。
Consider the geometry. The family $\gamma^{*}$ forms a region bounded by the axes and the quarter circle.
考虑几何结构。族 $\gamma^{*}$ 形成由坐标轴和四分之一圆所界定的区域。
The unique point $\scriptscriptstyle\mathcal{C}$ means that if any other segment passes through $C$ , it must be $\overline{{A B}}$ .
唯一点$\scriptscriptstyle\mathcal{C}$表示如果任何其他线段通过$C$,则这条线段必须是$\overline{{A B}}$。
Consider the slope of the line containing the segment connecting $(x,0)$ and $(0,y)$ . Slope is $-{\frac{y}{x}}$
考虑连接 $(x,0)$ 和 $(0,y)$ 的线段所在直线的斜率。斜率为 $-{\frac{y}{x}}$
The equation is $y=-\textstyle{\frac{y}{x}}x+y$ .
方程为 $y=-\textstyle{\frac{y}{x}}x+y$。
$\scriptscriptstyle\mathcal{C}$ lies on the segment, then $\begin{array}{r}{\frac{1-\lambda}{2x}+\frac{\sqrt{3}\lambda}{2y}=1}\end{array}$ . Consider the geometric mean. Let the point $\scriptscriptstyle\mathcal{C}$ have coordinates $(a,b)$ . The segment connecting $(x,0)$ and $(0,y)$ passes through $(a,b)$ . $\textstyle{\frac{a}{x}}+{\frac{b}{y}}=1$ . If $\scriptscriptstyle\mathcal{C}$ is on $\overline{{A B}}$ , then $\begin{array}{r}{b=-\sqrt{3}a+\frac{\sqrt{3}}{2}.}\end{array}$ $\begin{array}{r}{a=\frac{1}{2}-\frac{1}{2}\lambda}\end{array}$ , $\begin{array}{r}{b=\frac{\sqrt{3}}{2}\lambda}\end{array}$ . The point $\textrm{C}$ is tangent to the envelope of the family of segments.
$\scriptscriptstyle\mathcal{C}$ 位于线段上,则 $\begin{array}{r}{\frac{1-\lambda}{2x}+\frac{\sqrt{3}\lambda}{2y}=1}\end{array}$。考虑几何平均数。设点 $\scriptscriptstyle\mathcal{C}$ 的坐标为 $(a,b)$。连接 $(x,0)$ 和 $(0,y)$ 的线段经过 $(a,b)$。$\textstyle{\frac{a}{x}}+{\frac{b}{y}}=1$。如果 $\scriptscriptstyle\mathcal{C}$ 位于 $\overline{{A B}}$ 上,则 $\begin{array}{r}{b=-\sqrt{3}a+\frac{\sqrt{3}}{2}.}\end{array}$ $\begin{array}{r}{a=\frac{1}{2}-\frac{1}{2}\lambda}\end{array}$, $\begin{array}{r}{b=\frac{\sqrt{3}}{2}\lambda}\end{array}$。点 $\textrm{C}$ 与线段族的包络线相切。
Solution:
解决方案:
