[论文翻译]小规模大语言模型中的强化学习推理:有效与无效之处


原文地址:https://arxiv.org/pdf/2503.16219

代码地址:https://github.com/knoveleng/open-rs

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t

小规模大语言模型中的强化学习推理:有效与无效之处

Abstract

摘要

Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains—e.g., AMC23 accuracy rising from $63%$ to $80%$ and AIME24 reaching $46.7%$ , surpassing o1-preview—using only 7,000 samples and a $\$42$ training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.

提升大语言模型 (LLM) 的推理能力通常依赖于大量的计算资源和广泛的数据集,这在资源受限的环境中限制了其可访问性。我们的研究探讨了强化学习 (RL) 在提升小型 LLM 推理能力方面的潜力,重点关注一个 1.5 亿参数的模型 DeepSeek-R1-Distill-Qwen-1.5B,在严格的约束条件下:在 4 个 NVIDIA A40 GPU(每个 48 GB VRAM)上训练 24 小时。我们采用了 Group Relative Policy Optimization (GRPO) 算法,并精心策划了一个紧凑且高质量的数学推理数据集,进行了三项实验以探索模型的行为和性能。我们的结果表明,推理能力迅速提升——例如,AMC23 的准确率从 $63%$ 上升到 $80%$,AIME24 达到了 $46.7%$,超过了 o1-preview——仅使用了 7,000 个样本和 $\$42$ 的训练成本,而基线模型的成本则高达数千美元。然而,随着训练时间的延长,出现了优化不稳定性和长度限制等挑战。这些发现突显了基于 RL 的微调在小型 LLM 中的有效性,为大规模方法提供了一种经济高效的替代方案。我们发布了代码和数据集作为开源资源,提供了对权衡的见解,并为在资源有限的环境中构建可扩展的、具备推理能力的 LLM 奠定了基础。所有资源均可在 https://github.com/knoveleng/open-rs 获取。

1 Introduction

1 引言

Recent advancements in large language models (LLMs) have significantly advanced the pursuit of artificial general intelligence (AGI), with models such as GPT-4o (OpenAI, 2024a), Claude 3.5 Sonnet (Anthropic, 2024), and Gemini 1.5 (Google, 2024) demonstrating unprecedented capabilities. A pivotal aspect of this progress is the integration of post-training techniques into the training pipeline. These methods—including supervised fine-tuning (SFT) and reinforcement learning (RL)—enhance reasoning accuracy, align models with societal values, and adapt them to user preferences, all while demanding fewer computational resources than pre-training (OpenAI, 2024b). A notable innovation in this domain is OpenAI’s o1 series, which leverages inference-time scaling through extended Chain-of-Thought (CoT) reasoning to achieve remarkable performance in mathematics, coding, and scientific reasoning tasks (OpenAI, 2024b). However, despite these breakthroughs, scaling reasoning capabilities at test time remains a persistent challenge for the broader research community, largely due to limited access to proprietary methodologies and resources.

近年来,大语言模型 (LLM) 的显著进展极大地推动了通用人工智能 (AGI) 的追求,诸如 GPT-4o (OpenAI, 2024a)、Claude 3.5 Sonnet (Anthropic, 2024) 和 Gemini 1.5 (Google, 2024) 等模型展示了前所未有的能力。这一进展的关键在于将训练后技术整合到训练流程中。这些方法——包括监督微调 (SFT) 和强化学习 (RL)——提高了推理准确性,使模型与社会价值观保持一致,并根据用户偏好进行调整,同时比预训练所需的计算资源更少 (OpenAI, 2024b)。该领域的一个显著创新是 OpenAI 的 o1 系列,它通过扩展的思维链 (CoT) 推理利用推理时扩展,在数学、编码和科学推理任务中取得了显著性能 (OpenAI, 2024b)。然而,尽管取得了这些突破,在测试时扩展推理能力仍然是更广泛研究社区面临的持续挑战,这主要是由于对专有方法和资源的访问有限。

Efforts to bolster LLM reasoning have explored diverse strategies. Process-based reward models (Uesato et al., 2022; Lightman et al., 2023a; Wang et al., 2023) guide models toward structured problem-solving, while RL approaches (Kumar et al., 2024) optimize performance through feedback-driven learning. Search algorithms, such as Monte Carlo Tree Search (MCTS) and Beam Search, have also been employed to enhance reasoning depth (Feng et al., 2024; Xin et al., 2024; Trinh et al., 2024). Although these methods have driven incremental gains, they fall short of the general reasoning prowess exhibited by the o1 series. Recently, the DeepSeek-R1 model (DeepSeek-AI, 2025) has emerged as a competitive alternative, utilizing RL with the Group Relative Policy Optimization (GRPO) algorithm. Built on the 671-billion-parameter DeepSeek-V3, DeepSeek-R1 matches o1’s reasoning performance (DeepSeek-AI, 2025). Yet, the sheer scale and computational demands of such models—often exceeding hundreds of billions of parameters—render them impractical for self-hosting by most organizations outside major technology firms, limiting their broader adoption.

为了增强大语言模型的推理能力,研究者们探索了多种策略。基于过程的奖励模型 (Uesato et al., 2022; Lightman et al., 2023a; Wang et al., 2023) 引导模型进行结构化问题解决,而强化学习 (Reinforcement Learning, RL) 方法 (Kumar et al., 2024) 则通过反馈驱动的学习来优化性能。搜索算法,如蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS) 和束搜索 (Beam Search),也被用于增强推理深度 (Feng et al., 2024; Xin et al., 2024; Trinh et al., 2024)。尽管这些方法带来了逐步的改进,但它们仍无法与 o1 系列所展现的通用推理能力相媲美。最近,DeepSeek-R1 模型 (DeepSeek-AI, 2025) 作为一种竞争性替代方案出现,它利用强化学习结合群组相对策略优化 (Group Relative Policy Optimization, GRPO) 算法。基于 6710 亿参数的 DeepSeek-V3,DeepSeek-R1 在推理性能上与 o1 相当 (DeepSeek-AI, 2025)。然而,这些模型的庞大规模和计算需求——通常超过数千亿参数——使得大多数非大型科技公司难以自托管,限制了它们的广泛应用。

In contrast, small LLMs, typically ranging from 1 to 10 billion parameters, present a resourceefficient alternative with potential for widespread deployment. Previous studies have demonstrated the feasibility of enhancing small LLMs through RL-based fine-tuning inspired by DeepSeek-R1 (Luo et al., 2025; Team, 2025b). However, these efforts often rely on expansive datasets (hundreds of thousands to millions of samples) or incur significant computational costs, undermining their accessibility for resource-constrained settings. This tension motivates two central research questions:

相比之下,小型大语言模型(LLM),通常参数规模在10亿到100亿之间,提供了一种资源高效的替代方案,具有广泛部署的潜力。先前的研究已经证明了通过基于强化学习(RL)的微调来增强小型大语言模型的可行性,这一方法受到了DeepSeek-R1的启发(Luo等,2025;Team,2025b)。然而,这些努力通常依赖于庞大的数据集(数十万到数百万样本)或产生显著的计算成本,从而削弱了它们在资源受限环境中的可访问性。这种矛盾激发了两个核心研究问题:

These questions naturally extend to a practical inquiry: If viable, how should such an approach be implemented for small LLMs, and if not, what are the fundamental limitations? Addressing these, we investigate the reasoning capacity of a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under stringent constraints: training on a cluster of 4 NVIDIA A40 GPUs (48 GB VRAM each) within a 24-hour window. Our methodology adapts the GRPO-based RL framework from DeepSeek-R1, tailoring it to the resourcelimited context of small LLMs. We assess performance on a suite of mathematical reasoning benchmarks, a domain requiring structured, logical problem-solving that serves as a robust testbed for reasoning ability.

这些问题自然延伸出一个实际的探究:如果可行,如何为小型大语言模型实施这种方法?如果不可行,其根本限制是什么?针对这些问题,我们研究了在严格约束条件下,一个拥有15亿参数的模型DeepSeek-R1-Distill-Qwen-1.5B的推理能力:在4台NVIDIA A40 GPU(每台48 GB显存)的集群上,在24小时内完成训练。我们的方法采用了基于GRPO的强化学习框架,并将其调整为适用于资源有限的小型大语言模型。我们在一系列数学推理基准上评估了性能,这一领域需要结构化的逻辑问题解决能力,是推理能力的强大测试平台。


Figure 1: Comparison of zero-shot pass $@1$ performance versus model size (left) and computational cost (right). Our Open-RS (red point) achieves the highest AIME24 score $(46.7%)$ , outperforming o1-preview $\hat{(}44.6%)$ and other models (green points). Additionally, Open-RS models exhibit the lowest computational cost at approximately $\$42$ .

图 1: 零样本通过率 $@1$ 与模型大小(左)和计算成本(右)的比较。我们的 Open-RS(红点)达到了最高的 AIME24 分数 $(46.7%)$,优于 o1-preview $\hat{(}44.6%)$ 和其他模型(绿点)。此外,Open-RS 模型的计算成本最低,约为 $\$42$。

Our study yields three primary contributions:

我们的研究主要有以下三点贡献:

Our findings illuminate the promise of RL-based methods to enhance small LLMs’ reasoning capabilities, achieving competitive performance with minimal resources (Figure 1). Simultaneously, they reveal critical challenges—such as data efficiency, optimization stability, and length constraints—that must be addressed to fully realize this potential. These insights lay the groundwork for developing lightweight, reasoning-capable LLMs suitable for resource-constrained environments, advancing the democratization of advanced AI technologies.

我们的研究结果揭示了基于强化学习(RL)的方法在提升小型大语言模型推理能力方面的潜力,以最少的资源实现了具有竞争力的性能(图 1)。同时,这些方法也揭示了必须解决的关键挑战——如数据效率、优化稳定性和长度限制——以充分发挥这一潜力。这些见解为开发适合资源受限环境的轻量级、具备推理能力的大语言模型奠定了基础,推动了先进AI技术的民主化进程。

The remainder of this paper is structured as follows: Section 2 details our methodology, including data curation, RL algorithm, and reward design; Section 3 presents three experiments, their results, and comparative analyses; and Section 4 summarizes key findings. Additional details, including related work, discussion, hyper parameter setups, and supplementary results, are provided in the Appendix.

本文的其余部分结构如下:第2节详细介绍了我们的方法,包括数据整理、强化学习算法和奖励设计;第3节展示了三个实验、其结果以及比较分析;第4节总结了主要发现。附录中提供了更多细节,包括相关工作、讨论、超参数设置和补充结果。

2 Methodology

2 方法论

In this section, we outline our approach to optimizing the reasoning capabilities of small large language models (LLMs) under computational constraints. Our methodology comprises two primary components: (1) the curation of a high-quality, mathematics-focused dataset, and (2) the application of a resource-efficient reinforcement learning (RL) algorithm. These components are designed to balance performance gains with practical limitations, such as reduced computational overhead and privacy considerations.

在本节中,我们概述了在计算约束下优化小型大语言模型 (LLM) 推理能力的方法。我们的方法包括两个主要部分:(1) 构建一个高质量、以数学为重点的数据集,(2) 应用一种资源高效的强化学习 (RL) 算法。这些组件的设计旨在平衡性能提升与实际限制,例如减少计算开销和隐私考虑。

2.1 High-Quality Dataset Curation

2.1 高质量数据集构建

To minimize training costs while maximizing reasoning performance, we curate a compact, high-quality dataset tailored to mathematical reasoning. This dataset is derived from two existing sources: the s1 dataset (Mu en nigh off et al., 2025) and the DeepScaleR dataset (DeepSeek-AI, 2025). By filtering and refining these datasets, we ensure that our training data is both relevant and challenging, enabling efficient learning for small LLMs.

为了在最小化训练成本的同时最大化推理性能,我们精心策划了一个紧凑且高质量的数据集,专门针对数学推理。该数据集源自两个现有来源:s1 数据集 (Mu en nigh off et al., 2025) 和 DeepScaleR 数据集 (DeepSeek-AI, 2025)。通过对这些数据集进行过滤和精炼,我们确保训练数据既相关又具有挑战性,从而为小型大语言模型提供高效的学习机会。

s1 Dataset The s1 dataset (Mu en nigh off et al., 2025) is a general-purpose reasoning corpus comprising 59,029 questions sourced from diverse domains, including NuminaMATH (LI et al., 2024), AIME problems (1983–2021), Olympic Arena (Huang et al., 2024), OmniMath (Gao et al., 2024), AGIEval (Zhong et al., 2023), probability questions from Stanford University’s Statistics Department PhD Qualifying Exams (https://statistics. stanford.edu), and brain-teasers from Puzzled Quant (https://www.puzzled quant.com). Although the dataset spans multiple disciplines—such as Astronomy, Biology, Chemistry, Computer Science, Geography, Mathematics, and Physics—our focus is exclusively on mathematical reasoning.

s1 数据集
s1 数据集 (Mu en nigh off 等人, 2025) 是一个通用推理语料库,包含来自多个领域的 59,029 个问题,涵盖 NuminaMATH (LI 等人, 2024)、AIME 问题 (1983–2021)、Olympic Arena (Huang 等人, 2024)、OmniMath (Gao 等人, 2024)、AGIEval (Zhong 等人, 2023)、斯坦福大学统计系博士资格考试中的概率问题 (https://statistics.stanford.edu) 以及 Puzzled Quant 的脑筋急转弯 (https://www.puzzledquant.com)。尽管该数据集涵盖多个学科——如天文学、生物学、化学、计算机科学、地理学、数学和物理学——但我们的研究重点仅集中在数学推理上。

To isolate mathematics-specific examples, we adopt a filtering workflow inspired by (Muennighoff et al., 2025). First, we retain only questions with solutions containing the LaTeX command ${\backslash}60{\times}60{},$ a common indicator of mathematical answers, reducing the dataset to 31,323 examples. Next, we employ the distilled model DeepSeek-R1-Distill-Qwen-1.5B to eliminate trivial questions, yielding 21,533 examples. Finally, to ensure data quality, we use Qwen2.5-7B-Instruct to remove noisy or multi-part questions, resulting in a final set of 18,615 high-quality mathematical reasoning examples – open-s1 dataset.

为了隔离数学相关的示例,我们采用了受 (Muennighoff et al., 2025) 启发的过滤工作流程。首先,我们仅保留解决方案中包含 LaTeX 命令 ${\backslash}60{\times}60{}$ 的问题,这是数学答案的常见指示符,将数据集减少到 31,323 个示例。接下来,我们使用蒸馏模型 DeepSeek-R1-Distill-Qwen-1.5B 来消除简单问题,得到 21,533 个示例。最后,为了确保数据质量,我们使用 Qwen2.5-7B-Instruct 去除噪声或多部分问题,最终得到 18,615 个高质量的数学推理示例——open-s1 数据集。

DeepScaleR Dataset The DeepScaleR dataset (Luo et al., 2025) contains 40,315 mathematics-specific questions drawn from AIME (1984–2023), AMC (prior to 2023), OmniMATH, and the Still dataset. Unlike the s1 dataset, DeepScaleR is pre-filtered to focus solely on mathematics, with redundant questions removed and solutions extracted from raw text using retrieval-augmented generation (RAG) and advanced LLMs like Gemini-1.5-Pro-002. To further refine this dataset, we apply Qwen2.5-Math-7B-Instruct to exclude easy questions, reducing the set to 21,044 examples – open-deepscaler dataset. We opt for Qwen2.5-Math-7B-Instruct over DeepSeek-R1-Distill-Qwen-1.5B—used for the s1 dataset—to introduce diversity in filtering criteria and avoid excessive overlap between the two datasets.

DeepScaleR 数据集
DeepScaleR 数据集 (Luo et al., 2025) 包含从 AIME (1984–2023)、AMC (2023 年之前)、OmniMATH 和 Still 数据集中提取的 40,315 个数学相关题目。与 s1 数据集不同,DeepScaleR 经过预过滤,专注于数学领域,冗余题目已被移除,并通过检索增强生成 (RAG) 和 Gemini-1.5-Pro-002 等先进的大语言模型从原始文本中提取解答。为了进一步优化该数据集,我们使用 Qwen2.5-Math-7B-Instruct 排除了简单题目,将数据集缩减至 21,044 个样本——即 open-deepscaler 数据集。我们选择 Qwen2.5-Math-7B-Instruct 而非 s1 数据集中使用的 DeepSeek-R1-Distill-Qwen-1.5B,以引入过滤标准的多样性,并避免两个数据集之间的过度重叠。

Final Dataset Combining the refined open-s1 dataset (18,615 examples) and opendeepscaler (21,044 examples), we obtain a final high-quality dataset of 39,659 mathematical reasoning questions. This curated corpus strikes a balance between scale and specificity, enabling effective training of small LLMs under resource constraints.

最终数据集结合精炼后的 open-s1 数据集(18,615 个示例)和 opendeepscaler(21,044 个示例),我们获得了包含 39,659 个数学推理问题的高质量最终数据集。这个精选的语料库在规模和特异性之间取得了平衡,能够在资源受限的情况下有效训练小型大语言模型。

2.2 Reinforcement Learning Algorithm

2.2 强化学习算法

To train small LLMs efficiently, we adopt the Group Relative Policy Optimization (GRPO) algorithm Shao et al. (2024), as utilized in DeepSeek-AI (2025). GRPO eliminates the need for a separate critic model—typically as large as the policy model—by estimating baselines from group scores, thereby reducing computational overhead. For each question $q,$ GRPO samples a group of $G$ outputs ${o_{1},\stackrel{\smile}{o}_ {2},\ldots,\stackrel{\star}{o}_ {G}}$ from the old policy $\pi_{\theta_{\mathrm{old}}}$ and optimizes the policy $\pi_{\theta}$ by maximizing the following objective:

为了高效训练小型大语言模型,我们采用了 Group Relative Policy Optimization (GRPO) 算法 Shao et al. (2024),如 DeepSeek-AI (2025) 中所使用的那样。GRPO 通过从组分数中估计基线,消除了对单独评论模型的需求——通常与策略模型一样大——从而减少了计算开销。对于每个问题 $q$,GRPO 从旧策略 $\pi_{\theta_{\mathrm{old}}}$ 中采样一组 $G$ 个输出 ${o_{1},\stackrel{\smile}{o}_ {2},\ldots,\stackrel{\star}{o}_ {G}}$,并通过最大化以下目标来优化策略 $\pi_{\theta}$:

图片.png

where the KL-divergence term is defined as:

其中 KL 散度项定义为:

图片.png

and the advantage $A_{i}$ is computed from a group of rewards ${r_{1},r_{2},\ldots,r_{G}}$ :

优势 $A_{i}$ 是从一组奖励 ${r_{1},r_{2},\ldots,r_{G}}$ 中计算得出的:

图片.png

Here, $\epsilon$ and $\beta$ are hyper parameters controlling the clipping range and KL penalty, respectively.

这里,$\epsilon$ 和 $\beta$ 是分别控制裁剪范围和KL惩罚的超参数。

Reward Models The reward function is critical to guiding RL optimization. We employ a rule-based reward system comprising three components, designed to balance correctness, efficiency, and structure without relying on resource-intensive neural reward models:

奖励模型
奖励函数对于指导强化学习(RL)优化至关重要。我们采用了一个基于规则的奖励系统,该系统由三个部分组成,旨在在不依赖资源密集型神经奖励模型的情况下平衡正确性、效率和结构:

3 Experiments

3 实验

To address the research questions outlined in Section 1—namely, how reinforcement learning (RL) can enhance the reasoning abilities of small large language models (LLMs) and what practical insights emerge under computational constraints—we design three experiments to analyze the training behavior of small LLMs. These experiments aim to provide empirical evidence of performance improvements and offer actionable guidance for future research and industrial applications.

为了解决第1节中概述的研究问题——即强化学习(RL)如何增强小型大语言模型(LLMs)的推理能力,以及在计算限制下会得出哪些实际见解——我们设计了三个实验来分析小型LLMs的训练行为。这些实验旨在提供性能改进的实证证据,并为未来的研究和工业应用提供可操作的指导。

3.1 Experimental Setup

3.1 实验设置

We select DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek-AI, 2025) as our base model for training. This 1.5-billion-parameter model, distilled from larger architectures, is chosen for its balance of efficiency and reasoning potential. Notably, we bypass the supervised finetuning (SFT) phase—typically a precursor to RL for performance enhancement (Chu et al., 2025)—hypothesizing that the model’s pre training is sufficient to leverage RL directly. For the RL phase, we employ the Group Relative Policy Optimization (GRPO) algorithm, as detailed in Section 2.2, due to its computational efficiency.

我们选择 DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek-AI, 2025) 作为训练的基础模型。这个拥有 15 亿参数的模型是从更大的架构中蒸馏出来的,因其效率和推理潜力的平衡而被选中。值得注意的是,我们跳过了监督微调 (SFT) 阶段——通常是强化学习 (RL) 性能增强的前置步骤 (Chu et al., 2025)——假设模型的预训练足以直接利用 RL。在 RL 阶段,我们采用了组相对策略优化 (GRPO) 算法,如第 2.2 节所述,因其计算效率高。

Training is conducted on a cluster of 4 NVIDIA A40 GPUs (48GB VRAM each), imposing constraints that limit us to sampling 6 outputs per step with a maximum completion length of 4096 tokens. To facilitate this, we adapt open-r1 (Face, 2025), an open-source reproduction of DeepSeek-R1 by the Hugging Face team, customizing it to align with our objectives. The training phase is restricted to 1 epoch, completed within a 24-hour window, reflecting realworld resource limitations. Hyper parameters and additional configurations are detailed in Appendix E.

训练在4个NVIDIA A40 GPU(每个48GB显存)的集群上进行,这限制了我们在每一步只能采样6个输出,且最大完成长度为4096个token。为此,我们采用了Hugging Face团队开源的DeepSeek-R1复现项目open-r1 (Face, 2025),并对其进行了定制以符合我们的目标。训练阶段限制为1个epoch,并在24小时内完成,反映了现实世界的资源限制。超参数和其他配置详见附录E。

3.2 Benchmark Datasets

3.2 基准数据集

To evaluate the reasoning capabilities of our small LLM, we choose five mathematics-focused benchmark datasets: AIME24 1, MATH-500 (Lightman et al., 2023b; Hendrycks et al., 2021), AMC23 2, Minerva (Lewkowycz et al., 2022b) and Olympiad Bench (He et al., 2024). Details of the datasets are provided in Appendix C.

为了评估我们的小型大语言模型的推理能力,我们选择了五个以数学为重点的基准数据集:AIME24 1、MATH-500 (Lightman et al., 2023b; Hendrycks et al., 2021)、AMC23 2、Minerva (Lewkowycz et al., 2022b) 和 Olympiad Bench (He et al., 2024)。数据集的详细信息见附录 C。

3.3 Baseline Models

3.3 基线模型

To contextual ize our results, we compare our trained model against a range of baselines: Llama-3.1-70B-Instruct (AI, 2024a), o1-preview (AI, 2024b), Qwen-2.5-Math-7B-Instruct (Yang et al., 2024), rStar-Math-7B (Guan et al., 2025), Eurus-2-7B-PRIME, Qwen2.5-7B-SimpleRL (Zeng et al., 2025) (Cui et al., 2025), DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek-AI, 2025), DeepScaleR-1.5B-Preview (Luo et al., 2025), Still-3-1.5B-Preview (Team, 2025b).

为了将我们的结果置于上下文中,我们将训练好的模型与一系列基线模型进行了比较:Llama-3.1-70B-Instruct (AI, 2024a)、o1-preview (AI, 2024b)、Qwen-2.5-Math-7B-Instruct (Yang et al., 2024)、rStar-Math-7B (Guan et al., 2025)、Eurus-2-7B-PRIME、Qwen2.5-7B-SimpleRL (Zeng et al., 2025) (Cui et al., 2025)、DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek-AI, 2025)、DeepScaleR-1.5B-Preview (Luo et al., 2025)、Still-3-1.5B-Preview (Team, 2025b)。

This selection enables a robust comparison across model sizes, training methodologies, and reasoning strategies, highlighting the efficacy of our approach for small LLMs. Details of the baselines are provided in Appendix D.

这一选择使得我们能够在模型大小、训练方法和推理策略之间进行强有力的比较,突显了我们方法在小规模大语言模型上的有效性。基线模型的详细信息见附录 D。

3.4 Evaluation Metric

3.4 评估指标

We adopt the zero-shot pass $@1$ metric to measure performance, defined as the proportion of problems correctly solved on the first attempt without prior examples. This metric emphasizes the model’s ability to reason independently, aligning with our goal of enhancing intrinsic reasoning capabilities in small LLMs. Final answers are required in \boxed{} format for consistent automated evaluation.

我们采用零样本通过率 (zero-shot pass $@1$) 指标来衡量性能,该指标定义为在没有先验示例的情况下首次尝试正确解决问题的比例。该指标强调模型的独立推理能力,与我们增强小型大语言模型内在推理能力的目标一致。最终答案需要以 \boxed{} 格式提供,以确保自动化评估的一致性。

3.5 Process and Results

3.5 过程与结果

In this subsection, we present three experiments designed to enhance the reasoning abilities of small LLMs using reinforcement learning (RL), follow the methodology in Section 2. We analyze training progress, evaluate performance across benchmarks, and compare our models against baselines, highlighting key insights and their implications for future work.

在本小节中,我们介绍了三个实验,旨在通过强化学习 (RL) 提升小型大语言模型的推理能力,遵循第 2 节中的方法。我们分析了训练进展,评估了在多个基准测试中的表现,并将我们的模型与基线模型进行了比较,突出了关键见解及其对未来工作的意义。


Figure 2: Performance of the model on AMC23 (left) and MATH-500 (right) across global training steps. The red dashed line indicates the baseline score at the start of training.

图 2: 模型在 AMC23 (左) 和 MATH-500 (右) 上的性能随全局训练步骤的变化。红色虚线表示训练开始时的基线分数。

3.5.1 Experiment 1: Impact of High-Quality Data

3.5.1 实验 1:高质量数据的影响

In Experiment 1, we train the DeepSeek-R1-Distill-Qwen-1.5B model using the open-s1 dataset (18,615 samples) from Section 2.1, with a maximum completion length of 4096 tokens. We employ accuracy and format rewards, as described in Section 2.2. Although the full dataset corresponds to approximately 1500 global steps for one epoch, computational constraints (24-hour limit on 4x A40 GPUs) restrict training to 500 global steps.

在实验1中,我们使用第2.1节中的open-s1数据集(18,615个样本)训练DeepSeek-R1-Distill-Qwen-1.5B模型,最大完成长度为4096个Token。我们采用了第2.2节中描述的准确性和格式奖励。尽管完整数据集对应大约1500个全局步骤(一个epoch),但由于计算限制(4x A40 GPU上的24小时限制),训练被限制在500个全局步骤。

Performance on AMC23 improves from $63%$ to $70%$ and on MATH-500 from $83%$ to $84%$ within the first 50–100 steps (see Figure 2). However, after 200 steps, accuracy degrades significantly, dropping below $60%$ on AMC23 and to $80%$ on MATH-500. Figure 3 illustrates this trend, showing unstable accuracy rewards and completion lengths fluctuating near 4000 tokens initially, then decreasing to around 3000 tokens by 100 global steps (approximately 3000 local steps on a single GPU). Post-200 steps, lengths increase again, accompanied by unreadable content and non-English outputs.

在最初的 50-100 步内,AMC23 上的性能从 $63%$ 提升到 $70%$,MATH-500 上的性能从 $83%$ 提升到 $84%$(见图 2)。然而,经过 200 步后,准确率显著下降,AMC23 上的准确率降至 $60%$ 以下,MATH-500 上的准确率降至 $80%$。图 3 展示了这一趋势,显示准确率奖励不稳定,完成长度最初在 4000 token 附近波动,然后在 100 个全局步数(约单个 GPU 上的 3000 个局部步数)后降至约 3000 token。200 步后,长度再次增加,同时伴随着不可读的内容和非英语输出。


Figure 3: Accuracy reward (left) and completion length (right) of outputs in Experiment 1 across local steps. Note that global steps are distributed across 4 GPUs, with 100 global steps approximating 3000 local steps.

图 3: 实验 1 中输出的准确率奖励(左)和完成长度(右)随局部步数的变化。请注意,全局步数分布在 4 个 GPU 上,100 个全局步数大约相当于 3000 个局部步数。

This degradation suggests that the model struggles with the complexity of open-s1, often exceeding the 4096-token limit before producing a final answer. The initial length reduction reflects adaptation to the format reward, but the subsequent increase and language drift indicate reward misalignment. We derive the following insight:

这种退化表明模型在处理 open-s1 的复杂性时遇到了困难,通常在生成最终答案之前就超过了 4096-token 的限制。初始长度的减少反映了对格式奖励的适应,但随后的增加和语言漂移表明奖励存在偏差。我们得出以下见解:

Insight 1

洞察 1

Small LLMs can achieve rapid reasoning improvements with limited high-quality data within 50–100 steps, but performance degrades with prolonged training under strict length constraints.

小规模大语言模型 (LLM) 可以在 50-100 步内通过有限的高质量数据实现快速推理能力的提升,但在严格的长度限制下,长时间训练会导致性能下降。

3.5.2 Experiment 2: Balancing Easy and Hard Problems

3.5.2 实验 2:平衡简单和困难问题

Building on Experiment 1, we hypothesize that mixing easier problems with challenging ones could stabilize training and reduce completion lengths. We construct a dataset of 7000 samples: 3000 from open-s1, 3000 from open-deepscaler, and 1000 easier problems from the raw DeepScaleR dataset (Section 2.1). The maximum completion length is reduced to 3584 tokens, retaining accuracy and format rewards.

在实验1的基础上,我们假设将简单问题与挑战性问题混合可以稳定训练并减少完成长度。我们构建了一个包含7000个样本的数据集:3000个来自open-s1,3000个来自open-deepscaler,以及1000个来自原始DeepScaleR数据集的简单问题(第2.1节)。最大完成长度减少到3584个token,同时保留了准确性和格式奖励。

Initial completion lengths drop to approximately 2800 tokens, and performance improves significantly: AMC23 rises from $63%$ to $80%,$ and MATH-500 from $83%$ to $85%$ within 50–100 steps (Figure 2). However, after 150–200 steps (approximately 4000 local steps), performance declines, and KL divergence becomes unstable (Figure 4), with mixed-language outputs reemerging.

初始完成长度降至约2800个token,性能显著提升:AMC23从$63%$上升到$80%$,MATH-500从$83%$上升到$85%$,在50-100步内(图2)。然而,在150-200步后(约4000个局部步骤),性能下降,KL散度变得不稳定(图4),混合语言输出再次出现。


Figure 4: KL divergence (left) and completion length (right) of outputs in Experiment 2 across local steps.

图 4: 实验 2 中输出结果的 KL 散度 (左) 和完成长度 (右) 随局部步骤的变化。

The improved initial performance validates our hypothesis, suggesting that easier problems encourage concise reasoning, while harder ones maintain complexity. However, the latestage instability highlights persistent challenges with length constraints and multilingual tendencies. We note:

改进后的初始性能验证了我们的假设,表明较简单的问题鼓励简洁的推理,而较难的问题则保持复杂性。然而,后期的不稳定性突显了长度限制和多语言倾向的持续挑战。我们注意到:

Insight 2

Insight 2

Incorporating a mix of easy and hard problems under reduced length constraints enhances early performance and stabilizes reasoning behavior, though long-term stability remains elusive.

在减少长度限制的情况下,结合简单和困难的问题可以提升早期表现并稳定推理行为,尽管长期稳定性仍然难以实现。

3.5.3 Experiment 3: Controlling Length with Cosine Reward

3.5.3 实验 3: 使用余弦奖励控制长度

Experiment 3 uses the same 7000-sample dataset as Experiment 2, but replaces the accuracy reward with a cosine reward to better control output length, as outlined in Section 2.2. We also append an instruction to the system prompt: “Reply in English only, do not use other languages”, avoiding a computationally expensive language reward function. The maximum completion length remains 3584 tokens.

实验 3 使用了与实验 2 相同的 7000 样本数据集,但将准确率奖励替换为余弦奖励,以更好地控制输出长度,如第 2.2 节所述。我们还在系统提示中附加了一条指令:“仅用英语回复,不要使用其他语言”,从而避免了计算成本高昂的语言奖励函数。最大完成长度仍为 3584 个 Token。

Completion lengths stabilize between 1000 and 3500 tokens (Figure 5), a marked improvement over Experiment 2’s 2000–3500 range. Performance on AMC23 and MATH500 increases modestly compared to the baseline $63%$ to $72.5%$ and $83%$ to $84.4%$ , respectively) within 50 steps, though it lags behind Experiment 2’s peak (Figure 2). After 200 steps, mixed-language content persists, reflecting the multilingual nature of DeepSeek-R1-Distill-Qwen-1.5B.

完成长度稳定在 1000 到 3500 个 Token 之间(图 5),相比实验 2 的 2000–3500 范围有明显改善。在 50 步内,AMC23 和 MATH500 的性能相比基线有所提升(分别从 $63%$ 提升到 $72.5%$,以及从 $83%$ 提升到 $84.4%$),尽管仍落后于实验 2 的峰值(图 2)。经过 200 步后,混合语言内容仍然存在,这反映了 DeepSeek-R1-Distill-Qwen-1.5B 的多语言特性。

The cosine reward effectively regulates length, but the language issue suggests a need for explicit language constraints or extended completion lengths for complex tasks. We conclude:

余弦奖励有效地调节了长度,但语言问题表明需要对复杂任务施加明确的语言约束或延长完成长度。我们得出结论:

Insight 3

洞察 3

Cosine rewards stabilize completion lengths, improving training consistency, but extending length limits is necessary for extremely hard tasks, particularly with multilingual base models.

余弦奖励稳定了完成长度,提高了训练一致性,但对于极其困难的任务,特别是多语言基础模型,延长长度限制是必要的。


Figure 5: KL divergence (left) and completion length (right) of outputs in Experiment 3 across local steps.

图 5: 实验 3 中输出结果的 KL 散度 (左) 和完成长度 (右) 随局部步骤的变化。

3.5.4 Overall Comparison

3.5.4 总体比较

We select checkpoints at 100, 50, and 50 global steps from Experiments 1, 2, and 3, naming them Open-RS1, Open-RS2, and Open-RS3 (R for Reasoning, S for Small), respectively. These are evaluated against baselines from Section 3.3 across benchmarks from Section 3.2, using zero-shot pass $@1$ (Table 1).

我们从实验1、2和3中分别选择了100、50和50个全局步骤的检查点,将它们命名为Open-RS1、Open-RS2和Open-RS3(R代表推理,S代表小型)。这些检查点将根据第3.3节的基线,在第3.2节的基准上进行评估,使用零样本通过 $@1$(表1)。

模型 AIME24 MATH-500 AMC23 Minerva OlympiadBench 平均
通用模型
Llama-3.1-70B-Instruct 16.7 64.6 30.1 35.3 31.9
01-preview 44.6 85.5
7B 模型
Qwen-2.5-Math-7B-Instruct 13.3 79.8 50.6 34.6 40.7
rStar-Math-7B 26.7 78.4 47.5 47.1
Eurus-2-7B-PRIME 26.7 79.2 57.8 38.6 42.1
Qwen2.5-7B-SimpleRL 26.7 82.4 62.5 39.7 43.3
1.5B 模型
DeepSeek-R1-Distill-Qwen-1.5B 28.8 82.8 62.9 26.5 43.3
Still-3-1.5B-Preview 32.5 84.4 66.7 29.0 45.4
DeepScaleR-1.5B-Preview 43.1 87.8 73.6 30.2 50.0
我们的模型
Open-RS1 (100 steps) 30.0 83.8 70.0 29.0 52.4
Open-RS2 (50 steps) 30.0 85.4 80.0 30.5 52.4
Open-RS3 (50 steps) 46.7 84.4 72.5 26.8 51.3

Table 1: Zero-shot pass $@1$ performance across benchmarks. Bold indicates the highest score per benchmark. Dashes $(\dot{-})$ denote unavailable official scores. Scores for o1-preview are sourced from AI (2024b); others from Zeng et al. (2025); Luo et al. (2025). Our models are evaluated using the lighteval package Fourrier et al. (2023).

表 1: 零样本通过 $@1$ 在各个基准测试中的表现。加粗表示每个基准测试中的最高分。短横线 $(\dot{-})$ 表示没有可用的官方分数。o1-preview 的分数来自 AI (2024b);其他分数来自 Zeng 等人 (2025);Luo 等人 (2025)。我们的模型使用 lighteval 包 Fourrier 等人 (2023) 进行评估。

Our models outperform most baselines, with average scores of $53.0%$ % (Open-RS1), $55.7%$% (Open-RS2), and $56.3%$ % (Open-RS3), compared to $57.0%$% for DeepScaleR-1.5B-Preview. Notably, Open-RS3 achieves the highest AIME24 score $(46.7%)$ % , surpassing o1-preview $(44.6%)$% and DeepScaleR-1.5B-Preview $(43.1%)$% ). Open-RS2 excels on AMC23 $(80.0%)$% and ties with Open-RS1 on Olympiad Bench $(52.4%)$% , both outperforming DeepScaleR-1.5B-Preview. MATH-500 scores remain competitive, though Minerva performance lags behind 7B models, reflecting the complexity of cross-disciplinary reasoning.

我们的模型在大多数基准测试中表现优异,平均得分为 $53.0%$% (Open-RS1)、$55.7%$% (Open-RS2) 和 $56.3%$% (Open-RS3),而 DeepScaleR-1.5B-Preview 的平均得分为 $57.0%$%。值得注意的是,Open-RS3 在 AIME24 上取得了最高分 $(46.7%)$%,超过了 o1-preview $(44.6%)$% 和 DeepScaleR-1.5B-Preview $(43.1%)$%。Open-RS2 在 AMC23 上表现出色 $(80.0%)$%,并在 Olympiad Bench 上与 Open-RS1 并列 $(52.4%)$%,两者均优于 DeepScaleR-1.5B-Preview。MATH-500 的得分保持竞争力,尽管 Minerva 的表现落后于 7B 模型,这反映了跨学科推理的复杂性。

We further compare training costs3 and data efficiency (Tables 2 and 3, and Figure 1). Our approach, using 7000 samples with 6 outputs per step (42,000 total samples), costs approximately $\$42$ on $4\mathrm{x}$ A40 GPUs over 24 hours. In contrast, 7B models like Qwen2.5-7B-SimpleRL $(\$1633)$ and Eurus-2-7B-PRIME $(\$1089$ and 1.5B models like DeepScaleR-1.5B-Preview $(\$3629)$ and Still-3-1.5B-Preview $(\$2268)$ require significantly more resources and data (e.g., $40\mathrm{k}\times16$ samples for DeepScaleR).

我们进一步比较了训练成本3和数据效率(表2和表3,以及图1)。我们的方法使用7000个样本,每个步骤生成6个输出(总计42,000个样本),在4个A40 GPU上运行24小时,成本约为$\$42$。相比之下,7B模型如Qwen2.5-7B-SimpleRL $(\$1633)$ 和 Eurus-2-7B-PRIME $(\$1089)$,以及1.5B模型如DeepScaleR-1.5B-Preview $(\$3629)$ 和 Still-3-1.5B-Preview $(\$2268)$ 需要更多的资源和数据(例如,DeepScaleR需要$40\mathrm{k}\times16$个样本)。

Table 2: Comparison of data usage and training costs for 7B models. Data are sourced from original papers or GitHub issues addressing author’s resource constraints.

表 2: 7B 模型的数据使用和训练成本对比。数据来源于原始论文或 GitHub 上作者资源限制的讨论。

rStar-Math-7B Eurus-2-7B-PRIME Qwen2.5-7B-SimpleRL Open-RS
SFT 数据 BaseModelQwen2.5-Math-7B 7.3M Qwen2.5-Math-7B 230k Qwen2.5-Math-7B 0 DeepSeek-R1-Distill-Qwen-1.5B 0
RM RL 数据 硬件 时间 无 3.647M×16 10x8H10080GB,1x8A10080GB 15x4A10040GB Eurus-2-7B-SFT 150k × 4 72h 0 无 8k x 8 4x6A10080GB 36h 0 无 7k × 6 1x4A4048GB 24h
成本估算 $1088 $1633 $42

Table 3: Comparison of data usage and training costs for 1.5B models. Data are sourced from original papers or GitHub issues addressing author’s resource constraints.

表 3: 1.5B 模型的数据使用和训练成本对比。数据来源于原始论文或 GitHub 上讨论作者资源限制的问题。

DeepScaleR-1.5B-Preview Still-3-1.5B-Preview Open-RS
SFTData Base ModelDeepSeek-R1-Distill-Qwen-1.5B 0 DeepSeek-R1-Distill-Qwen-1.5B 0 DeepSeek-R1-Distill-Qwen-1.5B 0 0

Our approach demonstrates that small LLMs can achieve competitive reasoning performance with minimal data and cost, offering a scalable alternative to resource-intensive baselines.

我们的方法表明,小型大语言模型能够以最少的数据和成本实现具有竞争力的推理性能,为资源密集型的基线方法提供了一种可扩展的替代方案。

4 Conclusion

4 结论

Our study investigated enhancing the reasoning abilities of small LLMs using ${\mathrm{RL}},$ focusing on the 1.5-billion-parameter DeepSeek-R1-Distill-Qwen-1.5B under strict constraints. Adapting the GRPO algorithm and a compact mathematical reasoning dataset, we conducted three experiments to assess behavior and performance under resource limitations. Our findings show small LLMs can achieve significant reasoning gains with minimal resources—e.g., AMC23 accuracy rising from $6%$ to $80%$ and AIME24 reaching $46.7%,$ surpassing o1-preview—at a cost of $\$42$ versus thousands for baselines. Open-RS variants averaged $53.0%{-56.3%}$ on benchmarks, demonstrating RL’s viability for small LLMs. Releasing our code and datasets, we provide a framework for lightweight, reasoning-capable models, despite challenges like optimization stability, laying a foundation for future work.

我们的研究探讨了在严格约束下,使用强化学习 (${\mathrm{RL}}$) 增强小型大语言模型 (LLM) 的推理能力,重点关注了 15 亿参数的 DeepSeek-R1-Distill-Qwen-1.5B 模型。通过调整 GRPO 算法并使用紧凑的数学推理数据集,我们进行了三项实验,以评估在资源限制下的行为和性能。研究结果表明,小型大语言模型能够在极少的资源下实现显著的推理能力提升——例如,AMC23 的准确率从 $6%$ 提升至 $80%$,AIME24 达到 $46.7%$,超过了 o1-preview——而成本仅为 $\$42$,远低于基线模型的数千美元。Open-RS 变体在基准测试中的平均表现达到 $53.0%{-56.3%}$,证明了强化学习在小型大语言模型中的可行性。我们发布了代码和数据集,为轻量级、具备推理能力的模型提供了一个框架,尽管面临优化稳定性等挑战,但仍为未来的工作奠定了基础。

References

参考文献

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.

Dan Hendrycks、Collin Burns、Saurav Kadavath、Akul Arora、Steven Basart、Eric Tang、Dawn Song 和 Jacob Steinhardt。使用数学数据集测量数学问题解决能力。NeurIPS,2021。

Niklas Mu en nigh off, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Z ett le moyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393.

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, 和 Tatsunori Hashimoto. s1: 简单的测试时缩放, 2025. URL https://arxiv.org/abs/2501.19393.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Micha lewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sut- ton, and Augustus Odena. Show your work: Scratch pads for intermediate computation with language models, 2021. URL https://arxiv.org/abs/2112.00114.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Micha lewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, 和 Augustus Odena. 展示你的工作:用于大语言模型中间计算的草稿本, 2021. URL https://arxiv.org/abs/2112.00114.

OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.

OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.

OpenAI. Learning to reason with llms, 2024b. URL https://openai.com/index/ learning-to-reason-with-llms/.

OpenAI. 学习用大语言模型推理, 2024b. URL https://openai.com/index/learning-to-reason-with-llms/.

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. In Anna Korhonen, David Traum, and Liuis Marquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932–4942, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1487. URL https://a cl anthology.org/P19-1487/.

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, 和 Richard Socher. 解释你自己!利用语言模型进行常识推理. 在 Anna Korhonen, David Traum, 和 Liuis Marquez (编), 第57届计算语言学协会年会论文集, 第4932–4942页, 意大利佛罗伦萨, 2019年7月. 计算语言学协会. doi: 10.18653/v1/P19-1487. URL https://a cl anthology.org/P19-1487/.

Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, CHI EA ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380959. doi: 10.1145/3411763.3451760. URL https://doi.org/10.1145/3411763.3451760.

Laria Reynolds 和 Kyle McDonell. 大语言模型的提示编程:超越少样本范式. 在 2021 年 CHI 计算系统人因会议扩展摘要中, CHI EA ’21, 美国纽约州纽约市, 2021. 计算机协会. ISBN 9781450380959. doi: 10.1145/3411763.3451760. URL https://doi.org/10.1145/3411763.3451760.

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deep seek math: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300.

邵志宏, 王培毅, 朱启豪, 徐润新, 宋俊晓, 毕晓, 张浩伟, 张明川, 李永康, 吴毅, 郭大亚. DeepSeek Math: 推动开放语言模型在数学推理中的极限, 2024. URL https://arxiv.org/abs/2402.03300.

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, 和 Shunyu Yao. Reflexion: 语言智能体的口头强化学习, 2023.

Kimi Team. Kimi k1.5: Scaling reinforcement learning with llms, 2025a. URL https: //arxiv.org/abs/2501.12599.

Kimi Team. Kimi k1.5: 使用大语言模型扩展强化学习, 2025a. URL https://arxiv.org/abs/2501.12599.

RUCAIBox STILL Team. Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning, 2025b. URL https://github.com/RUCAIBox/ Slow Thinking with LLMs.

RUCAIBox STILL 团队。Still-3-1.5b-preview:通过强化学习增强小模型的慢思考能力,2025b。URL https://github.com/RUCAIBox/ Slow Thinking with LLMs。

Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 2024. doi: 10.1038/s41586-023-06747-5.

Trieu Trinh, Yuhuai Wu, Quoc Le, He He, 和 Thang Luong. 无需人类演示的奥林匹克几何解题. Nature, 2024. doi: 10.1038/s41586-023-06747-5.

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.

Jonathan Uesato、Nate Kushman、Ramana Kumar、Francis Song、Noah Siegel、Lisa Wang、Antonia Creswell、Geoffrey Irving 和 Irina Higgins。基于过程和结果的反馈解决数学应用题。arXiv 预印本 arXiv:2211.14275,2022。

Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935, 2023.

Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, 和 Zhifang Sui. Math-shepherd: 一种用于大语言模型数学推理的无标签逐步验证器. arXiv 预印本 arXiv:2312.08935, 2023.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper files/paper/ 2022/file/9 d 5609613524 ecf 4 f 15 af 0 f 7 b 31 abc a 4-Paper-Conference.pdf.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, 和 Denny Zhou. 思维链提示在大语言模型中引发推理. 在 S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, 和 A. Oh (编), 神经信息处理系统进展, 第 35 卷, 第 24824–24837 页. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper files/paper/ 2022/file/9 d 5609613524 ecf 4 f 15 af 0 f 7 b 31 abc a 4-Paper-Conference.pdf.

Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL https: //arxiv.org/abs/2408.08152.

Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, 和 Chong Ruan. Deepseek-prover-v1.5: 利用证明助手反馈进行强化学习和蒙特卡洛树搜索, 2024. URL https://arxiv.org/abs/2408.08152.

A Related Work

相关工作

A.1 Reasoning in Large Language Models

A.1 大语言模型中的推理

A substantial body of research has investigated methods to enhance the reasoning capabilities and factual accuracy of large language models (LLMs). Early approaches predominantly relied on prompting techniques to elicit structured reasoning. For instance, scratchpad-style prompting encourages models to break down problems into intermediate steps (Nye et al., 2021), while verification mechanisms assess the correctness of generated outputs (Cobbe et al., 2021). Chain-of-thought (CoT) prompting has emerged as a particularly effective strategy, leveraging demonstrations of step-by-step reasoning to improve performance on complex tasks (Wei et al., 2022; Kojima et al., 2022; Reynolds & McDonell, 2021). More recently, techniques such as intermediate self-reflection have been proposed to enable models to iterative ly refine their reasoning processes (Shinn et al., 2023; Madaan et al., 2023).

大量研究已经探讨了如何增强大语言模型 (LLMs) 的推理能力和事实准确性。早期方法主要依赖于提示技术来引发结构化推理。例如,草稿式提示鼓励模型将问题分解为中间步骤 (Nye et al., 2021),而验证机制则评估生成输出的正确性 (Cobbe et al., 2021)。思维链 (Chain-of-thought, CoT) 提示作为一种特别有效的策略出现,通过展示逐步推理来提高复杂任务的表现 (Wei et al., 2022; Kojima et al., 2022; Reynolds & McDonell, 2021)。最近,提出了诸如中间自我反思等技术,使模型能够迭代地改进其推理过程 (Shinn et al., 2023; Madaan et al., 2023)。

In parallel, supervised fine-tuning (SFT) has been employed to embed reasoning capabilities directly into LLMs. Studies such as (Lewkowycz et al., 2022a) and (Rajani et al., 2019) demonstrate that fine-tuning on high-quality datasets can enhance problem-solving abilities. Notably, integrating CoT reasoning into SFT has shown significant promise; works like (Zelikman et al., 2022; Mu en nigh off et al., 2025; Ye et al., 2025) illustrate that fine-tuning with small, carefully curated datasets of CoT examples can yield substantial performance gains. However, these efforts have predominantly focused on large-scale LLMs, typically ranging from 7 billion to over 100 billion parameters. This reliance on massive models limits accessibility and practicality for resource-constrained settings, motivating the exploration of alternative approaches for smaller LLMs.

与此同时,监督微调 (Supervised Fine-Tuning, SFT) 被用于将推理能力直接嵌入到大语言模型中。研究表明,如 (Lewkowycz et al., 2022a) 和 (Rajani et al., 2019) 所示,对高质量数据集进行微调可以增强问题解决能力。值得注意的是,将链式思维 (Chain-of-Thought, CoT) 推理整合到 SFT 中显示出显著的前景;如 (Zelikman et al., 2022; Mu en nigh off et al., 2025; Ye et al., 2025) 等研究表明,使用少量精心策划的 CoT 示例数据集进行微调可以带来显著的性能提升。然而,这些努力主要集中在规模较大的大语言模型上,通常参数范围从 70 亿到超过 1000 亿。这种对大规模模型的依赖限制了在资源受限环境中的可访问性和实用性,这促使人们探索适用于较小大语言模型的替代方法。

A.2 Reasoning with Reinforcement Learning

A.2 强化学习推理

Reinforcement learning (RL) has emerged as a powerful paradigm for improving reasoning in LLMs, particularly for tackling complex, multi-step problems. Unlike SFT, which often optimizes for imitation of training data, RL enables models to learn from feedback, enhancing generalization to both in-distribution and out-of-distribution tasks (Chu et al., 2025; Yeo et al., 2025). Recent advancements underscore the efficacy of RL in this domain. For example, OpenAI (2024b) and DeepSeek-AI (2025) demonstrate that RL-based training can significantly boost reasoning performance, while Team (2025a) explores scaling laws for RL-driven LLMs. These studies highlight RL’s ability to refine decision-making processes by optimizing for task-specific rewards, such as correctness or logical coherence.

强化学习 (Reinforcement Learning, RL) 已成为提升大语言模型 (LLM) 推理能力的强大范式,尤其是在解决复杂的多步骤问题方面。与通常优化训练数据模仿的监督微调 (SFT) 不同,RL 使模型能够从反馈中学习,从而增强对分布内和分布外任务的泛化能力 (Chu et al., 2025; Yeo et al., 2025)。最近的进展强调了 RL 在这一领域的有效性。例如,OpenAI (2024b) 和 DeepSeek-AI (2025) 展示了基于 RL 的训练可以显著提升推理性能,而 Team (2025a) 则探索了 RL 驱动的大语言模型的扩展规律。这些研究强调了 RL 通过优化任务特定奖励(如正确性或逻辑一致性)来优化决策过程的能力。

Despite these advances, RL-based methods are not without limitations. They typically demand substantial computational resources, often exceeding those required for SFT, and are predominantly applied to large LLMs. This focus on scale renders RL impractical for smaller models and restricts its adoption outside well-resourced organizations, such as major technology firms. Furthermore, privacy concerns arise when deploying such models, as self-hosting becomes infeasible for most academic or industrial entities with limited infrastructure. Consequently, there remains a critical gap in the literature: the application of RL to enhance reasoning in small LLMs under resource and privacy constraints.

尽管取得了这些进展,基于强化学习 (RL) 的方法并非没有局限性。它们通常需要大量的计算资源,往往超过监督微调 (SFT) 所需,并且主要应用于大型大语言模型。这种对规模的关注使得强化学习对于较小的模型不切实际,并限制了其在资源充足的组织(如大型科技公司)之外的采用。此外,部署此类模型时会出现隐私问题,因为对于大多数基础设施有限的学术或工业实体来说,自托管变得不可行。因此,文献中仍然存在一个关键空白:在资源和隐私限制下,应用强化学习来增强小型大语言模型的推理能力。

B Limitations & Discussion

B 限制与讨论

While our study demonstrates the promise of RL-based fine-tuning for enhancing the reasoning abilities of small LLMs, several limitations and broader implications warrant discussion. These insights not only contextual ize our findings but also highlight avenues for future research.

虽然我们的研究表明基于强化学习(RL)的微调在增强小型大语言模型的推理能力方面具有潜力,但仍存在一些局限性和更广泛的影响值得讨论。这些见解不仅为我们的发现提供了背景,还突出了未来研究的方向。

B.1 Limitations

B.1 局限性

First, our experiments were constrained by a 24-hour training window on a modest cluster of 4 NVIDIA A40 GPUs (48 GB VRAM each), limiting the number of global steps (e.g., 500 in Experiment 1 versus a potential 1500 for one epoch). This restriction curtailed our ability to fully explore the long-term behavior of the model, particularly beyond 200 steps, where performance degradation and multilingual outputs emerged. Second, the maximum completion length (4096 tokens in Experiment 1, reduced to 3584 in Experiments 2 and 3) proved insufficient for extremely hard problems in the open-s1 dataset, forcing the model to truncate reasoning processes prematurely. This suggests that our methodology may under exploit the potential of small LLMs on complex tasks requiring extended reasoning chains.

首先,我们的实验受到了一组4个NVIDIA A40 GPU(每个48 GB显存)的集群上24小时训练窗口的限制,这限制了全局步数(例如,实验1中的500步,而一个epoch可能达到1500步)。这一限制削弱了我们充分探索模型长期行为的能力,特别是在超过200步之后,性能下降和多语言输出开始出现。其次,最大完成长度(实验1中为4096个token,实验2和3中减少到3584个token)在open-s1数据集的极难问题上显得不足,迫使模型过早地截断推理过程。这表明我们的方法可能在需要扩展推理链的复杂任务上未能充分发挥小型大语言模型的潜力。

Third, the multilingual nature of the base model, DeepSeek-R1-Distill-Qwen-1.5B, introduced unintended language drift after 150–200 steps, despite efforts to enforce English-only outputs via prompts in Experiment 3. This limitation reflects a trade-off in using a pretrained, multilingual foundation, which, while efficient, complicates monolingual optimization. Finally, our evaluation focused exclusively on mathematical reasoning benchmarks, leaving the general iz ability of our approach to other domains—such as scientific reasoning or coding—unexplored. These constraints highlight the need for cautious interpretation of our results within the specified scope.

第三,基础模型 DeepSeek-R1-Distill-Qwen-1.5B 的多语言特性在 150-200 步后引入了意外的语言漂移,尽管在实验 3 中通过提示强制输出仅限英语。这一限制反映了使用预训练的多语言基础模型的权衡,虽然高效,但使单语言优化变得复杂。最后,我们的评估仅专注于数学推理基准,未探索我们方法在其他领域(如科学推理或编码)的泛化能力。这些限制强调了在指定范围内谨慎解释我们结果的必要性。

B.2 Discussion

B.2 讨论

Our findings reveal a nuanced trade-off between efficiency and reasoning depth in small LLMs. The rapid performance gains observed in the first 50–100 steps across all experiments (Insight 1) suggest that small, high-quality datasets can effectively bootstrap reasoning capabilities, aligning with prior work on data efficiency in RL (Chu et al., 2025). However, the subsequent degradation underscores a sensitivity to over-optimization under fixed length constraints, a challenge also noted in larger models like DeepSeek-R1 (DeepSeekAI, 2025). Experiment 2’s success with mixed difficulty levels (Insight 2) indicates that curriculum-like strategies could mitigate this. Meanwhile, the cosine reward’s stabilizing effect in Experiment 3 (Insight 3) suggests a promising direction for controlling reasoning verbosity, though it sacrifices peak accuracy compared to Experiment 2.

我们的研究揭示了小型大语言模型在效率和推理深度之间的微妙权衡。在所有实验中,前50-100步观察到的快速性能提升(洞察1)表明,小型高质量数据集可以有效引导推理能力,这与之前关于强化学习中数据效率的研究一致(Chu等,2025)。然而,随后的性能下降突显了在固定长度约束下对过度优化的敏感性,这一挑战在DeepSeek-R1等大型模型中也得到了注意(DeepSeekAI,2025)。实验2在混合难度水平上的成功(洞察2)表明,类似课程学习的策略可以缓解这一问题。同时,实验3中余弦奖励的稳定效果(洞察3)表明,控制推理冗长性是一个有前景的方向,尽管与实验2相比,它牺牲了峰值准确性。

Comparatively, our Open-RS variants achieved performance rivaling or exceeding state-ofthe-art 1.5B models (e.g., DeepScaleR-1.5B-Preview) and even some 7B models, at a fraction of the cost and data volume. This efficiency challenges the prevailing reliance on massive datasets and computational resources in reasoning enhancement (OpenAI, ${2024\mathrm{b}}.$ ; Luo et al., 2025), offering a scalable alternative for resource-constrained environments. However, the persistent multilingual drift and length limitations point to inherent challenges in adapting multilingual base models and optimizing for complex tasks within tight bounds.

相比之下,我们的 Open-RS 变体在性能和成本上均达到了与最先进的 1.5B 模型(例如 DeepScaleR-1.5B-Preview)相当甚至超越的水平,甚至在某些情况下超越了部分 7B 模型,同时仅需极低的成本和数据量。这种效率挑战了当前在推理增强领域对大规模数据集和计算资源的依赖(OpenAI, 2024b; Luo et al., 2025),为资源受限的环境提供了一种可扩展的替代方案。然而,持续的多语言漂移和长度限制表明,在适应多语言基础模型和优化复杂任务时,仍然存在固有的挑战。

B.3 Future Directions

B.3 未来方向

These limitations suggest several research avenues. First, extending training duration or employing multi-stage length schedules could address truncation issues, allowing the model to handle harder problems without compromising stability. Second, incorporating a lightweight language reward or monolingual pre-filtering of the base model might mitigate language drift, enhancing output consistency. Third, expanding the benchmark suite to include non-mathematical domains would test the general iz ability of our approach, aligning with broader AGI goals. Finally, exploring hybrid methods—such as combining GRPO with search algorithms like MCTS (Feng et al., 2024)—could further deepen reasoning capacity without significantly increasing resource demands.

这些限制提出了几个研究方向。首先,延长训练时间或采用多阶段长度调度可以解决截断问题,使模型能够处理更复杂的问题而不影响稳定性。其次,引入轻量级的语言奖励或对基础模型进行单语预过滤可能会减轻语言漂移,增强输出的一致性。第三,扩展基准测试套件以包括非数学领域,将测试我们方法的泛化能力,与更广泛的通用人工智能目标保持一致。最后,探索混合方法——例如将GRPO与MCTS等搜索算法结合(Feng et al., 2024)——可以在不显著增加资源需求的情况下进一步深化推理能力。

In conclusion, our work demonstrates that RL-based fine-tuning can unlock substantial reasoning potential in small LLMs, even under stringent constraints. By identifying key tradeoffs and offering practical insights, we pave the way for developing efficient, reasoningcapable models that balance performance and accessibility—a critical step toward democratizing advanced AI technologies.

总之,我们的工作表明,基于强化学习(RL)的微调可以在严格的约束条件下释放小型大语言模型的巨大推理潜力。通过识别关键权衡并提供实用见解,我们为开发高效、具备推理能力的模型铺平了道路,这些模型在性能和可访问性之间取得了平衡——这是实现先进AI技术民主化的关键一步。

C Datasets

C 数据集

Detail datasets are used in Section 3.2.

详细数据集在第3.2节中使用。

Table 4 summarizes the datasets and their sample sizes. This diverse collection ensures a comprehensive assessment of the model’s reasoning generalization across problem types and difficulty levels.

表 4 总结了数据集及其样本量。这一多样化的集合确保了模型在问题类型和难度级别上的推理泛化能力的全面评估。

Table 4: Benchmark Datasets and Sample Sizes for Evaluation

表 4: 评估基准数据集及样本量

数据集 样本量
AIME24 30
MATH-500 500
AMC23 Minerva 40
OlympiadBench 272,675

D Baseline Models

D 基线模型

The description of baseline models in Section 3.3.

第3.3节中基线模型的描述。

• General-Purpose Large Models:

• 通用大模型

– Llama-3.1-70B-Instruct (AI, 2024a): A 70B-parameter model optimized for instruction-following. – o1-preview (AI, 2024b): A high-performing reasoning model from OpenAI.

– Llama-3.1-70B-Instruct (AI, 2024a): 一个拥有700亿参数的模型,专为指令跟随任务优化。
– o1-preview (AI, 2024b): OpenAI 推出的高性能推理模型。

• Mathematics-Focused 7B Models:

• 数学导向的 7B 模型:

• Mathematics-Focused 1.5B Models:

• 数学导向的 1.5B 模型:

E Hyper parameter Setup

E 超参数设置

Table 5 show parameters that used in training phase.

表 5: 训练阶段使用的参数

Table 5: Hyper parameter Setups for GRPO Trainer

表 5: GRPO 训练器的超参数设置

参数
通用设置 bf16 use_vllm true
vllm-device vllm_enforce_eager vllm-gpu_memory_utilization vllm_max_model len true auto true
0.7 4608
do_eval false
output_dir overwrite_output_dir 训练配置 gradient-accumulation_steps gradient-checkpointing data/OpenRS-GRPO true
阅读全文(20积分)