Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t

小规模大语言模型中的强化学习推理：有效与无效之处

Abstract

摘要

Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains—e.g., AMC23 accuracy rising from $63%$ to $80%$ and AIME24 reaching $46.7%$ , surpassing o1-preview—using only 7,000 samples and a $\$42$ training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.

提升大语言模型 (LLM) 的推理能力通常依赖于大量的计算资源和广泛的数据集，这在资源受限的环境中限制了其可访问性。我们的研究探讨了强化学习 (RL) 在提升小型 LLM 推理能力方面的潜力，重点关注一个 1.5 亿参数的模型 DeepSeek-R1-Distill-Qwen-1.5B，在严格的约束条件下：在 4 个 NVIDIA A40 GPU（每个 48 GB VRAM）上训练 24 小时。我们采用了 Group Relative Policy Optimization (GRPO) 算法，并精心策划了一个紧凑且高质量的数学推理数据集，进行了三项实验以探索模型的行为和性能。我们的结果表明，推理能力迅速提升——例如，AMC23 的准确率从 $63%$ 上升到 $80%$ ，AIME24 达到了 $46.7%$ ，超过了 o1-preview——仅使用了 7,000 个样本和 $\$42$ 的训练成本，而基线模型的成本则高达数千美元。然而，随着训练时间的延长，出现了优化不稳定性和长度限制等挑战。这些发现突显了基于 RL 的微调在小型 LLM 中的有效性，为大规模方法提供了一种经济高效的替代方案。我们发布了代码和数据集作为开源资源，提供了对权衡的见解，并为在资源有限的环境中构建可扩展的、具备推理能力的 LLM 奠定了基础。所有资源均可在 https://github.com/knoveleng/open-rs 获取。

1 Introduction

1 引言

Recent advancements in large language models (LLMs) have significantly advanced the pursuit of artificial general intelligence (AGI), with models such as GPT-4o (OpenAI, 2024a), Claude 3.5 Sonnet (Anthropic, 2024), and Gemini 1.5 (Google, 2024) demonstrating unprecedented capabilities. A pivotal aspect of this progress is the integration of post-training techniques into the training pipeline. These methods—including supervised fine-tuning (SFT) and reinforcement learning (RL)—enhance reasoning accuracy, align models with societal values, and adapt them to user preferences, all while demanding fewer computational resources than pre-training (OpenAI, 2024b). A notable innovation in this domain is OpenAI’s o1 series, which leverages inference-time scaling through extended Chain-of-Thought (CoT) reasoning to achieve remarkable performance in mathematics, coding, and scientific reasoning tasks (OpenAI, 2024b). However, despite these breakthroughs, scaling reasoning capabilities at test time remains a persistent challenge for the broader research community, largely due to limited access to proprietary methodologies and resources.

近年来，大语言模型 (LLM) 的显著进展极大地推动了通用人工智能 (AGI) 的追求，诸如 GPT-4o (OpenAI, 2024a)、Claude 3.5 Sonnet (Anthropic, 2024) 和 Gemini 1.5 (Google, 2024) 等模型展示了前所未有的能力。这一进展的关键在于将训练后技术整合到训练流程中。这些方法——包括监督微调 (SFT) 和强化学习 (RL)——提高了推理准确性，使模型与社会价值观保持一致，并根据用户偏好进行调整，同时比预训练所需的计算资源更少 (OpenAI, 2024b)。该领域的一个显著创新是 OpenAI 的 o1 系列，它通过扩展的思维链 (CoT) 推理利用推理时扩展，在数学、编码和科学推理任务中取得了显著性能 (OpenAI, 2024b)。然而，尽管取得了这些突破，在测试时扩展推理能力仍然是更广泛研究社区面临的持续挑战，这主要是由于对专有方法和资源的访问有限。

Efforts to bolster LLM reasoning have explored diverse strategies. Process-based reward models (Uesato et al., 2022; Lightman et al., 2023a; Wang et al., 2023) guide models toward structured problem-solving, while RL approaches (Kumar et al., 2024) optimize performance through feedback-driven learning. Search algorithms, such as Monte Carlo Tree Search (MCTS) and Beam Search, have also been employed to enhance reasoning depth (Feng et al., 2024; Xin et al., 2024; Trinh et al., 2024). Although these methods have driven incremental gains, they fall short of the general reasoning prowess exhibited by the o1 series. Recently, the DeepSeek-R1 model (DeepSeek-AI, 2025) has emerged as a competitive alternative, utilizing RL with the Group Relative Policy Optimization (GRPO) algorithm. Built on the 671-billion-parameter DeepSeek-V3, DeepSeek-R1 matches o1’s reasoning performance (DeepSeek-AI, 2025). Yet, the sheer scale and computational demands of such models—often exceeding hundreds of billions of parameters—render them impractical for self-hosting by most organizations outside major technology firms, limiting their broader adoption.

为了增强大语言模型的推理能力，研究者们探索了多种策略。基于过程的奖励模型 (Uesato et al., 2022; Lightman et al., 2023a; Wang et al., 2023) 引导模型进行结构化问题解决，而强化学习 (Reinforcement Learning, RL) 方法 (Kumar et al., 2024) 则通过反馈驱动的学习来优化性能。搜索算法，如蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS) 和束搜索 (Beam Search)，也被用于增强推理深度 (Feng et al., 2024; Xin et al., 2024; Trinh et al., 2024)。尽管这些方法带来了逐步的改进，但它们仍无法与 o1 系列所展现的通用推理能力相媲美。最近，DeepSeek-R1 模型 (DeepSeek-AI, 2025) 作为一种竞争性替代方案出现，它利用强化学习结合群组相对策略优化 (Group Relative Policy Optimization, GRPO) 算法。基于 6710 亿参数的 DeepSeek-V3，DeepSeek-R1 在推理性能上与 o1 相当 (DeepSeek-AI, 2025)。然而，这些模型的庞大规模和计算需求——通常超过数千亿参数——使得大多数非大型科技公司难以自托管，限制了它们的广泛应用。

In contrast, small LLMs, typically ranging from 1 to 10 billion parameters, present a resourceefficient alternative with potential for widespread deployment. Previous studies have demonstrated the feasibility of enhancing small LLMs through RL-based fine-tuning inspired by DeepSeek-R1 (Luo et al., 2025; Team, 2025b). However, these efforts often rely on expansive datasets (hundreds of thousands to millions of samples) or incur significant computational costs, undermining their accessibility for resource-constrained settings. This tension motivates two central research questions:

相比之下，小型大语言模型（LLM），通常参数规模在10亿到100亿之间，提供了一种资源高效的替代方案，具有广泛部署的潜力。先前的研究已经证明了通过基于强化学习（RL）的微调来增强小型大语言模型的可行性，这一方法受到了DeepSeek-R1的启发（Luo等，2025；Team，2025b）。然而，这些努力通常依赖于庞大的数据集（数十万到数百万样本）或产生显著的计算成本，从而削弱了它们在资源受限环境中的可访问性。这种矛盾激发了两个核心研究问题：

These questions naturally extend to a practical inquiry: If viable, how should such an approach be implemented for small LLMs, and if not, what are the fundamental limitations? Addressing these, we investigate the reasoning capacity of a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under stringent constraints: training on a cluster of 4 NVIDIA A40 GPUs (48 GB VRAM each) within a 24-hour window. Our methodology adapts the GRPO-based RL framework from DeepSeek-R1, tailoring it to the resourcelimited context of small LLMs. We assess performance on a suite of mathematical reasoning benchmarks, a domain requiring structured, logical problem-solving that serves as a robust testbed for reasoning ability.

这些问题自然延伸出一个实际的探究：如果可行，如何为小型大语言模型实施这种方法？如果不可行，其根本限制是什么？针对这些问题，我们研究了在严格约束条件下，一个拥有15亿参数的模型DeepSeek-R1-Distill-Qwen-1.5B的推理能力：在4台NVIDIA A40 GPU（每台48 GB显存）的集群上，在24小时内完成训练。我们的方法采用了基于GRPO的强化学习框架，并将其调整为适用于资源有限的小型大语言模型。我们在一系列数学推理基准上评估了性能，这一领域需要结构化的逻辑问题解决能力，是推理能力的强大测试平台。

Figure 1: Comparison of zero-shot pass $@1$ performance versus model size (left) and computational cost (right). Our Open-RS (red point) achieves the highest AIME24 score $(46.7%)$ , outperforming o1-preview $\hat{(}44.6%)$ and other models (green points). Additionally, Open-RS models exhibit the lowest computational cost at approximately $\$42$ .

图 1: 零样本通过率 $@1$ 与模型大小（左）和计算成本（右）的比较。我们的 Open-RS（红点）达到了最高的 AIME24 分数 $(46.7%)$ ，优于 o1-preview $\hat{(}44.6%)$ 和其他模型（绿点）。此外，Open-RS 模型的计算成本最低，约为 $\$42$ 。

Our study yields three primary contributions:

我们的研究主要有以下三点贡献：

Our findings illuminate the promise of RL-based methods to enhance small LLMs’ reasoning capabilities, achieving competitive performance with minimal resources (Figure 1). Simultaneously, they reveal critical challenges—such as data efficiency, optimization stability, and length constraints—that must be addressed to fully realize this potential. These insights lay the groundwork for developing lightweight, reasoning-capable LLMs suitable for resource-constrained environments, advancing the democratization of advanced AI technologies.

我们的研究结果揭示了基于强化学习（RL）的方法在提升小型大语言模型推理能力方面的潜力，以最少的资源实现了具有竞争力的性能（图 1）。同时，这些方法也揭示了必须解决的关键挑战——如数据效率、优化稳定性和长度限制——以充分发挥这一潜力。这些见解为开发适合资源受限环境的轻量级、具备推理能力的大语言模型奠定了基础，推动了先进AI技术的民主化进程。

The remainder of this paper is structured as follows: Section 2 details our methodology, including data curation, RL algorithm, and reward design; Section 3 presents three experiments, their results, and comparative analyses; and Section 4 summarizes key findings. Additional details, including related work, discussion, hyper parameter setups, and supplementary results, are provided in the Appendix.

本文的其余部分结构如下：第2节详细介绍了我们的方法，包括数据整理、强化学习算法和奖励设计；第3节展示了三个实验、其结果以及比较分析；第4节总结了主要发现。附录中提供了更多细节，包括相关工作、讨论、超参数设置和补充结果。

2 Methodology

2 方法论

In this section, we outline our approach to optimizing the reasoning capabilities of small large language models (LLMs) under computational constraints. Our methodology comprises two primary components: (1) the curation of a high-quality, mathematics-focused dataset, and (2) the application of a resource-efficient reinforcement learning (RL) algorithm. These components are designed to balance performance gains with practical limitations, such as reduced computational overhead and privacy considerations.

在本节中，我们概述了在计算约束下优化小型大语言模型 (LLM) 推理能力的方法。我们的方法包括两个主要部分：(1) 构建一个高质量、以数学为重点的数据集，(2) 应用一种资源高效的强化学习 (RL) 算法。这些组件的设计旨在平衡性能提升与实际限制，例如减少计算开销和隐私考虑。

2.1 High-Quality Dataset Curation

2.1 高质量数据集构建

To minimize training costs while maximizing reasoning performance, we curate a compact, high-quality dataset tailored to mathematical reasoning. This dataset is derived from two existing sources: the s1 dataset (Mu en nigh off et al., 2025) and the DeepScaleR dataset (DeepSeek-AI, 2025). By filtering and refining these datasets, we ensure that our training data is both relevant and challenging, enabling efficient learning for small LLMs.

为了在最小化训练成本的同时最大化推理性能，我们精心策划了一个紧凑且高质量的数据集，专门针对数学推理。该数据集源自两个现有来源：s1 数据集 (Mu en nigh off et al., 2025) 和 DeepScaleR 数据集 (DeepSeek-AI, 2025)。通过对这些数据集进行过滤和精炼，我们确保训练数据既相关又具有挑战性，从而为小型大语言模型提供高效的学习机会。

s1 Dataset The s1 dataset (Mu en nigh off et al., 2025) is a general-purpose reasoning corpus comprising 59,029 questions sourced from diverse domains, including NuminaMATH (LI et al., 2024), AIME problems (1983–2021), Olympic Arena (Huang et al., 2024), OmniMath (Gao et al., 2024), AGIEval (Zhong et al., 2023), probability questions from Stanford University’s Statistics Department PhD Qualifying Exams (https://statistics. stanford.edu), and brain-teasers from Puzzled Quant (https://www.puzzled quant.com). Although the dataset spans multiple disciplines—such as Astronomy, Biology, Chemistry, Computer Science, Geography, Mathematics, and Physics—our focus is exclusively on mathematical reasoning.

s1 数据集
s1 数据集 (Mu en nigh off 等人, 2025) 是一个通用推理语料库，包含来自多个领域的 59,029 个问题，涵盖 NuminaMATH (LI 等人, 2024)、AIME 问题 (1983–2021)、Olympic Arena (Huang 等人, 2024)、OmniMath (Gao 等人, 2024)、AGIEval (Zhong 等人, 2023)、斯坦福大学统计系博士资格考试中的概率问题 (https://statistics.stanford.edu) 以及 Puzzled Quant 的脑筋急转弯 (https://www.puzzledquant.com)。尽管该数据集涵盖多个学科——如天文学、生物学、化学、计算机科学、地理学、数学和物理学——但我们的研究重点仅集中在数学推理上。

To isolate mathematics-specific examples, we adopt a filtering workflow inspired by (Muennighoff et al., 2025). First, we retain only questions with solutions containing the LaTeX command ${\backslash}60{\times}60{},$ a common indicator of mathematical answers, reducing the dataset to 31,323 examples. Next, we employ the distilled model DeepSeek-R1-Distill-Qwen-1.5B to eliminate trivial questions, yielding 21,533 examples. Finally, to ensure data quality, we use Qwen2.5-7B-Instruct to remove noisy or multi-part questions, resulting in a final set of 18,615 high-quality mathematical reasoning examples – open-s1 dataset.

为了隔离数学相关的示例，我们采用了受 (Muennighoff et al., 2025) 启发的过滤工作流程。首先，我们仅保留解决方案中包含 LaTeX 命令 ${\backslash}60{\times}60{}$ 的问题，这是数学答案的常见指示符，将数据集减少到 31,323 个示例。接下来，我们使用蒸馏模型 DeepSeek-R1-Distill-Qwen-1.5B 来消除简单问题，得到 21,533 个示例。最后，为了确保数据质量，我们使用 Qwen2.5-7B-Instruct 去除噪声或多部分问题，最终得到 18,615 个高质量的数学推理示例——open-s1 数据集。

DeepScaleR Dataset The DeepScaleR dataset (Luo et al., 2025) contains 40,315 mathematics-specific questions drawn from AIME (1984–2023), AMC (prior to 2023), OmniMATH, and the Still dataset. Unlike the s1 dataset, DeepScaleR is pre-filtered to focus solely on mathematics, with redundant questions removed and solutions extracted from raw text using retrieval-augmented generation (RAG) and advanced LLMs like Gemini-1.5-Pro-002. To further refine this dataset, we apply Qwen2.5-Math-7B-Instruct to exclude easy questions, reducing the set to 21,044 examples – open-deepscaler dataset. We opt for Qwen2.5-Math-7B-Instruct over DeepSeek-R1-Distill-Qwen-1.5B—used for the s1 dataset—to introduce diversity in filtering criteria and avoid excessive overlap between the two datasets.

DeepScaleR 数据集
DeepScaleR 数据集 (Luo et al., 2025) 包含从 AIME (1984–2023)、AMC (2023 年之前)、OmniMATH 和 Still 数据集中提取的 40,315 个数学相关题目。与 s1 数据集不同，DeepScaleR 经过预过滤，专注于数学领域，冗余题目已被移除，并通过检索增强生成 (RAG) 和 Gemini-1.5-Pro-002 等先进的大语言模型从原始文本中提取解答。为了进一步优化该数据集，我们使用 Qwen2.5-Math-7B-Instruct 排除了简单题目，将数据集缩减至 21,044 个样本——即 open-deepscaler 数据集。我们选择 Qwen2.5-Math-7B-Instruct 而非 s1 数据集中使用的 DeepSeek-R1-Distill-Qwen-1.5B，以引入过滤标准的多样性，并避免两个数据集之间的过度重叠。

Final Dataset Combining the refined open-s1 dataset (18,615 examples) and opendeepscaler (21,044 examples), we obtain a final high-quality dataset of 39,659 mathematical reasoning questions. This curated corpus strikes a balance between scale and specificity, enabling effective training of small LLMs under resource constraints.

最终数据集结合精炼后的 open-s1 数据集（18,615 个示例）和 opendeepscaler（21,044 个示例），我们获得了包含 39,659 个数学推理问题的高质量最终数据集。这个精选的语料库在规模和特异性之间取得了平衡，能够在资源受限的情况下有效训练小型大语言模型。

2.2 Reinforcement Learning Algorithm

2.2 强化学习算法

To train small LLMs efficiently, we adopt the Group Relative Policy Optimization (GRPO) algorithm Shao et al. (2024), as utilized in DeepSeek-AI (2025). GRPO eliminates the need for a separate critic model—typically as large as the policy model—by estimating baselines from group scores, thereby reducing computational overhead. For each question $q,$ GRPO samples a group of $G$ outputs ${o_{1},\stackrel{\smile}{o}_ {2},\ldots,\stackrel{\star}{o}_ {G}}$ from the old policy $\pi_{\theta_{\mathrm{old}}}$ and optimizes the policy $\pi_{\theta}$ by maximizing the following objective:

为了高效训练小型大语言模型，我们采用了 Group Relative Policy Optimization (GRPO) 算法 Shao et al. (2024)，如 DeepSeek-AI (2025) 中所使用的那样。GRPO 通过从组分数中估计基线，消除了对单独评论模型的需求——通常与策略模型一样大——从而减少了计算开销。对于每个问题 $q$ ，GRPO 从旧策略 $\pi_{\theta_{\mathrm{old}}}$ 中采样一组 $G$ 个输出 ${o_{1},\stackrel{\smile}{o}_ {2},\ldots,\stackrel{\star}{o}_ {G}}$ ，并通过最大化以下目标来优化策略 $\pi_{\theta}$ ：

图片.png

where the KL-divergence term is defined as:

其中 KL 散度项定义为：

图片.png

and the advantage $A_{i}$ is computed from a group of rewards ${r_{1},r_{2},\ldots,r_{G}}$ :

优势 $A_{i}$ 是从一组奖励 ${r_{1},r_{2},\ldots,r_{G}}$ 中计算得出的：

图片.png

Here, $\epsilon$ and $\beta$ are hyper parameters controlling the clipping range and KL penalty, respectively.

这里， $\epsilon$ 和 $\beta$ 是分别控制裁剪范围和KL惩罚的超参数。

Reward Models The reward function is critical to guiding RL optimization. We employ a rule-based reward system comprising three components, designed to balance correctness, efficiency, and structure without relying on resource-intensive neural reward models:

奖励模型
奖励函数对于指导强化学习（RL）优化至关重要。我们采用了一个基于规则的奖励系统，该系统由三个部分组成，旨在在不依赖资源密集型神经奖励模型的情况下平衡正确性、效率和结构：

3 Experiments

3 实验

To address the research questions outlined in Section 1—namely, how reinforcement learning (RL) can enhance the reasoning abilities of small large language models (LLMs) and what practical insights emerge under computational constraints—we design three experiments to analyze the training behavior of small LLMs. These experiments aim to provide empirical evidence of performance improvements and offer actionable guidance for future research and industrial applications.

为了解决第1节中概述的研究问题——即强化学习（RL）如何增强小型大语言模型（LLMs）的推理能力，以及在计算限制下会得出哪些实际见解——我们设计了三个实验来分析小型LLMs的训练行为。这些实验旨在提供性能改进的实证证据，并为未来的研究和工业应用提供可操作的指导。

3.1 Experimental Setup

3.1 实验设置

We select DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek-AI, 2025) as our base model for training. This 1.5-billion-parameter model, distilled from larger architectures, is chosen for its balance of efficiency and reasoning potential. Notably, we bypass the supervised finetuning (SFT) phase—typically a precursor to RL for performance enhancement (Chu et al., 2025)—hypothesizing that the model’s pre training is sufficient to leverage RL directly. For the RL phase, we employ the Group Relative Policy Optimization (GRPO) algorithm, as detailed in Section 2.2, due to its computational efficiency.

我们选择 DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek-AI, 2025) 作为训练的基础模型。这个拥有 15 亿参数的模型是从更大的架构中蒸馏出来的，因其效率和推理潜力的平衡而被选中。值得注意的是，我们跳过了监督微调 (SFT) 阶段——通常是强化学习 (RL) 性能增强的前置步骤 (Chu et al., 2025)——假设模型的预训练足以直接利用 RL。在 RL 阶段，我们采用了组相对策略优化 (GRPO) 算法，如第 2.2 节所述，因其计算效率高。

Training is conducted on a cluster of 4 NVIDIA A40 GPUs (48GB VRAM each), imposing constraints that limit us to sampling 6 outputs per step with a maximum completion length of 4096 tokens. To facilitate this, we adapt open-r1 (Face, 2025), an open-source reproduction of DeepSeek-R1 by the Hugging Face team, customizing it to align with our objectives. The training phase is restricted to 1 epoch, completed within a 24-hour window, reflecting realworld resource limitations. Hyper parameters and additional configurations are detailed in Appendix E.

训练在4个NVIDIA A40 GPU（每个48GB显存）的集群上进行，这限制了我们在每一步只能采样6个输出，且最大完成长度为4096个token。为此，我们采用了Hugging Face团队开源的DeepSeek-R1复现项目open-r1 (Face, 2025)，并对其进行了定制以符合我们的目标。训练阶段限制为1个epoch，并在24小时内完成，反映了现实世界的资源限制。超参数和其他配置详见附录E。

3.2 Benchmark Datasets

3.2 基准数据集

To evaluate the reasoning capabilities of our small LLM, we choose five mathematics-focused benchmark datasets: AIME24 1, MATH-500 (Lightman et al., 2023b; Hendrycks et al., 2021), AMC23 2, Minerva (Lewkowycz et al., 2022b) and Olympiad Bench (He et al., 2024). Details of the datasets are provided in Appendix C.

为了评估我们的小型大语言模型的推理能力，我们选择了五个以数学为重点的基准数据集：AIME24 1、MATH-500 (Lightman et al., 2023b; Hendrycks et al., 2021)、AMC23 2、Minerva (Lewkowycz et al., 2022b) 和 Olympiad Bench (He et al., 2024)。数据集的详细信息见附录 C。

3.3 Baseline Models

3.3 基线模型

To contextual ize our results, we compare our trained model against a range of baselines: Llama-3.1-70B-Instruct (AI, 2024a), o1-preview (AI, 2024b), Qwen-2.5-Math-7B-Instruct (Yang et al., 2024), rStar-Math-7B (Guan et al., 2025), Eurus-2-7B-PRIME, Qwen2.5-7B-SimpleRL (Zeng et al., 2025) (Cui et al., 2025), DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek-AI, 2025), DeepScaleR-1.5B-Preview (Luo et al., 2025), Still-3-1.5B-Preview (Team, 2025b).

为了将我们的结果置于上下文中，我们将训练好的模型与一系列基线模型进行了比较：Llama-3.1-70B-Instruct (AI, 2024a)、o1-preview (AI, 2024b)、Qwen-2.5-Math-7B-Instruct (Yang et al., 2024)、rStar-Math-7B (Guan et al., 2025)、Eurus-2-7B-PRIME、Qwen2.5-7B-SimpleRL (Zeng et al., 2025) (Cui et al., 2025)、DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek-AI, 2025)、DeepScaleR-1.5B-Preview (Luo et al., 2025)、Still-3-1.5B-Preview (Team, 2025b)。

This selection enables a robust comparison across model sizes, training methodologies, and reasoning strategies, highlighting the efficacy of our approach for small LLMs. Details of the baselines are provided in Appendix D.

这一选择使得我们能够在模型大小、训练方法和推理策略之间进行强有力的比较，突显了我们方法在小规模大语言模型上的有效性。基线模型的详细信息见附录 D。

3.4 Evaluation Metric

3.4 评估指标

We adopt the zero-shot pass $@1$ metric to measure performance, defined as the proportion of problems correctly solved on the first attempt without prior examples. This metric emphasizes the model’s ability to reason independently, aligning with our goal of enhancing intrinsic reasoning capabilities in small LLMs. Final answers are required in \boxed{} format for consistent automated evaluation.

我们采用零样本通过率 (zero-shot pass $@1$ ) 指标来衡量性能，该指标定义为在没有先验示例的情况下首次尝试正确解决问题的比例。该指标强调模型的独立推理能力，与我们增强小型大语言模型内在推理能力的目标一致。最终答案需要以 \boxed{} 格式提供，以确保自动化评估的一致性。

3.5 Process and Results

3.5 过程与结果

In this subsection, we present three experiments designed to enhance the reasoning abilities of small LLMs using reinforcement learning (RL), follow the methodology in Section 2. We analyze training progress, evaluate performance across benchmarks, and compare our models against baselines, highlighting key insights and their implications for future work.

在本小节中，我们介绍了三个实验，旨在通过强化学习 (RL) 提升小型大语言模型的推理能力，遵循第 2 节中的方法。我们分析了训练进展，评估了在多个基准测试中的表现，并将我们的模型与基线模型进行了比较，突出了关键见解及其对未来工作的意义。

Figure 2: Performance of the model on AMC23 (left) and MATH-500 (right) across global training steps. The red dashed line indicates the baseline score at the start of training.

图 2: 模型在 AMC23 (左) 和 MATH-500 (右) 上的性能随全局训练步骤的变化。红色虚线表示训练开始时的基线分数。

3.5.1 Experiment 1: Impact of High-Quality Data

3.5.1 实验 1：高质量数据的影响

In Experiment 1, we train the DeepSeek-R1-Distill-Qwen-1.5B model using the open-s1 dataset (18,615 samples) from Section 2.1, with a maximum completion length of 4096 tokens. We employ accuracy and format rewards, as described in Section 2.2. Although the full dataset corresponds to approximately 1500 global steps for one epoch, computational constraints (24-hour limit on 4x A40 GPUs) restrict training to 500 global steps.

在实验1中，我们使用第2.1节中的open-s1数据集（18,615个样本）训练DeepSeek-R1-Distill-Qwen-1.5B模型，最大完成长度为4096个Token。我们采用了第2.2节中描述的准确性和格式奖励。尽管完整数据集对应大约1500个全局步骤（一个epoch），但由于计算限制（4x A40 GPU上的24小时限制），训练被限制在500个全局步骤。

Performance on AMC23 improves from $63%$ to $70%$ and on MATH-500 from $83%$ to $84%$ within the first 50–100 steps (see Figure 2). However, after 200 steps, accuracy degrades significantly, dropping below $60%$ on AMC23 and to $80%$ on MATH-500. Figure 3 illustrates this trend, showing unstable accuracy rewards and completion lengths fluctuating near 4000 tokens initially, then decreasing to around 3000 tokens by 100 global steps (approximately 3000 local steps on a single GPU). Post-200 steps, lengths increase again, accompanied by unreadable content and non-English outputs.

在最初的 50-100 步内，AMC23 上的性能从 $63%$ 提升到 $70%$ ，MATH-500 上的性能从 $83%$ 提升到 $84%$ （见图 2）。然而，经过 200 步后，准确率显著下降，AMC23 上的准确率降至 $60%$ 以下，MATH-500 上的准确率降至 $80%$ 。图 3 展示了这一趋势，显示准确率奖励不稳定，完成长度最初在 4000 token 附近波动，然后在 100 个全局步数（约单个 GPU 上的 3000 个局部步数）后降至约 3000 token。200 步后，长度再次增加，同时伴随着不可读的内容和非英语输出。

Figure 3: Accuracy reward (left) and completion length (right) of outputs in Experiment 1 across local steps. Note that global steps are distributed across 4 GPUs, with 100 global steps approximating 3000 local steps.

图 3: 实验 1 中输出的准确率奖励（左）和完成长度（右）随局部步数的变化。请注意，全局步数分布在 4 个 GPU 上，100 个全局步数大约相当于 3000 个局部步数。

This degradation suggests that the model struggles with the complexity of open-s1, often exceeding the 4096-token limit before producing a final answer. The initial length reduction reflects adaptation to the format reward, but the subsequent increase and language drift indicate reward misalignment. We derive the following insight:

这种退化表明模型在处理 open-s1 的复杂性时遇到了困难，通常在生成最终答案之前就超过了 4096-token 的限制。初始长度的减少反映了对格式奖励的适应，但随后的增加和语言漂移表明奖励存在偏差。我们得出以下见解：

Insight 1

洞察 1

Small LLMs can achieve rapid reasoning improvements with limited high-quality data within 50–100 steps, but performance degrades with prolonged training under strict length constraints.

小规模大语言模型 (LLM) 可以在 50-100 步内通过有限的高质量数据实现快速推理能力的提升，但在严格的长度限制下，长时间训练会导致性能下降。

3.5.2 Experiment 2: Balancing Easy and Hard Problems

3.5.2 实验 2：平衡简单和困难问题

Building on Experiment 1, we hypothesize that mixing easier problems with challenging ones could stabilize training and reduce completion lengths. We construct a dataset of 7000 samples: 3000 from open-s1, 3000 from open-deepscaler, and 1000 easier problems from the raw DeepScaleR dataset (Section 2.1). The maximum completion length is reduced to 3584 tokens, retaining accuracy and format rewards.

在实验1的基础上，我们假设将简单问题与挑战性问题混合可以稳定训练并减少完成长度。我们构建了一个包含7000个样本的数据集：3000个来自open-s1，3000个来自open-deepscaler，以及1000个来自原始DeepScaleR数据集的简单问题（第2.1节）。最大完成长度减少到3584个token，同时保留了准确性和格式奖励。

Initial completion lengths drop to approximately 2800 tokens, and performance improves significantly: AMC23 rises from $63%$ to $80%,$ and MATH-500 from $83%$ to $85%$ within 50–100 steps (Figure 2). However, after 150–200 steps (approximately 4000 local steps), performance declines, and KL divergence becomes unstable (Figure 4), with mixed-language outputs reemerging.

初始完成长度降至约2800个token，性能显著提升：AMC23从 $63%$ 上升到 $80%$ ，MATH-500从 $83%$ 上升到 $85%$ ，在50-100步内（图2）。然而，在150-200步后（约4000个局部步骤），性能下降，KL散度变得不稳定（图4），混合语言输出再次出现。

Figure 4: KL divergence (left) and completion length (right) of outputs in Experiment 2 across local steps.

图 4: 实验 2 中输出结果的 KL 散度 (左) 和完成长度 (右) 随局部步骤的变化。

The improved initial performance validates our hypothesis, suggesting that easier problems encourage concise reasoning, while harder ones maintain complexity. However, the latestage instability highlights persistent challenges with length constraints and multilingual tendencies. We note:

改进后的初始性能验证了我们的假设，表明较简单的问题鼓励简洁的推理，而较难的问题则保持复杂性。然而，后期的不稳定性突显了长度限制和多语言倾向的持续挑战。我们注意到：

Insight 2

Incorporating a mix of easy and hard problems under reduced length constraints enhances early performance and stabilizes reasoning behavior, though long-term stability remains elusive.

在减少长度限制的情况下，结合简单和困难的问题可以提升早期表现并稳定推理行为，尽管长期稳定性仍然难以实现。

3.5.3 Experiment 3: Controlling Length with Cosine Reward

3.5.3 实验 3: 使用余弦奖励控制长度

Experiment 3 uses the same 7000-sample dataset as Experiment 2, but replaces the accuracy reward with a cosine reward to better control output length, as outlined in Section 2.2. We also append an instruction to the system prompt: “Reply in English only, do not use other languages”, avoiding a computationally expensive language reward function. The maximum completion length remains 3584 tokens.

实验 3 使用了与实验 2 相同的 7000 样本数据集，但将准确率奖励替换为余弦奖励，以更好地控制输出长度，如第 2.2 节所述。我们还在系统提示中附加了一条指令：“仅用英语回复，不要使用其他语言”，从而避免了计算成本高昂的语言奖励函数。最大完成长度仍为 3584 个 Token。

Completion lengths stabilize between 1000 and 3500 tokens (Figure 5), a marked improvement over Experiment 2’s 2000–3500 range. Performance on AMC23 and MATH500 increases modestly compared to the baseline $63%$ to $72.5%$ and $83%$ to $84.4%$ , respectively) within 50 steps, though it lags behind Experiment 2’s peak (Figure 2). After 200 steps, mixed-language content persists, reflecting the multilingual nature of DeepSeek-R1-Distill-Qwen-1.5B.

完成长度稳定在 1000 到 3500 个 Token 之间（图 5），相比实验 2 的 2000–3500 范围有明显改善。在 50 步内，AMC23 和 MATH500 的性能相比基线有所提升（分别从 $63%$ 提升到 $72.5%$ ，以及从 $83%$ 提升到 $84.4%$ ），尽管仍落后于实验 2 的峰值（图 2）。经过 200 步后，混合语言内容仍然存在，这反映了 DeepSeek-R1-Distill-Qwen-1.5B 的多语言特性。

The cosine reward effectively regulates length, but the language issue suggests a need for explicit language constraints or extended completion lengths for complex tasks. We conclude:

余弦奖励有效地调节了长度，但语言问题表明需要对复杂任务施加明确的语言约束或延长完成长度。我们得出结论：

Insight 3

洞察 3

Cosine rewards stabilize completion lengths, improving training consistency, but extending length limits is necessary for extremely hard tasks, particularly with multilingual base models.

余弦奖励稳定了完成长度，提高了训练一致性，但对于极其困难的任务，特别是多语言基础模型，延长长度限制是必要的。

Figure 5: KL divergence (left) and completion length (right) of outputs in Experiment 3 across local steps.

图 5: 实验 3 中输出结果的 KL 散度 (左) 和完成长度 (右) 随局部步骤的变化。

3.5.4 Overall Comparison

3.5.4 总体比较

We select checkpoints at 100, 50, and 50 global steps from Experiments 1, 2, and 3, naming them Open-RS1, Open-RS2, and Open-RS3 (R for Reasoning, S for Small), respectively. These are evaluated against baselines from Section 3.3 across benchmarks from Section 3.2, using zero-shot pass $@1$ (Table 1).

我们从实验1、2和3中分别选择了100、50和50个全局步骤的检查点，将它们命名为Open-RS1、Open-RS2和Open-RS3（R代表推理，S代表小型）。这些检查点将根据第3.3节的基线，在第3.2节的基准上进行评估，使用零样本通过 $@1$ （表1）。

模型	AIME24	MATH-500	AMC23	Minerva	OlympiadBench 平均
通用模型
Llama-3.1-70B-Instruct	16.7	64.6	30.1	35.3	31.9
01-preview	44.6	85.5
7B 模型
Qwen-2.5-Math-7B-Instruct	13.3	79.8	50.6	34.6	40.7
rStar-Math-7B	26.7	78.4	47.5	一	47.1
Eurus-2-7B-PRIME	26.7	79.2	57.8	38.6	42.1
Qwen2.5-7B-SimpleRL	26.7	82.4	62.5	39.7	43.3
1.5B 模型
DeepSeek-R1-Distill-Qwen-1.5B	28.8	82.8	62.9	26.5	43.3
Still-3-1.5B-Preview	32.5	84.4	66.7	29.0	45.4
DeepScaleR-1.5B-Preview	43.1	87.8	73.6	30.2	50.0
我们的模型
Open-RS1 (100 steps)	30.0	83.8	70.0	29.0	52.4
Open-RS2 (50 steps)	30.0	85.4	80.0	30.5	52.4
Open-RS3 (50 steps)	46.7	84.4	72.5	26.8	51.3

Table 1: Zero-shot pass $@1$ performance across benchmarks. Bold indicates the highest score per benchmark. Dashes $(\dot{-})$ denote unavailable official scores. Scores for o1-preview are sourced from AI (2024b); others from Zeng et al. (2025); Luo et al. (2025). Our models are evaluated using the lighteval package Fourrier et al. (2023).

表 1: 零样本通过 $@1$ 在各个基准测试中的表现。加粗表示每个基准测试中的最高分。短横线 $(\dot{-})$ 表示没有可用的官方分数。o1-preview 的分数来自 AI (2024b)；其他分数来自 Zeng 等人 (2025)；Luo 等人 (2025)。我们的模型使用 lighteval 包 Fourrier 等人 (2023) 进行评估。

Our models outperform most baselines, with average scores of $53.0%$ % (Open-RS1), $55.7%$ % (Open-RS2), and $56.3%$ % (Open-RS3), compared to $57.0%$ % for DeepScaleR-1.5B-Preview. Notably, Open-RS3 achieves the highest AIME24 score $(46.7%)$ % , surpassing o1-preview $(44.6%)$ % and DeepScaleR-1.5B-Preview $(43.1%)$ % ). Open-RS2 excels on AMC23 $(80.0%)$ % and ties with Open-RS1 on Olympiad Bench $(52.4%)$ % , both outperforming DeepScaleR-1.5B-Preview. MATH-500 scores remain competitive, though Minerva performance lags behind 7B models, reflecting the complexity of cross-disciplinary reasoning.

我们的模型在大多数基准测试中表现优异，平均得分为 $53.0%$ % (Open-RS1)、 $55.7%$ % (Open-RS2) 和 $56.3%$ % (Open-RS3)，而 DeepScaleR-1.5B-Preview 的平均得分为 $57.0%$ %。值得注意的是，Open-RS3 在 AIME24 上取得了最高分 $(46.7%)$ %，超过了 o1-preview $(44.6%)$ % 和 DeepScaleR-1.5B-Preview $(43.1%)$ %。Open-RS2 在 AMC23 上表现出色 $(80.0%)$ %，并在 Olympiad Bench 上与 Open-RS1 并列 $(52.4%)$ %，两者均优于 DeepScaleR-1.5B-Preview。MATH-500 的得分保持竞争力，尽管 Minerva 的表现落后于 7B 模型，这反映了跨学科推理的复杂性。

We further compare training costs3 and data efficiency (Tables 2 and 3, and Figure 1). Our approach, using 7000 samples with 6 outputs per step (42,000 total samples), costs approximately $\$42$ on $4\mathrm{x}$ A40 GPUs over 24 hours. In contrast, 7B models like Qwen2.5-7B-SimpleRL $(\$1633)$ and Eurus-2-7B-PRIME $(\$1089$ and 1.5B models like DeepScaleR-1.5B-Preview $(\$3629)$ and Still-3-1.5B-Preview $(\$2268)$ require significantly more resources and data (e.g., $40\mathrm{k}\times16$ samples for DeepScaleR).

我们进一步比较了训练成本3和数据效率（表2和表3，以及图1）。我们的方法使用7000个样本，每个步骤生成6个输出（总计42,000个样本），在4个A40 GPU上运行24小时，成本约为 $\$42$ 。相比之下，7B模型如Qwen2.5-7B-SimpleRL $(\$1633)$ 和 Eurus-2-7B-PRIME $(\$1089)$ ，以及1.5B模型如DeepScaleR-1.5B-Preview $(\$3629)$ 和 Still-3-1.5B-Preview $(\$2268)$ 需要更多的资源和数据（例如，DeepScaleR需要 $40\mathrm{k}\times16$ 个样本）。

Table 2: Comparison of data usage and training costs

[论文翻译]小规模大语言模型中的强化学习推理：有效与无效之处

原文地址：https://arxiv.org/pdf/2503.16219

代码地址：https://github.com/knoveleng/open-rs

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t

小规模大语言模型中的强化学习推理：有效与无效之处

Abstract

摘要

1 Introduction

1 引言

2 Methodology

2 方法论

2.1 High-Quality Dataset Curation

2.1 高质量数据集构建

2.2 Reinforcement Learning Algorithm

2.2 强化学习算法

3 Experiments

3 实验

3.1 Experimental Setup

3.1 实验设置

3.2 Benchmark Datasets

3.2 基准数据集

3.3 Baseline Models

3.3 基线模型

3.4 Evaluation Metric

3.4 评估指标

3.5 Process and Results

3.5 过程与结果

3.5.1 Experiment 1: Impact of High-Quality Data

3.5.1 实验 1：高质量数据的影响

Insight 1

洞察 1

3.5.2 Experiment 2: Balancing Easy and Hard Problems

3.5.2 实验 2：平衡简单和困难问题

Insight 2

Insight 2

3.5.3 Experiment 3: Controlling Length with Cosine Reward

3.5.3 实验 3: 使用余弦奖励控制长度

Insight 3

洞察 3

3.5.4 Overall Comparison

3.5.4 总体比较