LIMR: Less is More for RL Scaling
LIMR: 少即是多——强化学习的扩展策略
Abstract
摘要
In this paper, we ask: what truly determines the effectiveness of RL training data for enhancing language models’ reasoning capabilities? While recent advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack of transparency about training data requirements has hindered systematic progress. Starting directly from base models without distillation, we challenge the assumption that scaling up RL training data inherently improves performance. we demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories, enabling efficient resource utilization and scalable implementation. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. Notably, while recent data-efficient approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it significantly under performs at 7B-scale through supervised fine-tuning (SFT). In contrast, our RL-based LIMR achieves $16.7,%$ higher accuracy on AIME24 and outperforms LIMO and s1 by $13.0%$ and $22.2%$ on MATH500. These results fundamentally reshape our understanding of RL scaling in LLMs, demonstrating that precise sample selection, rather than data scale, may be the key to unlocking enhanced reasoning capabilities. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models at https : / /github . Com/GAIR-NLP /LIMR.
在本文中,我们探讨了一个问题:什么真正决定了RL训练数据在提升语言模型推理能力方面的有效性?尽管最近的进展如o1、Deepseek R1和Kimi1.5展示了RL的潜力,但训练数据需求的透明度不足阻碍了系统性的进展。我们直接从基础模型开始,不进行蒸馏处理,挑战了扩大RL训练数据规模必然提升性能的假设。我们证明了仅通过策略性选择的1,389个样本子集,可以超越完整的8,523个样本数据集。我们引入了学习影响度量(Learning Impact Measurement, LIM),这是一种自动化的方法,用于评估并优先排序训练样本,基于它们与模型学习轨迹的对齐程度,从而实现资源的高效利用和可扩展的实施。我们的方法仅使用1,389个样本,就达到了与完整8,523个样本数据集相当甚至更优的性能。值得注意的是,尽管最近的数据高效方法(如LIMO和s1)在32B规模模型上显示出潜力,但我们发现它们在7B规模上通过监督微调(SFT)表现显著不佳。相比之下,我们基于RL的LIMR在AIME24上实现了$16.7,%$的更高准确率,并在MATH500上分别超越了LIMO和s1 $13.0%$ 和 $22.2%$。这些结果从根本上重塑了我们对大语言模型中RL扩展的理解,展示了精确的样本选择,而非数据规模,可能是解锁增强推理能力的关键。为了可重复的研究和未来的创新,我们在https://github.com/GAIR-NLP/LIMR开源了LIMR,包括LIM的实现、训练和评估代码、精选的数据集和训练好的模型。

Figure 1: (a) The accuracy on AIME24 across using different training datasets in RL without any data distillation and SFT training as cold start. Our specifically curated LIMR dataset, a strategically selected subset from the full dataset, MATH (level 3-5), achieved comparable accuracy levels while utilizing less than one-sixth of the data volume. Notably, LIM significantly outperformed a randomly selected dataset of equivalent size, demonstrating the effectiveness of our selective dataset construction methodology. (b) A comparison of different data-efficient models. The results reveal that directly applying SFT on the LIMO (Ye et al., 2025) and s1 (Mu en nigh off et al. 2025) datasets with Qwen-Math-7B yields significantly inferior results compared to using RL with LIMR, implying that, for small models, RL is more effective in achieving data efciency.
图 1: (a) 在 RL 中使用不同训练数据集在 AIME24 上的准确率,没有任何数据蒸馏和 SFT 训练作为冷启动。我们特别策划的 LIMR 数据集,是从完整数据集中战略性地选择的一个子集,MATH (level 3-5),在仅使用不到六分之一数据量的情况下,达到了可比的准确率水平。值得注意的是,LIM 显著优于同等大小的随机选择数据集,证明了我们选择性数据集构建方法的有效性。(b) 不同数据高效模型的比较。结果表明,直接在 LIMO (Ye et al., 2025) 和 s1 (Mu en nigh off et al. 2025) 数据集上应用 SFT 与 Qwen-Math-7B 相比,使用 RL 和 LIMR 的结果显著较差,这意味着对于小模型,RL 在实现数据高效性方面更为有效。
1 Introduction
1 引言
Recent advances in Large Language Models (LLMs) have demonstrated the remarkable effectiveness of reinforcement learning (RL) in enhancing complex reasoning capabilities. Models like o1 (OpenAI, 2024), Deepseek R1 (Guo et al., 2025), and Kimi1.5 (Team et al., 2025) have shown that RL training can naturally induce sophisticated reasoning behaviors, including self-verification, reflection, and extended chains of thought. However, a critical gap exists in our understanding of RL training: these pioneering works provide limited transparency about their training data scale, making it challenging for the research community to build upon their success. Follow-up open-source efforts (Table 1) have explored diverse experimental scenarios, from base models to distilled long-form chain-of-thought models, with RL data volumes ranging from 8K (Zeng et al., 2025) to 150K (Cui et al., 2025), but without clear guidance on optimal data requirements or scaling principles. In this work, we try to explore the scaling dynamics of RL training data by focusing on a foundational scenario: starting directly from base models without distillation (similar to the RL scaling setting of Deepseek R1-zero).
大语言模型 (LLM) 的最新进展展示了强化学习 (RL) 在增强复杂推理能力方面的显著效果。像 o1 (OpenAI, 2024)、Deepseek R1 (Guo et al., 2025) 和 Kimi1.5 (Team et al., 2025) 这样的模型表明,RL 训练可以自然地诱导出复杂的推理行为,包括自我验证、反思和扩展的思维链。然而,我们对 RL 训练的理解存在一个关键空白:这些开创性工作对其训练数据规模提供了有限的透明度,这使得研究界难以在其成功基础上进一步构建。后续的开源努力 (表 1) 探索了多种实验场景,从基础模型到蒸馏的长格式思维链模型,RL 数据量从 8K (Zeng et al., 2025) 到 150K (Cui et al., 2025) 不等,但缺乏关于最佳数据需求或扩展原则的明确指导。在本工作中,我们尝试通过关注一个基础场景来探索 RL 训练数据的扩展动态:直接从基础模型开始,无需蒸馏(类似于 Deepseek R1-zero 的 RL 扩展设置)。
This lack of understanding of RL training data requirements presents several fundamental challenges:
对强化学习 (RL) 训练数据需求的理解不足带来了几个根本性挑战:
More importantly, this uncertainty raises a crucial question: Is scaling up RL training data truly the key to improving model performance, or are we overlooking more fundamental factors such as sample quality and selection criteria?
更重要的是,这种不确定性提出了一个关键问题:增加 RL(强化学习)训练数据是否真的是提升模型性能的关键,还是我们忽略了更根本的因素,比如样本质量和选择标准?
In this work, we challenge the assumption that larger RL training datasets necessarily lead to better performance. Our key insight is that the quality and relevance of training samples matter far more than their quantity. Through extensive empirical analysis, we make several surprising observations that fundamentally change our understanding of RL training dynamics:
在这项工作中,我们挑战了“更大的强化学习 (RL) 训练数据集必然带来更好性能”的假设。我们的核心发现是,训练样本的质量和相关性远比数量重要。通过广泛的实证分析,我们得出了几个令人惊讶的观察结果,从根本上改变了我们对强化学习训练动态的理解:
- We find that a carefully selected subset of RL training samples (1,389) can achieve comparable or even superior performance compared to training with the full dataset (8,523). 2. Most importantly, we develop an automated quantitative method for evaluating the potential value of RL training samples. Our method, which we call Learning Impact Measurement (LIM), can effectively predict which samples will contribute most significantly to model improvement. This automated approach eliminates the need for manual sample curation and makes our methodology easily scalable. 3. Recent approaches like LIMO and s1 have demonstrated the potential of distilled reasoning data efficiency through supervised fine-tuning with 32B models. We find that at 7B-scale, these methods significantly under perform. Our RL-based LIMR achieves $16.7%$ higher accuracy on AIME24 ( $32.5%$ VS $15.8%$ )and surpasses LIMO and s1 by $13.0%$ and $22.2%$ on MATH500 ( $78.0%$ VS $65.0%$ , $55.8%$ ), suggesting that RL may be more effective for enhancing reasoning capabilities in data-sparse scenarios.
- 我们发现,经过精心选择的 RL 训练样本子集(1,389 个)可以达到与使用完整数据集(8,523 个)训练相当甚至更优的性能。
- 最重要的是,我们开发了一种自动化定量方法来评估 RL 训练样本的潜在价值。我们称之为学习影响度量(Learning Impact Measurement,LIM)的方法,能够有效预测哪些样本对模型改进的贡献最大。这种自动化方法消除了手动样本筛选的需求,使我们的方法易于扩展。
- 最近的方法如 LIMO 和 s1 已经展示了通过使用 32B 模型进行监督微调来提升推理数据效率的潜力。我们发现,在 7B 规模下,这些方法的表现显著不佳。我们基于 RL 的 LIMR 在 AIME24 上实现了 $16.7%$ 的准确率提升($32.5%$ VS $15.8%$),并在 MATH500 上分别超过了 LIMO 和 s1 $13.0%$ 和 $22.2%$($78.0%$ VS $65.0%$,$55.8%$),这表明在数据稀疏的场景下,RL 可能更有效地增强推理能力。
Our findings have significant implications for the field of LLM development. They suggest that the path to better reasoning capabilities may not lie in simply scaling up RL training data, but rather in being more selective about which samples to use. This insight could dramatically reduce the computational resources required for effective RL training while potentially improving final model performance. Furthermore, our automated sample evaluation method provides a practical tool for researchers and practitioners to implement these insights in their own work. For reproducible research and future innovation, we release all LIMR artifacts openly, including LIMR dataset and model, all training and evaluation code, and implementation details of LIM.
我们的研究结果对大语言模型的开发领域具有重要意义。它们表明,提升推理能力的路径可能并不在于简单地扩大强化学习训练数据的规模,而在于更有选择性地使用样本。这一见解可能显著减少有效强化学习训练所需的计算资源,同时潜在地提高最终模型性能。此外,我们的自动化样本评估方法为研究人员和从业者提供了一个实用工具,以便在他们自己的工作中实施这些见解。为了可重复的研究和未来的创新,我们公开了所有 LIMR 的成果,包括 LIMR 数据集和模型、所有训练和评估代码,以及 LIM 的实现细节。
2 Methodology
2 方法论
We present Learning Impact Measurement (LIM), a systematic approach to quantify and optimize the value of training data in reinforcement learning. Our method addresses the critical challenge of data efficiency in RL training by analyzing learning dynamics to identify the most effective training samples.
我们提出了学习影响度量(Learning Impact Measurement, LIM),这是一种系统化的方法,用于量化和优化强化学习中训练数据的价值。我们的方法通过分析学习动态来识别最有效的训练样本,从而解决RL训练中数据效率的关键挑战。
2.1 Learning Dynamics in RL Training
2.1 RL训练中的学习动态
To understand the relationship between training data and model improvement, we conducted extensive analysis using the MATH-FULL dataset (Hendrycks et al., 2021), which contains 8,523 mathematical problems of varying difficulty levels (3-5). Our investigation reveals that different training samples contribute unequally to model learning, contrary to the conventional approach of treating all samples uniformly. As illustrated in Figure 2a, we
为了理解训练数据与模型改进之间的关系,我们使用 MATH-FULL 数据集(Hendrycks 等人,2021)进行了广泛分析,该数据集包含 8,523 个不同难度级别(3-5)的数学问题。我们的研究表明,不同的训练样本对模型学习的贡献并不相同,这与将所有样本统一处理的传统方法相悖。如图 2a 所示,我们
| 方法 | 初始模型 | 长链CoT蒸馏 | 问题数量 |
|---|---|---|---|
| STILL-3 (Team, 2025b) | Instruct | 是 | 29,925 |
| DeepScaleR (Luo et al., 2025) | Instruct | 是 | 40,314 |
| Sky-T1 (Team, 2025a) | Instruct | 是 | 45,000 |
| THUDM-T1 (Hou et al., 2025) | Instruct | 否 | 30,000 |
| PRIME (Cui et al., 2025) | Instruct | 否 | 150,000 |
| SimpleRL (Zeng et al., 2025) | Base | 否 | 8,523 |
| LIMR | Base | 否 | 1,389 |
Table 1: Comparison of various methods. The “init model" refers to the type of the initial actor model, we performance RL directly on the base model. “Long CoT Distillation”' indicates whether the initial model distills long CoT for cold start.
表 1: 各种方法的比较。“初始模型 (init model)”指的是初始参与者模型的类型,我们直接在基础模型上进行强化学习。“长链思维蒸馏 (Long CoT Distillation)”表示初始模型是否通过长链思维蒸馏进行冷启动。

Figure 2: (a) Learning dynamics analysis of training samples from MATH-FULL dataset across epochs. Solution reward trajectories reveal diverse patterns: samples maintaining near-zero rewards, samples quickly achieving high rewards, and those showing dynamic learning progress with varying improvement rates. (b) Sample learning trajectories compared against the average reward curve (red). Higher LIM scores refect better alignment with model's learning trajectory, where trajectories showing similar growth patterns receive higher scores.
图 2: (a) MATH-FULL 数据集中训练样本在不同 epoch 下的学习动态分析。解决方案奖励轨迹展示了多样化的模式:保持接近零奖励的样本、快速达到高奖励的样本,以及表现出动态学习进展且改善速率不同的样本。(b) 样本学习轨迹与平均奖励曲线(红色)的对比。较高的 LIM 分数反映了与模型学习轨迹的更好对齐,表现出相似增长模式的轨迹会获得更高的分数。
observe diverse learning trajectories: some samples exhibit stable performance patterns, while others show complex learning dynamics that appear to drive significant model improvements.
观察到多样化的学习轨迹:一些样本表现出稳定的性能模式,而其他样本则显示出复杂的学习动态,这些动态似乎推动了模型的显著改进。
These observations lead to our key insight: the value of training data in RL can be systematically measured by examining how well individual samples align with the model's overall learning progression. This understanding forms the foundation of LIM, our proposed method for quantifying sample effectiveness.
这些观察得出了我们的关键见解:RL 中训练数据的价值可以通过检查单个样本与模型整体学习进展的契合程度来系统性地衡量。这一理解构成了 LIM 的基础,即我们提出的量化样本有效性的方法。
2.2 Learning Impact Measurement (LIM)
2.2 学习影响测量 (LIM)
LIM centers on a model-aligned trajectory analysis that evaluates training samples based on their contribution to model learning. Our key finding is that samples whose learning patterns complement the model's overall performance trajectory tend to be more valuable for optimization.
LIM 专注于一种模型对齐轨迹分析,该分析根据训练样本对模型学习的贡献进行评估。我们的关键发现是,学习模式与模型整体性能轨迹互补的样本往往对优化更具价值。
2.2.1 Model-aligned Trajectory Analysis
2.2.1 模型对齐轨迹分析
Given that neural network learning typically follows a logarithmic growth pattern, we use the model's average reward curve as a reference for measuring sample effectiveness (Figure 2b):
考虑到神经网络学习通常遵循对数增长模式,我们使用模型的平均奖励曲线作为衡量样本有效性的参考(图 2b):

where $\boldsymbol{r}_{i}^{k}$ represents the reward of sample $i$ at epoch $k$ , and $N$ is the total number of samples. For each sample, LIM computes a normalized alignment score:
其中 $\boldsymbol{r}_{i}^{k}$ 表示样本 $i$ 在第 $k$ 个周期中的奖励,$N$ 是样本的总数。对于每个样本,LIM 计算一个归一化的对齐分数:

This score quantifies how well a sample's learning pattern aligns with the model's overall learning trajectory, with higher scores indicating better alignment.
该分数量化了样本的学习模式与模型整体学习轨迹的匹配程度,分数越高表示匹配度越好。
2.2.2 Sample Selection Strategy
2.2.2 样本选择策略
Based on the alignment scores, LIM implements a selective sampling strategy: $s_{i}>\theta$ where $\theta$ serves as a quality threshold that can be adjusted according to specific requirements. In our experiments, setting $\theta=0.6$ yielded an optimized dataset (LIMR) of 1,389 high-value samples from the original dataset.
基于对齐分数,LIM实现了一种选择性采样策略:$s_{i}>\theta$,其中$\theta$作为质量阈值,可以根据特定需求进行调整。在我们的实验中,设置$\theta=0.6$从原始数据集中筛选出了1,389个高价值样本,形成了优化后的数据集(LIMR)。
2.3 Baseline Data Selection Methods
2.3 基线数据选择方法
While developing our core methodology, we explored several alternative approaches that helped inform and validate our final method. These approaches, provide valuable insights into data selection in RL.
在开发核心方法论的过程中,我们探索了几种替代方法,这些方法有助于为最终方法的确定提供信息和验证。这些方法为强化学习中的数据选择提供了宝贵的见解。
Random Sampling baseline (RAND) randomly selects 1,389 samples from MATH-FULL to match the size of our main approach, providing a fundamental reference point for evaluating selective sampling effectiveness.
随机采样基线 (RAND) 从 MATH-FULL 中随机选择 1,389 个样本,以匹配我们主要方法的规模,为评估选择性采样的有效性提供一个基本参考点。
Linear Progress Analysis method (LINEAR) evaluates samples based on their consistency in showing steady improvements across training epochs. While this approach captures samples with gradual progress, it often misses valuable samples that show rapid early gains followed by stabilization. Using a threshold of $\theta=0.7$ ,thismethod yields 1,189 samples.
线性进展分析法 (LINEAR) 根据样本在训练周期中表现出的稳定改进一致性来评估样本。虽然这种方法能够捕捉到逐步进展的样本,但往往会忽略那些在早期迅速提升随后趋于稳定的有价值的样本。使用阈值 $\theta=0.7$,该方法共得到 1,189 个样本。
2.4 Reward Design
2.4 奖励设计
Similar to deepseek r1 (Guo et al., 2025), we use a rule-based reward function. Specifically, for a correct answer, the reward is 1; for an incorrect but properly formatted answer, the reward is -0.5; and for a answer with formatting errors, the reward is $^{-1}$ . Formally, this can be expressed as:
与 deepseek r1 (Guo et al., 2025) 类似,我们使用基于规则的奖励函数。具体来说,对于正确答案,奖励为 1;对于错误但格式正确的答案,奖励为 -0.5;对于格式错误的答案,奖励为 $^{-1}$。正式表达式为:

3 Experiment
3 实验
3.1 Experimental Setup
3.1 实验设置
Training We conduct RL training using PPO (Schulman et al., 2017) algorithm implemented in the OpenRLHF (Hu et al., 2024) framework. Using Qwen2.5-Math-7B (Yang et al., 2024) as our initial policy model, we configure the rollout batch size as 1,024 and generate 8 samples per prompt with a temperature of 1.2 during exploration. The training process uses a batch size of 256, with learning rates set to 5e-7 and 9e-6 for the actor and critic models respectively, and a KL coefficient of 0.01.
训练
我们使用 OpenRLHF (Hu et al., 2024) 框架中实现的 PPO (Schulman et al., 2017) 算法进行 RL 训练。以 Qwen2.5-Math-7B (Yang et al., 2024) 作为初始策略模型,我们将 rollout 批次大小配置为 1,024,并在探索过程中为每个提示生成 8 个样本,温度为 1.2。训练过程使用批次大小为 256,actor 和 critic 模型的学习率分别设置为 5e-7 和 9e-6,KL 系数为 0.01。
Evaluation We conducted experimental evaluations on multiple challenging benchmarks, including: (1) MATH500; (2) AIME2024; (4) AMC2023. To accelerate the evaluation process, we utilized the vLLM (Kwon et al., 2023) framework. For AIME24, AMC23, due to the limited number of questions (30 and 40 respectively), we performed 4 sampling runs per question with a temperature of 0.4. For MATH500, we employed greedy decoding for inference.
评估我们在多个具有挑战性的基准上进行了实验评估,包括:(1) MATH500;(2) AIME2024;(4) AMC2023。为了加速评估过程,我们使用了 vLLM [Kwon et al., 2023] 框架。对于 AIME24 和 AMC23,由于问题数量有限(分别为 30 和 40),我们对每个问题进行了 4 次采样运行,温度为 0.4。对于 MATH500,我们采用了贪婪解码进行推理。
3.2 Main Results
3.2 主要结果
Table 2: Main results on difficult math benchmarks.
表 2: 困难数学基准测试的主要结果。
| Method | #Questions | AIME2024 | MATH500 | AMC2023 | AVG. |
|---|---|---|---|---|---|
| Qwen-Math-7B | 16.7 | 52.4 | 52.5 | 40.5 | |
| Qwen-Math-7B-FULL | 8,523 | 32.5 | 76.6 | 61.9 | 57.0 |
| Qwen-Math-7B-RAND | 1,389 | 25.8 | 66.0 | 56.3 | 49.4 |
| Qwen-Math-7B-LINEAR | 1,138 | 28.3 | 74.6 | 61.9 | 54.9 |
| LIMR | 1,389 | 32.5 | 78.0 | 63.8 | 58.1 |
As illustrated in Table 2, directly applying RL to Qwen-Math-7B using the MATH-FULL dataset resulted in a significant performance improvement. Different data selection strategies, however, led to notable variations in performance. Training with the MATH-RAND dataset results in an average accuracy drop of $8.1%$ comparedto using the full dataset, whereas MATH-LINEAR incurs only a $2%$ loss. More notably, LIMR, despite an $80%$ reduction in dataset size, performs nearly on par with MATH-FULL. This further supports the notion that in RL, only a small subset of questions plays a critical role.
如表 2 所示,直接在 Qwen-Math-7B 上使用 MATH-FULL 数据集应用强化学习 (RL) 带来了显著的性能提升。然而,不同的数据选择策略导致了显著的性能差异。与使用完整数据集相比,使用 MATH-RAND 数据集训练会导致平均准确率下降 $8.1%$,而 MATH-LINEAR 仅带来 $2%$ 的损失。更值得注意的是,LIMR 尽管减少了 $80%$ 的数据集大小,但表现与 MATH-FULL 几乎相当。这进一步支持了在强化学习中,只有一小部分问题起关键作用的观点。

Figure 3: Performance and training dynamics
图 3: 性能与训练动态
Additionally, we analyze the evolution of various metrics during RL training on the MATH-FULL, MATHRAND, and LIMR datasets. As shown in Figur 3a, the accuracy curves of LIMR and MATH-FULL are nearly identical, both significantly outperforming MATH-RAND. Meanwhile, Figure 3b indicates that the training curve for MATH-FULL exhibits instability in terms of sequence length, whereas the corresponding curve for LIMR initially declines before gradually increasing. Figure 3c further illustrates differences in training rewards: the reward curve for LIMR rises more rapidly and ultimately approaches 1.0. This suggests that during RL, the model effectively utilizes the LIMR dataset for learning.
此外,我们分析了在 MATH-FULL、MATHRAND 和 LIMR 数据集上进行 RL 训练时各种指标的演变。如图 3a 所示,LIMR 和 MATH-FULL 的准确率曲线几乎相同,两者都显著优于 MATH-RAND。同时,图 3b 表明,MATH-FULL 的训练曲线在序列长度方面表现出不稳定性,而 LIMR 的相应曲线最初下降后逐渐上升。图 3c 进一步说明了训练奖励的差异:LIMR 的奖励曲线上升得更快,最终接近 1.0。这表明在 RL 期间,模型有效地利用了 LIMR 数据集进行学习。
Figure 4 presents a comparative analysis of model performance across three challenging benchmarks. The results demonstrate that LIMR achieves performance comparable to MATH-FULL on all three benchmarks, while significantly outperforming the RAND baseline. Notably, LIMR's consistent excellence on both AIME24 and AMC23 datasets provides compelling evidence that its enhanced performance is not attributable to over fitting to a single dataset, but rather reflects a genuine improvement in the model's mathematical reasoning capabilities compared to the RAND.
图 4: 展示了模型在三个具有挑战性的基准测试中的性能对比分析。结果表明,LIMR 在所有三个基准测试中的表现与 MATH-FULL 相当,同时显著优于 RAND 基线。值得注意的是,LIMR 在 AIME24 和 AMC23 数据集上的一致优异表现,有力地证明了其性能的提升并非由于对单一数据集的过拟合,而是反映了相比于 RAND,模型在数学推理能力上的真正提升。

Figure 4: Accuracy on various benchmarks
图 4: 各基准测试的准确率
3.3RL Outperforms SFT in Data Efficiency
3.3 强化学习在数据效率上优于监督微调
Table 3: Performance difference of three data efficient models.
表 3: 三种数据高效模型的性能差异
| Method | #Questions | AIME2024 | MATH500 | AMC2023 | AVG. |
|---|---|---|---|---|---|
| Qwen-Math-7B | 16.7 | 52.4 | 52.5 | 40.5 | |
| Qwen-Math-7B-s1 | 1,000 | 15.8 | 55.8 | 42.5 | 38.0 |
| Qwen-Math-7B-LIMO | 817 | 15.8 | 65.0 | 56.3 | 45.7 |
| LIMR | 1,389 | 32.5 | 78.0 | 63.8 | 58.1 |
Both LIMO (Ye et al., 2025) and s1 (Mu en nigh off et al., 2025) emphasize that only a small amount of data is needed to unlock the reasoning potential of models. However, we found that in scenarios with limited data and small models (e.g., 7B models), using reinforcement learning (RL) is more effective than distilling data from larger models and performing imitation learning.
LIMO (Ye et al., 2025) 和 s1 (Mu en nigh off et al., 2025) 都强调只需少量数据即可解锁模型的推理潜力。然而,我们发现,在数据有限且模型较小(例如 7B 模型)的场景下,使用强化学习 (RL) 比从更大模型中蒸馏数据并进行模仿学习更为有效。
Specifically, we fine-tuned Qwen-2.5-Math-7B using 1,000 pieces of data from s1 and 817 pieces of data from LIMO via supervised fine-tuning and compared it with LIMR. The experimental results show that, with the same 1k questions, Compared to LIMO and s1, LIMR has achieved a relative improvement of over $100%$ on AIME, and at least a $10%$ accuracy increase on AMC23 and MATH500. This further underscores the importance of selecting data that is suitable for the model rather than blindly opting for more challenging data.
具体来说,我们使用来自 s1 的 1,000 条数据和来自 LIMO 的 817 条数据通过监督微调对 Qwen-2.5-Math-7B 进行了微调,并与 LIMR 进行了比较。实验结果表明,在相同的 1k 问题下,与 LIMO 和 s1 相比,LIMR 在 AIME 上实现了超过 $100%$ 的相对提升,并且在 AMC23 和 MATH500 上至少提升了 $10%$ 的准确率。这进一步强调了选择适合模型的数据的重要性,而不是盲目选择更具挑战性的数据。
4 Conclusion
4 结论
In this work, we challenge the conventional wisdom that scaling up RL training data is necessary for improving LLM reasoning capabilities. Through the introduction of Learning Impact Measurement (LIM), we demonstrate that a carefully selected subset of 1,389 samples can match or exceed the performance of the full 8,523-sample dataset across multiple challenging mathematical benchmarks. Our automated LIM methodology not only provides a practical, scalable solution for researchers to implement efficient RL training but also reveals that the path to enhanced reasoning capabilities may lie in optimizing sample quality rather than increasing data quantity. Additionally, our comparison with supervised fine-tuning approaches demonstrates that RL, when combined with efcient data selection, can be particularly effective for smaller models with limited data, suggesting potential applications of our methodology beyond mathematical reasoning to other domains where RL is applied in language models.
在本研究中,我们挑战了传统观念,即扩大强化学习(RL)训练数据对于提升大语言模型推理能力是必要的。通过引入学习影响度量(LIM),我们证明了在多个具有挑战性的数学基准测试中,精心选择的1,389个样本子集可以匹配甚至超越完整的8,523个样本数据集的性能。我们的自动化LIM方法不仅为研究人员提供了一种实用、可扩展的解决方案,以实现高效的RL训练,还揭示了提升推理能力的路径可能在于优化样本质量而非增加数据量。此外,我们与监督微调方法的比较表明,当RL与高效的数据选择相结合时,对于数据有限的小型模型尤其有效,这表明我们的方法在数学推理之外的其他领域(如语言模型中应用RL的领域)也具有潜在应用价值。
References
参考文献
[1] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. 2025. Process reinforcement through implicit rewards.
[1] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. 2025. 通过隐式奖励的过程强化。
[2] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-rl: In centi viz ing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
[2] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, 等. 2025. Deepseek-rl: 通过强化学习增强大语言模型的推理能力. arXiv 预印本 arXiv:2501.12948.
[3] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
[3] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, 和 Jacob Steinhardt. 2021. 使用数学数据集衡量数学问题解决能力. arXiv 预印本 arXiv:2103.03874.
[4] Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. 2025. Advancing language model reasoning through reinforcement learning and inference scaling. arXiv preprint arXiv:2501.11651.
[4] Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, 和 Yuxiao Dong. 2025. 通过强化学习和推理扩展推进语言模型推理能力. arXiv 预印本 arXiv:2501.11651.
[5] Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. 2024. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143.
[5] Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. 2024. Openrlhf: 一个易用、可扩展且高性能的 RLHF 框架. arXiv preprint arXiv:2405.11143.
[6] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with paged attention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
[6] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, 和 Ion Stoica. 2023. 高效内存管理用于大语言模型服务与分页注意力机制。在《ACM SIGOPS 第29届操作系统原理研讨会论文集》中。
[7] Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. Deepscaler: Surpassing ol-preview with a 1.5b model by scaling rl. https : / /pretty-radio-b75 .notion. site / DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902 c 1468005 bed NotionBlog.
[7] Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, 和 Ion Stoica. 2025. Deepscaler: 通过扩展强化学习 (RL) 使用 1.5b 模型超越 ol-preview. https : / /pretty-radio-b75 .notion. site / DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902 c 1468005 bed NotionBlog.
[8] Niklas Mu en nigh off, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Z ett le moyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling.
[8] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, 和 Tatsunori Hashimoto. 2025. s1: 简单测试时缩放。
[9] OpenA1. 2024. Learning to reason with llms, september 2024.
[9] OpenA1. 2024. 学习用大语言模型进行推理,2024年9月。
[10] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.
[10] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, 和 Oleg Klimov. 2017. 近端策略优化算法。
[11] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, and Zonghan Yang. 2025. Kimi k1.5: Scaling reinforcement learning with lms.
[12] NovaSky Team. 2025a. Sky-tl: Train your own o1 preview model within 450. https://novaskyai.github.io/posts/sky-t1. Accessed: 2025-01-09.
[12] NovaSky Team. 2025a. Sky-tl: 在450分钟内训练你自己的o1预览模型. https://novaskyai.github.io/posts/sky-t1. 访问日期: 2025-01-09.
[13] RUCAIBox STILL Team. 2025b. Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning.
[13] RUCAIBox STILL Team. 2025b. Still-3-1.5b-preview: 通过强化学习提升小模型的慢思考能力。
[14] An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. 2024. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.
[14] An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, 和 Zhenru Zhang. 2024. Qwen2.5-math 技术报告:通过自我改进实现数学专家模型。
[15] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. 2025. Limo: Less is more for reasoning.
[15] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. 2025. Limo: Less is more for reasoning.
[16] Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 2025. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. ht tps : / /hkust-nlp.notion.site/ simplerl-reason. Notion Blog.
[16] Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 2025. 7b 模型和 8k 示例:通过强化学习实现的新兴推理既有效又高效。https://hkust-nlp.notion.site/simplerl-reason. Notion 博客。
