LIMR: Less is More for RL Scaling
LIMR: 少即是多——强化学习的扩展策略
Abstract
摘要
In this paper, we ask: what truly determines the effectiveness of RL training data for enhancing language models’ reasoning capabilities? While recent advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack of transparency about training data requirements has hindered systematic progress. Starting directly from base models without distillation, we challenge the assumption that scaling up RL training data inherently improves performance. we demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories, enabling efficient resource utilization and scalable implementation. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. Notably, while recent data-efficient approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it significantly under performs at 7B-scale through supervised fine-tuning (SFT). In contrast, our RL-based LIMR achieves $16.7,%$ higher accuracy on AIME24 and outperforms LIMO and s1 by $13.0%$ and $22.2%$ on MATH500. These results fundamentally reshape our understanding of RL scaling in LLMs, demonstrating that precise sample selection, rather than data scale, may be the key to unlocking enhanced reasoning capabilities. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models at https : / /github . Com/GAIR-NLP /LIMR.
在本文中,我们探讨了一个问题:什么真正决定了RL训练数据在提升语言模型推理能力方面的有效性?尽管最近的进展如o1、Deepseek R1和Kimi1.5展示了RL的潜力,但训练数据需求的透明度不足阻碍了系统性的进展。我们直接从基础模型开始,不进行蒸馏处理,挑战了扩大RL训练数据规模必然提升性能的假设。我们证明了仅通过策略性选择的1,389个样本子集,可以超越完整的8,523个样本数据集。我们引入了学习影响度量(Learning Impact Measurement, LIM),这是一种自动化的方法,用于评估并优先排序训练样本,基于它们与模型学习轨迹的对齐程度,从而实现资源的高效利用和可扩展的实施。我们的方法仅使用1,389个样本,就达到了与完整8,523个样本数据集相当甚至更优的性能。值得注意的是,尽管最近的数据高效方法(如LIMO和s1)在32B规模模型上显示出潜力,但我们发现它们在7B规模上通过监督微调(SFT)表现显著不佳。相比之下,我们基于RL的LIMR在AIME24上实现了$16.7,%$的更高准确率,并在MATH500上分别超越了LIMO和s1 $13.0%$ 和 $22.2%$。这些结果从根本上重塑了我们对大语言模型中RL扩展的理解,展示了精确的样本选择,而非数据规模,可能是解锁增强推理能力的关键。为了可重复的研究和未来的创新,我们在https://github.com/GAIR-NLP/LIMR开源了LIMR,包括LIM的实现、训练和评估代码、精选的数据集和训练好的模型。
Figure 1: (a) The accuracy on AIME24 across using different training datasets in RL without any data distillation and SFT training as cold start. Our specifically curated LIMR dataset, a strategically selected subset from the full dataset, MATH (level 3-5), achieved comparable accuracy levels while utilizing less than one-sixth of the data volume. Notably, LIM significantly outperformed a randomly selected dataset of equivalent size, demonstrating the effectiveness of our selective dataset construction methodology. (b) A comparison of different data-efficient models. The results reveal that directly applying SFT on the LIMO (Ye et al., 2025) and s1 (Mu en nigh off et al. 2025) datasets with Qwen-Math-7B yields significantly inferior results compared to using RL with LIMR, implying that, for small models, RL is more effective in achieving data efciency.
图 1: (a) 在 RL 中使用不同训练数据集在 AIME24 上的准确率,没有任何数据蒸馏和 SFT 训练作为冷启动。我们特别策划的 LIMR 数据集,是从完整数据集中战略性地选择的一个子集,MATH (level 3-5),在仅使用不到六分之一数据量的情况下,达到了可比的准确率水平。值得注意的是,LIM 显著优于同等大小的随机选择数据集,证明了我们选择性数据集构建方法的有效性。(b) 不同数据高效模型的比较。结果表明,直接在 LIMO (Ye et al., 2025) 和 s1 (Mu en nigh off et al. 2025) 数据集上应用 SFT 与 Qwen-Math-7B 相比,使用 RL 和 LIMR 的结果显著较差,这意味着对于小模型,RL 在实现数据高效性方面更为有效。
1 Introduction
1 引言
Recent advances in Large Language Models (LLMs) have demonstrated the remarkable effectiveness of reinforcement learning (RL) in enhancing complex reasoning capabilities. Models like o1 (OpenAI, 2024), Deepseek R1 (Guo et al., 2025), and Kimi1.5 (Team et al., 2025) have shown that RL training can naturally induce sophisticated reasoning behaviors, including self-verification, reflection, and extended chains of thought. However, a critical gap exists in our understanding of RL training: these pioneering works provide limited transparency about their training data scale, making it challenging for the research community to build upon their success. Follow-up open-source efforts (Table 1) have explored diverse experimental scenarios, from base models to distilled long-form chain-of-thought models, with RL data volumes ranging from 8K (Zeng et al., 2025) to 150K (Cui et al., 2025), but without clear guidance on optimal data requirements or scaling principles. In this work, we try to explore the scaling dynamics of RL training data by focusing on a foundational scenario: starting directly from base models without distillation (similar to the RL scaling setting of Deepseek R1-zero).
大语言模型 (LLM) 的最新进展展示了强化学习 (RL) 在增强复杂推理能力方面的显著效果。像 o1 (OpenAI, 2024)、Deepseek R1 (Guo et al., 2025) 和 Kimi1.5 (Team et al., 2025) 这样的模型表明,RL 训练可以自然地诱导出复杂的推理行为,包括自我验证、反思和扩展的思维链。然而,我们对 RL 训练的理解存在一个关键空白:这些开创性工作对其训练数据规模提供了有限的透明度,这使得研究界难以在其成功基础上进一步构建。后续的开源努力 (表 1) 探索了多种实验场景,从基础模型到蒸馏的长格式思维链模型,RL 数据量从 8K (Zeng et al., 2025) 到 150K (Cui et al., 2025) 不等,但缺乏关于最佳数据需求或扩展原则的明确指导。在本工作中,我们尝试通过关注一个基础场景来探索 RL 训练数据的扩展动态:直接从基础模型开始,无需蒸馏(类似于 Deepseek R1-zero 的 RL 扩展设置)。
This lack of understanding of RL training data requirements presents several fundamental challenges:
对强化学习 (RL) 训练数据需求的理解不足带来了几个根本性挑战:
More importantly, this uncertainty raises a crucial question: Is scaling up RL training data truly the key to improving model performance, or are we overlooking more fundamental factors such as sample quality and selection criteria?
更重要的是,这种不确定性提出了一个关键问题:增加 RL(强化学习)训练数据是否真的是提升模型性能的关键,还是我们忽略了更根本的因素,比如样本质量和选择标准?
In this work, we challenge the assumption that larger RL training datasets necessarily lead to better performance. Our key insight is that the quality and relevance of training samples matter far more than their quantity. Through extensive empirical analysis, we make several surprising observations that fundamentally change our understanding of RL training dynamics:
在这项工作中,我们挑战了“更大的强化学习 (RL) 训练数据集必然带来更好性能”的假设。我们的核心发现是,训练样本的质量和相关性远比数量重要。通过广泛的实证分析,我们得出了几个令人惊讶的观察结果,从根本上改变了我们对强化学习训练动态的理解:
- We find that a carefully selected subset of RL training samples (1,389) can achieve comparable or even superior performance compared to training with the full dataset (8,523). 2. Most importantly, we develop an automated quantitative method for evaluating the potential value of RL training samples. Our method, which we call Learning Impact Measurement (LIM), can effectively predict which samples will contribute most significantly to model improvement. This automated approach eliminates the need for manual sample curation and makes our methodology easily scalable. 3. Recent approaches like LIMO and s1 have demonstrated the potential of distilled reasoning data efficiency through supervised fine-tuning with 32B models. We find that at 7B-scale, these methods significantly under perform. Our RL-based LIMR achieves $16.7%$ higher accuracy on AIME24 ( $32.5%$ VS $15.8%$ )and surpasses LIMO and s1 by $13.0%$ and $22.2%$ on MATH500 ( $78.0%$ VS $65.0%$ , $55.8%$ ), suggesting that RL may be more effective for enhancing reasoning capabilities in data-sparse scenarios.
- 我们发现,经过精心选择的 RL 训练样本子集(1,389 个)可以达到与使用完整数据集(8,523 个)训练相当甚至更优的性能。
- 最重要的是,我们开发了一种自动化定量方法来评估 RL 训练样本的潜在价值。我们称之为学习影响度量(Learning Impact Measurement,LIM)的方法,能够有效预测哪些样本对模型改进的贡献最大。这种自动化方法消除了手动样本筛选的需求,使我们的方法易于扩展。
- 最近的方法如 LIMO 和 s1 已经展示了通过使用 32B 模型进行监督微调来提升推理数据效率的潜力。我们发现,在 7B 规模下,这些方法的表现显著不佳。我们基于 RL 的 LIMR 在 AIME24 上实现了 $16.7%$ 的准确率提升($32.5%$ VS $15.8%$),并在 MATH500 上分别超过了 LIMO 和 s1 $13.0%$ 和 $22.2%$($78.0%$ VS $65.0%$,$55.8%$),这表明在数据稀疏的场景下,RL 可能更有效地增强推理能力。
Our findings have significant implications for the field of LLM development. They suggest that the path to better reasoning capabilities may not lie in simply scaling up RL training data, but rather in being more selective about which samples to use. This insight could dramatically reduce the computational resources required for effective RL training while potentially improving final model performance. Furthermore, our automated sample evaluation method provides a practical tool for researchers and practitioners to implement these insights in their own work. For reproducible research and future innovation, we release all LIMR artifacts openly, including LIMR dataset and model, all training and evaluation code, and implementation details of LIM.
我们的研究结果对大语言模型的开发领域具有重要意义。它们表明,提升推理能力的路径可能并不在于简单地扩大强化学习训练数据的规模,而在于更有选择性地使用样本。这一见解可能显著减少有效强化学习训练所需的计算资源,同时潜在地提高最终模型性能。此外,我们的自动化样本评估方法为研究人员和从业者提供了一个实用工具,以便在他们自己的工作中实施这些见解。为了可重复的研究和未来的创新,我们公开了所有 LIMR 的成果,包括 LIMR 数据集和模型、所有训练和评估代码,以及 LIM 的实现细节。
2 Methodology
2 方法论
We present Learning Impact Measurement (LIM), a systematic approach to quantify and optimize the value of training data in reinforcement learning. Our method addresses the critical challenge of data efficiency in RL training by analyzing learning dynamics to identify the most effective training samples.
我们提出了学习影响度量(Learning Impact Measurement, LIM),这是一种系统化的方法,用于量化和优化强化学习中训练数据的价值。我们的方法通过分析学习动态来识别最有效的训练样本,从而解决RL训练中数据效率的关键挑战。
2.1 Learning Dynamics in RL Training
2.1 RL训练中的学习动态
To understand the relationship between training data and model improvement, we conducted extensive analysis using the MATH-FULL dataset (Hendrycks et al., 2021), which contains 8,523 mathematical problems of varying difficulty levels (3-5). Our investigation reveals that different training samples contribute unequally to model learning, contrary to the conventional approach of treating all samples uniformly. As illustrated in Figure 2a, we
为了理解训练数据与模型改进之间的关系,我们使用 MATH-FULL 数据集(Hendrycks 等人,2021)进行了广泛分析,该数据集包含 8,523 个不同难度级别(3-5)的数学问题。我们的研究表明,不同的训练样本对模型学习的贡献并不相同,这与将所有样本统一处理的传统方法相悖。如图 2a 所示,我们
方法 | 初始模型 | 长链CoT蒸馏 | 问题数量 |
---|---|---|---|
STILL-3 (Team, 2025b) | Instruct | 是 | 29,925 |
DeepScaleR (Luo et al., 2025) | Instruct | 是 | 40,314 |
Sky-T1 (Team, 2025a) | Instruct | 是 | 45,000 |
THUDM-T1 (Hou et al., 2025) | Instruct | 否 | 30,000 |
PRIME (Cui et al., 2025) | Instruct | 否 | 150,000 |
SimpleRL (Zeng et al., 2025) | Base | 否 | 8,523 |
LIMR | Base | 否 | 1,389 |
Table 1: Comparison of various methods. The “init model" refers to the type of the initial actor model, we performance RL directly on the base model. “Long CoT Distillation”' indicates whether the initial model distills long CoT for cold start.
表 1: 各种方法的比较。“初始模型 (init model)”指的是初始参与者模型的类型,我们直接在基础模型上进行强化学习。“长链思维蒸馏 (Long CoT Distillation)”表示初始模型是否通过长链思维蒸馏进行冷启动。
Figure 2: (a) Learning dynamics analysis of training samples from MATH-FULL dataset across epochs. Solution reward trajectories reveal diverse patterns: samples maintaining near-zero rewards, samples quickly achieving high rewards, and those showing dynamic learning progress with varying improvement rates. (b) Sample learning trajectories compared against the average reward curve (red). Higher LIM scores refect better alignment with model's learning trajectory, where trajectories showing similar growth patterns receive higher scores.
图 2: (a) MATH-FULL 数据集中训练样本在不同 epoch 下的学习动态分析。解决方案奖励轨迹展示了多样化的模式:保持接近零奖励的样本、快速达到高奖励的样本,以及表现出动态学习进展且改善速率不同的样本。(b) 样本学习轨迹与平均奖励曲线(红色)的对比。较高的 LIM 分数反映了与模型学习轨迹的更好对齐,表现出相似增长模式的轨迹会获得更高的分数。
observe diverse learning trajectories: some samples exhibit stable performance patterns, while others show complex learning dynamics that appear to drive significant model improvements.
观察到多样化的学习轨迹:一些样本表现出稳定的性能模式,而其他样本则显示出复杂的学习动态,这些动态似乎推动了模型的显著改进。
These observations lead to our key insight: the value of training data in RL can be systematically measured by examining how well individual samples align with the model's overall learning progression. This understanding forms the foundation of LIM, our proposed method for quantifying sample effectiveness.
这些观察得出了我们的关键见解:RL 中训练数据的价值可以通过检查单个样本与模型整体学习进展的契合程度来系统性地衡量。这一理解构成了 LIM 的基础,即我们提出的量化样本有效性的方法。
2.2 Learning Impact Measurement (LIM)
2.2 学习影响测量 (LIM)
LIM centers on a model-aligned trajectory analysis that evaluates training samples based on their contribution to model learning. Our key finding is that samples whose learning patterns complement the model's overall performance trajectory tend to be more valuable for optimization.
LIM 专注于一种模型对齐轨迹分析,该分析根据训练样本对模型学习的贡献进行评估。我们的关键发现是,学习模式与模型整体性能轨迹互补的样本往往对优化更具价值。
2.2.1 Model-aligned Trajectory Analysis
2.2.1 模型对齐轨迹分析
Given that neural network learning typically follows a logarithmic growth pattern, we use the model's average reward curve as a reference for measuring sample effectiveness (Figure 2b):
考虑到神经网络学习通常遵循对数增长模式,我们使用模型的平均奖励曲线作为衡量样本有效性的参考(图 2b):
where $\boldsymbol{r}_{i}^{k}$ represents the reward of sample $i$ at epoch $k$ , and $N$ is the total number of samples. For each sample, LIM computes a normalized alignment score:
其中 $\boldsymbol{r}_{i}^{k}$ 表示样本 $i$ 在第 $k$ 个周期中的奖励,$N$ 是样本的总数。对于每个样本,LIM 计算一个归一化的对齐分数:
This score quantifies how well a sample's learning pattern aligns with the model's overall learning trajectory, with higher scores indicating better alignment.
该分数量化了样本的学习模式与模型整体学习轨迹的匹配程度,分数越高表示匹配度越好。
2.2.2 Sample Selection Strategy
2.2.2 样本选择策略
Based on the alignment scores, LIM implements a selective sampling strategy: $s_{i}>\theta$ where $\theta$ serves as a quality threshold that can be adjusted according to specific requirements. In our experiments, setting $\theta=0.6$ yielded an optimized dataset (LIMR) of 1,389 high-value samples from the original dataset.
基于对齐分数,LIM实现了一种选择性采样策略:$s_{i}>\theta$,其中$\theta$作为质量阈值,可以根据特定需求进行调整。在我们的实验中,设置$\theta=0.6$从原始数据集中筛选出了1,389个高价值样本,形成了优化后的数据集(LIMR)。
2.3 Baseline Data Selection Methods
2.3 基线数据选择方法
While developing our core methodology, we explored several alternative approaches that helped inform and validate our final method. These approaches, provide valuable insights into data selection in RL.
在开发核心方法论的过程中,我们探索了几种替代方法,这些方法有助于为最终方法的确定提供信息和验证。这些方法为强化学习中的数据选择提供了宝贵的见解。
Random Sampling baseline (RAND) randomly selects 1,389 samples from MATH-FULL to match the size of our main approach, providing a fundamental reference point for evaluating selective sampling effectiveness.
随机采样基线 (RAND) 从 MATH-FULL 中随机选择 1,389 个样本,以匹配我们主要方法的规模,为评估选择性采样的有效性提供一个基本参考点。
Linear Progress Analysis method (LINEAR) evaluates samples based on their consistency in showing steady improvements across training epochs. While this approach captures samples with gradual progress, it often misses valuable samples that show rapid early gains followed by stabilization. Using a threshold of $\theta=0.7$ ,thismethod yields 1,189 samples.
线性进展分析法 (LINEAR) 根据样本在训练周期中表现出的稳定改进一致性来评估样本。虽然这种方法能够捕捉到逐步进展的样本,但往往会忽略那些在早期迅速提升随后趋于稳定的有价值的样本。使用阈值 $\theta=0.7$,该方法共得到 1,189 个样本。
2.4 Reward Design
2.4 奖励设计
Similar to deepseek r1 (Guo et al., 2025), we use a rule-based reward function. Specifically, for a correct answer, the reward is 1; for an incorrect but properly formatted answer, the reward is -0.5; and for a answer with formatting errors, the reward is $^{-1}$ . Formally, this can be expressed as:
与 deepseek r1 (Guo et al., 2025) 类似,我们使用基于规则的奖励函数。具体来说,对于正确答案,奖励为 1;对于错误但格式正确的答案,奖励为 -0.5;对于格式错误的答案,奖励为 $^{-1}$。正式表达式为:
3 Experiment
3 实验
3.1 Experimental Setup
3.1 实验设置
Training We conduct RL training using PPO (Schulman et al., 2017) algorithm implemented in the OpenRLHF (Hu et al., 2024) framework. Using Qwen2.5-Math-7B (Yang et al., 2024) as our initial policy model, we configure the rollout batch size as 1,024 and generate 8 samples per prompt with a temperature of 1.2 during exploration. The training process uses a batch size of 256, with learning rates set to 5e-7 and 9e-6 for the actor and critic models respectively, and a KL coefficient of 0.01.
训练
我们使用 OpenRLHF (Hu et al., 2024) 框架中实现的 PPO (Schulman et al., 2017) 算法进行 RL 训练。以 Qwen2.5-Math-7B (Yang et al., 2024) 作为初始策略模型,我们将 rollout 批次大小配置为 1,024,并在探索过程中为每个提示生成 8 个样本,温度为 1.2。训练过程使用批次大小为 256,actor 和 critic 模型的学习率分别设置为 5e-7 和 9e-6,KL 系数为 0.01。
Evaluation We conducted experimental evaluations on multiple challenging benchmarks, including: (1) MATH500; (2) AIME2024; (4) AMC2023. To accelerate the evaluation process, we utilized the vLLM (Kwon et al., 2023) framework. For AIME24, AMC23, due to the limited number of questions (30 and 40 respectively), we performed 4 sampling runs per question with a temperature of 0.4. For MATH500, we employed greedy decoding for inference.
评估我们在多个具有挑战性的基准上进行了实验评估,包括:(1) MATH500;(2) AIME2024;(4) AMC2023。为了加速评估过程,我们使用了 vLLM [Kwon et al., 2023] 框架。对于 AIME24 和 AMC23,由于问题数量有限(分别为 30 和 40),我们对每个问题进行了 4 次采样运行,温度为 0.4。对于 MATH500,我们采用了贪婪解码进行推理。
3.2 Main Results
3.2 主要结果
Table 2: Main results on difficult math benchmarks.
表 2: 困难数学基准测试的主要结果。
Method | #Questions | AIME2024 | MATH500 | AMC2023 | AVG. |
---|---|---|---|---|---|
Qwen-Math-7B | 16.7 | 52.4 | 52.5 | 40.5 | |
Qwen-Math-7B-FULL | 8,523 | 32.5 | 76.6 | 61.9 | 57.0 |
Qwen-Math-7B-RAND | 1,389 | 25.8 | 66.0 | 56.3 | 49.4 |
Qwen-Math-7B-LINEAR | 1,138 | 28.3 | 74.6 | 61.9 | 54.9 |
LIMR | 1,389 | 32.5 | 78.0 | 63.8 | 58.1 |
As illustrated in Table 2, directly applying RL to Qwen-Math-7B using the MATH-FULL dataset resulted in a significant performance improvement. Different data selection strategies, however, led to notable variations in performance. Training with the MATH-RAND dataset results in an average accuracy drop of $8.1%$ comparedto using the full dataset, whereas MATH-LINEAR incurs only a $2%$ loss. More notably, LIMR, despite an $80%$ reduction in dataset size, performs nearly on par with MATH-FULL. This further supports the notion that in RL, only a small subset of questions plays a critical role.
如表 2 所示,直接在 Qwen-Math-7B 上使用 MATH-FULL 数据集应用强化学习 (RL) 带来了显著的性能提升。然而,不同的数据选择策略导致了显著的性能差异。与使用完整数据集相比,使用 MATH-RAND 数据集训练会导致平均准确率下降 $8.1%$,而 MATH-LINEAR 仅带来 $2%$ 的损失。更值得注意的是,LIMR 尽管减少了