[论文翻译]大语言模型的火热启动与规则执行采样


原文地址:https://arxiv.org/pdf/2410.21236v1


Flaming-hot Initiation with Regular Execution Sampling for Large Language Models

大语言模型的火热启动与规则执行采样

Guanlin Liu, Renjie Zheng, Wenlei Shi, Chen Dun, Zheng Wu, Xing Jin, Lin Yan ByteDance

Guanlin Liu, Renjie Zheng, Wenlei Shi, Chen Dun, Zheng Wu, Xing Jin, Lin Yan ByteDance

{guanlin.liu, renjie.zheng, wenlei.shi}@bytedance.com {chen.dun, zheng.wu1, jinxing.9, neil}@bytedance.com

{guanlin.liu, renjie.zheng, wenlei.shi}@bytedance.com {chen.dun, zheng.wu1, jinxing.9, neil}@bytedance.com

Abstract

摘要

Since the release of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities across various domains. A key challenge in developing these general capabilities is efficiently sourcing diverse, high-quality data. This becomes especially critical in reasoningrelated tasks with sandbox checkers, such as math or code, where the goal is to generate correct solutions to specific problems with higher probability. In this work, we introduce Flaminghot Initiation with Regular Execution (FIRE) sampling, a simple yet highly effective method to efficiently find good responses. Our empirical findings show that FIRE sampling enhances inference-time generation quality and also benefits training in the alignment stage. Furthermore, we explore how FIRE sampling improves performance by promoting diversity and analyze the impact of employing FIRE at different positions within a response.

自 ChatGPT 发布以来,大语言模型 (LLMs) 在各个领域展现了卓越的能力。开发这些通用能力的一个关键挑战是高效获取多样化的高质量数据。这在涉及沙盒检查器的推理相关任务(如数学或代码)中尤为重要,目标是更大概率生成特定问题的正确解决方案。在本工作中,我们引入了带有常规执行的火焰热启动 (FIRE) 采样,这是一种简单但高效的方法,能够有效找到优质响应。我们的实证结果表明,FIRE 采样提升了推理时生成的质量,同时也对齐阶段的训练有益。此外,我们探讨了 FIRE 采样如何通过促进多样性来提升性能,并分析了在响应不同位置使用 FIRE 的影响。

1 Introduction

1 引言

Large language models (LLMs) have achieved remarkable success in a wide range of tasks since the release of ChatGPT (OpenAI, 2022). In addition to traditional natural language processing tasks such as sum mari z ation and sentiment analysis, LLMs have demonstrated effectiveness in many new domains, including code generation (Chen et al., 2023; Roziere et al., 2023), human-computer interaction (Li et al., 2023), and math problemsolving (Wei et al., 2022; Yu et al., 2024). Although standalone LLMs have limited reasoning capabilities (Sun et al., 2023; Valmeekam et al., 2023; Chen et al., 2024b), researchers have tried to enhance them by incorporating tool-use and developing integrated systems known as LLM agents (Xi et al., 2023; Wang et al., 2024), which expands the applications of LLMs to more general domains like robot control (Wang et al., 2023a) and autonomous driving (Mao* et al., 2023).

大语言模型 (LLMs) 自 ChatGPT (OpenAI, 2022) 发布以来,在广泛的任务中取得了显著成功。除了传统的自然语言处理任务(如摘要生成和情感分析)外,大语言模型在许多新领域也展现了有效性,包括代码生成 (Chen et al., 2023; Roziere et al., 2023)、人机交互 (Li et al., 2023) 以及数学问题解决 (Wei et al., 2022; Yu et al., 2024)。尽管独立的大语言模型在推理能力上存在局限 (Sun et al., 2023; Valmeekam et al., 2023; Chen et al., 2024b),研究人员已尝试通过引入工具使用和开发集成系统(称为大语言模型智能体)来增强其能力 (Xi et al., 2023; Wang et al., 2024),这将大语言模型的应用扩展到更广泛的领域,如机器人控制 (Wang et al., 2023a) 和自动驾驶 (Mao* et al., 2023)。

To develop general capabilities, LLMs are typically trained through a three-stage process: pretraining, supervised fine-tuning (SFT), and alignment (Bai et al., 2022; Ouyang et al., 2022). Dur- ing pre training, the model learns from a vast array of data gathered from publicly available sources. Then, in the SFT and alignment stages, the model’s abilities are further refined, allowing it to increase reasoning abilities and better follow users’ instructions. In order to enhance reasoning tasks, a sandbox checker — a tool used to verify the correctness of solutions — is often used during training (Liu et al., 2023b). Therefore, one of the key challenges in achieving effective and efficient training is determining how to obtain more successful samples within a fixed number of trials, particularly when addressing complex problems.

为了开发通用能力,大语言模型通常通过三个阶段进行训练:预训练、监督微调(SFT)和对齐(Bai et al., 2022; Ouyang et al., 2022)。在预训练期间,模型从公开来源收集的大量数据中学习。然后,在监督微调和对齐阶段,模型的能力得到进一步精炼,使其能够增强推理能力并更好地遵循用户的指令。为了增强推理任务,训练过程中通常使用沙盒检查器(sandbox checker)——一种用于验证解决方案正确性的工具(Liu et al., 2023b)。因此,在实现有效且高效训练的关键挑战之一是如何在固定次数的尝试中获得更多成功的样本,尤其是在处理复杂问题时。

In this paper, we introduce Flaming-hot Initiation with Regular Execution (FIRE), a simple yet effective sampling method for training large language models. Inspired by recent findings on attention sink (Xiao et al., 2023), our approach begins by sampling the initial token at a very high temperature and proceeds with the regular sampling process for the remaining sequence. Our algorithm can be viewed as a simplified and more general version of CoT-decoding (Wang and Zhou, 2024), especially with a focus on training in math and coding domains where a sandbox checker is available at a relatively cheap cost.

在本文中,我们介绍了 Flaming-hot Initiation with Regular Execution (FIRE),这是一种简单而有效的采样方法,用于训练大语言模型。受最近关于注意力陷阱 (attention sink) 的研究启发 (Xiao et al., 2023),我们的方法从以极高温度采样初始 Token 开始,然后对剩余序列进行常规采样。我们的算法可以看作是 CoT-decoding (Wang and Zhou, 2024) 的简化且更通用的版本,特别是在数学和编程领域的训练中,这些领域可以以相对较低的成本使用沙箱检查器。

We first show that our method, at inference time, can improve the pass rate within N trials $({\mathrm{pass}},{\ @},\mathbf{n})$ , also known as the best-of-N (BoN) when only the correctness of the final answer is considered. To demonstrate its effectiveness in training, we show that it can be directly integrated into the reinforcement learning process of large language models. Our approach proves to be effective across multiple open-source models and various LLM capabilities, including mathematical reasoning and coding. We highlight how our method promotes diversity in generated samples, a key factor linked to performance improvements in pass rate. Importantly, this diversity is maintained even after training with our sampling method, indicating room for further enhancement. We also discuss the effects of simple variations of our method, where the temperature change occurs mid-process rather than at the start, on performance outcomes.

我们首先展示,在推理时,我们的方法可以在N次试验中提高通过率 $({\mathrm{pass}},{\ @},\mathbf{n})$ ,即仅考虑最终答案正确性时的最佳N次试验(BoN)。为了展示其在训练中的有效性,我们表明它可以直接集成到大语言模型的强化学习过程中。我们的方法在多个开源模型和各种大语言模型能力(包括数学推理和编码)中均表现出有效性。我们强调我们的方法如何促进生成样本的多样性,这是与通过率改进相关的关键因素。重要的是,即使在使用我们的采样方法进行训练后,这种多样性仍然得以保持,表明还有进一步改进的空间。我们还讨论了我们的方法的简单变体(即温度变化发生在过程中而不是开始时)对性能结果的影响。

2 Related Works

2 相关工作

Researchers have been exploring two primary directions to efficiently improve response quality under a frozen pre-trained LLM. The first direction focuses on prompting techniques such as Chain-of-Thought (Wei et al., 2022) and Tree-of-Thought (Yao et al., 2023a). The second direction involves letting LLMs fix their own mistakes (Wang et al., 2023b; Yao et al., 2023b; Shinn et al., 2023; Madaan et al., 2023; Chen et al., 2024a). In line with these two directions, there has been increasing focus on controlled decoding in LLMs to enhance reasoning capabilities during inference, ranging from searchbased approaches applied to policy models (Mudgal et al., 2023; Huang et al., 2024) to utilizing value models trained in the alignment phase (Liu et al., 2023a; Feng et al., 2023).

研究人员一直在探索两个主要方向,以在预训练大语言模型冻结的情况下有效提高响应质量。第一个方向侧重于提示技术,如思维链 (Chain-of-Thought) (Wei et al., 2022) 和思维树 (Tree-of-Thought) (Yao et al., 2023a)。第二个方向涉及让大语言模型自行修正错误 (Wang et al., 2023b; Yao et al., 2023b; Shinn et al., 2023; Madaan et al., 2023; Chen et al., 2024a)。随着这两个方向的发展,越来越多的人关注大语言模型中的受控解码,以增强推理能力,范围从应用于策略模型的基于搜索的方法 (Mudgal et al., 2023; Huang et al., 2024) 到利用在对齐阶段训练的价值模型 (Liu et al., 2023a; Feng et al., 2023)。

In this paper, we also focus on inference time; however, our approach extends to the sampling processes used during the training of large language models, as commonly practiced in InstructGPT (Ouyang et al., 2022). This process consists of three key stages: pre training, supervised finetuning (SFT), and alignment, also known as reinforcement learning with human feedback (RLHF). For large language models trained in this paradigm, there could be some helpful properties that, without strong theoretical guarantees, are empirically true and thus helpful for LLMs. Our work is related to attention sink (Xiao et al., 2023). An attention sink refers to a token or set of tokens that disproportionately receive attention from other tokens during the attention mechanism within transformer architectures. In their study, they found that one of the most identifiable tokens was shown to be the initial token. While there are no theoretical guarantees, they propose an intuition that initial tokens are visible and used in all later token generations, making them more readily trained to be attention sinks.

在本文中,我们同样关注推理时间;然而,我们的方法扩展到了大语言模型训练过程中使用的采样过程,如InstructGPT (Ouyang et al., 2022) 中常见的做法。此过程包括三个关键阶段:预训练、监督微调 (SFT) 和对齐,也称为基于人类反馈的强化学习 (RLHF)。对于在这种范式中训练的大语言模型,可能会有一些有用的特性,尽管没有强有力的理论保证,但在经验上是正确的,因此对大语言模型有帮助。我们的工作与注意力沉井 (Xiao et al., 2023) 相关。注意力沉井指的是在Transformer架构的注意力机制中,某个Token或一组Token不成比例地接收到其他Token的注意力。在他们的研究中,他们发现最可识别的Token之一是初始Token。虽然没有理论保证,但他们提出了一种直觉,即初始Token在所有后续Token生成过程中都是可见且被使用的,这使得它们更容易被训练成为注意力沉井。

Our work is closely related to CoTdecoding (Wang and Zhou, 2024), which uncovers the CoT-paths by enumerating over the top-k alternative tokens and aggregating the responses by scoring the decoded responses with confidence on the final answer. However, our approach differs in three key aspects: (1) we introduce a differentiable sampling method that can be directly integrated with existing inference and training frameworks, (2) we focus on improving model performance in scenarios with a sandbox checker, where aggregating responses is less data-efficient, and (3) our method operates without assumptions about the prompts, even when a chain of thought (CoT) is included, extending beyond the scope of CoT-decoding.

我们的工作与 CoTdecoding (Wang and Zhou, 2024) 密切相关,后者通过枚举 top-k 替代 Token 并通过对最终答案的信心评分来聚合解码的响应,从而揭示 CoT 路径。然而,我们的方法在三个关键方面有所不同:(1) 我们引入了一种可微分的采样方法,可以直接与现有的推理和训练框架集成,(2) 我们专注于在带有沙盒检查器的场景中提高模型性能,其中聚合响应的数据效率较低,(3) 我们的方法在不对提示做任何假设的情况下运行,即使包含思维链 (CoT),也超越了 CoT-decoding 的范围。

3 Flaming-hot Initiation Regular Execution

3 火热的初始化和常规执行

3.1 Method

3.1 方法

In this work, we propose a sampling method, Flaming-hot Initiation with Regular Execution (FIRE), inspired by the attention sink phenomenon (Xiao et al., 2023) that demonstrates the importance of initial tokens.

在本工作中,我们提出了一种采样方法——Flaming-hot Initiation with Regular Execution (FIRE),该方法受到了注意力下沉现象 (Xiao et al., 2023) 的启发,该现象展示了初始 Token 的重要性。

FIRE first samples the initial token at a very high temperature $p\gg1$ , combined with top $k$ filtering to make the candidate tokens more controllable. At higher temperatures, the candidate tokens are sampled from a probability distribution that approaches uniform sampling. After the initial token is sampled, FIRE proceeds with the decoding stage using a regular temperature setting.

FIRE 首先在非常高的温度 $p\gg1$ 下对初始 Token 进行采样,结合 top $k$ 过滤使候选 Token 更具可控性。在更高的温度下,候选 Token 从接近均匀采样的概率分布中采样。在初始 Token 采样后,FIRE 使用常规温度设置进行解码阶段。

Our approach FIRE is similar to CoTdecoding (Wang and Zhou, 2024) that enumerates the top ${\boldsymbol{k}}$ candidates of the initial token. However, while CoT-decoding focuses more on the decoding stage and extracting Chain-of-Thought without prompt, our approach FIRE serves as a general differentiable sampling method, which can be combined with existing sampling frameworks and can be more efficient in the training stage where a sandbox checker that judges whether a specific answer is correct or not is available with a cheap cost.

我们的方法 FIRE 类似于 CoTdecoding (Wang and Zhou, 2024) ,它枚举了初始 Token 的前 ${\boldsymbol{k}}$ 个候选。然而,虽然 CoT-decoding 更注重解码阶段并在无需提示的情况下提取 Chain-of-Thought,我们的方法 FIRE 作为一种通用的可微分采样方法,可以与现有的采样框架结合,并且在训练阶段可以更高效,因为在训练阶段,判断特定答案是否正确的沙盒检查器可以以较低的成本获得。

While FIRE can be applied to any token in the decoding stage, we restrict its application to the initial token to prevent the generation of random tokens that are wrong in the context. For example, if we apply FIRE after the prefix $"1+2{=}"$ , it would sample, in addition to the token $"3"$ , other tokens like "4" or $"5"$ , which are very likely to be wrong. In contrast, since FIRE is only applied to the initial token, it would unlikely lead to broken sentences or code with syntax errors. In our empirical exper

虽然 FIRE 可以在解码阶段应用于任何 Token,但我们将其应用限制在初始 Token 上,以防止生成在上下文中错误的随机 Token。例如,如果我们在前缀 $"1+2{=}"$ 之后应用 FIRE,它可能会采样除了 Token $"3"$ 之外的其他 Token,如 "4" 或 $"5"$,这些 Token 很可能是错误的。相比之下,由于 FIRE 仅应用于初始 Token,它不太可能导致句子断裂或代码出现语法错误。在我们的实验经验中

常规 FIRE
模型 通过率% #EA 通过率% #EA
GSM8K DeepSeek 97.57 2.26 98.71 2.76
Gemma-2 86.81 3.87 87.57 4.01
Qwen2 95.90 2.58 98.25 3.17
Qwen2-RL 96.90 2.63 97.90 3.26
MATH DeepSeek 76.16 5.63 78.16 7.89
Gemma-2 49.20 9.24 51.48 10.39
Qwen2 76.60 7.44 79.08 9.03
Qwen2.5-72B 79.30 2.39 80.40 2.60

Table 1: Inference results for different models on different datasets with best hyper parameters combinations. Specifically, Qwen2-RL is a fine-tuned model trained by ourselves. We show the pass rate $(%)$ with 40 samples, and the effective answers (EA) among the 40 samples.

表 1: 不同模型在不同数据集上使用最佳超参数组合的推理结果。具体而言,Qwen2-RL 是我们自己训练的微调模型。我们展示了 40 个样本的通过率 $(%)$ 以及这 40 个样本中的有效答案 (EA)。

Table 2: Pass rate $(%)$ with different number of samples from Qwen2-7B-Instruct on MBPP and $\mathrm{MBPP+}$ .

表 2: Qwen2-7B-Instruct 在 MBPP 和 $\mathrm{MBPP+}$ 上不同样本数的通过率 $(%)$。

Regular Regular FIRE FIRE
Pass @ 1 Pass @ 10 Pass @ 1 Pass @ 10
MBPP 61.2 82.8 50.6 86.6
MBPP+ 52.7 74.2 44.1 77.0

iments, we found that the initial token frequently consists of words like "Let’s", "Sure", "So", and "The", which do not directly convey any information. But what these initial tokens affect is the reasoning steps afterward, with the same intuition as Streaming LL M (Xiao et al., 2023).

在实验中,我们发现初始的Token经常由“Let’s”、“Sure”、“So”和“The”等词语组成,这些词语并不直接传递任何信息。但这些初始Token影响的是后续的推理步骤,其直觉与Streaming LLM (Xiao et al., 2023) 相同。

3.2 Experiments

3.2 实验

In this section, we evaluate our algorithm, FIRE, by addressing several key research questions that guide our experiments.

在本节中,我们通过解决几个关键研究问题来评估我们的算法 FIRE,这些问题指导了我们的实验。

How effective is FIRE during inference? We first showcase the effectiveness of FIRE sampling in inference-only scenarios. We tested four opensource models: Qwen2-7B-Instruct (Qwen2) (Yang et al., 2024), Qwen2.5-72B-Instruct (Qwen2.5- 72B) (Yang et al., 2024), DeepSeek-coder-v2- Instruct (DeepSeek)(Zhu et al., 2024), and Gemma2-2b-it (Gemma-2)(Team et al., 2024), on a di- verse set of datasets, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and $\mathbf{MBPP(+)}$ (Austin et al., 2021; Liu et al., 2023c). In GSM8K and MATH, we extend the prompts with phrase "Please reason step-by-step" to ensure CoT reasoning in models’ responses, a setting where the original motivation of CoT-decoding becomes less meaningful as CoT paths would naturally occur. For the regular sampling settings, we use a combination of nucleus sampling and top $\cdot\mathbf{k}$ sampling.

FIRE在推理中的效果如何?我们首先展示了FIRE采样在仅推理场景中的有效性。我们测试了四个开源模型:Qwen2-7B-Instruct(Qwen2)(Yang et al., 2024)、Qwen2.5-72B-Instruct(Qwen2.5-72B)(Yang et al., 2024)、DeepSeek-coder-v2-Instruct(DeepSeek)(Zhu et al., 2024) 和 Gemma2-2b-it(Gemma-2)(Team et al., 2024),在多个数据集上进行了测试,包括 GSM8K (Cobbe et al., 2021)、MATH (Hendrycks et al., 2021) 和 $\mathbf{MBPP(+)}$ (Austin et al., 2021; Liu et al., 2023c)。在GSM8K和MATH中,我们在提示中添加了“请逐步推理”的短语,以确保模型响应中的链式思考(CoT)推理,这种情况下,CoT解码的原始动机变得不那么重要,因为CoT路径会自然出现。对于常规的采样设置,我们结合了核心采样和 top $\cdot\mathbf{k}$ 采样。

Regular FIRE
P k min-p n=10 n=40 n=10 n=40
0.7 16 0.01 0 66.4 66.4 75.8 75.8 70.0 70.0 78.9 78.9
32 0.01 66.2 75.3 70.1 78.9
0.9 16 0 0.01 66.2 66.1 75.2 76.6 70.1 69.5 78.9 78.9
32 0 0.01 0 66.2 66.7 66.8 76.6 76.4 74.4 69.5 69.5 69.1 78.9 79.1 79.0

Table 3: Pass rate $(%)$ for Qwen2-7B-Instruct on MATH dataset with different hyper parameter combinations. p: nucleus sampling parameter, $\mathrm{k}:$ top $\cdot\mathbf{k}$ sampling parameter, min-p: minimum probability threshold (0 indicates min-p is not used). $\mathtt{n=10}$ and $\mathtt{n=40}$ represent the number of samples for calculating the pass rate.

表 3: Qwen2-7B-Instruct 在 MATH 数据集上不同超参数组合下的通过率 $(%)$ 。p: 核采样参数,$\mathrm{k}:$ top $\cdot\mathbf{k}$ 采样参数,min-p: 最小概率阈值(0 表示未使用 min-p)。$\mathtt{n=10}$ 和 $\mathtt{n=40}$ 表示计算通过率的样本数量。

To ensure a fair comparison, we conducted a thorough enumeration over hyper parameters, including $p,k$ , and min $\mathbf{\nabla}\cdot\mathbf{p}$ (Hugging face, 2023). Table 1 and Table 2 present the aggregated results, where the reported numbers represent the best outcomes from the enumeration. We observe that FIRE consistently improves the pass rate compared to regular settings across all models on different benchmarks. To further demonstrate the consistent improvement over different hyper parameters, we provide an example result of Qwen2-7B-Instruct on the MATH dataset in Table 3. Full results for all models and datasets are provided in the appendix. Table 3 reveals that although FIRE may alter the hyperparameter combination that yields optimal performance, it consistently outperforms regular sampling across all hyper parameter combinations.

为了确保公平比较,我们对超参数进行了全面枚举,包括 $p,k$ 和 min $\mathbf{\nabla}\cdot\mathbf{p}$ (Hugging face, 2023)。表 1 和表 2 展示了汇总的结果,其中报告的数字代表枚举中的最佳结果。我们观察到,与常规设置相比,FIRE 在不同基准测试中的所有模型上均持续提高了通过率。为了进一步展示在不同超参数下的一致性改进,我们在表 3 中提供了 Qwen2-7B-Instruct 在 MATH 数据集上的示例结果。所有模型和数据集的完整结果见附录。表 3 表明,尽管 FIRE 可能会改变产生最佳性能的超参数组合,但在所有超参数组合中,它始终优于常规采样。

Why is FIRE effective? FIRE introduces more diversity to the initial token that is generated at a high temperature, and due to the strong attention scores towards initial tokens (Xiao et al., 2023), this diversity benefits the entire subsequent generation. To measure diversity quantitatively, we use the number of unique answers (effective answers) within a set of responses as our metric. We choose not to use some popular metrics like n-grams since we only control the initial token, and in tasks with long reasoning paths, such as math and coding, similar n-grams will likely always appear, making it unsuitable for measuring diversity. As shown in Figure 1, Table 1 (#EA), FIRE demonstrates increased diversity across various models and datasets, which contributes to enhanced pass $@{\mathrm{n}}$ performance. As anticipated, FIRE

为什么 FIRE 有效?FIRE 在高温下生成的初始 Token 引入了更多多样性,并且由于对初始 Token 的强烈注意力得分 (Xiao et al., 2023),这种多样性有利于整个后续生成。为了定量测量多样性,我们使用一组响应中唯一答案(有效答案)的数量作为我们的度量。我们选择不使用一些流行的度量方法,如 n-gram,因为我们只控制初始 Token,而在具有长推理路径的任务中,如数学和编码,类似的 n-gram 很可能会始终出现,使其不适合测量多样性。如图 1 和表 1 (#EA) 所示,FIRE 在各种模型和数据集中展示了增加的多样性,这有助于提高通过 $@{\mathrm{n}}$ 性能。正如预期的那样,FIRE


Figure 1: Curves for pass rate and number of effective answers with different numbers of samples on GSM8K.

图 1: GSM8K 上不同样本数量下的通过率和有效答案数量的曲线。

数据集 模型 PPO PPO+FIRE
GSM8K Deepseek 80.64 82.16
GSM8K Qwen2 80.16 82.02
GSM8K Gemma 40.39 42.91
MATH Gemma-2 58.07 61.20
MATH Qwen2 53.50 55.07

Table 4: Pass $@1$ on GSM8K and Math for Different models trained with PPO with different sampling.

表 4: 不同采样方式下使用 PPO 训练的模型在 GSM8K 和 Math 上的 Pass $@1$ 结果

1st-line 2nd-line 3rd-line PRM-line
Regular 46.07 74.36 74.77 75.73
FIRE 64.59 74.96 75.92 78.21

Table 5: Pass $@10$ Results from Qwen2-7B-Instruct on the training set of MATH dataset for FIRE variants with different sampling points, compared to regular sampling method that does not change the temperature.

表 5: Pass $@10$ Qwen2-7B-Instruct 在 MATH 数据集训练集上针对不同采样点的 FIRE 变体的结果,与不改变温度的常规采样方法进行比较。

does not improve $\mathrm{Pass}\mathcal{@1}$ performance due to its focus on promoting diversity. However, it consistently delivers improvements when more samples are considered.

不会提高 $\mathrm{Pass}\mathcal{@1}$ 性能,因为它专注于促进多样性。然而,当考虑更多样本时,它会持续带来改进。

Is FIRE helpful when integrated into training? Having established that our method improves pass $@\mathbf{n}$ by improving diversity, we directly apply FIRE to boost language model training. To test this, we use Proximal Policy Optimization (PPO) (Schulman et al., 2017) to finetune several models using the GSM8K and MATH datasets, and assess their performance through the final pass rate for single samples $(\mathrm{Pass}@1)$ . As shown in Table 4, integrating FIRE into the training process leads to an improvement in $\mathrm{Pass},{\ @1}$ . Notably, even though each data point is sampled only once during PPO training following common practice (Ouyang et al., 2022; Sheng et al., 2024), our method still yields improvements. The results also show that the improvements are consistent for different models. Furthermore, after our RL training, the model still exhibits diversity and continues to benefit from inference-time pass rate improvements, as evidenced by Qwen2-RL in Table 1. Consequently, FIRE can be applied iterative ly to refine the model, leading to an even bigger improvement margin.

将 FIRE 整合到训练中是否有帮助?在确定我们的方法通过提高多样性来提升 pass $@\mathbf{n}$ 后,我们直接应用 FIRE 来增强语言模型的训练。为了验证这一点,我们使用近端策略优化 (PPO) (Schulman et al., 2017) 对多个模型进行微调,使用 GSM8K 和 MATH 数据集,并通过单样本的最终通过率 $(\mathrm{Pass}@1)$ 评估其性能。如表 4 所示,将 FIRE 整合到训练过程中提升了 $\mathrm{Pass},{\ @1}$。值得注意的是,尽管在 PPO 训练中每个数据点仅采样一次 (Ouyang et al., 2022; Sheng et al., 2024),我们的方法仍然带来了改进。结果还表明,这种改进在不同的模型中是一致的。此外,经过我们的强化学习训练后,模型仍然表现出多样性,并继续受益于推理时通过率的提升,如表 1 中的 Qwen2-RL 所示。因此,FIRE 可以迭代应用以优化模型,从而带来更大的改进空间。

Can FIRE sampling work in mid-sequence? Finally, we explore the effect of applying FIRE sampling midway through a response. We first construct a dataset that ensures the correctness of the initial sentences, by utilizing a Process Reward Model (PRM) to identify the first sentences at which the response becomes incorrect. We then evaluate the effect of applying FIRE sampling at the beginning of different sentences (1st, 2nd, and 3rd-line) or at the first token deemed incorrect by the PRM ("PRM-line"). We refer the reader to the appendix for a more detailed description of the construction of this dataset. As shown in Table 5, while FIRE sampling offers benefits throughout different settings, its advantages diminish for tokens beyond the initial ones, despite an overall increase in accuracy due to the prefix guaranteed to be correct.

FIRE采样能在序列中间起作用吗?最后,我们探讨了在响应中途应用FIRE采样的效果。我们首先构建了一个数据集,通过使用过程奖励模型(Process Reward Model,PRM)来识别响应开始出错的第一个句子,以确保初始句子的正确性。然后,我们评估了在不同句子(第1行、第2行和第3行)的开头或在PRM认为出错的第一个Token("PRM行")处应用FIRE采样的效果。有关此数据集构建的更详细描述,请参阅附录。如表5所示,尽管FIRE采样在不同设置下都提供了优势,但对于初始Token之后的Token,其优势有所减弱,尽管由于前缀保证正确,整体准确性有所提高。

4 Conclusion

4 结论

In this paper, we introduced a novel sampling method called Flaming-hot Initiation with Regular Exec