[论文翻译]大语言模型的火热启动与规则执行采样


原文地址:https://arxiv.org/pdf/2410.21236v1


Flaming-hot Initiation with Regular Execution Sampling for Large Language Models

大语言模型的火热启动与规则执行采样

Guanlin Liu, Renjie Zheng, Wenlei Shi, Chen Dun, Zheng Wu, Xing Jin, Lin Yan ByteDance

Guanlin Liu, Renjie Zheng, Wenlei Shi, Chen Dun, Zheng Wu, Xing Jin, Lin Yan ByteDance

{guanlin.liu, renjie.zheng, wenlei.shi}@bytedance.com {chen.dun, zheng.wu1, jinxing.9, neil}@bytedance.com

{guanlin.liu, renjie.zheng, wenlei.shi}@bytedance.com {chen.dun, zheng.wu1, jinxing.9, neil}@bytedance.com

Abstract

摘要

Since the release of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities across various domains. A key challenge in developing these general capabilities is efficiently sourcing diverse, high-quality data. This becomes especially critical in reasoningrelated tasks with sandbox checkers, such as math or code, where the goal is to generate correct solutions to specific problems with higher probability. In this work, we introduce Flaminghot Initiation with Regular Execution (FIRE) sampling, a simple yet highly effective method to efficiently find good responses. Our empirical findings show that FIRE sampling enhances inference-time generation quality and also benefits training in the alignment stage. Furthermore, we explore how FIRE sampling improves performance by promoting diversity and analyze the impact of employing FIRE at different positions within a response.

自 ChatGPT 发布以来,大语言模型 (LLMs) 在各个领域展现了卓越的能力。开发这些通用能力的一个关键挑战是高效获取多样化的高质量数据。这在涉及沙盒检查器的推理相关任务(如数学或代码)中尤为重要,目标是更大概率生成特定问题的正确解决方案。在本工作中,我们引入了带有常规执行的火焰热启动 (FIRE) 采样,这是一种简单但高效的方法,能够有效找到优质响应。我们的实证结果表明,FIRE 采样提升了推理时生成的质量,同时也对齐阶段的训练有益。此外,我们探讨了 FIRE 采样如何通过促进多样性来提升性能,并分析了在响应不同位置使用 FIRE 的影响。

1 Introduction

1 引言

Large language models (LLMs) have achieved remarkable success in a wide range of tasks since the release of ChatGPT (OpenAI, 2022). In addition to traditional natural language processing tasks such as sum mari z ation and sentiment analysis, LLMs have demonstrated effectiveness in many new domains, including code generation (Chen et al., 2023; Roziere et al., 2023), human-computer interaction (Li et al., 2023), and math problemsolving (Wei et al., 2022; Yu et al., 2024). Although standalone LLMs have limited reasoning capabilities (Sun et al., 2023; Valmeekam et al., 2023; Chen et al., 2024b), researchers have tried to enhance them by incorporating tool-use and developing integrated systems known as LLM agents (Xi et al., 2023; Wang et al., 2024), which expands the applications of LLMs to more general domains like robot control (Wang et al., 2023a) and autonomous driving (Mao* et al., 2023).

大语言模型 (LLMs) 自 ChatGPT (OpenAI, 2022) 发布以来,在广泛的任务中取得了显著成功。除了传统的自然语言处理任务(如摘要生成和情感分析)外,大语言模型在许多新领域也展现了有效性,包括代码生成 (Chen et al., 2023; Roziere et al., 2023)、人机交互 (Li et al., 2023) 以及数学问题解决 (Wei et al., 2022; Yu et al., 2024)。尽管独立的大语言模型在推理能力上存在局限 (Sun et al., 2023; Valmeekam et al., 2023; Chen et al., 2024b),研究人员已尝试通过引入工具使用和开发集成系统(称为大语言模型智能体)来增强其能力 (Xi et al., 2023; Wang et al., 2024),这将大语言模型的应用扩展到更广泛的领域,如机器人控制 (Wang et al., 2023a) 和自动驾驶 (Mao* et al., 2023)。

To develop general capabilities, LLMs are typically trained through a three-stage process: pretraining, supervised fine-tuning (SFT), and alignment (Bai et al., 2022; Ouyang et al., 2022). Dur- ing pre training, the model learns from a vast array of data gathered from publicly available sources. Then, in the SFT and alignment stages, the model’s abilities are further refined, allowing it to increase reasoning abilities and better follow users’ instructions. In order to enhance reasoning tasks, a sandbox checker — a tool used to verify the correctness of solutions — is often used during training (Liu et al., 2023b). Therefore, one of the key challenges in achieving effective and efficient training is determining how to obtain more successful samples within a fixed number of trials, particularly when addressing complex problems.

为了开发通用能力,大语言模型通常通过三个阶段进行训练:预训练、监督微调(SFT)和对齐(Bai et al., 2022; Ouyang et al., 2022)。在预训练期间,模型从公开来源收集的大量数据中学习。然后,在监督微调和对齐阶段,模型的能力得到进一步精炼,使其能够增强推理能力并更好地遵循用户的指令。为了增强推理任务,训练过程中通常使用沙盒检查器(sandbox checker)——一种用于验证解决方案正确性的工具(Liu et al., 2023b)。因此,在实现有效且高效训练的关键挑战之一是如何在固定次数的尝试中获得更多成功的样本,尤其是在处理复杂问题时。

In this paper, we introduce Flaming-hot Initiation with Regular Execution (FIRE), a simple yet effective sampling method for training large language models. Inspired by recent findings on attention sink (Xiao et al., 2023), our approach begins by sampling the initial token at a very high temperature and proceeds with the regular sampling process for the remaining sequence. Our algorithm can be viewed as a simplified and more general version of CoT-decoding (Wang and Zhou, 2024), especially with a focus on training in math and coding domains where a sandbox checker is available at a relatively cheap cost.

在本文中,我们介绍了 Flaming-hot Initiation with Regular Execution (FIRE),这是一种简单而有效的采样方法,用于训练大语言模型。受最近关于注意力陷阱 (attention sink) 的研究启发 (Xiao et al., 2023),我们的方法从以极高温度采样初始 Token 开始,然后对剩余序列进行常规采样。我们的算法可以看作是 CoT-decoding (Wang and Zhou, 2024) 的简化且更通用的版本,特别是在数学和编程领域的训练中,这些领域可以以相对较低的成本使用沙箱检查器。

We first show that our method, at inference time, can improve the pass rate within N trials $({\mathrm{pass}},{\ @},\mathbf{n})$ , also known as the best-of-N (BoN) when only the correctness of the final answer is considered. To demonstrate its effectiveness in training, we show that it can be directly integrated into the reinforcement learning process of large language models. Our approach proves to be effective across multiple open-source models and various LLM capabilities, including mathematical reasoning and coding. We highlight how our method promotes diversity in generated samples, a key factor linked to performance improvements in pass rate. Importantly, this diversity is maintained even after training with our sampling method, indicating room for further enhancement. We also discuss the effects of simple variations of our method, where the temperature change occurs mid-process rather than at the start, on performance outcomes.

我们首先展示,在推理时,我们的方法可以在N次试验中提高通过率 $({\mathrm{pass}},{\ @},\mathbf{n})$ ,即仅考虑最终答案正确性时的最佳N次试验(BoN)。为了展示其在训练中的有效性,我们表明它可以直接集成到大语言模型的强化学习过程中。我们的方法在多个开源模型和各种大语言模型能力(包括数学推理和编码)中均表现出有效性。我们强调我们的方法如何促进生成样本的多样性,这是与通过率改进相关的关键因素。重要的是,即使在使用我们的采样方法进行训练后,这种多样性仍然得以保持,表明还有进一步改进的空间。我们还讨论了我们的方法的简单变体(即温度变化发生在过程中而不是开始时)对性能结果的影响。

2 Related Works

2 相关工作

Researchers have been exploring two primary directions to efficiently improve response quality under a frozen pre-trained LLM. The first direction focuses on prompting techniques such as Chain-of-Thought (Wei et al., 2022) and Tree-of-Thought (Yao et al., 2023a). The second direction involves letting LLMs fix their own mistakes (Wang et al., 2023b; Yao et al., 2023b; Shinn et al., 2023; Madaan et al., 2023; Chen et al., 2024a). In line with these two directions, there has been increasing focus on controlled decoding in LLMs to enhance reasoning capabilities during inference, ranging from searchbased approaches applied to policy models (Mudgal et al., 2023; Huang et al., 2024) to utilizing value models trained in the alignment phase (Liu et al., 2023a; Feng et al., 2023).

研究人员一直在探索两个主要方向,以在预训练大语言模型冻结的情况下有效提高响应质量。第一个方向侧重于提示技术,如思维链 (Chain-of-Thought) (Wei et al., 2022) 和思维树 (Tree-of-Thought) (Yao et al., 2023a)。第二个方向涉及让大语言模型自行修正错误 (Wang et al., 2023b; Yao et al., 2023b; Shinn et al., 2023; Madaan et al., 2023; Chen et al., 2024a)。随着这两个方向的发展,越来越多的人关注大语言模型中的受控解码,以增强推理能力,范围从应用于策略模型的基于搜索的方法 (Mudgal et al., 2023; Huang et al., 2024) 到利用在对齐阶段训练的价值模型 (Liu et al., 2023a; Feng et al., 2023)。

In this paper, we also focus on inference time; however, our approach extends to the sampling processes used during the training of large language models, as commonly practiced in InstructGPT (Ouyang et al., 2022). This process consists of three key stages: pre training, supervised finetuning (SFT), and alignment, also known as reinforcement learning with human feedback (RLHF). For large language models trained in this paradigm, there could be some helpful properties that, without strong theoretical guarantees, are empirically true and thus helpful for LLMs. Our work is related to attention sink (Xiao et al., 2023). An attention sink refers to a token or set of tokens that disproportionately receive attention from other tokens during the attention mechanism within transformer architectures. In their study, they found that one of the most identifiable tokens was shown to be the initial token. While there are no theoretical guarantees, they propose an intuition that initial tokens are visible and used in all later token generations, making them more readily trained to be attention sinks.

在本文中,我们同样关注推理时间;然而,我们的方法扩展到了大语言模型训练过程中使用的采样过程,如InstructGPT (Ouyang et al., 2022) 中常见的做法。此过程包括三个关键阶段:预训练、监督微调 (SFT) 和对齐,也称为基于人类反馈的强化学习 (RLHF)。对于在这种范式中训练的大语言模型,可能会有一些有用的特性,尽管没有强有力的理论保证,但在经验上是正确的,因此对大语言模型有帮助。我们的工作与注意力沉井 (Xiao et al., 2023) 相关。注意力沉井指的是在Transformer架构的注意力机制中,某个Token或一组Token不成比例地接收到其他Token的注意力。在他们的研究中,他们发现最可识别的Token之一是初始Token。虽然没有理论保证,但他们提出了一种直觉,即初始Token在所有后续Token生成过程中都是可见且被使用的,这使得它们更容易被训练成为注意力沉井。

Our work is closely related to CoTdecoding (Wang and Zhou, 2024), which uncovers the CoT-paths by enumerating over the top-k alternative tokens and aggregating the responses by scoring the decoded responses with confidence on the final answer. However, our approach differs in three key aspects: (1) we introduce a differentiable sampling method that can be directly integrated with existing inference and training frameworks, (2) we focus on improving model performance in scenarios with a sandbox checker, where aggregating responses is less data-efficient, and (3) our method operates without assumptions about the prompts, even when a chain of thought (CoT) is included, extending beyond the scope of CoT-decoding.

我们的工作与 CoTdecoding (Wang and Zhou, 2024) 密切相关,后者通过枚举 top-k 替代 Token 并通过对最终答案的信心评分来聚合解码的响应,从而揭示 CoT 路径。然而,我们的方法在三个关键方面有所不同:(1) 我们引入了一种可微分的采样方法,可以直接与现有的推理和训练框架集成,(2) 我们专注于在带有沙盒检查器的场景中提高模型性能,其中聚合响应的数据效率较低,(3) 我们的方法在不对提示做任何假设的情况下运行,即使包含思维链 (CoT),也超越了 CoT-decoding 的范围。

3 Flaming-hot Initiation Regular Execution

3 火热的初始化和常规执行

3.1 Method

3.1 方法

In this work, we propose a sampling method, Flaming-hot Initiation with Regular Execution (FIRE), inspired by the attention sink phenomenon (Xiao et al., 2023) that demonstrates the importance of initial tokens.

在本工作中,我们提出了一种采样方法——Flaming-hot Initiation with Regular Execution (FIRE),该方法受到了注意力下沉现象 (Xiao et al., 2023) 的启发,该现象展示了初始 Token 的重要性。

FIRE first samples the initial token at a very high temperature $p\gg1$ , combined with top $k$ filtering to make the candidate tokens more controllable. At higher temperatures, the candidate tokens are sampled from a probability distribution that approaches uniform sampling. After the initial token is sampled, FIRE proceeds with the decoding stage using a regular temperature setting.

FIRE 首先在非常高的温度 $p\gg1$ 下对初始 Token 进行采样,结合 top $k$ 过滤使候选 Token 更具可控性。在更高的温度下,候选 Token 从接近均匀采样的概率分布中采样。在初始 Token 采样后,FIRE 使用常规温度设置进行解码阶段。

Our approach FIRE is similar to CoTdecoding (Wang and Zhou, 2024) that enumerates the top ${\boldsymbol{k}}$ candidates of the initial token. However, while CoT-decoding focuses more on the decoding stage and extracting Chain-of-Thought without prompt, our approach FIRE serves as a general differentiable sampling method, which can be combined with existing sampling frameworks and can be more efficient in the training stage where a sandbox checker that judges whether a specific answer is correct or not is available with a cheap cost.

我们的方法 FIRE 类似于 CoTdecoding (Wang and Zhou, 2024) ,它枚举了初始 Token 的前 ${\boldsymbol{k}}$ 个候选。然而,虽然 CoT-decoding 更注重解码阶段并在无需提示的情况下提取 Chain-of-Thought,我们的方法 FIRE 作为一种通用的可微分采样方法,可以与现有的采样框架结合,并且在训练阶段可以更高效,因为在训练阶段,判断特定答案是否正确的沙盒检查器可以以较低的成本获得。

While FIRE can be applied to any token in the decoding stage, we restrict its application to the initial token to prevent the generation of random tokens that are wrong in the context. For example, if we apply FIRE after the prefix $"1+2{=}"$ , it would sample, in addition to the token $"3"$ , other tokens like "4" or $"5"$ , which are very likely to be wrong. In contrast, since FIRE is only applied to the initial token, it would unlikely lead to broken sentences or code with syntax errors. In our empirical exper

虽然 FIRE 可以在解码阶段应用于任何 Token,但我们将其应用限制在初始 Token 上,以防止生成在上下文中错误的随机 Token。例如,如果我们在前缀 $"1+2{=}"$ 之后应用 FIRE,它可能会采样除了 Token $"3"$ 之外的其他 Token,如 "4" 或 $"5"$,这些 Token 很可能是错误的。相比之下,由于 FIRE 仅应用于初始 Token,它不太可能导致句子断裂或代码出现语法错误。在我们的实验经验中

常规 FIRE
模型 通过率% #EA 通过率% #EA
GSM8K DeepSeek 97.57 2.26 98.71 2.76
Gemma-2 86.81 3.87 87.57 4.01
Qwen2 95.90 2.58 98.25 3.17
Qwen2-RL 96.90 2.63 97.90 3.26
MATH DeepSeek 76.16 5.63 78.16 7.89
Gemma-2 49.20 9.24 51.48 10.39
Qwen2 76.60 7.44 79.08 9.03
Qwen2.5-72B 79.30 2.39 80.40 2.60

Table 1: Inference results for different models on different datasets with best hyper parameters combinations. Specifically, Qwen2-RL is a fine-tuned model trained by ourselves. We show the pass rate $(%)$ with 40 samples, and the effective answers (EA) among the 40 samples.

表 1: 不同模型在不同数据集上使用最佳超参数组合的推理结果。具体而言,Qwen2-RL 是我们自己训练的微调模型。我们展示了 40 个样本的通过率 $(%)$ 以及这 40 个样本中的有效答案 (EA)。

Table 2: Pass rate $(%)$ with different number of samples from Qwen2-7B-Instruct on MBPP and $\mathrm{MBPP+}$ .

表 2: Qwen2-7B-Instruct 在 MBPP 和 $\mathrm{MBPP+}$ 上不同样本数的通过率 $(%)$。

Regular Regular FIRE FIRE
Pass @ 1 Pass @ 10 Pass @ 1 Pass @ 10
MBPP 61.2 82.8 50.6 86.6
MBPP+ 52.7 74.2 44.1 77.0

iments, we found that the initial token frequently consists of words like "Let’s", "Sure", "So", and "The", which do not directly convey any information. But what these initial tokens affect is the reasoning steps afterward, with the same intuition as Streaming LL M (Xiao et al., 2023).

在实验中,我们发现初始的Token经常由“Let’s”、“Sure”、“So”和“The”等词语组成,这些词语并不直接传递任何信息。但这些初始Token影响的是后续的推理步骤,其直觉与Streaming LLM (Xiao et al., 2023) 相同。

3.2 Experiments

3.2 实验

In this section, we evaluate our algorithm, FIRE, by addressing several key research questions that guide our experiments.

在本节中,我们通过解决几个关键研究问题来评估我们的算法 FIRE,这些问题指导了我们的实验。

How effective is FIRE during inference? We first showcase the effectiveness of FIRE sampling in inference-only scenarios. We tested four opensource models: Qwen2-7B-Instruct (Qwen2) (Yang et al., 2024), Qwen2.5-72B-Instruct (Qwen2.5- 72B) (Yang et al., 2024), DeepSeek-coder-v2- Instruct (DeepSeek)(Zhu et al., 2024), and Gemma2-2b-it (Gemma-2)(Team et al., 2024), on a di- verse set of datasets, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and $\mathbf{MBPP(+)}$ (Austin et al., 2021; Liu et al., 2023c). In GSM8K and MATH, we extend the prompts with phrase "Please reason step-by-step" to ensure CoT reasoning in models’ responses, a setting where the original motivation of CoT-decoding becomes less meaningful as CoT paths would naturally occur. For the regular sampling settings, we use a combination of nucleus sampling and top $\cdot\mathbf{k}$ sampling.

FIRE在推理中的效果如何?我们首先展示了FIRE采样在仅推理场景中的有效性。我们测试了四个开源模型:Qwen2-7B-Instruct(Qwen2)(Yang et al., 2024)、Qwen2.5-72B-Instruct(Qwen2.5-72B)(Yang et al., 2024)、DeepSeek-coder-v2-Instruct(DeepSeek)(Zhu et al., 2024) 和 Gemma2-2b-it(Gemma-2)(Team et al., 2024),在多个数据集上进行了测试,包括 GSM8K (Cobbe et al., 2021)、MATH (Hendrycks et al., 2021) 和 $\mathbf{MBPP(+)}$ (Austin et al., 2021; Liu et al., 2023c)。在GSM8K和MATH中,我们在提示中添加了“请逐步推理”的短语,以确保模型响应中的链式思考(CoT)推理,这种情况下,CoT解码的原始动机变得不那么重要,因为CoT路径会自然出现。对于常规的采样设置,我们结合了核心采样和 top $\cdot\mathbf{k}$ 采样。

Regular FIRE
P k min-p n=10 n=40 n=10 n=40
0.7 16 0.01 0 66.4 66.4 75.8 75.8 70.0 70.0 78.9 78.9
32 0.01 66.2 75.3 70.1 78.9
0.9 16 0 0.01 66.2 66.1 75.2 76.6 70.1 69.5 78.9 78.9
32 0 0.01 0 66.2 66.7 66.8 76.6 76.4 74.4 69.5 69.5 69.1 78.9 79.1 79.0

Table 3: Pass rate $(%)$ for Qwen2-7B-Instruct on MATH dataset with different hyper parameter combinations. p: nucleus sampling parameter, $\mathrm{k}:$ top $\cdot\mathbf{k}$ sampling parameter, min-p: minimum probability threshold (0 indicates min-p is not used). $\mathtt{n=10}$ and $\mathtt{n=40}$ represent the number of samples for calculating the pass rate.

表 3: Qwen2-7B-Instruct 在 MATH 数据集上不同超参数组合下的通过率 $(%)$ 。p: 核采样参数,$\mathrm{k}:$ top $\cdot\mathbf{k}$ 采样参数,min-p: 最小概率阈值(0 表示未使用 min-p)。$\mathtt{n=10}$ 和 $\mathtt{n=40}$ 表示计算通过率的样本数量。

To ensure a fair comparison, we conducted a thorough enumeration over hyper parameters, including $p,k$ , and min $\mathbf{\nabla}\cdot\mathbf{p}$ (Hugging face, 2023). Table 1 and Table 2 present the aggregated results, where the reported numbers represent the best outcomes from the enumeration. We observe that FIRE consistently improves the pass rate compared to regular settings across all models on different benchmarks. To further demonstrate the consistent improvement over different hyper parameters, we provide an example result of Qwen2-7B-Instruct on the MATH dataset in Table 3. Full results for all models and datasets are provided in the appendix. Table 3 reveals that although FIRE may alter the hyperparameter combination that yields optimal performance, it consistently outperforms regular sampling across all hyper parameter combinations.

为了确保公平比较,我们对超参数进行了全面枚举,包括 $p,k$ 和 min $\mathbf{\nabla}\cdot\mathbf{p}$ (Hugging face, 2023)。表 1 和表 2 展示了汇总的结果,其中报告的数字代表枚举中的最佳结果。我们观察到,与常规设置相比,FIRE 在不同基准测试中的所有模型上均持续提高了通过率。为了进一步展示在不同超参数下的一致性改进,我们在表 3 中提供了 Qwen2-7B-Instruct 在 MATH 数据集上的示例结果。所有模型和数据集的完整结果见附录。表 3 表明,尽管 FIRE 可能会改变产生最佳性能的超参数组合,但在所有超参数组合中,它始终优于常规采样。

Why is FIRE effective? FIRE introduces more diversity to the initial token that is generated at a high temperature, and due to the strong attention scores towards initial tokens (Xiao et al., 2023), this diversity benefits the entire subsequent generation. To measure diversity quantitatively, we use the number of unique answers (effective answers) within a set of responses as our metric. We choose not to use some popular metrics like n-grams since we only control the initial token, and in tasks with long reasoning paths, such as math and coding, similar n-grams will likely always appear, making it unsuitable for measuring diversity. As shown in Figure 1, Table 1 (#EA), FIRE demonstrates increased diversity across various models and datasets, which contributes to enhanced pass $@{\mathrm{n}}$ performance. As anticipated, FIRE

为什么 FIRE 有效?FIRE 在高温下生成的初始 Token 引入了更多多样性,并且由于对初始 Token 的强烈注意力得分 (Xiao et al., 2023),这种多样性有利于整个后续生成。为了定量测量多样性,我们使用一组响应中唯一答案(有效答案)的数量作为我们的度量。我们选择不使用一些流行的度量方法,如 n-gram,因为我们只控制初始 Token,而在具有长推理路径的任务中,如数学和编码,类似的 n-gram 很可能会始终出现,使其不适合测量多样性。如图 1 和表 1 (#EA) 所示,FIRE 在各种模型和数据集中展示了增加的多样性,这有助于提高通过 $@{\mathrm{n}}$ 性能。正如预期的那样,FIRE


Figure 1: Curves for pass rate and number of effective answers with different numbers of samples on GSM8K.

图 1: GSM8K 上不同样本数量下的通过率和有效答案数量的曲线。

数据集 模型 PPO PPO+FIRE
GSM8K Deepseek 80.64 82.16
GSM8K Qwen2 80.16 82.02
GSM8K Gemma 40.39 42.91
MATH Gemma-2 58.07 61.20
MATH Qwen2 53.50 55.07

Table 4: Pass $@1$ on GSM8K and Math for Different models trained with PPO with different sampling.

表 4: 不同采样方式下使用 PPO 训练的模型在 GSM8K 和 Math 上的 Pass $@1$ 结果

1st-line 2nd-line 3rd-line PRM-line
Regular 46.07 74.36 74.77 75.73
FIRE 64.59 74.96 75.92 78.21

Table 5: Pass $@10$ Results from Qwen2-7B-Instruct on the training set of MATH dataset for FIRE variants with different sampling points, compared to regular sampling method that does not change the temperature.

表 5: Pass $@10$ Qwen2-7B-Instruct 在 MATH 数据集训练集上针对不同采样点的 FIRE 变体的结果,与不改变温度的常规采样方法进行比较。

does not improve $\mathrm{Pass}\mathcal{@1}$ performance due to its focus on promoting diversity. However, it consistently delivers improvements when more samples are considered.

不会提高 $\mathrm{Pass}\mathcal{@1}$ 性能,因为它专注于促进多样性。然而,当考虑更多样本时,它会持续带来改进。

Is FIRE helpful when integrated into training? Having established that our method improves pass $@\mathbf{n}$ by improving diversity, we directly apply FIRE to boost language model training. To test this, we use Proximal Policy Optimization (PPO) (Schulman et al., 2017) to finetune several models using the GSM8K and MATH datasets, and assess their performance through the final pass rate for single samples $(\mathrm{Pass}@1)$ . As shown in Table 4, integrating FIRE into the training process leads to an improvement in $\mathrm{Pass},{\ @1}$ . Notably, even though each data point is sampled only once during PPO training following common practice (Ouyang et al., 2022; Sheng et al., 2024), our method still yields improvements. The results also show that the improvements are consistent for different models. Furthermore, after our RL training, the model still exhibits diversity and continues to benefit from inference-time pass rate improvements, as evidenced by Qwen2-RL in Table 1. Consequently, FIRE can be applied iterative ly to refine the model, leading to an even bigger improvement margin.

将 FIRE 整合到训练中是否有帮助?在确定我们的方法通过提高多样性来提升 pass $@\mathbf{n}$ 后,我们直接应用 FIRE 来增强语言模型的训练。为了验证这一点,我们使用近端策略优化 (PPO) (Schulman et al., 2017) 对多个模型进行微调,使用 GSM8K 和 MATH 数据集,并通过单样本的最终通过率 $(\mathrm{Pass}@1)$ 评估其性能。如表 4 所示,将 FIRE 整合到训练过程中提升了 $\mathrm{Pass},{\ @1}$。值得注意的是,尽管在 PPO 训练中每个数据点仅采样一次 (Ouyang et al., 2022; Sheng et al., 2024),我们的方法仍然带来了改进。结果还表明,这种改进在不同的模型中是一致的。此外,经过我们的强化学习训练后,模型仍然表现出多样性,并继续受益于推理时通过率的提升,如表 1 中的 Qwen2-RL 所示。因此,FIRE 可以迭代应用以优化模型,从而带来更大的改进空间。

Can FIRE sampling work in mid-sequence? Finally, we explore the effect of applying FIRE sampling midway through a response. We first construct a dataset that ensures the correctness of the initial sentences, by utilizing a Process Reward Model (PRM) to identify the first sentences at which the response becomes incorrect. We then evaluate the effect of applying FIRE sampling at the beginning of different sentences (1st, 2nd, and 3rd-line) or at the first token deemed incorrect by the PRM ("PRM-line"). We refer the reader to the appendix for a more detailed description of the construction of this dataset. As shown in Table 5, while FIRE sampling offers benefits throughout different settings, its advantages diminish for tokens beyond the initial ones, despite an overall increase in accuracy due to the prefix guaranteed to be correct.

FIRE采样能在序列中间起作用吗?最后,我们探讨了在响应中途应用FIRE采样的效果。我们首先构建了一个数据集,通过使用过程奖励模型(Process Reward Model,PRM)来识别响应开始出错的第一个句子,以确保初始句子的正确性。然后,我们评估了在不同句子(第1行、第2行和第3行)的开头或在PRM认为出错的第一个Token("PRM行")处应用FIRE采样的效果。有关此数据集构建的更详细描述,请参阅附录。如表5所示,尽管FIRE采样在不同设置下都提供了优势,但对于初始Token之后的Token,其优势有所减弱,尽管由于前缀保证正确,整体准确性有所提高。

4 Conclusion

4 结论

In this paper, we introduced a novel sampling method called Flaming-hot Initiation with Regular Execution (FIRE). Through empirical analysis, we demonstrated that FIRE enhances both inferencetime performance and reinforcement learning, particularly when a chain of thought is integrated into the prompt. We showed that FIRE improves generation diversity, and we believe that this diversity contributes to its overall effectiveness. Additionally, we explored several variants of FIRE that modify the sampling process not only immediately after the question but also during the middle of the generation, further showcasing its versatility.

在本文中,我们介绍了一种名为 FIRE (Flaming-hot Initiation with Regular Execution) 的新型采样方法。通过实证分析,我们证明了 FIRE 在推理性能和强化学习方面均有提升,尤其是在提示中引入思维链时。我们发现 FIRE 可以增加生成的多样性,并认为这种多样性有助于其整体效果。此外,我们还探索了 FIRE 的几种变体,这些变体不仅在问题提出后立即修改采样过程,还在生成过程中间进行调整,进一步展示了其多功能性。

5 Limitations

5 局限性

While this work focuses on improving the efficiency of LLM training through better sampling methods, there are two limitations. First, our approach lacks a strong theoretical guarantee, meaning that there is a possibility that future models, especially ones that are with different model architectures, may not benefit from it. Second, although our method is designed for training LLMs, the inference-time algorithm could potentially bypass safety measures by sampling out-of-distribution data. However, we argue that this concern can be inherently mitigated in models trained with our proposed sampling technique.

尽管本工作通过改进采样方法专注于提升大语言模型 (LLM) 的训练效率,但仍存在两个局限性。首先,我们的方法缺乏强有力的理论保证,这意味着未来的模型,尤其是采用不同架构的模型,可能无法从中受益。其次,尽管我们的方法专为大语言模型训练设计,但在推理阶段,算法可能通过采样分布外的数据绕过安全措施。然而,我们认为,使用我们提出的采样技术训练的模型可以固有地缓解这一担忧。

References

参考文献

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Micha lewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Micha lewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, 等. 2021. 使用大语言模型进行程序合成. arXiv 预印本 arXiv:2108.07732.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, 等人. 2022. 通过人类反馈的强化学习训练一个有用且无害的助手. arXiv 预印本 arXiv:2204.05862.

Weizhe Chen, Sven Koenig, and Bistra Dilkina. 2024a. Reprompt: Planning by automatic prompt engineering for large language models agents. arXiv preprint arXiv:2406.11132.

Weizhe Chen, Sven Koenig, 和 Bistra Dilkina. 2024a. Reprompt: 通过自动提示工程为大语言模型智能体进行规划. arXiv 预印本 arXiv:2406.11132.

Weizhe Chen, Sven Koenig, and Bistra Dilkina. 2024b. Why solving multi-agent path finding with large language model has not succeeded yet. arXiv preprint arXiv:2401.03630.

Weizhe Chen, Sven Koenig, 和 Bistra Dilkina. 2024b. 为什么使用大语言模型解决多智能体路径规划问题尚未成功. arXiv 预印本 arXiv:2401.03630.

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. 教授大语言模型自我调试. arXiv preprint arXiv:2304.05128.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, 和 John Schulman. 2021. 训练验证器解决数学文字问题. arXiv 预印本 arXiv:2110.14168.

Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, and Jun Wang. 2023. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179.

Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, and Jun Wang. 2023. 类似AlphaZero的树搜索可以指导大语言模型解码和训练. arXiv预印本 arXiv:2309.17179.

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. NeurIPS.

Dan Hendrycks、Collin Burns、Saurav Kadavath、Akul Arora、Steven Basart、Eric Tang、Dawn Song 和 Jacob Steinhardt. 2021. 使用数学数据集测量数学问题解决能力. NeurIPS.

James Y Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchoff, and Dan Roth. 2024. Deal: Decoding-time alignment for large language models. arXiv preprint arXiv:2402.06147.

James Y Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchoff, 和 Dan Roth. 2024. Deal: 大语言模型的解码时间对齐. arXiv预印本 arXiv:2402.06147.

Hugging face. 2023. New sampling strategy dropped in transformers – min p sampling. https://hugging face.co/posts/joaogante/ 319451541682734. Accessed: 2024-10-15.

Hugging Face. 2023. Transformer 中的新采样策略——min p 采样. https://hugging face.co/posts/joaogante/ 319451541682734. 访问日期: 2024-10-15.

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with paged attention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, 和 Ion Stoica. 2023. 基于分页注意力机制的大语言模型服务高效内存管理. 发表于 ACM SIGOPS 第29届操作系统原理研讨会论文集.

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: communicative agents for "mind" exploration of large language model society. In Advances in Neural Information Processing Systems (NeurIPS).

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: 大语言模型社会的“心智”探索通信智能体. In Advances in Neural Information Processing Systems (NeurIPS).

Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz. 2023a. Making ppo even better: Value-guided monte-carlo tree search decoding. arXiv preprint arXiv:2309.15028.

Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, 和 Asli Celikyilmaz. 2023a. 让 PPO 更上一层楼:基于价值引导的蒙特卡洛树搜索解码. arXiv 预印本 arXiv:2309.15028.

Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. 2023b. Rltf: Reinforcement learning from unit test feedback. arXiv preprint arXiv:2307.04349.

Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. 2023b. Rltf: 单元测试反馈的强化学习。arXiv preprint arXiv:2307.04349.

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023c. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems.

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023c. 你的代码真的是由ChatGPT生成的吗?对大语言模型进行代码生成的严格评估。在第三十七届神经信息处理系统会议上。

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems (NeurIPS).

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, 等. 2023. Self-refine: 基于自我反馈的迭代优化. 发表于《神经信息处理系统进展》(NeurIPS).

Jiageng $\mathbf{Mao}^{}$ , Junjie $\mathrm{Ye^{}}$ , Yuxi Qian, Marco Pavone, and Yue Wang. 2023. A language agent for autonomous driving. arXiv preprint arXiv:2311.10813.

Jiageng $\mathbf{Mao}^{}$, Junjie $\mathrm{Ye^{}}$, Yuxi Qian, Marco Pavone, 和 Yue Wang. 2023. 自动驾驶的语言智能体. arXiv 预印本 arXiv:2311.10813.

Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, et al. 2023. Controlled decoding from language models. arXiv preprint arXiv:2310.17022.

Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, 等. 2023. 从语言模型中控制解码. arXiv 预印本 arXiv:2310.17022.

OpenAI. 2022. Introducing chatgpt.

OpenAI. 2022. 推出 ChatGPT.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray 等人。

  1. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  2. 通过人类反馈训练语言模型以遵循指令。神经信息处理系统进展,35:27730–27744。

A Implementation Details

A 实现细节

In the paper, we proposed FIRE sampling, which is similar to CoT-decoding, and removed the need to calculate the confidence score. One of the biggest benefits of simplifying the method is getting an extremely easy implementation. For inference, we use vLLM (Kwon et al., 2023) and do a two- stage sampling, with the first stage sampling only one token with high temperature and the second stage continuing the sampling with regular sampling. For training, we implement based on HybridFlow(Sheng et al., 2024), a newly released RLHF code base, which supports sampling with vLLM. Thus, we only changed the sampling part of the code in the RLHF framework. As shared in all experiments, the temperature used for the initial token is set at 30.

在论文中,我们提出了FIRE采样方法,该方法类似于CoT解码,但无需计算置信度分数。简化方法的最大好处之一是实现的极其简单。在推理过程中,我们使用vLLM (Kwon et al., 2023) 并采用两阶段采样,第一阶段在高温下仅采样一个Token,第二阶段继续使用常规采样进行采样。在训练过程中,我们基于HybridFlow (Sheng et al., 2024) 实现,这是一个新发布的RLHF代码库,支持使用vLLM进行采样。因此,我们仅修改了RLHF框架中的采样部分代码。在所有实验中,初始Token的温度设置为30。

In our experiment, we enumerate the parameters of top-p sampling, top-k sampling, and min $\mathfrak{p}$ sampling. We list all the parameters we have tried in the next section. Due to computation costs, some of the models are not enumerated in the same number as others. However, our conclusion that FIRE outperforms regular sampling is consistent, as we will show later. Specifically for $\mathbf{MBPP(+)}$ , and for Qwen-RL, the model after our fine-tuning, we test on a single hyper para meter combination of $t o p-p=0.9$ , $t o p-k=16$ , which follows the best configuration from previous trials. For Qwen2.5-72b-Instruct, we follow the recommended hyper parameters of $t o p-p=0.8$ , $t o p-k=16$ . For reinforcement learning problems, we use the default parameters in HybridFlow, specifically, $t o p-k,=,16$ , $t o p\mathrm{~-~}p,=,1.0$ . For training with FIRE sampling, to enable PPO to accept the relatively out-of-distribution samples, we change the clipping ratio for PPO from 0.2 to 0.5. We observe that for $\scriptstyle\mathrm{PPO+FIRE}$ to use the original clip rate, it will generally match the original performance, while pure PPO with a higher clip ratio will lead the training to a failure and converge to a pass rate close to 0.

在我们的实验中,我们枚举了 top-p 采样、top-k 采样和 min $\mathfrak{p}$ 采样的参数。我们在下一节列出了我们尝试过的所有参数。由于计算成本的原因,某些模型的枚举数量与其他模型不同。然而,我们得出的结论是 FIRE 优于常规采样,这一点我们将在后面展示。特别是对于 $\mathbf{MBPP(+)}$ 和 Qwen-RL(我们微调后的模型),我们在单个超参数组合 $top-p=0.9$、$top-k=16$ 上进行了测试,这遵循了之前试验中的最佳配置。对于 Qwen2.5-72b-Instruct,我们遵循推荐的超参数 $top-p=0.8$、$top-k=16$。对于强化学习问题,我们使用 HybridFlow 中的默认参数,具体为 $top-k=16$、$top-p=1.0$。在使用 FIRE 采样进行训练时,为了让 PPO 接受相对分布外的样本,我们将 PPO 的剪切比从 0.2 更改为 0.5。我们观察到,对于 $\scriptstyle\mathrm{PPO+FIRE}$ 使用原始剪切率时,通常会匹配原始性能,而具有较高剪切率的纯 PPO 会导致训练失败并收敛到接近 0 的通过率。

In the paper, we use three different datasets: GSM8K, MATH, and $\mathbf{MBPP(+)}$ . GSM8K is a dataset with $8.5\mathrm{K}$ total instances of math problem, of which 7.5K is in the training set and 1.3K is in the test set. MATH is a math dataset that is slightly more difficult and more comprehensive than GSM8K, with 7.5K training data and 5K test data. MBPP is a benchmark consisting of around 1,000 crowd-sourced Python programming problems, and $\mathrm{MBPP+}$ is a benchmark that enlarges MBPP with some harder problems, reaching around 35K total test problems. While $\mathrm{MBPP+}$ is still under regular update, we use version 0.1.0 in our paper.

在论文中,我们使用了三个不同的数据集:GSM8K、MATH 和 $\mathbf{MBPP(+)}$。GSM8K 是一个包含 $8.5\mathrm{K}$ 个数学问题实例的数据集,其中 7.5K 个属于训练集,1.3K 个属于测试集。MATH 是一个比 GSM8K 稍难且更全面的数学数据集,包含 7.5K 个训练数据和 5K 个测试数据。MBPP 是一个包含约 1,000 个众包 Python 编程问题的基准数据集,而 $\mathrm{MBPP+}$ 则是在 MBPP 基础上扩展了一些更难的问题,测试问题总数达到约 35K。虽然 $\mathrm{MBPP+}$ 仍在定期更新,但在我们的论文中使用了版本 0.1.0。

For the final part of the experiment about generation in the middle sequence, we use a dataset that guarantees a certain number of sentences of prefixes to be correct. Here, the sentences are defined based on ’.’ in the answer. This dataset is generated on the training set of the MATH dataset, for which we first use Qwen2-7b-Instruct to sample 10 responses for each question. Then, for each response, we enumerate the sentences and sample 20 times using different numbers of sentences as the prefix. Thus, we obtained an approximation of the point at which the original samples became wrong. Specifically, if one response is wrong before the number of lines we enumerate in Table 5, we use all the prefix up to the point that is still correct for that response, i.e., if for a specific sample, the correct sentences are less than 2, 3rd-line pass rate will be calculated in the same way as PRM-line.

在关于中间序列生成的实验的最后部分,我们使用了一个保证一定数量前缀句子正确的数据集。在这里,句子是基于答案中的“.”来定义的。该数据集是在MATH数据集的训练集上生成的,我们首先使用Qwen2-7b-Instruct对每个问题采样10个回答。然后,对于每个回答,我们枚举句子并使用不同数量的句子作为前缀进行20次采样。因此,我们获得了原始样本在何处出错的近似点。具体来说,如果一个回答在我们枚举的表5中的行数之前出错,我们将使用所有在该点之前仍然正确的前缀,即如果对于某个特定样本,正确的句子少于2个,第三行的通过率将按照与PRM行相同的方式计算。

B Extra Experiment Results

B 额外实验结果

We provide our full inference experiment table in Table 6, Table 7, and Table 8. We observe that among all hyper parameter combinations, FIRE stably outperforms regular sampling, starting at Pass $@10$ to $\mathrm{Pass}\mathcal{@20}$ and $\mathrm{Pass}\bigstar!\ G40$ . In most settings, FIRE is superior to regular sampling at Pass $@5$ , and for certain settings in the MATH dataset, FIRE could even show an advantage in Pass $@1$ .

我们在表 6、表 7 和表 8 中提供了完整的推理实验表。我们观察到,在所有超参数组合中,FIRE 从 Pass $@10$ 到 $\mathrm{Pass}\mathcal{@20}$ 以及 $\mathrm{Pass}\bigstar!\ G40$ 都稳定地优于常规采样。在大多数设置中,FIRE 在 Pass $@5$ 时优于常规采样,并且在 MATH 数据集的某些设置中,FIRE 甚至在 Pass $@1$ 时也显示出优势。

Table 6: Deepseek-coder-v2-Instruct on different datasets with regular sampling (Reg) and FIRE (ours). We show the pass rate with different number of samples $(\mathrm{Pass}@\mathbf{n})$ , and the effective answers (EA) of the total 40 samples.

表 6: Deepseek-coder-v2-Instruct 在不同数据集上使用常规采样 (Reg) 和 FIRE (我们提出的方法) 的结果。我们展示了不同样本数量下的通过率 $(\mathrm{Pass}@\mathbf{n})$ 以及总样本数为 40 时的有效答案 (EA) 。

Table 7: Gemma-2-2b-it on different datasets with regular sampling (Reg) and FIRE (ours). We show the pass rate with different number of samples $(\mathrm{Pass},@\mathbb{n})$ , and the effective answers (EA) of the total 40 samples.

表 7: Gemma-2-2b-it 在不同数据集上使用常规采样 (Reg) 和 FIRE (我们的方法) 的结果。我们展示了不同样本数量下的通过率 $(\mathrm{Pass},@\mathbb{n})$,以及总样本数为 40 时的有效答案 (EA)。

Table 8: Qwen2-7B-Instruct on different datasets with regular sampling (Reg) and FIRE (ours). We show the pass rate with different number of samples $(\mathrm{Pass},@\mathbb{n})$ , and the effective answers (EA) of the total 40 samples.

阅读全文(20积分)