[论文翻译]CodeSteer:通过代码/文本引导的符号增强大语言模型


原文地址:https://arxiv.org/pdf/2502.04350


CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

CodeSteer:通过代码/文本引导的符号增强大语言模型

Abstract

摘要

Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities under utilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-round guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama3-8B model with a newly designed multi-round supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, Code Steer LL M, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior general iz ability, providing an average 41.8 performance boost on Claude, Mistral, and GPT3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github. com/yongchao98/CodeSteer-v1.0.

现有方法无法在大语言模型(LLMs)的文本推理和代码生成之间有效引导,导致符号计算能力未被充分利用。我们提出了CodeSteer,一种有效引导LLM代码/文本生成的方法。我们构建了一个全面的基准SymBench,包含37个复杂性可调的符号任务,并合成了包含12k多轮引导/生成轨迹和5.5k引导对比对的数据集。我们通过新设计的多轮监督微调(SFT)和直接偏好优化(DPO)对Llama3-8B模型进行微调。最终得到的模型CodeSteer LLM,通过提出的符号和自答检查器增强,能够有效引导更大模型的代码/文本生成。将CodeSteer应用于GPT-4o后,其平均性能得分从53.3提升至86.4,甚至在所有37个任务(28个已知,9个未知)中超过了现有的最佳LLM OpenAI o1(82.7)、o1-preview(74.8)和DeepSeek R1(76.8)。为GPT-4o训练的CodeSteer展示了卓越的泛化能力,在Claude、Mistral和GPT3.5上平均提升了41.8的性能得分。由CodeSteer引导的LLMs充分利用符号计算,在高度复杂的任务中保持强劲表现。模型、数据集和代码可在https://github.com/yongchao98/CodeSteer-v1.0获取。

1. Introduction

1. 引言

2024c; Li et al., 2023), they still fail in ostensibly simple tasks (Zhou et al., 2024a). Crucially, many tasks in existing benchmarks—such as Blocks world (Valmeekam et al., 2024) and Game 24 (Zhou et al., 2023b)—can be completely solved with code solutions. Text-based reasoning excels at semantic understanding and commonsense inference but is less suited for exact computation, symbolic manipulation, optimization, and algorithmic processing (Valmeekam et al., 2022). In contrast, symbolic computing via code generation is adept at handling rigorous operations and can easily leverage specialized tools (e.g., equation solvers). In many tasks, prompting LLMs to generate and execute code outperforms purely textual reasoning (Madaan et al., 2022; Liang et al., 2022; Chen et al., 2022).

2024c; Li et al., 2023),它们在表面上简单的任务中仍然表现不佳(Zhou et al., 2024a)。关键的是,现有基准测试中的许多任务——如积木世界(Blocks world, Valmeekam et al., 2024)和24点游戏(Game 24, Zhou et al., 2023b)——可以通过代码解决方案完全解决。基于文本的推理在语义理解和常识推理方面表现出色,但在精确计算、符号操作、优化和算法处理方面则不太适用(Valmeekam et al., 2022)。相比之下,通过代码生成进行符号计算擅长处理严格的操作,并且可以轻松利用专用工具(例如,方程求解器)。在许多任务中,提示大语言模型生成并执行代码的表现优于纯文本推理(Madaan et al., 2022; Liang et al., 2022; Chen et al., 2022)。

A key challenge is guiding LLMs to decide when to rely on textual reasoning versus programmatic solutions, given that most input questions lack explicit cues about which approach is best. Recent OpenAI GPT models address this by providing a Code Interpreter module, allowing the model to iterative ly generate and execute code, then further reason with the output (Achiam et al., 2023). Multi-agent frameworks like AutoGen (Wu et al., 2023) adopt a specialized system prompt to steer LLM for code generation when needed. However, recently Chen et al. (2024e) finds that all these existing methods struggle to effectively steer between textual reasoning and code generation, failing to fully leverage symbolic computing capabilities.

一个关键的挑战是如何引导大语言模型决定何时依赖文本推理还是程序化解决方案,因为大多数输入问题都缺乏关于哪种方法最佳的明确提示。最近的 OpenAI GPT 模型通过提供代码解释器模块来解决这个问题,使模型能够迭代生成和执行代码,然后进一步推理输出 (Achiam et al., 2023)。像 AutoGen (Wu et al., 2023) 这样的多智能体框架采用专门的系统提示来在需要时引导大语言模型生成代码。然而,最近 Chen et al. (2024e) 发现,所有这些现有方法在文本推理和代码生成之间难以有效引导,无法充分利用符号计算能力。

While the reasoning and planning capabilities of LLMs have improved significantly (Wang et al., 2024; Chen et al.,

大语言模型的推理和规划能力已显著提升 (Wang et al., 2024; Chen et al., 2024)

Our work tries to bridge this gap by developing an assistant framework (CodeSteer) to guide the code/text generation of the LLM solving the task (TaskLLM). By fine-tuning a small model (Llama-3-8B (Dubey et al., 2024)) to be the assistant, we enable large models (GPT-4o (Achiam et al., 2023)) to fully leverage symbolic computing via code generation while preserving other capabilities. Recognizing that iterative “executing and exploring” is the most effective way to solve tasks, we build CodeSteer to generate prompts that guide the TaskLLM through multiple rounds of interaction before finalizing answers.

我们的工作试图通过开发一个辅助框架(CodeSteer)来填补这一空白,以指导大语言模型生成代码/文本来解决任务(TaskLLM)。通过微调一个小模型(Llama-3-8B (Dubey et al., 2024))作为辅助,我们使大模型(GPT-4o (Achiam et al., 2023))能够通过代码生成充分利用符号计算,同时保留其他能力。认识到迭代“执行和探索”是解决任务的最有效方式,我们构建了CodeSteer,以生成提示,引导TaskLLM在最终确定答案前进行多轮交互。

To achieve a comprehensive evaluation, we gather and develop a benchmark with 37 symbolic tasks, referred as SymBench. On SymBench, augmenting GPT-4o with CodeSteer greatly improves its average performance score from 53.3 to 86.4, even outperforming the current leading pure-text model, OpenAI o1 (82.7) (Jaech et al., 2024) and DeepSeek R1 (76.8) (Guo et al., 2025). Although trained for GPT-4o, CodeSteer shows great general iz ability, delivering an average 41.8 performance gain on Claude-3-5-Sonnet, MistralLarge, and GPT-3.5. By fully leveraging symbolic computing, CodeSteer-guided LLMs maintain strong performance on highly complex tasks even when o1 fails in all testing cases. Our key contributions are:

为实现全面评估,我们收集并开发了包含37项符号任务的基准测试,称为SymBench。在SymBench上,使用CodeSteer增强的GPT-4o将其平均性能得分从53.3大幅提升至86.4,甚至超越了当前领先的纯文本模型OpenAI o1(82.7)(Jaech等,2024)和DeepSeek R1(76.8)(Guo等,2025)。尽管CodeSteer是为GPT-4o训练的,但它展现了强大的泛化能力,在Claude-3-5-Sonnet、MistralLarge和GPT-3.5上平均提升了41.8的性能。通过充分利用符号计算,CodeSteer引导的大语言模型在高度复杂的任务中保持了强劲的性能,即使o1在所有测试案例中均告失败。我们的主要贡献包括:


CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance Figure 1: Examples and performance of CodeSteer on guiding LLM code/text generation to integrate symbolic computing. At each interaction with TaskLLM, it reviews current and previous answers, then provides guidance for the next round. CodeSteer returns final answers when it deems them ready. With CodeSteer, GPT-4o outperforms OpenAI Code Interpreter, o1, and o1-preview models.

图 1: CodeSteer 在指导大语言模型代码/文本生成以整合符号计算方面的示例和性能。在与 TaskLLM 的每次交互中,它都会审查当前和之前的答案,然后为下一轮提供指导。当 CodeSteer 认为答案准备就绪时,它会返回最终答案。借助 CodeSteer,GPT-4o 的性能优于 OpenAI Code Interpreter、o1 和 o1-preview 模型。

  1. Developing and publishing SymBench: Prior works by Chen et al. (2024e) and Gui et al. (2024) gathered and developed 14 and 31 tasks, respectively, targeting challenges in computation, symbolic manipulation, logic, optimization, spatial reasoning, and constrained planning. However, neither study published the complete code for question/solution synthesis or the full datasets. From these 45 tasks, we select 37 that remain challenging for GPT-4o and redevelop their generation code to produce samples with adjustable complexity. We refer to this newly published benchmark as SymBench.
  2. 开发并发布 SymBench:Chen 等人 (2024e) 和 Gui 等人 (2024) 的先前工作分别收集并开发了 14 项和 31 项任务,针对计算、符号操作、逻辑、优化、空间推理和约束规划等领域的挑战。然而,这两项研究均未发布完整的题目/解答生成代码或完整的数据集。从这 45 项任务中,我们筛选出 37 项仍对 GPT-4o 具有挑战性的任务,并重新开发了它们的生成代码,以生成具有可调复杂度的样本。我们将这一新发布的基准称为 SymBench。
  3. New methods for dataset construction and model fine-tuning of SFT and DPO: We fine-tune Llama-3-8B with the synthesized datasets of 12k multi-round guidance/generation trajectories (SFT) and 5.5k guidance comparison pairs (DPO). Unlike standard multi-step settings, in CodeSteer’s multi-round guidance, the TaskLLM outputs a complete answer each round rather than only at the end. Consequently, we introduce novel components to both the dataset construction and training processes for SFT and DPO, such as data synthesis of dynamic guidance adaptation, emphasis on the final two rounds in SFT, comparison score design, and efficient answer sampling in DPO. These modifications result in better performance. Both the final CodeSteer model and created datasets will be released.
  4. SFT 和 DPO 的数据集构建与模型微调新方法:我们使用合成的 12k 多轮指导/生成轨迹数据集(SFT)和 5.5k 指导对比对数据集(DPO)对 Llama-3-8B 进行微调。与标准的多步骤设置不同,在 CodeSteer 的多轮指导中,TaskLLM 每轮都会输出一个完整的答案,而不仅仅是在最后。因此,我们在 SFT 和 DPO 的数据集构建和训练过程中引入了新的组件,例如动态指导适应的数据合成、在 SFT 中强调最后两轮、对比分数设计以及 DPO 中的高效答案采样。这些改进带来了更好的性能。最终的 CodeSteer 模型和创建的数据集都将发布。
  5. Symbolic checker and self-answer checker: Observing that TaskLLM frequently produces text-like code that hardcodes answers, neglecting efficient symbolic computation, we introduce a Symbolic Checker to help Code Steer LL M evaluate code complexity and efficiency. Since most reasoning and planning tasks can be better verified with coding, we add a Self-answer Checker for better judgment of answer correctness of Code Steer LL M. These two new checkers have been proven to significantly improve the efficiency of dataset synthesis and Code Steer LL M fine-tuning.

符号检查器和自我回答检查器:观察到TaskLLM经常生成类似文本的代码,硬编码答案,忽视了高效的符号计算,我们引入了符号检查器,帮助Code Steer LLM评估代码的复杂性和效率。由于大多数推理和规划任务可以通过编码更好地验证,我们添加了自我回答检查器,以更好地判断Code Steer LLM的答案正确性。这两个新的检查器已被证明能显著提高数据集合成和Code Steer LLM微调的效率。

  1. Proposed CodeSteer Outperforms Nine Baselines and o1: CodeSteer’s superior performance highlights the importance of enhancing LLM reasoning and planning with symbolic computing. This also demonstrates the potential for steering large models to generate smarter code and text by leveraging specialized smaller models.

CodeSteer 优于九个基线的原因:CodeSteer 的卓越性能凸显了通过符号计算增强大语言模型推理和规划的重要性。这也展示了利用专门的小型模型引导大型模型生成更智能代码和文本的潜力。

2. Symbolic Tasks and SymBench

2. 符号任务与 SymBench

Challenges in Code/Text Choices For tasks requiring computation, symbolic manipulation, logic, optimization, spatial reasoning, and constrained planning, coding-based symbolic computing is often more effective than text-based approaches. However, Chen et al. (2024e) found that steering LLM code/text generation poses significant challenges, even in tasks with apparent symbolic characteristics. The main bottlenecks are: 1) Deciding whether code or text is simpler depends on task type, task complexity, and the LLM’s capabilities, which is hard to judge (see Appendix Sec. A). 2) LLM-generated code often appears as text-like scripts that merely hard-code answers rather than enabling efficient symbolic computation, echoing the phenomenon described in Yang et al. (2024) (see Appendix Sec. B).

代码/文本选择中的挑战
对于需要计算、符号操作、逻辑、优化、空间推理和约束规划的任务,基于代码的符号计算通常比基于文本的方法更有效。然而,Chen et al. (2024e) 发现,即使在具有明显符号特征的任务中,引导大语言模型的代码/文本生成也面临重大挑战。主要瓶颈包括:1) 判断代码还是文本更简单取决于任务类型、任务复杂性和大语言模型的能力,这很难判断(见附录 A 节)。2) 大语言模型生成的代码通常表现为类似文本的脚本,这些脚本仅仅是硬编码答案,而不是实现高效的符号计算,这与 Yang et al. (2024) 中描述的现象一致(见附录 B 节)。

SymBench Chen et al. (2024e) and Gui et al. (2024) collected 14 and 31 tasks with symbolic factors from various benchmarks such as Suzgun et al. (2022); Chen et al. (2024d); Yao et al. (2024); Cobbe et al. (2021); Hendrycks et al. (2021), but their question-generation code and complete datasets remain private. We redevelop the generation code to automatically synthesize questions with adjustable complexity. Our resulting set of 37 tasks covers reasoning, planning, and execution, testing competencies in mathematics, spatial reasoning, logic, order reasoning, optimization, and search. Details and categorization are provided in Appendix Sec. C and Table 4.

SymBench
Chen et al. (2024e) 和 Gui et al. (2024) 从多个基准测试中收集了 14 和 31 个带有符号因素的任务,例如 Suzgun et al. (2022); Chen et al. (2024d); Yao et al. (2024); Cobbe et al. (2021); Hendrycks et al. (2021),但他们的题目生成代码和完整数据集仍未公开。我们重新开发了生成代码,以自动合成具有可调整复杂性的题目。我们最终生成的 37 个任务涵盖了推理、规划和执行,测试了数学、空间推理、逻辑、顺序推理、优化和搜索的能力。详细信息和分类见附录 C 节和表 4。

3. CodeSteer Framework

3. CodeSteer 框架

Fig 1 illustrates how CodeSteer guides the LLM’s code/text generation. At each round, CodeSteer reviews the TaskLLM’s current answer and the guidance/answer history, then decides whether to offer new guidance or finalize the response. It performs three key functions:

图1展示了CodeSteer如何引导大语言模型的代码/文本生成。在每一轮中,CodeSteer会审查TaskLLM的当前答案以及指导/答案历史,然后决定是否提供新的指导或最终确定响应。它执行三个关键功能:

  1. Initial Method Selection In the first round, it chooses whether to solve the task with code or text (e.g., use textual reasoning for small-number multiplication, and code for large-number multiplication in the task Number Multiply). 2) Dynamic Adaptation In subsequent rounds, it refines guidance or switches methods if issues arise (e.g., encouraging more sophisticated symbolic approaches in Game 24, or switching to textual reasoning after multiple incorrect code attempts in BoxLift).
  2. 初始方法选择
    在首轮中,它选择是使用代码还是文本来完成任务 (例如,在“数字乘法”任务中,小数字乘法使用文本推理,大数字乘法使用代码)。
  3. 动态调整
    在后续轮次中,如果出现问题,它会优化指导或切换方法 (例如,在“24点游戏”中鼓励更复杂的符号方法,或在“BoxLift”中多次代码尝试失败后切换为文本推理)。

3) Answer Final iz ation When Ready

3) 准备就绪时进行最终答案确认

The main components of CodeSteer are as follows:

CodeSteer 的主要组件如下:

Code Steer LL M is the primary model fine-tuned and used to guide TaskLLM in code/text generation. The input prompt formats for the first and subsequent rounds are presented in Appendix Sec. D. To facilitate answer evaluation, Code Steer LL M is equipped with two checkers—Selfanswer and Symbolic—whose design is inspired by the inherent features of symbolic tasks.

Code Steer LL M 是主要模型,经过微调后用于指导 TaskLLM 进行代码/文本生成。第一轮及后续轮次的输入提示格式见附录 D 节。为了方便回答评估,Code Steer LL M 配备了两个检查器——Selfanswer 和 Symbolic,其设计灵感来源于符号任务的固有特性。

Self-answer Checker re-queries TaskLLM to generate and execute code for verifying its current answer, then returns the evaluation results and explanations to Code Steer LL M. Since many symbolic tasks benefit from code-based verification, this approach often provides a more reliable perspective. The prompt format for the Self-answer Checker is provided in Appendix Sec. E.

自我答案检查器重新查询TaskLLM以生成并执行用于验证其当前答案的代码,然后将评估结果和解释返回给Code Steer LLM。由于许多符号任务受益于基于代码的验证,这种方法通常提供更可靠的视角。自我答案检查器的提示格式参见附录第E节。

Symbolic Checker is a rule-based script to analyze the generated code for iteration, search, numeric handling, permutations, and combinations, then returns a complexity summary and score. This helps Code Steer LL M determine whether the code is sufficiently sophisticated for the task at hand. Since TaskLLM often produces text-like code prone to errors, the Symbolic Checker’s complexity assessment aids, but does not solely dictate, Code Steer LL M’s decisions. Further details on the checking code and prompt are in Appendix Sec. F.

符号检查器(Symbolic Checker) 是一个基于规则的脚本,用于分析生成代码的迭代、搜索、数字处理、排列和组合,然后返回复杂性摘要和评分。这有助于 Code Steer 大语言模型(LLM) 判断代码是否足够复杂以应对当前任务。由于 TaskLLM 经常生成类似文本的代码且容易出错,符号检查器的复杂性评估有助于但不会完全决定 Code Steer 大语言模型的决策。更多关于检查代码和提示的详细信息参见附录 F 节。

Beyond enhancing Code Steer LL M’s performance, the Selfanswer and Symbolic Checkers also streamline dataset synthesis for SFT and DPO fine-tuning, as discussed in the following sections.

除了提升 Code Steer 大语言模型的性能外,Selfanswer 和 Symbolic Checkers 还能简化用于 SFT 和 DPO 微调的数据集合成,如下文所述。

4. Fine-tuning the Code Steer LL M

4. 微调 Code Steer 大语言模型

Among the three modules of CodeSteer, the CodeSteerLLM needs to be fine-tuned to perform the complicated task of steering. The fine-tuning is performed on a subset of SymBench. Specifically, we randomly select 28 of the 37 SymBench tasks, using a distinct set of samples without overlap with the test samples. This setup allows us to evaluate CodeSteer on 28 seen tasks (with different test samples) and on the remaining 9 unseen tasks. The fine-tuning consists of two steps. We first fine-tune the Llama-3.1-8B model with SFT, then further optimize it using DPO. Both processes are fine-tuned with full parameter on 4Hl00 GPUs for 4-10 epochs. The detailed parameter and hardware settings for fine-tuning and inference processes are discussed in Appendix Sec. H. We synthesize 12k multiround guidance/generation trajectories for SFT and 5.5k guidance comparison pairs for DPO. The specific data number for each task is in Appendix Sec. G.

在CodeSteer的三个模块中,CodeSteerLLM需要进行微调以执行复杂的引导任务。微调在SymBench的一个子集上进行。具体而言,我们从37个SymBench任务中随机选择了28个,使用一组不与测试样本重叠的不同样本。这种设置使得我们能够在28个已见任务(使用不同的测试样本)和剩余的9个未见任务上评估CodeSteer。微调分为两个步骤。我们首先使用SFT对Llama-3.1-8B模型进行微调,然后使用DPO进一步优化。这两个过程都在4Hl00 GPU上进行了4-10个epoch的全参数微调。微调和推理过程的详细参数和硬件设置将在附录H节中讨论。我们合成了12k轮多轮引导/生成轨迹用于SFT,以及5.5k引导比较对用于DPO。每个任务的具体数据数量见附录G节。

4.1. Multi-round SFT

4.1. 多轮SFT

To generate supervision data for SFT, we prompt the GPT-4o to serve as both the guiding LLM (i.e., the Code Steer LL M) and the TaskLLM to generate multiple guidance/generate trajectories. We then filter the trajectories keeping only those that produce correct answers. To improve success rates, Code Steer LL M’s prompt is more detailed and includes pre-set knowledge or hints. To increase dataset diversity and enable dynamic adaptation of guided thoughts, this prompt also has different versions. For example, we may let GPT-4o choose all guidance styles, or enforce transitions from code to text or text to code. We set the maximum guidance rounds to be 5 and return the final answer once that limit is reached.

为SFT生成监督数据时,我们提示GPT-4o同时作为引导LLM(即Code Steer LLM)和TaskLLM生成多个引导/生成轨迹。然后,我们过滤这些轨迹,仅保留那些产生正确答案的部分。为了提高成功率,Code Steer LLM的提示更为详细,并包含预设的知识或提示。为了增加数据集的多样性并实现引导思想的动态适应,该提示还具有不同的版本。例如,我们可能让GPT-4o选择所有引导风格,或者强制从代码到文本或从文本到代码的转换。我们将最大引导轮次设置为5,并在达到该限制时返回最终答案。

Multi-round Gradient Cancellation Issue In multiround trajectories, the SFT process incorporates gradients from each round. This can lead to gradient cancellation in the early rounds. For example, in one task, both [code, return answer] and [text, code, return answer] produce correct results, so if both trajectories are used for fine-tuning, the SFT cannot learn that code is the better first step.

多轮梯度抵消问题
在多轮轨迹中,SFT(监督微调)过程会结合每一轮的梯度。这可能导致早期轮次的梯度抵消。例如,在某个任务中,[代码,返回答案] 和 [文本,代码,返回答案] 都会产生正确的结果,因此如果同时使用这两条轨迹进行微调,SFT 无法学到代码是更好的第一步。

Data Augmentation To mitigate this issue, we leverage the fact that the final two rounds of guidance are most influential, as the TaskLLM produces new answers each round while earlier rounds primarily provide background. Consequently, we augment the SFT dataset by doubling the weights of the final two rounds.

数据增强
为了缓解这一问题,我们利用了最后两轮指导最具影响力的事实,因为 TaskLLM 每轮生成新的答案,而早期轮次主要提供背景信息。因此,我们通过将最后两轮的权重加倍来增强 SFT 数据集。

4.2. Multi-round DPO

4.2. 多轮 DPO

Figure 2: Schematic of multi-round DPO data sampling: blue squares represent intermediate (non-final) rounds, and brown ovals mark finalizing rounds. Guidance responses from the same parent node in Code Steer LL M are compared to generate the DPO data.

图 2: 多轮DPO数据采样的示意图:蓝色方块表示中间(非最终)轮次,棕色椭圆标记最终轮次。通过比较来自Code Steer大语言模型中相同父节点的指导响应来生成DPO数据。

Because many correct trajectories in the SFT dataset are still suboptimal, we need to further fine-tune the Code Steer LL M on pairs of trajectories labeled with preferences. Here we use rule-based scores to assign preferences. Figure 2 illustrates our framework for sampling DPO guidance pairs in a multi-round setting. The main challenge is sampling and selecting guidance pairs that exhibit clear performance differences across various rounds while minimizing the number of samples to conserve resources. We use a tree structure where each node represents a guidance, with a branching factor of 2 or 3. To compare guidance pairs from the same parent node, we calculate their Performance Scores using the following equation:

因为在 SFT 数据集中的许多正确轨迹仍然非最优,我们需要在标记有偏好的轨迹对上进一步微调 Code Steer 大语言模型。这里我们使用基于规则的分数来分配偏好。图 2 展示了我们在多轮设置中采样 DPO 指导对的框架。主要挑战是采样和选择在不同轮次中表现出明显性能差异的指导对,同时最小化样本数量以节省资源。我们使用树结构,其中每个节点代表一个指导,分支因子为 2 或 3。为了比较来自同一父节点的指导对,我们使用以下公式计算它们的性能分数:

image.png

Here, ${\mathrm{Score}}{i}representsthescoreforanodeatroundi,whereiisthecurrentroundnumber,andC(i)isthesetofchildnodesofnodei.Ifthecurrentroundisthefinalone,\mathrm{Score}{i}issetto15-iforcorrectanswersand-iforincorrectones.ThisincentivizesCodeSteerLLMtoachievecorrectanswersinthefewestroundspossible.Fornonfinalrounds,{\mathrm{Score}}_{i}$ is calculated as the average of its child nodes’ scores. This ensures that each non-terminal round’s score reflects the average performance of its potential subsequent actions, i.e., the expectation.

这里,${\mathrm{Score}}{i}iiC(i)i\mathrm{Score}{i}15-i-iCodeSteerLLM{\mathrm{Score}}_{i}$ 计算为其子节点分数的平均值。这确保了每个非终端轮次的分数反映了其潜在后续动作的平均表现,即期望值。

DPO data is collected from guidance pairs within the same parent node at each level that have a score difference greater than 2. To prevent reward hacking (Skalse et al., 2022)—where Code Steer LL M might bypass exploration and return incorrect answers quickly (e.g., preferring a score of 2 over 5 )—we include only pairs where at least one guidance has a positive score. To obtain diverse guidance answers, we set the inference temperature to 1.5 for the SFT fine-tuned Code Steer LL M and use three models fine-tuned at different epochs (6, 8, and 10) to compare their guidance responses for the same parent node.

DPO 数据是从同一父节点中得分差异大于2的指导对中收集的。为了防止奖励黑客攻击 (Skalse et al., 2022)——即 Code Steer 大语言模型可能绕过探索并快速返回错误答案(例如,偏好 2 的得分而不是 5 的得分)——我们只包括至少有一个指导得分为正的对。为了获得多样化的指导答案,我们将 SFT 微调的 Code Steer 大语言模型的推理温度设置为1.5,并使用在不同训练周期(6、8和10)微调的三个模型来比较它们对同一父节点的指导响应。

5. Experiments

5. 实验

Experimental settings We use GPT-4o as the TaskLLM to test 28 seen and 9 unseen tasks, each with 100 samples of varying complexity. The samples for the 28 seen tasks are different from those used to train Code Steer LL M. Additionally, we evaluate other LLM types to assess CodeSteer’s general iz ability.

实验设置
我们使用 GPT-4o 作为任务大语言模型 (TaskLLM) 来测试 28 个已见任务和 9 个未见任务,每个任务包含 100 个不同复杂度的样本。28 个已见任务的样本与用于训练 CodeSteer 大语言模型的样本不同。此外,我们还评估了其他类型的大语言模型,以评估 CodeSteer 的泛化能力。

We compare CodeSteer to six training-free and three training-based baselines, with methods 1, 3–6, and 9 originally proposed in Chen et al. (2024e).

我们将 CodeSteer 与六种无训练和三种有训练的基线方法进行了比较,其中方法 1、3–6 和 9 最初由 Chen 等人 (2024e) 提出。

Training-free Baselines 1) No extra modifications but only input the original question (Only Question); 2) Our framework in Sec. 4.1 to synthesize SFT dataset, where GPT4o works as Code Steer LL M with extra hints (Symbolic Agent); 3) Prompting LLMs to answer with only text with CoT (All Text +\sumCoT, ); 4) Prompting LLMs to first analyze the question with CoT and then output the code answer (All Code+CoT) ; 5) Concatenating the input question with AutoGen’s original system prompt in Appendix Section L (AutoGen Conca.); 6) Implement a multi-agent framework that first queries LLMs to answer the question with All Text +,CoT and All Cod ±\CoT methods, respectively. Then the final solution is obtained by combining and summarizing both versions of the answers by the same LLM but prompted differently (Code+Text+Sum.1 ).

无训练基线 1) 仅输入原始问题,不做额外修改 (仅问题); 2) 使用第4.1节中的框架合成SFT数据集,其中GPT4o作为带有额外提示的代码引导大语言模型 (符号智能体); 3) 提示大语言模型仅使用文本并通过思维链 (CoT) 回答 (全文本 +\sumCoT, ); 4) 提示大语言模型首先通过思维链分析问题,然后输出代码答案 (全 +CoT) ); 5) 将输入问题与AutoGen附录L部分的原始系统提示连接 (AutoGen连接); 6) 实施一个多智能体框架,首先分别通过全文本 +,CoT 和全代码 ±\CoT 方法查询大语言模型回答问题。然后通过同一大语言模型但以不同方式提示,结合并总结两个版本的答案来获得最终解决方案 (++.1 )。

Table 1: Experimental results on SymBench. Methods with the highest scores are highlighted blue.

Training-based Baselines 7) Fine-tune Llama-3.1-8B as a summarizer based on the Code+Text+Sum.1 method using SFT on correct summary data (Code+Text+Sum.2) ; 8) We fine-tune Llama-3.1-8B as a one-step evaluator to choose between text or code generation (Code/Text Choice); 9) OpenAI GPT Code Interpreter with the original input question (Code Interpreter). Method 7 and 8 are fine-tuned on the same data number and task types as CodeSteer.

基于训练的基线方法 7) 使用 Code+Text+Sum.1 方法,在正确的摘要数据 (Code+Text+Sum.2) 上进行监督微调 (SFT),将 Llama-3.1-8B 微调为摘要生成器;8) 我们将 Llama-3.1-8B 微调为一步评估器,用于在文本生成或代码生成之间进行选择(代码/文本选择);9) 使用原始输入问题的 OpenAI GPT Code Interpreter(代码解释器)。方法 7 和 8 在与 CodeSteer 相同的任务类型和数据量上进行微调。

Comparison with CoT LLMs We also compare with the current best models: OpenAI o1 and o1-preview (Jaech et al., 2024) and DeepSeek R1 (Guo et al., 2025). These models enhance reasoning and planning by using textual search, reflection, and exploration during answer generation. However, our analysis shows that these CoT LLMs have not yet integrated code-based symbolic computing to further improve their performance.

与 CoT 大语言模型的对比
我们还与当前最好的模型进行了对比:OpenAI o1 和 o1-preview (Jaech et al., 2024) 以及 DeepSeek R1 (Guo et al., 2025)。这些模型通过在生成答案时使用文本搜索、反思和探索来增强推理和规划能力。然而,我们的分析表明,这些 CoT 大语言模型尚未集成基于代码的符号计算来进一步提升其性能。

Evaluations Answers are evaluated using predefined rules, with GPT-4o assisting in adjusting formats as needed. Beyond the Code Interpreter method, some approaches have the LLM output code as the final answer. We extract and execute this code using predefined algorithms to obtain the final result or facilitate further reasoning. To prevent infinite loops, code execution is limited to 30 seconds. If this limit is exceeded, the task is marked as failed or returns errors for subsequent rounds. We utilize success rate as the metric for each task. To compare each method, we calculate the Average Normalized Score over all the tested tasks by the following equation:

评估答案使用预定义规则进行评估,GPT-4o 协助根据需要调整格式。除了代码解释器方法外,某些方法将大语言模型 (LLM) 输出的代码作为最终答案。我们使用预定义的算法提取并执行该代码,以获得最终结果或促进进一步推理。为了防止无限循环,代码执行时间限制在 30 秒内。如果超过此限制,任务将被标记为失败或返回错误以供后续轮次处理。我们使用成功率作为每个任务的指标。为了比较每种方法,我们通过以下公式计算所有测试任务的平均归一化得分:

image.png

where $\mathrm{AveNorm}{j}istheAverageNormalizedScoreformethodj,s{i j}isthescoreofmethodjfortaski,\operatorname*{max}(s_{i})isthemaximumscorefortaski,N$ is the total number of tasks. This equation normalizes each score relative to the maximum score in the respective task, and then averages the normalized scores over all tasks. Apart from the task performance, in later sections we also discuss the costs of token lengths and runtime for each method.

其中,$\mathrm{AveNorm}{j}js{i j}ji\operatorname*{max}(s_{i})iN$ 是任务的总数。该公式将每个分数相对于相应任务中的最高分数进行标准化,然后对所有任务的标准化分数进行平均。除了任务性能外,在后文我们还将讨论每种方法的 Token 长度和运行时间的成本。

5.1. Overall Better Performance

5.1. 整体性能更佳

Table 1 presents the full results of all methods on SymBench, including individual task scores and the Average Normalized Score. The key findings are:

表 1: 展示了所有方法在 SymBench 上的完整结果,包括各个任务的得分和平均归一化得分。关键发现如下:

  1. CodeSteer maintains similar relative performance on seen and unseen tasks, indicating no over fitting. 2) Augmenting GPT-4o with CodeSteer significantly boosts its performance, raising the Ave. Norm. Total Score from 53.3 to 86.4—outperforming all 9 baselines (best baseline: Code/Text Choice at 77.9).
  2. CodeSteer 在已见和未见任务上保持相似的相对性能,表明没有过拟合。
  3. 使用 CodeSteer 增强 GPT-4o 显著提升了其性能,将平均标准化总分从 53.3 提高到 86.4,超过了所有 9 个基线模型(最佳基线:Code/Text Choice 为 77.9)。


Figure 3: Normalized score distribution of CodeSteer+GPT4o and o1 in 37 SymBench tasks.

图 3: CodeSteer+GPT4o 和 o1 在 37 个 SymBench 任务中的归一化得分分布。

  1. GPT 40+ CodeSteer surpasses o1 (82.7), R1 (76.8), and o1-preview (74.8), highlighting the importance of integrating symbolic computing into LLMs. Figure 3 compares the score distribution of GPT 40+ CodeSteer and o1, showing that CodeSteer reduces instances of extremely low scores (near 0), demonstrating its robustness to varied tasks. 4) Compared to other training-based methods (Code + Text+Sum.2 and Code/Text Choice) with the same data number and tasks, CodeSteer’s better performance validates the framework’s effectiveness (further discussed in Sec. 6).
  2. GPT 40+ CodeSteer 超越了 o1 (82.7)、R1 (76.8) 和 o1-preview (74.8),突显了将符号计算集成到大语言模型中的重要性。图 3 比较了 GPT 40+ CodeSteer 和 o1 的得分分布,显示 CodeSteer 减少了极低得分(接近 0)的情况,展示了其对各种任务的鲁棒性。4) 与其他基于训练的方法(Code + Text+Sum.2 和 Code/Text Choice)在相同数据量和任务下相比,CodeSteer 的更好表现验证了该框架的有效性(进一步讨论见第 6 节)。

5.2. S cal ability and General iz ability

5.2. 可扩展性与泛化能力


Figure 4: Method performance across four representative tasks as task complexity increases from left to right on the X -axis controlled by value scales. C.S. and Inter. represent CodeSteer and Interpreter.

图 4: 随着任务复杂度从左到右在 X 轴上由值尺度控制的增加,方法在四个代表性任务上的表现。C.S. 和 Inter. 分别代表 CodeSteer 和 Interpreter。

To assess the impact of symbolic computing, Fig. 4 tracks the performance of five methods across four tasks of increasing complexity. As critical task-specific properties escalate, o1, o1-preview, and GPT-4o fail in highly complex cases, while symbolic-augmented methods (CodeSteer, Code Interpreter) sustain performance. Notably, CodeSteer proves more robust across tasks than Code Interpreter.

为了评估符号计算的影响,图 4 跟踪了五种方法在四个复杂度递增任务中的表现。随着关键任务特定属性的增加,o1、o1-preview 和 GPT-4o 在高度复杂的任务中失败,而符号增强方法(CodeSteer、Code Interpreter)保持了性能。值得注意的是,CodeSteer 在任务中表现得比 Code Interpreter 更为稳健。

Table 2: Experimental results of Claude-3-5-sonnet-20241022, Mistral-Large, and GPT-3.5 with or without augmented CodeSteer. Methods with the higher scores of the same model are highlighted blue.

表 2: Claude-3-5-sonnet-20241022, Mistral-Large, 和 GPT-3.5 在有或无增强 CodeSteer 的情况下的实验结果。同一模型中得分较高的方法以蓝色高亮显示。

方法 Claude Claude+CodeSteer Mistral Mistral+CodeSteer GPT-3.5 GPT-3.5+CodeSteer
CombinatorialCalcu. 48 66 25 34 12 29
Eight Queen 4 87 60 41 0 16
Reversi 0 45 0 33 0 32
Cons.LinearArran. 73 90 47 48 25 9
StandardSudoku 0 100 0 100 0 95
Ave.Norm.Score 29.1 92.0 31.0 59.8 8.6 42.3

In our study, Code Steer LL M is fine-tuned on synthesized datasets where TaskLLM is always GPT-4o. To assess its transfer ability and general iz ability, we test it with three popular models: Claude-3-5-Sonnet, Mistral-Large, and GPT3.5-Turbo. We evaluate them on five representative tasks based on GPT-4o’s results in Table 1: two where text outperforms code and three where code is superior. CodeSteer has shown apparent effects when guiding GPT-4o on these tasks. The results in Table 2 confirm that CodeSteer generalizes well across other LLMs types. This is expected, as its core mechanisms—code/text guidance and dynamic adaptation—are essential to all general-purpose LLMs. Notably, we observe that CodeSteer is particularly effective when applied to stronger LLMs, such as Claude. This is likely because more powerful models possess superior self-reflection capabilities and can generate complex code with greater precision. Thus, they benefit more from CodeSteer’s additional structured guidance, unlocking their full potential.

在我们的研究中,CodeSteer 大语言模型在合成数据集上进行了微调,其中 TaskLLM 始终为 GPT-4o。为了评估其迁移能力和泛化能力,我们使用三种流行模型进行了测试:Claude-3-5-Sonnet、Mistral-Large 和 GPT3.5-Turbo。我们在表 1 中基于 GPT-4o 的结果对这五个代表性任务进行了评估:其中两项任务中文本优于代码,三项任务中代码更优。CodeSteer 在指导 GPT-4o 完成这些任务时显示出显著效果。表 2 的结果证实了 CodeSteer 在不同类型的大语言模型中具有良好的泛化能力。这是可以预期的,因为其核心机制——代码/文本引导和动态适应——对所有通用大语言模型都至关重要。值得注意的是,我们观察到 CodeSteer 在应用于更强的模型(如 Claude)时效果尤为显著。这可能是因为更强大的模型具有更优越的自我反思能力,并且能够更精确地生成复杂代码。因此,它们从 CodeSteer 提供的额外结构化指导中受益更多,从而释放了其全部潜力。


5.3. Cost of Tokens and Runtime

图 1: 5.3. Token 成本和运行时间

Figure 5: Score vs. token and runtime costs for each method, highlighting CodeSteer, R1, o1, and o1-preview in red. We display CodeSteer results separately for inferences using single or four H100 GPUs. Specific values are in Table 6.

图 5: 每种方法的得分与 token 和运行时的成本对比,其中红色突出显示了 CodeSteer、R1、o1 和 o1-preview。我们分别展示了 CodeSteer 在使用单个和四个 H100 GPU 进行推理时的结果。具体数值见表 6。

Figure 5 shows Score versus Token Length (including input and output tokens) and Score versus Runtime (covering both LLM inference and code execution) for all methods. Complete data is provided in Appendix Table 6. Token counts include only those used by TaskLLM, excluding small and open-source models fine-tuned on Llama-3.1-8B. For the o1 and o1-preview models, only runtime is plotted since their thinking chains are unavailable. While achieving superior performance, CodeSteer uses more tokens than baseline methods due to its multi-round generations. Most of these tokens are consumed by multiple interaction rounds that ultimately fail. CoT LLM R1 consumes more tokens than CodeSteer due to the inefficient textual iteration.

图5展示了所有方法的得分与Token长度(包括输入和输出Token)以及得分与运行时间(涵盖大语言模型推理和代码执行)的关系。完整数据参见附录表6。Token计数仅包括TaskLLM使用的Token,不包括在Llama-3.1-8B上微调的小型和开源模型。对于o1和o1-preview模型,由于它们的思考链不可用,仅绘制了运行时间。虽然CodeSteer在性能上表现优异,但由于其多轮生成,它使用的Token比基线方法更多。这些Token大部分被多轮交互消耗,这些交互最终失败。CoT LLM R1由于低效的文本迭代,消耗的Token比CodeSteer更多。

In terms of runtime, CodeSteer is faster than o1 and R1 while delivering better performance. Additionally, since most of CodeSteer’s runtime comes from the inference of the 8B Code Steer LL M on our workstation, hardware and system optimization s can significantly reduce it. For example, running Code Steer LL M on four H100 GPUs instead of one decreases the average runtime from 63.8 to 45.4 seconds. CoT LLMs consume excessive runtime and tokens due to their extensive and often redundant reasoning chains. Textual iteration is inherently inefficient for search. Appendix Sec. J shows examples of text answers of R1 and GPT-4o, in which both models attempt to find the correct equation for the Game 24 task by listing all possible combinations, leading to uncontrolled iterations and endless generation. This highlights the importance of symbolic computing through code generation.

在运行时间方面,CodeSteer 比 o1 和 R1 更快,同时提供了更好的性能。此外,由于 CodeSteer 的大部分运行时间来自我们工作站上 8B Code Steer 大语言模型的推理,硬件和系统优化可以显著减少这一时间。例如,在四个 H100 GPU 上运行 Code Steer 大语言模型而不是一个,可以将平均运行时间从 63.8 秒减少到 45.4 秒。CoT大语言模型由于其广泛且通常冗余的推理链,消耗了过多的运行时间和 Token。文本迭代在搜索方面本质上是低效的。附录 J 节展示了 R1 和 GPT-4o 的文本答案示例,其中两个模型都试图通过列出所有可能的组合来找到 Game 24 任务的正确方程,导致了不受控制的迭代和无尽的生成。这凸显了通过代码生成进行符号计算的重要性。

6. Ablation Studies

6. 消融实验

The CodeSteer framework comprises SFT and DPO dataset synthesis, Code Steer LL M fine-tuning, a symbolic checker, and a self-answer checker. Here we do the ablation studies on these components and their related modifications. The added experimental results are shown in Table 3 with the whole result table of 37 SymBench tasks in Append Sec. K.

CodeSteer框架包含SFT和DPO数据集合成、Code Steer大语言模型微调、符号检查器和自答检查器。我们在此对这些组件及其相关修改进行消融研究。新增的实验结果如表3所示,完整的37个SymBench任务结果表见附录K节。

DPO Effects In Table 3, 1.CodeSteer outperforms 2.WO DPO, showing the effectiveness of the DPO process.

表 3: DPO 效果

  1. CodeSteer 优于 2. WO DPO,展示了 DPO 过程的有效性。

SFT Data Augmentation As discussed in Sec. 4.1, we do the data augmentation of the last two rounds in each trajectory to prevent multi-round gradient cancellation. In

SFT 数据增强

Table 3: Ablation studies on CodeSteer. WO DPO: CodeSteer with SFT but without DPO fine-tuning. WO DPO WO Data Augment: Same as WO DPO, but without data augmentation in the last two rounds. Agent represents the Symbolic Agent.

表 3: CodeSteer 的消融研究。WO DPO: 使用 SFT 但不进行 DPO 微调的 CodeSteer。WO DPO WO Data Augment: 与 WO DPO 相同,但在最后两轮没有数据增强。Agent 代表符号智能体 (Symbolic Agent)。

方法 1.Code Steer 2.WO DPO 3.WODPO WOData 4.WO Symbolic Checker 5.WO Self-answer Checker 6. Agent 7.Agent WO Symbolic Checker 8.Agent WO Self-answer Checker
任务成功率 % Ave.Norm.,Seen 88.1 80.0 Augment. 79.7 80.1 78.5 77.0 71.9 70.1
Ave.Norm.,Unseen 81.3 76.2 70.9 68.6 64.2 67.9 62.0 57.4
Ave. Norm., Total 86.4 79.1 77.6 77.3 75.0 74.8 69.5 67.0

Table 3, 2.WO DPO achieves higher score than 3.WO DPO WO Data Augment., which means this extra attention on the last two rounds does enhance the SFT process.

表 3: 2.WO DPO 的得分高于 3.WO DPO WO Data Augment.,这意味着对最后两轮的额外关注确实增强了 SFT 过程。

Symbolic and Self-answer Checkers We evaluate the effects of the Symbolic and Self-answer Checker in two parts: 1) Dataset Synthesis Efficiency: Comparing Group 6 with Groups 7 and 8 in Table 3 shows that integrating these two checkers increases the Symbolic Agent’s success rates, thereby enhancing the efficiency of the dataset synthesis process. 2) CodeSteer Performance: Comparing Group 1 with Groups 4 and 5 demonstrates that augmenting with these two checkers improves CodeSteer’s final performance.

符号化与自答检查器的效果评估

Multi-round Guidance CodeSteer uses a multi-round interaction strategy with TaskLLM. In contrast, the Code/Text Choice method in Table 1 relies on single-step guidance and performs worse than CodeSteer. This demonstrates that the multi-round design enhances guidance effectiveness, aligning with the common intuition that the best methods for many tasks emerge from iterative “executing and exploring” processes accompanied with dynamic adaptation.

多轮指导
CodeSteer采用与大语言模型 (TaskLLM) 的多轮交互策略。相比之下,表1中的代码/文本选择方法依赖于单步指导,其表现不如CodeSteer。这表明多轮设计提高了指导的有效性,与许多任务的优化方法通常来自于伴随动态调整的迭代“执行与探索”过程的常见直觉相吻合。

Guide Not Summarizer CodeSteer primarily serves as the guidance generator for TaskLLM rather than directly generating answers, summarizing, or selecting among multiple answers. This design choice accounts for the limitations of the open-source LLM we use compared to the more capable closed-source LLM that supports TaskLLM. By focusing on guidance, CodeSteer reduces task complexity and data space requirements. The Code+Text+Sum.2 approach in Table 1 attempts to fine-tune an answer summarizer using the same data volume but fails, highlighting that summarization imposes a significant burden on Llama-3.1-8B due to the unique characteristics of each task.

指南而非总结器 CodeSteer 主要为 TaskLLM 生成指导,而不是直接生成答案、总结或在多个答案中选择。这一设计选择考虑到了我们使用的开源大语言模型与支持 TaskLLM 的能力更强的闭源大语言模型之间的局限性。通过专注于生成指导,CodeSteer 减少了任务复杂性和数据空间需求。表 1 中的 Code+Text+Sum.2 方法尝试使用相同的数据量微调一个答案总结器,但失败了,这突显了由于每个任务的独特特征,总结对 Llama-3.1-8B 带来了巨大的负担。

7. Related Work

7. 相关工作

Code Generation and Symbolic Computing in LLM Tasks LLMs are widely used for general agent tasks, such as interacting with softwares and websites (Zhou et al., 2023c ; Hao et al., 2024a;b; Xu et al., 2024), planning robot actions (Chen et al., 2024d; Ahn et al., 2022), and inferring with logic (Suzgun et al., 2022). Literally, many test tasks in previous works can be solved with direct coding (Suzgun & Kalai, 2024; Gao et al., 2023). Some recent works also further extend the applications of coding into tasks involving commonsense reasoning and semantic analysis (Li et al., 2023; Weir et al., 2024). Most of previous works mainly utilize text (Yao et al., 2024; Ahn et al., 2022; Lin et al., 2023) or code (Liang et al., 2022; Bairi et al., 2024; Zhou et al., 2023a) as the only output modality. Chen et al. (2024e) highlights the importance of smartly switching between code and text generation in LLMs but notes current methods have clear drawbacks.

LLM任务中的代码生成与符号计算
大语言模型被广泛用于通用智能体任务,例如与软件和网站交互 (Zhou et al., 2023c; Hao et al., 2024a;b; Xu et al., 2024)、规划机器人动作 (Chen et al., 2024d; Ahn et al., 2022) 以及逻辑推理 (Suzgun et al., 2022)。实际上,之前工作中的许多测试任务可以通过直接编码来解决 (Suzgun & Kalai, 2024; Gao et al., 2023)。最近的一些工作还进一步将编码的应用扩展到涉及常识推理和语义分析的任务中 (Li et al., 2023; Weir et al., 2024)。大多数之前的工作主要使用文本 (Yao et al., 2024; Ahn et al., 2022; Lin et al., 2023) 或代码 (Liang et al., 2022; Bairi et al., 2024; Zhou et al., 2023a) 作为唯一的输出模态。Chen et al. (2024e) 强调了在大语言模型中智能切换代码和文本生成的重要性,但指出当前方法存在明显缺陷。

LLM Self-reflection and CoT Models LLM-generated feedback via self-evaluation can improve performance on a variety of tasks (Yang et al., 2022; Welleck et al., 2022; Madaan et al., 2023). The OpenAI o1 (Jaech et al., 2024) and DeepSeek R1 (Guo et al., 2025) models demonstrate the potential of agentic LLMs that use Chain-of-Thought (CoT) text generation to explore and self-reflect, enhancing reasoning and planning. However, they lack symbolic computing and code generation capabilities, leading to weaker performance on complex symbolic tasks and consuming substantial tokens and time (Chen et al., 2024a).

LLM 自反思与 CoT 模型
通过自我评估生成的 LLM 反馈可以提升各种任务的表现 (Yang et al., 2022; Welleck et al., 2022; Madaan et al., 2023)。OpenAI o1 (Jaech et al., 2024) 和 DeepSeek R1 (Guo et al., 2025) 模型展示了使用思维链 (Chain-of-Thought, CoT) 文本生成的自主 LLM 在探索和自我反思方面的潜力,增强了推理和规划能力。然而,它们缺乏符号计算和代码生成能力,导致在复杂符号任务上表现较弱,并且消耗大量 Token 和时间 (Chen et al., 2024a)。

LLM Fine-tuning with Multi-step SFT and DPO SFT (Chen et al., 2024f) and DPO (Rafailov et al., 2024) are extensively implemented for LLM fine-tuning. To enhance LLM’s capability in multi-step agent tasks, these methods are further modified with multi-step goals and rewards (Zhou et al., 2024b; Zhai et al., 2024; Zhang et al., 2024). LLM self-generated data have become increasingly important for model improvement when combined with search algorithms and rejection sampling (Zhou et al., 2023b; Guan et al., 2025).

大语言模型多步SFT和DPO微调 (Chen et al., 2024f) 和 DPO (Rafailov et al., 2024) 在大语言模型微调中得到了广泛应用。为了增强大语言模型在多步智能体任务中的能力,这些方法进一步结合了多步目标和奖励进行改进 (Zhou et al., 2024b; Zhai et al., 2024; Zhang et al., 2024)。当与搜索算法和拒绝采样结合时,大语言模型自我生成的数据在模型改进中变得越来越重要 (Zhou et al., 2023b; Guan et al., 2025)。

8. Discussion

8. 讨论

Our work underlines the significance of augmenting LLM reasoning and planning capabilities with symbolic computing and shows great potentials of steering large models for smarter code/text generation with specialized small models. We introduce novel modifications to dataset synthesis and fine-tuning (SFT/DPO) to support a multi-round guidance framework, which has proven effective. Unlike CoT LLMs like OpenAI o1 and DeepSeek R1, which rely solely on textual reasoning for exploration, symbolic computing offers greater efficiency, robustness, and s cal ability. Since coding is a core LLM capability, generating symbolic tools via code writing preserves generalization across tasks.

我们的工作强调了通过符号计算增强大语言模型推理和规划能力的重要性,并展示了通过专用小模型引导大模型生成更智能代码/文本的巨大潜力。我们引入了对数据集合成和微调(SFT/DPO)的新颖修改,以支持多轮引导框架,该框架已被证明是有效的。与 OpenAI o1 和 DeepSeek R1 等仅依赖文本推理进行探索的 CoT 大语言模型不同,符号计算提供了更高的效率、鲁棒性和可扩展性。由于编码是大语言模型的核心能力,通过代码编写生成符号工具可以保持跨任务的泛化能力。

Impact Statement

影响声明

This paper aims to advance the field of Foundation Models. Steering the generation from language models has the great potential to improve safety and performance to better align with human preferences. Any such work is inherently a double-edged sword; the same techniques used to generate samples from a harmless distribution of text could, with a single sign change, be repurposed for generating samples from a harmful distribution of text. Our method improves language model capability by integrating symbolic computing, which may also be misused for harmful purposes.

本文旨在推动基础模型(Foundation Models)领域的发展。通过引导大语言模型的生成,有望在安全性和性能上取得显著提升,使其更符合人类偏好。然而,任何此类工作本质上都是一把双刃剑;同样的技术既可以用于从无害的文本分布中生成样本,只需稍作调整,便可用于从有害的文本分布中生成样本。我们的方法通过集成符号计算(symbolic computing)来提升大语言模型的能力,但这也可能被误用于有害目的。

Overall, we believe the potential positive social benefits of our work in evaluation and steering language model output towards desired target distributions outweigh the potential negatives stemming primarily from misuse.

总体而言,我们相信在评估和引导大语言模型输出朝向目标分布方面的工作,其潜在的社会效益超过了主要由滥用带来的潜在负面影响。

References

参考文献

Chen, Y., Arkin, J., Hao, Y., Zhang, Y., Roy, N., and Fan, C. Prompt optimization in multi-step tasks (promst): Integrating human feedback and preference alignment. arXiv preprint arXiv:2402.08702, 2024c.

Chen, Y., Arkin, J., Hao, Y., Zhang, Y., Roy, N., and Fan, C. 多步任务中的提示优化 (promst):整合人类反馈与偏好对齐. arXiv preprint arXiv:2402.08702, 2024c.

Chen, Y., Arkin, J., Zhang, Y., Roy, N., and Fan, C. Scalable multi-robot collaboration with large language models: Centralized or decentralized systems? In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 4311–4317. IEEE, 2024d.

Chen, Y., Arkin, J., Zhang, Y., Roy, N., and Fan, C. 基于大语言模型的可扩展多机器人协作:集中式还是分散式系统?2024 IEEE 国际机器人与自动化会议 (ICRA), pp. 4311–4317. IEEE, 2024d.

Chen, Y., Jhamtani, H., Sharma, S., Fan, C., and Wang, C. Steering large language models between code execution and textual reasoning. arXiv preprint arXiv:2410.03524, 2024e.

Chen, Y., Jhamtani, H., Sharma, S., Fan, C., 和 Wang, C. Steering large language models between code execution and textual reasoning. arXiv preprint arXiv:2410.03524, 2024e.

Chen, Z., Deng, Y., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024f.

Chen, Z., Deng, Y., Yuan, H., Ji, K., and Gu, Q. 自对弈微调将弱语言模型转化为强语言模型。arXiv preprint arXiv:2401.01335, 2024f.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., 等. 训练验证器以解决数学文字题. arXiv 预印本 arXiv:2110.14168, 2021.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., 等. The llama 3 herd of models. arXiv 预印本 arXiv:2407.21783, 2024.

Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., and Neubig, G. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023.

Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., and Neubig, G. PAL: 程序辅助语言模型。在国际机器学习会议上,第10764–10799页。PMLR, 2023。

Guan, X., Zhang, L. L., Liu, Y., Shang, N., Sun, Y., Zhu, Y., Yang, F., and Yang, M. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519, 2025.

Guan, X., Zhang, L. L., Liu, Y., Shang, N., Sun, Y., Zhu, Y., Yang, F., and Yang, M. rstar-math:小规模大语言模型通过自我演化的深度思考掌握数学推理。arXiv 预印本 arXiv:2501.04519, 2025。

Gui, J., Liu, Y., Cheng, J., Gu, X., Liu, X., Wang, H., Dong, Y., Tang, J., and Huang, M. Logicgame: Benchmarking rule-based reasoning abilities of large language models. arXiv preprint arXiv:2408.15778, 2024.

Gui, J., Liu, Y., Cheng, J., Gu, X., Liu, X., Wang, H., Dong, Y., Tang, J., and Huang, M. Logicgame:基准测试大语言模型的基于规则的推理能力。arXiv preprint arXiv:2408.15778, 2024.

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incenti viz ing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., 等. Deepseek-r1: 通过强化学习激励大语言模型的推理能力. arXiv 预印本 arXiv:2501.12948, 2025.

Hao, Y., Chen, Y., Zhang, Y., and Fan, C. Large language models can plan your travels rigorously with formal verification tools. arXiv preprint arXiv:2404.11891, 2024a.

Hao, Y., Chen, Y., Zhang, Y., and Fan, C. 大语言模型能够使用形式验证工具严格规划您的旅行。arXiv 预印本 arXiv:2404.11891, 2024a.

Hao, Y., Zhang, Y., and Fan, C. Planning anything with rigor: General-purpose zero-shot planning with llm-based formalized programming. arXiv preprint arXiv:2410.12112, 2024b.

Hao, Y., Zhang, Y., and Fan, C. Planning anything with rigor: General-purpose zero-shot planning with llm-based formalized programming. arXiv preprint arXiv:2410.12112, 2024b.

Valmeekam, K., Olmo, A., Sreedharan, S., and Kambhampati, S. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.

Valmeekam, K., Olmo, A., Sreedharan, S., 和 Kambhampati, S. 大语言模型仍然无法规划(关于变化规划和推理的大语言模型基准)。在 NeurIPS 2022 决策基础模型研讨会中,2022。

Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S., and Kam b ham pati, S. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. Advances in Neural Information Processing Systems, 36, 2024.

Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S., and Kambhampati, S. Planbench: 用于评估大语言模型在规划和变化推理上的可扩展基准。Advances in Neural Information Processing Systems, 36, 2024.

Wang, J., Wang, J., At hi war at kun, B., Zhang, C., and Zou, J. Mixture-of-agents enhances large language model capabilities. arXiv preprint arXiv:2406.04692, 2024.

Wang, J., Wang, J., At hi war at kun, B., Zhang, C., and Zou, J. Mixture-of-agents 增强大语言模型能力。arXiv 预印本 arXiv:2406.04692, 2024。

Weir, N., Khalifa, M., Qiu, L., Weller, O., and Clark, P. Learning to reason via program generation, emulation, and search. arXiv preprint arXiv:2405.16337, 2024.

Weir, N., Khalifa, M., Qiu, L., Weller, O., 和 Clark, P. 通过程序生成、模拟和搜索学习推理。arXiv 预印本 arXiv:2405.16337, 2024.

Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., and Choi, Y. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, 2022.

Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., 和 Choi, Y. 通过学习自我纠正生成序列。在第十一届国际学习表示会议 (ICLR), 2022.

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Wang, C. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Wang, C. Autogen: 通过多智能体对话框架实现下一代大语言模型应用. arXiv preprint arXiv:2308.08155, 2023.

Xu, T., Chen, L., Wu, D.-J., Chen, Y., Zhang, Z., Yao, X., Xie, Z., Chen, Y., Liu, S., Qian, B., et al. Crab: Cross- environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024.

Xu, T., Chen, L., Wu, D.-J., Chen, Y., Zhang, Z., Yao, X., Xie, Z., Chen, Y., Liu, S., Qian, B., 等. Crab: 跨环境多模态语言模型智能体基准测试. arXiv 预印本 arXiv:2407.01511, 2024.

Yang, K., Tian, Y., Peng, N., and Klein, D. Re3: Generating longer stories with recursive re prompting and revision. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4393–4479, 2022.

Yang, K., Tian, Y., Peng, N., and Klein, D. Re3: 通过递归重提示与修订生成长篇故事。在2022年自然语言处理实证方法会议论文集中,第4393–4479页,2022。

Yang, Y., Xiong, S., Payani, A., Shareghi, E., and Fekri, F. Can llms reason in the wild with programs? arXiv preprint arXiv:2406.13764, 2024.

Ya