CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
CodeSteer:通过代码/文本引导的符号增强大语言模型
Abstract
摘要
Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities under utilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of $12\mathbf{k}$ multi-round guidance/generation trajectories and $5.5\mathrm{k}$ guidance comparison pairs. We fine-tune the Llama3-8B model with a newly designed multi-round supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, Code Steer LL M, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior general iz ability, providing an average 41.8 performance boost on Claude, Mistral, and GPT3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github. com/yongchao98/CodeSteer-v1.0.
现有方法无法在大语言模型(LLMs)的文本推理和代码生成之间有效引导,导致符号计算能力未被充分利用。我们提出了CodeSteer,一种有效引导LLM代码/文本生成的方法。我们构建了一个全面的基准SymBench,包含37个复杂性可调的符号任务,并合成了包含12k多轮引导/生成轨迹和5.5k引导对比对的数据集。我们通过新设计的多轮监督微调(SFT)和直接偏好优化(DPO)对Llama3-8B模型进行微调。最终得到的模型CodeSteer LLM,通过提出的符号和自答检查器增强,能够有效引导更大模型的代码/文本生成。将CodeSteer应用于GPT-4o后,其平均性能得分从53.3提升至86.4,甚至在所有37个任务(28个已知,9个未知)中超过了现有的最佳LLM OpenAI o1(82.7)、o1-preview(74.8)和DeepSeek R1(76.8)。为GPT-4o训练的CodeSteer展示了卓越的泛化能力,在Claude、Mistral和GPT3.5上平均提升了41.8的性能得分。由CodeSteer引导的LLMs充分利用符号计算,在高度复杂的任务中保持强劲表现。模型、数据集和代码可在https://github.com/yongchao98/CodeSteer-v1.0获取。
1. Introduction
1. 引言
2024c; Li et al., 2023), they still fail in ostensibly simple tasks (Zhou et al., 2024a). Crucially, many tasks in existing benchmarks—such as Blocks world (Valmeekam et al., 2024) and Game 24 (Zhou et al., 2023b)—can be completely solved with code solutions. Text-based reasoning excels at semantic understanding and commonsense inference but is less suited for exact computation, symbolic manipulation, optimization, and algorithmic processing (Valmeekam et al., 2022). In contrast, symbolic computing via code generation is adept at handling rigorous operations and can easily leverage specialized tools (e.g., equation solvers). In many tasks, prompting LLMs to generate and execute code outperforms purely textual reasoning (Madaan et al., 2022; Liang et al., 2022; Chen et al., 2022).
2024c; Li et al., 2023),它们在表面上简单的任务中仍然表现不佳(Zhou et al., 2024a)。关键的是,现有基准测试中的许多任务——如积木世界(Blocks world, Valmeekam et al., 2024)和24点游戏(Game 24, Zhou et al., 2023b)——可以通过代码解决方案完全解决。基于文本的推理在语义理解和常识推理方面表现出色,但在精确计算、符号操作、优化和算法处理方面则不太适用(Valmeekam et al., 2022)。相比之下,通过代码生成进行符号计算擅长处理严格的操作,并且可以轻松利用专用工具(例如,方程求解器)。在许多任务中,提示大语言模型生成并执行代码的表现优于纯文本推理(Madaan et al., 2022; Liang et al., 2022; Chen et al., 2022)。
A key challenge is guiding LLMs to decide when to rely on textual reasoning versus programmatic solutions, given that most input questions lack explicit cues about which approach is best. Recent OpenAI GPT models address this by providing a Code Interpreter module, allowing the model to iterative ly generate and execute code, then further reason with the output (Achiam et al., 2023). Multi-agent frameworks like AutoGen (Wu et al., 2023) adopt a specialized system prompt to steer LLM for code generation when needed. However, recently Chen et al. (2024e) finds that all these existing methods struggle to effectively steer between textual reasoning and code generation, failing to fully leverage symbolic computing capabilities.
一个关键的挑战是如何引导大语言模型决定何时依赖文本推理还是程序化解决方案,因为大多数输入问题都缺乏关于哪种方法最佳的明确提示。最近的 OpenAI GPT 模型通过提供代码解释器模块来解决这个问题,使模型能够迭代生成和执行代码,然后进一步推理输出 (Achiam et al., 2023)。像 AutoGen (Wu et al., 2023) 这样的多智能体框架采用专门的系统提示来在需要时引导大语言模型生成代码。然而,最近 Chen et al. (2024e) 发现,所有这些现有方法在文本推理和代码生成之间难以有效引导,无法充分利用符号计算能力。
While the reasoning and planning capabilities of LLMs have improved significantly (Wang et al., 2024; Chen et al.,
大语言模型的推理和规划能力已显著提升 (Wang et al., 2024; Chen et al., 2024)
Our work tries to bridge this gap by developing an assistant framework (CodeSteer) to guide the code/text generation of the LLM solving the task (TaskLLM). By fine-tuning a small model (Llama-3-8B (Dubey et al., 2024)) to be the assistant, we enable large models (GPT-4o (Achiam et al., 2023)) to fully leverage symbolic computing via code generation while preserving other capabilities. Recognizing that iterative “executing and exploring” is the most effective way to solve tasks, we build CodeSteer to generate prompts that guide the TaskLLM through multiple rounds of interaction before finalizing answers.
我们的工作试图通过开发一个辅助框架(CodeSteer)来填补这一空白,以指导大语言模型生成代码/文本来解决任务(TaskLLM)。通过微调一个小模型(Llama-3-8B (Dubey et al., 2024))作为辅助,我们使大模型(GPT-4o (Achiam et al., 2023))能够通过代码生成充分利用符号计算,同时保留其他能力。认识到迭代“执行和探索”是解决任务的最有效方式,我们构建了CodeSteer,以生成提示,引导TaskLLM在最终确定答案前进行多轮交互。
To achieve a comprehensive evaluation, we gather and develop a benchmark with 37 symbolic tasks, referred as SymBench. On SymBench, augmenting GPT-4o with CodeSteer greatly improves its average performance score from 53.3 to 86.4, even outperforming the current leading pure-text model, OpenAI o1 (82.7) (Jaech et al., 2024) and DeepSeek R1 (76.8) (Guo et al., 2025). Although trained for GPT-4o, CodeSteer shows great general iz ability, delivering an average 41.8 performance gain on Claude-3-5-Sonnet, MistralLarge, and GPT-3.5. By fully leveraging symbolic computing, CodeSteer-guided LLMs maintain strong performance on highly complex tasks even when o1 fails in all testing cases. Our key contributions are:
为实现全面评估,我们收集并开发了包含37项符号任务的基准测试,称为SymBench。在SymBench上,使用CodeSteer增强的GPT-4o将其平均性能得分从53.3大幅提升至86.4,甚至超越了当前领先的纯文本模型OpenAI o1(82.7)(Jaech等,2024)和DeepSeek R1(76.8)(Guo等,2025)。尽管CodeSteer是为GPT-4o训练的,但它展现了强大的泛化能力,在Claude-3-5-Sonnet、MistralLarge和GPT-3.5上平均提升了41.8的性能。通过充分利用符号计算,CodeSteer引导的大语言模型在高度复杂的任务中保持了强劲的性能,即使o1在所有测试案例中均告失败。我们的主要贡献包括:

CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance Figure 1: Examples and performance of CodeSteer on guiding LLM code/text generation to integrate symbolic computing. At each interaction with TaskLLM, it reviews current and previous answers, then provides guidance for the next round. CodeSteer returns final answers when it deems them ready. With CodeSteer, GPT-4o outperforms OpenAI Code Interpreter, o1, and o1-preview models.
图 1: CodeSteer 在指导大语言模型代码/文本生成以整合符号计算方面的示例和性能。在与 TaskLLM 的每次交互中,它都会审查当前和之前的答案,然后为下一轮提供指导。当 CodeSteer 认为答案准备就绪时,它会返回最终答案。借助 CodeSteer,GPT-4o 的性能优于 OpenAI Code Interpreter、o1 和 o1-preview 模型。
- Developing and publishing SymBench: Prior works by Chen et al. (2024e) and Gui et al. (2024) gathered and developed 14 and 31 tasks, respectively, targeting challenges in computation, symbolic manipulation, logic, optimization, spatial reasoning, and constrained planning. However, neither study published the complete code for question/solution synthesis or the full datasets. From these 45 tasks, we select 37 that remain challenging for GPT-4o and redevelop their generation code to produce samples with adjustable complexity. We refer to this newly published benchmark as SymBench.
- 开发并发布 SymBench:Chen 等人 (2024e) 和 Gui 等人 (2024) 的先前工作分别收集并开发了 14 项和 31 项任务,针对计算、符号操作、逻辑、优化、空间推理和约束规划等领域的挑战。然而,这两项研究均未发布完整的题目/解答生成代码或完整的数据集。从这 45 项任务中,我们筛选出 37 项仍对 GPT-4o 具有挑战性的任务,并重新开发了它们的生成代码,以生成具有可调复杂度的样本。我们将这一新发布的基准称为 SymBench。
- New methods for dataset construction and model fine-tuning of SFT and DPO: We fine-tune Llama-3-8B with the synthesized datasets of 12k multi-round guidance/generation trajectories (SFT) and $5.5\mathrm{k}$ guidance comparison pairs (DPO). Unlike standard multi-step settings, in CodeSteer’s multi-round guidance, the TaskLLM outputs a complete answer each round rather than only at the end. Consequently, we introduce novel components to both the dataset construction and training processes for SFT and DPO, such as data synthesis of dynamic guidance adaptation, emphasis on the final two rounds in SFT, comparison score design, and efficient answer sampling in DPO. These modifications result in better performance. Both the final CodeSteer model and created datasets will be released.
- SFT 和 DPO 的数据集构建与模型微调新方法:我们使用合成的 12k 多轮指导/生成轨迹数据集(SFT)和 5.5k 指导对比对数据集(DPO)对 Llama-3-8B 进行微调。与标准的多步骤设置不同,在 CodeSteer 的多轮指导中,TaskLLM 每轮都会输出一个完整的答案,而不仅仅是在最后。因此,我们在 SFT 和 DPO 的数据集构建和训练过程中引入了新的组件,例如动态指导适应的数据合成、在 SFT 中强调最后两轮、对比分数设计以及 DPO 中的高效答案采样。这些改进带来了更好的性能。最终的 CodeSteer 模型和创建的数据集都将发布。
- Symbolic checker and self-answer checker: Observing that TaskLLM frequently produces text-like code that hardcodes answers, neglecting efficient symbolic computation, we introduce a Symbolic Checker to help Code Steer LL M evaluate code complexity and efficiency. Since most reasoning and planning tasks can be better verified with coding, we add a Self-answer Checker for better judgment of answer correctness of Code Steer LL M. These two new checkers have been proven to significantly improve the efficiency of dataset synthesis and Code Steer LL M fine-tuning.
符号检查器和自我回答检查器:观察到TaskLLM经常生成类似文本的代码,硬编码答案,忽视了高效的符号计算,我们引入了符号检查器,帮助Code Steer LLM评估代码的复杂性和效率。由于大多数推理和规划任务可以通过编码更好地验证,我们添加了自我回答检查器,以更好地判断Code Steer LLM的答案正确性。这两个新的检查器已被证明能显著提高数据集合成和Code Steer LLM微调的效率。
- Proposed CodeSteer Outperforms Nine Baselines and o1: CodeSteer’s superior performance highlights the importance of enhancing LLM reasoning and planning with symbolic computing. This also demonstrates the potential for steering large models to generate smarter code and text by leveraging specialized smaller models.
CodeSteer 优于九个基线的原因:CodeSteer 的卓越性能凸显了通过符号计算增强大语言模型推理和规划的重要性。这也展示了利用专门的小型模型引导大型模型生成更智能代码和文本的潜力。
2. Symbolic Tasks and SymBench
2. 符号任务与 SymBench
Challenges in Code/Text Choices For tasks requiring computation, symbolic manipulation, logic, optimization, spatial reasoning, and constrained planning, coding-based symbolic computing is often more effective than text-based approaches. However, Chen et al. (2024e) found that steering LLM code/text generation poses significant challenges, even in tasks with apparent symbolic characteristics. The main bottlenecks are: 1) Deciding whether code or text is simpler depends on task type, task complexity, and the LLM’s capabilities, which is hard to judge (see Appendix Sec. A). 2) LLM-generated code often appears as text-like scripts that merely hard-code answers rather than enabling efficient symbolic computation, echoing the phenomenon described in Yang et al. (2024) (see Appendix Sec. B).
代码/文本选择中的挑战
对于需要计算、符号操作、逻辑、优化、空间推理和约束规划的任务,基于代码的符号计算通常比基于文本的方法更有效。然而,Chen et al. (2024e) 发现,即使在具有明显符号特征的任务中,引导大语言模型的代码/文本生成也面临重大挑战。主要瓶颈包括:1) 判断代码还是文本更简单取决于任务类型、任务复杂性和大语言模型的能力,这很难判断(见附录 A 节)。2) 大语言模型生成的代码通常表现为类似文本的脚本,这些脚本仅仅是硬编码答案,而不是实现高效的符号计算,这与 Yang et al. (2024) 中描述的现象一致(见附录 B 节)。
SymBench Chen et al. (2024e) and Gui et al. (2024) collected 14 and 31 tasks with symbolic factors from various benchmarks such as Suzgun et al. (2022); Chen et al. (2024d); Yao et al. (2024); Cobbe et al. (2021); Hendrycks et al. (2021), but their question-generation code and complete datasets remain private. We redevelop the generation code to automatically synthesize questions with adjustable complexity. Our resulting set of 37 tasks covers reasoning, planning, and execution, testing competencies in mathematics, spatial reasoning, logic, order reasoning, optimization, and search. Details and categorization are provided in Appendix Sec. C and Table 4.
SymBench
Chen et al. (2024e) 和 Gui et al. (2024) 从多个基准测试中收集了 14 和 31 个带有符号因素的任务,例如 Suzgun et al. (2022); Chen et al. (2024d); Yao et al. (2024); Cobbe et al. (2021); Hendrycks et al. (2021),但他们的题目生成代码和完整数据集仍未公开。我们重新开发了生成代码,以自动合成具有可调整复杂性的题目。我们最终生成的 37 个任务涵盖了推理、规划和执行,测试了数学、空间推理、逻辑、顺序推理、优化和搜索的能力。详细信息和分类见附录 C 节和表 4。
3. CodeSteer Framework
3. CodeSteer 框架
Fig 1 illustrates how CodeSteer guides the LLM’s code/text generation. At each round, CodeSteer reviews the TaskLLM’s current answer and the guidance/answer history, then decides whether to offer new guidance or finalize the response. It performs three key functions:
图1展示了CodeSteer如何引导大语言模型的代码/文本生成。在每一轮中,CodeSteer会审查TaskLLM的当前答案以及指导/答案历史,然后决定是否提供新的指导或最终确定响应。它执行三个关键功能:
- Initial Method Selection In the first round, it chooses whether to solve the task with code or text (e.g., use textual reasoning for small-number multiplication, and code for large-number multiplication in the task Number Multiply). 2) Dynamic Adaptation In subsequent rounds, it refines guidance or switches methods if issues arise (e.g., encouraging more sophisticated symbolic approaches in Game 24, or switching to textual reasoning after multiple incorrect code attempts in BoxLift).
- 初始方法选择
在首轮中,它选择是使用代码还是文本来完成任务 (例如,在“数字乘法”任务中,小数字乘法使用文本推理,大数字乘法使用代码)。 - 动态调整
在后续轮次中,如果出现问题,它会优化指导或切换方法 (例如,在“24点游戏”中鼓励更复杂的符号方法,或在“BoxLift”中多次代码尝试失败后切换为文本推理)。
3) Answer Final iz ation When Ready
3) 准备就绪时进行最终答案确认
The main components of CodeSteer are as follows:
CodeSteer 的主要组件如下:
Code Steer LL M is the primary model fine-tuned and used to guide TaskLLM in code/text generation. The input prompt formats for the first and subsequent rounds are presented in Appendix Sec. D. To facilitate answer evaluation, Code Steer LL M is equipped with two checkers—Selfanswer and Symbolic—whose design is inspired by the inherent features of symbolic tasks.
Code Steer LL M 是主要模型,经过微调后用于指导 TaskLLM 进行代码/文本生成。第一轮及后续轮次的输入提示格式见附录 D 节。为了方便回答评估,Code Steer LL M 配备了两个检查器——Selfanswer 和 Symbolic,其设计灵感来源于符号任务的固有特性。
Self-answer Checker re-queries TaskLLM to generate and execute code for verifying its current answer, then returns the evaluation results and explanations to Code Steer LL M. Since many symbolic tasks benefit from code-based verification, this approach often provides a more reliable perspective. The prompt format for the Self-answer Checker is provided in Appendix Sec. E.
自我答案检查器重新查询TaskLLM以生成并执行用于验证其当前答案的代码,然后将评估结果和解释返回给Code Steer LLM。由于许多符号任务受益于基于代码的验证,这种方法通常提供更可靠的视角。自我答案检查器的提示格式参见附录第E节。
Symbolic Checker is a rule-based script to analyze the generated code for iteration, search, numeric handling, permutations, and combinations, then returns a complexity summary and score. This helps Code Steer LL M determine whether the code is sufficiently sophisticated for the task at hand. Since TaskLLM often produces text-like code prone to errors, the Symbolic Checker’s complexity assessment aids, but does not solely dictate, Code Steer LL M’s decisions. Further details on the checking code and prompt are in Appendix Sec. F.
符号检查器(Symbolic Checker) 是一个基于规则的脚本,用于分析生成代码的迭代、搜索、数字处理、排列和组合,然后返回复杂性摘要和评分。这有助于 Code Steer 大语言模型(LLM) 判断代码是否足够复杂以应对当前任务。由于 TaskLLM 经常生成类似文本的代码且容易出错,符号检查器的复杂性评估有助于但不会完全决定 Code Steer 大语言模型的决策。更多关于检查代码和提示的详细信息参见附录 F 节。
Beyond enhancing Code Steer LL M’s performance, the Selfanswer and Symbolic Checkers also streamline dataset synthesis for SFT and DPO fine-tuning, as discussed in the following sections.
除了提升 Code Steer 大语言模型的性能外,Selfanswer 和 Symbolic Checkers 还能简化用于 SFT 和 DPO 微调的数据集合成,如下文所述。
4. Fine-tuning the Code Steer LL M
4. 微调 Code Steer 大语言模型
Among the three modules of CodeSteer, the CodeSteerLLM needs to be fine-tuned to perform the complicated task of steering. The fine-tuning is performed on a subset of SymBench. Specifically, we randomly select 28 of the 37 SymBench tasks, using a distinct set of samples without overlap with the test samples. This setup allows us to evaluate CodeSteer on 28 seen tasks (with different test samples) and on the remaining 9 unseen tasks. The fine-tuning consists of two steps. We first fine-tune the Llama-3.1-8B model with SFT, then further optimize it using DPO. Both processes are fine-tuned with full parameter on $4^{*}\mathrm{Hl}00$ GPUs for 4-10 epochs. The detailed parameter and hardware settings for fine-tuning and inference processes are discussed in Appendix Sec. H. We synthesize 12k multiround guidance/generation trajectories for SFT and $5.5\mathrm{k}$ guidance comparison pairs for DPO. The specific data number for each task is in Appendix Sec. G.
在CodeSteer的三个模块中,CodeSteerLLM需要进行微调以执行复杂的引导任务。微调在SymBench的一个子集上进行。具体而言,我们从37个SymBench任务中随机选择了28个,使用一组不与测试样本重叠的不同样本。这种设置使得我们能够在28个已见任务(使用不同的测试样本)和剩余的9个未见任务上评估CodeSteer。微调分为两个步骤。我们首先使用SFT对Llama-3.1-8B模型进行微调,然后使用DPO进一步优化。这两个过程都在$4^{*}\mathrm{Hl}00$ GPU上进行了4-10个epoch的全参数微调。微调和推理过程的详细参数和硬件设置将在附录H节中讨论。我们合成了12k轮多轮引导/生成轨迹用于SFT,以及$5.5\mathrm{k}$引导比较对用于DPO。每个任务的具体数据数量见附录G节。
4.1. Multi-round SFT
4.1. 多轮SFT
To generate supervision data for SFT, we prompt the GPT-4o to serve as both the guiding LLM (i.e., the Code Steer LL M) and the TaskLLM to generate multiple guidance/generate trajectories. We then filter the trajectories keeping only those that produce correct answers. To improve success rates, Code Steer LL M’s prompt is more detailed and includes pre-set knowledge or hints. To increase dataset diversity and enable dynamic adaptation of guided thoughts, this prompt also has different versions. For example, we may let GPT-4o choose all guidance styles, or enforce transitions from code to text or text to code. We set the maximum guidance rounds to be 5 and return the final answer once that limit is reached.
为SFT生成监督数据时,我们提示GPT-4o同时作为引导LLM(即Code Steer LLM)和TaskLLM生成多个引导/生成轨迹。然后,我们过滤这些轨迹,仅保留那些产生正确答案的部分。为了提高成功率,Code Steer LLM的提示更为详细,并包含预设的知识或提示。为了增加数据集的多样性并实现引导思想的动态适应,该提示还具有不同的版本。例如,我们可能让GPT-4o选择所有引导风格,或者强制从代码到文本或从文本到代码的转换。我们将最大引导轮次设置为5,并在达到该限制时返回最终答案。
Multi-round Gradient Cancellation Issue In multiround trajectories, the SFT process incorporates gradients from each round. This can lead to gradient cancellation in the early rounds. For example, in one task, both [code, return answer] and [text, code, return answer] produce correct results, so if both trajectories are used for fine-tuning, the SFT cannot learn that code is the better first step.
多轮梯度抵消问题
在多轮轨迹中,SFT(监督微调)过程会结合每一轮的梯度。这可能导致早期轮次的梯度抵消。例如,在某个任务中,[代码,返回答案] 和 [文本,代码,返回答案] 都会产生正确的结果,因此如果同时使用这两条轨迹进行微调,SFT 无法学到代码是更好的第一步。
Data Augmentation To mitigate this issue, we leverage the fact that the final two rounds of guidance are most influential, as the TaskLLM produces new answers each round while earlier rounds primarily provide background. Consequently, we augment the SFT dataset by doubling the weights of the final two rounds.
数据增强
为了缓解这一问题,我们利用了最后两轮指导最具影响力的事实,因为 TaskLLM 每轮生成新的答案,而早期轮次主要提供背景信息。因此,我们通过将最后两轮的权重加倍来增强 SFT 数据集。
4.2. Multi-round DPO
4.2. 多轮 DPO


Figure 2: Schematic of multi-round DPO data sampling: blue squares represent intermediate (non-final) rounds, and brown ovals mark finalizing rounds. Guidance responses from the same parent node in Code Steer LL M are compared to generate the DPO data.
图 2: 多轮DPO数据采样的示意图:蓝色方块表示中间(非最终)轮次,棕色椭圆标记最终轮次。通过比较来自Code Steer大语言模型中相同父节点的指导响应来生成DPO数据。
Because many correct trajectories in the SFT dataset are still suboptimal, we need to further fine-tune the Code Steer LL M on pairs of trajectories labeled with preferences. Here we use rule-based scores to assign preferences. Figure 2 illustrates our framework for sampling DPO guidance pairs in a multi-round setting. The main challenge is sampling and selecting guidance pairs that exhibit clear performance differences across various rounds while minimizing the number of samples to conserve resources. We use a tree structure where each node represents a guidance, with a branching factor of 2 or 3. To compare guidance pairs from the same parent node, we calculate their Performance Scores using the following equation:
因为在 SFT 数据集中的许多正确轨迹仍然非最优,我们需要在标记有偏好的轨迹对上进一步微调 Code Steer 大语言模型。这里我们使用基于规则的分数来分配偏好。图 2 展示了我们在多轮设置中采样 DPO 指导对的框架。主要挑战是采样和选择在不同轮次中表现出明显性能差异的指导对,同时最小化样本数量以节省资源。我们使用树结构,其中每个节点代表一个指导,分支因子为 2 或 3。为了比较来自同一父节点的指导对,我们使用以下公式计算它们的性能分数:

Here, ${\mathrm{Score}}{i}$ represents the score for a node at round $i$ , where $i$ is the current round number, and $C(i)$ is the set of child nodes of node $i$ . If the current round is the final one, $\mathrm{Score}{i}$ is set to $15-i$ for correct answers and $-i$ for incorrect ones. This in centi viz es Code Steer LL M to achieve correct answers in the fewest rounds possible. For non-final rounds, ${\mathrm{Score}}_{i}$ is calculated as the average of its child nodes’ scores. This ensures that each non-terminal round’s score reflects the average performance of its potential subsequent actions, i.e., the expectation.
这里,${\mathrm{Score}}{i}$ 表示第 $i$ 轮中节点的分数,其中 $i$ 是当前轮次,$C(i)$ 是节点 $i$ 的子节点集合。如果当前轮次是最后一轮,$\mathrm{Score}{i}$ 对于正确答案设置为 $15-i$,对于错误答案设置为 $-i$。这会引导 Code Steer LLM 在尽可能少的轮次内实现正确答案。对于非最终轮次,${\mathrm{Score}}_{i}$ 计算为其子节点分数的平均值。这确保了每个非终端轮次的分数反映了其潜在后续动作的平均表现,即期望值。
DPO data is collected from guidance pairs within the same parent node at each level that have a score difference greater than 2. To prevent reward hacking (Skalse et al., 2022)—where Code Steer LL M might bypass exploration and return incorrect answers quickly (e.g., preferring a score of $^{-2}$ over $-5$ )—we include only pairs where at least one guidance has a positive score. To obtain diverse guidance answers, we set the inference temperature to 1.5 for the SFT fine-tuned Code Steer LL M and use three models fine-tuned at different epochs (6, 8, and 10) to compare their guidance responses for the same parent node.
DPO 数据是从同一父节点中得分差异大于2的指导对中收集的。为了防止奖励黑客攻击 (Skalse et al., 2022)——即 Code Steer 大语言模型可能绕过探索并快速返回错误答案(例如,偏好 $-2$ 的得分而不是 $-5$ 的得分)——我们只包括至少有一个指导得分为正的对。为了获得多样化的指导答案,我们将 SFT 微调的 Code Steer 大语言模型的推理温度设置为1.5,并使用在不同训练周期(6、8和10)微调的三个模型来比较它们对同一父节点的指导响应。
5. Experiments
5. 实验
Experimental settings We use GPT-4o as the TaskLLM to test 28 seen and 9 unseen tasks, each with 100 samples of varying complexity. The samples for the 28 seen tasks are different from those used to train Code Steer LL M. Additionally, we evaluate other LLM types to assess CodeSteer’s general iz ability.
实验设置
我们使用 GPT-4o 作为任务大语言模型 (TaskLLM) 来测试 28 个已见任务和 9 个未见任务,每个任务包含 100 个不同复杂度的样本。28 个已见任务的样本与用于训练 CodeSteer 大语言模型的样本不同。此外,我们还评估了其他类型的大语言模型,以评估 CodeSteer 的泛化能力。
We compare CodeSteer to six training-free and three training-based baselines, with methods 1, 3–6, and 9 originally proposed in Chen et al. (2024e).
我们将 CodeSteer 与六种无训练和三种有训练的基线方法进行了比较,其中方法 1、3–6 和 9 最初由 Chen 等人 (2024e) 提出。
Training-free Baselines 1) No extra modifications but only input the original question (Only Question); 2) Our framework in Sec. 4.1 to synthesize SFT dataset, where GPT4o works as Code Steer LL M with extra hints (Symbolic Agent); 3) Prompting LLMs to answer with only text with CoT (All Text $\mathbf{+\sumCoT},$ ); 4) Prompting LLMs to first analyze the question with CoT and then output the code answer (All $\mathbf{Code}+\mathbf{CoT})$ ; 5) Concatenating the input question with AutoGen’s original system prompt in Appendix Section L (AutoGen Conca.); 6) Implement a multi-agent framework that first queries LLMs to answer the question with All Text $+,\mathrm{CoT}$ and All Cod $\mathrm{\pm\CoT}$ methods, respectively. Then the final solution is obtained by combining and summarizing both versions of the answers by the same LLM but prompted differently $\mathbf{(Code+Text+Sum.1}$ ).
无训练基线 1) 仅输入原始问题,不做额外修改 (仅问题); 2) 使用第4.1节中的框架合成SFT数据集,其中GPT4o作为带有额外提示的代码引导大语言模型 (符号智能体); 3) 提示大语言模型仅使用文本并通过思维链 (CoT) 回答 (全文本 $\mathbf{+\sumCoT},$ ); 4) 提示大语言模型首先通过思维链分析问题,然后输出代码答案 (全 $\mathbf{代码}+\mathbf{CoT})$ ); 5) 将输入问题与AutoGen附录L部分的原始系统提示连接 (AutoGen连接); 6) 实施一个多智能体框架,首先分别通过全文本 $+,\mathrm{CoT}$ 和全代码 $\mathrm{\pm\CoT}$ 方法查询大语言模型回答问题。然后通过同一大语言模型但以不同方式提示,结合并总结两个版本的答案来获得最终解决方案 $\mathbf{(代码+文本+总结.1}$ )。
Table 1: Experimental results on SymBench. Methods with the highest scores are highlighted blue.
Training-based Baselines 7) Fine-tune Llama-3.1-8B as a summarizer based on the $\mathrm{Code+Text+Sum.}1$ method using SFT on correct summary data $(\mathbf{Code}+\mathbf{Text}+\mathbf{Sum}.2)$ ; 8) We fine-tune Llama-3.1-8B as a one-step evaluator to choose between text or code generation (Code/Text Choice); 9) OpenAI GPT Code Interpreter with the original input question (Code Interpreter). Method 7 and 8 are fine-tuned on the same data number and task types as CodeSteer.
基于训练的基线方法 7) 使用 $\mathrm{Code+Text+Sum}.1$ 方法,在正确的摘要数据 $(\mathbf{Code}+\mathbf{Text}+\mathbf{Sum}.2)$ 上进行监督微调 (SFT),将 Llama-3.1-8B 微调为摘要生成器;8) 我们将 Llama-3.1-8B 微调为一步评估器,用于在文本生成或代码生成之间进行选择(代码/文本选择);9) 使用原始输入问题的 OpenAI GPT Code Interpreter(代码解释器)。方法 7 和 8 在与 CodeSteer 相同的任务类型和数据量上进行微调。
Comparison with CoT LLMs We also compare with the current best models: OpenAI o1 and o1-preview (Jaech et al., 2024) and DeepSeek R1 (Guo et al., 2025). These models enhance reasoning and planning by using textual search, reflection, and exploration during answer generation. However, our analysis shows that these CoT LLMs have not yet integrated code-based symbolic computing to further improve their performance.
与 CoT 大语言模型的对比
我们还与当前最好的模型进行了对比:OpenAI o1 和 o1-preview (Jaech et al., 2024) 以及 DeepSeek R1 (Guo et al., 2025)。这些模型通过在生成答案时使用文本搜索、反思和探索来增强推理和规划能力。然而,我们的分析表明,这些 CoT 大语言模型尚未集成基于代码的符号计算来进一步提升其性能。
Evaluations Answers are evaluated using predefined rules, with GPT-4o assisting in adjusting formats as needed. Beyond the Code Interpreter method, some approaches have the LLM output code as the final answer. We extract and execute this code using predefined algorithms to obtain the final result or facilitate further reasoning. To prevent infinite loops, code execution is limited to 30 seconds. If this limit is exceeded, the task is marked as failed or returns errors for subsequent rounds. We utilize success rate as the metric for each task. To compare each method, we calculate the Average Normalized Score over all the tested tasks by the following equation:
评估答案使用预定义规则进行评估,GPT-4o 协助根据需要调整格式。除了代码解释器方法外,某些方法将大语言模型 (LLM) 输出的代码作为最终答案。我们使用预定义的算法提取并执行该代码,以获得最终结果或促进进一步推理。为了防止无限循环,代码执行时间限制在 30 秒内。如果超过此限制,任务将被标记为失败或返回错误以供后续轮次处理。我们使用成功率作为每个任务的指标。为了比较每种方法,我们通过以下公式计算所有测试任务的平均归一化得分:

where $\mathrm{AveNorm}{j}$ is the Average Normalized Score for method $j$ , $s{i j}$ is the score of method $j$ for task $i$ , $\operatorname*{max}(s_{i})$ is the maximum score for task $i$ , $N$ is the total number of tasks. This equation normalizes each score relative to the maximum score in the respective task, and then averages the normalized scores over all tasks. Apart from the task performance, in later sections we also discuss the costs of token lengths and runtime for each method.
其中,$\mathrm{AveNorm}{j}$ 是方法 $j$ 的平均标准化分数,$s{i j}$ 是方法 $j$ 在任务 $i$ 中的分数,$\operatorname*{max}(s_{i})$ 是任务 $i$ 中的最高分数,$N$ 是任务的总数。该公式将每个分数相对于相应任务中的最高分数进行标准化,然后对所有任务的标准化分数进行平均。除了任务性能外,在后文我们还将讨论每种方法的 Token 长度和运行时间的成本。
5.1. Overall Better Performance
5.1. 整体性能更佳
Table 1 presents the full results of all methods on SymBench, including individual task scores and the Average Normalized Score. The key findings are:
表 1: 展示了所有方法在 SymBench 上的完整结果,包括各个任务的得分和平均归一化得分。关键发现如下:
- CodeSteer maintains similar relative performance on seen and unseen tasks, indicating no over fitting. 2) Augmenting GPT-4o with CodeSteer significantly boosts its performance, raising the Ave. Norm. Total Score from 53.3 to 86.4—outperforming all 9 baselines (best baseline: Code/Text Choice at 77.9).
- CodeSteer 在已见和未见任务上保持相似的相对性能,表明没有过拟合。
- 使用 CodeSteer 增强 GPT-4o 显著提升了其性能,将平均标准化总分从 53.3 提高到 86.4,超过了所有 9 个基线模型(最佳基线:Code/Text Choice 为 77.9)。

Figure 3: Normalized score distribution of CodeSteer+GPT4o and o1 in 37 SymBench tasks.
图 3: CodeSteer+GPT4o 和 o1 在 37 个 SymBench 任务中的归一化得分分布。
- GPT $^{40+}$ CodeSteer surpasses o1 (82.7), R1 (76.8), and o1-preview (74.8), highlighting the importance of integrating symbolic computing into LLMs. Figure 3 compares the score distribution of GPT $^{40+}$ CodeSteer and o1, showing that CodeSteer reduces instances of extremely low scores (near 0), demonstrating its robustness to varied tasks. 4) Compared to other training-based methods (Code $^+$ $\mathrm{Text}+\mathrm{Sum}.2$ and Code/Text Choice) with the same data number and tasks, CodeSteer’s better performance validates the framework’s effectiveness (further discussed in Sec. 6).
- GPT $^{40+}$ CodeSteer 超越了 o1 (82.7)、R1 (76.8) 和 o1-preview (74.8),突显了将符号计算集成到大语言模型中的重要性。图 3 比较了 GPT $^{40+}$ CodeSteer 和 o1 的得分分布,显示 CodeSteer 减少了极低得分(接近 0)的情况,展示了其对各种任务的鲁棒性。4) 与其他基于训练的方法(Code $^+$ $\mathrm{Text}+\mathrm{Sum}.2$ 和 Code/Text Choice)在相同数据量和任务下相比,CodeSteer 的更好表现验证了该框架的有效性(进一步讨论见第 6 节)。
5.2. S cal ability and General iz ability
5.2. 可扩展性与泛化能力

Figure 4: Method performance across four representative tasks as task complexity increases from left to right on the ${\bf X}$ -axis controlled by value scales. C.S. and Inter. represent CodeSteer and Interpreter.
图 4: 随着任务复杂度从左到右在 ${\bf X}$ 轴上由值尺度控制的增加,方法在四个代表性任务上的表现。C.S. 和 Inter. 分别代表 CodeSteer 和 Interpreter。
To assess the impact of symbolic computing, Fig. 4 tracks the performance of five methods across four tasks of increasing complexity. As critical task-specific properties escalate, o1, o1-preview, and GPT-4o fail in highly complex cases, while symbolic-augmented methods (CodeSteer, Code Interpreter) sustain performance. Notably, CodeSteer proves more robust across tasks than Code Interpreter.
为了评估符号计算的影响,图 4 跟踪了五种方法在四个复杂度递增任务中的表现。随着关键任务特定属性的增加,o1、o1-preview 和 GPT-4o 在高度复杂的任务中失败,而符号增强方法(CodeSteer、Code Interpreter)保持了性能。值得注意的是,CodeSteer 在任务中表现得比 Code Interpreter 更为稳健。
Table 2: Experimental results of Claude-3-5-sonnet-20241022, Mistral-Large, and GPT-3.5 with or without augmented CodeSteer. Methods with the higher scores of the same model are highlighted blue.
表 2: Claude-3-5-sonnet-20241022, Mistral-Large, 和 GPT-3.5 在有或无增强 CodeSteer 的情况下的实验结果。同一模型中得分较高的方法以蓝色高亮显示。
| 方法 | Claude | Claude+CodeSteer | Mistral | Mistral+CodeSteer | GPT-3.5 | GPT-3.5+CodeSteer |
|---|---|---|---|---|---|---|
| CombinatorialCalcu. | 48 | 66 | 25 | 34 | 12 | 29 |
| Eight Queen | 4 | 87 | 60 | 41 | 0 | 16 |
| Reversi | 0 | 45 | 0 | 33 | 0 | 32 |
| Cons.LinearArran. | 73 | 90 | 47 | 48 | 25 | 9 |
| StandardSudoku | 0 | 100 | 0 | 100 | 0 | 95 |
| Ave.Norm.Score | 29.1 | 92.0 | 31.0 | 59.8 | 8.6 | 42.3 |
In our study, Code Steer LL M is fine-tuned on synthesized datasets where TaskLLM is always GPT-4o. To assess its transfer ability and general iz ability, we test it with three popular models: Claude-3-5-Sonnet, Mistral-Large, and GPT3.5-Turbo. We evaluate them on five representative tasks based on GPT-4o’s results in Table 1: two where text outperforms code and three where code is superior. CodeSteer has shown apparent effects when guiding GPT-4o on these tasks. The results in Table 2 confirm that CodeSteer generalizes well across other LLMs types. This is expected, as its core mechanisms—code/text guidance and dynamic adaptation—are essential to all general-purpose LLMs. Notably, we observe that CodeSteer is particularly effective when applied to stronger LLMs, such as Claude. This is likely because more powerful models possess superior self-reflection capabilities and can generate complex code with greater precision. Thus, they benefit more from CodeSteer’s additional structured guidance, unlocking their full potential.
在我们的研究中,CodeSteer 大语言模型在合成数据集上进行了微调,其中 TaskLLM 始终为 GPT-4o。为了评估其迁移能力和泛化能力,我们使用三种流行模型进行了测试:Claude-3-5-Sonnet、Mistral-Large 和 GPT3.5-Turbo。我们在表 1 中基于 GPT-4o 的结果对这五个代表性任务进行了评估:其中两项任务中文本优于代码,三项任务中代码更优。CodeSteer 在指导 GPT-4o 完成这些任务时显示出显著效果。表 2 的结果证实了 CodeSteer 在不同类型的大语言模型中具有良好的泛化能力。这是可以预期的,因为其核心机制——代码/文本引导和动态适应——对所有通用大语言模型都至关重要。值得注意的是,我们观察到 CodeSteer 在应用于更强的模型(如 Claude)时效果尤为显著。这可能是因为更强大的模型具有更优越的自我反思能力,并且能够更精确地生成复杂代码。因此,它们从 CodeSteer 提供的额外结构化指导中受益更多,从而释放了其全部潜力。

5.3. Cost of Tokens and Runtime
图 1: 5.3. Token 成本和运行时间
Figure 5: Score vs. token and runtime costs for each method, highlighting CodeSteer, R1, o1, and o1-preview in red. We display CodeSteer results separately for inferences using single or four H100 GPUs. Specific values are in Table 6.
图 5: 每种方法的得分与 token 和运行时的成本对比,其中红色突出显示了 CodeSteer、R1、o1 和 o1-preview。我们分别展示了 CodeSteer 在使用单个和四个 H100 GPU 进行推理时的结果。具体数值见表 6。
Figure 5 shows Score versus Token Length (including input and output tokens) and Score versus Runtime (covering both LLM inference and code execution) for all methods. Complete data is provided in Appendix Table 6. Token counts include only those used by TaskLLM, excluding small and open-source models fine-tuned on Llama-3.1-8B. For the o1 and o1-preview models, only runtime is plotted since their thinking chains are unavailable. While achieving superior performance, CodeSteer uses more tokens than baseline methods due to its multi-round generations. Most of these tokens are consumed by multiple interaction rounds that ultimately fail. CoT LLM R1 consumes more tokens than CodeSteer due to the inefficient textual iteration.
图5展示了所有方法的得分与Token长度(包括输入和输出Token)以及得分与运行时间(涵盖大语言模型推理和代码执行)的关系。完整数据参见附录表6。Token计数仅包括TaskLLM使用的Token,不包括在Llama-3.1-8B上微调的小型和开源模型。对于o1和o1-preview模型,由于它们的思考链不可用,仅绘制了运行时间。虽然CodeSteer在性能上表现优异,但由于其多轮生成,它使用的Token比基线方法更多。这些Token大部分被多轮交互消耗,这些交互最终失败。CoT LLM R1由于低效的文本迭代,消耗的Token比CodeSteer更多。
In terms of runtime, CodeSteer is faster than o1 and R1 while delivering better performance. Additionally, since most of CodeSteer’s runtime comes from the inference of the 8B Code Steer LL M on our workstation, hardware and system optimization s can significantly reduce it. For example, running Code Steer LL M on four H100 GPUs instead of one decreases the average runtime from 63.8 to 45.4 seconds. CoT LLMs consume excessive runtime and tokens due to their extensive and often redundant reasoning chains. Textual iteration is inherently inefficient for search. Appendix Sec. J shows examples of text answers of R1 and GPT-4o, in which both models attempt to find the correct equation for the Game 24 task by listing all possible combinations, leading to uncontrolled iterations and endless generation. This highlights the importance of symbolic computing through code generation.
在运行时间方面,CodeSteer 比 o1 和 R1 更快,同时提供了更好的性能。此外,由于 CodeSteer 的大部分运行时间来自我们工作站上 8B Code Steer 大语言模型的推理,硬件和系统优化可以显著减少这一时间。例如,在四个 H100 GPU 上运行 Code Steer 大语言模型而不是一个,可以将平均运行时间从 63.8 秒减少到 45.4 秒。CoT大语言模型由于其广泛且通常冗余的推理链,消耗了过多的运行时间和 Token。文本迭代在搜索方面本质上是低效的。附录 J 节展示了 R1 和 GPT-4o 的文本答案示例,其中两个模型都试图通过列出所有可能的组合来找到 Game 24 任务的正确方程,导致了不受控制的迭代和无尽的生成。这凸显了通过代码生成进行符号计算的重要性。
6. Ablation Studies
6. 消融实验
The CodeSteer framework comprises SFT and DPO dataset synthesis, Code Steer LL M fine-tuning, a symbolic checker, and a self-answer checker. Here we do the ablation studies on these components and their related modifications. The added experimental results are shown in Table 3 with the whole result table of 37 SymBench tasks in Append Sec. K.
CodeSteer框架包含SFT和DPO数据集合成、Code Steer大语言模型微调、符号检查器和自答检查器。我们在此对这些组件及其相关修改进行消融研究。新增的实验结果如表3所示,完整的37个SymBench任务结果表见附录K节。
DPO Effects In Table 3, 1.CodeSteer outperforms 2.WO DPO, showing the effectiveness of the DPO process.
表 3: DPO 效果
- CodeSteer 优于 2. WO DPO,展示了 DPO 过程的有效性。
SFT Data Augmentation As discussed in Sec. 4.1, we do the data augmentation of the last two rounds in each trajectory to prevent multi-round gradient cancellation. In
SFT 数据增强
Table 3: Ablation studies on CodeSteer. WO DPO: CodeSteer with SFT but without DPO fine-tuning. WO DPO WO Data Augment: Same as WO DPO, but without data augmentation in the last two rounds. Agent represents the Symbolic Agent.
表 3: CodeSteer 的消融研究。WO DPO: 使用 SFT 但不进行 DPO 微调的 CodeSteer。WO DPO WO Data Augment: 与 WO DPO 相同,但在最后两轮没有数据增强。Agent 代表符号智能体 (Symbolic Agent)。
| 方法 | 1.Code Steer | 2.WO DPO | 3.WODPO WOData | 4.WO Symbolic Checker | 5.WO Self-answer Checker | 6. Agent | 7.Agent WO Symbolic Checker | 8.Agent WO Self-answer Checker |
|---|---|---|---|---|---|---|---|---|
| 任务成功率 % Ave.Norm.,Seen | 88.1 | 80.0 | Augment. 79.7 | 80.1 | 78.5 | 77.0 | 71.9 | 70.1 |
| Ave.Norm.,Unseen | 81.3 | 76.2 | 70.9 | 68.6 | 64.2 | 67.9 | 62.0 | 57.4 |
| Ave. Norm., Total | 86.4 | 79.1 | 77.6 | 77.3 | 75.0 | 74.8 | 69.5 | 67.0 |
Table 3, 2.WO DPO achieves higher score than 3.WO DPO WO Data Augment., which means this extra attention on the last two rounds does enhance the SFT process.
表 3: 2.WO DPO 的得分高于 3.WO DPO WO Data Augment.,这意味着对最后两轮的额外关注确实增强了 SFT 过程。
Symbolic and Self-answer Checkers We evaluate the effects of the Symbolic and Self-answer Checker in two parts: 1) Dataset Synthesis Efficiency: Comparing Group 6 with Groups 7 and 8 in Table 3 shows that integrating these two checkers increases the Symbolic Agent’s success rates, thereby enhancing the efficiency of the dataset synthesis process. 2) CodeSteer Performance: Comparing Group 1 with Groups 4 and 5 demonstrates that augmenting with these two checkers improves CodeSteer’s final performance.
符号化与自答检查器的效果评估
Multi-round Guidance CodeSteer uses a multi-round interaction strategy with TaskLLM. In contrast, the Code/Text Choice method in Table 1 relies on single-step guidance and performs worse than CodeSteer. This demonstrates that the multi-round design enhances guidance effectiveness, aligning with the common intuition that the best methods for many tasks emerge from iterative “executing and exploring” processes accompanied with dynamic adaptation.
多轮指导
CodeSteer采用与大语言模型 (TaskLLM) 的多轮交互策略。相比之下,表1中的代码/文本选择方法依赖于单步指导,其表现不如CodeSteer。这表明多轮设计提高了指导的有效性,与许多任务的优化方法通常来自于伴随动态调整的迭代“执行与探索”过程的常见直觉相吻合。
Guide Not Summarizer CodeSteer primarily serves as the guidance generator for TaskLLM rather than directly generating answers, summarizing, or selecting among multiple answers. This design choice accounts for the limitations of the open-source LLM we use compared to the more capable closed-source LLM that supports TaskLLM. By focusing on guidance, CodeSteer reduces task complexity and data space requirements. The $\mathrm{Code}+\mathrm{Text}+\mathrm{Sum}.2$ approach in Table 1 attempts to fine-tune an answer summarizer using the same data volume but fails, highlighting that summarization imposes a significant burden on Llama-3.1-8B due to the unique characteristics of each task.
指南而非总结器 CodeSteer 主要为 TaskLLM 生成指导,而不是直接生成答案、总结或在多个答案中选择。这一设计选择考虑到了我们使用的开源大语言模型与支持 TaskLLM 的能力更强的闭源大语言模型之间的局限性。通过专注于生成指导,CodeSteer 减少了任务复杂性和数据空间需求。表 1 中的 $\mathrm{Code}+\mathrm{Text}+\mathrm{Sum}.2$ 方法尝试使用相同的数据量微调一个答案总结器,但失败了,这突显了由于每个任务的独特特征,总结对 Llama-3.1-8B 带来了巨大的负担。
7. Related Work
7. 相关工作
Code Generation and Symbolic Computing in LLM Tasks LLMs are widely used for general agent tasks, such as interacting with softwares and websites (Zhou et al., $2023\mathrm{c}$ ; Hao et al., 2024a;b; Xu et al., 2024), planning robot actions (Chen et al., 2024d; Ahn et al., 2022), and inferring with logic (Suzgun et al., 2022). Literally, many test tasks in previous works can be solved with direct coding (Suzgun & Kalai, 2024; Gao et al., 2023). Some recent works also further extend the applications of coding into tasks involving commonsense reasoning and semantic analysis (Li et al., 2023; Weir et al., 2024). Most of previous works mainly utilize text (Yao et al., 2024; Ahn et al., 2022; Lin et al., 2023) or code (Liang et al., 2022; Bairi et al., 2024; Zhou et al., 2023a) as the only output modality. Chen et al. (2024e) highlights the importance of smartly switching between code and text generation in LLMs but notes current methods have clear drawbacks.
LLM任务中的代码生成与符号计算
大语言模型被广泛用于通用智能体任务,例如与软件和网站交互 (Zhou et al., $2023\mathrm{c}$; Hao et al., 2024a;b; Xu et al., 2024)、规划机器人动作 (Chen et al., 2024d; Ahn et al., 2022) 以及逻辑推理 (Suzgun et al., 2022)。实际上,之前工作中的许多测试任务可以通过直接编码来解决 (Suzgun & Kalai, 2024; Gao et al., 2023)。最近的一些工作还进一步将编码的应用扩展到涉及常识推理和语义分析的任务中 (Li et al., 2023; Weir et al., 2024)。大多数之前的工作主要使用文本 (Yao et al., 2024; Ahn et al., 2022; Lin et al., 2023) 或代码 (Liang et al., 2022; Bairi et al., 2024; Zhou et al., 2023a) 作为唯一的输出模态。Chen et al. (2024e) 强调了在大语言模型中智能切换代码和文本生成的重要性,但指出当前方法存在明显缺陷。
LLM Self-reflection and CoT Models LLM-generated feedback via self-evaluation can improve performance on a variety of tasks (Yang et al., 2022; Welleck et al., 2022; Madaan et al., 2023). The OpenAI o1 (Jaech et al., 2024) and DeepSeek R1 (Guo et al., 2025) models demonstrate the potential of agentic LLMs that use Chain-of-Thought (CoT) text generation to explore and self-reflect, enhancing reasoning and planning. However, they lack symbolic computing and code generation capabilities, leading to weaker performance on complex symbolic tasks and consuming substantial tokens and time (Chen et al., 2024a).
LLM 自反思与 CoT 模型
通过自我评估生成的 LLM 反馈可以提升各种任务的表现 (Yang et al., 2022; Welleck et al., 2022; Madaan et al., 2023)。OpenAI o1 (Jaech et al., 2024) 和 DeepSeek R1 (Guo et al., 2025) 模型展示了使用思维链 (Chain-of-Thought, CoT) 文本生成的自主 LLM 在探索和自我反思方面的潜力,增强了推理和规划能力。然而,它们缺乏符号计算和代码生成能力,导致在复杂符号任务上表现较弱,并且消耗大量 Token 和时间 (Chen et al., 2024a)。
LLM Fine-tuning with Multi-step SFT and DPO SFT (Chen et al., 2024f) and DPO (Rafailov et al., 2024) are extensively implemented for LLM fine-tuning. To enhance LLM’s capability in multi-step agent tasks, these methods are further modified with multi-step goals and rewards (Zhou et al., 2024b; Zhai et al., 2024; Zhang et al., 2024). LLM self-generated data have become increasingly important for model improvement when combined with search algorithms and rejection sampling (Zhou et al., 2023b; Guan et al., 2025).
大语言模型多步SFT和DPO微调 (Chen et al., 2024f) 和 DPO (Rafailov et al., 2024) 在大语言模型微调中得到了广泛应用。为了增强大语言模型在多步智能体任务中的能力,这些方法进一步结合了多步目标和奖励进行改进 (Zhou et al., 2024b; Zhai et al., 2024; Zhang et al., 2024)。当与搜索算法和拒绝采样结合时,大语言模型自我生成的数据在模型改进中变得越来越重要 (Zhou et al., 2023b; Guan et al., 2025)。
8. Discussion
8. 讨论
Our work underlines the significance of augmenting LLM reasoning and planning capabilities with symbolic computing and shows great potentials of steering large models for smarter code/text generation with specialized small models. We introduce novel modifications to dataset synthesis and fine-tuning (SFT/DPO) to support a multi-round guidance framework, which has proven effective. Unlike CoT LLMs like OpenAI o1 and DeepSeek R1, which rely solely on textual reasoning for exploration, symbolic computing offers greater efficiency, robustness, and s cal ability. Since coding is a core LLM capability, generating symbolic tools via code writing preserves generalization across tasks.
我们的工作强调了通过符号计算增强大语言模型推理和规划能力的重要性,并展示了通过专用小模型引导大模型生成更智能代码/文本的巨大潜力。我们引入了对数据集合成和微调(SFT/DPO)的新颖修改,以支持多轮引导框架,该框架已被证明是有效的。与 OpenAI o1 和 DeepSeek R1 等仅依赖文本推理进行探索的 CoT 大语言模型不同,符号计算提供了更高的效率、鲁棒性和可扩展性。由于编码是大语言模型的核心能力,通过代码编写生成符号工具可以保持跨任务的泛化能力。
Impact Statement
影响声明
This paper aims to advance the field of Foundation Models. Steering the generation from language models has the great potential to improve safety and performance to better align with human preferences. Any such work is inherently a double-edged sword; the same techniques used to generate samples from a harmless distribution of text could, with a single sign change, be repurposed for generating samples from a harmful distribution of text. Our method improves language model capability by integrating symbolic computing, which may also be misused for harmful purposes.
本文旨在推动基础模型(Foundation Models)领域的发展。通过引导大语言模型的生成,有望在安全性和性能上取得显著提升,使其更符合人类偏好。然而,任何此类工作本质上都是一把双刃剑;同样的技术既可以用于从无害的文本分布中生成样本,只需稍作调整,便可用于从有害的文本分布中生成样本。我们的方法通过集成符号计算(symbolic computing)来提升大语言模型的能力,但这也可能被误用于有害目的。
Overall, we believe the potential positive social benefits of our work in evaluation and steering language model output towards desired target distributions outweigh the potential negatives stemming primarily from misuse.
总体而言,我们相信在评估和引导大语言模型输出朝向目标分布方面的工作,其潜在的社会效益超过了主要由滥用带来的潜在负面影响。
References
参考文献
Chen, Y., Arkin, J., Hao, Y., Zhang, Y., Roy, N., and Fan, C. Prompt optimization in multi-step tasks (promst): Integrating human feedback and preference alignment. arXiv preprint arXiv:2402.08702, 2024c.
Chen, Y., Arkin, J., Hao, Y., Zhang, Y., Roy, N., and Fan, C. 多步任务中的提示优化 (promst):整合人类反馈与偏好对齐. arXiv preprint arXiv:2402.08702, 2024c.
Chen, Y., Arkin, J., Zhang, Y., Roy, N., and Fan, C. Scalable multi-robot collaboration with large language models: Centralized or decentralized systems? In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 4311–4317. IEEE, 2024d.
Chen, Y., Arkin, J., Zhang, Y., Roy, N., and Fan, C. 基于大语言模型的可扩展多机器人协作:集中式还是分散式系统?2024 IEEE 国际机器人与自动化会议 (ICRA), pp. 4311–4317. IEEE, 2024d.
Chen, Y., Jhamtani, H., Sharma, S., Fan, C., and Wang, C. Steering large language models between code execution and textual reasoning. arXiv preprint arXiv:2410.03524, 2024e.
Chen, Y., Jhamtani, H., Sharma, S., Fan, C., 和 Wang, C. Steering large language models between code execution and textual reasoning. arXiv preprint arXiv:2410.03524, 2024e.
Chen, Z., Deng, Y., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024f.
Chen, Z., Deng, Y., Yuan, H., Ji, K., and Gu, Q. 自对弈微调将弱语言模型转化为强语言模型。arXiv preprint arXiv:2401.01335, 2024f.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., 等. 训练验证器以解决数学文字题. arXiv 预印本 arXiv:2110.14168, 2021.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., 等. The llama 3 herd of models. arXiv 预印本 arXiv:2407.21783, 2024.
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., and Neubig, G. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023.
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., and Neubig, G. PAL: 程序辅助语言模型。在国际机器学习会议上,第10764–10799页。PMLR, 2023。
Guan, X., Zhang, L. L., Liu, Y., Shang, N., Sun, Y., Zhu, Y., Yang, F., and Yang, M. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519, 2025.
Guan, X., Zhang, L. L., Liu, Y., Shang, N., Sun, Y., Zhu, Y., Yang, F., and Yang, M. rstar-math:小规模大语言模型通过自我演化的深度思考掌握数学推理。arXiv 预印本 arXiv:2501.04519, 2025。
Gui, J., Liu, Y., Cheng, J., Gu, X., Liu, X., Wang, H., Dong, Y., Tang, J., and Huang, M. Logicgame: Benchmarking rule-based reasoning abilities of large language models. arXiv preprint arXiv:2408.15778, 2024.
Gui, J., Liu, Y., Cheng, J., Gu, X., Liu, X., Wang, H., Dong, Y., Tang, J., and Huang, M. Logicgame:基准测试大语言模型的基于规则的推理能力。arXiv preprint arXiv:2408.15778, 2024.
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incenti viz ing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., 等. Deepseek-r1: 通过强化学习激励大语言模型的推理能力. arXiv 预印本 arXiv:2501.12948, 2025.
Hao, Y., Chen, Y., Zhang, Y., and Fan, C. Large language models can plan your travels rigorously with formal verification tools. arXiv preprint arXiv:2404.11891, 2024a.
Hao, Y., Chen, Y., Zhang, Y., and Fan, C. 大语言模型能够使用形式验证工具严格规划您的旅行。arXiv 预印本 arXiv:2404.11891, 2024a.
Hao, Y., Zhang, Y., and Fan, C. Planning anything with rigor: General-purpose zero-shot planning with llm-based formalized programming. arXiv preprint arXiv:2410.12112, 2024b.
Hao, Y., Zhang, Y., and Fan, C. Planning anything with rigor: General-purpose zero-shot planning with llm-based formalized programming. arXiv preprint arXiv:2410.12112, 2024b.
Valmeekam, K., Olmo, A., Sreedharan, S., and Kambhampati, S. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
Valmeekam, K., Olmo, A., Sreedharan, S., 和 Kambhampati, S. 大语言模型仍然无法规划(关于变化规划和推理的大语言模型基准)。在 NeurIPS 2022 决策基础模型研讨会中,2022。
Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S., and Kam b ham pati, S. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. Advances in Neural Information Processing Systems, 36, 2024.
Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S., and Kambhampati, S. Planbench: 用于评估大语言模型在规划和变化推理上的可扩展基准。Advances in Neural Information Processing Systems, 36, 2024.
Wang, J., Wang, J., At hi war at kun, B., Zhang, C., and Zou, J. Mixture-of-agents enhances large language model capabilities. arXiv preprint arXiv:2406.04692, 2024.
Wang, J., Wang, J., At hi war at kun, B., Zhang, C., and Zou, J. Mixture-of-agents 增强大语言模型能力。arXiv 预印本 arXiv:2406.04692, 2024。
Weir, N., Khalifa, M., Qiu, L., Weller, O., and Clark, P. Learning to reason via program generation, emulation, and search. arXiv preprint arXiv:2405.16337, 2024.
Weir, N., Khalifa, M., Qiu, L., Weller, O., 和 Clark, P. 通过程序生成、模拟和搜索学习推理。arXiv 预印本 arXiv:2405.16337, 2024.
Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., and Choi, Y. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, 2022.
Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., 和 Choi, Y. 通过学习自我纠正生成序列。在第十一届国际学习表示会议 (ICLR), 2022.
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Wang, C. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Wang, C. Autogen: 通过多智能体对话框架实现下一代大语言模型应用. arXiv preprint arXiv:2308.08155, 2023.
Xu, T., Chen, L., Wu, D.-J., Chen, Y., Zhang, Z., Yao, X., Xie, Z., Chen, Y., Liu, S., Qian, B., et al. Crab: Cross- environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024.
Xu, T., Chen, L., Wu, D.-J., Chen, Y., Zhang, Z., Yao, X., Xie, Z., Chen, Y., Liu, S., Qian, B., 等. Crab: 跨环境多模态语言模型智能体基准测试. arXiv 预印本 arXiv:2407.01511, 2024.
Yang, K., Tian, Y., Peng, N., and Klein, D. Re3: Generating longer stories with recursive re prompting and revision. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4393–4479, 2022.
Yang, K., Tian, Y., Peng, N., and Klein, D. Re3: 通过递归重提示与修订生成长篇故事。在2022年自然语言处理实证方法会议论文集中,第4393–4479页,2022。
Yang, Y., Xiong, S., Payani, A., Shareghi, E., and Fekri, F. Can llms reason in the wild with programs? arXiv preprint arXiv:2406.13764, 2024.
Yang, Y., Xiong, S., Payani, A., Shareghi, E., and Fekri, F. 大语言模型能在野外用程序推理吗?arXiv preprint arXiv:2406.13764, 2024.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. 思维树:利用大语言模型进行深思熟虑的问题解决. Advances in Neural Information Processing Systems, 36, 2024.
Zhai, Y., Bai, H., Lin, Z., Pan, J., Tong, S., Zhou, Y., Suhr, A., Xie, S., LeCun, Y., Ma, Y., et al. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. arXiv preprint arXiv:2405.10292, 2024.
Zhai, Y., Bai, H., Lin, Z., Pan, J., Tong, S., Zhou, Y., Suhr, A., Xie, S., LeCun, Y., Ma, Y., 等. 通过强化学习微调大规模视觉语言模型为决策智能体. arXiv 预印本 arXiv:2405.10292, 2024.
Appendices: CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
附录:CodeSteer:通过代码/文本引导的符号增强语言模型
A. Impacts of task types, task complexities, and LLM capabilities on code/text choices
A. 任务类型、任务复杂度和大语言模型能力对代码/文本选择的影响
The phenomenon and challenges of steering LLM code/text generation are first proposed by Chen et al. (2024e). Here we discuss these phenomenon in details for the motivation of our work. Fig 6 presents two typical examples of the recently popular topics of ’9.11’ and $\varphi.9^{\bullet}$ numerical comparison and ’r’ letter count in ’strawberry’, that the ChatGPT of GPT-4o makes mistakes by direct textual reasoning but easily solves the problem after prompted to use code. Meanwhile, Fig 7 displays the example that GPT-4o makes mistakes to solve the question by code generation but partially solve the question by textual reasoning. The above two examples show that whether code or text is simpler highly depends on the task types and LLM own capabilities and characteristics.
Chen 等人 (2024e) 首先提出了引导大语言模型代码/文本生成的现象和挑战。在此我们详细讨论这些现象,以作为我们工作的动机。图 6 展示了最近流行的两个主题的典型例子,即“9.11”和 $\varphi.9^{\bullet}$ 的数值比较以及“strawberry”中字母“r”的计数,GPT-4o 的 ChatGPT 在直接文本推理时犯错,但在提示使用代码后轻松解决了问题。同时,图 7 展示了 GPT-4o 通过代码生成解决问题时犯错的例子,但通过文本推理部分解决了问题。上述两个例子表明,代码还是文本更简单,很大程度上取决于任务类型和大语言模型自身的能力和特性。
The OpenAI GPT-4o Code Interpreter is trained to steer LLM code/text generation. However, the study of Chen et al. (2024e) finds many limitations of this method. In Fig 8, they observe an intriguing property of GPT Code Interpreter: its decision to use code depends on the complexity of the task, as shown in Fig 8. GPT-4o Code Interpreter chooses to handle simple Number Multiplying questions with text and complex questions with code, resulting in correct answers. However, it fails in medium-difficulty questions since it tends to be overconfident and chooses to answer the question via textual reasoning, which sometimes is wrong. Hence, whether to implement symbolic computing depends on task complexities even for the same type of the task.
OpenAI GPT-4o Code Interpreter 经过训练以引导大语言模型的代码/文本生成。然而,Chen 等人 (2024e) 的研究发现这种方法存在许多局限性。在图 8 中,他们观察到了 GPT Code Interpreter 的一个有趣特性:其是否使用代码取决于任务的复杂性,如图 8 所示。GPT-4o Code Interpreter 选择通过文本处理简单的数字乘法问题,而通过代码处理复杂问题,从而得到正确答案。然而,在处理中等难度问题时,它往往过于自信,选择通过文本推理来回答问题,有时会导致错误。因此,是否进行符号计算,即使对于同一类型的任务,也取决于任务的复杂性。

Figure 6: The cases that GPT-4o makes simple mistakes by direct textual reasoning but can reliably solve the problem with prompted to use code.
图 6: GPT-4o 在直接文本推理时犯简单错误,但在提示使用代码时能够可靠解决问题的情况。

BoxLift

BoxLift
Figure 7: Representative answers of BoxLift task. The left figure is the partially correct answer of GPT-4o with All Text $^+$ CoT method. The right figure is the wrong code answer from All Code $^+$ CoT method. The text and code parts are colored in blue and green, respectively. The All Code $+,\mathrm{CoT}$ method generates the wrong code that runs into an infinite loop.
图 7: BoxLift任务的代表性答案。左图是 GPT-4o 使用 All Text $^+$ CoT 方法的部分正确回答。右图是 All Code $^+$ CoT 方法的错误代码回答。文本和代码部分分别用蓝色和绿色标注。All Code $+,\mathrm{CoT}$ 方法生成的代码进入了无限循环。

Figure 8: GPT-4o Code Interpreter tends to handle simple Number Multiplying tasks with text and complex tasks with code. However, it often fails with medium-difficulty questions, where it is overconfident and chooses not to use code when needed.
图 8: GPT-4o Code Interpreter 在处理简单的数字乘法任务时倾向于使用文本,而在处理复杂任务时使用代码。然而,它在处理中等难度的问题时往往失败,因为它在需要时过度自信,选择不使用代码。
B. Varied code versions of the same LLM
B. 同一大语言模型的不同代码版本
Game 24
24 点游戏


Figure 9: Representative code answers of Game 24 task. The left figure is the correct code of GPT-4o with extra AutoGen prompt in Appendix Sec. L for guiding code/text choices. The right figure is the wrong code after prompting GPT-4o to answer with code ‘Think of an algorithm to solve the task and implement it in python’. The text and code parts are colored in blue and green, respectively. In both cases, GPT-4o is prompted to solve this task with code. The only difference is the guiding prompts. However, GPT-4o answers with different types of codes, with or without efficient symbolic computing. This phenomenon shows that LLM code generation is unstable under varied prompts, tasks, and LLM types.
图 9: Game 24 任务中的代表性代码回答。左图是 GPT-4o 在附录 L 章节中使用额外的 AutoGen 提示引导代码/文本选择后的正确代码。右图是在提示 GPT-4o 使用代码“思考一个算法来解决任务并在 Python 语言中实现它”后生成的错误代码。文本和代码部分分别用蓝色和绿色标记。在两种情况下,GPT-4o 都被提示用代码解决该任务。唯一的区别在于引导提示。然而,GPT-4o 生成了不同类型代码,或包含或不包含高效的符号计算。这种现象表明,大语言模型在不同提示、任务和模型类型下的代码生成是不稳定的。
C. Description of SymBench tasks
C. SymBench任务描述
Here we describe the 37 testing tasks. They require strong symbolic, mathematical, logical, geometrical, scientific, and commonsense reasoning capabilities. The first 14 tasks originate from Chen et al. (2024e), while the last 23 are from Gui et al. (2024). Note that both these two previous works do not release the full question datasets and codes for these 37 tasks. The released question dataset in Gui et al. (2024) only contains 8 or 16 questions for each task. Hence, we develop codes to automatically synthesize the questions for each task with tunable complexities. Both our developed codes and question datasets are released.
在这里我们描述了37个测试任务。这些任务需要强大的符号、数学、逻辑、几何、科学和常识推理能力。前14个任务源自Chen et al. (2024e),而后23个任务来自Gui et al. (2024)。需要注意的是,这两项先前的研究并未发布这37个任务的完整问题数据集和代码。Gui et al. (2024)中发布的问题数据集仅包含每个任务的8或16个问题。因此,我们开发了代码来自动合成每个任务的问题,并且复杂度可调。我们开发的代码和问题数据集均已发布。
Number Multiplying This task involves querying LLMs to compute the product among integers. It represents a classic problem that LLMs are not able to solve through pure textual reasoning.
数字乘法
此任务涉及查询大语言模型(LLM)以计算整数之间的乘积。它代表了大语言模型无法通过纯文本推理解决的经典问题。
Path Plan This task involves querying LLMs to plan the robot trajectory waypoints based on human task instructions and environments. This task originates from AutoTAMP (Chen et al., 2024b).
路径规划
该任务涉及查询大语言模型,以根据人类任务指令和环境规划机器人轨迹路径点。该任务源自 AutoTAMP (Chen et al., 2024b)。
Letters This task involves querying LLMs to count the total number of specific letters in a long word and specify their positions. An example question can be ’How many r’s in the word strawberry and what are their positions?’. This task has recently gained significant attention because current LLMs struggle to perform it effectively and accurately.
字母计数任务
BoxLift This task involves coordinating robots of various types to lift boxes of different sizes and weights. Each robot has a specific lifting capacity and can collaborate with others to lift a single box. A box can only be lifted if the combined lifting capacity of the robots exceeds the box’s weight. The objective is to lift all the boxes in the minimum number of time steps. This task originates from Scalable-Robots (Chen et al., 2024d).
BoxLift 该任务涉及协调各种类型的机器人来搬运不同尺寸和重量的箱子。每个机器人都有特定的搬运能力,并可以与其他机器人协作搬运一个箱子。只有当机器人的总搬运能力超过箱子的重量时,箱子才能被搬运。目标是在最短的时间内搬运完所有箱子。该任务源自 Scalable-Robots (Chen et al., 2024d)。
BoxNet This task involves coordinating robot arms to move colored boxes (squares) into corresponding colored goal locations (circles) in the fewest time steps. Each robot arm is assigned and restricted to a cell indicated by the dotted lines. The arms have two possible actions: (1) move a box within their cell to a neighboring cell, or (2) move a box within their cell to a goal location within the same cell. The objective is to ensure all boxes are placed in their matching goal locations efficiently. This task originates from Scalable-Robots (Chen et al., 2024d).
BoxNet 此任务需要协调机器人手臂在最少的时间步数内将彩色盒子(方块)移动到相应颜色的目标位置(圆圈)上。每个机器人手臂被分配并限于虚线指示的单元格内。手臂有两个可能的动作:(1) 将盒子从所在单元格移动到相邻的单元格,(2) 将盒子从所在单元格移动到同一单元格内的目标位置。目标是确保所有盒子都高效地放置在匹配的目标位置中。此任务源自 Scalable-Robots (Chen et al., 2024d)。
Blocks world In Blocks world, the objective is to stack a set of blocks (brown) according to a specific order. The robot can perform four actions: (1) pick up a block, (2) unstack a block from the top of another block, (3) put down a block, (4) stack a block on top of another block. A robot can only pick up, unstack, or stack a block if it is clear, that is, the block has no other blocks on top and is not currently being held. This task originates from PlanBench (Valmeekam et al., 2024).
积木世界
在积木世界中,目标是根据特定顺序堆叠一组积木(棕色)。机器人可以执行四种操作:(1) 拿起一块积木,(2) 从另一块积木的顶部移除一块积木,(3) 放下积木,(4) 将一块积木堆叠在另一块积木上。只有当一块积木是清晰的,即它上面没有其他积木且当前没有被持有,机器人才能拿起、移除或堆叠它。该任务源于PlanBench (Valmeekam et al., 2024)。
Date Understanding Given a small set of sentences referring a specific date, the task involves querying LLMs to answer a provided question based on the information in these sentences (e.g., ‘The concert was scheduled for 06/01/1943, but was delayed by one day to today. What was the date yesterday in MM/DD/YYYY?’). This task originates from BIG-Bench-Hard (Suzgun et al., 2022).
日期理解任务要求根据一小段指向特定日期的句子,查询大语言模型(LLM)以回答基于这些信息的问题(例如,“音乐会原定于1943年6月1日举行,但被推迟了一天到今天。昨天的日期是MM/DD/YYYY格式的什么?”)。该任务源自BIG-Bench-Hard (Suzgun et al., 2022)。
Web of Lies This task involves querying LLMs to determine the truth value of a random Boolean function presented as a natural-language word problem. This task originates from BIG-Bench-Hard (Suzgun et al., 2022).
Web of Lies:该任务涉及查询大语言模型,以确定用自然语言描述的概率布尔函数的真值。该任务源自 BIG-Bench-Hard (Suzgun et al., 2022)。
Logical Deduction This task involves querying LLMs to deduce the order of a sequence of objects using clues and information about their spacial relationships and placements. This task originates from BIG-Bench-Hard (Suzgun et al., 2022).
逻辑推理
Navigate This task involves querying LLMs to determine whether the agent would return to its initial starting point after following a series of navigation steps. This task originates from BIG-Bench-Hard (Suzgun et al., 2022).
导航任务涉及查询大语言模型,以确定智能体在遵循一系列导航步骤后是否会返回其初始起点。该任务源自 BIG-Bench-Hard (Suzgun 等, 2022)。
GSM-Hard (Gao et al., 2023) This is the more challenging version of GSM8K (Cobbe et al., 2021) math reasoning dataset, where the numbers in the original questions of GSM8K are replaced with larger, less common values.
GSM-Hard (Gao et al., 2023) 这是 GSM8K (Cobbe et al., 2021) 数学推理数据集的更具挑战性版本,其中将 GSM8K 原始问题中的数字替换为更大、更不常见的数值。
MATH-Geometry This is the math reasoning dataset from MATH dataset (Hendrycks et al., 2021), with specific focus on geometry questions.
MATH-Geometry 这是来自MATH数据集 (Hendrycks et al., 2021) 的数学推理数据集,特别关注几何问题。
MATH-Count&Probability This is the math reasoning dataset from MATH dataset (Hendrycks et al., 2021), with specific focus on counting and probability questions.
MATH-Count&Probability 这是来自 MATH 数据集 (Hendrycks 等人, 2021) 的数学推理数据集,特别关注计数和概率问题。
The following 23 tasks originate from LogicGame (Gui et al., 2024).
以下 23 个任务源自 LogicGame (Gui et al., 2024)。
Logical Equation The task is to assign a specific numeric value to each letter from a given set, using a predefined range of numbers and a set of inequalities. Each letter corresponds to a unique number, and the relationships between the letters are defined by mathematical equations or constraints.
逻辑方程
Pooling This task involves applying a pooling operation on a numerical $N\times N$ grid. The pooling operation uses an $n\times n$ sliding window $(n<N)$ that moves across the grid from left to right and top to bottom. The results from each window are then arranged based on their positions to create a new output matrix.
池化
Light Puzzles In this task, you are given an $n\times n$ grid representing a network of lights, where a lit light is represented by ”1” and an unlit light by $"0^{\bullet}$ . Several buttons control the state of these lights by turning them on or off in certain positions. The state of each light can be affected by multiple buttons. The task is to follow a series of button presses and determine the final state of the grid.
光线谜题
在该任务中,你会看到一个 $n\times n$ 的网格,代表一组灯光网络,其中亮起的灯用“1”表示,未亮的灯用 $0^{\bullet}$ 表示。多个按钮通过在某些位置上打开或关闭这些灯来控制它们的状态。每个灯的状态可能受到多个按钮的影响。任务是按照一系列按钮的按下顺序,确定网格的最终状态。
Mahjong Given an initial set of letter cards, in each round, a new card is added and one card is removed. Some effects may happen when specific combinations of the cards appear after introducing the new card. A result is determined based on these specific conditions. The goal is to determine a result based on a series of rounds
麻将
给定一组初始字母牌,每轮增加一张新牌并移除一张牌。在引入新牌后,当出现特定组合时可能会触发一些效果。结果将基于这些特定条件确定。目标是通过一系列轮次确定结果。
Statistical Counting Calculate the total score of a string by scanning it from left to right, where consecutive identical letters earn points (for example, two or more consecutive A’s add 1 point, B’s add 2 points, etc.). The task is to start with a score of 0 and return the final summing value.
统计计数
通过从左到右扫描字符串来计算其总分,其中连续的相同字母得分(例如,两个或更多连续的 A 得 1 分,B 得 2 分,依此类推)。任务是从 0 分开始,返回最终的总分。
Matrix Transformation Rotate a given matrix of characters based on given instruction (e.g., 90 degrees clockwise), preserving each character’s position relative to others in the transformed output. The input matrix can be of any size and contain any character.
矩阵变换:根据给定的指令(例如,顺时针旋转90度)旋转给定的字符矩阵,并保持每个字符在变换后输出中的相对位置。输入矩阵可以是任意大小,包含任意字符。
Logical Puzzle The task involves querying LLMs to select a specified number of different values from a grid of numbers, ensuring that certain mathematical constraints (sum or product) are satisfied for selected numbers for each row and column.
逻辑谜题
Constrained Linear Arrangement In a two-player card game, the task is to deduce your opponent’s moves based on the game’s rules, your played cards, and the announced results of each round. Each card can only be used once, and the game follows specific interaction rules between different card types, where certain cards can defeat, be defeated by, or draw with others according to predefined relationships.
约束线性排列
在双人纸牌游戏中,任务是根据游戏规则、你打出的牌以及每轮宣布的结果来推断对手的移动。每张牌只能使用一次,游戏遵循不同牌型之间的特定交互规则,其中某些牌可以根据预定义的关系击败、被击败或与其它牌打成平局。
Pattern Recognition The task involves querying LLMs to find all squares in a character matrix where each square consists of identical characters and has a side length of at least 3.
模式识别
该任务涉及查询大语言模型 (LLM) ,以在字符矩阵中查找所有由相同字符组成且边长至少为 3 的正方形。
String Insertion The task is to transform a string by scanning it from left to right and inserting specific characters after certain character patterns (e.g., each pattern WXYZ requires inserting W immediately after it occurs). All operations are performed simultaneously on the original string.
字符串插入
任务是通过从左到右扫描字符串,并在特定字符模式(例如,每个模式 WXYZ 在出现后立即插入 W)后插入特定字符来转换字符串。所有操作都在原始字符串上同时执行。
Letter Logic Diagram The task is to complete an incomplete grid by selecting from a list of letters, where each row and column must contain each letter exactly once, and all cells on the minor diagonal (top-right to bottom-left) must contain the same letter. Some cells are already filled in as constraints.
字母逻辑图任务是从字母列表中选择字母以完成一个不完整的网格,其中每行和每列必须恰好包含每个字母一次,且所有小对角线(从右上到左下)的单元格必须包含相同的字母。某些单元格已作为约束条件被填充。
String Deletion and Modification The task is to transform a string by repeatedly applying a set of ordered string manipulation rules until no more changes are possible, where each rule modifies the string based on specific patterns or conditions present in the current string state. For example, a modification rule can be “If the string ends with ‘ba’, replace it with ‘ab’.”
字符串删除与修改
String Synthesis Given an initial set of blocks and a set of synthesis rules that combine different types of blocks, the task is to determine the final block(s) after repeatedly applying these rules in order until no more combinations are possible.
字符串合成
给定一组初始块和一组合成规则,这些规则用于组合不同类型的块,任务是确定在按顺序重复应用这些规则直到无法再进行组合后得到的最终块。
Reversi In this game similar to Reversi, players take turns placing pieces on an $n\times n$ grid. After placing a piece, any of the opponent’s pieces located between two of the player’s pieces (in the same row, column, or diagonal) will be flipped. The task is to determine the state of the board after rounds, starting from a given configuration.
Reversi
在这个类似于 Reversi 的游戏中,玩家轮流在一个 $n\times n$ 的棋盘上放置棋子。放置棋子后,位于玩家两个棋子之间的对手棋子(在同一行、列或对角线上)将会被翻转。任务是从给定的初始配置开始,确定游戏进行若干回合后的棋盘状态。
Standard Sudoku Given a partially filled Sudoku grid, the task is to fill the remaining empty cells with numbers between
标准数独
1 and 9, ensuring that no number repeats in the same row, column, or $3\times3$ subgrid.
确保在同一行、同一列或 $3\times3$ 子网格中不重复出现 1 和 9 的数字。
Eight Queen Given a grid with some queens already placed, the task is to place the remaining queens such that no two queens share the same row, column, or diagonal, while avoiding positions with obstacles in the grid.
八皇后问题:在一个已放置部分皇后的网格中,任务是放置剩余的皇后,使得没有两个皇后位于同一行、列或对角线上,同时避开网格中的障碍物位置。
Cryptanalysis In this task, you are provided with a combination lock consisting of numbers and letters, where neither the numbers nor the letters repeat. Using a series of guesses and feedback, the goal is to deduce the correct password based on the given conditions.
密码分析
String Splitting A dismantling engineer has old machines and can obtain machine parts through a set of predefined methods. By continuously cycling through these methods in a specific order, the engineer dismantles machines or combines parts to create new components, and the task is to determine the total number of parts and remaining machines after all possible cycles.
字符串拆分
拆卸工程师拥有旧机器,并可以通过一组预定义的方法获取机器零件。通过按特定顺序不断循环这些方法,工程师拆卸机器或组合零件以创建新组件,任务是确定所有可能的循环后的零件总数和剩余机器数量。
Comb in at oral Calculation Given a set of integers, the goal is to use arithmetic operations (addition, subtraction, multiplication, division) and parentheses to arrange the numbers in such a way that the final result matches a specified target value. Each number must be used exactly once, and the order of the numbers cannot be changed.
组合算式
给定一组整数,目标是通过使用算术运算(加、减、乘、除)和括号来排列这些数字,使最终结果与指定的目标值匹配。每个数字必须恰好使用一次,且数字的顺序不能改变。
Synthesis Decomposition A farmer grows various crops and can exchange them for agricultural products. Using a set of methods, he can trade specific combinations of crops for products, following a cyclic pattern until no further exchanges are possible. The goal is to determine the synthesis result for each round.
合成分解
一位农民种植各种作物,并可以用它们交换农产品。他使用一系列方法,按照某种循环模式,用特定的作物组合换取产品,直到无法再进行交换为止。目标是确定每轮的合成结果。
2048 Similarly to the 2048 game, in a grid, numbers representing powers of 2 can move in any direction, combining when they encounter a matching number to form the next power of 2. Given a starting position and a sequence of movements, the goal is to determine the resulting grid after executing the moves.
2048 类似于2048游戏,在一个网格中,代表2的幂的数字可以向任何方向移动,当它们遇到相同的数字时会合并,形成下一个2的幂。给定一个起始位置和一系列移动,目标是确定执行这些移动后的结果网格。
Permutation and Combination Given a set of objects with specific positioning constraints, the task is to determine the correct arrangement of the objects on a shelf. Each object must be placed in a position according to the rules provided, ensuring that the conditions on adjacency, order, and specific positions are met. For example, a rule about adjacency could be ‘Book A must be adjacent to book I’.
排列组合 给定一组具有特定位置约束的对象,任务是确定这些对象在架子上的正确排列。每个对象必须根据提供的规则放置在特定位置上,确保邻接、顺序和特定位置的条件得到满足。例如,一个关于邻接的规则可能是“书 A 必须与书 I 相邻”。
Table 4: The evaluated capabilities of all tasks, classified as Execution, Planning, and Reasoning tasks.
表 4: 所有任务的评估能力,分类为执行、规划和推理任务。
| 类别 | 任务 | 数学 | 空间推理 | 逻辑推理 | 顺序推理 | 优化 | 搜索 |
|---|---|---|---|---|---|---|---|
| 执行 | 数字乘法 新操作 池化 光拼图 麻将 统计计数 矩阵变换 模式识别 字符串插入 字符串删除与修改 字符串合成 Reversi 字符串分割 合成 分解 | √ √ x x | X √ √ √ x √ | x √ √ √ | x √ x x √ √ x | x | × x x √ √ √ |
| 规划 | 2048 游戏 24 路径规划 字母 BoxLift BoxNet Blocksworld 逻辑方程 逻辑谜题 构建 线性排列 字母逻辑图 标准数独 八皇后 密码分析 组合计算 | √ √ √ | √ | √ √ √ √ √ | √ x x x x | √ √ √ √ | x √ x x √ x x √ √ |
| 推理 | 排列组合 日期理解 Webof Lies 逻辑推理 导航 GSM-Hard MATH-几何 MATH-计数与概率 | x √ √ | √ √ | √ √ √ √ √ | x | √ x x x x | √ x √ |
D. Prompt for Code Steer LL M
D. 代码引导LLM的提示词
The input prompts of Code Steer LL M follow a multi-round dialogue, i.e., previous rounds of prompts and responses will be included as history prompts for following generation of response guidance. Since we set the maximum rounds of guidance to be 5 for each task, the total addition of prompt and output lengths of Code Steer LL M does not surpass maximum context window 8k. The formats for the first round of prompt and following rounds of prompts are as follows. Note that ‘The summary of generated code complexity is: {code complexity summary}’ is not included if the generated answer by TaskLLM does not have code.
Code Steer LLM的输入提示遵循多轮对话,即前几轮的提示和响应将作为后续生成响应指导的历史提示。由于我们将每项任务的最大指导轮数设置为5,Code Steer LLM的提示和输出长度总和不超过最大上下文窗口8k。第一轮提示和后续几轮提示的格式如下。注意,如果TaskLLM生成的答案不包含代码,则不包括“生成代码复杂性的摘要为:{代码复杂性摘要}”。
Round 1 prompt to Code Steer LL M
第一轮提示到代码控制大语言模型
You are guiding another TaskLLM to solve a task. You will be presented with a task that can potentially be solved using either pure textual reasoning or coding. Your goal is to determine which method will be most effective for solving the task. Follow these steps:
你正在指导另一个TaskLLM解决任务。你将看到一个可能通过纯文本推理或编程来解决的任务。你的目标是确定哪种方法最有效。按照以下步骤进行:
Following Rounds of prompts to Code Steer LL M
多轮提示到代码引导大语言模型
The response from TaskLLM is: {response}
TaskLLM 的响应是:{response}
The feedback from the checking agent is: {check result} The summary of generated code complexity is: {code complexity summary} The final returned guidance prompt should be of the format <<
检查代理的反馈是:{check result} 生成的代码复杂性总结为:{code complexity summary} 最终返回的指导提示应为 <<
E. Prompt for Self-answer Checker
E. 自我答案检查提示
Prompt for Self-answer Checker
自我回答检查提示
F. Code for Symbolic Checker
F. 符号检查器代码
The following code checks the factors of iteration, search, numeric, permutations, and combinations in the answered code by TaskLLM and returns the summary of code complexity and the complexity score. We directly return the summary of code complexity as ‘code complexity summary’ to Code Steer LL M for further guidance. If the complexity score less than 2.0, the returned ‘code complexity summary’ concatenates with ‘The generated code may not be complex enough to carry out symbolic computing for solving the task.’
以下代码检查了由 TaskLLM 生成的代码中的迭代、搜索、数值、排列和组合等因素,并返回代码复杂度的总结和复杂度评分。我们直接将代码复杂度的总结作为‘代码复杂度总结’返回给 Code Steer LLM 以进行进一步指导。如果复杂度评分小于 2.0,返回的‘代码复杂度总结’将拼接‘生成的代码可能不够复杂,无法进行符号计算以完成任务。’

Figure 10: Code for checking the symbolic factors of the generated code by TaskLLM.
图 10: 用于检查 TaskLLM 生成代码的符号因子的代码。
G. Synthesized dataset number of each task for SFT and DPO
G. SFT和DPO任务的合成数据集数量
Table 5: Synthesized dataset number of each task for SFT and DPO fine-tuning processes.
表 5: SFT 和 DPO 微调过程中每个任务的合成数据集数量。
| 数据集编号 | SFT成功轨迹数 | DPO对数 |
|---|---|---|
| Game 24 | 792 | 320 |
| Path Plan | 442 | 215 |
| BoxLift | 345 | 163 |
| BoxNet | 330 | 186 |
| Blocksworld | 406 | 248 |
| Date Understanding | 497 | 238 |
| Webof Lies | 492 | 204 |
| Logical Deduction | 489 | 241 |
| Navigation | 503 | 170 |
| GSM-Hard | 332 | 125 |
| MATH Geometry | 342 | 115 |
| MATH Count&Prob. | 346 | 127 |
| Logical Equation | 396 | 213 |
| New Operator | 394 | 189 |
| Pooling | 404 | 187 |
| Light Puzzles | 406 | 259 |
| Mahjong | 421 | 230 |
| Statistical Counting | 402 | 223 |
| Matrix Transform. | 391 | 214 |
| Logical Puzzle | 454 | 148 |
| Constrained Linear Arrangement | 432 | 155 |
| Pattern Recognition | 414 | 135 |
| String Insertion | 409 | 128 |
| Letter Logic Diagram | 500 | 226 |
| String deletion&Modification | 504 | 230 |
| String Synthesis | 397 | 185 |
| Reversi | 403 | 194 |
| Standard Sudoku | 400 | 212 |
| Total | 12043 | 5480 |
H. Parameter and hardware settings of SFT/DPO fine-tuning and inference processes
SFT/DPO 微调与推断过程的参数和硬件设置
We utilize four H100 80GB GPUs for full-parameter fine-tuning of the Llama-3.1-8B models. The model is trained for 10 epochs in the SFT stage and 6 epochs in the DPO stage. The learning rate is set to $1\times10^{-5}$ for SFT and $5\times10^{-6}$ for DPO. We use a batch size of 4 for training. In DPO, the loss function follows the standard sigmoid loss (Rafailov et al., 2024), with the hyper parameter $\beta$ set to 0.1.
我们使用四块 H100 80GB GPU 对 Llama-3.1-8B 模型进行全参数微调。模型在 SFT 阶段训练 10 个 epoch,在 DPO 阶段训练 6 个 epoch。学习率在 SFT 阶段设置为 $1\times10^{-5}$,在 DPO 阶段设置为 $5\times10^{-6}$。我们使用的训练批量大小为 4。在 DPO 中,损失函数遵循标准 sigmoid 损失 (Rafailov et al., 2024),超参数 $\beta$ 设置为 0.1。
In most cases, we perform the inference of Code Steer LL M using a single H100 80GB GPU. However, to analyze the impact of hardware configurations on CodeSteer runtime, as shown in Fig. 5, we also conduct inference using four H100 GPUs for comparison.
在大多数情况下,我们使用单个 H100 80GB GPU 进行 Code Steer LL M 的推理。然而,为了分析硬件配置对 CodeSteer 运行时的影响,如图 5 所示,我们还使用四个 H100 GPU 进行推理以进行比较。
For the generation of guidance answers in the DPO dataset creation, we utilize three different SFT fine-tuned Llama-3.1-8B models, trained for 6, 8, and 10 epochs, respectively. For each question and stage, we query all three models and compare their generated guidance answers.
在DPO数据集创建的过程中,我们使用了三个不同的SFT(监督式微调)微调后的Llama-3.1-8B模型,分别训练了6、8和10个周期。针对每个问题和阶段,我们查询所有三个模型并比较它们生成的指导答案。
I. Score-cost table for each method
1. 各方法的得分-成本表
Table 6: Score-cost table for each method.
表 6: 每种方法的得分-成本表。
| 平均归一化 | 平均得分 (↑) | 平均 Token 长度 (↓) | 平均运行时间 (s) (↓) |
|---|---|---|---|
| 基线方法 | |||
| 仅问题 | 53.3 | 566.1 | 8.2 |
| 符号智能体 | 74.8 | 1192.5 | 27.3 |
| 所有文本 + CoT | 52.1 | 1110.7 | 15.3 |
| 所有代码 + CoT | 69.6 | 949.8 | 8.9 |
| AutoGen Conca. | 69.9 | 1295.9 | 10.6 |
| 代码 + 文本 + Sum.1 | 63.1 | 3931.6 | 24.2 |
| 代码 + 文本 + Sum.2 | 62.4 | 2808.6 | 32.4 |
| 代码/文本选择 | 77.9 | 587.4 | 20.1 |
| 代码解释器 | 70.5 | 1175.9 | 23.8 |
| CoT 大语言模型 | |||
| DeepSeek R1 | 76.8 | 6396.6 | 68.6 |
| 01 | 82.7 | N/A | 70.5 |
| 01-preview | 74.8 | N/A | 37.7 |
| 提出的方法 | |||
| CodeSteer, 1*H100 | 86.4 | 4693.3 | 63.8 |
| CodeSteer, 4*H100 | 86.4 | 4693.3 | 45.4 |
J. Example Text Answer of DeepSeek R1 and GPT-4o in Game 24
J. 深度搜索 R1 和 GPT-4o 在 24 点游戏中的示例文本回答
Figure 11: Example text answer of R1 in the task Game 24. R1 searches possible answers with the continuous back-and-forth textual reasoning process. This search process still fails in the end.
图 11: R1 在 Game 24 任务中的示例文本回答。R1 通过连续的来回文本推理过程搜索可能的答案。最终,这个搜索过程仍然失败了。
Game 24,GPT-4o text answer
24点游戏,GPT-4o的文本解答

图 1:
Figure 12: Example text answer of GPT-4o in the task Game 24. GPT-4o continues the textual reasoning process until reaching the maximum token generation length but never returns the answer.
图 12: GPT-4o在Game 24任务中的文本回答示例。GPT-4o继续文本推理过程,直到达到最大Token生成长度,但始终没有返回答案。
K. Full experimental results of ablation studies
K. 消融研究的完整实验结果
Table 7: Full experimental results of ablation studies on the components in CodeSteer framework.
表 7: CodeSteer 框架中各组件的消融研究完整实验结果。
| 方法 | 1. CodeSteer | 2. WO DPO | 3. WO DPO WOData | 4. WO Symbolic | 5. WO Self-answer | 6. Agent | 7. Agent WO Symbolic | 8. Agent WO Self-answer |
|---|---|---|---|---|---|---|---|---|
| 任务成功率 % | 增强. | 检查器 | 检查器 | 检查器 | 检查器 | |||
| 平均标准化值, 已知 | 88.1 | 80.0 | 79.7 70.9 | 80.1 | 78.5 64.2 | 77.0 67.9 | 71.9 62.0 | 70.1 57.4 |
| 平均标准化值, 未知, 合计 | 81.3 86.4 | 76.2 79.1 | 77.6 | 68.6 77.3 | 75.0 | 74.8 | 69.5 | 67.0 |
| Game 24 | 93 | 93 76 | 46 74 | 62 72 | 57 74 | 37 43 | 41 41 | 28 |
| Path Plan BoxLift | 75 | 65 | 76 | 66 | 72 | 58 | 47 | 29 |
| BoxNet | 77 | 21 | 31 | 13 | 17 | 30 | 24 | 39 15 |
| Blocksworld | 29 52 | 50 | 50 | 54 | 51 | 60 | 45 | 41 |
| Date Understanding | 87 | 83 | 86 | 80 | 83 | 89 | 84 | 92 |
| Web of Lies | 98 | 94 | 92 | 95 | 92 | 99 | 95 | 97 |
| Logical Deduction | 92 | 92 | 95 | 91 | 89 | 93 | 91 | 87 |
| Navigation | 99 | 90 | 95 | 85 | 80 | 93 | 94 | 88 |
| GSM-Hard | 77 | 74 | 72 | 79 | 74 | 76 | 73 | 70 |
| MATH Geometry | 75 | 74 | 70 | 71 | 69 | 73 | 68 | 70 |
| MATHCount&Prob. | 93 | 92 | 86 | 84 | 81 | 88 | 85 | 82 |
| Logical Equation | 78 | 58 | 56 | 61 | 56 | 50 | 52 | 56 |
| New Operator | 40 | 38 | 40 | 24 | 52 | 39 | 28 | 20 |
| Pooling | 46 | 43 | 51 | 47 | 45 | 46 | 44 | 52 |
| Light Puzzles | 68 | 71 | 52 | 51 | 52 | 56 | 56 | 60 |
| Mahjong | 90 | 88 | 88 | 92 | 95 | 77 | 85 | 79 |
| Statistical Counting | 97 | 98 | 92 | 95 | 84 | 93 | 90 | 96 |
| Matrix Transform. | 98 | 100 | 97 | 96 | 95 | 96 | 92 | 96 |
| Logical Puzzle | 70 | 58 | 56 | 52 | 44 | 58 | 53 | 54 |
| Const. Linear Arrange. | 86 | 66 | 65 | 76 | 81 | 71 | 64 | 52 |
| Pattern Recognition | 93 | 96 | 95 | 95 | 93 | 90 | 92 | 100 |
| String Insertion | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Letter Logic Diagram | 45 | 20 | 35 | 35 | 35 | 30 | 25 | 23 |
| String deletion&Modi. | 93 | 88 | 92 | 90 | 88 | 90 | 86 | 76 |
| String Synthesis | 29 | 12 | 21 | 30 | 26 | 20 | 12 | 14 |
| Reversi | 52 | 49 | 39 | 52 | 24 | 36 | 28 | 36 |
| Standard Sudoku | 100 | 100 | 95 | 100 | 100 | 98 | 100 | 100 |
| Letters | 96 | 85 | 88 | 87 | 84 | 91 | 79 | |
| Eight Queen | 78 | 74 | 72 | 72 | 52 | 73 | 64 | 75 52 |
| Number Multiply | 95 | 90 | 92 | 94 | 95 | 87 | 80 | 74 |
| Cryptanalysis | 24 | 22 | 15 | 4 | 12 | 15 | 12 | 7 |
| String Splitting | 56 | 56 | 31 | 43 | 41 | 52 | 42 | 40 |
| Combinatorial Calculation | 88 | 45 | ||||||
| Synthesis Decomposition 2048 | 56 | 56 | 44 | 53 | 44 | 43 | 32 | 40 |
| Permutation and Combina. | 93 | 86 | 80 | 92 | 56 | 89 | 82 | 78 |
L. System prompt of AutoGen
L. AutoGen 的系统提示
System prompt of AutoGen (Wu et al., 2023)
AutoGen 的系统提示 (Wu et al., 2023)
You are a helpful AI assistant. Solve tasks using your coding and language skills. In the following cases, suggest python code (in a python coding block) or shell script (in a sh coding block) for the user to execute. 1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, get the current date/time, check the operating system. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself. 2. When you need to perform some task with code, use the code to perform the task and output the result. Finish the task smartly. Solve the task step by step if you need to. If a plan is not provided, explain your plan first. Be clear which step uses code, and which step uses your language skill. When using code, you must indicate the script type in the code block. The user cannot provide any other feedback or perform any other action beyond executing the code you suggest. The user can’t modify your code. So do not suggest incomplete code which requires users to modify. Don’t use a code block if it’s not intended to be executed by the user. If you want the user to save the code in a file before executing it, put # filename: filename inside the code block as the first line. Don’t include multiple code blocks in one response. Do not ask users to copy and paste the result. Instead, use ’print’ function for the output when relevant. Check the execution result returned by the user. If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can’t be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try. When you find an answer, verify the answer carefully. Include verifiable evidence in your response if possible. Reply ”TERMINATE” in the end when everything is done.
你是一位有帮助的AI助手。使用你的编程和语言技能来完成任务。在以下情况下,建议用户执行Python代码(在Python代码块中)或shell脚本(在sh代码块中)。1. 当你需要收集信息时,使用代码输出你需要的信息,例如浏览或搜索网页、下载/读取文件、打印网页或文件的内容、获取当前日期/时间、检查操作系统。在打印了足够的信息并且任务可以根据你的语言技能解决后,你可以自行解决任务。2. 当你需要用代码执行某些任务时,使用代码执行任务并输出结果。聪明地完成任务。如果需要,逐步解决任务。如果没有提供计划,请先解释你的计划。明确哪一步使用代码,哪一步使用你的语言技能。在使用代码时,你必须在代码块中指示脚本类型。用户无法提供任何其他反馈或执行除你建议的代码之外的任何操作。用户无法修改你的代码。因此,不要建议需要用户修改的不完整代码。如果代码块不打算由用户执行,请不要使用代码块。如果你希望用户在执行代码之前将其保存到文件中,请在代码块的第一行放置# filename: filename。不要在一个响应中包含多个代码块。不要要求用户复制和粘贴结果。相反,在相关时使用print函数输出。检查用户返回的执行结果。如果结果表明有错误,请修复错误并再次输出代码。建议完整的代码,而不是部分代码或代码更改。如果无法修复错误或即使代码成功执行后任务仍未解决,请分析问题,重新审视你的假设,收集你需要的额外信息,并尝试不同的方法。当你找到答案时,仔细验证答案。如果可能,请在回复中包含可验证的证据。一切完成后,最后回复“TERMINATE”。
