[论文翻译]扩展通用数据分析智能体 (Data-Analytic Agents)




SCALING GENERALIST DATA-ANALYTIC AGENTS

扩展通用数据分析智能体 (Data-Analytic Agents)

ABSTRACT

摘要

Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering or multi-agent scaffolds over proprietary models, while open-source models still struggle with diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DATAMIND, a scalable data synthesis and agent training recipe designed to construct generalist data-analytic agents. DATAMIND tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DATAMIND applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DATAMIND, we curate DATAMIND-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DATAMIND-12K, our DATAMIND-14B achieves state-of-the-art with an average score of $71.16%$ on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DATAMIND-7B also performs best among all open-source models with a score of $68.10%$ .We also list some empirical insights gained from our exploratory trials in the analysis experiments, aiming to provide actionable insights about agent training for the community. We will release DATAMIND-12K and DATAMIND-7B,14B for the community's future research'.

数据分析智能体 (Data-analytic Agent) 正成为自动化科学发现和实现创新人工智能愿景的关键催化剂。然而,当前方法严重依赖基于专有模型的提示工程或多智能体框架,而开源模型在处理现实分析所需的多样化格式、大规模数据文件以及长周期多步推理时仍面临困难。本文提出DATAMIND——一个可扩展的数据合成与智能体训练方案,旨在构建通用型数据分析智能体。该方法解决了构建开源数据分析智能体的三个关键挑战:数据资源不足、训练策略不当以及基于代码的多轮执行不稳定。具体而言,DATAMIND采用:1)细粒度任务分类与递归式由易到难任务组合机制,提升合成查询的多样性与难度;2)基于知识增强的轨迹采样策略,辅以模型驱动和规则驱动的过滤机制;3)结合监督微调与强化学习损失的动态可调训练目标;4)内存节约型且稳定的代码多轮执行框架。基于此,我们构建了DATAMIND-12K——一个涵盖多领域、多任务类别及多数据文件格式的高质量数据分析任务轨迹集。在DATAMIND-12K上训练的DATAMIND-14B模型在多项数据分析基准测试中以71.16%的平均得分达到最优性能,超越最强的专有基线DeepSeek-V3.1和GPT-5。我们的DATAMIND-7B模型同样以68.10%的得分在所有开源模型中表现最佳。通过分析实验,我们总结了探索性试验中获得的部分经验性发现,旨在为学界提供可操作的智能体训练洞见。我们将向社区发布DATAMIND-12K及DATAMIND-7B/14B模型以支持未来研究。


(a) Task taxonomy used in DATAM1ND for fine-grained (b) Performance comparison between proprietary modand diverse query synthesis. els and open-source models on multiple datasets.

(a) DATAM1ND中用于细粒度多样化查询合成的任务分类法。 (b) 专有模型与开源模型在多个数据集上的性能比较。

Figure 1: (a) Task Taxonomy. We categorize data analysis tasks into 18 fine-grained categories to enhance the diversity of our synthesized queries. (b) Performance Comparison. Our DATAM1ND-14B achieves the best compared with all proprietary models and open-source trained or untrained models.

图 1: (a) 任务分类法。我们将数据分析任务划分为18个细粒度类别,以增强合成查询的多样性。(b) 性能对比。我们的DATAM1ND-14B与所有专有模型及开源训练/未训练模型相比均取得最佳表现。

1 INTRODUCTION

1 引言

Large Language Models (LLMs) have demonstrated formidable performance on a wide spectrum of reasoning tasks spanning math, code, and science (DeepSeek-AI et al., 2025; Kimi et al., 2025; OpenAI, 2025a; Yang et al., 2025). As AI enters its second half (Yao, 2025), a surge of LLM Agentic benchmarks targeted in increasingly complex and domain-specific scenarios (Jimenez et al., 2024; Starace et al., 2025; Mialon et al., 2024; Phan et al., 2025; Wei et al., 2025a) is emerging. Among them, Automated Data Analysis (Hu et al., 2024; Jing et al., 2025; Liu et al., 2024; Majumder et al., 2025), an essential pillar of AI for scientific research, plays a critical role in realizing Innovating AI and has shown its promise to boost research efficiency and accelerate scientific discovery (Chen et al., 2025b; Schmidgall et al., 2025; Lu et al., 2024; Chai et al., 2025).

大语言模型 (LLM) 在数学、代码和科学等广泛推理任务中展现出强大性能 (DeepSeek-AI et al., 2025; Kimi et al., 2025; OpenAI, 2025a; Yang et al., 2025)。随着人工智能进入下半场 (Yao, 2025),针对日益复杂和特定领域场景的智能体基准测试 (Jimenez et al., 2024; Starace et al., 2025; Mialon et al., 2024; Phan et al., 2025; Wei et al., 2025a) 正涌现。其中,自动化数据分析 (Hu et al., 2024; Jing et al., 2025; Liu et al., 2024; Majumder et al., 2025) 作为科学研究人工智能的重要支柱,在实现创新人工智能方面发挥着关键作用,并展现出提升研究效率和加速科学发现的潜力 (Chen et al., 2025b; Schmidgall et al., 2025; Lu et al., 2024; Chai et al., 2025)。

Data-Analytic Agents process, model, and compute data by generating code to discover useful information or regular conclusions, thereby furnishing users with insights to support decision-making. However, existing data-analytic agents (Zhang et al., 2023; Hong et al., 2025; Li et al., 2024; Sun et al., 2025; Guo et al., 2024) are overwhelmingly built on proprietary models via prompt engineering and rely on predefined workflows or multi-agent scaffolds. The few open-source trained models (Wu et al., 2025b;c; Su et al., 2024) can only perform simple table understanding tasks (tables compact enough to fit into the prompt) and can easily break down when confronted with diverse-format, large-scale data files and long-horizon, multi-step reasoning demanded by real-world tasks.

数据分析智能体 (Data-Analytic Agents) 通过生成代码来处理、建模和计算数据,从而发现有用信息或规律性结论,为用户提供支持决策的洞察。然而,现有的数据分析智能体 (Zhang et al., 2023; Hong et al., 2025; Li et al., 2024; Sun et al., 2025; Guo et al., 2024) 绝大多数通过提示工程基于专有模型构建,并依赖预定义的工作流或多智能体框架。少数开源训练模型 (Wu et al., 2025b;c; Su et al., 2024) 仅能执行简单的表格理解任务(即表格尺寸需小到能放入提示词中),当面对现实任务中多样化格式的大规模数据文件以及长周期、多步骤的推理需求时,这些系统极易失效。

Challenges. In this work, we propose to train a generalist, open-source data-analytic agent. This endeavor entails several intrinsic challenges that must be addressed: 1) Insufficient data resources. Training a specialized agent demands a large-scale, high-quality collection of tasks and corresponding solution trajectories. However, publicly available data analysis benchmarks often only provide a limited test set for evaluation purposes and lack step-by-step trajectory annotations, making it infeasible to assemble an effective training corpus from off-the-shelf resources. 2) Improper training strategy. Current agent training strategies typically follow an SFT-then-RL paradigm. Yet, in a new scenario, it remains unclear how to stabilize long-horizon agent training and how to allocate training steps across SFT and RL to achieve optimal performance. 3) Unstable code-based multi-turn rollout. Data files and code interpreters involve intricate memory management. Parallel agentic rollout and multi-turn code generation with limited memory resources will further exacerbate this situation.

挑战。在这项工作中,我们提出训练一个通用型开源数据分析智能体 (AI Agent) 。这项工作需要解决几个固有挑战:1) 数据资源不足。训练专业智能体需要大规模高质量的任务集合与对应解决轨迹,但公开可用的数据分析基准通常仅提供有限的测试集用于评估目的,且缺乏逐步轨迹标注,使得无法从现有资源中组装有效的训练语料库。2) 训练策略不当。当前智能体训练策略通常遵循SFT后RL的模式,但在新场景中,如何稳定长周期智能体训练以及如何在SFT和RL间分配训练步骤以获得最优性能仍不明确。3) 不稳定的基于代码的多轮展开。数据文件和代码解释器涉及复杂的内存管理,有限内存资源下的并行智能体展开和多轮代码生成将进一步加剧这种情况。

The DATAMIND Pipeline. In response to the above challenges, we introduce DATAMIND, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. To construct a large-scale training corpus, we begin by harvesting a diverse collection of data files in various formats and domains from the Internet and open communities. Then, we apply a fine-grained task taxonomy (see Fig.1a) and a recursive easy-to-hard task composition mechanism to increase the diversity and diffculty of our synthesized queries. Next, we adopt a knowledge-augmented trajectory sampling strategy to improve both the validity and reliability of synthesized trajectories. A modelbased judger performs self-consistency filtering on these trajectories, followed by rule-based checks. The judgment signal will also be fed back to the model to encourage refinement, enriching the thinking patterns present in the final training set. During training, we combine SFT loss and RL loss with a dynamic coefficient to schedule the relative weight of SFT versus RL across training steps, allowing us to balance exploitation and exploration to stabilize training. For parallel multi-turn rollout, we a synchronize agent generation and code execution and utilize a chunk-wise code maintenance method to reduce peak memory usage. Moreover, we sandbox each trajectory in an isolated environment with strict caps on execution time and memory usage, enabling stable code-based multi-turn rollout.

DATAMIND 流程。针对上述挑战, 我们推出了 DATAMIND, 这是一个可扩展的数据合成和智能体训练方案, 旨在构建通用数据分析智能体。为构建大规模训练语料库, 我们首先从互联网和开放社区收集各种格式和领域的数据文件。接着, 我们采用细粒度任务分类法 (见图1a) 和递归式由易到难的任务组合机制, 以增加合成查询的多样性和难度。随后, 我们采用知识增强的轨迹采样策略, 同时提升合成轨迹的有效性和可靠性。基于模型的评判器对这些轨迹执行自洽性过滤, 并进行基于规则的检查。评判信号还将反馈给模型以促进改进, 从而丰富最终训练集中存在的思维模式。在训练期间, 我们将 SFT 损失和 RL 损失与动态系数相结合, 以调度训练步骤中 SFT 与 RL 的相对权重, 使我们能够平衡利用和探索以稳定训练。对于并行多轮展开, 我们同步智能体生成和代码执行, 并利用分块式代码维护方法来降低峰值内存使用量。此外, 我们将每个轨迹沙盒化在隔离环境中, 并严格限制执行时间和内存使用量, 从而实现稳定的基于代码的多轮展开。

Results and Insights. Through the DATAMIND pipeline, we curate DATAMIND-12K, a highquality training set that spans diverse task categories and data file formats for data-analytic tasks. When trained on DATAMIND-12K, our 14B model, DATAM1ND-14B, achieves a new state-ofthe-art with an average score of $71.16%$ on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5 and surpassing all open-source models by a substantial margin (see Fig.1b). Our DATAM1ND-7B also performs best among all open-source modelswith ascoreof $68.10%$ . Our additional analysis studies yield three valuable insights for the community: 1) Self-consistency filtering is more non-trivial than the best trajectory selection; 2) SFT loss can be an effective stabilizer for RL training, but can also be the culprit of unstable training. 3) RL can narrow the performance gap between different base models, but can hardly reverse the order.

结果与洞察。通过DATAMIND流程,我们构建了DATAMIND-12K高质量训练集,该数据集涵盖数据分析任务的多样化任务类别和数据文件格式。基于DATAMIND-12K训练的14B模型DATAM1ND-14B在多项数据分析基准测试中取得71.16%的平均分,刷新了最高水平,超越最强商业基线DeepSeek-V3.1与GPT-5,并大幅领先所有开源模型 (见图1b) 。我们的DATAM1ND-7B模型同样以68.10%的得分位居开源模型榜首。额外分析研究为社区带来三项重要发现: 1) 自一致性过滤比最优轨迹选择更具挑战性; 2) SFT损失可作为强化学习训练的有效稳定器,但也可能引发训练不稳定性; 3) 强化学习能缩小不同基础模型的性能差距,但难以逆转优劣排序。


Figure 2: The Pipeline of DATAMIND. DATAMIND applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective including both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework.

图 2: DATAMIND 的流程架构。DATAMIND 采用 1) 细粒度任务分类与递归式由易到难的任务组合机制; 2) 知识增强轨迹采样策略及基于模型与规则的筛选机制; 3) 动态可调训练目标 (包含 SFT 与 RL 损失函数); 4) 基于代码的内存节约型稳定多轮展开框架。

2 PROBLEM DEFINITION

2 问题定义

A data analysis task $u$ is typically represented as a quadruple $\boldsymbol{u}=(q,f,d,a)$ , comprising the user query $q$ , the data file $f$ , the data description $d$ . and the answer $a$ , where data file $f$ may be provided in a variety of formats (. csv, .xlsx, . sqlite, etc.), and data description $d$ is optional.

数据分析任务 $u$ 通常表示为一个四元组 $\boldsymbol{u}=(q,f,d,a)$ ,包含用户查询 $q$ 、数据文件 $f$ 、数据描述 $d$ 和答案 $a$ ,其中数据文件 $f$ 可能以多种格式提供 (.csv, .xlsx, .sqlite 等) ,数据描述 $d$ 为可选项。

Our agent framework adheres to the prevailing ReAct (Yao et al., 2023) paradigm. Upon receiving a task, the agent is required to iterate multiple rounds of Thought -Act ion-Observat ion cycles and finally produce an answer. In the data analysis scenario, Thought denotes the agent's reasoning and reflection process conditioned on the current context; Act i on refers to the agent's invocation of code to process and compute over the data files or the generation of the final answer. The code may be written in $\clubsuit$ Python or $\equiv$ SQL, depending on the data file format; Obse rvat i on consists of the execution feedback returned by the environment (i.e., Code Interpreter).

我们的智能体框架遵循主流的ReAct范式 (Yao et al., 2023) 。当接收到任务时,智能体需要迭代多轮思考-行动-观察循环,并最终生成答案。在数据分析场景中,思考表示智能体基于当前上下文进行的推理与反思过程;行动指智能体调用代码处理数据文件或生成最终答案,代码可根据数据文件格式选用Python语言或SQL编写;观察则包含环境(即代码解释器)返回的执行反馈。

Given task $u$ , let a Thought -Act ion-Observat ion loop be represented by $(\tau,\alpha,o)$ ,respectively. Then the agent's historical trajectory $h$ at time step $t$ can be denoted as:

给定任务 $u$ ,令思维-行动-观察循环分别表示为 $(\tau,\alpha,o)$ 。则智能体在时间步 $t$ 的历史轨迹 $h$ 可表示为:

$$
h_{t}=(u,\tau_{0},\alpha_{0},o_{0},\tau_{1},\alpha_{1},o_{1},\ldots,\tau_{t-1},\alpha_{t-1},o_{t-1}).
$$

$$
h_{t}=(u,\tau_{0},\alpha_{0},o_{0},\tau_{1},\alpha_{1},o_{1},\ldots,\tau_{t-1},\alpha_{t-1},o_{t-1}).
$$

Conditioned on the history trajectory $h_{t}$ , the agent with parameters $\theta$ will produce its next thought $\tau_{t}$ and action $\alpha_{t}$ according to the policy $\pi_{\theta}(\tau_{t},\alpha_{t}|h_{t})$ and will receive an observation $o_{t}$ from the code interpreter after action $\alpha_{t}$ is executed. The whole trajectory terminates either when the agent emits an answer or when a predefined maximum number of rounds $\tau$ is reached. For simplicity, in the following sections, we denote the input part provided to the agent (including $q,f$ , and $d$ )as $x$ and the trajectory (including answer $a$ ) sampled from the agent as $y\sim\pi_{\theta}(\cdot|x)$

在历史轨迹 $h_{t}$ 的条件下,具有参数 $\theta$ 的智能体将根据策略 $\pi_{\theta}(\tau_{t},\alpha_{t}|h_{t})$ 生成其下一个思考 $\tau_{t}$ 和动作 $\alpha_{t}$,并在执行动作 $\alpha_{t}$ 后从代码解释器接收观察结果 $o_{t}$。当智能体发出答案或达到预定义的最大轮数 $\tau$ 时,整个轨迹终止。为简化表述,在后续章节中,我们将提供给智能体的输入部分(包括 $q,f$ 和 $d$)记为 $x$,将从智能体采样的轨迹(包括答案 $a$)记为 $y\sim\pi_{\theta}(\cdot|x)$。

3 SCALING DATA-ANALYTIC AGENT DATA

3 数据分析型智能体 (Data-Analytic Agent) 的数据扩展

3.1 FILE COLLECTION AND QUERY SYNTHESIS

3.1 文件收集与查询合成

Data File Collection. First, we need a large amount of raw data files $f$ to scale up the potential synthesized task volume. Fortunately, the Internet and the open community benchmarks already host a massive reservoir of such files. We first target Kaggle, which contains tens of thousands of . csv and . $\mathbf{\boldsymbol{x}}\bot\mathbf{\boldsymbol{S}}\mathbf{\boldsymbol{x}}$ spreadsheets. Using the official Kaggle $\bar{\mathbf{A}\mathbf{P}\mathbf{I}^{2}}$ , we crawl a diverse subset of files spanning multiple domains, and then discard files that $i$ ) can not be loaded, $i i$ )are extremely small ( $\phantom{0}\times20\phantom{.0}$ rows) or large $(>1,000$ rows), or $i i i$ ) contain anomalous data types. After this pipeline, we retain 3, 400 . csv and 560 . $\mathbf{\boldsymbol{x}}\bot\mathbf{\boldsymbol{S}}\mathbf{\boldsymbol{x}}$ files. For database files, we draw primarily from the training set of BIRD (Li et al., 2023b) and OmniSQL (Li et al., 2025a), both of which are high-quality corpora widely used in the Text-to-SQL field. Similarly, we sample from these sources and apply an analogous filtering pipeline, finally obtaining 1, 954 . sqlite files.

数据文件收集。首先,我们需要大量原始数据文件 $f$ 以扩大潜在合成任务规模。幸运的是,互联网和开放社区基准已存有海量此类文件。我们首先瞄准Kaggle平台,该平台包含数万份.csv和 $\mathbf{\boldsymbol{x}}\bot\mathbf{\boldsymbol{S}}\mathbf{\boldsymbol{x}}$ 格式电子表格。通过官方Kaggle $\bar{\mathbf{A}\mathbf{P}\mathbf{I}^{2}}$ 接口,我们爬取了跨多个领域的多样化文件子集,随后剔除以下文件:$i$ ) 无法加载的文件,$ii$ ) 过小( $\phantom{0}\times20\phantom{.0}$ 行)或过大 $(>1,000$ 行)的文件,或 $iii$ ) 包含异常数据类型的文件。经过此流程,我们保留3,400份.csv文件和560份 $\mathbf{\boldsymbol{x}}\bot\mathbf{\boldsymbol{S}}\mathbf{\boldsymbol{x}}$ 文件。对于数据库文件,我们主要从BIRD (Li et al., 2023b) 和OmniSQL (Li et al., 2025a) 的训练集中选取,这两个语料库都是文本到SQL领域广泛使用的高质量数据集。同样地,我们从这些源中抽样并应用类似的过滤流程,最终获得1,954份.sqlite文件。

Query Categorization and Synthesis. To generate specific queries, we devise an automated script to extract meta-information $d$ of each data file, such as table headers, column names, data types, and representative rows, and then feed these metadata into DeepSeek-V3 (DeepSeek-AI, 2024) to synthesize queries $q$ . To ensure both diversity and fine-grained ness of the generated questions, we refer to and refine the taxonomy in Wu et al. (2025b) and classify the data analysis tasks into 18 fine-grained categories (see Fig.la). For each category, we carefully curate $4\sim6$ exemplar queries that vary in complexity and domains and serve as few-shot demonstrations. Under the guidance of these type-specific contexts, every data file is used to generate a diverse set of queries that span the full spectrum of the proposed taxonomy. To further elevate query complexity, we adopt a recursive easy-to-hard composition scheme that chains multiple task types, i.e., the output of one task is fed as input to the next. By iterating $2\sim5$ times, we progressively amplify the difficulty and create multi-hop analytic challenges that go well beyond the capability required by any single task type. The prompts for query synthesis can be found in Appx.G.2.

查询分类与生成。为生成特定查询,我们设计自动化脚本来提取每个数据文件的元信息$d$(如表头、列名、数据类型和代表性数据行),随后将这些元数据输入DeepSeek-V3(DeepSeek-AI, 2024)以合成查询$q$。为确保生成问题的多样性与细粒度特性,我们参考并改进了Wu等人(2025b)的分类体系,将数据分析任务划分为18个细粒度类别(见图1a)。针对每个类别,我们精心编制$4\sim6$个不同复杂度与领域的示例查询作为少样本示例。在这些类型特定上下文的引导下,每个数据文件被用于生成覆盖完整分类体系的多样化查询集。为提升查询复杂度,我们采用递归式由易到难组合策略,将多个任务类型串联形成链式结构——即前序任务的输出作为后续任务的输入。通过$2\sim5$次迭代,我们逐步增强难度,创建出远超单一任务类型能力的多跳分析挑战。查询合成的提示模板详见附录G.2。

3.2 EXPERT TRAJECTORY SAMPLING AND FILTERING.

3.2 专家轨迹采样与过滤

Knowledge Augmented Trajectory Sampling. To guarantee the quality of the synthesized trajectories, we introduce a knowledge-augmented trajectory sampling framework. Initially, for each question category, we manually craft a high-level workfow $k$ that encodes procedural knowledge and steers the model during trajectory synthesis. To further boost answer quality, we impose a self-consistency filter. We sample $\mathcal{N}$ independent trajectories per query and employ a judge model $\mathcal{M}$ poweredby GPT-4o-mini (OpenAI, 2024b) to verify whether their final answers are consistent with reasoning rationales. Only trajectories that converge to the same answer are retained; among them, the judge model will also select the most concise and accurate one as our training instance $y$

知识增强轨迹采样。为保证合成轨迹的质量, 我们引入了知识增强轨迹采样框架。首先, 针对每个问题类别, 我们手动构建一个高层次工作流 $k$, 该工作流编码了程序性知识并在轨迹合成过程中引导模型。为进一步提升答案质量, 我们施加了自一致性过滤器。我们对每个查询采样 $\mathcal{N}$ 条独立轨迹, 并采用由 GPT-4o-mini (OpenAI, 2024b) 驱动的评判模型 $\mathcal{M}$ 来验证其最终答案是否与推理依据一致。仅保留收敛到相同答案的轨迹; 其中, 评判模型还会选择最简洁准确的轨迹作为我们的训练实例 $y$。

{c,s,y}=M({yi}i=1N),{yi}i=1Nπθcyert(|k,x),y={yi{yi}i=1N,s=1 none,s=0,

where $c$ is the chain-of-thought process of the judge model to reach the binary conclusion $s$ of whether the sampled trajectories are consistent. We use DeepSeek-V3.1 (DeepSeek-AI, 2025) as our expert policy model $\pi_{\theta_{\mathrm{cxpert}}}$ . During implementation, we set $\mathcal{N}=3$ . The prompt used for trajectory sampling and the judge model $\mathcal{M}$ can be found in Appx.G.3 and Appx.G.4, respectively. We extract the final answer from the trajectory as the final synthesized answer $a$ for the corresponding query $q$ . However, this pipeline inherently biases us toward easier queries whose answers are more likely to coincide. To counteract this, we refine the high-level workflow knowledge $k$ into more granular, step-by-step instructions for categories that exhibit low inter-trajectory consistency. Moreover, for trajectories that fail the consistency check, we feed the judge model's chain-of-thought back to the agent as external critique, prompting it to refect and revise its reasoning path:

其中 $c$ 是评判模型为判断采样轨迹是否一致而得出二元结论 $s$ 的思维链过程。我们采用 DeepSeek-V3.1 (DeepSeek-AI, 2025) 作为专家策略模型 $\pi_{\theta_{\mathrm{cxpert}}}$。实际实现时设定 $\mathcal{N}=3$。轨迹采样所用的提示词和评判模型 $\mathcal{M}$ 分别详见附录 G.3 和附录 G.4。我们从轨迹中提取最终答案作为对应查询 $q$ 的最终合成答案 $a$。但该方法天然偏向于答案更易趋同的简单查询。为抵消此偏差,我们将高层工作流程知识 $k$ 细化为更细粒度的分步指令,适用于轨迹间一致性较低的类别。此外,对于未通过一致性检验的轨迹,我们将评判模型的思维链作为外部批评反馈给智能体,促使其反思并修正推理路径:

{yreflectedi}i=1Nπθexpert(|k,x,{yi}i=1N,c),ifs=0.

The reflected tr aec tories ${y_{\mathrm{reflected}}^{i}}_{i=1}^{\mathcal{N}}$ Wwill be fed into the judge model $\mathcal{M}$ again to conduct the consistency check and the trajectory selection in Eqn.2. This rescue loop not only salvages additional usable data but also enriches the diversity of thinking patterns embedded in the trajectory pool.

反射轨迹 ${y_{\mathrm{reflected}}^{i}}_{i=1}^{\mathcal{N}}$ 将被再次输入评判模型 $\mathcal{M}$,以执行公式2中的一致性检验和轨迹选择。该救援循环不仅能挽回更多可用数据,还能增强轨迹池中思维模式的多样性。

Rule-based Trajectory Filtering. In addition to discarding inconsistent trajectories, we apply three further rule-based filtering stages. 1) Format compliance. We drop any trajectory that deviates from the ReAct format, ensuring that every remaining trajectory can be losslessly converted into our target training schema. 2) Length control. We filter out trajectories whose final answer exceeds 1, 024 tokens, preventing the model from exploiting spurious hallucinations to artificially hit the correct string. 3) Linguistic integrity. We remove trajectories containing garbled text or intermingled natural languages, eliminating samples that could destabilize the agent training. After the full filtering pipeline, we retain 11, 707 high-quality trajectories named as DATAM1ND-12K.

基于规则的轨迹过滤。除了剔除不一致的轨迹外,我们还实施了三个基于规则的过滤阶段:1) 格式合规性。我们舍弃所有偏离ReAct格式的轨迹,确保每条保留的轨迹都能无损转换为目标训练模式;2) 长度控制。我们过滤最终答案超过1024个token的轨迹,防止模型利用虚假幻觉人为匹配正确答案;3) 语言完整性。我们移除包含乱码或混合自然语言的轨迹,消除可能影响智能体训练稳定性的样本。经过完整过滤流程后,我们保留了11,707条高质量轨迹,命名为DATAM1ND-12K。

4 SCALING DATA-ANALYTIC AGENT TRAINING

4 规模化数据-分析智能体训练

Dynamic Control Between SFT and RL. In this paper, we adopt a combined paradigm of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for the agent training. Empirically, we observe that it is difficult to strike a balance between the two stages: the model needs to absorb sufficient knowledge from expert data during SFT, yet excessive imitation often rigidifies exploration during RL. Hence, following Zhang et al. (2025b), we employ a hybrid strategy that dynamically blends on-policy and off-policy learning, allowing the training procedure to flexibly trade off between exploitation of expert knowledge and continued exploration.

动态调控SFT与RL训练。本文采用监督微调 (SFT) 与强化学习 (RL) 相结合的智能体训练范式。实证研究发现,两个阶段的平衡难以把握:模型需要在SFT阶段充分吸收专家数据知识,但过度模仿往往会导致RL阶段的探索僵化。因此,我们遵循Zhang et al. (2025b) 的方法,采用动态混合在线与离线学习的策略,使训练过程能灵活权衡专家知识利用与持续探索。

Given the training dataset $\mathcal{D}$ , we express our SFT loss as:

给定训练数据集 $\mathcal{D}$,我们将SFT损失函数表示为:
LSFT(θ)=E(x,y)D[t=1|y|I(yto)logπθ(yt|x,y<t)],

where $\mathbb{I}(y_{t}\notin o)$ is an indicator function that masks out any tokens produced by the environment feedback, ensuring that the model is optimized only on the agent-generated portion of the trajectory. For RL, we use the Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) (Yu et al., 2025) algorithm, minimizing the following function:

其中 $\mathbb{I}(y_{t}\notin o)$ 是指示函数,用于屏蔽环境反馈产生的所有 token,确保模型仅在智能体生成的轨迹部分进行优化。对于强化学习 (Reinforcement Learning),我们采用解耦裁剪与动态采样策略优化 (Decoupled Clip and Dynamic Sampling Policy Optimization,DAPO) 算法 (Yu et al., 2025),最小化以下函数:
LDAPO(θ)=E(x,y)D,{yi}i=1Gπθodd(|x) [1i=1G|yi|i=1Gt=1min(ri,t(θ)A^i,t,clip(ri,t(θ),1εlow,1+εhigh)A^i,t)] s.t.0<|{yi|sequivalent(y,yi)}|<G,

where {yi}i=1G is a group of $G$ trajectories sampled from the agent policy $\pi_{\theta_{\mathrm{old}}}$ and $y$ is the expert trajectory. Similar to SFT, any tokens emitted by the environment are discarded when computing the objective. $r_{i,t}(\theta)$ denotes the per-token importance-sampling ratio, and A^i,t is the advantage of the $i$ -th response, obtained by normalizing the group-level rewards ${R_{i}}_{i=1}^{G}$

其中{yi}i=1G是从智能体策略 $\pi_{\theta_{\mathrm{old}}}$ 中采样的一组 $G$ 条轨迹,$y$ 是专家轨迹。与监督微调 (SFT) 类似,在计算目标函数时会丢弃环境生成的任何 token。$r_{i,t}(\theta)$ 表示每个 token 的重要性采样比率,A^i,t是第 $i$ 个响应经过组级奖励 ${R_{i}}_{i=1}^{G}$ 归一化后得到的优势函数。
ri,t(θ)=πθ(yi,tx,yi,<t)πθold(yi,tx,yi,<t),A^i,t=Rimean({Ri}i=1G)std({Ri}i=1G).

The inequality in Eqn.5 serves as a filtering criterion that discards trajectories lacking optimization utility whose rewards are uniformly 0 or uniformly 1 to prevent spurious gradient updates.

公式 5 中的不等式作为过滤标准, 用于丢弃缺乏优化效用的轨迹 (其奖励值全为 0 或全为 1), 以防止虚假梯度更新。

Finally, unlike the conventional SFT-then-RL pipeline, we jointly optimize the agent by combining the SFT and RL objectives with a dynamically balanced weighting factor:

最后,与传统先进行监督微调 (SFT) 再进行强化学习 (RL) 的流程不同,我们通过结合 SFT 和 RL 目标并采用动态平衡权重因子来联合优化智能体:
LFinal(θ)=γLSFT(θ)+(1γ)LDAPO(θ),

where $\gamma\in[0,1]$ varies dynamically throughout training. In our implementation, $\gamma$ is initialized to a large value so that the agent first acquires knowledge from expert data via the SFT loss, and is then annealed to a small value to encourage extensive exploration through RL. Please refer to $\S5.3$ for our analysis of different $\gamma$ settings. Importantly, for any trajectory that is filtered out by the inequality in Eqn.5, we will compute only the SFT loss. To increase the likelihood of producing eligible trajectories during the early stage of RL training, we perform a cold start using DATAMIND-12K before the process described above. We also analyze the effect of cold start for RL training in $\S5.3$

其中 $\gamma\in[0,1]$ 在训练过程中动态变化。在我们的实现中,$\gamma$ 初始化为较大值,使智能体首先通过 SFT 损失从专家数据获取知识,随后退火至较小值以鼓励通过强化学习进行广泛探索。关于不同 $\gamma$ 设置的分析请参阅 $\S5.3$。重要的是,对于被公式5不等式过滤掉的任何轨迹,我们将仅计算 SFT 损失。为提升强化学习训练早期阶段生成合格轨迹的概率,我们在上述过程之前使用 DATAMIND-12K 执行冷启动。我们还在 $\S5.3$ 中分析了冷启动对强化学习训练的影响。

Void Turns Filtering. In multi-turn agentic training, the model can experience distribution al drift due to external feedback and multi-turn compounding errors during multi-turn rollout, which will easily result in trajectory collapse, thereby de stabilizing RL training (Xue et al., 2025; Baronio et al., 2025; Mai et al., 2025). We also observe this phenomenon in our experiments. To stabilize training, we directly mask out the entire loss contributed by trajectories that contain void turns. Here, a void turn is defined as an agentic loop that fails to produce a valid code snippet or answer.

空轮次过滤。在多轮智能体训练中,由于外部反馈和多轮推演中的误差累积,模型可能经历分布漂移,这容易导致轨迹崩溃,从而使强化学习训练失稳 (Xue et al., 2025; Baronio et al., 2025; Mai et al., 2025)。我们在实验中也观察到这一现象。为稳定训练,我们直接屏蔽包含空轮次的轨迹所产生的全部损失。此处,空轮次被定义为未能生成有效代码片段或答案的智能体循环。

Agentic Code-based Multi-turn Rollout. A stable environment plays a key role in stable on-policy RL training. In data-analytic agent training, massive concurrent file I/O and code execution can easily lead to environment crashes, especially with limited memory resources. To prevent this, we implement three optimization s: 1) Asynchronous interaction. We a synchronize model generation and code execution for different data samples, which can decouple peak GPU and CPU memory demands and avoid simultaneous file I/O and code-execution spikes. 2) Chunked code maintenance. We implement a light-weight, notebook-style code generation strategy. The model only needs to produce the code snippet required for the current reasoning step, effectively reducing generation latency. Furthermore, whereas conventional notebook systems maintain a global variable pool, which is memory-intensive, we retain only the textual code chunks. At runtime, we concatenate the active snippet with its predecessors, yielding the same global execution effect without the memory overhead. 3) Security Control. To ensure secure code execution, we isolate the runtime environment for each trajectory, enforce per-trajectory limits on CPU time and peak memory, and filter any snippet containing insecure function calls before execution. Additionally, we provide an automatic package-installation mechanism that dynamically checks and installs uninstalled Python packages.

基于代码的多轮推演智能体。稳定环境在稳定的同策略强化学习训练中起着关键作用。在数据分析智能体训练过程中,海量并发文件I/O和代码执行极易导致环境崩溃,特别是在内存资源有限的情况下。为此我们实施了三项优化措施:1) 异步交互。我们对不同数据样本的模型生成与代码执行进行异步处理,从而解耦GPU与CPU的内存峰值需求,避免文件I/O与代码执行同时达到峰值。2) 分块代码维护。我们采用轻量级笔记本式代码生成策略,模型仅需生成当前推理步骤所需的代码片段,有效降低生成延迟。与传统笔记本系统维护全局变量池(内存消耗大)不同,我们仅保留文本化代码块,运行时将当前片段与前置片段拼接,在实现相同全局执行效果的同时避免内存开销。3) 安全控制。为确保代码安全执行,我们为每条轨迹隔离运行环境,实施单轨迹CPU时间和峰值内存限制,并在执行前过滤包含不安全函数调用的代码片段。此外,我们还提供自动包安装机制,动态检查并安装未配置的Python语言包。

Reward Design. Our reward mainly comprises three components: format reward rformat, answer reward ranswer, and length reward rlength. The agent is required to enclose its reasoning process within . tags, place any generated data-processing code between and , and wrap its final answer in .... The environment's execution results will be placed between and . For the answer reward, as many answers are descriptive and thus resist rule-based verification, we adopt a model-as-judge powered by GPT-4o-mini (OpenAI, 2024b). We engineer a dedicated LLM evaluation prompt detailed in Appx. G.4. Both $r_{\mathrm{format}}$ and $r_{\mathrm{answer}}$ are binary with only O and 1. To mitigate the risk of the agent hacking the answer reward by hallucinating excessive tokens, we further impose a length-based penalty to discourage overly verbose outputs. We define the length reward and the final reward as:

奖励设计。我们的奖励主要包含三个组成部分:格式奖励 rformat、答案奖励 ranswer 和长度奖励 rlength。智能体需要将其推理过程包裹在 <think> </think> 标签内,将生成的数据处理代码置于 <code></code> 之间,并将最终答案包裹在 <answer>...</answer> 中。环境的执行结果将放置在 <interpreter></interpreter> 之间。对于答案奖励,由于许多答案是描述性的,难以通过基于规则的方法进行验证,我们采用了由 GPT-4o-mini (OpenAI, 2024b) 驱动的模型即评判 (model-as-judge) 方法。我们设计了一个专门的大语言模型评估提示词,详见附录 G.4。$r_{\mathrm{format}}$ 和 $r_{\mathrm{answer}}$ 都是二元的,取值仅为 0 和 1。为了减轻智能体通过产生过多 Token 来攻击答案奖励的风险,我们进一步施加了基于长度的惩罚,以抑制过于冗长的输出。我们定义长度奖励和最终奖励如下:

R={rlengthranswer,ranswer=1llmin 0,rformat=1,ranswer=0,rlength={1,llmin lmaxllmaxlmin0.5+0.5,lmin<llmax 0.1,rformat=0,ranswer=0lmax<l

We in centi viz e correct outputs. So as long as the predicted answer exactly matches the ground truth, the model will receive a high reward $(\geq0.5)$ . The specific value is length-dependent: we award a full reward if the answer length $l$ is shorter than $l_{\mathrm{min}}$ ; it decays linearly to 0.5 when the length falls between $l_{\mathrm{min}}$ and ${l_{{\operatorname*{max}}}}$ ; any sequence longer than ${l_{{\operatorname*{max}}}}$ incurs a fixed length penalty of 0.5. According to our observation, we set $l_{\mathrm{min}}$ and ${l_{{\operatorname*{max}}}}$ to 256 and 1024 respectively during our experiments.

我们评估正确的输出。因此只要预测答案与标准答案完全匹配,模型就会获得高奖励 $(\geq0.5)$ 。具体数值与长度相关:当答案长度 $l$ 短于 $l_{\mathrm{min}}$ 时给予全额奖励;当长度介于 $l_{\mathrm{min}}$ 和 ${l_{{\operatorname*{max}}}}$ 之间时线性衰减至0.5;任何超过 ${l_{{\operatorname*{max}}}}$ 的序列都会受到0.5的固定长度惩罚。根据我们的观察,在实验中将 $l_{\mathrm{min}}$ 和 ${l_{{\operatorname*{max}}}}$ 分别设置为256和1024。

5 EXPERIMENTS

5 实验

5.1 EXPERIMENTAL SETTINGS

5.1 实验设置

Datasets and Metrics. We evaluate our model on three datasets related to data analysis: DABench (Hu et al., 2024), TableBench (Wu et al., 2025b), and BIRD (Li et al., 2023b). Our evaluation protocol aligns with our answer reward method, where a judge model powered by GPT-4o-mini (OpenAI, 2024b) is used to evaluate the correctness of the final answer. We report both pass $@1$ and pass $\ @3$ scores for all the methods. Please refer to Appx.D for more details.

数据集与评估指标。我们在三个数据分析相关数据集上评估模型:DABench (Hu et al., 2024)、TableBench (Wu et al., 2025b) 和 BIRD (Li et al., 2023b)。评估方案与答案奖励方法保持一致,采用基于GPT-4o-mini (OpenAI, 2024b) 的裁判模型来评估最终答案的正确性。我们报告所有方法的pass $@1$ 和pass $@3$ 分数,更多细节请参阅附录D。

Models and Baselines. We compare our models with five strong proprietary models and four outstanding open-source models (see Tab.1). In addition, we select four open-source models that have been explicitly trained for data-analysis-related tasks: TableLLM (Wu et al., 2025b), Table-R1 (Wu et al., 2025c), OmniSQL (Li et al., 2025a), and SQL-R1 (Ma et al., 2025). We include Qwen2.5-Coder-7B and 14B (Hui et al., 2024) as our backbone models to compare different baselines. Detailed model information and reproduction protocols for all baselines are provided in Appx.E.

模型与基线。我们将自身模型与五个强大的专有模型及四个优秀的开源模型进行对比 (参见表1) 。此外,我们选取了四个经过数据分相关任务显式训练的开源模型:TableLLM (Wu等人, 2025b) 、Table-R1 (Wu等人, 2025c) 、OmniSQL (Li等人, 2025a) 以及SQL-R1 (Ma等人, 2025) 。我们引入Qwen2.5-Coder-7B和14B (Hui等人, 2024) 作为骨干模型以比较不同基线。所有基线的详细模型信息与复现方案详见附录E。

Training and Inference Setups. We use Llama Factory (Zheng et al., 2024) for SFT training and verl (Sheng et al., 2025) for RL training. For SFT, our learning rate is $1e-5$ with a warmup ratio of 0.1 and a cosine decay schedule. Our global batch size is set to 16. For RL, we use a learning rate of $1e-6$ . The batch size is 16 with a mini batch size of 2. The rollout temperature is 0.7, the top-p is 1.0, and the group size $G$ is 4. We schedule $\gamma$ via cosine decay, annealing from a peak of 0.9 to a valley of 0.05. At test time, we fix the temperature to 0.7, top-p to 0.95, and an inference batch size of 5 for all evaluations. The detailed hyper parameter information can be seen in Appx.F.

训练与推理设置。我们使用 Llama Factory (Zheng et al., 2024) 进行 SFT 训练,使用 verl (Sheng et al., 2025) 进行 RL 训练。对于 SFT,我们的学习率为 $1e-5$,预热比例为 0.1,并采用余弦衰减调度。全局批次大小设置为 16。对于 RL,我们使用 $1e-6$ 的学习率,批次大小为 16,最小批次大小为 2。推演温度为 0.7,top-p 值为 1.0,组大小 $G$ 为 4。我们通过余弦衰减调度 $\gamma$,从峰值 0.9 退火至谷值 0.05。在测试阶段,所有评估均固定温度 0.7、top-p 0.95,推理批次大小为 5。详细超参数信息可见附录 F。

Table 1 : Main Results. * indicates that the original paper does not report results for the corresponding model and we use their official data and code to train the model for reproduction. + denotes that we directly download their official trained model for fair evaluation. The best results for each model group are highlighted in bold.

BackboneMethodDABenchTableBenchBIRDAvg.
pass@1pass@3pass@1pass@3pass@1pass@3pass@1pass@3
Proprietary Models
GPT-40ReAct76.3984.4464.9775.0650.2062.3963.8573.96
04-mini79.1286.7771.0380.1557.0466.8869.0677.93
DeepSeek-R178.7387.5568.9679.5255.8066.1767.8377.75
DeepSeek-V3.181.3289.4972.5281.6857.8968.1270.5879.76
GPT-578.2185.2169.9378.3760.1765.1969.4476.26
Open-source Models
Qwen-2.5-Coder-32B QwQ-32B Llama-3.3-70B Qwen-2.5-72BReAct73.1581.3261.1172.2641.2060.1758.4971.25
70.1785.2157.7975.1950.3064.2159.4274.87
69.7880.1655.4770.3659.1068.5861.4573.03
75.3386.3865.4476.2160.3069.4967.0277.36
Qwen-2.5 Coder-7BReAct15.0535.4111.7028.637.0218.7111.2627.58
TableLLM*36.7171.9841.0170.3611.9916.7529.9053.03
Table-R1*42.5478.9956.3663.6110.6913.4936.5352.03
OmniSQL+26.4636.1939.9550.2557.1166.3041.1750.91
SQL-R1#24.9034.6340.8450.6456.7866.2340.8350.50
DATAMIND77.3087.9467.6079.3959.4169.8868.1079.07
ReAct71.2183.2756.9669.9741.7659.9156.6471.05
Qwen-2.5 Coder-14BTableLLM*38.2674.7176.0820.9928.8859.89
Table-R145.3346.4458.9111.8014.0835.2350.79
OmniSQL+26.4679.38 39.3050.38 41.9852.6758.8067.4135.84 42.4153.13
SQL-R1+27.2440.4741.2251.0258.0266.6242.1652.70
DATAMIND80.2988.7270.9581.8162.2370.2171.1680.25

表 1: 主要结果。* 表示原始论文未报告相应模型的结果,我们使用其官方数据和代码训练模型进行复现。+ 表示我们直接下载其官方训练模型进行公平评估。每个模型组的最佳结果以粗体标出。

骨干网络 方法 DABench DABench TableBench TableBench BIRD BIRD Avg. Avg.
pass@1 pass@3 pass@1 pass@3 pass@1 pass@3 pass@1 pass@3
专有模型
GPT-40 ReAct 76.39 84.44 64.97 75.06 50.20 62.39 63.85 73.96
04-mini ReAct 79.12 86.77 71.03 80.15 57.04 66.88 69.06 77.93
DeepSeek-R1 ReAct 78.73 87.55 68.96 79.52 55.80 66.17 67.83 77.75
DeepSeek-V3.1 ReAct 81.32 89.49 72.52 81.68 57.89 68.12 70.58 79.76
GPT-5 ReAct 78.21 85.21 69.93 78.37 60.17 65.19 69.44 76.26
开源模型
Qwen-2.5-Coder-32B ReAct 73.15 81.32 61.11 72.26 41.20 60.17 58.49 71.25
QwQ-32B ReAct 70.17 85.21 57.79 75.19 50.30 64.21 59.42 74.87
Llama-3.3-70B ReAct 69.78 80.16 55.47 70.36 59.10 68.58 61.45 73.03
Qwen-2.5-72B ReAct 75.33 86.38 65.44 76.21 60.30 69.49 67.02 77.36
Qwen-2.5-Coder-7B ReAct 15.05 35.41 11.70 28.63 7.02 18.71 11.26 27.58
Qwen-2.5-Coder-7B TableLLM* 36.71 71.98 41.01 70.36 11.99 16.75 29.90 53.03
Qwen-2.5-Coder-7B Table-R1* 42.54 78.99 56.36 63.61 10.69 13.49 36.53 52.03
Qwen-2.5-Coder-7B OmniSQL+ 26.46 36.19 39.95 50.25 57.11 66.30 41.17 50.91
Qwen-2.5-Coder-7B SQL-R1# 24.90 34.63 40.84 50.64 56.78 66.23 40.83 50.50
Qwen-2.5-Coder-7B DATAMIND 77.30 87.94 67.60 79.39 59.41 69.88 68.10 79.07
Qwen-2.5-Coder-7B ReAct 71.21 83.27 56.96 69.97 41.76 59.91 56.64 71.05
Qwen-2.5-Coder-14B TableLLM* 38.26 74.71 76.08 20.99 28.88 59.89
Qwen-2.5-Coder-14B Table-R1 45.33 46.44 58.91 11.80 14.08 35.23 50.79
Qwen-2.5-Coder-14B OmniSQL+ 26.46 79.38 39.30 50.38 41.98 52.67 58.80 67.41
Qwen-2.5-Coder-14B SQL-R1+ 27.24 40.47 41.22 51.02 58.02 66.62 42.16 52.70
Qwen-2.5-Coder-14B DATAMIND 80.29 88.72 70.95 81.81 62.23 70.21 71.16 80.25

5.2 MAIN RESULTS

5.2 主要结果

As shown in Tab. 1, our 7B model, DATAMIND-7B, achieves the best among all open-source models with an average score of $68.10%$ . Our 14B model, DATAMIND-14B, attains an average score of $71.16%$ across all tasks, surpassing all proprietary models (including the latest GPT-5 and DeepSeekv3.1) as well as all open-source alternatives. Moreover, our DATAMIND series models demonstrate robust mastery of diverse data formats and exhibit balanced performance across all datasets. By contrast, specialized models degrade sharply when confronted with unseen data. For example, OmniSQL-7B reaches $57.11%$ on BIRD, yet its performance on TableBench and DABench drops steeply. Note that to ensure a fair evaluation, we have converted all tables in these two benchmarks into . Sql ite files. Nevertheless, SQL-oriented models still under perform. This observation indicates the breadth of query types and file formats covered by DATAMIND-12K. Furthermore, TableLLM and Table-R1 are limited to small-scale tables. When evaluated on DABench's large-scale tables, they fail to generalize, and their accuracy deteriorates even further on BIRD's multi-table analysis. These results highlight our model's capacity to handle complex tabular data, which can be attributed to the difficulty distribution embedded in DATAMIND-12K. Moreover, all trained baselines are exposed to significantly larger training corpora than ours (20K instances for TableLLM and Table-R1, and 2.5M for OmniSQL and SQL-R1, versus only 12K for DATAMIND), yet we outperform them even on their adept benchmarks. This gain is attributable to the high-quality reasoning trajectories curated in DATAMIND-12K and our stable training strategy. Our model also maintains a high pass $\ @3$ score, indicating that it preserves strong generation diversity while ensuring reliability.

如表 1 所示, 我们的 7B 模型 DATAMIND-7B 以 $68.10%$ 的平均得分在所有开源模型中表现最佳。我们的 14B 模型 DATAMIND-14B 在所有任务中取得了 $71.16%$ 的平均得分, 超越了所有闭源模型 (包括最新的 GPT-5 和 DeepSeekv3.1) 以及所有开源替代方案。此外, 我们的 DATAMIND 系列模型展现出对多样化数据格式的稳健掌握能力, 并在所有数据集上表现出均衡的性能。相比之下, 专用模型在面对未见数据时性能急剧下降。例如, OmniSQL-7B 在 BIRD 上达到 $57.11%$, 但其在 TableBench 和 DABench 上的表现大幅下滑。需要注意的是, 为确保公平评估, 我们已将这两个基准测试中的所有表格转换为 .Sqlite 文件。尽管如此, 面向 SQL 的模型仍然表现不佳。这一观察结果印证了 DATAMIND-12K 所涵盖的查询类型和文件格式的广度。此外, TableLLM 和 Table-R1 仅限于处理小规模表格。当在 DABench 的大规模表格上进行评估时, 它们无法泛化, 在 BIRD 的多表分析中其准确性更是进一步恶化。这些结果凸显了我们模型处理复杂表格数据的能力, 这归因于 DATAMIND-12K 中嵌入的难度分布。值得注意的是, 所有经过训练的基线模型所使用的训练语料规模都显著大于我们的模型 (TableLLM 和 Table-R1 使用 20K 实例, OmniSQL 和 SQL-R1 使用 2.5M 实例, 而 DATAMIND 仅使用 12K), 但即使在它们擅长的基准测试上, 我们仍然表现更优。这一优势得益于 DATAMIND-12K 中精心设计的高质量推理轨迹以及我们稳定的训练策略。我们的模型还保持了较高的 pass $\ @3$ 得分, 表明它在确保可靠性的同时保持了强大的生成多样性。

5.3ANALYSIS

5.3 分析

Self-consistency filtering is more non-trivial than the best trajectory selection. In Fig.3, we analyze the impact of the self-consistency trajectory filtering and best trajectory selection strategies through SFT on the 7B model. It is evident that removing the self-consistency filtering (non-con) inflicts the most pronounced degradation on model performance: both pass $@1$ andpass $\ @3$ dropto varying extents across all datasets. This observation suggests that the quality of the answers produced by a trajectory is a critical guarantee of the trajectory's overall quality. Provided that the final answers are consistent, we observe that randomly selecting a single trajectory for training is not necessarily worse than explicitly choosing the best one, and it even yields a clear improvement on DABench. We hypothesize that the judge model's preference bias may potentially reduce trajectory diversity. This conjecture can be further evidenced by the pass $\ @3$ scores of random-select, which are on par with or superior to those of con-select across all three datasets. Moreover, the largest performance gains are obtained by including, without any selection, every trajectory that converges to a consistent answer. This pattern holds across all datasets and indicates that the diversity of reasoning patterns and problem-solving strategies embedded in the trajectories is more beneficial to the model's reasoning capability, which aligns with the findings in Guha et al. (2025), although we cannot fully rule out the contribution of the larger training volume introduced by this unfiltered approach.

自洽性过滤比最佳轨迹选择更具挑战性。图3中,我们通过SFT分析了自洽性轨迹过滤和最佳轨迹选择策略对7B模型的影响。显然,移除自洽性过滤(non-con)会导致模型性能出现最显著的下降:在所有数据集上,pass $@1$ 和 pass $\ @3$ 指标均出现不同程度降低。这一观察表明,轨迹生成答案的质量是保证轨迹整体质量的关键。只要最终答案保持一致,我们观察到随机选择单条轨迹进行训练并不一定比显式选择最佳轨迹更差,甚至在DABench上产生了明显提升。我们推测评判模型的偏好偏差可能会降低轨迹多样性。这一猜想可进一步通过random-select的pass $\ @3$ 得分得到印证——在全部三个数据集中,该策略的表现均持平或优于con-select。此外,最大性能增益来自于不加选择地纳入所有收敛到一致答案的轨迹。该模式在所有数据集中均成立,表明轨迹中蕴含的推理模式和解题策略的多样性对模型推理能力更具增益,这与Guha等人 (2025) 的研究发现一致,尽管我们无法完全排除这种无过滤方法因训练数据量增加所带来的贡献。


Figure 3: Analysis on Self-Consistency Filtering and Best Trajectory Selection. Con-select is our original setting, including self-consistency filtering and best trajectory selection by a judge model $\mathcal{M}$ .Non-select uses all the sampled trajectories without the best selection. Random-select means randomly select a trajectory instead of the best selection. Non-con directly leverages all the synthesized trajectories without self-consistency filtering.

图 3: 自洽性过滤与最优轨迹选择分析。Con-select 为我们的原始设置,包含自洽性过滤和通过评判模型 $\mathcal{M}$ 进行的最优轨迹选择。Non-select 使用所有采样轨迹但不进行最优选择。Random-select 表示随机选择轨迹而非最优选择。Non-con 直接利用所有合成轨迹而不进行自洽性过滤。


Figure 4: The Influence of SFT Loss for RL Training. $\gamma=0$ denotes the absence of SFT loss, $\gamma=0.2$ corresponds to a low SFT-loss weight, and dynamic $\gamma$ indicates our naive setting.

图 4: SFT损失对强化学习训练的影响。$\gamma=0$ 表示未使用SFT损失,$\gamma=0.2$ 对应较低SFT损失权重,动态 $\gamma$ 代表我们的基础设置。

SFT loss is an effective stabilizer for RL training. When our experiments are still in an exploratory phase, we use DATAMIND-12K to examine how the weight of the SFT loss in Eqn.7 influences the RL training on the 7B model without a cold start. In Fig.4, we plot the dynamics of the answer reward across training steps under different $\gamma$ settings. As can be seen, when no SFT loss is imposed $(\gamma=0,$ the answer reward declines almost monotonically. We attribute this failure to two factors. First, the 7B model's limited multi-step reasoning capability makes it difficult to roll out high-quality trajectory groups for effective learning. Second, the heterogeneity of both data structures and code languages yields highly imbalanced trajectory distributions, resulting in unstable training. Raising $\gamma$ to 0.2 can alleviate the problem to some extent. The answer reward initially rises despite large oscillations, yet the SFT loss remains too weak to prevent the policy from eventually drifting away and collapsing. Under our dynamic $\gamma$ schedule, the model first enjoys the stabilizing supervision of a strong SFT loss and after it matures, the SFT coefficient is gradually annealed to encourage exploration, yielding stable training during the whole process.

SFT损失是强化学习训练的有效稳定器。在实验仍处于探索阶段时,我们使用DATAMIND-12K数据集检验公式7中SFT损失的权重对70亿参数模型非冷启动强化学习训练的影响。图4展示了不同$\gamma$设置下答案奖励随训练步数的动态变化。可以看出,当不施加SFT损失时$(\gamma=0)$,答案奖励几乎单调递减。我们将此失败归因于两个因素:首先,70亿参数模型有限的多步推理能力难以生成高质量的轨迹组进行有效学习;其次,数据结构和编程语言的异质性导致轨迹分布高度不平衡,造成训练不稳定。将$\gamma$提升至0.2能在一定程度上缓解问题——尽管存在较大振荡,答案奖励初始阶段仍会上升,但SFT损失仍过于微弱,无法阻止策略最终偏离并崩溃。在我们的动态$\gamma$调度机制下,模型首先受益于强SFT损失的稳定监督,待其成熟后逐渐退火SFT系数以鼓励探索,从而在整个过程中实现稳定训练。

SFT loss can also be the culprit of unstable training. Although SFT loss serves as an effective stabilizer for RL training, we find that its persistent dominance throughout training can conversely trigger collapse. As shown in Fig.5, fixing $\gamma$ at a high level also causes the answer reward to rise briefly, followed by a gradual decline. The underlying reason is that over-fitting to the SFT loss traps the policy in the rigid thinking patterns embedded in the expert trajectories, especially when these trajectories are synthesized from the same model, thereby crippling exploration. To corroborate this, we track the entropy of the policy during training and observe a pronounced entropy collapse phenomenon. In contrast, our dynamic $\gamma$ strategy can keep the policy entropy consistently at a relatively high level throughout training. Overall, we find the training process resembles raising a child. During early childhood, constant parental guidance (a large $\gamma$ ) is indispensable to keep the child from going astray. As the child grows up, excessive supervision stifles the child's innate drive for self-directed exploration. At that stage, judiciously letting go (a small $\gamma$ )enables the child to discover their true capabilities through the feedback from the surrounding world.

SFT损失也可能是训练不稳定的罪魁祸首。尽管SFT损失作为强化学习训练的有效稳定器,但我们发现其在训练过程中持续占据主导地位反而会引发崩溃。如图5所示,将$\gamma$固定在高位同样会导致答案奖励短暂上升后逐渐下降。根本原因在于对SFT损失的过度拟合会使策略陷入专家轨迹中固化的思维模式,当这些轨迹源自同一模型时尤为严重,从而阻碍探索过程。为验证这一点,我们追踪训练过程中的策略熵值,观察到明显的熵崩溃现象。相比之下,我们的动态$\gamma$策略能使策略熵在整个训练期间持续保持在较高水平。总体而言,我们发现训练过程如同养育孩童:幼年时期持续的家长引导(较大$\gamma$)对防止孩子误入歧途至关重要;随着孩子成长,过度监督会压制其与生俱来的自主探索动力,此时适时放手(较小$\gamma$)能让孩子通过环境反馈发掘自身真正潜力。

RL can narrow the performance gap between different base models, but can hardly reverse the order. Fig. 6 shows the impact of different degrees of cold start on the 7B model to subsequent RL training. For rapid empirical verification, we randomly sample 3, 843 training data from DATAMIND-12K (balanced on query types) and 240 test data (60 for each of the three test datasets) for evaluation. As the number of cold start training epochs increases, the marginal gain achieved by RL over the cold start checkpoint (i.e., the slope of the dashed line) diminishes. This indicates that RL can narrow the performance gap between different base models (Liu et al., 2025). Nevertheless, although the gap is narrowed, post-RL performance remains positively correlated with the capability of the base model. This suggests that the bulk of knowledge is acquired during SFT, whereas RL primarily serves to unlock latent potential rather than explicitly push the model beyond its inherent capacity boundary ( Yue et al., 2025a; Chu et al., 2025). Setting aside over fitting, the current trend suggests that a cross point may emerge in which a sufficiently strong cold start leaves no room for further improvement via RL. Whether such a point truly exists, and, if it does, what fundamental mechanisms (e.g., saturation of the policy space, diminishing exploratory signal, or intrinsic limitations of the reward model) render further RL ineffective, constitutes an important open question for future work.

强化学习 (RL) 可以缩小不同基础模型之间的性能差距,但很难逆转优劣顺序。图 6 显示了不同程度的冷启动对 7B 模型后续强化学习训练的影响。为快速进行实证验证,我们从 DATAMIND-12K 中随机抽取 3,843 条训练数据(按查询类型平衡)和 240 条测试数据(三个测试数据集各 60 条)进行评估。随着冷启动训练轮次的增加,强化学习相对于冷启动检查点获得的边际收益(即虚线斜率)逐渐减小。这表明强化学习可以缩小不同基础模型之间的性能差距 (Liu et al., 2025)。然而,尽管差距缩小,后强化学习性能仍与基础模型能力呈正相关。这表明主要知识是在监督微调 (SFT) 期间获得的,而强化学习主要起释放潜在能力的作用,而非显式推动模型突破其固有能力边界 (Yue et al., 2025a; Chu et al., 2025)。若不考虑过拟合,当前趋势表明可能会出现一个交叉点:当冷启动足够强时,强化学习将无法带来进一步改进。这样的临界点是否真实存在?如果存在,是何种根本机制(例如策略空间饱和、探索信号衰减或奖励模型的内在局限性)导致强化学习失效?这构成了未来工作中重要的开放性问题。


Figure 5: Answer Reward and Entropy Dynamics of different $\gamma$ settings.

图 5: 不同 $\gamma$ 设置下的答案奖励与熵动态。

6 RELATED WORK

6 相关工作

Agent Training. The earliest wave of LLM Agents (Wang et al., 2023; Xi et al., 2023) leverages the formidable reasoning capabilities of proprietary models (Qia0 et al., 2023; Chen et al., 2025a; Ya0 et al., 2023; Zhou et al., 2023; Hong et al., 2024; Li et al., 2023a; Wu et al., 2023). As AI entered the second half (Yao, 2025), numerous benchmarks targeting complex, domainspecific agentic tasks are introduced (Mialon et al., 2024; Phan et al., 2025; Jimenez et al., 2024; Chan et al., 2025; Starace et al., 2025; Wei et al., 2025a), which expose the limitations of general-purpose agent architectures, elevating domain-specific agent training to a critical necessity. The release of Large Reasoning Models (OpenAI, 2024c; DeepSeek-AI et al., 2025; Kimi et al., 2025) marks the triumph of Reinforcement Learning (RL) for LLMs. Consequently, a surge of work has sought to adapt RL algorithms to various agent domains (Jin et al., 2025a; Song et al., 2025; Li et al., 2025c; Wu et al., 2025a; Feng et al., 2025; Qian et al., 2025; Li et al., 2025b). Yet these methods presuppose a strong backbone model; researchers are therefore compelled to synthesize copious post-training data to compensate for the backbone's deficiencies. To the best of our knowledge, we are the first to systematically investigate the scaling of agent post-training in the data-analytic scenario, aiming to provide actionable insights for data synthesis and RL-driven training in other complex agent fields.

智能体训练。早期的大语言模型智能体浪潮 (Wang et al., 2023; Xi et al., 2023) 利用了专有模型强大的推理能力 (Qia0 et al., 2023; Chen et al., 2025a; Ya0 et al., 2023; Zhou et al., 2023; Hong et al., 2024; Li et al., 2023a; Wu et al., 2023)。随着人工智能进入下半场 (Yao, 2025),针对复杂领域特定智能体任务的众多基准测试被引入 (Mialon et al., 2024; Phan et al., 2025; Jimenez et al., 2024; Chan et al., 2025; Starace et al., 2025; Wei et al., 2025a),这些基准暴露了通用智能体架构的局限性,将领域特定智能体训练提升到关键必要性的高度。大型推理模型 (OpenAI, 2024c; DeepSeek-AI et al., 2025; Kimi et al., 2025) 的发布标志着强化学习 (RL) 在大语言模型领域的胜利。随之涌现出大量工作致力于将强化学习算法适配到各种智能体领域 (Jin et al., 2025a; Song et al., 2025; Li et al., 2025c; Wu et al., 2025a; Feng et al., 2025; Qian et al., 2025; Li et al., 2025b)。然而这些方法都预设了强大的骨干模型;研究人员因此被迫合成大量后训练数据以弥补骨干模型的不足。据我们所知,我们是首个在数据分析场景中系统研究智能体后训练扩展的工作,旨在为其他复杂智能体领域的数据合成和强化学习驱动训练提供可操作的见解。


Figure 6: Performance Gap Between Cold Start and RL with varying cold start epochs.

图 6: 冷启动与强化学习在不同冷启动周期下的性能差距。

Data-Analytic Agents and Benchmarks. Data Analysis Agents harness the reasoning capabilities and code-generation facility of LLMs to automate the end-to-end processing of data analysis tasks. Virtually all existing data analysis agents rely on closed-source models and are limited to prompt engineering. DS-Agent (Guo et al., 2024) incorporates human insights into data analysis tasks via case-based reasoning. AutoKaggle (Li et al., 2024) decomposes the data analysis pipeline into specialized sub-tasks through a multi-agent architecture. Data-Copilot (Zhang et al., 2023) and Agent ic Data (Sun et al., 2025) stabilize agent behavior by orchestrating operations within predefined workflows. Data Interpreter (Hong et al., 2025) further enlarges the agent's exploration space by introducing dynamic graph-based workflows. To foster progress in this domain, numerous data analysis datasets have been introduced (Hu et al., 2024; Jing et al., 2025; Liu et al., 2024; Zhang et al., 2025a; Majumder et al., 2025). Nevertheless, each adopts its own task formulation and evaluation protocol, and the majority primarily rely on human-annotated labels. In this paper, we propose a fully automated pipeline to synthesize data analysis questions and executable code trajectories. Leveraging this synthetic corpus, we train two generalist data-analytic agents with advanced performance.

数据分析智能体与基准测试。数据分析智能体利用大语言模型的推理能力和代码生成功能,实现数据分析任务端到端处理的自动化。几乎所有现有数据分析智能体都依赖闭源模型,且仅限于提示工程。DS-Agent (Guo et al., 2024) 通过基于案例的推理将人类洞察融入数据分析任务。AutoKaggle (Li et al., 2024) 通过多智能体架构将数据分析流程分解为专门子任务。Data-Copilot (Zhang et al., 2023) 和Agentic Data (Sun et al., 2025) 通过在预定义工作流中编排操作来稳定智能体行为。Data Interpreter (Hong et al., 2025) 通过引入基于动态图的工作流进一步扩大智能体探索空间。为推动该领域发展,已涌现大量数据分析数据集 (Hu et al., 2024; Jing et al., 2025; Liu et al., 2024; Zhang et al., 2025a; Majumder et al., 2025)。然而,各自采用不同的任务表述和评估方案,且主要依赖人工标注标签。本文提出全自动流程来合成数据分析问题与可执行代码轨迹,利用该合成语料训练出两个具有先进性能的通用数据分析智能体。

7 CONCLUSION

7 结论

This paper introduces DATAMIND, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. Built on DATAMIND, we curate DATAMIND-12K, a high-quality training set that spans diverse task categories and data file formats for data-analytic tasks. Trained on DATAMIND-12K, we obtain DATAMIND-7B and 14B, two advanced data-analytic agents with superior performance on multiple benchmarks compared with various proprietary and open-source baselines. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community.

本文介绍DATAMIND,一种可扩展的数据合成与智能体训练方案,用于构建通用数据分析智能体。基于DATAMIND,我们构建了DATAMIND-12K——一个涵盖多样化任务类别和数据文件格式的高质量数据分析任务训练集。通过DATAMIND-12K训练,我们获得了DATAMIND-7B和14B两个先进的数据分析智能体,它们在多个基准测试中相较于各类专有和开源基线模型均表现出卓越性能。我们还将探索性试验中获得的经验性见解融入分析实验,旨在为学术界提供可操作的智能体训练洞见。

ETHICS STATEMENT

伦理声明

This study was conducted in full compliance with established ethical standards and research best practices. All data employed are derived and synthesized exclusively from publicly available sources; no proprietary or confidential information was used. Every reference to these data sources is accurately and appropriately cited throughout the paper. We strongly encourage all users of our training dataset to uphold the highest ethical standards, ensuring fairness, transparency, and responsibility in their research. Any use of the dataset that could cause harm or negatively impact society is strictly prohibited.

本研究完全遵循既定的伦理标准和研究最佳实践进行。所有使用的数据均仅来自公开来源的衍生与合成;未使用任何专有或保密信息。论文中所有对这些数据源的引用均准确且适当地标注。我们强烈鼓励训练数据集的所有使用者秉持最高伦理标准,确保研究中的公平性、透明度和责任感。严禁任何可能造成伤害或对社会产生负面影响的 dataset 使用行为。

REPRODUCIBILITY STATEMENT

可复现性声明

We have submitted all our training and evaluation code in the Supplementary Material. Due to OpenReview's file size limit, we only upload a 3, 843 subset of DATAMIND-12K. We will fully open DATAMIND-12K and our models DATAMIND-7B and 14B immediately after the double blind review process. The detailed training data synthesis and agent training methods can be found in $\S3$ and $\S4$ . We have clearly reported the details of our evaluation datasets and metrics in $\S5.1$ and Appx.D. The detailed information about the models and baselines we use, including the model versions of proprietary models and the reproduction details of baseline models, can be found in $\S5.1$ and Appx.E. The code framework used and the training and inference hyper parameters are mentioned in $\S5.1$ and Appx.F. All the prompts used in our paper are presented in Appx.G.

我们已在补充材料中提交了所有训练和评估代码。由于OpenReview的文件大小限制,我们仅上传了DATAMIND-12K的3,843条数据子集。我们将在双盲评审结束后立即全面开放DATAMIND-12K及我们的模型DATAMIND-7B和14B。详细训练数据合成与智能体训练方法请参阅$\S3$和$\S4$章节。我们已在$\S5.1$和附录D中明确说明了评估数据集与指标的细节。所用模型及基线的详细信息(包括专有模型版本和基线模型复现细节)可查阅$\S5.1$和附录E。代码框架及训练推理超参数记录于$\S5.1$和附录F。本文使用的所有提示词均列于附录G。

REFERENCES

参考文献

Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn RL for generating CUDA kernels. CoRR, abs/2507.11948, 2025. doi: 10.48550/ARXIV.2507.11948. URLhttps://doi.0rg/10.48550/arXiv.2507.11948.

Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo 和 Silas Alberti. Kevin: 面向生成 CUDA 内核的多轮强化学习. CoRR, abs/2507.11948, 2025. doi: 10.48550/ARXIV.2507.11948. URL https://doi.org/10.48550/arXiv.2507.11948.

Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. Scimaster: Towards generalpurpose scientific AI agents, part i. x-master as foundation: Can we lead on humanity's last exam? CoRR, abs/2507.05241, 2025. doi: 10.48550/ARXIV.2507.05241. URL https : //doi.0rg/10.48550/arXiv.2507.05241.

Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. SciMaster: 迈向通用科学AI智能体, 第一部分. X-Master作为基础: 我们能否引领人类终极考试? CoRR, abs/2507.05241, 2025. doi: 10.48550/ARXIV.2507.05241. URL https://doi.org/10.48550/arXiv.2507.05241.

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. Mlebench: Evaluating machine learning agents on machine learning engineering, 2025. URL https : //arxiv.0rg/abs/2410.07095.

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng 和 Aleksander Madry。Mlebench: 在机器学习工程中评估机器学习智能体, 2025. URL https://arxiv.0rg/abs/2410.07095.

Team Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, and et al. Kimi k2: Open agentic intelligence, 2025. URL https : / /arxiv.0rg /abs /2507.20534.

Team Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, 等人. Kimi k2: 开放智能体 (Open Agentic Intelligence), 2025. URL https://arxiv.org/abs/2507.20534.

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. Spider 2.0: Evaluating language models on real-world enterprise textto-sql workfows. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https : / / openreview. net/forum?id=XmProj9cPs.

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. Spider 2.0: 在大语言模型上评估真实企业级文本到SQL工作流. 载于第十三届国际学习表征会议, ICLR 2025, 新加坡, 2025年4月24-28日. OpenReview.net, 2025. URL https://openreview.net/forum?id=XmProj9cPs.

Guohao Li, Hasan Hammoud, Hani Itani, Dmitri Khizbullin, and Bernard Ghanem. CAMEL: communicative agents for "mind" exploration of large language model society. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36:Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023a. URL http://papers.nips.cc/paper files/paper/2023/hash/ a 3621 ee 907 def 47 c 1 b 952 a de 25 c 67698-Abstract-Conference.html.

李国浩、哈桑·哈茂德、哈尼·伊塔尼、德米特里·希兹布林和伯纳德·加尼姆。CAMEL: 面向大语言模型社会思维探索的通信智能体。见 Alice Oh、Tristan Naumann、Amir Globerson、Kate Saenko、Moritz Hardt 和 Sergey Levine (编), 《神经信息处理系统进展 36: 2023年神经信息处理系统年度会议》, NeurIPS 2023, 美国路易斯安那州新奥尔良, 2023年12月10-16日, 2023a。URL http://papers.nips.cc/paper_files/paper/2023/hash/a3621ee907def47c1b952ade25c67698-Abstract-Conference.html

Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, and Cuiping Li. Omnisql: Synthesizing highquality text-t0-sql data at scale. CoRR, abs/2503.02240, 2025a. doi: 10.48550/ARXIV.2503.02240. URLhttps://doi.0rg/10.48550/arXiv.2503.02240.

Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, and Cuiping Li. Omnisql: 大规模生成高质量文本到SQL数据. CoRR, abs/2503.02240, 2025a. doi: 10.48550/ARXIV.2503.02240. URLhttps://doi.0rg/10.48550/arXiv.2503.02240.

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin ChenChuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. Can LLM already serve as A database interface? A big bench for large-scale database grounded text-to-sqls. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers.nips.cc/paper files/paper/ 2023/hash/83 fc 8 fab 1710363050 bb d 1 d 4 b 8 cc 0021-Abstract-Datasets_and Benchmarks.html.

李金阳, 黄慧远, 曲格, 杨嘉曦, 李斌华, 李博文, 王柏霖, 秦博文, 耿瑞英, 霍楠, 周煊赫, 马晨皓, 李国良, Kevin ChenChuan Chang, 黄斐, Reynold Cheng, 李永彬. 大语言模型能否作为数据库接口? 面向大规模数据库的文本到SQL基准测试. 见 Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, Sergey Levine (编), 《神经信息处理系统进展第36辑: 2023年神经信息处理系统年会》, NeurIPS 2023, 美国路易斯安那州新奥尔良, 2023年12月10-16日, 2023b. URL http://papers.nips.cc/paper files/paper/ 2023/hash/83 fc 8 fab 1710363050 bb d 1 d 4 b 8 cc 0021-Abstract-Datasets_and Benchmarks.html.

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent, 2025b. URL https : / /arxiv.0rg/abs /2507.02592.

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: 面向网络智能体的超人类推理导航, 2025b. URL https://arxiv.org/abs/2507.02592.

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability, 2025c. URL https://arxiv.0rg/abs /2504.21776.

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: 赋能大推理模型深度研究能力, 2025c. URL https://arxiv.0rg/abs/2504.21776.

Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wang chun shu Zhou, Wenhao Huang, and Ge Zhang. Autokaggle: A multi-agent framework for autonomous data science competitions. CoRR, abs/2410.20424, 2024. doi: 10.48550/ARXIV.2410.20424. URL https : //doi.0rg/10.48550/arXiv.2410.20424.

Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wang chun shu Zhou, Wenhao Huang, and Ge Zhang. AutoKaggle: 面向自主数据科学竞赛的多智能体框架. CoRR, abs/2410.20424, 2024. doi: 10.48550/ARXIV.2410.20424. URL https://doi.org/10.48550/arXiv.2410.20424.

Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pp. 9215-9235. Association for Computational Linguistics, 2024. doi: 10.18653/V1/2024. FINDINGS-ACL.548. URL https://doi.0rg/10.18653/v1/2024.findings-acl. 548.

Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. 大语言模型 (LLM) 能否进行基于数据的统计与因果推理? 基于数据的高级定量推理基准测试. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), 《计算语言学协会研究发现: ACL 2024会议录》, 泰国曼谷及虚拟会议, 2024年8月11-16日, pp. 9215-9235. 计算语言学协会, 2024. doi: 10.18653/V1/2024.FINDINGS-ACL.548. URL https://doi.org/10.18653/v1/2024.findings-acl.548.

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy, 2025. URL https: //arxiv.0rg/abs /2506.13284.

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: 通过监督微调与强化学习的协同作用推进数学与代码推理能力, 2025. URL https://arxiv.org/abs/2506.13284.

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. CoRR, abs/2408.06292, 2024. doi: 10.48550/ARXIV.2408.06292. URL https : / /doi .0rg/10 . 48550/arXiv.2408. 06292.

Chris Lu、Cong Lu、Robert Tjarko Lange、Jakob N. Foerster、Jeff Clune和David Ha。AI科学家:迈向全自动开放式科学发现。CoRR,abs/2408.06292,2024。doi:10.48550/ARXIV.2408.06292。URL https://doi.org/10.48550/arXiv.2408.06292

Peixian Ma, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, and Jian Guo. SQL-R1: training natural language to SQL reasoning model by reinforcement learning. CoRR, abs/2504.08600, 2025. doi: 10.48550/ARXIV.2504.08600. URL https : / /doi . 0rg/10 . 48550 /arXiv.2504. 08600.

Peixian Ma, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, and Jian Guo. SQL-R1: 基于强化学习的自然语言转SQL推理模型训练. CoRR, abs/2504.08600, 2025. doi: 10.48550/ARXIV.2504.08600. URL https://doi.org/10.48550/arXiv.2504.08600.

Xinji Mai, Haotian Xu, Xing W, Weinong Wang, Yingying Zhang, and Wenqiang Zhang. Agent RL scaling law: Agent RL with spontaneous code execution for mathematical problem solving. CoRR, abs/2505.07773, 2025. doi: 10.48550/ARXIV.2505.07773. URL https : / / doi . 0rg/ 10.48550/arXiv.2505.07773.

Xinji Mai, Haotian Xu, Xing W, Weinong Wang, Yingying Zhang, and Wenqiang Zhang. Agent RL scaling law: Agent RL with spontaneous code execution for mathematical problem solving. CoRR, abs/2505.07773, 2025. doi: 10.48550/ARXIV.2505.07773. URL https : / / doi . 0rg/ 10.48550/arXiv.2505.07773.

Bodhi s at twa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discovery bench: Towards data-driven discovery with large language models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https : / /openreview.net / forum?id $=$ VyflgpwfJw.

Bodhi s at twa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discovery bench: 面向数据驱动发现的大语言模型. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=VyflgpwfJw.

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id $\underline{{\underline{{\mathbf{\Pi}}}}}$ fibxvahvs3.

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun 和 Thomas Scialom. GAIA: 通用人工智能助手的基准测试. 发表于第十二届国际学习表征会议 (ICLR 2024), 奥地利维也纳, 2024年5月7-11日. OpenReview.net, 2024. URL https://openreview.net/forum?id $\underline{{\underline{{\mathbf{\Pi}}}}}$ fibxvahvs3.

OpenAl. Hello gpt-4o, 2024a. https: / /openai.com/index/hello-gpt-4o/.

OpenAI. Hello gpt-4o, 2024a. https://openai.com/index/hello-gpt-4o/.

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024b. ht tps : / / openai . Com/ index/gpt-4o-mini-advancing-cost-efficient-intelligence/.

OpenAI. Gpt-4o mini: 推进成本效益型智能发展, 2024b. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/.

OpenAI. Introducing openai o1-preview, 2024c. https://openai.com/index/ introducing-openai-ol-preview/.

OpenAI. 推出openai o1-preview, 2024c. https://openai.com/index/introducing-openai-ol-preview/.

OpenAI. Introducing gpt-5, 2025a. https://openai.com/index/ introducing-gpt-5/.

OpenAI. 推出GPT-5, 2025a. https://openai.com/index/introducing-gpt-5/.

OpenAI. Introducing openai 03 and 04-mini, 2025b. https:/ /openai.com/index/ introducing-o3-and-o4-mini/.

OpenAI. 推出 OpenAI O3 和 O4-mini, 2025b. https://openai.com/index/introducing-o3-and-o4-mini/.

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Milind Jagota, Ronak Pradeep, and et al. Humanity's last exam, 2025. URL https: //arxiv.0rg/abs/2501.14249.

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Milind Jagota, Ronak Pradeep 等人. 人类终极测试, 2025. URL https://arxiv.org/abs/2501.14249.

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tur, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025. URL https : / / arxiv . Org / abs/2504.13958.

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tur, Gokhan Tur, and Heng Ji. Toolrl: 奖励即工具学习的全部需求, 2025. URL https://arxiv.org/abs/2504.13958.

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 6lst Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 5368-5393. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.294. URL https : / / doi . 0rg /10 . 18 653 /v1/2023.ac1-1ong.294.

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 基于大语言模型提示的推理研究综述. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (编), 《第61届计算语言学协会年会论文集 (第一卷: 长论文) 》, ACL 2023, 加拿大多伦多, 2023年7月9-14日, 第5368-5393页. 计算语言学协会, 2023. doi: 10.18653/V1/2023.ACL-LONG.294. URL https://doi.org/10.18653/v1/2023.acl-long.294.

Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wang chun shu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen. Autoact: Automatic agent learning from scratch for QA via self-planning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pp. 3003-3021. Association for Computational Linguistics,2024. URL https://a cl anthology.0rg/2024.acl-1ong.165.

Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wang chun shu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen. AutoAct: 通过自我规划实现问答任务中从零开始的自动智能体学习. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), 计算语言学协会第62届年会论文集 (第一卷: 长论文), ACL 2024, 泰国曼谷, 2024年8月11-16日, pp. 3003-3021. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.acl-long.165.

ATHE USE OF LARGE LANGUAGE MODELS

大语言模型 (Large Language Models) 的使用

We affirm that Large Language Models are employed solely as an assisted tool to refine wording and sentence structure during our paper writing process. Their use in the experiments is strictly for scientific research purposes, and all such usage has been explicitly documented in our Experimental Settings and Reproducibility Statement. No other reliance on LLMs is involved in this work.

我们声明,大语言模型 (Large Language Model) 仅作为辅助工具用于论文撰写过程中的措辞润色与句子结构优化。实验中对大语言模型的使用严格遵循科研目的,所有使用情况均已明确记录在实验设置与可复现性声明中。本工作未涉及其他对大语言模型的依赖。

B LIMITATIONS

B 局限性

This paper still has some limitations that must be acknowledged: a) At present, we only incorporate reasoning-oriented data-analysis tasks; training, predictive, and data-visualization tasks are deliberately excluded and reserved as our important future work. $b$ ) Owing to computational constraints, our experimental backbone is restricted to the Qwen family, with model scale capped at 14B. Furthermore, not all mainstream benchmarks are covered in our evaluation suite. c) Limited by computational resources, we have not exhaustively evaluated all RL training algorithms; moreover, data scarcity constrains our RL runs to $\sim350$ steps. In future work, we will investigate more advanced RL strategies that enable stable, continual learning over substantially larger datasets.

本文仍存在一些必须承认的局限性: a) 目前我们仅纳入面向推理的数据分析任务, 训练, 预测和数据可视化任务被刻意排除, 并保留为我们重要的未来工作. $b$ ) 由于计算资源限制, 我们的实验主干网络仅限于Qwen系列, 模型规模上限为140亿参数. 此外, 评估套件并未覆盖所有主流基准. c) 受限于计算资源, 我们尚未详尽评估所有强化学习训练算法; 同时数据稀缺性将我们的强化学习训练步数限制在 $\sim350$ 步. 在未来的工作中, 我们将研究更先进的强化学习策略, 以实现对更大规模数据的稳定持续学习.

CA MORE DETAILED RELATED WORK

CA 更详细的相关工作

Agent Training. The earliest wave of LLM Agents (Wang et al., 2023; Xi et al., 2023) leverages the formidable reasoning capabilities of proprietary models (Qia0 et al., 2023; Chen et al., 2025a). At that time, researchers primarily boost agent performance through prompt engineering ( Yao et al., 2023; Zhou et al., 2023; Hong et al., 2024; Li et al., 2023a; Wu et al., 2023). To equip open-source models with agentic skills, subsequent works introduce agent training (Chen et al., 2023; Zeng et al., 2024; Chen et al., 2024; Qia0 et al., 2024) via SFT. Large-scale trajectory data, manually curated or synthetically generated by closed-source models, are used to instruct-tune open-source models. As AI entered the second half (Yao, 2025), numerous benchmarks targeting complex, domain-specific agentic tasks are introduced (Mialon et al., 2024; Phan et al., 2025; Jimenez et al., 2024; Chan et al., 2025; Starace et al., 2025; Wei et al., 2025a), which expose the limitations of general-purpose agent architectures, elevating domain-specific agent training to a critical necessity. However, extensive studies have shown that SFT tends to drive agent models into paradigm over fitting, severely compromising their dynamic generalization ability in sophisticated agent scenarios (Chu et al., 2025; Jin et al., 2025b; Qia0 et al., 2025).

AI智能体 (AI Agent) 训练。早期的大语言模型智能体浪潮 (Wang et al., 2023; Xi et al., 2023) 依托于闭源模型的强大推理能力 (Qia0 et al., 2023; Chen et al., 2025a)。彼时研究者主要通过提示工程 (Yao et al., 2023; Zhou et al., 2023; Hong et al., 2024; Li et al., 2023a; Wu et al., 2023) 提升智能体性能。为使开源模型具备智能体能力,后续研究开始通过监督微调引入智能体训练 (Chen et al., 2023; Zeng et al., 2024; Chen et al., 2024; Qia0 et al., 2024),采用人工标注或闭源模型生成的大规模轨迹数据对开源模型进行指令微调。随着人工智能进入下半场 (Yao, 2025),针对复杂领域特定任务的评测基准相继涌现 (Mialon et al., 2024; Phan et al., 2025; Jimenez et al., 2024; Chan et al., 2025; Starace et al., 2025; Wei et al., 2025a),暴露出通用智能体架构的局限性,使得领域特定智能体训练成为关键需求。然而大量研究表明,监督微调容易导致智能体模型陷入范式过拟合,严重削弱其在复杂智能体场景中的动态泛化能力 (Chu et al., 2025; Jin et al., 2025b; Qia0 et al., 2025)。

The release of Large Reasoning Models (OpenAI, 2024c; DeepSeek-AI et al., 2025; Kimi et al., 2025) marks the triumph of Reinforcement Learning (RL) for LLMs. GRPO-style algorithms (Shao et al., 2024; Yu et al., 2025; Yue et al., 2025b; Wang et al., 2025; Dong et al., 2025; Zhang et al., 2025b) enable models to autonomously explore while preserving robust generalization across diverse reasoning patterns. Consequently, a surge of work has sought to adapt GRPO-like algorithms to various agent domains (Jin et al., 2025a; Song et al., 2025; Li et al., 2025c; Wu et al., 2025a; Feng et al., 2025; Qian et al., 2025; Li et al., 2025b; Wei et al., 2025b). Yet these methods presuppose a strong backbone model; researchers are therefore compelled to synthesize copious post-training data to compensate for the backbone's deficiencies in the target agent setting. To the best of our knowledge, we are the first to systematically investigate the scaling of agent post-training data and multi-turn RL training in the data-analytic scenario, aiming to provide actionable insights for data synthesis and RL-driven training in other complex agent fields.

大推理模型 (Large Reasoning Models) 的发布 (OpenAI, 2024c; DeepSeek-AI et al., 2025; Kimi et al., 2025) 标志着强化学习 (Reinforcement Learning) 在大语言模型领域的胜利。GRPO风格算法 (Shao et al., 2024; Yu et al., 2025; Yue et al., 2025b; Wang et al., 2025; Dong et al., 2025; Zhang et al., 2025b) 使模型能够自主探索,同时在不同推理模式间保持强大的泛化能力。因此,大量研究试图将类GRPO算法应用于各种智能体领域 (Jin et al., 2025a; Song et al., 2025; Li et al., 2025c; Wu et al., 2025a; Feng et al., 2025; Qian et al., 2025; Li et al., 2025b; Wei et al., 2025b)。然而这些方法都预设了强大的骨干模型;研究人员因此不得不合成大量后训练数据来弥补骨干模型在目标智能体场景中的缺陷。据我们所知,我们是首个在数据分析场景中系统研究智能体后训练数据扩展和多轮强化学习训练的工作,旨在为其他复杂智能体领域的数据合成和强化学习驱动训练提供可行见解。

Data-Analytic Agents and Benchmarks. Data Analysis Agents harness the reasoning capabilities and code-generation facility of LLMs to automate the end-to-end processing of data analysis tasks, constituting a critical component in the pursuit of autonomous scientific discovery (Chen et al., 2025b). Virtually all existing data analysis agents rely on closed-source models and are limited to prompt engineering. InfiAgent (Hu et al., 2024) pioneers the adoption of the ReAct (Yao et al., 2023) paradigm for tackling data analysis problems. DS-Agent (Guo et al., 2024) incorporates human insights into data analysis tasks via case-based reasoning. AutoKaggle (Li et al., 2024) decomposes the data analysis pipeline into specialized sub-tasks through a multi-agent architecture. Data-Copilot (Zhang et al., 2023) and Agent ic Data (Sun et al., 2025) stabilize agent behavior by orchestrating operations within predefined workflows. Data Interpreter (Hong et al., 2025) further enlarges the agent's exploration space by introducing dynamic graph-based workflows. To foster progress in this domain, numerous data analysis datasets have been introduced (Hu et al., 2024; Jing et al., 2025; Liu et al., 2024; Zhang et al., 2025a; Majumder et al., 2025; Wu et al., 2025b; Lei et al., 2025). Nevertheless, each adopts its own task formulation and evaluation protocol, and the majority primarily rely on human-annotated labels. In this paper, we propose a fully automated pipeline to synthesize data analysis questions and executable code trajectories. Leveraging this synthetic corpus, we train two generalist data-analytic agents with advanced performance.

数据分析智能体与基准测试。数据分析智能体利用大语言模型的推理能力和代码生成功能,自动化处理数据分析任务的全流程,是实现自主科学发现的关键组成部分 (Chen et al., 2025b)。当前几乎所有数据分析智能体都依赖闭源模型且仅限于提示工程。InfiAgent (Hu et al., 2024) 率先采用ReAct范式 (Yao et al., 2023) 解决数据分析问题。DS-Agent (Guo et al., 2024) 通过案例推理将人类洞察融入数据分析任务。AutoKaggle (Li et al., 2024) 通过多智能体架构将数据分析流程分解为专业子任务。Data-Copilot (Zhang et al., 2023) 和Agentic Data (Sun et al., 2025) 通过在预定义工作流中编排操作来稳定智能体行为。Data Interpreter (Hong et al., 2025) 通过引入基于动态图的工作流进一步扩展了智能体的探索空间。为推动该领域发展,目前已涌现多个数据分析数据集 (Hu et al., 2024; Jing et al., 2025; Liu et al., 2024; Zhang et al., 2025a; Majumder et al., 2025; Wu et al., 2025b; Lei et al., 2025)。然而,这些数据集各自采用不同的任务构建方式和评估方案,且主要依赖人工标注标签。本文提出全自动流程来合成数据分析问题与可执行代码轨迹,并利用该合成语料训练出两个具有先进性能的通用数据分析智能体。

D DATASETS AND EVALUATION DETAILS

D 数据集和评估细节

We evaluate our model on three datasets related to data analysis. Here, we introduce the details and our evaluation protocols for each dataset:

我们在三个与数据分析相关的数据集上评估我们的模型。在此,我们介绍每个数据集的详细信息及评估方案:

· DABench (Hu et al., 2024). DABench evaluates LLMs in data analysis tasks across 257 challenges from 52 CsV files, covering 7 question categories. The original benchmark uses accuracy as the metric. The model's answer will be reformatted by an LLM to a specific structure and compared with the gold label using regular expression matching. Here, we directly utilize the model-as-judge to compare the predicted answer and the gold answer.

· DABench (Hu et al., 2024) 。DABench 通过 52 个 CSV 文件中的 257 项挑战来评估大语言模型 (LLM) 在数据分析任务中的表现,涵盖 7 个问题类别。原始基准使用准确率作为评估指标。模型的答案将由大语言模型重新格式化为特定结构,并通过正则表达式匹配与标准答案进行比对。在此,我们直接采用模型即评判员 (model-as-judge) 的方式来比较预测答案与标准答案。

· TableBench (Wu et al., 2025b). TableBench is a real-world table reasoning benchmark spanning 18 felds and four major categories. The tables in TableBench are organized using · js on files. So we first transform them into . csv files. Then, we filter the trend forecasting and chart generation questions because these questions do not have explicit gold answers. The original benchmark uses Rouge-L as the metric, and we apply model-as-judge instead.

· TableBench (Wu等人, 2025b) 。TableBench是一个涵盖18个领域和四大类别的真实场景表格推理基准测试。该基准中的表格采用· js文件进行组织,因此我们首先将其转换为.csv文件。随后我们过滤了趋势预测和图表生成类问题,因为这些问题没有明确的参考答案。原始基准采用Rouge-L作为评估指标,而我们改用模型即评判官 (model-as-judge) 方法。

· BIRD (Li et al., 2023b). BIRD is a widely used Text-to-SQL benchmark. We use it to evaluate our model's ability to analyze table-based databases. Since the BIRD test set requires official leader board submission, we adopt its validation set as our testbed for convenience. As SQL execution typically returns structured and often very large tables that are hard for a model-asjudge to assess, we instead materialize each result into a . csv file and perform an exact match comparison against the gold label.

· BIRD (Li et al., 2023b). BIRD是一个广泛使用的Text-to-SQL基准测试。我们用它来评估模型分析基于表格的数据库的能力。由于BIRD测试集需要官方排行榜提交,为方便起见我们采用其验证集作为测试平台。由于SQL执行通常返回结构化且通常体量庞大的表格,难以通过模型即评判 (model-as-judge) 方式评估,我们改为将每个结果物化为.csv文件并与标准答案进行精确匹配比较。

We adopt accuracy as the final metric. For every model, we run three independent trials. We take the average score of the three trials as pass $@1$ , while the union of the three trials is taken as pass $\ @3$ (i.e., success on any single trial counts as an overall success).

我们采用准确率作为最终评估指标。每个模型均进行三次独立实验。将三次实验的平均得分记为通过 $@1$ ,而将三次实验的并集结果记为通过 $\ @3$ (即在任意单次实验中成功即视为整体成功)。

E BASELINES AND REPRODUCTION DETAILS

E 基线方法与复现细节

Models and Baselines. We compare our DATAMIND with five strong proprietary models: GPT4o (gpt-4o-2024-0806) (OpenAI, 2024a), 04-mini (04-mini-2025-04-16) (OpenAI, 2025b), DeepSeek-R1 (deepseek-r1-2025-0528) (DeepSeek-AI et al., 2025), DeepSeekV3.1 (deepseek-v3.1-nothinking) (DeepSeek-AI, 2025), GPT-5 (gpt-5-2025-08-07) (OpenAI, 2025a). We also include four outstanding open-source models: QwQ-32B (Qwen, 2025), Qwen-2.5-Coder-32B (Hui et al., 2024), Llama-3.3-70B (Dubey et al., 2024), and Qwen-2.5-72B (Yang et al., 2024). In addition, we select four open-source models that have been explicitly trained for data-analysis-related tasks: TableLLM (Wu et al., 2025b) and Table-R1 (Wu et al., 2025c) are specialized for tabular reasoning, whereas OmniSQL (Li et al., 2025a) and SQL-R1 (Ma et al., 2025) are optimized for Text-to-SQL generation. We include Qwen-2.5-Coder-7B and 14B (Hui et al., 2024) as backbone models to compare different baselines. We use the Instruct version for all open-source models. Here we introduce the baselines we compare with and our reproduction details:

模型与基线。我们将DATAMIND与五个强大的专有模型进行比较:GPT4o (gpt-4o-2024-0806) (OpenAI, 2024a) 、04-mini (04-mini-2025-04-16) (OpenAI, 2025b) 、DeepSeek-R1 (deepseek-r1-2025-0528) (DeepSeek-AI et al., 2025) 、DeepSeekV3.1 (deepseek-v3.1-nothinking) (DeepSeek-AI, 2025) 、GPT-5 (gpt-5-2025-08-07) (OpenAI, 2025a) 。同时纳入了四个优秀的开源模型:QwQ-32B (Qwen, 2025) 、Qwen-2.5-Coder-32B (Hui et al., 2024) 、Llama-3.3-70B (Dubey et al., 2024) 和Qwen-2.5-72B (Yang et al., 2024) 。此外,我们选择了四个经过显式训练用于数据分析相关任务的开源模型:TableLLM (Wu et al., 2025b) 和Table-R1 (Wu et al., 2025c) 专精于表格推理,而OmniSQL (Li et al., 2025a) 和SQL-R1 (Ma et al., 2025) 针对Text-to-SQL生成进行了优化。我们引入Qwen-2.5-Coder-7B和14B (Hui et al., 2024) 作为骨干模型来比较不同基线。所有开源模型均使用指令微调版本。此处介绍对比基线及我们的复现细节:

Table 2: Detailed hyper parameters used in our paper.

StageHyperparameterValue
SFTlearning rate1e-5
lr scheduler typecosine
warmup ratio0.1
batch size16
training epoch3
SFT+RLcutoff length8192
learning rate1e-6
lr warmup styleconstant
lr warmup steps batch size20
16
mini batch size training epoch2
max prompt length1
max response length2048
clip ratio low Elow8192
0.2
clip ratio high Ehigh0.28
rollout temperature0.7
rollout topp1.0
rollout group size G4
scheduler typecosine
peak value0.9
valley value0.05
length reward Imin256
length reward lmax1024
Inferencetemperature0.7
topp0.95
batch size5

表 2: 本文使用的详细超参数。

阶段 超参数
SFT 学习率 1e-5
SFT 学习率调度器类型 cosine
SFT 预热比例 0.1
SFT 批次大小 16
SFT 训练轮数 3
SFT+RL 截断长度 8192
SFT+RL 学习率 1e-6
SFT+RL 学习率预热风格 constant
SFT+RL 学习率预热步数 20
SFT+RL 批次大小 16
SFT+RL 小批次大小 2
SFT+RL 训练轮数 1
SFT+RL 最大提示长度 2048
SFT+RL 最大响应长度 8192
SFT+RL 裁剪比率下限 0.2
SFT+RL 裁剪比率上限 0.28
SFT+RL 生成温度 0.7
SFT+RL 生成 top-p 1.0
SFT+RL 生成组大小 G 4
SFT+RL 调度器类型 cosine
SFT+RL 峰值 0.9
SFT+RL 谷值 0.05
SFT+RL 长度奖励最小长度 256
SFT+RL 长度奖励最大长度 1024
推理 温度 0.7
推理 top-p 0.95
推理 批次大小 5

F TRAINING AND INFERENCE DETAILS

F 训练和推理细节

We use Llama Factory (Zheng et al., 2024) for SFT training and verl (Sheng et al., 2025) for RL training. For SFT, our learning rate is $1e-5$ with a warmup ratio of 0.1 and a cosine decay schedule. Our global batch size is set to 16. For RL, we use a learning rate of $1e-6$ .The batch size is 16 with a mini batch size of 2. The rollout temperature is 0.7, the top-p is 1.0, and the group size $G$ is 4.We schedule $\gamma$ via cosine decay, annealing from a peak of 0.9 to a valley of 0.05. At test time, we fix the temperature to 0.7, top-p to 0.95, and an inference batch size of 5 for all evaluations. The detailed hyper parameters employed in DATAMIND are presented in Tab.2.

我们使用 Llama Factory (Zheng et al., 2024) 进行监督微调 (SFT) 训练, 使用 verl (Sheng et al., 2025) 进行强化学习 (RL) 训练。对于 SFT, 我们的学习率为 $1e-5$, 预热比例为 0.1, 并采用余弦衰减调度。全局批处理大小设置为 16。对于 RL, 我们使用 $1e-6$ 的学习率, 批处理大小为 16, 小批量大小为 2。推演温度为 0.7, top-p 值为 1.0, 组大小 $G$ 为 4。我们通过余弦衰减调度 $\gamma$, 从峰值 0.9 退火至谷值 0.05。在测试阶段, 我们将温度固定为 0.7, top-p 值固定为 0.95, 所有评估的推理批处理大小均为 5。DATAMIND 采用的详细超参数见表 2。

GPROMPTS USED IN OUR PAPER

本文使用的提示词

G.1 TRAINING AND EVALUATION PROMPT

G.1 训练与评估提示

# Training and Evaluation Prompt

# 训练与评估提示

You are an expert-level data analyst and statistician who solves any data challenge through rigorous logic, systematic planning, and deep investigation. Your primary task is to answer user questions by analyzing the provided data source. You can solve the given problem step by step by utilizing Python code execution (for CSV files) or SQL queries (for database files)

你是一位专家级的数据分析师和统计学家,通过严谨的逻辑、系统性的规划和深入调查来解决任何数据挑战。你的主要任务是通过分析提供的数据源来回答用户问题。你可以利用Python代码执行(针对CSV文件)或SQL查询(针对数据库文件)逐步解决给定的问题。

to support your reasoning.

支持你的推理。

# # CSV File and Excel File Analysis Notes

# CSV 文件与 Excel 文件分析笔记

# # Database File Analysis Notes

# # 数据库文件分析笔记

You can only use sqlite database engine to execute your SQL queries 2. In your first step, use get db in foO to inspect the database schema. 3. In your answer, you must provide the fle name of the result CsV file. Make sure the answer file has been saved in the current directory.

1. 只能使用sqlite数据库引擎执行SQL查询
2. 第一步请使用get_db_info()检查数据库结构
3. 答案中必须提供结果CSV文件的文件名
4. 确保答案文件已保存在当前目录

# # Additional Notes:

# # 补充说明:

1. Avoid including irrelevant commentary outside of the designated tags , .
2. 避免在指定标签 ,  之外包含无关的注释内容。

# # Data Source

# # 数据来源

\*\*The data source path is ′{data source path}'.\*\*

**数据源路径为 '{data source path}'。**

Figure 7: Prompt for Training and Evaluation.

图 7: 训练与评估提示。

G.2 QUERY SYNTHESIS PROMPT

G.2 查询合成提示

# Query Synthesis Prompt
# 查询合成提示

You are a data analysis expert and assistant.  Your task is to generate high-quality, insightful data analysis questions based on a given data file's metadata, headers, and column descriptions. You always consider the semantics of the data and produce questions that can support exploratory data analysis, business understanding, or hypothesis generation. Your output should be clear, structured, and directly usable for data analysis planning.

你是一名数据分析专家和助手。你的任务是根据给定数据文件的元数据、标题和列描述生成高质量、有洞察力的数据分析问题。你始终考虑数据的语义,提出能够支持探索性数据分析、业务理解或假设生成的问题。你的输出应清晰、结构化,并可直接用于数据分析规划。

# #Objective

# 目标

Based on the information from the data file and the specified question type, generate one unique and meaningful exploratory data analysis (EDA) question.  The phrasing and structure of the question must closely follow the style of the provided examples below. The goal is to generate questions that can reveal various types of insights from the dataset.

基于数据文件信息和指定问题类型,生成一个独特且有意义的探索性数据分析 (EDA) 问题。问题的措辞和结构必须严格遵循下方示例的风格规范,目标是生成能够从数据集中揭示各类洞察的问题。

# ## Here is some meta information about the data file:

# ## 数据文件的元信息如下:

### Description: {description.replace(\*#, \*##)}

### 描述: {description.replace(*#, *##)}

### Header: {meta_info['head']}

### 标题: {meta_info['head']}

### Columns: {meta_info['columns']}

### : {meta_info['columns']}

### Columns Type: {meta_info['type']}

### 列类型: {meta_info['type']}

### Columns Range: {meta_info['range']}

### 列范围: {meta_info['range']}

### Columns Unique Values: {meta_info['unique]}

### 列唯一值数量: {meta_info['unique']}

### Row Number: {meta_info['row_count']}

### 行号: {meta_info['row_count']}

### Column Number {meta_info['column count']}

### 列数 {meta_info['column count']}

# ## EDA Question Types with Short Descriptions

# ## 带简短描述的EDA问题类型

1. \*\* Aggregation\*\* - Summarize data values to understand overall patterns or trends, such as calculating averages, totals, or maximums. Often used to quantify general behavior across a dataset.
2. **聚合** - 汇总数据值以理解整体模式或趋势,例如计算平均值、总和或最大值。通常用于量化数据集的总体行为。
3. \*\*Ranking\*\* - Identify and compare items based on specific metrics to determine their relative standing, often highlighting the highest, lowest.
4. **排序** - 根据特定指标识别和比较项目以确定其相对位置,通常突出显示最高值和最低值。
5. \*\*Comparison\*\* - Compare values across data points to identify differences, similarities, or extremes, often focusing on identifying the highest, lowest, or range between values.
6. **对比** - 通过数据点间的数值比较来识别差异、相似性或极端值,通常侧重于找出最高值、最低值或数值区间范围。
7. \*\*Domain Specific\*\* - Analyze data within a specific field or context using domain knowledge to interpret results, answer specialized questions, or derive insights meaningful to that area.
8. **领域特定** - 利用领域知识分析特定领域或情境中的数据,以解读结果、回答专业问题或获取对该领域有意义的见解。
9. \*\*Causal Analysis\*\* - Analyzing relationships beyond correlation, often requiring controlled experiments or advanced statistical methods to infer causality.
10. **因果分析 (Causal Analysis)** - 分析超越相关性的关系,通常需要受控实验或高级统计方法来推断因果关系。
11. \*\*Statistical Analysis\*\* - Apply statistical measures (e.g., median, standard deviation, variance, growth rate) to summarize, describe, or evaluate patterns and variability within the data.
12. **统计分析** - 应用统计方法 (例如中位数、标准差、方差、增长率) 对数据中的模式和变异性进行总结、描述或评估。
13. \*\*Correlation Analysis\*\* - Measure the strength and direction of the relationship between two quantitative variables, typically using a correlation coefficient to assess how closely the variables move together.
14. **相关性分析** - 测量两个定量变量之间关系的强度和方向,通常使用相关系数来评估变量之间的协同变化程度。
15. \*\*Arithmetic Calculation\*\* - Perform basic mathematical operations (e.g., addition, subtraction, multiplication, division) to compute totals, differences, or projections based on thegiven data.
16. **算术计算** - 执行基本数学运算 (例如加法、减法、乘法、除法) , 根据给定数据计算总和、差值或预测值。
17. \*\*Descriptive Analysis\*\* - Provide an overview of the dataset by explaining its structure, key columns, and any observable patterns or trends. Focus on summarizing what the data shows without drawing causal inferences.
18. **描述性分析** - 通过解释数据集的结构、关键列及任何可观察的模式或趋势来概述数据。重点在于总结数据所呈现的内容,而不进行因果推断。
19. \*\*Impact Analysis\*\* - Analyze relationships between variables to determine how one factor influences another. This involves identifying trends, correlations, or causations within the data to assess impact over time or across categories.
20. **影响分析** - 分析变量之间的关系以确定某个因素如何影响另一个因素。这涉及识别数据中的趋势、相关性或因果关系,以评估随时间推移或跨类别的影响。
21. \*\*Fact Checking\*\* - Retrieve and verify multiple related facts across different data points to answer a question. This requires connecting and cross-referencing information from various parts of a dataset.
22. **事实核查** - 检索并验证多个相关事实,通过不同数据点来回答问题。这需要连接并交叉引用数据集中各个部分的信息。
23. \*\* Anomaly Detection\*\* - Identify data points that significantly differ from the expected pattern or norm, potentially indicating errors, outliers, or unusual behavior.
24. **异常检测 (Anomaly Detection)** - 识别与预期模式或规范显著不同的数据点,可能表明存在错误、异常值或不寻常行为。
25. \*\*Multi-hop Numerical Reasoning\*\* - Perform numerical reasoning that requires combining multiple pieces of information or steps. This often involves intermediate calculations or logical sequencing to reach the final answer.
26. **多跳数值推理** - 执行需要结合多条信息或步骤的数值推理。这通常涉及中间计算或逻辑顺序以得出最终答案。
27. \*\*Time-based Calculation\*\* - Analyze data across time periods to identify trends, changes, or cumulative values, often involving comparisons between different time intervals or calculating growth rates over time.
28. **基于时间的计算** - 分析跨时间段的数据以识别趋势、变化或累计值,通常涉及不同时间间隔之间的比较或计算随时间变化的增长率。
29. \*\*Distribution Analysis\*\* - Analyze how data values are distributed for a given variable or across groups. This includes assessing normality (e.g., via the Shapiro-Wilk test), skewness, kurtosis, or comparing distributions between groups using statistical tests (e.g., the Mann- Whitney U). It helps in understanding the shape, spread, and symmetry of the data, and whether it meets assumptions required for other analyses.
30. **分布分析 (Distribution Analysis)** - 分析数据值在给定变量或跨组别中的分布情况。包括评估正态性 (例如通过 Shapiro-Wilk 检验) 、偏度、峰度,或使用统计检验 (例如 Mann-Whitney U 检验) 比较组间分布。这有助于理解数据的形态、离散度和对称性,以及判断是否满足其他分析所需的前提假设。
31. \*\*Feature Engineering\*\* - Create, transform, combine, or extract variables to enhance data quality or modeling potential. This includes generating new columns, deriving ratios or indicators, aggregating related values, or reformatting data to reveal deeper patterns or prepare for predictive analysis.

**特征工程** - 通过创建、转换、组合或提取变量来提升数据质量或建模潜力。包括生成新列、推导比率或指标、聚合相关数值,或通过数据重构来揭示深层规律、为预测分析做准备。

18. \*\*Comprehensive Data Preprocessing\*\* - Perform a sequence of data cleaning and preparation steps to ensure the dataset is ready for analysis or modeling. This includes handling missing values, transforming data types, encoding categorical variables, normalizing or scaling numerical features, and correcting inconsistencies. The goal is to produce a clean, well-structured, and analysis-ready dataset.
19. **全面数据预处理** - 执行一系列数据清理和准备步骤,确保数据集已准备好进行分析或建模。这包括处理缺失值、转换数据类型、编码分类变量、归一化或缩放数值特征以及纠正不一致之处。目标是生成一个干净、结构良好且可直接用于分析的数据集。

# ## Analysis Question Type Focus

# ## 分析类问题聚焦

# ## Requirements:

## 需求:

- The question should be grounded in the context and structure of the data file.
- 问题应基于数据文件的上下文和结构。
- \*The phrasing and style of the question must closely mirror the examples provided\*\*. This includes using similar formats, tones, and syntactic patterns.
- *问题的措辞和风格必须严格遵循提供的示例*。这包括使用相似的格式、语气和句法模式。

_ \*\*Follow the same linguistic conventions as the examples\*\*. For example, if the examples use formulations like "What is the difference between.", " How many.", or "Which of the following..", then your generated question must follow similar patterns to ensure stylistic consistency.

**遵循与示例相同的语言规范**。例如,若示例采用"什么是...之间的区别"、"...有多少"或"下列哪项..."等表述形式,则生成的问题必须遵循类似模式以确保风格一致性。

- \*\*Output format must include\*\*: - The question inside ...\* - A short description inside .. - The question type inside ..\*
- **输出格式必须包含** : - 问题内容置于 `...`  - 简短描述置于 `...`  - 问题类型置于 `...` 

# ## Template Enforcement

# ## 模板强制执行

You are strictly required to follow the templates provided for the question type below: {get question template(question info['question template'])}

你必须严格遵循以下问题类型对应的模板: {获取问题模板(问题信息['问题模板'])}

This is a hard constraint, not a suggestion.

这是一个硬性约束,而非建议。

## Here are some examples: {question info['question example]} ## Now, generate a high-quality EDA question:

## 以下是一些示例: {问题信息['问题示例']} ## 现在请生成一个高质量的探索性数据分析问题:

G.3 TRAJECTORY SAMPLING PROMPT

G.3 轨迹采样提示

# Trajectory Sampling Prompt

# 轨迹采样提示

You are a Data Analysis Assistant who can solve the given problem step by step with utilizing a code execution tool to support your reasoning.

你是一名数据分析助手,能够通过使用代码执行工具逐步解决给定问题,以支持你的推理过程。

1. You should think through the data analysis problem logically, outlining your reasoning process in the ... tags.
2. 你应该通过逻辑思考数据分析问题,并在 ... 标签中概述你的推理过程。
3. After reasoning, you can write Python code if necessary and use printO statements to inspect key values to support your reasoning. Remember to place your Python code inside '\'python .. \\\ tags. You may use libraries like pandas, numpy, sklearn,etc.
4. 推理完成后, 如有必要可编写 Python语言 代码, 并使用 print() 语句检查关键值以辅助推理。请将 Python语言 代码置于 \\python ... \\ 标签内。可使用 pandas, numpy, sklearn 等库。
5. Whenever you're confident about the result, you can directly provide your answer to the question inside ....
6. 当你对结果有把握时,可以直接在 ... 中给出问题答案。

- Keep your responses concise, structured, and directly tied to the original question.
- 保持回答简洁、结构清晰,并紧扣原始问题。

G.4 JUDGE MODEL PROMPT

G.4 评判模型提示

# Judge Model Prompt (Self-consistency)

# 评判模型提示 (自我一致性)

You are a precision evaluation system. Your task is to determine whether three AI-generated answers are equivalent and fully correct. Use the following evaluation criteria:

你是一个精度评估系统。你的任务是判断三个AI生成的答案是否等价且完全正确。请使用以下评估标准:

Respond using:

请使用:

### Question question

### 问题

Evaluate standard:

评估标准:

1. Are all three answers semantically equivalent and/or numerically consistent   \$(\leq3\%\$  difference)? 2. Do all three answers fully resolve the original question without omissions? 3. Which single answer is the best basis for a final polished response?
2. 三个答案在语义上是否等价和/或数值一致 (差异≤3%)?
3. 三个答案是否完全解决了原始问题而无遗漏?
4. 哪个答案最适合作为最终润色回复的基础?

# Judge Model Prompt (Reward)

# Judge Model Prompt (Reward)

You are a fair and professional evaluator. Your task is to assess how closely an AI assistant's answer matches the provided ground truth for a given question. You are to provide a numerical score for how well the response answers the question based on the ground truth answer.

你是一位公正且专业的评估员。你的任务是评估AI助手对给定问题的回答与提供的标准答案的匹配程度。你需要根据标准答案对回答的质量给出数值评分。

Your evaluation should focus on the assistant's answer to the question. Begin your evaluation by comparing the assistant's answer with the ground truth answer. Identify and correct any mistakes. Be as objective as possible.

你的评估应聚焦于助手对问题的回答。开始评估时,请将助手的回答与标准答案进行比对,指出并修正所有错误。尽可能保持客观。

Evaluate the correctness (0 for incorrect, 1 for correct) of the predicted answer to the question:

评估预测答案的正确性 (0表示错误, 1表示正确) :

Question: {question }

问题:{question}

Predicted answer: {p red answer}

预测答案: {预测答案}

Ground truth answer: {ground truth }

真实答案: {ground truth }

Rules for judgment:

判断规则:

1. For numerical questions, any result within  \$3\%\$   of the ground truth answer is considered correct. Please compare abs(Predicted answer)/abs(True answer) with  \$3\%\$  to makeyour decision.
2. 对于数值问题,任何在真实答案 \$3\%\$ 误差范围内的结果均视为正确。请将 abs(预测答案)/abs(真实答案) 与 \$3\%\$ 进行比较来做出判断。

Wrap your reasoning inside  and wrap the accuracy score inside  tags. Keep your reasoning concise, no more than 3-5 clear and informative sentences. Avoid repetition or unnecessary elaboration. Only output the reasoning and score using the required tags. Follow the output format as shown in the example below:

需要将英文指令准确翻译为中文,同时严格遵守所有格式要求:保留术语和引用格式、转换图表标注、处理括号间距、首次出现专业术语加注英文。需要特别注意输出格式的严格限制,确保只返回Markdown格式的翻译结果。

将你的推理过程包裹在标签内,并将准确度评分包裹在标签内。保持推理简洁明了,不超过3-5句清晰且信息丰富的句子。避免重复或不必要的阐述。仅使用所需标签输出推理和评分。按照以下示例所示的输出格式执行:
阅读全文(20积分)