[论文翻译]扩展通用数据分析智能体 (Data-Analytic Agents)




SCALING GENERALIST DATA-ANALYTIC AGENTS

扩展通用数据分析智能体 (Data-Analytic Agents)

ABSTRACT

摘要

Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering or multi-agent scaffolds over proprietary models, while open-source models still struggle with diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DATAMIND, a scalable data synthesis and agent training recipe designed to construct generalist data-analytic agents. DATAMIND tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DATAMIND applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DATAMIND, we curate DATAMIND-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DATAMIND-12K, our DATAMIND-14B achieves state-of-the-art with an average score of $71.16%$ on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DATAMIND-7B also performs best among all open-source models with a score of $68.10%$ .We also list some empirical insights gained from our exploratory trials in the analysis experiments, aiming to provide actionable insights about agent training for the community. We will release DATAMIND-12K and DATAMIND-7B,14B for the community's future research'.

数据分析智能体 (Data-analytic Agent) 正成为自动化科学发现和实现创新人工智能愿景的关键催化剂。然而,当前方法严重依赖基于专有模型的提示工程或多智能体框架,而开源模型在处理现实分析所需的多样化格式、大规模数据文件以及长周期多步推理时仍面临困难。本文提出DATAMIND——一个可扩展的数据合成与智能体训练方案,旨在构建通用型数据分析智能体。该方法解决了构建开源数据分析智能体的三个关键挑战:数据资源不足、训练策略不当以及基于代码的多轮执行不稳定。具体而言,DATAMIND采用:1)细粒度任务分类与递归式由易到难任务组合机制,提升合成查询的多样性与难度;2)基于知识增强的轨迹采样策略,辅以模型驱动和规则驱动的过滤机制;3)结合监督微调与强化学习损失的动态可调训练目标;4)内存节约型且稳定的代码多轮执行框架。基于此,我们构建了DATAMIND-12K——一个涵盖多领域、多任务类别及多数据文件格式的高质量数据分析任务轨迹集。在DATAMIND-12K上训练的DATAMIND-14B模型在多项数据分析基准测试中以71.16%的平均得分达到最优性能,超越最强的专有基线DeepSeek-V3.1和GPT-5。我们的DATAMIND-7B模型同样以68.10%的得分在所有开源模型中表现最佳。通过分析实验,我们总结了探索性试验中获得的部分经验性发现,旨在为学界提供可操作的智能体训练洞见。我们将向社区发布DATAMIND-12K及DATAMIND-7B/14B模型以支持未来研究。


(a) Task taxonomy used in DATAM1ND for fine-grained (b) Performance comparison between proprietary modand diverse query synthesis. els and open-source models on multiple datasets.

(a) DATAM1ND中用于细粒度多样化查询合成的任务分类法。 (b) 专有模型与开源模型在多个数据集上的性能比较。

Figure 1: (a) Task Taxonomy. We categorize data analysis tasks into 18 fine-grained categories to enhance the diversity of our synthesized queries. (b) Performance Comparison. Our DATAM1ND-14B achieves the best compared with all proprietary models and open-source trained or untrained models.

图 1: (a) 任务分类法。我们将数据分析任务划分为18个细粒度类别,以增强合成查询的多样性。(b) 性能对比。我们的DATAM1ND-14B与所有专有模型及开源训练/未训练模型相比均取得最佳表现。

1 INTRODUCTION

1 引言

Large Language Models (LLMs) have demonstrated formidable performance on a wide spectrum of reasoning tasks spanning math, code, and science (DeepSeek-AI et al., 2025; Kimi et al., 2025; OpenAI, 2025a; Yang et al., 2025). As AI enters its second half (Yao, 2025), a surge of LLM Agentic benchmarks targeted in increasingly complex and domain-specific scenarios (Jimenez et al., 2024; Starace et al., 2025; Mialon et al., 2024; Phan et al., 2025; Wei et al., 2025a) is emerging. Among them, Automated Data Analysis (Hu et al., 2024; Jing et al., 2025; Liu et al., 2024; Majumder et al., 2025), an essential pillar of AI for scientific research, plays a critical role in realizing Innovating AI and has shown its promise to boost research efficiency and accelerate scientific discovery (Chen et al., 2025b; Schmidgall et al., 2025; Lu et al., 2024; Chai et al., 2025).

大语言模型 (LLM) 在数学、代码和科学等广泛推理任务中展现出强大性能 (DeepSeek-AI et al., 2025; Kimi et al., 2025; OpenAI, 2025a; Yang et al., 2025)。随着人工智能进入下半场 (Yao, 2025),针对日益复杂和特定领域场景的智能体基准测试 (Jimenez et al., 2024; Starace et al., 2025; Mialon et al., 2024; Phan et al., 2025; Wei et al., 2025a) 正涌现。其中,自动化数据分析 (Hu et al., 2024; Jing et al., 2025; Liu et al., 2024; Majumder et al., 2025) 作为科学研究人工智能的重要支柱,在实现创新人工智能方面发挥着关键作用,并展现出提升研究效率和加速科学发现的潜力 (Chen et al., 2025b; Schmidgall et al., 2025; Lu et al., 2024; Chai et al., 2025)。

Data-Analytic Agents process, model, and compute data by generating code to discover useful information or regular conclusions, thereby furnishing users with insights to support decision-making. However, existing data-analytic agents (Zhang et al., 2023; Hong et al., 2025; Li et al., 2024; Sun et al., 2025; Guo et al., 2024) are overwhelmingly built on proprietary models via prompt engineering and rely on predefined workflows or multi-agent scaffolds. The few open-source trained models (Wu et al., 2025b;c; Su et al., 2024) can only perform simple table understanding tasks (tables compact enough to fit into the prompt) and can easily break down when confronted with diverse-format, large-scale data files and long-horizon, multi-step reasoning demanded by real-world tasks.

数据分析智能体 (Data-Analytic Agents) 通过生成代码来处理、建模和计算数据,从而发现有用信息或规律性结论,为用户提供支持决策的洞察。然而,现有的数据分析智能体 (Zhang et al., 2023; Hong et al., 2025; Li et al., 2024; Sun et al., 2025; Guo et al., 2024) 绝大多数通过提示工程基于专有模型构建,并依赖预定义的工作流或多智能体框架。少数开源训练模型 (Wu et al., 2025b;c; Su et al., 2024) 仅能执行简单的表格理解任务(即表格尺寸需小到能放入提示词中),当面对现实任务中多样化格式的大规模数据文件以及长周期、多步骤的推理需求时,这些系统极易失效。

Challenges. In this work, we propose to train a generalist, open-source data-analytic agent. This endeavor entails several intrinsic challenges that must be addressed: 1) Insufficient data resources. Training a specialized agent demands a large-scale, high-quality collection of tasks and corresponding solution trajectories. However, publicly available data analysis benchmarks often only provide a limited test set for evaluation purposes and lack step-by-step trajectory annotations, making it infeasible to assemble an effective training corpus from off-the-shelf resources. 2) Improper training strategy. Current agent training strategies typically follow an SFT-then-RL paradigm. Yet, in a new scenario, it remains unclear how to stabilize long-horizon agent training and how to allocate training steps across SFT and RL to achieve optimal performance. 3) Unstable code-based multi-turn rollout. Data files and code interpreters involve intricate memory management. Parallel agentic rollout and multi-turn code generation with limited memory resources will further exacerbate this situation.

挑战。在这项工作中,我们提出训练一个通用型开源数据分析智能体 (AI Agent) 。这项工作需要解决几个固有挑战:1) 数据资源不足。训练专业智能体需要大规模高质量的任务集合与对应解决轨迹,但公开可用的数据分析基准通常仅提供有限的测试集用于评估目的,且缺乏逐步轨迹标注,使得无法从现有资源中组装有效的训练语料库。2) 训练策略不当。当前智能体训练策略通常遵循SFT后RL的模式,但在新场景中,如何稳定长周期智能体训练以及如何在SFT和RL间分配训练步骤以获得最优性能仍不明确。3) 不稳定的基于代码的多轮展开。数据文件和代码解释器涉及复杂的内存管理,有限内存资源下的并行智能体展开和多轮代码生成将进一步加剧这种情况。

The DATAMIND Pipeline. In response to the above challenges, we introduce DATAMIND, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. To construct a large-scale training corpus, we begin by harvesting a diverse collection of data files in various formats and domains from the Internet and open communities. Then, we apply a fine-grained task taxonomy (see Fig.1a) and a recursive easy-to-hard task composition mechanism to increase the diversity and diffculty of our synthesized queries. Next, we adopt a knowledge-augmented trajectory sampling strategy to improve both the validity and reliability of synthesized trajectories. A modelbased judger performs self-consistency filtering on these trajectories, followed by rule-based checks. The judgment signal will also be fed back to the model to encourage refinement, enriching the thinking patterns present in the final training set. During training, we combine SFT loss and RL loss with a dynamic coefficient to schedule the relative weight of SFT versus RL across training steps, allowing us to balance exploitation and exploration to stabilize training. For parallel multi-turn rollout, we a synchronize agent generation and code execution and utilize a chunk-wise code maintenance method to reduce peak memory usage. Moreover, we sandbox each trajectory in an isolated environment with strict caps on execution time and memory usage, enabling stable code-based multi-turn rollout.

DATAMIND 流程。针对上述挑战, 我们推出了 DATAMIND, 这是一个可扩展的数据合成和智能体训练方案, 旨在构建通用数据分析智能体。为构建大规模训练语料库, 我们首先从互联网和开放社区收集各种格式和领域的数据文件。接着, 我们采用细粒度任务分类法 (见图1a) 和递归式由易到难的任务组合机制, 以增加合成查询的多样性和难度。随后, 我们采用知识增强的轨迹采样策略, 同时提升合成轨迹的有效性和可靠性。基于模型的评判器对这些轨迹执行自洽性过滤, 并进行基于规则的检查。评判信号还将反馈给模型以促进改进, 从而丰富最终训练集中存在的思维模式。在训练期间, 我们将 SFT 损失和 RL 损失与动态系数相结合, 以调度训练步骤中 SFT 与 RL 的相对权重, 使我们能够平衡利用和探索以稳定训练。对于并行多轮展开, 我们同步智能体生成和代码执行, 并利用分块式代码维护方法来降低峰值内存使用量。此外, 我们将每个轨迹沙盒化在隔离环境中, 并严格限制执行时间和内存使用量, 从而实现稳定的基于代码的多轮展开。

Results and Insights. Through the DATAMIND pipeline, we curate DATAMIND-12K, a highquality training set that spans diverse task categories and data file formats for data-analytic tasks. When trained on DATAMIND-12K, our 14B model, DATAM1ND-14B, achieves a new state-ofthe-art with an average score of $71.16%$ on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5 and surpassing all open-source models by a substantial margin (see Fig.1b). Our DATAM1ND-7B also performs best among all open-source modelswith ascoreof $68.10%$ . Our additional analysis studies yield three valuable insights for the community: 1) Self-consistency filtering is more non-trivial than the best trajectory selection; 2) SFT loss can be an effective stabilizer for RL training, but can also be the culprit of unstable training. 3) RL can narrow the performance gap between different base models, but can hardly reverse the order.

结果与洞察。通过DATAMIND流程,我们构建了DATAMIND-12K高质量训练集,该数据集涵盖数据分析任务的多样化任务类别和数据文件格式。基于DATAMIND-12K训练的14B模型DATAM1ND-14B在多项数据分析基准测试中取得71.16%的平均分,刷新了最高水平,超越最强商业基线DeepSeek-V3.1与GPT-5,并大幅领先所有开源模型 (见图1b) 。我们的DATAM1ND-7B模型同样以68.10%的得分位居开源模型榜首。额外分析研究为社区带来三项重要发现: 1) 自一致性过滤比最优轨迹选择更具挑战性; 2) SFT损失可作为强化学习训练的有效稳定器,但也可能引发训练不稳定性; 3) 强化学习能缩小不同基础模型的性能差距,但难以逆转优劣排序。


Figure 2: The Pipeline of DATAMIND. DATAMIND applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective including both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework.

图 2: DATAMIND 的流程架构。DATAMIND 采用 1) 细粒度任务分类与递归式由易到难的任务组合机制; 2) 知识增强轨迹采样策略及基于模型与规则的筛选机制; 3) 动态可调训练目标 (包含 SFT 与 RL 损失函数); 4) 基于代码的内存节约型稳定多轮展开框架。

2 PROBLEM DEFINITION

2 问题定义

A data analysis task $u$ is typically represented as a quadruple $\boldsymbol{u}=(q,f,d,a)$ , comprising the user query $q$ , the data file $f$ , the data description $d$ . and the answer $a$ , where data file $f$ may be provided in a variety of formats (. csv, .xlsx, . sqlite, etc.), and data description $d$ is optional.

数据分析任务 $u$ 通常表示为一个四元组 $\boldsymbol{u}=(q,f,d,a)$ ,包含用户查询 $q$ 、数据文件 $f$ 、数据描述 $d$ 和答案 $a$ ,其中数据文件 $f$ 可能以多种格式提供 (.csv, .xlsx, .sqlite 等) ,数据描述 $d$ 为可选项。

Our agent framework adheres to the prevailing ReAct (Yao et al., 2023) paradigm. Upon receiving a task, the agent is required to iterate multiple rounds of Thought -Act ion-Observat ion cycles and finally produce an answer. In the data analysis scenario, Thought denotes the agent's reasoning and reflection process conditioned on the current context; Act i on refers to the agent's invocation of code to process and compute over the data files or the generation of the final answer. The code may be written in $\clubsuit$ Python or $\equiv$ SQL, depending on the data file format; Obse rvat i on consists of the execution feedback returned by the environment (i.e., Code Interpreter).

我们的智能体框架遵循主流的ReAct范式 (Yao et al., 2023) 。当接收到任务时,智能体需要迭代多轮思考-行动-观察循环,并最终生成答案。在数据分析场景中,思考表示智能体基于当前上下文进行的推理与反思过程;行动指智能体调用代码处理数据文件或生成最终答案,代码可根据数据文件格式选用Python语言或SQL编写;观察则包含环境(即代码解释器)返回的执行反馈。

Given task $u$ , let a Thought -Act ion-Observat ion loop be represented by $(\tau,\alpha,o)$ ,respectively. Then the agent's historical trajectory $h$ at time step $t$ can be denoted as:

给定任务 $u$ ,令思维-行动-观察循环分别表示为 $(\tau,\alpha,o)$ 。则智能体在时间步 $t$ 的历史轨迹 $h$ 可表示为:

$$
h_{t}=(u,\tau_{0},\alpha_{0},o_{0},\tau_{1},\alpha_{1},o_{1},\ldots,\tau_{t-1},\alpha_{t-1},o_{t-1}).
$$

$$
h_{t}=(u,\tau_{0},\alpha_{0},o_{0},\tau_{1},\alpha_{1},o_{1},\ldots,\tau_{t-1},\alpha_{t-1},o_{t-1}).
$$

Conditioned on the history trajectory $h_{t}$ , the agent with parameters $\theta$ will produce its next thought $\tau_{t}$ and action $\alpha_{t}$ according to the policy $\pi_{\theta}(\tau_{t},\alpha_{t}|h_{t})$ and will receive an observation $o_{t}$ from the code interpreter after action $\alpha_{t}$ is executed. The whole trajectory terminates either when the agent emits an answer or when a predefined maximum number of rounds $\tau$ is reached. For simplicity, in the following sections, we denote the input part provided to the agent (including $q,f$ , and $d$ )as $x$ and the trajectory (including answer $a$ ) sampled from the agent as $y\sim\pi_{\theta}(\cdot|x)$

在历史轨迹 $h_{t}$ 的条件下,具有参数 $\theta$ 的智能体将根据策略 $\pi_{\theta}(\tau_{t},\alpha_{t}|h_{t})$ 生成其下一个思考 $\tau_{t}$ 和动作 $\alpha_{t}$,并在执行动作 $\alpha_{t}$ 后从代码解释器接收观察结果 $o_{t}$。当智能体发出答案或达到预定义的最大轮数 $\tau$ 时,整个轨迹终止。为简化表述,在后续章节中,我们将提供给智能体的输入部分(包括 $q,f$ 和 $d$)记为 $x$,将从智能体采样的轨迹(包括答案 $a$)记为 $y\sim\pi_{\theta}(\cdot|x)$。

3 SCALING DATA-ANALYTIC AGENT DATA

3 数据分析型智能体 (Data-Analytic Agent) 的数据扩展

3.1 FILE COLLECTION AND QUERY SYNTHESIS

3.1 文件收集与查询合成

Data File Collection. First, we need a large amount of raw data files $f$ to scale up the potential synthesized task volume. Fortunately, the Internet and the open community benchmarks already host a massive reservoir of such files. We first target Kaggle, which contains tens of thousands of . csv and . $\mathbf{\boldsymbol{x}}\bot\mathbf{\boldsymbol{S}}\mathbf{\boldsymbol{x}}$ spreadsheets. Using the official Kaggle $\bar{\mathbf{A}\mathbf{P}\mathbf{I}^{2}}$ , we crawl a diverse subset of files spanning multiple domains, and then discard files that $i$ ) can not be loaded, $i i$ )are extremely small ( $\phantom{0}\times20\phantom{.0}$ rows) or large $(>1,000$ rows), or $i i i$ ) contain anomalous data types. After this pipeline, we retain 3, 400 . csv and 560 . $\mathbf{\boldsymbol{x}}\bot\mathbf{\boldsymbol{S}}\mathbf{\boldsymbol{x}}$ files. For database files, we draw primarily from the training set of BIRD (Li et al., 2023b) and OmniSQL (Li et al., 2025a), both of which are high-quality corpora widely used in the Text-to-SQL field. Similarly, we sample from these sources and apply an analogous filtering pipeline, finally obtaining 1, 954 . sqlite files.

数据文件收集。首先,我们需要大量原始数据文件 $f$ 以扩大潜在合成任务规模。幸运的是,互联网和开放社区基准已存有海量此类文件。我们首先瞄准Kaggle平台,该平台包含数万份.csv和 $\mathbf{\boldsymbol{x}}\bot\mathbf{\boldsymbol{S}}\mathbf{\boldsymbol{x}}$ 格式电子表格。通过官方Kaggle $\bar{\mathbf{A}\mathbf{P}\mathbf{I}^{2}}$ 接口,我们爬取了跨多个领域的多样化文件子集,随后剔除以下文件:$i$ ) 无法加载的文件,$ii$ ) 过小( $\phantom{0}\times20\phantom{.0}$ 行)或过大 $(>1,000$ 行)的文件,或 $iii$ ) 包含异常数据类型的文件。经过此流程,我们保留3,400份.csv文件和560份 $\mathbf{\boldsymbol{x}}\bot\mathbf{\boldsymbol{S}}\mathbf{\boldsymbol{x}}$ 文件。对于数据库文件,我们主要从BIRD (Li et al., 2023b) 和OmniSQL (Li et al., 2025a) 的训练集中选取,这两个语料库都是文本到SQL领域广泛使用的高质量数据集。同样地,我们从这些源中抽样并应用类似的过滤流程,最终获得1,954份.sqlite文件。

Query Categorization and Synthesis. To generate specific queries, we devise an automated script to extract meta-information $d$ of each data file, such as table headers, column names, data types, and representative rows, and then feed these metadata into DeepSeek-V3 (DeepSeek-AI, 2024) to synthesize queries $q$ . To ensure both diversity and fine-grained ness of the generated questions, we refer to and refine the taxonomy in Wu et al. (2025b) and classify the data analysis tasks into 18 fine-grained categories (see Fig.la). For each category, we carefully curate $4\sim6$ exemplar queries that vary in complexity and domains and serve as few-shot demonstrations. Under the guidance of these type-specific contexts, every data file is used to generate a diverse set of queries that span the full spectrum of the proposed taxonomy. To further elevate query complexity, we adopt a recursive easy-to-hard composition scheme that chains multiple task types, i.e., the output of one task is fed as input to the next. By iterating $2\sim5$ times, we progressively amplify the difficulty and create multi-hop analytic challenges that go well beyond the capability required by any single task type. The prompts for query synthesis can be found in Appx.G.2.

查询分类与生成。为生成特定查询,我们设计自动化脚本来提取每个数据文件的元信息$d$(如表头、列名、数据类型和代表性数据行),随后将这些元数据输入DeepSeek-V3(DeepSeek-AI, 2024)以合成查询$q$。为确保生成问题的多样性与细粒度特性,我们参考并改进了Wu等人(2025b)的分类体系,将数据分析任务划分为18个细粒度类别(见图1a)。针对每个类别,我们精心编制$4\sim6$个不同复杂度与领域的示例查询作为少样本示例。在这些类型特定上下文的引导下,每个数据文件被用于生成覆盖完整分类体系的多样化查询集。为提升查询复杂度,我们采用递归式由易到难组合策略,将多个任务类型串联形成链式结构——即前序任务的输出作为后续任务的输入。通过$2\sim5$次迭代,我们逐步增强难度,创建出远超单一任务类型能力的多跳分析挑战。查询合成的提示模板详见附录G.2。

3.2 EXPERT TRAJECTORY SAMPLING AND FILTERING.

3.2 专家轨迹采样与过滤

Knowledge Augmented Trajectory Sampling. To guarantee the quality of the synthesized trajectories, we introduce a knowledge-augmented trajectory sampling framework. Initially, for each question category, we manually craft a high-level workfow $k$ that encodes procedural knowledge and steers the model during trajectory synthesis. To further boost answer quality, we impose a self-consistency filter. We sample $\mathcal{N}$ independent trajectories per query and employ a judge model $\mathcal{M}$ poweredby GPT-4o-mini (OpenAI, 2024b) to verify whether their final answers are consistent with reasoning rationales. Only trajectories that converge to the same answer are retained; among them, the judge model will also select the most concise and accurate one as our training instance $y$

知识增强轨迹采样。为保证合成轨迹的质量, 我们引入了知识增强轨迹采样框架。首先, 针对每个问题类别, 我们手动构建一个高层次工作流 $k$, 该工作流编码了程序性知识并在轨迹合成过程中引导模型。为进一步提升答案质量, 我们施加了自一致性过滤器。我们对每个查询采样 $\mathcal{N}$ 条独立轨迹, 并采用由 GPT-4o-mini (OpenAI, 2024b) 驱动的评判模型 $\mathcal{M}$ 来验证其最终答案是否与推理依据一致。仅保留收敛到相同答案的轨迹; 其中, 评判模型还会选择最简洁准确的轨迹作为我们的训练实例 $y$。

{c,s,y}=M({yi}i=1N),{yi}i=1Nπθcyert(|k,x),y={yi{yi}i=1N,s=1 none,s=0,

where $c$ is the chain-of-thought process of the judge model to reach the binary conclusion $s$ of whether the sampled trajectories are consistent. We use DeepSeek-V3.1 (DeepSeek-AI, 2025) as our expert policy model $\pi_{\theta_{\mathrm{cxpert}}}$ . During implementation, we set $\mathcal{N}=3$ . The prompt used for trajectory sampling and the judge model $\mathcal{M}$ can be found in Appx.G.3 and Appx.G.4, respectively. We extract the final answer from the trajectory as the final synthesized answer $a$ for the corresponding query $q$ . However, this pipeline inherently biases us toward easier queries whose answers are more likely to coincide. To counteract this, we refine the high-level workflow knowledge $k$ into more granular, step-by-step instructions for categories that exhibit low inter-trajectory consistency. Moreover, for trajectories that fail the consistency check, we feed the judge model's chain-of-thought back to the agent as external critique, prompting it to refect and revise its reasoning path:

其中 $c$ 是评判模型为判断采样轨迹是否一致而得出二元结论 $s$ 的思维链过程。我们采用 DeepSeek-V3.1 (DeepSeek-AI, 2025) 作为专家策略模型 $\pi_{\theta_{\mathrm{cxpert}}}$。实际实现时设定 $\mathcal{N}=3$。轨迹采样所用的提示词和评判模型 $\mathcal{M}$ 分别详见附录 G.3 和附录 G.4。我们从轨迹中提取最终答案作为对应查询 $q$ 的最终合成答案 $a$。但该方法天然偏向于答案更易趋同的简单查询。为抵消此偏差,我们将高层工作流程知识 $k$ 细化为更细粒度的分步指令,适用于轨迹间一致性较低的类别。此外,对于未通过一致性检验的轨迹,我们将评判模型的思维链作为外部批评反馈给智能体,促使其反思并修正推理路径:

{yreflectedi}i=1Nπθexpert(|k,x,{yi}i=1N,c),ifs=0.

The reflected tr aec tories ${y_{\mathrm{reflected}}^{i}}_{i=1}^{\mathcal{N}}$ Wwill be fed into the judge model $\mathcal{M}$ again to conduct the consistency check and the trajectory selection in Eqn.2. This rescue loop not only salvages additional usable data but also enriches the diversity of thinking patterns embedded in the trajectory pool.

反射轨迹 ${y_{\mathrm{reflected}}^{i}}_{i=1}^{\mathcal{N}}$ 将被再次输入评判模型 $\mathcal{M}$,以执行公式2中的一致性检验和轨迹选择。该救援循环不仅能挽回更多可用数据,还能增强轨迹池中思维模式的多样性。

Rule-based Trajectory Filtering. In addition to discarding inconsistent trajectories, we apply three further rule-based filtering stages. 1) Format compliance. We drop any trajectory that deviates from the ReAct format, ensuring that every remaining trajectory can be losslessly converted into our target training schema. 2) Length control. We filter out trajectories whose final answer exceeds 1, 024 tokens, preventing the model from exploiting spurious hallucinations to artificially hit the correct string. 3) Linguistic integrity. We remove trajectories containing garbled text or intermingled natural languages, eliminating samples that could destabilize the agent training. After the full filtering pipeline, we retain 11, 707 high-quality trajectories named as DATAM1ND-12K.

基于规则的轨迹过滤。除了剔除不一致的轨迹外,我们还实施了三个基于规则的过滤阶段:1) 格式合规性。我们舍弃所有偏离ReAct格式的轨迹,确保每条保留的轨迹都能无损转换为目标训练模式;2) 长度控制。我们过滤最终答案超过1024个token的轨迹,防止模型利用虚假幻觉人为匹配正确答案;3) 语言完整性。我们移除包含乱码或混合自然语言的轨迹,消除可能影响智能体训练稳定性的样本。经过完整过滤流程后,我们保留了11,707条高质量轨迹,命名为DATAM1ND-12K。

4 SCALING DATA-ANALYTIC AGENT TRAINING

4 规模化数据-分析智能体训练

Dynamic Control Between SFT and RL. In this paper, we adopt a combined paradigm of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for the agent training. Empirically, we observe that it is difficult to strike a balance between the two stages: the model needs to absorb sufficient knowledge from expert data during SFT, yet excessive imitation often rigidifies exploration during RL. Hence, following Zhang et al. (2025b), we employ a hybrid strategy that dynamically blends on-policy and off-policy learning, allowing the training procedure to flexibly trade off between exploitation of expert knowledge and continued exploration.

动态调控SFT与RL训练。本文采用监督微调 (SFT) 与强化学习 (RL) 相结合的智能体训练范式。实证研究发现,两个阶段的平衡难以把握:模型需要在SFT阶段充分吸收专家数据知识,但过度模仿往往会导致RL阶段的探索僵化。因此,我们遵循Zhang et al. (2025b) 的方法,采用动态混合在线与离线学习的策略,使训练过程能灵活权衡专家知识利用与持续探索。

Given the training dataset $\mathcal{D}$ , we express our SFT loss as:

给定训练数据集 $\mathcal{D}$,我们将SFT损失函数表示为:
LSFT(θ)=E(x,y)D[t=1|y|I(yto)logπθ(yt|x,y<t)],

where $\mathbb{I}(y_{t}\notin o)$ is an indicator function that masks out any tokens produced by the environment feedback, ensuring that the model is optimized only on the agent-generated portion of the trajectory. For RL, we use the Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) (Yu et al., 2025) algorithm, minimizing the following function:

其中 $\mathbb{I}(y_{t}\notin o)$ 是指示函数,用于屏蔽环境反馈产生的所有 token,确保模型仅在智能体生成的轨迹部分进行优化。对于强化学习 (Reinforcement Learning),我们采用解耦裁剪与动态采样策略优化 (Decoupled Clip and Dynamic Sampling Policy Optimization,DAPO) 算法 (Yu et al., 2025),最小化以下函数:
LDAPO(θ)=E(x,y)D,{yi}i=1Gπθodd(|x) [1i=1G|yi|i=1Gt=1min(ri,t(θ)A^i,t,clip(ri,t(θ),1εlow,1+εhigh)A^i,t)] s.t.0<|{yi|sequivalent(y,yi)}|<G,

where {yi}i=1G is a group of $G$ trajectories sampled from the agent policy $\pi_{\theta_{\mathrm{old}}}$ and $y$ is the expert trajectory. Similar to SFT, any tokens emitted by the environment are discarded when computing the objective. $r_{i,t}(\theta)$ denotes the per-token importance-sampling ratio, and A^i,t is the advantage of the $i$ -th response, obtained by normalizing the group-level rewards ${R_{i}}_{i=1}^{G}$

其中{yi}i=1G是从智能体策略 $\pi_{\theta_{\mathrm{old}}}$ 中采样的一组 $G$ 条轨迹,$y$ 是专家轨迹。与监督微调 (SFT) 类似,在计算目标函数时会丢弃环境生成的任何 token。$r_{i,t}(\theta)$ 表示每个 token 的重要性采样比率,A^i,t是第 $i$ 个响应经过组级奖励 ${R_{i}}_{i=1}^{G}$ 归一化后得到的优势函数。
ri,t(θ)=πθ(yi,tx,yi,<t)πθold(yi,tx,yi,<t),A^i,t=Rimean({Ri}i=1G)std({Ri}i=1G).

The inequality in Eqn.5 serves as a filtering criterion that discards trajectories lacking optimization utility whose rewards are uniformly 0 or uniformly 1 to prevent spurious gradient updates.

公式 5 中的不等式作为过滤标准, 用于丢弃缺乏优化效用的轨迹 (其奖励值全为 0 或全为 1), 以防止虚假梯度更新。

Finally, unlike the conventional SFT-then-RL pipeline, we jointly optimize the agent by combining the SFT and RL objectives with a dynamically balanced weighting factor:

最后,与传统先进行监督微调 (SFT) 再进行强化学习 (RL) 的流程不同,我们通过结合 SFT 和 RL 目标并采用动态平衡权重因子来联合优化智能体:
LFinal(θ)=γLSFT(θ)+(1γ)LDAPO(θ),

where $\gamma\in[0,1]$ varies dynamically throughout training. In our implementation, $\gamma$ is initialized to a large value so that the agent first acquires knowledge from expert data via the SFT loss, and is then annealed to a small value to encourage extensive exploration through RL. Please refer to $\S5.3$ for our analysis of different $\gamma$ settings. Importantly, for any trajectory that is filtered out by the inequality in Eqn.5, we will compute only the SFT loss. To increase the likelihood of producing eligible trajectories during the early stage of RL training, we perform a cold start using DATAMIND-12K before the process described above. We also analyze the effect of cold start for RL training in $\S5.3$

其中 $\gamma\in[0,1]$ 在训练过程中动态变化。在我们的实现中,$\gamma$ 初始化为较大值,使智能体首先通过 SFT 损失从专家数据获取知识,随后退火至较小值以鼓励通过强化学习进行广泛探索。关于不同 $\gamma$ 设置的分析请参阅 $\S5.3$。重要的是,对于被公式5不等式过滤掉的任何轨迹,我们将仅计算 SFT 损失。为提升强化学习训练早期阶段生成合格轨迹的概率,我们在上述过程之前使用 DATAMIND-12K 执行冷启动。我们还在 $\S5.3$ 中分析了冷启动对强化学习训练的影响。

Void Turns Filtering. In multi-turn agentic training, the model can experience distribution al drift due to external feedback and multi-turn compounding errors during multi-turn rollout, which will easily result in trajectory collapse, thereby de stabilizing RL training (Xue et al., 2025; Baronio et al., 2025; Mai et al., 2025). We also observe this phenomenon in our experiments. To stabilize training, we directly mask out the entire loss contributed by trajectories that contain void turns. Here, a void turn is defined as an agentic loop that fails to produce a valid code snippet or answer.

空轮次过滤。在多轮智能体训练中,由于外部反馈和多轮推演中的误差累积,模型可能经历分布漂移,这容易导致轨迹崩溃,从而使强化学习训练失稳 (Xue et al., 2025; Baronio et al., 2025; Mai et al., 2025)。我们在实验中也观察到这一现象。为稳定训练,我们直接屏蔽包含空轮次的轨迹所产生的全部损失。此处,空轮次被定义为未能生成有效代码片段或答案的智能体循环。

Agentic Code-based Multi-turn Rollout. A stable environment plays a key role in stable on-policy RL training. In data-analytic agent training, massive concurrent file I/O and code execution can easily lead to environment crashes, especially with limited memory resources. To prevent this, we implement three optimization s: 1) Asynchronous interaction. We a synchronize model generation and code execution for different data samples, which can decouple peak GPU and CPU memory demands and avoid simultaneous file I/O and code-execution spikes. 2) Chunked code maintenance. We implement a light-weight, notebook-style code generation strategy. The model only needs to produce the code snippet required for the current reasoning step, effectively reducing generation latency. Furthermore, whereas conventional notebook systems maintain a global variable pool, which is memory-intensive, we retain only the textual code chunks. At runtime, we concatenate the active snippet with its predecessors, yielding the same global execution effect without the memory overhead. 3) Security Control. To ensure secure code execution, we isolate the runtime environment for each trajectory, enforce per-trajectory limits on CPU time and peak memory, and filter any snippet containing insecure function calls before execution. Additionally, we provide an automatic package-installation mechanism that dynamically checks and installs uninstalled Python packages.

基于代码的多轮推演智能体。稳定环境在稳定的同策略强化学习训练中起着关键作用。在数据分析智能体训练过程中,海量并发文件I/O和代码执行极易导致环境崩溃,特别是在内存资源有限的情况下。为此我们实施了三项优化措施:1) 异步交互。我们对不同数据样本的模型生成与代码执行进行异步处理,从而解耦GPU与CPU的内存峰值需求,避免文件I/O与代码执行同时达到峰值。2) 分块代码维护。我们采用轻量级笔记本式代码生成策略,模型仅需生成当前推理步骤所需的代码片段,有效降低生成延迟。与传统笔记本系统维护全局变量池(内存消耗大)不同,我们仅保留文本化代码块,运行时将当前片段与前置片段拼接,在实现相同全局执行效果的同时避免内存开销。3) 安全控制。为确保代码安全执行,我们为每条轨迹隔离运行环境,实施单轨迹CPU时间和峰值内存限制,并在执行前过滤包含不安全函数调用的代码片段。此外,我们还提供自动包安装机制,动态检查并安装未配置的Python语言包。

Reward Design. Our reward mainly comprises three components: format reward rformat, answer reward ranswer, and length reward rlength. The agent is required to enclose its reasoning process within . tags, place any generated data-processing code between and , and wrap its final answer in .... The environment's execution results will be placed between and . For the answer reward, as many answers are descriptive and thus resist rule-based verification, we adopt a model-as-judge powered by GPT-4o-mini (OpenAI, 2024b). We engineer a dedicated LLM evaluation prompt detailed in Appx. G.4. Both $r_{\mathrm{format}}$ and $r_{\mathrm{answer}}$ are binary with only O and 1. To mitigate the risk of the agent hacking the answer reward by hallucinating excessive tokens, we further impose a length-based penalty to discourage overly verbose outputs. We define the length reward and the final reward as:

奖励设计。我们的奖励主要包含三个组成部分:格式奖励 rformat、答案奖励 ranswer 和长度奖励 rlength。智能体需要将其推理过程包裹在 <think> </think> 标签内,将生成的数据处理代码置于 <code></code> 之间,并将最终答案包裹在 <answer>...</answer> 中。环境的执行结果将放置在 <interpreter></interpreter> 之间。对于答案奖励,由于许多答案是描述性的,难以通过基于规则的方法进行验证,我们采用了由 GPT-4o-mini (OpenAI, 2024b) 驱动的模型即评判 (model-as-judge) 方法。我们设计了一个专门的大语言模型评估提示词,详见附录 G.4。$r_{\mathrm{format}}$ 和 $r_{\mathrm{answer}}$ 都是二元的,取值仅为 0 和 1。为了减轻智能体通过产生过多 Token 来攻击答案奖励的风险,我们进一步施加了基于长度的惩罚,以抑制过于冗长的输出。我们定义长度奖励和最终奖励如下:

R={rlengthranswer,ranswer=1llmin 0,rformat=1,ranswer=0,rlength={1,llmin lmaxllmaxlmin0.5+0.5,lmin<llmax 0.1,rformat=0,ranswer=0lmax<l

We in centi viz e correct outputs. So as long as the predicted answer exactly matches the ground truth, the model will receive a high reward $(\geq0.5)$ . The specific value is length-dependent: we award a full reward if the answer length $l$ is shorter than $l_{\mathrm{min}}$ ; it decays linearly to 0.5 when the length falls between $l_{\mathrm{min}}$ and ${l_{{\operatorname*{max}}}}$ ; any sequence longer than ${l_{{\operatorname*{max}}}}$ incurs a fixed length penalty of 0.5. According to our observation, we set $l_{\mathrm{min}}$ and ${l_{{\operatorname*{max}}}}$ to 256 and 1024 respectively during our experiments.

我们评估正确的输出。因此只要预测答案与标准答案完全匹配,模型就会获得高奖励 $(\geq0.5)$ 。具体数值与长度相关:当答案长度 $l$ 短于 $l_{\mathrm{min}}$ 时给予全额奖励;当长度介于 $l_{\mathrm{min}}$ 和 ${l_{{\operatorname*{max}}}}$ 之间时线性衰减至0.5;任何超过 ${l_{{\operatorname*{max}}}}$ 的序列都会受到0.5的固定长度惩罚。根据我们的观察,在实验中将 $l_{\mathrm{min}}$ 和 ${l_{{\operatorname*{max}}}}$ 分别设置为256和1024。

5 EXPERIMENTS

5 实验

5.1 EXPERIMENTAL SETTINGS

5.1 实验设置

Datasets and Metrics. We evaluate our model on three datasets related to data analysis: DABench (Hu et al., 2024), TableBench (Wu et al., 2025b), and BIRD (Li et al., 2023b). Our evaluation protocol aligns with our answer reward method, where a judge model powered by GPT-4o-mini (OpenAI, 2024b) is used to evaluate the correctness of the final answer. We report both pass $@1$ and pass $\ @3$ scores for all the methods. Please refer to Appx.D for more details.

数据集与评估指标。我们在三个数据分析相关数据集上评估模型:DABench (Hu et al., 2024)、TableBench (Wu et al., 2025b) 和 BIRD (Li et al., 2023b)。评估方案与答案奖励方法保持一致,采用基于GPT-4o-mini (OpenAI, 2024b) 的裁判模型来评估最终答案的正确性。我们报告所有方法的pass $@1$ 和pass $@3$ 分数,更多细节请参阅附录D。

Models and Baselines. We compare our models with five strong proprietary models and four outstanding open-source models (see Tab.1). In addition, we select four open-source models that have been explicitly trained for data-analysis-related tasks: TableLLM (Wu et al., 2025b), Table-R1 (Wu et al., 2025c), OmniSQL (Li et al., 2025a), and SQL-R1 (Ma et al., 2025). We include Qwen2.5-Coder-7B and 14B (Hui et al., 2024) as our backbone models to compare different baselines. Detailed model information and reproduction protocols for all baselines are provided in Appx.E.

模型与基线。我们将自身模型与五个强大的专有模型及四个优秀的开源模型进行对比 (参见表1) 。此外,我们选取了四个经过数据分相关任务显式训练的开源模型:TableLLM (Wu等人, 2025b) 、Table-R1 (Wu等人, 2025c) 、OmniSQL (Li等人, 2025a) 以及SQL-R1 (Ma等人, 2025) 。我们引入Qwen2.5-Coder-7B和14B (Hui等人, 2024) 作为骨干模型以比较不同基线。所有基线的详细模型信息与复现方案详见附录E。

Training and Inference Setups. We use Llama Factory (Zheng et al., 2024) for SFT training and verl (Sheng et al., 2025) for RL training. For SFT, our learning rate is $1e-5$ with a warmup ratio of 0.1 and a cosine decay schedule. Our global batch size is set to 16. For RL, we use a learning rate of $1e-6$ . The batch size is 16 with a mini batch size of 2. The rollout temperature is 0.7, the top-p is 1.0, and the group size $G$ is 4. We schedule $\gamma$ via cosine decay, annealing from a peak of 0.9 to a valley of 0.05. At test time, we fix the temperature to 0.7, top-p to 0.95, and an inference batch size of 5 for all evaluations. The detailed hyper parameter information can be seen in Appx.F.

训练与推理设置。我们使用 Llama Factory (Zheng et al., 2024) 进行 SFT 训练,使用 verl (Sheng et al., 2025) 进行 RL 训练。对于 SFT,我们的学习率为 $1e-5$,预热比例为 0.1,并采用余弦衰减调度。全局批次大小设置为 16。对于 RL,我们使用 $1e-6$ 的学习率,批次大小为 16,最小批次大小为 2。推演温度为 0.7,top-p 值为 1.0,组大小 $G$ 为 4。我们通过余弦衰减调度 $\gamma$,从峰值 0.9 退火至谷值 0.05。在测试阶段,所有评估均固定温度 0.7、top-p 0.95,推理批次大小为 5。详细超参数信息可见附录 F。

Table 1 : Main Results. * indicates that the original paper does not report results for the corresponding model and we use their official data and code to train the model for reproduction. + denotes that we directly download their official trained model for fair evaluation. The best results for each model group are highlighted in bold.

BackboneMethodDABenchTableBenchBIRDAvg.
pass@1pass@3pass@1pass@3pass@1pass@3pass@1pass@3
Proprietary Models
GPT-40ReAct76.3984.4464.9775.0650.2062.3963.8573.96
04-mini79.1286.7771.0380.1557.0466.8869.0677.93
DeepSeek-R178.7387.5568.9679.5255.8066.1767.8377.75
DeepSeek-V3.181.3289.4972.5281.6857.8968.1270.5879.76
GPT-578.2185.2169.9378.3760.1765.1969.4476.26
Open-source Models
Qwen-2.5-Coder-32B QwQ-32B Llama-3.3-70B Qwen-2.5-72BReAct73.1581.3261.1172.2641.2060.1758.4971.25
70.1785.2157.7975.1950.3064.2159.4274.87
69.7880.1655.4770.3659.1068.5861.4573.03
75.3386.3865.4476.2160.3069.4967.0277.36
Qwen-2.5 Coder-7BReAct15.0535.4111.7028.637.0218.7111.2627.58
TableLLM*36.7171.9841.0170.3611.9916.7529.9053.03
Table-R1*42.5478.9956.3663.6110.6913.4936.5352.03
OmniSQL+26.4636.1939.9550.2557.1166.3041.1750.91
SQL-R1#24.9034.6340.8450.6456.7866.2340.8350.50
DATAMIND77.3087.9467.6079.3959.4169.8868.1079.07
ReAct71.2183.2756.9669.9741.7659.9156.6471.05
Qwen-2.5 Coder-14BTableLLM*38.2674.7176.0820.9928.8859.89
Table-R145.3346.4458.9111.8014.0835.2350.79
OmniSQL+26.4679.38 39.3050.38 41.9852.6758.8067.4135.84 42.4153.13
SQL-R1+27.2440.4741.2251.0258.0266.6242.1652.70
DATAMIND80.2988.7270.9581.8162.2370.2171.1680.25

表 1: 主要结果。* 表示原始论文未报告相应模型的结果,我们使用其官方数据和代码训练模型进行复现。+ 表示我们直接下载其官方训练模型进行公平评估。每个模型组的最佳结果以粗体标出。

骨干网络 方法 DABench DABench TableBench TableBench BIRD BIRD Avg. Avg.
pass@1 pass@3 pass@1 pass@3 pass@1 pass@3 pass@1 pass@3
专有模型
GPT-40 ReAct 76.39 84.44 64.97 75.06 50.20 62.39 63.85 73.96
04-mini ReAct 79.12 86.77 71.03 80.15 57.04 66.88 69.06 77.93
DeepSeek-R1 ReAct 78.73 87.55 68.96 79.52 55.80 66.17 67.83 77.75
DeepSeek-V3.1 ReAct 81.32 89.49 72.52 81.68 57.89 68.12 70.58 79.76
GPT-5 ReAct 78.21 85.21 69.93 78.37 60.17 65.19 69.44 76.26
开源模型
Qwen-2.5-Coder-32B ReAct 73.15 81.32 61.11 72.26 41.20 60.17 58.49 71.25
QwQ-32B ReAct 70.17 85.21 57.79 75.19 50.30 64.21 59.42 74.87
Llama-3.3-70B ReAct 69.78 80.16 55.47 70.36 59.10 68.58 61.45 73.03
Qwen-2.5-72B ReAct 75.33 86.38 65.44 76.21 60.30 69.49 67.02 77.36
Qwen-2.5-Coder-7B ReAct 15.05 35.41 11.70 28.63 7.02 18.71 11.26 27.58
Qwen-2.5-Coder-7B TableLLM* 36.71 71.98 41.01 70.36 11.99 16.75 29.90 53.03
Qwen-2.5-Coder-7B Table-R1* 42.54 78.99 56.36 63.61 10.69 13.49 36.53 52.03
Qwen-2.5-Coder-7B OmniSQL+ 26.46 36.19 39.95 50.25 57.11 66.30 41.17 50.91
Qwen-2.5-Coder-7B SQL-R1# 24.90 34.63 40.84 50.64 56.78 66.23 40.83 50.50
Qwen-2.5-Coder-7B DATAMIND 77.30 87.94 67.60 79.39 59.41 69.88 68.10 79.07
Qwen-2.5-Coder-7B ReAct 71.21 83.27 56.96 69.97 41.76 59.91 56.64 71.05
Qwen-2.5-Coder-14B TableLLM* 38.26 74.71 76.08 20.99 28.88 59.89
Qwen-2.5-Coder-14B Table-R1 45.33 46.44 58.91 11.80 14.08 35.23 50.79
Qwen-2.5-Coder-14B OmniSQL+ 26.46 79.38 39.30 50.38 41.98 52.67 58.80 67.41
Qwen-2.5-Coder-14B SQL-R1+ 27.24 40.47 41.22 51.02 58.02 66.62 42.16 52.70
Qwen-2.5-Coder-14B DATAMIND 80.29 88.72 70.95 81.81 62.23 70.21 71.16 80.25

5.2 MAIN RESULTS

5.2 主要结果

As shown in Tab. 1, our 7B model, DATAMIND-7B, achieves the best among all open-source models with an average score of $68.10%$ . Our 14B model, DATAMIND-14B, attains an average score of $71.16%$ across all tasks, surpassing all proprietary models (including the latest GPT-5 and DeepSeekv3.1) as well as all open-source alternatives. Moreover, our DATAMIND series models demonstrate robust mastery of diverse data formats and exhibit balanced performance across all datasets. By contrast, specialized models degrade sharply when confronted with unseen data. For example, OmniSQL-7B reaches $57.11%$ on BIRD, yet its performance on TableBench and DABench drops steeply. Note that to ensure a fair evaluation, we have converted all tables in these two benchmarks into . Sql ite files. Nevertheless, SQL-oriented models still under perform. This observation indicates the breadth of query types and file formats covered by DATAMIND-12K. Furthermore, TableLLM and Table-R1 are limited to small-scale tables. When evaluated on DABench's large-scale tables, they fail to generalize, and their accuracy deteriorates even further on BIRD's multi-table analysis. These results highlight our model's capacity to handle complex tabular data, which can be attributed to the difficulty distribution embedded in DATAMIND-12K. Moreover, all trained baselines are exposed to significantly larger training corpora than ours (20K instances for TableLLM and Table-R1, and 2.5M for OmniSQL and SQL-R1, versus only 12K for DATAMIND), yet we outperform them even on their adept benchmarks. This gain is attributable to the high-quality reasoning trajectories curated in DATAMIND-12K and our stable training strategy. Our model also maintains a high pass $\ @3$ score, indicating that it preserves strong generation diversity while ensuring reliability.

如表 1 所示, 我们的 7B 模型 DATAMIND-7B 以 $68.10%$ 的平均得分在所有开源模型中表现最佳。我们的 14B 模型 DATAMIND-14B 在所有任务中取得了 $71.16%$ 的平均得分, 超越了所有闭源模型 (包括最新的 GPT-5 和 DeepSeekv3.1) 以及所有开源替代方案。此外, 我们的 DATAMIND 系列模型展现出对多样化数据格式的稳健掌握能力, 并在所有数据集上表现出均衡的性能。相比之下, 专用模型在面对未见数据时性能急剧下降。例如, OmniSQL-7B 在 BIRD 上达到 $57.11%$, 但其在 TableBench 和 DABench 上的表现大幅下滑。需要注意的是, 为确保公平评估, 我们已将这两个基准测试中的所有表格转换为 .Sqlite 文件。尽管如此, 面向 SQL 的模型仍然表现不佳。这一观察结果印证了 DATAMIND-12K 所涵盖的查询类型和文件格式的广度。此外, TableLLM 和 Table-R1 仅限于处理小规模表格。当在 DABench 的大规模表格上进行评估时, 它们无法泛化, 在 BIRD 的多表分析中其准确性更是进一步恶化。这些结果凸显了我们模型处理复杂表格数据的能力, 这归因于 DATAMIND-12K 中嵌入的难度分布。值得注意的是, 所有经过训练的基线模型所使用的训练语料规模都显著大于我们的模型 (TableLLM 和 Table-R1 使用 20K 实例, OmniSQL 和 SQL-R1 使用 2.5M 实例, 而 DATAMIND 仅使用 12K), 但即使在它们擅长的基准测试上, 我们仍然表现更优。这一优势得益于 DATAMIND-12K 中精心设计的高质量推理轨迹以及我们稳定的训练策略。我们的模型还保持了较高的 pass $\ @3$ 得分, 表明它在确保可靠性的同时保持了强大的生成多样性。

5.3ANALYSIS

5.3 分析

Self-consistency filtering is more non-trivial than the best trajectory selection. In Fig.3, we analyze the impact of the self-consistency trajectory filtering and best trajectory selection strategies through SFT on the 7B model. It is evident that removing the self-consistency filtering (non-con) inflicts the most pronounced degradation on model performance: both pass $@1$ andpass $\ @3$ dropto varying extents across all datasets. This observation suggests that the quality of the answers produced by a trajectory is a critical guarantee of the trajectory's overall quality. Provided that the final answers are consistent, we observe that randomly selecting a single trajectory for training is not necessarily worse than explicitly choosing the best one, and it even yields a clear improvement on DABench. We hypothesize that the judge model's preference bias may potentially reduce trajectory diversity. This conjecture can be further evidenced by the pass $\ @3$ scores of random-select, which are on par with or superior to those of con-select across all three datasets. Moreover, the largest performance gains are obtained by including, without any selection, every trajectory that converges to a consistent answer. This pattern holds across all datasets and indicates that the diversity of reasoning patterns and problem-solving strategies embedded in the trajectories is more beneficial to the model's reasoning capability, which aligns with the findings in Guha et al. (2025), although we cannot fully rule out the contribution of the larger training volume introduced by this unfiltered approach.

自洽性过滤比最佳轨迹选择更具挑战性。图3中,我们通过SFT分析了自洽性轨迹过滤和最佳轨迹选择策略对7B模型的影响。显然,移除自洽性过滤(non-con)会导致模型性能出现最显著的下降:在所有数据集上,pass $@1$ 和 pass $\ @3$ 指标均出现不同程度降低。这一观察表明,轨迹生成答案的质量是保证轨迹整体质量的关键。只要最终答案保持一致,我们观察到随机选择单条轨迹进行训练并不一定比显式选择最佳轨迹更差,甚至在DABench上产生了明显提升。我们推测评判模型的偏好偏差可能会降低轨迹多样性。这一猜想可进一步通过random-select的pass $\ @3$ 得分得到印证——在全部三个数据集中,该策略的表现均持平或优于con-select。此外,最大性能增益来自于不加选择地纳入所有收敛到一致答案的轨迹。该模式在所有数据集中均成立,表明轨迹中蕴含的推理模式和解题策略的多样性对模型推理能力更具增益,这与Guha等人 (2025) 的研究发现一致,尽管我们无法完全排除这种无过滤方法因训练数据量增加所带来的贡献。


Figure 3: Analysis on Self-Consistency Filtering and Best Trajectory Selection. Con-select is our original setting, including self-consistency filtering and best trajectory selection by a judge model $\mathcal{M}$ .Non-select uses all the sampled trajectories without the best selection. Random-select means randomly select a trajectory instead of the best selection. Non-con directly leverages all the synthesized trajectories without self-consistency filtering.

图 3: 自洽性过滤与最优轨迹选择分析。Con-select 为我们的原始设置,包含自洽性过滤和通过评判模型 $\mathcal{M}$ 进行的最优轨迹选择。Non-select 使用所有采样轨迹但不进行最优选择。Random-select 表示随机选择轨迹而非最优选择。Non-con 直接利用所有合成轨迹而不进行自洽性过滤。


Figure 4: The Influence of SFT Loss for RL Training. $\gamma=0$ denotes the absence of SFT loss, $\gamma=0.2$ corresponds to a low SFT-loss weight, and dynamic $\gamma$ indicates our naive setting.

图 4: SFT损失对强化学习训练的影响。$\gamma=0$ 表示未使用SFT损失,$\gamma=0.2$ 对应较低SFT损失权重,动态 $\gamma$ 代表我们的基础设置。

SFT loss is an effective stabilizer for RL training. When our experiments are still in an exploratory phase, we use DATAMIND-12K to examine how the weight of the SFT loss in Eqn.7 influences the RL training on the 7B model without a cold start. In Fig.4, we plot the dynamics of the answer reward across training steps under different $\gamma$ settings. As can be seen, when no SFT loss is imposed $(\gamma=0,$ the answer reward declines almost monotonically. We attribute this failure to two factors. First, the 7B model's limited multi-step reasoning capability makes it difficult to roll out high-quality trajectory groups for effective learning. Second, the heterogeneity of both data structures and code languages yields highly imbalanced trajectory distributions, resulting in unstable training. Raising $\gamma$ to 0.2 can alleviate the problem to some extent. The answer reward initially rises despite large oscillations, yet the SFT loss remains too weak to prevent the policy from eventually drifting away and collapsing. Under our dynamic $\gamma$ schedule, the model first enjoys the stabilizing supervision of a strong SFT loss and after it matures, the SFT coefficient is gradually annealed to encourage exploration, yielding stable training during the whole process.

SFT损失是强化学习训练的有效稳定器。在实验仍处于探索阶段时,我们使用DATAMIND-12K数据集检验公式7中SFT损失的权重对70亿参数模型非冷启动强化学习训练的影响。图4展示了不同$\gamma$设置下答案奖励随训练步数的动态变化。可以看出,当不施加SFT损失时$(\gamma=0)$,答案奖励几乎单调递减。我们将此失败归因于两个因素:首先,70亿参数模型有限的多步推理能力难以生成高质量的轨迹组进行有效学习;其次,数据结构和编程语言的异质性导致轨迹分布高度不平衡,造成训练不稳定。将$\gamma$提升至0.2能在一定程度上缓解问题——尽管存在较大振荡,答案奖励初始阶段仍会上升,但SFT损失仍过于微弱,无法阻止策略最终偏离并崩溃。在我们的动态$\gamma$调度机制下,模型首先受益于强SFT损失的稳定监督,待其成熟后逐渐退火SFT系数以鼓励探索,从而在整个过程中实现稳定训练。

SFT loss can also be the culprit of unstable training. Although SFT loss serves as an effective stabilizer for RL training, we find that its persistent dominance throughout training can conversely trigger collapse. As shown in Fig.5, fixing $\gamma$ at a high level also causes the answer reward to rise briefly, followed by a gradual decline. The underlying reason is that over-fitting to the SFT loss traps the policy in the rigid thinking patterns embedded in the expert trajectories, especially when these trajectories are synthesized from the same model, thereby crippling exploration. To corroborate this, we track the entropy of the policy during training and observe a pronounced entropy collapse phenomenon. In contrast, our dynamic $\gamma$ strategy can keep the policy entropy consistently at a relatively high level throughout training. Overall, we find the training process resembles raising a child. During early childhood, constant parental guidance (a large $\gamma$ ) is indispensable to keep the child from going astray. As the child grows up, excessive supervision stifles the child's innate drive for self-directed exploration. At that stage, judiciously letting go (a small $\gamma$ )enables the child to discover their true capabilities through the feedback from the surrounding world.

SFT损失也可能是训练不稳定的罪魁祸首。尽管SFT损失作为强化学习训练的有效稳定器,但我们发现其在训练过程中持续占据主导地位反而会引发崩溃。如图5所示,将$\gamma$固定在高位同样会导致答案奖励短暂上升后逐渐下降。根本原因在于对SFT损失的过度拟合会使策略陷入专家轨迹中固化的思维模式,当这些轨迹源自同一模型时尤为严重,从而阻碍探索过程。为验证这一点,我们追踪训练过程中的策略熵值,观察到明显的熵崩溃现象。相比之下,我们的动态$\gamma$策略能使策略熵在整个训练期间持续保持在较高水平。总体而言,我们发现训练过程如同养育孩童:幼年时期持续的家长引导(较大$\gamma$)对防止孩子误入歧途至关重要;随着孩子成长,过度监督会压制其与生俱来的自主探索动力,此时适时放手(较小$\gamma$)能让孩子通过环境反馈发掘自身真正潜力。

RL can narrow the performance gap between different base models, but can hardly reverse the order. Fig. 6 shows the impact of different degrees of cold start on the 7B model to subsequent RL training. For rapid empirical verification, we randomly sample 3, 843 training data from DATAMIND-12K (balanced on query types) and 240 test data (60 for each of the three test datasets) for evaluation. As the number of cold start training epochs increases, the marginal gain achieved by RL over the cold start checkpoint (i.e., the slope of the dashed line) diminishes. This indicates that RL can narrow the performance gap between different base models (Liu et al., 2025). Nevertheless, although the gap is narrowed, post-RL performance remains positively correlated with the capability of the base model. This suggests that the bulk of knowledge is acquired during SFT, whereas RL primarily serves to unlock latent potential rather than explicitly push the model beyond its inherent capacity boundary ( Yue et al., 2025a; Chu et al., 2025). Setting aside over fitting, the current trend suggests that a cross point may emerge in which a sufficiently strong cold start leaves no room for further improvement via RL. Whether such a point truly exists, and, if it does, what fundamental mechanisms (e.g., saturation of the policy space, diminishing exploratory signal, or intrinsic limitations of the reward model) render further RL ineffective, constitutes an important open question for future work.

强化学习 (RL) 可以缩小不同基础模型之间的性能差距,但很难逆转优劣顺序。图 6 显示了不同程度的冷启动对 7B 模型后续强化学习训练的影响。为快速进行实证验证,我们从 DATAMIND-12K 中随机抽取 3,843 条训练数据(按查询类型平衡)和 240 条测试数据(三个测试数据集各 60 条)进行评估。随着冷启动训练轮次的增加,强化学习相对于冷启动检查点获得的边际收益(即虚线斜率)逐渐减小。这表明强化学习可以缩小不同基础模型之间的性能差距 (Liu et al., 2025)。然而,尽管差距缩小,后强化学习性能仍与基础模型能力呈正相关。这表明主要知识是在监督微调 (SFT) 期间获得的,而强化学习主要起释放潜在能力的作用,而非显式推动模型突破其固有能力边界 (Yue et al., 2025a; Chu et al., 2025)。若不考虑过拟合,当前趋势表明可能会出现一个交叉点:当冷启动足够强时,强化学习将无法带来进一步改进。这样的临界点是否真实存在?如果存在,是何种根本机制(例如策略空间饱和、探索信号衰减或奖励模型的内在局限性)导致强化学习失效?这构成了未来工作中重要的开放性问题。


Figure 5: Answer Reward and Entropy Dynamics of different $\gamma$ settings.

图 5: 不同 $\gamma$ 设置下的答案奖励与熵动态。

6 RELATED WORK

6 相关工作

Agent Training. The earliest wave of LLM Agents (Wang et al., 2023; Xi et al., 2023) leverages the formidable reasoning capabilities of proprietary models (Qia0 et al., 2023; Chen et al., 2025a; Ya0 et al., 2023; Zhou et al., 2023; Hong et al., 2024; Li et al., 2023a; Wu et al., 2023). As AI entered the second half (Yao, 2025), numerous benchmarks targeting complex, domainspecific agentic tasks are introduced (Mialon et al., 2024; Phan et al., 2025; Jimenez et al., 2024; Chan et al., 2025; Starace et al., 2025; Wei et al., 2025a), which expose the limitations of general-purpose agent architectures, elevating domain-specific agent training to a critical necessity. The release of Large Reasoning Models (OpenAI, 2024c; DeepSeek-AI et al., 2025; Kimi et al., 2025) marks the triumph of Reinforcement Learning (RL) for LLMs. Consequently, a surge of work has sought to adapt RL algorithms to various agent domains (Jin et al., 2025a; Song et al., 2025; Li et al., 2025c; Wu et al., 2025a; Feng et al., 2025; Qian et al., 2025; Li et al., 2025b). Yet these methods presuppose a strong backbone model; researchers are therefore compelled to synthesize copious post-training data to compensate for the backbone's deficiencies. To the best of our knowledge, we are the first to systematically investigate the scaling of agent post-training in the data-analytic scenario, aiming to provide actionable insights for data synthesis and RL-driven training in other complex agent fields.

智能体训练。早期的大语言模型智能体浪潮 (Wang et al., 2023; Xi et al., 2023) 利用了专有模型强大的推理能力 (Qia0 et al., 2023; Chen et al., 2025a; Ya0 et al., 2023; Zhou et al., 2023; Hong et al., 2024; Li et al., 2023a; Wu et al., 2023)。随着人工智能进入下半场 (Yao, 2025),针对复杂领域特定智能体任务的众多基准测试被引入 (Mialon et al., 2024; Phan et al., 2025; Jimenez et al., 2024; Chan et al., 2025; Starace et al., 2025; Wei et al., 2025a),这些基准暴露了通用智能体架构的局限性,将领域特定智能体训练提升到关键必要性的高度。大型推理模型 (OpenAI, 2024c; DeepSeek-AI et al., 2025; Kimi et al., 2025) 的发布标志着强化学习 (RL) 在大语言模型领域的胜利。随之涌现出大量工作致力于将强化学习算法适配到各种智能体领域 (Jin et al., 2025a; Song et al., 2025; Li et al., 2025c; Wu et al., 2025a; Feng et al., 2025; Qian et al., 2025; Li et al., 2025b)。然而这些方法都预设了强大的骨干模型;研究人员因此被迫合成大量后训练数据以弥补骨干模型的不足。据我们所知,我们是首个在数据分析场景中系统研究智能体后训练扩展的工作,旨在为其他复杂智能体领域的数据合成和强化学习驱动训练提供可操作的见解。


Figure 6: Performance Gap Between Cold Start and RL with varying cold start epochs.

图 6: 冷启动与强化学习在不同冷启动周期下的性能差距。

Data-Analytic Agents and Benchmarks. Data Analysis Agents harness the reasoning capabilities and code-generation facility of LLMs to automate the end-to-end processing of data analysis tasks. Virtually all existing data analysis agents rely on closed-source models and are limited to prompt engineering. DS-Agent (Guo et al., 2024) incorporates human insights into data analysis tasks via case-based reasoning. AutoKaggle (Li et al., 2024) decomposes the data analysis pipeline into specialized sub-tasks through a multi-agent architecture. Data-Copilot (Zhang et al., 2023) and Agent ic Data (Sun et al., 2025) stabilize agent behavior by orchestrating operations within predefined workflows. Data Interpreter (Hong et al., 2025) further enlarges the agent's exploration space by introducing dynamic graph-based workflows. To foster progress in this domain, numerous data analysis datasets have been introduced (Hu et al., 2024; Jing et al., 2025; Liu et al., 2024; Zhang et al., 2025a; Majumder et al., 2025)