Data Formulator 2: Iterative ly Creating Rich Visualization s with AI

Data Formulator 2: 使用AI迭代创建丰富的可视化

CHENGLONG WANG, Microsoft Research, USA BONGSHIN LEE, Yonsei University, Korea STEVEN DRUCKER, Microsoft Research, USA DAN MARSHALL, Microsoft Research, USA JIANFENG GAO, Microsoft Research, USA

Fig. 1. With Data Formulator 2, analysts can navigate the iteration history in Data Threads and select previous designs to be reused towards new ones; then, using Concept Encoding Shelf, analysts specify their chart design using blended UI and natural language inputs, delegating data transformation effort to AI. When new charts are created, data threads are updated for future reference. Data Formulator 2 is available at https://github.com/microsoft/data-formulator.

图 1: 使用 Data Formulator 2，分析师可以在数据线程中浏览迭代历史，并选择之前的设计以重新用于新设计；然后，使用概念编码架，分析师通过混合 UI 和自然语言输入来指定他们的图表设计，将数据转换工作委托给 AI。当创建新图表时，数据线程会被更新以供将来参考。Data Formulator 2 可在 https://github.com/microsoft/data-formulator 获取。

To create rich visualization s, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need not only proficiency in data transformation and visualization tools but also efforts to manage the branching history consisting of many different versions of data and charts. Recent LLM-powered AI systems have greatly improved visualization authoring experiences, for example by mitigating manual data transformation barriers via LLMs’ code generation ability. However, these systems do not work well for iterative visualization authoring, because they often require analysts to provide, in a single turn, a text-only prompt that fully describes the complex visualization task to be performed, which is unrealistic to both users and models in many cases. In this paper, we present Data Formulator 2, an LLM-powered visualization system to address these challenges. With Data Formulator 2, users describe their visualization intent with blended UI and natural language inputs, and data transformation are delegated to AI. To support iteration, Data Formulator 2 lets users navigate their iteration history and reuse previous designs towards new ones so that they don’t need to start from scratch every time. In a user study with eight participants, we observed that Data Formulator 2 allows participants to develop their own iteration strategies to complete challenging data exploration sessions.

为了创建丰富的可视化效果，数据分析师通常需要在数据处理和图表规范之间反复迭代以实现目标。为了实现这一点，分析师不仅需要精通数据转换和可视化工具，还需要努力管理由许多不同版本的数据和图表组成的分支历史。最近由大语言模型驱动的 AI 系统极大地改善了可视化创作体验，例如通过大语言模型的代码生成能力减轻手动数据转换的障碍。然而，这些系统在迭代可视化创作方面表现不佳，因为它们通常要求分析师在一次交互中提供一个仅包含文本的提示，以完全描述要执行的复杂可视化任务，这在许多情况下对用户和模型来说都是不现实的。在本文中，我们提出了 Data Formulator 2，一个由大语言模型驱动的可视化系统，以应对这些挑战。通过 Data Formulator 2，用户可以使用混合的 UI 和自然语言输入来描述他们的可视化意图，而数据转换则委托给 AI。为了支持迭代，Data Formulator 2 允许用户浏览他们的迭代历史并重用之前的设计，以便他们不需要每次都从头开始。在一项有八名参与者参与的用户研究中，我们观察到 Data Formulator 2 允许参与者开发自己的迭代策略，以完成具有挑战性的数据探索会话。

1 Introduction

1 引言

From an initial design idea, data analysts often need to go back and forth on a variety of charts before reaching their goals. Throughout this iterative process, besides updating the chart specifications, analysts face the challenges to transform and manage different data formats to support these visualization designs. Iterative chart authoring is prevalent in exploratory data analysis [42], where analysts often discover new directions from initial charts. For example, after noticing that the line chart about renewable energy percentage in Figure 1 are quite dense for comparing different countries’ trends, the analysts may want to filter it to show only top 5 CO2 emitter’s trends, or visualize ranks of these countries each year instead. To achieve these, the analysts need different data transformations: the former requires filtering the data with each country’s aggregated CO2 emission values, and the latter requires partitioning the data by year to compute each country’s ranking. Similar challenges are also relevant in the data-driven storytelling context [40, 41], where authors needs to derive new data to refine chart designs (e.g., annotation). For example, to highlight which countries are leading in renewable energy adoption, the author would superimpose a trend line of global median adoption rates over the line chart; the author may later convert the chart into small multiples to tell a story about most sustainable countries from each continent. Again, these new designs require data transformation from the current results.

从最初的设计构思出发，数据分析师通常需要反复尝试多种图表才能达成目标。在这一迭代过程中，除了更新图表的规格，分析师还面临转换和管理不同数据格式的挑战，以支持这些可视化设计。迭代式的图表创作在探索性数据分析中很常见[42]，分析师通常从初始图表中发现新的方向。例如，在注意到图1中关于可再生能源百分比的折线图在比较不同国家的趋势时过于密集后，分析师可能希望对其进行过滤，仅显示前5个二氧化碳排放国的趋势，或者每年可视化这些国家的排名。为了实现这些目标，分析师需要进行不同的数据转换：前者需要根据每个国家的二氧化碳排放总量对数据进行过滤，后者则需要按年份划分数据以计算每个国家的排名。类似的挑战在数据驱动的叙事场景中也存在[40, 41]，作者需要推导新数据以优化图表设计（例如注释）。例如，为了突出哪些国家在可再生能源采用方面领先，作者可能会在全球折线图上叠加全球中位数采用率的趋势线；作者随后可能将图表转换为小多图，以讲述每个大陆最具可持续性国家的故事。同样，这些新设计需要从当前结果中进行数据转换。

Managing different data and chart designs together in these iterative authoring processes is challenging. As the analyst comes up with new chart designs, they need not only to understand the data format expected by the chart and tool, but also need to know how to use diverse transformation operators (e.g., reshaping, aggregation, window functions, string processing) in data transformation tools or libraries to prepare the data. Many AI-powered tools have been developed to tackle these visualization challenges (e.g., [2, 9, 26, 31, 54, 55]). These tools let users describe their goals using natural language, and they leverage the underlying AI models’ code generation ability [1, 5] to automatically write code to transform the data and create the visualization. Despite their success, current tools do not work well in the iterative visualization authoring context. Most of them require analysts to provide, in a single turn, a text-only prompt that fully describes the complex visualization authoring task to be performed, which is usually unrealistic to both users and models.

在这些迭代的创作过程中，管理不同的数据和图表设计具有挑战性。当分析师提出新的图表设计时，他们不仅需要理解图表和工具所期望的数据格式，还需要知道如何在数据转换工具或库中使用各种转换操作符（例如，重塑、聚合、窗口函数、字符串处理）来准备数据。许多基于 AI 的工具已经被开发出来，以应对这些可视化挑战（例如，[2, 9, 26, 31, 54, 55]）。这些工具允许用户使用自然语言描述他们的目标，并利用底层 AI 模型的代码生成能力 [1, 5] 自动编写代码来转换数据并创建可视化。尽管这些工具取得了成功，但当前的工具在迭代的可视化创作环境中表现不佳。大多数工具要求分析师一次性提供一个完全描述要执行的复杂可视化创作任务的纯文本提示，这对用户和模型来说通常是不现实的。

To overcome these limitations, we design a new interaction approach for iterative ly chart authoring. Our key idea is to blend GUI and natural language (NL) inputs so that users can specify charts both precisely and flexibly, and we design an interface for users to control the contexts, so that users can navigate and reuse previous design towards new ones, as opposed to starting from scratch each time. We realize the design with the concept encoding shelf for specifying charts beyond data format constraints and data threads for managing the user’s non-linear authoring history (Figure 1).

为克服这些限制，我们设计了一种新的交互方法用于迭代式图表创作。我们的核心思想是融合图形用户界面（GUI）和自然语言（NL）输入，使用户能够既精确又灵活地指定图表，并设计了一个界面让用户控制上下文，从而用户可以在现有设计的基础上进行导航和重用，而不是每次都从零开始。我们通过概念编码货架（concept encoding shelf）来实现这一设计，用于在数据格式约束之外指定图表，并通过数据线程（data threads）来管理用户的非线性创作历史（图 1）。

Chart specification with blended UI and NL inputs. Resembling shelf-configuration UIs [40, 55], the concept encoding shelf allows users to drag existing data fields they wish to visualize and drop them to visual channels to specify chart designs. Differently, with concept encoding shelf, users can also input new data field names in the chart configuration to express their intent to visualize fields that they want from a transformed data. Then, they can provide a supplemental NL instruction to explain the new fields and ask the AI to transform data and instantiate the chart. This blended UI and NL approach for chart specification makes user inputs both precise and flexible. Since Data Formulator 2 can precisely extract chart specification from the encoding shelf, the user doesn’t need verbose prompt to explain the design. By conveying data semantics using NL inputs, the user delegates data transformation to AI, and thus they doesn’t need to worry about data preparation. This approach also improves the task success rate of AI models. Because Data Formulator 2 can infer the visualization script directly from UI input, the AI model only needs to generate data transformation code. With the chart design provided as contexts to the AI model, the model has more information to ground the user’s instruction for better code generation. Managing and leveraging iteration contexts with data threads. Data Formulator 2 presents the user’s non-linear iteration history as data threads and lets them manage data and charts created throughout the process. With data threads, users can easily navigate to an earlier result, fork a new branch, and reuse its context to create new charts. This way, users only need to inform the model how to update the previous result (e.g., “show only top 5 CO2 emission countries’ trends”, Figure 1) as opposed to re-describing the whole chart from scratch. When the user decides to reuse, the Data Formulator 2 tailors the conversation history to include only contexts relevant to that data to derive new result, allowing the AI to generate code with clear contextual information free from (irrelevant) messages from other threads. Besides general navigation and branching supports, data threads also provide shortcut for users to quickly backtrack and revise prompts to update recently created charts, which can be useful for analysts to explore alternative designs or correct errors made by AI.

混合界面与自然语言输入的图表规范。类似于货架配置界面 [40, 55]，概念编码货架允许用户拖拽他们希望可视化的现有数据字段并将其放置到视觉通道中，以指定图表设计。不同的是，通过概念编码货架，用户还可以在图表配置中输入新的数据字段名称，以表达他们希望从转换后的数据中可视化字段的意图。然后，他们可以提供补充的自然语言指令来解释新字段，并要求 AI 转换数据并实例化图表。这种混合界面与自然语言的图表规范方法使用户输入既精确又灵活。由于 Data Formulator 2 可以精确地从编码货架中提取图表规范，用户无需冗长的提示来解释设计。通过使用自然语言输入传达数据语义，用户将数据转换委托给 AI，因此他们无需担心数据准备。这种方法还提高了 AI 模型的任务成功率。由于 Data Formulator 2 可以直接从界面输入推断可视化脚本，AI 模型只需要生成数据转换代码。通过将图表设计作为上下文提供给 AI 模型，模型有更多信息来支撑用户的指令，从而实现更好的代码生成。通过数据线程管理和利用迭代上下文。Data Formulator 2 将用户的非线性迭代历史呈现为数据线程，并允许他们管理在此过程中创建的数据和图表。通过数据线程，用户可以轻松导航到早期结果，分叉一个新分支，并重用其上下文来创建新图表。这样，用户只需告知模型如何更新先前的结果（例如，“仅显示前 5 个 CO2 排放国家的趋势”，图 1），而不是从头重新描述整个图表。当用户决定重用时，Data Formulator 2 会调整对话历史，仅包含与该数据相关的上下文以得出新结果，从而使 AI 能够生成具有清晰上下文信息的代码，不受其他线程中（不相关）消息的干扰。除了通用的导航和分支支持外，数据线程还为用户提供了快捷方式，可以快速回溯和修订提示以更新最近创建的图表，这对分析师探索替代设计或纠正 AI 的错误非常有用。

Based on these two key designs, we developed Data Formulator 2, an AI-powered visualization tool for iterative visualization authoring. Data Formulator 2 supports diverse visualization s provided by Vega-Lite marks and encodings, and the AI can transform data flexibly to accommodate different designs, supporting operators like reshaping, filtering, aggregation, window functions, and column derivation. Like other AI tools [9, 55], Data Formulator 2 also provides users panels to view generated data, transformation code and code explanations to inspect and verify AI outputs.

基于这两项关键设计，我们开发了 Data Formulator 2，这是一款支持迭代式可视化创作的 AI 驱动可视化工具。Data Formulator 2 支持由 Vega-Lite 标记和编码提供的多样化可视化，AI 可以灵活地转换数据以适应不同的设计，支持重塑、过滤、聚合、窗口函数和列派生等操作。与其他 AI 工具 [9, 55] 类似，Data Formulator 2 也为用户提供了面板，用于查看生成的数据、转换代码和代码解释，以便检查和验证 AI 的输出。

To understand how Data Formulator 2’s multi-modal interaction benefits analysts in solving challenging data visualization s tasks, we conducted a user study consisting of eight participants with varying data science expertise. They were asked to reproduce two professional data scientists’ analysis sessions to create a total of 16 visualization s, 12 of which require non-trivial data transformations (e.g., rank categories by a criterion and combine low-ranked ones into one category with the label, “Others”). The study shows that participants can quickly learn to use Data Formulator 2 to solve these complex tasks, and that Data Formulator 2’s flexibility and expressiveness allow participants to develop their own verification, error correction, and iteration strategies to complete the tasks. Our inductive analysis of study sessions reveals interesting patterns of how users’ experiences and expectations about the AI system affected their work styles.

为了理解 Data Formulator 2 的多模态交互如何帮助分析师解决具有挑战性的数据可视化任务，我们进行了一项用户研究，共有八名具有不同数据科学专业背景的参与者。他们被要求复现两名专业数据科学家的分析会话，以创建总共 16 个可视化，其中 12 个需要非平凡的数据转换（例如，按标准对类别进行排序，并将排名较低的类别合并为一个标签为“其他”的类别）。研究表明，参与者可以快速学会使用 Data Formulator 2 来解决这些复杂任务，并且 Data Formulator 2 的灵活性和表达能力使参与者能够开发自己的验证、纠错和迭代策略以完成任务。我们对研究会话的归纳分析揭示了用户的经验和他们对 AI 系统的期望如何影响其工作风格的有趣模式。

In summary, the main contributions of this paper are as follows:

总而言之，本文的主要贡献如下：

We design a multi-modal UI, composed of concept encoding shelf and data threads, to blend UI and NL interactions for users to specify their intent for iterative chart authoring.

我们设计了一个多模态用户界面，由概念编码架和数据线程组成，将用户界面和自然语言交互相结合，使用户能够指定其意图以进行迭代图表创作。

Fig. 2. A data analysis session where the analyst explores energy from different sources, renewable percentage trends, and ranks of countries by their renewable percentages from a dataset about $\mathsf{C O}_{2}$ and electricity of 20 countries between 2000 and 2020 (table 1). The analyst has to create five versions of the data to support different chart designs in three branches. Data Formulator 2 lets users manage the iteration contexts and create rich visualization s beyond the initial data with blended UI and natural language inputs.

图2: 数据分析会话，分析师从2000年至2020年间20个国家的 $\mathsf{C O}_{2}$ 和电力数据集中探索不同能源的能源、可再生能源百分比趋势以及按可再生能源百分比排名的国家（表1）。分析师必须创建五个版本的数据以支持三个分支中的不同图表设计。Data Formulator 2允许用户管理迭代上下文，并通过混合UI和自然语言输入创建超越初始数据的丰富可视化。

• We implement our design with an interactive visualization tool, Data Formulator 2, which enables users to iterative ly create rich visualization s that requires multiple rounds of data transformations along the way. • We conducted a user study to learn how the blended UI and NL interaction benefits analysts in data exploration sessions. We observed that analysts can easily develop their own strategies to work the AI system to perform data analysis and visualization tasks that best reflect their personal experience and expectation with the AI model.

• 我们通过一个互动可视化工具 Data Formulator 2 实现了我们的设计，该工具使用户能够迭代创建丰富的可视化效果，这些效果在此过程中需要多轮数据转换。
• 我们进行了一项用户研究，以了解混合 UI 和自然语言交互如何在数据探索会话中使分析师受益。我们观察到，分析师可以轻松制定自己的策略，利用 AI 系统执行数据分析和可视化任务，这些任务最能反映他们对 AI 模型的个人经验和期望。

2 Illustrative Scenarios: Exploring Renewable Energy Trends

2 示例场景：探索可再生能源趋势

In this section, we describe scenarios to illustrate users’ experiences of creating a series of visualization s to explore global sustainability from a dataset of 20 countries’ energy from 2000 to 2020. The initial dataset, shown in Figure 2- $\cdot!\textcircled{1}.$ , includes each country’s energy produced from three sources (fossil fuel, renewables, and nuclear)

在本节中，我们描述了用户通过一系列可视化探索全球可持续性的场景，数据集涵盖了 20 个国家从 2000 年到 2020 年的能源数据。初始数据集如图 2- $\cdot!\textcircled{1}.$ 所示，包括每个国家从三种来源（化石燃料、可再生能源和核能）产生的能源。

each year and annual $\mathrm{CO}{2} $emission value (the$ \mathrm{CO}{2}$ emission data only ranges from 2000 to 2019). We compare a professional data analyst’s experience with computational notebooks and a journalist’s experience using Data Formulator 2 to complete the analysis session shown in Figure 2.

每年和年度 $\mathrm{CO}{2} $排放值（$ \mathrm{CO}{2}$ 排放数据仅涵盖2000年至2019年）。我们对比了一位专业数据分析师使用计算笔记本的经验和一位记者使用Data Formulator 2完成图2所示分析会话的经验。

2.1 Exploration with computational notebooks

2.1 使用计算笔记本进行探索

Heather is an analyst who is proficient with a computational notebook and R libraries, ggplot2 and tidyverse. Because ggplot2 expects all data fields to be visualized on visual channels (e.g., $x,y$ -axes, color, facet) are columns in the input data, Header uses tidyverse for data transformation.

Heather 是一位精通计算笔记本和 R 库（ggplot2 和 tidyverse）的分析师。由于 ggplot2 要求所有要可视化的数据字段（例如 $x,y$ 轴、颜色、分面）都是输入数据中的列，因此她使用 tidyverse 进行数据转换。

Basic charts. To start, Heather wants to visualize the amount of electricity produced from renewables per country over the years with a line chart to see “if our planet is sustainable.” Since the input data (table $\cdot!!\bigcirc)$ includes all required fields, Heather creates the line chart with ease, by mapping columns $\mathsf{Y e a r}!\to!x$ , Electricity from renewables $(\mathsf{T w h}){\to\y$ and Entity $\rightarrow$ color (chart $\textcircled{1}$ -A). She then creates another line chart for $\mathrm{CO}{2} $emission trends, mapping CO2 emissions (kt) to the$ y $axis (chart$ \textcircled{1} $-B). Heather is puzzled that China, the country with considerable increased use of renewable energy, also has the biggest increase in$ \mathrm{CO}{2} $emissions. This is counter intuitive because renewables themselves would not cause$ \mathrm{CO}_{2}$ emission increase. Thus, Heather decides to dive deeper.

基本图表。首先，Heather 想通过折线图可视化各国多年来可再生能源发电量，以查看“我们的星球是否可持续”。由于输入数据（表 $\cdot!!\bigcirc)$ 包含了所有必要的字段，Heather 轻松地创建了折线图，将列 $\mathsf{Y e a r}!\to!x$ 、可再生能源发电量 $(\mathsf{T w h}){\to\y$ 和国家 $\rightarrow$ 颜色映射到图表中（图 $\textcircled{1}$ -A）。随后，她为 $\mathrm{CO}_{2}$ 排放趋势创建了另一个折线图，将 CO2 排放量 (kt) 映射到 $y$ 轴（图 $\textcircled{1}$ -B）。Heather 感到困惑的是，中国在可再生能源使用量大幅增加的同时，CO2 排放量的增幅也是最大的。这与直觉相悖，因为可再生能源本身不会导致 CO2 排放量的增加。因此，Heather 决定进一步深入研究。

Renewable energy versus other sources. Heather suspects the $\mathbf{CO}{2} $emission increase is caused by a surge of fossil fuel consumption. To compare fossil fuel usage against renewables, she wants a faceted line chart that shows electricity from each energy source side by side (chart$ \textcircled{2}) $). To create the chart, Heather needs to have a data table with columns–Year, Electricity, Entity, and Energy Source–and map the columns to$ x,y $, color, and facet, respectively so that the chart is divided into subplots based on values from the Energy Source column. Because table$ \textcircled{1} $stores electricity values across three columns in the wide format, Heather unpivots table$ \textcircled{1} $into the long format, to fold specified column names into values in the Energy Source field and corresponding values into the Electricity field. She then creates the desired chart$ \textcircled{2} $with the transformed data$ {\mathcal{O}}. $and verifies her assumption: despite the increase of renewables usage, the usage of fossil fuel also grows significantly, leading to$ \mathrm{CO}{2}$ emission increase. This motivates Heather to explore renewable trends by visualizing trends of the percentage of electricity from renewables over all three resources.

可再生能源与其他能源的比较。Heather怀疑$\mathbf{CO}{2} $排放量的增加是由化石燃料消耗的激增引起的。为了比较化石燃料和可再生能源的使用情况，她希望有一个分面折线图，并列显示每种能源的发电量（图表$ \textcircled{2} $）。为了创建该图表，Heather需要一个包含Year、Electricity、Entity和Energy Source列的数据表，并分别将这些列映射到$ x,y $、颜色和分面，以便根据Energy Source列中的值将图表划分为子图。由于表$ \textcircled{1} $以宽格式存储了三个列中的发电量值，Heather将表$ \textcircled{1} $转换为长格式，将指定的列名折叠为Energy Source字段中的值，并将相应的值折叠为Electricity字段。然后，她使用转换后的数据$ {\mathcal{O}} $创建了所需的图表$ \textcircled{2} $，并验证了她的假设：尽管可再生能源的使用量有所增加，但化石燃料的使用量也显著增长，导致$ \mathrm{CO}{2}$排放量增加。这促使Heather通过可视化可再生能源在三种资源中的发电量占比趋势来探索可再生能源的趋势。

Renewable energy percentage and ranks. To visualize renewable energy percentage, Heather goes back to table $\textcircled{1}$ to derive a new column Renewable Percentage, by dividing Electricity from renewables (TWh) from the total produced electricity for each country per year. With the new data $\textcircled{3}$ , Heather visualizes the renewable percentage trends in chart $\circled{3}$ , which shows that the percentage increase is slower than their absolute value increase (as shown in chart $\textcircled{1})$ .

可再生能源占比及排名。为了可视化可再生能源占比，Heather回到表 $\textcircled{1}$ ，通过将每个国家每年的可再生能源发电量（TWh）除以总发电量，得出一个新列“可再生能源占比”。通过新数据 $\textcircled{3}$ ，Heather在图 $\circled{3}$ 中可视化了可再生能源占比趋势，结果显示其占比增长速度慢于其绝对值增长速度（如图 $\textcircled{1}$ 所示）。

Because many countries share similar renewable percentage, it is quite difficult to compare different countries’ trends. Heather thus decides to create a visualization of countries’ renewable percentage ranks to complement existing charts. To calculate ranks of each country among others per year, Heather uses a window function on table $\circled{3}$ to partition the table based on Year, and apply the rank() function to Renewable Percentage to derive a new column Rank. With Rank mapped to $y$ -axis, chart $\circled{4}$ allows Heather to clearly examine how different countries’ ranks change in the last two decades; for example, Germany and UK are the two top ranked countries emerge from the bottom pack in 2000.

由于许多国家的可再生能源比例相似，因此很难比较不同国家的趋势。Heather 决定创建一个国家可再生能源比例排名的可视化图表，以补充现有图表。为了计算每年各国在其中的排名，Heather 在表 $\circled{3}$ 上使用窗口函数，按年份对表进行分区，并将 rank() 函数应用于可再生能源比例，从而生成一个新列 Rank。将 Rank 映射到 $y$ 轴后，图表 $\circled{4}$ 使 Heather 能够清楚地查看不同国家在过去二十年中的排名变化；例如，德国和英国是 2000 年从底部群体中脱颖而出的两个排名最高的国家。

Renewable trends from top CO2 emitters. Finally, Heather wants to focus on renewable percentage trends from top $\mathrm{CO}{2} $emission countries, which make most influences to global sustainability. Despite table$ \textcircled{3} $contains all columns to be visualized, Heather needs to filter it based on the countries’ CO2 emission. To do so, Heather goes back to table$ \textcircled{1} $to aggregate each country’s total$ \mathrm{CO}{2} $emission, sort it and find top five. Heather then uses this intermediate result to filter table$ \circled{3} $to obtain renewable percentage from top five CO2 emitters (shown as table$ \textcircled{5} $) and creates chart$ \circled{5}$ .

主要二氧化碳排放国的可再生能源趋势

Fig. 3. Data Formulator 2 overview. The user creates visualization s by providing fields (drag-and-drop existing fields or type in new ones) and NL instructions to Concept Encoding Shelf and delegates data transformation to AI. Data View shows the derived data. The user can navigate data derivation history using Data Threads. They can then locate the desired point to refine or create new charts by providing follow-up instructions in Concept Encoding Shelf.

图 3: Data Formulator 2 概览。用户通过提供字段（拖放现有字段或输入新字段）和自然语言指令到概念编码架，并委托 AI 进行数据转换来创建可视化。数据视图显示派生数据。用户可以使用数据线程导航数据派生历史。然后，他们可以通过在概念编码架中提供后续指令来定位所需的点以进行细化或创建新图表。

From this chart, it is clear that top $\mathrm{CO}{2} $emitters are indeed heading in the right direction towards sustainability, despite total$ \mathrm{CO}{2} $emissions are still increasing with total energy produced also increasing each year. To publish this visualization, Heather decides to add an annotation to the plot with the median global renewable percentage. On top of table$ \circled{5} $, Heather appends the median renewable percentage each year calculated from table$ \circled{3} $and includes a new column Global Median?, used as a flag to assist plotting so that global median can be colored in a different opacity. Chart$ \circled{6} $shows the final result, by including Global Median as an Entity and mapping Global Median?$ \rightarrow$ opacity, median renewable percentage is visualized along other countries in a different opacity. Heather is satisfied with the results and concludes the session.

从这张图表中可以明显看出，尽管总的 $\mathrm{CO}{2} $排放量随着每年能源产量的增加而增加，但主要的$ \mathrm{CO}{2} $排放国确实正在朝着可持续发展的正确方向前进。为了发布这一可视化结果，Heather 决定在图表中添加一个注释，标注全球可再生能源比例的中位数。在表格$ \circled{5} $的基础上，Heather 添加了从表格$ \circled{3} $中计算出的每年可再生能源比例的中位数，并包含了一个新列 Global Median?，用作辅助绘图的标志，以便全球中位数可以用不同的透明度着色。图表$ \circled{6} $显示了最终结果，通过将 Global Median 作为一个实体并映射 Global Median?$ \rightarrow$ 透明度，全球中位数的可再生能源比例与其他国家以不同的透明度一起可视化。Heather 对结果感到满意，并结束了这次会话。

2.2 Exploration with Data Formulator 2

2.2 使用数据构建工具 (Data Formulator) 进行探索

Megan is a journalist who has a solid understanding about data visualization. She utilizes visualization s effectively in her work but she doesn’t program. Megan can create and refine rich visualization s iterative ly with Data Formulator 2 (Figure 3), which inherits the basic experience of shelf-configuration style tools. She can specify charts by mapping data fields to visual channels of the selected chart and provide additional contexts using natural language.

Megan 是一名对数据可视化有深入了解的记者。她在工作中有效地使用可视化工具，但并不编程。Megan 可以通过 Data Formulator 2（图 3）迭代地创建和完善丰富的可视化效果，该工具继承了货架配置风格工具的基本体验。她可以通过将数据字段映射到所选图表的视觉通道来指定图表，并使用自然语言提供额外的上下文。

Basic charts. Megan starts with line charts to visualize trends of electricity from renewables (Figure $2{\sqrt{\textstyle\mathrm{1}\mathrm{)A}}},$ ). Since all three required fields are available from the input data, Megan simply selects chart type “line chart” in the encoding shelf and drags-and-drops fields to their corresponding visual channels (Figure 4- $\cdot!\textcircled{1},$ ). Data Formulator 2 then generates the desired visualization. To visualize the $\mathrm{CO}_{2}$ emission trends, Megan swaps $y$ -axis encoding with CO2 emissions $(\mathsf{k t})!\to!y$ .

基本图表。Megan 首先使用折线图来可视化可再生能源的电力趋势（图 2{\sqrt{\textstyle\mathrm{1}\mathrm{)A}}}）。由于输入数据中所有三个必填字段都可用，Megan 只需在编码栏中选择图表类型“折线图”，并将字段拖放到相应的视觉通道（图 4- $\cdot!\textcircled{1}$ ）。然后 Data Formulator 2 生成所需的可视化图表。为了可视化 $\mathrm{CO}_{2}$ 排放趋势，Megan 将 $y$ 轴编码与 CO2 排放 $(\mathsf{k t})!\to!y$ 进行交换。

Fig. 4. Experiences with Data Formulator 2: (1) creating the basic renewable energy chart using drag-and-drop to encode fields; (2 and 3) creating charts that requiring new fields by providing field names and optional natural language instructions to derive new data.

图 4: 使用 Data Formulator 2 的体验：(1) 通过拖放字段编码创建基本可再生能源图表；(2 和 3) 通过提供字段名称和可选的自然语言指令来派生新数据，创建需要新字段的图表。

Renewable energy vs other sources. Megan now needs to create the faceted line chart to compare electricity from all energy sources, which requires new fields Electricity and Energy Source. With Data Formulator 2, Megan can specify the chart using future fields and NL instructions in Concept Encoding Shelf (Figure 3-2) and delegate data transformation to AI.

可再生能源与其他能源的对比

As Figure 4- $\circled{2}$ shows, Megan first drags-and-drops existing fields Year and Entity to $x$ -axis and color, respectively; then, she types in names of new fields Electricity and Energy Source in $y$ -axis and column, respectively, to tell the AI agent that she expects two new fields to be derived for these properties; finally, Megan provides an instruction “compare electricity from all three sources” to further clarify the intent and clicks the formulate button. To create the chart, Data Formulator 2 first generates a Vega-Lite spec skeleton from the encoding (to be completed based on information from the transformed data); it then summarizes the data, encodings, and NL instructions into a prompt to ask an LLM to generate a data transformation code to prepare the data that fulfils all necessary fields, which is then used to instantiate the chart skeleton. After reviewing the generated chart and data, Megan is satisfied and moves to the next task.

如图 4- $\circled{2}$ 所示，Megan 首先将现有的字段 Year 和 Entity 分别拖放到 $x$ 轴和颜色上；然后，她在 $y$ 轴和列中分别输入新字段 Electricity 和 Energy Source 的名称，以告诉 AI 智能体她希望为这些属性派生两个新字段；最后，Megan 提供了指令“比较所有三种来源的电力”，以进一步澄清意图并点击 formulate 按钮。为了创建图表，Data Formulator 2 首先从编码生成一个 Vega-Lite 规范骨架（根据转换后的数据中的信息完成）；然后，它将数据、编码和自然语言指令总结为一个提示，要求大语言模型生成一个数据转换代码，以准备满足所有必要字段的数据，然后用于实例化图表骨架。在查看了生成的图表和数据后，Megan 感到满意并继续下一个任务。

Data Formulator 2 also updates data threads (Figure 3- $\cdot{\circled{5}}).$ ) with the newly derived data and chart so that Megan can manage and leverage data provenance. For example, Megan can delete the chart, create/fork new charts from either the original or new data, or select an existing chart to iterate from.

Data Formulator 2 还会使用新派生的数据和图表更新数据线程（图 3- $\cdot{\circled{5}}).$ ），以便 Megan 可以管理和利用数据来源。例如，Megan 可以删除图表、从原始数据或新数据中创建/分叉新图表，或选择现有图表进行迭代。

Renewable energy percentage and ranks. Megan proceeds to visualizing renewable energy percentage. Despite it required a different data transformation, Megan enjoys the same experience as the previous task: Megan dragsand-drops Year and Entity to $x_{\mathrm{{}}}$ -axis and color (Figure 4- $\langle{\mathfrak{3}}\rangle$ ), and enters the name of the new field “Renewable Energy Percentage” to $y$ -axis; then, since Megan believes the field names are self-explanatory, she proceeds to formulate the new data without an additional NL instruction. Data Formulator 2 generates the desired visualization (Figure 5- $\textcircled{1}$ ).

可再生能源比例及排名。Megan 继续可视化可再生能源比例。尽管这需要不同的数据转换，但 Megan 享受到了与之前任务相同的体验：Megan 将 Year 和 Entity 拖放到 $x_{\mathrm{{}}}$ 轴和颜色 (图 4- $\langle{\mathfrak{3}}\rangle$ )，并在 $y$ 轴输入新字段名称“可再生能源比例”；然后，由于 Megan 认为字段名称不言自明，她继续在没有额外自然语言指令的情况下制定新数据。Data Formulator 2 生成了所需的可视化 (图 5- $\textcircled{1}$ )。

To visualize the ranks of countries based on their renewable percentage, Megan decides to continue it from the previous chart, which already computes renewable percentage. To do so, Megan duplicates the renewable percentage chart and update the $y$ -axis field to another new field Rank and clicks “derive.” As shown in Figure 4- $\circled{3}$ , Megan’s interaction positions the Concept Encoding Shelf in the contexts prior result as opposed to the original data, which conveys her intent of reusing the data towards the new one. With the context information, the AI model successfully derives the desired chart (Figure 2- $\cdot!\leftmoon\right)$ ) even only with Megan’s simple inputs.

为了可视化基于可再生能源占比的国家排名，Megan 决定从之前的图表继续，该图表已经计算了可再生能源占比。为此，Megan 复制了可再生能源占比图表，并将 $y$ 轴字段更新为另一个新字段 Rank，然后点击“推导”。如图 4- $\circled{3}$ 所示，Megan 的交互将 Concept Encoding Shelf 定位在之前结果的上下文中，而不是原始数据，这传达了她将数据重用于新数据的意图。借助上下文信息，AI 模型成功推导出了所需的图表（图 2- $\cdot!\leftmoon\right)$ ），即使 Megan 的输入非常简单。

Renewable trends from top CO2 emitters. Next, Megan decides to visualize top-5 $\mathrm{CO}_{2}$ countries’ renewable percentage trends. Megan again decides to iterate on the previous chart, otherwise she needs more effort to create a longer prompt to specify the task at once. Megan first use data threads (Figure 3- $\circled{5}$ ) to locate renewable percentage chart. On top of that, Megan provides a new instruction below the local data thread, “show only top $5:C O2$ emission countries’ trends,” and clicks the “derive” button (Figure 5- $\cdot!\bigcirc!)$ . Data Formulator 2 updates the previous code to include a filter clause to produce the new data and visualization (Figure 5- $\langle{\mathcal{D}}\rangle$ .

主要二氧化碳排放国的可再生能源趋势

Fig. 5. Iteration with Data Formulator 2: (1) provide a new instruction on top of the renewable energy percentage chart to filter by only top $C{\mathrm{O}{2}} $countries, (2) update the chart with a new field Global Median? and instruct Data Formulator 2 to add global median besides top$ 5,{\mathrm{CO}}{2}$ countries’ trends, and (3) move Global Median? from column to opacity to update chart design without deriving new data.

图 5: 使用 Data Formulator 2 进行迭代：(1) 在可再生能源百分比图表上提供新指令，仅过滤前 $C{\mathrm{O}{2}} $国家，(2) 使用新字段 Global Median? 更新图表，并指示 Data Formulator 2 在前$ 5,{\mathrm{CO}}{2}$ 国家趋势旁边添加全球中位数，(3) 将 Global Median? 从列移动到不透明度，以在不派生新数据的情况下更新图表设计。

Megan continues on the iteration process, to include global median trends besides top $5,\mathrm{CO}_{2}$ countries’ trends. Since this new chart requires different encodings and she wants to keep both visualization s around, Megan forks a branch by copying the previous chart. Then, she updates the Concept Encoding Shelf by (1) adding a new encoding Global Median? $\rightarrow$ column and (2) providing the edit instruction “include global median as an entity” (Figure 5- $\langle{\mathcal{D}}\rangle$ . Once she clicks the derive button, Data Formulator 2 generates the new chart (Figure 5- $\langle{\mathfrak{H}}\rangle$ ). Upon inspection, Megan prefers to combine two views in one, with global average rendered in a different opacity. Since these two charts require the same data fields, she simply selects a new chart type “custom line” (which exposes more chart properties than the basic line chart) and moves Global Median? to the opacity channel. Since it requires no data transformation, Data Formulator 2 doesn’t need to invoke AI and directly renders the chart. With all desired chart created, Megan concludes the analysis session. Figure 3- $\cdot\textcircled{5}$ shows all three threads created by Megan that lead to the final designs.

Megan 继续迭代过程，除了前五大二氧化碳排放国的趋势外，还包括全球中位数趋势。由于这个新图表需要不同的编码，并且她希望保留两个可视化，Megan 通过复制之前的图表来创建一个分支。然后，她更新了概念编码架，通过 (1) 添加一个新的编码 Global Median? → 列，以及 (2) 提供编辑指令“将全球中位数作为一个实体包括进来” (图 5- ⟨D⟩)。一旦她点击派生按钮，Data Formulator 2 就会生成新图表 (图 5- ⟨H⟩)。经过检查，Megan 更倾向于将两个视图合并为一个，并以不同的不透明度呈现全球平均值。由于这两个图表需要相同的数据字段，她只需选择一个新的图表类型“自定义折线图”（它比基本折线图暴露了更多的图表属性），并将 Global Median? 移动到不透明度通道。由于不需要数据转换，Data Formulator 2 不需要调用 AI 并直接渲染图表。创建完所有所需的图表后，Megan 结束了分析会话。图 3- ⋅⑤ 显示了 Megan 创建的所有三个线程，这些线程最终导致了最终的设计。

2.3 Comparison of Exploration Experiences

2.3 探索体验比较

The experience of Heather and Megan exploring global sustainability (using data $\textcircled{1})$ demonstrates an inherently iterative process. Both of them started with a high-level goal without concrete designs in mind and gradually formed the design from explorations in various branches. This iterative exploration process required a series of data transformation and the management of provenance, and thus is challenging for people not proficient in data transformation and programming. Here, we compare their exploration experiences to highlight how Data Formulator 2 bridges Megan’s skill gap, enabling her to achieve the analysis Heather, an experienced data analysis, performed.

Heather 和 Megan 探索全球可持续性（使用数据 $\textcircled{1}$ ）的经历展示了一个本质上迭代的过程。她们两人最初都设定了一个高层目标，但没有具体的计划，随后通过在各分支中的探索逐渐形成了设计。这种迭代的探索过程需要一系列的数据转换和溯源管理，因此对于不精通数据转换和编程的人来说具有挑战性。在这里，我们比较她们的探索经历，以突出 Data Formulator 2 如何弥补 Megan 的技能差距，使她能够完成经验丰富的数据分析师 Heather 所执行的分析。

2.3.1 Data transformation and chart creation. When new designs are considered, Heather needs to prepare new data to accommodate the design, even when some designs are seemingly close (e.g., charts $\circled{3}$ and $\textcircled{5},$ ). This requires her to understand the data shape expected by the charts, choose the right transformation idiom (e.g., unpivot for table $\textcircled{2}$ , join and union for table $\textcircled{6},$ ), and implement them with proper operators. Once the data is prepared, Heather can easily specify chart by mappings data columns to visual channels of the selected chart type. Her proficiency in data transformation is essential for her to create rich visualization s beyond the initial dataset.

2.3.1 数据转换与图表创建

To bridge Megan’s skill gap in data transformation, Data Formulator 2 lets Megan specify her intents in a unified interaction that combines chart encodings and natural language, regardless of the types of data transformation required behind the scene, and data transformation is delegated to AI. Because the Concept Encoding Shelf resembles the shelf-configuration UI, Megan’s experience from Power BI translates well into Data Formulator 2. Furthermore, since Megan communicates the chart design using concept encodings, she only needs to provide a short supplementary NL instruction to clarify her intents; based on these inputs, Data Formulator 2 assembles a detailed prompt to communicate with the AI model. If Megan were to use text-only interface to interact with AI, she needs more detailed prompt to explain her intent to avoid ambiguity, including explaining chart encodings she created from drag-and-drop interactions easily.

为了弥补 Megan 在数据转换方面的技能差距，Data Formulator 2 允许 Megan 在一个统一的交互中指定她的意图，该交互结合了图表编码和自然语言，无论背后需要哪种类型的数据转换，数据转换都会被委托给 AI。由于概念编码架类似于配置架 UI，Megan 从 Power BI 中获得的使用经验可以很好地转移到 Data Formulator 2 中。此外，由于 Megan 使用概念编码来传达图表设计，她只需要提供简短的补充自然语言指令来澄清她的意图；基于这些输入，Data Formulator 2 会组装一个详细的提示来与 AI 模型进行通信。如果 Megan 使用纯文本界面与 AI 交互，她需要更详细的提示来解释她的意图以避免歧义，包括解释她通过拖放交互轻松创建的图表编码。

2.3.2 Managing branching contexts. During the exploration, Heather backtracks several times to reuse previous results toward new designs (e.g., chart $\langle!!\langle4\rangle!!\rightarrow$ data $\circled{3}\rightarrow$ chart $\textcircled{5},$ ), creating three branches along the way. Because Heather programs in a notebook, she can either copy and adapt previous code snippets or reuse variables computed in previous iterations for new designs. This way, Heather lowers her specification efforts despite new designs are more complex. Heather’s programming expertise is essential for her to manage the branching contexts in a linear programming environment.

2.3.2 管理分支上下文

For Megan, managing branching contexts with different version of data could be challenging without Data Formulator 2. Should Megan use a chat-based AI interface, she would need to prepare a verbose prompt to explain contexts and data transformation goals in detail each turn to avoid extra disambiguation efforts, especially when multiple branches are mixed in the chat history and the task becomes more complicated later on. Data Formulator 2’s data threads address this challenge. Data threads not only provide a visualization for Megan to review history, but also let her visit previous states and reuse them towards new branches as Heather did. This way, Megan only needs to specify updates to be applied as opposed to describing the full design from scratch in one shot, and the AI model can generate results more reliably leveraging the contexts Megan provided. Shall Megan spot undesired results, she could also use data threads (Figure 3- $\langle{\mathfrak{H}}\rangle$ ) to rerun or backtrack one step to revise instructions, as opposed to restarting from the scratch.

对于 Megan 来说，如果没有 Data Formulator 2，管理不同版本数据的分支上下文可能会很具挑战性。如果 Megan 使用基于聊天的 AI 界面，她每次都需要准备一个冗长的提示，详细解释上下文和数据转换目标，以避免额外的歧义消除工作，尤其是当聊天历史中混合了多个分支且任务后来变得更为复杂时。Data Formulator 2 的数据线程解决了这一挑战。数据线程不仅为 Megan 提供了历史回顾的可视化，还让她能够访问先前的状态并将其重用于新的分支，正如 Heather 所做的那样。这样一来，Megan 只需指定要应用的更新，而不是一次性从头描述完整的设计，AI 模型可以更可靠地利用 Megan 提供的上下文生成结果。如果 Megan 发现不理想的结果，她还可以使用数据线程（图 3- $\langle{\mathfrak{H}}\rangle$ ）重新运行或回溯一步以修改指令，而不是从头开始。

3 The Data Formulator 2 System Design

3 The Data Formulator 2 系统设计

As described earlier, Data Formulator 2 combines UI and NL interactions in a multi-modal UI to reduce analysts visualization authoring efforts, and it provides data threads for users to navigate iteration history and specify new designs on top of previous ones. Data Formulator 2 employs the following system designs to support such interactions:

如前所述，Data Formulator 2 在多模态 UI 中结合了 UI 和 NL 交互，以减少分析师的可视化创作工作量，并提供了数据线程，供用户导航迭代历史并在先前设计的基础上指定新设计。Data Formulator 2 采用了以下系统设计来支持此类交互：

• First, to allow users to specify chart design and data transformation goals from different paradigms (shelfconfiguration UI versus NL inputs), Data Formulator 2 decouples chart specification and data transformation and solve them with different techniques (template instantiation versus AI code generation). • Second, to support reusing, Data Formulator 2 organizes the iteration history as data threads, treating data as first class objects. Data Formulator 2 enables users either to locate a chart from a different branch and follow up, or to quickly revise and rerun the most recent instruction leading to the current chart. We next detail how Data Formulator 2 realizes these designs, and additional features designed to assist users to understand AI-generated results.

• 首先，为了允许用户从不同范式（货架配置 UI 与自然语言输入）中指定图表设计和数据转换目标，Data Formulator 2 将图表规范和数据转换解耦，并使用不同技术（模板实例化与 AI 代码生成）分别解决。• 其次，为了支持重用，Data Formulator 2 将迭代历史组织为数据线程，将数据视为一等对象。Data Formulator 2 使用户能够从不同分支中定位图表并继续跟进，或快速修改并重新运行导致当前图表的最新指令。接下来，我们将详细说明 Data Formulator 2 如何实现这些设计，以及为帮助用户理解 AI 生成结果而设计的附加功能。

3.1 多模态用户界面：解耦图表规范与数据转换

Data Formulator 2 decouples chart specification and data transformation so that users can benefit from both the precision of UI interaction to configure chart designs and the expressiveness of NL descriptions to specify

Data Formulator 2 将图表规范与数据转换解耦，使用户既能通过 UI 交互的精确性来配置图表设计，又能通过自然语言描述的表达能力来指定数据转换。

Fig. 6. Data Formulator 2’s workflow. (1) Given the user specification in the concept encoding shelf, Data Formulator 2 first generates a Vega-Lite spec skeleton from the selected chart type. (2) When the chart requires new fields (e.g., Rank), Data Formulator 2 compiles a prompt from the concept encoding shelf and asks its AI model to generate a data transformation code to produce the desired data. (3) Upon completion, the Vega-Lite skeleton is instantiated with the new data to produce the desired chart.

图 6: Data Formulator 2 的工作流程。(1) 根据概念编码架中的用户规范，Data Formulator 2 首先从选定的图表类型生成一个 Vega-Lite 规范框架。(2) 当图表需要新字段 (例如 Rank) 时，Data Formulator 2 从概念编码架中编译提示，并要求其 AI 模型生成数据转换代码以生成所需数据。(3) 完成后，Vega-Lite 框架将使用新数据进行实例化，以生成所需的图表。

data transformation goals. As shown in Figure 6, given a user specification in the concept encoding shelf, Data Formulator 2 generates the desired chart in three steps: (1) generating a Vega-Lite script from the selected chart type, (2) compiling a prompt and delegate data transformation to AI, and (3) using the generated data to instantiate Vega-Lite script to render the desired chart.

数据转换目标。如图 6 所示，在概念编码栏中给定用户规范后，Data Formulator 2 通过三个步骤生成所需图表：(1) 从选定的图表类型生成 Vega-Lite 脚本，(2) 编译提示并委托 AI 进行数据转换，(3) 使用生成的数据实例化 Vega-Lite 脚本以渲染所需图表。

Chart specification generation. Data Formulator 2 adopts a chart type-based approach to represent visualization s, supporting five categories of charts: scatter (scatter plot, ranged dot plot), line (line chart, dotted line chart), bar (bar chart, stacked bar chart, grouped bar chart), statistics (histogram, heatmap, linear regression, boxplot) and custom (custom scatter, line, bar area, rectangle where all available visual channels are exposed for advanced users). Each chart type is represented as a Vega-Lite template with a set of predefined visual channels, including position channels $(x,y)$ , legends (color, size, shape, opacity), and facet channels (column, row) shown to the user in the concept encoding shelf. For example, a line chart is represented as a Vega-Lite template { "mark": "line", "encoding" : ${\mathbf{\nabla}^{\prime\prime}\mathbf{x}^{\prime\prime}\colon$ : null, "y": null, "color": null, "column": null, "row": null}}, and when the user selects line chart, channels$ x,y.$ , color, column, and row are displayed in the concept encoding shelf. Using chart type-based design, Data Formulator 2 supports predefined layered chart (e.g., ranged dot plot and linear regression plot that are composed from line and scatter, Figure 7-right). Additional chart types (e.g., bullet chart) can be supported by adding new Vega-Lite templates with respected channels to the library.

图表规范生成。Data Formulator 2采用基于图表类型的方法来表示可视化，支持五种图表类别：散点图（散点图、范围点图）、折线图（折线图、点线图）、柱状图（柱状图、堆叠柱状图、分组柱状图）、统计图（直方图、热力图、线性回归、箱线图）和自定义（自定义散点图、折线图、柱状区域、矩形，所有可用的视觉通道都暴露给高级用户）。每种图表类型都表示为一个Vega-Lite模板，带有一组预定义的视觉通道，包括位置通道 $(x,y)$ 、图例（颜色、大小、形状、不透明度）和分面通道（列、行），这些通道在概念编码架上展示给用户。例如，折线图表示为一个Vega-Lite模板{"mark": "line", "encoding": ${\mathbf{\nabla}^{\prime\prime}\mathbf{x}^{\prime\prime}\colon$ : null, "y": null, "color": null, "column": null, "row": null}}，当用户选择折线图时，通道$ x,y.$、颜色、列和行会显示在概念编码架上。通过基于图表类型的设计，Data Formulator 2支持预定义的分层图表（例如，由折线图和散点图组成的范围点图和线性回归图，图7-右）。通过向库中添加带有相应通道的新Vega-Lite模板，可以支持其他图表类型（例如，子弹图）。

As the user inputs fields into the concept encoding shelf, either by dragging and dropping it from existing data fields or by typing in new fields they wish to visualize, Data Formulator 2 instantiate s the Vega-Lite template with provided fields. For example, as shown in Figure 6- $\bullet$ , when the user drags Year $\rightarrow x$ , Entity $\rightarrow y$ and types Rank in $y_{\cdot}$ , the line chart template mentioned above is instantiated with provided fields: if the field is available in the current data table, both field name and encoding type are instantiated (e.g., Year with type “temporal”), otherwise the encoding type is left as a “” to be instantiated later when data transformation completes.

当用户通过从现有数据字段中拖放或输入他们希望可视化的新字段将字段输入到概念编码栏时，Data Formulator 2 会使用提供的字段实例化 Vega-Lite 模板。例如，如图 6 所示，当用户拖动 Year $\rightarrow x$ 和 Entity $\rightarrow y$ 并在 $y_{\cdot}$ 中输入 Rank 时，上述折线图模板会使用提供的字段进行实例化：如果字段在当前数据表中可用，则字段名称和编码类型都会被实例化（例如，Year 的类型为“temporal”），否则编码类型将保留为“”，以便在数据转换完成后进行实例化。

The shelf-configuration design provides users with simple yet precise interaction. The concept encoding shelf saves users efforts from writing prompts to explain the chart design. Figure 7 further illustrates how the specification in the concept encoding shelf interacts with the underlying Vega-Lite scripts. In Figure 7-left, the user can specify an ‘avg’ operator the $y_{\mathrm{,}} $-axis to transform the axis and the operator is instantiated as the “aggregate” property of$ y_{\mathrm{,}}$ -axis in the script. In addition, Figure 7-right shows another example of the user working with a layered chart (ranged dot plot): as the user fills fields in the UI, Data Formulator 2 populates corresponding fields to different parameters in the predefined chart template.

层架配置设计为用户提供了简单而精确的交互方式。概念编码层架减少了用户编写提示来解释图表设计的麻烦。图 7 进一步展示了概念编码层架中的规范如何与底层的 Vega-Lite 脚本进行交互。在图 7 左侧，用户可以在 $y_{\mathrm{,}} $轴上指定一个 "avg" 操作符来转换轴，该操作符在脚本中被实例化为$ y_{\mathrm{,}}$ 轴的 "aggregate" 属性。此外，图 7 右侧展示了另一个用户处理分层图表（范围点图）的示例：当用户在 UI 中填写字段时，Data Formulator 2 会将相应的字段填充到预定义图表模板中的不同参数中。

Fig. 7. Concept encoding shelf instantiate s users’ encodings as a Vega-Lite specification. The user creates a bar chart showing average rank of countries with an “avg” operator on $y.$ -axis, and a ranged dot plot to compare ranks of each country in 2000 and 2020 (the chart template routes users’ $x\cdot$ -axis encoding “Entity” to both $x$ and detail channels).

图 7: 概念编码层将用户的编码实例化为 Vega-Lite 规范。用户创建了一个条形图，显示国家在 $y$ 轴上使用“avg”操作符的平均排名，以及一个范围点图来比较每个国家在 2000 年和 2020 年的排名（图表模板将用户的 $x\cdot$ 轴编码“Entity”路由到 $x$ 和详细通道）。

Data transformation with AI. From the concept encoding shelf, Data Formulator 2 assembles a prompt and queries LLM to generate a python code to transform data. The data transformation prompt contains three segments: the system prompt, the data transformation context and the goal (illustrated Figure $6-\textcircled{2}$ , full prompt is shown in the Appendix):

使用AI进行数据转换。从概念编码库中，Data Formulator 2 组装一个提示并查询大语言模型以生成用于数据转换的Python语言代码。数据转换提示包含三个部分：系统提示、数据转换上下文和目标（如图 $6-\textcircled{2}$ 所示，完整提示见附录）：

• Finally, Data Formulator 2 assembles a goal prompt section, combining the NL instruction provided in the text box and field names used in the encodings. When user skips NL instruction (as shown in Figure 4- $\mathbf{\mathcal{O}}$ ), the instruction part is simply left blank. This goal will be refined by the LLM as instruction by the system prompt before attempting to generate the data transformation code.

• 最后，Data Formulator 2会组装一个目标提示部分，结合文本框中提供的自然语言 (NL) 指令和编码中使用的字段名称。当用户跳过 NL 指令时（如图 4- $\mathbf{\mathcal{O}}$ 所示），指令部分将留空。在尝试生成数据转换代码之前，LLM 会根据系统提示将此目标作为指令进行优化。

With the full input, Data Formulator 2 prompts the LLM to generate a response, consisting of the refined objective and the code. Below shows the LLM’s refined objective for the task in Figure 6, and the generated code is shown in Figure 6- 2 .

在完整输入的情况下，Data Formulator 2 会提示大语言模型生成响应，包括优化后的目标和代码。下图展示了大语言模型对图 6 中任务的优化目标，生成的代码如图 6-2 所示。

Data Formulator 2 then runs the code on the input data. If the code executes without errors, the output data is used to instantiate the Vega-Lite script generated in the previous step, by first inferring semantic types of newly generated columns (to determine their encoding type), and then assembling the data with the script to render the visualization (Figure 6- $\circledcirc$ ). Occasionally, the generated code may cause runtime errors, either due to attempting to use libraries that are not imported, references to invalid columns names, or incorrectly handling of undefined or NaN values. When errors occur, before asking users to retry, Data Formulator 2 tries to correct the errors, by querying the LLM with the the error message and a follow-up instruction to repair its mistakes [8, 33]. When repair completes, the visualization is similarly generated. Either way, Data Formulator 2 updates the data threads and presents the results to the user.

Data Formulator 2 随后在输入数据上运行代码。如果代码执行无误，输出数据将用于实例化上一步生成的 Vega-Lite 脚本，首先推断新生成列的语义类型（以确定其编码类型），然后将数据与脚本组装以渲染可视化（图 6- $\circledcirc$ ）。偶尔，生成的代码可能会引发运行时错误，原因可能是尝试使用未导入的库、引用无效的列名或错误处理未定义或 NaN 值。当发生错误时，在要求用户重试之前，Data Formulator 2 会尝试通过向大语言模型发送错误消息和后续指令来修复错误 [8, 33]。修复完成后，同样会生成可视化。无论哪种情况，Data Formulator 2 都会更新数据线程并向用户展示结果。

3.2 Data threads: navigating the iteration history

3.2 数据线程：导航迭代历史记录

During the iterative visualization process, the analyst needs to navigate their authoring history to locate relevant artifacts (data or charts) to take actions (delete, duplicate or followup). Data Formulator 2 introduces data threads to represent the tree-structured iteration history to support navigation tasks. In data threads, we treat data as the first class objects (nodes in data threads) that are connected according to the user’s instruction provided to the AI model (edges), and visualization s are attached to the version of data they are created from. Centering the iteration history around data benefits user navigation because it directly reflects the sequence of user actions in creating these new data. This design also benefits the AI model: when user issues a follow-up instruction, Data Formulator 2 automatically retrieves its conversation history with the AI towards the current data and then instruct the AI model to rewrite the code towards new goals based on the retrieved history. This way, the AI model does not pose risk of incorrectly using conversation history from other branches to make incorrect data transformation. As shown in Figure 8, the code and the conversation history is attached to each data nodes. Each turn when the user provides a follow-up instruction, the AI model generates new code by updating the previous code (could be deletion, addition or both) to achieve the user’s goal; this way, the code always takes the original data as the input with all information accessible. Comparing to an alternative design where we only pass current data to the AI model and asks it to write a new code to further transform it (i.e., reusing the data as opposed to reusing the computation leading to the data), our design has more flexibility to accommodate different styles of followup instructions — either the user wants to further update the data (e.g., “now, calculate average rank for each country”), revise previous the computation (e.g., “also consider nuclear as renewable energy”) or creating an alternatives (e.g., “rank by CO2 instead”) — since the AI has access to the full dialog history and the full dataset. In contrast, the data-only reuse approach restricts the AI model’s access to only the current data, limiting its ability to support “backtracking” or “alternative design” styles instructions.

在迭代的可视化过程中，分析师需要浏览其创作历史，以定位相关的工件（数据或图表）并采取行动（删除、复制或跟进）。Data Formulator 2 引入了数据线程来表示树结构的迭代历史，以支持导航任务。在数据线程中，我们将数据视为一等对象（数据线程中的节点），这些节点根据用户提供给 AI 模型的指令（边）连接，并且可视化内容附加到它们所创建的数据版本上。将迭代历史围绕数据为中心有助于用户导航，因为它直接反映了用户创建这些新数据的操作序列。这种设计也有利于 AI 模型：当用户发出后续指令时，Data Formulator 2 会自动检索其与 AI 的对话历史，然后指示 AI 模型根据检索到的历史重写代码以实现新目标。这样，AI 模型不会误用其他分支的对话历史而导致错误的数据转换。如图 8 所示，代码和对话历史附加到每个数据节点上。每当用户提供后续指令时，AI 模型通过更新先前的代码（可能是删除、添加或两者兼有）来生成新代码，以实现用户的目标；这样，代码始终以原始数据作为输入，并且可以访问所有信息。与另一种设计相比，我们只将当前数据传递给 AI 模型并要求它编写新代码以进一步转换数据（即重用数据而不是重用导致数据的计算），我们的设计更具灵活性，能够适应不同风格的后续指令——无论是用户希望进一步更新数据（例如，“现在，计算每个国家的平均排名”）、修改先前的计算（例如，“也将核能视为可再生能源”）还是创建替代方案（例如，“按 CO2 排名”）——因为 AI 可以访问完整的对话历史和完整的数据集。相比之下，仅重用数据的方法限制了 AI 模型仅能访问当前数据，限制了其支持“回溯”或“替代设计”风格指令的能力。

Fig. 8. Data threads and local data threads (right). In data threads, the user can create new charts from previous versions of data, and open previous charts in the main panel to create new branches; when creating new data, the AI model is instructed to revise previous code based on user instructions. In local data threads, the user can easily (1) rerun the previous instruction, (2) issue a follow-up instruction or (3) expand the previous card to revise and rerun the instruction.

图 8: 数据线程和本地数据线程（右侧）。在数据线程中，用户可以从数据的先前版本创建新图表，并在主面板中打开先前的图表以创建新分支；在创建新数据时，AI 模型会根据用户指令修改先前的代码。在本地数据线程中，用户可以轻松地 (1) 重新运行先前的指令，(2) 发出后续指令，或 (3) 扩展先前的卡片以修改并重新运行指令。

During iteration, the analyst needs both (1) locating a data or a chart further from the current one to create new branch in derivation tree and (2) performing quick follow-up/revisions of latest instruction from the latest data. To accommodate these different needs, Data Formulator 2 presents both (global) data threads and local data threads. For navigation, the key challenge is assist user to distinguish the desired content from others, and thus data threads are located in a separate panel with previews of data, instruction and charts to assist navigation (Figure 3). This support users different navigation styles, either they want to navigate by provenance (i.e., using instruction cards to locate desired data) or navigate by artifacts (i.e., using visualization snapshots to recall data semantics). Once the user locates the desired data, they can click and open a previous chart to display it in the main panel for further updates as well as create a new chart directly from the data Figure 8- 1 . To support quick updates from the current result, Data Formulator 2 aims to minimize users’ interaction overhead. Thus, the local data thread is designed as part of the main authoring panel Figure 3. The local data thread visualizes only the history leading from the initial data to the current one and omits chart snapshots to minimize distraction (the full history is still available in the global data thread). By integrating the local data thread with the concept encoding shelf, Data Formulator 2 helps the user understand the authoring contexts and enables them to perform quick local updates. As shown in Figure 8, the user can rerun the previous instruction (e.g., when the AI produces an incorrect result and they would like to quickly retry before updating instructions, $\pmb{\oslash}$ ), provide a follow-up instruction to refine the data $(\pmb{\otimes})$ , as well as quickly open the previous instruction to modify and rerun the command $(\pmb{\oslash})$ .

在迭代过程中，分析师需要 (1) 定位一个远离当前数据或图表的数据或图表，以在推导树中创建新分支，以及 (2) 对最新数据执行快速跟进/修订。为了满足这些不同需求，Data Formulator 2 提供了全局数据线程和本地数据线程。在导航方面，关键挑战是帮助用户从其他内容中区分出所需内容，因此数据线程位于一个独立面板中，其中包含数据、指令和图表的预览，以辅助导航 (图 3)。这支持用户不同的导航风格，无论是通过来源导航 (即使用指令卡定位所需数据) 还是通过工件导航 (即使用可视化快照回忆数据语义)。一旦用户定位到所需数据，他们可以点击并打开之前的图表，将其显示在主面板中以进行进一步更新，也可以直接从数据创建新图表 (图 8-1)。为了支持从当前结果进行快速更新，Data Formulator 2 旨在最小化用户的交互开销。因此，本地数据线程被设计为主要的创作面板的一部分 (图 3)。本地数据线程仅可视化从初始数据到当前数据的历史，并省略图表快照以最小化干扰 (完整历史仍然在全局数据线程中可用)。通过将本地数据线程与概念编码架集成，Data Formulator 2 帮助用户理解创作上下文，并使他们能够执行快速的本地更新。如图 8 所示，用户可以重新运行之前的指令 (例如，当 AI 生成错误结果时，他们希望在更新指令之前快速重试， $\pmb{\oslash}$ )，提供跟进指令以优化数据 $(\pmb{\otimes})$ ，以及快速打开之前的指令进行修改并重新运行命令 $(\pmb{\oslash})$ 。

With data threads, analysts can manage and navigate the history and perform iterative updates from previous results, similar to how data analysts reuse code and data in computation notebooks. Otherwise, if the analyst needs to start from scratch and ask the AI to achieve the goal at once, they need efforts to prepare rather detailed prompts to reduce ambiguity, especially for describing more complex charts they need in later analysis stages.

通过数据线程，分析师可以管理和导航历史记录，并从之前的结果中进行迭代更新，类似于数据分析师在计算笔记本中重用代码和数据的方式。否则，如果分析师需要从头开始并让AI一次性实现目标，他们需要花费精力准备相当详细的提示，以减少歧义，特别是在描述后续分析阶段所需的更复杂图表时。

[论文翻译]Data Formulator 2: 使用AI迭代创建丰富的可视化

原文地址：https://arxiv.org/pdf/2408.16119v1

Data Formulator 2: Iterative ly Creating Rich Visualization s with AI

Data Formulator 2: 使用AI迭代创建丰富的可视化

1 Introduction

1 引言

In summary, the main contributions of this paper are as follows:

2 Illustrative Scenarios: Exploring Renewable Energy Trends

2 示例场景：探索可再生能源趋势

2.1 Exploration with computational notebooks

2.1 使用计算笔记本进行探索

2.2 Exploration with Data Formulator 2

2.3 Comparison of Exploration Experiences

3 The Data Formulator 2 System Design

3 The Data Formulator 2 系统设计

3.2 Data threads: navigating the iteration history

3.2 数据线程：导航迭代历史记录

3.3 Miscellaneo

[论文翻译]Data Formulator 2: 使用AI迭代创建丰富的可视化

原文地址：https://arxiv.org/pdf/2408.16119v1

Data Formulator 2: Iterative ly Creating Rich Visualization s with AI

Data Formulator 2: 使用AI迭代创建丰富的可视化

1 Introduction

1 引言

In summary, the main contributions of this paper are as follows:

2 Illustrative Scenarios: Exploring Renewable Energy Trends

2 示例场景：探索可再生能源趋势

2.1 Exploration with computational notebooks

2.1 使用计算笔记本进行探索

2.2 Exploration with Data Formulator 2

2.3 Comparison of Exploration Experiences

3 The Data Formulator 2 System Design

3 The Data Formulator 2 系统设计

3.1 Multi-modal UI: Decoupling chart specification and data transformation

3.2 Data threads: navigating the iteration history

3.2 数据线程：导航迭代历史记录

3.3 Miscellaneo