[论文翻译]Data Formulator 2: 使用AI迭代创建丰富的可视化


原文地址:https://arxiv.org/pdf/2408.16119v1


Data Formulator 2: Iterative ly Creating Rich Visualization s with AI

Data Formulator 2: 使用AI迭代创建丰富的可视化

CHENGLONG WANG, Microsoft Research, USA BONGSHIN LEE, Yonsei University, Korea STEVEN DRUCKER, Microsoft Research, USA DAN MARSHALL, Microsoft Research, USA JIANFENG GAO, Microsoft Research, USA

CHENGLONG WANG, Microsoft Research, USA BONGSHIN LEE, Yonsei University, Korea STEVEN DRUCKER, Microsoft Research, USA DAN MARSHALL, Microsoft Research, USA JIANFENG GAO, Microsoft Research, USA

Fig. 1. With Data Formulator 2, analysts can navigate the iteration history in Data Threads and select previous designs to be reused towards new ones; then, using Concept Encoding Shelf, analysts specify their chart design using blended UI and natural language inputs, delegating data transformation effort to AI. When new charts are created, data threads are updated for future reference. Data Formulator 2 is available at https://github.com/microsoft/data-formulator.

图 1: 使用 Data Formulator 2,分析师可以在数据线程中浏览迭代历史,并选择之前的设计以重新用于新设计;然后,使用概念编码架,分析师通过混合 UI 和自然语言输入来指定他们的图表设计,将数据转换工作委托给 AI。当创建新图表时,数据线程会被更新以供将来参考。Data Formulator 2 可在 https://github.com/microsoft/data-formulator 获取。

To create rich visualization s, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need not only proficiency in data transformation and visualization tools but also efforts to manage the branching history consisting of many different versions of data and charts. Recent LLM-powered AI systems have greatly improved visualization authoring experiences, for example by mitigating manual data transformation barriers via LLMs’ code generation ability. However, these systems do not work well for iterative visualization authoring, because they often require analysts to provide, in a single turn, a text-only prompt that fully describes the complex visualization task to be performed, which is unrealistic to both users and models in many cases. In this paper, we present Data Formulator 2, an LLM-powered visualization system to address these challenges. With Data Formulator 2, users describe their visualization intent with blended UI and natural language inputs, and data transformation are delegated to AI. To support iteration, Data Formulator 2 lets users navigate their iteration history and reuse previous designs towards new ones so that they don’t need to start from scratch every time. In a user study with eight participants, we observed that Data Formulator 2 allows participants to develop their own iteration strategies to complete challenging data exploration sessions.

为了创建丰富的可视化效果,数据分析师通常需要在数据处理和图表规范之间反复迭代以实现目标。为了实现这一点,分析师不仅需要精通数据转换和可视化工具,还需要努力管理由许多不同版本的数据和图表组成的分支历史。最近由大语言模型驱动的 AI 系统极大地改善了可视化创作体验,例如通过大语言模型的代码生成能力减轻手动数据转换的障碍。然而,这些系统在迭代可视化创作方面表现不佳,因为它们通常要求分析师在一次交互中提供一个仅包含文本的提示,以完全描述要执行的复杂可视化任务,这在许多情况下对用户和模型来说都是不现实的。在本文中,我们提出了 Data Formulator 2,一个由大语言模型驱动的可视化系统,以应对这些挑战。通过 Data Formulator 2,用户可以使用混合的 UI 和自然语言输入来描述他们的可视化意图,而数据转换则委托给 AI。为了支持迭代,Data Formulator 2 允许用户浏览他们的迭代历史并重用之前的设计,以便他们不需要每次都从头开始。在一项有八名参与者参与的用户研究中,我们观察到 Data Formulator 2 允许参与者开发自己的迭代策略,以完成具有挑战性的数据探索会话。

1 Introduction

1 引言

From an initial design idea, data analysts often need to go back and forth on a variety of charts before reaching their goals. Throughout this iterative process, besides updating the chart specifications, analysts face the challenges to transform and manage different data formats to support these visualization designs. Iterative chart authoring is prevalent in exploratory data analysis [42], where analysts often discover new directions from initial charts. For example, after noticing that the line chart about renewable energy percentage in Figure 1 are quite dense for comparing different countries’ trends, the analysts may want to filter it to show only top 5 CO2 emitter’s trends, or visualize ranks of these countries each year instead. To achieve these, the analysts need different data transformations: the former requires filtering the data with each country’s aggregated CO2 emission values, and the latter requires partitioning the data by year to compute each country’s ranking. Similar challenges are also relevant in the data-driven storytelling context [40, 41], where authors needs to derive new data to refine chart designs (e.g., annotation). For example, to highlight which countries are leading in renewable energy adoption, the author would superimpose a trend line of global median adoption rates over the line chart; the author may later convert the chart into small multiples to tell a story about most sustainable countries from each continent. Again, these new designs require data transformation from the current results.

从最初的设计构思出发,数据分析师通常需要反复尝试多种图表才能达成目标。在这一迭代过程中,除了更新图表的规格,分析师还面临转换和管理不同数据格式的挑战,以支持这些可视化设计。迭代式的图表创作在探索性数据分析中很常见[42],分析师通常从初始图表中发现新的方向。例如,在注意到图1中关于可再生能源百分比的折线图在比较不同国家的趋势时过于密集后,分析师可能希望对其进行过滤,仅显示前5个二氧化碳排放国的趋势,或者每年可视化这些国家的排名。为了实现这些目标,分析师需要进行不同的数据转换:前者需要根据每个国家的二氧化碳排放总量对数据进行过滤,后者则需要按年份划分数据以计算每个国家的排名。类似的挑战在数据驱动的叙事场景中也存在[40, 41],作者需要推导新数据以优化图表设计(例如注释)。例如,为了突出哪些国家在可再生能源采用方面领先,作者可能会在全球折线图上叠加全球中位数采用率的趋势线;作者随后可能将图表转换为小多图,以讲述每个大陆最具可持续性国家的故事。同样,这些新设计需要从当前结果中进行数据转换。

Managing different data and chart designs together in these iterative authoring processes is challenging. As the analyst comes up with new chart designs, they need not only to understand the data format expected by the chart and tool, but also need to know how to use diverse transformation operators (e.g., reshaping, aggregation, window functions, string processing) in data transformation tools or libraries to prepare the data. Many AI-powered tools have been developed to tackle these visualization challenges (e.g., [2, 9, 26, 31, 54, 55]). These tools let users describe their goals using natural language, and they leverage the underlying AI models’ code generation ability [1, 5] to automatically write code to transform the data and create the visualization. Despite their success, current tools do not work well in the iterative visualization authoring context. Most of them require analysts to provide, in a single turn, a text-only prompt that fully describes the complex visualization authoring task to be performed, which is usually unrealistic to both users and models.

在这些迭代的创作过程中,管理不同的数据和图表设计具有挑战性。当分析师提出新的图表设计时,他们不仅需要理解图表和工具所期望的数据格式,还需要知道如何在数据转换工具或库中使用各种转换操作符(例如,重塑、聚合、窗口函数、字符串处理)来准备数据。许多基于 AI 的工具已经被开发出来,以应对这些可视化挑战(例如,[2, 9, 26, 31, 54, 55])。这些工具允许用户使用自然语言描述他们的目标,并利用底层 AI 模型的代码生成能力 [1, 5] 自动编写代码来转换数据并创建可视化。尽管这些工具取得了成功,但当前的工具在迭代的可视化创作环境中表现不佳。大多数工具要求分析师一次性提供一个完全描述要执行的复杂可视化创作任务的纯文本提示,这对用户和模型来说通常是不现实的。

To overcome these limitations, we design a new interaction approach for iterative ly chart authoring. Our key idea is to blend GUI and natural language (NL) inputs so that users can specify charts both precisely and flexibly, and we design an interface for users to control the contexts, so that users can navigate and reuse previous design towards new ones, as opposed to starting from scratch each time. We realize the design with the concept encoding shelf for specifying charts beyond data format constraints and data threads for managing the user’s non-linear authoring history (Figure 1).

为克服这些限制,我们设计了一种新的交互方法用于迭代式图表创作。我们的核心思想是融合图形用户界面(GUI)和自然语言(NL)输入,使用户能够既精确又灵活地指定图表,并设计了一个界面让用户控制上下文,从而用户可以在现有设计的基础上进行导航和重用,而不是每次都从零开始。我们通过概念编码货架(concept encoding shelf)来实现这一设计,用于在数据格式约束之外指定图表,并通过数据线程(data threads)来管理用户的非线性创作历史(图 1)。

Chart specification with blended UI and NL inputs. Resembling shelf-configuration UIs [40, 55], the concept encoding shelf allows users to drag existing data fields they wish to visualize and drop them to visual channels to specify chart designs. Differently, with concept encoding shelf, users can also input new data field names in the chart configuration to express their intent to visualize fields that they want from a transformed data. Then, they can provide a supplemental NL instruction to explain the new fields and ask the AI to transform data and instantiate the chart. This blended UI and NL approach for chart specification makes user inputs both precise and flexible. Since Data Formulator 2 can precisely extract chart specification from the encoding shelf, the user doesn’t need verbose prompt to explain the design. By conveying data semantics using NL inputs, the user delegates data transformation to AI, and thus they doesn’t need to worry about data preparation. This approach also improves the task success rate of AI models. Because Data Formulator 2 can infer the visualization script directly from UI input, the AI model only needs to generate data transformation code. With the chart design provided as contexts to the AI model, the model has more information to ground the user’s instruction for better code generation. Managing and leveraging iteration contexts with data threads. Data Formulator 2 presents the user’s non-linear iteration history as data threads and lets them manage data and charts created throughout the process. With data threads, users can easily navigate to an earlier result, fork a new branch, and reuse its context to create new charts. This way, users only need to inform the model how to update the previous result (e.g., “show only top 5 CO2 emission countries’ trends”, Figure 1) as opposed to re-describing the whole chart from scratch. When the user decides to reuse, the Data Formulator 2 tailors the conversation history to include only contexts relevant to that data to derive new result, allowing the AI to generate code with clear contextual information free from (irrelevant) messages from other threads. Besides general navigation and branching supports, data threads also provide shortcut for users to quickly backtrack and revise prompts to update recently created charts, which can be useful for analysts to explore alternative designs or correct errors made by AI.

混合界面与自然语言输入的图表规范。类似于货架配置界面 [40, 55],概念编码货架允许用户拖拽他们希望可视化的现有数据字段并将其放置到视觉通道中,以指定图表设计。不同的是,通过概念编码货架,用户还可以在图表配置中输入新的数据字段名称,以表达他们希望从转换后的数据中可视化字段的意图。然后,他们可以提供补充的自然语言指令来解释新字段,并要求 AI 转换数据并实例化图表。这种混合界面与自然语言的图表规范方法使用户输入既精确又灵活。由于 Data Formulator 2 可以精确地从编码货架中提取图表规范,用户无需冗长的提示来解释设计。通过使用自然语言输入传达数据语义,用户将数据转换委托给 AI,因此他们无需担心数据准备。这种方法还提高了 AI 模型的任务成功率。由于 Data Formulator 2 可以直接从界面输入推断可视化脚本,AI 模型只需要生成数据转换代码。通过将图表设计作为上下文提供给 AI 模型,模型有更多信息来支撑用户的指令,从而实现更好的代码生成。通过数据线程管理和利用迭代上下文。Data Formulator 2 将用户的非线性迭代历史呈现为数据线程,并允许他们管理在此过程中创建的数据和图表。通过数据线程,用户可以轻松导航到早期结果,分叉一个新分支,并重用其上下文来创建新图表。这样,用户只需告知模型如何更新先前的结果(例如,“仅显示前 5 个 CO2 排放国家的趋势”,图 1),而不是从头重新描述整个图表。当用户决定重用时,Data Formulator 2 会调整对话历史,仅包含与该数据相关的上下文以得出新结果,从而使 AI 能够生成具有清晰上下文信息的代码,不受其他线程中(不相关)消息的干扰。除了通用的导航和分支支持外,数据线程还为用户提供了快捷方式,可以快速回溯和修订提示以更新最近创建的图表,这对分析师探索替代设计或纠正 AI 的错误非常有用。

Based on these two key designs, we developed Data Formulator 2, an AI-powered visualization tool for iterative visualization authoring. Data Formulator 2 supports diverse visualization s provided by Vega-Lite marks and encodings, and the AI can transform data flexibly to accommodate different designs, supporting operators like reshaping, filtering, aggregation, window functions, and column derivation. Like other AI tools [9, 55], Data Formulator 2 also provides users panels to view generated data, transformation code and code explanations to inspect and verify AI outputs.

基于这两项关键设计,我们开发了 Data Formulator 2,这是一款支持迭代式可视化创作的 AI 驱动可视化工具。Data Formulator 2 支持由 Vega-Lite 标记和编码提供的多样化可视化,AI 可以灵活地转换数据以适应不同的设计,支持重塑、过滤、聚合、窗口函数和列派生等操作。与其他 AI 工具 [9, 55] 类似,Data Formulator 2 也为用户提供了面板,用于查看生成的数据、转换代码和代码解释,以便检查和验证 AI 的输出。

To understand how Data Formulator 2’s multi-modal interaction benefits analysts in solving challenging data visualization s tasks, we conducted a user study consisting of eight participants with varying data science expertise. They were asked to reproduce two professional data scientists’ analysis sessions to create a total of 16 visualization s, 12 of which require non-trivial data transformations (e.g., rank categories by a criterion and combine low-ranked ones into one category with the label, “Others”). The study shows that participants can quickly learn to use Data Formulator 2 to solve these complex tasks, and that Data Formulator 2’s flexibility and expressiveness allow participants to develop their own verification, error correction, and iteration strategies to complete the tasks. Our inductive analysis of study sessions reveals interesting patterns of how users’ experiences and expectations about the AI system affected their work styles.

为了理解 Data Formulator 2 的多模态交互如何帮助分析师解决具有挑战性的数据可视化任务,我们进行了一项用户研究,共有八名具有不同数据科学专业背景的参与者。他们被要求复现两名专业数据科学家的分析会话,以创建总共 16 个可视化,其中 12 个需要非平凡的数据转换(例如,按标准对类别进行排序,并将排名较低的类别合并为一个标签为“其他”的类别)。研究表明,参与者可以快速学会使用 Data Formulator 2 来解决这些复杂任务,并且 Data Formulator 2 的灵活性和表达能力使参与者能够开发自己的验证、纠错和迭代策略以完成任务。我们对研究会话的归纳分析揭示了用户的经验和他们对 AI 系统的期望如何影响其工作风格的有趣模式。

In summary, the main contributions of this paper are as follows:

总而言之,本文的主要贡献如下:

We design a multi-modal UI, composed of concept encoding shelf and data threads, to blend UI and NL interactions for users to specify their intent for iterative chart authoring.

我们设计了一个多模态用户界面,由概念编码架和数据线程组成,将用户界面和自然语言交互相结合,使用户能够指定其意图以进行迭代图表创作。

Fig. 2. A data analysis session where the analyst explores energy from different sources, renewable percentage trends, and ranks of countries by their renewable percentages from a dataset about $\mathsf{C O}_{2}$ and electricity of 20 countries between 2000 and 2020 (table 1). The analyst has to create five versions of the data to support different chart designs in three branches. Data Formulator 2 lets users manage the iteration contexts and create rich visualization s beyond the initial data with blended UI and natural language inputs.

图2: 数据分析会话,分析师从2000年至2020年间20个国家的$\mathsf{C O}_{2}$和电力数据集中探索不同能源的能源、可再生能源百分比趋势以及按可再生能源百分比排名的国家(表1)。分析师必须创建五个版本的数据以支持三个分支中的不同图表设计。Data Formulator 2允许用户管理迭代上下文,并通过混合UI和自然语言输入创建超越初始数据的丰富可视化。

• We implement our design with an interactive visualization tool, Data Formulator 2, which enables users to iterative ly create rich visualization s that requires multiple rounds of data transformations along the way. • We conducted a user study to learn how the blended UI and NL interaction benefits analysts in data exploration sessions. We observed that analysts can easily develop their own strategies to work the AI system to perform data analysis and visualization tasks that best reflect their personal experience and expectation with the AI model.

• 我们通过一个互动可视化工具 Data Formulator 2 实现了我们的设计,该工具使用户能够迭代创建丰富的可视化效果,这些效果在此过程中需要多轮数据转换。
• 我们进行了一项用户研究,以了解混合 UI 和自然语言交互如何在数据探索会话中使分析师受益。我们观察到,分析师可以轻松制定自己的策略,利用 AI 系统执行数据分析和可视化任务,这些任务最能反映他们对 AI 模型的个人经验和期望。

2 Illustrative Scenarios: Exploring Renewable Energy Trends

2 示例场景:探索可再生能源趋势

In this section, we describe scenarios to illustrate users’ experiences of creating a series of visualization s to explore global sustainability from a dataset of 20 countries’ energy from 2000 to 2020. The initial dataset, shown in Figure 2- $\cdot!\textcircled{1}.$ , includes each country’s energy produced from three sources (fossil fuel, renewables, and nuclear)

在本节中,我们描述了用户通过一系列可视化探索全球可持续性的场景,数据集涵盖了 20 个国家从 2000 年到 2020 年的能源数据。初始数据集如图 2- $\cdot!\textcircled{1}.$ 所示,包括每个国家从三种来源(化石燃料、可再生能源和核能)产生的能源。

each year and annual $\mathrm{CO}{2}$ emission value (the $\mathrm{CO}{2}$ emission data only ranges from 2000 to 2019). We compare a professional data analyst’s experience with computational notebooks and a journalist’s experience using Data Formulator 2 to complete the analysis session shown in Figure 2.

每年和年度 $\mathrm{CO}{2}$ 排放值( $\mathrm{CO}{2}$ 排放数据仅涵盖2000年至2019年)。我们对比了一位专业数据分析师使用计算笔记本的经验和一位记者使用Data Formulator 2完成图2所示分析会话的经验。

2.1 Exploration with computational notebooks

2.1 使用计算笔记本进行探索

Heather is an analyst who is proficient with a computational notebook and R libraries, ggplot2 and tidyverse. Because ggplot2 expects all data fields to be visualized on visual channels (e.g., $x,y$ -axes, color, facet) are columns in the input data, Header uses tidyverse for data transformation.

Heather 是一位精通计算笔记本和 R 库(ggplot2 和 tidyverse)的分析师。由于 ggplot2 要求所有要可视化的数据字段(例如 $x,y$ 轴、颜色、分面)都是输入数据中的列,因此她使用 tidyverse 进行数据转换。

Basic charts. To start, Heather wants to visualize the amount of electricity produced from renewables per country over the years with a line chart to see “if our planet is sustainable.” Since the input data (table $\cdot!!\bigcirc)$ includes all required fields, Heather creates the line chart with ease, by mapping columns $\mathsf{Y e a r}!\to!x$ , Electricity from renewables $(\mathsf{T w h}){\to\y$ and Entity $\rightarrow$ color (chart $\textcircled{1}$ -A). She then creates another line chart for $\mathrm{CO}{2}$ emission trends, mapping CO2 emissions (kt) to the $y$ axis (chart $\textcircled{1}$ -B). Heather is puzzled that China, the country with considerable increased use of renewable energy, also has the biggest increase in $\mathrm{CO}{2}$ emissions. This is counter intuitive because renewables themselves would not cause $\mathrm{CO}_{2}$ emission increase. Thus, Heather decides to dive deeper.

基本图表。首先,Heather 想通过折线图可视化各国多年来可再生能源发电量,以查看“我们的星球是否可持续”。由于输入数据(表 $\cdot!!\bigcirc)$ 包含了所有必要的字段,Heather 轻松地创建了折线图,将列 $\mathsf{Y e a r}!\to!x$、可再生能源发电量 $(\mathsf{T w h}){\to\y$ 和国家 $\rightarrow$ 颜色映射到图表中(图 $\textcircled{1}$ -A)。随后,她为 $\mathrm{CO}_{2}$ 排放趋势创建了另一个折线图,将 CO2 排放量 (kt) 映射到 $y$ 轴(图 $\textcircled{1}$ -B)。Heather 感到困惑的是,中国在可再生能源使用量大幅增加的同时,CO2 排放量的增幅也是最大的。这与直觉相悖,因为可再生能源本身不会导致 CO2 排放量的增加。因此,Heather 决定进一步深入研究。

Renewable energy versus other sources. Heather suspects the $\mathbf{CO}{2}$ emission increase is caused by a surge of fossil fuel consumption. To compare fossil fuel usage against renewables, she wants a faceted line chart that shows electricity from each energy source side by side (chart $\textcircled{2})$ ). To create the chart, Heather needs to have a data table with columns–Year, Electricity, Entity, and Energy Source–and map the columns to $x,y$ , color, and facet, respectively so that the chart is divided into subplots based on values from the Energy Source column. Because table $\textcircled{1}$ stores electricity values across three columns in the wide format, Heather unpivots table $\textcircled{1}$ into the long format, to fold specified column names into values in the Energy Source field and corresponding values into the Electricity field. She then creates the desired chart $\textcircled{2}$ with the transformed data ${\mathcal{O}}.$ and verifies her assumption: despite the increase of renewables usage, the usage of fossil fuel also grows significantly, leading to $\mathrm{CO}{2}$ emission increase. This motivates Heather to explore renewable trends by visualizing trends of the percentage of electricity from renewables over all three resources.

可再生能源与其他能源的比较。Heather怀疑$\mathbf{CO}{2}$排放量的增加是由化石燃料消耗的激增引起的。为了比较化石燃料和可再生能源的使用情况,她希望有一个分面折线图,并列显示每种能源的发电量(图表$\textcircled{2}$)。为了创建该图表,Heather需要一个包含Year、Electricity、Entity和Energy Source列的数据表,并分别将这些列映射到$x,y$、颜色和分面,以便根据Energy Source列中的值将图表划分为子图。由于表$\textcircled{1}$以宽格式存储了三个列中的发电量值,Heather将表$\textcircled{1}$转换为长格式,将指定的列名折叠为Energy Source字段中的值,并将相应的值折叠为Electricity字段。然后,她使用转换后的数据${\mathcal{O}}$创建了所需的图表$\textcircled{2}$,并验证了她的假设:尽管可再生能源的使用量有所增加,但化石燃料的使用量也显著增长,导致$\mathrm{CO}{2}$排放量增加。这促使Heather通过可视化可再生能源在三种资源中的发电量占比趋势来探索可再生能源的趋势。

Renewable energy percentage and ranks. To visualize renewable energy percentage, Heather goes back to table $\textcircled{1}$ to derive a new column Renewable Percentage, by dividing Electricity from renewables (TWh) from the total produced electricity for each country per year. With the new data $\textcircled{3}$ , Heather visualizes the renewable percentage trends in chart $\circled{3}$ , which shows that the percentage increase is slower than their absolute value increase (as shown in chart $\textcircled{1})$ .

可再生能源占比及排名。为了可视化可再生能源占比,Heather回到表 $\textcircled{1}$,通过将每个国家每年的可再生能源发电量(TWh)除以总发电量,得出一个新列“可再生能源占比”。通过新数据 $\textcircled{3}$,Heather在图 $\circled{3}$中可视化了可再生能源占比趋势,结果显示其占比增长速度慢于其绝对值增长速度(如图 $\textcircled{1}$所示)。

Because many countries share similar renewable percentage, it is quite difficult to compare different countries’ trends. Heather thus decides to create a visualization of countries’ renewable percentage ranks to complement existing charts. To calculate ranks of each country among others per year, Heather uses a window function on table $\circled{3}$ to partition the table based on Year, and apply the rank() function to Renewable Percentage to derive a new column Rank. With Rank mapped to $y$ -axis, chart $\circled{4}$ allows Heather to clearly examine how different countries’ ranks change in the last two decades; for example, Germany and UK are the two top ranked countries emerge from the bottom pack in 2000.

由于许多国家的可再生能源比例相似,因此很难比较不同国家的趋势。Heather 决定创建一个国家可再生能源比例排名的可视化图表,以补充现有图表。为了计算每年各国在其中的排名,Heather 在表 $\circled{3}$ 上使用窗口函数,按年份对表进行分区,并将 rank() 函数应用于可再生能源比例,从而生成一个新列 Rank。将 Rank 映射到 $y$ 轴后,图表 $\circled{4}$ 使 Heather 能够清楚地查看不同国家在过去二十年中的排名变化;例如,德国和英国是 2000 年从底部群体中脱颖而出的两个排名最高的国家。

Renewable trends from top CO2 emitters. Finally, Heather wants to focus on renewable percentage trends from top $\mathrm{CO}{2}$ emission countries, which make most influences to global sustainability. Despite table $\textcircled{3}$ contains all columns to be visualized, Heather needs to filter it based on the countries’ CO2 emission. To do so, Heather goes back to table $\textcircled{1}$ to aggregate each country’s total $\mathrm{CO}{2}$ emission, sort it and find top five. Heather then uses this intermediate result to filter table $\circled{3}$ to obtain renewable percentage from top five CO2 emitters (shown as table $\textcircled{5}$ ) and creates chart $\circled{5}$ .

主要二氧化碳排放国的可再生能源趋势


Fig. 3. Data Formulator 2 overview. The user creates visualization s by providing fields (drag-and-drop existing fields or type in new ones) and NL instructions to Concept Encoding Shelf and delegates data transformation to AI. Data View shows the derived data. The user can navigate data derivation history using Data Threads. They can then locate the desired point to refine or create new charts by providing follow-up instructions in Concept Encoding Shelf.

图 3: Data Formulator 2 概览。用户通过提供字段(拖放现有字段或输入新字段)和自然语言指令到概念编码架,并委托 AI 进行数据转换来创建可视化。数据视图显示派生数据。用户可以使用数据线程导航数据派生历史。然后,他们可以通过在概念编码架中提供后续指令来定位所需的点以进行细化或创建新图表。

From this chart, it is clear that top $\mathrm{CO}{2}$ emitters are indeed heading in the right direction towards sustainability, despite total $\mathrm{CO}{2}$ emissions are still increasing with total energy produced also increasing each year. To publish this visualization, Heather decides to add an annotation to the plot with the median global renewable percentage. On top of table $\circled{5}$ , Heather appends the median renewable percentage each year calculated from table $\circled{3}$ and includes a new column Global Median?, used as a flag to assist plotting so that global median can be colored in a different opacity. Chart $\circled{6}$ shows the final result, by including Global Median as an Entity and mapping Global Median? $\rightarrow$ opacity, median renewable percentage is visualized along other countries in a different opacity. Heather is satisfied with the results and concludes the session.

从这张图表中可以明显看出,尽管总的 $\mathrm{CO}{2}$ 排放量随着每年能源产量的增加而增加,但主要的 $\mathrm{CO}{2}$ 排放国确实正在朝着可持续发展的正确方向前进。为了发布这一可视化结果,Heather 决定在图表中添加一个注释,标注全球可再生能源比例的中位数。在表格 $\circled{5}$ 的基础上,Heather 添加了从表格 $\circled{3}$ 中计算出的每年可再生能源比例的中位数,并包含了一个新列 Global Median?,用作辅助绘图的标志,以便全球中位数可以用不同的透明度着色。图表 $\circled{6}$ 显示了最终结果,通过将 Global Median 作为一个实体并映射 Global Median? $\rightarrow$ 透明度,全球中位数的可再生能源比例与其他国家以不同的透明度一起可视化。Heather 对结果感到满意,并结束了这次会话。

2.2 Exploration with Data Formulator 2

2.2 使用数据构建工具 (Data Formulator) 进行探索

Megan is a journalist who has a solid understanding about data visualization. She utilizes visualization s effectively in her work but she doesn’t program. Megan can create and refine rich visualization s iterative ly with Data Formulator 2 (Figure 3), which inherits the basic experience of shelf-configuration style tools. She can specify charts by mapping data fields to visual channels of the selected chart and provide additional contexts using natural language.

Megan 是一名对数据可视化有深入了解的记者。她在工作中有效地使用可视化工具,但并不编程。Megan 可以通过 Data Formulator 2(图 3)迭代地创建和完善丰富的可视化效果,该工具继承了货架配置风格工具的基本体验。她可以通过将数据字段映射到所选图表的视觉通道来指定图表,并使用自然语言提供额外的上下文。

Basic charts. Megan starts with line charts to visualize trends of electricity from renewables (Figure $2{\sqrt{\textstyle\mathrm{1}\mathrm{)A}}},$ ). Since all three required fields are available from the input data, Megan simply selects chart type “line chart” in the encoding shelf and drags-and-drops fields to their corresponding visual channels (Figure 4- $\cdot!\textcircled{1},$ ). Data Formulator 2 then generates the desired visualization. To visualize the $\mathrm{CO}_{2}$ emission trends, Megan swaps $y$ -axis encoding with CO2 emissions $(\mathsf{k t})!\to!y$ .

基本图表。Megan 首先使用折线图来可视化可再生能源的电力趋势(图 2{\sqrt{\textstyle\mathrm{1}\mathrm{)A}}})。由于输入数据中所有三个必填字段都可用,Megan 只需在编码栏中选择图表类型“折线图”,并将字段拖放到相应的视觉通道(图 4- $\cdot!\textcircled{1}$)。然后 Data Formulator 2 生成所需的可视化图表。为了可视化 $\mathrm{CO}_{2}$ 排放趋势,Megan 将 $y$ 轴编码与 CO2 排放 $(\mathsf{k t})!\to!y$ 进行交换。


Fig. 4. Experiences with Data Formulator 2: (1) creating the basic renewable energy chart using drag-and-drop to encode fields; (2 and 3) creating charts that requiring new fields by providing field names and optional natural language instructions to derive new data.

图 4: 使用 Data Formulator 2 的体验:(1) 通过拖放字段编码创建基本可再生能源图表;(2 和 3) 通过提供字段名称和可选的自然语言指令来派生新数据,创建需要新字段的图表。

Renewable energy vs other sources. Megan now needs to create the faceted line chart to compare electricity from all energy sources, which requires new fields Electricity and Energy Source. With Data Formulator 2, Megan can specify the chart using future fields and NL instructions in Concept Encoding Shelf (Figure 3-2) and delegate data transformation to AI.

可再生能源与其他能源的对比

As Figure 4- $\circled{2}$ shows, Megan first drags-and-drops existing fields Year and Entity to $x$ -axis and color, respectively; then, she types in names of new fields Electricity and Energy Source in $y$ -axis and column, respectively, to tell the AI agent that she expects two new fields to be derived for these properties; finally, Megan provides an instruction “compare electricity from all three sources” to further clarify the intent and clicks the formulate button. To create the chart, Data Formulator 2 first generates a Vega-Lite spec skeleton from the encoding (to be completed based on information from the transformed data); it then summarizes the data, encodings, and NL instructions into a prompt to ask an LLM to generate a data transformation code to prepare the data that fulfils all necessary fields, which is then used to instantiate the chart skeleton. After reviewing the generated chart and data, Megan is satisfied and moves to the next task.

如图 4- $\circled{2}$ 所示,Megan 首先将现有的字段 Year 和 Entity 分别拖放到 $x$ 轴和颜色上;然后,她在 $y$ 轴和列中分别输入新字段 Electricity 和 Energy Source 的名称,以告诉 AI 智能体她希望为这些属性派生两个新字段;最后,Megan 提供了指令“比较所有三种来源的电力”,以进一步澄清意图并点击 formulate 按钮。为了创建图表,Data Formulator 2 首先从编码生成一个 Vega-Lite 规范骨架(根据转换后的数据中的信息完成);然后,它将数据、编码和自然语言指令总结为一个提示,要求大语言模型生成一个数据转换代码,以准备满足所有必要字段的数据,然后用于实例化图表骨架。在查看了生成的图表和数据后,Megan 感到满意并继续下一个任务。

Data Formulator 2 also updates data threads (Figure 3- $\cdot{\circled{5}}).$ ) with the newly derived data and chart so that Megan can manage and leverage data provenance. For example, Megan can delete the chart, create/fork new charts from either the original or new data, or select an existing chart to iterate from.

Data Formulator 2 还会使用新派生的数据和图表更新数据线程(图 3- $\cdot{\circled{5}}).$ ),以便 Megan 可以管理和利用数据来源。例如,Megan 可以删除图表、从原始数据或新数据中创建/分叉新图表,或选择现有图表进行迭代。

Renewable energy percentage and ranks. Megan proceeds to visualizing renewable energy percentage. Despite it required a different data transformation, Megan enjoys the same experience as the previous task: Megan dragsand-drops Year and Entity to $x_{\mathrm{{}}}$ -axis and color (Figure 4- $\langle{\mathfrak{3}}\rangle$ ), and enters the name of the new field “Renewable Energy Percentage” to $y$ -axis; then, since Megan believes the field names are self-explanatory, she proceeds to formulate the new data without an additional NL instruction. Data Formulator 2 generates the desired visualization (Figure 5- $\textcircled{1}$ ).

可再生能源比例及排名。Megan 继续可视化可再生能源比例。尽管这需要不同的数据转换,但 Megan 享受到了与之前任务相同的体验:Megan 将 Year 和 Entity 拖放到 $x_{\mathrm{{}}}$ 轴和颜色 (图 4- $\langle{\mathfrak{3}}\rangle$ ),并在 $y$ 轴输入新字段名称“可再生能源比例”;然后,由于 Megan 认为字段名称不言自明,她继续在没有额外自然语言指令的情况下制定新数据。Data Formulator 2 生成了所需的可视化 (图 5- $\textcircled{1}$ )。

To visualize the ranks of countries based on their renewable percentage, Megan decides to continue it from the previous chart, which already computes renewable percentage. To do so, Megan duplicates the renewable percentage chart and update the $y$ -axis field to another new field Rank and clicks “derive.” As shown in Figure 4- $\circled{3}$ , Megan’s interaction positions the Concept Encoding Shelf in the contexts prior result as opposed to the original data, which conveys her intent of reusing the data towards the new one. With the context information, the AI model successfully derives the desired chart (Figure 2- $\cdot!\leftmoon\right)$ ) even only with Megan’s simple inputs.

为了可视化基于可再生能源占比的国家排名,Megan 决定从之前的图表继续,该图表已经计算了可再生能源占比。为此,Megan 复制了可再生能源占比图表,并将 $y$ 轴字段更新为另一个新字段 Rank,然后点击“推导”。如图 4- $\circled{3}$ 所示,Megan 的交互将 Concept Encoding Shelf 定位在之前结果的上下文中,而不是原始数据,这传达了她将数据重用于新数据的意图。借助上下文信息,AI 模型成功推导出了所需的图表(图 2- $\cdot!\leftmoon\right)$),即使 Megan 的输入非常简单。

Renewable trends from top CO2 emitters. Next, Megan decides to visualize top-5 $\mathrm{CO}_{2}$ countries’ renewable percentage trends. Megan again decides to iterate on the previous chart, otherwise she needs more effort to create a longer prompt to specify the task at once. Megan first use data threads (Figure 3- $\circled{5}$ ) to locate renewable percentage chart. On top of that, Megan provides a new instruction below the local data thread, “show only top $5:C O2$ emission countries’ trends,” and clicks the “derive” button (Figure 5- $\cdot!\bigcirc!)$ . Data Formulator 2 updates the previous code to include a filter clause to produce the new data and visualization (Figure 5- $\langle{\mathcal{D}}\rangle$ .

主要二氧化碳排放国的可再生能源趋势


Fig. 5. Iteration with Data Formulator 2: (1) provide a new instruction on top of the renewable energy percentage chart to filter by only top $C{\mathrm{O}{2}}$ countries, (2) update the chart with a new field Global Median? and instruct Data Formulator 2 to add global median besides top $5,{\mathrm{CO}}{2}$ countries’ trends, and (3) move Global Median? from column to opacity to update chart design without deriving new data.

图 5: 使用 Data Formulator 2 进行迭代:(1) 在可再生能源百分比图表上提供新指令,仅过滤前 $C{\mathrm{O}{2}}$ 国家,(2) 使用新字段 Global Median? 更新图表,并指示 Data Formulator 2 在前 $5,{\mathrm{CO}}{2}$ 国家趋势旁边添加全球中位数,(3) 将 Global Median? 从列移动到不透明度,以在不派生新数据的情况下更新图表设计。

Megan continues on the iteration process, to include global median trends besides top $5,\mathrm{CO}_{2}$ countries’ trends. Since this new chart requires different encodings and she wants to keep both visualization s around, Megan forks a branch by copying the previous chart. Then, she updates the Concept Encoding Shelf by (1) adding a new encoding Global Median? $\rightarrow$ column and (2) providing the edit instruction “include global median as an entity” (Figure 5- $\langle{\mathcal{D}}\rangle$ . Once she clicks the derive button, Data Formulator 2 generates the new chart (Figure 5- $\langle{\mathfrak{H}}\rangle$ ). Upon inspection, Megan prefers to combine two views in one, with global average rendered in a different opacity. Since these two charts require the same data fields, she simply selects a new chart type “custom line” (which exposes more chart properties than the basic line chart) and moves Global Median? to the opacity channel. Since it requires no data transformation, Data Formulator 2 doesn’t need to invoke AI and directly renders the chart. With all desired chart created, Megan concludes the analysis session. Figure 3- $\cdot\textcircled{5}$ shows all three threads created by Megan that lead to the final designs.

Megan 继续迭代过程,除了前五大二氧化碳排放国的趋势外,还包括全球中位数趋势。由于这个新图表需要不同的编码,并且她希望保留两个可视化,Megan 通过复制之前的图表来创建一个分支。然后,她更新了概念编码架,通过 (1) 添加一个新的编码 Global Median? → 列,以及 (2) 提供编辑指令“将全球中位数作为一个实体包括进来” (图 5- ⟨D⟩)。一旦她点击派生按钮,Data Formulator 2 就会生成新图表 (图 5- ⟨H⟩)。经过检查,Megan 更倾向于将两个视图合并为一个,并以不同的不透明度呈现全球平均值。由于这两个图表需要相同的数据字段,她只需选择一个新的图表类型“自定义折线图”(它比基本折线图暴露了更多的图表属性),并将 Global Median? 移动到不透明度通道。由于不需要数据转换,Data Formulator 2 不需要调用 AI 并直接渲染图表。创建完所有所需的图表后,Megan 结束了分析会话。图 3- ⋅⑤ 显示了 Megan 创建的所有三个线程,这些线程最终导致了最终的设计。

2.3 Comparison of Exploration Experiences

2.3 探索体验比较

The experience of Heather and Megan exploring global sustainability (using data $\textcircled{1})$ demonstrates an inherently iterative process. Both of them started with a high-level goal without concrete designs in mind and gradually formed the design from explorations in various branches. This iterative exploration process required a series of data transformation and the management of provenance, and thus is challenging for people not proficient in data transformation and programming. Here, we compare their exploration experiences to highlight how Data Formulator 2 bridges Megan’s skill gap, enabling her to achieve the analysis Heather, an experienced data analysis, performed.

Heather 和 Megan 探索全球可持续性(使用数据 $\textcircled{1}$)的经历展示了一个本质上迭代的过程。她们两人最初都设定了一个高层目标,但没有具体的计划,随后通过在各分支中的探索逐渐形成了设计。这种迭代的探索过程需要一系列的数据转换和溯源管理,因此对于不精通数据转换和编程的人来说具有挑战性。在这里,我们比较她们的探索经历,以突出 Data Formulator 2 如何弥补 Megan 的技能差距,使她能够完成经验丰富的数据分析师 Heather 所执行的分析。

2.3.1 Data transformation and chart creation. When new designs are considered, Heather needs to prepare new data to accommodate the design, even when some designs are seemingly close (e.g., charts $\circled{3}$ and $\textcircled{5},$ ). This requires her to understand the data shape expected by the charts, choose the right transformation idiom (e.g., unpivot for table $\textcircled{2}$ , join and union for table $\textcircled{6},$ ), and implement them with proper operators. Once the data is prepared, Heather can easily specify chart by mappings data columns to visual channels of the selected chart type. Her proficiency in data transformation is essential for her to create rich visualization s beyond the initial dataset.

2.3.1 数据转换与图表创建

To bridge Megan’s skill gap in data transformation, Data Formulator 2 lets Megan specify her intents in a unified interaction that combines chart encodings and natural language, regardless of the types of data transformation required behind the scene, and data transformation is delegated to AI. Because the Concept Encoding Shelf resembles the shelf-configuration UI, Megan’s experience from Power BI translates well into Data Formulator 2. Furthermore, since Megan communicates the chart design using concept encodings, she only needs to provide a short supplementary NL instruction to clarify her intents; based on these inputs, Data Formulator 2 assembles a detailed prompt to communicate with the AI model. If Megan were to use text-only interface to interact with AI, she needs more detailed prompt to explain her intent to avoid ambiguity, including explaining chart encodings she created from drag-and-drop interactions easily.

为了弥补 Megan 在数据转换方面的技能差距,Data Formulator 2 允许 Megan 在一个统一的交互中指定她的意图,该交互结合了图表编码和自然语言,无论背后需要哪种类型的数据转换,数据转换都会被委托给 AI。由于概念编码架类似于配置架 UI,Megan 从 Power BI 中获得的使用经验可以很好地转移到 Data Formulator 2 中。此外,由于 Megan 使用概念编码来传达图表设计,她只需要提供简短的补充自然语言指令来澄清她的意图;基于这些输入,Data Formulator 2 会组装一个详细的提示来与 AI 模型进行通信。如果 Megan 使用纯文本界面与 AI 交互,她需要更详细的提示来解释她的意图以避免歧义,包括解释她通过拖放交互轻松创建的图表编码。

2.3.2 Managing branching contexts. During the exploration, Heather backtracks several times to reuse previous results toward new designs (e.g., chart $\langle!!\langle4\rangle!!\rightarrow$ data $\circled{3}\rightarrow$ chart $\textcircled{5},$ ), creating three branches along the way. Because Heather programs in a notebook, she can either copy and adapt previous code snippets or reuse variables computed in previous iterations for new designs. This way, Heather lowers her specification efforts despite new designs are more complex. Heather’s programming expertise is essential for her to manage the branching contexts in a linear programming environment.

2.3.2 管理分支上下文

For Megan, managing branching contexts with different version of data could be challenging without Data Formulator 2. Should Megan use a chat-based AI interface, she would need to prepare a verbose prompt to explain contexts and data transformation goals in detail each turn to avoid extra disambiguation efforts, especially when multiple branches are mixed in the chat history and the task becomes more complicated later on. Data Formulator 2’s data threads address this challenge. Data threads not only provide a visualization for Megan to review history, but also let her visit previous states and reuse them towards new branches as Heather did. This way, Megan only needs to specify updates to be applied as opposed to describing the full design from scratch in one shot, and the AI model can generate results more reliably leveraging the contexts Megan provided. Shall Megan spot undesired results, she could also use data threads (Figure 3- $\langle{\mathfrak{H}}\rangle$ ) to rerun or backtrack one step to revise instructions, as opposed to restarting from the scratch.

对于 Megan 来说,如果没有 Data Formulator 2,管理不同版本数据的分支上下文可能会很具挑战性。如果 Megan 使用基于聊天的 AI 界面,她每次都需要准备一个冗长的提示,详细解释上下文和数据转换目标,以避免额外的歧义消除工作,尤其是当聊天历史中混合了多个分支且任务后来变得更为复杂时。Data Formulator 2 的数据线程解决了这一挑战。数据线程不仅为 Megan 提供了历史回顾的可视化,还让她能够访问先前的状态并将其重用于新的分支,正如 Heather 所做的那样。这样一来,Megan 只需指定要应用的更新,而不是一次性从头描述完整的设计,AI 模型可以更可靠地利用 Megan 提供的上下文生成结果。如果 Megan 发现不理想的结果,她还可以使用数据线程(图 3- $\langle{\mathfrak{H}}\rangle$)重新运行或回溯一步以修改指令,而不是从头开始。

3 The Data Formulator 2 System Design

3 The Data Formulator 2 系统设计

As described earlier, Data Formulator 2 combines UI and NL interactions in a multi-modal UI to reduce analysts visualization authoring efforts, and it provides data threads for users to navigate iteration history and specify new designs on top of previous ones. Data Formulator 2 employs the following system designs to support such interactions:

如前所述,Data Formulator 2 在多模态 UI 中结合了 UI 和 NL 交互,以减少分析师的可视化创作工作量,并提供了数据线程,供用户导航迭代历史并在先前设计的基础上指定新设计。Data Formulator 2 采用了以下系统设计来支持此类交互:

• First, to allow users to specify chart design and data transformation goals from different paradigms (shelfconfiguration UI versus NL inputs), Data Formulator 2 decouples chart specification and data transformation and solve them with different techniques (template instantiation versus AI code generation). • Second, to support reusing, Data Formulator 2 organizes the iteration history as data threads, treating data as first class objects. Data Formulator 2 enables users either to locate a chart from a different branch and follow up, or to quickly revise and rerun the most recent instruction leading to the current chart. We next detail how Data Formulator 2 realizes these designs, and additional features designed to assist users to understand AI-generated results.

• 首先,为了允许用户从不同范式(货架配置 UI 与自然语言输入)中指定图表设计和数据转换目标,Data Formulator 2 将图表规范和数据转换解耦,并使用不同技术(模板实例化与 AI 代码生成)分别解决。• 其次,为了支持重用,Data Formulator 2 将迭代历史组织为数据线程,将数据视为一等对象。Data Formulator 2 使用户能够从不同分支中定位图表并继续跟进,或快速修改并重新运行导致当前图表的最新指令。接下来,我们将详细说明 Data Formulator 2 如何实现这些设计,以及为帮助用户理解 AI 生成结果而设计的附加功能。

3.1 Multi-modal UI: Decoupling chart specification and data transformation

3.1 多模态用户界面:解耦图表规范与数据转换

Data Formulator 2 decouples chart specification and data transformation so that users can benefit from both the precision of UI interaction to configure chart designs and the expressiveness of NL descriptions to specify

Data Formulator 2 将图表规范与数据转换解耦,使用户既能通过 UI 交互的精确性来配置图表设计,又能通过自然语言描述的表达能力来指定数据转换。

Fig. 6. Data Formulator 2’s workflow. (1) Given the user specification in the concept encoding shelf, Data Formulator 2 first generates a Vega-Lite spec skeleton from the selected chart type. (2) When the chart requires new fields (e.g., Rank), Data Formulator 2 compiles a prompt from the concept encoding shelf and asks its AI model to generate a data transformation code to produce the desired data. (3) Upon completion, the Vega-Lite skeleton is instantiated with the new data to produce the desired chart.

图 6: Data Formulator 2 的工作流程。(1) 根据概念编码架中的用户规范,Data Formulator 2 首先从选定的图表类型生成一个 Vega-Lite 规范框架。(2) 当图表需要新字段 (例如 Rank) 时,Data Formulator 2 从概念编码架中编译提示,并要求其 AI 模型生成数据转换代码以生成所需数据。(3) 完成后,Vega-Lite 框架将使用新数据进行实例化,以生成所需的图表。

data transformation goals. As shown in Figure 6, given a user specification in the concept encoding shelf, Data Formulator 2 generates the desired chart in three steps: (1) generating a Vega-Lite script from the selected chart type, (2) compiling a prompt and delegate data transformation to AI, and (3) using the generated data to instantiate Vega-Lite script to render the desired chart.

数据转换目标。如图 6 所示,在概念编码栏中给定用户规范后,Data Formulator 2 通过三个步骤生成所需图表:(1) 从选定的图表类型生成 Vega-Lite 脚本,(2) 编译提示并委托 AI 进行数据转换,(3) 使用生成的数据实例化 Vega-Lite 脚本以渲染所需图表。

Chart specification generation. Data Formulator 2 adopts a chart type-based approach to represent visualization s, supporting five categories of charts: scatter (scatter plot, ranged dot plot), line (line chart, dotted line chart), bar (bar chart, stacked bar chart, grouped bar chart), statistics (histogram, heatmap, linear regression, boxplot) and custom (custom scatter, line, bar area, rectangle where all available visual channels are exposed for advanced users). Each chart type is represented as a Vega-Lite template with a set of predefined visual channels, including position channels $(x,y)$ , legends (color, size, shape, opacity), and facet channels (column, row) shown to the user in the concept encoding shelf. For example, a line chart is represented as a Vega-Lite template { "mark": "line", "encoding" : ${\mathbf{\nabla}^{\prime\prime}\mathbf{x}^{\prime\prime}\colon$ : null, "y": null, "color": null, "column": null, "row": null}}, and when the user selects line chart, channels $x,y.$ , color, column, and row are displayed in the concept encoding shelf. Using chart type-based design, Data Formulator 2 supports predefined layered chart (e.g., ranged dot plot and linear regression plot that are composed from line and scatter, Figure 7-right). Additional chart types (e.g., bullet chart) can be supported by adding new Vega-Lite templates with respected channels to the library.

图表规范生成。Data Formulator 2采用基于图表类型的方法来表示可视化,支持五种图表类别:散点图(散点图、范围点图)、折线图(折线图、点线图)、柱状图(柱状图、堆叠柱状图、分组柱状图)、统计图(直方图、热力图、线性回归、箱线图)和自定义(自定义散点图、折线图、柱状区域、矩形,所有可用的视觉通道都暴露给高级用户)。每种图表类型都表示为一个Vega-Lite模板,带有一组预定义的视觉通道,包括位置通道$(x,y)$、图例(颜色、大小、形状、不透明度)和分面通道(列、行),这些通道在概念编码架上展示给用户。例如,折线图表示为一个Vega-Lite模板{"mark": "line", "encoding": ${\mathbf{\nabla}^{\prime\prime}\mathbf{x}^{\prime\prime}\colon$ : null, "y": null, "color": null, "column": null, "row": null}},当用户选择折线图时,通道$x,y.$、颜色、列和行会显示在概念编码架上。通过基于图表类型的设计,Data Formulator 2支持预定义的分层图表(例如,由折线图和散点图组成的范围点图和线性回归图,图7-右)。通过向库中添加带有相应通道的新Vega-Lite模板,可以支持其他图表类型(例如,子弹图)。

As the user inputs fields into the concept encoding shelf, either by dragging and dropping it from existing data fields or by typing in new fields they wish to visualize, Data Formulator 2 instantiate s the Vega-Lite template with provided fields. For example, as shown in Figure 6- $\bullet$ , when the user drags Year $\rightarrow x$ , Entity $\rightarrow y$ and types Rank in $y_{\cdot}$ , the line chart template mentioned above is instantiated with provided fields: if the field is available in the current data table, both field name and encoding type are instantiated (e.g., Year with type “temporal”), otherwise the encoding type is left as a “” to be instantiated later when data transformation completes.

当用户通过从现有数据字段中拖放或输入他们希望可视化的新字段将字段输入到概念编码栏时,Data Formulator 2 会使用提供的字段实例化 Vega-Lite 模板。例如,如图 6 所示,当用户拖动 Year $\rightarrow x$ 和 Entity $\rightarrow y$ 并在 $y_{\cdot}$ 中输入 Rank 时,上述折线图模板会使用提供的字段进行实例化:如果字段在当前数据表中可用,则字段名称和编码类型都会被实例化(例如,Year 的类型为“temporal”),否则编码类型将保留为“”,以便在数据转换完成后进行实例化。

The shelf-configuration design provides users with simple yet precise interaction. The concept encoding shelf saves users efforts from writing prompts to explain the chart design. Figure 7 further illustrates how the specification in the concept encoding shelf interacts with the underlying Vega-Lite scripts. In Figure 7-left, the user can specify an ‘avg’ operator the $y_{\mathrm{,}}$ -axis to transform the axis and the operator is instantiated as the “aggregate” property of $y_{\mathrm{,}}$ -axis in the script. In addition, Figure 7-right shows another example of the user working with a layered chart (ranged dot plot): as the user fills fields in the UI, Data Formulator 2 populates corresponding fields to different parameters in the predefined chart template.

层架配置设计为用户提供了简单而精确的交互方式。概念编码层架减少了用户编写提示来解释图表设计的麻烦。图 7 进一步展示了概念编码层架中的规范如何与底层的 Vega-Lite 脚本进行交互。在图 7 左侧,用户可以在 $y_{\mathrm{,}}$ 轴上指定一个 "avg" 操作符来转换轴,该操作符在脚本中被实例化为 $y_{\mathrm{,}}$ 轴的 "aggregate" 属性。此外,图 7 右侧展示了另一个用户处理分层图表(范围点图)的示例:当用户在 UI 中填写字段时,Data Formulator 2 会将相应的字段填充到预定义图表模板中的不同参数中。


Fig. 7. Concept encoding shelf instantiate s users’ encodings as a Vega-Lite specification. The user creates a bar chart showing average rank of countries with an “avg” operator on $y.$ -axis, and a ranged dot plot to compare ranks of each country in 2000 and 2020 (the chart template routes users’ $x\cdot$ -axis encoding “Entity” to both $x$ and detail channels).

图 7: 概念编码层将用户的编码实例化为 Vega-Lite 规范。用户创建了一个条形图,显示国家在 $y$ 轴上使用“avg”操作符的平均排名,以及一个范围点图来比较每个国家在 2000 年和 2020 年的排名(图表模板将用户的 $x\cdot$ 轴编码“Entity”路由到 $x$ 和详细通道)。

Data transformation with AI. From the concept encoding shelf, Data Formulator 2 assembles a prompt and queries LLM to generate a python code to transform data. The data transformation prompt contains three segments: the system prompt, the data transformation context and the goal (illustrated Figure $6-\textcircled{2}$ , full prompt is shown in the Appendix):

使用AI进行数据转换。从概念编码库中,Data Formulator 2 组装一个提示并查询大语言模型以生成用于数据转换的Python语言代码。数据转换提示包含三个部分:系统提示、数据转换上下文和目标(如图 $6-\textcircled{2}$ 所示,完整提示见附录):

• Finally, Data Formulator 2 assembles a goal prompt section, combining the NL instruction provided in the text box and field names used in the encodings. When user skips NL instruction (as shown in Figure 4- $\mathbf{\mathcal{O}}$ ), the instruction part is simply left blank. This goal will be refined by the LLM as instruction by the system prompt before attempting to generate the data transformation code.

• 最后,Data Formulator 2会组装一个目标提示部分,结合文本框中提供的自然语言 (NL) 指令和编码中使用的字段名称。当用户跳过 NL 指令时(如图 4- $\mathbf{\mathcal{O}}$ 所示),指令部分将留空。在尝试生成数据转换代码之前,LLM 会根据系统提示将此目标作为指令进行优化。

With the full input, Data Formulator 2 prompts the LLM to generate a response, consisting of the refined objective and the code. Below shows the LLM’s refined objective for the task in Figure 6, and the generated code is shown in Figure 6- 2 .

在完整输入的情况下,Data Formulator 2 会提示大语言模型生成响应,包括优化后的目标和代码。下图展示了大语言模型对图 6 中任务的优化目标,生成的代码如图 6-2 所示。

Data Formulator 2 then runs the code on the input data. If the code executes without errors, the output data is used to instantiate the Vega-Lite script generated in the previous step, by first inferring semantic types of newly generated columns (to determine their encoding type), and then assembling the data with the script to render the visualization (Figure 6- $\circledcirc$ ). Occasionally, the generated code may cause runtime errors, either due to attempting to use libraries that are not imported, references to invalid columns names, or incorrectly handling of undefined or NaN values. When errors occur, before asking users to retry, Data Formulator 2 tries to correct the errors, by querying the LLM with the the error message and a follow-up instruction to repair its mistakes [8, 33]. When repair completes, the visualization is similarly generated. Either way, Data Formulator 2 updates the data threads and presents the results to the user.

Data Formulator 2 随后在输入数据上运行代码。如果代码执行无误,输出数据将用于实例化上一步生成的 Vega-Lite 脚本,首先推断新生成列的语义类型(以确定其编码类型),然后将数据与脚本组装以渲染可视化(图 6- $\circledcirc$)。偶尔,生成的代码可能会引发运行时错误,原因可能是尝试使用未导入的库、引用无效的列名或错误处理未定义或 NaN 值。当发生错误时,在要求用户重试之前,Data Formulator 2 会尝试通过向大语言模型发送错误消息和后续指令来修复错误 [8, 33]。修复完成后,同样会生成可视化。无论哪种情况,Data Formulator 2 都会更新数据线程并向用户展示结果。

3.2 Data threads: navigating the iteration history

3.2 数据线程:导航迭代历史记录

During the iterative visualization process, the analyst needs to navigate their authoring history to locate relevant artifacts (data or charts) to take actions (delete, duplicate or followup). Data Formulator 2 introduces data threads to represent the tree-structured iteration history to support navigation tasks. In data threads, we treat data as the first class objects (nodes in data threads) that are connected according to the user’s instruction provided to the AI model (edges), and visualization s are attached to the version of data they are created from. Centering the iteration history around data benefits user navigation because it directly reflects the sequence of user actions in creating these new data. This design also benefits the AI model: when user issues a follow-up instruction, Data Formulator 2 automatically retrieves its conversation history with the AI towards the current data and then instruct the AI model to rewrite the code towards new goals based on the retrieved history. This way, the AI model does not pose risk of incorrectly using conversation history from other branches to make incorrect data transformation. As shown in Figure 8, the code and the conversation history is attached to each data nodes. Each turn when the user provides a follow-up instruction, the AI model generates new code by updating the previous code (could be deletion, addition or both) to achieve the user’s goal; this way, the code always takes the original data as the input with all information accessible. Comparing to an alternative design where we only pass current data to the AI model and asks it to write a new code to further transform it (i.e., reusing the data as opposed to reusing the computation leading to the data), our design has more flexibility to accommodate different styles of followup instructions — either the user wants to further update the data (e.g., “now, calculate average rank for each country”), revise previous the computation (e.g., “also consider nuclear as renewable energy”) or creating an alternatives (e.g., “rank by CO2 instead”) — since the AI has access to the full dialog history and the full dataset. In contrast, the data-only reuse approach restricts the AI model’s access to only the current data, limiting its ability to support “backtracking” or “alternative design” styles instructions.

在迭代的可视化过程中,分析师需要浏览其创作历史,以定位相关的工件(数据或图表)并采取行动(删除、复制或跟进)。Data Formulator 2 引入了数据线程来表示树结构的迭代历史,以支持导航任务。在数据线程中,我们将数据视为一等对象(数据线程中的节点),这些节点根据用户提供给 AI 模型的指令(边)连接,并且可视化内容附加到它们所创建的数据版本上。将迭代历史围绕数据为中心有助于用户导航,因为它直接反映了用户创建这些新数据的操作序列。这种设计也有利于 AI 模型:当用户发出后续指令时,Data Formulator 2 会自动检索其与 AI 的对话历史,然后指示 AI 模型根据检索到的历史重写代码以实现新目标。这样,AI 模型不会误用其他分支的对话历史而导致错误的数据转换。如图 8 所示,代码和对话历史附加到每个数据节点上。每当用户提供后续指令时,AI 模型通过更新先前的代码(可能是删除、添加或两者兼有)来生成新代码,以实现用户的目标;这样,代码始终以原始数据作为输入,并且可以访问所有信息。与另一种设计相比,我们只将当前数据传递给 AI 模型并要求它编写新代码以进一步转换数据(即重用数据而不是重用导致数据的计算),我们的设计更具灵活性,能够适应不同风格的后续指令——无论是用户希望进一步更新数据(例如,“现在,计算每个国家的平均排名”)、修改先前的计算(例如,“也将核能视为可再生能源”)还是创建替代方案(例如,“按 CO2 排名”)——因为 AI 可以访问完整的对话历史和完整的数据集。相比之下,仅重用数据的方法限制了 AI 模型仅能访问当前数据,限制了其支持“回溯”或“替代设计”风格指令的能力。


Fig. 8. Data threads and local data threads (right). In data threads, the user can create new charts from previous versions of data, and open previous charts in the main panel to create new branches; when creating new data, the AI model is instructed to revise previous code based on user instructions. In local data threads, the user can easily (1) rerun the previous instruction, (2) issue a follow-up instruction or (3) expand the previous card to revise and rerun the instruction.

图 8: 数据线程和本地数据线程(右侧)。在数据线程中,用户可以从数据的先前版本创建新图表,并在主面板中打开先前的图表以创建新分支;在创建新数据时,AI 模型会根据用户指令修改先前的代码。在本地数据线程中,用户可以轻松地 (1) 重新运行先前的指令,(2) 发出后续指令,或 (3) 扩展先前的卡片以修改并重新运行指令。

During iteration, the analyst needs both (1) locating a data or a chart further from the current one to create new branch in derivation tree and (2) performing quick follow-up/revisions of latest instruction from the latest data. To accommodate these different needs, Data Formulator 2 presents both (global) data threads and local data threads. For navigation, the key challenge is assist user to distinguish the desired content from others, and thus data threads are located in a separate panel with previews of data, instruction and charts to assist navigation (Figure 3). This support users different navigation styles, either they want to navigate by provenance (i.e., using instruction cards to locate desired data) or navigate by artifacts (i.e., using visualization snapshots to recall data semantics). Once the user locates the desired data, they can click and open a previous chart to display it in the main panel for further updates as well as create a new chart directly from the data Figure 8- 1 . To support quick updates from the current result, Data Formulator 2 aims to minimize users’ interaction overhead. Thus, the local data thread is designed as part of the main authoring panel Figure 3. The local data thread visualizes only the history leading from the initial data to the current one and omits chart snapshots to minimize distraction (the full history is still available in the global data thread). By integrating the local data thread with the concept encoding shelf, Data Formulator 2 helps the user understand the authoring contexts and enables them to perform quick local updates. As shown in Figure 8, the user can rerun the previous instruction (e.g., when the AI produces an incorrect result and they would like to quickly retry before updating instructions, $\pmb{\oslash}$ ), provide a follow-up instruction to refine the data $(\pmb{\otimes})$ , as well as quickly open the previous instruction to modify and rerun the command $(\pmb{\oslash})$ .

在迭代过程中,分析师需要 (1) 定位一个远离当前数据或图表的数据或图表,以在推导树中创建新分支,以及 (2) 对最新数据执行快速跟进/修订。为了满足这些不同需求,Data Formulator 2 提供了全局数据线程和本地数据线程。在导航方面,关键挑战是帮助用户从其他内容中区分出所需内容,因此数据线程位于一个独立面板中,其中包含数据、指令和图表的预览,以辅助导航 (图 3)。这支持用户不同的导航风格,无论是通过来源导航 (即使用指令卡定位所需数据) 还是通过工件导航 (即使用可视化快照回忆数据语义)。一旦用户定位到所需数据,他们可以点击并打开之前的图表,将其显示在主面板中以进行进一步更新,也可以直接从数据创建新图表 (图 8-1)。为了支持从当前结果进行快速更新,Data Formulator 2 旨在最小化用户的交互开销。因此,本地数据线程被设计为主要的创作面板的一部分 (图 3)。本地数据线程仅可视化从初始数据到当前数据的历史,并省略图表快照以最小化干扰 (完整历史仍然在全局数据线程中可用)。通过将本地数据线程与概念编码架集成,Data Formulator 2 帮助用户理解创作上下文,并使他们能够执行快速的本地更新。如图 8 所示,用户可以重新运行之前的指令 (例如,当 AI 生成错误结果时,他们希望在更新指令之前快速重试,$\pmb{\oslash}$),提供跟进指令以优化数据 $(\pmb{\otimes})$,以及快速打开之前的指令进行修改并重新运行命令 $(\pmb{\oslash})$。

With data threads, analysts can manage and navigate the history and perform iterative updates from previous results, similar to how data analysts reuse code and data in computation notebooks. Otherwise, if the analyst needs to start from scratch and ask the AI to achieve the goal at once, they need efforts to prepare rather detailed prompts to reduce ambiguity, especially for describing more complex charts they need in later analysis stages.

通过数据线程,分析师可以管理和导航历史记录,并从之前的结果中进行迭代更新,类似于数据分析师在计算笔记本中重用代码和数据的方式。否则,如果分析师需要从头开始并让AI一次性实现目标,他们需要花费精力准备相当详细的提示,以减少歧义,特别是在描述后续分析阶段所需的更复杂图表时。

3.3 Miscellaneous: inspecting results and styling charts

3.3 杂项:检查结果和样式化图表

As an AI-powered tool, Data Formulator 2 lets the user verify AI-generated results and resolve mistakes made by AI. It displays transformed data, visualization, the code, and an explanation of the code in the main panel. This design accommodates various user verification styles identified by prior work [12, 54]: e.g., viewing high-level correctness from chart, inspecting corner cases in data, inspecting the transformation output, as well as understanding the transformation process from code. Data Formulator 2 utilizes a code explanation module to query the AI model to translate code into step-by-step explanations assist users to understand the process. Furthermore, despite data transformations generated in the later iteration stages can be complex, users only need to verify its correctness against its predecessor because Data Formulator 2 users create visualization s increment ally. This considerably lowers users’ verification efforts, as we discovered in our study in Section 4. As previously mentioned in Figure 8, when the user discovers errors, they can take advantages of the data thread’s iterative mechanism to rerun, follow up or revise instructions to correct results.

作为一款AI驱动的工具,Data Formulator 2允许用户验证AI生成的结果并纠正AI犯下的错误。它在主面板中展示了转换后的数据、可视化图表、代码以及代码的解释。这一设计适应了先前研究[12, 54]中识别的多种用户验证风格:例如,从图表中查看高层次正确性、检查数据中的极端情况、检查转换输出,以及通过代码理解转换过程。Data Formulator 2利用代码解释模块查询AI模型,将代码转化为逐步解释,帮助用户理解过程。此外,尽管在后期迭代阶段生成的数据转换可能很复杂,但由于Data Formulator 2用户逐步创建可视化,用户只需验证其与前一版本的正确性即可。正如我们在第4节的研究中所发现的,这大大降低了用户的验证工作量。如前文图8所述,当用户发现错误时,他们可以利用数据线程的迭代机制重新运行、跟进或修改指令以纠正结果。

Benefiting from the decoupled chart specification and data transformation processes, when the user wants to update visualization styles (e.g., change color scheme, change sort order of an axis, or swap encodings) that do not require additional data transformation, they can directly perform edits in the concept encoding shelf, by expanding the channel property and update parameters or swapping encoded fields. These updates are directly reflected in the Vega-Lite script and rendered in the main panel. Unlike interactions with AI which has a slightly delayed response time, this approach allows the user to achieve quick and precise edits with immediate visual feedback to refine the design.

得益于解耦的图表规范和数据转换过程,当用户想要更新不需要额外数据转换的可视化样式(例如更改配色方案、更改轴的排序顺序或交换编码)时,他们可以直接在概念编码架中进行编辑,通过扩展通道属性并更新参数或交换编码字段。这些更新会直接反映在 Vega-Lite 脚本中,并在主面板中呈现。与 AI 交互相比,这种方式允许用户实现快速且精确的编辑,并立即获得视觉反馈以优化设计。

3.4 Implementation

3.4 实现

Data Formulator 2 is implemented as a React web application, with a backend Python server running on a Dv2-series CPU with 3.5 GiB RAM. Data Formulator 2 has been tested with different versions with OpenAI models, including GPT-3.5-turbo, GPT-4, GPT-4o and GPT-4o-mini (we used GPT-3.5-turbo in our user study) all of which except GPT-4 can generally response within 10 seconds. Since the LLM generates code to manipulate data as opposed to directly consume data, data size does not affect its response time. Data Formulator 2 can sometimes be slow due to Vega-Lite rendering overhead (e.g., large dataset with $>20{,}000$ rows, long data threads with $>20$ charts), we envision that on-demand re-rendering of charts can improve its performance in deployment.

Data Formulator 2 实现为一个 React 网页应用,后端 Python 服务器运行在具有 3.5 GiB 内存的 Dv2 系列 CPU 上。Data Formulator 2 已与不同版本的 OpenAI 模型进行了测试,包括 GPT-3.5-turbo、GPT-4、GPT-4o 和 GPT-4o-mini(我们在用户研究中使用了 GPT-3.5-turbo),除 GPT-4 外,其他模型通常能在 10 秒内响应。由于大语言模型生成代码来操作数据,而不是直接消费数据,因此数据大小不会影响其响应时间。Data Formulator 2 有时可能会因为 Vega-Lite 渲染开销(例如,行数超过 20,000 的大型数据集,包含超过 20 个图表的较长数据线程)而变慢,我们设想在部署中按需重新渲染图表可以提高其性能。

4 Evaluation: Iterative Exploratory Analysis

4 评估:迭代探索性分析

We conducted a user study to understand potential benefits and usability issues of Data Formulator 2, as well as strategies developed by users when iterative ly creating visualization s in an exploratory data analysis session.

我们进行了一项用户研究,以了解Data Formulator 2的潜在优势和可用性问题,以及用户在探索性数据分析会话中迭代创建可视化时开发的策略。

4.1 Study Design

4.1 研究设计

Participants. After piloting and refining the design of user study and Data Formulator 2 with three volunteers, we recruited eight participants from a large company. Participants self rated their skills (Figure 9) on a scale (“Novice,” “Intermediate,” “Proficient,” and “Expert”) in the following aspects: (1) chart creation – experience with chart authoring tools or libraries, (2) data transformation – experience with data transformation tools and library expertise, (3) programming, and (4) AI assistants – experience with large language models (e.g., ChatGPT [1]) and prompting.

参与者。在与三名志愿者进行用户研究和Data Formulator 2设计的试点和优化后,我们从一家大公司招募了八名参与者。参与者自评了他们在以下方面的技能(图 9),评分等级为(“新手”、“中级”、“熟练”和“专家”):(1) 图表创建——使用图表制作工具或库的经验,(2) 数据转换——使用数据转换工具和库的专业知识,(3) 编程,以及 (4) AI助手——使用大语言模型(例如 ChatGPT [1])和提示的经验。

Setup and procedure. Each study session, conducted remotely with screen sharing, consisted of four sections within a 2-hour slot. After a brief introduction of the study goal, participants were asked to follow step-by-step instructions in a tutorial presented in slides ${\sim}25\$ minutes). To make sure that they understood the tool and process, practice tasks ( $_{\sim15}$ minutes) were presented during and after the tutorial, where participants could ask questions as they worked through the tasks. Then, participants were given two study tasks to complete, where only clarification questions were allowed – we recorded hints participants requested about the tool when they got stuck. The two study tasks involved 16 visualization s to be created in total, with 12 of them requiring data transformation. Participants were encouraged to think aloud as they performed the tasks. We concluded the session with a debriefing with participants to (1) compare their experiences with Data Formulator 2 with the tools they have been using for data analysis, (2) understand strategies behind the way they used Data Formulator 2, and (3) gather impressions and suggestions for improvements to the tool. Participants were encouraged to take breaks between phases.

设置与流程。每项研究会话通过屏幕共享远程进行,包含四个部分,总时长2小时。在简要介绍研究目标后,参与者被要求按照幻灯片中的教程逐步操作(约25分钟)。为了确保他们理解工具和流程,教程期间和之后会提供练习任务(约15分钟),参与者在完成任务时可以提问。随后,参与者需完成两项研究任务,期间仅允许提出澄清问题——我们记录了他们在遇到困难时请求的工具提示。这两项研究任务共涉及创建16个可视化,其中12个需要进行数据转换。鼓励参与者在执行任务时进行出声思考。最后,我们与参与者进行总结,以(1)比较他们使用Data Formulator 2与他们以往用于数据分析工具的经验,(2)了解他们使用Data Formulator 2背后的策略,以及(3)收集对工具改进的印象和建议。鼓励参与者在各阶段之间休息。

Data Formulator 2: Iterative ly Creating Rich Visualization s with AI •

Data Formulator 2: 使用 AI 迭代创建丰富的可视化

ID Role Chart Data Programming Alassistants Dataset1 Dataset2 Hints
P1 Developer Proficient Expert Expert Intermediate 1047s 1666s 1
P2 DataScientist Proficient Expert Expert Expert 1636s 1886s 0
P3 DataArchitect Proficient Expert Expert Expert 715s 2207s 0
P4 Developer Novice Intermediate Proficient Intermediate 1036s 1521s 1
P5 Developer Intermediate Intermediate Proficient Novice 1251s 2937s 1
P6 DataScientist Intermediate Expert Intermediate Proficient 856s 1148s 3
P7 DataScientist Proficient Expert Proficient Proficient 1638s 2372s 1
P8 Developer Proficient Proficient Expert Novice 1043s 1987s 2

Tutorial and practice tasks. We use the global energy dataset (described in Section 2) for tutorial and practice tasks. In the tutorial, participants follow detailed instructions to recreate the six visualization s, all but one (chart $\pmb{\mathbb{Q}}$ ) in Figure 2. Besides, participants also learned to inspect results and work with AI’s mistakes. In the practice tasks, participants were asked to do similar analysis but focusing on the electricity generated from nuclear, with an additional task of creating a bar chart to compare the difference of nuclear between 2000 and 2020, which requires both table pivoting and calculation.

教程与实践任务。我们使用全球能源数据集(在第2节中描述)进行教程和实践任务。在教程中,参与者按照详细说明重新创建六个可视化,除了图2中的一个图表 ($\pmb{\mathbb{Q}}$) 之外。此外,参与者还学习了如何检查结果并处理AI的错误。在实践任务中,参与者被要求进行类似的分析,但重点关注核能发电,并附加了一个任务,即创建一个条形图来比较2000年和2020年核能的差异,这需要表格旋转和计算。

Study tasks. To focus on peoples’ iterative processes rather than their ability either to create a single chart or to gain insights in exploring data, we decided to employ an exploration session reproduction approach, which asks participants to reproduce two data exploration sessions conducted by an experienced data scientist. We wanted to see if participants could iterative ly create charts with Data Formulator 2, without requiring them to come up with exploration objectives on the fly (otherwise we would have limit our participants to only highly skilled data scientists). We took two exploration sessions from David Robinson’s live stream analysis of Tidy Tuesday datasets.

研究任务。为了关注人们的迭代过程,而不是他们创建单一图表或在探索数据时获得洞察的能力,我们决定采用探索会话复现方法,要求参与者复现由经验丰富的数据科学家进行的两次数据探索会话。我们希望看到参与者能否使用 Data Formulator 2 迭代地创建图表,而不需要他们临时提出探索目标(否则我们将只能限制参与者为高技能的数据科学家)。我们从 David Robinson 的 Tidy Tuesday 数据集直播分析中选取了两次探索会话。

Figure $10{-}\Phi$ shows the first data exploration session: given a dataset on college majors and income data (173 rows $\times\ 7$ columns), participants were asked to create seven visualization s (2 basic charts $+\ 5$ charts requiring data transformation) that progressively explore top earning majors and the relationship between women ratio and major salary. The exploration process required participants to derive new fields (e.g., women ratio), filter data (filter by top 20 earning majors), derive new data (aggregate to obtain major categories with top earnings) and perform conditional formatting (color by top 4 categories and “others”). In our task presentation, we provided the description of the task and reference chart (similar to the chart reproduction study [39, 41]) for all but the last two visualization s. We hid the reference charts for the final two visualization s and asked participants to verify the correctness, so that we could use them to probe participants’ verification strategies. We did not provide the iteration direction (i.e., which charts should they base on to create a new one) in the task description, which let participants develop a variety of iteration techniques with only the high-level task guidance.

图 $10{-}\Phi$ 展示了第一次数据探索会话:给定一个关于大学专业和收入的数据集(173行 $\times\ 7$ 列),参与者被要求创建七个可视化图表(2个基本图表 $+\ 5$ 个需要数据转换的图表),逐步探索收入最高的专业以及女性比例与专业薪资之间的关系。探索过程要求参与者推导新字段(例如,女性比例)、过滤数据(按收入最高的20个专业进行过滤)、推导新数据(聚合以获得收入最高的专业类别)并执行条件格式设置(按前4个类别和“其他”进行颜色标注)。在我们的任务展示中,我们为除最后两个可视化图表外的所有图表提供了任务描述和参考图表(类似于图表复制研究 [39, 41])。我们隐藏了最后两个可视化图表的参考图表,并要求参与者验证其正确性,以便我们可以使用它们来探究参与者的验证策略。我们在任务描述中没有提供迭代方向(即他们应该基于哪些图表来创建新图表),这使得参与者仅凭高级任务指导就能开发出多种迭代技术。

Figure $^{10-\oplus}$ shows the second data exploration session: given a dataset of movies with their budget and gross (3281 rows $\times\ 8$ columns), participants were asked to explore movies and genres with highest return-on-investment values, comparing profit and profit ratio as metrics with 9 visualization s created along the way. Besides two basic box plots to show budget and worldwide gross distribution, the other seven charts require data transformation, including calculation and aggregation (average profit / profit ratio for each genre), string processing (extract year

图 $^{10-\oplus}$ 展示了第二次数据探索会话:给定一个包含电影预算和票房的数据集(3281 行 $\times\ 8$ 列),参与者被要求探索具有最高投资回报率的电影和类型,比较利润和利润率作为指标,并在此过程中创建了 9 个可视化图表。除了两个基本的箱线图用于显示预算和全球票房分布外,其他七张图表需要进行数据转换,包括计算和聚合(每种类型的平均利润/利润率)、字符串处理(提取年份)。

Fig. 10. Study tasks. Dataset 1: Understanding top earning majors and the relation between salary and women percentage. Dataset 2: Exploring movies genres with best return-on-investment values (profit vs. profit ratio) and top movies. The branching directions here are only for illustration and are not provided to participants; participants developed different iteration strategies themselves.

图 10: 研究任务。数据集 1: 理解高收入专业以及薪资与女性比例之间的关系。数据集 2: 探索具有最佳投资回报率(利润 vs. 利润率)的电影类型以及顶级电影。这里的分支方向仅用于说明,并未提供给参与者;参与者自己开发了不同的迭代策略。

for trends), filtering $\mathrm{year}>2000\mathrm{)}$ ), and partitioning and ranking (top 20 movies for each metric). We again hide references of the final two charts to probe participants’ verification process.

为了趋势分析 (for trends)、筛选 $\mathrm{year}>2000\mathrm{)}$ ),以及分区和排名(每个指标的前20部电影)。我们再次隐藏最后两张图表的引用,以探究参与者的验证过程。

4.2 Results

4.2 结果

Task completion. All participants successfully completed all 16 visualization s (Figure 9): participants took less than 20 mins on average to finish the seven charts in task 1, and about 33 mins for the nine charts in task 2. Since we let participants deviate from the main exploration task (e.g., in task 2, P4 asked to sort the bar chart for top profitable movies are based on their profits, even though it was not required), the recorded completion time is an overestimate of the actual task time. During the study, six participants asked for hints to get unstuck during tasks; we categorize them as follows:

任务完成。所有参与者都成功完成了所有 16 个可视化任务(图 9):参与者在任务 1 中完成七张图表平均耗时不到 20 分钟,在任务 2 中完成九张图表平均耗时约 33 分钟。由于我们允许参与者偏离主要探索任务(例如,在任务 2 中,P4 要求根据利润对最赚钱电影的条形图进行排序,尽管这不是必须的),因此记录的完成时间是对实际任务时间的高估。在研究过程中,有六名参与者在任务中请求提示以解决问题;我们将其分类如下:

• Task clarification: P1 didn’t realize top movies are restricted to movies after 2000; P4 and P6 required hints about the difference between profit and profit ratio in task 2; P6 also asked about whether $x$ -axis should be Year or Date when plotting movie profit trends. • Data clarification: P6 an P8 were hinted to notice the difference between fields Major and Major Category in task 1. • System performance: P5 encountered a performance issue, as they created large sized charts: in task 2, they created multiple bar charts with Movie mapped to the $x_{\mathrm{{}}}$ -axis, resulting bar charts containing 1300 categorical values causing rendering issues. They were suggested to reset the exploration session. • Chart encoding: P7 and P8 required hints on “why the chart didn’t render color lengends” when they didn’t put a field in the color encoding; they expected to specify it only in NL input but not in the concept encoding shelf. 1

• 任务澄清:P1没有意识到顶级电影仅限于2000年以后的电影;P4和P6在任务2中需要关于利润和利润率区别的提示;P6还询问了在绘制电影利润趋势时,$x$轴应该是年份还是日期。• 数据澄清:P6和P8在任务1中被提示注意Major和Major Category字段之间的区别。• 系统性能:P5遇到了性能问题,因为他们创建了大尺寸的图表:在任务2中,他们创建了多个条形图,将Movie映射到$x_{\mathrm{{}}}$轴,导致条形图包含1300个分类值,从而引发渲染问题。建议他们重置探索会话。• 图表编码:P7和P8在未将字段放入颜色编码时需要关于“为什么图表没有渲染颜色图例”的提示;他们期望仅在自然语言输入中指定,而不在概念编码栏中指定。

During the debriefing, participants commented that these tasks were much more difficult to complete with tools they are familiar with. P1 mentioned that they were “obviously much faster” with Data Formulator 2 as it helped with data transformation despite being an programming expert. When asked about their experience comparing against chat-based AI assistants, participants noted (1) the iteration support makes it easy to create more charts and (2) the $\mathrm{UI}+\mathrm{NL}$ approach in Data Formulator 2 is more effective for communicating and constraining intent. For example, P2 mentioned “with ChatGPT, I would have to put a bit more effort to specify the instructions to get what I want, iterations here is much faster with UI”, and P4 mentioned that “with ChatGPT, you need to much more contexts, I need to describe in detail about what x,y-axes should be, but here I can just provide with UI.”

在复盘过程中,参与者评论称,使用他们熟悉的工具完成这些任务要困难得多。P1 提到,使用 Data Formulator 2“明显快得多”,因为它帮助他们进行数据转换,尽管他们自己是编程专家。当被问及与基于聊天的 AI 助手相比的体验时,参与者指出:(1) 迭代支持使得创建更多图表变得容易,(2) Data Formulator 2 中的 $\mathrm{UI}+\mathrm{NL}$ 方法在传达和约束意图方面更为有效。例如,P2 提到“使用 ChatGPT,我需要花更多精力来指定指令才能得到我想要的结果,而在这里使用 UI 进行迭代要快得多”,P4 提到“使用 ChatGPT,你需要更多的上下文,我需要详细描述 x、y 轴应该是什么,但在这里我只需通过 UI 提供信息。”

Iteration styles. Data Formulator 2 lets users develop their own iteration strategies. We observed three major distinct styles of iteration, in terms of which tables or charts participants chose to derive a new chart.

迭代风格。Data Formulator 2 允许用户开发自己的迭代策略。我们观察到用户在从表格或图表中选择生成新图表时有三种主要的迭代风格。

The first type of users preferred to achieve a particular chart through small, incremental changes from an existing chart that shared either similar data fields or similar chart configuration. For example, P2 and P3 chose to create the line chart showing profit ratio trends overtime on top of the bar chart showing the average profit ratio per genre, and next visualized movies with highest profit ratio further on top, since they share the same derived field profit ratio. P2 mentioned “I definitely like to be able to just work on top of that and like going forward by just giving a new prompt, because it remembers the context prior to the last one, it ends up generating the right data and visualization.” P2 further commented that they did not like too much branching: “...felt that it would be harder to go back to the source and fix every single time.” P7 also preferred incremental changes, but with a focus on visual similarity as opposed to data similarity.

第一类用户倾向于通过与现有图表进行小而渐进的更改来实现特定图表,这些现有图表要么共享相似的数据字段,要么共享相似的图表配置。例如,P2 和 P3 选择在显示每个类型平均利润率的柱状图之上创建显示利润率随时间变化的折线图,然后在上面进一步可视化利润率最高的电影,因为它们共享相同的派生字段利润率。P2 提到:“我肯定希望能够在此基础上继续工作,只需给出一个新的提示,因为它会记住上一个提示的上下文,最终生成正确的数据和可视化。”P2 进一步评论说,他们不喜欢太多的分支:“……感觉每次回到源头修复会更困难。”P7 也倾向于渐进式更改,但更注重视觉相似性而非数据相似性。

In contrast, the second type of users preferred to go back and re-issue a prompt to achieve all the changes from the initial data as succinctly as possible. For example, P1 mentioned that “[I] like keeping it as terse as possible that will get me the right result.” P4 also felt that sometimes it was more productive to just start over from the original dataset throwing out all iterations, especially when they failed to produce a desired outcome: “when we had all of those failures, I went back to the original base dataset and then frame my question there.”

相比之下,第二类用户倾向于返回并重新发出提示,以尽可能简洁的方式实现初始数据的所有更改。例如,P1提到“我喜欢尽可能简洁地得到正确的结果。” P4也认为,有时从头开始处理原始数据集,抛弃所有迭代,尤其是当它们未能产生预期结果时,反而更有效率:“当我们遇到所有那些失败时,我回到了原始的基础数据集,然后在那里重新构建我的问题。”

The third type of users primarily think about the iterations in terms for adding (or retrieving) columns from the dataset. P5 preferred to first instruct Data Formulator 2 to add/remove columns from an existing data (e.g., bring back fields that might have been dropped in previous iterations as needed, or add a new field required for the desired chart), and then create visualization from the right data.

第三类用户主要考虑在数据集中添加(或检索)列的迭代。P5 倾向于首先指示 Data Formulator 2 从现有数据中添加/删除列(例如,根据需要带回之前迭代中可能被删除的字段,或添加所需图表所需的新字段),然后从正确的数据创建可视化。

Organization of iteration history. When asked about their rationale behind branching strategies, all participants agreed data threads are essential for managing iteration histories. Regarding their preferred organization style, P1 mentioned “I don’t like to pollute my workspace” and “I’d like to keep my workspace as clean as possible” and thus they always chose to backtrack and fix previous instruction when encountering undesired results. P2, who mentioned “going back created too much branching” instead preferred follow through. P4 used prompts to help navigate iterations to find the one they were looking for: “I was using the prompts as my anchor to figure out where

迭代历史的组织。当被问及分支策略背后的理由时,所有参与者都认为数据线程对于管理迭代历史至关重要。关于他们偏好的组织风格,P1提到“我不喜欢污染我的工作空间”和“我希望尽可能保持工作空间的整洁”,因此他们总是选择回溯并修复之前遇到不理想结果时的指令。P2则提到“回退会导致分支过多”,因此更倾向于继续推进。P4使用提示来帮助导航迭代,以找到他们想要的版本:“我使用提示作为我的锚点,来确定在哪里”。

I wanted to go.” P8 found it sometimes difficult to iterate in Data Formulator 2 because data threads were “linear instead of hierarchical”: they preferred a tree-view data thread organization, where they could scan quickly through the entire branching tree for a dataset, its transformations and visualization s and then collapse branches that were not of interest for the current goals.

P8 发现有时在 Data Formulator 2 中迭代很困难,因为数据线程是“线性的而非分层的”:他们更喜欢树状视图的数据线程组织,这样他们可以快速扫描整个分支树来查看数据集、其转换和可视化,然后折叠与当前目标无关的分支。

Verification. To proceed through iterative exploration, or repeat/correct a step, participants needed to verify that the chart or transformation was performed correctly. Some used the explanations of the code, some (even non-python programmers) used the actual code, and some used the result tables to validate the impact of the transformations. P3 mentioned “as an expert, I like to see the prompt to the model, and then the code generated; but as a business user, I would imagine using more data, chart, and explanations.” P4 mentioned “[explanation] steps were really, really helpful in terms of figuring out whether it is doing the right thing as to what I’m asking it to do. That and also the data chart underneath.” Interestingly, P7 stated that they preferred to use code rather than explanations of the code, but in the study, they used almost exclusively the explanations. They stated that they felt some pressure from the study environment not to spend too much time understanding code for which they were not familiar with, but they would trust code more. We also observed participants who developed trust in a workflow (by examining code and data tables) when it was straightforward, and then, they assumed the more complicated transformations built on top of these steps worked.

验证。为了通过迭代探索或重复/纠正某个步骤,参与者需要验证图表或转换是否正确执行。有些人使用代码的解释,有些人(甚至非 Python 程序员)使用实际代码,还有一些人使用结果表来验证转换的影响。P3 提到:“作为专家,我喜欢看到模型的提示,然后生成代码;但作为业务用户,我会想象使用更多的数据、图表和解释。”P4 提到:“[解释]步骤在弄清楚它是否按照我的要求做正确的事情方面非常有帮助。还有下面的数据图表。”有趣的是,P7 表示他们更喜欢使用代码而不是代码的解释,但在研究中,他们几乎只使用了解释。他们表示,在研究环境中,他们感到一些压力,不想花太多时间理解他们不熟悉的代码,但他们更信任代码。我们还观察到,当工作流程简单时,参与者会通过检查代码和数据表来建立信任,然后他们假设基于这些步骤的更复杂的转换是有效的。

Miscellaneous. Several users noted potential improvements of Data Formulator 2 for iterative chart authoring. P1 commented on how small interface variations might give different afford ances. For instance, “if there was a large view for data threads, it would encourage me to do more transformations and do more branching.” P3 mentioned that they prefer AI to ask the user to disambiguate when the intent is unclear rather than trying to solve the task with unclear specification. P7 used instructions that were very detailed and sometimes incorrect, which in turn, made iteration more difficult, since it was difficult to increment ally modify these instructions. We discussed the potential of having templates or AI feedback for instruction crafting to reduce errors.

杂项。几位用户指出了 Data Formulator 2 在迭代图表创作方面的潜在改进。P1 评论了界面细微变化可能带来的不同效果。例如,“如果有一个数据线程的大视图,它会鼓励我进行更多的转换和分支。” P3 提到,他们更希望 AI 在意图不明确时要求用户澄清,而不是尝试用不明确的规范解决问题。P7 使用了非常详细且有时不正确的指令,这使得迭代变得更加困难,因为很难逐步修改这些指令。我们讨论了使用模板或 AI 反馈来减少指令编写错误的潜力。

5 Related Work

5 相关工作

Data Formulator 2 builds on top of existing work on data transformation, chart authoring, and AI-powered visualization tools.

Data Formulator 2 建立在现有数据转换、图表创作和 AI 驱动的可视化工具工作之上。

LLM-powered visualization tools. Large language models’ code generation ability [1, 5, 24, 50] motivates the designs of new AI-powered visualization s tools [10, 26, 49, 55] that allows users to create visualization using high-level natural language descriptions. For example, given a dataset and a visualization prompt, LIDA [9] automatically generates a data summary and prompts the LLM to generate python code to transform data and generate visualization s. Because LLMs can struggle in understanding complex chart logics, ChartGPT [49] decomposes visualization tasks into fine-grained reasoning pipelines (e.g., data column selection, filtering logic, chart type, visual encoding), using chain-of-thoughts prompting [56] to guide LLMs to generate code step by step. Data Formulator [55] leverages LLMs to derive new data columns that can be used in traditional shelf-configuration UI. Because these tools focuses on single-turn user interaction with abstract NL descriptions, they are not suitable for iterative analysis where the analyst may branch or revise designs throughout. For multi-turn interactions, users can directly have conversation with LLMs in Code Interpreter [1] or Chat2Vis [26]: Code Interpreter equips the LLM with a Python interpreter so that the model can generate and execute code to transform data and create visualization s with the user interactively; Chat2Vis further includes visualization-specific prompts to help the model generate visualization s more reliably. Since these tools organize the dialog linearly, when the context contains branches, the user needs extract efforts to explain the task so that the model can retrieve the correct context, otherwise the model are more likely to produce undesired results [14, 21, 65]. Besides, since these tools are based on NL inputs, when the user has concrete designs in mind, they need additional efforts to elaborate the design clearly (especially when the design is complex) so that the model can produce their desired results.

大语言模型驱动的可视化工具。大语言模型的代码生成能力 [1, 5, 24, 50] 推动了新型 AI 驱动的可视化工具 [10, 26, 49, 55] 的设计,这些工具允许用户使用高级自然语言描述来创建可视化。例如,给定一个数据集和一个可视化提示,LIDA [9] 会自动生成数据摘要,并提示大语言模型生成 Python 代码来转换数据并生成可视化。由于大语言模型在理解复杂图表逻辑时可能会遇到困难,ChartGPT [49] 将可视化任务分解为细粒度的推理管道(例如,数据列选择、过滤逻辑、图表类型、视觉编码),使用思维链提示 [56] 引导大语言模型逐步生成代码。Data Formulator [55] 利用大语言模型推导出可在传统货架配置 UI 中使用的新数据列。由于这些工具专注于单轮用户与抽象自然语言描述的交互,它们不适合分析师在整个过程中可能分支或修改设计的迭代分析。对于多轮交互,用户可以直接在 Code Interpreter [1] 或 Chat2Vis [26] 中与大语言模型进行对话:Code Interpreter 为大语言模型配备了 Python 解释器,使模型能够生成并执行代码,与用户交互式地转换数据和创建可视化;Chat2Vis 进一步包含了特定于可视化的提示,以帮助模型更可靠地生成可视化。由于这些工具以线性方式组织对话,当上下文包含分支时,用户需要额外努力解释任务,以便模型能够检索到正确的上下文,否则模型更有可能产生不理想的结果 [14, 21, 65]。此外,由于这些工具基于自然语言输入,当用户心中有具体的设计时,他们需要额外努力清晰地阐述设计(尤其是当设计复杂时),以便模型能够生成他们期望的结果。

Data Formulator 2 is a also LLM-powered tool that shares similar prompt designs like LIDA and Chat2Vis (e.g., using data summary to explain the authoring context) and supports NL interaction. Instead of using only NL inputs, Data Formulator 2 blends UI and NL inputs for chart specification so that users can communicate their intent both precisely and flexibly. This design is different from Data Formulator, whose UI and NL inputs work independently: Data Formulator’s NL interface restricts data transformation to column-wise computation, and the user needs additional efforts to complete other transformations like reshaping and aggregation separately using UI with a different paradigm (programming by example); with unified UI and NL interaction in Data Formulator 2, the user can achieve more expressive data transformation with less input efforts. Data Formulator 2’s data threads generalize linear contexts used in existing dialog systems, it allows users to navigate branching contexts and reuse previous results to better support iterative visualization authoring.

Data Formulator 2 也是一款由大语言模型驱动的工具,它与 LIDA 和 Chat2Vis 共享类似的提示设计(例如,使用数据摘要来解释创作背景),并支持自然语言交互。与仅使用自然语言输入不同,Data Formulator 2 结合了用户界面和自然语言输入来进行图表规范,使用户能够既精确又灵活地传达他们的意图。这种设计与 Data Formulator 不同,后者的用户界面和自然语言输入是独立工作的:Data Formulator 的自然语言接口将数据转换限制为列式计算,用户需要额外的努力来分别完成其他转换,如重塑和聚合,使用不同范式(示例编程)的用户界面;而在 Data Formulator 2 中,统一了用户界面和自然语言交互,用户可以用更少的输入努力实现更具表达力的数据转换。Data Formulator 2 的数据线程泛化了现有对话系统中使用的线性上下文,它允许用户导航分支上下文并重用之前的结果,以更好地支持迭代的可视化创作。

Other AI and synthesis-powered tools. Besides LLM-powered tools above, neural semantic parsing [6, 29, 31], and program synthesis-based tools [54] have also been developed to address the visualization challenge. For example, NL4DV [31] and NcNet [25] are natural language interfaces (NLIs) based on recurrent neural networks trained from parallel NL and chart specification corpus that can generate charts from NL queries. NL2Vis [62] and Graphy [6] use semantic parser to extract entities from the user’s NL query and apply program synthesis techniques to compose chart specifications. Unlike LLM-based tools that can generate general purpose python programs to support expressive data transformation and visualization from abstract instructions, semantic parsing based NLIs are less expressive, requiring more concrete descriptions from the user and supporting only limited data transformation. In particular, these tools require tidy data inputs [58], and they do not support transformations like string processing, column derivation and reshaping. While programming-by-examples (PBE) techniques are developed to tackle data reshaping challenges in chart authoring (e.g., Falx [54] and Data Formulator [55]’s reshaping module), these tools require users to prepare low-level examples to demonstrate the transformation intent, which can be difficult for new users as it deviates from the high-level visualization workflow. Unlike LLM-based tools where the user can directly have conversation with the model to disambiguate inputs, semantic parsing and PBE-based tools develop special techniques for resolving ambiguous user intent. For example, DataTone [11] introduces disambiguation widgets to allow users to select alternative extracted entities in the generated queries to resolve ambiguity, and it paraphrases the generated query in NL to explain the result. Falx [54] renders charts from multiple versions of data consistent with user examples for user inspection.

其他AI和合成驱动的工具。除了上述的大语言模型工具外,神经语义解析 [6, 29, 31] 和基于程序合成的工具 [54] 也被开发出来以应对可视化挑战。例如,NL4DV [31] 和 NcNet [25] 是基于循环神经网络的自然语言接口(NLIs),它们通过并行自然语言和图表规范语料库训练,可以从自然语言查询生成图表。NL2Vis [62] 和 Graphy [6] 使用语义解析器从用户的自然语言查询中提取实体,并应用程序合成技术来组合图表规范。与基于大语言模型的工具不同,这些工具可以生成通用的 Python 程序,以支持从抽象指令中进行表达性数据转换和可视化,而基于语义解析的 NLIs 表达能力较弱,需要用户提供更具体的描述,并且仅支持有限的数据转换。特别是,这些工具需要整洁的数据输入 [58],并且不支持字符串处理、列派生和重塑等转换。虽然通过示例编程(PBE)技术被开发出来以应对图表创作中的数据重塑挑战(例如,Falx [54] 和 Data Formulator [55] 的重塑模块),但这些工具要求用户准备低级别的示例来演示转换意图,这对于新用户来说可能会很困难,因为它偏离了高级可视化工作流。与基于大语言模型的工具不同,用户可以直接与模型对话以消除输入歧义,而基于语义解析和 PBE 的工具则开发了特殊技术来解析用户的模糊意图。例如,DataTone [11] 引入了消歧小部件,允许用户在生成的查询中选择替代提取的实体以消除歧义,并用自然语言解释生成的查询以说明结果。Falx [54] 从与用户示例一致的多版本数据中渲染图表,供用户检查。

Benefit from LLMs, Data Formulator 2 supports a much wider range of data transformation and does not limit inputs to tidy data. Inspired by how prior work displays candidate results and explains code to help users understand system outputs [11, 12, 55], Data Formulator 2 displays generated code, data, chart and code explanation to assist user inspection. To resolve ambiguous outputs, the user can use data threads to follow up or backtrack and revise the their instructions.

得益于大语言模型,Data Formulator 2 支持更广泛的数据转换,且不限制输入为整洁数据。受之前工作展示候选结果并解释代码以帮助用户理解系统输出的启发 [11, 12, 55],Data Formulator 2 展示了生成的代码、数据、图表和代码解释,以协助用户检查。为解决模糊输出,用户可以使用数据线程进行跟进或回溯并修改其指令。

Visualization grammars and interactive tools. The grammar of graphics [60] inspired many modern visualization grammars (e.g., ggplot2 [57], Vega-Lite [45], Altair [53]), where visualization s are built from mapping data columns to visual channels and lower-level chart properties. Comparing to more flexible and expressive languages like D3 [3] and Atlas [22], high-level grammars hide the computation process of linking data items to visual objects to reduce visualization efforts. Powered by these high-level grammars, interactive tools like Lyra [44], Data Illustrator [23], Chart i cul at or [40], Tableau [47]) are introduced. With the shelf-configuration interface, users of these tools specify chart designs by mapping data columns to visual encoding “shelves” using the intuitive drag-and-drop interaction. Despite these tools make chart specification easy, they require tidy input data [58]: every variable to be visualized should be organized in a column in the input data, and every mark should come from a row. Thus, users need skills and additional efforts to prepare data in the right format with data transformation tools [7, 15–18, 35–37, 59].

可视化语法与交互工具。图形语法 [60] 启发了许多现代可视化语法(例如 ggplot2 [57]、Vega-Lite [45]、Altair [53]),在这些语法中,可视化是通过将数据列映射到视觉通道和底层图表属性来构建的。与 D3 [3] 和 Atlas [22] 等更灵活、表达能力更强的语言相比,高级语法隐藏了将数据项与视觉对象链接的计算过程,从而减少了可视化的工作量。在这些高级语法的支持下,引入了诸如 Lyra [44]、Data Illustrator [23]、Chart i cul at or [40]、Tableau [47] 等交互工具。通过这些工具的货架配置界面,用户可以通过直观的拖放交互将数据列映射到视觉编码“货架”来指定图表设计。尽管这些工具使图表规范变得简单,但它们需要整洁的输入数据 [58]:要可视化的每个变量应组织在输入数据的一列中,每个标记应来自一行。因此,用户需要具备技能和额外的努力,使用数据转换工具 [7, 15–18, 35–37, 59] 准备正确格式的数据。

Data Formulator 2 internally represents charts using Vega-Lite, and it benefits from Vega-Lite’s expressiveness to support rich visualization designs. Data Formulator 2 inherits the shelf-configuration design from existing interactive tools and blends it with NL inputs for chart specification. This way, users can specify chart designs easily with UI in Data Formulator 2, yet they do not need to worry about data transformation, as Data Formulator 2 delegates data transformation to AI.

Data Formulator 2 内部使用 Vega-Lite 表示图表,并受益于 Vega-Lite 的表现力,以支持丰富的可视化设计。Data Formulator 2 继承了现有交互工具的 shelf-configuration 设计,并将其与自然语言输入结合用于图表规范。这样,用户可以在 Data Formulator 2 中轻松使用 UI 指定图表设计,同时无需担心数据转换,因为 Data Formulator 2 将数据转换委托给 AI。

Exploration history. Graphical history [13] and data provenance [4] are essential in visualization authoring, especially in exploration tasks where branching and iterations are common. In computation notebooks, the exploration history is organized based on code blocks [28, 32]. Data transformation tools like somnus [63] and Tableau Prep visualize data provenance based on transformation operators. Directed-graph models [19, 46] based on visual similarity are also used for visualization organization. Data Formulator 2’s data threads draws inspirations from these systems. The key difference is that Data Formulator 2 organizes history around highlevel user interactions with AI and hides operator-level details to enhance navigation and reuse. In future, Data Formulator 2 could render data threads as hierarchical trees [19] to support navigation of large data threads in multiple granularity.

探索历史。在可视化创作中,图形历史[13]和数据溯源[4]至关重要,尤其是在分支和迭代常见的探索任务中。在计算笔记本中,探索历史是基于代码块组织的[28, 32]。像somnus[63]和Tableau Prep这样的数据转换工具基于转换操作符可视化数据溯源。基于视觉相似性的有向图模型[19, 46]也用于可视化组织。Data Formulator 2的数据线程从这些系统中汲取了灵感。关键区别在于,Data Formulator 2围绕用户与AI的高级交互组织历史,并隐藏操作级别的细节以增强导航和重用。未来,Data Formulator 2可以将数据线程呈现为层次树[19],以支持多粒度的大数据线程导航。

Multi-modal interaction. Despite natural language provides flexible and expressive interaction between human and AI, NL-only interaction is not always optimal for the users to clearly convey their intent, especially for conveying designs they pictured in their mind. To address this limitation, multi-modal models like ChatGPT [1] and Gemini [38] are introduced, allowing users to provide audios and images in their conversation with AI. New interactive tools are also developed to support multi-modal interaction. For example, DirectGPT [27] allows users to direct point and click on a canvas to specify contexts or objects that NL instruction is based on to reduce prompting efforts, DynaVis [52] generates UI widgets dynamically based on user’s NL inputs for chart editing so that they can explore and repeat edits and see instant visual feedback from edits. Data Formulator 2’s concept encoding shelf bridges the precision and affordance of GUI interaction with flexibility of NL inputs, and it contributes a sample of multi-modal UI design for visualization authoring.

多模态交互。尽管自然语言为人与AI之间提供了灵活且富有表现力的交互方式,但仅依赖自然语言的交互并不总是用户清晰传达意图的最佳方式,尤其是在表达他们心中设想的设计时。为了解决这一局限,像ChatGPT [1]和Gemini [38]这样的多模态模型被引入,允许用户在与AI的对话中提供音频和图像。新的交互工具也被开发出来以支持多模态交互。例如,DirectGPT [27]允许用户直接在画布上点击以指定上下文或对象,基于这些上下文或对象生成自然语言指令,从而减少提示的难度;DynaVis [52]根据用户的自然语言输入动态生成UI控件,用于图表编辑,使用户能够探索和重复编辑,并立即看到编辑的视觉反馈。Data Formulator 2的概念编码架桥接了GUI交互的精确性和可操作性,以及自然语言输入的灵活性,它为可视化创作提供了一个多模态UI设计的样本。

Others. Data Formulator 2 focuses on visualization authoring, where AI completes tasks planned by the user. There are potential to combine Data Formulator 2 with exploration and recommendation systems like Voyager [61], Draco [30], and Lux [20] for suggesting visualization goals to assist users “cold-start” their analysis. Data Formulator 2 currently focuses on grammar-of-graphics-based charts (provided by Vega-Lite), and provides limited supports of custom chart designs (e.g., new layouts, interaction, animation, or annotation). Data Formulator 2’s data transformation and history management could also assist animation designs [68] and interactive visualization authoring [67]. Data Formulator 2 in future can incorporate canvases for editing layouts [43, 51] and marks [40] to expand its design space.

其他。Data Formulator 2 专注于可视化创作,其中 AI 完成用户计划的任务。有潜力将 Data Formulator 2 与探索和推荐系统(如 Voyager [61]、Draco [30] 和 Lux [20])结合,以建议可视化目标,帮助用户“冷启动”分析。Data Formulator 2 目前专注于基于图形语法的图表(由 Vega-Lite 提供),并提供有限的自定义图表设计支持(例如新布局、交互、动画或注释)。Data Formulator 2 的数据转换和历史管理也可以辅助动画设计 [68] 和交互式可视化创作 [67]。未来的 Data Formulator 2 可以整合用于编辑布局 [43, 51] 和标记 [40] 的画布,以扩展其设计空间。

6 Conclusion

6 结论

Visualization authors often create visualization s in an iterative fashion, going back and forth between data transformation and visualization steps. To achieve such iterative analysis process, authors not only needs to to be proficient with data transformation and visualization tools, but also needs to spend considerable efforts managing the branching history consisting of many different versions of data and charts. Despite AI-powered tools have been developed to reduce users efforts, they do not work well for iterative analysis, because they often expect users to specify their intent all at once with only NL inputs. We presented Data Formulator 2, an interactive system for iterative creation of rich visualization s. Data Formulator 2 features a multi-modal UI that lets users to specify visualization with blended UI and NL inputs. Benefiting from both the precision of UI interaction and expressiveness of NL descriptions, users can more precisely convey complex designs without verbose prompts. To support management of the iteration history, Data Formulator 2 introduces data threads, where users can navigate, branch and reuse previous designs towards new ones as opposed to creating everything from scratch. In the user study, we invited eight participants to reproduce two challenging data exploration sessions consisting of 16 visualization s. We observed that Data Formulator 2 let participants develop their own iteration and verification strategies to solve the task with confidence with minimal hints.

可视化作者通常以迭代的方式创建可视化,在数据转换和可视化步骤之间来回切换。为了实现这种迭代分析过程,作者不仅需要精通数据转换和可视化工具,还需要花费大量精力管理由许多不同版本的数据和图表组成的分支历史。尽管已经开发了 AI 驱动的工具来减少用户的工作量,但它们并不适合迭代分析,因为它们通常期望用户一次性通过自然语言输入指定他们的意图。我们提出了 Data Formulator 2,一个用于迭代创建丰富可视化的交互式系统。Data Formulator 2 具有多模态用户界面,允许用户通过混合用户界面和自然语言输入来指定可视化。得益于用户界面交互的精确性和自然语言描述的表现力,用户可以更精确地传达复杂的设计,而不需要冗长的提示。为了支持迭代历史的管理,Data Formulator 2 引入了数据线程,用户可以导航、分支和重用以前的设计,而不是从头开始创建所有内容。在用户研究中,我们邀请了八位参与者重现两个具有挑战性的数据探索会话,包含 16 个可视化。我们观察到,Data Formulator 2 让参与者开发了自己的迭代和验证策略,以最小的提示自信地完成任务。

7 Acknowledgement

7 致谢

We would like to thank John Thompson, Jeevana Priya Inala, Dave Brown, Gonzalo Ramos, and Kori Inkpen’s contributions and feedback to this work.

我们要感谢John Thompson、Jeevana Priya Inala、Dave Brown、Gonzalo Ramos和Kori Inkpen对本工作的贡献和反馈。

References

参考文献

阅读全文(20积分)