Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models
Chain-of-Action: 基于大语言模型的可靠多模态问答
Abstract
摘要
We present a Chain-of-Action (CoA) framework for multimodal and retrieval-augmented QuestionAnswering (QA). Compared to the literature, CoA overcomes two major challenges of current QA applications: (i) unfaithful hallucination that is inconsistent with real-time or domain facts and (ii) weak reasoning performance over compositional information. Our key contribution is a novel reasoning-retrieval mechanism that decomposes a complex question into a reasoning chain via systematic prompting and pre-designed actions. Methodological ly, we propose three types of domain-adaptable ‘Plug-and-Play’ actions for retrieving real-time information from heterogeneous sources. We also propose a multi-reference faith score (MRFS) to verify and resolve conflicts in the answers. Empirically, we exploit both public benchmarks and a Web3 case study to demonstrate the capability of CoA over other methods.
我们提出了一种用于多模态和检索增强问答(Question Answering, QA)的行动链(Chain-of-Action, CoA)框架。与现有研究相比,CoA克服了当前QA应用的两大挑战:(i) 与实时或领域事实不符的虚假幻觉;(ii) 对组合信息推理能力较弱。我们的核心贡献是一种新颖的推理-检索机制,通过系统提示和预设计动作将复杂问题分解为推理链。在方法论上,我们提出了三种可适配不同领域的"即插即用"动作,用于从异构源检索实时信息。我们还提出了多参考可信度评分(Multi-Reference Faith Score, MRFS)来验证和解决答案中的冲突。实证方面,我们通过公开基准测试和Web3案例研究证明了CoA优于其他方法的能力。
1. Introduction
1. 引言
This work proposes a new reasoning-retrieval framework to enhance the quality of Large Language Models (LLMs) question answering without additional training and querying costs. As exemplified in Figure 1, this work overcomes two major drawbacks in applying LLMs to answer complex questions: (i) unfaithful generation, where the response may not align with real-time or domain-specific facts (e.g. failing to localize relevant facts in Figure 1(b)), and (ii) weak reasoning, where LLMs struggle to aggregate heterogeneous information sources, resolve their conflicts, adequately reason over the information to provide useful, tailored responses (such as the failure of the stopped analysis in Figure 1(c) despite having successfully localized relevant search results).
本研究提出了一种新的推理-检索框架,旨在不增加额外训练和查询成本的情况下提升大语言模型(LLMs)的问答质量。如图1所示,该工作克服了将大语言模型应用于复杂问题解答时的两大缺陷:(i) 不可靠生成,即回答可能不符合实时或领域特定事实(例如图1(b)中未能定位相关事实);(ii) 薄弱推理,即大语言模型难以整合异构信息源、解决信息冲突,并基于信息进行充分推理以提供有用的定制化响应(如图1(c)所示,尽管成功定位了相关搜索结果,却仍出现分析中断的失败情况)。
To enhance faithfulness and multi-step reasoning, previous approaches such as chain-of-thought based work (Wang et al., 2022; Saparov & He, 2022; Yao et al., 2023a) en- courage LLMs to think step-by-step to break down complex questions. However, only pushing models to continue thinking may not be ideal. Models are expected to learn to pause to verify results and decide if they need more information before continuing to generate. Recent work, thereby, explores integrating information retrieval (Yao et al., 2022; Xu et al., 2023; Li et al., 2023b) into the reasoning chain. However, we argue that seeking external information is not only retrieval, but should manifest as configurable ‘Plugand-Play’ actions: querying web text, encoding domain knowledge, analyzing tabular and numerical data, etc. The key challenge of such heterogeneous multimodal data is to automatically decide when to cease generation to solicit information, what types of external sources to leverage, and how to cross-validate conflicting insights.
为提高忠实度和多步推理能力,先前基于思维链的研究 (Wang et al., 2022; Saparov & He, 2022; Yao et al., 2023a) 鼓励大语言模型逐步分解复杂问题。然而,仅推动模型持续思考可能并非最佳方案。模型需要学会暂停以验证结果,并判断是否需要更多信息后再继续生成。近期研究开始探索将信息检索 (Yao et al., 2022; Xu et al., 2023; Li et al., 2023b) 融入推理链。但我们认为,获取外部信息不应仅限于检索,而应表现为可配置的"即插即用"操作:查询网络文本、编码领域知识、分析表格与数值数据等。处理此类异构多模态数据的关键挑战在于:如何自动决定何时暂停生成以获取信息、选择何种外部数据源,以及如何交叉验证冲突观点。
To that end, we propose a new universal framework Chainof-Action (CoA) equipping LLMs to pro actively initiate information-seeking actions. We design three ‘Plug-andPlay’ actions in this paper: (i) web-querying to extract real-time information as discrete text tokens, (ii) knowledgeencoding to embed domain-specific knowledge concepts as continuous vectors, and (iii) data-analyzing for accessing and interpreting numeric tabular sources. A key advantage of this framework is the extensibility to diverse modalities, e.g., images in the future. Beyond adapting across data modalities, new actions can be introduced to handle emerging domains or data processing techniques.
为此,我们提出了一种新的通用框架Chainof-Action (CoA),使大语言模型能够主动发起信息搜寻行为。本文设计了三种"即插即用"式操作:(i) 网络查询,将实时信息提取为离散文本token;(ii) 知识编码,将领域特定知识概念嵌入为连续向量;(iii) 数据分析,用于访问和解读数值表格数据。该框架的关键优势在于可扩展至多模态场景,例如未来的图像处理。除了跨数据模态的适配性,还可以引入新操作来处理新兴领域或数据处理技术。
In detail, as illustrated in Figure 2, the CoA framework first inject the question and action descriptions into the predesigned prompting template through in-context learning. Then LLMs construct an action chains (ACs), where each action node represents a sub-question, a missing-data flag indicating the need for additional information, and an initial answer. After that, we perform action execution and monitoring to address retrieval demands in three steps: (i) retrieving related information, (ii) verifying conflict between the initial answer and the retrieved information, and (iii) inferring missing content with retrieved information when necessary. To verify information conflicts, we design a verification module utilizing our multi-reference faith score (MRFS). If the LLM-generated answer confidence is below a certain threshold, the corresponding action incorporates the retrieved information for answer correction. In this way, LLMs can effectively generate the final answer that is sound and externally-grounded.
具体而言,如图 2 所示,CoA 框架首先通过上下文学习将问题和动作描述注入预设的提示模板。随后,大语言模型构建动作链 (ACs) ,其中每个动作节点代表一个子问题、标识需要补充信息的缺失数据标志以及初始答案。接着,我们通过三个步骤执行动作并监控检索需求:(i) 检索相关信息,(ii) 验证初始答案与检索信息间的冲突,(iii) 必要时结合检索信息推断缺失内容。为验证信息冲突,我们设计了基于多参考可信度分数 (MRFS) 的验证模块。若大语言模型生成答案的置信度低于阈值,相应动作会整合检索信息进行答案修正。通过这种方式,大语言模型能有效生成逻辑严密且基于外部依据的最终答案。
(a) 行动链 | (b) 思维链 | (c) ReAct智能体 |
---|---|---|
Token名称: BTC 当前价格: $42426.71 市场情绪分析 | 我可以为你提供逐步分析,帮助你做出明智决策... | 搜索结果提供了比特币当前状态的信息,包括近期新闻和监管动态。但并未明确回答现在是否是买入比特币的好时机。因此最终答案是:我无法判断当前是否适合买入比特币。 |
新闻情绪呈现正负观点交织... | 步骤1: 研究并理解比特币 在投资比特币前,必须充分了解... | |
技术分析 RSI当前值为38.28,表明 | 步骤2: 评估个人财务状况 | |
BTC尚未进入超卖区间... 建议与交易策略 | 请根据自身财务状况和风险承受能力考量。仅投入可承受损失的资金,因为加密货币可能... |
We show the example outputs in Figure 1, where the outputs exhibit significant improvement. A key feature of CoA is automatically solicit external information that forms as tokens, vectors, or numbers for integration into model reasoning. Rather than hard-coding their connections, actions are designed as dataset-agnostic modules that LLMs invoke selectively.
我们在图1中展示了示例输出,这些输出显示出显著改进。CoA的一个关键特性是自动获取外部信息,这些信息以token、向量或数字形式存在,并将其整合到模型推理中。不同于硬编码连接方式,这些动作被设计为与大语言模型选择性调用的数据集无关模块。
The significant improvement of CoA is not only showed in experiments on multiple QA datasets, but also is validated from the notable success of the real-world deployment. Upon integration into a Web3 QA application, key metrics including active users and positive feedback volumes increased remarkably within a few months. This performance highlights CoA’s effectiveness in real-world applications.
CoA的显著改进不仅体现在多个问答数据集上的实验效果,其在实际部署中的突出成效也验证了这一点。集成到Web3问答应用后,活跃用户数和正面反馈量等关键指标在短短几个月内显著提升,这一表现充分彰显了CoA在现实应用中的有效性。
In summary, our main contributions are as follows:
总之,我们的主要贡献如下:
• We present CoA, which integrates a novel reasoningretrieval mechanism to decompose complex questions into reasoning chains of configurable actions via systematic prompting. It can retrieve heterogeneous multimodal information and reduce information conflicts. • We propose three types of ‘Plug-and-Play’ domainadaptable actions to address retrievals for real-time information, domain knowledge, and tabular data. The actions are flexible to incorporate additional sources. • We propose a novel metric, the multi-reference faith score (MRFS), to identify and resolve conflicts between retrieved information and LLM-generated answers, enhancing answer reliability. • Our experimental results demonstrate that our framework surpasses existing methods in public benchmarks.
• 我们提出了CoA框架,通过系统性提示(prompting)将复杂问题分解为可配置动作的推理链,并整合了创新的推理-检索机制。该框架能检索异构多模态信息并减少信息冲突。
• 我们设计了三种"即插即用"的领域自适应动作,分别用于实时信息、领域知识和表格数据的检索。这些动作可灵活集成其他数据源。
• 我们提出了一种新指标——多参考可信度分数(MRFS),用于识别和解决检索信息与大语言模型生成答案之间的冲突,从而提升答案可靠性。
• 实验结果表明,我们的框架在公开基准测试中超越了现有方法。
• Additionally, a real-world application in our Web3 QA product has shown significant user engagement and positive feedback, validating the CoA framework’s effectiveness and practicality in real-world scenarios.
• 此外,我们的Web3问答产品中的实际应用显示出显著的用户参与度和积极反馈,验证了CoA框架在现实场景中的有效性和实用性。
2. Methodology
2. 方法论
As shown in Figure 2, we first introduce how to generate the action chain by LLM (Sec. 2.1). Then, the actions address multimodal retrieval demands of the chain’s nodes in three processes: (i) retrieving related information, (ii) verifying whether the LLM-generated answer is good enough or in demand of more information from retrieval, and (iii) checking if the initial answer of each node’s sub-question is missing so that we fill in missing contents with the retrieved information. Finally, we get the final answer by the LLM based on this refined and processed action chain.
如图 2 所示,我们首先介绍如何通过大语言模型生成动作链( Sec. 2.1)。随后,这些动作通过三个流程处理链节点的多模态检索需求:(i) 检索相关信息,(ii) 验证大语言模型生成的答案是否足够好或需要从检索中获取更多信息,(iii) 检查每个节点子问题的初始答案是否存在缺失,以便用检索到的信息填补缺失内容。最终,大语言模型基于这条经过细化和处理的动作链生成最终答案。
2.1. Action Chain Generation
2.1. 动作链生成
We use in-context learning to generate an action chain by LLM. As shown in Figure 2 (a), we design a prompt template to decompose the user’s question into many subquestions, as well as the corresponding Missing Flags (MF) and guess answers shown in Figure 2 (b). Then, we assign one of the actions to solve each sub-question.
我们利用上下文学习 (in-context learning) 通过大语言模型生成动作链。如图 2 (a) 所示,我们设计了一个提示模板,将用户问题分解为多个子问题,以及相应的缺失标志 (Missing Flags, MF) 和推测答案如图 2 (b) 所示。接着,我们为每个子问题分配一个解决动作。
Prompt design. We design a prompt template as shown in Figure 4 starting with "Construct an action reasoning chain for [questions]..." to prompt LLM to generate an Action Chain $A C$ not answer our question $Q$ directly.
提示设计。我们设计了一个提示模板,如图 4 所示,以"为[问题]构建一个动作推理链..."开头,提示大语言模型生成动作链 $A C$ 而非直接回答问题 $Q$。
Each action node represents four elements, including Acti $\mathsf{o n}{i}$ , the content of the sub-questions $\mathsf{S u b}{i}$ , the missing data flag ${\mathsf{M F}}{i}$ , the guess answer from LLMs $\mathsf{A}{i}$ , where $i\in{1,\ldots,n}$ . When the inner-knowledge of LLM is enough to answer the sub-questions, LLM generates an initial answer as the value of “guess answer”. Otherwise, the value of "missing flag" becomes “True”, followed by a blank “guess answer”.
每个动作节点包含四个要素:动作 $\mathsf{o n}{i}$、子问题内容 $\mathsf{S u b}{i}$、缺失数据标志 ${\mathsf{M F}}{i}$ 以及大语言模型推测答案 $\mathsf{A}{i}$(其中 $i\in{1,\ldots,n}$)。当大语言模型的内部知识足以回答子问题时,会生成初始答案作为"推测答案"值;否则"缺失标志"值变为"True",且"推测答案"留空。
Figure 2. Overview of Chain-of-Action framework. We first use in-context learning to prompt LLM to generate the action chain. The chain has many nodes consisting of sub-questions (Sub), missing flags (MF), and LLM-generated guess answers (A). Then, the actions address multimodal retrieval demands of the nodes in three steps: (i) retrieving related information, (ii) verifying whether the LLM-generated answer needs correction by retrieval, and (iii) checking if we need to fill in missing contents with the retrieved information. Finally, we generate the final answer by the LLM based on the refined and processed action chain. Figure 3. Two samples from our Chain-of-Action Framework.
图 2: 行动链框架概览。我们首先利用上下文学习提示大语言模型生成行动链。该链条包含多个由子问题(Sub)、缺失标志(MF)和大语言模型生成的猜测答案(A)组成的节点。随后,行动通过三个步骤处理节点的多模态检索需求:(i) 检索相关信息,(ii) 验证是否需要通过检索修正大语言模型的生成答案,(iii) 检查是否需要利用检索信息填补缺失内容。最终,大语言模型基于优化处理后的行动链生成最终答案。图 3: 我们行动链框架中的两个示例样本。
2.2. Actions Implementation
2.2. 动作实现
We propose three types of actions to address multimodal retrieval demands. Each of them has three working steps: (i) Information Retrieval, (ii) Answering Verification, and (iii) Missing Detection. We first introduce the design of the three actions. Then, we describe the details of the three common steps.
我们针对多模态检索需求提出了三种类型的操作,每种操作均包含三个工作步骤:(i) 信息检索 (Information Retrieval)、(ii) 答案验证 (Answering Verification)、(iii) 缺失检测 (Missing Detection)。首先介绍这三种操作的设计,随后详细说明三个通用步骤的具体实现。
2.2.1. ACTIONS DESIGN
2.2.1. 动作设计
Action 1: Web-querying. Web-querying action utilizes the existing search engines (e.g., Google Search) and follows our query strategy to get the relevant content from the Internet. In detail, it first searches for the keywords of the given sub-question ${\sf S u b}{n}$ to obtain the result list. If the corresponding "Missing flag" is "True", we choose the top $\boldsymbol{\cdot}\mathbf{k}$ results and extract their contents from their page sources. Otherwise, we combine their titles T and snippets Sn of the top M pages. Then, we transfer each pair of title and snippet ${\mathsf{T}{m},\mathsf{S}\mathsf{n}{m}}$ into a 1536-dimension vector $E m b{\mathsf{T}{m}|\mathsf{S}\mathsf{n}{m}}$ by the embedding model (text-embeddingada-002 from OpenAI (OpenAI, 2023b)). Meanwhile, we also transfer the sub-question and guess answer $ big{mathsf{S u b}{n},\mathsf{A}{n}big}$ into $E m b{\mathsf{S u b}{n}|\mathsf{A}{n}}$ . Next, we calculate the similarity between each $E m b{\mathsf{T}{m}|\mathsf{S}\mathsf{n}{m}}$ and $E m b{\mathsf{S u b}{n}|\mathsf{A}{n}}$ to filter the pages whose similarities are lower than 0.8. Then, we extract the contents of high-similarity pages and calculate the similarity between them and $E m b{\mathsf{S u b}{n}|\mathsf{A}{n}}$ to rank and get the top $\mathbf{\nabla\cdotk}$ final pages. Those contents of the $\mathbf{k}$ final pages are the final information that we retrieve by the action.
动作1:网络查询。网络查询动作利用现有搜索引擎(如Google Search)并遵循我们的查询策略从互联网获取相关内容。具体而言,它首先搜索给定子问题${\sf Sub}{n}$的关键词以获取结果列表。若对应"Missing flag"为"True",则选取前$\boldsymbol{\cdot}\mathbf{k}$条结果并从页面源码提取内容;否则组合前M个页面的标题T和摘要Sn。接着,通过OpenAI的嵌入模型(text-embeddingada-002 (OpenAI, 2023b))将每对标题和摘要${\mathsf{T}{m},\mathsf{Sn}{m}}$转换为1536维向量$Emb{\mathsf{T}{m}|\mathsf{Sn}{m}}$,同时将子问题及其猜测答案$big{mathsf{Sub}{n},\mathsf{A}{n}big}$转换为$Emb{\mathsf{Sub}{n}|\mathsf{A}{n}}$。随后计算每个$Emb{\mathsf{T}{m}|\mathsf{Sn}{m}}$与$Emb{\mathsf{Sub}{n}|\mathsf{A}{n}}$的相似度,过滤低于0.8的页面。对高相似度页面提取内容后,再次计算其与$Emb{\mathsf{Sub}{n}|\mathsf{A}{n}}$的相似度进行排序,最终保留前$\mathbf{\nabla\cdotk}$个页面。这$\mathbf{k}$个页面的内容即为通过该动作检索到的最终信息。
Action 2: Knowledge-encoding. Knowledge-encoding action utilizes the vector database (e.g., ChromaDB) as data storage to store the domain information and corresponding embedded vectors. For example, we collect web3 domain information from different sources (X, experts’ blogs, white papers, and trending strategies) to support our QA case study. After data collection, we split each document into many chunks based on the length. Then, we encode each chunk of content into an embedded vector and store it in our
行动2:知识编码。知识编码行动利用向量数据库(如ChromaDB)作为数据存储,用于存储领域信息及对应的嵌入向量。例如,我们从不同来源(X平台、专家博客、白皮书和热门策略)收集web3领域信息以支持问答案例研究。数据收集完成后,根据长度将每个文档分割成多个文本块,随后将每个内容块编码为嵌入向量并存储至我们的
Construct an action reasoning chain for this complex [Question]: “\$QUESTION" in JSON format. For each step of the reasoning chain, choose an action from three choices: [Web-querying Engine(search real-time news or new words), Knowledge-encoding Engine (search existing domain information in local knowledge base), Data-analyzing Engine (query real-value data and calculate some results)] as the value of element "Action", and also generate a sub-question for each action to get one of [web-search keywords, needed information description, data features description] as the value of element "Sub". Also, generate an initial answer for each Sub as the value of the element "Guess answer" if you make sure it is correct. In addition, if you cannot answer some sub-questions, make the element “Missing flag” value “False”, otherwise, make it True” You need to try to generate the final answer for the [Question] by referring to the "Action"-"Sub""Guess answer"-"Missing flag" in "Chain", as the value of the element "Final answer". For example:
为该复杂[问题]构建一个动作推理链:"$QUESTION",以JSON格式呈现。推理链的每一步需从三个选项中选择一个动作:[网络查询引擎(搜索实时新闻或新词)、知识编码引擎(在本地知识库中搜索现有领域信息)、数据分析引擎(查询实值数据并计算结果)]作为"Action"元素的值,同时为每个动作生成一个子问题,以获取[网络搜索关键词、所需信息描述、数据特征描述]之一作为"Sub"元素的值。若确认初始答案正确,还需为每个"Sub"生成初始答案作为"Guess answer"元素的值。此外,若无法回答某些子问题,需将"Missing flag"元素值设为"False",否则设为"True"。最终需参考"Chain"中的"Action"-"Sub"-"Guess answer"-"Missing flag"生成[问题]的最终答案,作为"Final answer"元素的值。例如:
{"Question":"ls it good to invest in Bitcoin now? A. It is a good time. B. It is not a good time.", "Chain": [ ("Action": “Knowledge-encoding", "Sub": "what is bitcoin","Guess answer": "Bitcoin is one of the crypto currencies.", "Missing flag" : "False"), to YY..", "Missing_ flag" : "False"),
现在是投资比特币的好时机吗?A. 是好时机。B. 不是好时机。
Figure 4. Prompt to Generate Action Chain in Chain-of-Action (CoA). This template integrates the user’s question along with a description of each available action. The resulting action chain comprises elements such as actions, subs, guess answers and missing flags. This prompt not only decomposes complex questions into multiple sub-questions, guided by the features of the actions but also allows the LLM to answer certain sub-questions using its existing inner-knowledge. This process exemplifies our proposed reasoning-retrieval mechanism.
图 4: Chain-of-Action (CoA) 中生成动作链的提示模板。该模板整合了用户问题及各可用动作的描述,生成的动作链包含动作、子问题、猜测答案和缺失标志等元素。此提示不仅能基于动作特征将复杂问题分解为多个子问题,还允许大语言模型利用其内部知识直接回答部分子问题。该过程展示了我们提出的推理-检索机制。
vector database with its index. When we need to execute this engine to retrieve domain information, we could forward the $E m b{\mathsf{S u b}{n}|\mathsf{A}{n}}$ to compute the similarity between the input and each chunk to obtain the top-k results.
带有索引的向量数据库。当需要执行该引擎检索领域信息时,我们可以转发$E m b{\mathsf{S u b}{n}|\mathsf{A}{n}}$来计算输入与每个数据块之间的相似度,从而获取top-k结果。
Action 3: Data-analyzing. Data-analyzing action aims to retrieve the data information from some real-value data sources (e.g., market data of digital currencies). In some special situations, we could directly retrieve the relevant values from our deployed API when some sub-questions demand up-to-date or historical value data. Furthermore, we can also use LLM to compute more sophisticated features by generating Python or SQL codes to execute. It is flexible and compatible with various situations. In this paper, we only design it to retrieve the market data for the Web3 case.
行动3:数据分析。数据分析行动旨在从某些实值数据源(如数字货币市场数据)中检索数据信息。在某些特殊情况下,当子问题需要最新或历史数值数据时,我们可以直接从部署的API中检索相关值。此外,还可以利用大语言模型生成Python语言或SQL代码来执行更复杂的特征计算,这种方式灵活且兼容多种场景。本文中,我们仅针对Web3案例设计其用于检索市场数据的功能。
2.2.2. ACTIONS WORKFLOW
2.2.2. 操作工作流
In the action chain, the framework executes the actions workflow for each node until it finishes the whole chain, as shown in Algorithm 1.
在动作链中,该框架为每个节点执行动作工作流,直到完成整个链条,如算法1所示。
Information Retrieval. In the information retrieval stage, we need to find the most relevant and similar contents from different knowledge/data sources. At first, we choose both sub-questions and guess the answer of each node as a query section, $Q S_{n}$ . Then, with the encoding of LLM’s embedding model1, we transfer our query $Q S_{n}={{\sf S u b}{n}|A{n}}$ into a 1536-dimension vector $E m b{Q S_{n}}$ . With this embedded vector, we can perform information retrieval and then rank the results by calculating the similarity. Finally, actions return the top-k results R QS :
信息检索。在信息检索阶段,我们需要从不同知识/数据源中找出最相关且相似的内容。首先,我们同时选择子问题和每个节点的猜测答案作为查询片段$QS_n$。接着,通过大语言模型的嵌入模型编码,将查询$QS_n={{\sf Sub}n|A_n}$转换为1536维向量$Emb{QS_n}$。利用该嵌入向量执行信息检索后,通过相似度计算对结果排序。最终操作返回前k个结果$R{QS}$:
$$
R_{{Q S}}=(r_{1}\mid r_{2}\mid\ldots\mid r_{k}).
$$
$$
R_{{Q S}}=(r_{1}\mid r_{2}\mid\ldots\mid r_{k}).
$$
Answering Verification. After the information retrieval, we verify the information conflicts between guess answer $A_{n}$ and retrieved facts $R_{{Q S}}$ . Inspired by the ROUGE (Lin,
回答验证。在信息检索后,我们验证猜测答案$A_{n}$与检索到的事实$R_{{Q S}}$之间的信息冲突。受ROUGE (Lin,
2004), we propose a multi-reference faith score, MRFS. To get the MRFS, we compute the pairwise faith score $S$ between a candidate summary and every reference, then take the maximum of faith scores. $S$ is a composite metric computed based on three individual components: Precision (P), Recall (Rcl), and Average Word Length (AWL) in the Candidate Summary. The mathematical representation of the score is given by:
2004年), 我们提出了一种多参考忠实度评分MRFS。为了计算MRFS, 我们首先计算候选摘要与每个参考摘要之间的成对忠实度评分 $S$ , 然后取这些忠实度评分的最大值。 $S$ 是一个基于三个独立分量组合计算的复合指标: 精确率(P)、召回率(Rcl)以及候选摘要中的平均词长(AWL)。该评分的数学表达式如下:
$$
\begin{array}{r}{\mathbf{S}=\alpha\times P+\beta\times R c l+\gamma\times A W L}\end{array}
$$
$$
\begin{array}{r}{\mathbf{S}=\alpha\times P+\beta\times R c l+\gamma\times A W L}\end{array}
$$
Where:
其中:
• $\alpha,\beta,\gamma$ are weights corresponding to the importance of Precision, Recall, and Average Word Length, respectively. Their values can be adjusted based on specific requirements but should sum up to 1 for normalization purposes.
- $\alpha,\beta,\gamma$ 分别是精确率 (Precision)、召回率 (Recall) 和平均词长 (Average Word Length) 的权重系数。这些数值可根据具体需求调整,但需满足归一化条件 $\alpha + \beta + \gamma = 1$。
• $_P$ (Precision) is the fraction of relevant instances among the retrieved instances. It is calculated as:
• $_P$ (Precision) 是指在检索到的实例中相关实例的比例。计算公式为:
• Rcl (Recall) is defined as the fraction of relevant instances that were retrieved. It is calculated as:
• Rcl (Recall) 定义为检索到的相关实例的比例。计算公式为:
• $\pmb{A W L}$ (Average Word Length in Candidate Summary) represents the mean length of the words present in the summarized content. It is calculated as:
• $\pmb{A W L}$ (候选摘要平均词长) 表示摘要内容中单词的平均长度。计算公式为:
Adjusting the weights $\alpha,\beta,\gamma$ will allow for emphasizing different aspects (Precision, Recall, or Word Length) depending on the specific evaluation criteria or context.
调整权重 $\alpha,\beta,\gamma$ 可根据具体评估标准或上下文强调不同方面(精确率、召回率或词长)。
算法1:Actions工作流描述。
初始化:Actions链(AC);问题(Q);大语言模型(M);查询部分(QS);子问题(Sub);猜测答案(A);置信度分数(S);多参考置信度分数(MRFS);检索结果(R);缺失标志(MF);输出:最终生成答案。
函数IR(Sub, A, MF):
QSn = 拼接[Subi|Ai];
R = 检索(QSn);
MRFS = argk max S(rk, Ai);
如果 MF == True 则
AC.添加(Sub, r1); //添加Top-1数据
结束如果
如果 MRFS < T 则
AC.修正(Subi, rk);
结束如果
AC.添加(Sub, r1);
结束函数
函数Main(Q, M):
AC = 链生成(Q, M);
对AC中的每个(Subi, Ai, MF)执行
IR(Sub, A, MF);
结束循环
最终答案生成(AC, M)
返回"完成";
结束函数
After getting the MRFS through:
通过以下方式获得MRFS后:
$$
M R F S=\arg_{k}\operatorname*{max}S(r_{k},A_{i}),
$$
$$
M R F S=\arg_{k}\operatorname*{max}S(r_{k},A_{i}),
$$
we setup a threshold $T$ to decide whether the answer $A_{i}$ is faithful. If MRFS is greater than $\mathrm{\DeltaT}.$ , we keep the answer; otherwise, we change the answer $A_{i}$ to reference contents.
我们设定一个阈值 $T$ 来决定答案 $A_{i}$ 是否可信。如果 MRFS 大于 $\mathrm{\DeltaT}$,则保留答案;否则,将答案 $A_{i}$ 替换为参考内容。
Missing Detection. The last stage of each action is detecting whether the guess answer $A_{i}$ is complete. When a sub-question needs some special or real-time information, the corresponding guess answer $A_{i}$ could be incomplete with a Missing Flag $M F_{i}$ being "true". If a guess answer’s MF is "True", we inject the retrieved information into the $A_{i}$ to fill in the blank "Guess answer".
缺失检测。每个动作的最后阶段是检测猜测答案$A_{i}$是否完整。当子问题需要某些特殊或实时信息时,对应的猜测答案$A_{i}$可能不完整,此时缺失标志$M F_{i}$为"true"。若某猜测答案的MF为"True",我们会将检索到的信息注入$A_{i}$以填补"猜测答案"的空白。
2.3. Final answer generation
2.3. 最终答案生成
After all actions’ executions, we propose a prompt template shown in Figure 5 to integrate all corrected answers and corresponding sub-questions of the AC. Then, it can prompt LLM to refer to the newest retrieved information and generate the final answer starting with "[Final Content]" through the corrected reasoning chain.
在执行完所有操作后,我们提出了如图5所示的提示模板,用于整合AC的所有修正答案及对应子问题。随后,该模板可提示大语言模型参考最新检索到的信息,并通过修正后的推理链生成以"[Final Content]"开头的最终答案。
Here is the corrected reasoning chain for this complex [Question]: “(how's buying bitcoin]". Each step of the reasoning chain has [Sub-question] and [Solved Answer]. Answer the [Question]: “[how's buying bitcoin}" starting with [Final Content] through the reasoning chain.
以下是针对这个复杂问题[Question]“(how's buying bitcoin]”的修正推理链。推理链的每一步都包含[Sub-question]和[Solved Answer]。通过推理链回答[Question]“[how's buying bitcoin}”,回答以[Final Content]开头。
And its price become more and more high recently [2].Also, there is a lot of news to promote Bitcoin such as...[3].Sothe answerisIt is a good time toinvest inBitcoin now, but you need to consider the risk of investing in crypto currency.
其价格最近变得越来越高[2]。此外,有很多新闻在宣传比特币,例如...[3]。所以答案是现在投资比特币是个好时机,但你需要考虑投资加密货币的风险。
Figure 5. Prompt for final answer generation in CoA. We use the processed chain to prompt LLM to reanswer the user’s question.
图 5: CoA中生成最终答案的提示。我们使用处理后的推理链来提示大语言模型重新回答用户的问题。
3. Experiments
3. 实验
In this section, we initially compare the performance of our Chain-of-Action framework with recent state-of-the-art baselines across various public benchmarks, followed by an in-depth analysis of these comparisons. Subsequently, we provide a detailed analysis of our launched case study: a Question Answering (QA) application in the Web3 domain.
在本节中,我们首先将 Chain-of-Action 框架与各公共基准测试的最新最优基线进行性能对比,随后深入分析这些比较结果。接着,我们对已开展的案例研究进行详细分析:一个 Web3 领域的问答 (QA) 应用。
3.1. Experiments with Benchmarks
3.1. 基准测试实验
Datasets and Evaluation Metric. We select four classic QA tasks that include web-based QA (Web Questions $\mathrm{QA}^{2}$ (WQA)(Berant et al., 2013)), general $\mathrm{QA}^{3}$ (DATE, General Knowledge, Social QA (SoQA)), Truth $\mathrm{QA}^{2}$ (Srivastava et al., 2022), Strategy $\mathrm{QA}^{2}(\mathrm{SQA})$ (Geva et al., 2021), and Fact Checking (FEVER4 (Thorne et al., 2018)).
数据集与评估指标。我们选取了四个经典问答任务,包括基于网络的问答(Web Questions $\mathrm{QA}^{2}$ (WQA)(Berant et al., 2013))、通用问答 $\mathrm{QA}^{3}$ (DATE、常识问答、社交问答(SoQA))、事实问答 $\mathrm{QA}^{2}$ (Srivastava et al., 2022)、策略问答 $\mathrm{QA}^{2}(\mathrm{SQA})$ (Geva et al., 2021)以及事实核查(FEVER4 (Thorne et al., 2018))。
For the evaluation metric, we use cover-EM (Rosset et al., 2020) to represent whether the generated answer contains the ground truth.
在评估指标方面,我们采用cover-EM (Rosset et al., 2020)来衡量生成答案是否包含真实答案。
We categorize our baseline methods into two types: the first type focuses on reasoning, prompting LLM to solve complex questions (Few-shot Prompting, Chain-of-Thought (CoT) (Wei et al., 2022), Self Consistency (SC) (Wang et al., 2022), Tree of Thought (ToT) (Yao et al., 2023a), Leastto-Most (Zhou et al., 2022), and Auto-Chain-of-Thought (Auto-CoT) (Zhang et al., 2023)), and the second RetrievalAugmented-Generation (RAG) type that integrates Information Retrieval to enhance reasoning capabilities (ToolFormer (Schick et al., 2023a),Self-Ask (Press et al., 2022), React (Yao et al., 2023b), Search Chain (SeChain) (Xu et al., 2023), and DSP (Khattab et al., 2022)). We conduct a thorough functional comparison between these baseline methods and our Chain-of-Action (CoA), as presented in Table 1.
我们将基线方法分为两类:第一类侧重于推理,通过提示大语言模型解决复杂问题(少样本提示 (Few-shot Prompting) 、思维链 (Chain-of-Thought, CoT) [20]、自洽性 (Self Consistency, SC) [21]、思维树 (Tree of Thought, ToT) [22]、最少到最多 (Least-to-Most) [23] 和自动思维链 (Auto-Chain-of-Thought, Auto-CoT) [24]),第二类检索增强生成 (Retrieval-Augmented-Generation, RAG) 方法则整合信息检索以增强推理能力(ToolFormer [25]、自问 (Self-Ask) [26]、React [27]、搜索链 (Search Chain, SeChain) [28] 和 DSP [29])。我们对这些基线方法与我们的行动链 (Chain-of-Action, CoA) 进行了全面功能对比,如表 1 所示。
Implementation. Our experimental framework incorporates the data preprocessing techniques of Google’s Bigbench (Srivastava et al., 2022), and Auto-COT (Zhang et al.,
实现。我们的实验框架整合了Google Bigbench (Srivastava等人,2022)的数据预处理技术和Auto-COT (Zhang等人,
Table 1. The functional comparison of Chain-of-Thought baselines with our method CoA.
表 1: 思维链基线方法与我们提出的CoA方法的功能对比
方法 | 少样本 | CoT | SC | ToT | Auto-CoT | Least-to-Most | ToolFormer | Self-Ask | React | DSP | SearchChain | CoA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
多步推理 (MultistepReasoning) | √ | √ | √ | √ | / | √ | ||||||
检索 (Retrieval) | √ | √ | ||||||||||
多模态 (Multimodal) | 人 | |||||||||||
验证 (Verification) | √ |
Table 2. We conduct a comprehensive evaluation of accuracy for six question-answering and one fact-checking dataset. Our study involves the implementation of 11 baseline methods alongside our own Chain-of-Action (CoA) method. We assess the performance of these methods across seven tasks, considering both information retrieval and non-retrieval scenarios. The results averaged over three runs, are presented with variance values omitted $(\mathrm{all}\leq2%)$ ). Our presentation format involves bolding the best results and underlining the second-best results. Our findings highlight the superior performance of CoA, which achieved the highest accuracy in 12 out of 14 test scenarios. Notably, CoA consistently outperforms all baseline methods, even when external memory was not employed, demonstrating its robust and top-tier performance.
表 2: 我们对六个问答数据集和一个事实核查数据集进行了全面的准确性评估。研究实现了11种基线方法以及我们提出的行动链(Chain-of-Action,CoA)方法,在包含信息检索与非检索场景的七项任务中评估了这些方法的性能。所有结果均为三次运行的平均值(方差均≤2%未显示),最佳结果加粗标出,次优结果以下划线标示。实验结果表明CoA在14个测试场景中有12项取得最高准确率,即使在不使用外部记忆的情况下也持续超越所有基线方法,展现出稳健的顶尖性能。
方法 | WebQA | DATE | GK | SocialQA | TruthQA | StrategyQA | 事实核查 FEVER |
---|---|---|---|---|---|---|---|
无信息检索 | |||||||
零样本(Zero-shot) | 43.0 | 43.6 | 91.0 | 73.8 | 65.9 | 66.3 | 50.0 |
少样本(Few-shot) | 44.7 | 49.5 | 91.1 | 74.2 | 68.9 | 65.9 | 50.7 |
CoT (Wei等, 2022) | 42.5 | 43.7 | 88.1 | 71.0 | 66.2 | 65.8 | 40.4 |
SC (Wang等, 2022) | 36.5 | 50.0 | 87.5 | 60.0 | 66.7 | 70.8 | 53.3 |
ToT (Ya0等, 2023a) | 32.3 | 47.1 | 85.1 | 68.5 | 66.6 | 43.3 | 41.2 |
Auto-CoT (Zhang等, 2023) | 42.1 | 52.3 | 89.7 | 59.1 | 61.6 | 65.4 | 32.5 |
最少到最多(Lest-to-Most, Zhou等, 2022) | 44.0 | 42.1 | 80.8 | 68.1 | 59.5 | 65.8 | 43.4 |
SeChain无IR | 50.8 | 44.7 | 75.0 | 64.9 | 54.1 | 75.6 | 39.2 |
CoA无行动 | 64.7 | 55.3 | 91.4 | 80.2 | 63.3 | 70.6 | 54.2 |
结合信息检索 | |||||||
ToolFormer (Schick等, 2023a) | 34.5 | 53.9 | 72.3 | 48.1 | 57.5 | 69.4 | 60.2 |
Self-Ask (Press等, 2022) | 31.1 | 55.1 | 79.7 | 52.1 | 60.5 | 67.7 | 64.2 |
React (Ya0等, 2023b) | 38.3 | / | 85.1 | 65.8 | 59.9 | 70.4 | 43.9 |
DSP (Khattab等, 2022) | 59.4 | 48.8 | 85.1 | 68.2 | 58.4 | 72.4 | 62.2 |
SearchChain (Xu等, 2023) | 65.3 | 51.0 | 87.6 | 69.4 | 61.7 | 77.0 | 65.9 |
CoA | 70.7 | 57.4 | 98.6 | 83.1 | 67.3 | 79.2 | 68.9 |
-无验证 | 66.9 | 56.8 | 95.7 | 81.5 | 65.0 | 75.2 | 65.7 |
-无插补 | 67.4 | 56.3 | 97.1 | 82.9 | 65.8 | 76.5 | 65.3 |
2023). For generating multiple reasoning steps, we employ OpenAI’s gpt-3.5-turbo (OpenAI, 2023a) model accessed via API as our primary LLM. Additionally, to tackle the challenges in controlling response formats with black-box models like gpt-3.5-turbo, we establish an advanced evaluation pipeline utilizing GPT-4 (Bevilacqua et al., 2023).
2023年)。为生成多步推理过程,我们采用通过API调用的OpenAI gpt-3.5-turbo (OpenAI, 2023a) 模型作为核心大语言模型。针对gpt-3.5-turbo等黑盒模型在控制响应格式方面的挑战,我们建立了基于GPT-4 (Bevilacqua et al., 2023) 的高级评估流程。
3.1.1. EXPERIMENTAL ANALYSIS
3.1.1. 实验分析
Our comprehensive evaluation, detailed in Table 2, compares the effectiveness of our CoA framework and eleven baseline methods across six question-answering datasets and one fact-checking dataset. We evaluate the framework’s performance in both information retrieval and non-retrieval scenarios, separately. The sole exception pertains to React, implemented by Langchian (Topsakal & Akinci, 2023). It exhibits an unresponsive behavior in the DATE dataset. As a result, we omit the comparison involving React within the DATE dataset. Our CoA framework demonstrates superior performance metrics in 12 of 14 test scenarios. Our method achieves a significant $3.42%$ improvement in the test tasks without information retrieval compared to the state-of-the-art baseline (Search Chain without IR), and a $6.14%$ increase in the test tasks with information retrieval over its state-of-the-art baseline (Search Chain). This is a significant outcome, as it underscores the effectiveness of our framework. It also demonstrates that CoA is well-suited for various question-answering tasks. In particular, the enhancement in performance is consistent regardless of the integration of IR. This indicates that our framework has intrinsic robustness and comprehensive understanding that is not reliant on external information.
我们在表2中的全面评估比较了CoA框架与11种基线方法在六个问答数据集和一个事实核查数据集上的有效性。我们分别评估了框架在信息检索和非检索场景下的性能。唯一例外是Langchian实现的React (Topsakal & Akinci, 2023),该模型在DATE数据集中表现出无响应行为,因此我们排除了React在DATE数据集中的比较。CoA框架在14个测试场景中的12个场景中展现出优越性能指标。与最先进的基线方法(无信息检索的Search Chain)相比,我们的方法在非检索测试任务中实现了3.42%的显著提升;在与最先进基线(Search Chain)相比的信息检索测试任务中,取得了6.14%的性能提升。这一重要成果既验证了我们框架的有效性,也表明CoA能很好地适应各类问答任务。值得注意的是,无论是否集成信息检索,性能提升都保持稳定,这说明我们的框架具有不依赖外部信息的内在鲁棒性和全面理解能力。
In a further analysis detailed in Table 3, we delve into the complexity of reasoning processes in various methods. Our framework exhibits a higher average number of reasoning steps when decomposing complex questions. This metric is vital, highlighting the framework’s capability to engage in a multi-step inference process, a capability that is essential for solving intricate problems that require more than surface-level understanding. The fact that our framework outperforms others in this measure suggests that it can better understand and navigate the layers of complexity within questions, which is a testament to the sophisticated reasoning algorithms it employs.
在表3的进一步分析中,我们深入研究了各种方法的推理过程复杂性。我们的框架在分解复杂问题时展现出更高的平均推理步骤数。这一指标至关重要,凸显了框架参与多步推理的能力,这种能力对于解决需要超越表层理解的复杂问题必不可少。我们的框架在该指标上优于其他方法,表明其能更好地理解和驾驭问题中的多层次复杂性,这印证了其所采用的先进推理算法。
Additionally, Table 6 explores the average frequency of LLM usage per question. Our framework shows a reduced frequency, reflect