[论文翻译]使用强化学习在开放式对话中进行动态规划


原文地址:PDF/RLHF论文集/Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning.pdf


Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning

使用强化学习在开放式对话中进行动态规划

Deborah Cohen, Moonkyung Ryu, Yinlam Chow, Orgad Keller, Ido Greenberg, Avinatan Hassidim, Michael Fink, Yossi Matias, Idan Szpektor, Craig Boutilier, Gal Elidan {debbycohen,mkryu,yinlamchow,orgad,ido greenberg,avinatan,fink,yossi,szpektor,cboutilier,elidan}@google.com Google Research

ABSTRACT

摘要

Despite recent advances in natural language understanding and generation, and decades of research on the development of convers at ional bots, building automated agents that can carry on rich open-ended conversations with humans “in the wild” remains a formidable challenge. In this work we develop a real-time, open-ended dialogue system that uses reinforcement learning (RL) to power a bot’s conversational skill at scale. Our work pairs the succinct embedding of the conversation state generated using SOTA (supervised) language models with RL techniques that are particularly suited to a dynamic action space that changes as the conversation progresses. Trained using crowd-sourced data, our novel system is able to substantially exceeds the (strong) baseline supervised model with respect to several metrics of interest in a live experiment with real users of the Google Assistant.

尽管自然语言理解和生成领域最近取得了进展,并且对话机器人的开发已经研究了数十年,但构建能够“在野外”与人类进行丰富开放式对话的自动化智能体仍然是一个巨大的挑战。在这项工作中,我们开发了一个实时、开放式的对话系统,该系统使用强化学习 (RL) 来大规模提升机器人的对话能力。我们的工作将使用 SOTA(监督式)语言模型生成的对话状态的简洁嵌入与特别适合随着对话进展而变化的动态动作空间的 RL 技术相结合。通过使用众包数据进行训练,我们的新颖系统在与 Google Assistant 真实用户的实时实验中,在多个关键指标上显著超过了(强大的)基线监督模型。

1 INTRODUCTION

1 引言

With tremendous advances in AI and ML techniques to recognize speech and perform high quality natural language understanding (NLU) and generation (NLG), increased attention is being directed toward the task of carrying out real-time, rich conversations between humans and bots (e.g., [19, 32, 46]). Realistic interactions generally span complex topic spaces and are relatively open-ended, and often have an underlying goal (e.g., task completion, knowledge sharing). Thus, carrying them out effectively requires not just powerful bots that learn to generate favorable responses, but also demands that bots have the ability to plan and adapt on the fly.

随着人工智能(AI)和机器学习(ML)技术在语音识别、高质量自然语言理解(NLU)和生成(NLG)方面的巨大进步,越来越多的注意力被转向实现人类与机器人之间实时、丰富的对话任务(例如 [19, 32, 46])。真实的交互通常涉及复杂的主题空间,并且相对开放,通常具有潜在的目标(例如任务完成、知识共享)。因此,有效地执行这些任务不仅需要强大的机器人来学习生成有利的响应,还要求机器人具备即时规划和适应的能力。

The framework of reinforcement learning (RL) is a natural approach for this task. Indeed, research on Markov decision processes (MDPs) and RL for spoken-dialogue systems spans well over two decades, ranging from early work using MDPs and RL [18, 35], to methods based on partially observable MDP (POMDP) models (e.g., [42]), to more recent approaches adopting deep learning representations (e.g., [19]). Despite this, deployments of RL “in the wild” in large-scale dialogue systems, such as smart assistants like Alexa, Siri or Google Assistant, are rare (though exceptions exist, e.g., [32]; see Related Work below). Indeed, building such systems remains a formidable challenge.

强化学习 (Reinforcement Learning, RL) 的框架是解决这一任务的天然方法。事实上,针对语音对话系统的马尔可夫决策过程 (Markov Decision Processes, MDPs) 和 RL 的研究已经跨越了二十多年,从早期使用 MDPs 和 RL 的工作 [18, 35],到基于部分可观测 MDP (Partially Observable MDP, POMDP) 模型的方法(例如 [42]),再到最近采用深度学习表示的方法(例如 [19])。尽管如此,在大规模对话系统(如 Alexa、Siri 或 Google Assistant 等智能助手)中“实际应用”RL 的情况仍然罕见(尽管存在例外,例如 [32];见下文的相关工作)。事实上,构建这样的系统仍然是一个巨大的挑战。

Aside from the infrastructure hurdles associated with scalable, real-time systems, there are inherent modeling challenges when building open-ended conversational bots. First, the state space of such bots is massive, even within specific verticals, and care is needed to craft effective state representations for RL algorithms. Second, the action space is also in principle “unbounded,” and imposing reasonable limitations on actions comes with its own difficulties, including the fact that the set of candidate actions may vary as the conversation progresses. Finally, the design of suitable reward functions for open-ended dialogue can be quite subtle. In this work, we rely on crowd-sourced labels.

除了与可扩展的实时系统相关的基础设施障碍外,构建开放式对话机器人还存在固有的建模挑战。首先,即使是在特定垂直领域内,这类机器人的状态空间也非常庞大,需要精心设计有效的状态表示以适用于强化学习(RL)算法。其次,动作空间在原则上也是“无界的”,对动作施加合理的限制本身就存在困难,包括候选动作集可能随着对话的进行而变化。最后,为开放式对话设计合适的奖励函数可能非常微妙。在本研究中,我们依赖于众包标签。

We present a real-time, open-ended dialogue system that uses RL to power a bot’s conversational skill at scale. We address the challenges above using a novel RL construction. We first exploit powerful supervised models—specifically, RNNs and transformers— to provide a succinct embedding of the conversation state. Second, we use the fact that a relatively small set of “reasonable” candidate actions can be generated at each conversation turn [38]. From an RL perspective, this can be viewed as a stochastic realization of the full action space, so we use an RL approach tailored to such stochastic action sets [3]. We also explore the use of alternative SOTA RL techniques and training methods, including continuousaction optimization [29], conservative Q-learning [17] and novel off-policy evaluation algorithms [28].

我们提出了一种实时、开放式的对话系统,该系统使用强化学习(RL)来大规模提升机器人的对话能力。我们通过一种新颖的强化学习构建方法来解决上述挑战。首先,我们利用强大的监督模型——特别是RNN和Transformer——来提供对话状态的简洁嵌入。其次,我们利用了一个事实,即在每个对话轮次中可以生成相对较小的一组“合理”候选动作 [38]。从强化学习的角度来看,这可以被视为完整动作空间的随机实现,因此我们使用了一种针对此类随机动作集定制的强化学习方法 [3]。我们还探索了使用其他最先进的强化学习技术和训练方法,包括连续动作优化 [29]、保守Q学习 [17] 以及新颖的离策略评估算法 [28]。

We first evaluate our methods using offline data. We then describe the deployment of our system “in the wild” in the Google Assistant, specifically as part of the animal domain experience described by Szpektor et al. [38]. We demonstrate the effectiveness of our RL-based approach at dynamic planning and driving openended dialogue: relative to a SOTA non-RL (transformer) baseline, our bot substantially improves a number of key metrics, including conversation length, cooperative responses and explicit positive feedback. We perform a novel and extensive comparison of many RL architectures in real-world settings and generate unique insights into their relative performance. An example dialogue is shown in Fig. 1, showcasing the rich pivoting of our best-performing RL model vs. the supervised approach. Our model, now deployed in the Google Assistant, marks an important milestone in the use of RL for driving real-time, engaging, open-ended, conversations.

我们首先使用离线数据评估我们的方法。然后,我们描述了我们的系统在 Google Assistant 中的实际部署,特别是作为 Szpektor 等人 [38] 描述的动物领域体验的一部分。我们展示了基于强化学习 (RL) 的方法在动态规划和驱动开放式对话方面的有效性:相对于最先进的非强化学习 (Transformer) 基线,我们的机器人显著改善了多个关键指标,包括对话长度、合作响应和明确的正面反馈。我们在现实环境中对多种强化学习架构进行了新颖且广泛的比较,并生成了关于它们相对性能的独特见解。图 1 展示了一个示例对话,展示了我们表现最佳的强化学习模型与监督学习方法之间的丰富对比。我们的模型现已部署在 Google Assistant 中,标志着使用强化学习驱动实时、引人入胜、开放式对话的重要里程碑。

The key ingredients and insights of our approach are threefold. First, we use an effective representation of the RL state, leveraging pre-trained supervised models that encode the conversation history. Second, we limit the action space to a small set of generated candidate actions while allowing multiple actions in a single turn to compose rich bot responses. This granularity decouples content generation and dialogue planning. Finally, we adapt recent stateof-the-art RL algorithms that are well-suited to our dynamic action space to the candidate-generation decomposition we adopt here.

我们方法的关键要素和见解有三点。首先,我们使用强化学习状态的有效表示,利用预训练的监督模型对对话历史进行编码。其次,我们将动作空间限制为一小组生成的候选动作,同时允许在单轮对话中执行多个动作以构建丰富的机器人响应。这种粒度将内容生成和对话规划解耦。最后,我们采用了最近的最先进的强化学习算法,这些算法非常适合我们的动态动作空间,并将其应用于我们采用的候选生成分解方法。

2 RELATED WORK

2 相关工作

The use of RL for dialogue dates back more than two decades. Statistical research focuses on task-oriented dialogues and uses MDPs [18, 35, 36, 41] or POMDPs [42, 44]. Henderson et al. [11] introduce function approximation to reduce the resulting large slot-filling state space. Casanueva et al. [4] propose a feudal RL

使用强化学习 (RL) 进行对话的研究可以追溯到二十多年前。统计研究主要集中在任务导向型对话,并使用马尔可夫决策过程 (MDPs) [18, 35, 36, 41] 或部分可观测马尔可夫决策过程 (POMDPs) [42, 44]。Henderson 等人 [11] 引入了函数逼近来减少由此产生的大规模槽填充状态空间。Casanueva 等人 [4] 提出了一种封建强化学习方法。


(b) Supervised model

图 1:
(b) 监督模型

Figure 1: Example conversations conducted by (a) an RL model and (b) a supervised model, previously deployed in the Google Assistant, showcasing the rich pivoting of the RL model vs. the supervised approach.

图 1: (a) 强化学习 (RL) 模型和 (b) 监督学习模型进行的对话示例,这些模型曾部署在 Google Assistant 中,展示了强化学习模型与监督学习方法在丰富对话转向方面的对比。

model which decomposes the decision by first selecting a subset of primitive actions, then choosing the actual action. These methods each model the state and action spaces using handcrafted semantic representations, such as slots and dialogue acts. This restricts such approaches to simple domains with limited slot-filling.

一种模型,通过首先选择一组基本动作子集,然后选择实际动作来分解决策。这些方法各自使用手工设计的语义表示(如槽位和对话行为)来建模状态和动作空间。这限制了这些方法仅适用于具有有限槽位填充的简单领域。

More recent work leverages deep neural networks (DNNs) to obviate the need for these so-called summary states and actions [9]. Fatemi et al. [8] consider a low-dimensional continuous-valued state space. Liu et al. [23] encode the dialogue state using an RNN but translate the representation to slot-value pairs. In both cases, the action space is restricted to a small number of dialogue acts and the approaches remain limited to specific task-based domains, with no clear extension to open-ended dialogues.

最近的研究利用深度神经网络 (DNNs) 来消除对这些所谓的摘要状态和动作的需求 [9]。Fatemi 等人 [8] 考虑了一个低维连续值状态空间。Liu 等人 [23] 使用 RNN 对对话状态进行编码,但将表示转换为槽值对。在这两种情况下,动作空间都被限制在少量的对话行为中,并且这些方法仍然局限于特定的任务型领域,没有明显的扩展到开放式对话的途径。

Building on advances in neural generative models, another line of work applies RL to directly learn a response generation model conditioned on the dialogue history. Actions are often defined at the word level, so the action space is the entire vocabulary [14, 19, 20, 33]. This approach suffers from several drawbacks: the action space is very large; word-level RL performs credit assignment poorly at an unnatural level for dialogue planning; and this may affect decoder performance, leading to incomprehensible utterances [45].

基于神经生成模型的进展,另一项工作应用强化学习(RL)直接学习基于对话历史的响应生成模型。动作通常在词级别定义,因此动作空间是整个词汇表 [14, 19, 20, 33]。这种方法存在几个缺点:动作空间非常大;词级别的强化学习在对话规划中在非自然级别上表现不佳;这可能会影响解码器的性能,导致生成难以理解的语句 [45]。

Related approaches model actions as latent variables, inducing a latent action space, thus decoupling the discourse-level decision making from NLG [30, 45]. However, they focus on specific domains (e.g., price negotiation, slot-filling or chitchat) rather than grounded open-ended dialogues. Serban et al. [32] proposed MILABOT, as part of the Amazon Alexa Prize competition, where a DM selects a response from several generated candidates. Our approach is similar in spirit, but allows one to combine several candidates in the same bot turn to compose a richer response. This seemingly small difference is vitally important as it allows our RL model to make decisions at an effective granularity. We note that while MILABOT was restricted to an A/B testing evaluation within a competition, our RL model is deployed in the Google Assistant.

相关方法将动作建模为潜在变量,引入潜在动作空间,从而将话语层面的决策与自然语言生成 (NLG) 解耦 [30, 45]。然而,这些方法专注于特定领域(例如价格谈判、槽填充或闲聊),而不是基于开放领域的对话。Serban 等人 [32] 提出了 MILABOT,作为 Amazon Alexa Prize 竞赛的一部分,其中对话管理器 (DM) 从多个生成的候选响应中选择一个。我们的方法在精神上与之相似,但允许在同一轮对话中组合多个候选响应,以生成更丰富的回复。这一看似微小的差异至关重要,因为它使我们的强化学习 (RL) 模型能够在有效的粒度上做出决策。我们注意到,虽然 MILABOT 仅限于竞赛中的 A/B 测试评估,但我们的 RL 模型已部署在 Google Assistant 中。

3 DYNAMIC COMPOSITION

3 动态组合

In this work, we build on the dynamic composition approach introduced by Szpektor et al. [38]. This dialogue management model limits the action space using specific content providers to propose candidate utterances, which are dynamically selected by the dialogue manager (DM). We adopt this scheme to manage action complexity in our RL approaches.

在本工作中,我们基于 Szpektor 等人 [38] 提出的动态组合方法进行构建。该对话管理模型通过特定的内容提供者来限制动作空间,以提出候选话语,这些话语由对话管理器 (DM) 动态选择。我们采用这一方案来管理我们强化学习方法中的动作复杂性。

Dynamic composition decouples content (utterance) generation from selection. Given candidate utterances proposed by each of several providers, the DM scores and selects a suitable utterance as (part of) a response given the current conversation history. The bot response can be composed of several utterances, which are generated and selected sequentially and dynamically (see Fig. 2). Additional components include an NLU module and a sentence fusion module that merges the selected utterances into a coherent response. We describe each of these components in turn.

动态组合将内容(话语)生成与选择解耦。给定由多个提供者提出的候选话语,对话管理器(DM)根据当前的对话历史对它们进行评分并选择合适的话语作为(部分)响应。机器人响应可以由多个话语组成,这些话语是顺序且动态生成和选择的(见图 2)。其他组件包括一个自然语言理解(NLU)模块和一个句子融合模块,后者将选定的语句合并为一个连贯的响应。我们将依次描述这些组件。

Natural Language Understanding. At each turn of the conversation, user input is analyzed by an NLU module, comprising two components: a focus tracker and a user answer interpreter. The focus of the conversation is the set of entities currently being discussed. If the last turn ended with a bot question, the answer interpreter classifies the user answer w.r.t. the question type. For example, the response types expected for yes/no questions (e.g., “Do you want to hear more about cheetahs?”), are ‘yes’ (e.g., “sure,” “I want to hear about cheetahs”), ’no’, and ignoring the question; list selection questions (e.g., “Which animal do you want to hear about next?”)

自然语言理解。在对话的每一轮中,用户输入由一个自然语言理解(NLU)模块进行分析,该模块包括两个组件:焦点跟踪器和用户回答解释器。对话的焦点是当前正在讨论的实体集合。如果上一轮以机器人的问题结束,回答解释器会根据问题类型对用户回答进行分类。例如,对于是非问题(例如,“你想了解更多关于猎豹的信息吗?”),预期的回答类型包括“是”(例如,“当然”,“我想了解猎豹”)、“否”以及忽略问题;对于列表选择问题(例如,“接下来你想了解哪种动物?”)


Figure 2: Dynamic composition flow diagram. A few candidates from different providers are shown in each step and the one selected by the DM is highlighted in blue.

图 2: 动态组合流程图。每一步展示了来自不同提供者的几个候选方案,DM 选择的方案用蓝色高亮显示。

Table 1: Illustration of the concepts of focus, user answer interpretation and dialog acts for a sample conversation.

表 1: 示例对话中的焦点、用户回答解释和对话行为的说明。

对话 焦点 用户回答解释 对话行为
用户 北极熊的声音 北极熊
机器人 这是北极熊的声音 。 北极熊
机器人 嘿,你想听听企鹅的声音吗? 企鹅
用户 是 机器人 太好了。 企鹅 合作
机器人 企鹅
机器人 这是企鹅的声音 。 企鹅
机器人 你想了解哪种动物?
用户 机器人 告诉我关于北极熊的事情 北极熊 合作
机器人 酷。Churchillwild.com 说“北极熊想玩的时候会左右摇摆头部”。 北极熊 事实

and quizzes (e.g.,“Can you guess which animal this is, a lion or a tiger?”) include animals as potential responses. Table 1 illustrates the concepts of focus and answer interpretation.

表 1 展示了焦点和答案解释的概念。

Content Providers. Given the conversation history and the selected utterances so far in the current turn, content providers propose candidate utterances w.r.t. this context. They rely on different sources (e.g., news, knowledge graph) to extract relevant content, which includes both structured and unstructured data. Structured providers generate text for their candidate utterances using templates, while unstructured content is quoted verbatim with attribution. Content is expressed in different forms via dialogue acts such as statements, preference queries, navigation questions and quizzes. Some of these, referred to as conversational drivers, aim to proactively increase user engagement (e.g., focus changing, questions). Examples of such dialogue acts are provided in Table 1.

内容提供者。根据对话历史和当前轮次中已选择的话语,内容提供者会针对这一上下文提出候选话语。它们依赖不同的来源(例如新闻、知识图谱)来提取相关内容,这些内容包括结构化和非结构化数据。结构化提供者使用模板为其候选话语生成文本,而非结构化内容则直接引用并注明出处。内容通过对话行为(如陈述、偏好查询、导航问题和测验)以不同形式表达。其中一些被称为对话驱动者,旨在主动提高用户参与度(例如焦点转移、提问)。表1中提供了此类对话行为的示例。

Dialogue Manager. In each step of the bot composition loop, providers generate candidates for the next utterance to be appended to the response constructed so far. Given a set of candidates and the conversation context, utterance selection is performed by a learned DM. This step is repeated until the DM assesses that the response is a relevant and engaging bot response. In [38], the DM is implemented as an RNN encoder, trained in a supervised fashion (see Sec. 4.2). In this work, we develop DMs trained using RL.

对话管理器 (Dialogue Manager)。在机器人组合循环的每一步中,提供者会生成候选话语,这些话语将被附加到目前已构建的响应中。给定一组候选话语和对话上下文,话语选择由学习型对话管理器执行。此步骤会重复,直到对话管理器评估出响应是相关且引人入胜的机器人响应。在 [38] 中,对话管理器被实现为一个 RNN 编码器,以监督学习的方式进行训练(见第 4.2 节)。在本工作中,我们开发了使用强化学习 (RL) 训练的对话管理器。

Sentence Fusion. The output of the composition loop is a sequence of utterances, which still needs to be fused into a coherent bot response. Simple concatenation of the utterances typically results in cumbersome, unnatural, verbose responses, such as $\mathrm{^{\circ}O n}$ average, male lions weigh 420 lbs. On average, male lions are 3.9 feet tall. That means that lions are about as tall as a piano.” Sentence fusion combines the selected utterances into a single cohesive response [2, 10, 25], such as $^{\ast}\mathrm{On}$ average, male lions weigh 420 lbs and are 3.9 feet tall. That means that they’re about as tall as a piano.” This module uses the following techniques: (a) pro nominal iz ation, (b) removal of repetitive context mentions and (c) introduction of a discourse marker between sentences. Our fusion model is based on Laser T agger [24], a sequence labeling architecture in which each token in the input text is classified to be either copied as-is, deleted or substituted with a phrase taken from a small, predefined vocabulary, typically containing pronouns and connectives.

句子融合。组合循环的输出是一系列话语,这些话语仍需融合成一个连贯的机器人响应。简单地将话语连接起来通常会导致冗长、不自然的响应,例如“平均而言,雄性狮子的体重为420磅。平均而言,雄性狮子的身高为3.9英尺。这意味着狮子的身高大约与钢琴相当。”句子融合将选定的句子组合成一个连贯的响应 [2, 10, 25],例如“平均而言,雄性狮子的体重为420磅,身高为3.9英尺。这意味着它们的身高大约与钢琴相当。”该模块使用以下技术:(a) 代词化,(b) 删除重复的上下文提及,以及 (c) 在句子之间引入话语标记。我们的融合模型基于LaserTagger [24],这是一种序列标记架构,其中输入文本中的每个Token被分类为直接复制、删除或用取自小型预定义词汇表的短语替换,该词汇表通常包含代词和连接词。

4 REINFORCEMENT LEARNING FOR THE DM

4 针对 DM 的强化学习

Dynamic composition is realized by Szpektor et al. [38] with supervised training of an RNN-based DM. This limits the construction of the next bot response to be myopic, as it is optimized for maximal immediate reward. However, since the main goal of the DM is to conduct complex, engaging multi-turn conversations, it should target the natural complexity of human-to-human conversations, which are typically not conducted in a myopic, turn-by-turn manner, but rather reflect some degree of look ahead and dynamic planning. For example, a conversation may comprise several steps leading to an intended goal, such as knowledge transfer; or to make a conversation more engaging, one might intersperse interesting facts throughout, or build tension towards an eventual resolution. Such capabilities require a bot be able to choose responses that lead toward such non-myopic ends, and adapt to user responses/queries by dynamically re-planning its trajectory accordingly.

Szpektor 等人 [38] 通过监督训练基于 RNN 的对话管理器 (DM) 实现了动态组合。这种方法限制了下一个机器人响应的构建,使其变得短视,因为它被优化为追求最大的即时奖励。然而,由于 DM 的主要目标是进行复杂、引人入胜的多轮对话,它应该瞄准人与人之间对话的自然复杂性,这些对话通常不是以短视的、逐轮的方式进行,而是反映了一定程度的预见性和动态规划。例如,对话可能包含多个步骤,以达到预期的目标,如知识传递;或者为了使对话更具吸引力,可能会在其中穿插有趣的事实,或者为最终解决方案制造紧张感。这些能力要求机器人能够选择导向这些非短视目标的响应,并通过动态重新规划其轨迹来适应用户的响应/查询。

To this end, we develop an RL framework for open-ended dialogue that builds on the dynamic composition architecture, and propose a number of concrete instantiations of it. In Sec. 4.1, we formulate the underlying MDP that captures the spirit of the contentprovider decomposition. In Sec. 4.2, we discuss the use of the underlying supervised model for state representation. We then propose a two-step Q-learning approach in Sec. 4.3, with algorithmic variants motivated by specific properties of the MDP and our representations. We also develop an end-to-end Q-learning method in Sec. 4.4 that does not require the language encoders for state generation.

为此,我们开发了一个基于动态组合架构的开放式对话强化学习(RL)框架,并提出了该框架的多个具体实例。在4.1节中,我们构建了一个捕捉内容提供者分解精神的底层马尔可夫决策过程(MDP)。在4.2节中,我们讨论了使用底层监督模型进行状态表示的方法。随后,在4.3节中,我们提出了一种两步Q学习方法,其算法变体受到MDP特定属性和我们表示方法的启发。此外,在4.4节中,我们还开发了一种端到端的Q学习方法,该方法不需要语言编码器来生成状态。

4.1 MDP with a Stochastic Action Space

4.1 具有随机动作空间的 MDP

We begin with an MDP formulation of the DM problem upon which our RL methods operate. We assume a state space $\chi$ , action space $\mathcal{A}$ , transition kernel $P$ , reward function $R$ , initial state distribution $\beta$ and discount factor $\gamma,$ and aim to optimize the cumulative discounted return $\begin{array}{r}{J(\pi),:=,\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}~\vert~P,R,\beta,\pi]}\end{array}$ which captures the long-term value of the conversation. The RL DM policy $\pi$ is an action distribution conditioned on state $x\in\chi$ . An optimal $\mathrm{D}!\mathrm{M},\pi^{*}$ is found by solving $\operatorname*{max}_{\pi\in\Pi},J(\pi)$ , where $\Pi$ is the space of all DM policies. We discuss each of these elements in turn.

我们从 DM(对话管理)问题的 MDP(马尔可夫决策过程)公式化开始,我们的强化学习方法基于此进行操作。我们假设状态空间为 $\chi$,动作空间为 $\mathcal{A}$,转移核为 $P$,奖励函数为 $R$,初始状态分布为 $\beta$,折扣因子为 $\gamma$,目标是优化累积折扣回报 $\begin{array}{r}{J(\pi),:=,\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}~\vert~P,R,\beta,\pi]}\end{array}$,它捕捉了对话的长期价值。RL DM 策略 $\pi$ 是基于状态 $x\in\chi$ 的动作分布。通过求解 $\operatorname*{max}_{\pi\in\Pi},J(\pi)$ 来找到最优的 DM 策略 $\pi^{*}$,其中 $\Pi$ 是所有 DM 策略的空间。我们依次讨论这些元素。

Much RL research in dialogue management defines the state and action spaces to be the tokenized language space [1, 13, 19]. For instance, the state $x$ is the tokenized user-agent conversation history to that point, while an action $a$ is the DM’s output sentence (generated token by token). However, since the state and action spaces in this formulation are both combinatorial, even with a medium-sized vocabulary the corresponding tokenized spaces grow exponentially, thus making RL intractable. We handle the combinatorics of state space by leveraging state-of-the-art language models, such as RNNs or transformers, to encode conversation history $x$ with a $d$ -dimensional embedding $\phi_{x}\in\mathbb{R}^{d}$ (see Sec. 4.2 for details), thus replacing the large discrete state space by the continuous embedding space R𝑑.

在对话管理的强化学习研究中,许多工作将状态和动作空间定义为 Token 化的语言空间 [1, 13, 19]。例如,状态 $x$ 是到当前为止的用户-智能体对话历史的 Token 化表示,而动作 $a$ 是对话管理模块输出的句子(逐个 Token 生成)。然而,由于这种定义下的状态和动作空间都是组合性的,即使使用中等规模的词汇表,相应的 Token 化空间也会呈指数级增长,从而使强化学习变得难以处理。我们通过利用最先进的语言模型(如 RNN 或 Transformer)来处理状态空间的组合性问题,将对话历史 $x$ 编码为一个 $d$ 维的嵌入 $\phi_{x}\in\mathbb{R}^{d}$(详见第 4.2 节),从而用连续的嵌入空间 $\mathbb{R}^{d}$ 替代了庞大的离散状态空间。

We also differ from typical RL dialogue models in our treatment of the action space. Rather than a generative language model that directly outputs sentences, we leverage the dynamic composition framework (Sec. 3) to render the action space tractable. Specifically, at any turn, each content provider proposes a small set of utterances. This dynamic-composition, content-provider (DCCP) action decomposition ensures the DM policy need only score and select from a (relatively small) discrete set $\mathcal{A}_{x}$ at state $x$ , i.e. the set of candidate utterances. Note that by working at the utterance level rather than at the level of the full bot response (a fused concatenation of K such utterances), we remove the small-scale combinatorics of the action space, at the cost of extending the horizon of the RL problem. But this does not sacrifice the optimality of the policy.

我们在动作空间的处理上也与典型的强化学习(RL)对话模型不同。我们不是直接输出句子的生成式语言模型,而是利用动态组合框架(第3节)来使动作空间变得易于处理。具体来说,在每一轮对话中,每个内容提供者会提出一小部分话语。这种动态组合、内容提供者(DCCP)动作分解确保了对话管理(DM)策略只需在状态 $x$ 下从一个(相对较小的)离散集合 $\mathcal{A}_{x}$ 中评分和选择,即候选话语的集合。需要注意的是,通过在话语级别而不是完整机器人响应(K个话语的融合拼接)级别工作,我们消除了动作空间的小规模组合问题,但代价是延长了RL问题的时间范围。但这并不会牺牲策略的最优性。

Importantly, since the providers use potentially arbitrary logic to generate candidates, the realization of $\mathcal{A}_{x}$ may differ with each occurrence of $x$ . This puts us in the realm of non-standard MDPs with stochastic action sets [3]. Fortunately, we consider Q-learning methods below that handle this directly.

重要的是,由于提供者可能使用任意的逻辑来生成候选,$\mathcal{A}_{x}$ 的实现可能会随着每次 $x$ 的出现而不同。这使我们进入了具有随机动作集的非标准 MDP (Markov Decision Process) 领域 [3]。幸运的是,我们在下面考虑的 Q-learning 方法可以直接处理这种情况。

The training data, further described in Sec. 4.5, is composed of crowd-sourced conversations, generated with a supervised DM. The human evaluators provide a rating for each selected utterance, which are used as rewards that measure the rater’s immediate value of the action given the conversation history. As we shall see, the RL model is able to leverage these to learn dynamic planning strategies to improve user engagement.

训练数据(在第4.5节中进一步描述)由众包对话组成,这些对话是通过有监督的对话管理器(DM)生成的。人类评估者为每个选定的语句提供评分,这些评分被用作奖励,用于衡量评估者在给定对话历史的情况下对动作的即时价值。正如我们将看到的,强化学习(RL)模型能够利用这些奖励来学习动态规划策略,从而提高用户参与度。

In the existing dialogue RL literature [6, 19, 22] most algorithms are based on policy-gradient methods [37] because (i) learning an optimal state-action value function with a combinatorial action space is intractable, and (ii) the resulting DM is a sequence-tosequence policy model that generates bot responses. The simplification of the action space afforded by our DCCP reformulation allows us to use value-based Q-learning [3, 27], a relatively uncommon approach in large-scale dialogue systems.

在现有的对话强化学习文献 [6, 19, 22] 中,大多数算法都基于策略梯度方法 [37],因为 (i) 在组合动作空间中学习最优状态-动作值函数是不可行的,(ii) 生成的对话管理器 (DM) 是一个序列到序列的策略模型,用于生成机器人的响应。我们的 DCCP 重构简化了动作空间,使我们能够使用基于值的 Q 学习 [3, 27],这是在大规模对话系统中相对不常见的方法。

4.2 Supervised Model as a State Encoder

4.2 监督模型作为状态编码器

To encode the conversation history into a $d$ -dimensional embedding vector, we consider two supervised learning architectures: an improved version of the RNN model from [38] and a transformerbased approach using a pre-trained BERT model.

为了将对话历史编码为 $d$ 维嵌入向量,我们考虑了两种监督学习架构:一种是从 [38] 改进的 RNN 模型,另一种是基于预训练 BERT 模型的 Transformer 方法。

Supervised RNN Architecture. We modify the two-level hierarchical RNN encoder [38] by replacing the first-level gated recurrent unit (GRU), that encodes the user and bot utterances, by a pre-trained sentence embedding module [5], which is fixed during training. These provide a clear advantage over the first-level GRU, which we attribute to training on a large, general corpus. The resulting sentence embeddings are fed to a GRU along with nontextual metadata features, including: the conversation turn index, the candidate dialogue act, the number of tokens in the constructed response, whether the candidate offers to change the focus.

监督RNN架构。我们通过将第一层的门控循环单元(GRU)替换为预训练的句子嵌入模块 [5] 来修改两级分层RNN编码器 [38],该模块在训练期间是固定的。这些句子嵌入模块相比第一层GRU具有明显优势,我们将其归因于在大型通用语料库上的训练。生成的句子嵌入与非文本元数据特征一起输入到GRU中,这些特征包括:对话轮次索引、候选对话行为、构建响应中的Token数量、候选是否提供改变焦点的选项。

Supervised Transformer-based Architecture. Our second supervised model uses a transformer architecture [40]. We consider two variants: a text-only BERT model and a combined text-metadata model. In both, the input is the concatenated sequence of user and bot utterances (i.e., conversation history) and a candidate utterance to be scored. In the second, we concatenate the per-token contextual representation vectors produced by the BERT model and an embedded representation of the metadata features for the utterance of which the token is a part. The resulting concatenated vectors are fed into another small transformer. Experiments on manually annotated data show the text-only variant outperforms the second, corroborating the hypothesis that the transformer’s better use of the input text—specifically, the ability to attend to the history when processing candidates—obviates the need for additional features.

基于监督学习的 Transformer 架构。我们的第二个监督模型使用了 Transformer 架构 [40]。我们考虑了两种变体:一种是仅文本的 BERT 模型,另一种是结合文本和元数据的模型。在这两种模型中,输入都是用户和机器人对话的拼接序列(即对话历史)以及待评分的候选对话。在第二种模型中,我们将 BERT 模型生成的每个 Token 的上下文表示向量与 Token 所属对话的元数据特征的嵌入表示进行拼接。生成的拼接向量被输入到另一个小型 Transformer 中。在手动标注数据上的实验表明,仅文本的变体优于第二种模型,这验证了 Transformer 更好地利用输入文本(特别是在处理候选对话时能够关注历史对话)的假设,从而无需额外的特征。

State Representation. Our state representation uses the output of the dialogue-history encoder, either the RNN hidden state or the pooled output of the last transformer layer. The state is constructed by concatenating this encoding with the sentence embedding of the last user input and context features, e.g., the conversation turn index and the composition turn step index.

状态表示。我们的状态表示使用对话历史编码器的输出,即 RNN 的隐藏状态或最后一个 Transformer 层的池化输出。状态是通过将此编码与最后一个用户输入的句子嵌入以及上下文特征(例如,对话轮次索引和组合轮次步骤索引)连接起来构建的。

Figure 3: Two-step Q-learning schema. The state is the concatenation of (i) the output of the dialogue history encoder, either RNN or transformer (denoted by (1)), (ii) the embedding of the last user input and (iii) context features, including the conversation turn index and the composition turn step index. The action is represented by its embedding. We consider both stochastic action and continuous action ${\bf{Q}}\cdot{\bf{\Lambda}}$ - learning approaches (denoted by (2)) potentially with the added CQL regular iz ation (denoted by (3)).

图 3: 两步 Q-learning 架构。状态由以下三部分拼接而成:(i) 对话历史编码器(RNN 或 Transformer)的输出(记为 (1)),(ii) 最后用户输入的嵌入表示,以及 (iii) 上下文特征,包括对话轮次索引和组合轮次步骤索引。动作由其嵌入表示表示。我们考虑了随机动作和连续动作 ${\bf{Q}}\cdot{\bf{\Lambda}}$ - learning 方法(记为 (2)),可能还增加了 CQL 正则化(记为 (3))。

4.3 The Two Step Q-model Architecture

4.3 两步 Q 模型架构

We now develop several RL approaches for the DM which rely on Q-learning. Our first approaches use a two-step model in which the state is first encoded by a language model (either a pre-trained RNN or transformer) before being passed to the DM policy. Figure 3 illustrates how these building blocks come together in the two-step approach. Given a pre-trained state encoder $\phi:\boldsymbol{X}\to\mathbb{R}^{d}$ and a sentence encoder $\psi:\mathcal{A}\to\mathbb{R}^{h}$ , we apply two different Q-learning techniques using the encoded state space (using $\phi_{x}$ rather than $x$ ) and action space (using $\psi_{a}$ rather than 𝑎).

我们现在为 DM 开发了几种基于 Q-learning 的强化学习方法。我们的第一种方法使用两步模型,其中状态首先由语言模型(预训练的 RNN 或 Transformer)编码,然后再传递给 DM 策略。图 3 展示了这些构建块如何在两步方法中结合。给定一个预训练的状态编码器 $\phi:\boldsymbol{X}\to\mathbb{R}^{d}$ 和一个句子编码器 $\psi:\mathcal{A}\to\mathbb{R}^{h}$,我们使用编码后的状态空间(使用 $\phi_{x}$ 而不是 $x$)和动作空间(使用 $\psi_{a}$ 而不是 𝑎)应用两种不同的 Q-learning 技术。

Stochastic Action Q-learning (SAQL) [3]. Our first RL technique applies Q-learning directly to the discrete, stochastic action sets $\mathcal{A}{x}$ as determined by the DCCP de compost ion. We adopt the general deep Q network (DQN) approach [27], using a DNN to represent the $\boldsymbol{\mathrm{Q}}$ -function. Specifically, $Q{\theta}:\mathbb{R}^{d}\times\mathbb{R}^{h}\rightarrow\mathbb{R}$ is a feed-forward DNN with parameters $\theta$ , which represents the cumulative discounted value of taking action (or bot utterance) $\psi_{a}\in\mathbb{R}^{h}$ in state (i.e., conversation history encoding) $\phi_{x}\in\mathbb{R}^{d}$ .

随机动作 Q 学习 (Stochastic Action Q-learning, SAQL) [3]。我们的第一个强化学习技术直接将 Q 学习应用于由 DCCP 分解确定的离散随机动作集 $\mathcal{A}{x}$。我们采用通用的深度 Q 网络 (Deep Q Network, DQN) 方法 [27],使用深度神经网络 (DNN) 来表示 $\boldsymbol{\mathrm{Q}}$ 函数。具体来说,$Q{\theta}:\mathbb{R}^{d}\times\mathbb{R}^{h}\rightarrow\mathbb{R}$ 是一个具有参数 $\theta$ 的前馈 DNN,它表示在状态(即对话历史编码)$\phi_{x}\in\mathbb{R}^{d}$ 中采取动作(或机器人话语)$\psi_{a}\in\mathbb{R}^{h}$ 的累积折扣价值。

$$
\operatorname*{min}{\theta}\sum{i=1}^{|B|}(\mathcal{Q}{\theta}(\phi{x_{i}},\psi_{a_{i}})-r_{i}-\gamma\operatorname*{max}{a^{\prime}\in\mathcal{R}{x_{i}^{\prime}}}\mathcal{Q}{\theta^{\mathrm{target}}}(\phi{x_{i}^{\prime}},\psi_{a^{\prime}}))^{2},
$$

$$
\operatorname*{min}{\theta}\sum{i=1}^{|B|}(\mathcal{Q}{\theta}(\phi{x_{i}},\psi_{a_{i}})-r_{i}-\gamma\operatorname*{max}{a^{\prime}\in\mathcal{R}{x_{i}^{\prime}}}\mathcal{Q}{\theta^{\mathrm{target}}}(\phi{x_{i}^{\prime}},\psi_{a^{\prime}}))^{2},
$$

where $Q_{\theta^{\mathrm{target}}}$ is a target $\boldsymbol{Q}$ -function, used to improve training stability in DQN [26] (note the use of the realized action set $\mathcal{A}{x{i}^{\prime}}$ in the maximization). Under this loss, RL is $\ell_{2}$ -regression of $Q_{\theta}$ w.r.t. target labels $r+\gamma\operatorname*{max}{a^{\prime}\in\mathcal{A}{x_{i}^{\prime}}}Q_{\theta^{\mathrm{target}}}(\phi_{x^{\prime}},\psi_{a^{\prime}})$ , which tries to match the value function $Q_{\theta}$ with its Bellman backup.

其中 $Q_{\theta^{\mathrm{target}}}$ 是目标 $\boldsymbol{Q}$ 函数,用于提高 DQN [26] 的训练稳定性(注意在最大化时使用了实现的动作集 $\mathcal{A}{x{i}^{\prime}}$)。在此损失下,RL 是 $Q_{\theta}$ 相对于目标标签 $r+\gamma\operatorname*{max}{a^{\prime}\in\mathcal{A}{x_{i}^{\prime}}}Q_{\theta^{\mathrm{target}}}(\phi_{x^{\prime}},\psi_{a^{\prime}})$ 的 $\ell_{2}$ 回归,试图将值函数 $Q_{\theta}$ 与其 Bellman 备份匹配。

We refer to this approach as stochastic action $\boldsymbol{Q}$ -learning (SAQL) to reflect the stochastic action sets used in training. Once SAQL converges, the DM policy is $\pi^{}(x);\in;\arg\operatorname{max}{a\in{\mathcal A}{x}}Q_{\theta^{*}}(\phi_{x},\psi_{a}),$ . That is, at inference time, the $Q$ -model is applied to each candidate action, and the DM responds with the action with the greatest Q-value given the current dialogue state.

我们将这种方法称为随机动作Q学习 (Stochastic Action Q-learning, SAQL),以反映训练中使用的随机动作集。一旦SAQL收敛,决策策略为 $\pi^{}(x);\in;\arg\operatorname{max}{a\in{\mathcal A}{x}}Q_{\theta^{*}}(\phi_{x},\psi_{a}),$ 。也就是说,在推理时,Q模型应用于每个候选动作,决策模块根据当前对话状态返回具有最大Q值的动作。

Continuous Action Q-learning (CAQL) [29]. In the SAQL formulation, action maximization takes place over the discrete set of candidates $\mathcal{A}_{x}$ . However, the embedding representation means that the action space can also be treated as continuous, and we can consider maximization over this wider space. Continuous-action RL problems are common in areas like robotics [15], and typically policy-gradient algorithms are used to learn a return-maximizing policy [34, 37]. However, such methods are often data-inefficient and impractical when faced with high-dimensional action spaces, both issues present in dialogue systems. Instead, we consider the use of continuous action $Q$ -learning (CAQL) [29] to solve the continuousaction variant of our DM policy.

连续动作 Q 学习 (Continuous Action Q-learning, CAQL) [29]。在 SAQL 的公式中,动作最大化是在离散候选集 $\mathcal{A}_{x}$ 上进行的。然而,嵌入表示意味着动作空间也可以被视为连续的,我们可以考虑在这个更广泛的空间上进行最大化。连续动作的强化学习问题在机器人等领域很常见 [15],通常使用策略梯度算法来学习回报最大化的策略 [34, 37]。然而,面对高维动作空间时,这些方法通常数据效率低下且不切实际,而对话系统中也存在这些问题。相反,我们考虑使用连续动作 Q 学习 (CAQL) [29] 来解决我们的对话管理策略的连续动作变体。

Roughly speaking, using $\mathrm{CAQL},$ when faced with a next state $x^{\prime}$ while training the Q-function $Q_{\theta}$ , we do not restrict ourselves to maximizing over the discrete action set $\mathcal{A}(x^{\prime})$ , but instead maximize over the entire embedding space $\psi$ , minimizing:

粗略地说,使用 $\mathrm{CAQL}$ 时,在训练 Q 函数 $Q_{\theta}$ 时,面对下一个状态 $x^{\prime}$,我们不会限制自己在离散动作集 $\mathcal{A}(x^{\prime})$ 上最大化,而是在整个嵌入空间 $\psi$ 上最大化,最小化:

$$
\operatorname*{min}{\theta}\sum{i=1}^{|B|}(r_{i}+\gamma Q_{\theta^{\mathrm{target}}}(\phi_{x_{i}^{\prime}},\arg\operatorname*{max}{\psi}Q{\theta}(\phi_{x_{i}^{\prime}},\psi))-Q_{\theta}(\phi_{x_{i}},\psi_{a_{i}}))^{2}.
$$

$$
\operatorname*{min}{\theta}\sum{i=1}^{|B|}(r_{i}+\gamma Q_{\theta^{\mathrm{target}}}(\phi_{x_{i}^{\prime}},\arg\operatorname*{max}{\psi}Q{\theta}(\phi_{x_{i}^{\prime}},\psi))-Q_{\theta}(\phi_{x_{i}},\psi_{a_{i}}))^{2}.
$$

This approach has advantages over $S A Q L$ : one need not record the realization ${\mathcal{A}}{x^{\prime}}$ of the stochastic action sets in the data set, and continuous action maximization (see below) can be more effective when the set of candidate actions (utterances) is moderate or large in size. However, CAQL will generally overestimate the true value of its policy, since it hypothesizes the use of embedded actions that are never generated by any content provider. Indeed, once $Q{\theta}$ is trained using CAQL, we restrict the realized policy to scoring (and using) only provider-generated candidate actions at inference/serving time.

这种方法相较于 $S A Q L$ 具有优势:无需在数据集中记录随机动作集 ${\mathcal{A}}{x^{\prime}}$ 的实现,并且在候选动作(话语)集规模适中或较大时,连续动作最大化(见下文)可能更为有效。然而,CAQL 通常会高估其策略的真实价值,因为它假设使用了任何内容提供商都未生成的嵌入动作。实际上,一旦使用 CAQL 训练了 $Q{\theta}$,我们在推理/服务时会将实现的策略限制为仅对提供商生成的候选动作进行评分(和使用)。

When $Q_{\theta}$ is represented by a DNN, the inner maximization is typically differentiable and non-convex. This can be solved optimally for certain classes of DNNs using a mixed-integer program or a first-order method such as gradient ascent (GA) [29]. We use GA in this work: starting from an initial embedded action $\psi^{(0)}$ , the optimal embedded action arg $\operatorname*{max}{\psi}Q{\theta}(\phi_{x^{\prime}},\psi)$ is computed iteratively by $\boldsymbol{\psi}^{(t+1)}\gets\boldsymbol{\psi}^{(t)}+\epsilon_{\mathrm{GA}}\nabla_{\boldsymbol{\psi}}\mathcal{Q}{\boldsymbol{\theta}}(\phi{x^{\prime}},\boldsymbol{\psi})|{\boldsymbol{\psi}=\boldsymbol{\psi}^{(t)}}$ , where $\epsilon{\mathrm{GA}}>0$ is a tunable step size.

当 $Q_{\theta}$ 由深度神经网络 (DNN) 表示时,内部最大化问题通常是可微且非凸的。对于某些类别的 DNN,可以使用混合整数规划或一阶方法(如梯度上升 (GA) [29])来最优地解决此问题。在本工作中,我们使用 GA:从初始嵌入动作 $\psi^{(0)}$ 开始,通过迭代计算 $\boldsymbol{\psi}^{(t+1)}\gets\boldsymbol{\psi}^{(t)}+\epsilon_{\mathrm{GA}}\nabla_{\boldsymbol{\psi}}\mathcal{Q}{\boldsymbol{\theta}}(\phi{x^{\prime}},\boldsymbol{\psi})|{\boldsymbol{\psi}=\boldsymbol{\psi}^{(t)}}$ 来找到最优嵌入动作 $\operatorname*{max}{\psi}Q_{\theta}(\phi_{x^{\prime}},\psi)$,其中 $\epsilon_{\mathrm{GA}}>0$ 是可调的步长。

Conservative Q-learning (CQL) [17]. Our DM problem is an application of offline RL, where a model is learned using previously collected user-bot conversation with no further (online) interaction. Offline RL is prone to over estimation errors induced by the distribution al shift between the offline data and that generated by the learned policy [43]. This is especially problematic if certain bot actions are rare in the offline data, making their learned $\boldsymbol{Q}$ -values very noisy. To alleviate this, we can apply conservative $\boldsymbol{Q}$ -learning (CQL) [17], a regular iz ation scheme which learns a “conservative”

保守Q学习 (CQL) [17]。我们的DM问题是一个离线强化学习的应用,其中模型是通过之前收集的用户-机器人对话进行学习的,没有进一步的(在线)交互。离线强化学习容易受到离线数据与学习策略生成的数据之间的分布偏移引起的过高估计误差的影响 [43]。如果某些机器人在离线数据中很少见,这尤其成问题,使得它们学习到的 $\boldsymbol{Q}$ 值非常嘈杂。为了缓解这个问题,我们可以应用保守 $\boldsymbol{Q}$ 学习 (CQL) [17],这是一种正则化方案,它学习一个“保守的”

$\boldsymbol{Q}$ -function that lower bounds the true Q-function. CQL can be applied to both SAQL and CAQL (we illustrate it only for SAQL). In CQL one augments the Q-learning loss with a behavior regularizer: mi $\begin{array}{r}{\mathtt{n}{\theta}\sum{i=1}^{|B|}\alpha(\mathbb{E}{a\sim\mu}[Q{\theta}(\phi_{x_{i}},\psi_{a})]-\mathbb{E}{a\sim\pi{\beta}}[Q_{\theta}(\phi_{x_{i}},\psi_{a})])+}\end{array}$ $(r_{i}+\gamma Q_{\theta^{\mathrm{target}}}(\phi_{x_{i}^{\prime}}$ $,\mathrm{arg,max}{a^{\prime}\in\mathcal{A}(x{i}^{\prime})},Q_{\theta}(\phi_{x_{i}^{\prime}},\psi_{a^{\prime}}))-Q_{\theta}(\phi_{x_{i}},\psi_{a_{i}}))^{2},$ where $\pi_{\beta}$ is a behavior policy (DM) that approximates the datageneration policy, $\alpha>0$ is a tunable regular iz ation parameter, and $\mu$ is the target policy to be learned. Intuitively, CQL regular iz ation minimizes the differences in Q-values of actions generated by our learned RL DM policy and the behavior (training-data generating) policy. We use target $\mu(a|x)\propto\exp(Q_{\theta}(\phi_{x},\psi_{a}))$ , which corresponds to the optimal policy of entropy-regularized Q-learning [31].

$\boldsymbol{Q}$ - 函数,它是真实 Q 函数的下界。CQL 可以应用于 SAQL 和 CAQL(我们仅以 SAQL 为例进行说明)。在 CQL 中,我们通过行为正则化器来增强 Q 学习损失:mi $\begin{array}{r}{\mathtt{n}{\theta}\sum{i=1}^{|B|}\alpha(\mathbb{E}{a\sim\mu}[Q{\theta}(\phi_{x_{i}},\psi_{a})]-\mathbb{E}{a\sim\pi{\beta}}[Q_{\theta}(\phi_{x_{i}},\psi_{a})])+}\end{array}$ $(r_{i}+\gamma Q_{\theta^{\mathrm{target}}}(\phi_{x_{i}^{\prime}}$ $,\mathrm{arg,max}{a^{\prime}\in\mathcal{A}(x{i}^{\prime})},Q_{\theta}(\phi_{x_{i}^{\prime}},\psi_{a^{\prime}}))-Q_{\theta}(\phi_{x_{i}},\psi_{a_{i}}))^{2},$ 其中 $\pi_{\beta}$ 是近似数据生成策略的行为策略 (DM),$\alpha>0$ 是一个可调的正则化参数,$\mu$ 是要学习的目标策略。直观上,CQL 正则化最小化了由我们学习的 RL DM 策略生成的动作与行为(训练数据生成)策略生成的 Q 值差异。我们使用目标 $\mu(a|x)\propto\exp(Q_{\theta}(\phi_{x},\psi_{a}))$,它对应于熵正则化 Q 学习的最优策略 [31]。

4.4 End-to-end Architecture

4.4 端到端架构

We now outline an end-to-end (E2E) RL approach that jointly trains the language encoder and the $\boldsymbol{Q}$ -function. In contrast to our twostep approaches, by not constraining the DM to using a pre-trained encoder, E2E RL can tune the encoder (hence its representations) to the dialogue task at hand. This approach is similar in spirit to the original DQN model [27], in which the $\boldsymbol{Q}$ -network consists of both a convolutional DNN that encodes pixel frames (states) and a feed-forward NN that learns the Q-values.

我们现在概述一种端到端 (E2E) 强化学习方法,该方法联合训练语言编码器和 $\boldsymbol{Q}$ 函数。与我们的两步方法不同,E2E RL 不限制对话管理器使用预训练的编码器,因此可以针对当前对话任务调整编码器(从而调整其表示)。这种方法在精神上与原始的 DQN 模型 [27] 相似,其中 $\boldsymbol{Q}$ 网络由编码像素帧(状态)的卷积 DNN 和学习 Q 值的前馈神经网络组成。

To learn the $\boldsymbol{Q}$ -function in E2E fashion, we apply DQN to $Q(x,a)=$ $Q_{\theta}(c(x,a))$ , where $c(x,a)$ is the concatenation of the conversation history and the current candidate action, and $Q_{\theta};:;X;\rightarrow;\mathbb{R}$ is a trainable language encoder (e.g., a transformer initialized with pre-trained weights), followed by a feed-forward DNN. This $\boldsymbol{Q}$ - model jointly encodes the raw input conversation and assigns a $\boldsymbol{Q}$ -value to each candidate action. This allows us to learn the $\boldsymbol{Q}$ - function E2E, without relying on fixed pretrained language encoders. Specifically, with target network $Q_{\theta}\mathrm{target}$ updated as above, in E2E learning we train the $Q_{\theta}$ by minimizing the mean squared Bellman error. We use SAQL and formulate the inner maximization as $\begin{array}{r}{\operatorname*{max}{a^{\prime}\in\mathcal{A}(x^{\prime})}Q{\theta^{\mathrm{target}}}(x^{\prime},a^{\prime})}\end{array}$ .

为了以端到端(E2E)的方式学习 $\boldsymbol{Q}$ 函数,我们将 DQN 应用于 $Q(x,a)=$ $Q_{\theta}(c(x,a))$,其中 $c(x,a)$ 是对话历史和当前候选动作的连接,$Q_{\theta};:;X;\rightarrow;\mathbb{R}$ 是一个可训练的语言编码器(例如,使用预训练权重初始化的 Transformer),后接一个前馈深度神经网络(DNN)。这个 $\boldsymbol{Q}$ 模型联合编码原始输入对话,并为每个候选动作分配一个 $\boldsymbol{Q}$ 值。这使得我们能够以端到端的方式学习 $\boldsymbol{Q}$ 函数,而不依赖于固定的预训练语言编码器。具体来说,在目标网络 $Q_{\theta}\mathrm{target}$ 如上所述更新的情况下,在端到端学习中,我们通过最小化均方贝尔曼误差来训练 $Q_{\theta}$。我们使用 SAQL 并将内部最大化问题表述为 $\begin{array}{r}{\operatorname*{max}{a^{\prime}\in\mathcal{A}(x^{\prime})}Q{\theta^{\mathrm{target}}}(x^{\prime},a^{\prime})}\end{array}$。

4.5 Training Data

4.5 训练数据

The DM models are trained on crowd-sourced data, generated by human evaluators. Each evaluator converses with the bot until the dialogue derails or comes to a natural end. They then rate the bot responses, assessing each utterance in the composition loop, including those selected and unselected by the DM. Although evaluators were provided with a set of guidelines for assessing bot response quality, the resulting data is noisy and some level of rater-specific subjectivity is included in the ratings.

DM模型在由人类评估者生成的众包数据上进行训练。每个评估者与机器人对话,直到对话脱轨或自然结束。然后他们对机器人的回答进行评分,评估组合循环中的每个话语,包括被DM选中和未选中的那些。尽管评估者被提供了一套评估机器人回答质量的指南,但生成的数据仍然存在噪声,评分中包含了一定程度的评估者主观性。

A dozen evaluators generated $\sim!20\mathrm{K}$ conversations with an average of 3 bot responses, each with 1 to 4 utterances, and with up to 30 candidates per utterance. For the supervised models, this results in ${\sim}1.5\mathrm{M}$ training examples, as each (selected and unselected) candidate corr responds to a training example. By contrast, RL models only use the labels on selected candidates, giving 150K labels.

十几名评估者生成了约 20K 次对话,平均每次对话有 3 个机器人响应,每个响应包含 1 到 4 句话,每句话最多有 30 个候选回复。对于监督模型来说,这产生了约 1.5M 个训练样本,因为每个(选中和未选中的)候选回复都对应一个训练样本。相比之下,强化学习模型仅使用选中候选回复的标签,因此有 150K 个标签。

Each candidate utterance is rated on a scale of -3 to 7, with no 0 rating made available. The negative ratings reflect candidates that do not reply to a user question, are out of context, or repeat content that was already mentioned in the conversation. The positive scores correspond to candidates that fit the conversation context well.

每个候选话语的评分范围为-3到7,不提供0分。负分表示候选话语没有回应用户问题、脱离上下文或重复对话中已提及的内容。正分则表示候选话语与对话上下文契合良好。

5 INITIAL OFFLINE & ONLINE EVALUATION

5 初始离线与在线评估

Before deploying our models in live experiment, we conducted preliminary evaluation of our RL-based DM policies. We describe both (i) off-policy counter factual inference [21, 39] evaluation and (ii) on-policy human (rater) evaluation. Off-policy evaluation can be performed on the existing datasets used to train our models (in our case, generated with the supervised DM). While often easier—hence especially useful for initial model development and tuning—it is less reliable than on-policy evaluation. So we use both methods.

在将我们的模型部署到实际实验之前,我们对基于强化学习 (RL) 的对话管理 (DM) 策略进行了初步评估。我们描述了 (i) 离策略反事实推理 [21, 39] 评估和 (ii) 在策略人类(评分者)评估。离策略评估可以在用于训练我们模型的现有数据集上进行(在我们的案例中,使用监督式 DM 生成)。虽然通常更容易——因此特别适用于初始模型开发和调优——但它比在策略评估的可靠性较低。因此,我们同时使用了这两种方法。

DM Models. We evaluate the following variants of our SAQL and CAQL algorithms, using either the supervised RNN or transformer for state representation, with or without CQL regular iz ation. SAQLRNN, CAQL-RNN, SAQL-Transformer, and CAQL-Transformer are the Q-learned models trained with the two-step RL approaches SAQL and CAQL using RNN and transformer encoders, respectively. SAQL-Reg-RNN, CAQL-Reg-RNN, SAQL-Reg-Transformer, and CAQL-Reg-Transformer are the same Q-learned models trained with CQL regular iz ation. SAQL-Reg-E2E denotes the E2E Q-learned model with CQL regular iz ation.

DM 模型。我们评估了以下 SAQL 和 CAQL 算法的变体,使用监督 RNN 或 Transformer 进行状态表示,带有或不带有 CQL 正则化。SAQL-RNN、CAQL-RNN、SAQL-Transformer 和 CAQL-Transformer 是使用两步 RL 方法 SAQL 和 CAQL 分别通过 RNN 和 Transformer 编码器训练的 Q-learning 模型。SAQL-Reg-RNN、CAQL-Reg-RNN、SAQL-Reg-Transformer 和 CAQL-Reg-Transformer 是使用 CQL 正则化训练的相同 Q-learning 模型。SAQL-Reg-E2E 表示带有 CQL 正则化的端到端 Q-learning 模型。

The RNN architecture includes a GRU layer with 200 units. The supervised transformer model uses the publicly-available, pretrained BERT-Medium checkpoint.3 The training regime roughly follows that for BERT.4 For RL models, we use a discount factor $\gamma=0.95$ for SAQL, $\gamma=0.9$ for CAQL and CQL Reg, and $\gamma=0.8$ for E2E RL,5 and a fully connected feed forward DNN for the $\boldsymbol{Q}$ -function. Hyper-parameters are provided in the appendix.

RNN 架构包括一个具有 200 个单元的 GRU 层。监督式 Transformer 模型使用了公开可用的预训练 BERT-Medium 检查点 [3]。训练方案大致遵循 BERT 的训练方案 [4]。对于强化学习模型,我们为 SAQL 使用折扣因子 $\gamma=0.95$,为 CAQL 和 CQL Reg 使用 $\gamma=0.9$,为 E2E RL 使用 $\gamma=0.8$ [5],并为 $\boldsymbol{Q}$ 函数使用全连接前馈 DNN。超参数在附录中提供。

DM Off-policy Evaluation. The main goal of off-policy evaluation is to assess the performance of our RL-based DM policy using existing conversational data. Unlike (supervised) myopic models, RL requires evaluating the reward of full trajectories. Since an RLbased DM can drive sequential conversations that follow a very different distribution than that of the training data, off-policy correction using propensity scoring or related methods is needed [39]. We use DualDICE [28] for this purpose, a recent SOTA method for off-policy estimation of RL policy values that directly estimates the stationary distribution correction ratio, i.e., the ratio of the steadystate probabilities of specific state-action pairs $(\phi_{x_{i}},\phi_{a_{i}})$ generated by the RL policy $\pi$ and the data-generating (or behavior) policy $\pi_{B}$ (which can be estimated from the training data). We provide a high-level overview.

DM 离策略评估。离策略评估的主要目标是利用现有的对话数据评估我们基于强化学习 (RL) 的 DM 策略的性能。与(监督式)短视模型不同,RL 需要评估完整轨迹的奖励。由于基于 RL 的 DM 可以驱动与训练数据分布截然不同的序列对话,因此需要使用倾向评分或相关方法进行离策略校正 [39]。为此,我们使用 DualDICE [28],这是一种用于 RL 策略值离策略估计的最新 SOTA 方法,它直接估计稳态分布校正比率,即由 RL 策略 $\pi$ 和数据生成(或行为)策略 $\pi_{B}$(可以从训练数据中估计)生成的特定状态-动作对 $(\phi_{x_{i}},\phi_{a_{i}})$ 的稳态概率比率。我们提供一个高层次的概述。

Given a batch $B={(\phi_{x_{i}},\psi_{a_{i}},r_{i},\phi_{x_{i}^{\prime}})}{i}$ of (embedded conversation history) training data and a DM policy $\pi$ , DualDICE learns a feed-forward ${\mathrm{DNN}},\bar{\nu{\rho}}:\mathbb{R}^{d}\times\mathbb{R}^{h}\to\mathbb{R},$ , parameterized by $\rho$ , where $\nu_{\rho}(\phi_{x_{i}},\psi_{a_{i}})$ is a proto-value function whose Bellman residuals are estimates of the required stationary distribution ratios [28]. Given a trained $\nu_{\rho}$ , the value of the RL-based DM’s policy $\pi$ can be estimated by $\begin{array}{r}{J_{\mathrm{DD}}(\pi_{\lambda}),:=,\sum_{i=1}^{|B|}r_{i}\cdot(\nu_{\rho^{*}}(\phi_{x_{i}},\psi_{a_{i}})-\gamma\frac{\pi_{\lambda}(a_{i}^{\prime}|x_{i}^{\prime})}{\pi_{B}(a_{i}^{\prime}|x_{i}^{\prime})}\nu_{\rho^{*}}(\phi_{x_{i}^{\prime}},\psi_{a_{i}^{\prime}}))}\end{array}$ Notice that this estimator assumes the knowledge of the behavior (data-generating) policy $\pi_{B}$ (which is the supervised DM in our setting). However, since the supervised DM is trained to optimize for a myopic reward, it can be overly deterministic. This can drive large fluctuations in propensity scores $\pi/\pi_{B}$ , and high variance in $J_{\mathrm{DD}}$ . Instead, we use a behavior-agnostic form of DualDICE which requires no estimation of $\pi_{B}$ (see [28] for details).

给定一批训练数据 $B={(\phi_{x_{i}},\psi_{a_{i}},r_{i},\phi_{x_{i}^{\prime}})}{i}$ (嵌入的对话历史)和一个 DM 策略 $\pi$,DualDICE 学习一个前馈 ${\mathrm{DNN}},\bar{\nu{\rho}}:\mathbb{R}^{d}\times\mathbb{R}^{h}\to\mathbb{R},$,由 $\rho$ 参数化,其中 $\nu_{\rho}(\phi_{x_{i}},\psi_{a_{i}})$ 是一个原型值函数,其 Bellman 残差是所需稳态分布比的估计 [28]。给定一个训练好的 $\nu_{\rho}$,基于 RL 的 DM 策略 $\pi$ 的值可以通过 $\begin{array}{r}{J_{\mathrm{DD}}(\pi_{\lambda}),:=,\sum_{i=1}^{|B|}r_{i}\cdot(\nu_{\rho^{*}}(\phi_{x_{i}},\psi_{a_{i}})-\gamma\frac{\pi_{\lambda}(a_{i}^{\prime}|x_{i}^{\prime})}{\pi_{B}(a_{i}^{\prime}|x_{i}^{\prime})}\nu_{\rho^{*}}(\phi_{x_{i}^{\prime}},\psi_{a_{i}^{\prime}}))}\end{array}$ 来估计。请注意,这个估计器假设已知行为(数据生成)策略 $\pi_{B}$(在我们的设置中是监督 DM)。然而,由于监督 DM 被训练为优化短视奖励,它可能过于确定性。这可能导致倾向得分 $\pi/\pi_{B}$ 的大幅波动,以及 $J_{\mathrm{DD}}$ 的高方差。相反,我们使用一种行为无关的 DualDICE 形式,不需要估计 $\pi_{B}$(详见 [28])。

Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning Table 2: Off-policy and on-policy raters evaluation results.

使用强化学习在开放式对话中进行动态规划

表 2: 离策略和同策略评分者的评估结果

模型类型 模型 同策略 离策略
监督学习 RNN 5.75 ± 1.99 5.28 ± 0.44
Transformer 5.67 ± 2.18 4.55 ± 0.37
RNN 5.17 ± 2.29 5.13 ± 0.52
随机动作 Reg-RNN 6.48 ± 1.13 5.51 ± 0.39
Transformer 5.71 ± 2.26 4.76 ± 0.40
Reg-Transformer 5.58 ± 2.02 4.73 ± 0.45
RNN 6.04 ± 1.63 5.49 ± 0.41
连续动作 Transformer 5.86 ± 1.79 4.91 ± 0.38
Reg-Transformer 5.46 ± 2.25 4.78 ± 0.46
E2E Reg-Transformer 6.53 ± 1.44 --

The off-policy evaluation results generated by DualDICE are presented in Table 2 (second column). Since our use of DualDICE depends on language encoders $\phi$ and $\psi$ , it cannot be used to evaluate the E2E model. Note that our off-policy value estimates of the RNN-based and the transformer-based models are generated using an RNN-based and a transformer-based behavior policy, respectively. DualDice off-policy results show that SAQL-Reg-RNN and CAQL-RNN are among the best-performing RL-based policies (this coincides with on-policy evaluation results, see below). The offline performance of RNN-based models is consistently better than that of transformer-based models. This too is somewhat corroborated by on-policy evaluation, though this performance difference may be due in part to proto-function approximation error in DualDice caused by the inherent bias of the transformer-based data.

DualDICE 生成的离策略评估结果如表 2(第二列)所示。由于我们使用 DualDICE 依赖于语言编码器 $\phi$ 和 $\psi$,因此它不能用于评估 E2E 模型。请注意,我们对基于 RNN 和基于 Transformer 的模型的离策略值估计是分别使用基于 RNN 和基于 Transformer 的行为策略生成的。DualDice 的离策略结果表明,SAQL-Reg-RNN 和 CAQL-RNN 是表现最好的基于 RL 的策略之一(这与在策略评估结果一致,见下文)。基于 RNN 的模型的离线性能始终优于基于 Transformer 的模型。这在某种程度上也得到了在策略评估的证实,尽管这种性能差异可能部分是由于基于 Transformer 的数据的固有偏差导致的 DualDice 中的原型函数近似误差。

DM On-policy Evaluation. We next conducted on-policy evaluation. Human evaluators were asked to conduct dialogues with our bot and rate the overall conversation experience on the same -3 to 7 scale used to collect training data. Evaluation was blind—evaluators did not know which model they were conversing with. Overall 200 dialogues for each model were rated. The results are presented in Table 2 (first column). SAQL-Reg-E2E and SAQL-Reg-RNN received the highest rating while SAQL-RNN performed worst. Notice that the models rated best by evaluators are trained with lower discount factors. We conjecture raters may be inherently biased to value myopic quality. We also note the high-variance in rater evaluations across all models (a point discussed further below).

DM 在线策略评估。接下来我们进行了在线策略评估。人类评估者被要求与我们的机器人进行对话,并使用与收集训练数据相同的 -3 到 7 的评分标准对整体对话体验进行评分。评估是盲测的——评估者不知道他们正在与哪个模型对话。每个模型总共评估了 200 个对话。结果如表 2(第一列)所示。SAQL-Reg-E2E 和 SAQL-Reg-RNN 获得了最高评分,而 SAQL-RNN 表现最差。请注意,评估者评分最高的模型是使用较低的折扣因子训练的。我们推测评估者可能天生偏向于重视短视的质量。我们还注意到所有模型的评估者评分存在高方差(这一点将在下面进一步讨论)。

6 LIVE EXPERIMENT

6 实时实验

Evaluation by human raters facilitates policy assessment in controlled settings and is necessary before deployment in a user-facing commercial product. However, dedicated human evaluators typically behave differently than real users. Specifically, the impact of conversation planning might be quite different with raters vs. real users. For example, raters might continue a conversation even after it reaches an awkward stage or they might not reflect a potential increase in user engagement after a successful focus change initiated by the bot. To gain an in-depth understanding of their impact on real users, we conducted a live experiment with our RL models. The Q-learning model that achieved the largest improvement in terms of user engagement was then fully deployed in the Google Assistant.

通过人类评估员进行评估有助于在受控环境中进行策略评估,并且在面向用户的商业产品部署之前是必要的。然而,专职的人类评估员通常与真实用户的行为不同。具体来说,对话规划的影响在评估员和真实用户之间可能会有很大差异。例如,评估员可能会在对话达到尴尬阶段后继续对话,或者他们可能不会反映出由机器人成功发起焦点转换后用户参与度的潜在提升。为了深入了解它们对真实用户的影响,我们使用强化学习模型进行了实时实验。在用户参与度方面取得最大改进的Q-learning模型随后被完全部署到Google Assistant中。

6.1 Experimental Setup

6.1 实验设置

To conduct a live experiment, we build on the dynamic composition bot from [38] (Sec. 3). This bot is integrated with the Google Assistant, dubbed the assistant below, and interacts with users in a real-time online setting.

为了进行实时实验,我们基于[38](第3节)中的动态组合机器人进行构建。该机器人与Google Assistant集成,下文简称为助手,并在实时在线环境中与用户互动。

The experiment was conducted using an A/B testing protocol, in which a small percentage of assistant users were randomly sampled to interact with the bot using an RL-based DM while other users (same percentage) interact with the vanilla bot using a supervised DM. More precisely, the experiment was conducted with one control arm, with the transformer-based supervised model, and eight experiment arms with the architectures listed in Sec. 5. We use the supervised transformer model as a baseline as it was shown to outperform the supervised RNN in a previous live experiment.

实验采用 A/B 测试协议进行,其中随机抽取一小部分助手用户与基于强化学习 (RL) 的对话管理 (DM) 机器人进行交互,而其他用户(相同比例)则与使用监督式 DM 的普通机器人进行交互。更准确地说,实验包括一个对照组,使用基于 Transformer 的监督模型,以及八个实验组,其架构如第 5 节所列。我们使用监督式 Transformer 模型作为基线,因为在之前的实时实验中,它已被证明优于监督式 RNN。

Our experiment spanned the months of December 2021 and January 2022, during which user assignment to control/experiments remained constant. The experiment was transparent to the users, who could not distinguish between the different DMs. A conversation starts when a user triggers the experience by asking an animal related query (e.g., “how does a lion sound?”). Once initiated, a conversation with a user could end if the bot predicted that its response is not of sufficient quality (i.e., the DM score is too low), if the user issued a query outside of the animal domain (e.g., about the weather), or if the user issued a standard stop command. The last two options were handled by the assistant.

我们的实验跨越了2021年12月和2022年1月,在此期间用户被分配到控制组/实验组的情况保持不变。实验对用户是透明的,他们无法区分不同的对话模型 (DM)。当用户通过提出与动物相关的问题(例如,“狮子是怎么叫的?”)触发体验时,对话开始。一旦开始,如果机器人预测其响应质量不足(即 DM 分数太低),或者用户提出了动物领域之外的问题(例如,关于天气),或者用户发出了标准的停止命令,与用户的对话可能会结束。后两种情况由助手处理。

6.2 Evaluation Metrics

6.2 评估指标

We measured daily user interaction with the assistant in the animal domain in both the experiment and control arms. To assess user engagement, we use several surrogate metrics that are directly measurable in the interaction logs. We define a conversation to be the succession of user and bot turns, starting with a triggering user turn. The conversation length is the number of turns (combined user and bot turns) in a conversation. We consider followup feedback after each bot response, where followup refers to the next query, if any, after the bot response. Specifically, we distinguish:

我们在实验组和对照组中测量了用户每天在动物领域与助手的互动情况。为了评估用户参与度,我们使用了几个可以直接从互动日志中测量的替代指标。我们将对话定义为用户和机器人轮次的连续,从触发用户轮次开始。对话长度是指对话中的轮次数量(用户和机器人的轮次总和)。我们考虑了每次机器人响应后的后续反馈,其中后续反馈指的是机器人响应后的下一个查询(如果有的话)。具体来说,我们区分了以下几种情况:

• Cooperative responses to bot questions, such as “yes” in response to a question proposing additional content (e.g., “do you want to hear more?”) or “Tell me about lions” in response to a list selection question (e.g., “which animal do you want to hear about next?”). • Non-cooperative responses to bot questions, such as “no” in response to a question proposing additional content (e.g., “do you want to learn about cheetahs?”). • Explicit positive feedback, which captures followup user queries with explicit gratitude, e.g., “thank you” or “wonderful”.

  • 对机器人问题的合作性回应,例如在提议额外内容的问题(如“你想听更多吗?”)中回答“是”,或在列表选择问题(如“接下来你想听哪种动物?”)中回答“告诉我关于狮子的事”。
  • 对机器人问题的非合作性回应,例如在提议额外内容的问题(如“你想了解猎豹吗?”)中回答“否”。
  • 明确的正面反馈,即用户在后续查询中表达明确的感激之情,例如“谢谢”或“太棒了”。

Table 3: Mean relative change of experiment vs. the control metrics. Here, T stands for transformer; green changes are desirable, red changes less so (to varying degrees).

表 3: 实验与对照指标的相对变化均值。其中,T 代表 Transformer;绿色变化是期望的,红色变化则不太理想(程度不一)。

指标 SAQL-RNN SAQL-T CAQL-RNN CAQL-T SAQL-Reg-E2E
对话长度 +30% +23% +14% +18% -0.7%
合作性回应 +8% -6.8% -5.8% -4% -8%
非合作性回应 +112% +178% +54% +120% +41%
明确正面反馈 +32% +9.7% -20% +6.8% -9%
明确负面反馈 -18% +8.6% +1% -14% +27%

• Explicit negative feedback, reflecting followup user queries that contain negative feedback, such as “stop” or “shut up”.

• 显式负面反馈,反映包含负面反馈的后续用户查询,例如“stop”或“shut up”。

For the last two metrics, we use predefined lists of positive and negative feedback phrases collected from user logs.

对于最后两个指标,我们使用从用户日志中收集的预定义正面和负面反馈短语列表。

6.3 Main Results

6.3 主要结果

The average relative change in metrics across all experiments w.r.t. the control is shown in Table 3 (for CQL variants, see Table 4 in the appendix). Interestingly, these results differ from the rater online evaluations in Table 2, demonstrating the substantial distinction between the behaviors of raters and real users. This discrepancy may be, in part, due to the high variance in rater evaluations. Surprisingly, SAQL-Reg-E2E performs worst and is slightly outperformed by the supervised baseline. The E2E model behaves quite conservatively, similar to a supervised model, avoiding pivoting to other animals and changing the type of offered content (e.g., sounds, facts, quizzes). Such conservative behavior may be caused by the lower discount factor $\gamma=0.8$ used, making its expected trajectory horizon shorter. This might be preferred by raters who tend to evaluate bot responses more myopically; but at the same time, it provides a more boring, less engaging experience for users. The different SAQL models outperform their CAQL counterparts. A similar conclusion was drawn in [45], where discrete latent actions were deemed to be more suitable than continuous actions for dialogue agents.

所有实验中指标相对于对照组的平均相对变化如表 3 所示(CQL 变体见附录中的表 4)。有趣的是,这些结果与表 2 中的评分者在线评估结果不同,表明评分者和真实用户行为之间存在显著差异。这种差异可能部分是由于评分者评估的高方差所致。令人惊讶的是,SAQL-Reg-E2E 表现最差,略逊于监督基线。E2E 模型表现得相当保守,类似于监督模型,避免转向其他动物并改变提供的内容类型(例如,声音、事实、测验)。这种保守行为可能是由于使用了较低的折扣因子 $\gamma=0.8$,使其预期的轨迹范围更短。这可能更受评分者的青睐,他们倾向于更短视地评估机器人的响应;但与此同时,它为用户提供了更无聊、吸引力较低的体验。不同的 SAQL 模型优于其对应的 CAQL 模型。[45] 中得出了类似的结论,其中离散的潜在动作被认为比连续动作更适合对话智能体。

Overall, we find that SAQL-RNN performs best w.r.t. our main metrics, conducting longer, more engaging conversations. It increases conversation length by $32%$ , while also increasing user engagement as captured by multiple metrics. We see an increase of $8%$ in cooperative responses to bot questions. While there is also a large increase of non-cooperative responses $(112%)$ , this is expected as the SAQL-RNN agent takes more risks by asking pivoting questions, generating many more occasions for non-cooperative user reactions. While the user may not be interested in the convers at ional direction proposed by the bot (e.g., pivoting to another animal), the user often continues engaging in a dialogue about animals. For example, in Fig. 1a, the user provides a non-cooperative answer in the 3rd turn. As a result, the bot modifies its plan and asks the user to choose the next conversation focus, to which the user responds positively. In addition, some followup user queries contain explicit positive or negative feedback. While an order of magnitude fewer than other followups, they offer a direct measure of user (dis)satisfaction. SAQL-RNN increases explicit positive feedback by $32%$ and reduces negative feedback by $18%$ .

总体而言,我们发现 SAQL-RNN 在我们的主要指标上表现最佳,能够进行更长、更具吸引力的对话。它将对话长度增加了 $32%$,同时通过多个指标捕捉到的用户参与度也有所提高。我们看到用户对机器人问题的合作性回应增加了 $8%$。尽管非合作性回应也有大幅增加 $(112%)$,但这是预期的,因为 SAQL-RNN 智能体通过提出转向问题承担了更多风险,从而产生了更多用户非合作性反应的机会。虽然用户可能对机器人提出的对话方向不感兴趣(例如转向另一个动物),但用户通常会继续参与关于动物的对话。例如,在图 1a 中,用户在第三轮提供了一个非合作性回答。结果,机器人修改了其计划,并让用户选择下一个对话焦点,用户对此做出了积极的回应。此外,一些后续用户查询包含明确的正面或负面反馈。虽然数量级上比其他后续查询少,但它们提供了用户(不)满意度的直接衡量标准。SAQL-RNN 将明确的正面反馈增加了 $32%$,并将负面反馈减少了 $18%$。

Our CQL variants of the different models indeed behave more “conservatively,” closer to supervised model behavior. This translates into smaller changes in conversation length and user feedback metrics (see appendix). Interestingly, using the transformer (vs. RNN) state encoding does not improve SAQL performance, unlike in the supervised setting, where transformer-based candidate selection is superior: the RNN state representation seems sufficient for RL. For this reason, we focus our analysis on SAQL-RNN below.

我们不同模型的 CQL 变体确实表现得更加“保守”,更接近监督模型的行为。这体现在对话长度和用户反馈指标的较小变化上(见附录)。有趣的是,使用 Transformer(相对于 RNN)状态编码并没有提高 SAQL 的性能,这与监督设置不同,在监督设置中,基于 Transformer 的候选选择表现更优:RNN 状态表示似乎足以满足 RL 的需求。因此,我们在下文中将分析重点放在 SAQL-RNN 上。

6.4 Qualitative Analysis of the RL DM

6.4 RL DM 的定性分析

To improve user engagement while conducting longer conversations, SAQL-RNN uses several planning strategies. First, it ends $20%$ more turns in questions relative to the control, prompting the user to choose additional content (e.g., learn more animal facts, hear another animal sound). While we observe an increase in cooperative responses, the cooperation rate to bot questions drops by $9.5%$ . Although this may seem problematic, this is actually a result of a favorable policy learned by our bot: by taking more risks in eliciting a user’s preference for the next steps, SAQL-RNN achieves an overall improved user experience, as measured via increased conversation length, combined with a noticeable increase in explicit positive feedback and a decrease in negative feedback.

为了提高用户在进行较长对话时的参与度,SAQL-RNN 采用了多种规划策略。首先,它比对照组多出 20% 的回合以问题结束,提示用户选择更多内容(例如,了解更多动物知识,听另一种动物的声音)。虽然我们观察到合作回应的增加,但对机器人问题的合作率下降了 9.5%。尽管这看似有问题,但这实际上是我们的机器人学习到的一种有利策略的结果:通过在引导用户对下一步的偏好时承担更多风险,SAQL-RNN 实现了整体用户体验的提升,这通过对话长度的增加、明显增加的明确正面反馈以及负面反馈的减少来衡量。

A second planning strategy is to better exploit content diversity, including facts, sounds, quizzes, yes/no questions, open questions, etc. On average, SAQL-RNN uses $26%$ more unique providers per conversation than the supervised transformer-based model.

第二种规划策略是更好地利用内容的多样性,包括事实、声音、测验、是/否问题、开放性问题等。平均而言,SAQL-RNN 在每次对话中使用的独特提供者比基于监督的 Transformer 模型多 $26%$。

Two additional planning strategies are related to the existence of two sub-dialogues with different characteristics. Dialogues around animal sounds are poorer in content and exhibit entity pivoting at every turn (after playing the sound of a given animal, we can either suggest the sound of a different animal or quiz the user about other animal sounds). In contrast, dialogues around animal facts typically contain richer content and a greater conversation depth. We observe that SAQL-RNN favors the richer experience of the latter, selecting $31%$ more fact-related content.

两种额外的规划策略与存在两种不同特征的子对话相关。围绕动物声音的对话内容较为贫乏,并且在每一轮对话中都会出现实体转换(在播放某种动物的声音后,我们可以建议播放另一种动物的声音,或者向用户提问其他动物的声音)。相比之下,围绕动物事实的对话通常包含更丰富的内容和更深的对话层次。我们观察到 SAQL-RNN 更倾向于后者的丰富体验,选择了 $31%$ 更多与事实相关的内容。

Lastly, we observe that the average conversation breadth of dialogues conducted by SAQL-RNN is lower (it generates $13%$ fewer focus-pivoting turns). This is a consequence of fact dialogues having less breadth. However, when restricting analysis to fact dialogues, SAQL-RNN exhibits $60%$ more focus-pivoting turns.

最后,我们观察到 SAQL-RNN 生成的对话的平均广度较低(它生成的焦点转移轮次减少了 13%)。这是由于事实对话的广度较小。然而,当将分析限制在事实对话时,SAQL-RNN 表现出 60% 更多的焦点转移轮次。

Some of these strategies are exemplified by the sample conversation in Fig. 1a, generated by the SAQL-RNN model, which we contrast with Fig.1b, conducted by the supervised transformer. Both conversations start with the same 2 turns. In the 3rd turn, after a non-cooperative user response, the transformer pivots back to sounds to maximize “immediate” user interest. By contrast, the RL model tries to pivot to facts for a richer conversational experience, suggesting that the user choose the next animal. We also observe that the RL conversation includes more types of content, such as sounds, facts, quizzes, yes/no and open questions.

这些策略中的一些在图 1a 的示例对话中得到了体现,该对话由 SAQL-RNN 模型生成,我们将其与图 1b 中由监督式 Transformer 生成的对话进行对比。两个对话都以相同的两轮对话开始。在第三轮对话中,当用户给出不合作的回应后,Transformer 转向声音内容以最大化“即时”用户兴趣。相比之下,RL 模型则尝试转向事实内容以提供更丰富的对话体验,建议用户选择下一个动物。我们还观察到,RL 对话包含了更多类型的内容,例如声音、事实、测验、是非问题和开放性问题。

7 CONCLUSION

7 结论

In this work we tackled the formidable task of building a rich, openended conversational bot that is deployed in the challenging setting of a real-time, global commercial assistant. Our approach relies on the framework of reinforcement learning, using a novel state representation based on the succinct embedding of a supervised language model and an RL algorithm that allows for a dynamic action space at each stage of the conversation. Ours is one of the few examples of RL-based conversational systems deployed in the wild at scale, and the substantial advantages demonstrated over the SOTA supervised model validates the decades-long premise that the dynamic planning ability of RL is a natural fit for the design of rich dialogue agents.

在本工作中,我们致力于构建一个丰富的、开放式的对话机器人,并将其部署在实时、全球商业助手的挑战性环境中。我们的方法依赖于强化学习框架,使用了一种基于监督语言模型简洁嵌入的新型状态表示,以及一种允许在对话的每个阶段动态调整动作空间的强化学习算法。我们的系统是少数几个大规模部署在真实环境中的基于强化学习的对话系统之一,其相对于最先进的监督模型所展现出的显著优势,验证了数十年来强化学习的动态规划能力是设计丰富对话智能体的自然选择这一前提。

An interesting insight from our live experiment highlights the power of RL to take counter-intuitive actions: an increase in noncooperative responses, a seemingly negative phenomenon, is simply a tool with which the agent may elicit a user’s preference for the next phase of the conversation. This leads to a positive conversational experience on average, with a measurable increase both in conversation length and positive feedback. We hope to discover other dialogue strategies that drive “great” conversations as we shift to learning models directly from rich user signals.

我们的实时实验揭示了一个有趣的见解,展示了强化学习(RL)采取反直觉行动的力量:非合作回应的增加,虽然看似是负面现象,但实际上只是智能体用来引出用户在对话下一阶段偏好的工具。这平均带来了积极的对话体验,对话长度和正面反馈均有可衡量的增加。随着我们转向直接从丰富的用户信号中学习模型,我们希望发现其他能够推动“优秀”对话的策略。

REFERENCES

参考文献

A APPENDIX

A 附录

A.1 DM Training Hyper-Parameters

A.1 DM 训练超参数

Our supervised and RL models were trained with the following hyper-parameters.

我们的监督学习和强化学习模型使用了以下超参数进行训练。

The supervised RNN model is trained with a learning rate of 0.0001, batch size of 16, a dropout probability of 0.2 and 200K training steps. Its architecture includes a GRU layer with 200 units.

监督式RNN模型以0.0001的学习率、16的批量大小、0.2的dropout概率和20万次训练步数进行训练。其架构包括一个200个单元的GRU层。

For the supervised transformer model, we use the BERT-Medium checkpoint 8 having uncased vocabulary, hidden dimension $H=512$ , $L=8$ transformer layers, and $A=8$ attention heads per layer. This model was trained for 20000 steps with a global batch size of 768 divided among 8 TPUv3 chips, using the Adam optimizer with an initial learning rate of $\epsilon,=,\overline{{5}},\cdot,10^{-5}$ (decayed to zero), $\beta_{1},=,0.9$ , $\beta_{2}=0.999$ , and $\epsilon=10^{-6}$ .

对于监督式 Transformer 模型,我们使用了 BERT-Medium 检查点 8,其词汇表为小写形式,隐藏维度为 $H=512$,包含 $L=8$ 个 Transformer 层,每层有 $A=8$ 个注意力头。该模型在 8 个 TPUv3 芯片上训练了 20000 步,全局批量大小为 768,使用 Adam 优化器,初始学习率为 $\epsilon,=,\overline{{5}},\cdot,10^{-5}$(衰减至零),$\beta_{1},=,0.9$,$\beta_{2}=0.999$,以及 $\epsilon=10^{-6}$。

For the RL models, SAQL-RNN, CAQL-RNN, SAQL-Reg-RNN, and CAQL-Reg-RNN use a fully connected feed forward network for $\boldsymbol{Q}$ -function approximation. These networks are composed of 3 layers and each layer is composed of 1024 RELU units. The SAQLRNN network is trained using a learning rate $\epsilon=7{\cdot}10^{-5}$ and $k=2M$ steps. Warm started with the SAQL-RNN weights, the SAQL-RegRNN network is trained with $\epsilon;=;5,\cdot,10^{-6}$ and $k\ =\ 2.4M$ . The CAQL-RNN network is trained with $\epsilon=5\cdot10^{-5}$ and $k=2M$ , where the inner maximization problem uses a GA learning rate of $\epsilon_{\mathrm{GA}}=1$ · $10^{-6}$ and runs for a maximum of $k_{\mathrm{GA}}=25$ steps. Warm started with the CAQL-RNN weights, the CAQL-Reg-RNN network is trained with $\epsilon;=;3,\cdot,10^{-6}$ and $k;=;2M$ . SAQL-Transformer, SAQL-RegTransformer, CAQL-Transformer, and CAQL-Reg-Transformer all follow almost the same settings as SAQL-RNN, SAQL-Reg-RNN, CAQL-RNN, and CAQL-Reg-RNN but are trained with $\epsilon=3\cdot10^{-4}$ and $k=4M$ , $\epsilon=1\cdot10^{-5}$ and $k=2.5M$ , $\epsilon=3\cdot10^{-5}$ and $k=3M$ $\epsilon=1\cdot10^{-5}$ and $k=3M$ respectively. The models are trained with batch size $\vert B\vert=32$ , and in all two-step CQL regularized models the regular iz ation coefficient $\alpha$ is $0.1.^{9}$ SAQL-Reg-E2E was trained with $k=320,000$ , $\epsilon=5\cdot10^{-5}$ , $|B|=48$ , $\gamma=0.8$ and $\alpha=0.01$ . All these hyper-parameters are chosen from the best settings of their corresponding grid-search optimization.

对于RL模型,SAQL-RNN、CAQL-RNN、SAQL-Reg-RNN和CAQL-Reg-RNN使用全连接前馈网络进行$\boldsymbol{Q}$函数近似。这些网络由3层组成,每层由1024个RELU单元组成。SAQL-RNN网络使用学习率$\epsilon=7{\cdot}10^{-5}$和$k=2M$步进行训练。以SAQL-RNN权重为初始值,SAQL-Reg-RNN网络使用$\epsilon;=;5,\cdot,10^{-6}$和$k\ =\ 2.4M$进行训练。CAQL-RNN网络使用$\epsilon=5\cdot10^{-5}$和$k=2M$进行训练,其中内部最大化问题使用GA学习率$\epsilon_{\mathrm{GA}}=1$·$10^{-6}$,并最多运行$k_{\mathrm{GA}}=25$步。以CAQL-RNN权重为初始值,CAQL-Reg-RNN网络使用$\epsilon;=;3,\cdot,10^{-6}$和$k;=;2M$进行训练。SAQL-Transformer、SAQL-Reg-Transformer、CAQL-Transformer和CAQL-Reg-Transformer几乎遵循与SAQL-RNN、SAQL-Reg-RNN、CAQL-RNN和CAQL-Reg-RNN相同的设置,但分别使用$\epsilon=3\cdot10^{-4}$和$k=4M$、$\epsilon=1\cdot10^{-5}$和$k=2.5M$、$\epsilon=3\cdot10^{-5}$和$k=3M$、$\epsilon=1\cdot10^{-5}$和$k=3M$进行训练。模型使用批量大小$\vert B\vert=32$进行训练,在所有两步CQL正则化模型中,正则化系数$\alpha$为$0.1.^{9}$。SAQL-Reg-E2E使用$k=320,000$、$\epsilon=5\cdot10^{-5}$、$|B|=48$、$\gamma=0.8$和$\alpha=0.01$进行训练。所有这些超参数都是从其对应的网格搜索优化中选择的最佳设置。

A.2 Full Live Experiment Results

A.2 完整实时实验结果

The average relative change in the live experiment metrics of the experiments w.r.t the control is shown in Table 4 for all models, including the CQL variants.

所有模型(包括CQL变体)的实验指标相对于对照组的平均相对变化如表4所示。

Table 4: Mean relative change of experiments vs. the control metrics. Here, T stands for transformer.

表 4: 实验与对照指标的相对变化均值。其中,T 代表 Transformer。

指标 SAQL-RNN SAQL-Reg-RNN SAQL-T SAQL-Reg-T CAQL-RNN CAQL-T CAQL-Reg-T SAQL-Reg-E2E
对话长度 +30% +3.6% +23% +19% +14% +18% +8.9% -0.7%
合作性回应 +8% -20% -6.8% -0.3% -5.8% -4% -4.3% -8%
非合作性回应 +112% +75% +178% +125% +54% +120% +130% +41%
明确正面反馈 +32% +42% +9.7% +0.4% -20% +6.8% +5.8% -9%
明确负面反馈 -18% -7% +8.6% -7.7% +1% -14% +15% +27%
阅读全文(20积分)