Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning

使用强化学习在开放式对话中进行动态规划

Deborah Cohen, Moonkyung Ryu, Yinlam Chow, Orgad Keller, Ido Greenberg, Avinatan Hassidim, Michael Fink, Yossi Matias, Idan Szpektor, Craig Boutilier, Gal Elidan {debbycohen,mkryu,yinlamchow,orgad,ido greenberg,avinatan,fink,yossi,szpektor,cboutilier,elidan}@google.com Google Research

ABSTRACT

摘要

Despite recent advances in natural language understanding and generation, and decades of research on the development of convers at ional bots, building automated agents that can carry on rich open-ended conversations with humans “in the wild” remains a formidable challenge. In this work we develop a real-time, open-ended dialogue system that uses reinforcement learning (RL) to power a bot’s conversational skill at scale. Our work pairs the succinct embedding of the conversation state generated using SOTA (supervised) language models with RL techniques that are particularly suited to a dynamic action space that changes as the conversation progresses. Trained using crowd-sourced data, our novel system is able to substantially exceeds the (strong) baseline supervised model with respect to several metrics of interest in a live experiment with real users of the Google Assistant.

尽管自然语言理解和生成领域最近取得了进展，并且对话机器人的开发已经研究了数十年，但构建能够“在野外”与人类进行丰富开放式对话的自动化智能体仍然是一个巨大的挑战。在这项工作中，我们开发了一个实时、开放式的对话系统，该系统使用强化学习 (RL) 来大规模提升机器人的对话能力。我们的工作将使用 SOTA（监督式）语言模型生成的对话状态的简洁嵌入与特别适合随着对话进展而变化的动态动作空间的 RL 技术相结合。通过使用众包数据进行训练，我们的新颖系统在与 Google Assistant 真实用户的实时实验中，在多个关键指标上显著超过了（强大的）基线监督模型。

1 INTRODUCTION

1 引言

With tremendous advances in AI and ML techniques to recognize speech and perform high quality natural language understanding (NLU) and generation (NLG), increased attention is being directed toward the task of carrying out real-time, rich conversations between humans and bots (e.g., [19, 32, 46]). Realistic interactions generally span complex topic spaces and are relatively open-ended, and often have an underlying goal (e.g., task completion, knowledge sharing). Thus, carrying them out effectively requires not just powerful bots that learn to generate favorable responses, but also demands that bots have the ability to plan and adapt on the fly.

随着人工智能（AI）和机器学习（ML）技术在语音识别、高质量自然语言理解（NLU）和生成（NLG）方面的巨大进步，越来越多的注意力被转向实现人类与机器人之间实时、丰富的对话任务（例如 [19, 32, 46]）。真实的交互通常涉及复杂的主题空间，并且相对开放，通常具有潜在的目标（例如任务完成、知识共享）。因此，有效地执行这些任务不仅需要强大的机器人来学习生成有利的响应，还要求机器人具备即时规划和适应的能力。

The framework of reinforcement learning (RL) is a natural approach for this task. Indeed, research on Markov decision processes (MDPs) and RL for spoken-dialogue systems spans well over two decades, ranging from early work using MDPs and RL [18, 35], to methods based on partially observable MDP (POMDP) models (e.g., [42]), to more recent approaches adopting deep learning representations (e.g., [19]). Despite this, deployments of RL “in the wild” in large-scale dialogue systems, such as smart assistants like Alexa, Siri or Google Assistant, are rare (though exceptions exist, e.g., [32]; see Related Work below). Indeed, building such systems remains a formidable challenge.

强化学习 (Reinforcement Learning, RL) 的框架是解决这一任务的天然方法。事实上，针对语音对话系统的马尔可夫决策过程 (Markov Decision Processes, MDPs) 和 RL 的研究已经跨越了二十多年，从早期使用 MDPs 和 RL 的工作 [18, 35]，到基于部分可观测 MDP (Partially Observable MDP, POMDP) 模型的方法（例如 [42]），再到最近采用深度学习表示的方法（例如 [19]）。尽管如此，在大规模对话系统（如 Alexa、Siri 或 Google Assistant 等智能助手）中“实际应用”RL 的情况仍然罕见（尽管存在例外，例如 [32]；见下文的相关工作）。事实上，构建这样的系统仍然是一个巨大的挑战。

Aside from the infrastructure hurdles associated with scalable, real-time systems, there are inherent modeling challenges when building open-ended conversational bots. First, the state space of such bots is massive, even within specific verticals, and care is needed to craft effective state representations for RL algorithms. Second, the action space is also in principle “unbounded,” and imposing reasonable limitations on actions comes with its own difficulties, including the fact that the set of candidate actions may vary as the conversation progresses. Finally, the design of suitable reward functions for open-ended dialogue can be quite subtle. In this work, we rely on crowd-sourced labels.

除了与可扩展的实时系统相关的基础设施障碍外，构建开放式对话机器人还存在固有的建模挑战。首先，即使是在特定垂直领域内，这类机器人的状态空间也非常庞大，需要精心设计有效的状态表示以适用于强化学习（RL）算法。其次，动作空间在原则上也是“无界的”，对动作施加合理的限制本身就存在困难，包括候选动作集可能随着对话的进行而变化。最后，为开放式对话设计合适的奖励函数可能非常微妙。在本研究中，我们依赖于众包标签。

We present a real-time, open-ended dialogue system that uses RL to power a bot’s conversational skill at scale. We address the challenges above using a novel RL construction. We first exploit powerful supervised models—specifically, RNNs and transformers— to provide a succinct embedding of the conversation state. Second, we use the fact that a relatively small set of “reasonable” candidate actions can be generated at each conversation turn [38]. From an RL perspective, this can be viewed as a stochastic realization of the full action space, so we use an RL approach tailored to such stochastic action sets [3]. We also explore the use of alternative SOTA RL techniques and training methods, including continuousaction optimization [29], conservative Q-learning [17] and novel off-policy evaluation algorithms [28].

我们提出了一种实时、开放式的对话系统，该系统使用强化学习（RL）来大规模提升机器人的对话能力。我们通过一种新颖的强化学习构建方法来解决上述挑战。首先，我们利用强大的监督模型——特别是RNN和Transformer——来提供对话状态的简洁嵌入。其次，我们利用了一个事实，即在每个对话轮次中可以生成相对较小的一组“合理”候选动作 [38]。从强化学习的角度来看，这可以被视为完整动作空间的随机实现，因此我们使用了一种针对此类随机动作集定制的强化学习方法 [3]。我们还探索了使用其他最先进的强化学习技术和训练方法，包括连续动作优化 [29]、保守Q学习 [17] 以及新颖的离策略评估算法 [28]。

We first evaluate our methods using offline data. We then describe the deployment of our system “in the wild” in the Google Assistant, specifically as part of the animal domain experience described by Szpektor et al. [38]. We demonstrate the effectiveness of our RL-based approach at dynamic planning and driving openended dialogue: relative to a SOTA non-RL (transformer) baseline, our bot substantially improves a number of key metrics, including conversation length, cooperative responses and explicit positive feedback. We perform a novel and extensive comparison of many RL architectures in real-world settings and generate unique insights into their relative performance. An example dialogue is shown in Fig. 1, showcasing the rich pivoting of our best-performing RL model vs. the supervised approach. Our model, now deployed in the Google Assistant, marks an important milestone in the use of RL for driving real-time, engaging, open-ended, conversations.

我们首先使用离线数据评估我们的方法。然后，我们描述了我们的系统在 Google Assistant 中的实际部署，特别是作为 Szpektor 等人 [38] 描述的动物领域体验的一部分。我们展示了基于强化学习 (RL) 的方法在动态规划和驱动开放式对话方面的有效性：相对于最先进的非强化学习 (Transformer) 基线，我们的机器人显著改善了多个关键指标，包括对话长度、合作响应和明确的正面反馈。我们在现实环境中对多种强化学习架构进行了新颖且广泛的比较，并生成了关于它们相对性能的独特见解。图 1 展示了一个示例对话，展示了我们表现最佳的强化学习模型与监督学习方法之间的丰富对比。我们的模型现已部署在 Google Assistant 中，标志着使用强化学习驱动实时、引人入胜、开放式对话的重要里程碑。

The key ingredients and insights of our approach are threefold. First, we use an effective representation of the RL state, leveraging pre-trained supervised models that encode the conversation history. Second, we limit the action space to a small set of generated candidate actions while allowing multiple actions in a single turn to compose rich bot responses. This granularity decouples content generation and dialogue planning. Finally, we adapt recent stateof-the-art RL algorithms that are well-suited to our dynamic action space to the candidate-generation decomposition we adopt here.

我们方法的关键要素和见解有三点。首先，我们使用强化学习状态的有效表示，利用预训练的监督模型对对话历史进行编码。其次，我们将动作空间限制为一小组生成的候选动作，同时允许在单轮对话中执行多个动作以构建丰富的机器人响应。这种粒度将内容生成和对话规划解耦。最后，我们采用了最近的最先进的强化学习算法，这些算法非常适合我们的动态动作空间，并将其应用于我们采用的候选生成分解方法。

2 相关工作

The use of RL for dialogue dates back more than two decades. Statistical research focuses on task-oriented dialogues and uses MDPs [18, 35, 36, 41] or POMDPs [42, 44]. Henderson et al. [11] introduce function approximation to reduce the resulting large slot-filling state space. Casanueva et al. [4] propose a feudal RL

使用强化学习 (RL) 进行对话的研究可以追溯到二十多年前。统计研究主要集中在任务导向型对话，并使用马尔可夫决策过程 (MDPs) [18, 35, 36, 41] 或部分可观测马尔可夫决策过程 (POMDPs) [42, 44]。Henderson 等人 [11] 引入了函数逼近来减少由此产生的大规模槽填充状态空间。Casanueva 等人 [4] 提出了一种封建强化学习方法。

(b) Supervised model

图 1:
(b) 监督模型

Figure 1: Example conversations conducted by (a) an RL model and (b) a supervised model, previously deployed in the Google Assistant, showcasing the rich pivoting of the RL model vs. the supervised approach.

图 1: (a) 强化学习 (RL) 模型和 (b) 监督学习模型进行的对话示例，这些模型曾部署在 Google Assistant 中，展示了强化学习模型与监督学习方法在丰富对话转向方面的对比。

model which decomposes the decision by first selecting a subset of primitive actions, then choosing the actual action. These methods each model the state and action spaces using handcrafted semantic representations, such as slots and dialogue acts. This restricts such approaches to simple domains with limited slot-filling.

一种模型，通过首先选择一组基本动作子集，然后选择实际动作来分解决策。这些方法各自使用手工设计的语义表示（如槽位和对话行为）来建模状态和动作空间。这限制了这些方法仅适用于具有有限槽位填充的简单领域。

More recent work leverages deep neural networks (DNNs) to obviate the need for these so-called summary states and actions [9]. Fatemi et al. [8] consider a low-dimensional continuous-valued state space. Liu et al. [23] encode the dialogue state using an RNN but translate the representation to slot-value pairs. In both cases, the action space is restricted to a small number of dialogue acts and the approaches remain limited to specific task-based domains, with no clear extension to open-ended dialogues.

最近的研究利用深度神经网络 (DNNs) 来消除对这些所谓的摘要状态和动作的需求 [9]。Fatemi 等人 [8] 考虑了一个低维连续值状态空间。Liu 等人 [23] 使用 RNN 对对话状态进行编码，但将表示转换为槽值对。在这两种情况下，动作空间都被限制在少量的对话行为中，并且这些方法仍然局限于特定的任务型领域，没有明显的扩展到开放式对话的途径。

Building on advances in neural generative models, another line of work applies RL to directly learn a response generation model conditioned on the dialogue history. Actions are often defined at the word level, so the action space is the entire vocabulary [14, 19, 20, 33]. This approach suffers from several drawbacks: the action space is very large; word-level RL performs credit assignment poorly at an unnatural level for dialogue planning; and this may affect decoder performance, leading to incomprehensible utterances [45].

基于神经生成模型的进展，另一项工作应用强化学习（RL）直接学习基于对话历史的响应生成模型。动作通常在词级别定义，因此动作空间是整个词汇表 [14, 19, 20, 33]。这种方法存在几个缺点：动作空间非常大；词级别的强化学习在对话规划中在非自然级别上表现不佳；这可能会影响解码器的性能，导致生成难以理解的语句 [45]。

Related approaches model actions as latent variables, inducing a latent action space, thus decoupling the discourse-level decision making from NLG [30, 45]. However, they focus on specific domains (e.g., price negotiation, slot-filling or chitchat) rather than grounded open-ended dialogues. Serban et al. [32] proposed MILABOT, as part of the Amazon Alexa Prize competition, where a DM selects a response from several generated candidates. Our approach is similar in spirit, but allows one to combine several candidates in the same bot turn to compose a richer response. This seemingly small difference is vitally important as it allows our RL model to make decisions at an effective granularity. We note that while MILABOT was restricted to an A/B testing evaluation within a competition, our RL model is deployed in the Google Assistant.

相关方法将动作建模为潜在变量，引入潜在动作空间，从而将话语层面的决策与自然语言生成 (NLG) 解耦 [30, 45]。然而，这些方法专注于特定领域（例如价格谈判、槽填充或闲聊），而不是基于开放领域的对话。Serban 等人 [32] 提出了 MILABOT，作为 Amazon Alexa Prize 竞赛的一部分，其中对话管理器 (DM) 从多个生成的候选响应中选择一个。我们的方法在精神上与之相似，但允许在同一轮对话中组合多个候选响应，以生成更丰富的回复。这一看似微小的差异至关重要，因为它使我们的强化学习 (RL) 模型能够在有效的粒度上做出决策。我们注意到，虽然 MILABOT 仅限于竞赛中的 A/B 测试评估，但我们的 RL 模型已部署在 Google Assistant 中。

3 DYNAMIC COMPOSITION

3 动态组合

In this work, we build on the dynamic composition approach introduced by Szpektor et al. [38]. This dialogue management model limits the action space using specific content providers to propose candidate utterances, which are dynamically selected by the dialogue manager (DM). We adopt this scheme to manage action complexity in our RL approaches.

在本工作中，我们基于 Szpektor 等人 [38] 提出的动态组合方法进行构建。该对话管理模型通过特定的内容提供者来限制动作空间，以提出候选话语，这些话语由对话管理器 (DM) 动态选择。我们采用这一方案来管理我们强化学习方法中的动作复杂性。

Dynamic composition decouples content (utterance) generation from selection. Given candidate utterances proposed by each of several providers, the DM scores and selects a suitable utterance as (part of) a response given the current conversation history. The bot response can be composed of several utterances, which are generated and selected sequentially and dynamically (see Fig. 2). Additional components include an NLU module and a sentence fusion module that merges the selected utterances into a coherent response. We describe each of these components in turn.

动态组合将内容（话语）生成与选择解耦。给定由多个提供者提出的候选话语，对话管理器（DM）根据当前的对话历史对它们进行评分并选择合适的话语作为（部分）响应。机器人响应可以由多个话语组成，这些话语是顺序且动态生成和选择的（见图 2）。其他组件包括一个自然语言理解（NLU）模块和一个句子融合模块，后者将选定的语句合并为一个连贯的响应。我们将依次描述这些组件。

Natural Language Understanding. At each turn of the conversation, user input is analyzed by an NLU module, comprising two components: a focus tracker and a user answer interpreter. The focus of the conversation is the set of entities currently being discussed. If the last turn ended with a bot question, the answer interpreter classifies the user answer w.r.t. the question type. For example, the response types expected for yes/no questions (e.g., “Do you want to hear more about cheetahs?”), are ‘yes’ (e.g., “sure,” “I want to hear about cheetahs”), ’no’, and ignoring the question; list selection questions (e.g., “Which animal do you want to hear about next?”)

自然语言理解。在对话的每一轮中，用户输入由一个自然语言理解（NLU）模块进行分析，该模块包括两个组件：焦点跟踪器和用户回答解释器。对话的焦点是当前正在讨论的实体集合。如果上一轮以机器人的问题结束，回答解释器会根据问题类型对用户回答进行分类。例如，对于是非问题（例如，“你想了解更多关于猎豹的信息吗？”），预期的回答类型包括“是”（例如，“当然”，“我想了解猎豹”）、“否”以及忽略问题；对于列表选择问题（例如，“接下来你想了解哪种动物？”）

Figure 2: Dynamic composition flow diagram. A few candidates from different providers are shown in each step and the one selected by the DM is highlighted in blue.

图 2: 动态组合流程图。每一步展示了来自不同提供者的几个候选方案，DM 选择的方案用蓝色高亮显示。

Table 1: Illustration of the concepts of focus, user answer interpretation and dialog acts for a sample conversation.

表 1: 示例对话中的焦点、用户回答解释和对话行为的说明。

对话	焦点	用户回答解释	对话行为
用户	北极熊的声音	北极熊
机器人	这是北极熊的声音。	北极熊
机器人	嘿，你想听听企鹅的声音吗？	企鹅
用户是机器人太好了。		企鹅	合作
机器人		企鹅
机器人	这是企鹅的声音。	企鹅
机器人	你想了解哪种动物？
用户机器人	告诉我关于北极熊的事情	北极熊	合作
机器人	酷。Churchillwild.com 说“北极熊想玩的时候会左右摇摆头部”。	北极熊	事实

and quizzes (e.g.,“Can you guess which animal this is, a lion or a tiger?”) include animals as potential responses. Table 1 illustrates the concepts of focus and answer interpretation.

表 1 展示了焦点和答案解释的概念。

Content Providers. Given the conversation history and the selected utterances so far in the current turn, content providers propose candidate utterances w.r.t. this context. They rely on different sources (e.g., news, knowledge graph) to extract relevant content, which includes both structured and unstructured data. Structured providers generate text for their candidate utterances using templates, while unstructured content is quoted verbatim with attribution. Content is expressed in different forms via dialogue acts such as statements, preference queries, navigation questions and quizzes. Some of these, referred to as conversational drivers, aim to proactively increase user engagement (e.g., focus changing, questions). Examples of such dialogue acts are provided in Table 1.

内容提供者。根据对话历史和当前轮次中已选择的话语，内容提供者会针对这一上下文提出候选话语。它们依赖不同的来源（例如新闻、知识图谱）来提取相关内容，这些内容包括结构化和非结构化数据。结构化提供者使用模板为其候选话语生成文本，而非结构化内容则直接引用并注明出处。内容通过对话行为（如陈述、偏好查询、导航问题和测验）以不同形式表达。其中一些被称为对话驱动者，旨在主动提高用户参与度（例如焦点转移、提问）。表1中提供了此类对话行为的示例。

Dialogue Manager. In each step of the bot composition loop, providers generate candidates for the next utterance to be appended to the response constructed so far. Given a set of candidates and the conversation context, utterance selection is performed by a learned DM. This step is repeated until the DM assesses that the response is a relevant and engaging bot response. In [38], the DM is implemented as an RNN encoder, trained in a supervised fashion (see Sec. 4.2). In this work, we develop DMs trained using RL.

对话管理器 (Dialogue Manager)。在机器人组合循环的每一步中，提供者会生成候选话语，这些话语将被附加到目前已构建的响应中。给定一组候选话语和对话上下文，话语选择由学习型对话管理器执行。此步骤会重复，直到对话管理器评估出响应是相关且引人入胜的机器人响应。在 [38] 中，对话管理器被实现为一个 RNN 编码器，以监督学习的方式进行训练（见第 4.2 节）。在本工作中，我们开发了使用强化学习 (RL) 训练的对话管理器。

Sentence Fusion. The output of the composition loop is a sequence of utterances, which still needs to be fused into a coherent bot response. Simple concatenation of the utterances typically results in cumbersome, unnatural, verbose responses, such as $\mathrm{^{\circ}O n}$ average, male lions weigh 420 lbs. On average, male lions are 3.9 feet tall. That means that lions are about as tall as a piano.” Sentence fusion combines the selected utterances into a single cohesive response [2, 10, 25], such as $^{\ast}\mathrm{On}$ average, male lions weigh 420 lbs and are 3.9 feet tall. That means that they’re about as tall as a piano.” This module uses the following techniques: (a) pro nominal iz ation, (b) removal of repetitive context mentions and (c) introduction of a discourse marker between sentences. Our fusion model is based on Laser T agger [24], a sequence labeling architecture in which each token in the input text is classified to be either copied as-is, deleted or substituted with a phrase taken from a small, predefined vocabulary, typically containing pronouns and connectives.

句子融合。组合循环的输出是一系列话语，这些话语仍需融合成一个连贯的机器人响应。简单地将话语连接起来通常会导致冗长、不自然的响应，例如“平均而言，雄性狮子的体重为420磅。平均而言，雄性狮子的身高为3.9英尺。这意味着狮子的身高大约与钢琴相当。”句子融合将选定的句子组合成一个连贯的响应 [2, 10, 25]，例如“平均而言，雄性狮子的体重为420磅，身高为3.9英尺。这意味着它们的身高大约与钢琴相当。”该模块使用以下技术：(a) 代词化，(b) 删除重复的上下文提及，以及 (c) 在句子之间引入话语标记。我们的融合模型基于LaserTagger [24]，这是一种序列标记架构，其中输入文本中的每个Token被分类为直接复制、删除或用取自小型预定义词汇表的短语替换，该词汇表通常包含代词和连接词。

4 REINFORCEMENT LEARNING FOR THE DM

4 针对 DM 的强化学习

Dynamic composition is realized by Szpektor et al. [38] with supervised training of an RNN-based DM. This limits the construction of the next bot response to be myopic, as it is optimized for maximal immediate reward. However, since the main goal of the DM is to conduct complex, engaging multi-turn conversations, it should target the natural complexity of human-to-human conversations, which are typically not conducted in a myopic, turn-by-turn manner, but rather reflect some degree of look ahead and dynamic planning. For example, a conversation may comprise several steps leading to an intended goal, such as knowledge transfer; or to make a conversation more engaging, one might intersperse interesting facts throughout, or build tension towards an eventual resolution. Such capabilities require a bot be able to choose responses that lead toward such non-myopic ends, and adapt to user responses/queries by dynamically re-planning its trajectory accordingly.

Szpektor 等人 [38] 通过监督训练基于 RNN 的对话管理器 (DM) 实现了动态组合。这种方法限制了下一个机器人响应的构建，使其变得短视，因为它被优化为追求最大的即时奖励。然而，由于 DM 的主要目标是进行复杂、引人入胜的多轮对话，它应该瞄准人与人之间对话的自然复杂性，这些对话通常不是以短视的、逐轮的方式进行，而是反映了一定程度的预见性和动态规划。例如，对话可能包含多个步骤，以达到预期的目标，如知识传递；或者为了使对话更具吸引力，可能会在其中穿插有趣的事实，或者为最终解决方案制造紧张感。这些能力要求机器人能够选择导向这些非短视目标的响应，并通过动态重新规划其轨迹来适应用户的响应/查询。

To this end, we develop an RL framework for open-ended dialogue that builds on the dynamic composition architecture, and propose a number of concrete instantiations of it. In Sec. 4.1, we formulate the underlying MDP that captures the spirit of the contentprovider decomposition. In Sec. 4.2, we discuss the use of the underlying supervised model for state representation. We then propose a two-step Q-learning approach in Sec. 4.3, with algorithmic variants motivated by specific properties of the MDP and our representations. We also develop an end-to-end Q-learning method in Sec. 4.4 that does not require the language encoders for state generation.

为此，我们开发了一个基于动态组合架构的开放式对话强化学习（RL）框架，并提出了该框架的多个具体实例。在4.1节中，我们构建了一个捕捉内容提供者分解精神的底层马尔可夫决策过程（MDP）。在4.2节中，我们讨论了使用底层监督模型进行状态表示的方法。随后，在4.3节中，我们提出了一种两步Q学习方法，其算法变体受到MDP特定属性和我们表示方法的启发。此外，在4.4节中，我们还开发了一种端到端的Q学习方法，该方法不需要语言编码器来生成状态。

4.1 MDP with a Stochastic Action Space

4.1 具有随机动作空间的 MDP

We begin with an MDP formulation of the DM problem upon which our RL methods operate. We assume a state space $\chi$ , action space $\mathcal{A}$ , transition kernel $P$ , reward function $R$ , initial state distribution $\beta$ and discount factor $\gamma,$ and aim to optimize the cumulative discounted return $\begin{array}{r}{J(\pi),:=,\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}~\vert~P,R,\beta,\pi]}\end{array}$ which captures the long-term value of the conversation. The RL DM policy $\pi$ is an action distribution conditioned on state $x\in\chi$ . An optimal $\mathrm{D}!\mathrm{M},\pi^{*}$ is found by solving $\operatorname*{max}_{\pi\in\Pi},J(\pi)$ , where $\Pi$ is the space of all DM policies. We discuss each of these elements in turn.

我们从 DM（对话管理）问题的 MDP（马尔可夫决策过程）公式化开始，我们的强化学习方法基于此进行操作。我们假设状态空间为 $\chi$，动作空间为 $\mathcal{A}$，转移核为 $P$，奖励函数为 $R$，初始状态分布为 $\beta$，折扣因子为 $\gamma$，目标是优化累积折扣回报 $\begin{array}{r}{J(\pi),:=,\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}~\vert~P,R,\beta,\pi]}\end{array}$，它捕捉了对话的长期价值。RL DM 策略 $\pi$ 是基于状态 $x\in\chi$ 的动作分布。通过求解 $\operatorname*{max}_{\pi\in\Pi},J(\pi)$ 来找到最优的 DM 策略 $\pi^{*}$，其中 $\Pi$ 是所有 DM 策略的空间。我们依次讨论这些元素。

Much RL research in dialogue management defines the state and action spaces to be the tokenized language space [1, 13, 19]. For instance, the state $x$ is the tokenized user-agent conversation history to that point, while an action $a$ is the DM’s output sentence (generated token by token). However, since the state and action spaces in this formulation are both combinatorial, even with a medium-sized vocabulary the corresponding tokenized spaces grow exponentially, thus making RL intractable. We handle the combinatorics of state space by leveraging state-of-the-art language models, such as RNNs or transformers, to encode conversation history $x$ with a $d$ -dimensional embedding $\phi_{x}\in\mathbb{R}^{d}$ (see Sec. 4.2 for details), thus replacing the large discrete state space by the continuous embedding space R𝑑.

在对话管理的强化学习研究中，许多工作将状态和动作空间定义为 Token 化的语言空间 [1, 13, 19]。例如，状态 $x$ 是到当前为止的用户-智能体对话历史的 Token 化表示，而动作 $a$ 是对话管理模块输出的句子（逐个 Token 生成）。然而，由于这种定义下的状态和动作空间都是组合性的，即使使用中等规模的词汇表，相应的 Token 化空间也会呈指数级增长，从而使强化学习变得难以处理。我们通过利用最先进的语言模型（如 RNN 或 Transformer）来处理状态空间的组合性问题，将对话历史 $x$ 编码为一个 $d$ 维的嵌入 $\phi_{x}\in\mathbb{R}^{d}$（详见第 4.2 节），从而用连续的嵌入空间 $\mathbb{R}^{d}$ 替代了庞大的离散状态空间。

We also differ from typical RL dialogue models in our treatment of the action space. Rather than a generative language model that directly outputs sentences, we leverage the dynamic composition framework (Sec. 3) to render the action space tractable. Specifically, at any turn, each content provider proposes a small set of utterances. This dynamic-composition, content-provider (DCCP) action decomposition ensures the DM policy need only score and select from a (relatively small) discrete set $\mathcal{A}_{x}$ at state $x$ , i.e. the set of candidate utterances. Note that by working at the utterance level rather than at the level of the full bot response (a fused concatenation of K such utterances), we remove the small-scale combinatorics of the action space, at the cost of extending the horizon of the RL problem. But this does not sacrifice the optimality of the policy.

我们在动作空间的处理上也与典型的强化学习（RL）对话模型不同。我们不是直接输出句子的生成式语言模型，而是利用动态组合框架（第3节）来使动作空间变得易于处理。具体来说，在每一轮对话中，每个内容提供者会提出一小部分话语。这种动态组合、内容提供者（DCCP）动作分解确保了对话管理（DM）策略只需在状态 $x$ 下从一个（相对较小的）离散集合 $\mathcal{A}_{x}$ 中评分和选择，即候选话语的集合。需要注意的是，通过在话语级别而不是完整机器人响应（K个话语的融合拼接）级别工作，我们消除了动作空间的小规模组合问题，但代价是延长了RL问题的时间范围。但这并不会牺牲策略的最优性。

Importantly, since the providers use potentially arbitrary logic to generate candidates, the realization of $\mathcal{A}_{x}$ may differ with each occurrence of $x$ . This puts us in the realm of non-standard MDPs with stochastic action sets [3]. Fortunately, we consider Q-learning methods below that handle this directly.

重要的是，由于提供者可能使用任意的逻辑来生成候选，$\mathcal{A}_{x}$ 的实现可能会随着每次 $x$ 的出现而不同。这使我们进入了具有随机动作集的非标准 MDP (Markov Decision Process) 领域 [3]。幸运的是，我们在下面考虑的 Q-learning 方法可以直接处理这种情况。

The training data, further described in Sec. 4.5, is composed of crowd-sourced conversations, generated with a supervised DM. The human evaluators provide a rating for each selected utterance, which are used as rewards that measure the rater’s immediate value of the action given the conversation history. As we shall see, the RL model is able to leverage these to learn dynamic planning strategies to improve user engagement.

训练数据（在第4.5节中进一步描述）由众包对话组成，这些对话是通过有监督的对话管理器（DM）生成的。人类评估者为每个选定的语句提供评分，这些评分被用作奖励，用于衡量评估者在给定对话历史的情况下对动作的即时价值。正如我们将看到的，强化学习（RL）模型能够利用这些奖励来学习动态规划策略，从而提高用户参与度。

In the existing dialogue RL literature [6, 19, 22] most algorithms are based on policy-gradient methods [37] because (i) learning an optimal state-action value function with a combinatorial action space is intractable, and (ii) the resulting DM is a sequence-tosequence policy model that generates bot responses. The simplification of the action space afforded by our DCCP reformulation allows us to use value-based Q-learning [3, 27], a relatively uncommon approach in large-scale dialogue systems.

在现有的对话强化学习文献 [6, 19, 22] 中，大多数算法都基于策略梯度方法 [37]，因为 (i) 在组合动作空间中学习最优状态-动作值函数是不可行的，(ii) 生成的对话管理器 (DM) 是一个序列到序列的策略模型，用于生成机器人的响应。我们的 DCCP 重构简化了动作空间，使我们能够使用基于值的 Q 学习 [3, 27]，这是在大规模对话系统中相对不常见的方法。

4.2 Supervised Model as a State Encoder

4.2 监督模型作为状态编码器

To encode the conversation history into a $d$ -dimensional embedding vector, we consider two supervised learning architectures: an improved version of the RNN model from [38] and a transformerbased approach using a pre-trained BERT model.

为了将对话历史编码为 $d$ 维嵌入向量，我们考虑了两种监督学习架构：一种是从 [38] 改进的 RNN 模型，另一种是基于预训练 BERT 模型的 Transformer 方法。

Supervised RNN Architecture. We modify the two-level hierarchical RNN encoder [38] by replacing the first-level gated recurrent unit (GRU), that encodes the user and bot utterances, by a pre-trained sentence embedding module [5], which is fixed during training. These provide a clear advantage over the first-level GRU, which we attribute to training on a large, general corpus. The resulting sentence embeddings are fed to a GRU along with nontextual metadata features, including: the conversation turn index, the candidate dialogue act, the number of tokens in the constructed response, whether the candidate offers to change the focus.

监督RNN架构。我们通过将第一层的门控循环单元（GRU）替换为预训练的句子嵌入模块 [5] 来修改两级分层RNN编码器 [38]，该模块在训练期间是固定的。这些句子嵌入模块相比第一层GRU具有明显优势，我们将其归因于在大型通用语料库上的训练。生成的句子嵌入与非文本元数据特征一起输入到GRU中，这些特征包括：对话轮次索引、候选对话行为、构建响应中的Token数量、候选是否提供改变焦点的选项。

Supervised Transformer-based Architecture. Our second supervised model uses a transformer architecture [40]. We consider two variants: a text-only BERT model and a combined text-metadata model. In both, the input is the concatenated sequence of user and bot utterances (i.e., conversation history) and a candidate utterance to be scored. In the second, we concatenate the per-token contextual representation vectors produced by the BERT model and an embedded representation of the metadata features for the utterance of which the token is a part. The resulting concatenated vectors are fed into another small transformer. Experiments on manually annotated data show the text-only variant outperforms the second, corroborating the hypothesis that the transformer’s better use of the input text—specifically, the ability to attend to the history when processing candidates—obviates the need for additional features.

基于监督学习的 Transformer 架构。我们的第二个监督模型使用了 Transformer 架构 [40]。我们考虑了两种变体：一种是仅文本的 BERT 模型，另一种是结合文本和元数据的模型。在这两种模型中，输入都是用户和机器人对话的拼接序列（即对话历史）以及待评分的候选对话。在第二种模型中，我们将 BERT 模型生成的每个 Token 的上下文表示向量与 Token 所属对话的元数据特征的嵌入表示进行拼接。生成的拼接向量被输入到另一个小型 Transformer 中。在手动标注数据上的实验表明，仅文本的变体优于第二种模型，这验证了 Transformer 更好地利用输入文本（特别是在处理候选对话时能够关注历史对话）的假设，从而无需额外的特征。

State Representation. Our state representation uses the output of the dialogue-history encoder, either the RNN hidden state or the pooled output of the last transformer layer. The state is constructed by concatenating this encoding with the sentence embedding of the last user input and context features, e.g., the conversation turn index and the composition turn step index.

状态表示。我们的状态表示使用对话历史编码器的输出，即 RNN 的隐藏状态或最后一个 Transformer 层的池化输出。状态是通过将此编码与最后一个用户输入的句子嵌入以及上下文特征（例如，对话轮次索引和组合轮次步骤索引）连接起来构建的。

Figure 3: Two-step Q-learning schema. The state is the concatenation of (i) the output of the dialogue history encoder, either RNN or transformer (denoted by (1)), (ii) the embedding of the last user input and (iii) context features, including the conversation turn index and the composition turn step index. The action is represented by its embedding. We consider both stochastic action and continuous action ${\bf{Q}}\cdot{\bf{\Lambda}}$ - learning approaches (denoted by (2)) potentially with the added CQL regular iz ation (denoted by (3)).

图 3: 两步 Q-learning 架构。状态由以下三部分拼接而成：(i) 对话历史编码器（RNN 或 Transformer）的输出（记为 (1)），(ii) 最后用户输入的嵌入表示，以及 (iii) 上下文特征，包括对话轮次索引和组合轮次步骤索引。动作由其嵌入表示表示。我们考虑了随机动作和连续动作 ${\bf{Q}}\cdot{\bf{\Lambda}}$ - learning 方法（记为 (2)），可能还增加了 CQL 正则化（记为 (3)）。

4.3 The Two Step Q-model Architecture

4.3 两步 Q 模型架构

We now develop several RL approaches for the DM which rely on Q-learning. Our first approaches use a two-step model in which the state is first encoded by a language model (either a pre-trained RNN or transformer) before being passed to the DM policy. Figure 3 illustrates how these building blocks come together in the two-step approach. Given a pre-trained state encoder $\phi:\boldsymbol{X}\to\mathbb{R}^{d}$ and a sentence encoder $\psi:\mathcal{A}\to\mathbb{R}^{h}$ , we apply two different Q-learning techniques using the encoded state space (using $\phi_{x}$ rather than $x$ ) and action space (using $\psi_{a}$ rather than 𝑎).

我们现在为 DM 开发了几种基于 Q-learning 的强化学习方法。我们的第一种方法使用两步模型，其中状态首先由语言模型（预训练的 RNN 或 Transformer）编码，然后再传递给 DM 策略。图 3 展示了这些构建块如何在两步方法中结合。给定一个预训练的状态编码器 $\phi:\boldsymbol{X}\to\mathbb{R}^{d}$ 和一个句子编码器 $\psi:\mathcal{A}\to\mathbb{R}^{h}$，我们使用编码后的状态空间（使用 $\phi_{x}$ 而不是 $x$）和动作空间（使用 $\psi_{a}$ 而不是 𝑎）应用两种不同的 Q-learning 技术。

Stochastic Action Q-learning (SAQL) [3]. Our first RL technique applies Q-learning directly to the discrete, stochastic action sets $\mathcal{A}{x}$ as determined by the DCCP de compost ion. We adopt the general deep Q network (DQN) approach [27], using a DNN to represent the $\boldsymbol{\mathrm{Q}}$ -function. Specifically, $Q{\theta}:\mathbb{R}^{d}\times\mathbb{R}^{h}\rightarrow\mathbb{R}$ is a feed-forward DNN with parameters $\theta$ , which represents the cumulative discounted value of taking action (or bot utterance) $\psi_{a}~~\in~~\mathbb{R}^{h}$ in state (i.e., conversation history encoding) $\phi_{x}\in\mathbb{R}^{d}$ .

随机动作 Q 学习 (Stochastic Action Q-learning, SAQL) [3]。我们的第一个强化学习技术直接将 Q 学习应用于由 DCCP 分解确定的离散随机动作集 $\mathcal{A}{x}$。我们采用通用的深度 Q 网络 (Deep Q Network, DQN) 方法 [27]，使用深度神经网络 (DNN) 来表示 $\boldsymbol{\mathrm{Q}}$ 函数。具体来说，$Q{\theta}:\mathbb{R}^{d}\times\mathbb{R}^{h}\rightarrow\mathbb{R}$ 是一个具有参数 $\theta$ 的前馈 DNN，它表示在状态（即对话历史编码）$\phi_{x}\in\mathbb{R}^{d}$ 中采取动作（或机器人话语）$\psi_{a}~~\in~~\mathbb{R}^{h}$ 的累积折扣价值。

$$
\operatorname*{min}{\theta}\sum{i=1}^{|B|}(\mathcal{Q}{\theta}(\phi{x_{i}},\psi_{a_{i}})-r_{i}-\gamma\operatorname*{max}{a^{\prime}\in\mathcal{R}{x_{i}^{\prime}}}\mathcal{Q}{\theta^{\mathrm{target}}}(\phi{x_{i}^{\prime}},\psi_{a^{\prime}}))^{2},
$$

where $Q_{\theta^{\mathrm{target}}}$ is a target $\boldsymbol{Q}$ -function, used to improve training stability in DQN [26] (note the use of the realized action set $\mathcal{A}{x{i}^{\prime}}$ in the maximization). Under this loss, RL is $\ell_{2}$ -regression of $Q_{\theta}$ w.r.t. target labels $r+\gamma\operatorname*{max}{a^{\prime}\in\mathcal{A}{x_{i}^{\prime}}}Q_{\theta^{\mathrm{target}}}(\phi_{x^{\prime}},\psi_{a^{\prime}})$ , which tries to match the value function $Q_{\theta}$ with its Bellman backup.

其中 $Q_{\theta^{\mathrm{target}}}$ 是目标 $\boldsymbol{Q}$ 函数，用于提高 DQN [26] 的训练稳定性（注意在最大化时使用了实现的动作集 $\mathcal{A}{x{i}^{\prime}}$）。在此损失下，RL 是 $Q_{\theta}$ 相对于目标标签 $r+\gamma\operatorname*{max}{a^{\prime}\in\mathcal{A}{x_{i}^{\prime}}}Q_{\theta^{\mathrm{target}}}(\phi_{x^{\prime}},\psi_{a^{\prime}})$ 的 $\ell_{2}$ 回归，试图将值函数 $Q_{\theta}$ 与其 Bellman 备份匹配。

We refer to this approach as stochastic action $\boldsymbol{Q}$ -learning (SAQL) to reflect the stochastic action sets used in training. Once SAQL converges, the DM policy is $\pi^{}(x);\in;\arg\operatorname{max}{a\in{\mathcal A}{x}}Q_{\theta^{*}}(\phi_{x},\psi_{a}),$ . That is, at inference time, the $Q$ -model is applied to each candidate action, and the DM responds with the action with the greatest Q-value given the current dialogue state.

我们将这种方法称为随机动作Q学习 (Stochastic Action Q-learning, SAQL)，以反映训练中使用的随机动作集。一旦SAQL收敛，决策策略为 $\pi^{}(x);\in;\arg\operatorname{max}{a\in{\mathcal A}{x}}Q_{\theta^{*}}(\phi_{x},\psi_{a}),$ 。也就是说，在推理时，Q模型应用于每个候选动作，决策模块根据当前对话状态返回具有最大Q值的动作。

Continuous Action Q-learning (CAQL) [29]. In the SAQL formulation, action maximization takes place over the discrete set of candidates $\mathcal{A}_{x}$ . However, the embedding representation means that the action space can also be treated as continuous, and we can consider maximization over this wider space. Continuous-action RL problems are common in areas like robotics [15], and typically policy-gradient algorithms are used to learn a return-maximizing policy [34, 37]. However, such methods are often data-inefficient and impractical when faced with high-dimensional action spaces, both issues present in dialogue systems. Instead, we consider the use of continuous action $Q$ -learning (CAQL) [29] to solve the continuousaction variant of our DM policy.

连续动作 Q 学习 (Continuous Action Q-learning, CAQL) [29]。在 SAQL 的公式中，动作最大化是在离散候选集 $\mathcal{A}_{x}$ 上进行的。然而，嵌入表示意味着动作空间也可以被视为连续的，我们可以考虑在这个更广泛的空间上进行最大化。连续动作的强化学习问题在机器人等领域很常见 [15]，通常使用策略梯度算法来学习回报最大化的策略 [34, 37]。然而，面对高维动作空间时，这些方法通常数据效率低下且不切实际，而对话系统中也存在这些问题。相反，我们考虑使用连续动作 Q 学习 (CAQL) [29] 来解决我们的对话管理策略的连续动作变体。

Roughly speaking, using $\mathrm{CAQL},$ when faced with a next state $x^{\prime}$ while training the Q-function $Q_{\theta}$ , we do not restrict ourselves to maximizing over the discrete action set $\mathcal{A}(x^{\prime})$ , but instead maximize over the entire embedding space $\psi$ , minimizing:

粗略地说，使用 $\mathrm{CAQL}$ 时，在训练 Q 函数 $Q_{\theta}$ 时，面对下一个状态 $x^{\prime}$，我们不会限制自己在离散动作集 $\mathcal{A}(x^{\prime})$ 上最大化，而是在整个嵌入空间 $\psi$ 上最大化，最小化：

$$
\operatorname*{min}{\theta}\sum{i=1}^{|B|}(r_{i}+\gamma Q_{\theta^{\mathrm{target}}}(\phi_{x_{i}^{\prime}},\arg\operatorname*{max}{\psi}Q{\theta}(\phi_{x_{i}^{\prime}},\psi))-Q_{\theta}(\phi_{x_{i}},\psi_{a_{i}}))^{2}.
$$

This approach has advantages over $S A Q L$ : one need not record the realization ${\mathcal{A}}{x^{\prime}}$ of the stochastic action sets in the data set, and continuous action maximization (see below) can be more effective when the set of candidate actions (utterances) is moderate or large in size. However, CAQL will generally overestimate the true value of its policy, since it hypothesizes the use of embedded actions that are never generated by any content provider. Indeed, once $Q{\theta}$ is trained using CAQL, we restrict the realized policy to scoring (and using) only provider-generated candidate actions at inference/serving time.

这种方法相较于 $S A Q L$ 具有优势：无需在数据集中记录随机动作集 ${\mathcal{A}}{x^{\prime}}$ 的实现，并且在候选动作（话语）集规模适中或较大时，连续动作最大化（见下文）可能更为有效。然而，CAQL 通常会高估其策略的真实价值，因为它假设使用了任何内容提供商都未生成的嵌入动作。实际上，一旦使用 CAQL 训练了 $Q{\theta}$，我们在推理/服务时会将实现的策略限制为仅对提供商生成的候选动作进行评分（和使用）。

When $Q_{\theta}$ is represented by a DNN, the inner maximization is typically differentiable and non-convex. This can be solved optimally for certain classes of DNNs using a mixed-integer program or a first-order method such as gradient ascent (GA) [29]. We use GA in this work: starting from an initial embedded action $\psi^{(0)}$ , the optimal embedded action arg $\operatorname*{max}{\psi}Q{\theta}(\phi_{x^{\prime}},\psi)$ is computed iteratively by $\boldsymbol{\psi}^{(t+1)}\gets\boldsymbol{\psi}^{(t)}+\epsilon_{\mathrm{GA}}\nabla_{\boldsymbol{\psi}}\mathcal{Q}{\boldsymbol{\theta}}(\phi{x^{\prime}},\boldsymbol{\psi})|{\boldsymbol{\psi}=\boldsymbol{\psi}^{(t)}}$ , where $\epsilon{\mathrm{GA}}>0$ is a tunable step size.

当 $Q_{\theta}$ 由深度神经网络 (DNN) 表示时，内部最大化问题通常是可微且非凸的。对于某些类别的 DNN，可以使用混合整数规划或一阶方法（如梯度上升 (GA) [29]）来最优地解决此问题。在本工作中，我们使用 GA：从初始嵌入动作 $\psi^{(0)}$ 开始，通过迭代计算 $\boldsymbol{\psi}^{(t+1)}\gets\boldsymbol{\psi}^{(t)}+\epsilon_{\mathrm{GA}}\nabla_{\boldsymbol{\psi}}\mathcal{Q}{\boldsymbol{\theta}}(\phi{x^{\prime}},\boldsymbol{\psi})|{\boldsymbol{\psi}=\boldsymbol{\psi}^{(t)}}$ 来找到最优嵌入动作 $\operatorname*{max}{\psi}Q_{\theta}(\phi_{x^{\prime}},\psi)$，其中 $\epsilon_{\mathrm{GA}}>0$ 是可调的步长。

Conservative Q-learning (CQL) [17]. Our DM problem is an application of offline RL, where a model is learned using previously collected user-bot conversation with no further (online) interaction. Offline RL is prone to over estimation errors induced by the distribution al shift between the offline data and that generated by the learned policy [43]. This is especially problematic if certain bot actions are rare in the offline data, making their learned $\boldsymbol{Q}$ -values very noisy. To alleviate this, we can apply conservative $\boldsymbol{Q}$ -learning (CQL) [17], a regular iz ation scheme which learns a “conservative”

保守Q学习 (CQL) [17]。我们的DM问题是一个离线强化学习的应用，其中模型是通过之前收集的用户-机器人对话进行学习的，没有进一步的（在线）交互。离线强化学习容易受到离线数据与学习策略生成的数据之间的分布偏移引起的过高估计误差的影响 [43]。如果某些机器人在离线数据中很少见，这尤其成问题，使得它们学习到的 $\boldsymbol{Q}$ 值非常嘈杂。为了缓解这个问题，我们可以应用保守 $\boldsymbol{Q}$ 学习 (CQL) [17]，这是一种正则化方案，它学习一个“保守的”

$\boldsymbol{Q}$ -function that lower bounds the true Q-function. CQL can be applied to both SAQL and CAQL (we illustrate it only for SAQL). In CQL one augments the Q-learning loss with a behavior regularizer: mi $\begin{array}{r}{\mathtt{n}{\theta}\sum{i=1}^{|B|}\alpha(\mathbb{E}{a\sim\mu}[Q{\theta}(\phi_{x_{i}},\psi_{a})]-\mathbb{E}{a\sim\pi{\beta}}[Q_{\theta}(\phi_{x_{i}},\psi_{a})])+}\end{array}$ $(r_{i}+\gamma Q_{\theta^{\mathrm{target}}}(\phi_{x_{i}^{\prime}}$ $,\mathrm{arg,max}{a^{\prime}\in\mathcal{A}(x{i}^{\prime})},Q_{\theta}(\phi_{x_{i}^{\prime}},\psi_{a^{\prime}}))-Q_{\theta}(\phi_{x_{i}},\psi_{a_{i}}))^{2},$ where $\pi_{\beta}$ is a behavior policy (DM) that approximates the datageneration policy, $\alpha>0$ is a tunable regular iz ation parameter, and $\mu$ is the target policy to be learned. Intuitively, CQL regular iz ation minimizes the differences in Q-values of actions generated by our learned RL DM policy and the behavior (training-data generating) policy. We use target $\mu(a|x)\propto\exp(Q_{\theta}(\phi_{x},\psi_{a}))$ , which corresponds to the optimal policy of entropy-regularized Q-learning [31].

$\boldsymbol{Q}$ - 函数，它是真实 Q 函数的下界。CQL 可以应用于 SAQL 和 CAQL（我们仅以 SAQL 为例进行说明）。在 CQL 中，我们通过行为正则化器来增强 Q 学习损失：mi $\begin{array}{r}{\mathtt{n}{\theta}\sum{i=1}^{|B|}\alpha(\mathbb{E}{a\sim\mu}[Q{\theta}(\phi_{x_{i}},\psi_{a})]-\mathbb{E}{a\sim\pi{\beta}}[Q_{\theta}(\phi_{x_{i}},\psi_{a})])+}\end{array}$ $(r_{i}+\gamma Q_{\theta^{\mathrm{target}}}(\phi_{x_{i}^{\prime}}$ $,\mathrm{arg,max}{a^{\prime}\in\mathcal{A}(x{i}^{\prime})},Q_{\theta}(\phi_{x_{i}^{\prime}},\psi_{a^{\prime}}))-Q_{\theta}(\phi_{x_{i}},\psi_{a_{i}}))^{2},$ 其中 $\pi_{\beta}$ 是近似数据生成策略的行为策略 (DM)，$\alpha>0$ 是一个可调的正则化参数，$\mu$ 是要学习的目标策略。直观上，CQL 正则化最小化了由我们学习的 RL DM 策略生成的动作与行为（训练数据生成）策略生成的 Q 值差异。我们使用目标 $\mu(a|x)\propto\exp(Q_{\theta}(\phi_{x},\psi_{a}))$，它对应于熵正则化 Q 学习的最优策略 [31]。

4.4 End-to-end Architecture

4.4 端到端架构

We now outline an end-to-end (E2E) RL approach that jointly trains the language encoder and the $\boldsymbol{Q}$ -function. In contrast to our twostep approaches, by not constraining the DM to using a pre-trained encoder, E2E RL can tune the encoder (hence its representations) to the dialogue task at hand. This approach is similar in spirit to the original DQN model [27], in which the $\boldsymbol{Q}$ -network consists of both a convolutional DNN that encodes pixel frames (states) and a feed-forward NN that learns the Q-values.

我们现在概述一种端到端 (E2E) 强化学习方法，该方法联合训练语言编码器和 $\boldsymbol{Q}$ 函数。与我们的两步方法不同，E2E RL 不限制对话管理器使用预训练的编码器，因此可以针对当前对话任务调整编码器（从而调整其表示）。这种方法在精神上与原始的 DQN 模型 [27] 相似，其中 $\boldsymbol{Q}$ 网络由编码像素帧（状态）的卷积 DNN 和学习 Q 值的前馈神经网络组成。

To learn the $\boldsymbol{Q}$ -function in E2E fashion, we apply DQN to $Q(x,a)=$ $Q_{\theta}(c(x,a))$ , where $c(x,a)$ is the concatenation of the conversation history and the current candidate action, and $Q_{\theta};:;X;\rightarrow;\mathbb{R}$ is a trainable language encoder (e.g., a transformer initialized with pre-trained weights), followed by a feed-forward DNN. This $\boldsymbol{Q}$ - model jointly encodes the raw input conversation and assigns a $\boldsymbol{Q}$ -value to each candidate action. This allows us to learn the $\boldsymbol{Q}$ - function E2E, without relying on fixed pretrained language encoders. Specifically, with target network $Q_{\theta}\mathrm{target}$ updated as above, in E2E learning we train the $Q_{\theta}$ by minimizing the mean squared Bellman error. We use SAQL and formulate the inner maximization as $\begin{array}{r}{\operatorname*{max}{a^{\prime}\in\mathcal{A}(x^{\prime})}Q{\theta^{\mathrm{target}}}(x^{\prime},a^{\prime})}\end{array}$ .

为了以端到端（E2E）的方式学习 $\boldsymbol{Q}$ 函数，我们将 DQN 应用于 $Q(x,a)=$ $Q_{\theta}(c(x,a))$，其中 $c(x,a)$ 是对话历史和当前候选动作的连接，$Q_{\theta};:;X;\rightarrow;\mathbb{R}$ 是一个可训练的语言编码器（例如，使用预训练权重初始化的 Transformer），后接一个前馈深度神经网络（DNN）。这个 $\boldsymbol{Q}$ 模型联合编码原始输入对话，并为每个候选动作分配一个 $\boldsymbol{Q}$ 值。这使得我们能够以端到端的方式学习 $\boldsymbol{Q}$ 函数，而不依赖于固定的预训练语言编码器。具体来说，在目标网络 $Q_{\theta}\mathrm{target}$ 如上所述更新的情况下，在端到端学习中，我们通过最小化均方贝尔曼误差来训练 $Q_{\theta}$。我们使用 SAQL 并将内部最大化问题表述为 $\begin{array}{r}{\operatorname*{max}{a^{\prime}\in\mathcal{A}(x^{\prime})}Q{\theta^{\mathrm{target}}}(x^{\prime},a^{\prime})}\end{array}$。

4.5 Training Data

4.5 训练数据

The DM models are trained on crowd-sourced data, generated by human evaluators. Each evaluator converses with the bot until the dialogue derails or comes to a natural end. They then rate the bot responses, assessing each utterance in the composition loop, including those selected and unselected by the DM. Although evaluators were provided with a set of guidelines for assessing bot response quality, the resulting data is noisy and some level of rater-specific subjectivity is included in the ratings.

DM模型在由人类评估者生成的众包数据上进行训练。每个评估者与机器人对话，直到对话脱轨或自然结束。然后他们对机器人的回答进行评分，评估组合循环中的每个话语，包括被DM选中和未选中的那些。尽管评估者被提供了一套评估机器人回答质量的指南，但生成的数据仍然存在噪声，评分中包含了一定程度的评估者主观性。

A dozen

[论文翻译]使用强化学习在开放式对话中进行动态规划

原文地址：PDF/RLHF论文集/Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning.pdf

Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning

使用强化学习在开放式对话中进行动态规划

ABSTRACT

摘要

1 INTRODUCTION

1 引言

2 相关工作

3 DYNAMIC COMPOSITION

3 动态组合

4 REINFORCEMENT LEARNING FOR THE DM

4 针对 DM 的强化学习

4.1 MDP with a Stochastic Action Space

4.1 具有随机动作空间的 MDP

4.2 Supervised Model as a State Encoder

4.2 监督模型作为状态编码器

4.3 The Two Step Q-model Architecture

4.3 两步 Q 模型架构

4.4 End-to-end Architecture

4.4 端到端架构

4.5 Training Data

4.5 训练数据

[论文翻译]使用强化学习在开放式对话中进行动态规划

原文地址：PDF/RLHF论文集/Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning.pdf

Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning

使用强化学习在开放式对话中进行动态规划

ABSTRACT

摘要

1 INTRODUCTION

1 引言

2 RELATED WORK

2 相关工作

3 DYNAMIC COMPOSITION

3 动态组合

4 REINFORCEMENT LEARNING FOR THE DM

4 针对 DM 的强化学习

4.1 MDP with a Stochastic Action Space

4.1 具有随机动作空间的 MDP

4.2 Supervised Model as a State Encoder

4.2 监督模型作为状态编码器

4.3 The Two Step Q-model Architecture

4.3 两步 Q 模型架构

4.4 End-to-end Architecture

4.4 端到端架构

4.5 Training Data

4.5 训练数据