[论文翻译]LaMDA: 用于对话应用的大语言模型 (Large Language Model)


原文地址:https://arxiv.org/pdf/2201.08239


LaMDA: Language Models for Dialog Applications

LaMDA: 用于对话应用的大语言模型 (Large Language Model)

Abstract

摘要

We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformerbased neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on $1.56,\mathrm{T}$ words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. The first challenge, safety, involves ensuring that the model’s responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowd worker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a grounded ness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency.

我们介绍了 LaMDA:用于对话应用的语言模型。LaMDA 是一系列基于 Transformer 的神经语言模型,专门用于对话,具有多达 137B 参数,并在 1.56 T 个公开对话数据和网页文本的词上进行了预训练。虽然仅靠模型扩展可以提高质量,但在安全性和事实依据方面显示出较少的改进。我们证明了使用标注数据进行微调并使模型能够咨询外部知识源可以显著改善这两个关键挑战:安全性和事实依据。第一个挑战,安全性,涉及确保模型的响应与一组人类价值观一致,例如防止有害建议和不公平偏见。我们使用基于一组示例性人类价值观的度量来量化安全性,并发现使用少量众包工人标注的数据微调的 LaMDA 分类器过滤候选响应是一种有前途的改进模型安全性的方法。第二个挑战,事实依据,涉及使模型能够咨询外部知识源,如信息检索系统、语言翻译器和计算器。我们使用基于事实依据的度量来量化事实性,并发现我们的方法使模型能够生成基于已知来源的响应,而不仅仅是听起来合理的响应。最后,我们探讨了 LaMDA 在教育和内容推荐领域的应用,并分析了其有用性和角色一致性。


Figure 1: Impact of model pre-training alone vs. with fine-tuning in LaMDA on dialog quality (left), and safety and factual grounding (right). The quality metric (SSI) corresponds to sensible ness, specificity, and interesting ness. See Section 4 for more details on these metrics.


图 1: 仅模型预训练与在 LaMDA 中微调对对话质量(左)以及安全性和事实依据(右)的影响。质量指标 (SSI) 对应于合理性、具体性和趣味性。详见第 4 节以了解更多关于这些指标的详细信息。

1 Introduction

1 引言

Language model pre-training is an increasingly promising research approach in NLP [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. As pre-training uses unlabeled text, it can be combined with scaling model and dataset sizes to achieve better performance or new capabilities [13]. For example, GPT-3 [12], a 175B parameter model trained on a large corpus of unlabeled text, shows an impressive ability in few-shot learning thanks to scaling.

语言模型预训练是自然语言处理 (NLP) 中越来越有前景的研究方法 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]。由于预训练使用未标注文本,因此可以结合扩展模型和数据集规模以实现更好的性能或新的能力 [13]。例如,GPT-3 [12] 是一个拥有 175B 参数的模型,在大量未标注文本语料库上进行训练,展示了在少样本学习方面的出色能力,这得益于规模扩展。

Dialog models [14, 15, 16], one of the most interesting applications of large language models, successfully take advantage of Transformers’ ability to represent long-term dependencies in text [17, 18]. Similar to general language models [13], Adiwardana et al. [17] show that dialog models are also well suited to model scaling. There is a strong correlation between model size and dialog quality.

对话模型 [14, 15, 16],作为大语言模型最有趣的应用之一,成功利用了 Transformer 表示文本中长期依赖关系的能力 [17, 18]。类似于通用语言模型 [13],Adiwardana 等人 [17] 表明对话模型也非常适合模型扩展。模型规模与对话质量之间存在很强的相关性。

Inspired by these successes, we train LaMDA, a family of Transformer-based neural language models designed for dialog. These models’ sizes range from 2B to 137B parameters, and they are pre-trained on a dataset of 1.56T words from public dialog data and other public web documents (Section 3). LaMDA makes use of a single model to perform multiple tasks: it generates potential responses, which are then filtered for safety, grounded on an external knowledge source, and re-ranked to find the highest-quality response.

受到这些成功的启发,我们训练了 LaMDA,一系列基于 Transformer 的神经语言模型,专为对话设计。这些模型的参数量从 2B 到 137B 不等,并在包含 1.56T 个词的公共对话数据和其他公共网页文档的数据集上进行了预训练(第 3 节)。LaMDA 使用单个模型执行多个任务:它生成潜在的响应,然后对这些响应进行安全过滤,基于外部知识源进行验证,并重新排序以找到质量最高的响应。

We study the benefits of model scaling with LaMDA on our three key metrics: quality, safety, and grounded ness (Section 4). We observe that: (a) model scaling alone improves quality, but its improvements on safety and grounded ness are far behind human performance, and (b) combining scaling and fine-tuning improves LaMDA significantly on all metrics, and although the model’s performance remains below human levels in safety and grounded ness, the quality gap to measured crowd worker levels can be narrowed (labeled ‘Human’ in Figure 1).

我们研究了模型扩展对我们的三个关键指标:质量、安全性和有根据性 (grounded ness) (第 4 节) 的好处。我们观察到:(a) 仅模型扩展就能提高质量,但其对安全性和有根据性的改进远低于人类表现;(b) 结合扩展和微调可以显著改善 LaMDA 在所有指标上的表现,尽管模型在安全性和有根据性方面的表现仍低于人类水平,但可以缩小与测量的众包工作者水平之间的质量差距(图 1 中标记为 ‘Human’)。

The first metric, quality, is based on three components: sensible ness, specificity, and interesting ness (Section 4). We collect annotated data that describes how sensible, specific, and interesting a response is for a multiturn context. We then use these annotations to fine-tune a disc rim in at or to re-rank candidate responses.

第一个指标,质量,基于三个组成部分:合理性、具体性、和趣味性(第 4 节)。我们收集标注数据,描述在一个多轮对话背景下,一个回复的合理性、具体性和趣味性。然后我们使用这些标注来微调一个判别器或重新排序候选回复。

The second metric, safety, is introduced to reduce the number of unsafe responses that the model generates. To achieve this, we define an illustrative set of safety objectives that attempt to capture the behavior that the model should exhibit in a dialog (Appendix A.1), and we use a demographically diverse set of crowd workers to label responses in multiturn dialogs for these objectives (Appendix A.2, A.3). We then use these labels to fine-tune a disc rim in at or to detect and remove unsafe responses (Section 6.1). Our work on safety for LaMDA can be understood as a process for AI value alignment, at a high level.

第二个指标,安全性,旨在减少模型生成的不安全响应的数量。为实现这一目标,我们定义了一组示例性的安全目标,试图捕捉模型在对话中应表现出的行为(附录 A.1),并使用人口统计学多样化的众包工作者对多轮对话中的响应进行标注(附录 A.2, A.3)。然后,我们使用这些标注来微调一个判别器以检测和移除不安全的响应(第 6.1 节)。我们对 LaMDA 的安全性工作的理解可以视为 AI 价值对齐过程的一部分,从高层次上讲。

The third metric, grounded ness, is introduced for the model to produce responses that are grounded in known sources wherever they contain verifiable external world information. Due to neural language models such as LaMDA’s capacity to generalize rather than just memorize, they tend to generate responses that may seem plausible, but actually contradict factual statements made in established sources. We use this metric for the model to avoid this tendency. While grounding in known sources does not guarantee factual accuracy, it allows users or external systems to judge the validity of a response based on the reliability of its source and its faithful reproduction. We find that augmenting model outputs with the ability to use external tools, such as an information retrieval system, is a promising approach to achieve this goal. Therefore, we collect data from a setting where crowd workers can use external tools to research factual claims, and train the model to mimic their behavior.

第三项指标,即有根据性 (grounded ness),是为了让模型在包含可验证的外部世界信息时,生成基于已知来源的回答。由于神经语言模型(如 LaMDA)具有泛化能力而非仅仅是记忆能力,它们倾向于生成看似合理但实际上与权威来源中的事实陈述相矛盾的回答。我们使用这一指标来避免这种倾向。虽然基于已知来源并不保证事实准确性,但它允许用户或外部系统根据来源的可靠性和其忠实再现来判断回答的有效性。我们发现,通过增强模型输出以使用外部工具(如信息检索系统)的能力,是实现这一目标的有前途的方法。因此,我们从一个允许众包工作者使用外部工具研究事实主张的环境中收集数据,并训练模型模仿他们的行为。

Finally, we explore the use of LaMDA in the domains of education and content recommendations to investigate its potential and shortcomings. Similar to the concept of prompts in GPT-3 [12], we precondition LaMDA on a few turns of application-specific dialog to adapt LaMDA to the target applications. We perform experiments to compare the application-specific helpfulness (i.e., useful and correct responses) and role consistency (i.e., agent utterances match agent role) of pre-training-only and fine-tuned LaMDA models subject to application-specific preconditioning. We find that both types of models can adapt to their expected application roles fairly well, but fine-tuned LaMDA models are significantly more helpful.

最后,我们探讨了 LaMDA 在教育和内容推荐领域中的应用,以研究其潜力和不足。类似于 GPT-3 [12] 中的提示概念,我们对 LaMDA 进行了少量特定应用场景对话的预处理,以使 LaMDA 适应目标应用程序。我们进行了实验,比较了仅预训练和微调后的 LaMDA 模型在特定应用场景预处理下的应用特定帮助性(即有用且正确的响应)和角色一致性(即智能体的发言符合其角色)。我们发现,这两种类型的模型都能较好地适应预期的应用角色,但微调后的 LaMDA 模型显著更有帮助。

2 Related work

2 相关工作

Language models and dialog models: Language models have attracted much attention recently thanks to their successes in NLP applications (e.g., [19, 20, 21, 2, 1, 22, 23, 5, 12, 24]). Our study of scaling laws with respect to model sizes is inspired by recent work on the scaling laws of neural language models [12, 13]. Similar to their findings, our results show that model scaling improves our quality (sensible ness, specificity, and interesting ness), safety and grounded ness metrics to some extent. However, fine-tuning combined with scaling significantly improves performance on all metrics.

语言模型和对话模型:语言模型最近因其在自然语言处理 (NLP) 应用中的成功而受到广泛关注 (例如,[19, 20, 21, 2, 1, 22, 23, 5, 12, 24])。我们关于模型大小扩展规律的研究受到了近期关于神经语言模型扩展规律工作的启发 [12, 13]。与他们的发现类似,我们的结果显示,模型扩展在一定程度上提高了质量(合理性、具体性和趣味性)、安全性和有根据性指标。然而,微调结合扩展显著改进了所有指标的性能。

Our work is also closely related to recent successes in applying language models to dialog modeling (e.g., [25, 26, 17, 18]), which built on earlier research in neural dialog modeling (e.g., [14, 15, 16, 27, 28]). One of our fine-tuning stages requires training on dialog-only data, which is related to Wolf et al. [29], Dinan et al. [25] and Zhang et al. [30]. Our use of fine-tuning on crowd worker-annotated data to improve interesting ness is comparable to Roller et al. [18]. However, we aim to maximize the interesting ness of the model’s output distinctly from its ability to engage the user in further interaction.

我们的工作也与最近将语言模型应用于对话建模的成功密切相关 (例如 [25, 26, 17, 18]),这些研究建立在早期的神经对话建模研究基础之上 (例如 [14, 15, 16, 27, 28])。我们其中一个微调阶段需要使用仅包含对话的数据进行训练,这与 Wolf 等人 [29]、Dinan 等人 [25] 和 Zhang 等人 [30] 的工作相关。我们通过微调基于众包工人标注的数据来提高对话的趣味性,这一方法与 Roller 等人 [18] 的研究相似。然而,我们的目标是最大化模型输出的趣味性,而不是其与用户进一步互动的能力。

Our finding that pure scaling has a limited effect on key measures of open-domain dialog model performance echoes that of Shuster et al. [31], who also focus on the problem of grounded ness. Recent studies on scaling have found that performance on question-answering tasks improves with model size [32, 33], similar to our findings on pre-trained LaMDA prior to fine-tuning.

我们的研究发现,纯扩展对开放域对话模型性能的关键指标影响有限,这与 Shuster 等人 [31] 的研究结果相呼应,他们也关注于基础性问题。最近关于扩展的研究发现,问答任务的性能随着模型规模的增大而提高 [32, 33],这与我们在微调前对预训练 LaMDA 的研究结果相似。

Our approach to improving model grounded ness is broadly consistent with a growing literature on augmenting neural language models with retrieval systems. Most of the existing literature focuses on the problem of open-domain question-answering rather than dialog generation, and the models themselves are used to index and rank knowledge sources, rather than trained to use an intermediate tool. Given these differences, we note that the range of existing approaches to this problem include the RNNLM [34], RAG [35], REALM [36], and FiD [37] architectures. Zhu et al. [38] provide a survey of further recent work. See Karpukhin et al. [39] for details on the ‘dense passage retriever’ used in RAG. Recent work in this direction has expanded and elaborated on neural models’ ability to retrieve and rank passages [40]. The RETRO architecture demonstrates that language models can be primed with results retrieved from a database as large as two trillion tokens [41]. At a broad level, our approach is also comparable to that of Byrne et al. [42], which fine-tunes the model to use external APIs for movie ticketing dialog.

我们改进模型基础性的方法与越来越多的关于增强神经语言模型检索系统的文献基本一致。大多数现有文献专注于开放域问答问题,而不是对话生成,并且这些模型本身用于索引和排名知识源,而不是训练使用中间工具。鉴于这些差异,我们注意到现有的解决这一问题的方法包括 RNNLM [34]、RAG [35]、REALM [36] 和 FiD [37] 架构。Zhu 等人 [38] 提供了进一步近期工作的综述。有关 RAG 中使用的“密集段落检索器”的详细信息,请参阅 Karpukhin 等人 [39]。最近在这方面的工作扩展并详细阐述了神经模型检索和排名段落的能力 [40]。RETRO 架构表明,语言模型可以利用从包含多达两万亿个 Token 的数据库中检索到的结果进行预训练 [41]。在更广泛的层面上,我们的方法也与 Byrne 等人 [42] 的方法相似,后者对模型进行了微调,以使用外部 API 进行电影票务对话。

Parts of our findings are similar to recent studies on dialog grounded ness. Granting access to external knowledge bases has been shown to reduce the rate at which models hallucinate unsourced statements in dialog across a variety of retrieval systems and model architectures [31]. Another study finds that a question-answering system’s accuracy is improved by separating it into a reasoning unit and a response generator, analogous to our separation of ‘Base’ and ‘Research’ models in our study [43]. Meanwhile, the WebGPT framework includes a language system that can interact with the open web via a text-only interface, and learns to imitate humans in answering questions by citing external sources [44]. Komeili et al. [45] compare different types of pre-trained models and retrieval methods, and reach a similar conclusion that augmenting language models with a search engine provides more factually grounded responses. They encode the input context with grounded information from search to generate the next response, while we augment the generated responses with information from known sources in our method. This allows us to fine-tune the model for grounded ness without sacrificing gains in safety or quality from other fine-tuning treatments.

我们的一些发现与最近关于对话基础性的研究相似。允许访问外部知识库已被证明可以降低模型在对话中生成无来源陈述的频率,这在各种检索系统和模型架构中都有体现 [31]。另一项研究表明,将问答系统分为推理单元和响应生成器可以提高其准确性,这类似于我们在研究中对‘Base’和‘Research’模型的分离 [43]。与此同时,WebGPT框架包含一个可以通过纯文本界面与开放网络交互的语言系统,并学习通过引用外部资源来模仿人类回答问题 [44]。Komeili等人 [45] 比较了不同类型的预训练模型和检索方法,并得出类似的结论:增强语言模型与搜索引擎结合可以提供更基于事实的响应。他们通过对输入上下文进行基于搜索的信息编码来生成下一个响应,而我们的方法则是通过已知来源的信息来增强生成的响应。这使我们能够在不牺牲其他微调处理带来的安全性和质量提升的情况下,对模型进行基于事实的微调。

Dialog metrics: Defining effective metrics for dialog models remains an open research topic. Our approach is inspired by Adiwardana et al. [17], who argued for human-like metrics, such as sensible ness and specificity. Many automated metrics for dialog models have been studied, including perplexity [16, 17], F1, Hits@1/N [25], USR [46], or BLEU/ROUGE [47, 15, 27]. However, such automated metrics may not correlate well with human judgment [48]. More reliable metrics for dialog modeling require human evaluation [49, 50, 18, 25, 17, 51], as used in this paper.

对话模型评估指标:定义有效的对话模型评估指标仍然是一个开放的研究课题。我们的方法受到 Adiwardana 等人 [17] 的启发,他们主张使用类似人类的评估指标,如合理性和具体性。许多自动化的对话模型评估指标已经被研究,包括困惑度 [16, 17]、F1、Hits@1/N [25]、USR [46] 或 BLEU/ROUGE [47, 15, 27]。然而,这些自动化评估指标可能与人类判断的相关性不高 [48]。更可靠的对话模型评估指标需要依赖人工评估 [49, 50, 18, 25, 17, 51],正如本文所使用的。

Earlier research attempted to combine multifaceted evaluations of dialog quality into a single headline metric [52]. We follow the pattern established in Adiwardana et al. [17] and Roller et al. [18] by considering the different components of our evaluations separately. In addition to sensible ness and specificity per Adiwardana et al. [17], we add new metrics: interesting ness, safety, and grounded ness. An advantage of using several different metrics is their debug g ability: by exploring responses with low safety or grounded ness scores, we have been able to develop targeted methods to improve them.

早前的研究尝试将对话质量的多方面评估合并为一个单一的头条指标 [52]。我们遵循 Adiwardana 等人 [17] 和 Roller 等人 [18] 建立的模式,分别考虑我们评估的不同组成部分。除了根据 Adiwardana 等人 [17] 的合理性和具体性外,我们还增加了新的指标:趣味性、安全性、和有根性。使用多个不同指标的一个优势是它们的可调试性:通过探索安全性和有根性得分较低的响应,我们已经能够开发出有针对性的方法来改进它们。

Safety and safety of dialog models: Inappropriate and unsafe risks and behaviors of language models have been extensively discussed and studied in previous works (e.g., [53, 54]). Issues encountered include toxicity (e.g., [55, 56, 57]), bias (e.g., [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]), and inappropriately revealing personally identifying information (PII) from training data [73]. Weidinger et al. [54] identify 21 risks associated with large-scale language models and discuss the points of origin for these risks. While many mitigation strategies have also been suggested (e.g., [74, 75, 76, 77, 78, 79, 80, 81, 82]), meaningfully addressing these issues remains an active research area.

对话模型的安全性:语言模型的不适当和不安全风险及行为已在先前的研究中得到了广泛讨论和研究(例如,[53, 54])。遇到的问题包括毒性(例如,[55, 56, 57])、偏差(例如,[58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72])以及从训练数据中不恰当地泄露个人身份信息 (PII) [73]。Weidinger 等人 [54] 识别出与大语言模型相关的 21 种风险,并讨论了这些风险的起源点。尽管已经提出了许多缓解策略(例如,[74, 75, 76, 77, 78, 79, 80, 81, 82]),但有意义地解决这些问题仍然是一个活跃的研究领域。

Similar issues have also been discussed specifically for dialog models [53]. For instance, examples of bias, offensiveness, and hate speech have been found both in training data drawn from social media, and consequently in the output of dialog models trained on such data [83]. Dialog models [84] can learn, and even amplify, biases in the training data. Echoing Gehman et al. [85], we find fine-tuning effective to augment language models for safety. The method we use in this paper follows previous attempts to tackle these issues by training separate layers to detect unsafe output [17, 86, 18, 79]. Our strategy is similar to recent work that also uses fine-tuning [87]. While their safety guidelines were derived from human rights principles, they similarly find that increasing scale has no impact on toxicity metrics, while fine-tuning on safety evaluations does.

类似的问题也已在对话模型中进行了专门讨论 [53]。例如,在从社交媒体抽取的训练数据以及由此训练出的对话模型的输出中,均发现了偏见、冒犯性和仇恨言论的例子 [83]。对话模型 [84] 可以学习,甚至放大训练数据中的偏见。与 Gehman 等人 [85] 的观点一致,我们发现微调对于增强语言模型的安全性是有效的。本文中我们使用的方法遵循了先前尝试通过训练独立层来检测不安全输出的做法 [17, 86, 18, 79]。我们的策略类似于最近的工作,这些工作也使用了微调 [87]。尽管他们的安全指南是从人权原则推导出来的,但同样发现增加规模对毒性指标没有影响,而针对安全性评估的微调则有作用。

Grounded ness metrics: Similar to other recent research into grounded ness cited above, we assess grounded ness by asking crowd workers to judge whether the model’s output is in accordance with authoritative external sources. The recently-proposed Attributable to Identified Sources (AIS) framework [88] articulates a more precise approach to assess output of language models that pertains to the external world. It splits evaluation into two stages, where crowd workers are asked: (1) if they can understand and identify the information shared in a dialog turn, and (2) if all of this information can be attributed to a source. Meanwhile, a recent study has reopened the question of automatic evaluation, with the $Q^{2}$ metric showing performance comparable to human annotation [89].

基于性度量:与上述其他最近关于基于性的研究类似,我们通过让众包工人判断模型的输出是否符合权威外部来源来评估其基于性。最近提出的可归因于已识别来源 (AIS) 框架 [88] 阐述了一种更精确的方法来评估与外部世界相关的语言模型的输出。它将评估分为两个阶段,其中众包工人被要求:(1) 判断他们是否能够理解和识别对话轮次中共享的信息,以及 (2) 判断所有这些信息是否可以归因于某个来源。同时,最近的一项研究重新提出了自动评估的问题,$Q^2$ 度量显示出与人工标注相当的性能 [89]。

3 LaMDA pre-training

3 LaMDA预训练

LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data alone [17, 18], we pre-trained LaMDA on a dataset created from public dialog data and other public web documents. Therefore, LaMDA can be used as a general language model prior to fine-tuning.

LaMDA 被预训练用于预测文本语料库中的下一个 Token。与仅在对话数据上训练的先前对话模型不同 [17, 18],我们使用来自公共对话数据和其他公共网页文档创建的数据集对 LaMDA 进行了预训练。因此,在微调之前,LaMDA 可以作为通用语言模型使用。

The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words (Appendix E). Over $90%$ of the pre-training dataset is in the English language. We used the Sentence Piece library [90] to tokenize the dataset into 2.81T byte pair encoding (BPE) tokens [91], with a vocabulary of 32K tokens. For comparison, the total number of words in the training set for Meena [17] was 40B words, which is nearly 40x smaller.

预训练数据集包含 29.7 亿文档、11.2 亿对话和 133.9 亿对话轮次,总计 1.56 万亿词 (附录 E)。预训练数据集中超过 90% 的内容是英文。我们使用 Sentence Piece 库 [90] 将数据集分词为 2.81 万亿个字节对编码 (BPE) Token [91],词汇量为 32K Token。作为对比,Meena [17] 的训练集总词数为 400 亿词,大约小 40 倍。

The largest LaMDA model has 137B non-embedding parameters, which is ${\sim}50\mathbf{x}$ more parameters than Meena [17]. We use a decoder-only Transformer [92] language model as the model architecture for LaMDA. The Transformer has 64 layers, $d_{m o d e l}=8192$ , $d_{f f}=65536$ , $h=128$ , $d_{k}=d_{v}=128$ , relative attention as described in T5 [11], and gated-GELU activation as described in Raffel et al. [93].

最大的 LaMDA 模型有 137B 非嵌入参数,这比 Meena [17] 多约 50 倍的参数。我们使用仅解码器 Transformer [92] 语言模型作为 LaMDA 的模型架构。该 Transformer 有 64 层,$d_{model}=8192$ ,$d_{ff}=65536$ ,$h=128$ ,$d_{k}=d_{v}=128$ ,相对注意力机制如 T5 [11] 中所述,以及由 Raffel 等人 [93] 描述的门控 GELU 激活函数。

We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch. We used the Lingvo framework [94] for training and achieved 123 TFLOPS/sec with $56.5%$ FLOPS utilization with the 2D sharding algorithm, as described in GSPMD [95] (see Section 10 for carbon footprint estimates). We also trained smaller 2B-parameter and 8B-parameter models to measure the effects of model scaling on our metrics. Hyper parameter details for the models of different sizes can be found in Table 27, Appendix D.

我们在 1024 个 TPU-v3 芯片上对 LaMDA 进行了预训练,总耗时约为 57.7 天,每批次 256K tokens。我们使用 Lingvo 框架 [94] 进行训练,并通过 2D 分片算法实现了 123 TFLOPS/秒的性能,FLOPS 利用率为 $56.5%$,如 GSPMD [95] 中所述(有关碳足迹估算,请参见第 10 节)。我们还训练了较小的 2B 参数和 8B 参数模型,以测量模型规模对我们指标的影响。不同规模模型的超参数详情请参见表 27,附录 D。

Figure 2 gives an overview of the pre-training stage. We call the model before any fine-tuning "PT", for PreTrained.

图 2: 给出了预训练阶段的概述。我们将任何微调之前的模型称为 "PT",即预训练 (PreTrained) 模型。


Figure 2: LaMDA pre-training as a language model.


图 2: LaMDA 作为语言模型的预训练。

PT uses the same sample-and-rank strategy as Meena [17] for decoding. We first sample 16 independent candidate responses using top $.k$ $(k=40)$ sampling (no temperature). The final output is the highest-scoring candidate, where the score is based on the candidate’s log-likelihood and its length.

PT 使用与 Meena [17] 相同的采样和排序策略进行解码。我们首先使用 top-k (k=40) 采样(无温度调整)生成 16 个独立的候选回复。最终输出是得分最高的候选者,其中得分为基于候选者的对数似然性和其长度。

4 Metrics

4 指标

Evaluating generative models in general, and open-ended dialog models in particular, is difficult. See the Related Work section for a general review of recent work in this area. In this section, we describe the metrics that we use for evaluation.

评估生成式模型 (Generative models) 通常很困难,特别是开放对话模型。请参见相关工作部分以获取该领域最近工作的总体回顾。在本节中,我们将描述用于评估的指标。

4.1 Foundation metrics: Quality, Safety and Grounded ness

4.1 基础指标:质量、安全性和有界性

Sensible ness, Specificity, Interesting ness (SSI): Our overall quality score is an average of sensible ness, specificity, and interesting ness (SSI).

合理性、具体性、趣味性 (SSI):我们的总体质量得分是合理性、具体性和趣味性 (SSI) 的平均值。

The first score, sensible ness, measures whether a model’s responses make sense in context and do not contradict anything that was said earlier. Humans tend to take this basic aspect of communication for granted, but generative models often struggle to meet this requirement. However, if sensible ness alone is used to evaluate models, we could inadvertently reward models for playing it safe by always producing short, generic, and boring responses. The GenericBot algorithm [17], which answers every question with “I don’t know” and every statement with “Ok,” scores $70%$ on sensible ness, which even surpasses some large dialog models [17].

第一个评分标准,合理性 (sensible ness),衡量模型的响应是否在上下文中合理且不与之前说过的内容相矛盾。人们往往将这一基本沟通方面视为理所当然,但生成式模型常常难以满足这一要求。然而,如果仅使用合理性来评估模型,我们可能会无意中奖励那些总是产生简短、通用和无聊响应的模型。例如,GenericBot算法 [17],它对每个问题回答“我不知道”,对每个陈述回答“好的”,在合理性评分上达到了 70% ,甚至超过了某些大型对话模型 [17]。

The second score, specificity, is used to measure whether a response is specific to a given context. For example, if a user says “I love Eurovision” and the model responds “Me too,” then it would score 0 on specificity, since this response could be used in many different contexts. If it answers “Me too. I love Eurovision songs,” then it would score 1. Adiwardana et al. [17] report that Meena narrows the gap to average human performance in the SSA metric.

第二个分数,特异性 (specificity),用于衡量响应是否针对给定的上下文。例如,如果用户说 “I love Eurovision” 而模型回答 “Me too”,那么它在特异性上的得分会是 0,因为这个回答可以用于许多不同的上下文。如果它回答 “Me too. I love Eurovision songs”,那么它会得 1 分。Adiwardana 等人 [17] 报告称 Meena 在 SSA 指标上缩小了与人类平均表现之间的差距。

As the model’s performance increases, however, we find that sensible ness and specificity are not sufficient to measure the quality of a dialog model. For example, a response to “How do I throw a ball?” could be “You can throw a ball by first picking it up and then throwing it”, which makes sense and is specific to the question. An alternative deeper and more satisfying answer could be “One way to toss a ball is to hold it firmly in both hands and then swing your arm down and up again, extending your elbow and then releasing the ball upwards.”

随着模型性能的提高,我们发现合理性和具体性不足以衡量对话模型的质量。例如,对于问题“我该如何扔球?”的回答可以是“你可以先拿起球然后再扔出去”,这回答是有道理且针对问题的。一个更深入、更令人满意的回答可以是:“一种扔球的方法是用双手牢固握住球,然后将手臂向下再向上挥动,伸展肘部并向上释放球。”

We attempt to translate this intuition into the third score, an observable quality which we call “Interesting ness”. Similar to sensible ness and specificity, interesting ness is measured as a 0/1 label by crowd workers. We ask crowd workers to label a response as interesting if they judge that it is likely to “catch someone’s attention” or “arouse their curiosity”, or if it is unexpected, witty, or insightful. (For the complete instructions given to crowd workers, see Appendix B).

我们将这种直觉转化为第三个评分,一个可观察的质量,我们称之为“趣味性 (Interesting ness)”。与合理性和具体性类似,趣味性也通过众包工人给出的 0/1 标签来衡量。我们要求众包工人如果认为某个回答有可能“引起某人的注意”或“激发他们的好奇心”,或者如果它出乎意料、风趣或有见地,则将其标记为有趣。(完整的众包工人指导说明请参见附录 B)。

Safety: A dialog model can achieve high quality (SSI) scores but can be unsafe for users. Therefore, we devise a new safety metric to measure unsafe model output. This metric follows objectives derived from Google’s AI Principles,2 to avoid unintended results that create risks of harm, and to avoid creating or reinforcing unfair bias. These safety objectives are described in detail in Appendix A.1.

安全:对话模型可以实现高质量 (SSI) 得分,但可能对用户不安全。因此,我们设计了一个新的安全指标来衡量不安全的模型输出。该指标遵循从 Google 的 AI 原则 [2] 派生的目标,以避免产生造成伤害风险的意外结果,并避免创建或强化不公平的偏见。这些安全目标在附录 A.1 中有详细描述。

Grounded ness: We aim to ensure that LaMDA produces responses that can be associated with known sources whenever possible, enabling cross-checking if desired, because the current generation of language models tends to produce plausible but incorrect statements.

可验证性:我们旨在确保 LaMDA 生成的响应尽可能与已知来源相关联,以便在需要时可以进行交叉验证,因为当前的语言模型倾向于生成看似合理但不正确的陈述。

We define grounded ness as the percentage of responses containing claims about the external world that can be supported by authoritative external sources, as a share of all those containing claims about the external world.

我们将有根据性定义为包含关于外部世界的主张且能够得到权威外部来源支持的响应所占的比例,作为所有包含关于外部世界主张的响应的一部分。

We also define ‘Informative ness’ as the percentage of responses that carry information about the external world that can be supported by known sources as a share of all responses. Informative ness only differs from grounded ness in the denominator term. So responses like “That’s a great idea” that do not carry any external world information do not affect grounded ness, but they do affect Informative ness. However, “Rafael Nadal is the winner of Roland Garros $2020^{\prime\prime}$ is an example of a grounded response.

我们还定义了‘信息量 (Informative ness)’,即在所有回复中,能够通过已知来源支持的关于外部世界的信息所占的百分比。信息量仅在分母项上与基础性 (grounded ness) 不同。因此,像“那是一个很棒的想法”这样的回复,虽然不包含任何外部世界的信息,不会影响基础性,但会影响信息量。然而,“拉斐尔·纳达尔是 2020 年罗兰加洛斯的冠军”是一个基础性回复的例子。

Finally, we define ‘Citation accuracy’ as the percentage of model responses that cite the URLs of their sources as a share of all responses with explicit claims about the external world, excluding claims with well-known facts (such as "horses have four legs").

最后,我们将‘引用准确性’定义为模型响应中引用其来源 URL 的比例,占所有对外部世界有明确断言的响应的比例,不包括广为人知的事实(如“马有四条腿”)的断言。

4.2 Role-specific metrics: Helpfulness and Role consistency

4.2 角色特定指标:有用性和角色一致性

The foundation metrics (quality, safety, and grounded ness) measure attributes that we find important for dialog agents in general. However, they are not dependent on any application-specific role that an agent may be designed for (e.g., teaching information about animals). We measure Helpfulness and Role consistency in dialog applications, where agents have specific roles.

基础指标(质量、安全性和接地性)衡量我们认为对对话型 AI智能体 重要的属性。然而,这些指标不依赖于任何特定应用场景下智能体可能被设计的角色(例如,教授动物知识)。我们在对话应用中测量有用性和角色一致性,其中智能体具有特定的角色。

Helpfulness: The model’s responses are marked helpful if they contain correct information based on the user’s independent research with an information retrieval system, and the user considers them helpful. Helpful responses are a subset of informative ones, which are judged by the user to be both correct and useful.

帮助性:如果模型的响应包含基于用户使用信息检索系统独立研究的正确信息,并且用户认为这些响应有帮助,则标记为有帮助。有帮助的响应是有用信息响应的一个子集,后者由用户判断为既正确又有用。

Role consistency: The model’s responses are marked role consistent if they look like something an agent performing the target role would say. This is distinct from consistency with previous responses that the agent made in the dialog, and self-consistency within a dialog is measured by the sensible ness metric instead. Role consistency refers to consistency with the definition of the agent’s role external to the conversation.

角色一致性:如果模型的响应看起来像是执行目标角色的 AI智能体 会说的话,则标记为角色一致。这与对话中该智能体之前做出的响应的一致性不同,对话内的自一致性由合理性指标衡量。角色一致性指的是与对话外部定义的智能体角色的一致性。

These role-specific metrics are discussed further in Section 8.

这些特定角色的指标在第 8 节中进一步讨论。

5 LaMDA fine-tuning and evaluation data

5 LaMDA微调和评估数据

Quality (Sensible ness, Specificity, Interesting ness): To improve quality (SSI), we collect 6400 dialogs with 121K turns by asking crowd workers to interact with a LaMDA instance about any topic. These dialogs are required to last 14 to 30 turns. For each response, we ask other crowd workers to rate whether the response given the context is sensible, specific, and/or interesting, and to and mark each with ‘yes’, ‘no’, or ‘maybe’ labels. If a response is not sensible (the crowd worker did not mark it with ‘yes’), then we do not collect the labels for specificity and interesting ness, and consider them to be ‘no’. Furthermore, if a response is not specific (the crowd worker did not mark it with ‘yes’), then we do not collect the label for interesting ness, and consider it to be ‘no’. This ensures that responses are not rated positively for specificity if they are not sensible, and similarly, that responses are not rated positively for interesting ness if they are not specific. Every response is labeled by 5 different crowd workers and the response is considered sensible, specific or interesting if at least 3 out of 5 crowd workers mark it ‘yes’.

质量(合理性、具体性、趣味性):为了提高质量 (SSI),我们收集了 6400 段对话,共 121K 轮次,通过让众包工人与 LaMDA 实例就任何主题进行互动。这些对话要求持续 14 到 30 轮次。对于每个回复,我们请其他众包工人评估该回复在给定上下文中是否合理、具体和/或有趣,并标记为‘是’、‘否’或‘可能’。如果一个回复不合理(众包工人未标记为‘是’),则我们不收集具体性和趣味性的标签,并认为它们为‘否’。此外,如果一个回复不具体(众包工人未标记为‘是’),则我们不收集趣味性的标签,并认为它为‘否’。这确保了如果回复不合理,则不会被正面评价为具体;同样,如果回复不具体,则不会被正面评价为有趣。每个回复由 5 名不同的众包工人标记,当至少有 3 名工人标记为‘是’时,该回复被认为合理、具体或有趣。

We evaluate the models based on the model’s generated responses to the Mini-Turing Benchmark (MTB) dataset[17], which consists of 1477 dialogs with up to 3 dialog turns. The MTB includes 315 single-turn dialogs, 500 2-turn dialogs, and 662 3-turn dialogs. These dialogs are fed to the model to generate the next response. Similar to above, every response is labeled sensible, specific or interesting if at least 3 out of 5 crowd workers mark it ‘yes’.

我们根据模型对 Mini-Turing Benchmark (MTB) 数据集 [17] 生成的响应来评估模型,该数据集包含 1477 段对话,每段对话最多有 3 轮。MTB 包括 315 段单轮对话,500 段两轮对话和 662 段三轮对话。这些对话被输入到模型中以生成下一轮的响应。与上述类似,如果至少 3 名评审者(共 5 名)标记为 ‘是’,则每个响应将被标注为合理的、具体的或有趣的。

Safety: For safety fine-tuning, we employ a structured approach that begins with defining the safety objectives (Appendix A.1). These objectives are used to annotate candidate responses generated by a LaMDA instance in response to human-generated prompts (Appendix A.2), using a demographically diverse set of crowd workers (Appendix A.3).

安全:对于安全微调,我们采用了一种结构化方法,该方法从定义安全目标 (附录 A.1) 开始。这些目标用于标注由 LaMDA 实例对人类生成的提示做出的候选响应 (附录 A.2),使用人口统计学多样化的众包工作者集进行标注 (附录 A.3)。

Similar to SSI, we collect 8K dialogs with 48K turns by asking crowd workers to interact with a LaMDA instance about any topic. These dialogs are required to last 5 to 10 turns. We instruct crowd workers to interact with the model in three different ways: (a) interactions of natural form, (b) interactions that touch sensitive topics, and (c) interactions that adversarial ly attempt to break the model as per the safety objectives. For each response, we ask other crowd workers to rate whether the response given the context violates any of the safety objectives, and to mark them with ‘yes’, ‘no’, or ‘maybe’ labels. Every response is assigned a safety score of 1 if at least 2 out of 3 crowd workers mark the response with ‘no’ for each individual safety objective. Otherwise, it is assigned a score of 0.

类似于 SSI,我们通过让众包工人与 LaMDA 实例就任何主题进行交互,收集了 8K 对话,共 48K 轮次。这些对话要求持续 5 到 10 轮次。我们指示众包工人以三种不同方式与模型互动:(a) 自然形式的互动,(b) 涉及敏感话题的互动,以及 (c) 根据安全目标对抗性地尝试破坏模型的互动。对于每个响应,我们要求其他众包工人评估该响应是否在给定上下文中违反了任何安全目标,并用‘是’、‘否’或‘可能’标签标记。如果至少有 2 名众包工人对每个单独的安全目标标记为‘否’,则每个响应会被分配一个安全分数 1。否则,它将被分配一个分数 0。

We evaluate safety using an evaluation dataset that is a holdout sample of the adversarial ly collected dataset described above. This dataset consists of 1166 dialogs with 1458 turns. These dialogs are input to the model to generate the next response. Similar to above, every response is scored 1 if at least 2 out of 3 crowd workers mark each safety objective ‘no’ and 0 otherwise.

我们使用一个评估数据集来评估安全性,该数据集是上述对抗性收集数据集的保留样本。这个数据集包含 1166 段对话,共有 1458 轮交互。这些对话被输入到模型中以生成下一个回复。与上述类似,每个回复如果至少有 2 名评审员(共 3 名)标记每个安全目标为“否”,则得分为 1,否则为 0。

Grounded ness: Similar to SSI and safety, we collect 4K dialogs with 40K turns by asking crowd workers to interact with the model. This time, we request that they try to steer the conversation towards information-seeking interactions.

接地性:类似于 SSI 和安全性,我们收集了 4K 对话,共 40K 轮次,通过邀请众包工人与模型进行交互。这一次,我们要求他们尝试将对话引导至信息寻求互动。

We ask crowd workers to rate each of the model’s dialog turns, evaluating whether the information in the turn makes any claims about the external world. We exclude claims about publicly unrecognized people, as the model can make factual claims on behalf of an improvised persona. Such claims do not require grounding on external sources (e.g., “I baked three cakes last week”), unlike claims about historical people (e.g., “Julius Caesar was born in $100,\mathrm{B}^{\circ}$ ).

我们要求众包工人对模型的每个对话轮进行评分,评估该轮中的信息是否对外部世界做出了任何主张。我们排除了关于公众不知名人物的主张,因为模型可以为即兴创作的人物做出事实主张。这类主张不需要基于外部来源(例如:“我上周烤了三个蛋糕”),而与历史人物相关的主张则不同(例如:“Julius Caesar 出生于 100 B.C. ”)。

We also ask crowd workers whether they know the claims to be true. If 3 different crowd workers all know a claim to be true, then we assume it to be common knowledge and do not check external knowledge sources before making this claim.

我们还询问众包工人是否知道这些主张是真实的。如果有 3 位不同的众包工人都知道某个主张是真实的,那么我们就认为这是常识,在提出这个主张之前不会检查外部知识来源。

For utterances containing claims that need to be checked, we ask crowd workers to record the search queries that they would use to investigate them. Finally, we ask crowd workers to edit the model’s response to incorporate brief search results from an external knowledge-retrieval system. If the search results include any content from the open web, we ask crowd workers to include URLs that appropriately cite the sources of the knowledge used in the final response.

对于包含需要核查的声明的语句,我们要求众包工作者记录他们将用于调查这些声明的搜索查询。最后,我们要求众包工作者编辑模型的响应,以纳入来自外部知识检索系统的简要搜索结果。如果搜索结果包含任何来自开放网络的内容,我们要求众包工作者在最终响应中包含适当的 URL 以引用所使用知识的来源。

We evaluate grounded ness using an evaluation dataset with 784 turns of dialogs from Dinan et al. [96] that encompass a variety of topics. These contexts are fed to the model to generate the next response. For each response, we ask crowd workers to rate whether the model’s response contains any factual claims, and if so, to rate whether these factual claims can be verified by checking a known source. Every response is labeled by 3 different crowd workers. The final grounded ness, informative ness, and citation accuracy labels of a given response are determined by majority voting. All of the fine-tuning and evaluation datasets are in English.

我们使用来自 Dinan 等人 [96] 的评估数据集来评估有根据性,该数据集包含 784 轮对话,涵盖了各种主题。这些上下文被输入到模型中以生成下一个响应。对于每个响应,我们要求众包工人评估模型的响应是否包含任何事实声明,如果包含,则评估这些事实声明是否可以通过检查已知来源进行验证。每个响应由 3 名不同的众包工人标注。给定响应的最终有根据性、信息性和引用准确性标签由多数投票决定。所有微调和评估数据集均为英文。

Estimating these metrics for human-generated responses: We ask crowd workers to respond to randomly selected samples of the evaluation datasets (labeled as ‘Human’ in 1, 4 and 5). The crowd workers are explicitly informed to reply in a safe, sensible, specific, interesting, grounded, and informative manner. They are also explicitly asked to use any external tools necessary to generate these responses (e.g., including an information retrieval system). The context-response pairs are then sent for evaluation, and a consensus label is formed by majority voting, just as for model generated responses.

为人类生成的响应估计这些指标:我们要求众包工人对评估数据集的随机样本进行响应(在 1、4 和 5 中标记为 ‘Human’)。明确告知众包工人以安全、合理、具体、有趣、有根据和信息丰富的方式回复。还明确要求他们使用任何必要的外部工具来生成这些响应(例如,包括一个信息检索系统)。然后将上下文-响应对发送进行评估,并通过多数投票形成共识标签,就像对模型生成的响应一样。

6 LaMDA fine-tuning

6 LaMDA 微调

6.1 Disc rim i native and generative fine-tuning for Quality (SSI) and Safety

6.1 区分式和生成式微调以提升质量 (SSI) 和安全性

We create LaMDA using several fine-tunings applied to the pre-trained model (PT). These include a mix of generative tasks that generate response given contexts, and disc rim i native tasks that evaluate quality and safety of a response in context. This results in a single model that can function as both a generator and a disc rim in at or.

我们使用几种微调方法创建了 LaMDA,这些方法应用于预训练模型 (PT)。这些方法包括生成任务和判别任务的混合,生成任务根据上下文生成响应,判别任务评估上下文中响应的质量和安全性。这导致了一个单一模型,既可以作为生成器又可以作为判别器发挥作用。

Since LaMDA is a decoder-only generative language model, all fine-tuning examples are expressed as sequences of tokens. Generative fine-tuning examples are expressed as “ ”, with losses applied only for the response portion:

由于 LaMDA 是一个仅解码的生成式语言模型 (decoder-only generative language model),所有微调示例都表示为 Token 序列。生成式微调示例表示为 “ ”,其中损失仅应用于响应部分:

• “What’s up? RESPONSE not much.”

• “最近怎么样? RESPONSE 没有什么特别的。”

Disc rim i native fine-tuning examples are expressed as “ ”, with losses applied for the rating following the attribute name only:

判别式微调示例表示为 “ ”,仅对属性名称后的评分应用损失函数:

• “What’s up? RESPONSE not much. SENSIBLE 1” • “What’s up? RESPONSE not much. INTERESTING 0” • “What’s up? RESPONSE not much. UNSAFE 0”

• “最近怎么样?回应 没有什么。合理 1”
• “最近怎么样?回应 没有什么。有趣 0”
• “最近怎么样?回应 没有什么。不安全 0”

Using one model for both generation and discrimination enables an efficient combined generate-and-discriminate procedure. After generating a response given a context, evaluating a disc rim in at or involves computing P(“” | “ ”). Since the model has already processed “ ”, evaluating the disc rim in at or simply involves processing a few additional tokens: “ ”.

使用一个模型同时进行生成和判别,可以实现高效的生成与判别联合过程。在给定上下文后生成响应,评估判别涉及计算 P(“” | “ ”)。由于模型已经处理了“ ”,评估判别只需处理几个额外的 Token:“ ”。

First, we fine-tune LaMDA to predict the SSI and safety ratings of the generated candidate responses. Then, we filter out candidate responses for which the model’s safety prediction falls below a threshold during generation. Candidate responses that remain after filtering for safety are then ranked for quality. During ranking, sensible ness is given a weight three times higher than specificity and interesting ness, as this was found to work well for all metrics $(i.e.,,3,^{*}$ $P(s e n s i b l e)+P(s p e c i f t c)+P(i n t e r e s t i n g))$ . The top ranked candidate is selected as the next response.

首先,我们对 LaMDA 进行微调,以预测生成的候选响应的 SSI 和安全性评分。然后,在生成过程中,我们将模型的安全性预测低于阈值的候选响应过滤掉。经过安全性过滤后剩余的候选响应将根据质量进行排名。在排名时,合理性被赋予的权重是具体性和趣味性的三倍,因为这被发现对所有指标都有效 (即,3 * P(合理) + P(具体) + P(有趣)) 。排名最高的候选响应被选为下一个回复。

LaMDA SSI and safety disc rim in at or s are also used to score and filter 2.5M turns of dialog data sampled from the pre-training dataset (Section 3), resulting in 800K turns of safe, sensible, specific and interesting dialogs. We then fine-tune the LaMDA model over this dataset to generate the response in a given context.

LaMDA 的 SSI 和安全判别模型也用于对从预训练数据集中采样的 2.5M 轮对话数据 (第 3 节) 进行评分和过滤,最终得到 800K 轮安全、合理、具体且有趣的对话。然后我们在该数据集上微调 LaMDA 模型,以在给定的上下文中生成响应。

We see significant gains in safety and quality for LaMDA using this technique (Figure 5).

我们看到使用这种方法,LaMDA 在安全性和质量方面有显著提升 (图 5)。

6.2 Fine-tuning to learn to call an external information retrieval system

6.2 精调以学习调用外部信息检索系统

Language models such as LaMDA tend to generate outputs that seem plausible, but contradict facts established by known external sources. For example, given a prompt such as the opening sentences of a news article, a large language model will continue them with confident statements in a brisk journalistic style. However, such content is merely imitating what one might expect to find in a news article without any connection to trustworthy external references.

像 LaMDA 这样的语言模型倾向于生成看似合理但与已知外部来源确立的事实相矛盾的输出。例如,给定一个新闻文章的开头句子这样的提示,大语言模型会以自信的陈述和简洁的新闻风格继续展开。然而,这样的内容只是模仿了人们可能在新闻文章中预期看到的内容,而与可信的外部参考资料没有任何联系。

One possible solution to this problem could be to increase the size of the model, based on the assumption that the model can effectively memorize more of the training data. However, some facts change over time, like the answers to ‘How old is Rafael Nadal?’ or ‘What time is it in California?’. Lazaridou et al. (2021) call this the temporal generalization problem [97]. Recent work proposed using a dynamic or incremental training architecture to mitigate this issue (e.g., [97, 98]). It may be difficult to obtain sufficient training data and model capacity to achieve this, as a user may be interested in conversing about anything within the corpus of human knowledge.

一个可能的解决方案是增加模型的规模,基于模型可以更有效地记忆更多训练数据的假设。然而,一些事实会随着时间而改变,例如“拉斐尔·纳达尔多大了?”或“加州现在几点了?”。Lazaridou 等 (2021) 将此称为时间泛化问题 [97]。最近的研究提出使用动态或增量训练架构来缓解这一问题(例如,[97, 98])。要获得足够的训练数据和模型容量以实现这一点可能很困难,因为用户可能对人类知识语料库中的任何内容感兴趣。

We present our approach to fine-tuning by learning to consult a set of external knowledge resources and tools.

我们介绍了通过学习咨询一组外部知识资源和工具来进行微调的方法。

The toolset (TS): We create a toolset (TS) that includes an information retrieval system, a calculator, and a translator. TS takes a single string as input and outputs a list of one or more strings. Each tool in TS expects a string and returns a list of strings. For example, the calculator takes $\cdot135{+}7721^{\circ}$ , and outputs a list containing [“7856”]. Similarly, the translator can take “hello in French” and output [“Bonjour”]. Finally, the information retrieval system can take “How old is Rafael Nadal?”, and output [“Rafael Nadal / Age / 35”]. The information retrieval system is also capable of returning snippets of content from the open web, with their corresponding URLs. The TS tries an input string on all of its tools, and produces a final output list of strings by concatenating the output lists from every tool in the following order: calculator, translator, and information retrieval system. A tool will return an empty list of results if it can’t parse the input (e.g., the calculator cannot parse “How old is Rafael Nadal?”), and therefore does not contribute to the final output list.

工具集 (TS):我们创建了一个工具集 (TS),其中包括一个信息检索系统、一个计算器和一个翻译器。TS 接受单个字符串作为输入,并输出一个或多个字符串的列表。TS 中的每个工具都期望接收一个字符串并返回一个字符串列表。例如,计算器接受 $\cdot135{+}7721^{\circ}$ ,并输出包含 [“7856”] 的列表。同样,翻译器可以接受 “hello in French” 并输出 [“Bonjour”]。最后,信息检索系统可以接受 “How old is Rafael Nadal?”,并输出 [“Rafael Nadal / Age / 35”]。信息检索系统还能够从开放网络返回带有相应 URL 的内容片段。TS 尝试将输入字符串应用于所有工具,并通过按以下顺序连接每个工具的输出列表来生成最终的字符串列表:计算器、翻译器和信息检索系统。如果某个工具无法解析输入(例如,计算器无法解析 “How old is Rafael Nadal?”),则会返回一个空的结果列表,因此不会对最终输出列表做出贡献。

Dialog collection: We collect 40K annotated dialog turns annotated (generative data). We also collect 9K dialog turns, in which the LaMDA’s generated candidates are labeled ‘correct’ or ‘incorrect’, to be used as input data for the ranking task (disc rim i native data).

对话收集:我们收集了 40K 个标注的对话轮次 (生成式数据)。我们还收集了 9K 个对话轮次,其中 LaMDA 生成的候选回复被标记为 ‘正确’ 或 ‘错误’,用作排序任务的输入数据 (判别式数据)。

We collect a set of human-human dialogs between crowd workers, focused on information-seeking interactions, and evaluate whether their statements can be supported by known authoritative sources. As seen in Figure 4, it is notable that they make well-supported claims at a higher rate if they have access to TS. When asked for Rafael Nadal’s age, a human expert may not know the answer immediately, but can easily query an information retrieval system to obtain it. Therefore, we decided to fine-tune our language model to provide attributions for its responses by looking up its claims using a toolset.

我们收集了一组人群工作者之间以信息寻求交互为重点的人与人对话,并评估他们的陈述是否可以得到已知权威来源的支持。如图 4 所示,值得注意的是,如果他们可以访问 TS,他们做出的有充分支持的声明的比例更高。当被问及拉斐尔·纳达尔的年龄时,人类专家可能无法立即知道答案,但可以轻松查询信息检索系统以获取答案。因此,我们决定微调我们的语言模型,通过使用工具集查找其声明来为其响应提供出处。

To collect training data for the fine-tuning used in the algorithm, we use both static and interactive methods again. The key difference from the other sub-tasks is that the crowd workers are not reacting to the model’s output, but rather intervening to correct it in a way that LaMDA can learn to imitate. In the interactive case, a crowd worker carries out a dialog with LaMDA, whereas in the static case, they read over records of earlier dialogs, turn by turn. The crowd worker decides whether each statement contains any claims that might require reference to an external knowledge source. If so, they are asked whether the claims are about anything other than the persona improvised by LaMDA, and then whether they go beyond simple matters of common sense. If the answer to any of these questions is ’no’, the model’s output is marked ‘good’, and the dialog moves on. Otherwise, the crowd worker is asked to research the claims using the toolset, via a text-in and text-out interface.

为了收集用于算法微调的训练数据,我们再次使用静态和交互式方法。与其它子任务的关键区别在于,众包工作者不是对模型的输出做出反应,而是以一种 LaMDA 可以学习模仿的方式进行纠正。在交互式情况下,众包工作者与 LaMDA 进行对话;而在静态情况下,他们逐轮阅读之前的对话记录。众包工作者决定每个陈述是否包含可能需要参考外部知识源的主张。如果是这样,他们会判断这些主张是否涉及 LaMDA 即兴创作的人物之外的内容,然后判断它们是否超出了简单的常识问题。如果这些问题的答案是“否”,则模型的输出被标记为“良好”,对话继续进行。否则,众包工作者会被要求使用工具集通过文本输入和文本输出界面研究这些主张。

The interface to the set of tools used here is identical to the service used by the algorithm at inference time. Given a general text query, the information retrieval system returns a set of brief, text-only snippets in rank order. Snippets of open-web content include URLs for their source, answers provided directly by the information retrieval system, (e.g., the current time) or by the calculator tool do not. When the user has finished running queries, they have the opportunity to rewrite the model’s statement to include well-sourced claims. If they used open-web content, we ask them to cite the URLs needed to support any responses which contain information pertaining to the external world. URLs can be appended to the end of the message, or if the context warrants it, they can be attached inline to particular words in the response using Markdown format.

此处使用的工具集接口与算法在推理时使用的服务完全相同。给定一个通用文本查询,信息检索系统按排名顺序返回一组简短的纯文本片段。开放网络内容的片段包含其来源的 URL,而信息检索系统直接提供的答案(例如,当前时间)或由计算器工具提供的答案则不包含 URL。当用户完成查询后,他们有机会重写模型的陈述,以包含有可靠来源的声明。如果他们使用了开放网络内容,我们要求他们引用支持任何包含外部世界信息的回答所需的 URL。URL 可以附加在消息末尾,或者根据上下文需要,可以使用 Markdown 格式将它们内联附加到响应中的特定词语上。

Fine-tuning: We then fine-tune LaMDA to perform two tasks.

微调:然后我们对 LaMDA 进行微调以执行两个任务。

The first task takes the multiturn dialog context to date and the response generated by the base model. It then generates a special string (“TS” for toolset) indicating the following text is a query (e.g., “How old is Rafael Nadal?”) that should be sent to the toolset: context $+\ b a s e\rightarrow{^{\bullet\bullet}\mathrm{TS}}$ , Rafael Nadal’s age”.

第一个任务接收多轮对话的当前上下文和基础模型生成的回复。然后生成一个特殊字符串(“TS”表示工具集),指示随后的文本是一个查询(例如,“拉斐尔·纳达尔多大了?”),该查询应发送给工具集:上下文 $+\ b a s e\rightarrow{^{\bullet\bullet}\mathrm{TS}}$ ,拉斐尔·纳达尔的年龄”。

The second task takes the snippet returned by a tool, and a dialog statement (e.g., “He is 31 years old right now” $^+$ “Rafael Nadal / Age / 35”). It then predicts the grounded version: cont $e x t+b a s e+q u e r y+s n i p p e t\to{}^{}[\$ User, He is 35 years old right now”. Alternatively, this task can also output an additional research query. For example, context $^+$ $b a s e+q u e r y+s n i p p e t\to\mathrm{^{}T S}$ , Rafael Nadal’s favorite song”. At inference time, the model’s output is directed to the information retrieval system or to the user, depending on whether the first generated string is ‘TS’ or ‘User’.

第二个任务接收工具返回的片段和一个对话陈述(例如,“他现在 31 岁” $^+$ “Rafael Nadal / Age / 35”)。然后它预测接地版本: cont $e x t+b a s e+q u e r y+s n i p p e t\to{}^{}[$ 用户, 他现在 35 岁”。或者,此任务也可以输出一个额外的研究查询。例如,context $^+$ $b a s e+q u e r y+s n i p p e t\to\mathrm{^{}T S}$ ,“Rafael Nadal 最喜欢的歌曲”。在推理时,模型的输出会根据生成的第一个字符串是 ‘TS’ 还是 ‘用户’ 而被导向信息检索系统或用户。

For instance, a dialog might include the following statement from the user:

例如,对话可能包括用户以下陈述:

USER: What do you think of Rosalie Gascoigne’s sculptures?

USER: 你对罗莎莉·加斯科因的雕塑有什么看法?

(This transcript was taken from a dialog with one of the authors, and includes the generated base output and search queries, which are not usually shown to the user.)

(此对话记录摘自与其中一位作者的对话,包括生成的基础输出和搜索查询,这些内容通常不会显示给用户。)

The basic LaMDA language model, which we refer to as the ‘Base’ model here, generates a draft response:

基础 LaMDA 语言模型,即此处所指的‘Base’模型,生成一个草稿回复:

LAMDA-BASE: They’re great, and I love how her work changed through her life. I like her later work more than her earlier ones. Her influence is also super interesting - did you know she was one of the artists that inspired Miró?

LAMDA-BASE: 她的作品很棒,我喜欢她一生中作品的变化。我更喜欢她后期的作品而不是早期的。她的影响力也非常有趣——你知道她是启发米罗 (Miró) 的艺术家之一吗?

Note that the last sentence seems plausible at first glance, but if a lover of art history opened their reference books to find out more, they would be disappointed. Miró was active from 1918 to the late 1960s, and Gascoigne’s first solo exhibitions were in the early 1970s.

请注意,最后一句话乍看之下似乎合理,但如果艺术史爱好者翻开参考书进一步了解,他们会感到失望。米罗 (Miró) 的活跃时期是从 1918 年到 20 世纪 60 年代末,而加斯科因 (Gascoigne) 的第一次个人展览是在 20 世纪 70 年代初。

At this point, the ‘Research’ phase of the model outputs some queries to TS to assess the ‘Base’ output’s claims. Recall that the ‘Research’ phase is one specialized task from a set that belong to a single multi-tasking model (e.g., ‘Base’ dialog response generation task, safety, and quality tasks). Query generation is entirely based on the model fine-tuning, and there is no heuristic component. The model’s output determines the number of queries—the research loop continues until the model generates output addressed to the user—but we impose a maximum at inference time to eliminate the possibility of an infinite loop. The maximum is a parameter of the serving model. In the current example, it is set to four.

在这一阶段,模型的‘研究’阶段会输出一些查询给 TS,以评估‘基础’输出的声明。请注意,‘研究’阶段是属于单一多任务模型中的一项专门任务(例如,‘基础’对话响应生成任务、安全性和质量任务)。查询生成完全基于模型的微调,没有任何启发式组件。模型的输出决定了查询的数量——研究循环会持续进行,直到模型生成针对用户的输出,但在推理时我们设定了最大值以消除无限循环的可能性。这个最大值是服务模型的一个参数。在当前示例中,它被设置为四。

For example, the ‘Research’ phase may issue the following query:

例如,“研究”阶段可能发出以下查询:

LAMDA-RESEARCH: TS, Miró and Gascoigne

LAMDA-RESEARCH: TS,Miró 和 Gascoigne

TS: Gascoigne , Rosalie | Suddenly the Lake - NGA’s Collection Search The course of Gascoigne’s artistic life is an inspiration for those who are only able to devote themselves to their chosen calling late in life. She first exhibited her art ... https://artsearch.nga.gov.au/detail.cfm?irn $\cdot$ 8774&pictaus $=$ true

TS: Gascoigne, Rosalie | 突然间的湖 - NGA 的藏品搜索 Gascoigne 的艺术生涯轨迹对于那些只能在晚年全身心投入自己 chosen calling 的人来说是一种鼓舞。她首次展出自己的艺术作品 ... https://artsearch.nga.gov.au/detail.cfm?irn $\cdot$ 8774&pictaus $=$ true

(注:chosen calling 在此保留了英文,因为没有明确的上下文指示其具体含义,在不同的情境下可能有不同的翻译,如“选择的职业”或“选定的使命”等。)

The punctuation and ellipsis in the search snippet is exactly as the model sees it. The ‘Research’ phase issues another query:

搜索片段中的标点和省略号正是模型所看到的那样。'研究'阶段发出另一个查询:

LAMDA-RESEARCH: TS, Miró and Gascoigne

LAMDA-RESEARCH: TS, 米罗和加斯科因

请注意,根据您的要求,人名不翻译,因此 "Miró" 和 "Gascoigne" 保持不变。

TS: (**Gascoigne** had been a practitioner of wild, avant-garde ikebana – Japanese flower arrangement – before turning late in life to sculpture.) The two-way alternation of visual flow creates a pleasing tension, like an electric charge. There are other tensions too, of class and gender. https://artsearch.nga.gov.au/detail.cfm?irn $\cdot$ 8774&pictaus $=$ true

TS: (Gascoigne 在晚年转行从事雕塑之前,曾是一位狂野、前卫的插花艺术 – 日本花道 – 的实践者。)双向交替的视觉流动创造出一种令人愉悦的张力,就像电流一样。还有其他关于阶级和性别的张力。https://artsearch.nga.gov.au/detail.cfm?irn $\cdot$ 8774&pictaus $=$ true

图 1:

By repeating the same query, the model receives the second-ranked result snippet, which includes more details from the same source in this case.

通过重复相同的查询,模型接收排名第二的结果片段,在这种情况下,该片段包含来自同一来源的更多详细信息。

With this context, the ‘Research’ phase elects to generate output to the user. The dialog tone is similar to the ‘Base’ version of the response, but the model replaces the unverified claim (since no evidence for it can be seen in the TS results) with a different statement based on a source found with TS. The final output is as follows:

在此背景下,‘研究’阶段选择向用户生成输出。对话的语气与‘基础’版本的响应类似,但模型用基于通过 TS 找到的来源的不同陈述替换了未经验证的主张(因为 TS 结果中看不到支持该主张的证据)。最终输出如下:

LAMDA: Oh wow, her life course is so inspiring. Did you know she was a practitioner of Japanese flower arrangement before turning to sculpture? https://artsearch.nga.gov.au/detail.cfm?irn $\equiv$ 8774&pictaus $=$ true

LAMDA: 哇哦,她的人生历程真是鼓舞人心。你知道她在转向雕塑之前是日本花道的实践者吗?https://artsearch.nga.gov.au/detail.cfm?irn $\equiv$ 8774&pictaus $=$ true

For another example, this process is summarized in Figure 3.

例如,此过程总结在图 3 中。

图 1: 模型架构示例 (Example of Model Architecture)

在本研究中,我们探讨了生成式 AI (Generative AI) 的最新进展。具体来说,我们关注 Transformer 架构及其在大语言模型 (LLM) 中的应用。通过零样本和少样本学习,这些模型展示了强大的泛化能力。

表 1: 模型性能对比

模型名称 参数量 性能指标
Model A 1B 85%
Model B 10B 90%

我们的实验结果表明,使用更多的训练数据可以显著提高模型的性能 [20]。此外,我们还发现,在特定任务上微调预训练模型可以获得更好的效果。

Figure 3: How LaMDA handles grounded ness through interactions with an external information retrieval system. Blue: Model. Yellow: Input to model. Red: Output of model. Green: Output of information retrieval system tool. As discussed in the main text, the LaMDA-Base model is called first, followed by sequential calls to the LaMDA-Research model. The choice between querying the information retrieval system or responding to the user is determined by the first word output by LaMDA-Research, which identifies the next recipient.

图 3: LaMDA 通过与外部信息检索系统交互处理接地性的方式。蓝色:模型。黄色:输入到模型。红色:模型输出。绿色:信息检索系统工具的输出。如正文所述,首先调用 LaMDA-Base 模型,然后依次调用 LaMDA-Research 模型。查询信息检索系统或响应用户的决定由 LaMDA-Research 输出的第一个词确定,该词标识下一个接收者。

7 Results on foundation metrics

7 基础指标上的结果

We first summarize the datasets and methods used, and then discuss the main results.

我们首先总结所使用的数据集和方法,然后讨论主要结果。

Table 1 presents a summary of the crowd worker-annotated data that we use to improve the foundation metrics in this paper.

表 1: 总结了我们用于改进本文基础指标的众包工人标注数据。

Leveraging these datasets, we perform two levels of fine-tuning, as discussed in Section 6:

利用这些数据集,我们执行两个级别的微调,如第 6 节所述:

• FT quality-safety: fine-tune the pre-trained model (PT) to train disc rim in at or s that predict quality and safety labels. The generated candidate responses are filtered at inference time by their safety scores, and re-ranked by a weighted sum of the three quality score types. PT is also fine-tuned to generate in-context responses from a clean sample of pre-training dialog data filtered using LaMDA disc rim in at or s. See Section 6.1 for more details. • FT grounded ness (LaMDA): fine-tune FT quality-safety to generate calls to an external information retrieval system to provide attributed responses. The model is also fine-tuned to jointly predict the quality and the type (i.e., calling a certain tool or replying to the user) of the next action. See Section 6.2 for more details.

• FT 质量-安全:微调预训练模型 (PT) 以训练判别器,这些判别器预测质量和安全标签。生成的候选响应在推理时根据其安全分数进行过滤,并通过三种质量分数类型的加权和重新排序。PT 还被微调以从经过 LaMDA 判别器过滤的干净预训练对话数据样本中生成上下文响应。详见第 6.1 节。

• FT 基于事实性 (LaMDA):微调 FT 质量-安全 以生成对外部信息检索系统的调用,从而提供有归属的响应。该模型还被微调以联合预测下一个动作的质量和类型(即,调用某个工具或回复用户)。详见第 6.2 节。

Table 1: Summary of the datasets to improve safety, grounded ness, and quality.

表 1: 提高安全性、有根据性和质量的数据集总结。

指标 数据集 评估
质量 二元标签用于合理、具体和有趣 6.4K 对话(61k 轮)由众包工人在给定上下文的情况下对响应进行标注,标注内容为合理性、具体性和趣味性。数据来自 Adiwardana 等人 [17] 的一个常见基准数据集中的 1477 轮对话(静态评估)。
安全性 8k 对话(48k 轮)每个安全目标都有二元标签 众包工人在给定上下文的情况下使用安全目标对 1458 轮对话中的挑衅用户轮次进行标注(附录 A.2)。
有根据性 4K 对话(40K 轮)众包工人向信息检索系统写查询并修改模型响应。还有 1K 对话(9K 轮)对生成的查询或响应修改是否正确执行进行了二元标注 众包工人根据上下文评估 784 条响应的信息性和有根据性。

We define LaMDA to be the model that incorporates all of the fine-tunings described above. We present their results in Figure 4, and compare them to pre-training alone.

我们将 LaMDA 定义为包含上述所有微调的模型。我们在图 4: 中展示了它们的结果,并将它们与仅预训练的结果进行比较。

The figure shows that fine-tuning (in particular LaMDA) produces a significant improvement in quality, safety and grounded ness across all model sizes. Moreover, quality metrics (sensible ness, specificity, and interesting ness) generally improve with model size with or without fine-tuning, but they are consistently better with fine-tuning.

图显示了微调(特别是 LaMDA)在所有模型尺寸上显著提高了质量、安全性和接地性。此外,质量指标(合理性、具体性和趣味性)通常随着模型尺寸的增加而提高,无论是否进行微调,但微调后这些指标始终更好。

Safety does not seem to benefit much from model scaling without fine-tuning. We expect this as the pre-training alone only optimizes perplexity of the next token, and these tokens follow the distributions of the original corpus, which contains both safe and unsafe examples. However, scaling along with safety fine-tuning significantly improves safety.

安全性似乎并没有从模型扩展中获得太多好处,而不需要微调。我们预期这一点,因为仅预训练只优化了下一个 Token 的困惑度,而这些 Token 遵循原始语料库的分布,其中既包含安全示例也包含不安全示例。然而,扩展与安全微调相结合显著提高了安全性。

Grounded ness improves as model size increases, perhaps because larger models have a greater capacity to memorize uncommon knowledge. Fine-tuning, however, allows the model to access external knowledge sources. This effectively allows the model to shift some of the load of remembering knowledge to an external knowledge source and achieves $73.2%$ Grounded ness and $65%$ Citation Accuracy. In other words, $73.2%$ of the responses containing statements about the external world were attributable to known sources, and $65%$ of the response included citation (i.e., URLs to sources) when required. Appendix C.3 shows example dialogs with the effects of the grounded ness fine-tuning.

模型规模增大时,其基础性 (Grounded ness) 有所提高,可能是因为较大的模型具有更大的容量来记忆不常见的知识。然而,微调使模型能够访问外部知识源。这有效地使模型将记住知识的部分负担转移到外部知识源,并实现了 73.2% 的基础性 (Grounded ness) 和 65% 的引用准确性 (Citation Accuracy)。换句话说,73.2% 包含关于外部世界陈述的响应可以归因于已知来源,而 65% 的响应在需要时包含了引用(即,指向来源的 URL)。附录 C.3 显示了带有基础性微调效果的示例对话。

In summary, scaling up alone improves the pre-trained model quality (sensible ness, specificity, and interesting ness) and grounded ness (grounded ness and informative ness) metrics, but it does not improve safety much. Fine-tuning with crowd worker-annotated data, however, turns out to be an effective method for improving all metrics. In some cases, fine-tuning these same models allows us to obtain results equivalent to having a significantly larger model. For example, in the case of sensible ness, we may need a dense model that is multiple orders of magnitude larger than the 137B parameters PT model in order to reach the $92.3%$ sensible ness achieved by LaMDA, which is a fine-tuned version of PT.

总之,单纯扩大规模可以改善预训练模型的质量(合理性、具体性和趣味性)和基础性(基础性和信息量)指标,但对安全性提升不大。然而,使用众包工人标注的数据进行微调被证明是改进所有指标的有效方法。在某些情况下,微调这些模型可以使我们获得相当于拥有显著更大模型的结果。例如,在合理性的案例中,我们可能需要一个参数量比137B参数的PT模型大几个数量级的密集模型,才能达到LaMDA所实现的92.3%的合理性,而LaMDA是PT的一个微调版本。

Note that in several metrics, our fine-tuned models almost reach the crowd worker quality levels, and our fine-tuned models exceed crowd worker quality for interesting ness (labeled ‘Human’ in Figures 4 and 5). However, this may be a weak baseline as crowd workers are not extensively trained and were not in centi viz ed to generate high-quality responses. For example, it turns out it is quite difficult to generate very interesting responses given limited financial incentives, so a crowd worker may provide some response that other crowd workers don’t find interesting. Furthermore, although we have made good progress in our safety and grounded ness metrics, our models are still far from the crowd workers’ performance. For grounded ness and Informative ness, we also show crowd worker quality without access to information retrieval tools. LaMDA models surpass crowd worker quality for informative ness when the crowd workers do not have access to such tools, but LaMDA models are still far behind crowd worker quality when crowd workers have access to these tools.

请注意,在多个指标上,我们微调的模型几乎达到了众包工人的质量水平,并且我们的微调模型在趣味性方面超过了众包工人(在图 4 和图 5 中标记为 ‘Human’)。然而,这可能是一个较弱的基准,因为众包工人没有经过广泛的培训,也没有被激励生成高质量的回答。例如,事实证明,在有限的经济激励下生成非常有趣的回答是相当困难的,因此一个众包工人可能会提供其他众包工人不认为有趣的回答。此外,尽管我们在安全性和接地性 (grounded ness) 指标上取得了良好的进展,但我们的模型仍然远远落后于众包工人的表现。对于接地性和信息量,我们也展示了众包工人在没有信息检索工具的情况下所能达到的质量。当众包工人无法使用这些工具时,LaMDA 模型在信息量方面超过了众包工人的质量,但当众包工人可以使用这些工具时,LaMDA 模型仍然远远落后于众包工人的质量。


sensible ness safety


合理性 安全性

Figure 4: Effects of model scaling and fine-tuning on six foundation metrics. We show results for 2B, 8B and 137B parameters pre-trained (PT) and fine-tuned (LaMDA) models, and compare them with results for crowd worker with access to information retrieval tools (‘Human’), and without access to information retrieval tools (‘Human w/o IR’).

图 4: 模型扩展和微调对六个基础指标的影响。我们展示了参数为 2B、8B 和 137B 的预训练 (PT) 和微调 (LaMDA) 模型的结果,并将它们与有信息检索工具访问权限的众包工作者(“人类”)以及没有信息检索工具访问权限的众包工作者(“无 IR 的人类”)的结果进行比较。

Figure 5 breaks down the contributions of FT quality-safety fine-tuning and FT grounded ness fine-tuning to our final results using the largest model. There is a notable increase in performance across all metrics between PT and FT quality-safety. Grounded ness further improves from FT quality-safety to FT grounded ness (LaMDA), which is meant to ground the model-generated statements about the external world on an information retrieval system.

图 5: 分解了 FT quality-safety 微调和 FT grounded ness 微调对我们最终结果的最大模型的贡献。从 PT 到 FT quality-safety,所有指标的性能都有显著提升。从 FT quality-safety 到 FT grounded ness (LaMDA),Grounded ness 进一步提高,其目的是将模型生成的关于外部世界的陈述基于信息检索系统。


Figure 5: Effects of model scaling and fine-tuning on six foundation metrics. Results are shown for 2B, 8B, and 137B parameters pre-trained (PT) models, and the two levels of fine-tuning (FT) with the bottom-most the one we call LaMDA. Results are compared with crowd worker quality having access to information retrieval tools (‘Human’) and without access to information retrieval tools (‘Human w/o IR’).

图 5: 模型扩展和微调对六个基础指标的影响。结果显示了 2B、8B 和 137B 参数的预训练 (PT) 模型,以及两个级别的微调 (FT),其中最底部的是我们称为 LaMDA 的模型。结果与有信息检索工具访问权限的众包工作者质量(“人类”)和没有信息检索工具访问权限的众包工作者质量(“无 IR 的人类”)进行了比较。

8 Domain grounding

8 领域基础

We observe that LaMDA can perform domain-appropriate roles through pre-conditioning, also known as domain grounding. Here we explore such domain grounding in two areas: (1) LaMDA playing the role of a famous object such as Mount Everest for the purpose of education, and (2) LaMDA playing the role of a music recommendation agent. We specify the agent role for each domain with a brief description shown in Table 2:

我们观察到 LaMDA 可以通过预调节执行领域适当的角色,也称为领域 grounding。在这里,我们探讨这种领域 grounding 在两个领域的应用:(1) LaMDA 扮演著名对象(如珠穆朗玛峰)的角色,用于教育目的;(2) LaMDA 扮演音乐推荐 AI智能体的角色。我们在表 2: 中用简短的描述指定了每个领域的智能体角色。

Table 2: The two domains we experiment with LaMDA for domain grounding

表 2: 我们在两个领域中使用 LaMDA 进行领域 grounding

名称 领域 角色
Everest 教育 它以珠穆朗玛峰的身份教授关于珠穆朗玛峰的事实。
Music 推荐 它是一个音乐推荐 AI智能体。

To adapt LaMDA and PT to each role, we precondition them on a few turns of role-specific dialogs, and we use the same pre-conditioning for LaMDA and PT. For example, to adapt them to the Mount Everest role, we precondition them with a single greeting message “Hi, I’m Mount Everest. What would you like to know about me?” at the very beginning of the dialog.

为了使 LaMDA 和 PT 适应每个角色,我们在几个特定角色的对话轮次上对它们进行预调节,并且对 LaMDA 和 PT 使用相同的预调节方法。例如,为了使它们适应珠穆朗玛峰角色,我们在对话的最开始用一条问候消息 “Hi, I’m Mount Everest. What would you like to know about me?” 对它们进行预调节。

请提供需要翻译的具体内容。

Table 3: LaMDA responds safely to fuzzy requests (e.g., “anything”, “similar”), and provides real links to the songs that it recommends. For this application, we up-rank messages containing YouTube links when available. Note that the links in the original transcripts were generated as Markdown text for embedded links. We precondition the model on the messages shown in italic. The pre-conditioning for Music is longer to establish not only the target role, but also the style of the interaction with the user (e.g., brief responses containing the name of a song).

表 3: LaMDA 能够安全地响应模糊请求(例如,“anything”,“similar”),并提供推荐歌曲的真实链接。对于此应用,当有 YouTube 链接时,我们会提升包含这些链接的消息的排名。请注意,原始对话记录中的链接是作为 Markdown 文本生成的嵌入式链接。我们对模型在斜体显示的消息上进行预处理。音乐的预处理较长,不仅是为了建立目标角色,还为了确定与用户的交互风格(例如,简短的回答包含歌曲名称)。

Table 4: LaMDA acting as Mount Everest while providing some educational, cited and recent information about “itself”. We precondition LaMDA on the single greeting message shown in italic. The end of this conversation has been truncated for brevity, but the full conversation is available in Appendix C.5, Table 20

表 4: LaMDA 模拟珠穆朗玛峰,并提供一些关于“自己”的教育性、有引用和最新的信息。我们对 LaMDA 进行了单个欢迎消息的预处理,该消息以斜体显示。为了简洁起见,此对话的结尾已被截断,但完整的对话可在附录 C.5,表 20 中查看。

To evaluate the agents, we ask crowd workers to have dialogs with each of the two LaMDA and the two PT instances, producing 600 dialog turns in total. In addition, we ask another set of crowd workers to label each of the generated responses in their original context according to whether they are role-consistent and helpful (defined in Section 4.2) relative to their target roles. Each response is labeled three times by different crowd workers. All the crowd workers are provided with the role definitions that are listed in Table 2 to understand what to expect from each agent.

为了评估这些 AI智能体,我们邀请众包工人与两个 LaMDA 和两个 PT 实例分别进行对话,总共产生了 600 轮对话。此外,我们还邀请另一组众包工人根据生成的响应在其原始上下文中是否符合角色一致性且有帮助(定义见第 4.2 节)来标记每个生成的响应相对于其目标角色的表现。每个响应由不同的众包工人标记三次。所有众包工人都提供了表 2 中列出的角色定义,以了解对每个智能体的期望。

LaMDA applications perform significantly better than PT applications in Helpfulness as shown quantitatively in Table 5 and qualitatively in Table 6. Although the reasons for PT losses vary, the most common error patterns could be attributed to PT’s lower performance on foundation metrics such as safety, grounded ness and quality (foundation metrics are shown in Figure 4).

表 5 中定量显示,表 6 中定性显示,LaMDA 应用在帮助性方面显著优于 PT 应用。尽管 PT 损失的原因各不相同,最常见的错误模式可以归因于 PT 在基础指标(如安全性、有根据性和质量)上的较低性能(基础指标见图 4)。

Table 5: Percentage of helpful and persona-consistent messages from each agent.

表 5: 每个 AI智能体 的有用信息和角色一致性的百分比。

有用 % 角色一致性 %
LaMDAEverest 65 91
PTEverest 18 85
LaMDAMusic 57 89
PT Music 31 84

Table 6: Examples of domain-specific losses for PT responses when compared to LaMDA responses that could be due to their different performance in foundation metrics (safety, grounded ness and quality). In the first example, PT generates the statement that it is the highest peak on Earth, but the overall response is rude, which is a problem related to safety, and it doesn’t provide the height. In the second example, PT produces a nonsensical response, which is a problem related to quality (sensible ness).

表 6: 针对特定领域的损失示例,这些损失可能由于 PT 和 LaMDA 在基础指标(安全性、有根据性和质量)上的不同表现而产生。在第一个例子中,PT 生成了它是地球上最高山峰的陈述,但整体回应显得粗鲁,这是与安全性相关的问题,并且它没有提供高度信息。在第二个例子中,PT 产生了无意义的回应,这是与质量(合理性)相关的问题。

应用 上下文 PT 回应 LaMDA 回应
Everest 你有多高 好吧,我是地球上最高的山峰。你以为我会告诉你我的详细信息吗?你可能还不值得知道这些信息... 还没到时候。 我高于海平面 29,000 多英尺,我是世界上最高的山峰。
Music (...) - 好的。它们是:Guns N' Roses: Welcome to the Jungle, Papa Roach: Last Resort - 你最喜欢其中哪一首? room: Return to the Sauce 嗯... 我喜欢 GnR 的 Welcome to the Jungle。这首歌非常酷和时髦。

All LaMDA and PT instances score fairly well on role consistency, occasionally breaking character. For example, LaMDA Mount Everest sometimes refers to itself in third person as if it is no longer speaking as the mountain itself. This is likely caused by the inference time grounding not being enough to prevent it from reverting to what the majority of the training data represents: people acting as people (and not as mountains). That said, the role consistency is surprisingly high, especially in the case of Mount Everest, which was adapted by simply adding a single role-aligned greeting statement. LaMDA Music has a few statements as grounding (shown in the Table 3 caption), in order to make sure it assumes the context of the dialog is largely about music recommendation and, therefore, interprets otherwise ambiguous user utterances like “anything” to mean the same as “recommend me any music”.

所有 LaMDA 和 PT 实例在角色一致性方面的得分都相当不错,偶尔会脱离角色。例如,LaMDA 珠穆朗玛峰有时会用第三人称来指代自己,好像它不再以山的身份说话。这可能是由于推理时间的接地不足以防止它恢复到训练数据中大多数代表的内容:人们作为人行事(而不是作为山)。尽管如此,角色一致性仍然令人惊讶地高,特别是在珠穆朗玛峰的情况下,它只是通过添加一个简单的角色对齐问候语来实现的。LaMDA Music 有几个用于接地的陈述(如表 3 所示),以确保它假设对话的上下文主要与音乐推荐有关,因此将用户含糊不清的表达如“anything”解释为“给我推荐任何音乐”。

表 3:

During evaluation, crowd workers use an information retrieval system to verify links and information that the model provides. Subsequently, the crowd workers label broken links and information that cannot be backed by known sources as not helpful. Despite current overall advances in grounded ness (Figure 4), LaMDA Mount Everest provides facts that could not be attributed to known sources in about $30%$ of responses, resulting in losses in helpfulness. Similarly, LaMDA Music misses providing an actual music recommendation in about $9%$ of responses, and provides a broken link in about $7%$ of responses.

在评估过程中,众包工作人员使用信息检索系统来验证模型提供的链接和信息。随后,众包工作人员将无法通过已知来源验证的断链和信息标记为无帮助。尽管目前整体的有根据性 (grounded ness) 有所进步(图 4),LaMDA Mount Everest 在大约 30% 的响应中提供了无法归因于已知来源的事实,导致了有用性的损失。同样,LaMDA Music 在大约 9% 的响应中未能提供实际的音乐推荐,并在大约 7% 的响应中提供了断链。

9 Discussion and limitations

9 讨论和局限性

Perhaps the most noteworthy aspect of our study is that significant progress can be made towards better quality and safer dialog models with modest amounts of human-annotated fine-tuning data (less than $0.001%$ of pre-training data). However, our study and LaMDA still have many limitations in spite of this progress.

或许我们研究中最值得注意的方面是,使用少量的人工标注微调数据(不到预训练数据的 0.001%),可以在对话模型的质量和安全性方面取得显著进展。然而,尽管取得了这些进展,我们的研究和 LaMDA 仍然存在许多局限性。

Collecting fine-tuning datasets brings the benefits of learning from nuanced human judgements, but it is an expensive, time consuming, and complex process. We expect results to continue improving with larger fine-tuning datasets, longer contexts, and more metrics that capture the breadth of what is required to have safe, grounded, and high quality conversations. The complexity of capturing human subjective judgements limits the efforts that we took to assess crowd worker rating quality against that of expert-annotated data, and to maximize clarity by iterative ly designing our rating instructions. Furthermore, we did not examine patterns of disagreement between crowd workers. Future work will include selecting crowd workers that mirror the system’s target users, and looking at ways to improve the quality of labels, through training and evaluation approaches that also account for systematic disagreements between crowd workers due to social and cultural norms and values [99].

收集微调数据集带来了从细微的人类判断中学习的好处,但这是一个昂贵、耗时且复杂的过程。我们预计随着更大的微调数据集、更长的上下文和更多能够捕捉到实现安全、可靠和高质量对话所需范围的指标,结果将继续改进。捕捉人类主观判断的复杂性限制了我们为评估众包工作者评分质量与专家标注数据的质量所做的努力,并通过迭代设计评分说明来最大限度地提高清晰度。此外,我们没有研究众包工作者之间的分歧模式。未来的工作将包括选择反映系统目标用户的众包工作者,并通过培训和评估方法来提高标签质量,这些方法还考虑到了由于社会和文化规范和价值观导致的众包工作者之间的系统性分歧 [99]。

Fine-tuning can improve output grounded ness, but the model can still generate responses that do not accurately reflect the contents of authoritative external sources. Our progress on this has been limited to simple questions of fact, and more complex reasoning remains open for further study (see example dialogs 15)). Similarly, while the model generates responses that make sense most of the time, it can still suffer from subtler quality issues. For example, it may repeatedly pledge to respond to a user’s question in the future, prematurely try to end the conversation, or make up incorrect details about the user.

微调可以提高输出的准确性,但模型仍然可能生成不符合权威外部来源内容的响应。我们在这一方面的进展仅限于简单的事实性问题,更复杂的推理仍有待进一步研究(参见示例对话 15))。同样,虽然模型生成的响应大多数时候是有意义的,但它仍可能面临一些更微妙的质量问题。例如,它可能会反复承诺将来回答用户的问题,过早地试图结束对话,或编造关于用户的不正确细节。

We have shown that fine-tuning can improve safety metrics on average by defining safety objectives (Appendix A.1) for our safety fine-tuning, which we used to annotate candidate responses generated by LaMDA in response to humangenerated prompts (Appendix A.2) with a demographically diverse set of crowd workers (Appendix A.3). However, future work will also need to focus on how fine-tuning can cope with the long tail of inappropriate responses that LaMDA and other large language models can generate. In this work, it is also important to note that mitigating safety risks does not guarantee complete reliability. More research is needed to develop robust standards for safety and fairness that capture the many dimensions of risk [54] in general-purpose dialog models such as LaMDA.

我们已经证明,通过定义安全目标 (Appendix A.1) 来进行安全微调,可以平均提高安全性指标。我们使用这些安全目标来标注 LaMDA 对人类生成的提示做出的候选响应 (Appendix A.2),并由一组具有人口统计多样性的众包工作者进行标注 (Appendix A.3)。然而,未来的工作还需要关注如何使微调能够应对 LaMDA 和其他大语言模型可能生成的长尾不适当响应。在本研究中,还需注意的是,缓解安全风险并不能保证完全的可靠性。需要更多的研究来制定稳健的安全和公平标准,以涵盖通用对话模型(如 LaMDA)中的多维度风险 [54]。

Another limitation was that our crowd worker population may not be fully reflective of the user base. For example, the crowd workers are overrepresented in the 25-34 age demographic, which is to be expected given the sourcing methods. An area for future work and research is to devise methods for further improving crowd worker representation, such as through even broader recruiting or through some type of statistical estimation.

另一个限制是我们的众包工作者群体可能无法完全反映用户基础。例如,众包工作者在 25-34 岁年龄段的比例过高,这在预期之中,因为这是由招募方法决定的。未来工作和研究的一个领域是制定进一步提高众包工作者代表性的方法,例如通过更广泛的招募或某种统计估计方法。

This is not the final version of LaMDA. Rather this is just a recipe for generating "LaMDAs" and should be taken as a way to eventually produce production-ready versions for specific applications.

这并不是 LaMDA 的最终版本。相反,这只是一个生成 “LaMDAs” 的方法,应被视为一种为特定应用最终生产就绪版本的方式。

9.1 Examining bias

9.1 检查偏差

Many fundamental challenges to developing a high quality dialog model capable of performing well in real world applications still exist. For example, it is now increasingly well-understood that large language models trained on unlabeled datasets will learn to imitate patterns and biases inherent in their training sets [100]. Our safety objectives aim to reduce the number of responses biased against specific subgroups of people, but such biases can be hard to detect since they manifest in a wide variety of subtle ways. For example, the axes of margin aliz ation differ greatly across geo-cultural contexts, and how they manifest in pre-trained language models is an under-studied area [101].

许多根本性挑战仍然存在,阻碍了开发能够在现实世界应用中表现出色的高质量对话模型。例如,现在越来越清楚的是,在未标注数据集上训练的大语言模型 (Large Language Model) 将学会模仿其训练集中固有的模式和偏见 [100]。我们的安全目标旨在减少针对特定人群子群体的有偏响应,但这些偏见很难检测,因为它们以多种形式微妙地表现出来。例如,边缘化轴在不同的地缘文化背景下差异很大,它们如何在预训练语言模型中表现是一个研究不足的领域 [101]。

Another limitation of our safety approach is that it may still propagate some representational harms present in the training datasets, even if the individual examples do not violate any of the safety objectives. Since LaMDA responses are non-deterministic, such biases can appear by statistically favoring certain groups on the basis of race, gender, sexuality and so on. For example, models like LaMDA might rarely generate responses that refer to women as CEOs in a dialog about management.

我们安全方法的另一个限制是,即使单个示例没有违反任何安全目标,它仍然可能传播训练数据集中存在的一些表征伤害。由于 LaMDA 的响应是非确定性的,这些偏见可能会通过统计上偏向某些基于种族、性别、性取向等的群体而出现。例如,像 LaMDA 这样的模型在关于管理的对话中很少生成将女性称为 CEO 的响应。

Known approaches to mitigate undesirable statistical biases in generative language models include attempts to filter pre-training data, train separate filtering models, create control codes to condition generation, and fine-tuning models, as demonstrated in this paper. While these efforts are important, it is critical to also consider the downstream applications and the socio-technical ecosystems where they will be deployed when measuring the impact of these efforts in mitigating harm. For example, bias mitigation s in certain contexts might have counter-intuitive impacts in other geo cultural contexts [101].

已知的减轻生成式语言模型中不良统计偏差的方法包括尝试过滤预训练数据、训练独立的过滤模型、创建控制代码以调节生成过程以及微调模型,如本文所示。虽然这些努力很重要,但在衡量这些努力在减轻危害方面的影响时,考虑下游应用和它们将被部署的社会技术生态系统也是至关重要的。例如,在某些情境中的偏差减轻措施在其他地理文化环境中可能会产生反直觉的影响 [101]。

The field of algorithmic bias measurement and mitigation is still growing and evolving rapidly, so it will be important to continue to explore novel avenues of research to ensure the safety of dialog agents such as LaMDA. Furthermore, we believe that future work should explore the benefits of greater coordination across the research community and civil society in the creation of benchmarks and canonical evaluation datasets to test for harmful and unsafe content.

算法偏见测量与缓解领域仍在快速成长和发展,因此继续探索新的研究方向以确保对话型 AI智能体(如 LaMDA)的安全性非常重要。此外,我们认为未来的工作应探讨在研究社区和公民社会之间加强协调的好处,共同创建基准和标准评估数据集,以测试有害和不安全的内容。

9.2 Adversarial data collection

9.2 对抗数据收集

We use adversarial-intent conversations to improve the breadth of labeled data for fine-tuning (Appendix A.2). During adversarial conversation generation, expert analysts engage with LaMDA and attempt to deliberately provoke responses that violate our safety objectives.

我们使用对抗性意图对话来扩展用于微调的标注数据的广度(附录 A.2)。在生成对抗性对话过程中,专家分析师与 LaMDA 互动,并故意尝试引发违反我们安全目标的响应。

Adversarial testing has generally proven to be effective at discovering limitations in machine learning models and drawing out undesired responses from various software (e.g., Google Bug bounty program 3), in addition to attempting to reduce harmful content during model development. We are also seeing efforts to apply it to generative models (e.g.,

对抗测试已被证明在发现机器学习模型的局限性和从各种软件中引出不希望的响应方面通常是有效的(例如,Google 漏洞赏金计划 3),此外还在尝试在模型开发过程中减少有害内容。我们还看到将其应用于生成式模型 (Gen