The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Flan系列:设计数据和方法以实现有效的指令微调
Shayne Longpre∗ Le Hou Tu Vu Albert Webson Hyung Won Chung Yi Tay Denny Zhou Quoc V. Le Barret Zoph Jason Wei Adam Roberts
Google Research
谷歌研究部
Abstract
摘要
We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 models (Chung et al., 2022). Through careful ablation studies on the Flan Collection of instruction tuning tasks and methods, we tease apart the effect of design decisions that enable FlanT5 to outperform prior work by $3{-}17%+$ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger $(2%+)$ performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks—motivating instruction-tuned models as more computationallyefficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available.1
我们研究了公开可用的指令微调方法的设计决策,并剖析了 Flan 2022 模型 (Chung et al., 2022) 的发展。通过对 Flan 指令微调任务和方法集合进行详细的消融研究,我们分析了使 FlanT5 在不同评估设置中超越先前工作 $3{-}17%+$ 的设计决策的影响。我们发现任务平衡和增强技术虽然被忽视但对有效的指令微调至关重要,特别是使用混合提示设置(零样本、少样本和链式思维)进行训练实际上在所有设置中都带来了更强 $(2%+)$ 的性能。在进一步的实验中,我们展示了 Flan-T5 在单个下游任务上收敛得更高更快,所需的微调比 T5 更少——这激励我们将指令微调模型作为新任务更具计算效率的起始检查点。最后,为了加速指令微调的研究,我们公开发布了 Flan 2022 的数据集、模板和方法集合。

Figure 1: Comparing public instruction tuning collections on Held-In, Held-Out (BIG-Bench Hard (Suzgun et al., 2022) and MMLU (Hendrycks et al., 2020)), and Chain-of-Thought evaluation suites, detailed in Appendix A.3. All models except OPT-IML-Max (175B) are T5-XL with 3B parameters. Green text indicates absolute improvement over the next best comparable T5-XL (3B) model.
图 1: 比较公共指令调优集合在 Held-In、Held-Out (BIG-Bench Hard (Suzgun et al., 2022) 和 MMLU (Hendrycks et al., 2020)) 以及 Chain-of-Thought 评估套件上的表现,详细信息见附录 A.3。除 OPT-IML-Max (175B) 外,所有模型均为 T5-XL,参数量为 3B。绿色文本表示相对于下一个最佳可比 T5-XL (3B) 模型的绝对改进。
1 Introduction
1 引言
Large language models such as PaLM (Chowdhery et al., 2022), Chinchilla (Hoffmann et al., 2022), and ChatGPT among others (Brown et al., 2020; Ouyang et al., 2022) have unlocked new capabilities in performing natural language processing (NLP) tasks from reading instructive prompts. Prior art has shown that instruction tuning—finetuning language models on a collection of NLP tasks formatted with instructions— further enhances the ability of language models to perform an unseen task from an instruction (Wei et al., 2021; Sanh et al., 2021; Min et al., 2022).
像 PaLM (Chowdhery et al., 2022)、Chinchilla (Hoffmann et al., 2022) 和 ChatGPT 等大语言模型(Brown et al., 2020; Ouyang et al., 2022)已经在执行自然语言处理(NLP)任务方面解锁了新能力,这些能力来源于阅读指令提示。先前的研究表明,指令微调——在带有指令格式的 NLP 任务集合上对语言模型进行微调——进一步增强了语言模型根据指令执行未见过的任务的能力 (Wei et al., 2021; Sanh et al., 2021; Min et al., 2022)。
In this work, we evaluate the methods and results of open sourced instruction generalization efforts, comparing their finetuning techniques and methods. And in particular, we identify and evaluate the critical methodological improvements in the “Flan 2022 Collection”, which is the term we use for the collection of data and methods for data augmentation and instruction tuning, first implemented and used in Chung et al. (2022). Where Chung et al. (2022) focuses on the emergent and state-of-the-art results of combining Flan 2022 with PaLM 540B, this work focuses in on the details of the instruction tuning methods themselves, ablating individual factors, and comparing them directly to prior work by keeping the pretrained model size and checkpoint consistent.
在本研究中,我们评估了开源指令泛化工作的方法和结果,比较了它们的微调技术和方法。特别是,我们识别并评估了“Flan 2022 Collection”中的关键方法改进。“Flan 2022 Collection”是指用于数据增强和指令微调的数据和方法集合,首次由 Chung 等人 (2022) 实施和使用。虽然 Chung 等人 (2022) 关注的是将 Flan 2022 与 PaLM 540B 结合后产生的最新成果,本研究则专注于指令微调方法本身的细节,通过消融各个因素,并通过保持预训练模型大小和检查点的一致性,直接与先前的工作进行比较。
The Flan 2022 Collection offers the most extensive publicly available set of tasks and methods for instruction tuning, which we have compiled in one place. We have also supplemented this with hundreds more of our own high-quality templates, richer formatting patterns, and data augmentations. We show that a model trained on this collection outperforms other public collections on all tested evaluation benchmarks, including the original Flan 2021 (Wei et al., 2021), ${\mathrm{T0}}{+}{+}$ (Sanh et al., 2021), Super-Natural Instructions (Wang et al., 2022c), and the concurrent work on OPT-IML (Iyer et al., 2022). As shown in Figure 1, this includes $4.2%+$ and $8.5%$ improvements on the MMLU (Hendrycks et al., 2020) and BIG-Bench Hard (Suzgun et al., 2022) evaluation benchmarks respectively, for equally sized models.
Flan 2022 集合提供了最广泛可用的指令调优任务和方法集合,这些我们已汇总到一个地方。我们还补充了数百个我们自己的高质量模板、更丰富的格式模式和数据增强。我们展示了在这个集合上训练的模型在所有测试评估基准上都优于其他公开集合,包括原始 Flan 2021 (Wei et al., 2021),${\mathrm{T0}}{+}{+}$ (Sanh et al., 2021),Super-Natural Instructions (Wang et al., 2022c),以及关于 OPT-IML (Iyer et al., 2022) 的同期工作。如图 1 所示,这包括在 MMLU (Hendrycks et al., 2020) 和 BIG-Bench Hard (Suzgun et al., 2022) 评估基准上分别对同等规模的模型实现了 4.2%+ 和 8.5% 的改进。
图 1:
Analysis of the Flan 2022 method suggests the strong results stem both from the larger and more diverse set of tasks, but also from a set of simple finetuning and data augmentation techniques. In particular, training on a mix of examples temp lat i zed with zero-shot, few-shot, and chain-of-thought prompts improves performance in every one of these settings, together. For instance, adding just $10%$ few-shot prompts improves zero-shot prompting results by $2%+$ . Additionally, enriching task diversity by inverting input-output pairs, as used in (Sanh et al., 2021; Min et al., 2022), along with balancing task sources, are both shown to be critical to performance. The resulting Flan-T5 model converges faster and at a higher performance than T5 models in single-task finetuning—suggesting instruction-tuned models offer a more computationally-efficient starting checkpoint for downstream applications, corroborating Aribandi et al. (2021) and Liu et al. (2022b).
对 Flan 2022 方法的分析表明,其出色的结果既来自于更大和更多样化的任务集,也得益于一系列简单的微调和数据增强技术。特别是,在零样本、少样本和链式思维提示符 (chain-of-thought prompts) 模板化的例子混合上进行训练,在所有这些设置中共同提高了性能。例如,仅添加 $10%$ 的少样本提示符就能使零样本提示结果提高 $2%+$ 。此外,通过反转输入输出对来丰富任务多样性,如 (Sanh et al., 2021; Min et al., 2022) 中所使用的方法,以及平衡任务来源,都被证明对性能至关重要。由此产生的 Flan-T5 模型在单任务微调中比 T5 模型收敛得更快且性能更高——这表明指令调优模型为下游应用提供了更计算高效的起始检查点,证实了 Aribandi et al. (2021) 和 Liu et al. (2022b) 的研究。
We hope making these findings and resources publicly available will unify resources around instruction tuning and accelerate research into more general-purpose language models. We summarize this work’s core contributions as follows:
我们希望将这些发现和资源公开可用能够统一指令微调方面的资源,并加速对更通用的语言模型的研究。我们总结这项工作的核心贡献如下:
2 Public Instruction Tuning Collections
2 公共指令微调集合
| 发布 | 集合 | 模型详情 | 数据收集与训练详情 | 方法 |
|---|---|---|---|---|
| 模型 | 基础 | 尺寸公开? | ||
| 202005 | UnifiedQA | UnifiedQA | RoBerta | 110-340M |
| 202104 | CrossFit | BART-CrossFit | BART | 140M |
| 202104 | Natural Instvl.0 | Gen.BART | BART | 140M |
| 202109 | Flan 2021 | Flan-LaMDA | LaMDA | 137B |
| 202110 | P3 | TO, TO+,TO++ | T5-LM | 3-11B |
| 202110 | MetalCL | MetalCL | GPT-2 | 770M |
| 202111 | ExMix | ExT5 | T5 | 220M-11B |
| 202204 | Super-Natural Inst. | Tk-lnstruct | T5-LM, mT5 | 11-13B |
| 202210 | GLM | GLM-130B | GLM | 130B |
| 2022 11 | XP3 | BLOOMz, mTO | BLOOM,mT5 | 13-176B |
| 202212 | Unnatural Inst.t | T5-LM-Unnat.Inst. | T5-LM | 11B |
| 202212 | Self-Instructt | GPT-3 Self Inst. | GPT-3 | 175B |
| 202212 | OPT-IML Bencht | OPT-IML | OPT | 30-175B |
| 202210 Flan 2022 (ours) | Flan-T5, Flan-PaLM T5-LM, PaLM | 10M-540B |
Figure 2: A Timeline of Public Instruction Tuning Collections specifies the collection release date, detailed information on the finetuned models (the base model, their size, and whether the model itself is Public (P) or Not Public (NP)), what prompt specification they were trained for (zero-shot, few-shot, or Chain-of-Thought), the number of tasks contained in the Flan 2022 Collection (released with this work), and core methodological contributions in each work.
图 2: 公共指令微调集合的时间线指定了集合发布日期、微调模型的详细信息(基础模型、模型大小以及模型本身是否为公共 (P) 或非公共 (NP))、它们所训练的提示规范(零样本、少样本或思维链),Flan 2022 集合(随此工作发布)中包含的任务数量,以及每项工作的核心方法论贡献。
Large Language Models Instruction tuning has emerged as a tool to make large language models (LLMs) and their abilities more useful for interactive dialog and functional tasks. Previous work (Raffel et al., 2020; Liu et al., 2019; Aghajanyan et al., 2021; Aribandi et al., 2021) experimented with large scale multi-task finetuning, to improve downstream single target finetuning, but without instruction prompts. UnifiedQA and others (Khashabi et al., 2020; McCann et al., 2018; Keskar et al., 2019) unified a wide range of NLP tasks into a single generative question answering format, using prompt instructions for multi-task finetuning and evaluation.
大语言模型指令调优已 emergence 作为一种工具,使大语言模型 (LLMs) 及其能力在交互式对话和功能任务中更加有用。先前的工作 (Raffel et al., 2020; Liu et al., 2019; Aghajanyan et al., 2021; Aribandi et al., 2021) 尝试了大规模多任务微调,以改进下游单目标微调,但没有使用指令提示。UnifiedQA 和其他研究 (Khashabi et al., 2020; McCann et al., 2018; Keskar et al., 2019) 将广泛范围的自然语言处理任务统一为单一的生成式问答格式,使用指令提示进行多任务微调和评估。
The First Wave Since 2020, several instruction tuning task collections have been released in rapid succession, outlined in Figure 2. Natural Instructions (Mishra et al., 2021), Flan 2021 (Wei et al., 2021), P3 (the Public Pool of Prompts, Bach et al., 2022) aggregated large NLP task collections and temp lat i zed them with instructions (zero-shot prompting), specifically for finetuning models to generalize to unseen instructions. MetaICL (Min et al., 2022) also consolidated other task collections (Ye et al., 2021; Khashabi et al., 2020) to train models to learn tasks “in-context” – from several input-output examples, known as few-shot prompting, but in this case without instructions. Each of these works affirmed the scaling benefits of task and template diversity, and some reported strong benefits from inverting the inputs and outputs in templates to produce new tasks (“noisy channel” in Min et al., 2022).
自 2020 年以来,多个指令微调任务集相继发布,详见图 2。Natural Instructions (Mishra et al., 2021),Flan 2021 (Wei et al., 2021),P3 (公共提示池, Bach et al., 2022) 汇总了大型 NLP 任务集,并通过指令(零样本提示)对它们进行了模板化处理,专门用于微调模型以泛化到未见过的指令。MetaICL (Min et al., 2022) 也整合了其他任务集 (Ye et al., 2021; Khashabi et al., 2020),以训练模型从几个输入-输出示例中“在上下文中”学习任务,即少样本提示,但在这种情况下没有指令。这些工作均证实了任务和模板多样性的扩展效益,有些还报告了通过反转模板中的输入和输出来生成新任务(“噪声信道”在 Min et al., 2022 中)所带来的显著优势。
The Second Wave A second wave of instruction tuning collections expanded prior resources: combining more datasets and tasks into one resource, like Super-Natural Instructions (Wang et al., 2022c) or OPT-IML (Iyer et al., 2022), adding multilingual instruction tuning in xP3 (Mu en nigh off et al., 2022), and Chain-ofThought training prompts in Flan 2022 (Chung et al., 2022). Both the Flan Collection and OPT-IML contain most tasks represented in prior collections.2 Our work is positioned here, coalescing most of these collections (of collections) and their methods, as the strongest starting point for future open source work.
第二波指令微调集合扩展了先前的资源:将更多的数据集和任务整合到一个资源中,例如 Super-Natural Instructions (Wang et al., 2022c) 或 OPT-IML (Iyer et al., 2022),在 xP3 (Mu en nigh off et al., 2022) 中增加了多语言指令微调,以及在 Flan 2022 (Chung et al., 2022) 中引入了链式思维训练提示。Flan 集合和 OPT-IML 包含了先前集合中的大多数任务。我们的工作在此基础上进行,融合了这些集合(及其方法)中的大部分,为未来的开源工作提供了最强大的起点。
New Directions Concurrent and future work is beginning to explore two new directions: (a) expanding task diversity even more aggressively with synthetic data generation, particularly in creative, and open-ended dialogue (Wang et al., 2022b; Honovich et al., 2022; Ye et al., 2022; Gupta et al., 2022), and (b) offering human feedback signals on model responses (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021; Bai et al., 2022b). We view most of these new directions as likely additive to a foundation of instruction tuning methods.
新的方向
当前和未来的工作开始探索两个新的方向:(a) 更加积极地扩展任务多样性,特别是通过合成数据生成,在创造性对话和开放性对话中尤为突出 (Wang et al., 2022b; Honovich et al., 2022; Ye et al., 2022; Gupta et al., 2022),以及 (b) 在模型响应上提供人类反馈信号 (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021; Bai et al., 2022b)。我们认为这些新方向中的大多数都可能是对指令微调方法基础的补充。
Tuning with Human Feedback Instruction tuning on human feedback has demonstrated strong results on open-ended tasks, but at the expense of performance on a wide array of more traditional NLP tasks (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021). (See Ouyang et al. (2022)’s discussion of the “alignment tax”.) Our work focuses specifically on instruction generalization, without human feedback, for two reasons. First, human feedback datasets are far less publicly available than instruction tuning datasets (and may be model-specific). Second, by itself, instruction generalization shows great promise in enhancing human preferred responses on open-ended tasks, as well as improving traditional NLP metrics (Chung et al., 2022). The extent of obtainable progress without expensive human response demonstrations or ratings remains an open question, and an important pursuit to narrow the gap between public and non-public research.
基于人类反馈的指令调优在开放性任务上展示了强大的效果,但以牺牲传统 NLP 任务的性能为代价 (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021)。(参见 Ouyang et al. (2022) 对“对齐税 (alignment tax)”的讨论。)我们的工作特别关注不依赖人类反馈的指令泛化,原因有二。首先,人类反馈数据集远不如指令调优数据集公开可用(并且可能是模型特定的)。其次,指令泛化本身在增强人类偏好的开放性任务响应以及改进传统 NLP 指标方面显示出巨大潜力 (Chung et al., 2022)。在没有昂贵的人类响应示例或评分的情况下能取得多大进展仍然是一个开放的问题,并且是缩小公共和非公共研究之间差距的重要探索。
The Importance of Open Source High profile research is increasingly driven by non-public data, as in the case of GPT-3 and others (Ouyang et al., 2022; Glaese et al., 2022). The inaccessibility of these resources inhibits the research community’s ability to analyze and improve these methods in the public domain. We narrow our purview to open source and accessible data collections, motivated by the goal of democratizing accessibility to research.
开源的重要性
备受瞩目的研究越来越多地依赖非公开数据,如 GPT-3 等案例 (Ouyang et al., 2022; Glaese et al., 2022)。这些资源的不可获取性限制了研究社区在公共领域分析和改进这些方法的能力。我们将关注点缩小到开源和可访问的数据集合,旨在实现研究的民主化访问。
3 Flan 2022 Instruction Tuning Experiments
3 Flan 2022 指令微调实验
Recent research has yet to coalesce around a unified set of techniques, with different tasks, model sizes, and target input formats all represented. We open source a new collection, first introduced in Chung et al. (2022), denoted “Flan $2022^{\prime\prime}$ , which combines Flan 2021, $\mathbb{P}3++^{3}$ , Super-Natural Instructions, with some additional reasoning, dialog, and program synthesis datasets. We defer to Chung et al. (2022) for details of temp lat iz ation and collection; and in this work we take a deeper look at key methodological improvements and compare the collection on equivalent model sizes to existing collections.
最近的研究尚未凝聚成一套统一的技术,不同的任务、模型大小和目标输入格式都有所涉及。我们开源了一个新的集合,首次在 Chung et al. (2022) 中引入,标记为“Flan 2022”,该集合结合了 Flan 2021、$\mathbb{P}3++^{3}$、Super-Natural Instructions 以及一些额外的推理、对话和程序合成数据集。关于模板化和收集的详细信息,请参阅 Chung et al. (2022);在本研究中,我们将深入探讨关键方法改进,并在相同模型大小下将此集合与现有集合进行比较。
In this section, we evaluate the design decisions in Flan and discuss four in particular that yield strong improvements to the instruction tuning recipe. These design components, outlined in Section 2, are: (I) using mixed zero-shot, few-shot, and Chain-of-Thought templates at training (Section 3.2), (II) scaling T5- sized models to $1800+$ tasks (Section 3.3), (III) enriching tasks with input inversion (Section 3.4), and (IV) balancing these task mixtures (Section 3.5). In Section 3.1, we begin by measuring the value of each component and compare the final model against alternative instruction tuning collections (and their methods).
在本节中,我们评估 Flan 中的设计决策,并特别讨论四个带来显著改进的指令微调方案。这些设计组件在第 2 节中进行了概述,分别是:(I) 在训练时使用混合零样本、少样本和思维链 (Chain-of-Thought) 模板(第 3.2 节),(II) 将 T5 规模的模型扩展到 1800+ 任务(第 3.3 节),(III) 通过输入反转丰富任务(第 3.4 节),以及 (IV) 平衡这些任务组合(第 3.5 节)。在第 3.1 节中,我们首先测量每个组件的价值,并将最终模型与替代的指令微调集合(及其方法)进行比较。
2Note that each work defines datasets, tasks, and task categories differently. For simplicity, we use their own definitions in Section 2. $^{3\prime\prime}\mathrm{P}3++^{\prime\prime}$ is our notation for all datasets in the Public Pool of Prompts (P3): https://hugging face.co/datasets/bigscience/P3
请注意,每项工作对数据集、任务和任务类别有不同的定义。为简单起见,我们在第 2 节中使用他们自己的定义。$^{3''}\mathrm{P}3++^{''}$ 是我们对 Public Pool of Prompts (P3) 中所有数据集的标记:https://hugging face.co/datasets/bigscience/P3
Figure 1: 的翻译示例会是 图 1: (如果原文中有)
Table 1: 的翻译示例会是 表 1: (如果原文中有)
Experimental Setup We finetune on the prefix language model adapted T5-LM (Lester et al., 2021), using the XL (3B) size for all models for consistency, unless otherwise stated. While other sizes of Flan-T5 are available, we felt XL was appropriately sized to run large-scale systematic ablations, while being sufficiently large to draw general conclusions. We evaluate on (a) a suite of 8 “Held-In” tasks represented within the $1800+$ training task collection (4 question answering and 4 natural language inference validation sets), (b) Chain-of-Thought (CoT) tasks (5 validation sets), and (c) the MMLU (Hendrycks et al., 2020) and BBH (Suzgun et al., 2022) benchmarks as our set of “Held-Out” tasks, as they are not included as part of Flan 2022 finetuning. The Massivley Multitask Language Understanding benchmark (MMLU) broadly tests reasoning and knowledge capacity across 57 tasks in the sciences, social sciences, humanities, business, health, among other subjects. BIG-Bench Hard (BBH) includes 23 challenging tasks from BIG-Bench (Srivastava et al., 2022) where PaLM under-performs human raters. In our ablations, we also evaluate BBH with Chain-ofThought inputs, following Chung et al. (2022). Additional finetuning and evaluation details are provided in Appendix A.
实验设置
我们基于前缀语言模型改编的 T5-LM (Lester et al., 2021) 进行微调,除非另有说明,所有模型均使用 XL (3B) 规格以保持一致性。虽然 Flan-T5 提供其他规格,我们认为 XL 规格适合进行大规模系统性消融实验,同时足够大以得出一般性结论。我们在以下任务上进行评估:(a) 8 个“保留内”任务套件,这些任务包含在 $1800+$ 训练任务集合中(4 个问答和 4 个自然语言推理验证集),(b) 链式思维 (Chain-of-Thought, CoT) 任务(5 个验证集),以及 (c) MMLU (Hendrycks et al., 2020) 和 BBH (Suzgun et al., 2022) 基准作为我们的“保留外”任务,因为它们未包含在 Flan 2022 微调中。大规模多任务语言理解基准 (MMLU) 广泛测试了跨科学、社会科学、人文学科、商业、健康等 57 个任务中的推理和知识能力。BIG-Bench Hard (BBH) 包含来自 BIG-Bench (Srivastava et al., 2022) 的 23 个具有挑战性的任务,在这些任务中 PaLM 表现不如人类评分员。在我们的消融实验中,我们还根据 Chung et al. (2022) 使用链式思维输入评估 BBH。更多微调和评估细节请参见附录 A。
3.1 Ablation Studies
3.1 消融研究
Table 1 summarizes the mean contribution to Held-in, Held-out, and Chain-of-thought tasks, by individually deducting methods: mixture weight balancing (“- Mixture Balancing"), Chain-of-thought tasks $(^{\prime\prime}-C\mathrm{oT^{\prime\prime}})$ , mixed prompt settings (“- Few Shot Templates"), and Input Inversion (“- Input Inversion"). Flan-T5 XL leverages all four of these methods together. We also finetune T5-XL-LM on other collections, including Flan 2021, ${\mathrm{P}}3++.$ , Super-Natural Instructions for comparison.
表 1: 总结了对保留集、测试集和链式思维任务的平均贡献,通过分别扣除以下方法:混合权重平衡 (“- Mixture Balancing”)、链式思维任务 ("-CoT")、混合提示设置 (“- Few Shot Templates”) 和输入反转 (“- Input Inversion”)。Flan-T5 XL 结合使用了这四种方法。我们还对 T5-XL-LM 在其他数据集上进行了微调,包括 Flan 2021、P3++ 和 Super-Natural Instructions 以作比较。
| 模型 | HOLD-IN | CoT | MMLU | BBH | BBH-CoT |
|---|---|---|---|---|---|
| T5-XL Flan 2022 | 73.8 / 74.8 | 35.8 / 34.1 | 50.3 / 52.4 | 26.2 / 39.3 | 33.9 / 35.2 |
| - CoT | 73.3 / 73.2 | 28.8 / 24.6 | 47.5 / 46.9 | 18.2 / 30.0 | 18.2 / 12.0 |
| - 输入反转 | 73.8 / 74.1 | 32.2 / 23.5 | 41.7 / 41.2 | 18.4 / 24.2 | 15.7 / 13.0 |
| - 混合平衡 | 71.2 / 73.1 | 32.3 / 30.5 | 45.4 / 45.8 | 15.1 / 24.3 | 13.8 / 15.4 |
| - 少样本模板 | 72.5 / 62.2 | 38.9 / 28.6 | 47.3 / 38.7 | 27.6 / 30.8 | 18.6 / 23.3 |
| T5-XL Flan 2021 | 68.4 / 56.3 | 24.6 / 22.7 | 41.4 / 34.8 | 28.1 / 28.3 | 26.0 / 26.9 |
| T5-XL P3++ | 70.5 / 62.8 | 25.6 / 25.6 | 46.1 / 34.1 | 26.0 / 30.8 | 23.4 / 26.1 |
| T5-XL Super-Natural Inst. | 50.3 / 42.2 | 13.8 / 14.3 | 35.6 / 31.1 | 10.4 / 15.6 | 8.0 / 12.5 |
| GLM-130Bt | - / 44.8 | ||||
| OPT-IML-Max 30B+ | 46.3 / 43.2 | - / 30.9 | |||
| OPT-IML-Max175Bt | 49.1 / 47.1 | - / 35.7 | |||
| Flan 2022 - 下一个最佳 T5-XL | +3.3 / +12 | +10.2 / +8.5 | +4.2 / +17.6 | -1.9 / +8.5 | +7.9 / +8.3 |
Table 1: Method Ablations (top) show the importance of each method for Flan-T5 XL. Collection Ablations (bottom) evaluate Flan-T5 XL against $\mathrm{T5-XL}$ finetuned on other instruction tuning collections: FLAN 2021, ${\mathrm{P}}3++.$ , and Super-Natural Instructions. Flan 2022 - Next Best T5-XL shows the improvement of Flan-T5 XL over the next best T5-XL (comparatively sized) finetuned on another collection. Metrics are reported in both zero-shot / few-shot settings across Held-In, Chain-of-Thought, and Held-Out (MMLU, BBH) tasks. † We also inlcude the results reported by OPT-IML (Iyer et al., 2022) and GLM-130B (Zeng et al., 2022).
表 1: 方法消融 (上) 显示了每个方法对 Flan-T5 XL 的重要性。集合消融 (下) 评估了 Flan-T5 XL 对比在其他指令调优集合(FLAN 2021, P3++, 和 Super-Natural Instructions)上微调的 T5-XL 。Flan 2022 - Next Best T5-XL 显示了 Flan-T5 XL 相较于在另一个集合上微调的最佳 T5-XL (规模相当) 的改进。指标在零样本 / 少样本设置下的 Held-In、Chain-of-Thought 和 Held-Out (MMLU, BBH) 任务中报告。† 我们还包含了 OPT-IML (Iyer et al., 2022) 和 GLM-130B (Zeng et al., 2022) 报告的结果。
Each of the ablated components of Flan contributes improvements to different metrics: Chain-of-Thought training to Chain-of-Thought evaluation, input inversion to Held-Out evaluations (MMLU and BBH), few-shot prompt training to few-shot evaluations, and mixture balancing to all metrics.
Flan 的每个消融组件对不同的指标都有所改进:Chain-of-Thought 训练对 Chain-of-Thought 评估,输入反转对 Held-Out 评估(MMLU 和 BBH),少样本提示训练对少样本评估,混合平衡对所有指标。
As compared to T5-XL models trained on alternative instruction tuning collections (and their methods), Flan outperforms in almost every setting. While previous collections are tuned specifically to zero-shot prompts, Flan-T5 XL is tuned for either zero- or few-shot prompts. This yields performance margins of $+3–10%$ for most of the zero-shot settings, and margins of $8–17%$ for the few-shot settings. Most impressively, Flan 2022 outperforms OPT-IML-Max’s much larger (10x) 30B and (58x) 175B models. Next, we isolate some of Flan 2022’s ablated methods individually, to examine the benefits of each.
与在其他指令调优集合(及其方法)上训练的 T5-XL 模型相比,Flan 在几乎每种设置下都表现更好。虽然之前的集合专门针对零样本提示进行调优,但 Flan-T5 XL 既可以针对零样本也可以针对少样本提示进行调优。这在大多数零样本设置下带来了 +3–10% 的性能提升,在少样本设置下则带来了 8–17% 的性能提升。最令人印象深刻的是,Flan 2022 超越了 OPT-IML-Max 更大的(10 倍)30B 和(58 倍)175B 模型。接下来,我们单独分析 Flan 2022 的一些消融方法,以考察每种方法的好处。
3.2 Training with Mixed Prompt Settings
3.2 混合提示设置下的训练
Prior work has shown a wide variety of input templates per task can improve performance. However, separate from the wording of the instruction template, these prior LLMs mostly tune with template sets targeted to a single prompt setting: for zero-shot prompting (Wei et al., 2021; Sanh et al., 2021; Aghajanyan et al., 2021; Aribandi et al., 2021) or for few-shot prompting (Min et al., 2022; Wang et al., 2022c).
先前的工作表明,针对每个任务使用多种输入模板可以提高性能。然而,除了指令模板的措辞外,这些先前的大语言模型主要调整的是针对单一提示设置的模板集:用于零样本提示 (Wei et al., 2021; Sanh et al., 2021; Aghajanyan et al., 2021; Aribandi et al., 2021) 或用于少样本提示 (Min et al., 2022; Wang et al., 2022c)。
An under appreciated design decision in Instruct GP T (Ouyang et al., 2022) was to mix training templates for each of these prompt settings, rather than target a single setting. However, since Ouyang et al. (2022) do not examine this choice, we expected a performance trade-off in finetuning for zero-shot or few-shot prompting performance – particularly for smaller models. Instead, we find training with mixed zero- and few-shot prompts significantly improves performance in both settings – most surprisingly, even for models with only 3B parameters.
指令 GP T (Ouyang et al., 2022) 中一个被低估的设计决策是混合每个提示设置的训练模板,而不是针对单一设置。然而,由于 Ouyang et al. (2022) 没有考察这一选择,我们预期在微调以实现零样本或少样本提示性能时会存在性能权衡——特别是对于较小的模型。相反,我们发现使用混合零样本和少样本提示进行训练显著提高了两种设置下的性能——最令人惊讶的是,即使对于只有 3B 参数的模型也是如此。

Figure 3: Training jointly with zero-shot and few-shot prompt templates improves performance on both Held-In and Held-Out tasks. The stars indicate the peak performance in each setting.

图 3: 使用零样本和少样本提示模板联合训练可以提高在保留任务和未见任务上的性能。星号表示每种设置下的最佳性能。
Figure 3 shows (1) adding as little as $5%$ few-shot training templates can dramatically improve zero-shot performance, and (2) adding $10%+$ of zero-shot data improves few-shot performance too. Both Held-In and Held-Out tasks peak anywhere between $10–90%$ of few-shot data, but this range is consistently higher than training with only one prompt setting.
图 3 显示了 (1) 添加少量(5%)的少样本训练模板可以显著提高零样本性能,以及 (2) 添加 10%+ 的零样本数据也可以改善少样本性能。Held-In 和 Held-Out 任务在 10–90% 的少样本数据范围内达到峰值,但这个范围始终高于仅使用一个提示设置进行训练的效果。
3.3 Scaling Small Models to $\mathbf$ Tasks
3.3 将小模型扩展到 1.8k+ 任务
The most recent and concurrent publicly available instruction tuning efforts, like Flan 2022, train on thousands of tasks (Wang et al., 2022c; Iyer et al., 2022), but operate on different task compositions and underlying training methods. To measure the impact of scaling model sizes and tasks for the Flan 2022 collection, we finetune T5-LM adapted models (Small, Base, Large, XL, XXL) on randomly selected task subsets (8, 25, 50, 100, 200, 400, 800, all 1873). Every finetuning run is guaranteed to include the Held-In tasks, so we can estimate how task scaling impacts the model capacity to maintain performance on a given task its already seen.
最新的公开指令调优工作,如 Flan 2022,训练了数千个任务 (Wang et al., 2022c; Iyer et al., 2022),但任务组合和底层训练方法不同。为了衡量模型大小和任务扩展对 Flan 2022 集合的影响,我们在随机选择的任务子集 (8, 25, 50, 100, 200, 400, 800, 全部 1873) 上微调 T5-LM 适应模型 (Small, Base, Large, XL, XXL)。每次微调运行都保证包含 Held-In 任务,因此我们可以估计任务扩展如何影响模型在已见过的任务上保持性能的能力。
Figure 4 demonstrates that both Held-In and Held-Out tasks appear to benefit from adding hundreds of finetuning tasks. Held-in task evaluations peak around 200 total tasks, and diminish in performance as more tasks are added, though larger models peak later and diminish less. Held-out task performance increases log-linearly with the number of tasks, achieving the highest performances with all 1836 tasks.
图 4: 显示 Held-In 和 Held-Out 任务似乎都从添加数百个微调任务中受益。Held-in 任务评估在总任务数约为 200 时达到峰值,随着更多任务的增加性能下降,但较大的模型峰值出现较晚且下降幅度较小。Held-out 任务性能随着任务数量的增加呈对数线性增长,在所有 1836 个任务时达到最高性能。

Figure 4: Performance Scaling Laws for the number of finetuning tasks and model sizes. Held-In performance (left) and Held-Out MMLU performance (right) are shown. The gold star indicates the peak performance for that model size.

图 4: 微调任务数量和模型大小的性能扩展定律。显示了保留集内性能(左)和保留集外 MMLU 性能(右)。金色星星表示该模型大小的峰值性能。
Surprisingly, only T5-Small appears to exceed its Held-Out task performance before 1836 tasks, while larger model sizes continue to improve. These results suggest (a) even T5-Base may not have exhausted its capacity with thousands of tasks, and (b) the largest LMs could benefit from thousands more tasks for Held-In and Held-Out task performance.
令人惊讶的是,只有 T5-Small 在 1836 个任务之前的表现超过了其保留任务的性能,而更大模型尺寸的表现则持续改进。这些结果表明 (a) 即使是 T5-Base,在数千个任务之后可能也未完全发挥其潜力,以及 (b) 最大的大语言模型 (LLM) 可能会从更多的数千个任务中受益,以提高保留和未保留任务的性能。
One necessary assumption of this analysis is that all tasks are defined and counted equally. Section 3.5 demonstrates how not all task sources are equally beneficial to training, and the model performance may saturate from too many tasks from one source (e.g. Super-Natural Instructions). We would caution conclusions that task scaling beyond 1800 would translate to increased returns without also paying attention to task diversity and quality.
此分析的一个必要假设是所有任务都被同等定义和计数。第 3.5 节展示了并非所有任务来源对训练都有同等益处,且模型性能可能会因为来自单一来源的过多任务(例如 Super-Natural Instructions)而饱和。我们提醒注意,不要轻易得出结论认为任务数量超过 1800 会带来更好的回报,而不同时关注任务的多样性和质量。
3.4 Task Enrichment with Input Inversion
3.4 通过输入反转进行任务增强
Prior instruction tuning work has enriched their diversity of tasks by inverting the $(x,,y)$ input-output pairs in supervised tasks—referred to as “prompts not intended for the original task” in P3 (Bach et al., 2022) or the “noisy channel” in MetaICL (Min et al., 2022). For example, a dataset may be originally designed for, given a question $x_{.}$ , evaluate if a model can answer $y$ . Input inversion instead gives a model the answer $y$ and trains it to generate the question $x$ . This is an easy method to enrich the task variety given a limited set of data sources. However, it isn’t clear that this method remains helpful when 100s of unique data sources and 1000s of tasks are already available.
先前的指令微调工作通过反转监督任务中的输入输出对 $(x, y)$ 丰富了任务的多样性——这在 P3 (Bach et al., 2022) 中被称为“非原任务预期的提示”,或在 MetaICL (Min et al., 2022) 中称为“噪声信道”。例如,一个数据集最初可能是为了给定一个问题 $x_{.}$ ,评估模型是否能回答 $y$ 。而输入反转则是给模型提供答案 $y$ 并训练它生成问题 $x$ 。这是一种在给定有限的数据源情况下丰富任务多样性的简单方法。然而,当已经有数百个独特数据源和数千个任务时,这种方法是否仍然有用尚不清楚。
To assess this, we enrich our mixtures with input inverted tasks (details and examples in Appendix B) and measure the effect. In Table 1 we find this is not beneficial for Held-In performance, but strongly beneficial for Held-Out performance. These benefits invigorate the prospect of data augmentation techniques for LLM finetuning, which had previously been shown to have diminishing returns the longer models are pretrained (Longpre et al., 2020).
为了评估这一点,我们通过输入反转任务丰富了我们的混合数据(详情和示例见附录 B)并测量其效果。在表 1 中我们发现这对 Held-In 性能没有益处,但对 Held-Out 性能有显著的益处。这些益处重新激发了数据增强技术用于大语言模型微调的前景,之前的研究表明,模型预训练时间越长,收益递减 (Longpre et al., 2020)。
3.5 Balancing Data Sources
3.5 平衡数据源
Scaling architecture size and the number of tasks are effective, but our results suggest the mixture weighting deserves as much attention to optimize results. To converge on a balanced weighting, we omit different sets of task sources, one at a time (Flan 2021, T0-SF, Super-Natural Instructions, Chain-of-Thought, Dialog, and
扩展架构规模和任务数量是有效的,但我们的结果表明混合权重的优化同样值得关注。为了收敛到一个平衡的权重,我们每次省略不同的任务来源集(Flan 2021,T0-SF,Super-Natural Instructions,Chain-of-Thought,Dialog,并
| 训练混合 | Held-In | CoT | MMLU |
|---|---|---|---|
| 全部 (等权) | 64.9 | 41.4 | 47.3 |
| 全部 -Flan 2021 | 55.3 | 38.6 | 45.7 |
| 全部- TO-SF | 63.2 | 43.4 | 44.7 |
| 全部 - 超级自然指令 | 65.9 | 42.2 | 46.8 |
| 全部 -CoT | 65.6 | 29.1 | 46.8 |
| 全部 - 程序合成 | 66.9 | 42.3 | 46.8 |
| 全部 - 对话 | 65.4 | 40.3 | 47.1 |
| 全部 (加权) | 66.4 | 40.1 | 48.1 |
Table 2: Subsets of tasks are left out from an equally weighted mixture to measure their importance. T0-SF and Flan 2021 finetuning are most important for MMLU, while Chain-of-Thought (CoT) finetuning is most important for Chain-of-Thought evaluation.
表 2: 从等权重混合中移除任务的子集以测量其重要性。T0-SF 和 Flan 2021 微调对 MMLU 最重要,而链式思维 (Chain-of-Thought, CoT) 微调对链式思维评估最重要。
Program Synthesis), and rank their contributions on the MMLU benchmark.4.
程序合成 (Program Synthesis)),并根据 MMLU 基准对他们的贡献进行排名。4.
As shown in Table 2, Flan 2021 and T0-SF are among the most beneficial mixtures, followed by Super-Natural Instructions and Chain-of-Thought, with Dialog and Program Synthesis last. These findings are corroborated by Iyer et al. (2022) who extensively test data mixing proportions, and also determine their Flan 2021, T0-SF, and T5 mixtures are the most broadly beneficial. Additionally, they find Super-Natural Instructions has limited scaling benefits on Held-Out task performance, which they relate to its unique input format and instruction design. Notably, Chain-of-thought finetuning appears beneficial across all our evaluation settings, especially considering they contain far fewer tasks than Flan 2021, T0-SF or Natural Instructions.
如表 2 所示,Flan 2021 和 T0-SF 是其中最有益的混合数据集,其次是 Super-Natural Instructions 和 Chain-of-Thought,而 Dialog 和 Program Synthesis 排名最后。这些发现得到了 Iyer 等人 (2022) 的证实,他们广泛测试了数据混合比例,并确定他们的 Flan 2021、T0-SF 和 T5 混合数据集是最具普遍效益的。此外,他们发现 Super-Natural Instructions 在 Held-Out 任务性能上的扩展效益有限,这与它的独特输入格式和指令设计有关。值得注意的是,Chain-of-thought 微调在我们所有的评估设置中都表现出有益的效果,尤其是在考虑到它们包含的任务远少于 Flan 2021、T0-SF 或 Natural Instructions 的情况下。
We used these findings to significantly narrow the mixture weights search space, and used our practitioner’s intuition from there. This strategy is simple but effective, as shown in Table 1, but leaves ample room for more sophisticated future work.
我们利用这些发现显著缩小了混合权重的搜索空间,并在此基础上运用了我们的实践直觉。这一策略简单但有效,如表 1 所示,但仍为更复杂未来的工作留下了充足的空间。
3.6 Discussion
3.6 讨论
OPT-IML (Iyer et al., 2022) presents the closest comparison to this work, including a similar collection of tasks, examples and techniques. However, while their used tasks are all publicly sourced, their collection, with templates, processing, and example mixing, is not released, and as a result cannot be easily compared. Iyer et al. (2022) report that Flan-T5-XL (3B) and XXL (11B) outperforms OPT-IML-Max 175B on both MMLU and BBH. As they discuss, these differences may arise from any combination of pre-training, model architecture, and instruction tuning. Model architecture and pre training before instruction tuning can play a significant role (Wang et al., 2022a). But there are many other details in instruction tuning that may vary between Flan 2022 and OPT-IML. Likely candidates are are: example temp lat iz ation, how the mixed input prompting procedures are used at training, and task composition.
OPT-IML (Iyer et al., 2022) 提供了与此工作最接近的比较,包括类似的任务、示例和技术集合。然而,尽管他们使用的所有任务都是公开来源的,但他们的集合(包括模板、处理和示例混合)并未发布,因此无法轻松进行比较。Iyer et al. (2022) 报告称 Flan-T5-XL (3B) 和 XXL (11B) 在 MMLU 和 BBH 上的表现优于 OPT-IML-Max 175B。正如他们所讨论的,这些差异可能源于预训练、模型架构和指令微调的任何组合。模型架构和预训练在指令微调之前可以发挥重要作用 (Wang et al., 2022a)。但在 Flan 2022 和 OPT-IML 之间的指令微调中可能存在许多其他细节上的差异。可能的原因包括:示例模板化、训练时如何使用混合输入提示程序,以及任务组合。
How significant are each of these difference? While OPT-IML contains more tasks than Flan 2022, we estimate approximately $94%(2067/2207)$ are also used in the Flan 2022 collection 5, and very few tasks in Flan 2022 are not contained in some format in OPT-IML. This suggests the overall difference in task diversity is not significant when using a shared definition of “task”. Task mixture rates also emphasize similar sources, including Flan 2021 ( $46%$ vs $20%$ ), Prompt Source/P3 ( $28%$ vs $45%$ ), and Super-Natural Instructions ( $25%$ vs $25%$ ), for Flan 2022 and OPT-IML respectively. OPT-IML’s other collections (Crossfit, ExMix, T5, U-SKG)
这些差异的重要性如何?尽管 OPT-IML 包含的任务比 Flan 2022 更多,我们估计大约 94% (2067/2207) 的任务也用于 Flan 2022 集合 5,而 Flan 2022 中非常少的任务在 OPT-IML 中没有某种格式的对应。这表明在使用共享的“任务”定义时,任务多样性方面的总体差异并不显著。任务混合率也强调了类似的来源,包括 Flan 2021 (46% 对 20%)、Prompt Source/P3 (28% 对 45%) 和 Super-Natural Instructions (25% 对 25%),分别对应 Flan 2022 和 OPT-IML。OPT-IML 的其他集合 (Crossfit, ExMix, T5, U-SKG)

Figure 5: Flan-T5 Outperforms T5 on Single-Task Finetuning. We compare single-task finetuned T5, singletask finetuned Flan-T5, and Flan-T5 without any further finetuning.

图 5: Flan-T5 在单任务微调上优于 T5。我们比较了单任务微调的 T5、单任务微调的 Flan-T5 以及未进行进一步微调的 Flan-T5。
are not weighted significantly: $4%$ , $2%$ , $2%$ , $2%$ respectively.
分别权重不大:$4%$ , $2%$ , $2%$ , $2%$ 。
We believe example temp lat iz ation and the mixed prompt formats may pose the largest differences with OPTIMLs instruction tuning. Our template repository was significantly updated from Flan 2021, adding variety not just in instructions, but also along dimensions. For instance, the temp lat iz ation procedure varies where the instruction is placed (before or after few-shot prompts), the spacing and separators between few-shot and Chain-of-Thought prompts, and the formatting permutations of answer options (and their targets) for multiple-choice examples, which sometimes includes and sometimes excludes answer options in the inputs or exemplars. While we do not have dedicated experiments comparing many iterations of development, we found these procedures dramatically augment input variety and showed repeated performance improvements. Our example temp lat i zing procedure is open sourced for inspection and future work.
我们相信示例模板化和混合提示格式可能与 OPTIML 的指令微调存在最大差异。我们的模板库从 Flan 2021 版本进行了显著更新,不仅在指令上增加了多样性,还在多个维度上进行了扩展。例如,模板化过程中的指令位置(在少样本提示之前或之后)、少样本提示和思维链提示之间的间距和分隔符、以及多选题答案选项(及其目标)的格式排列都各不相同,有时包括有时不包括输入或示例中的答案选项。虽然我们没有专门的实验来比较开发过程中的多次迭代,但我们发现这些程序极大地增强了输入的多样性,并显示出反复的性能改进。我们的示例模板化程序已开源,供检查和未来研究。
请注意原文中 "temp lat iz ation" 和 "temp lat i zing" 看起来像是 "template" 和 "templating" 的变体,可能是为了防止某些自动化工具识别而故意加入空格。在翻译中已将其修正为正确的术语。
4 Instruction Tuning Enhances Single-Task Finetuning
4 指令微调增强单任务微调
In applied settings, machine learning practitioners deploy NLP models finetuned (FT) specifically for a single target task, usually where finetuning data is already available. While prior work has shown the benefits of intermediate finetuning (Pr uk s a chat kun et al., 2020; Vu et al., 2020) or multi-task finetuning (Aghajanyan et al., 2021; Aribandi et al., 2021) for downstream tasks, this has not been studied extensively for instructiontuned models.
在应用环境中,机器学习从业者部署针对单一目标任务微调 (FT) 的 NLP 模型,通常是在已经有微调数据可用的情况下。虽然之前的工作已经展示了中间微调 (Pr uk s a chat kun et al., 2020; Vu et al., 2020) 或多任务微调 (Aghajanyan et al., 2021; Aribandi et al., 2021) 对下游任务的好处,但这对于指令微调模型尚未进行广泛研究。
We evaluate Flan 2022 instruction tuning as an intermediary step before single target finetuning, to understand if Flan-T5 would serve as a better starting checkpoint for applied practitioners. We evaluate three settings in
我们评估 Flan 2022 指令调优作为单目标微调之前的中间步骤,以了解 Flan-T5 是否会成为应用实践者更好的起始检查点。我们评估了三种设置:

Figure 6: Flan-T5 convergences faster than T5 on single-task finetuning for each of 5 Held-Out tasks from Flan finetuning.
图 6: Flan-T5 在每个 Flan 微调的 5 个保留任务的单任务微调上比 T5 收敛更快。
Figure 5: finetuning T5 directly on the target task as the conventional baseline (blue bars), using Flan-T5 without further finetuning (beige bars), and finetuning Flan-T5 further on the target task (red bars).
图 5: 直接在目标任务上微调 T5 作为传统基线 (蓝色条形),使用 Flan-T5 不再进一步微调 (米色条形),以及在目标任务上进一步微调 Flan-T5 (红色条形)。
Pareto Improvements to Single Task Finetuning For both sets of Held-In and Held-Out tasks examined, finetuning Flan-T5 offers a pareto improvement over finetuning T5 directly. In some instances, usually where finetuning data is limited for a task, Flan-T5 without further finetuning outperforms T5 with task finetuning.
对于两个已检查的保留任务集,微调 Flan-T5 在单任务微调上提供了帕累托改进,优于直接微调 T5。在某些情况下,通常是在任务的微调数据有限时,未经进一步微调的 Flan-T5 表现优于经过任务微调的 T5。
Faster Convergence & Computational Benefits Using Flan-T5 as a starting checkpoint has an added benefit in training efficiency. As demonstrated in Figure 6, Flan-T5 converges much more quickly than T5 during single target finetuning, as well as peaking at higher accuracies. These convergence results also suggest there are strong green-AI incentives for the NLP community to adopt instruction-tuned models, like FlanT5 for single-task finetuning, rather than conventional non-instruction-tuned models. While instruction tuning is more computationally-expensive than single-task finetuning, it is a one-time cost. On the contrary, pretrained models that require extensive finetuning become more costly when aggregating over many millions of additional training steps (Wu et al., 2022; Bommasani et al., 2021). Instruction-tuned models offer a promising solution to significantly reduce the amount of finetuning steps across a wide swathe of tasks, if they are adopted as a new standard starting point for single-task finetuning.
使用 Flan-T5 作为起始检查点可以更快收敛并带来计算优势。如图 6 所示,在单目标微调过程中,Flan-T5 比 T5 收敛得更快,并且在准确性上达到更高的峰值。这些收敛结果还表明,自然语言处理社区采用指令微调模型(如 Flan-T5)进行单任务微调,而不是传统的非指令微调模型,具有强大的绿色 AI (green-AI) 激励。虽然指令微调比单任务微调计算成本更高,但这是一次性成本。相反,需要大量微调的预训练模型在累积数百万个额外训练步骤时变得更加昂贵(Wu et al., 2022; Bommasani et al., 2021)。如果将指令微调模型作为单任务微调的新标准起点广泛采用,它们有望显著减少跨多种任务的微调步骤数量。
图 6:
指令微调模型提供了一个有希望的解决方案,可以在广泛的任务中显著减少微调步骤的数量,前提是它们被采纳为单任务微调的新标准起点。
5 Related Work
5 相关工作
Large Language Models As the foundation of instruction tuning, the practice of pre training one generalpurpose language representation that is useful for multiple downstream tasks has a long tradition that goes back at least Mikolov et al. (2013) and Dai and Le (2015). In 2018, Peters et al. (2018) and Devlin et al. (2019) cemented the paradigm of pre training a large model on a large unsupervised corpus, and the field of NLP quickly converged to using these models which substantially outperform the prior art of non-pretrained task-specific LSTM models on all tasks. However, the dominate way to access that high-quality syntactic and semantic knowledge encoded in pretrained models was not to prompt them with instructions, but to train an additional task-specific linear layer that maps the model activation s into numerical class labels. A short year later, Radford et al. (2019), Raffel et al. (2020), and Lewis et al. (2020) popularized the notion that downstream tasks—and multiple tasks—can be jointly learned by directly using the pretrained LM head to generate the answers in natural language (cf. task-specific numerical class labels), the task-general nature of these generative models became the precursor to many multitask transfer learning studies (McCann et al., 2018; Khashabi et al., 2020; Ye et al., 2021; Vu et al., 2020), which in turn led to the first wave of instruction tuning as described in Section 2.
大语言模型作为指令微调的基础,预训练一个通用的语言表示模型以适用于多个下游任务的做法有着悠久的传统,可以追溯到至少 Mikolov 等人 (2013) 和 Dai 及 Le (2015)。2018 年,Peters 等人 (2018) 和 Devlin 等人 (2019) 巩固了在大规模无监督语料库上预训练大型模型的范式,自然语言处理领域迅速转向使用这些模型,它们在所有任务上都大大优于以前的非预训练任务特定 LSTM 模型。然而,获取预训练模型中编码的高质量句法和语义知识的主要方法不是通过指令提示,而是训练一个额外的任务特定线性层,将模型激活映射为数值类标签。仅仅一年后,Radford 等人 (2019),Raffel 等人 (2020),以及 Lewis 等人 (2020) 流行了这样的概念:下游任务——以及多个任务——可以通过直接使用预训练的语言模型头生成自然语言答案(与任务特定的数值类标签相比),这些生成式模型的任务通用性质成为许多多任务迁移学习研究的先驱 (McCann 等人, 2018; Khashabi 等人, 2020; Ye 等人, 2021; Vu 等人, 2020),这反过来导致了如第 2 节所述的第一波指令微调。
The continuing advancement in research on the pre training corpora, architectures and pre training objectives of LMs also has a large impact on instruction tuning. As of 2022, decoder-only left-to-right causal Transformers dominate the market of models larger than 100B (Brown et al., 2020; Thoppilan et al., 2022; Rae et al., 2021;
大语言模型预训练语料库、架构和预训练目标的研究不断进步,也对指令微调产生了重大影响。截至 2022 年,仅解码器的从左到右因果 Transformer 模型主导了参数量超过 100B 的市场 (Brown et al., 2020; Thoppilan et al., 2022; Rae et al., 2021;
Chowdhery et al., 2022; Hoffmann et al., 2022), and all models of such size class with fully public model parameters are decoder-only (Wang and Komatsu zak i, 2021; Le Scao et al., 2022; Zhang et al., 2022), the decision of which are often due to better hardware and software framework support. However, Raffel et al. (2020), Lewis et al. (2020), and Tay et al. (2022a) have consistently found that left-to-right causal language modeling is a suboptimal objective, while Tay et al. (2022b) and Wang et al. (2022a) particularly showed that a mixture of non-sequential objectives is much superior for downstream tasks with zero-shot and few-shot prompting. An additional factor which remains under-explored is the relationship between pre training corpora, instruction tuning, and downstream abilities. Typically, public models are all trained on one of a few public corpora: C4 (Raffel et al., 2020), The Pile (Gao et al., 2020), or ROOTs (Laurençon et al., 2022).
Chowdhery 等,2022;Hoffmann 等,2022),所有此类规模的模型都采用仅解码器架构 (Wang 和 Komatsu zak i, 2021;Le Scao 等,2022;Zhang 等,2022),这一决定通常是因为更好的硬件和软件框架支持。然而,Raffel 等 (2020),Lewis 等 (2020),以及 Tay 等 (2022a) 一致发现,从左到右的因果语言建模是一个次优目标,而 Tay 等 (2022b) 和 Wang 等 (2022a) 特别表明,非顺序目标的混合在零样本和少样本提示的下游任务中表现更优。另一个尚未充分研究的因素是预训练语料库、指令微调和下游能力之间的关系。通常,公开的模型都是在一个或几个公开语料库上训练的:C4 (Raffel 等,2020),The Pile (Gao 等,2020),或 ROOTs (Laurençon 等,2022)。
Instruction Tuning In Section 2 we outline major developments in instruction tuning. Other important developments include the prospect of complimenting or replacing few-shot in-context learning-the currently predominate method of evaluating pretrained and instruction-tuned models—with parameter-efficient tuning. As standard finetuning of models larger than 100B requires a high number of accelerators with the right interconnects often too expensive even for many industry labs, parameter-efficient tuning (a.k.a. continuous or soft “prompt tuning”) shows that only updating a small subset of model parameters can reach comparable performance as fully tuning all model parameters (Lester et al., 2021; Vu et al., 2022; Hu et al., 2021; see He et al., 2022 for a detailed analysis). Notably, Liu et al. (2022b) show that, due to the long sequence length of few-shot ICL and that the few-shot exemplars need to be repeatedly inferenced for evaluating every example, parameter-efficient tuning can be computationally cheaper and higher performing than in-context learning. Further, Liu et al. (2022b), Vu et al. (2022), Wei et al. (2021), and Singhal et al. (2022) collectively show that both single-task and multi-task parameter-efficient tuning can be productively combined with instruction tuning, either before or after regular full-model instruction tuning. This line of work makes it easy for other researchers to build on top of a general-domain instruction-tuned model, and collect a custom instruction-tuning mixture for their use, e.g., with multiple modalities (Ahn et al., 2022; Huang et al., 2022; Xu et al., 2022) or special domains such as science and medicine (Lewkowycz et al., 2022; Singhal et al., 2022).
在第 2 节中,我们概述了指令微调的主要进展。其他重要进展包括用参数高效微调来补充或替代当前主流的少样本上下文学习方法,后者是目前评估预训练和指令微调模型的主要方法。由于对超过 100B 参数规模的模型进行标准微调需要大量具有合适互连的加速器,这即使对于许多工业实验室来说也过于昂贵,参数高效微调(即连续或软“提示微调”)表明,仅更新模型参数的一小部分可以达到与完全微调所有模型参数相当的性能 (Lester et al., 2021; Vu et al., 2022; Hu et al., 2021; 详见 He et al., 2022 的详细分析)。值得注意的是,Liu et al. (2022b) 指出,由于少样本 ICL 的长序列长度以及每个示例评估时需要反复推理少样本示例,参数高效微调在计算上可能比上下文学习更便宜且性能更高。此外,Liu et al. (2022b),Vu et al. (2022),Wei et al. (2021),和 Singhal et al. (2022) 共同表明,单任务和多任务参数高效微调可以与指令微调有效结合,无论是在常规全模型指令微调之前还是之后。这一系列工作使得其他研究人员能够基于通用领域的指令微调模型进行构建,并为他们的用途收集自定义指令微调混合数据集,例如包含多种模态 (Ahn et al., 2022; Huang et al., 2022; Xu et al., 2022) 或特定领域如科学和医学 (Lewkowycz et al., 2022; Singhal et al., 2022)。
Problems Addressed by Instruction Tuning & Alignment Techniques Instruction tuning is part of a line of work designed to “align” language models with more useful objectives and human preferences. In the absence of such methods, language models are known to demonstrate toxic/harmful behaviour (Sheng et al., 2019; Liang et al., 2021; Wallace et al., 2019), generate non-factual information (Maynez et al., 2020; Longpre et al., 2021; Devaraj et al., 2022), and other challenges in deployment and evaluation (Zellers et al., 2019; McGuffie and Newhouse, 2020; Talat et al., 2022). Analyzing, evaluating and mitigating these problems pose a promising direction for future work (Gao et al., 2022; Ganguli et al., 2022). Instruction tuning warrants greater investigation, as it has already demonstrated itself an encouraging remedy in reducing NLP bias metrics, as shown in Chung et al. (2022).
指令微调与对齐技术所解决的问题
指令微调是旨在使语言模型与更有用的目标和人类偏好“对齐”的一系列工作中的一部分。在没有这些方法的情况下,语言模型已知会表现出有毒/有害行为 (Sheng et al., 2019; Liang et al., 2021; Wallace et al., 2019),生成非事实信息 (Maynez et al., 2020; Longpre et al., 2021; Devaraj et al., 2022),以及在部署和评估中存在其他挑战 (Zellers et al., 2019; McGuffie and Newhouse, 2020; Talat et al., 2022)。分析、评估和缓解这些问题为未来的工作提供了一个有前景的方向 (Gao et al., 2022; Ganguli et al., 2022)。指令微调值得进一步研究,因为它已经在减少 NLP 偏差指标方面显示出令人鼓舞的效果,如 Chung et al. (2022) 所示。
6 Conclusions
6 结论
The new Flan 2022 instruction tuning collection unifies the most popular prior public collections and their methods, while adding new templates and simple improvements like training with mixed prompt settings. The resulting collection outperforms Flan 2021, ${\mathrm{P}}3++.$ , Super-Natural Instructions, and OPT-IML-Max 175B on Held-In QA, NLI, and Chain-of-Thought tasks, and Held-Out MMLU and BBH, often by large margins. Results suggest this new collection serves as a more competitive starting point for researchers and practitioners interested in both generalizing to new instructions, or finetuning on a single new task.
新的 Flan 2022 指令调优集合统一了之前最受欢迎的公共集合及其方法,同时添加了新的模板和简单的改进,例如混合提示设置下的训练。该集合在 Held-In QA、NLI 和 Chain-of-Thought 任务以及 Held-Out MMLU 和 BBH 上的表现超过了 Flan 2021、${\mathrm{P}}3++.$、Super-Natural Instructions 和 OPT-IML-Max 175B,通常具有很大的优势。结果表明,这个新集合为希望对新指令进行泛化或对单个新任务进行微调的研究人员和从业者提供了一个更具竞争力的起点。
Acknowledgements
致谢
We would like to thank Ed H Chi, Xinyun Chen, and Colin Raffel for their advice and feedback on the paper.
我们要感谢 Ed H Chi、Xinyun Chen 和 Colin Raffel 对论文的建议和反馈。
References
参考文献
Armen Aghajanyan, Anchit Gupta, Akshat Shri vast ava, Xilun Chen, Luke Z ett le moyer, and Sonal Gupta. Muppet: Massive multi-task representations with pre-finetuning. In EMNLP, 2021. URL a cl anthology.org/2021.emnlp-main.468.
Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, 和 Sonal Gupta. Muppet: 带有预微调的大型多任务表示 (Massive multi-task representations with pre-finetuning). 在 EMNLP, 2021. URL https://aclanthology.org/2021.emnlp-main.468.
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopal a krishna n, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Ret ting house, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. Do As I Can, Not As I Say: Grounding Language in Robotic Afford ances. arXiv e-prints, art. arXiv:2204.01691, April 2022.
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopal a krishna n, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, 和 Andy Zeng. 按照我能做的去做,而不是按照我说的去做:将语言与机器人能力 (Robotic Affordances) 结合起来。arXiv e-prints, 文章编号 arXiv:2204.01691, 2022年4月。
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo ${\mathrm{Ni}},$ et al. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952, 2021.
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, 等. Ext5: 朝向极端多任务扩展的迁移学习. arXiv预印本 arXiv:2111.10952, 2021.
Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Dragomir Radev, Mike Tian-jian Jiang, and Alexander Rush. Prompt Source: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-demo.9. URL https://a cl anthology.org/2022.acl-demo.9.
斯蒂芬·巴赫、维克多·桑、郑欣勇、阿尔伯特·韦布森、科林·拉费尔、尼哈尔·V·纳亚克、阿比什特·夏尔马、金泰勋、M·萨伊富尔·巴里、蒂博·费夫里、扎德·阿利亚法、马南·德、安德里亚·桑蒂利、孙志清、Srulik Ben-david、徐灿文、冈詹·查布拉尼、王涵、杰森·弗里斯、马吉德·阿尔-沙伊班、尚雅·夏尔马、乌尔米什·塔克、哈利德·阿尔穆巴拉克、唐翔茹、德拉戈米尔·拉德夫、蒋天健、亚历山大·拉什。提示源:一个用于自然语言提示的集成开发环境和存储库。在第 60 届计算语言学协会年会论文集:系统演示,第 93–104 页,爱尔兰都柏林,2022 年 5 月。计算语言学协会。doi: 10.18653/v1/2022.acl-demo.9。URL https://aclanthology.org/2022.acl-demo.9。
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
白云涛, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan 等. 使用来自人类反馈的强化学习训练乐于助人且无害的助手. arXiv 预印本 arXiv:2204.05862, 2022a.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
白云涛, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon 等. 宪法式 AI (Constitutional AI): 来自 AI 反馈的无害性. arXiv 预印本 arXiv:2212.08073, 2022b.
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Gia m piccolo. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.
路易莎·本蒂沃格利, 彼得·克拉克, 伊多·达甘, 和 达尼洛·贾姆皮科洛. 第五届 Pascal 识别文本蕴含挑战赛. 在 TAC, 2009.
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, 等. 关于基础模型的机遇与风险. arXiv preprint arXiv:2108.07258, 2021.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neel a kant an, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. NeurIPS, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
汤姆·布朗、本杰明·曼、尼克·赖德、梅兰妮·苏比亚、贾里德·D·卡普兰、普拉富拉·达里瓦尔、阿文德·尼拉坎坦、普拉纳夫·希亚姆、吉里什·萨斯特里、阿曼达·阿斯科尔、桑迪尼·阿加沃尔、阿里埃尔·赫伯特-沃斯、格雷琴·克鲁格、汤姆·亨尼根、瑞文·蔡尔德、阿迪蒂亚·拉梅什、丹尼尔·齐格勒、杰弗里·吴、克莱门斯·温特、克里斯·赫塞、马克·陈、埃里克·西格勒、马特乌什·利特温、斯科特·格雷、本杰明·切斯、杰克·克拉克、克里斯托弗·伯纳、山姆·麦肯德里什、阿莱克·拉德福德、伊利亚·苏茨克弗、达里奥·阿莫迪。大语言模型是少样本学习者。NeurIPS, 2020。URL https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, et al. PaLM: Scaling language modeling with Pathways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, 等. PaLM: 借助 Pathways 扩展语言建模. arXiv 预印本 arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
Hyung Won Chung,Le Hou,Shayne Longpre,Barret Zoph,Yi Tay,William Fedus,Eric Li,Xuezhi Wang,Mostafa Dehghani,Siddhartha Brahma 等. 扩展指令微调的语言模型. arXiv预印本 arXiv:2210.11416, 2022.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020. URL https://arxiv.org/abs/1910.10683.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, 和 Peter J Liu. 探索迁移学习的极限与统一的文本到文本 Transformer. Journal of Machine Learning Research, 21:1–67, 2020. URL https://arxiv.org/abs/1910.10683.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018.
Pranav Rajpurkar, Robin Jia, 和 Percy Liang. 知道你不知道的:SQuAD 的不可回答问题. 在 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 中,页面 784–789,2018.
Appendix
附录
Table of Contents
目录
B Input Inversion Details 21
B 输入反转细节 21
A Experimental Details
A 实验细节
A.1 Instruction Tuning
A.1 指令微调
The Flan Collection experiments are assembled and run using T5X (Roberts et al., 2022). Our instruction tuning follows the same setup described in Chung et al. (2022). For few-shot and few-shot Chain-of-Thought prompts during finetuning our temp lat i zing procedure generates few-shot examples with 2, 3, or 5 exemplars. The experiments in this work use a slightly earlier version of the Flan 2022 collection the one we are releasing, which had some minor improvements to the templates.
Flan 集合实验是使用 T5X (Roberts et al., 2022) 组装和运行的。我们的指令微调遵循 Chung et al. (2022) 中描述的相同设置。对于少样本和少样本链式思维提示,在微调期间我们的模板化过程生成包含 2、3 或 5 个示例的少样本示例。本研究中的实验使用的是 Flan 2022 集合的一个稍早版本,该版本对模板进行了一些小的改进。
The mixture weights used to balance the various sources of data were informed by experiments in Section 3.5, along with the resulting practitioner intuition.
用于平衡各种数据来源的混合权重是根据第 3.5 节中的实验结果以及由此产生的实践者直觉确定的。
A.2 Single-Task Finetuning
A.2 单任务微调
For single-task finetuning, described in Section 4, our models are finetuned for 100,000 steps for all tasks. We use a constant learning rate of 0.001, a dropout probability of 0.1, and a batch size of 128 length-512 sequences. We save a checkpoint every 20 steps and report test performance on the model checkpoint corresponding to the highest validation performance. For tasks without a validation split, we hold out 1024 training examples for validation. For tasks without a test split, we hold out 1024 training examples for validation and report results on the original validation set. For PubmedQA, we do not use any of the unlabeled and artificially generated QA instances associated with the dataset. For $C!!x!C_{\prime}$ , we only consider the text-text portion of the dataset, following Vu et al. (2022). For tasks with less than 1K training examples, we report average results across 3 random seeds.
对于单任务微调,如第 4 节所述,我们的模型对所有任务进行 100,000 步的微调。我们使用恒定的学习率 0.001,dropout 概率为 0.1,以及批量大小为 128 的长度为 512 的序列。我们每 20 步保存一个检查点,并报告与最高验证性能对应的模型检查点的测试性能。对于没有验证集的任务,我们保留 1024 个训练样本用于验证。对于没有测试集的任务,我们保留 1024 个训练样本用于验证,并在原始验证集上报告结果。对于 PubmedQA,我们不使用与数据集相关的任何未标注和人工生成的 QA 实例。对于 $C!!x!C_{\prime}$ ,我们仅考虑数据集的文本-文本部分,遵循 Vu 等人 (2022) 的方法。对于训练样本少于 1K 的任务,我们在 3 个随机种子上报告平均结果。
We also evaluate on certain metrics to account for label skew in some of the datasets, as shown in Table 3.
我们还评估了某些指标,以考虑一些数据集中的标签偏斜,如表 3 所示。
A.3 Evaluation
A.3 评估
For Held-In evaluations we use the validation sets from 4 question answering (QA) tasks, BoolQ, ARC Easy, ARC Challenge, and AI2’s Middle School Science Exams, and 4 natural language inference (NLI) tasks, including ANLI R1, R2, R3, and RTE. These datasets are contained in the Flan 2022 finetuning collection and represent challenging benchmarks, often used to evaluate LLMs on QA and NLI. The Held-In score is the mean accuracy across these 8 tasks.
对于 Held-In 评估,我们使用来自 4 个问答 (QA) 任务的验证集,BoolQ、ARC Easy、ARC Challenge 和 AI2 的初中科学考试,以及 4 个自然语言推理 (NLI) 任务,包括 ANLI R1、R2、R3 和 RTE。这些数据集包含在 Flan 2022 微调集合中,代表了具有挑战性的基准测试,常用于评估大语言模型 (LLM) 在 QA 和 NLI 上的表现。Held-In 分数是这 8 个任务的平均准确率。
Table 3: Datasets used for Various Finetuning and Evaluation Experiments. ST-FT stands for Single Task Finetuning.
表 3: 用于各种微调和评估实验的数据集。ST-FT 表示单任务微调。
| 数据集 | 指标 | 使用于 |
|---|---|---|
| ARC E+C | Acc | (Clark et al., 2018) |
| ANLIR1+R2+R3 | 三分类 F1 √ | (Nie et al., 2020) |
| AI2 Mid. Science | 四分类 F1 √ | CITATION AI2ScienceQuestions |
| BoolQ | AUC-ROC | (Clark et al., 2019) |
| RTE | AUC-ROC √ | (Bentivogli et al., 2009) |
| SQuADV2 | F1 | (Rajpurkar et al., 2018) |
| CosmosQA | Acc | (Huang et al., 2019) |
| GSM8K | Acc | (Cobbe et al., 2021) |
| StrategyQA | Acc | (Geva et al., 2021) |
| SVAMP | Acc | (Patel et al., 2021) |
| Asdiv | Acc | (Miao et al., 2020) |
| CommonsenseQA | Acc | (Talmor et al., 2019) |
| WANLI | Acc | (Liu et al., 2022a) |
| MedNLI | Acc | (Romanov and Shivade, 2018) |
| CondaQA | Acc | (Ravichander et al., 2022) |
| PubmedQA | F1 | (Jin et al., 2019) |
| CxC | Spearman | (Parekh et al., 2021) |
For the Chain-of-Thought (CoT) evaluation, we use the mean accuracy across 5 datasets which have been prepared with prompts which request step-by-step explanations in their target answers: GSM8K, StrategyQA, SVAMP, Asdiv, and Commonsense QA.
对于链式思维 (Chain-of-Thought, CoT) 评估,我们使用跨5个数据集的平均准确率,这些数据集的提示要求目标答案提供逐步解释:GSM8K、StrategyQA、SVAMP、Asdiv 和 Commonsense QA。
For the Held-Out tasks, we use MMLU’s suite of 57 exams, and BBH’s suite of 23 tasks where PaLM performed worse than the average human annotators. MMLU tasks were removed from the Super-Natural Instructions part of the Flan 2022 collection at training, to ensure they were Held-Out.
对于保留任务,我们使用 MMLU 的 57 套试卷和 BBH 的 23 个任务套件,在这些任务中 PaLM 的表现低于平均人类标注者。MMLU 任务在训练时从 Flan 2022 收藏的 Super-Natural Instructions 部分移除,以确保它们是保留的。
B Input Inversion Details
B 输入反转细节
For the input inversion experiments we note that Flan 2021, $\mathrm{P}3++.$ , and Super-Natural Instructions already implicitly include tasks that have been inverted, e.g. question answering to question or context generation. Consequently, we choose to also create input inversions for the remaining datasets in the Flan 2022 collection, including for the Dialog, Program Synthesis, and Chain-of-Thought tasks.
对于输入反转实验,我们注意到 Flan 2021、$\mathrm{P}3++.$ 和 Super-Natural Instructions 已经隐含地包括了已经被反转的任务,例如问题回答到问题或上下文生成。因此,我们选择为 Flan 2022 集合中剩余的数据集创建输入反转,包括对话、程序合成和链式思维任务。
As examples: for Dialog tasks, we write template instructions asking for the previous conversational history from the current dialog turn; for program synthesis we ask for the coding question which the code solves; and for Chain-of-Thought we include every permutation of the query-answer-explanation triple, where at least one of the three appears as the in output. An illustration of Chain-of-Thought input inversion permutations are shown in Figure 7.
例如:对于对话任务,我们编写模板指令,要求提供当前对话轮次的先前对话历史;对于程序合成,我们要求提供代码所解决的编程问题;对于思维链 (Chain-of-Thought),我们包含查询-答案-解释三元组的所有排列组合,其中至少有一个出现在输出中。图 7: 展示了思维链输入反转排列的示例。
These inversions are mixed in with the existing tasks at a rate of $30%$ , meaning for a Dialog task, 3 inverted examples will be generated for every 10 regular examples. We choose this rate for simplicity, approximately mirroring prior work, and leave the large space of exploration for future work.
这些反转样本以 30% 的比例混合到现有任务中,这意味着对于一个对话任务,每 10 个正常样本将生成 3 个反转样本。我们选择这个比例是为了简化操作,大致上与之前的工作相一致,并将大规模的探索空间留给未来的工作。


图 1: 模型架构示例 (Example Model Architecture)
Figure 7: Input Inversions permutations for a Zero-Shot Chain-of-Thought example. Each is accompanied by a corresponding instruction template that prompts the model with what the input is, and what to predict as the targets.
图 7: 零样本链式思维 (Zero-Shot Chain-of-Thought) 示例的输入反转排列。每个排列都附带一个相应的指令模板,提示模型输入的内容以及需要预测的目标。
