The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Flan系列:设计数据和方法以实现有效的指令微调
Shayne Longpre∗ Le Hou Tu Vu Albert Webson Hyung Won Chung Yi Tay Denny Zhou Quoc V. Le Barret Zoph Jason Wei Adam Roberts
Google Research
谷歌研究部
Abstract
摘要
We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 models (Chung et al., 2022). Through careful ablation studies on the Flan Collection of instruction tuning tasks and methods, we tease apart the effect of design decisions that enable FlanT5 to outperform prior work by $3{-}17%+$ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger $(2%+)$ performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks—motivating instruction-tuned models as more computationallyefficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available.1
我们研究了公开可用的指令微调方法的设计决策,并剖析了 Flan 2022 模型 (Chung et al., 2022) 的发展。通过对 Flan 指令微调任务和方法集合进行详细的消融研究,我们分析了使 FlanT5 在不同评估设置中超越先前工作 $3{-}17%+$ 的设计决策的影响。我们发现任务平衡和增强技术虽然被忽视但对有效的指令微调至关重要,特别是使用混合提示设置(零样本、少样本和链式思维)进行训练实际上在所有设置中都带来了更强 $(2%+)$ 的性能。在进一步的实验中,我们展示了 Flan-T5 在单个下游任务上收敛得更高更快,所需的微调比 T5 更少——这激励我们将指令微调模型作为新任务更具计算效率的起始检查点。最后,为了加速指令微调的研究,我们公开发布了 Flan 2022 的数据集、模板和方法集合。
Figure 1: Comparing public instruction tuning collections on Held-In, Held-Out (BIG-Bench Hard (Suzgun et al., 2022) and MMLU (Hendrycks et al., 2020)), and Chain-of-Thought evaluation suites, detailed in Appendix A.3. All models except OPT-IML-Max (175B) are T5-XL with 3B parameters. Green text indicates absolute improvement over the next best comparable T5-XL (3B) model.
图 1: 比较公共指令调优集合在 Held-In、Held-Out (BIG-Bench Hard (Suzgun et al., 2022) 和 MMLU (Hendrycks et al., 2020)) 以及 Chain-of-Thought 评估套件上的表现,详细信息见附录 A.3。除 OPT-IML-Max (175B) 外,所有模型均为 T5-XL,参数量为 3B。绿色文本表示相对于下一个最佳可比 T5-XL (3B) 模型的绝对改进。
1 Introduction
1 引言
Large language models such as PaLM (Chowdhery et al., 2022), Chinchilla (Hoffmann et al., 2022), and ChatGPT among others (Brown et al., 2020; Ouyang et al., 2022) have unlocked new capabilities in performing natural language processing (NLP) tasks from reading instructive prompts. Prior art has shown that instruction tuning—finetuning language models on a collection of NLP tasks formatted with instructions— further enhances the ability of language models to perform an unseen task from an instruction (Wei et al., 2021; Sanh et al., 2021; Min et al., 2022).
像 PaLM (Chowdhery et al., 2022)、Chinchilla (Hoffmann et al., 2022) 和 ChatGPT 等大语言模型(Brown et al., 2020; Ouyang et al., 2022)已经在执行自然语言处理(NLP)任务方面解锁了新能力,这些能力来源于阅读指令提示。先前的研究表明,指令微调——在带有指令格式的 NLP 任务集合上对语言模型进行微调——进一步增强了语言模型根据指令执行未见过的任务的能力 (Wei et al., 2021; Sanh et al., 2021; Min et al., 2022)。
In this work, we evaluate the methods and results of open sourced instruction generalization efforts, comparing their finetuning techniques and methods. And in particular, we identify and evaluate the critical methodological improvements in the “Flan 2022 Collection”, which is the term we use for the collection of data and methods for data augmentation and instruction tuning, first implemented and used in Chung et al. (2022). Where Chung et al. (2022) focuses on the emergent and state-of-the-art results of combining Flan 2022 with PaLM 540B, this work focuses in on the details of the instruction tuning methods themselves, ablating individual factors, and comparing them directly to prior work by keeping the pretrained model size and checkpoint consistent.
在本研究中,我们评估了开源指令泛化工作的方法和结果,比较了它们的微调技术和方法。特别是,我们识别并评估了“Flan 2022 Collection”中的关键方法改进。“Flan 2022 Collection”是指用于数据增强和指令微调的数据和方法集合,首次由 Chung 等人 (2022) 实施和使用。虽然 Chung 等人 (2022) 关注的是将 Flan 2022 与 PaLM 540B 结合后产生的最新成果,本研究则专注于指令微调方法本身的细节,通过消融各个因素,并通过保持预训练模型大小和检查点的一致性,直接与先前的工作进行比较。
The Flan 2022 Collection offers the most extensive publicly available set of tasks and methods for instruction tuning, which we have compiled in one place. We have also supplemented this with hundreds more of our own high-quality templates, richer formatting patterns, and data augmentations. We show that a model trained on this collection outperforms other public collections on all tested evaluation benchmarks, including the original Flan 2021 (Wei et al., 2021), ${\mathrm{T0}}{+}{+}$ (Sanh et al., 2021), Super-Natural Instructions (Wang et al., 2022c), and the concurrent work on OPT-IML (Iyer et al., 2022). As shown in Figure 1, this includes $4.2%+$ and $8.5%$ improvements on the MMLU (Hendrycks et al., 2020) and BIG-Bench Hard (Suzgun et al., 2022) evaluation benchmarks respectively, for equally sized models.
Flan 2022 集合提供了最广泛可用的指令调优任务和方法集合,这些我们已汇总到一个地方。我们还补充了数百个我们自己的高质量模板、更丰富的格式模式和数据增强。我们展示了在这个集合上训练的模型在所有测试评估基准上都优于其他公开集合,包括原始 Flan 2021 (Wei et al., 2021),${\mathrm{T0}}{+}{+}$ (Sanh et al., 2021),Super-Natural Instructions (Wang et al., 2022c),以及关于 OPT-IML (Iyer et al., 2022) 的同期工作。如图 1 所示,这包括在 MMLU (Hendrycks et al., 2020) 和 BIG-Bench Hard (Suzgun et al., 2022) 评估基准上分别对同等规模的模型实现了 4.2%+ 和 8.5% 的改进。
图 1:
Analysis of the Flan 2022 method suggests the strong results stem both from the larger and more diverse set of tasks, but also from a set of simple finetuning and data augmentation techniques. In particular, training on a mix of examples temp lat i zed with zero-shot, few-shot, and chain-of-thought prompts improves performance in every one of these settings, together. For instance, adding just $10%$ few-shot prompts improves zero-shot prompting results by $2%+$ . Additionally, enriching task diversity by inverting input-output pairs, as used in (Sanh et al., 2021; Min et al., 2022), along with balancing task sources, are both shown to be critical to performance. The resulting Flan-T5 model converges faster and at a higher performance than T5 models in single-task finetuning—suggesting instruction-tuned models offer a more computationally-efficient starting checkpoint for downstream applications, corroborating Aribandi et al. (2021) and Liu et al. (2022b).
对 Flan 2022 方法的分析表明,其出色的结果既来自于更大和更多样化的任务集,也得益于一系列简单的微调和数据增强技术。特别是,在零样本、少样本和链式思维提示符 (chain-of-thought prompts) 模板化的例子混合上进行训练,在所有这些设置中共同提高了性能。例如,仅添加 $10%$ 的少样本提示符就能使零样本提示结果提高 $2%+$ 。此外,通过反转输入输出对来丰富任务多样性,如 (Sanh et al., 2021; Min et al., 2022) 中所使用的方法,以及平衡任务来源,都被证明对性能至关重要。由此产生的 Flan-T5 模型在单任务微调中比 T5 模型收敛得更快且性能更高——这表明指令调优模型为下游应用提供了更计算高效的起始检查点,证实了 Aribandi et al. (2021) 和 Liu et al. (2022b) 的研究。
We hope making these findings and resources publicly available will unify resources around instruction tuning and accelerate research into more general-purpose language models. We summarize this work’s core contributions as follows:
我们希望将这些发现和资源公开可用能够统一指令微调方面的资源,并加速对更通用的语言模型的研究。我们总结这项工作的核心贡献如下:
2 Public Instruction Tuning Collections
2 公共指令微调集合
发布 | 集合 | 模型详情 | 数据收集与训练详情 | 方法 |
---|---|---|---|---|
模型 | 基础 | 尺寸公开? | ||
202005 | UnifiedQA | UnifiedQA | RoBerta | 110-340M |
202104 | CrossFit | BART-CrossFit | BART | 140M |
202104 | Natural Instvl.0 | Gen.BART | BART | 140M |
202109 | Flan 2021 | Flan-LaMDA | LaMDA | 137B |
202110 | P3 | TO, TO+,TO++ | T5-LM | 3-11B |
202110 | MetalCL | MetalCL | GPT-2 | 770M |
202111 | ExMix | ExT5 | T5 | 220M-11B |
202204 | Super-Natural Inst. | Tk-lnstruct | T5-LM, mT5 | 11-13B |
202210 | GLM | GLM-130B | GLM | 130B |
2022 11 | XP3 | BLOOMz, mTO | BLOOM,mT5 | 13-176B |
202212 | Unnatural Inst.t | T5-LM-Unnat.Inst. | T5-LM | 11B |
202212 | Self-Instructt | GPT-3 Self Inst. | GPT-3 | 175B |
202212 | OPT-IML Bencht | OPT-IML | OPT | 30-175B |
202210 Flan 2022 (ours) | Flan-T5, Flan-PaLM T5-LM, PaLM | 10M-540B |
Figure 2: A Timeline of Public Instruction Tuning Collections specifies the collection release date, detailed information on the finetuned models (the base model, their size, and whether the model itself is Public (P) or Not Public (NP)), what prompt specification they were trained for (zero-shot, few-shot, or Chain-of-Thought), the number of tasks contained in the Flan 2022 Collection (released with this work), and core methodological contributions in each work.
图 2: 公共指令微调集合的时间线指定了集合发布日期、微调模型的详细信息(基础模型、模型大小以及模型本身是否为公共 (P) 或非公共 (NP))、它们所训练的提示规范(零样本、少样本或思维链),Flan 2022 集合(随此工作发布)中包含的任务数量,以及每项工作的核心方法论贡献。
Large Language Models Instruction tuning has emerged as a tool to make large language models (LLMs) and their abilities more useful for interactive dialog and functional tasks. Previous work (Raffel et al., 2020; Liu et al., 2019; Aghajanyan et al., 2021; Aribandi et al., 2021) experimented with large scale multi-task finetuning, to improve downstream single target finetuning, but without instruction prompts. UnifiedQA and others (Khashabi et al., 2020; McCann et al., 2018; Keskar et al., 2019) unified a wide range of NLP tasks into a single generative question answering format, using prompt instructions for multi-task finetuning and evaluation.
大语言模型指令调优已 emergence 作为一种工具,使大语言模型 (LLMs) 及其能力在交互式对话和功能任务中更加有用。先前的工作 (Raffel et al., 2020; Liu et al., 2019; Aghajanyan et al., 2021; Aribandi et al., 2021) 尝试了大规模多任务微调,以改进下游单目标微调,但没有使用指令提示。UnifiedQA 和其他研究 (Khashabi et al., 2020; McCann et al., 2018; Keskar et al., 2019) 将广泛范围的自然语言处理任务统一为单一的生成式问答格式,使用指令提示进行多任务微调和评估。
The First Wave Since 2020, several instruction tuning task collections have been released in rapid succession, outlined in Figure 2. Natural Instructions (Mishra et al., 2021), Flan 2021 (Wei et al., 2021), P3 (the Public Pool of Prompts, Bach et al., 2022) aggregated large NLP task collections and temp lat i zed them with instructions (zero-shot prompting), specifically for finetuning models to generalize to unseen instructions. MetaICL (Min et al., 2022) also consolidated other task collections (Ye et al., 2021; Khashabi et al., 2020) to train models to learn tasks “in-context” – from several input-output examples, known as few-shot prompting, but in this case without instructions. Each of these works affirmed the scaling benefits of task and template diversity, and some reported strong benefits from inverting the inputs and outputs in templates to produce new tasks (“noisy channel” in Min et al., 2022).
自 2020 年以来,多个指令微调任务集相继发布,详见图 2。Natural Instructions (Mishra et al., 2021),Flan 2021 (Wei et al., 2021),P3 (公共提示池, Bach et al., 2022) 汇总了大型 NLP 任务集,并通过指令(零样本提示)对它们进行了模板化处理,专门用于微调模型以泛化到未见过的指令。MetaICL (Min et al., 2022) 也整合了其他任务集 (Ye et al., 2021; Khashabi et al., 2020),以训练模型从几个输入-输出示例中“在上下文中”学习任务,即少样本提示,但在这种情况下没有指令。这些工作均证实了任务和模板多样性的扩展效益,有些还报告了通过反转模板中的输入和输出来生成新任务(“噪声信道”在 Min et al., 2022 中)所带来的显著优势。
The Second Wave A second wave of instruction tuning collections expanded prior resources: combining more datasets and tasks into one resource, like Super-Natural Instructions (Wang et al., 2022c) or OPT-IML (Iyer et al., 2022), adding multilingual instruction tuning in xP3 (Mu en nigh off et al., 2022), and Chain-ofThought training prompts in Flan 2022 (Chung et al., 2022). Both the Flan Collection and OPT-IML contain most tasks represented in prior collections.2 Our work is positioned here, coalescing most of these collections (of collections) and their methods, as the strongest starting point for future open source work.
第二波指令微调集合扩展了先前的资源:将更多的数据集和任务整合到一个资源中,例如 Super-Natural Instructions (Wang et al., 2022c) 或 OPT-IML (Iyer et al., 2022),在 xP3 (Mu en nigh off et al., 2022) 中增加了多语言指令微调,以及在 Flan 2022 (Chung et al., 2022) 中引入了链式思维训练提示。Flan 集合和 OPT-IML 包含了先前集合中的大多数任务。我们的工作在此基础上进行,融合了这些集合(及其方法)中的大部分,为未来的开源工作提供了最强大的起点。
New Directions Concurrent and future work is beginning to explore two new directions: (a) expanding task diversity even more aggressively with synthetic data generation, particularly in creative, and open-ended dialogue (Wang et al., 2022b; Honovich et al., 2022; Ye et al., 2022; Gupta et al., 2022), and (b) offering human feedback signals on model responses (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021; Bai et al., 2022b). We view most of these new directions as likely additive to a foundation of instruction tuning methods.
新的方向
当前和未来的工作开始探索两个新的方向:(a) 更加积极地扩展任务多样性,特别是通过合成数据生成,在创造性对话和开放性对话中尤为突出 (Wang et al., 2022b; Honovich et al., 2022; Ye et al., 2022; Gupta et al., 2022),以及 (b) 在模型响应上提供人类反馈信号 (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021; Bai et al., 2022b)。我们认为这些新方向中的大多数都可能是对指令微调方法基础的补充。
Tuning with Human Feedback Instruction tuning on human feedback has demonstrated strong results on open-ended tasks, but at the expense of performance on a wide array of more traditional NLP tasks (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021). (See Ouyang et al. (2022)’s discussion of the “alignment tax”.) Our work focuses specifically on instruction generalization, without human feedback, for two reasons. First, human feedback datasets are far less publicly available than instruction tuning datasets (and may be model-specific). Second, by itself, instruction generalization shows great promise in enhancing human preferred responses on open-ended tasks, as well as improving traditional NLP metrics (Chung et al., 2022). The extent of obtainable progress without expensive human response demonstrations or ratings remains an open question, and an important pursuit to narrow the gap between public and non-public research.
基于人类反馈的指令调优在开放性任务上展示了强大的效果,但以牺牲传统 NLP 任务的性能为代价 (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021)。(参见 Ouyang et al. (2022) 对“对齐税 (alignment tax)”的讨论。)我们的工作特别关注不依赖人类反馈的指令泛化,原因有二。首先,人类反馈数据集远不如指令调优数据集公开可用(并且可能是模型特定的)。其次,指令泛化本身在增强人类偏好的开放性任务响应以及改进传统 NLP 指标方面显示出巨大潜力 (Chung et al., 2022)。在没有昂贵的人类响应示例或评分的情况下能取得多大进展仍然是一个开放的问题,并且是缩小公共和非公共研究之间差距的重要探索。
The Importance of Open Source High profile research is increasingly driven by non-public data, as in the case of GPT-3 and others (Ouyang et al., 2022; Glaese et al., 2022). The inaccessibility of these resources inhibits the research community’s ability to analyze and improve these methods in the public domain. We narrow our purview to open source and accessible data collections, motivated by the goal of democratizing accessibility to research.
开源的重要性
备受瞩目的研究越来越多地依赖非公开数据,如 GPT-3 等案例 (Ouyang et al., 2022; Glaese et al., 2022)。这些资源的不可获取性限制了研究社区在公共领域分析和改进这些方法的能力。我们将关注点缩小到开源和可访问的数据集合,旨在实现研究的民主化访问。
3 Flan 2022 Instruction Tuning Experiments
3 Flan 2022 指令微调实验
Recent research has yet to coalesce around a unified set of techniques, with different tasks, model sizes, and target input formats all represented. We open source a new collection, first introduced in Chung et al. (2022), denoted “Flan $2022^{\prime\prime}$ , which combines Flan 2021, $\mathbb{P}3++^{3}$ , Super-Natural Instructions, with some additional reasoning, dialog, and program synthesis datasets. We defer to Chung et al. (2022) for details of temp lat iz ation and collection; and in this work we take a deeper look at key methodological improvements and compare the collection on equivalent model sizes to existing collections.
最近的研究尚未凝聚成一套统一的技术,不同的任务、模型大小和目标输入格式都有所涉及。我们开源了一个新的集合,首次在 Chung et al. (2022) 中引入,标记为“Flan 2022”,该集合结合了 Flan 2021、$\mathbb{P}3++^{3}$、Super-Natural Instructions 以及一些额外的推理、对话和程序合成数据集。关于模板化和收集的详细信息,请参阅 Chung et al. (2022);在本研究中,我们将深入探讨关键方法改进,并在相同模型大小下将此集合与现有集合进行比较。
In this section, we evaluate the design decisions in Flan and discuss four in particular that yield strong improvements to the instruction tuning recipe. These design components, outlined in Section 2, are: (I) using mixed zero-shot, few-shot, and Chain-of-Thought templates at training (Section 3.2), (II) scaling T5- sized models to $1800+$ tasks (Section 3.3), (III) enriching tasks with input inversion (Section 3.4), and (IV) balancing these task mixtures (Section 3.5). In Section 3.1, we begin by measuring the value of each component and compare the final model against alternative instruction tuning collections (and their methods).
在本节中,我们评估 Flan 中的设计决策,并特别讨论四个带来显著改进的指令微调方案。这些设计组件在第 2 节中进行了概述,分别是:(I) 在训练时使用混合零样本、少样本和思维链 (Chain-of-Thought) 模板(第 3.2 节),(II) 将 T5 规模的模型扩展到 1800+ 任务(第 3.3 节),(III) 通过输入反转丰富任务(第 3.4 节),以及 (IV) 平衡这些任务组合(第 3.5 节)。在第 3.1 节中,我们首先测量每个组件的价值,并将最终模型与替代的指令微调集合(及其方法)进行比较。
2Note that each work defines datasets, tasks, and task categories differently. For simplicity, we use their own definitions in Section 2. $^{3\prime\prime}\mathrm{P}3++^{\prime\prime}$ is our notation for all datasets in the Public Pool of Prompts (P3): https://hugging face.co/datasets/bigscience/P3
请注意,每项工作对数据集、任务和任务类别有不同的定义。为简单起见,我们在第 2 节中使用他们自己的定义。$^{3''}\mathrm{P}3++^{''}$ 是我们对 Public Pool of Prompts (P3) 中所有数据集的标记:https://hugging face.co/datasets/bigscience/P3
Figure 1: 的翻译示例会是 图 1: (如果原文中有)
Table 1: 的翻译示例会是 表 1: (如果原文中有)
Experimental Setup We finetune on the prefix language model adapted T5-LM (Lester et al., 2021), using the XL (3B) size for all models for consistency, unless otherwise stated. While other sizes of Flan-T5 are available, we felt XL was appropriately sized to run large-scale systematic ablations, while being sufficiently large to draw general conclusions. We evaluate on (a) a suite of 8 “Held-In” tasks represented within the $1800+$ training task collection (4 question answering and 4 natural language inference validation sets), (b) Chain-of-Thought (CoT) tasks (5 validation sets), and (c) the MMLU (Hendrycks et al., 2020) and BBH (Suzgun et al., 2022) benchmarks as our set of “Held-Out” tasks, as they are not included as part of Flan 2022 finetuning. The Massivley Multitask Language Understanding benchmark (MMLU) broadly tests reasoning and knowledge capacity across 57 tasks in the sciences, social sciences, humanities, business, health, among other subjects. BIG-Bench Hard (BBH) includes 23 challenging tasks from BIG-Bench (Srivastava et al., 2022) where PaLM under-performs human raters. In our ablations, we also evaluate BBH with Chain-ofThought inputs, following Chung et al. (2022). Additional finetuning and evaluation details are provided in Appendix A.
实验设置
我们基于前缀语言模型改编的 T5-LM (Lester et al., 2021) 进行微调,除非另有说明,所有模型均使用 XL (3B) 规格以保持一致性。虽然 Flan-T5 提供其他规格,我们认为 XL 规格适合进行大规模系统性消融实验,同时足够大以得出一般性结论。我们在以下任务上进行评估:(a) 8 个“保留内”任务套件,这些任务包含在 $1800+$ 训练任务集合中(4 个问答和 4 个自然语言推理验证集),(b) 链式思维 (Chain-of-Thought, CoT) 任务(5 个验证集),以及 (c) MMLU (Hendrycks et al., 2020) 和 BBH (Suzgun et al., 2022) 基准作为我们的“保留外”任务,因为它们未包含在 Flan 2022 微调中。大规模多任务语言理解基准 (MMLU) 广泛测试了跨科学、社会科学、人文学科、商业、健康等 57 个任务中的推理和知识能力。BIG-Bench Hard (BBH) 包含来自 BIG-Bench (Srivastava et al., 2022) 的 23 个具有挑战性的任务,在这些任务中 PaLM 表现不如人类评分员。在我们的消融实验中,我们还根据 Chung et al. (2022) 使用链式思维输入评估 BBH。更多微调和评估细节请参见附录 A。
3.1 Ablation Studies
3.1 消融研究
Table 1 summarizes the mean contribution to Held-in, Held-out, and Chain-of-thought tasks, by individually deducting methods: mixture weight balancing (“- Mixture Balancing"), Chain-of-thought tasks $(^{\prime\prime}-C\mathrm{oT^{\prime\prime}})$ , mixed prompt settings (“- Few Shot Templates"), and Input Inversion (“- Input Inversion"). Flan-T5 XL leverages all four of these methods together. We also finetune T5-XL-LM on other collections, including Flan 2021, ${\mathrm{P}}3++.$ , Super-Natural Instructions for comparison.
表 1: 总结了对保留集、测试集和链式思维任务的平均贡献,通过分别扣除以下方法:混合权重平衡 (“- Mixture Balancing”)、链式思维任务 ("-CoT")、混合提示设置 (“- Few Shot Templates”) 和输入反转 (“- Input Inversion”)。Flan-T5 XL 结合使用了这四种方法。我们还对 T5-XL-LM 在其他数据集上进行了微调,包括 Flan 2021、P3++ 和 Super-Natural Instructions 以作比较。
模型 | HOLD-IN | CoT | MMLU | BBH | BBH-CoT |
---|---|---|---|---|---|
T5-XL Flan 2022 | 73.8 / 74.8 | 35.8 / 34.1 | 50.3 / 52.4 | 26.2 / 39.3 | 33.9 / 35.2 |
- CoT | 73.3 / 73.2 | 28.8 / 24.6 | 47.5 / 46.9 | 18.2 / 30.0 | 18.2 / 12.0 |
- 输入反转 | 73.8 / 74.1 | 32.2 / 23.5 | 41.7 / 41.2 | 18.4 / 24.2 | 15.7 / 13.0 |
- 混合平衡 | 71.2 / 73.1 | 32.3 / 30.5 | 45.4 / 45.8 | 15.1 / 24.3 | 13.8 / 15.4 |
- 少样本模板 | 72.5 / 62.2 | 38.9 / 28.6 | 47.3 / 38.7 | 27.6 / 30.8 | 18.6 / 23.3 |
T5-XL Flan 2021 | 68.4 / 56.3 | 24.6 / 22.7 | 41.4 / 34.8 | 28.1 / 28.3 | 26.0 / 26.9 |
T5-XL P3++ | 70.5 / 62.8 | 25.6 / 25.6 | 46.1 / 34.1 | 26.0 / 30.8 | 23.4 / 26.1 |
T5-XL Super-Natural Inst. | 50.3 / 42.2 | 13.8 / 14.3 | 35.6 / 31.1 | 10.4 / 15.6 | 8.0 / 12.5 |
GLM-130Bt | - / 44.8 | ||||
OPT-IML-Max 30B+ | 46.3 / 43.2 | - / 30.9 | |||
OPT-IML-Max175Bt | 49.1 / 47.1 | - / 35.7 | |||
Flan 2022 - 下一个最佳 T5-XL | +3.3 / +12 | +10.2 / +8.5 | +4.2 / +17.6 | -1.9 / +8.5 | +7.9 / +8.3 |
Table 1: Method Ablations (top) show the importance of each method for Flan-T5 XL. Collection Ablations (bottom) evaluate Flan-T5 XL against $\mathrm{T5-XL}$ finetuned on other instruction tuning collections: FLAN 2021, ${\mathrm{P}}3++.$ , and Super-Natural Instructions. Flan 2022 - Next Best T5-XL shows the improvement of Flan-T5 XL over the next best T5-XL (comparatively sized) finetuned on another collection. Metrics are reported in both zero-shot / few-shot settings across Held-In, Chain-of-Thought, and Held-Out (MMLU, BBH) tasks. † We also inlcude the results reported by OPT-IML (Iyer et al., 2022) and GLM-130B (Zeng et al., 2022).
表 1: 方法消融 (上) 显示了每个方法对 Flan-T5 XL 的重要性。集合消融 (下) 评估了 Flan-T5 XL 对比在其他指令调优集合(FLAN 2021, P3++, 和 Super-Natural Instructions)上微调的 T5-XL 。Flan 2022 - Next Best T5-XL 显示了 Flan-T5 XL 相较于在另一个集合上微调的最佳 T5-XL (规模相当) 的改进。指标在零样本 / 少样本设置下的 Held-In、Chain-of-Thought 和 Held-Out (MMLU, BBH) 任务中报告。† 我们还包含了 OPT-IML (Iyer et al., 2022) 和 GLM-130B (Zeng et al., 2022) 报告的结果。
Each of the ablated components of Flan contributes improvements to different metrics: Chain-of-Thought training to Chain-of-Thought evaluation, input inversion to Held-Out evaluations (MMLU and BBH), few-shot prompt training to few-shot evaluations, and mixture balancing to all metrics.
Flan 的每个消融组件对不同的指标都有所改进:Chain-of-Thought 训练对 Chain-of-Thought 评估,输入反转对 Held-Out 评估(MMLU 和 BBH),少样本提示训练对少样本评估,混合平衡对所有指标。
As compared to T5-XL models trained on alternative instruction tuning collections (and their methods), Flan outperforms in almost every setting. While previous collections are tuned specifically to zero-shot prompts, Flan-T5 XL is tuned for either zero- or few-shot prompts. This yields performance margins of $+3–10%$ for most of the zero-shot settings, and margins of $8–17%$ for the few-shot settings. Most impressively, Flan 2022 outperforms OPT-IML-Max’s much larger (10x) 30B and (58x) 175B models. Next, we isolate some of Flan 2022’s ablated methods individually, to examine the benefits of each.
与在其他指令调优集合(及其方法)上训练的 T5-XL 模型相比,Flan 在几乎每种设置下都表现更好。虽然之前的集合专门针对零样本提示进行调优,但 Flan-T5 XL 既可以针对零样本也可以针对少样本提示进行调优。这在大多数零样本设置下带来了 +3–10% 的性能提升,在少样本设置下则带来了 8–17% 的性能提升。最令人印象深刻的是,Flan 2022 超越了 OPT-IML-Max 更大的(10 倍)30B 和(58 倍)175B 模型。接下来,我们单独分析 Flan 2022 的一些消融方法,以考察每种方法的好处。
3.2 Training with Mixed Prompt Settings
3.2 混合提示设置下的训练
Prior work has shown a wide variety of input templates per task can improve performance. However, separate from the wording of the instruction template, these prior LLMs mostly tune with template sets targeted to a single prompt setting: for zero-shot prompting (Wei et al., 2021; Sanh et al., 2021; Aghajanyan et al., 2021; Aribandi et al., 2021) or for few-shot prompting (Min et al., 2022; Wang et al., 2022c).
先前的工作表明,针对每个任务使用多种输入模板可以提高性能。然而,除了指令模板的措辞外,这些先前的大语言模型主要调整的是针对单一提示设置的模板集:用于零样本提示 (Wei et al., 2021; Sanh et al., 2021; Aghajanyan et al., 2021; Aribandi et al., 2021) 或用于少样本提示 (Min et al., 2022; Wang et al., 2022c)。
An under appreciated design decision in Instruct GP T (Ouyang et al., 2022) was to mix training templates for each of these prompt settings, rather than target a single setting. However, since Ouyang et al. (2022) do not examine this choice, we expected a performance trade-off in finetuning for zero-shot or few-shot prompting performance – particularly for smaller models. Instead, we find training with mixed zero- and few-shot prompts significantly improves performance in both settings – most surprisingly, even for models with only 3B parameters.
指令 GP T (Ouyang et al., 2022) 中一个被低估的设计决策是混合每个提示设置的训练模板,而不是针对单一设置。然而,由于 Ouyang et al. (2022) 没有考察这一选择,我们预期在微调以实现零样本或少样本提示性能时会存在性能权衡——特别是对于较小的模型。相反,我们发现使用混合零样本和少样本提示进行训练显著提高了两种设置下的性能——最令人惊讶的是,即使对于只有 3B 参数的模型也是如此。
Figure 3: Training jointly with zero-shot and few-shot prompt templates improves performance on both Held-In and Held-Out tasks. The stars indicate the peak performance in each setting.
图 3: 使用零样本和少样本提示模板联合训练可以提高在保留任务和未见任务上的性能。星号表示每种设置下的最佳性能。
Figure 3 shows (1) adding as little as $5%$ few-shot training templates can dramatically improve zero-shot performance, and (2) adding $10%+$ of zero-shot data improves few-shot performance too. Both Held-In and Held-Out tasks peak anywhere between $10–90%$ of few-shot data, but this range is consistently higher than training with only one prompt setting.
图 3 显示了 (1) 添加少量(5%)的少样本训练模板可以显著提高零样本性能,以及 (2) 添加 10%+ 的零样本数据也可以改善少样本性能。Held-In 和 Held-Out 任务在 10–90% 的少样本数据范围内达到峰值,但这个范围始终高于仅使用一个提示设置进行训练的效果。
3.3 Scaling Small Models to $\mathbf$ Tasks
3.3 将小模型扩展到 1.8k+ 任务
The most recent and concurrent publicly available instruction tuning efforts, like Flan 2022, train on thousands of tasks (Wang et al., 2022c; Iyer et al., 2022), but operate on different task compositions and underlying training methods. To measure the impact of scaling model sizes and tasks for the Flan 2022 collection, we finetune T5-LM adapted models (Small, Base, Large, XL, XXL) on randomly selected task subsets (8, 25, 50, 100, 200, 400, 800, all 1873). Every finetuning run is guaranteed to include the Held-In tasks, so we can estimate how task scaling impacts the model capacity to maintain performance on a given task its already seen.
最新的公开指令调优工作,如 Flan 2022,训练了数千个任务 (Wang et al., 2022c; Iyer et al., 2022),但任务组合和底层训练方法不同。为了衡量模型大小和任务扩展对 Flan 2022 集合的影响,我们在随机选择的任务子集 (8, 25, 50, 100, 200, 400, 800, 全部 1873) 上微调 T5-LM 适应模型 (Small, Base, Large, XL, XXL)。每次微调运行都保证包含 Held-In 任务,因此我们可以估计任务扩展如何影响模型在已见过的任务上保持性能的能力。
Figure 4 demonstrates that both Held-In and Held-Out tasks appear to benefit from adding hundreds of finetuning tasks. Held-in task evaluations peak around 200 total tasks, and diminish in performance as more tasks are added, though larger models peak later and diminish less. Held-out task performance increases log-linearly with the number of tasks, achieving the highest performances with all 1836 tasks.
图 4: 显示 Held-In 和 Held-Out 任务似乎都从添加数百个微调任务中受益。Held-in 任务评估在总任务数约为 200 时达到峰值,随着更多任务的增加性能下降,但较大的模型峰值出现较晚且下降幅度较小。Held-out 任务性能随着任务数量的增加呈对数线性增长,在所有 1836 个任务时达到最高性能。
Figure 4: Performance Scaling Laws for the number of finetuning tasks and model sizes. Held-In performance (left) and Held-Out MMLU performance (right) are shown. The gold star indicates the peak performance for that model size.
图 4: 微调任务数量和模型大小的性能扩展定律。显示了保留集内性能(左)和保留集外 MMLU 性能(右)。金色星星表示该模型大小的峰值性能。
Surprisingly, only T5-Small appears to exceed its Held-Out task performance before 1836 tasks, while larger model sizes continue to improve. These results suggest (a) even T5-Base may not have exhausted its capacity with thousands of tasks, and (b) the largest LMs could benefit from thousands more tasks for Held-In and Held-Out task performance.
令人惊讶的是,只有 T5-Small 在 1836 个任务之前的表现超过了其保留任务的性能,而更大模型尺寸的表现则持续改进。这些结果表明 (a) 即使是 T5-Base,在数千个任务之后可能也未完全发挥其潜力,以及 (b) 最大的大语言模型 (LLM) 可能会从更多的数千个任务中受益,以提高保留和未保留任务的性能。
One necessary assumption of this analysis is that all tasks are defined and counted equally. Section 3.5 demonstrates how not all task sources are equally beneficial to training, and the model performance may saturate from too many tasks from one source (e.g. Super-Natural Instructions). We would caution conclusions that task scaling beyond 1800 would translate to increased returns without also paying attention to task diversity and quality.
此分析的一个必要假设是所有任务都被同等定义和计数。第 3.5 节展示了并非所有任务来源对训练都有同等益处,且模型性能可能会因为来自单一来源的过多任务(例如 Super-Natural Instructions)而饱和。我们提醒注意,不要轻易得出结论认为任务数量超过 1800 会带来更好的回报,而不同时关注任务的多样性和质量。
3.4 Task Enrichment with Input Inversion
3.4 通过输入反转进行任务增强
Prior instruction tuning work has enriched their diversity of tasks by inverting the $(x,,y)$ input-output pairs in supervised tasks—referred to as “prompts not intended for the original task” in P3 (Bach et al., 2022) or the “noisy channel” in MetaICL (Min et al., 2022). For example, a dataset may be originally designed for, given a question $x_{.}$ , evaluate if a model can answer $y$ . Input inversion instead gives a model the answer $y$ and trains it to generate the question $x$ . This is an easy method to enrich the task variety given a limited set of data sources. However, it isn’t clear that this method remains helpful when 100s of unique data sources and 1000s of tasks are already available.
先前的指令微调工作通过反转监督任务中的输入输出对 $(x, y)$ 丰富了任务的多样性——这在 P3 (Bach et al., 2022) 中被称为“非原任务预期的提示”,或在 MetaICL (Min et al., 2022) 中称为“噪声信道”。例如,一个数据集最初可能是为了给定一个问题 $x_{.}$ ,评估模型是否能回答 $y$ 。而输入反转则是给模型提供答案 $y$ 并训练它生成问题 $x$ 。这是一种在给定有限的数据源情况下丰富任务多样性的简单方法。然而,当已经有数百个独特数据源和数千个任务时,这种方法是否仍然有用尚不清楚。
To assess this, we enrich our mixtures with input inverted tasks (details and examples in Appendix B) and measure the effect. In Table 1 we find this is not beneficial for Held-In performance, but strongly beneficial for Held-Out performance. These benefits invigorate the prospect of data augmentation techniques for LLM finetuning, which had previously been shown to have diminishing returns the longer models are pretrained (Longpre et al., 2020).
为了评估这一点,我们通过输入反转任务丰富了我们的混合数据(详情和示例见附录 B)并测量其效果。在表 1 中我们发现这对 Held-In 性能没有益处,但对 Held-Out 性能有显著的益处。这些益处重新激发了数据增强技术用于大语言模型微调的前景,之前的研究表明,模型预训练时间越长,收益递减 (Longpre et al., 2020)。
3.5 Balancing Data Sources
3.5 平衡数据源
Scaling architecture size and the number of tasks are effective, but our results suggest the mixture weighting deserves as much attention to optimize results. To converge on a balanced weighting, we omit different sets of task sources, one at a time (Flan 2021, T0-SF, Super-Natural Instructions, Chain-of-Thought, Dialog, and
扩展架构规模和任务数量是有效的,但我们的结果表明混合权重的优化同样值得关注。为了收敛到一个平衡的权重,我们每次省略不同的任务来源集(Flan 2021,T0-SF,Super-Natural Instructions,Chain-of-Thought,Dialog,并
训练混合 | Held-In | CoT | MMLU |
---|---|---|---|
全部 (等权) | 64.9 | 41.4 | 47.3 |
全部 -Flan 2021 | 55.3 | 38.6 | 45.7 |
全部- TO-SF | 63.2 | 43.4 | 44.7 |
全部 - 超级自然指令 | 65.9 | 42.2 | 46.8 |
全部 -CoT | 65.6 | 29.1 | 46.8 |
全部 - 程序合成 | 66.9 | 42.3 | 46.8 |
全部 - 对话 | 65.4 | 40.3 | 47.1 |
全部 (加权) | 66.4 | 40.1 | 48.1 |
Table 2: Subsets of tasks are left out from an equally weighted mixture to measure their importance. T0-SF and Flan 2021 finetuning are most important for MMLU, while Chain-of-Thought (CoT) finetuning is most important for Chain-of-Thought evaluation.
表 2: 从等权重混合中移除任务的子集以测量其重要性。T0-SF 和 Flan 2021 微调对 MMLU 最重要,而链式思维 (Chain-of-Thought, CoT) 微调对链式思维评估最重要。
Program Synthesis), and rank their contributions on the MMLU benchmark.4.
程序合成 (Program Synthesis)),并根据 MMLU 基准对他们的贡献进行排名。4.
As shown in Table 2, Flan 2021 and T0-SF are among the most beneficial mixtures, followed by Super-Natural Instructions and Chain-of-Thought, with Dialog and Program Synthesis last. These findings are corroborated by Iyer et al. (2022) who extensively test data mixing proportions, and also determine their Flan 2021, T0-SF, and T5 mixtures are the most broadly beneficial. Additionally, they find Super-Natural Instructions has limited scaling benefits on Held-Out task performance, which they relate to its unique input format and instruction design. Notably, Chain-of-thought finetuning appears beneficial across all our evaluation settings, especially considering they contain far fewer tasks than Flan 2021, T0-SF or Natural Instructions.
如表 2 所示,Flan 2021 和 T0-SF 是其中最有益的混合数据集,其次是 Super-Natural Instructions 和 Chain-of-Thought,而 Dialog 和 Program Synthesis 排名最后。这些发现得到了 Iyer 等人 (2022) 的证实,他们广泛测试了数据混合比例,并确定他们的 Flan 2021、T0-SF 和 T5 混合数据集是最具普遍效益的。此外,他们发现 Super-Natural Instructions 在 Held-Out 任务性能上的扩展效益有限,这与它的独特输入格式和指令设计有关。值得注意的是,Chain-of-thought 微调在我们所有的评估设置中都表现出有益的效果,尤其是在考虑到它们包含的任务远少于 Flan 2021、T0-SF 或 Natural Instructions 的情况下。
We used these findings to significantly narrow the mixture weights search space, and used our practitioner’s intuition from there. This strategy is simple but effective, as shown in Table 1, but leaves ample room for more sophisticated future work.
我们利用这些发现显著缩小了混合权重的搜索空间,并在此基础上运用了我们的实践直觉。这一策略简单但有效,如表 1 所示,但仍为更复杂未来的工作留下了充足的空间。
3.6 Discussion
3.6 讨论
OPT-IML (Iyer et al., 2022) presents the closest comparison to this work, including a similar collection of tasks, examples and techniques. However, while their used tasks are all publicly sourced, their collection, with templates, processing, and example mixing, is not released, and as a result cannot be easily compared. Iyer et al. (2022) report that Flan-T5-XL (3B) and XXL (11B) outperforms OPT-IML-Max 175B on both MMLU and BBH. As they discuss, these differences may arise from any combination of pre-training, model architecture, and instruction tuning. Model architecture and pre training before instruction tuning can play a significant role (Wang et al., 2022a). But there are many other details in instruction tuning that may vary between Flan 2022 and OPT-IML. Likely candidates are are: example temp lat iz ation, how the mixed input prompting procedures are used at training, and task composition.
OPT-IML (Iyer et al., 2022) 提供了与此工作最接近的比较,包括类似的任务、示例和技术集合。然而,尽管他们使用的所有任务都是公开来源的,但他们的集合(包括模板、处理和示例混合)并未发布,因此无法轻松进行比较。Iyer et al. (2022) 报告称 Flan-T5-XL (3B) 和 XXL (11B) 在 MMLU 和 BBH 上的表现优于 OPT-IML-Max 175B。正如他们所讨论的,这些差异可能源于预训练、模型架构和指令微调的任何组合。模型架构和预训练在指令微调之前可以发挥重要作用 (Wang et al., 2022a)。但在 Flan 2022 和 OPT-IML 之间的指令微调中可能存在许多其他细节上的差异。可能的原因包括:示例模板化、训练时如何使用混合输入提示程序,以及任务组合。
How significant are each of these difference? While OPT-IML contains more tasks than Flan 2022, we estimate approximately $94%(2067/2207)$ are also used in the Flan 2022 collection 5, and very few tasks in Flan 2022 are not contained in some format in OPT-IML. This suggests the overall difference in task diversity is not significant when using a shared definition of “task”. Task mixture rates also emphasize similar sources, including Flan 2021 ( $46%$ vs $20%$ ), Prompt Source/P3 ( $28%$ vs $45%$ ), and Super-Natural Instructions ( $25%$ vs $25%$ ), for Flan 2022 and OPT-IML respectively. OPT-IML’s other collections (Crossfit, ExMix, T5, U-SKG)
这些差异的重要性如何?尽管 OPT-IML 包含的任务比 Flan 2022 更多,我们估计大约 94% (2067/2207) 的任务也用于 Flan 2022 集合 5,而 Flan 2022 中非常少的任务在 OPT-IML 中没有某种格式的对应。这表明在使用共享的“任务”定义时,任务多样性方面的总体差异并不显著。任务混合率也强调了类似的来源,包括 Flan 2021 (46% 对 20%)、Prompt Source/P3 (28% 对 45%) 和 Super-Natural Instructions (25% 对 25%),分别对应 Flan 2022 和 OPT-IML。OPT-IML 的其他集合 (Crossfit, ExMix, T5, U-SKG)
Figure 5: Flan-T5 Outperforms T5 on Single-Task Finetuning. We compare single-task finetuned T5, singletask finetuned Flan-T5, and Flan-T5 without any further finetuning.
图 5: Flan-T5 在单任务微调上优于 T5。我们比较了单任务微调的 T5、单任务微调的 Flan-T5 以及未进行进一步微调的 Flan-T5。
are not weighted significantly: $4%$ , $2%$ , $2%$ , $2%$ respectively.
分别权重不大:$4%$ , $2%$ , $2%$ , $2%$ 。
We believe example temp lat iz ation and the mixed prompt formats may pose the largest differences with OPTIMLs instruction tuning. Our template repository was significantly updated from Flan 2021, adding variety not just in instructions, but also along dimensions. For instance, the temp lat iz ation procedure varies where the instruction is placed (before or after few-shot prompts), the spacing and separators between few-shot and Chain-of-Thought prompts, and the formatting permutations of answer options (and their targets) for multiple-choice examples, which sometimes includes and sometimes excludes answer options in the inputs or exemplars. While we do not have dedicated experiments comparing many iterations of development, we found these procedures dramatically augment input variety and showed repeated performance improvements. Our example temp lat i zing procedure is open sourced for inspection and future work.
我们相信示例模板化和混合提示格式可能与 OPTIML 的指令微调存在最大差异。我们的模板库从 Flan 2021 版本进行了显著更新,不仅在指令上增加了多样性,还在多个维度上进行了扩展。例如,模板化过程中的指令位置(在少样本提示之前或之后)、少样本提示和思维链提示之间的间距和分隔符、以及多选题答案选项(及其目标)的格式排列都各不相同,有时包括有时不包括输入或示例中的答案选项。虽然我们没有专门的实验来比较开发过程中的多次迭代,但我们发现这些程序极大地增强了输入的多样性,并显示出反复的性能改进。我们的示例模板化程序已开源,供检查和未来研究。
请注意原文中 "temp lat iz ation" 和 "temp lat i zing" 看起来像是 "template" 和 "templating" 的变体,可能是为了防止某些自动化工具识别而故意加入空格。在翻译中已将其修正为正确的术语。
4 Instruction Tuning Enhances Single-Task Finetuning
4 指令微调增强单任务微调
In applied settings, machine learning practitioners deploy NLP models finetuned (FT) specifically for a single target task, usually where finetuning data is already available. While prior work has shown the benefits of intermediate finetuning (Pr uk s a chat kun et al., 2020; Vu et al., 2020) or multi-task finetuning (Aghajanyan et al., 2021; Aribandi et al., 2021) for downstream tasks, this has not been studied extensively for instructiontuned models.
在应用环境中,机器学习从业者部署针对单一目标任务微调 (FT) 的 NLP 模型,通常是在已经有微调数据可用的情况下。虽然之前的工作已经展示了中间微调 (Pr uk s a chat kun et al., 2020; Vu et al., 2020) 或多任务微调 (Aghajanyan et al., 2021; Aribandi et al., 2021) 对下游任务的好处,但这对于指令微调模型尚未进行广泛研究。
We evaluate Flan 2022 instruction tuning as an intermediary step before single target finetuning, to understand if Flan-T5 would serve as a better starting checkpoint for applied practitioners. We evaluate three settings in
我们评估 Flan 2022 指令调优作为单目标微调之前的中间步骤,以了解 Flan-T5 是否会成为应用实践者更好的起始检查点。我们评估了三种设置:
![](https://u254848-88