Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
实践中的大语言模型(Large Language Model)力量:关于ChatGPT及其他模型的综述
BING YIN, Amazon, USA
BING YIN, Amazon, 美国
XIA HU, Department of Computer Science, Rice University, USA
XIA HU,美国莱斯大学计算机科学系
This paper presents a comprehensive and practical guide for practitioners and end-users working with Large Language Models (LLMs) in their downstream natural language processing (NLP) tasks. We provide discussions and insights into the usage of LLMs from the perspectives of models, data, and downstream tasks. Firstly, we offer an introduction and brief summary of current GPT- and BERT-style LLMs. Then, we discuss the influence of pre-training data, training data, and test data. Most importantly, we provide a detailed discussion about the use and non-use cases of large language models for various natural language processing tasks, such as knowledge-intensive tasks, traditional natural language understanding tasks, natural language generation tasks, emergent abilities, and considerations for specific tasks. We present various use cases and non-use cases to illustrate the practical applications and limitations of LLMs in real-world scenarios. We also try to understand the importance of data and the specific challenges associated with each NLP task. Furthermore, we explore the impact of spurious biases on LLMs and delve into other essential considerations, such as efficiency, cost, and latency, to ensure a comprehensive understanding of deploying LLMs in practice. This comprehensive guide aims to provide researchers and practitioners with valuable insights and best practices for working with LLMs, thereby enabling the successful implementation of these models in a wide range of NLP tasks. A curated list of practical guide resources of LLMs, regularly updated, can be found at https://github.com/Mooler0410/LL Ms Practical Guide.
本文为从业者和终端用户在大语言模型(LLM)下游自然语言处理(NLP)任务中的应用提供了全面实用的指南。我们从模型、数据和下游任务三个维度,深入探讨了大语言模型的使用策略与洞见。首先,我们对当前GPT和BERT架构的大语言模型进行了介绍与简要总结。随后,我们分析了预训练数据、训练数据和测试数据的影响机制。最重要的是,我们针对各类自然语言处理任务(如知识密集型任务、传统自然语言理解任务、自然语言生成任务、涌现能力等)详细论证了大语言模型的适用场景与局限边界,通过具体案例展示其实际应用效果与约束条件。我们还着力解析了数据要素的重要性,以及不同NLP任务面临的特殊挑战。此外,我们探究了虚假偏差对大语言模型的影响,并深入讨论了效率、成本和延迟等关键部署因素。本指南旨在为研究者和实践者提供有价值的洞见与最佳实践,促进大语言模型在各类NLP任务中的成功落地。最新整理的实用资源清单持续更新于:https://github.com/Mooler0410/LLMsPracticalGuide。
CCS Concepts: $\cdot$ Computing methodologies $\rightarrow$ Natural language processing; Natural language generation; Machine translation.
CCS概念: $\cdot$ 计算方法 $\rightarrow$ 自然语言处理; 自然语言生成; 机器翻译。
Additional Key Words and Phrases: Large Language Models, Neural Language Processing, Practical Guide, ChatGPT
附加关键词和短语:大语言模型 (Large Language Models)、神经语言处理 (Neural Language Processing)、实用指南 (Practical Guide)、ChatGPT
1 INTRODUCTION
1 引言
In recent years, the rapid development of Large language Models has been revolutionizing the field of natural language processing [12, 128, 131]. These powerful models have shown great potential in addressing a variety of NLP tasks, ranging from natural language understanding (NLU) to generation tasks, even paving the way to Artificial General Intelligence (AGI). However, utilizing these models effectively and efficiently requires a practical understanding of their capabilities and limitations, as well as the data and tasks involved in NLP.
近年来,大语言模型 (Large Language Model) 的快速发展正在彻底改变自然语言处理领域 [12, 128, 131]。这些强大的模型在解决各类自然语言处理任务中展现出巨大潜力,涵盖从自然语言理解 (NLU) 到生成任务,甚至为通往通用人工智能 (AGI) 开辟了道路。然而,要高效且有效地利用这些模型,需要对其能力与局限性,以及自然语言处理中涉及的数据和任务有实际的理解。
To provide a guide for partition ers and end-users, this work focuses on the practical aspects of working with LLMs in downstream NLP tasks. This guide aims to provide practical advice on why or why not to choose LLMs for a given task, as well as guidance on how to select the most suitable LLM, taking into account factors such as model sizes, computational requirements, and the availability of domain-specific pre-trained models. This work offers a thorough understanding of LLMs from a practical perspective, therefore, empowers practitioners and end-users with the practical knowledge needed to successfully leverage the power of LLMs for their own NLP tasks.
为给开发者和终端用户提供指导,本文重点探讨大语言模型 (LLM) 在下游 NLP 任务中的实际应用。本指南旨在针对特定任务是否选用大语言模型提供实用建议,同时结合模型规模、计算资源需求和领域专用预训练模型的可用性等因素,指导如何选择最合适的大语言模型。这项工作从实践角度全面解析大语言模型,从而帮助开发者和终端用户掌握成功运用大语言模型解决自身 NLP 任务所需的实用知识。
Our work is structured as follows. First, our work offers a brief introduction to LLMs by discussing the most important models, such as GPT-style and BERT-style architectures. Then, we delve into the critical factors that influence model performance from the data perspective, including pre-training data, training/tuning data, and test data. Last and most importantly, we dive deep into various concrete NLP tasks, offering insights into the applicability of LLMs for knowledge-intensive tasks, traditional NLU tasks, and generation tasks, along with the emergent abilities that these models possess and challenging real-world scenarios. We provide detailed examples to highlight both the successful use cases and the limitations of LLMs in practice.
我们的工作结构如下。首先,通过讨论最重要的模型(如 GPT 风格和 BERT 风格的架构),简要介绍大语言模型。接着,我们从数据角度深入探讨影响模型性能的关键因素,包括预训练数据、训练/调优数据和测试数据。最后也是最重要的,我们深入探讨各种具体的 NLP 任务,分析大语言模型在知识密集型任务、传统自然语言理解任务和生成任务中的适用性,以及这些模型具备的涌现能力和具有挑战性的现实场景。我们提供了详细示例,以突出大语言模型在实践中的成功用例和局限性。
To analyze the abilities of large language models, we compare them with fine-tuned models. As of present, there is no universally recognized definition for LLMs and fine-tuned models. With consideration to practical utility, in our article, the definitions of them are proposed as: LLMs are huge language models pretrained on large amounts of datasets without tuning on data for specific tasks; fine-tuned models are typically smaller language models which are also pretrained and then further tuned on a smaller, task-specific dataset to optimize their performance on that task1.
为分析大语言模型(LLM)的能力,我们将其与微调模型进行对比。目前业界对LLM和微调模型尚无统一定义。基于实际应用考量,本文对其定义如下:大语言模型是在海量数据集上预训练、未经特定任务数据微调的巨型语言模型;微调模型通常是规模较小的语言模型,经过预训练后会在特定任务的较小数据集上进一步调优,以提升该任务表现[20]。
This work summarizes the following main practical guides for using LLMs:
本工作总结出以下使用大语言模型的主要实践指南:
2 PRACTICAL GUIDE FOR MODELS
2 模型实用指南
This section provides a brief introduction to state-of-the-art LLMs. These models differ in their training strategies, model architectures, and use cases. To provide a clearer understanding of the LLM landscape, we categorize them into two types: encoder-decoder or encoder-only language models and decoder-only language models. In Figure 1, we show the detailed evolution process of language models. From the evolutionary tree, we make the following interesting observations:
本节简要介绍最先进的大语言模型 (LLM)。这些模型在训练策略、架构设计和应用场景上各有不同。为更清晰地呈现大语言模型的发展脉络,我们将其划分为两类:编码器-解码器/纯编码器语言模型和纯解码器语言模型。图 1 展示了语言模型的详细演进历程,从进化树中我们可以得出以下重要发现:
a) Decoder-only models have been gradually dominating the development of LLMs. At the early stage of LLMs development, decoder-only models were not as popular as encoder-only and encoder-decoder models. However, after 2021, with the introduction of game-changing LLMs - GPT-3, decoder-only models experienced a significant boom. Meanwhile, after the initial explosive growth brought about by BERT, encoder-only models gradually began to fade away.
a) 仅解码器 (decoder-only) 模型已逐渐主导大语言模型的发展。在大语言模型发展初期,仅解码器模型不如仅编码器 (encoder-only) 和编码器-解码器 (encoder-decoder) 模型流行。然而2021年后,随着颠覆性的大语言模型 GPT-3 的推出,仅解码器模型迎来了显著爆发。与此同时,在 BERT 带来初期爆发式增长后,仅编码器模型逐渐式微。

Fig. 1. The evolutionary tree of modern LLMs traces the development of language models in recent years and highlights some of the most well-known models. Models on the same branch have closer relationships. Transformer-based models are shown in non-grey colors: decoder-only models in the blue branch, encoder-only models in the pink branch, and encoder-decoder models in the green branch. The vertical position of the models on the timeline represents their release dates. Open-source models are represented by solid squares, while closed-source models are represented by hollow ones. The stacked bar plot in the bottom right corner shows the number of models from various companies and institutions.
图 1: 现代大语言模型 (LLM) 的演化树展示了近年来语言模型的发展历程,并标注了一些最知名的模型。同一分支上的模型具有更紧密的关联性。基于Transformer的模型用非灰色标示:蓝色分支为纯解码器 (decoder-only) 模型,粉色分支为纯编码器 (encoder-only) 模型,绿色分支为编码器-解码器 (encoder-decoder) 模型。模型在时间轴上的纵向位置代表其发布日期。开源模型用实心方块表示,闭源模型用空心方块表示。右下角的堆叠条形图展示了来自不同公司和机构的模型数量。
b) OpenAI consistently maintains its leadership position in LLM, both currently and potentially in the future. Other companies and institutions are struggling to catch up with OpenAI in developing models comparable to GPT-3 and the current GPT-4. This leadership position may be attributed to OpenAI’s steadfast commitment to its technical path, even when it was not widely acknowledged initially. c) Meta contributes significantly to open-source LLMs and promotes research of LLMs. When considering contributions to the open-source community, particularly those related to LLMs, Meta stands out as one of the most generous commercial companies, as all the LLMs developed by Meta are open-sourced. d) LLMs exhibit a tendency towards closed-sourcing. In the early stages of LLM development (before 2020), the majority of models were open-sourced. However, with the introduction of GPT-3, companies have increasingly
b) OpenAI 在大语言模型(LLM)领域始终保持着领先地位,无论是现在还是未来。其他公司和机构在开发能与 GPT-3 和当前 GPT-4 相媲美的模型时,都难以追赶 OpenAI。这种领导地位可能归功于 OpenAI 对其技术路线的坚定承诺,即使最初并未获得广泛认可。
c) Meta 对开源大语言模型做出了重大贡献,并推动了大语言模型的研究。在考虑对开源社区的贡献时,特别是与大语言模型相关的贡献,Meta 是最慷慨的商业公司之一,因为其开发的所有大语言模型都是开源的。
d) 大语言模型呈现出闭源化的趋势。在大语言模型发展的早期阶段(2020 年之前),大多数模型都是开源的。然而,随着 GPT-3 的推出,企业越来越...
Table 1. Summary of Large Language Models.
| Characteristic | LLMs | ||
| Encoder-Decoder orEncoder-only (BERT-style) | Training: Model type: Pretrain task: | Masked Language Models Discriminative Predictmaskedwords | ELMo [80], BERT [28],RoBERTa [65], DistilBERT[90],BioBERT[57],XLM[54], Xlnet[119],ALBERT[55],ELECTRA[24], T5[84],GLM[123],XLM-E[20],ST-MoE[133], AlexaTM [95] |
| Decoder-only (GPT-style) | Training Model type: Pretrain task: | AutoregressiveLanguageModels Generative Predictnextword | GPT-3[16], OPT [126].PaLM [22], BLOOM [92], MT-NLG [93], GLaM[32],Gopher[83],chinchilla[41], LaMDA [102],GPT-J[107],LLaMA[103], GPT-4[76],BloombergGPT [117] |
表 1: 大语言模型总结
| 特性 | 大语言模型 |
|---|
opted to close-source their models, such as PaLM, LaMDA, and GPT-4. Consequently, it has become more difficult for academic researchers to conduct experiments on LLM training. As a result, API-based research could become the predominant method in the academic community. e) Encoder-decoder models remain promising, as this type of architecture is still being actively explored, and most of them are open-sourced. Google has made substantial contributions to open-source encoder-decoder architectures. However, the flexibility and versatility of decoder-only models seem to make Google’s insistence on this direction less promising.
选择闭源其模型,如PaLM、LaMDA和GPT-4。因此,学术研究人员在大语言模型训练方面的实验变得更加困难。最终,基于API的研究可能成为学术界的主流方法。e) 编码器-解码器(encoder-decoder)模型仍有前景,因为这类架构仍在积极探索中,且大多数是开源的。Google在开源编码器-解码器架构方面做出了重大贡献。然而,纯解码器(decoder-only)模型的灵活性和多功能性似乎使Google对这一方向的坚持显得前景有限。
We also briefly summarize the characteristics and the representative LLMs of each type in Table 1.
我们还在表1中简要总结了每种类型的特征和代表性大语言模型。
2.1 BERT-style Language Models: Encoder-Decoder or Encoder-only
2.1 BERT式语言模型:编码器-解码器或仅编码器
As natural language data is readily available and unsupervised training paradigms have been proposed to better utilize extremely large datasets, this motivates the unsupervised learning of natural language. One common approach is to predict masked words in a sentence while considering the surrounding context. This training paradigm is known as the Masked Language Model. This type of training allows the model to develop a deeper understanding of the relationships between words and the context in which they are used. These models are trained on a large corpus of texts using techniques such as the Transformer architecture and have achieved state-of-the-art results in many NLP tasks, such as sentiment analysis and named entity recognition. Notable examples of Masked Language Models include BERT [28], RoBERTa [65], and T5 [84]. MLMs have become an important tool in the field of natural language processing due to their success in a wide range of tasks.
由于自然语言数据易于获取,且无监督训练范式能够更好地利用海量数据集,这推动了自然语言的无监督学习。一种常见方法是在考虑上下文语境的情况下预测句子中被遮蔽的词语,这种训练范式被称为掩码语言模型 (Masked Language Model)。此类训练使模型能够深入理解词语间关系及其使用语境。这类模型通过Transformer架构等技术在大型文本语料库上进行训练,在情感分析、命名实体识别等众多自然语言处理任务中取得了最先进的成果。典型的掩码语言模型包括BERT [28]、RoBERTa [65]和T5 [84]。因其在广泛任务中的成功表现,掩码语言模型已成为自然语言处理领域的重要工具。
2.2 GPT-style Language Models: Decoder-only
2.2 GPT风格语言模型:仅解码器架构
Although language models are typically task-agnostic in architecture, these methods require fine-tuning on datasets of the specific downstream task. Researchers found that scaling up language models significantly improves the few-shot, even zero-shot performance [16]. The most successful models for better few-shot and zero-show performance are Auto regressive Language Models, which are trained by generating the next word in a sequence given the preceding words. These models have been widely used for downstream tasks such as text generation and question answering. Examples of Auto regressive Language Models include GPT-3 [16], OPT [126], PaLM [22], and BLOOM [92]. The game changer, GPT-3, for the first time, demonstrated reasonable few-/zero-shot performance via prompting and in-context learning, thus showing the superiority of auto regressive language models. There are also models such as CodeX [2] that are optimized for specific tasks such as code generation, Bloomberg GP T [117] for the financial domain. The recent breakthrough is ChatGPT, which refines GPT-3 specifically for conversational tasks, resulting in more interactive, coherent, and context-aware conversational for various real-world applications.
虽然语言模型在架构上通常与任务无关,但这些方法需要对特定下游任务的数据集进行微调。研究人员发现,扩大语言模型规模能显著提升少样本甚至零样本性能 [16]。在少样本和零样本性能方面最成功的模型是自回归语言模型 (Auto regressive Language Models),它们通过给定前文生成序列中的下一个词进行训练。这些模型已广泛应用于文本生成和问答等下游任务,例如 GPT-3 [16]、OPT [126]、PaLM [22] 和 BLOOM [92]。具有里程碑意义的 GPT-3 首次通过提示和上下文学习展现出优秀的少样本/零样本性能,从而证明了自回归语言模型的优越性。此外还有针对特定任务优化的模型,如代码生成专用的 CodeX [2]、金融领域的 BloombergGPT [117]。最新突破是 ChatGPT,它专门针对对话任务优化 GPT-3,为各类现实应用提供了更具交互性、连贯性和上下文感知的对话能力。
3 PRACTICAL GUIDE FOR DATA
3 数据实用指南
In this section, we’ll be discussing the critical role that data plays in selecting appropriate models for downstream tasks. The impact of data on the models’ effectiveness starts during the pre-training stage and continues through to the training and inference stages.
在本节中,我们将讨论数据为下游任务选择合适模型时的关键作用。数据对模型效果的影响始于预训练阶段,并持续贯穿至训练和推理阶段。
Remark 1
备注 1
3.1 Pre training data
3.1 预训练数据
Pre-training data plays a pivotal role in the development of large language models. As the foundation of remarkable capabilities [5, 47] of LLMs, the quality, quantitative, and diversity of pre-training data influence the performance of LLMs significantly [124]. The commonly used pre training data consists of a myriad of text sources, including books, articles, and websites. The data is carefully curated to ensure a comprehensive representation of human knowledge, linguistic nuances, and cultural perspectives. The importance of pre training data lies in its capacity to inform the language model with a rich understanding of word knowledge, grammar, syntax, and semantics, as well as the ability to recognize context and generate coherent responses. The diversity of pre training data also plays a crucial role in shaping the model’s performance, and the selection of LLMs highly depends on the components of the pre training data. For example, PaLM [22] and BLOOM [92] excel in multilingual tasks and machine translation with an abundance of multilingual pre training data. Moreover, PaLM’s performance in Question Answering tasks is enhanced by incorporating a considerable amount of social media conversations and Books corpus [22]. Likewise, code execution and code completion capabilities of GPT-3.5 (code-davinci-002) are amplified by the integration of code data in its pre training dataset. In brief, when selecting LLMs for downstream tasks, it is advisable to choose the model pre-trained on a similar field of data.
预训练数据在大语言模型的发展中起着关键作用。作为大语言模型卓越能力[5,47]的基础,预训练数据的质量、数量和多样性显著影响模型性能[124]。常用的预训练数据包含多种文本来源,如书籍、文章和网页内容。这些数据经过精心筛选,以确保全面涵盖人类知识、语言细微差别和文化视角。预训练数据的重要性在于其能为语言模型提供丰富的词汇知识、语法规则、句法结构和语义理解,同时赋予模型识别上下文并生成连贯回答的能力。预训练数据的多样性对模型性能同样至关重要,大语言模型的选择高度依赖于预训练数据的构成。例如,PaLM[22]和BLOOM[92]凭借海量多语言预训练数据,在多语言任务和机器翻译中表现优异。此外,PaLM通过整合大量社交媒体对话和书籍语料[22],显著提升了问答任务的表现。同理,GPT-3.5(code-davinci-002)的代码执行与补全能力也因其预训练数据集中编程数据的融入而增强。简而言之,为下游任务选择大语言模型时,建议选用经过相似领域数据预训练的模型。
3.2 Finetuning data
3.2 微调数据
When deploying a model for downstream tasks, it is essential to consider three primary scenarios based on the availability of annotated data: zero, few, and abundant. In this section, we provide a succinct overview of the appropriate models to employ for each scenario.
在为下游任务部署模型时,必须根据标注数据的可用性考虑三种主要场景:零样本 (zero-shot)、少样本 (few-shot) 和丰富数据。本节简要概述每种场景适用的模型。
Zero annotated data: In scenarios where annotated data is unavailable, utilizing LLMs in a zero-shot setting proves to be the most suitable approach. LLMs have been shown to outperform previous zero-shot methods [120]. Additionally, the absence of a parameter update process ensures that catastrophic forgetting [49] is avoided since the language model parameters remain unaltered.
零标注数据场景:在缺乏标注数据的场景中,采用大语言模型(LLM)的零样本(zero-shot)设置被证明是最佳方案。研究表明大语言模型在零样本任务中表现优于传统方法[120]。由于语言模型参数保持不变,这种无需参数更新的特性还能有效避免灾难性遗忘(catastrophic forgetting)[49]。
Few annotated data: In this case, the few-shot examples are directly incorporated in the input prompt of LLMs, which is named as in-context learning, and these examples can effectively guide LLMs to generalize to the task. As reported in [16], one-shot and few-shot performance make significant gains, even matching the performance of the SOTA fine-tuned open-domain models. And LLMs’ zero/few-shot ability can be improved further by scaling [16]. Alternatively, some few-shot learning methods are invented to enhance fine-tuned models, such as meta-learning [56] or transfer learning [88]. However, performance might be inferior compared to using LLMs due to fine-tuned models’ smaller scale and over fitting.
标注数据稀缺:这种情况下,少样本示例会被直接整合到大语言模型的输入提示中,称为上下文学习 (in-context learning)。这些示例能有效引导大语言模型泛化到目标任务。如[16]所述,单样本和少样本性能显著提升,甚至匹配了经过微调的开放领域SOTA模型性能。通过扩展模型规模[16],大语言模型的零样本/少样本能力还能进一步提升。此外,人们也发明了一些增强微调模型的少样本学习方法,例如元学习[56]或迁移学习[88]。但由于微调模型规模较小且容易过拟合,其性能可能逊色于使用大语言模型。
Abundant annotated data: With a substantial amount of annotated data for a particular task available, both fine-tuned models and LLMs can be considered. In most cases, fine-tuning the model can fit the data pretty well. Although, LLMs can be used to meet some constraints such as privacy [99].In this scenario, the choice between using a fine-tuned model or a LLM is task-specific and also depends on many factors, including desired performance, computational resources, and deployment constraints.
充足标注数据:当特定任务拥有大量标注数据时,既可以使用微调模型也可以考虑大语言模型(LLM)。多数情况下,微调模型能很好地拟合数据。但大语言模型可以满足某些限制条件(如隐私需求)[99]。这种情况下,选择微调模型还是大语言模型需根据具体任务决定,同时取决于性能需求、计算资源和部署限制等多重因素。
In a brief summary: LLMs are more versatile w.r.t. the data availability, while fine-tuned models can be considered with abundant annotated data.
简而言之:大语言模型 (LLM) 在数据可用性方面更具通用性,而拥有充足标注数据时则可考虑微调模型。
3.3 Test data/user data
3.3 测试数据/用户数据
When deploying LLMs for downstream tasks, we often face challenges stemming from distribution al differences between the test/user data and that of the training data. These disparities may encompass domain shifts [132], out-of-distribution variations [31], or even adversarial examples [82]. Such challenges significantly hinder fine-tuned modes’ effectiveness in real-world applications. They fit into a specific distribution and have a poor ability to generalize to OOD data. However, LLMs perform quite well facing such scenarios because they do not have an explicit fitting process. Moreover, recent advancements have further enhanced the ability of language models in this regard. The Reinforcement Learning from Human Feedback (RLHF) method has notably enhanced LLMs’ generalization capabilities [77]. For example, Instruct GP T demonstrates proficiency in following various instructions for a wide range of tasks and occasionally complying with instructions in different languages, even though such instructions are scarce. Similarly, ChatGPT exhibits consistent advantages on most adversarial and out-of-distribution (OOD) classification and translation tasks [109]. Its superiority in understanding dialogue-related texts led to an impressive performance on the DDXPlus dataset [101], a medical diagnosis dataset designed for OOD evaluation.
在将大语言模型(LLM)部署到下游任务时,我们常面临测试/用户数据与训练数据间的分布差异挑战。这些差异可能涉及领域偏移[132]、分布外变化[31]甚至对抗样本[82]。此类挑战严重阻碍了微调模型在实际应用中的效果——它们仅适配特定数据分布,对分布外(OOD)数据的泛化能力较差。然而大语言模型在此类场景下表现优异,因其不存在显式的拟合过程。
近期进展进一步强化了语言模型这方面的能力。基于人类反馈的强化学习(RLHF)方法显著提升了大语言模型的泛化能力[77]。例如InstructGPT能熟练执行各类任务的多样化指令,偶尔还能处理不同语言的指令(即使这类指令样本极少)。同样,ChatGPT在多数对抗性和分布外(OOD)分类及翻译任务中展现出持续优势[109],其对对话文本的卓越理解能力使其在DDXPlus数据集(专为OOD评估设计的医疗诊断数据集)上取得了惊人表现[101]。
4 PRACTICAL GUIDE FOR NLP TASKS
4 NLP任务的实用指南
In this section, we discuss in detail the use cases and no use cases for LLMs in various downstream NLP tasks and the corresponding model abilities. And in Figure 2, we summarize all discussions into a decision flow. It can be a guide for a quick decision while facing a task.
在本节中,我们将详细讨论大语言模型(LLM)在不同下游NLP任务中的适用场景与非适用场景,以及对应的模型能力。图2将所有讨论总结为决策流程图,可作为面对任务时快速判断的指南。
4.1 Traditional NLU tasks
4.1 传统自然语言理解任务
Traditional NLU tasks are some fundamental tasks in NLP including text classification, named entity recognition (NER), entailment prediction, and so on. Many of them are designed to serve as intermediate steps in larger AI systems, such as NER for knowledge graph construction.
传统自然语言理解(NLU)任务作为自然语言处理(NLP)的基础任务,包括文本分类、命名实体识别(NER)、蕴含预测等。其中许多任务被设计为更庞大AI系统的中间环节,例如用于知识图谱构建的命名实体识别。

Fig. 2. The decision flow for choosing LLMs or fine-tuned models 2for user’s NLP applications. The decision flow helps users assess whether their downstream NLP applications at hand meet specific conditions and, based on that evaluation, determine whether LLMs or fine-tuned models are the most suitable choice for their applications. During the decision process in the figure, ⑨ means meeting the condition, and $\oplus$ means not meeting the condition. The yellow circle for ⑨ of the last condition means there’s no model working well on this kind of application.
图 2: 为用户NLP应用选择大语言模型或微调模型的决策流程。该决策流程帮助用户评估其手头的下游NLP应用是否满足特定条件,并基于该评估确定大语言模型或微调模型是否最适合其应用。图中决策过程中,⑨表示满足条件,$\oplus$表示不满足条件。最后一个条件的⑨黄色圆圈表示此类应用没有表现良好的模型。
Remark 2
备注 2
Fine-tuned models generally are a better choice than LLMs in traditional NLU tasks, but LLMs can provide help while requiring strong generalization ability.
在传统自然语言理解(NLU)任务中,微调模型通常比大语言模型(LLM)更优,但当需要强大泛化能力时,大语言模型可以提供帮助。
4.1.1 No use case. In most natural language understanding tasks, such as tasks in GLUE[106] and SuperGLUE[105], fine-tuned models still have better performance, if such tasks come with rich well-annotated data and contain very few out-of-distribution examples on test sets. For different tasks and datasets, the gap between small fine-tuned models and LLMs varies.
4.1.1 无应用场景。在大多数自然语言理解任务中,例如GLUE[106]和SuperGLUE[105]中的任务,如果这些任务带有丰富的标注数据且测试集中极少出现分布外样本,经过微调的模型仍具有更优性能。针对不同任务和数据集,小型微调模型与大语言模型之间的性能差距存在差异。
In text classification, on most datasets, LLMs perform slightly worse than fine-tuned models. For sentiment analysis, such as on IMDB [69] and SST [94], fine-tuned models and LLMs perform equally well. For toxicity detection, which is another iconic text classification task, the gap is much larger. All LLMs cannot perform well on this task, and on Civil Comments [13] even the best one is only better than random guessing [59]. On the other hand, most popular fine-tuned models can obtain much better performance [33]. and the Perspective API is still one of the best for detecting toxicity. This API is powered by a multilingual BERT-based model, which is tuned on publicly available toxicity data and several smaller single-language CNNs distilled from this model. This might be due to the fact that toxicity is defined by subtle nuances in linguistic expressions, and large language models are unable to accurately comprehend this task solely based on the provided input.
在文本分类任务中,大多数数据集上大语言模型的表现略逊于微调模型。对于情感分析任务(如IMDB [69]和SST [94]),微调模型与大语言模型表现相当。而在毒性检测这类标志性文本分类任务中,性能差距则显著扩大。所有大语言模型在此任务上表现欠佳,在Civil Comments [13]数据集上即使最优模型也仅略优于随机猜测 [59]。相比之下,主流微调模型能取得更优性能 [33],其中Perspective API仍是毒性检测的最佳解决方案之一。该API基于多语言BERT模型构建,通过在公开毒性数据及由此模型蒸馏的若干小型单语言CNN模型上进行微调。这种现象可能源于毒性定义依赖于语言表达中的微妙差异,而大语言模型仅凭给定输入难以准确理解此类任务。
The trend of performance gaps is similar in some other tasks. For natural language inference (NLI) tasks, on most datasets, such as on RTE [106] and SNLI [14], fine-tuned models perform better than LLMs, while on some data such as CB [105], LLMs have obtained comparable performance with fine-tuned models [22]. For question answering (QA), on
性能差距的趋势在其他一些任务中也类似。对于自然语言推理 (NLI) 任务,在大多数数据集上,例如 RTE [106] 和 SNLI [14],微调模型的表现优于大语言模型,而在某些数据上如 CB [105],大语言模型已经取得了与微调模型相当的性能 [22]。对于问答 (QA) 任务,在
SQuADv2 [86], QuAC [21] and many other datasets, fine-tuned models have superior performance, while on CoQA [87], LLMs perform as well as fine-tuned models [22].
SQuADv2 [86]、QuAC [21] 等多数数据集上,微调模型表现更优,而在 CoQA [87] 中,大语言模型与微调模型性能相当 [22]。
In information retrieval (IR) tasks, LLMs are not widely exploited yet. One major reason is that IR tasks are fundamentally different from others. There’s no natural way to transform the thousands of candidate texts into a few/zero-shot form which is required by LLMs. The existing evaluation results on MS MARCO(regular/TREC) [73] show that methods based on fine-tuned models have better performance [59]. In this evaluation, the LLMs rank passages in an unorthodox way, which requires the LLMs to produce probabilities for passages one by one.
在信息检索(IR)任务中,大语言模型(LLM)尚未得到广泛应用。主要原因在于IR任务与其他任务存在本质差异:无法将数千个候选文本自然地转换为大语言模型所需的少样本/零样本形式。现有MS MARCO(常规/TREC)[73]评估结果表明,基于微调模型的方法表现更优[59]。该评估中,大语言模型采用非常规的段落排序方式,需要逐个计算段落的生成概率。
For some low-level intermediate tasks, which are not intended for regular users but rather for high level tasks, such as named entity recognition (NER) and dependency parsing, there’s not enough result coming from LLMs, because the most current evaluation of LLMs focuses on practical tasks. According to available evaluation results, for the NER task, CoNLL03 [89] is still a challenge for LLMs [81], where the performance of fine-tuned models is around as twice as LLMs. These intermediate tasks may vanish soon because LLMs can take over high-level tasks without the help of those intermediate tasks (e.g. dependency parsing for coding tasks; NER for some text generation tasks).
对于一些面向高级任务而非普通用户的底层中间任务,例如命名实体识别 (NER) 和依存句法分析 (dependency parsing) ,大语言模型产出的成果有限,因为当前对大语言模型的评估主要聚焦于实际应用任务。根据现有评估结果,在 CoNLL03 [89] 数据集上的 NER 任务仍是大语言模型的挑战 [81] ,微调模型的性能约为大语言模型的两倍。这些中间任务可能很快消失,因为大语言模型无需依赖此类中间任务 (例如编程任务中的依存句法分析、文本生成任务中的 NER) 就能接管高级任务。
In brief, for most traditional NLU tasks, a fine-tuned model is a better choice in terms of the performance on benchmark datasets and the computational cost. The scale of LLMs is usually $10\times$ or even $100\times$ larger than fine-tuned models. One possible cause for the inferior performance of LLMs on certain tasks can be the design of instructions/prompts. Transforming input from tasks like IR and sentence labeling into a few/zero-short instruction form is non-trivial. There may be better ways to adapt language models to traditional NLP tasks in the future. On the other hand, the upper limit of capabilities of fine-tuned models is not reached, and some methods like FLAN-tuning [67] can further boost the performance on NLU tasks. Another interesting finding is that on NLU tasks, after fine-tuning, masked language models, like T5[85], are better than most auto-regressive language models at the same scale, while some recent results imply that this gap can be bridged by scaling[22].
简而言之,对于大多数传统自然语言理解(NLU)任务而言,经过微调的模型在基准数据集表现和计算成本方面都是更优选择。大语言模型的规模通常比微调模型大 $10\times$ 甚至 $100\times$。大语言模型在某些任务上表现欠佳的原因可能在于指令/提示(prompt)设计——将信息检索(IR)、句子标注等任务的输入转化为少样本/零样本指令形式并非易事。未来可能会出现更好的方法使语言模型适配传统NLP任务。另一方面,微调模型的能力上限尚未触达,FLAN-tuning [67] 等方法还能进一步提升NLU任务表现。另一个有趣发现是:在NLU任务中,经过微调后,T5[85]等掩码语言模型的表现优于同规模大多数自回归语言模型,而近期研究[22]表明这种差距可以通过扩展模型规模来弥合。
4.1.2 Use case. However, there are still some NLU tasks suitable for LLMs.
4.1.2 用例。然而,仍有一些适合大语言模型的自然语言理解(NLU)任务。
One of the representative tasks is miscellaneous text classification [59]. In contrast to classic domain-specific text classification tasks such as sentiment analysis, miscellaneous text classification deals with a diverse range of topics and categories that may not have a clear or strong relationship with one another. It’s closer to real-world cases and hard to be formatted for using fine-tuned models. Another is the Adversarial NLI (ANLI)[74]. It is a difficult dataset composed of adversarial ly mined natural language inference questions in three rounds (R1, R2, and R3). LLMs have shown superior performance on ANLI, especially on the R3 and R2. Both examples demonstrate the exceptional ability of LLMs to generalize well on out-of-distribution and sparsely annotated data in traditional NLP tasks, surpassing that of fine-tuned models. We’ve discussed this in the section above 3.3.
代表性任务之一是杂项文本分类 [59]。与情感分析等经典领域特定文本分类任务不同,杂项文本分类处理的主题和类别多样,彼此之间可能没有明确或强关联。它更接近现实场景,且难以通过微调模型进行格式化处理。另一个案例是对抗性自然语言推理 (ANLI) [74],该数据集由三轮对抗性挖掘的自然语言推理问题构成(R1、R2和R3)。大语言模型在ANLI上展现出卓越性能,尤其在R3和R2轮次表现突出。这两个案例共同证明了大语言模型在传统NLP任务中对分布外数据和稀疏标注数据具有出色的泛化能力,其表现超越了微调模型。我们已在3.3节讨论过这一现象。
4.2 Generation tasks
4.2 生成任务
Natural Language Generation broadly encompasses two major categories of tasks, with the goal of creating coherent, meaningful, and con textually appropriate sequences of symbols. The first type focuses on converting input texts into new symbol sequences, as exemplified by tasks like paragraph sum mari z ation and machine translation. The second type, "open-ended" generation, aims to generate text or symbols from scratch to accurately match input descriptions such as crafting emails, composing news articles, creating fictional stories and writing code.
自然语言生成(Natural Language Generation)主要涵盖两大类任务,其目标是创建连贯、有意义且符合上下文的符号序列。第一类任务侧重于将输入文本转换为新的符号序列,例如段落摘要和机器翻译等任务。第二类"开放式"生成任务旨在从零开始生成文本或符号,以精确匹配输入描述,如撰写电子邮件、编写新闻文章、创作虚构故事和编写代码等。
Remark 3
备注 3
Due to their strong generation ability and creativity, LLMs show superiority at most generation tasks.
由于强大的生成能力和创造力,大语言模型在多数生成任务中展现出优越性。
4.2.1 Use case. Generation tasks require models to have a comprehensive understanding of the input contents or requirements and a certain level of creativity. This is what LLMs excel at.
4.2.1 用例。生成任务要求模型全面理解输入内容或需求,并具备一定创造力。这正是大语言模型 (LLM) 的优势所在。
For sum mari z ation tasks, although LLMs do not have an obvious advantage over fine-tuned models under traditional automatic evaluation metrics, such as ROUGE [60], human evaluation results indicate that humans tend to prefer the results generated by LLMs [38, 127] compared to that of fine-tuned models. For example, on CNN/DailyMail [71] and XSUM [72], fine-tuned models like Brio [66] and Pegasus [125] have much better performance than any LLMs w.r.t. ROUGE, but LLMs like OPT [126] perform far better in human evaluation considering all aspects including faithfulness, coherence, and relevance [127]. This demonstrates the superiority of LLMs in sum mari z ation tasks. On the other hand, it implies that current sum mari z ation benchmarks don’t contain summaries with high quality or the automatic metrics are not proper for the evaluation of sum mari z ation.
在文本摘要任务中,虽然大语言模型在ROUGE [60]等传统自动评估指标上相比微调模型没有明显优势,但人工评估结果表明人类更倾向于大语言模型生成的结果 [38, 127]。例如在CNN/DailyMail [71]和XSUM [72]数据集上,Brio [66]和Pegasus [125]等微调模型的ROUGE分数远优于任何大语言模型,但OPT [126]等大语言模型在忠实度、连贯性和相关性等全方位人工评估中表现更优 [127]。这说明大语言模型在摘要任务中的优越性,同时也反映出当前摘要基准测试要么缺乏高质量摘要样本,要么现有自动评估指标不适用于摘要质量评估。
In machine translation (MT), LLMs can perform competent translation, although the average performance is slightly worse than some commercial translation tools [45] considering some automatic metrics like BLEU[78]. LLMs are particularly good at translating some low-resource language texts to English texts, such as in the Romanian-English translation of WMT’16 [11], zero-shot or few-shot LLMs can perform better than SOTA fine-tuned model[22]. This is mainly due to the fact that English resources compose the main part of the pre-training data. BLOOM [92] is pre-trained on more multi-lingual data, leading to better translation quality in both rich-resource and low-resource translation. Another interesting finding is that BLOOM achieves good translation quality among Romance languages, even for translation from Galician, which is not included in the pre-training data. One reasonable explanation is that texts from some languages in the same language group can help the LLMs learn more from the similarity. If more multi-lingual texts can be added to the pre-training data, the translation capability may be improved further.
在机器翻译(MT)领域,大语言模型(LLM)能够胜任翻译任务,但根据BLEU[78]等自动评估指标,其平均表现略逊于部分商业翻译工具[45]。大语言模型尤其擅长将某些低资源语言文本翻译为英文,例如在WMT'16罗马尼亚语-英语翻译任务中[11],零样本或少样本的大语言模型表现优于经过精调的SOTA模型[22]。这主要得益于英语资源构成了预训练数据的主体部分。BLOOM[92]使用了更多多语言数据进行预训练,从而在富资源和低资源翻译任务中都获得了更优的翻译质量。另一个有趣的发现是,BLOOM在罗曼语族语言间实现了高质量的翻译,甚至对于预训练数据未包含的加利西亚语翻译也表现良好。合理的解释是,同语系中某些语言的文本能帮助大语言模型从相似性中学习更多知识。若能在预训练数据中加入更多多语言文本,其翻译能力有望进一步提升。
Additionally, LLMs are highly skilled in open-ended generations. One example is that the news articles generated by LLMs are almost indistinguishable from real news articles by humans [16]. LLMs are remarkably adept at code synthesis as well. Either for text-code generation, such as HumanEval [18] and MBPP [7], or for code repairing, such as DeepFix [39], LLMs can perform pretty well. GPT-4 can even pass $25%$ problems in Leetcode, which are not trivial for most human coders [76]. With training on more code data, the coding capability of LLMs can be improved further [22]. While performing well on such tasks, the codes generated by LLMs should be tested carefully to figure out any subtle bugs, which is one of the main challenges for applying LLMs in code synthesis.
此外,大语言模型(LLM)在开放式生成任务中表现出色。例如,它们生成的新闻文章几乎与人类撰写的真实新闻难以区分[16]。大语言模型在代码合成方面也展现出非凡能力,无论是文本到代码生成(如HumanEval[18]和MBPP[7]),还是代码修复(如DeepFix[39])都能出色完成。GPT-4甚至能解决Leetcode中25%的问题[76],这对多数人类程序员也非易事。随着代码训练数据的增加,大语言模型的编程能力还能持续提升[22]。但需注意,尽管在这些任务中表现优异,大语言模型生成的代码仍需仔细测试以发现潜在漏洞,这是其应用于代码合成领域的主要挑战之一。
4.2.2 No use case. Fine-tuned models, such as DeltaLM $^+$ Zcode [118], still perform best on most rich-resource translation and extremely low-resource translation tasks. In rich resource machine translation, fine-tuned models slightly outperform LLMs [22, 92]. And in extremely low-resource machine translation, such as English-Kazakh translation, fine-tuned models significantly perform better than LLMs.
4.2.2 无适用场景。经过微调 (fine-tuned) 的模型(如 DeltaLM $^+$ Zcode [118])在大多数高资源翻译和极低资源翻译任务中仍保持最佳表现。在高资源机器翻译场景中,微调模型略微优于大语言模型 [22, 92];而在极低资源机器翻译(如英语-哈萨克语翻译)中,微调模型显著优于大语言模型。
4.3 Knowledge-intensive tasks
4.3 知识密集型任务
Knowledge-intensive NLP tasks refer to a category of tasks that have a strong reliance on background knowledge, domain-specific expertise, or general real-world knowledge. These tasks go beyond simple pattern recognition or syntax analysis. And they are highly dependent on memorization and proper utilization of knowledge about specific entities, events, and common sense of our real world.
知识密集型 NLP (Natural Language Processing) 任务指一类高度依赖背景知识、领域专业知识或现实世界常识的任务。这类任务超越了简单的模式识别或句法分析,其核心在于对特定实体、事件及现实世界常识的记忆与合理运用。
Remark 4
备注 4
(1) LLMs excel at knowledge-intensive tasks due to their massive real-world knowledge. (2) LLMs struggle when the knowledge requirements do not match their learned knowledge, or when they face tasks that only require contextual knowledge, in which case fine-tuned models can work as well as LLMs.
(1) 大语言模型(LLM)因其庞大的现实世界知识储备,在知识密集型任务中表现出色。
(2) 当任务所需知识与模型所学不匹配,或仅需上下文知识的场景时,大语言模型表现欠佳,此时经过微调的模型能达到与大语言模型相当的效果。
4.3.1 Use case. In general, with billions of training tokens and parameters, LLMs have much more real-world knowledge than fine-tuned models.
4.3.1 用例。通常来说,经过数十亿训练token和参数训练的大语言模型(LLM)比微调模型具备更丰富的现实世界知识。
Closed-book question-answering tasks require the model to answer a given question about factual knowledge without any external information. It does require the memorization of real-world knowledge in the model. LLMs perform better on nearly all datasets, such as on Natural Questions [52], Web Questions [9], and TriviaQA [46]. On TriviaQA, even zero-shot LLMs is still much better [22].
闭卷问答任务要求模型在没有任何外部信息的情况下回答关于事实知识的问题。这需要模型记忆现实世界的知识。大语言模型在几乎所有数据集上都表现更好,例如 Natural Questions [52]、Web Questions [9] 和 TriviaQA [46]。在 TriviaQA 上,即使是零样本的大语言模型也仍然表现优异 [22]。
The massive multitask language understanding (MMLU) [40] is also highly knowledge-intensive. It contains multiplechoice questions spanning over 57 different subjects and requires general knowledge of the model. It’s pretty challenging even for LLMs, although the newly released GPT-4 [76] outperforms existing models by a considerable margin in English with a satisfactory $86.5%$ accuracy.
大规模多任务语言理解 (MMLU) [40] 同样具有高度知识密集型特征。该评测包含涵盖57个不同学科的多选题,要求模型具备广泛的知识储备。即便对于大语言模型而言,这也极具挑战性——尽管新发布的GPT-4 [76] 在英语测试中以86.5%的准确率显著超越现有模型。
Also, some tasks in Big-bench[96], which are designed to probe LLMs and extrapolate their future capabilities, heavily relied on the memorization of real-world knowledge. In such tasks, the performance of some LLMs is better than the average level of humans, and even comparable to the best human performance. For example, the task Hindu knowledge requires models to give facts about Hindu mythology, Periodic Elements require the capability of predicting the element name from the periodic table and Physics tests the physics knowledge of models by asking for the formula needed to solve a given physics problem.
此外,Big-bench[96]中的某些任务旨在探究大语言模型并推断其未来能力,这些任务严重依赖对现实世界知识的记忆。在此类任务中,部分大语言模型的表现优于人类平均水平,甚至可与人类最佳表现相媲美。例如:
- Hindu knowledge任务要求模型提供印度教神话相关的事实
- Periodic Elements任务需要根据元素周期表预测元素名称
- Physics任务通过要求模型给出解决特定物理问题所需的公式来测试其物理知识
4.3.2 No use case. There are some other tasks requiring knowledge different from that learned by LLMs. The required knowledge is not that learned by LLMs about the real world. In such tasks, LLMs are not notably superior.
4.3.2 无适用场景。存在某些任务所需知识与大语言模型所学知识存在本质差异,这类任务需要的并非大语言模型所掌握的现实世界知识。在此类任务中,大语言模型并不具备显著优势。
Some tasks only require the model to capture the self-contained knowledge in the contexts. The knowledge in the contexts from the input is enough for the model to make predictions. For these tasks, small fine-tuned models can work pretty well. One such task is machine reading comprehension (MRC). An MRC task provides several paragraphs and requires the model to predict the answer to questions based on these paragraphs. We’ve discussed MRC in the previous section because it’s also a traditional NLU task.
某些任务仅需模型捕捉上下文中的自包含知识。输入上下文中的知识足以让模型进行预测。对于这类任务,经过微调的小型模型就能表现良好。机器阅读理解 (MRC) 就是这样一种任务。MRC 任务提供若干段落,要求模型基于这些段落预测问题的答案。由于它也是传统自然语言理解任务,我们已在前文讨论过 MRC。
Another scenario is that the knowledge within LLMs about real world is useless to the task, or even the required knowledge is counter factual to the real world. As a result, the LLMs cannot work well on such tasks. In some cases, inconsistent knowledge may even make the LLMs worse than random guessing. For example, in Big-Bench, the Mnist ascii task requires the model to tell the digit represented by an ASCII art. The capability required by this task is nothing about real-world knowledge. Also, in the Inverse Scaling Phenomenon competition [70], the task redefine math redefines a common symbol and requires the model to choose between the original meaning and the meaning derived from the redefinition. What it requires contrasts to the LLMs’ knowledge, thus LLMs even perform worse than random guessing.
另一种情况是大语言模型中关于现实世界的知识对任务无用,甚至所需知识与现实世界相悖。因此,大语言模型在此类任务上表现不佳。某些情况下,矛盾的知识甚至会导致模型表现不如随机猜测。例如在Big-Bench中,Mnist ascii任务要求模型识别ASCII艺术所表示的数字,该任务所需能力与现实世界知识毫无关联。此外,在逆向缩放现象竞赛[70]中,任务redefine math重新定义了常见符号,要求模型在原始含义与重新定义的含义之间做出选择。该任务需求与大语言模型的知识相冲突,导致模型表现甚至不及随机猜测。
As an alternative to real-world knowledge in LLMs, access to extra knowledge is allowed, and models can thus get enough knowledge for a task via retrieval augmentation. The basic idea of retrieval augmentation is to add an extra information retrieval step prior to making predictions, in which, some useful texts related to the task will be retrieved from a large corpus. Then, the model will make predictions based on both the input contexts and the retrieved texts. With retrieved additional information, the closed-book task can become "open-book". In such a scenario, fine-tuned models are pretty good with much smaller sizes, because the required knowledge can be obtained by retrieving. For example, on Natural Questions [52], with extra corpus, retrieval augmented models [44, 48] are much better than any other methods.
作为大语言模型中现实世界知识的替代方案,允许访问额外知识,模型因此可以通过检索增强获取任务所需的足够知识。检索增强的基本思路是在预测前增加一个信息检索步骤,从大型语料库中检索与任务相关的有用文本。随后,模型将结合输入上下文和检索到的文本进行预测。通过检索附加信息,闭卷任务可转化为"开卷"模式。在此场景下,经过微调的较小规模模型表现优异,因为所需知识可通过检索获取。例如在Natural Questions [52]数据集上,配备额外语料库的检索增强模型[44, 48]显著优于其他方法。
4.4 Abilities Regarding Scaling
4.4 关于扩展的能力
Scaling of LLMs (e.g. parameters, training computation, etc.) can greatly empower pretrained language models. With the model scaling up, a model generally becomes more capable in a range of tasks. Reflected in some metrics, the performance shows a power-law relationship with the model scale. For example, the cross-entropy loss which is used to measure the performance for language modeling decreases linearly with the exponential increase in the model scale, which is also called ’scaling-law’ [41, 47]. For some crucial abilities, such as reasoning, scaling the model has gradually transformed these abilities from a very low state to a usable state, and even approaching human capabilities. In this section, we provide an overview of the usage of LLMs in terms of the abilities and behaviors of LLMs along with scaling.
大语言模型的扩展(如参数量、训练计算量等)能显著增强预训练语言模型的能力。随着模型规模扩大,模型在各类任务中的表现通常会更出色。从某些指标来看,性能与模型规模呈现幂律关系。例如用于衡量语言建模性能的交叉熵损失(cross-entropy loss)会随着模型规模的指数增长而线性下降,这种现象也被称为"扩展定律(scaling-law)" [41, 47]。对于一些关键能力(如推理),模型扩展已逐渐将这些能力从极低水平提升至可用状态,甚至接近人类水平。本节我们将从大语言模型的能力和行为随规模扩展的变化角度,概述其应用情况。
Remark 5
备注 5
4.4.1 Use Case with Reasoning. Reasoning, which involves making sense of information, drawing inferences, and making decisions, is one of the essential aspects of human intelligence. It is challenging for NLP. Many existing reasoning tasks can be classified into commonsense reasoning and arithmetic reasoning.
4.4.1 带推理的用例。推理是人类智能的核心能力之一,涉及信息理解、推断和决策制定,这对自然语言处理(NLP)极具挑战性。现有推理任务主要可分为常识推理和算术推理两类。
Arithmetic reasoning/problem solving. The arithmetic reasoning capability of LLMs benefits greatly from the scaling of model size. For GPT-3, the ability of two-digit addition only becomes apparent when the number of parameters exceeds 13B [16]. Tasks to test arithmetic reasoning are trivial for humans and designed to challenge the capability of transferring natural language into mathematical symbols and multi-step inference. On GSM8k [26], SVAMP [79] and AQuA [61], LLMs, as generalists, have competitive performance with most methods which have task-specific designs. And GPT-4 over performs any other methods [76], even some huge models particularly tuned for arithmetic problems [104]. Nevertheless, it should be noted that, without the intervention of external tools, LLMs may occasionally make mistakes in performing basic calculations, although chain-of-thought (CoT) prompting [115] can significantly improve LLMs’ ability in calculations.
算术推理/问题求解。大语言模型(LLM)的算术推理能力极大受益于模型规模的扩展。对于GPT-3而言,两位数加法能力仅在参数量超过130亿时才显现[16]。这些测试算术推理的任务对人类而言微不足道,其设计初衷是为了挑战将自然语言转化为数学符号及多步推理的能力。在GSM8k[26]、SVAMP[79]和AQuA[61]数据集上,作为通用模型的大语言模型与大多数专门设计的任务方法相比具有竞争力。而GPT-4的表现超越所有其他方法[76],甚至优于某些专门针对算术问题调优的大型模型[104]。不过值得注意的是,在没有外部工具介入的情况下,大语言模型偶尔会在基础计算中出现错误,尽管思维链(CoT)提示[115]能显著提升其计算能力。
Commonsense reasoning. Commonsense reasoning not only requires LLMs to remember factual knowledge but also requires LLMs to do several inference steps about the facts. Commonsense reasoning increases gradually with the growth of model size. Compared to fine-tuned models, LLMs keep the superiority on most datasets, such as StrategyQA [36] and ARC-C [25]. Especially on ARC-C, which contains difficult questions in science exams from grade 3 to grade 9, GPT-4 has been close to the performance of $100%$ $(96.3%)$ [76].
常识推理。常识推理不仅要求大语言模型记住事实知识,还需要对事实进行多步推理。随着模型规模的增长,常识推理能力逐步提升。相比微调模型,大语言模型在大多数数据集上保持优势,例如StrategyQA [36]和ARC-C [25]。尤其在包含3至9年级科学考试难题的ARC-C数据集上,GPT-4已接近$100%$ $(96.3%)$ [76]的表现。
4.4.2 Use Cases with Emergent Abilities. Scaling of models also endows the model with some unprecedented, fantastic abilities that go beyond the power-law rule. These abilities are called "emergent ability". As defined in [113], emergent abilities of LLMs are abilities that are not present in smaller-scale models but are present in large-scale models. This means such abilities cannot be predicted by extrapolating the performance improvements on smaller-scale models and the model suddenly gains good performance on some tasks once the scale exceeds a certain range. The emergent ability is typically unpredictable and surprising, leading to tasks that emerge randomly or unexpectedly. We examine concrete examples of the emergent abilities of LLMs and provide them as an important reference for deciding whether to leverage LLMs’ emergent abilities.
4.4.2 涌现能力 (emergent ability) 的用例
模型规模的扩展还会赋予模型一些超越幂律法则的前所未有的非凡能力,这些能力被称为"涌现能力"。如文献 [113] 所定义,大语言模型的涌现能力是指小规模模型不具备、而大规模模型具备的能力。这意味着此类能力无法通过外推小规模模型的性能改进来预测,一旦模型规模超过某个临界范围,其某些任务的性能会突然显著提升。涌现能力通常具有不可预测性和突发性,会导致任务随机或意外地显现。我们研究了大语言模型涌现能力的具体案例,为判断是否利用其涌现能力提供了重要参考依据。
Handling word manipulation is a typical emergent ability. It refers to the ability to learn symbolic manipulations, such as the reversed words [16], in which the model is given a word spelled backwards, and must output the original word. For example. GPT-3 [16] shows the emergent ability for word sorting, and word unscrambling tasks. PaLM [22] exhibits the emergent ability on ASCII word recognition and hyperbaton task. The logical abilities of language models tend to emerge as the model scales up, such as logical deduction, logical sequence, and logic grid puzzles. Additionally, other tasks, such as advanced coding (e.g., auto debugging, code line description), and concept understanding (e.g., novel concepts, simple Turing concepts), are also use cases with the emergent abilities of large language models.
处理词语操作是一种典型的涌现能力。它指的是学习符号操作的能力,例如反向单词任务 [16],即模型接收一个倒序拼写的单词后必须输出原始单词。例如,GPT-3 [16] 在单词排序和单词解乱任务中展现了涌现能力,而PaLM [22] 则在ASCII单词识别和hyperbaton任务上表现出涌现能力。语言模型的逻辑能力(如逻辑推理、逻辑序列和逻辑网格谜题)往往会随着模型规模扩大而涌现。此外,其他任务(如高级编程(例如自动调试、代码行描述)和概念理解(例如新概念、简单图灵概念))也是大语言模型涌现能力的应用场景。
4.4.3 No-Use Cases and Understanding. Although in most cases, as discussed above, larger models bring better performance, there are still many exceptions that should be considered when choosing the appropriate model.
4.4.3 非适用场景与理解。尽管如上文所述,大多数情况下更大规模的模型会带来更好的性能,但在选择合适模型时仍需考虑许多例外情况。
On certain tasks, with the size of LLMs increasing, the performance begins to decrease, such as Redefine-math: tests whether language models are able to work with common symbols when they are redefined to mean something else; Intothe-unknown: requires the model to choose which piece of information would help answer a question; Memo-trap: asks an LM to write a phrase in a way that starts like a famous quote but ends differently . This is also called Inverse Scaling Phenomenon. Another interesting phenomenon observed in the scaling of LLMs is called the $U.$ -shaped Phenomenon [114]. As the name implies, This phenomenon refers to that as LLM size increases, their performance on certain tasks initially improves but then starts to decline before eventually improving again, such as on: Hindsight-neglect: it tests whether language models are able to assess whether a bet was worth taking based on its expected value; NegationQA: this task takes an existing multiple-choice dataset and negates a part of each question to see if language models are sensitive to negation; Quote-repetition: it asks models to repeat back sentences given in the prompt, with few-shot examples to help it recognize the task. Hence the risk of diminishing performance should be noted and if the task is similar to those we just discussed, careful consideration should be given to whether or not to use huge LLMs.
在某些任务中,随着大语言模型 (LLM) 规模的增大,其性能反而开始下降。例如:
-
-
- Redefine-math* * :测试语言模型能否在常见符号被重新定义含义时正确使用它们;
-
-
-
- Into-the-unknown* * :要求模型选择哪条信息有助于回答问题;
-
-
-
- Memo-trap* * :让语言模型以模仿名句开头但结尾改写的方式书写短语。
-
这种现象被称为* * 逆缩放现象 (Inverse Scaling Phenomenon)* * 。
另一项在大语言模型扩展中观察到的有趣现象称为 * * U型现象 (U-shaped Phenomenon)* * [114]。顾名思义,这种现象指的是随着大语言模型规模增大,其在某些任务上的表现会先提升后下降,最终再次回升,例如:
-
-
- Hindsight-neglect* * :测试语言模型能否根据期望值判断赌注是否值得;
-
-
-
- NegationQA* * :对现有多选题数据集的每个问题部分进行否定,观察语言模型对否定的敏感性;
-
-
-
- Quote-repetition* * :要求模型重复提示中给定的句子,并提供少样本示例以帮助识别任务。
-
因此,需注意性能下降的风险。若任务类似上述讨论的案例,应慎重考虑是否使用超大规模的大语言模型。
Gaining a deeper understanding of emergent abilities, inverse scaling phenomenon and U-shape phenomenon in LLMs is essential for advancing research in this field. In a certain sense, the U-shape phenomenon suggests that small-scale models and huge-scale models make predictions with different internal mechanisms. From this perspective, the U-shape phenomenon can be seen as a transformation of the inverse-scaling phenomenon due to some emergent abilities from sufficiently large models [114]. GPT-4 [76] exhibits a reversal of the inverse scaling phenomenon in some cases, such as on a task called Hindsight Neglect. The explanation for these behaviors of LLMs during scaling is still an open problem. Several hypotheses have been proposed. For emergent abilities, one explanation is that there may be multiple key steps for a task and the LLM cannot handle this task until it’s large enough to handle every step, and another explanation is focused on the granularity of evaluation metrics [113]. For inverse-scaling phenomenon and u-shape phenomenon, the explanations mainly focus on the model’s over-reliance on information from its prior rather than the input prompts, valid but misleading few-shot examples, and distracting easier tasks within a hard task [114].
深入理解大语言模型(LLM)中的涌现能力、逆缩放现象和U型现象对推动该领域研究至关重要。从某种意义上说,U型现象表明小规模模型与超大规模模型采用不同的内部机制进行预测。从这个角度看,U型现象可视为当模型达到足够规模时,某些涌现能力导致的逆缩放现象转化[114]。GPT-4[76]在某些任务(如后见之明忽视任务)中表现出逆缩放现象的逆转。目前对大语言模型在扩展过程中这些行为的解释仍是一个开放性问题。
现有几种假设解释这些现象:对于涌现能力,一种解释认为任务可能包含多个关键步骤,只有当模型规模足够处理所有步骤时才能完成任务;另一种解释则关注评估指标的粒度问题[113]。针对逆缩放现象和U型现象,主要解释包括:模型过度依赖先验知识而非输入提示、有效但具有误导性的少样本示例,以及困难任务中干扰性的简单子任务[114]。
4.5 Miscellaneous tasks
4.5 其他任务
This section explores miscellaneous tasks which cannot be involved in previous discussions, to better understand LLMs’ strengths and weaknesses.
本节探讨无法归入前文讨论的各类任务,以更好地理解大语言模型的优势与不足。
Remark 6
备注 6
4.5.1 No use case. LLMs generally struggle with some tasks due to differences in objectives and training data.
4.5.1 无适用场景。由于目标与训练数据的差异,大语言模型 (LLM) 通常难以处理某些任务。
Although LLMs have achieved remarkable success in various natural language processing tasks, their performance in regression tasks has been less impressive. For example, ChatGPT’s performance on the GLUE STS-B dataset, which is a regression task evaluating sentence similarity, is inferior to a fine-tuned RoBERTa performance [130]. The Regression tasks typically involve predicting a continuous value rather than a discrete label, posing unique challenges for LLMs. One primary reason for their subpar performance is the inherent difference between the language modeling objective and the regression task objective. LLMs are designed to predict the next word in a sequence or generate coherent text, with their pre-training focused on capturing linguistic patterns and relationships. Consequently, their internal representations may not be well-suited for modeling continuous numerical outputs. Besides, LLMs have predominantly been trained on text data, focusing on capturing the intricacies of natural language processing. As a result, their performance on multimodal data, which involves handling multiple data types such as text, images, audio, video, actions, and robotics, remains largely unexplored. And fine-tuned multimodal models, like BEiT[110] and PaLI [19], still dominate many tasks such as visual question answering (VQA) and image captioning. Nonetheless, the recently introduced GPT-4 [76] has taken the step in multimodal fusion, but there is still a lack of detailed evaluation of its capabilities.
虽然大语言模型在各种自然语言处理任务中取得了显著成功,但在回归任务中的表现却不尽如人意。例如,ChatGPT在评估句子相似度的回归任务GLUE STS-B数据集上的表现不如经过微调的RoBERTa模型[130]。回归任务通常需要预测连续值而非离散标签,这给大语言模型带来了独特挑战。其表现不佳的主要原因在于语言建模目标与回归任务目标之间存在本质差异。大语言模型的设计初衷是预测序列中的下一个词或生成连贯文本,其预训练侧重于捕捉语言模式和关系,因此它们的内部表征可能不适合建模连续数值输出。此外,大语言模型主要基于文本数据进行训练,专注于捕捉自然语言处理的复杂性,导致其在涉及文本、图像、音频、视频、动作和机器人等多模态数据处理方面的表现仍鲜有研究。而经过微调的多模态模型(如BEiT[110]和PaLI[19])仍在视觉问答(VQA)和图像描述等多项任务中占据主导地位。尽管如此,最新推出的GPT-4[76]已迈出多模态融合的步伐,但目前仍缺乏对其能力的详细评估。
4.5.2 Use case. LLMs are particularly suitable for certain tasks.
4.5.2 使用场景。大语言模型 (LLM) 特别适合完成某些特定任务。
LLMs are very good at mimicking humans, acting as a chatbot, and performing various kinds of tasks. The LLMspowered ChatGPT7 is surprising for its consistency, reliability, informative ness, and robustness during multiple utterances with humans. The human-feedback procedure plays an important role in acquiring such abilities
大语言模型非常擅长模仿人类行为,既能充当聊天机器人,又能执行各类任务。由大语言模型驱动的ChatGPT7因其在与人类多轮对话中表现出的连贯性、可靠性、信息量和稳健性而令人惊叹。人类反馈机制对于获得这些能力起着关键作用
LLMs can both act as a good annotator and data generator for data augmentation, such as in[27, 29, 99, 121, 122]. Some LLMs have been found as good as human annotators [37] in some tasks. And the collected texts from GPT3.5 (text-davinci-003) have been used as human-like instruction-following demonstrations to train other language models [100].
大语言模型 (LLM) 既能作为优质标注工具,又能充当数据增强的生成器,例如 [27, 29, 99, 121, 122] 所示。研究发现某些任务中部分大语言模型的标注质量可与人类标注员媲美 [37]。从 GPT3.5 (text-davinci-003) 采集的文本已被用作类人指令跟随示范数据,用于训练其他语言模型 [100]。
LLMs can also be used for quality assessment on some NLG tasks, such as sum mari z ation and translation. On sum mari z ation tasks, GPT-4 as an evaluator achieves a higher correlation with humans than other methods with a large margin [64]. Some other evaluators based on LLMs [34, 50, 64, 108] also show good human alignment in more
大语言模型也可用于某些自然语言生成(NLG)任务的质量评估,例如摘要和翻译。在摘要任务中,GPT-4作为评估器比其他方法更显著地实现了与人类评价的高度相关性[64]。其他基于大语言模型的评估器[34,50,64,108]也在更多任务中展现出良好的人类对齐性。
NLG tasks, especially compared with traditional automatic metrics. But the LLM evaluator may have a bias towards the LLM-generated texts [64].
大语言模型评估器在自然语言生成任务中展现出与人类判断更高的相关性,尤其相较于传统自动指标。但该评估器可能对LLM生成的文本存在偏好 [64]。
Also, as we discussed above, some abilities of LLMs bring bonuses in addition to performance improvement, such as interpret ability. The CoT reasoning ability of LLMs can show how an LLM reaches the prediction, which is a good interpretation on the instance level, while it also improves the performance.
此外,正如我们前文所述,大语言模型的某些能力不仅能提升性能,还会带来额外优势,例如可解释性。大语言模型的思维链 (CoT) 推理能力可以展示其得出预测的过程,这在实例层面提供了良好的解释,同时这种能力也提升了模型性能。
4.6 Real world "tasks"
4.6 现实世界的"任务"
In the last part of this section, we would like to discuss the usage of LLMs and fine-tuned models in real-world "tasks". We use the term "tasks" loosely, as real-world scenarios often lack well-formatted definitions like those found in academia. Many requests to models even cannot be treated as NLP tasks. Models face challenges in the real world from three perspectives:
在本节的最后部分,我们将讨论大语言模型(LLM)和微调模型在实际"任务"中的应用。此处"任务"的定义较为宽泛,因为现实场景往往缺乏学术界那样明确定义的任务框架。许多对模型的请求甚至无法被视为自然语言处理(NLP)任务。从以下三个维度来看,模型在现实世界中面临着挑战:
Essentially, these challenges in the real world come from that users’ requests deviate significantly from the distribution of any NLP datasets designed for specific tasks. Public NLP datasets are not reflective of how the models are used [77].
本质上,这些现实世界中的挑战源于用户的请求明显偏离了为特定任务设计的任何NLP数据集的分布。公开的NLP数据集并不能反映模型的实际使用情况[77]。
Remark 7
备注 7
LLMs are better suited to handle real-world scenarios compared to fine-tuned models. However, evaluating the effectiveness of models in the real world is still an open problem.
大语言模型比微调模型更适合处理现实场景。然而,评估模型在现实世界中的有效性仍是一个开放性问题。
Handling such real-world scenarios requires coping with ambiguity, understanding context, and handling noisy input. Compared to fine-tuned models, LLMs are better equipped for this because they have been trained on diverse data sets that encompass various writing styles, languages, and domains. Additionally, LLMs demonstrate a strong ability to generate open-domain responses, making them well-suited for these scenarios. Fine-tuned models, on the other hand, are often tailored to specific, well-defined tasks and may struggle to adapt to new or unexpected user requests. They heavily rely on clear objectives and well-formed training data that specify the types of instructions the models should learn to follow. Fine-tuned models may struggle with noisy input due to their narrower focus on specific distributions and structured data. An additional system is often required as an assistant for fine-tuned models to process unstructured context, determine possible intents, and refine model responses accordingly.
处理这类现实场景需要应对模糊性、理解上下文并处理噪声输入。与微调模型相比,大语言模型更适合这类任务,因为它们经过多样化数据集的训练,涵盖各种写作风格、语言和领域。此外,大语言模型展现出强大的开放领域响应生成能力,使其特别适配此类场景。而微调模型通常针对特定、明确定义的任务进行优化,可能难以适应新的或意外的用户请求。它们高度依赖清晰的目标和格式规范的训练数据,这些数据需明确指定模型应学会遵循的指令类型。由于微调模型更专注于特定分布和结构化数据,它们在处理噪声输入时可能会遇到困难。通常需要额外的系统作为微调模型的辅助,以处理非结构化上下文、确定可能的意图并相应优化模型响应。
Additionally, some mechanics such as instruction tuning [91, 112] and human alignment tuning [77] further boost the capabilities of LLMs to better comprehend and follow user instructions. These methods improve the model’s ability to generate helpful, harmless, and honest responses while maintaining coherence and consistency [77, 91, 112]. While both methods can make LLMs better generalize to unseen tasks and instructions, it has been noticed that while human labelers prefer models tuned for human alignment [77] to models tuned with instructions from public NLP tasks, such as FLAN [112] and T0 [91]. The reason may be similar to reasons for fine-tuned models’ inferiority: public NLP tasks/datasets are designed for easy and automatic evaluation, and they can only cover a small part of real-world usage. One of the main issues when it comes to real-world scenarios is how to evaluate whether the model is good or not. Without any formalized tasks or metrics, the evaluation of model effectiveness can only rely on feedback from human labelers. Considering the complexity and cost of human evaluation, there’s no massive and systematic comparison between fine-tuned models and LLMs yet. Nevertheless, the huge success and popularity of LLMs such as chatGPT, have confirmed the superiority of LLMs to some extent.
此外,指令微调 [91, 112] 和人类对齐微调 [77] 等机制进一步提升了LLM (Large Language Model) 理解和遵循用户指令的能力。这些方法增强了模型生成有用、无害且诚实回答的能力,同时保持连贯性和一致性 [77, 91, 112]。虽然这两种方法都能让LLM更好地泛化到未见过的任务和指令,但研究发现人类标注者更倾向于选择经过人类对齐微调的模型 [77],而非基于公开NLP任务(如FLAN [112] 和T0 [91])指令微调的模型。其原因可能与微调模型的劣势类似:公开NLP任务/数据集是为便于自动化评估而设计的,仅能覆盖现实应用场景的一小部分。
现实场景中的核心问题之一是如何评估模型优劣。在没有标准化任务或指标的情况下,模型效果评估只能依赖人类标注者的反馈。考虑到人工评估的复杂性和成本,目前尚未出现微调模型与LLM之间大规模系统化的对比研究。尽管如此,chatGPT等LLM的巨大成功和普及,已在一定程度上证实了其优越性。
5 OTHER CONSIDERATIONS
5 其他注意事项
Despite LLMs are suitable for various downstream tasks, there are some other factors to consider, such as efficiency and trustworthiness. Our discussion of efficiency encompasses the training cost, inference latency, and parameter-efficient tuning strategies for LLMs. Meanwhile, our examination of trustworthiness includes robustness & calibration, fairness & biases, potential spurious correlations, and the safety challenges in LLMs.
尽管大语言模型 (LLM) 适用于各种下游任务,但仍需考虑其他因素,例如效率和可信度。我们对效率的讨论涵盖了大语言模型的训练成本、推理延迟和参数高效调优策略。同时,我们对可信度的研究包括鲁棒性与校准、公平性与偏见、潜在的伪相关性以及大语言模型中的安全挑战。
Remark 8
备注 8
5.1 Efficiency
5.1 效率
In real-world deployment, performance, cost, and latency are all important considerations, not just the performance of the models. While some parameter-efficient methods have been developed, practitioners must balance efficiency with effectiveness in the practice.
在实际部署中,性能、成本和延迟都是重要的考量因素,而不仅仅是模型的性能。虽然已经开发出一些参数高效的方法,但实践者必须在效率与效果之间取得平衡。
Cost. LLMs have grown increasingly larger in recent years, with models such as GPT-1, GPT-2, and GPT-3 featuring 117 million, 1.5 billion, and 175 billion parameters, respectively. The cost of training an LLM is heavily influenced by its size, with estimates suggesting that training the 11B parameter variant of T5 costs well over $\$1.3$ million for a single run, while a single training run of GPT-3 175B requires $\$4.6$ million [3]. The energy consumption for training large models is equally impressive. The total energy consumption for training a transformer model with 6B parameters to completion is estimated to be around $103.5\mathrm{MWh}$ [30]. Google reports that training PaLM consumed about 3.4 GWh in about two months [6]. Furthermore, the dataset size also scales rapidly with the size of the model, with GPT-3 175B trained on 499 billion tokens [16]. Another key metric that reflects the computing cost is Flops, with GPT-3 175B requiring $3.14\times10^{23}$ Flops, while a T5 11B model only requires $3.30\times10^{22}$ , which is 10 times less. In addition to these costs, hardware requirements are also substantial. OpenAI has collaborated with Microsoft on a supercomputer hosted in the Microsoft Azure cloud, consisting of $285\mathrm{k\Omega}$ CPU cores and $10\mathrm{k\Omega}$ high-end GPUs to support the training of large models. For users of the OpenAI API, pricing varies based on the model and usage, with options such as GPT-3.5-turbo charging $\$0.002$ per 1k tokens for chat service. However, for users who require custom models, training costs $\$0.03$ per 1k tokens, while usage costs $\$0.12$ per 1k tokens [4]. Therefore, for users who cannot afford such a large cost, such as small startups, individual users, etc., a small, fine-tuned model is a better and more reasonable choice.
成本。近年来,大语言模型的参数量持续攀升,GPT-1、GPT-2和GPT-3的参数量分别达到1.17亿、15亿和1750亿。模型训练成本与其规模密切相关:据估算,训练110亿参数的T5变体单次成本超过130万美元,而GPT-3 1750亿参数的训练单次耗资高达460万美元[3]。大型模型的能耗同样惊人,完整训练60亿参数Transformer模型约消耗103.5兆瓦时[30],Google报告显示训练PaLM在两个月内耗电约3.4吉瓦时[6]。数据集规模也随模型体量激增,GPT-3 1750亿的训练使用了4990亿token[16]。反映计算成本的另一关键指标是浮点运算量,GPT-3 1750亿需3.14×10²³次运算,而T5 110亿仅需3.30×10²²次,相差十倍。硬件需求同样庞大,OpenAI与Microsoft合作部署的Azure超级计算机包含28.5万CPU核心和1万高端GPU。对于OpenAI API用户,GPT-3.5-turbo聊天服务每千token收费0.002美元;定制模型训练每千token0.03美元,使用费0.12美元[4]。因此,初创公司或个人用户等预算有限的群体更适合选择经过微调的小型模型。
Latency. Latency is a crucial factor to consider in real-world applications of LLMs. Inference time is a commonly used metric to measure latency, which is highly dependent on the model size, architecture, and token size. For instance, the inference time for the GPT-J 6B model is 0.077s, 0.203s, and 0.707s when the max token size is set to 2, 8, and 32, respectively. Additionally, when the max token size is fixed at 32, the inference time for the Instruct GP T model (davinci v2) is 1.969s. As LLMs are often too large to be run on a single user’s machine, companies provide LLM services via APIs. The API latency can vary depending on the user’s location, and the average latency of the OpenAI API service for a single request can range from a few hundred milliseconds to several seconds. In scenarios where high latency is not acceptable, large LLMs may not be appropriate. For example, s cal ability is critical in many information retrieval applications. To deploy information retrieval systems on the web, search engines require very efficient inference for systems to be useful. The idealized denoised inference time for the Instruct GP T davinci v2 $(175\mathrm{B}^{\ast})$ model is 0.21s per request (i.e., a query-passage pair to be scored), which is too slow for web search engines.
延迟。延迟是大语言模型(LLM)实际应用中的关键考量因素。推理时间是衡量延迟的常用指标,其高度依赖于模型规模、架构和token数量。例如,当最大token数设为2、8和32时,GPT-J 6B模型的推理时间分别为0.077秒、0.203秒和0.707秒。此外,当最大token数固定为32时,Instruct GPT模型(davinci v2)的推理时间为1.969秒。由于大语言模型通常体积过大而无法在单台用户设备上运行,企业多通过API提供LLM服务。API延迟会因用户地理位置而异,OpenAI API服务的单次请求平均延迟可能在几百毫秒到数秒之间。在高延迟不可接受的场景中,大型LLM可能并不适用。例如,可扩展性对许多信息检索应用至关重要。要在网络上部署信息检索系统,搜索引擎需要极高的推理效率才能保证实用性。Instruct GPT davinci v2 $(175\mathrm{B}^{\ast})$模型的理想去噪推理时间为每个请求(即需要评分的查询-段落对)0.21秒,这对网页搜索引擎而言仍然过慢。
Parameter-Efficient Tuning. In practice, we may tune the model on some specific datasets. Parameter-Efficient Tuning (PET) is an efficient technique to tune a small portation of model parameters (or extra parameters) while freezing most parameters of the pre-trained LLMs. The main goal of PEFT is to greatly decrease the computational and storage costs while keeping the performance of the original models. The common techniques for PET are LoRA [42], Prefix Tuning [58], P-Tuning [62, 63]. As an illustration, the LoRA method maintains the weights of the pre-trained model and incorporates low-rank matrices into every layer of the Transformer architecture. This approach considerably minimizes the number of parameters that require training for subsequent tasks, thereby increasing overall efficiency. Alpaca-LoRA8 proposes integrating Low-Rank Adaptation (LoRA) into LLaMA-Alpaca, which enables runs LLaMA within hours on a single RTX 4090. All these PFT methods can be helpful either for fine-tuning a model to a specific task or tuning LLMs to meet special requirements like human alignment.
参数高效调优 (Parameter-Efficient Tuning)。实践中,我们可能需要在特定数据集上调整模型。参数高效调优 (PET) 是一种高效技术,只需调整模型参数的一小部分(或额外参数)同时冻结预训练大语言模型的大部分参数。PEFT的主要目标是在保持原始模型性能的同时大幅降低计算和存储成本。常见的PET技术包括LoRA [42]、Prefix Tuning [58]、P-Tuning [62, 63]。举例来说,LoRA方法保持预训练模型的权重不变,并将低秩矩阵融入Transformer架构的每一层。这种方法显著减少了后续任务需要训练的参数量,从而提升整体效率。Alpaca-LoRA8提出将低秩自适应 (LoRA) 整合到LLaMA-Alpaca中,使得单块RTX 4090显卡就能在数小时内运行LLaMA。所有这些PFT方法既有助于针对特定任务微调模型,也能调整大语言模型以满足特殊需求(如人类对齐)。
5.2 Trustworthiness
5.2 可信度
Given that LLMs are now involved in sensitive areas such as healthcare, finance, and law, it is crucial to ensure that they are trustworthy and capable of producing reliable output.
鉴于大语言模型现已涉足医疗、金融和法律等敏感领域,确保其可信且能生成可靠输出至关重要。
Robustness and Calibration. The accuracy and robustness of the LLMs are shown to have a very strong correlation [59]. The models that have high accuracy on the scenario also have good robustness. However, the robustness of the zero-shot becomes worse after being tuned on extra application-specific tasks data [116]. This may due to over fitting, which leads to poor general iz ability due to the extremely high complexity of the model and the limited training samples from downstream tasks [43]. In a similar vein, it has been observed that fine-tuning a model can result in significant mis calibrations, owing to over-parameter iz ation [51]. Therefore, fine-tuned models may not be an optimal choice when robustness and calibration are critical considerations. However, human alignment has been found as a potential solution for enhancing model robustness. Instruct GP T davinci v2 $(175\mathrm{B}^{\ast})$ has been shown to outperform other models in terms of robustness. On the other hand, achieving optimal calibration of the model depends on the scenario and adaptation procedure employed.
鲁棒性与校准。研究表明,大语言模型的准确性与鲁棒性存在极强相关性 [59]。在特定场景中表现高精度的模型通常也具备良好的鲁棒性。然而,零样本方法在经过额外任务数据调优后,其鲁棒性会出现下降 [116],这可能是由于模型超高复杂度与下游任务有限训练样本导致的过拟合现象,进而削弱了泛化能力 [43]。类似地,研究指出模型微调可能因过参数化引发严重的校准偏差问题 [51]。因此,当鲁棒性和校准成为关键考量时,微调模型可能并非最优选择。值得注意的是,人类对齐 (human alignment) 被发现是提升模型鲁棒性的潜在解决方案——InstructGPT davinci v2 $(175\mathrm{B}^{\ast})$ 在鲁棒性指标上已展现出超越其他模型的性能。另一方面,实现模型的最佳校准效果取决于具体应用场景和采用的适配策略。
Fairness and Bias. LLMs have been shown to exhibit disparate treatment and impact, perpetuating societal biases and potentially leading to discrimination [10, 17]. To ensure fairness and equity for all users, it is crucial to address these issues in the development and deployment of NLP models. Disparities in performance between demographic groups can serve as an indicator of fairness problems. LLMs are particularly susceptible to fairness issues, as significant performance disparities have been observed across demographic categories such as dialect, religion, gender, and race [59]. However, research has shown that aligning models with human instructions can improve LLM performance regardless of their size, with the Instruct GP T model (davinci v2) exhibiting smaller performance disparities than other LLMs [23].
公平性与偏见。研究表明,大语言模型(LLM)存在差别对待和影响,延续社会偏见并可能导致歧视 [10, 17]。为确保所有用户的公平公正,在自然语言处理(NLP)模型的开发和部署中解决这些问题至关重要。不同人口统计群体间的性能差异可作为公平性问题的指标。大语言模型尤其容易受到公平性问题影响,因为在方言、宗教、性别和种族等人口统计类别中观察到显著的性能差异 [59]。但研究表明,无论模型规模大小,通过将模型与人类指令对齐可以提升大语言模型性能,其中InstructGPT模型(davinci v2)相比其他大语言模型展现出更小的性能差异 [23]。
Spurious Biases. The shortcut learning problem has been observed in various natural language understanding tasks under the pre training and fine-tuning paradigm, where models heavily rely on spurious correlations between input and labels in the fine-tuning data for prediction [31, 35, 98]. For example, in reading comprehension tasks, fine-tuned models tend to focus on the lexical matching of words between the question and the original passage, neglecting the intended reading comprehension task itself [53]. In contrast, large language models are not directly trained on fine-tuned datasets, which makes it less likely for them to learn shortcut features present in the fine-tuned dataset, thereby enhancing the model’s generalization capabilities. However, LLMs are not infallible and may exhibit some shortcut learning during in-context learning. For example, recent preliminary studies have begun investigating the robustness of prompt-based methods in large-scale language models [111, 129]. One such study evaluates the few-shot learning performance of GPT-3 on text classification and information extraction tasks [129]. and reveal that the examined LLMs are susceptible to majority label bias and position bias, where they tend to predict answers based on the frequency or position of the answers in the training data. Moreover, these LLMs exhibit common token bias, favoring answers that are prevalent in their pre-training corpus. Recent studies show that this positional bias can be mitigated by selecting proper prompts [68]. In summary, while LLMs significantly reduce the shortcut learning problem prevalent in fine-tuned models, they still exhibit some shortcut learning issues and should be approached with caution when deploying them in downstream applications.
虚假偏差。在预训练与微调范式下,多个自然语言理解任务中已观察到捷径学习问题:模型高度依赖微调数据中输入与标签间的虚假相关性进行预测 [31, 35, 98]。例如在阅读理解任务中,微调后的模型倾向于关注问题与原文间的词汇匹配,而忽略真正的阅读理解目标 [53]。相比之下,大语言模型并未直接在微调数据集上训练,因此更不易学习其中存在的捷径特征,从而增强了模型泛化能力。然而大语言模型并非完美,在上下文学习中仍可能表现出某些捷径学习行为。例如近期初步研究开始探索基于提示词方法在大规模语言模型中的鲁棒性 [111, 129],其中一项研究评估了GPT-3在文本分类和信息抽取任务中的少样本学习表现 [129],发现被测大语言模型易受多数标签偏差和位置偏差影响——倾向于根据答案在训练数据中的出现频率或位置进行预测。此外,这些模型还表现出常见token偏差,即偏爱预训练语料中高频的答案。最新研究表明,通过选择合适提示词可缓解此类位置偏差 [68]。总之,虽然大语言模型显著改善了微调模型中普遍存在的捷径学习问题,但其自身仍存在部分捷径学习缺陷,在下游应用部署时需保持谨慎。
5.3 Safety challenges
5.3 安全挑战
LLMs have demonstrated their extremely strong capabilities in many areas such as reasoning, knowledge retention, and coding. As they become more powerful and human-like, their potential to influence people’s opinions and actions in significant ways grows. As a result, some new safety challenges to our society should be considered and have caught lots of attention in recent works [75, 76].
大语言模型(LLM)在推理、知识保留和编码等多个领域展现出极其强大的能力。随着它们变得更强大、更接近人类,其以重大方式影响人们观点和行为的潜力也在增长。因此,一些新的社会安全隐患需要被考虑,这已在近期研究[75, 76]中引起广泛关注。
Hallucinations. The potential for LLMs to "hallucinate," or generate nonsensical or untruthful content, can have significant negative impacts on the quality and reliability of information in various applications. As LLMs become increasingly convincing and believable, users may develop an over reliance on them and trust them to provide accurate information in areas with which they are somewhat familiar. This can be particularly dangerous if the model produces content that is entirely false or misleading, leading to incorrect decisions or actions taken based on that information. Such outcomes can have serious consequences in many domains, such as healthcare, finance, or public policy, where the accuracy and reliability of information are critical. To mitigate these issues, reinforcement learning from human feedback (RLHF) is widely used [75, 77] and LLMs themselves have been integrated into the loop [75].
幻觉。大语言模型(LLM)可能产生"幻觉"(即生成无意义或不真实内容),这对各类应用中的信息质量和可靠性具有显著负面影响。随着大语言模型生成内容的说服力和可信度不断提升,用户可能对其产生过度依赖,甚至在自己略知一二的领域也盲目信任其提供的信息准确性。当模型生成完全错误或具有误导性的内容时尤为危险,这会导致基于错误信息做出不当决策或行动。在医疗健康、金融或公共政策等对信息准确性要求极高的领域,此类后果可能造成严重影响。为缓解这些问题,业界普遍采用基于人类反馈的强化学习(RLHF)技术[75,77],并将大语言模型本身也纳入训练循环[75]。
Harmful content. Due to the high coherence, quality, and plausibility of texts generated by LLMs, harmful contents from LLMs can cause significant harm, including hate speech, discrimination, incitement to violence, false narratives, and even social engineering attack. The implementation of safeguards to detect and correct those contents can be mitigation [97]. These LLMs can also have dual-use potential by providing required illicit information, leading to risks such as the proliferation of weapons [75] and even terrorism attack planning. It is crucial to ensure using these LLMs responsibly, with safeguards in place to prevent harm. Also, in existing work, feedback from humans plays an important role in getting rid of harmful outputs.
有害内容。由于大语言模型(LLM)生成的文本具有高度连贯性、高质量和可信性,其产生的有害内容可能造成严重危害,包括仇恨言论、歧视、煽动暴力、虚假叙述甚至社会工程攻击。通过实施安全措施来检测和纠正这些内容可以缓解风险[97]。这些大语言模型还具有双重用途潜力,可能提供非法所需信息,导致武器扩散[75]乃至恐怖袭击策划等风险。确保负责任地使用这些大语言模型并建立防护措施以防止危害至关重要。此外,在现有工作中,人类反馈对于消除有害输出起着重要作用。
Privacy. LLMs can face serious security issues. An example is the issue of user privacy. It is reported that Samsung employees were using ChatGPT to process their work when they inadvertently leaked top-secret data, including the source code proper of the new program, internal meeting minutes related to the hardware, etc. The Italian data protection agency declared that OpenAI, the developer of ChatGPT, illicitly gathered personal user data, leading Italy to become the first government to prohibit ChatGPT over privacy concerns [1].
隐私。大语言模型可能面临严重的安全问题,用户隐私泄露就是典型案例。据报道,三星员工使用ChatGPT处理工作时,意外泄露了包括新程序源代码、硬件相关内部会议记录等绝密数据。意大利数据保护机构认定ChatGPT开发者OpenAI非法收集用户个人数据,促使意大利成为全球首个因隐私问题禁用ChatGPT的国家 [1]。
6 CONCLUSION AND FUTURE CHALLENGES
6 结论与未来挑战
Recent advances in large language models have been revolutionizing the field of natural language processing. Effectively using LLMs requires understanding their capabilities, and limitations for various NLP tasks. This work presents a practical guide to working with LLMs for downstream NLP tasks. We first discuss prominent models like GPT-style and BERT-style architectures and the factors influencing their performance. We then explore using LLMs for downstream tasks, including knowledge-intensive tasks, NLU, and NLG tasks, as well as providing concrete examples of successes and limitations. This practical guide offers insights into LLMs and best practices for harnessing LLMs across NLP tasks. We hope it would enable researchers and practitioners to leverage their potential and drive innovation in language technologies.
大语言模型 (Large Language Model) 的最新进展正在彻底改变自然语言处理领域。要有效利用大语言模型,需要理解它们在不同 NLP 任务中的能力和局限性。本文为下游 NLP 任务中使用大语言模型提供了实用指南。我们首先讨论了 GPT 风格和 BERT 风格的架构等主流模型,以及影响其性能的因素。接着探讨了如何将大语言模型应用于下游任务,包括知识密集型任务、自然语言理解 (NLU) 和自然语言生成 (NLG) 任务,并提供了成功案例和局限性的具体示例。本实用指南深入剖析了大语言模型,并提供了跨 NLP 任务利用大语言模型的最佳实践。我们希望它能帮助研究者和从业者充分发挥其潜力,推动语言技术的创新。
In the following, we figure out the future challenges of the LLMs:
我们总结了大语言模型 (LLM) 未来面临的挑战:
• Evaluation of proposed models on real-world “datasets”. While existing deep learning models are primarily evaluated on standard academic datasets, such as ImageNet, which have been milestones in deep learning development. However, the limitations of standard academic datasets can not exactly reflect real-world performance. As models advance, it is crucial to assess them on more diverse, complex, and realistic data that reflect real-world needs. Evaluating models on real-world “datasets”, in addition to academic ones, will provide a more rigorous test of their capabilities, as well as a better understanding of their effectiveness in real-world applications. This ensures that the models are capable of addressing real-world challenges and delivering practical solutions. • Model Alignment. Ensuring that increasingly powerful and autonomous models align with human values and priorities is essential. Methods must be developed to guarantee that these models behave as intended and do not optimize for undesirable outcomes. It is crucial to integrate alignment techniques from the start of the model development process. Model transparency and interpret ability are also important factors for evaluating and ensuring alignment. Additionally, as we look toward the future, an even more daunting challenge looms: aligning superhuman systems. While this task is currently beyond our demands, it is important to consider and prepare for the potential implications of aligning such advanced systems, as they may present unique complexities and ethical concerns [8, 15]. • Safety Alignment. While discussion of AI existential risks is important, concrete research is needed to guarantee the safe development of advanced AI. This includes techniques for interpret ability, scalable oversight and governance, and formal verification of model properties. Safety should be considered not just an add-on but an integral part of the model-building process.
• 在真实世界"数据集"上评估提出的模型。现有深度学习模型主要在标准学术数据集(如ImageNet)上进行评估,这些数据集曾是深度学习发展的重要里程碑。然而标准学术数据集的局限性无法准确反映真实场景表现。随着模型进步,必须在更能反映现实需求的多样化、复杂化、真实化数据上进行评估。除学术数据集外,在真实世界"数据集"上评估模型,既能更严格检验其能力,也能更好理解其在实际应用中的有效性。这确保模型能应对现实挑战并提供实用解决方案。
• 模型对齐(Alignment)。确保日益强大且自主的模型符合人类价值观和优先级至关重要。必须开发方法来保证这些模型按预期运行,避免优化出不良结果。从模型开发初期就整合对齐技术非常关键。模型透明度和可解释性也是评估和确保对齐的重要因素。此外展望未来时,一个更艰巨的挑战正在显现:对齐超人类系统。虽然该任务目前超出我们需求,但必须考虑并准备应对对齐此类先进系统的潜在影响,因其可能带来独特复杂性和伦理问题[8,15]。
• 安全对齐(Safety Alignment)。虽然讨论AI生存风险很重要,但仍需具体研究来保障先进AI的安全发展。这包括可解释性技术、可扩展的监督治理机制,以及模型特性的形式化验证。安全不应被视为附加项,而应作为模型构建过程的内在组成部分。
• Performance Prediction with Scaling. It is difficult to anticipate how model performance will change as model size and complexity increases dramatically. Developing methods to better predict model performance after scaling up or as new architectures are developed would allow for more efficient use of resources and accelerated progress. Some possibilities include: training a smaller ’seed’ model and extrapolating its growth, simulating the effects of increased scale or model tweaks, and benchmarking iterations of the model at different scales to build scaling laws. These could provide insight into the performance of models even before they are built.
• 性能预测与规模扩展。随着模型规模和复杂度的急剧增加,很难预判模型性能将如何变化。开发能更好预测模型在扩展后或新架构下的性能表现方法,将有助于更高效地利用资源并加速进展。可能的方案包括:训练一个较小的"种子"模型并外推其增长趋势、模拟规模扩大或模型调整的效果,以及在不同规模下对模型迭代进行基准测试以构建扩展定律。这些方法甚至能在模型构建前就对其性能提供洞见。
