Scaling Laws for Neural Language Models
神经语言模型的扩展定律 (Scaling Laws for Neural Language Models)
Sam McCandlish∗
Johns Hopkins University, OpenAI OpenAI jaredk@jhu.edu sam@openai.com
约翰霍普金斯大学,OpenAI
jaredk@jhu.edu sam@openai.com
Tom Henighan | TomB. Brown | Benjamin Chess | Rewon Child |
---|---|---|---|
OpenAI | OpenAI | OpenAI | OpenAI |
henighan@openai.com | tom@openai.com | bchess@openai.com | rewon@openai.com |
Scott Gray | Alec Radford | Jeffrey Wu | Dario Amodei |
OpenAI | OpenAI | OpenAI | OpenAI |
scott@openai.com | alec@openai.com | jeffwu@openai.com | damodei@openai.com |
Abstract
摘要
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of over fitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sampleefficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
我们研究了语言模型在交叉熵损失上的经验缩放定律。损失随着模型大小、数据集大小和用于训练的计算量按幂律缩放,某些趋势跨越了七个数量级以上。其他架构细节,如网络宽度或深度,在较大范围内影响较小。简单的方程支配着过拟合对模型/数据集大小的依赖关系以及训练速度对模型大小的依赖关系。这些关系使我们能够确定固定计算预算的最佳分配。较大的模型显著更样本高效,因此最优的计算效率训练涉及在相对较少的数据上训练非常大的模型,并在远未收敛时停止。
Contents
目录
Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning tasks can be efficiently expressed and evaluated in language, and the world’s text provides a wealth of data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in language modeling, with state of the art models [RNSS18, DCLT18, $\mathbf{Y}\mathbf{D}\bar{\mathbf{Y}}^{+}19$ , $\mathrm{LOG^{+}19}$ , $\mathrm{RSR}^{+}\mathrm{i}\bar{9}]$ approaching human-level performance on many specific tasks $[\mathsf{W P N}^{+}19]$ , including the composition of coherent multiparagraph prompted text samples $[\mathrm{RWC}^{+}19]$ .
语言为研究人工智能提供了一个自然的领域,因为绝大多数推理任务可以高效地用语言表达和评估,并且世界上的文本为通过生成式建模进行无监督学习提供了丰富的数据。深度学习在语言建模方面最近取得了快速进展,最先进的模型 [RNSS18, DCLT18, $\mathbf{Y}\mathbf{D}\bar{\mathbf{Y}}^{+}19$ , $\mathrm{LOG^{+}19}$ , $\mathrm{RSR}^{+}\mathrm{i}\bar{9}]$ 在许多特定任务上接近人类水平的表现 $[\mathsf{W P N}^{+}19]$ ,包括连贯的多段提示文本样本的组成 $[\mathrm{RWC}^{+}19]$ 。
One might expect language modeling performance to depend on model architecture, the size of neural models, the computing power used to train them, and the data available for this training process. In this work we will empirically investigate the dependence of language modeling loss on all of these factors, focusing on the Transformer architecture $[\mathrm{VSP}^{+}17$ , $\mathrm{LSP^{+}}18]$ . The high ceiling and low floor for performance on language tasks allows us to study trends over more than seven orders of magnitude in scale.
人们可能认为语言模型的性能取决于模型架构、神经模型的大小、用于训练它们的计算能力以及可用于此训练过程的数据。在这项工作中,我们将经验性地研究语言建模损失与所有这些因素的依赖关系,重点关注 Transformer 架构 [VSP+17, LSP+18] 。语言任务性能的高上限和低下限使我们能够研究超过七个数量级规模的变化趋势。
Throughout we will observe precise power-law scalings for performance as a function of training time, context length, dataset size, model size, and compute budget.
在整个过程中,我们将观察到性能与训练时间、上下文长度、数据集大小、模型大小和计算预算之间的精确幂律缩放关系。
1.1 Summary
1.1 摘要
规则:
- 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
- 不要输出与英文内容无关的内容。
- 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
- 人名不翻译
- 同时要保留引用的论文,例如 [20] 这样的引用。
- 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
- 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
- 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
- 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
* Transformer -> Transformer
* Token -> Token
* LLM/Large Language Model -> 大语言模型
* Zero-shot -> 零样本
* Few-shot -> 少样本
* AI Agent -> AI智能体
* AGI -> 通用人工智能
* Python -> Python语言
策略:
分三步进行翻译工作:
1. 不翻译无法识别的特殊字符和公式,原样返回
2. 将HTML表格格式转换成Markdown表格格式
3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息
最终只返回Markdown格式的翻译结果,不要回复无关内容。
Our key findings for Transformer language models are are as follows:
我们对 Transformer 语言模型的关键发现如下:
Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottle necked by the other two.
图 1: 语言建模性能随着模型大小、数据集大小和用于训练的计算量的增加而平稳提升。为了获得最佳性能,这三个因素必须同步扩展。当不受其他两个因素限制时,实证性能与每个单独因素之间存在幂律关系。
Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters $N$ (excluding embeddings), the size of the dataset $D$ , and the amount of compute $C$ used for training. Within reasonable limits, performance depends very weakly on other architectural hyper parameters such as depth vs. width. (Section 3)
性能主要取决于规模,轻微取决于模型结构:模型性能最强烈地依赖于规模,这由三个因素组成:模型参数的数量 $N$ (不包括嵌入),数据集的大小 $D$ ,以及用于训练的计算量 $C$ 。在合理范围内,性能对其他架构超参数(如深度与宽度)的依赖非常微弱。(第 3 节)
Smooth power laws: Performance has a power-law relationship with each of the three scale factors $N,D,C$ when not bottle necked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3)
平滑幂律:当不受其他两个因素的限制时,性能与三个尺度因子 $N,D,C$ 呈现幂律关系,趋势跨越了六个数量级以上(见图 1)。我们在高端没有观察到任何偏离这些趋势的迹象,尽管在达到零损失之前,性能最终必须趋于平稳。(第 3 节)
Universality of over fitting: Performance improves predictably as long as we scale up $N$ and $D$ in tandem, but enters a regime of diminishing returns if either $N$ or $D$ is held fixed while the other increases. The performance penalty depends predictably on the ratio $N^{0.74}/D$ , meaning that every time we increase the model size $8\mathrm{{x}}$ , we only need to increase the data by roughly $\mathsf{5x}$ to avoid a penalty. (Section 4)
过拟合的普遍性:只要我们同时扩大 $N$ 和 $D$,性能就会可预测地提高,但如果其中一个固定而另一个增加,则进入收益递减的阶段。性能惩罚取决于比率 $N^{0.74}/D$,这意味着每次我们将模型大小增加 $8\mathrm{{x}}$ 时,只需要将数据量增加大约 $\mathsf{5x}$ 就可以避免惩罚。(第 4 节)
Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the loss that would be achieved if we trained for much longer. (Section 5)
训练的普遍性:训练曲线遵循可预测的幂律,其参数大致与模型大小无关。通过外推训练曲线的早期部分,我们可以大致预测如果我们训练更长时间所能达到的损失。(第 5 节)
Transfer improves with test performance: When we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss – in other words, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2)
迁移学习随着测试性能的提高而改善:当我们评估模型在与训练数据分布不同的文本上的表现时,结果与训练验证集上的结果高度相关,损失大约有一个恒定的偏移——换句话说,迁移到不同分布会带来一个恒定的惩罚,但除此之外,其表现大致与训练集上的性能成正比改进。(第 3.2.2 节)
Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4).
样本效率:大模型比小模型更样本高效,在优化步骤更少的情况下达到相同的性能水平(图 2),并且使用更少的数据点(图 4)。
Convergence is inefficient: When working within a fixed compute budget $C$ but without any other restrictions on the model size $N$ or available data $D$ , we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as $D\stackrel{\smile}{\sim}C^{0.27}$ with training compute. (Section 6)
收敛效率低下:在计算预算 $C$ 固定但对模型大小 $N$ 或可用数据 $D$ 没有其他限制的情况下,我们通过训练非常大的模型并在远未达到收敛时停止来获得最佳性能(见图 3)。因此,计算效率最高的训练方式比根据小模型收敛所需的样本效率要高得多,数据需求随着训练计算量的增长非常缓慢,即 $D\stackrel{\smile}{\sim}C^{0.27}$。(第 6 节)
Optimal batch size: The ideal batch size for training these models is roughly a power of the loss only, and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly 1-2 million tokens at convergence for the largest models we can train. (Section 5.1)
最优批量大小:对于训练这些模型,理想的批量大小大致是损失的幂,并且可以通过测量梯度噪声尺度来确定 [MKAT18];对于我们能够训练的最大模型,在收敛时大约为 1-2 百万 Token (Section 5.1)。
Taken together, these results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. We expect that larger language models will perform better and be more sample efficient than current models.
综上所述,这些结果表明,当我们适当地扩大模型规模、数据和计算资源时,语言建模性能会平稳且可预测地提高。我们预计,更大的大语言模型将比当前模型表现更好,并且样本效率更高。
Figure 2 We show a series of language model training runs, with models ranging in size from $10^{3}$ to $10^{9}$ parameters (excluding embeddings).
图 2: 我们展示了一系列语言模型训练运行,模型参数规模从 $10^{3}$ 到 $10^{9}$ (不包括嵌入层)。
Figure 3 As more compute becomes available, we can choose how much to allocate towards training larger models, using larger batches, and training for more steps. We illustrate this for a billion-fold increase in compute. For optimally compute-efficient training, most of the increase should go towards increased model size. A relatively small increase in data is needed to avoid reuse. Of the increase in data, most can be used to increase parallelism through larger batch sizes, with only a very small increase in serial training time required.
图 3: 随着更多的计算资源变得可用,我们可以选择将多少资源分配给训练更大的模型、使用更大的批次以及进行更多的训练步骤。我们以计算资源增加十亿倍为例进行了说明。为了实现最优的计算效率,大部分增加的资源应该用于扩大模型规模。相对较小的数据量增加即可避免数据重用。在增加的数据中,大部分可以用于通过更大的批次来增加并行性,只需要非常小的串行训练时间增加。
1.2 Summary of Scaling Laws
1.2 扩展定律总结
规则:
- 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
- 不要输出与英文内容无关的内容。
- 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
- 人名不翻译
- 同时要保留引用的论文,例如 [20] 这样的引用。
- 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
- 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
- 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
- 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
* Transformer -> Transformer
* Token -> Token
* LLM/Large Language Model -> 大语言模型
* Zero-shot -> 零样本
* Few-shot -> 少样本
* AI Agent -> AI智能体
* AGI -> 通用人工智能
* Python -> Python语言
根据上述规则,以下是翻译结果:
1.2 扩展定律总结
The test loss of a Transformer trained to auto regressive ly model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters $N$ , the dataset size $D$ , or the optimally allocated compute budget $C_{\mathrm{min}}$ (see Figure 1):
当性能仅受限于非嵌入参数的数量 $N$ 、数据集大小 $D$ 或最优分配的计算预算 $C_{\mathrm{min}}$ 时,可以使用幂律预测训练用于自回归语言建模的 Transformer 的测试损失(见图 1):
- For models with a limited number of parameters, trained to convergence on sufficiently large datasets:
- 对于参数数量有限的模型,在足够大的数据集上训练到收敛:
$$
L(N)=\left({N_{\mathrm{c}}}/{N}\right)^{\alpha_{N}};~~\alpha_{N}\sim0.076,~~~N_{\mathrm{c}}\sim8.8\times10^{13}~(\mathrm{non-embedding~parameters})
$$
$$
L(N)=\left( N_{\mathrm{c}} / N \right)^{\alpha_{N}} ; ~~ \alpha_{N} \sim 0.076 , ~~~ N_{\mathrm{c}} \sim 8.8 \times 10^{13} ~(非嵌入参数)
$$
- For large models trained with a limited dataset with early stopping:
- 对于使用有限数据集训练并采用提前停止的大模型:
$$
L(D)=\left({{D_{\mathrm{{c}}}}/{D}}\right)^{\alpha_{D}};,,,\alpha_{D}\sim0.095,\quad{D_{\mathrm{{c}}}}\sim5.4\times10^{13},,(\mathrm{{tokens}})
$$
$$
L(D)=\left( \frac{D_{\mathrm{c}}}{D} \right)^{\alpha_{D}};,,,\alpha_{D}\sim0.095,\quad D_{\mathrm{c}}\sim5.4\times10^{13},,(\mathrm{tokens})
$$
- When training with a limited amount of compute, a sufficiently large dataset, an optimally-sized model, and a sufficiently small batch size (making optimal3 use of compute):
- 当在计算资源有限的情况下训练时,使用足够大的数据集、最优大小的模型和足够小的批量大小(使计算资源得到最优利用):
$$
L(C_{\mathrm{min}})=\left(C_{\mathrm{c}}^{\mathrm{min}}/C_{\mathrm{min}}\right)^{\alpha_{C}^{\mathrm{min}}};,,,\alpha_{C}^{\mathrm{min}}\sim0.050,\quad C_{\mathrm{c}}^{\mathrm{min}}\sim3.1\times10^{8},,(\mathrm{PF-days})
$$
$$
L(C_{\mathrm{min}})=\left(C_{\mathrm{c}}^{\mathrm{min}}/C_{\mathrm{min}}\right)^{\alpha_{C}^{\mathrm{min}}};,,,\alpha_{C}^{\mathrm{min}}\sim0.050,\quad C_{\mathrm{c}}^{\mathrm{min}}\sim3.1\times10^{8},,(PF-days)
$$
3We also observe an empirical power-law trend with the training compute $C$ (Figure 1) while training at fixed batch size, but it is the trend with $C_{\mathrm{min}}$ that should be used to make predictions. They are related by equation (5.5).
我们还观察到在固定批量大小训练时,训练计算量 $C$ 呈现经验幂律趋势 (图 1),但用于预测的应该是与 $C_{\mathrm{min}}$ 的趋势。它们通过公式 (5.5) 相关。
Figure 4 Left: The early-stopped test loss $L(N,D)$ varies predictably with the dataset size $D$ and model size $N$ according to Equation (1.5). Right: After an initial transient period, learning curves for all model sizes $N$ can be fit with Equation (1.6), which is parameterized in terms of $S_{\mathrm{min}}$ , the number of steps when training at large batch size (details in Section 5.1).
图 4 左:提前停止的测试损失 $L(N,D)$ 随数据集大小 $D$ 和模型大小 $N$ 的变化是可预测的,符合公式 (1.5)。右:在初始瞬态期后,所有模型大小 $N$ 的学习曲线可以用公式 (1.6) 拟合,该公式以 $S_{\mathrm{min}}$ 为参数,即在大批次大小训练时的步数(详情见第 5.1 节)。
These relations hold across eight orders of magnitude in $C_{\mathrm{min}}$ , six orders of magnitude in $N$ , and over two orders of magnitude in $D$ . They depend very weakly on model shape and other Transformer hyper parameters (depth, width, number of self-attention heads), with specific numerical values associated with the Webtext2 training set $[\mathrm{RWC}^{+}19]$ . The power laws $\alpha_{\mathrm{N}},\alpha_{\mathrm{D}},\alpha_{C}^{\mathrm{min}}$ specify the degree of performance improvement expected as we scale up $N,D$ , or $C_{\mathrm{min}}$ ; for example, doubling the number of parameters yields a loss that is smaller by a factor $2^{-\alpha_{N}}~=~0.95$ . The precise numerical values of $N_{\mathrm{c}},\bar{C}{\mathrm{c}}^{\mathrm{min}}$ , and $D{\mathrm{c}}$ depend on the vocabulary size and token iz ation and hence do not have a fundamental meaning.
这些关系在 $C_{\mathrm{min}}$ 的八个数量级、 $N$ 的六个数量级以及 $D$ 的两个数量级以上成立。它们对模型形状和其他 Transformer 超参数(深度、宽度、自注意力头数)的依赖非常弱,具体数值与 Webtext2 训练集 $[\mathrm{RWC}^{+}19]$ 相关。幂律 $\alpha_{\mathrm{N}},\alpha_{\mathrm{D}},\alpha_{C}^{\mathrm{min}}$ 指定了当我们扩大 $N,D$ 或 $C_{\mathrm{min}}$ 时预期的性能改进程度;例如,参数数量翻倍会使损失减少一个因子 $2^{-\alpha_{N}}~=~0.95$ 。 $N_{\mathrm{c}},\bar{C}{\mathrm{c}}^{\mathrm{min}}$ 和 $D{\mathrm{c}}$ 的精确数值取决于词汇表大小和分词方式,因此没有根本意义。
The critical batch size, which determines the speed/efficiency tradeoff for data parallelism ([MKAT18]), also roughly obeys a power law in $L$ :
关键批次大小,决定了数据并行的速度/效率权衡 ([MKAT18]),也大致遵循 $L$ 的幂律:
$$
B_{\mathrm{crit}}\left(L\right)=\frac{B_{}}{L^{1/\alpha_{B}}},\qquad B_{}\sim2\cdot10^{8};\mathrm{tokens,};;\alpha_{B}\sim0.21
$$
$$
B_{\mathrm{crit}}(L) = \frac{B_{}}{L^{1/\alpha_{B}}} , \qquad B_{} \sim 2 \cdot 10^{8} ; \mathrm{tokens,} ;; \alpha_{B} \sim 0.21
$$
Equation (1.1) and (1.2) together suggest that as we increase the model size, we should increase the dataset size sub linearly according to $D\propto N^{\frac{\alpha_{N}}{\alpha_{D}}}\sim N^{0.74}$ . In fact, we find that there is a single equation combining (1.1) and (1.2) that governs the simultaneous dependence on $N$ and $D$ and governs the degree of over fitting:
公式 (1.1) 和 (1.2) 共同表明,随着我们增加模型规模,我们应该根据 $D\propto N^{\frac{\alpha_{N}}{\alpha_{D}}}\sim N^{0.74}$ 以次线性方式增加数据集规模。实际上,我们发现有一个结合了 (1.1) 和 (1.2) 的单一公式,该公式同时取决于 $N$ 和 $D$,并决定了过拟合的程度:
$$
L(N,D)=\left[\left(\frac{N_{c}}{N}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}+\frac{D_{c}}{D}\right]^{\alpha_{D}}
$$
$$
L(N,D)=\left[\left(\frac{N_{c}}{N}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}+\frac{D_{c}}{D}\right]^{\alpha_{D}}
$$
with fits pictured on the left in figure 4. We conjecture that this functional form may also parameter ize the trained log-likelihood for other generative modeling tasks.
图 4: 左侧展示了拟合结果。我们推测这种函数形式也可能参数化其他生成式建模任务的训练对数似然。
When training a given model for a finite number of parameter update steps $S$ in the infinite data limit, after an initial transient period, the learning curves can be accurately fit by (see the right of figure 4)
当在无限数据限制下训练给定模型进行有限数量的参数更新步骤 $S$ 时,在初始瞬态期之后,学习曲线可以用以下公式准确拟合(见图 4 右侧)
$$
L(N,S)=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S_{\mathrm{min}}(S)}\right)^{\alpha_{S}}
$$
$$
L(N,S)=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S_{\mathrm{min}}(S)}\right)^{\alpha_{S}}
$$
where $S_{c}\approx2.1\times10^{3}$ and $\alpha_{S}\approx0.76$ , and $S_{\mathrm{min}}(S)$ is the minimum possible number of optimization steps (parameter updates) estimated using Equation (5.4).
其中 $S_{c}\approx2.1\times10^{3}$ 和 $\alpha_{S}\approx0.76$ ,以及 $S_{\mathrm{min}}(S)$ 是使用公式 (5.4) 估算的最小优化步骤数(参数更新)。
When training within a fixed compute budget $C$ , but with no other constraints, Equation (1.6) leads to the prediction that the optimal model size $N$ , optimal batch size $B$ , optimal number of steps $S$ , and dataset size $D$ should grow as
在固定计算预算 $C$ 内进行训练,但没有其他约束时,公式 (1.6) 预测最优模型大小 $N$ 、最优批量大小 $B$ 、最优步数 $S$ 和数据集大小 $D$ 应该增长为
$$
N\propto C^{\alpha_{C}^{\mathrm{min}}/\alpha_{N}},\quad B\propto C^{\alpha_{C}^{\mathrm{min}}/\alpha_{B}},\quad S\propto C^{\alpha_{C}^{\mathrm{min}}/\alpha s},\quad D=B\cdot S
$$
$$
N ∝ C^{\alpha_{C}^{\mathrm{min}} / \alpha_{N}} , \quad B ∝ C^{\alpha_{C}^{\mathrm{min}} / \alpha_{B}} , \quad S ∝ C^{\alpha_{C}^{\mathrm{min}} / \alpha_{S}} , \quad D = B \cdot S
$$
with
请提供需要翻译的完整内容。
$$
\alpha_{C}^{\mathrm{min}}=1/\left(1/\alpha_{S}+1/\alpha_{B}+1/\alpha_{N}\right)
$$
$$
\alpha_{C}^{\mathrm{min}}=1/\left(1/\alpha_{S}+1/\alpha_{B}+1/\alpha_{N}\right)
$$
which closely matche the empirically optimal results $N,\propto,C_{\mathrm{min}}^{0.73}$ , $B,\propto,C_{\mathrm{min}}^{0.24}$ , and $S~\propto~C_{\mathrm{min}}^{0.03}$ . As the $C$ increases, it should be spent primarily on larger models, without dramatic increases in training time or dataset size (see Figure 3). This also implies that as models grow larger, they become increasingly sample efficient. In practice, researchers typically train smaller models for longer than would be maximally compute-efficient because of hardware constraints. Optimal performance depends on total compute as a power law (see Equation (1.3)).
这与经验上的最优结果非常接近:$N,\propto,C_{\mathrm{min}}^{0.73}$ , $B,\propto,C_{\mathrm{min}}^{0.24}$ ,和 $S~\propto~C_{\mathrm{min}}^{0.03}$ 。随着 $C$ 的增加,它应该主要花费在更大的模型上,而不会显著增加训练时间或数据集大小(见图 3)。这也意味着,随着模型变得更大,它们的样本效率越来越高。实际上,研究人员通常由于硬件限制,会训练较小的模型更长时间,而不是最大化计算效率。最佳性能取决于总计算量的幂律关系(见公式 (1.3))。
We provide some basic theoretical motivation for Equation (1.5), an analysis of learning curve fits and their implications for training time, and a breakdown of our results per token. We also make some brief comparisons to LSTMs and recurrent Transformers $[\mathrm{DGV}^{+}18]$ .
我们为公式 (1.5) 提供了一些基本的理论动机,对学习曲线拟合及其对训练时间的影响进行了分析,并按每个 Token 对结果进行了分解。我们还简要比较了长短期记忆网络 (LSTMs) 和循环 Transformer $[\mathrm{DGV}^{+}18]$ 。
1.3 Notation
1.3 符号说明
We use the following notation:
我们使用以下符号:
2 Background and Methods
2 背景和方法
We train language models on WebText2, an extended version of the WebText $[\mathrm{RWC}^{+}19]$ dataset, tokenized using byte-pair encoding [SHB15] with a vocabulary size $n_{\mathrm{vocab}},=,50257$ . We optimize the autoregressive log-likelihood (i.e. cross-entropy loss) averaged over a 1024-token context, which is also our principal performance metric. We record the loss on the WebText2 test distribution and on a selection of other text distributions. We primarily train decoder-only $[{\mathrm{LSP}}^{+}18,\$ , RNSS18] Transformer $[\mathrm{VSP^{+}17}]$ models, though we also train LSTM models and Universal Transformers $[\mathrm{DGV}^{+}18]$ for comparison.
我们在 WebText2 上训练语言模型,WebText2 是 WebText $[\mathrm{RWC}^{+}19]$ 数据集的扩展版本,使用字节对编码 [SHB15] 进行分词,词汇表大小为 $n_{\mathrm{vocab}},=,50257$ 。我们优化了在 1024-token 上下文中的自回归对数似然(即交叉熵损失)的平均值,这也是我们的主要性能指标。我们在 WebText2 测试分布和其他一些文本分布上记录了损失。我们主要训练仅解码器的 Transformer $[\mathrm{LSP}^{+}18$, RNSS18] 模型,尽管我们也训练了 LSTM 模型和通用 Transformer $[\mathrm{DGV}^{+}18]$ 以作比较。
2.1 Parameter and Compute Scaling of Transformers
2.1 Transformer 参数和计算量扩展
We parameter ize the Transformer architecture using hyper parameters $n_{\mathrm{layer}}$ (number of layers), $d_{\mathrm{model}}$ (dimension of the residual stream), $d_{\mathrm{{ff}}}$ (dimension of the intermediate feed-forward layer), $d_{\mathrm{attn}}$ (dimension of the attention output), and $n_{\mathrm{heads}}$ (number of attention heads per layer). We include $n_{\mathrm{ctx}}$ tokens in the input context, with $n_{\mathrm{ctx}}=1024$ except where otherwise noted.
我们使用超参数对 Transformer 架构进行参数化,包括:$n_{\mathrm{layer}}$ (层数),$d_{\mathrm{model}}$ (残差流的维度),$d_{\mathrm{{ff}}}$ (中间前馈层的维度),$d_{\mathrm{attn}}$ (注意力输出的维度),和 $n_{\mathrm{heads}}$ (每层的注意力头数)。我们在输入上下文中包含 $n_{\mathrm{ctx}}$ 个 Token,其中 $n_{\mathrm{ctx}}=1024$ ,除非另有说明。
We use $N$ to denote the model size, which we define as the number of non-embedding parameters
我们使用 $N$ 来表示模型大小,定义为非嵌入参数的数量
$$
\begin{array}{r l}&{N\approx2d_{\mathrm{model}}n_{\mathrm{layer}}\left(2d_{\mathrm{attn}}+d_{\mathrm{ff}}\right)}\ &{\quad=12n_{\mathrm{layer}}d_{\mathrm{model}}^{2}\quad\mathrm{withthe~standard}\quad d_{\mathrm{attn}}=d_{\mathrm{ff}}/4=d_{\mathrm{model}}}\end{array}
$$
$$
\begin{array}{r l}
& N \approx 2 d_{\mathrm{model}} n_{\mathrm{layer}} \left( 2 d_{\mathrm{attn}} + d_{\mathrm{ff}} \right) \
& \quad = 12 n_{\mathrm{layer}} d_{\mathrm{model}}^2 \quad \text{withthe~standard} \quad d_{\mathrm{attn}} = d_{\mathrm{ff}} / 4 = d_{\mathrm{model}}
\end{array}
$$
where we have excluded biases and other sub-leading terms. Our models also have $n_{\mathrm{vocab}}d_{\mathrm{model}}$ parameters in an embedding matrix, and use $n_{\mathrm{ctx}}d_{\mathrm{model}}$ parameters for positional embeddings, but we do not include these when discussing the ‘model size’ $N$ ; we will see that this produces significantly cleaner scaling laws.
我们已经排除了偏置和其他次要项。我们的模型在嵌入矩阵中有 $n_{\mathrm{vocab}}d_{\mathrm{model}}$ 参数,并使用 $n_{\mathrm{ctx}}d_{\mathrm{model}}$ 参数用于位置嵌入,但在讨论‘模型大小’ $N$ 时没有包括这些;我们将看到这会产生更清晰的缩放定律。
Evaluating a forward pass of the Transformer involves roughly
评估 Transformer 的前向传递涉及大约
$$
C_{\mathrm{forward}}\approx2N+2n_{\mathrm{layer}}n_{\mathrm{ctx}}d_{\mathrm{model}}
$$
$$
C_{\mathrm{forward}} \approx 2N + 2n_{\mathrm{layer}} n_{\mathrm{ctx}} d_{\mathrm{model}}
$$
add-multiply operations, where the factor of two comes from the multiply-accumulate operation used in matrix multiplication. A more detailed per-operation parameter and compute count is included in Table 1.
加乘操作,其中的二倍因子来自矩阵乘法中使用的乘积累加操作。更详细的操作参数和计算量统计包含在表 1: 中。
Table 1 Parameter counts and compute (forward pass) estimates for a Transformer model. Sub-leading terms such as nonlinear i ties, biases, and layer normalization are omitted.
表 1: Transformer 模型的参数数量和计算(前向传递)估计。次主导项,例如非线性、偏置和层归一化被省略。
操作 | 参数 | 每个 Token 的 FLOPs |
---|---|---|
嵌入 (Embed) | (Nvocab + Nctx) dmodel | 4dmodel |
注意力:QKV (Attention: QKV) | Nlayer dmodel^3 dattn | 2Nlayer dmodel^3 dattn |
注意力:掩码 (Attention: Mask) | 2nlayer Nctx dattn | |
注意力:投影 (Attention: Project) | Nlayer dattn dmodel | 2nlayer dattn dembd |
前馈 (Feedforward) | Nlayer 2dmodel dff | 2Nlayer 2dmodel dff |
反嵌入 (De-embed) | 2dmodel Nvocab | |
总计(非嵌入)(Total (Non-Embedding)) | N = 2dmodel Nlayer (2dattn + df) | Cforward = 2N + 2nlayer Nctx dattn |
For contexts and models with $d_{\mathrm{model}},>,n_{\mathrm{ctx}}/12$ , the context-dependent computational cost per token is a relatively small fraction of the total compute. Since we primarily study models where $d_{\mathrm{model}}\gg n_{\mathrm{ctx}}/12$ , we do not include context-dependent terms in our training compute estimate. Accounting for the backwards pass (approximately twice the compute as the forwards pass), we then define the estimated non-embedding compute as $C\approx6N$ floating point operators per training token.
对于模型维度 $d_{\mathrm{model}},>,n_{\mathrm{ctx}}/12$ 的情况,每个 Token 的上下文依赖计算成本是总计算量的相对较小部分。由于我们主要研究的是模型维度远大于上下文长度十二分之一($d_{\mathrm{model}}\gg n_{\mathrm{ctx}}/12$)的模型,因此在我们的训练计算估计中不包括上下文依赖项。考虑到反向传播(大约是前向传播计算量的两倍),我们将估计的非嵌入计算定义为每个训练 Token 大约 $C\approx6N$ 次浮点运算。
2.2 Training Procedures
2.2 训练流程
Unless otherwise noted, we train models with the Adam optimizer [KB14] for a fixed $2.5\times10^{5}$ steps with a batch size of 512 sequences of 1024 tokens. Due to memory constraints, our largest models (more than 1B parameters) were trained with Adafactor [SS18]. We experimented with a variety of learning rates and schedules, as discussed in Appendix D.6. We found that results at convergence were largely independent of learning rate schedule. Unless otherwise noted, all training runs included in our data used a learning rate schedule with a 3000 step linear warmup followed by a cosine decay to zero.
除非另有说明,我们使用 Adam 优化器 [KB14] 训练模型,固定训练 $2.5\times10^{5}$ 步,每批包含 512 个序列,每个序列包含 1024 个 Token。由于内存限制,我们最大的模型(超过 1B 参数)使用 Adafactor [SS18] 进行训练。我们尝试了多种学习率和学习率调度方案,如附录 D.6 所述。我们发现收敛时的结果在很大程度上与学习率调度无关。除非另有说明,我们数据中包含的所有训练运行均采用学习率调度方案,包括 3000 步线性预热,随后是余弦衰减至零。
2.3 Datasets
2.3 数据集
We train our models on an extended version of the WebText dataset described in $[\mathrm{RWC}^{+}19]$ . The original WebText dataset was a web scrape of outbound links from Reddit through December 2017 which received at least 3 karma. In the second version, WebText2, we added outbound Reddit links from the period of January to October 2018, also with a minimum of 3 karma. The karma threshold served as a heuristic for whether people found the link interesting or useful. The text of the new links was extracted with the Newspaper 3 k python library. In total, the dataset consists of $20.3\mathrm{M}$ documents containing 96 GB of text and $1.62\stackrel{.}{\times}10^{10}$ words (as defined by wc). We then apply the reversible tokenizer described in $[\mathrm{RWC}^{+}19]$ , which yields $2.29\times10^{10}$ tokens. We reserve $6.6\times\dot{1}0^{8}$ of these tokens for use as a test set, and we also test on similarlyprepared samples of Books Corpus $[Z\mathrm{KZ}^{+}15]$ , Common Crawl [Fou], English Wikipedia, and a collection of publicly-available Internet Books.
我们在扩展版本的 WebText 数据集上训练模型,该数据集在 $[\mathrm{RWC}^{+}19]$ 中有描述。原始的 WebText 数据集是从 2017 年 12 月 Reddit 的外链中抓取的,这些链接至少获得了 3 个 karma。在第二个版本 WebText2 中,我们增加了 2018 年 1 月至 10 月期间的 Reddit 外链,同样要求至少获得 3 个 karma。Karma 阈值作为人们是否认为链接有趣或有用的启发式指标。新链接的文本使用 Newspaper 3k Python语言库提取。总计,该数据集包含 20.3M 文档,包含 96 GB 的文本和 1.62×10^10 个单词(由 wc 定义)。然后我们应用了 $[\mathrm{RWC}^{+}19]$ 中描述的可逆分词器,生成了 2.29×10^10 个 Token。我们保留了 6.6×10^8 个 Token 用于测试集,并且我们还在类似的准备样本上进行了测试,包括 Books Corpus $[Z\mathrm{KZ}^{+}15]$、Common Crawl [Fou]、英文维基百科和一系列公开可用的互联网书籍。
3 Empirical Results and Basic Power Laws
3 实证结果和基本幂律
To characterize language model scaling we train a wide variety of models, varying a number of factors including:
为了表征语言模型的扩展性,我们训练了各种各样的模型,改变了多个因素,包括:
• Model size (ranging in size from 768 to 1.5 billion non-embedding parameters) • Dataset size (ranging from 22 million to 23 billion tokens) • Shape (including depth, width, attention heads, and feed-forward dimension) • Context length (1024 for most runs, though we also experiment with shorter contexts) • Batch size ( $2^{19}$ for most runs, but we also vary it to measure the critical batch size)
• 模型规模 (从 768 到 15 亿非嵌入参数)
• 数据集规模 (从 2200 万到 230 亿 Token)
• 结构 (包括深度、宽度、注意力头和前馈维度)
• 上下文长度 (大多数运行中为 1024,尽管我们也实验了更短的上下文)
• 批量大小 (大多数运行中为 $2^{19}$ ,但我们也会改变它以测量关键批量大小)
Figure 5 Performance depends very mildly on model shape when the total number of non-embedding parameters $N$ is held fixed. The loss varies only a few percent over a wide range of shapes. Small differences in parameter counts are compensated for by using the fit to $L(N)$ as a baseline. Aspect ratio in particular can vary by a factor of 40 while only slightly impacting performance; an $(n_{\mathrm{layer}},d_{\mathrm{model}})=(6,42\mathrm{\bar{8}}8)$ reaches a loss within $3%$ of the (48, 1600) model used in $[\mathrm{RWC}^{+}19]$ .
图 5: 当总非嵌入参数数量 $N$ 固定时,性能对模型形状的依赖非常轻微。损失在广泛的形状范围内仅变化百分之几。参数数量的小差异通过使用拟合到 $L(N)$ 作为基线来补偿。纵横比可以变化 40 倍,而对性能的影响很小;一个 $(n_{\mathrm{layer}},d_{\mathrm{model}})=(6,42\mathrm{\bar{8}}8)$ 的模型达到的损失在 $3%$ 范围内接近于 $[\mathrm{RWC}^{+}19]$ 中使用的 (48, 1600) 模型。
Figure 6 Left: When we include embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. Right: When we exclude embedding parameters, the performance of models with different depths converge to a single trend. Only models with fewer than 2 layers or with extreme depth-to-width ratios deviate significantly from the trend.
图 6 左:当我们包含嵌入参数时,性能似乎强烈依赖于层数以及参数数量。右:当我们排除嵌入参数时,不同深度模型的性能收敛到单一趋势。只有少于 2 层或具有极端深度与宽度比率的模型显著偏离该趋势。
In this section we will display data along with empirically-motivated fits, deferring theoretical analysis to later sections.
在本节中,我们将展示数据以及基于经验的拟合结果,理论分析将推迟到后续章节。
3.1 Approximate Transformer Shape and Hyper parameter Independence
3.1 近似 Transformer 结构和超参数独立性
Transformer performance depends very weakly on the shape parameters $n_{\mathrm{layer}},n_{\mathrm{heads}}$ , and $d_{\mathrm{ff}}$ when we hold the total non-embedding parameter count $N$ fixed. To establish these results we trained models with fixed size while varying a single hyper parameter. This was simplest for the case of $n_{\mathrm{heads}}$ . When varying $n_{\mathrm{layer}}$ , we simultaneously varied $d_{\mathrm{model}}$ while keeping $N,\approx,12n_{\mathrm{layer}}d_{\mathrm{model}}^{2}$ fixed. Similarly, to vary $d_{\mathrm{ff}}$ at fixed model size we also simultaneously varied the $d_{\mathrm{model}}$ parameter, as required by the parameter counts in Table 1. Independence of $n_{\mathrm{layers}}$ would follow if deeper Transformers effectively behave as ensembles of shallower models, as has been suggested for ResNets [VWB16]. The results are shown in Figure 5.
Transformer 的性能对形状参数 $n_{\mathrm{layer}}$、$n_{\mathrm{heads}}$ 和 $d_{\mathrm{ff}}$ 的依赖非常弱,当我们将总的非嵌入参数数量 $N$ 固定时。为了得出这些结果,我们在固定模型大小的情况下,只改变一个超参数来训练模型。对于 $n_{\mathrm{heads}}$ 的情况最简单。在改变 $n_{\mathrm{layer}}$ 时,我们同时改变了 $d_{\mathrm{model}}$,同时保持 $N,\approx,12n_{\mathrm{layer}}d_{\mathrm{model}}^{2}$ 固定。同样地,在固定模型大小的情况下改变 $d_{\mathrm{ff}}$ 时,我们也同时改变了 $d_{\mathrm{model}}$ 参数,这是由表 1 中的参数数量决定的。如果更深的 Transformer 实际上表现为较浅模型的集成,则可以解释 $n_{\mathrm{layers}}$ 的独立性,这一点已经在 ResNets [VWB16] 中被提出。结果如图 5 所示。
3.2 Performance with Non-Embedding Parameter Count $N$
3.2 非嵌入参数数量 $N$ 下的性能表现
In Figure 6 we display the performance of a wide variety of models, ranging from small models with shape $(n_{\mathrm{layer}},d_{\mathrm{model}})=(2,128)$ through billion-parameter models, ranging in shape from (6, 4288) through (207, 768). Here we have trained to near convergence on the full WebText2 dataset and observe no overfitting (except possibly for the very largest models).
图 6: 我们展示了各种模型的性能,范围从小型模型的结构 (n_(layer), d_(model)) = (2, 128) 到数十亿参数的模型,结构从 (6, 4288) 到 (207, 768)。在这里,我们在完整的 WebText2 数据集上训练到接近收敛,并观察到没有过拟合(除了可能对于非常大的模型)。
As shown in Figure 1, we find a steady trend with non-embedding parameter count $N$ , which can be fit to the first term of Equation (1.5), so that
如图 1 所示,我们发现非嵌入参数计数 $N$ 呈现出稳定趋势,这可以拟合到公式 (1.5) 的第一项,因此
图 1:
$$
L(N)\approx\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}
$$
$$
L(N) ≈ \left( \frac{N_c}{N} \right)^{\alpha_N}
$$
Figure 7
图 7
To observe these trends it is crucial to study performance as a function of $N$ ; if we instead use the total parameter count (including the embedding parameters) the trend is somewhat obscured (see Figure 6). This suggests that the embedding matrix can be made smaller without impacting performance, as has been seen in recent work $[\mathrm{LCG^{+}19}]$ .
要观察这些趋势,研究性能与 $N$ 的函数关系至关重要;如果我们使用总参数量(包括嵌入参数),趋势会变得不那么明显(见图 6)。这表明嵌入矩阵可以变得更小而不影响性能,正如最近的研究 $[\mathrm{LCG^{+}19}]$ 所显示的那样。
Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasets is also a power-law in $N$ with nearly identical power, as shown in Figure 8.
虽然这些模型是在 WebText2 数据集上训练的,但它们在其他各种数据集上的测试损失也与 $N$ 呈幂律关系,且幂几乎相同,如图 8 所示。
3.2.1 Comparing to LSTMs and Universal Transformers
3.2.1 与 LSTMs 和 Universal Transformers 比较
In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter count $N$ . The LSTMs were trained with the same dataset and context length. We see from these figures that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens. We present power-law relationships between performance and context position Appendix D.5, where increasingly large powers for larger models suggest improved ability to quickly recognize patterns.
图 7: 我们比较了 LSTM 和 Transformer 在非嵌入参数数量 $N$ 下的性能。LSTM 使用相同的 数据集和上下文长度进行训练。从这些图表中可以看出,对于出现在上下文早期的 Token,LSTM 的表现与 Transformer 相当,但对于较晚出现的 Token,LSTM 无法匹敌 Transformer 的性能。我们在附录 D.5 中展示了性能与上下文位置之间的幂律关系,其中较大模型的幂次逐渐增大,表明其快速识别模式的能力有所提高。
We also compare the performance of standard Transformers to recurrent Transformers $[\mathrm{DGV}^{+}18]$ in Figure 17 in the appendix. These models re-use parameters, and so perform slightly better as a function of $N$ , at the cost of additional compute per-parameter.
我们还比较了标准 Transformer 和循环 Transformer $[\mathrm{DGV}^{+}18]$ 在附录中的图 17 中的性能。这些模型重用了参数,因此在 $N$ 的函数下表现略好,但每个参数的计算成本更高。
3.2.2 Generalization Among Data Distributions
3.2.2 数据分布间的泛化能力
We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8).
我们还在一组额外的文本数据分布上测试了我们的模型。这些数据集上的测试损失与模型大小的关系如图 8 所示;在所有情况下,模型仅在 WebText2 数据集上进行了训练。我们发现,在这些其他数据分布上的损失随着模型大小平滑改善,与 WebText2 上的改善直接平行。我们发现泛化几乎完全取决于分布内验证损失,而与训练时长或接近收敛的程度无关。我们也没有观察到对模型深度的依赖(见附录 D.8)。
图 8: 模型大小与测试损失的关系
3.3 Performance with Dataset Size and Compute
3.3 数据集规模和计算资源下的性能表现
We display empirical trends for the test loss as a function of dataset size $D$ (in tokens) and training compute $C$ in Figure 1.
我们在图 1 中展示了测试损失随数据集大小 $D$ (以 Token 计)和训练计算量 $C$ 的经验趋势。
For the trend with $D$ we trained a model with $(n_{\mathrm{layer}},n_{\mathrm{embd}})=(36,1280)$ on fixed subsets of the WebText2 dataset. We stopped training once the test loss ceased to decrease. We see that the resulting test losses can be fit with simple power-law
对于趋势 $D$ ,我们训练了一个模型,其参数为 $(n_{\mathrm{layer}}, n_{\mathrm{embd}}) = (36, 1280)$,基于 WebText2 数据集的固定子集。我们在测试损失停止下降时停止训练。我们发现,最终的测试损失可以用简单的幂律拟合。
$$
L(D)\approx\left(\frac{D_{c}}{D}\right)^{\alpha_{D}}
$$
$$
L(D) ≈ \left( \frac{D_c}{D} \right)^{\alpha_D}
$$
in the dataset size. The data and fit appear in Figure 1.
在数据集大小方面。数据和拟合结果如图 1 所示。
The total amount of non-embedding compute used during training can be estimated as $C=6N B S$ , where $B$ is the batch size, $S$ is the number of parameter updates, and the factor of 6 accounts for the forward and backward passes. Thus for a given value of $C$ we can scan over all models with various $N$ to find the model
训练期间使用的非嵌入计算总量可以估计为 $C=6N B S$ ,其中 $B$ 是批量大小,$S$ 是参数更新次数,因子 6 考虑了前向和后向传递。因此,对于给定的 $C$ 值,我们可以扫描所有具有不同 $N$ 的模型以找到最优模型
Figure 8 Left: Generalization performance to other data distributions improves smoothly with model size, with only a small and very slowly growing offset from the WebText2 training distribution. Right: Generalization performance depends only on training distribution performance, and not on the phase of training. We compare generalization of converged models (points) to that of a single large model (dashed curves) as it trains.
图 8 左:模型大小平滑地提高了对其他数据分布的泛化性能,与 WebText2 训练分布相比,仅存在很小且增长非常缓慢的偏差。右:泛化性能仅取决于训练分布的性能,而不受训练阶段的影响。我们将收敛模型(点)的泛化性能与单个大模型(虚线曲线)在训练过程中的泛化性能进行比较。
with the best performance on step $\begin{array}{r}{S=\frac{C}{6B S}}\end{array}$ . Note that in these results the batch size $B$ remains fixed for all models, which means that these empirical results are not truly optimal. We will account for this in later sections using an adjusted $C_{\mathrm{min}}$ to produce cleaner trends.
在步骤 $\begin{array}{r}{S=\frac{C}{6B S}}\end{array}$ 上表现最佳。请注意,在这些结果中,批处理大小 $B$ 对所有模型保持固定,这意味着这些实验结果并非真正最优。我们将在后续章节中使用调整后的 $C_{\mathrm{min}}$ 来产生更清晰的趋势。
The result appears as the heavy black line on the left-hand plot in Figure 1. It can be fit with
结果如图 1 左侧图中的粗黑线所示。它可以拟合
$$
L(C)\approx\left(\frac{C_{c}}{C}\right)^{\alpha_{C}}
$$
$$
L(C) ≈ \left( \frac{C_c}{C} \right)^{\alpha_C}
$$
The figure also includes images of individual learning curves to clarify when individual models are optimal. We will study the optimal allocation of compute more closely later on. The data strongly suggests that sample efficiency improves with model size, and we also illustrate this directly in Figure 19 in the appendix.
该图还包含了各个学习曲线的图像,以澄清各个模型在何时达到最优。我们将在后面更详细地研究计算资源的最佳分配。数据强烈表明样本效率随着模型规模的增大而提高,我们也在附录中的图 19 中直接展示了这一点。
4 Charting the Infinite Data Limit and Over fitting
4 绘制无限数据极限与过拟合
In Section 3 we found a number of basic scaling laws for language modeling performance. Here we will study the performance of a model of size $N$ trained on a dataset with $D$ tokens while varying $N$ and $D$ simultaneously. We will empirically demonstrate that the optimally trained test loss accords with the scaling law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing size while keeping over fitting under control.
在第 3 节中,我们发现了一些语言建模性能的基本缩放定律。在这里,我们将研究在包含 $D$ Token 的数据集上训练的大小为 $N$ 的模型的性能,同时变化 $N$ 和 $D$ 。我们将通过实证证明,最优训练的测试损失符合公式 (1.5) 的缩放定律。这为我们提供了指导,说明我们需要多少数据来训练不断增大的模型,同时保持过拟合在可控范围内。
4.1 Proposed $L(N,D)$ Equation
4.1 提出的 $L(N,D)$ 方程
We have chosen the parameter iz ation (1.5) (repeated here for convenience):
我们选择了参数化 (1.5) (为方便起见,在此重复):
$$
L(N,D)=\left[\left(\frac{N_{c}}{N}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}+\frac{D_{c}}{D}\right]^{\alpha_{D}}
$$
$$
L(N,D)=\left[\left(\frac{N_{c}}{N}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}+\frac{D_{c}}{D}\right]^{\alpha_{D}}
$$
公式保持原样,不进行翻译。
using three principles:
使用三个原则:
- Changes in vocabulary size or token iz ation are expected to rescale the loss by an overall factor. The parameter iz ation of $L(N,D)$ (and all models of the loss) must naturally allow for such a rescaling. 2. Fixing $D$ and sending $N\rightarrow\infty$ , the overall loss should approach $L(D)$ . Conversely, fixing $N$ and sending $D\rightarrow\infty$ the loss must approach $L(N)$ . 3. $L(N,D)$ should be analytic at $D=\infty$ , so that it has a series expansion in $1/D$ with integer powers. Theoretical support for this principle is significantly weaker than for the first two.
- 词汇量或 Token 化的变化预计将按比例调整损失的整体因子。$L(N,D)$ 的参数化(以及所有损失模型)必须自然允许这种重新缩放。
- 固定 $D$ 并使 $N\rightarrow\infty$,整体损失应接近 $L(D)$。相反,固定 $N$ 并使 $D\rightarrow\infty$,损失必须接近 $L(N)$。
- $L(N,D)$ 在 $D=\infty$ 处应该是解析的,因此它具有以 $1/D$ 展开的级数,并且幂为整数。对这一原则的理论支持明显弱于前两个。
Our choice of $L(N,D)$ satisfies the first requirement because we can rescale $N_{c},D_{c}$ with changes in the vocabulary. This also implies that the values of $N_{c},D_{c}$ have no fundamental meaning.
我们的选择 $L(N,D)$ 满足第一个要求,因为我们可以通过调整词汇的变化来重新缩放 $N_{c},D_{c}$ 。这也意味着 $N_{c},D_{c}$ 的值没有根本性的意义。
Figure 9 The early-stopped test loss $L(N,D)$ depends predictably on the dataset size $D$ and model size $N$ according to Equation (1.5). Left: For large $D$ , performance is a straight power law in $N$ . For a smaller fixed $D$ , performance stops improving as $N$ increases and the model begins to overfit. (The reverse is also true, see Figure 4.) Right: The extent of over fitting depends predominantly on the ratio $N^{\frac{\alpha_{N}}{\alpha_{D}}}/D$ , as predicted in equation (4.3). The line is our fit to that equation.
图 9: 早期停止的测试损失 $L(N,D)$ 可预测地依赖于数据集大小 $D$ 和模型大小 $N$,如公式 (1.5) 所示。左:对于较大的 $D$,性能是 $N$ 的幂律直线。对于较小的固定 $D$,随着 $N$ 增加,性能停止改善并且模型开始过拟合。(反之亦然,见图 4。)右:过拟合的程度主要取决于比率 $N^{\frac{\alpha_{N}}{\alpha_{D}}}/D$,如公式 (4.3) 预测的那样。线是我们对该公式的拟合。
Since we stop training early when the test loss ceases to improve and optimize all models in the same way, we expect that larger models should always perform better than smaller models. But with fixed finite $D$ , we also do not expect any model to be capable of approaching the best possible loss (ie the entropy of text). Similarly, a model with fixed size will be capacity-limited. These considerations motivate our second principle. Note that knowledge of $L(N)$ at infinite $D$ and $L(D)$ at infinite $N$ fully determines all the parameters in $L(N,D)$ .
由于我们在测试损失停止改善时提前停止训练,并以相同的方式优化所有模型,我们期望较大的模型应该总是比小的模型表现更好。但当 $D$ 固定时,我们也并不期望任何模型能够达到最佳可能的损失(即文本的熵)。同样,固定大小的模型将受到容量限制。这些考虑促使我们提出第二个原则。请注意,在无限 $D$ 下的 $L(N)$ 和在无限 $N$ 下的 $L(D)$ 完全决定了 $L(N,D)$ 中的所有参数。
The third principle is more speculative. There is a simple and general reason one might expect over fitting to scale $\propto1/D$ at very large $D$ . Over fitting should be related to the variance or the signal-to-noise ratio of the dataset [AS17], and this scales as $1/\bar{D}$ . This expectation should hold for any smooth loss function, since we expect to be able to expand the loss about the $D\rightarrow\infty$ limit. However, this argument assumes that $1/D$ corrections dominate over other sources of variance, such as the finite batch size and other limits on the efficacy of optimization. Without empirical confirmation, we would not be very confident of its applicability.
第三原则更具推测性。有一个简单而普遍的原因,可以预期过拟合在非常大的 $D$ 下按 $\propto1/D$ 缩放。过拟合应与数据集的方差或信噪比相关 [AS17],这按 $1/\bar{D}$ 缩放。这种预期应对任何平滑损失函数都成立,因为我们期望能够在 $D\rightarrow\infty$ 极限附近展开损失函数。然而,这一论点假设 $1/D$ 修正主导其他方差来源,例如有限批量大小和其他优化效能限制。没有实证确认,我们对其适用性不会有太大信心。
Our third principle explains the asymmetry between the roles of $N$ and $D$ in Equation (1.5). Very similar symmetric expressions 4 are possible, but they would not have a $1/D$ expansion with integer powers, and would require the introduction of an additional parameter.
我们的第三原则解释了方程 (1.5) 中 $N$ 和 $D$ 角色之间的不对称性。非常相似的对称表达式是可能的,但它们不会有 $1/D$ 的整数幂展开,并且需要引入一个额外的参数。
In any case, we will see that our equation for $L(N,D)$ fits the data well, which is the most important justification for our $L(N,D)$ ansatz.
在任何情况下,我们将看到我们的公式 $L(N,D)$ 与数据非常吻合,这是对我们 $L(N,D)$ 假设的最重要证明。
4.2 Results
4.2 结果
We regularize all our models with $10%$ dropout, and by tracking test loss and stopping once it is no longer decreasing. The results are displayed in Figure 9, including a fit to the four parameters $\alpha_{N},\alpha_{D},N_{c},D_{c}$ in Equation (1.5):
我们对所有模型使用 10% dropout 进行正则化,并通过跟踪测试损失并在损失不再减少时停止训练。结果如图 9 所示,包括对公式 (1.5) 中的四个参数 $\alpha_{N},\alpha_{D},N_{c},D_{c}$ 的拟合:
图 9:
Table 2 Fits to $L(N,D)$
表 2: $L(N,D)$ 的拟合结果
参数 | ON | αD | Nc | Dc |
---|---|---|---|---|
值 | 0.076 | 0.103 | 6.4 × 10^13 | 1.83 × 10^13 |
We obtain an excellent fit, with the exception of the runs where the dataset has been reduced by a factor of 1024, to about $2\times10^{7}$ tokens. With such a small dataset, an epoch consists of only 40 parameter updates. Perhaps such a tiny dataset represents a different regime for language modeling, as over fitting happens very early in training (see Figure 16). Also note that the parameters differ very slightly from those obtained in Section 3, as here we are fitting the full $L(N,D)$ rather than just $L(N,\infty)$ or $L(\infty,D)$ .
我们获得了非常好的拟合效果,例外情况是数据集被缩减了 1024 倍,大约为 $2\times10^{7}$ 个 Token。在如此小的数据集上,一个 epoch 仅包含 40 次参数更新。也许这种极小的数据集代表了语言建模的不同阶段,因为过拟合在训练初期就发生了(见图 16)。另外请注意,这里的参数与第 3 节中获得的参数略有不同,因为我们在这里拟合的是完整的 $L(N,D)$ 而不仅仅是 $L(N,\infty)$ 或 $L(\infty,D)$。
To chart the borderlands of the infinite data limit, we can directly study the extent of over fitting. For all but the largest models, we see no sign of over fitting when training with the full 22B token WebText2 dataset, so we can take it as representative of $D,=,\infty$ . Thus we can compare finite $D$ to the infinite data limit by
为了描绘无限数据极限的边缘地带,我们可以直接研究过拟合的程度。对于除最大模型之外的所有模型,在使用完整的 22B Token WebText2 数据集进行训练时,我们没有发现过拟合的迹象,因此可以将其视为代表 $D,=,\infty$ 。因此,我们可以将有限的 $D$ 与无限数据极限进行比较。
Figure 10 The critical batch size $B_{\mathrm{crit}}$ follows a power law in the loss as performance increase, and does not depend directly on the model size. We find that the critical batch size approximately doubles for every $13%$ decrease in loss. $B_{\mathrm{crit}}$ is measured empirically from the data shown in Figure 18, but it is also roughly predicted by the gradient noise scale, as in [MKAT18].
图 10: 关键批量大小 $B_{\mathrm{crit}}$ 随着性能的提高在损失中遵循幂律,并不直接依赖于模型大小。我们发现关键批量大小大约每减少 $13%$ 的损失就会翻倍。$B_{\mathrm{crit}}$ 是从图 18 所示的数据中经验测量得出的,但它也大致由梯度噪声尺度预测,如 [MKAT18] 所示。
defining
定义
$$
\delta L(N,D)\equiv\frac{L(N,D)}{L(N,\infty)}-1
$$
$$
\delta L(N,D) \equiv \frac{L(N,D)}{L(N,\infty)} - 1
$$
and studying it as a function of $N,D$ . In fact, we see empirically that $\delta L$ depends only a specific combination of $N$ and $D$ , as shown in Figure 16. This follows from the scaling law of Equation (1.5), which implies
并将其作为 $N$ 和 $D$ 的函数进行研究。实际上,我们从经验上看到 $\delta L$ 仅依赖于 $N$ 和 $D$ 的特定组合,如图 16 所示。这遵循公式 (1.5) 的缩放定律,该定律表明
图 16:
$$
\delta L\approx\left(1+\left(\frac{N}{N_{c}}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}\frac{D_{c}}{D}\right)^{\alpha_{D}}-1
$$
$$
\delta L \approx \left( 1 + \left( \frac{N}{N_c} \right)^{\frac{\alpha_N}{\alpha_D}} \frac{D_c}{D} \right)^{\alpha_D} - 1
$$
Note that at large $D$ this formula also has a series expansion in powers of $1/D$ .
请注意,在较大的 ( D ) 下,此公式也有 ( 1/D ) 的幂级数展开。
We estimate that the variation in the loss with different random seeds is roughly 0.02, which means that to avoid over fitting when training to within that threshold of convergence we require
我们估计不同随机种子下的损失变化大约为 0.02,这意味着为了避免过拟合,我们需要在训练时达到该阈值范围内的收敛
$$
D\gtrsim\left(5\times10^{3}\right)N^{0.74}
$$
$$
D \gtrsim \left( 5 \times 10^{3} \right) N^{0.74}
$$
With this relation, models smaller than $10^{9}$ parameters can be trained with minimal over fitting on the 22B token WebText2 dataset, but our largest models will encounter some mild over fitting. More generally, this relation shows that dataset size may grow sub-linearly in model size while avoiding over fitting. Note however that this does not typically represent maximally compute-efficient training. We should also emphasize that we have not optimized regular iz ation (eg the dropout probability) while varying dataset and model size.
通过这种关系,参数少于 $10^{9}$ 的模型可以在 22B Token 的 WebText2 数据集上进行训练,并且过拟合最小,但我们的最大模型会遇到一些轻微的过拟合。更一般地说,这种关系表明,在避免过拟合的情况下,数据集大小可以以亚线性的速度随着模型大小增长。然而,需要注意的是,这通常并不代表计算效率最高的训练方式。我们还应强调,在调整数据集和模型大小时,我们没有优化正则化(例如 dropout 概率)。
5 Scaling Laws with Model Size and Training Time
5 模型规模和训练时间的扩展定律
In this section we will demonstrate that a simple scaling law provides a good description for the loss as a function of model size $N$ and training time. First we will explain how to use the results of [MKAT18] to define a universal training step $S_{\mathrm{min}}$ , which accounts for the fact that most of our models have not been trained at an optimal batch size. Then we will demonstrate that we can fit the model size and training time dependence of the loss using Equation (1.6). Later we will use these results to predict the optimal allocation of training compute between model size and training time, and then confirm that prediction.
在本节中,我们将展示一个简单的缩放定律可以很好地描述损失与模型大小 $N$ 和训练时间的关系。首先,我们将解释如何使用 [MKAT18] 的结果来定义一个通用的训练步数 $S_{\mathrm{min}}$ ,这考虑了我们的大多数模型没有在最优批量大小下进行训练的事实。然后我们将证明我们可以使用公式 (1.6) 来拟合损失对模型大小和训练时间的依赖关系。之后我们将使用这些结果来预测训练计算资源在模型大小和训练时间之间的最优分配,并验证该预测。
5.1 Adjustment for Training at $B_}(L)$
5.1 在 $B_}(L)$ 处训练的调整
A simple empirical theory for the batch size dependence of training was developed in [MKAT18] (see also $[\mathrm{SLA}^{\bar{+}}18$ , $\bar{Z!L}!N^{+}19]$ ). It was argued that there is a critical batch size $B_{\mathrm{crit}}$ for training; for $B$ up to $B_{\mathrm{crit}}$ the batch size can be increased with very minimal degradation in compute-efficiency, whereas for $B>B_{\mathrm{crit}}$ increases in $B$ result in diminishing returns. It was also argued that the gradient noise scale provides a simple prediction for $B_{\mathrm{crit}}$ , and that neither depends directly on model size except through the value of the loss that has been attained. These results can be used to predict how training time and compute will vary with the batch size. To utilize both training time and compute as effectively as possible, it is best to train with a batch size $B\approx B_{\mathrm{crit}}$ . Training at $B\gg B_{\mathrm{crit}}$ minimizes the number of training steps, while $B\ll B_{\mathrm{crit}}$ minimizes the use of compute.
在 [MKAT18] 中发展了一个简单的经验理论来解释训练过程中批量大小的依赖性(另见 $[\mathrm{SLA}^{\bar{+}}18$ , $\bar{Z!L}!N^{+}19]$)。该理论认为,在训练中存在一个临界批量大小 $B_{\mathrm{crit}}$;对于 $B$ 小于或等于 $B_{\mathrm{crit}}$ 的情况,增加批量大小对计算效率的影响非常小;而对于 $B>B_{\mathrm{crit}}$ 的情况,继续增加 $B$ 会导致收益递减。还指出,梯度噪声尺度可以简单预测 $B_{\mathrm{crit}}$,并且两者都不直接依赖于模型大小,除非通过已达到的损失值。这些结果可用于预测训练时间和计算资源如何随批量大小变化。为了尽可能有效地利用训练时间和计算资源,最好使用接近临界批量大小 $B\approx B_{\mathrm{crit}}$ 进行训练。在 $B\gg B_{\mathrm{crit}}$ 下训练可最小化训练步骤的数量,而在 $B\ll B_{\mathrm{crit}}$ 下训练则可最小化计算资源的使用。
More specifically, it was demonstrated that for a wide variety of neural network tasks, the number of training steps $S$ and the number of data examples processed $E=B S$ satisfy the simple relation
更具体地说,已经证明对于各种各样的神经网络任务,训练步骤的数量 $S$ 和处理的数据示例数量 $E=B S$ 满足简单的关系
$$
\left({\frac{S}{S_{\mathrm{min}}}}-1\right)\left({\frac{E}{E_{\mathrm{min}}}}-1\right)=1
$$
$$
\left( \frac{S}{S_{\mathrm{min}}} - 1 \right) \left( \frac{E}{E_{\mathrm{min}}} - 1 \right) = 1
$$
when training to any fixed value of the loss $L$ . Here $S_{\mathrm{min}}$ is the minimum number of steps necessary to reach $L$ , while $E_{\mathrm{min}}$ is the minimum number of data examples that must be processed.
在训练到任何固定的损失值 $L$ 时。这里 $S_{\mathrm{min}}$ 是达到 $L$ 所需的最小步数,而 $E_{\mathrm{min}}$ 是必须处理的最小数据样本数量。
We demonstrate the relation (5.1) for Transformers in Figure 18 in the appendix. This relation defines the critical batch size
我们在附录的图 18 中展示了 Transformer 的关系 (5.1) 。此关系定义了关键批次大小
$$
B_{\mathrm{crit}}(L)\equiv\frac{E_{\mathrm{min}}}{S_{\mathrm{min}}}
$$
$$
B_ {\mathrm{crit}} (L) \equiv \frac {E_ {\mathrm{min}}} {S_ {\mathrm{min}}}
$$
which is a function of the target value of the loss. Training at the critical batch size makes a roughly optimal time/compute tradeoff, requiring $2S_{\mathrm{min}}$ training steps and processing $E=2E_{\operatorname*{min}}$ data examples.
这是损失目标值的函数。在临界批量大小进行训练可以大致实现时间和计算资源的最优权衡,需要 $2S_{\mathrm{min}}$ 训练步骤并处理 $E=2E_{\operatorname*{min}}$ 数据样本。
In Figure 10 we have plotted the critical batch size and gradient noise scale5 as a function of training loss for two different models. We see that $B_{\mathrm{crit}}(L)$ is independent of model size, and only depends on the loss $L$ . So the predictions of [MKAT18] continue to hold for Transformer language models. The critical batch size can be fit with a power-law in the loss
图 10: 我们绘制了两个不同模型的临界批量大小和梯度噪声尺度与训练损失的关系。我们发现 $B_{\mathrm{crit}}(L)$ 与模型大小无关,仅取决于损失 $L$ 。因此,[MKAT18] 的预测对于 Transformer 语言模型仍然成立。临界批量大小可以用损失的幂律进行拟合。
$$
B_{\mathrm{crit}}(L)\approx\frac{B_{*}}{L^{1/\alpha_{B}}}
$$
$$
B_{\mathrm{crit}}(L) \approx \frac{B_{*}}{L^{1/\alpha_{B}}}
$$
where $B_{*}\approx2\times10^{8}$ and $\alpha_{B}\approx0.21$ .
其中 $B_{*} \approx 2 \times 10^{8}$ 和 $\alpha_{B} \approx 0.21$ 。
We have chosen this parameter iz ation for $B_{\mathrm{crit}}(L)$ because as the loss approaches its minimum value $L_{\mathrm{min}}$ , the gradient noise scale is expected to diverge, and we expect $B_{\mathrm{crit}}$ to track this noise scale. We do not know $L_{\mathrm{min}}$ , as we see no sign that our models are approaching it, but $L_{\mathrm{min}}>0$ since the entropy of natural language is non-zero. Since apparently $L_{\mathrm{min}}$ is much smaller than the values of $L$ we have achieved, we used a parameter iz ation where $B_{\mathrm{crit}}$ diverges as $L\rightarrow0$ .
我们选择了这种参数化方式来表示 $B_{\mathrm{crit}}(L)$,因为当损失接近其最小值 $L_{\mathrm{min}}$ 时,梯度噪声尺度预计会发散,我们希望 $B_{\mathrm{crit}}$ 能够跟踪这一噪声尺度。我们不知道 $L_{\mathrm{min}}$,因为我们没有看到模型接近它的迹象,但 $L_{\mathrm{min}}>0$,因为自然语言的熵是非零的。由于显然 $L_{\mathrm{min}}$ 远小于我们已经实现的 $L$ 值,我们使用了在 $L\rightarrow0$ 时 $B_{\mathrm{crit}}$ 发散的参数化方式。
We will use $B_{\mathrm{crit}}(L)$ to estimate the relation between the number of training steps $S$ while training at batch size $B=2^{19}$ tokens and the number of training steps while training at $B\gg B_{\mathrm{crit}}$ . This is simply
我们将使用 $B_{\mathrm{crit}}(L)$ 来估计在批量大小为 $B=2^{19}$ 个 Token 时的训练步数 $S$ 与在 $B \gg B_{\mathrm{crit}}$ 时的训练步数之间的关系。这很简单。
$$
S_{\mathrm{min}}(S)\equiv\frac{S}{1+B_{\mathrm{crit}}(L)/B}\qquad\mathrm{(minimum~steps,at}B\gg B_{\mathrm{crit}})
$$
$$
S_{\mathrm{min}}(S) \equiv \frac{S}{1 + B_{\mathrm{crit}}(L) / B} \qquad (最小步骤,当 B \gg B_{\mathrm{crit}})
$$
for any given target value $L$ for the loss. This also defines a critical value of the compute needed to train to $L$ with a model of size $N$ if we were to train at $B\ll B_{\mathrm{crit}}(L)$ . This is
对于任何给定的损失目标值 $L$ 。这也定义了训练到损失值 $L$ 所需计算量的关键值,对于大小为 $N$ 的模型,如果我们以 $B \ll B_{\mathrm{crit}}(L)$ 进行训练。这是
$$
C_{\mathrm{min}}(C)\equiv\frac{C}{1+B/B_{\mathrm{crit}}(L)}\qquad\mathrm{(minimum~compute,at}B\ll B_{\mathrm{crit}})
$$
$$
C_{\mathrm{min}}(C) \equiv \frac{C}{1 + B / B_{\mathrm{crit}}(L)} \qquad (最小计算, 在 B \ll B_{\mathrm{crit}})
$$
where $C=6N B S$ estimates the (non-embedding) compute used at batch size $B$ .
其中 $C=6N B S$ 估计了在批处理大小为 $B$ 时使用的(非嵌入)计算资源 。
5.2 Results for $L(N,S_})$ and Performance with Model Size and Compute
5.2 $L(N,S_})$ 的结果及模型规模和计算性能的表现
Now we will use $S_{\mathrm{min}}$ defined in Equation (5.4) to obtain a simple and universal fit for the dependence of the loss on model size and training time in the infinite data limit. We will fit the stable, Adam-optimized training runs using Equation (1.6), repeated here for convenience:
现在我们将使用在公式 (5.4) 中定义的 $S_{\mathrm{min}}$ 来获得一个简单且通用的模型大小和训练时间对损失影响的拟合,在无限数据限制下。我们将使用公式 (1.6) 对稳定的、Adam优化的训练运行进行拟合,这里为了方便再次列出该公式:
$$
L(N,S_{\mathrm{min}})=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S_{\mathrm{min}}}\right)^{\alpha_{S}}
$$
$$
L(N,S_{\mathrm{min}})=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S_{\mathrm{min}}}\right)^{\alpha_{S}}
$$
for the loss. We include all training steps after the warmup period of the learning rate schedule, and find a fit to the data with the parameters:
对于损失。我们包括学习率调度的热身期之后的所有训练步骤,并使用以下参数对数据进行拟合:
Figure 11 When we hold either total compute or number of training steps fixed, performance follows $L(N,S)$ from Equation (5.6). Each value of compute budget has an associated optimal model size that maximizes performance. Mediocre fits at small $S$ are unsurprising, as the power-law equation for the learning curves breaks down very early in training.
图 11: 当我们固定总计算量或训练步数时,性能遵循公式 (5.6) 中的 $L(N,S)$ 。每个计算预算值都有一个与之相关的最优模型大小,可以最大化性能。在小 $S$ 时拟合效果较差并不令人意外,因为学习曲线的幂律方程在训练初期就会失效。
Table 3 Fits to $L(N,S)$
表 3: $L(N,S)$ 的拟合结果
参数 | ON | S0 | Nc | Sc |
---|---|---|---|---|
值 | 0.077 | 0.76 | 6.5 × 10^13 | 2.1 × 10^3 |
With these parameters, we obtain the learning curve fits in Figure 4. Though the fits are imperfect, we believe they are quite compelling given the simplicity of Equation (5.6).
使用这些参数,我们在图 4 中获得了学习曲线的拟合。尽管拟合并不完美,但我们认为鉴于公式 (5.6) 的简单性,它们是非常有说服力的。
图 4:
The data and fits can be visualized in a different and more interesting way, as shown in Figure 11. There we study the test loss as a function of model size while fixing either the total non-embedding compute $C$ used in training, or the number of steps $S$ . For the fits we use Equation (5.5) and (5.4) along with the parameters above and Equation (5.6).
数据和拟合结果可以以不同且更有趣的方式可视化,如图 11 所示。在那里,我们研究了在训练过程中固定总非嵌入计算量 $C$ 或步数 $S$ 时,测试损失随模型大小的变化情况。对于拟合,我们使用公式 (5.5) 和 (5.4) 以及上述参数和公式 (5.6)。
The power-law dependence of the loss on $S_{\mathrm{min}}$ reflects the interplay of optimizer dynamics and the loss landscape. Since the fits are best late in training, when the loss may be approximately quadratic, the powerlaw should provide information about the spectrum of the Hessian of the loss. Its universality suggests that the Hessian eigenvalue density is roughly independent of model size.
损失对 $S_{\mathrm{min}}$ 的幂律依赖反映了优化器动态和损失景观之间的相互作用。由于拟合在训练后期最佳,此时损失可能是近似二次的,因此幂律应提供关于损失的 Hessian 矩阵谱的信息。其普遍性表明 Hessian 特征值密度大致与模型大小无关。
5.3 Lower Bound on Early Stopping Step
5.3 提前停止步数的下界
The results for $L(N,S_{\mathrm{min}})$ can be used to derive a lower-bound (and rough estimate) of the step at which early stopping should occur when training is data limited. It is motivated by the idea that finite and infinite $D$ learning curves for a given model will be very similar until we reach $S_{\mathrm{min}}\approx S_{\mathrm{stop}}$ . Thus over fitting should be proportional to the correction from simply ending training at $S_{\mathrm{stop}}$ . This will underestimate $S_{\mathrm{stop}}$ , because in reality the test loss will decrease more slowly when we have a finite $D$ , and therefore we will require more training steps to reach the optimal test loss at finite $D$ . This line of reasoning leads to the inequality
$L(N,S_{\mathrm{min}})$ 的结果可以用于推导在数据有限的训练中应提前停止的步数的下界(和粗略估计)。其动机是基于这样一个想法:对于给定模型,有限和无限 $D$ 的学习曲线将非常相似,直到我们达到 $S_{\mathrm{min}} \approx S_{\mathrm{stop}}$ 。因此,过拟合应该与简单地在 $S_{\mathrm{stop}}$ 结束训练的修正成正比。这将低估 $S_{\mathrm{stop}}$ ,因为在现实中,当我们有一个有限的 $D$ 时,测试损失将下降得更慢,因此我们需要更多的训练步骤来达到有限 $D$ 下的最优测试损失。这一推理导致了以下不等式
$$
S_{\mathrm{stop}}(N,D)\gtrsim\frac{S_{c}}{[L(N,D)-L(N,\infty)]^{1/\alpha_{S}}}
$$
$$
S_{\mathrm{stop}}(N,D) \gtrsim \frac{S_{c}}{[L(N,D) - L(N, \infty)]^{1 / \alpha_{S}}}
$$
where $L(N,\infty)$ is the converged loss, evaluated with infinite available data. This inequality and its comparison to the empirical data is displayed in Figure 16 in the appendix. In that figure, the values of $S_{\mathrm{stop}}$ and $L(N,D)$ are empirical (though $S_{\mathrm{stop}}$ is adjusted to mimic training at $B,\gg,B_{\mathrm{crit}})$ , while $L(N,\infty)$ is computed from the fit to $L(N,D)$ evaluated at $D=\infty$ .
其中 $L(N,\infty)$ 是收敛损失,使用无限数据评估。此不等式及其与经验数据的比较在附录中的图 16 中显示。在该图中,$S_{\mathrm{stop}}$ 和 $L(N,D)$ 的值是经验性的(尽管 $S_{\mathrm{stop}}$ 调整为模拟在 $B \gg B_{\mathrm{crit}}$ 下的训练),而 $L(N,\infty)$ 是从对 $L(N,D)$ 的拟合计算得出,并在 $D=\infty$ 处评估。
6 Optimal Allocation of the Compute Budget
6 计算预算的最优分配
We displayed the empirical trend of performance as a function of the computation used during training in the top-right of Figure 1. However, this result involved training at a fixed batch size $B$ , whereas we know that in fact we could train more efficiently 6 by training at the batch size $B_{\mathrm{crit}}$ discussed in Section 5.1. Large and small values of the loss could have been achieved with fewer samples or fewer steps, respectively, and correcting for this inefficiency by standardizing to the critical batch size results in cleaner and more predictable trends.
我们在图 1 的右上角展示了性能的经验趋势,该趋势是训练期间使用的计算量的函数。然而,这一结果是在固定的批量大小 $B$ 下进行训练得出的,而我们知道实际上可以通过在第 5.1 节讨论的关键批量大小 $B_{\mathrm{crit}}$ 下进行训练来更高效地训练。较大和较小的损失值本可以分别用更少的样本或更少的步骤实现,通过标准化到关键批量大小来纠正这种低效性,可以得到更清晰和更可预测的趋势。
Figure 12 Left: Given a fixed compute budget, a particular model size is optimal, though somewhat larger or smaller models can be trained with minimal additional compute. Right: Models larger than the computeefficient size require fewer steps to train, allowing for potentially faster training if sufficient additional parallelism is possible. Note that this equation should not be trusted for very large models, as it is only valid in the power-law region of the learning curve, after initial transient effects.
图 12 左:给定固定的计算预算,特定的模型大小是最优的,尽管稍大或稍小的模型也可以用最小的额外计算资源进行训练。右:大于计算效率最优大小的模型需要更少的训练步骤,如果可以提供足够的额外并行性,则可能实现更快的训练。注意,此公式不应应用于非常大的模型,因为它仅在学习曲线的幂律区域有效,在初始瞬态效应之后 [20]。
Deviation from Optimal Model (N/Neficient)
偏离最优模型 (N/Neficient)
Figure 13 When adjusting performance to simulate training far below the critical batch size, we find a somewhat altered power law for $L(C_{\mathrm{min}})$ when compared with the fully empirical results. The conspicuous lump at $10^{-5}$ PF-days marks the transition from 1-layer to 2-layer networks; we exclude 1-layer networks in the power-law fits. It is the $L(C_{\mathrm{min}})$ trend that we expect to provide a reliable extrapolation for larger compute.
图 13: 在调整性能以模拟远低于临界批量大小的训练时,我们发现与完全经验结果相比,$L(C_{\mathrm{min}})$ 的幂律有所改变。明显的 $10^{-5}$ PF天的突起标志着从单层网络到双层网络的过渡;我们在幂律拟合中排除了单层网络。我们期望 $L(C_{\mathrm{min}})$ 的趋势能够为更大的计算量提供可靠的外推。
In this section we will adjust for this oversight. More importantly, we will use the results of Section 5 to determine the optimal allocation of compute between model size $N$ and the quantity of data processed during training, namely $2B_{\mathrm{crit}}S_{\mathrm{min}}$ . We will determine this allocation both empirically and theoretically, by using the equation for $L(N,S_{\mathrm{min}})$ , and we will demonstrate that these methods agree.
在本节中,我们将对此疏忽进行调整。更重要的是,我们将使用第 5 节的结果来确定模型大小 $N$ 和训练期间处理的数据量 $2B_{\mathrm{crit}}S_{\mathrm{min}}$ 之间的计算资源最优分配。我们将通过经验方法和理论方法来确定这种分配,即使用 $L(N,S_{\mathrm{min}})$ 的公式,并将证明这些方法是一致的。
6.1 Optimal Performance and Allocations
6.1 最优性能和分配
Let us first study the loss as a function of the optimally allocated compute from Equation (5.5). The result is plotted in Figure 13, along with a power-law fit. We see that as compared to the compute plot of Figure 1, the new fit with $C_{\mathrm{min}}$ is somewhat improved.
让我们首先研究损失作为最优分配计算资源的函数,从公式 (5.5) 出发。结果如图 13 所示,并附有幂律拟合。我们发现,与图 1 的计算资源图相比,新的拟合使用 $C_{\mathrm{min}}$ 有所改进。
图 13:
(注:原文中没有提供图 13 的具体描述,因此这里仅保留了图的引用格式)
Given $L(C_{\mathrm{min}})$ , it is natural to ask for the optimal model size $N(C_{\mathrm{min}})$ that provides the minimal loss with a given quantity of training compute. The optimal model size is shown in Figure 14. We observe that $N(C_{\mathrm{min}})$
给定 $L(C_{\mathrm{min}})$ ,很自然地会问在给定的训练计算量下,提供最小损失的最优模型大小 $N(C_{\mathrm{min}})$ 是多少。最优模型大小如图 14 所示。我们观察到 $N(C_{\mathrm{min}})$
Figure 14 Left: Each value of the compute budget $C_{\mathrm{min}}$ has an associated optimal model size $N$ . Optimal model size grows very rapidly with $C_{\mathrm{min}}$ , increasing by 5x for each $10\mathbf{x}$ increase in compute. The number of data examples processed makes up the remainder of the increase, growing relatively modestly by only $2\mathbf{x}$ . Right: The batch-adjusted number of optimization steps also grows very slowly, if at all, meaning that most of the growth in data examples processed can be used for increased batch sizes.
图 14 左:每个计算预算值 $C_{\mathrm{min}}$ 都有一个相应的最优模型大小 $N$ 。最优模型大小随着 $C_{\mathrm{min}}$ 的增加而迅速增长,每次计算量增加 $10\mathbf{x}$ 时,模型大小增加 5 倍。处理的数据样本数量构成了剩余的增长部分,相对较为 modestly,仅增加了 $2\mathbf{x}$ 。右:调整后的优化步骤数量增长非常缓慢,甚至几乎没有增长,这意味着大部分用于处理的数据样本的增长可以用于增加批次大小。
can be fit very well with a power-law
可以用幂律非常很好地拟合
$$
N(C_{\mathrm{min}})\propto(C_{\mathrm{min}})^{0.73}.
$$
$$
N(C_{\mathrm{min}}) \propto (C_{\mathrm{min}})^{0.73} .
$$
In Figure 12, we show the effect of training models of sub-optimal sizes (see Appendix B.4).
图 12: 我们展示了训练次优规模模型的效果 (见附录 B.4)。
By definition $C_{\mathrm{min}}\equiv6N B_{\mathrm{crit}}S$ , and so we can use $N(C_{\mathrm{min}}