Scaling Laws for Neural Language Models
神经语言模型的扩展定律 (Scaling Laws for Neural Language Models)
Sam McCandlish∗
Johns Hopkins University, OpenAI OpenAI jaredk@jhu.edu sam@openai.com
约翰霍普金斯大学,OpenAI
jaredk@jhu.edu sam@openai.com
| Tom Henighan | TomB. Brown | Benjamin Chess | Rewon Child |
|---|---|---|---|
| OpenAI | OpenAI | OpenAI | OpenAI |
| henighan@openai.com | tom@openai.com | bchess@openai.com | rewon@openai.com |
| Scott Gray | Alec Radford | Jeffrey Wu | Dario Amodei |
| OpenAI | OpenAI | OpenAI | OpenAI |
| scott@openai.com | alec@openai.com | jeffwu@openai.com | damodei@openai.com |
Abstract
摘要
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of over fitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sampleefficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
我们研究了语言模型在交叉熵损失上的经验缩放定律。损失随着模型大小、数据集大小和用于训练的计算量按幂律缩放,某些趋势跨越了七个数量级以上。其他架构细节,如网络宽度或深度,在较大范围内影响较小。简单的方程支配着过拟合对模型/数据集大小的依赖关系以及训练速度对模型大小的依赖关系。这些关系使我们能够确定固定计算预算的最佳分配。较大的模型显著更样本高效,因此最优的计算效率训练涉及在相对较少的数据上训练非常大的模型,并在远未收敛时停止。
Contents
目录
Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning tasks can be efficiently expressed and evaluated in language, and the world’s text provides a wealth of data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in language modeling, with state of the art models [RNSS18, DCLT18, $\mathbf{Y}\mathbf{D}\bar{\mathbf{Y}}^{+}19$ , $\mathrm{LOG^{+}19}$ , $\mathrm{RSR}^{+}\mathrm{i}\bar{9}]$ approaching human-level performance on many specific tasks $[\mathsf{W P N}^{+}19]$ , including the composition of coherent multiparagraph prompted text samples $[\mathrm{RWC}^{+}19]$ .
语言为研究人工智能提供了一个自然的领域,因为绝大多数推理任务可以高效地用语言表达和评估,并且世界上的文本为通过生成式建模进行无监督学习提供了丰富的数据。深度学习在语言建模方面最近取得了快速进展,最先进的模型 [RNSS18, DCLT18, $\mathbf{Y}\mathbf{D}\bar{\mathbf{Y}}^{+}19$ , $\mathrm{LOG^{+}19}$ , $\mathrm{RSR}^{+}\mathrm{i}\bar{9}]$ 在许多特定任务上接近人类水平的表现 $[\mathsf{W P N}^{+}19]$ ,包括连贯的多段提示文本样本的组成 $[\mathrm{RWC}^{+}19]$ 。
One might expect language modeling performance to depend on model architecture, the size of neural models, the computing power used to train them, and the data available for this training process. In this work we will empirically investigate the dependence of language modeling loss on all of these factors, focusing on the Transformer architecture $[\mathrm{VSP}^{+}17$ , $\mathrm{LSP^{+}}18]$ . The high ceiling and low floor for performance on language tasks allows us to study trends over more than seven orders of magnitude in scale.
人们可能认为语言模型的性能取决于模型架构、神经模型的大小、用于训练它们的计算能力以及可用于此训练过程的数据。在这项工作中,我们将经验性地研究语言建模损失与所有这些因素的依赖关系,重点关注 Transformer 架构 [VSP+17, LSP+18] 。语言任务性能的高上限和低下限使我们能够研究超过七个数量级规模的变化趋势。
Throughout we will observe precise power-law scalings for performance as a function of training time, context length, dataset size, model size, and compute budget.
在整个过程中,我们将观察到性能与训练时间、上下文长度、数据集大小、模型大小和计算预算之间的精确幂律缩放关系。
1.1 Summary
1.1 摘要
规则:
- 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
- 不要输出与英文内容无关的内容。
- 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
- 人名不翻译
- 同时要保留引用的论文,例如 [20] 这样的引用。
- 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
- 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
- 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
- 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
* Transformer -> Transformer
* Token -> Token
* LLM/Large Language Model -> 大语言模型
* Zero-shot -> 零样本
* Few-shot -> 少样本
* AI Agent -> AI智能体
* AGI -> 通用人工智能
* Python -> Python语言
策略:
分三步进行翻译工作:
1. 不翻译无法识别的特殊字符和公式,原样返回
2. 将HTML表格格式转换成Markdown表格格式
3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息
最终只返回Markdown格式的翻译结果,不要回复无关内容。
Our key findings for Transformer language models are are as follows:
我们对 Transformer 语言模型的关键发现如下:

Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottle necked by the other two.

图 1: 语言建模性能随着模型大小、数据集大小和用于训练的计算量的增加而平稳提升。为了获得最佳性能,这三个因素必须同步扩展。当不受其他两个因素限制时,实证性能与每个单独因素之间存在幂律关系。
Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters $N$ (excluding embeddings), the size of the dataset $D$ , and the amount of compute $C$ used for training. Within reasonable limits, performance depends very weakly on other architectural hyper parameters such as depth vs. width. (Section 3)
性能主要取决于规模,轻微取决于模型结构:模型性能最强烈地依赖于规模,这由三个因素组成:模型参数的数量 $N$ (不包括嵌入),数据集的大小 $D$ ,以及用于训练的计算量 $C$ 。在合理范围内,性能对其他架构超参数(如深度与宽度)的依赖非常微弱。(第 3 节)
Smooth power laws: Performance has a power-law relationship with each of the three scale factors $N,D,C$ when not bottle necked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3)
平滑幂律:当不受其他两个因素的限制时,性能与三个尺度因子 $N,D,C$ 呈现幂律关系,趋势跨越了六个数量级以上(见图 1)。我们在高端没有观察到任何偏离这些趋势的迹象,尽管在达到零损失之前,性能最终必须趋于平稳。(第 3 节)
Universality of over fitting: Performance improves predictably as long as we scale up $N$ and $D$ in tandem, but enters a regime of diminishing returns if either $N$ or $D$ is held fixed while the other increases. The performance penalty depends predictably on the ratio $N^{0.74}/D$ , meaning that every time we increase the model size $8\mathrm{{x}}$ , we only need to increase the data by roughly $\mathsf{5x}$ to avoid a penalty. (Section 4)
过拟合的普遍性:只要我们同时扩大 $N$ 和 $D$,性能就会可预测地提高,但如果其中一个固定而另一个增加,则进入收益递减的阶段。性能惩罚取决于比率 $N^{0.74}/D$,这意味着每次我们将模型大小增加 $8\mathrm{{x}}$ 时,只需要将数据量增加大约 $\mathsf{5x}$ 就可以避免惩罚。(第 4 节)
Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the loss that would be achieved if we trained for much longer. (Section 5)
训练的普遍性:训练曲线遵循可预测的幂律,其参数大致与模型大小无关。通过外推训练曲线的早期部分,我们可以大致预测如果我们训练更长时间所能达到的损失。(第 5 节)
Transfer improves with test performance: When we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss – in other words, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2)
迁移学习随着测试性能的提高而改善:当我们评估模型在与训练数据分布不同的文本上的表现时,结果与训练验证集上的结果高度相关,损失大约有一个恒定的偏移——换句话说,迁移到不同分布会带来一个恒定的惩罚,但除此之外,其表现大致与训练集上的性能成正比改进。(第 3.2.2 节)
Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4).
样本效率:大模型比小模型更样本高效,在优化步骤更少的情况下达到相同的性能水平(图 2),并且使用更少的数据点(图 4)。
Convergence is inefficient: When working within a fixed compute budget $C$ but without any other restrictions on the model size $N$ or available data $D$ , we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as $D\stackrel{\smile}{\sim}C^{0.27}$ with training compute. (Section 6)
收敛效率低下:在计算预算 $C$ 固定但对模型大小 $N$ 或可用数据 $D$ 没有其他限制的情况下,我们通过训练非常大的模型并在远未达到收敛时停止来获得最佳性能(见图 3)。因此,计算效率最高的训练方式比根据小模型收敛所需的样本效率要高得多,数据需求随着训练计算量的增长非常缓慢,即 $D\stackrel{\smile}{\sim}C^{0.27}$。(第 6 节)
Optimal batch size: The ideal batch size for training these models is roughly a power of the loss only, and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly 1-2 million tokens at convergence for the largest models we can train. (Section 5.1)
最优批量大小:对于训练这些模型,理想的批量大小大致是损失的幂,并且可以通过测量梯度噪声尺度来确定 [MKAT18];对于我们能够训练的最大模型,在收敛时大约为 1-2 百万 Token (Section 5.1)。
Taken together, these results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. We expect that larger language models will perform better and be more sample efficient than current models.
综上所述,这些结果表明,当我们适当地扩大模型规模、数据和计算资源时,语言建模性能会平稳且可预测地提高。我们预计,更大的大语言模型将比当前模型表现更好,并且样本效率更高。

Figure 2 We show a series of language model training runs, with models ranging in size from $10^{3}$ to $10^{9}$ parameters (excluding embeddings).
图 2: 我们展示了一系列语言模型训练运行,模型参数规模从 $10^{3}$ 到 $10^{9}$ (不包括嵌入层)。

Figure 3 As more compute becomes available, we can choose how much to allocate towards training larger models, using larger batches, and training for more steps. We illustrate this for a billion-fold increase in compute. For optimally compute-efficient training, most of the increase should go towards increased model size. A relatively small increase in data is needed to avoid reuse. Of the increase in data, most can be used to increase parallelism through larger batch sizes, with only a very small increase in serial training time required.
图 3: 随着更多的计算资源变得可用,我们可以选择将多少资源分配给训练更大的模型、使用更大的批次以及进行更多的训练步骤。我们以计算资源增加十亿倍为例进行了说明。为了实现最优的计算效率,大部分增加的资源应该用于扩大模型规模。相对较小的数据量增加即可避免数据重用。在增加的数据中,大部分可以用于通过更大的批次来增加并行性,只需要非常小的串行训练时间增加。

1.2 Summary of Scaling Laws
1.2 扩展定律总结
规则:
- 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
- 不要输出与英文内容无关的内容。
- 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
- 人名不翻译
- 同时要保留引用的论文,例如 [20] 这样的引用。
- 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
- 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
- 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
- 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
* Transformer -> Transformer
* Token -> Token
* LLM/Large Language Model -> 大语言模型
* Zero-shot -> 零样本
* Few-shot -> 少样本
* AI Agent -> AI智能体
* AGI -> 通用人工智能
* Python -> Python语言
根据上述规则,以下是翻译结果:
1.2 扩展定律总结
The test loss of a Transformer trained to auto regressive ly model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters $N$ , the dataset size $D$ , or the optimally allocated compute budget $C_{\mathrm{min}}$ (see Figure 1):
当性能仅受限于非嵌入参数的数量 $N$ 、数据集大小 $D$ 或最优分配的计算预算 $C_{\mathrm{min}}$ 时,可以使用幂律预测训练用于自回归语言建模的 Transformer 的测试损失(见图 1):
- For models with a limited number of parameters, trained to convergence on sufficiently large datasets:
- 对于参数数量有限的模型,在足够大的数据集上训练到收敛:
$$
L(N)=\left({N_{\mathrm{c}}}/{N}\right)^{\alpha_{N}};~~\alpha_{N}\sim0.076,~~~N_{\mathrm{c}}\sim8.8\times10^{13}~(\mathrm{non-embedding~parameters})
$$
$$
L(N)=\left( N_{\mathrm{c}} / N \right)^{\alpha_{N}} ; ~~ \alpha_{N} \sim 0.076 , ~~~ N_{\mathrm{c}} \sim 8.8 \times 10^{13} ~(非嵌入参数)
$$
- For large models trained with a limited dataset with early stopping:
- 对于使用有限数据集训练并采用提前停止的大模型:
$$
L(D)=\left({{D_{\mathrm{{c}}}}/{D}}\right)^{\alpha_{D}};,,,\alpha_{D}\sim0.095,\quad{D_{\mathrm{{c}}}}\sim5.4\times10^{13},,(\mathrm{{tokens}})
$$
$$
L(D)=\left( \frac{D_{\mathrm{c}}}{D} \right)^{\alpha_{D}};,,,\alpha_{D}\sim0.095,\quad D_{\mathrm{c}}\sim5.4\times10^{13},,(\mathrm{tokens})
$$
- When training with a limited amount of compute, a sufficiently large dataset, an optimally-sized model, and a sufficiently small batch size (making optimal3 use of compute):
- 当在计算资源有限的情况下训练时,使用足够大的数据集、最优大小的模型和足够小的批量大小(使计算资源得到最优利用):
$$
L(C_{\mathrm{min}})=\left(C_{\mathrm{c}}^{\mathrm{min}}/C_{\mathrm{min}}\right)^{\alpha_{C}^{\mathrm{min}}};,,,\alpha_{C}^{\mathrm{min}}\sim0.050,\quad C_{\mathrm{c}}^{\mathrm{min}}\sim3.1\times10^{8},,(\mathrm{PF-days})
$$
$$
L(C_{\mathrm{min}})=\left(C_{\mathrm{c}}^{\mathrm{min}}/C_{\mathrm{min}}\right)^{\alpha_{C}^{\mathrm{min}}};,,,\alpha_{C}^{\mathrm{min}}\sim0.050,\quad C_{\mathrm{c}}^{\mathrm{min}}\sim3.1\times10^{8},,(PF-days)
$$
3We also observe an empirical power-law trend with the training compute $C$ (Figure 1) while training at fixed batch size, but it is the trend with $C_{\mathrm{min}}$ that should be used to make predictions. They are related by equation (5.5).
我们还观察到在固定批量大小训练时,训练计算量 $C$ 呈现经验幂律趋势 (图 1),但用于预测的应该是与 $C_{\mathrm{min}}$ 的趋势。它们通过公式 (5.5) 相关。

Figure 4 Left: The early-stopped test loss $L(N,D)$ varies predictably with the dataset size $D$ and model size $N$ according to Equation (1.5). Right: After an initial transient period, learning curves for all model sizes $N$ can be fit with Equation (1.6), which is parameterized in terms of $S_{\mathrm{min}}$ , the number of steps when training at large batch size (details in Section 5.1).
图 4 左:提前停止的测试损失 $L(N,D)$ 随数据集大小 $D$ 和模型大小 $N$ 的变化是可预测的,符合公式 (1.5)。右:在初始瞬态期后,所有模型大小 $N$ 的学习曲线可以用公式 (1.6) 拟合,该公式以 $S_{\mathrm{min}}$ 为参数,即在大批次大小训练时的步数(详情见第 5.1 节)。
These relations hold across eight orders of magnitude in $C_{\mathrm{min}}$ , six orders of magnitude in $N$ , and over two orders of magnitude in $D$ . They depend very weakly on model shape and other Transformer hyper parameters (depth, width, number of self-attention heads), with specific numerical values associated with the Webtext2 training set $[\mathrm{RWC}^{+}19]$ . The power laws $\alpha_{\mathrm{N}},\alpha_{\mathrm{D}},\alpha_{C}^{\mathrm{min}}$ specify the degree of performance improvement expected as we scale up $N,D$ , or $C_{\mathrm{min}}$ ; for example, doubling the number of parameters yields a loss that is smaller by a factor $2^{-\alpha_{N}}~=~0.95$ . The precise numerical values of $N_{\mathrm{c}},\bar{C}{\mathrm{c}}^{\mathrm{min}}$ , and $D{\mathrm{c}}$ depend on the vocabulary size and token iz ation and hence do not have a fundamental meaning.
这些关系在 $C_{\mathrm{min}}$ 的八个数量级、 $N$ 的六个数量级以及 $D$ 的两个数量级以上成立。它们对模型形状和其他 Transformer 超参数(深度、宽度、自注意力头数)的依赖非常弱,具体数值与 Webtext2 训练集 $[\mathrm{RWC}^{+}19]$ 相关。幂律 $\alpha_{\mathrm{N}},\alpha_{\mathrm{D}},\alpha_{C}^{\mathrm{min}}$ 指定了当我们扩大 $N,D$ 或 $C_{\mathrm{min}}$ 时预期的性能改进程度;例如,参数数量翻倍会使损失减少一个因子 $2^{-\alpha_{N}}~=~0.95$ 。 $N_{\mathrm{c}},\bar{C}{\mathrm{c}}^{\mathrm{min}}$ 和 $D{\mathrm{c}}$ 的精确数值取决于词汇表大小和分词方式,因此没有根本意义。
The critical batch size, which determines the speed/efficiency tradeoff for data parallelism ([MKAT18]), also roughly obeys a power law in $L$ :
关键批次大小,决定了数据并行的速度/效率权衡 ([MKAT18]),也大致遵循 $L$ 的幂律:
$$
B_{\mathrm{crit}}\left(L\right)=\frac{B_{}}{L^{1/\alpha_{B}}},\qquad B_{}\sim2\cdot10^{8};\mathrm{tokens,};;\alpha_{B}\sim0.21
$$
$$
B_{\mathrm{crit}}(L) = \frac{B_{}}{L^{1/\alpha_{B}}} , \qquad B_{} \sim 2 \cdot 10^{8} ; \mathrm{tokens,} ;; \alpha_{B} \sim 0.21
$$
Equation (1.1) and (1.2) together suggest that as we increase the model size, we should increase the dataset size sub linearly according to $D\propto N^{\frac{\alpha_{N}}{\alpha_{D}}}\sim N^{0.74}$ . In fact, we find that there is a single equation combining (1.1) and (1.2) that governs the simultaneous dependence on $N$ and $D$ and governs the degree of over fitting:
公式 (1.1) 和 (1.2) 共同表明,随着我们增加模型规模,我们应该根据 $D\propto N^{\frac{\alpha_{N}}{\alpha_{D}}}\sim N^{0.74}$ 以次线性方式增加数据集规模。实际上,我们发现有一个结合了 (1.1) 和 (1.2) 的单一公式,该公式同时取决于 $N$ 和 $D$,并决定了过拟合的程度:
$$
L(N,D)=\left[\left(\frac{N_{c}}{N}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}+\frac{D_{c}}{D}\right]^{\alpha_{D}}
$$
$$
L(N,D)=\left[\left(\frac{N_{c}}{N}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}+\frac{D_{c}}{D}\right]^{\alpha_{D}}
$$
with fits pictured on the left in figure 4. We conjecture that this functional form may also parameter ize the trained log-likelihood for other generative modeling tasks.
图 4: 左侧展示了拟合结果。我们推测这种函数形式也可能参数化其他生成式建模任务的训练对数似然。
When training a given model for a finite number of parameter update steps $S$ in the infinite data limit, after an initial transient period, the learning curves can be accurately fit by (see the right of figure 4)
当在无限数据限制下训练给定模型进行有限数量的参数更新步骤 $S$ 时,在初始瞬态期之后,学习曲线可以用以下公式准确拟合(见图 4 右侧)
$$
L(N,S)=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S_{\mathrm{min}}(S)}\right)^{\alpha_{S}}
$$
$$
L(N,S)=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S_{\mathrm{min}}(S)}\right)^{\alpha_{S}}
$$
where $S_{c}\approx2.1\times10^{3}$ and $\alpha_{S}\approx0.76$ , and $S_{\mathrm{min}}(S)$ is the minimum possible number of optimization steps (parameter updates) estimated using Equation (5.4).
其中 $S_{c}\approx2.1\times10^{3}$ 和 $\alpha_{S}\approx0.76$ ,以及 $S_{\mathrm{min}}(S)$ 是使用公式 (5.4) 估算的最小优化步骤数(参数更新)。
When training within a fixed compute budget $C$ , but with no other constraints, Equation (1.6) leads to the prediction that the optimal model size $N$ , optimal batch size $B$ , optimal number of steps $S$ , and dataset size $D$ should grow as
在固定计算预算 $C$ 内进行训练,但没有其他约束时,公式 (1.6) 预测最优模型大小 $N$ 、最优批量大小 $B$ 、最优步数 $S$ 和数据集大小 $D$ 应该增长为
$$
N\propto C^{\alpha_{C}^{\mathrm{min}}/\alpha_{N}},\quad B\propto C^{\alpha_{C}^{\mathrm{min}}/\alpha_{B}},\quad S\propto C^{\alpha_{C}^{\mathrm{min}}/\alpha s},\quad D=B\cdot S
$$
$$
N ∝ C^{\alpha_{C}^{\mathrm{min}} / \alpha_{N}} , \quad B ∝ C^{\alpha_{C}^{\mathrm{min}} / \alpha_{B}} , \quad S ∝ C^{\alpha_{C}^{\mathrm{min}} / \alpha_{S}} , \quad D = B \cdot S
$$
with
请提供需要翻译的完整内容。
$$
\alpha_{C}^{\mathrm{min}}=1/\left(1/\alpha_{S}+1/\alpha_{B}+1/\alpha_{N}\right)
$$
$$
\alpha_{C}^{\mathrm{min}}=1/\left(1/\alpha_{S}+1/\alpha_{B}+1/\alpha_{N}\right)
$$
which closely matche the empirically optimal results $N,\propto,C_{\mathrm{min}}^{0.73}$ , $B,\propto,C_{\mathrm{min}}^{0.24}$ , and $S~\propto~C_{\mathrm{min}}^{0.03}$ . As the $C$ increases, it should be spent primarily on larger models, without dramatic increases in training time or dataset size (see Figure 3). This also implies that as models grow larger, they become increasingly sample efficient. In practice, researchers typically train smaller models for longer than would be maximally compute-efficient because of hardware constraints. Optimal performance depends on total compute as a power law (see Equation (1.3)).
这与经验上的最优结果非常接近:$N,\propto,C_{\mathrm{min}}^{0.73}$ , $B,\propto,C_{\mathrm{min}}^{0.24}$ ,和 $S~\propto~C_{\mathrm{min}}^{0.03}$ 。随着 $C$ 的增加,它应该主要花费在更大的模型上,而不会显著增加训练时间或数据集大小(见图 3)。这也意味着,随着模型变得更大,它们的样本效率越来越高。实际上,研究人员通常由于硬件限制,会训练较小的模型更长时间,而不是最大化计算效率。最佳性能取决于总计算量的幂律关系(见公式 (1.3))。
We provide some basic theoretical motivation for Equation (1.5), an analysis of learning curve fits and their implications for training time, and a breakdown of our results per token. We also make some brief comparisons to LSTMs and recurrent Transformers $[\mathrm{DGV}^{+}18]$ .
我们为公式 (1.5) 提供了一些基本的理论动机,对学习曲线拟合及其对训练时间的影响进行了分析,并按每个 Token 对结果进行了分解。我们还简要比较了长短期记忆网络 (LSTMs) 和循环 Transformer $[\mathrm{DGV}^{+}18]$ 。
1.3 Notation
1.3 符号说明
We use the following notation:
我们使用以下符号:
2 Background and Methods
2 背景和方法
We train language models on WebText2, an extended version of the WebText $[\mathrm{RWC}^{+}19]$ dataset, tokenized using byte-pair encoding [SHB15] with a vocabulary size $n_{\mathrm{vocab}},=,50257$ . We optimize the autoregressive log-likelihood (i.e. cross-entropy loss) averaged over a 1024-token context, which is also our principal performance metric. We record the loss on the WebText2 test distribution and on a selection of other text distributions. We primarily train decoder-only $[{\mathrm{LSP}}^{+}18,\$ , RNSS18] Transformer $[\mathrm{VSP^{+}17}]$ models, though we also train LSTM models and Universal Transformers $[\mathrm{DGV}^{+}18]$ for comparison.
我们在 WebText2 上训练语言模型,WebText2 是 WebText $[\mathrm{RWC}^{+}19]$ 数据集的扩展版本,使用字节对编码 [SHB15] 进行分词,词汇表大小为 $n_{\mathrm{vocab}},=,50257$ 。我们优化了在 1024-token 上下文中的自回归对数似然(即交叉熵损失)的平均值,这也是我们的主要性能指标。我们在 WebText2 测试分布和其他一些文本分布上记录了损失。我们主要训练仅解码器的 Transformer $[\mathrm{LSP}^{+}18$, RNSS18] 模型,尽管我们也训练了 LSTM 模型和通用 Transformer $[\mathrm{DGV}^{+}18]$ 以作比较。
2.1 Parameter and Compute Scaling of Transformers
2.1 Transformer 参数和计算量扩展
We parameter ize the Transformer architecture using hyper parameters $n_{\mathrm{layer}}$ (number of layers), $d_{\mathrm{model}}$ (dimension of the residual stream), $d_{\mathrm{{ff}}}$ (dimension of the intermediate feed-forward layer), $d_{\mathrm{attn}}$ (dimension of the attention output), and $n_{\mathrm{heads}}$ (number of attention heads per layer). We include $n_{\mathrm{ctx}}$ tokens in the input context, with $n_{\mathrm{ctx}}=1024$ except where otherwise noted.
我们使用超参数对 Transformer 架构进行参数化,包括:$n_{\mathrm{layer}}$ (层数),$d_{\mathrm{model}}$ (残差流的维度),$d_{\mathrm{{ff}}}$ (中间前馈层的维度),$d_{\mathrm{attn}}$ (注意力输出的维度),和 $n_{\mathrm{heads}}$ (每层的注意力头数)。我们在输入上下文中包含 $n_{\mathrm{ctx}}$ 个 Token,其中 $n_{\mathrm{ctx}}=1024$ ,除非另有说明。
We use $N$ to denote the model size, which we define as the number of non-embedding parameters
我们使用 $N$ 来表示模型大小,定义为非嵌入参数的数量
$$
\begin{array}{r l}&{N\approx2d_{\mathrm{model}}n_{\mathrm{layer}}\left(2d_{\mathrm{attn}}+d_{\mathrm{ff}}\right)}\ &{\quad=12n_{\mathrm{layer}}d_{\mathrm{model}}^{2}\quad\mathrm{withthe~standard}\quad d_{\mathrm{attn}}=d_{\mathrm{ff}}/4=d_{\mathrm{model}}}\end{array}
$$
$$
\begin{array}{r l}
& N \approx 2 d_{\mathrm{model}} n_{\mathrm{layer}} \left( 2 d_{\mathrm{attn}} + d_{\mathrm{ff}} \right) \
& \quad = 12 n_{\mathrm{layer}} d_{\mathrm{model}}^2 \quad \text{withthe~standard} \quad d_{\mathrm{attn}} = d_{\mathrm{ff}} / 4 = d_{\mathrm{model}}
\end{array}
$$
where we have excluded biases and other sub-leading terms. Our models also have $n_{\mathrm{vocab}}d_{\mathrm{model}}$ parameters in an embedding matrix, and use $n_{\mathrm{ctx}}d_{\mathrm{model}}$ parameters for positional embeddings, but we do not include these when discussing the ‘model size’ $N$ ; we will see that this produces significantly cleaner scaling laws.
我们已经排除了偏置和其他次要项。我们的模型在嵌入矩阵中有 $n_{\mathrm{vocab}}d_{\mathrm{model}}$ 参数,并使用 $n_{\mathrm{ctx}}d_{\mathrm{model}}$ 参数用于位置嵌入,但在讨论‘模型大小’ $N$ 时没有包括这些;我们将看到这会产生更清晰的缩放定律。
Evaluating a forward pass of the Transformer involves roughly
评估 Transformer 的前向传递涉及大约
$$
C_{\mathrm{forward}}\approx2N+2n_{\mathrm{layer}}n_{\mathrm{ctx}}d_{\mathrm{model}}
$$
$$
C_{\mathrm{forward}} \approx 2N + 2n_{\mathrm{layer}} n_{\mathrm{ctx}} d_{\mathrm{model}}
$$
add-multiply operations, where the factor of two comes from the multiply-accumulate operation used in matrix multiplication. A more detailed per-operation parameter and compute count is included in Table 1.
加乘操作,其中的二倍因子来自矩阵乘法中使用的乘积累加操作。更详细的操作参数和计算量统计包含在表 1: 中。
Table 1 Parameter counts and compute (forward pass) estimates for a Transformer model. Sub-leading terms such as nonlinear i ties, biases, and layer normalization are omitted.
表 1: Transformer 模型的参数数量和计算(前向传递)估计。次主导项,例如非线性、偏置和层归一化被省略。
| 操作 | 参数 | 每个 Token 的 FLOPs |
|---|---|---|
| 嵌入 (Embed) | (Nvocab + Nctx) dmodel | 4dmodel |
| 注意力:QKV (Attention: QKV) | Nlayer dmodel^3 dattn | 2Nlayer dmodel^3 dattn |
| 注意力:掩码 (Attention: Mask) | 2nlayer Nctx dattn | |
| 注意力:投影 (Attention: Project) | Nlayer dattn dmodel | 2nlayer dattn dembd |
| 前馈 (Feedforward) | Nlayer 2dmodel dff | 2Nlayer 2dmodel dff |
| 反嵌入 (De-embed) | 2dmodel Nvocab | |
| 总计(非嵌入)(Total (Non-Embedding)) | N = 2dmodel Nlayer (2dattn + df) | Cforward = 2N + 2nlayer Nctx dattn |
For contexts and models with $d_{\mathrm{model}},>,n_{\mathrm{ctx}}/12$ , the context-dependent computational cost per token is a relatively small fraction of the total compute. Since we primarily study models where $d_{\mathrm{model}}\gg n_{\mathrm{ctx}}/12$ , we do not include context-dependent terms in our training compute estimate. Accounting for the backwards pass (approximately twice the compute as the forwards pass), we then define the estimated non-embedding compute as $C\approx6N$ floating point operators per training token.
对于模型维度 $d_{\mathrm{model}},>,n_{\mathrm{ctx}}/12$ 的情况,每个 Token 的上下文依赖计算成本是总计算量的相对较小部分。由于我们主要研究的是模型维度远大于上下文长度十二分之一($d_{\mathrm{model}}\gg n_{\mathrm{ctx}}/12$)的模型,因此在我们的训练计算估计中不包括上下文依赖项。考虑到反向传播(大约是前向传播计算量的两倍),我们将估计的非嵌入计算定义为每个训练 Token 大约 $C\approx6N$ 次浮点运算。
2.2 Training Procedures
2.2 训练流程
Unless otherwise noted, we train models with the Adam optimizer [KB14] for a fixed $2.5\times10^{5}$ steps with a batch size of 512 sequences of 1024 tokens. Due to memory constraints, our largest models (more than 1B parameters) were trained with Adafactor [SS18]. We experimented with a variety of learning rates and schedules, as discussed in Appendix D.6. We found that results at convergence were largely independent of learning rate schedule. Unless otherwise noted, all training runs included in our data used a learning rate schedule with a 3000 step linear warmup followed by a cosine decay to zero.
除非另有说明,我们使用 Adam 优化器 [KB14] 训练模型,固定训练 $2.5\times10^{5}$ 步,每批包含 512 个序列,每个序列包含 1024 个 Token。由于内存限制,我们最大的模型(超过 1B 参数)使用 Adafactor [SS18] 进行训练。我们尝试了多种学习率和学习率调度方案,如附录 D.6 所述。我们发现收敛时的结果在很大程度上与学习率调度无关。除非另有说明,我们数据中包含的所有训练运行均采用学习率调度方案,包括 3000 步线性预热,随后是余弦衰减至零。
2.3 Datasets
2.3 数据集
We train our models on an extended version of the WebText dataset described in $[\mathrm{RWC}^{+}19]$ . The original WebText dataset was a web scrape of outbound links from Reddit through December 2017 which received at least 3 karma. In the second version, WebText2, we added outbound Reddit links from the period of January to October 2018, also with a minimum of 3 karma. The karma threshold served as a heuristic for whether people found the link interesting or useful. The text of the new links was extracted with the Newspaper 3 k python library. In total, the dataset consists of $20.3\mathrm{M}$ documents containing 96 GB of text and $1.62\stackrel{.}{\times}10^{10}$ words (as defined by wc). We then apply the reversible tokenizer described in $[\mathrm{RWC}^{+}19]$ , which yields $2.29\times10^{10}$ tokens. We reserve $6.6\times\dot{1}0^{8}$ of these tokens for use as a test set, and we also test on similarlyprepared samples of Books Corpus $[Z\mathrm{KZ}^{+}15]$ , Common Crawl [Fou], English Wikipedia, and a collection of publicly-available Internet Books.
我们在扩展版本的 WebText 数据集上训练模型,该数据集在 $[\mathrm{RWC}^{+}19]$ 中有描述。原始的 WebText 数据集是从 2017 年 12 月 Reddit 的外链中抓取的,这些链接至少获得了 3 个 karma。在第二个版本 WebText2 中,我们增加了 2018 年 1 月至 10 月期间的 Reddit 外链,同样要求至少获得 3 个 karma。Karma 阈值作为人们是否认为链接有趣或有用的启发式指标。新链接的文本使用 Newspaper 3k Python语言库提取。总计,该数据集包含 20.3M 文档,包含 96 GB 的文本和 1.62×10^10 个单词(由 wc 定义)。然后我们应用了 $[\mathrm{RWC}^{+}19]$ 中描述的可逆分词器,生成了 2.29×10^10 个 Token。我们保留了 6.6×10^8 个 Token 用于测试集,并且我们还在类似的准备样本上进行了测试,包括 Books Corpus $[Z\mathrm{KZ}^{+}15]$、Common Crawl [Fou]、英文维基百科和一系列公开可用的互联网书籍。
3 Empirical Results and Basic Power Laws
3 实证结果和基本幂律
To characterize language model scaling we train a wide variety of models, varying a number of factors including:
为了表征语言模型的扩展性,我们训练了各种各样的模型,改变了多个因素,包括:
• Model size (ranging in size from 768 to 1.5 billion non-embedding parameters) • Dataset size (ranging from 22 million to 23 billion tokens) • Shape (including depth, width, attention heads, and feed-forward dimension) • Context length (1024 for most runs, though we also experiment with shorter contexts) • Batch size ( $2^{19}$ for most runs, but we also vary it to measure the critical batch size)
• 模型规模 (从 768 到 15 亿非嵌入参数)
• 数据集规模 (从 2200 万到 230 亿 Token)
• 结构 (包括深度、宽度、注意力头和前馈维度)
• 上下文长度 (大多数运行中为 1024,尽管我们也实验了更短的上下文)
• 批量大小 (大多数运行中为 $2^{19}$ ,但我们也会改变它以测量关键批量大小)

Figure 5 Performance depends very mildly on model shape when the total number of non-embedding parameters $N$ is held fixed. The loss varies only a few percent over a wide range of shapes. Small differences in parameter counts are compensated for by using the fit to $L(N)$ as a baseline. Aspect ratio in particular can vary by a factor of 40 while only slightly impacting performance; an $(n_{\mathrm{layer}},d_{\mathrm{model}})=(6,42\mathrm{\bar{8}}8)$ reaches a loss within $3%$ of the (48, 1600) model used in $[\mathrm{RWC}^{+}19]$ .

图 5: 当总非嵌入参数数量 $N$ 固定时,性能对模型形状的依赖非常轻微。损失在广泛的形状范围内仅变化百分之几。参数数量的小差异通过使用拟合到 $L(N)$ 作为基线来补偿。纵横比可以变化 40 倍,而对性能的影响很小;一个 $(n_{\mathrm{layer}},d_{\mathrm{model}})=(6,42\mathrm{\bar{8}}8)$ 的模型达到的损失在 $3%$ 范围内接近于 $[\mathrm{RWC}^{+}19]$ 中使用的 (48, 1600) 模型。

Figure 6 Left: When we include embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. Right: When we exclude embedding parameters, the performance of models with different depths converge to a single trend. Only models with fewer than 2 layers or with extreme depth-to-width ratios deviate significantly from the trend.
图 6 左:当我们包含嵌入参数时,性能似乎强烈依赖于层数以及参数数量。右:当我们排除嵌入参数时,不同深度模型的性能收敛到单一趋势。只有少于 2 层或具有极端深度与宽度比率的模型显著偏离该趋势。
In this section we will display data along with empirically-motivated fits, deferring theoretical analysis to later sections.
在本节中,我们将展示数据以及基于经验的拟合结果,理论分析将推迟到后续章节。
3.1 Approximate Transformer Shape and Hyper parameter Independence
3.1 近似 Transformer 结构和超参数独立性
Transformer performance depends very weakly on the shape parameters $n_{\mathrm{layer}},n_{\mathrm{heads}}$ , and $d_{\mathrm{ff}}$ when we hold the total non-embedding parameter count $N$ fixed. To establish these results we trained models with fixed size while varying a single hyper parameter. This was simplest for the case of $n_{\mathrm{heads}}$ . When varying $n_{\mathrm{layer}}$ , we simultaneously varied $d_{\mathrm{model}}$ while keeping $N,\approx,12n_{\mathrm{layer}}d_{\mathrm{model}}^{2}$ fixed. Similarly, to vary $d_{\mathrm{ff}}$ at fixed model size we also simultaneously varied the $d_{\mathrm{model}}$ parameter, as required by the parameter counts in Table 1. Independence of $n_{\mathrm{layers}}$ would follow if deeper Transformers effectively behave as ensembles of shallower models, as has been suggested for ResNets [VWB16]. The results are shown in Figure 5.
Transformer 的性能对形状参数 $n_{\mathrm{layer}}$、$n_{\mathrm{heads}}$ 和 $d_{\mathrm{ff}}$ 的依赖非常弱,当我们将总的非嵌入参数数量 $N$ 固定时。为了得出这些结果,我们在固定模型大小的情况下,只改变一个超参数来训练模型。对于 $n_{\mathrm{heads}}$ 的情况最简单。在改变 $n_{\mathrm{layer}}$ 时,我们同时改变了 $d_{\mathrm{model}}$,同时保持 $N,\approx,12n_{\mathrm{layer}}d_{\mathrm{model}}^{2}$ 固定。同样地,在固定模型大小的情况下改变 $d_{\mathrm{ff}}$ 时,我们也同时改变了 $d_{\mathrm{model}}$ 参数,这是由表 1 中的参数数量决定的。如果更深的 Transformer 实际上表现为较浅模型的集成,则可以解释 $n_{\mathrm{layers}}$ 的独立性,这一点已经在 ResNets [VWB16] 中被提出。结果如图 5 所示。
3.2 Performance with Non-Embedding Parameter Count $N$
3.2 非嵌入参数数量 $N$ 下的性能表现
In Figure 6 we display the performance of a wide variety of models, ranging from small models with shape $(n_{\mathrm{layer}},d_{\mathrm{model}})=(2,128)$ through billion-parameter models, ranging in shape from (6, 4288) through (207, 768). Here we have trained to near convergence on the full WebText2 dataset and observe no overfitting (except possibly for the very largest models).
图 6: 我们展示了各种模型的性能,范围从小型模型的结构 (n_(layer), d_(model)) = (2, 128) 到数十亿参数的模型,结构从 (6, 4288) 到 (207, 768)。在这里,我们在完整的 WebText2 数据集上训练到接近收敛,并观察到没有过拟合(除了可能对于非常大的模型)。
As shown in Figure 1, we find a steady trend with non-embedding parameter count $N$ , which can be fit to the first term of Equation (1.5), so that
如图 1 所示,我们发现非嵌入参数计数 $N$ 呈现出稳定趋势,这可以拟合到公式 (1.5) 的第一项,因此
图 1:
$$
L(N)\approx\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}
$$
$$
L(N) ≈ \left( \frac{N_c}{N} \right)^{\alpha_N}
$$

Figure 7

图 7
To observe these trends it is crucial to study performance as a function of $N$ ; if we instead use the total parameter count (including the embedding parameters) the trend is somewhat obscured (see Figure 6). This suggests that the embedding matrix can be made smaller without impacting performance, as has been seen in recent work $[\mathrm{LCG^{+}19}]$ .
要观察这些趋势,研究性能与 $N$ 的函数关系至关重要;如果我们使用总参数量(包括嵌入参数),趋势会变得不那么明显(见图 6)。这表明嵌入矩阵可以变得更小而不影响性能,正如最近的研究 $[\mathrm{LCG^{+}19}]$ 所显示的那样。
Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasets is also a power-law in $N$ with nearly identical power, as shown in Figure 8.
虽然这些模型是在 WebText2 数据集上训练的,但它们在其他各种数据集上的测试损失也与 $N$ 呈幂律关系,且幂几乎相同,如图 8 所示。
3.2.1 Comparing to LSTMs and Universal Transformers
3.2.1 与 LSTMs 和 Universal Transformers 比较
In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter count $N$ . The LSTMs were trained with the same dataset and context length. We see from these figures that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens. We present power-law relationships between performance and context position Appendix D.5, where increasingly large powers for larger models suggest improved ability to quickly recognize patterns.
图 7: 我们比较了 LSTM 和 Transformer 在非嵌入参数数量 $N$ 下的性能。LSTM 使用相同的 数据集和上下文长度进行训练。从这些图表中可以看出,对于出现在上下文早期的 Token,LSTM 的表现与 Transformer 相当,但对于较晚出现的 Token,LSTM 无法匹敌 Transformer 的性能。我们在附录 D.5 中展示了性能与上下文位置之间的幂律关系,其中较大模型的幂次逐渐增大,表明其快速识别模式的能力有所提高。
We also compare the performance of standard Transformers to recurrent Transformers $[\mathrm{DGV}^{+}18]$ in Figure 17 in the appendix. These models re-use parameters, and so perform slightly better as a function of $N$ , at the cost of additional compute per-parameter.
我们还比较了标准 Transformer 和循环 Transformer $[\mathrm{DGV}^{+}18]$ 在附录中的图 17 中的性能。这些模型重用了参数,因此在 $N$ 的函数下表现略好,但每个参数的计算成本更高。
3.2.2 Generalization Among Data Distributions
3.2.2 数据分布间的泛化能力
We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8).
我们还在一组额外的文本数据分布上测试了我们的模型。这些数据集上的测试损失与模型大小的关系如图 8 所示;在所有情况下,模型仅在 WebText2 数据集上进行了训练。我们发现,在这些其他数据分布上的损失随着模型大小平滑改善,与 WebText2 上的改善直接平行。我们发现泛化几乎完全取决于分布内验证损失,而与训练时长或接近收敛的程度无关。我们也没有观察到对模型深度的依赖(见附录 D.8)。
图 8: 模型大小与测试损失的关系
3.3 Performance with Dataset Size and Compute
3.3 数据集规模和计算资源下的性能表现
We display empirical trends for the test loss as a function of dataset size $D$ (in tokens) and training compute $C$ in Figure 1.
我们在图 1 中展示了测试损失随数据集大小 $D$ (以 Token 计)和训练计算量 $C$ 的经验趋势。
For the trend with $D$ we trained a model with $(n_{\mathrm{layer}},n_{\mathrm{embd}})=(36,1280)$ on fixed subsets of the WebText2 dataset. We stopped training once the test loss ceased to decrease. We see that the resulting test losses can be fit with simple power-law
对于趋势 $D$ ,我们训练了一个模型,其参数为 $(n_{\mathrm{layer}}, n_{\mathrm{embd}}) = (36, 1280)$,基于 WebText2 数据集的固定子集。我们在测试损失停止下降时停止训练。我们发现,最终的测试损失可以用简单的幂律拟合。
$$
L(D)\approx\left(\frac{D_{c}}{D}\right)^{\alpha_{D}}
$$
$$
L(D) ≈ \left( \frac{D_c}{D} \right)^{\alpha_D}
$$
in the dataset size. The data and fit appear in Figure 1.
在数据集大小方面。数据和拟合结果如图 1 所示。
The total amount of non-embedding compute used during training can be estimated as $C=6N B S$ , where $B$ is the batch size, $S$ is the number of parameter updates, and the factor of 6 accounts for the forward and backward passes. Thus for a given value of $C$ we can scan over all models with various $N$ to find the model
训练期间使用的非嵌入计算总量可以估计为 $C=6N B S$ ,其中 $B$ 是批量大小,$S$ 是参数更新次数,因子 6 考虑了前向和后向传递。因此,对于给定的 $C$ 值,我们可以扫描所有具有不同 $N$ 的模型以找到最优模型

Figure 8 Left: Generalization performance to other data distributions improves smoothly with model size, with only a small and very slowly growing offset from the WebText2 training distribution. Right: Generalization performance depends only on training distribution performance, and not on the phase of training. We compare generalization of converged models (points) to that of a single large model (dashed curves) as it trains.

图 8 左:模型大小平滑地提高了对其他数据分布的泛化性能,与 WebText2 训练分布相比,仅存在很小且增长非常缓慢的偏差。右:泛化性能仅取决于训练分布的性能,而不受训练阶段的影响。我们将收敛模型(点)的泛化性能与单个大模型(虚线曲线)在训练过程中的泛化性能进行比较。
with the best performance on step $\begin{array}{r}{S=\frac{C}{6B S}}\end{array}$ . Note that in these results the batch size $B$ remains fixed for all models, which means that these empirical results are not truly optimal. We will account for this in later sections using an adjusted $C_{\mathrm{min}}$ to produce cleaner trends.
在步骤 $\begin{array}{r}{S=\frac{C}{6B S}}\end{array}$ 上表现最佳。请注意,在这些结果中,批处理大小 $B$ 对所有模型保持固定,这意味着这些实验结果并非真正最优。我们将在后续章节中使用调整后的 $C_{\mathrm{min}}$ 来产生更清晰的趋势。
The result appears as the heavy black line on the left-hand plot in Figure 1. It can be fit with
结果如图 1 左侧图中的粗黑线所示。它可以拟合
$$
L(C)\approx\left(\frac{C_{c}}{C}\right)^{\alpha_{C}}
$$
$$
L(C) ≈ \left( \frac{C_c}{C} \right)^{\alpha_C}
$$
The figure also includes images of individual learning curves to clarify when individual models are optimal. We will study the optimal allocation of compute more closely later on. The data strongly suggests that sample efficiency improves with model size, and we also illustrate this directly in Figure 19 in the appendix.
该图还包含了各个学习曲线的图像,以澄清各个模型在何时达到最优。我们将在后面更详细地研究计算资源的最佳分配。数据强烈表明样本效率随着模型规模的增大而提高,我们也在附录中的图 19 中直接展示了这一点。
4 Charting the Infinite Data Limit and Over fitting
4 绘制无限数据极限与过拟合
In Section 3 we found a number of basic scaling laws for language modeling performance. Here we will study the performance of a model of size $N$ trained on a dataset with $D$ tokens while varying $N$ and $D$ simultaneously. We will empirically demonstrate that the optimally trained test loss accords with the scaling law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing size while keeping over fitting under control.
在第 3 节中,我们发现了一些语言建模性能的基本缩放定律。在这里,我们将研究在包含 $D$ Token 的数据集上训练的大小为 $N$ 的模型的性能,同时变化 $N$ 和 $D$ 。我们将通过实证证明,最优训练的测试损失符合公式 (1.5) 的缩放定律。这为我们提供了指导,说明我们需要多少数据来训练不断增大的模型,同时保持过拟合在可控范围内。
4.1 Proposed $L(N,D)$ Equation
4.1 提出的 $L(N,D)$ 方程
We have chosen the parameter iz ation (1.5) (repeated here for convenience):
我们选择了参数化 (1.5) (为方便起见,在此重复):
$$
L(N,D)=\left[\left(\frac{N_{c}}{N}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}+\frac{D_{c}}{D}\right]^{\alpha_{D}}
$$
$$
L(N,D)=\left[\left(\frac{N_{c}}{N}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}+\frac{D_{c}}{D}\right]^{\alpha_{D}}
$$
公式保持原样,不进行翻译。
using three principles:
使用三个原则:
- Changes in vocabulary size or token iz ation are expected to rescale the loss by an overall factor. The parameter iz ation of $L(N,D)$ (and all models of the loss) must naturally allow for such a rescaling. 2. Fixing $D$ and sending $N\rightarrow\infty$ , the overall loss should approach $L(D)$ . Conversely, fixing $N$ and sending $D\rightarrow\infty$ the loss must approach $L(N)$ . 3. $L(N,D)$ should be analytic at $D=\infty$ , so that it has a series expansion in $1/D$ with integer powers. Theoretical support for this principle is significantly weaker than for the first two.
- 词汇量或 Token 化的变化预计将按比例调整损失的整体因子。$L(N,D)$ 的参数化(以及所有损失模型)必须自然允许这种重新缩放。
- 固定 $D$ 并使 $N\rightarrow\infty$,整体损失应接近 $L(D)$。相反,固定 $N$ 并使 $D\rightarrow\infty$,损失必须接近 $L(N)$。
- $L(N,D)$ 在 $D=\infty$ 处应该是解析的,因此它具有以 $1/D$ 展开的级数,并且幂为整数。对这一原则的理论支持明显弱于前两个。
Our choice of $L(N,D)$ satisfies the first requirement because we can rescale $N_{c},D_{c}$ with changes in the vocabulary. This also implies that the values of $N_{c},D_{c}$ have no fundamental meaning.
我们的选择 $L(N,D)$ 满足第一个要求,因为我们可以通过调整词汇的变化来重新缩放 $N_{c},D_{c}$ 。这也意味着 $N_{c},D_{c}$ 的值没有根本性的意义。

Figure 9 The early-stopped test loss $L(N,D)$ depends predictably on the dataset size $D$ and model size $N$ according to Equation (1.5). Left: For large $D$ , performance is a straight power law in $N$ . For a smaller fixed $D$ , performance stops improving as $N$ increases and the model begins to overfit. (The reverse is also true, see Figure 4.) Right: The extent of over fitting depends predominantly on the ratio $N^{\frac{\alpha_{N}}{\alpha_{D}}}/D$ , as predicted in equation (4.3). The line is our fit to that equation.
图 9: 早期停止的测试损失 $L(N,D)$ 可预测地依赖于数据集大小 $D$ 和模型大小 $N$,如公式 (1.5) 所示。左:对于较大的 $D$,性能是 $N$ 的幂律直线。对于较小的固定 $D$,随着 $N$ 增加,性能停止改善并且模型开始过拟合。(反之亦然,见图 4。)右:过拟合的程度主要取决于比率 $N^{\frac{\alpha_{N}}{\alpha_{D}}}/D$,如公式 (4.3) 预测的那样。线是我们对该公式的拟合。
Since we stop training early when the test loss ceases to improve and optimize all models in the same way, we expect that larger models should always perform better than smaller models. But with fixed finite $D$ , we also do not expect any model to be capable of approaching the best possible loss (ie the entropy of text). Similarly, a model with fixed size will be capacity-limited. These considerations motivate our second principle. Note that knowledge of $L(N)$ at infinite $D$ and $L(D)$ at infinite $N$ fully determines all the parameters in $L(N,D)$ .
由于我们在测试损失停止改善时提前停止训练,并以相同的方式优化所有模型,我们期望较大的模型应该总是比小的模型表现更好。但当 $D$ 固定时,我们也并不期望任何模型能够达到最佳可能的损失(即文本的熵)。同样,固定大小的模型将受到容量限制。这些考虑促使我们提出第二个原则。请注意,在无限 $D$ 下的 $L(N)$ 和在无限 $N$ 下的 $L(D)$ 完全决定了 $L(N,D)$ 中的所有参数。
The third principle is more speculative. There is a simple and general reason one might expect over fitting to scale $\propto1/D$ at very large $D$ . Over fitting should be related to the variance or the signal-to-noise ratio of the dataset [AS17], and this scales as $1/\bar{D}$ . This expectation should hold for any smooth loss function, since we expect to be able to expand the loss about the $D\rightarrow\infty$ limit. However, this argument assumes that $1/D$ corrections dominate over other sources of variance, such as the finite batch size and other limits on the efficacy of optimization. Without empirical confirmation, we would not be very confident of its applicability.
第三原则更具推测性。有一个简单而普遍的原因,可以预期过拟合在非常大的 $D$ 下按 $\propto1/D$ 缩放。过拟合应与数据集的方差或信噪比相关 [AS17],这按 $1/\bar{D}$ 缩放。这种预期应对任何平滑损失函数都成立,因为我们期望能够在 $D\rightarrow\infty$ 极限附近展开损失函数。然而,这一论点假设 $1/D$ 修正主导其他方差来源,例如有限批量大小和其他优化效能限制。没有实证确认,我们对其适用性不会有太大信心。
Our third principle explains the asymmetry between the roles of $N$ and $D$ in Equation (1.5). Very similar symmetric expressions 4 are possible, but they would not have a $1/D$ expansion with integer powers, and would require the introduction of an additional parameter.
我们的第三原则解释了方程 (1.5) 中 $N$ 和 $D$ 角色之间的不对称性。非常相似的对称表达式是可能的,但它们不会有 $1/D$ 的整数幂展开,并且需要引入一个额外的参数。
In any case, we will see that our equation for $L(N,D)$ fits the data well, which is the most important justification for our $L(N,D)$ ansatz.
在任何情况下,我们将看到我们的公式 $L(N,D)$ 与数据非常吻合,这是对我们 $L(N,D)$ 假设的最重要证明。
4.2 Results
4.2 结果
We regularize all our models with $10%$ dropout, and by tracking test loss and stopping once it is no longer decreasing. The results are displayed in Figure 9, including a fit to the four parameters $\alpha_{N},\alpha_{D},N_{c},D_{c}$ in Equation (1.5):
我们对所有模型使用 10% dropout 进行正则化,并通过跟踪测试损失并在损失不再减少时停止训练。结果如图 9 所示,包括对公式 (1.5) 中的四个参数 $\alpha_{N},\alpha_{D},N_{c},D_{c}$ 的拟合:
图 9:
Table 2 Fits to $L(N,D)$
表 2: $L(N,D)$ 的拟合结果
| 参数 | ON | αD | Nc | Dc |
|---|---|---|---|---|
| 值 | 0.076 | 0.103 | 6.4 × 10^13 | 1.83 × 10^13 |
We obtain an excellent fit, with the exception of the runs where the dataset has been reduced by a factor of 1024, to about $2\times10^{7}$ tokens. With such a small dataset, an epoch consists of only 40 parameter updates. Perhaps such a tiny dataset represents a different regime for language modeling, as over fitting happens very early in training (see Figure 16). Also note that the parameters differ very slightly from those obtained in Section 3, as here we are fitting the full $L(N,D)$ rather than just $L(N,\infty)$ or $L(\infty,D)$ .
我们获得了非常好的拟合效果,例外情况是数据集被缩减了 1024 倍,大约为 $2\times10^{7}$ 个 Token。在如此小的数据集上,一个 epoch 仅包含 40 次参数更新。也许这种极小的数据集代表了语言建模的不同阶段,因为过拟合在训练初期就发生了(见图 16)。另外请注意,这里的参数与第 3 节中获得的参数略有不同,因为我们在这里拟合的是完整的 $L(N,D)$ 而不仅仅是 $L(N,\infty)$ 或 $L(\infty,D)$。
To chart the borderlands of the infinite data limit, we can directly study the extent of over fitting. For all but the largest models, we see no sign of over fitting when training with the full 22B token WebText2 dataset, so we can take it as representative of $D,=,\infty$ . Thus we can compare finite $D$ to the infinite data limit by
为了描绘无限数据极限的边缘地带,我们可以直接研究过拟合的程度。对于除最大模型之外的所有模型,在使用完整的 22B Token WebText2 数据集进行训练时,我们没有发现过拟合的迹象,因此可以将其视为代表 $D,=,\infty$ 。因此,我们可以将有限的 $D$ 与无限数据极限进行比较。

Figure 10 The critical batch size $B_{\mathrm{crit}}$ follows a power law in the loss as performance increase, and does not depend directly on the model size. We find that the critical batch size approximately doubles for every $13%$ decrease in loss. $B_{\mathrm{crit}}$ is measured empirically from the data shown in Figure 18, but it is also roughly predicted by the gradient noise scale, as in [MKAT18].
图 10: 关键批量大小 $B_{\mathrm{crit}}$ 随着性能的提高在损失中遵循幂律,并不直接依赖于模型大小。我们发现关键批量大小大约每减少 $13%$ 的损失就会翻倍。$B_{\mathrm{crit}}$ 是从图 18 所示的数据中经验测量得出的,但它也大致由梯度噪声尺度预测,如 [MKAT18] 所示。
defining
定义
$$
\delta L(N,D)\equiv\frac{L(N,D)}{L(N,\infty)}-1
$$
$$
\delta L(N,D) \equiv \frac{L(N,D)}{L(N,\infty)} - 1
$$
and studying it as a function of $N,D$ . In fact, we see empirically that $\delta L$ depends only a specific combination of $N$ and $D$ , as shown in Figure 16. This follows from the scaling law of Equation (1.5), which implies
并将其作为 $N$ 和 $D$ 的函数进行研究。实际上,我们从经验上看到 $\delta L$ 仅依赖于 $N$ 和 $D$ 的特定组合,如图 16 所示。这遵循公式 (1.5) 的缩放定律,该定律表明
图 16:
$$
\delta L\approx\left(1+\left(\frac{N}{N_{c}}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}\frac{D_{c}}{D}\right)^{\alpha_{D}}-1
$$
$$
\delta L \approx \left( 1 + \left( \frac{N}{N_c} \right)^{\frac{\alpha_N}{\alpha_D}} \frac{D_c}{D} \right)^{\alpha_D} - 1
$$
Note that at large $D$ this formula also has a series expansion in powers of $1/D$ .
请注意,在较大的 ( D ) 下,此公式也有 ( 1/D ) 的幂级数展开。
We estimate that the variation in the loss with different random seeds is roughly 0.02, which means that to avoid over fitting when training to within that threshold of convergence we require
我们估计不同随机种子下的损失变化大约为 0.02,这意味着为了避免过拟合,我们需要在训练时达到该阈值范围内的收敛
$$
D\gtrsim\left(5\times10^{3}\right)N^{0.74}
$$
$$
D \gtrsim \left( 5 \times 10^{3} \right) N^{0.74}
$$
With this relation, models smaller than $10^{9}$ parameters can be trained with minimal over fitting on the 22B token WebText2 dataset, but our largest models will encounter some mild over fitting. More generally, this relation shows that dataset size may grow sub-linearly in model size while avoiding over fitting. Note however that this does not typically represent maximally compute-efficient training. We should also emphasize that we have not optimized regular iz ation (eg the dropout probability) while varying dataset and model size.
通过这种关系,参数少于 $10^{9}$ 的模型可以在 22B Token 的 WebText2 数据集上进行训练,并且过拟合最小,但我们的最大模型会遇到一些轻微的过拟合。更一般地说,这种关系表明,在避免过拟合的情况下,数据集大小可以以亚线性的速度随着模型大小增长。然而,需要注意的是,这通常并不代表计算效率最高的训练方式。我们还应强调,在调整数据集和模型大小时,我们没有优化正则化(例如 dropout 概率)。
5 Scaling Laws with Model Size and Training Time
5 模型规模和训练时间的扩展定律
In this section we will demonstrate that a simple scaling law provides a good description for the loss as a function of model size $N$ and training time. First we will explain how to use the results of [MKAT18] to define a universal training step $S_{\mathrm{min}}$ , which accounts for the fact that most of our models have not been trained at an optimal batch size. Then we will demonstrate that we can fit the model size and training time dependence of the loss using Equation (1.6). Later we will use these results to predict the optimal allocation of training compute between model size and training time, and then confirm that prediction.
在本节中,我们将展示一个简单的缩放定律可以很好地描述损失与模型大小 $N$ 和训练时间的关系。首先,我们将解释如何使用 [MKAT18] 的结果来定义一个通用的训练步数 $S_{\mathrm{min}}$ ,这考虑了我们的大多数模型没有在最优批量大小下进行训练的事实。然后我们将证明我们可以使用公式 (1.6) 来拟合损失对模型大小和训练时间的依赖关系。之后我们将使用这些结果来预测训练计算资源在模型大小和训练时间之间的最优分配,并验证该预测。
5.1 Adjustment for Training at $B_}(L)$
5.1 在 $B_}(L)$ 处训练的调整
A simple empirical theory for the batch size dependence of training was developed in [MKAT18] (see also $[\mathrm{SLA}^{\bar{+}}18$ , $\bar{Z!L}!N^{+}19]$ ). It was argued that there is a critical batch size $B_{\mathrm{crit}}$ for training; for $B$ up to $B_{\mathrm{crit}}$ the batch size can be increased with very minimal degradation in compute-efficiency, whereas for $B>B_{\mathrm{crit}}$ increases in $B$ result in diminishing returns. It was also argued that the gradient noise scale provides a simple prediction for $B_{\mathrm{crit}}$ , and that neither depends directly on model size except through the value of the loss that has been attained. These results can be used to predict how training time and compute will vary with the batch size. To utilize both training time and compute as effectively as possible, it is best to train with a batch size $B\approx B_{\mathrm{crit}}$ . Training at $B\gg B_{\mathrm{crit}}$ minimizes the number of training steps, while $B\ll B_{\mathrm{crit}}$ minimizes the use of compute.
在 [MKAT18] 中发展了一个简单的经验理论来解释训练过程中批量大小的依赖性(另见 $[\mathrm{SLA}^{\bar{+}}18$ , $\bar{Z!L}!N^{+}19]$)。该理论认为,在训练中存在一个临界批量大小 $B_{\mathrm{crit}}$;对于 $B$ 小于或等于 $B_{\mathrm{crit}}$ 的情况,增加批量大小对计算效率的影响非常小;而对于 $B>B_{\mathrm{crit}}$ 的情况,继续增加 $B$ 会导致收益递减。还指出,梯度噪声尺度可以简单预测 $B_{\mathrm{crit}}$,并且两者都不直接依赖于模型大小,除非通过已达到的损失值。这些结果可用于预测训练时间和计算资源如何随批量大小变化。为了尽可能有效地利用训练时间和计算资源,最好使用接近临界批量大小 $B\approx B_{\mathrm{crit}}$ 进行训练。在 $B\gg B_{\mathrm{crit}}$ 下训练可最小化训练步骤的数量,而在 $B\ll B_{\mathrm{crit}}$ 下训练则可最小化计算资源的使用。
More specifically, it was demonstrated that for a wide variety of neural network tasks, the number of training steps $S$ and the number of data examples processed $E=B S$ satisfy the simple relation
更具体地说,已经证明对于各种各样的神经网络任务,训练步骤的数量 $S$ 和处理的数据示例数量 $E=B S$ 满足简单的关系
$$
\left({\frac{S}{S_{\mathrm{min}}}}-1\right)\left({\frac{E}{E_{\mathrm{min}}}}-1\right)=1
$$
$$
\left( \frac{S}{S_{\mathrm{min}}} - 1 \right) \left( \frac{E}{E_{\mathrm{min}}} - 1 \right) = 1
$$
when training to any fixed value of the loss $L$ . Here $S_{\mathrm{min}}$ is the minimum number of steps necessary to reach $L$ , while $E_{\mathrm{min}}$ is the minimum number of data examples that must be processed.
在训练到任何固定的损失值 $L$ 时。这里 $S_{\mathrm{min}}$ 是达到 $L$ 所需的最小步数,而 $E_{\mathrm{min}}$ 是必须处理的最小数据样本数量。
We demonstrate the relation (5.1) for Transformers in Figure 18 in the appendix. This relation defines the critical batch size
我们在附录的图 18 中展示了 Transformer 的关系 (5.1) 。此关系定义了关键批次大小
$$
B_{\mathrm{crit}}(L)\equiv\frac{E_{\mathrm{min}}}{S_{\mathrm{min}}}
$$
$$
B_ {\mathrm{crit}} (L) \equiv \frac {E_ {\mathrm{min}}} {S_ {\mathrm{min}}}
$$
which is a function of the target value of the loss. Training at the critical batch size makes a roughly optimal time/compute tradeoff, requiring $2S_{\mathrm{min}}$ training steps and processing $E=2E_{\operatorname*{min}}$ data examples.
这是损失目标值的函数。在临界批量大小进行训练可以大致实现时间和计算资源的最优权衡,需要 $2S_{\mathrm{min}}$ 训练步骤并处理 $E=2E_{\operatorname*{min}}$ 数据样本。
In Figure 10 we have plotted the critical batch size and gradient noise scale5 as a function of training loss for two different models. We see that $B_{\mathrm{crit}}(L)$ is independent of model size, and only depends on the loss $L$ . So the predictions of [MKAT18] continue to hold for Transformer language models. The critical batch size can be fit with a power-law in the loss
图 10: 我们绘制了两个不同模型的临界批量大小和梯度噪声尺度与训练损失的关系。我们发现 $B_{\mathrm{crit}}(L)$ 与模型大小无关,仅取决于损失 $L$ 。因此,[MKAT18] 的预测对于 Transformer 语言模型仍然成立。临界批量大小可以用损失的幂律进行拟合。
$$
B_{\mathrm{crit}}(L)\approx\frac{B_{*}}{L^{1/\alpha_{B}}}
$$
$$
B_{\mathrm{crit}}(L) \approx \frac{B_{*}}{L^{1/\alpha_{B}}}
$$
where $B_{*}\approx2\times10^{8}$ and $\alpha_{B}\approx0.21$ .
其中 $B_{*} \approx 2 \times 10^{8}$ 和 $\alpha_{B} \approx 0.21$ 。
We have chosen this parameter iz ation for $B_{\mathrm{crit}}(L)$ because as the loss approaches its minimum value $L_{\mathrm{min}}$ , the gradient noise scale is expected to diverge, and we expect $B_{\mathrm{crit}}$ to track this noise scale. We do not know $L_{\mathrm{min}}$ , as we see no sign that our models are approaching it, but $L_{\mathrm{min}}>0$ since the entropy of natural language is non-zero. Since apparently $L_{\mathrm{min}}$ is much smaller than the values of $L$ we have achieved, we used a parameter iz ation where $B_{\mathrm{crit}}$ diverges as $L\rightarrow0$ .
我们选择了这种参数化方式来表示 $B_{\mathrm{crit}}(L)$,因为当损失接近其最小值 $L_{\mathrm{min}}$ 时,梯度噪声尺度预计会发散,我们希望 $B_{\mathrm{crit}}$ 能够跟踪这一噪声尺度。我们不知道 $L_{\mathrm{min}}$,因为我们没有看到模型接近它的迹象,但 $L_{\mathrm{min}}>0$,因为自然语言的熵是非零的。由于显然 $L_{\mathrm{min}}$ 远小于我们已经实现的 $L$ 值,我们使用了在 $L\rightarrow0$ 时 $B_{\mathrm{crit}}$ 发散的参数化方式。
We will use $B_{\mathrm{crit}}(L)$ to estimate the relation between the number of training steps $S$ while training at batch size $B=2^{19}$ tokens and the number of training steps while training at $B\gg B_{\mathrm{crit}}$ . This is simply
我们将使用 $B_{\mathrm{crit}}(L)$ 来估计在批量大小为 $B=2^{19}$ 个 Token 时的训练步数 $S$ 与在 $B \gg B_{\mathrm{crit}}$ 时的训练步数之间的关系。这很简单。
$$
S_{\mathrm{min}}(S)\equiv\frac{S}{1+B_{\mathrm{crit}}(L)/B}\qquad\mathrm{(minimum~steps,at}B\gg B_{\mathrm{crit}})
$$
$$
S_{\mathrm{min}}(S) \equiv \frac{S}{1 + B_{\mathrm{crit}}(L) / B} \qquad (最小步骤,当 B \gg B_{\mathrm{crit}})
$$
for any given target value $L$ for the loss. This also defines a critical value of the compute needed to train to $L$ with a model of size $N$ if we were to train at $B\ll B_{\mathrm{crit}}(L)$ . This is
对于任何给定的损失目标值 $L$ 。这也定义了训练到损失值 $L$ 所需计算量的关键值,对于大小为 $N$ 的模型,如果我们以 $B \ll B_{\mathrm{crit}}(L)$ 进行训练。这是
$$
C_{\mathrm{min}}(C)\equiv\frac{C}{1+B/B_{\mathrm{crit}}(L)}\qquad\mathrm{(minimum~compute,at}B\ll B_{\mathrm{crit}})
$$
$$
C_{\mathrm{min}}(C) \equiv \frac{C}{1 + B / B_{\mathrm{crit}}(L)} \qquad (最小计算, 在 B \ll B_{\mathrm{crit}})
$$
where $C=6N B S$ estimates the (non-embedding) compute used at batch size $B$ .
其中 $C=6N B S$ 估计了在批处理大小为 $B$ 时使用的(非嵌入)计算资源 。
5.2 Results for $L(N,S_})$ and Performance with Model Size and Compute
5.2 $L(N,S_})$ 的结果及模型规模和计算性能的表现
Now we will use $S_{\mathrm{min}}$ defined in Equation (5.4) to obtain a simple and universal fit for the dependence of the loss on model size and training time in the infinite data limit. We will fit the stable, Adam-optimized training runs using Equation (1.6), repeated here for convenience:
现在我们将使用在公式 (5.4) 中定义的 $S_{\mathrm{min}}$ 来获得一个简单且通用的模型大小和训练时间对损失影响的拟合,在无限数据限制下。我们将使用公式 (1.6) 对稳定的、Adam优化的训练运行进行拟合,这里为了方便再次列出该公式:
$$
L(N,S_{\mathrm{min}})=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S_{\mathrm{min}}}\right)^{\alpha_{S}}
$$
$$
L(N,S_{\mathrm{min}})=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S_{\mathrm{min}}}\right)^{\alpha_{S}}
$$
for the loss. We include all training steps after the warmup period of the learning rate schedule, and find a fit to the data with the parameters:
对于损失。我们包括学习率调度的热身期之后的所有训练步骤,并使用以下参数对数据进行拟合:

Figure 11 When we hold either total compute or number of training steps fixed, performance follows $L(N,S)$ from Equation (5.6). Each value of compute budget has an associated optimal model size that maximizes performance. Mediocre fits at small $S$ are unsurprising, as the power-law equation for the learning curves breaks down very early in training.
图 11: 当我们固定总计算量或训练步数时,性能遵循公式 (5.6) 中的 $L(N,S)$ 。每个计算预算值都有一个与之相关的最优模型大小,可以最大化性能。在小 $S$ 时拟合效果较差并不令人意外,因为学习曲线的幂律方程在训练初期就会失效。
Table 3 Fits to $L(N,S)$
表 3: $L(N,S)$ 的拟合结果
| 参数 | ON | S0 | Nc | Sc |
|---|---|---|---|---|
| 值 | 0.077 | 0.76 | 6.5 × 10^13 | 2.1 × 10^3 |
With these parameters, we obtain the learning curve fits in Figure 4. Though the fits are imperfect, we believe they are quite compelling given the simplicity of Equation (5.6).
使用这些参数,我们在图 4 中获得了学习曲线的拟合。尽管拟合并不完美,但我们认为鉴于公式 (5.6) 的简单性,它们是非常有说服力的。
图 4:
The data and fits can be visualized in a different and more interesting way, as shown in Figure 11. There we study the test loss as a function of model size while fixing either the total non-embedding compute $C$ used in training, or the number of steps $S$ . For the fits we use Equation (5.5) and (5.4) along with the parameters above and Equation (5.6).
数据和拟合结果可以以不同且更有趣的方式可视化,如图 11 所示。在那里,我们研究了在训练过程中固定总非嵌入计算量 $C$ 或步数 $S$ 时,测试损失随模型大小的变化情况。对于拟合,我们使用公式 (5.5) 和 (5.4) 以及上述参数和公式 (5.6)。
The power-law dependence of the loss on $S_{\mathrm{min}}$ reflects the interplay of optimizer dynamics and the loss landscape. Since the fits are best late in training, when the loss may be approximately quadratic, the powerlaw should provide information about the spectrum of the Hessian of the loss. Its universality suggests that the Hessian eigenvalue density is roughly independent of model size.
损失对 $S_{\mathrm{min}}$ 的幂律依赖反映了优化器动态和损失景观之间的相互作用。由于拟合在训练后期最佳,此时损失可能是近似二次的,因此幂律应提供关于损失的 Hessian 矩阵谱的信息。其普遍性表明 Hessian 特征值密度大致与模型大小无关。
5.3 Lower Bound on Early Stopping Step
5.3 提前停止步数的下界
The results for $L(N,S_{\mathrm{min}})$ can be used to derive a lower-bound (and rough estimate) of the step at which early stopping should occur when training is data limited. It is motivated by the idea that finite and infinite $D$ learning curves for a given model will be very similar until we reach $S_{\mathrm{min}}\approx S_{\mathrm{stop}}$ . Thus over fitting should be proportional to the correction from simply ending training at $S_{\mathrm{stop}}$ . This will underestimate $S_{\mathrm{stop}}$ , because in reality the test loss will decrease more slowly when we have a finite $D$ , and therefore we will require more training steps to reach the optimal test loss at finite $D$ . This line of reasoning leads to the inequality
$L(N,S_{\mathrm{min}})$ 的结果可以用于推导在数据有限的训练中应提前停止的步数的下界(和粗略估计)。其动机是基于这样一个想法:对于给定模型,有限和无限 $D$ 的学习曲线将非常相似,直到我们达到 $S_{\mathrm{min}} \approx S_{\mathrm{stop}}$ 。因此,过拟合应该与简单地在 $S_{\mathrm{stop}}$ 结束训练的修正成正比。这将低估 $S_{\mathrm{stop}}$ ,因为在现实中,当我们有一个有限的 $D$ 时,测试损失将下降得更慢,因此我们需要更多的训练步骤来达到有限 $D$ 下的最优测试损失。这一推理导致了以下不等式
$$
S_{\mathrm{stop}}(N,D)\gtrsim\frac{S_{c}}{[L(N,D)-L(N,\infty)]^{1/\alpha_{S}}}
$$
$$
S_{\mathrm{stop}}(N,D) \gtrsim \frac{S_{c}}{[L(N,D) - L(N, \infty)]^{1 / \alpha_{S}}}
$$
where $L(N,\infty)$ is the converged loss, evaluated with infinite available data. This inequality and its comparison to the empirical data is displayed in Figure 16 in the appendix. In that figure, the values of $S_{\mathrm{stop}}$ and $L(N,D)$ are empirical (though $S_{\mathrm{stop}}$ is adjusted to mimic training at $B,\gg,B_{\mathrm{crit}})$ , while $L(N,\infty)$ is computed from the fit to $L(N,D)$ evaluated at $D=\infty$ .
其中 $L(N,\infty)$ 是收敛损失,使用无限数据评估。此不等式及其与经验数据的比较在附录中的图 16 中显示。在该图中,$S_{\mathrm{stop}}$ 和 $L(N,D)$ 的值是经验性的(尽管 $S_{\mathrm{stop}}$ 调整为模拟在 $B \gg B_{\mathrm{crit}}$ 下的训练),而 $L(N,\infty)$ 是从对 $L(N,D)$ 的拟合计算得出,并在 $D=\infty$ 处评估。
6 Optimal Allocation of the Compute Budget
6 计算预算的最优分配
We displayed the empirical trend of performance as a function of the computation used during training in the top-right of Figure 1. However, this result involved training at a fixed batch size $B$ , whereas we know that in fact we could train more efficiently 6 by training at the batch size $B_{\mathrm{crit}}$ discussed in Section 5.1. Large and small values of the loss could have been achieved with fewer samples or fewer steps, respectively, and correcting for this inefficiency by standardizing to the critical batch size results in cleaner and more predictable trends.
我们在图 1 的右上角展示了性能的经验趋势,该趋势是训练期间使用的计算量的函数。然而,这一结果是在固定的批量大小 $B$ 下进行训练得出的,而我们知道实际上可以通过在第 5.1 节讨论的关键批量大小 $B_{\mathrm{crit}}$ 下进行训练来更高效地训练。较大和较小的损失值本可以分别用更少的样本或更少的步骤实现,通过标准化到关键批量大小来纠正这种低效性,可以得到更清晰和更可预测的趋势。

Figure 12 Left: Given a fixed compute budget, a particular model size is optimal, though somewhat larger or smaller models can be trained with minimal additional compute. Right: Models larger than the computeefficient size require fewer steps to train, allowing for potentially faster training if sufficient additional parallelism is possible. Note that this equation should not be trusted for very large models, as it is only valid in the power-law region of the learning curve, after initial transient effects.
图 12 左:给定固定的计算预算,特定的模型大小是最优的,尽管稍大或稍小的模型也可以用最小的额外计算资源进行训练。右:大于计算效率最优大小的模型需要更少的训练步骤,如果可以提供足够的额外并行性,则可能实现更快的训练。注意,此公式不应应用于非常大的模型,因为它仅在学习曲线的幂律区域有效,在初始瞬态效应之后 [20]。

Deviation from Optimal Model (N/Neficient)
偏离最优模型 (N/Neficient)


Figure 13 When adjusting performance to simulate training far below the critical batch size, we find a somewhat altered power law for $L(C_{\mathrm{min}})$ when compared with the fully empirical results. The conspicuous lump at $10^{-5}$ PF-days marks the transition from 1-layer to 2-layer networks; we exclude 1-layer networks in the power-law fits. It is the $L(C_{\mathrm{min}})$ trend that we expect to provide a reliable extrapolation for larger compute.
图 13: 在调整性能以模拟远低于临界批量大小的训练时,我们发现与完全经验结果相比,$L(C_{\mathrm{min}})$ 的幂律有所改变。明显的 $10^{-5}$ PF天的突起标志着从单层网络到双层网络的过渡;我们在幂律拟合中排除了单层网络。我们期望 $L(C_{\mathrm{min}})$ 的趋势能够为更大的计算量提供可靠的外推。
In this section we will adjust for this oversight. More importantly, we will use the results of Section 5 to determine the optimal allocation of compute between model size $N$ and the quantity of data processed during training, namely $2B_{\mathrm{crit}}S_{\mathrm{min}}$ . We will determine this allocation both empirically and theoretically, by using the equation for $L(N,S_{\mathrm{min}})$ , and we will demonstrate that these methods agree.
在本节中,我们将对此疏忽进行调整。更重要的是,我们将使用第 5 节的结果来确定模型大小 $N$ 和训练期间处理的数据量 $2B_{\mathrm{crit}}S_{\mathrm{min}}$ 之间的计算资源最优分配。我们将通过经验方法和理论方法来确定这种分配,即使用 $L(N,S_{\mathrm{min}})$ 的公式,并将证明这些方法是一致的。
6.1 Optimal Performance and Allocations
6.1 最优性能和分配
Let us first study the loss as a function of the optimally allocated compute from Equation (5.5). The result is plotted in Figure 13, along with a power-law fit. We see that as compared to the compute plot of Figure 1, the new fit with $C_{\mathrm{min}}$ is somewhat improved.
让我们首先研究损失作为最优分配计算资源的函数,从公式 (5.5) 出发。结果如图 13 所示,并附有幂律拟合。我们发现,与图 1 的计算资源图相比,新的拟合使用 $C_{\mathrm{min}}$ 有所改进。
图 13:
(注:原文中没有提供图 13 的具体描述,因此这里仅保留了图的引用格式)
Given $L(C_{\mathrm{min}})$ , it is natural to ask for the optimal model size $N(C_{\mathrm{min}})$ that provides the minimal loss with a given quantity of training compute. The optimal model size is shown in Figure 14. We observe that $N(C_{\mathrm{min}})$
给定 $L(C_{\mathrm{min}})$ ,很自然地会问在给定的训练计算量下,提供最小损失的最优模型大小 $N(C_{\mathrm{min}})$ 是多少。最优模型大小如图 14 所示。我们观察到 $N(C_{\mathrm{min}})$

Figure 14 Left: Each value of the compute budget $C_{\mathrm{min}}$ has an associated optimal model size $N$ . Optimal model size grows very rapidly with $C_{\mathrm{min}}$ , increasing by 5x for each $10\mathbf{x}$ increase in compute. The number of data examples processed makes up the remainder of the increase, growing relatively modestly by only $2\mathbf{x}$ . Right: The batch-adjusted number of optimization steps also grows very slowly, if at all, meaning that most of the growth in data examples processed can be used for increased batch sizes.
图 14 左:每个计算预算值 $C_{\mathrm{min}}$ 都有一个相应的最优模型大小 $N$ 。最优模型大小随着 $C_{\mathrm{min}}$ 的增加而迅速增长,每次计算量增加 $10\mathbf{x}$ 时,模型大小增加 5 倍。处理的数据样本数量构成了剩余的增长部分,相对较为 modestly,仅增加了 $2\mathbf{x}$ 。右:调整后的优化步骤数量增长非常缓慢,甚至几乎没有增长,这意味着大部分用于处理的数据样本的增长可以用于增加批次大小。
can be fit very well with a power-law
可以用幂律非常很好地拟合
$$
N(C_{\mathrm{min}})\propto(C_{\mathrm{min}})^{0.73}.
$$
$$
N(C_{\mathrm{min}}) \propto (C_{\mathrm{min}})^{0.73} .
$$
In Figure 12, we show the effect of training models of sub-optimal sizes (see Appendix B.4).
图 12: 我们展示了训练次优规模模型的效果 (见附录 B.4)。
By definition $C_{\mathrm{min}}\equiv6N B_{\mathrm{crit}}S$ , and so we can use $N(C_{\mathrm{min}})$ to extract further results. In particular, since prior fits show B ∝L−4.8 and L ∝Cm−i0n.05 , we can conclude that $B_{\mathrm{crit}}\propto C_{\mathrm{min}}^{0.24}$ . This leads us to conclude that the optimal number of steps will only grow very slowly with compute, as
根据定义 $C_{\mathrm{min}}\equiv6N B_{\mathrm{crit}}S$ ,因此我们可以使用 $N(C_{\mathrm{min}})$ 来提取进一步的结果。特别是,由于之前的拟合显示 B ∝ L−4.8 和 L ∝ Cm−i0n.05 ,我们可以得出结论 $B_{\mathrm{crit}}\propto C_{\mathrm{min}}^{0.24}$ 。这使我们得出结论,最优的步数将仅随着计算资源非常缓慢地增长,作为
$$
S_{\mathrm{min}}\propto(C_{\mathrm{min}})^{0.03},
$$
$$
S_{\mathrm{min}} \propto (C_{\mathrm{min}})^{0.03} ,
$$
matching the empirical results in Figure 14. In fact the measured exponent is sufficiently small that our results may even be consistent with an exponent of zero.
与图 14 的实证结果一致。实际上,测得的指数足够小,以至于我们的结果可能甚至与指数为零的情况一致。
Thus we conclude that as we scale up language modeling with an optimal allocation of computation, we should predominantly increase the model size $N$ , while simultaneously scaling up the batch size via $B,\propto$ $B_{\mathrm{crit}}$ with negligible increase in the number of serial steps. Since compute-efficient training uses relatively few optimization steps, additional work on speeding up early training dynamics may be warranted.
因此,我们得出结论:在以最优计算资源分配扩大语言模型规模时,应主要增加模型大小 $N$ ,同时通过 $B,\propto$ $B_{\mathrm{crit}}$ 扩大批次大小,而串行步骤的数量几乎不变。由于计算高效的训练使用相对较少的优化步骤,加速早期训练动态的额外工作可能是必要的。
6.2 Predictions from $L(N,S_})$
6.2 来自 $L(N,S_})$ 的预测
The results for $L(C_{\mathrm{min}})$ and the allocations can be predicted from the $L(N,S_{\mathrm{min}})$ equation obtained in Section 5. Given our equation for $L(N,S_{\mathrm{min}})$ , we can substitute $\begin{array}{r}{S_{\mathrm{min}}=\frac{C_{\mathrm{min}}}{6N B}}\end{array}$ and then find the minimum of the loss as a function of $N$ , while fixing the training compute. We carry out this procedure in detail in Appendix B, where we also provide some additional predictions.
可以从第 5 节获得的 $L(N,S_{\mathrm{min}})$ 方程预测 $L(C_{\mathrm{min}})$ 和分配的结果。给定我们的 $L(N,S_{\mathrm{min}})$ 方程,我们可以代入 $S_{\mathrm{min}}=\frac{C_{\mathrm{min}}}{6N B}$ ,然后在固定训练计算量的情况下,找到损失关于 $N$ 的最小值。我们在附录 B 中详细进行了这一过程,并提供了一些额外的预测。
For the loss as a function of training compute, we predict that
对于训练计算量作为函数的损失,我们预测
$$
L(C_{\mathrm{min}})=\left(\frac{C_{c}^{\mathrm{min}}}{C_{\mathrm{min}}}\right)^{\alpha_{C}^{\mathrm{min}}}
$$
$$
L(C_{\mathrm{min}})=\left(\frac{C_{c}^{\mathrm{min}}}{C_{\mathrm{min}}}\right)^{\alpha_{C}^{\mathrm{min}}}
$$
where
哪里
根据您的要求,仅提供翻译部分,并且由于 "where" 单独翻译为 "哪里",但似乎您可能需要一个更具体的上下文来进行准确翻译。不过,基于给定指令,以上是 "where" 的直接翻译结果。如果这并非您期望的效果,请提供完整的句子或段落以便进行更精确的翻译。
请注意,我将严格按照您的规则和提供的内容进行翻译,如果您有其他具体文本需要翻译,请提供完整文本。
$$
\alpha_{C}^{\mathrm{min}}\equiv\frac{1}{1/\alpha_{S}+1/\alpha_{B}+1/\alpha_{N}}\approx0.054
$$
$$
\alpha_{C}^{\mathrm{min}} \equiv \frac{1}{1 / \alpha_{S} + 1 / \alpha_{B} + 1 / \alpha_{N}} \approx 0.054
$$
in excellent agreement with the exponent of Figure 13. We also predict that
与图 13 的指数非常吻合。我们还预测
$$
N(C_{\mathrm{min}})\propto(C_{\mathrm{min}})^{\alpha_{C}^{\mathrm{min}}/\alpha_{N}}\approx(C_{\mathrm{min}})^{0.71}
$$
$$
N(C_{\mathrm{min}}) \propto (C_{\mathrm{min}})^{\alpha_{C}^{\mathrm{min}} / \alpha_{N}} \approx (C_{\mathrm{min}})^{0.71}
$$
which also matches the scaling of Figure 14 to within a few percent. Our scaling laws provide a predictive framework for the performance of language modeling.
这也与图 14 的缩放相符,误差在几个百分点之内。我们的缩放定律为语言模型的性能提供了一个预测框架。

Figure 15 Far beyond the model sizes we study empirically, we find a contradiction between our equations for $L(C_{\mathrm{min}})$ and $L(D)$ due to the slow growth of data needed for compute-efficient training. The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits.
图 15: 远超我们实证研究的模型规模,我们发现关于 $L(C_{\mathrm{min}})$ 和 $L(D)$ 的方程之间存在矛盾,这是由于计算高效训练所需数据的缓慢增长所致。交点标志着我们预期预测失效的转折点。该点的位置对幂律拟合得出的精确指数非常敏感。
6.3 Contradictions and a Conjecture
6.3 矛盾与猜想
We observe no signs of deviation from straight power-law trends at large values of compute, data, or model size. Our trends must eventually level off, though, since natural language has non-zero entropy.
我们在大规模计算、数据或模型尺寸的情况下,没有观察到偏离幂律趋势的迹象。然而,由于自然语言具有非零熵,我们的趋势最终必须趋于平稳 [20]。
Indeed, the trends for compute-efficient training described in this section already contain an apparent contradiction. At scales several orders of magnitude above those documented here, the performance predicted by the $L(C_{\mathrm{min}})$ scaling law decreases below what should be possible given the slow growth in training data with compute. This implies that our scaling laws must break down before this point, but we conjecture that the intersection point has a deeper meaning: it provides an estimate of the point at which Transformer language models reach maximal performance.
确实,本节描述的计算高效的训练趋势已经包含了一个明显的矛盾。在比本文记录的大几个数量级的规模上,根据 $L(C_{\mathrm{min}})$ 缩放定律预测的性能下降到低于给定计算量增长缓慢的训练数据所应达到的水平。这表明我们的缩放定律必须在此点之前失效,但我们推测交点有更深层的意义:它提供了 Transformer 语言模型达到最大性能点的估计。
Since the amount of data used by compute-efficient training grows slowly with the compute budget, the performance predicted by $L(C_{\mathrm{min}})$ eventually hits a lower bound set by the $L(D)$ power law (see Figure 15). Let us work this out in more detail.
由于计算高效的训练所使用的数据量随着计算预算的增长而缓慢增加,由 $L(C_{\mathrm{min}})$ 预测的性能最终会达到由 $L(D)$ 幂律设定的下限(见图 15)。让我们更详细地分析这一点。
To keep over fitting under control, the results of Section 4 imply that we should scale the dataset size as
为控制过拟合,第 4 节的结果表明我们应该按比例扩大数据集大小
$$
D\propto N^{0.74}\propto C_{\mathrm{min}}^{0.54}
$$
$$
D \propto N^{0.74} \propto C_{\mathrm{min}}^{0.54}
$$
where we have used the compute-efficient $N(C_{\mathrm{min}})$ from Figure 14.
我们使用了计算高效的 $N(C_{\mathrm{min}})$ (图 14)。
Let us compare this to the data requirements of compute-efficient training. If we train at the critical batch size (i.e. $C=2C_{\mathrm{min}})$ ) and never re-use data during training, we find that data usage grows with compute as
让我们将此与计算高效的训练的数据需求进行比较。如果我们以临界批量大小(即 $C=2C_{\mathrm{min}}$ )进行训练,并且在训练过程中从不重复使用数据,我们发现数据使用量随着计算量的增长而增长
$$
D(C_{\mathrm{min}})=\frac{2C_{\mathrm{min}}}{6N(C_{\mathrm{min}})}\approx\left(4\times10^{10},\mathrm{tokens}\right)\left(C_{\mathrm{min}}/\mathrm{PF}.\mathrm{Day}\right)^{0.26}
$$
$$
D(C_{\mathrm{min}})=\frac{2C_{\mathrm{min}}}{6N(C_{\mathrm{min}})}\approx\left(4\times10^{10},\mathrm{tokens}\right)\left(C_{\mathrm{min}}/\mathrm{PF}.\mathrm{Day}\right)^{0.26}
$$
This is the maximum rate at which the dataset size can productively grow with compute, since it means that we are only training for a single epoch. But it grows the dataset much more slowly than in Equation (6.6). It appears to imply that compute-efficient training will eventually run into a problem with over fitting, even if the training process never re-uses any data!
这是数据集大小可以在计算资源下有效增长的最大速率,因为我们只训练一个 epoch。但这比公式 (6.6) 中的数据集增长速度慢得多。这似乎意味着,即使训练过程从未重复使用任何数据,计算资源高效的训练最终也会遇到过拟合的问题。
According to Figure 1, we expect that when we are bottle necked by the dataset size (ie by over fitting), the loss should scale as $L(D)\propto\dot{D}^{-0.095}$ . This implies that the loss would scale with compute as $L(D(C_{\mathrm{min}}^{\ \bar{}}))\propto$ $C_{\mathrm{min}}^{-0.03}$ once we are data-limited. Once again, we have a contradiction, as this will eventually intersect with our prediction for $L(C_{\mathrm{min}})$ from Figure 13, where we found a scaling $L(C_{\mathrm{min}})\propto C_{\mathrm{min}}^{-0.050}$ .
根据图 1,我们预计当数据集大小成为瓶颈(即过拟合)时,损失应按 $L(D)\propto\dot{D}^{-0.095}$ 缩放。这表明一旦我们受到数据限制,损失将随计算资源按 $L(D(C_{\mathrm{min}}^{\ \bar{}}))\propto$ $C_{\mathrm{min}}^{-0.03}$ 缩放。再次,我们遇到了矛盾,因为这最终会与我们在图 13 中对 $L(C_{\mathrm{min}})$ 的预测相交,其中我们发现缩放为 $L(C_{\mathrm{min}})\propto C_{\mathrm{min}}^{-0.050}$。
The intersection point of $L(D(C_{\mathrm{min}}))$ and $L(C_{\mathrm{min}})$ occurs at
$L(D(C_{\mathrm{min}}))$ 和 $L(C_{\mathrm{min}})$ 的交点发生在
$$
C^{}\sim10^{4},\mathrm{PF-Days}\quad N^{}\sim10^{12},\mathrm{parameters},\quad D^{}\sim10^{12},\mathrm{tokens},\quad L^{}\sim1.7,\mathrm{nats/token}
$$
$$
C^{} \sim 10^{4} , \mathrm{PF-Days} \quad N^{} \sim 10^{12} , \mathrm{参数}, \quad D^{} \sim 10^{12} , \mathrm{Token}, \quad L^{} \sim 1.7 , \mathrm{nats/Token}
$$
though the numerical values are highly uncertain, varying by an order or magnitude in either direction depending on the precise values of the exponents from the power-law fits. The most obvious interpretation is that our scaling laws break down at or before we reach this point, which is still many orders of magnitude away in both compute and model size.
尽管数值存在高度不确定性,可能根据幂律拟合的指数精确值在任一方向上变化一个数量级。最明显的解释是,我们的扩展定律在达到这一点之前就已经失效了,而这在计算量和模型规模上仍然相距多个数量级。
One might also conjecture that this intersection point has a deeper meaning. If we cannot increase the model size beyond $N^{}$ without qualitatively different data requirements, perhaps this means that once we reach $C_{\mathrm{min}}^{}$ and $N^{}$ , we have extracted all of the reliable information available in natural language data. In this interpretation, $L^{}$ would provide a rough estimate for the entropy-per-token7 of natural language. In this scenario, we would expect the loss trend to level off at or before $L^{*}$ .
也可以推测这个交点具有更深层次的意义。如果我们无法在不改变数据需求的情况下将模型规模扩大到超过 $N^{}$ ,这可能意味着一旦我们达到 $C_{\mathrm{min}}^{}$ 和 $N^{}$ ,我们就已经提取了自然语言数据中所有可靠的信息。在这种解释下,$L^{}$ 可以提供自然语言的每 Token 熵 (entropy-per-token) 的粗略估计 [7] 。在这种情况下,我们会预期损失趋势将在 $L^{*}$ 或之前趋于平稳。
We can guess at the functional form of $L(C_{\mathrm{min}})$ as it levels off by considering a version of our training dataset with added noise. For example, we could append a random string of tokens to each context shown to the model to artificially boost the loss by a constant additive factor. Then, the distance from the noise floor $L-L_{\mathrm{noise}}$ would be a more meaningful performance metric, with even a small decrease in this distance potentially representing a significant boost in qualitative performance. Since the artificial noise would affect all of our trends equally, the critical point of 6.8 would not change (aside from the absolute value of $L^{*}$ ), and may be meaningful even if it occurs after the leveling off.
我们可以猜测 $L(C_{\mathrm{min}})$ 的函数形式在其趋于平稳时的表现,通过考虑一个添加了噪声的训练数据集版本。例如,我们可以在展示给模型的每个上下文中附加一串随机的 Token,以人为地增加损失值,使其增加一个常数加性因子。然后,距离噪声底 $L-L_{\mathrm{noise}}$ 将成为一个更有意义的性能指标,即使这一距离的小幅减少也可能代表性能的显著提升。由于人工噪声会对所有趋势产生相同的影响,临界点 6.8 不会改变(除了 $L^{*}$ 的绝对值),并且即使在趋于平稳之后,这个临界点仍然可能具有意义。
7 Related Work
7 相关工作
Power laws can arise from a wide variety of sources [THK18]. Power-law scalings with model and dataset size in density estimation $[\mathrm{Was}06]$ and in random forest models [Bia12] may be connected with our results. These models suggest that power-law exponents may have a very rough interpretation as the inverse of the number of relevant features in the data.
幂律可以从多种来源产生 [THK18]。在密度估计 $[\mathrm{Was}06]$ 和随机森林模型 [Bia12] 中,与模型和数据集大小相关的幂律缩放可能与我们的结果有关。这些模型表明,幂律指数可以粗略地解释为数据中相关特征数量的倒数。
Some early [BB01, Goo01] work found power-law scalings between performance and dataset size. More recent work $[\mathrm{HNA^{+}}17$ , HAD19] also investigated scaling between model size and data size; their work is perhaps the closest to ours in the literature 8. Note, however, that $[\mathrm{HNA^{+}17}]$ found super-linear scaling of dataset size with model size, whereas we find a sub-linear scaling. There are some parallels between our findings on optimal allocation of compute and [Kom19], including power-law learning curves. Efficient Nets [TL19] also appear to obey an approximate power-law relation between accuracy and model size. Very recent work [RRBS19b] studies scaling with both dataset size and model size for a variety of datasets, and fits an ansatz similar to ours.
一些早期的工作 [BB01, Goo01] 发现性能和数据集大小之间存在幂律缩放关系。更近期的工作 $[\mathrm{HNA^{+}}17$, HAD19] 也研究了模型大小和数据大小之间的缩放关系;他们的工作可能是文献中最接近我们的 8。然而,需要注意的是,$[\mathrm{HNA^{+}17}]$ 发现数据集大小与模型大小之间存在超线性缩放关系,而我们发现的是次线性缩放关系。我们的关于计算资源最优分配的发现与 [Kom19] 有一些相似之处,包括幂律学习曲线。Efficient Nets [TL19] 似乎也遵循准确率与模型大小之间的近似幂律关系。非常近期的工作 [RRBS19b] 研究了多种数据集在数据集大小和模型大小上的缩放关系,并拟合了一个与我们类似的假设。
Efficient Net [TL19] advocates scaling depth and width exponentially (with different coefficients) for optimal performance of image models, resulting in a power-law scaling of width as a function of depth. We find that for language models this power should be roughly one when scaling up (as width/depth should remain fixed). But more importantly, we find that the precise architectural hyper parameters are unimportant compared to the overall scale of the language model. In [VWB16] it was argued that deep models can function as ensembles of shallower models, which could potentially explain this finding. Earlier work [ZK16] has compared width and depth, and found that wide ResNets can outperform deep ResNets on image classification. Some studies fix computation per data example, which tends to scale in proportion to the number of model parameters, whereas we investigate scaling with both model size and the quantity of training computation.
Efficient Net [TL19] 倡导通过指数级扩展深度和宽度(使用不同的系数)来优化图像模型的性能,从而导致宽度作为深度的函数呈现幂律扩展。我们发现对于语言模型,在扩展时该幂应大致为一(因为宽度/深度应保持固定)。但更重要的是,我们发现相对于整体规模而言,具体的架构超参数对语言模型的重要性较低。在 [VWB16] 中,有人认为深度模型可以作为较浅模型的集成,这可能解释了这一发现。早期的工作 [ZK16] 比较了宽度和深度,并发现宽 ResNets 在图像分类上可以胜过深 ResNets。一些研究固定了每个数据示例的计算量,这往往与模型参数的数量成正比,而我们则研究了模型大小和训练计算量的扩展。
Various works [AS17, BHMM18] have investigated generalization in highly over parameterized models, finding a “jamming transition” $[\mathrm{GJS^{+}19}]$ when the model size reaches the dataset size (this may require training many orders of magnitude beyond typical practice, and in particular does not use early stopping). We do not observe such a transition, and find that the necessary training data scales sub linearly in the model size. Expansions in the model size, particularly at large width [JGH18, $\mathrm{\bar{L}X S^{+}}19]$ , may provide a useful framework for thinking about some of our scaling relations. Our results on optimization, such as the shape of learning curves, can likely be explained using a noisy quadratic model, which can provide quite accurate predictions $[Z\mathrm{LN}^{+}19]$ in realistic settings. Making this connection quantitative will require a characterization of the Hessian spectrum [Pap18, GKX19, GARD18].
许多研究工作 [AS17, BHMM18] 已经调查了高度过参数化模型中的泛化问题,发现当模型大小达到数据集大小时存在一个“阻塞转变” $[\mathrm{GJS^{+}19}]$ (这可能需要训练到典型实践的多个数量级之外,并且特别地不使用提前停止)。我们没有观察到这样的转变,并发现所需的训练数据量与模型大小呈次线性增长。在模型大小上的扩展,特别是在大宽度时 [JGH18, $\mathrm{\bar{L}X S^{+}}19]$ ,可能为思考我们的一些缩放关系提供了一个有用的框架。我们关于优化的结果,例如学习曲线的形状,很可能可以用一个带噪声的二次模型来解释,该模型可以在现实设置中提供相当准确的预测 $[Z\mathrm{LN}^{+}19]$ 。将这种联系量化将需要对 Hessian 谱进行表征 [Pap18, GKX19, GARD18]。
8 Discussion
8 讨论
We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count $N$ , dataset size $D$ , and optimized training computation $C_{\mathrm{min}}$ , as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyper parameters. Since scalings with $N,D,C_{\mathrm{min}}$ are power-laws, there are diminishing returns with increasing scale.
我们观察到语言模型的对数似然损失与非嵌入参数数量 $N$ 、数据集大小 $D$ 和优化训练计算量 $C_(min)$ 存在一致的缩放关系,如公式 (1.5) 和 (1.6) 所示。相反,我们发现许多架构和优化超参数的影响非常微弱。由于与 $N,D,C_(min)$ 的缩放关系是幂律关系,因此随着规模的增加,收益递减。
We were able to precisely model the dependence of the loss on $N$ and $D$ , and alternatively on $N$ and $S$ , when these parameters are varied simultaneously. We used these relations to derive the compute scaling, magnitude of over fitting, early stopping step, and data requirements when training large language models. So our scaling relations go beyond mere observation to provide a predictive framework. One might interpret these relations as analogues of the ideal gas law, which relates the macroscopic properties of a gas in a universal way, independent of most of the details of its microscopic con situ ents.
我们能够精确地建模损失对 $N$ 和 $D$ 的依赖关系,或者替代地对 $N$ 和 $S$ 的依赖关系,当这些参数同时变化时。我们利用这些关系推导出训练大语言模型时的计算扩展、过拟合程度、提前停止步骤和数据需求。因此,我们的扩展关系超越了单纯的观察,提供了一个预测框架。可以将这些关系解释为理想气体定律的类比,该定律以一种普遍的方式关联气体的宏观属性,独立于其微观成分的大部分细节。
It is natural to conjecture that the scaling relations will apply to other generative modeling tasks with a maximum likelihood loss, and perhaps in other settings as well. To this purpose, it will be interesting to test these relations on other domains, such as images, audio, and video models, and perhaps also for random network distillation. At this point we do not know which of our results depend on the structure of natural language data, and which are universal. It would also be exciting to find a theoretical framework from which the scaling relations can be derived: a ‘statistical mechanics’ underlying the ‘thermodynamics’ we have observed. Such a theory might make it possible to derive other more precise predictions, and provide a systematic understanding of the limitations of the scaling laws.
很自然地推测,这些缩放关系将适用于其他使用最大似然损失的生成式建模任务,也许在其他场景中也同样适用。为此,测试这些关系在其他领域中的表现将非常有趣,例如图像、音频和视频模型,也许还包括随机网络蒸馏。目前,我们还不知道哪些结果依赖于自然语言数据的结构,哪些是普遍适用的。找到一个可以从其中推导出这些缩放关系的理论框架也将令人兴奋:一种支撑我们所观察到的“热力学”的“统计力学”。这样的理论可能会使我们能够推导出更精确的预测,并提供对缩放定律局限性的系统理解。
In the domain of natural language, it will be important to investigate whether continued improvement on the loss translates into improvement on relevant language tasks. Smooth quantitative change can mask major qualitative improvements: “more is different”. For example, the smooth aggregate growth of the economy provides no indication of the specific technological developments that underwrite it. Similarly, the smooth improvements in language model loss may hide seemingly qualitative changes in capability.
在自然语言领域,重要的是调查在损失函数上的持续改进是否能转化为对相关语言任务的改进。平滑的定量变化可能会掩盖重大的定性改进:“更多是不同的”。例如,经济的平稳总体增长并不能反映支撑它的具体技术发展。同样,语言模型损失的平稳改进可能隐藏了能力上看似定性的变化。
Our results strongly suggest that larger models will continue to perform better, and will also be much more sample efficient than has been previously appreciated. Big models may be more important than big data. In this context, further investigation into model parallelism is warranted. Deep models can be trained using pipelining $[\mathrm{HCC}^{+}18]$ , which splits parameters depth-wise between devices, but eventually requires increased batch sizes as more devices are used. Wide networks on the other hand are more amenable to parallel iz ation $[\mathrm{SCP^{+}18}]$ , since large layers can be split between multiple workers with less serial dependency. Sparsity [CGRS19, GRK17] or branching (e.g. [KSH12]) may allow for even faster training of large networks through increased model parallelism. And using methods like [WRH17, WYL19], which grow networks as they train, it might be possible to remain on the compute-efficient frontier for an entire training run.
我们的结果强烈表明,更大的模型将继续表现得更好,并且在样本效率方面将比以前所认识到的要高得多。大模型可能比大数据更为重要。在这种情况下,有必要进一步研究模型并行性。深度模型可以使用管道技术 $[\mathrm{HCC}^{+}18]$ 进行训练,该方法在设备之间按深度分割参数,但最终随着更多设备的使用需要增加批量大小。另一方面,宽网络更适合并行化 $[\mathrm{SCP^{+}18}]$ ,因为大的层可以在多个工作节点之间分割,依赖性较小。稀疏性 [CGRS19, GRK17] 或分支(例如 [KSH12])可能通过增加模型并行性实现更大网络的更快训练。并且使用类似 [WRH17, WYL19] 的方法,在训练过程中扩展网络,有可能在整个训练过程中保持计算效率前沿。
Acknowledgements
致谢
We would like to thank Shan Carter, Paul Christiano, Jack Clark, Ajeya Cotra, Ethan Dyer, Jason Eisner, Danny Hernandez, Jacob Hilton, Brice Menard, Chris Olah, and Ilya Sutskever for discussions and for feedback on drafts of this work.
我们要感谢 Shan Carter、Paul Christiano、Jack Clark、Ajeya Cotra、Ethan Dyer、Jason Eisner、Danny Hernandez、Jacob Hilton、Brice Menard、Chris Olah 和 Ilya Sutskever 在讨论中提供的帮助以及对本文草稿的反馈。
Appendices
附录
A Summary of Power Laws
幂律总结
For easier reference, we provide a summary below of the key trends described throughout the paper.
为了便于参考,我们在下文提供了本文所述关键趋势的总结。
| 参数 | 数据 | 计算 | 批量大小 | 方程 |
|---|---|---|---|---|
| N | 8 | 8 | 固定 | L (N) = (Nc / N) α N |
| 8 | D | 提前停止 | 固定 | L (D) = (Dc / D) α D |
| 最优 | 8 | C | 固定 | L (C) = (C / C) α c (朴素) |
| Nopt | Dopt | Cmin | B < Bcrit | L (Cmin) = (Cmin / Cmin) min |
| N | D | 提前停止 | 固定 | α N & D () 器 L(N,D) = 十 Dc D |
| N | 8 | S' 步骤 | B | L (N,S) = (N) N |
| Smin(S,B) |
请注意,最后一行的公式和某些符号可能有误或不清晰,建议检查原始内容。
The empirical fitted values for these trends are:
这些趋势的经验拟合值为:
Table 4
表 4:
| PowerLaw | 规模 (依赖于分词) |
|---|---|
| QN = 0.076 | Nc = 8.8 × 10^13 params (非嵌入) |
| QD = 0.095 | Dc = 5.4 × 10^13 tokens |
| Qc = 0.057 | Cc = 1.6 × 10^7 PF-days |
| a_min 0.050 | C_min = 3.1 × 10^* PF-days |
| QB = 0.21 | B* = 2.1 × 10^8 tokens |
| αs = 0.76 | Sc = 2.1 × 10^3 s steps |
The optimal parameters for compute efficient training are given by:
计算高效的训练所需的最优参数由以下给出:
Table 5 Table 6
表 5 表 6
| 计算高效值 | PowerLaw | 规模 |
|---|---|---|
| Nopt = Ne· CPNn | PN = 0.73 | Ne = 1.3: 10^9 params |
| B* = Be·CPin B |
B Empirical Model of Compute-Efficient Frontier
B 计算高效前沿的经验模型
B.1 Defining Equations
B.1 定义方程
The power-law fit to the learning curves implies a simple prescription for compute-efficient training. In this appendix, we will derive the optimal performance, model size, and number of training steps as a function of
幂律拟合学习曲线意味着一种简单的计算高效的训练方法。在本附录中,我们将推导出最优性能、模型大小和训练步骤数作为函数的表达式。
the compute budget. We start with the Equation (1.6), repeated here for convenience:
计算预算。我们从公式 (1.6) 开始,这里为了方便再次列出:
$$
L\left(N,S\right)=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S}\right)^{\alpha_{S}}.
$$
$$
L\left(N,S\right)=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(\frac{S_{c}}{S}\right)^{\alpha_{S}}.
$$
Here, $S$ represents the number of parameter updates when training at the critical batch size [MKAT18], which was defined in Equation $(5.2\bar{)}^{9}$ :
这里,$S$ 表示在关键批量大小 [MKAT18] 进行训练时的参数更新次数,该值在公式 (5.2) 中定义:
注意:由于原文中的公式符号可能存在显示问题,建议核对原始文献以确保公式的准确性。
$$
B\left(L\right)=\frac{B_{*}}{L^{1/\alpha_{B}}}.
$$
$$
B\left(L\right)=\frac{B_{*}}{L^{1/\alpha_{B}}} .
$$
We would like to determine optimal training parameters for a fixed compute budget, so we replace $\textit{S}=$ $C/\left(6N B\left(L\right)\right)$ , where $C$ is the number of FLOPs used in the training run:
我们希望在固定的计算预算内确定最优的训练参数,所以我们用 $\textit{S}=$ $C/\left(6N B\left(L\right)\right)$ 替换,其中 $C$ 是训练过程中使用的 FLOPs 数量:
$$
L\left(N,C\right)=\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\left(6B_{*}S_{c}\frac{N}{L^{1/\alpha_{B}}C}\right)^{\alpha_{S}}.
$$
$$
L(N,C)=\left(\frac{N_c}{N}\right)^{\alpha_N}+\left(6B_* S_c \frac{N}{L^{1/\alpha_B} C}\right)^{\alpha_S}.
$$
Now, we set $\partial_{N}L\vert_{C}=0$ to find the condition for optimality:
现在,我们设 $\partial_{N}L\vert_{C}=0$ 来寻找最优性的条件:
$$
\begin{array}{c}{\displaystyle0=\frac{\partial L}{\partial N}\vert_{c}}\ {\displaystyle=-\frac{\alpha_{N}}{N}\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}+\frac{\alpha_{S}}{N}\left(6B_{}S_{c}\frac{N}{L^{1/\alpha_{B}}C}\right)^{\alpha_{S}}\left(1-5\frac{N}{L}\frac{\partial L}{\partial\widetilde{N}}\right\vert_{C}\right)}\ {\displaystyle\implies\frac{\alpha_{N}}{\alpha_{S}}\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}=\left(6B_{}S_{c}\frac{N}{L^{1/\alpha_{B}}C}\right)^{\alpha_{S}}}\end{array}
$$
$$
\begin{array}{c}
{\displaystyle 0 = \frac{\partial L}{\partial N} \vert_{c}} \
{\displaystyle = -\frac{\alpha_{N}}{N} \left( \frac{N_{c}}{N} \right)^{\alpha_{N}} + \frac{\alpha_{S}}{N} \left( 6B_{} S_{c} \frac{N}{L^{1/\alpha_{B}} C} \right)^{\alpha_{S}} \left( 1 - 5 \frac{N}{L} \frac{\partial L}{\partial \widetilde{N}} \right\vert_{C} \right)} \
{\displaystyle \implies \frac{\alpha_{N}}{\alpha_{S}} \left( \frac{N_{c}}{N} \right)^{\alpha_{N}} = \left( 6B_{} S_{c} \frac{N}{L^{1/\alpha_{B}} C} \right)^{\alpha_{S}}}
\end{array}
$$
Equation (B.3) and (B.4) together determine the compute-efficient frontier.
公式 (B.3) 和 (B.4) 共同决定了计算高效的前沿。
B.2 Efficient Training
B.2 高效训练
Now we assemble the implications of (B.3) and (B.4). First, note that inserting (B.4) into (B.3) yields
现在我们整合 (B.3) 和 (B.4) 的含义。首先,注意将 (B.4) 代入 (B.3) 可得
$$
L\left(N_{\mathrm{eff}}\left(C\right),C\right)=\left(1+\frac{\alpha_{N}}{\alpha_{S}}\right)L\left(N_{\mathrm{eff}},\infty\right),
$$
$$
L\left(N_{\mathrm{eff}}\left(C\right),C\right)=\left(1+\frac{\alpha_{N}}{\alpha_{S}}\right)L\left(N_{\mathrm{eff}},\infty\right),
$$
which implies that for compute-efficient training, we should train to a fixed percentage $\begin{array}{r}{\frac{\alpha_{N}}{\alpha_{S}}\approx10%}\end{array}$ above the converged loss. Next, let’s determine how the optimal loss depends on the compute budget. Eliminating $N$ yields a power-law dependence of performance on compute:
这意味着对于计算高效的训练,我们应该训练到收敛损失以上的固定百分比 $\begin{array}{r}{\frac{\alpha_{N}}{\alpha_{S}}\approx10%}\end{array}$ 。接下来,让我们确定最优损失如何依赖于计算预算。消除 $N$ 后,性能对计算的依赖呈现幂律关系:
$$
L\left(C\right)=\left(\frac{C_{c}}{C}\right)^{\alpha_{C}}
$$
$$
L(C) = \left( \frac{C_c}{C} \right)^{\alpha_C}
$$
where we defined
我们定义了
$$
\begin{array}{l}{\displaystyle\alpha_{C}=1/\left(1/\alpha_{S}+1/\alpha_{B}+1/\alpha_{N}\right)\approx0.052}\ {\displaystyle C_{c}=6N_{c}B_{*}S_{c}\left(1+\frac{\alpha_{N}}{\alpha_{S}}\right)^{1/\alpha_{S}+1/\alpha_{N}}\left(\frac{\alpha_{S}}{\alpha_{N}}\right)^{1/\alpha_{S}}.}\end{array}
$$
$$
\begin{array}{l}
α_C = 1 / (1 / α_S + 1 / α_B + 1 / α_N ) ≈ 0.052 \
C_c = 6 N_c B_* S_c (1 + \frac{α_N}{α_S})^{1 / α_S + 1 / α_N} (\frac{α_S}{α_N})^{1 / α_S} .
\end{array}
$$
Similarly, we can eliminate $L$ to find $N\left(C\right)$ :
similarly, 我们可以消除 $L$ 来求解 $N\left(C\right)$ :
同样地,我们可以消除 $L$ 来求解 $N(C)$ :
$$
\frac{N\left(C\right)}{N_{c}}=\left(\frac{C}{C_{c}}\right)^{\alpha_{C}/\alpha_{N}}\left(1+\frac{\alpha_{N}}{\alpha_{S}}\right)^{1/\alpha_{N}}
$$
$$
\frac{N(C)}{N_c} = \left( \frac{C}{C_c} \right)^{\alpha_C / \alpha_N} \left( 1 + \frac{\alpha_N}{\alpha_S} \right)^{1 / \alpha_N}
$$
and
并且
请注意,您提供的翻译内容非常简短。如果有更多具体内容需要翻译,请提供完整的文本以便进行准确翻译。
$$
S\left(C\right)=\frac{C_{c}}{6N_{c}B_{*}}\left(1+\frac{\alpha_{N}}{\alpha_{S}}\right)^{-1/\alpha_{N}}\left(\frac{C}{C_{c}}\right)^{\alpha_{C}/\alpha_{S}}
$$
$$
S\left(C\right)=\frac{C_{c}}{6N_{c}B_{*}}\left(1+\frac{\alpha_{N}}{\alpha_{S}}\right)^{-1/\alpha_{N}}\left(\frac{C}{C_{c}}\right)^{\alpha_{C}/\alpha_{S}}
$$
9There is a slight ambiguity here: we can imagine training either at a constant batch size $\boldsymbol{B}\left(\boldsymbol{L}_{\mathrm{target}}\right)$ , or we could instead train at a variable batch size $\tilde{B}\left(L\right)$ , where $\tilde{B}$ is the instantaneous critical batch size (as opposed to $B$ , which is the averaged version). These two prescriptions result in the same number of steps, so we can ignore this subtlety (see [MKAT18]).
这里存在一些细微的模糊性:我们可以想象以恒定批量大小 $\boldsymbol{B}\left(\boldsymbol{L}_{\mathrm{target}}\right)$ 进行训练,或者我们也可以选择以可变批量大小 $\tilde{B}\left(L\right)$ 进行训练,其中 $\tilde{B}$ 是瞬时临界批量大小(相对于 $B$,它是平均版本)。这两种方法会导致相同的步数,因此我们可以忽略这种细微差别(见 [MKAT18])。
B.3 Comparison to Inefficient
B.3 与低效方法的比较
Typically, researchers train models until they appear to be close to convergence. In this section, we compare the efficient training procedure described above to this more typical setup. We define a the convergence factor $f$ as the percent deviation from the converged loss:
通常,研究人员会训练模型直到它们看起来接近收敛。在本节中,我们将上述高效的训练过程与这种更典型的设置进行比较。我们定义一个收敛因子 $f$ 作为与收敛损失的百分比偏差:
图 1:
表 1:
[20]
$$
L\left(N,C\right)=\left(1+f\right)L\left(N,\infty\right).
$$
$$
L(N,C)=(1+f) L(N,\infty).
$$
For compute-efficient training we have $f,=,\alpha_{N}/\alpha_{S},\approx,10%$ from the previous section, but researchers typically use a much smaller value. Here, we choose $f^{\prime}=2%$ as an estimate. For a fixed value of the loss, we predict:
对于计算高效的训练,我们有 $f,=,\alpha_{N}/\alpha_{S},\approx,10%$ 来自上一节,但研究人员通常使用更小的值。在这里,我们选择 $f^{\prime}=2%$ 作为估计。对于固定损失值,我们预测:
$$
\begin{array}{l}{{\displaystyle\frac{N_{f}}{N_{f^{\prime}}}=\left(\frac{1+f}{1+f^{\prime}}\right)^{1/\alpha_{N}}\approx2.7,}}\ {{\displaystyle\frac{S_{f}}{S_{f^{\prime}}}=\left(\frac{1+\frac{1}{f}}{1+\frac{1}{f^{\prime}}}\right)^{1/\alpha_{S}}\approx0.13,}}\ {{\displaystyle\frac{C_{f}}{C_{f^{\prime}}}=\frac{N_{f}}{N_{f^{\prime}}}\frac{S_{f}}{S_{f^{\prime}}}\approx0.35}}\end{array}
$$
$$
\begin{array}{l}
{\displaystyle\frac{N_{f}}{N_{f^{\prime}}}=\left(\frac{1+f}{1+f^{\prime}}\right)^{1/\alpha_{N}}\approx2.7,} \
{\displaystyle\frac{S_{f}}{S_{f^{\prime}}}=\left(\frac{1+\frac{1}{f}}{1+\frac{1}{f^{\prime}}}\right)^{1/\alpha_{S}}\approx0.13,} \
{\displaystyle\frac{C_{f}}{C_{f^{\prime}}}=\frac{N_{f}}{N_{f^{\prime}}}\frac{S_{f}}{S_{f^{\prime}}}\approx0.35}
\end{array}
$$
So that compute-efficient training uses $7.7\mathbf{x}$ fewer parameter updates, $2.7\mathtt{X}$ more parameters, and $65%$ less compute to reach the same loss.
因此,计算高效的训练使用 $7.7\mathbf{x}$ 更少的参数更新,$2.7\mathtt{X}$ 更多的参数,并且需要 $65%$ 更少的计算资源以达到相同的损失。
B.4 Suboptimal Model Sizes
B.4 次优模型尺寸
We can solve A.1 to find an expression for the amount of compute needed to reach a given value of the loss $L$ with a model of size $N$ :
我们可以求解 A.1 以找到达到给定损失值 $L$ 所需的计算量的表达式,对于规模为 $N$ 的模型:
$$
C\left(N,L\right)=\left(6B_{*}S_{c}\frac{N}{L^{1/\alpha_{B}}}\right)\left(L-\left(\frac{N_{c}}{N}\right)^{\alpha_{N}}\right)^{-1/\alpha_{S}}.
$$
$$
C(N,L)=\left(6 B_{*} S_{c} \frac{N}{L^{1/\alpha_{B}}} \right) \left(L-\left(\frac{N_{c}}{N}\right)^{\alpha_{N}} \right)^{-1/\alpha_{S}} .
$$
Using A.6 and A.9, we can eliminate $L$ in favor of $N_{\mathrm{eff}}\left(L\right)$ , the model size which reaches $L$ most efficiently. From there, we find an expression for the excess compute needed as a consequence of using a suboptimal model size:
使用 A.6 和 A.9,我们可以用 $N_{\mathrm{eff}}\left(L\right)$ 替换 $L$,即达到 $L$ 最有效的模型大小。从那里,我们找到了由于使用次优模型大小所需的额外计算量的表达式:
$$
\frac{C\left(N,N_{\mathrm{eff}}\right)}{C\left(N_{\mathrm{eff}},N_{\mathrm{eff}}\right)}=\frac{N}{N_{\mathrm{eff}}}\left[1+\frac{\alpha_{S}}{\alpha_{N}}\left(1-\left(\frac{N_{\mathrm{eff}}}{N}\right)^{\alpha_{N}}\right)\right]^{-1/\alpha_{S}}.
$$
$$
\frac{C(N, N_{\mathrm{eff}})}{C(N_{\mathrm{eff}}, N_{\mathrm{eff}})} = \frac{N}{N_{\mathrm{eff}}} \left[ 1 + \frac{\alpha_{S}}{\alpha_{N}} \left( 1 - \left( \frac{N_{\mathrm{eff}}}{N} \right)^{\alpha_{N}} \right) \right]^{-1 / \alpha_{S}}.
$$
The result is shown in Figure X. Models between $0.6\mathrm{x}$ and $2.2\mathbf{x}$ the optimal size can be used with only a $20%$ increase in compute budget. Using a smaller model is useful when accounting for the cost inference. A larger model can be trained the the same level of performance in fewer steps, allowing for more parallelism and faster training if sufficient harware is available (see Figure Y):
结果如图 X 所示。可以在仅增加 20% 的计算预算的情况下使用 0.6x 到 2.2x 之间的最优模型大小。在考虑推理成本时,使用较小的模型是有用的。如果硬件条件充足,较大的模型可以用更少的步骤训练到相同的性能水平,从而实现更多的并行性和更快的训练(见图 Y):
$$
\frac{S\left(N,N_{\mathrm{eff}}\right)}{S\left(N_{\mathrm{eff}},N_{\mathrm{eff}}\right)}=\left[1+\frac{\alpha_{S}}{\alpha_{N}}\left(1-\left(\frac{N_{\mathrm{eff}}}{N}\right)^{\alpha_{N}}\right)\right]^{-1/\alpha_{S}}.
$$
$$
\frac{S(N, N_{\mathrm{eff}})}{S(N_{\mathrm{eff}}, N_{\mathrm{eff}})} = \left[ 1 + \frac{\alpha_S}{\alpha_N} \left( 1 - \left( \frac{N_{\mathrm{eff}}}{N} \right)^{\alpha_N} \right) \right]^{-1 / \alpha_S}.
$$
A $2.2\mathbf{x}$ larger model requires $45%$ fewer steps at a cost of $20%$ more training compute. Note that this equation should not be trusted for very large models, as it is only valid in the power-law region of the learning curve after initial transient effects.
一个 $2.2\mathbf{x}$ 更大的模型需要的步数减少 $45%$ ,但训练计算成本增加 $20%$ 。请注意,这个公式不应被用于非常大的模型,因为它仅在学习曲线的幂律区域中有效,在初始瞬态效应之后。
C Caveats
C 注意事项
In this section we list some potential caveats to our analysis.
在本节中,我们列出了一些可能的分析注意事项。
• At present we do not have a solid theoretical understanding for any of our proposed scaling laws. The scaling relations with model size and compute are especially mysterious. It may be possible to understand scaling at very large $D$ holding model size fixed [AS17], and also the shape of learning curves late in training, by modeling the loss with a noisy quadratic. But the scaling with $D$ at very large model size still remains mysterious. Without a theory or a systematic understanding of the corrections to our scaling laws, it’s difficult to determine in what circumstances they can be trusted.
• 目前我们对所提出的任何缩放定律都没有坚实的理论理解。模型大小和计算资源的缩放关系尤其神秘。通过使用带噪声的二次模型来建模损失,可能可以理解在非常大的 $D$ 下保持模型大小固定 [AS17] 的缩放情况,以及训练后期学习曲线的形状。但是,在非常大的模型大小下 $D$ 的缩放仍然令人费解。没有理论或系统地理解对我们缩放定律的修正,很难确定在什么情况下可以信任这些定律。

Figure 16 Left: We characterize the step on which early stopping occurs, as a function of the extent of over fitting. The red line indicates a lower bound for early stopping that is derived in Section 5.3. Right: We display train and test loss for a series of 300M parameter models trained on different sized dataset subsamples. The test loss typically follows that of a run done with unrestricted data until diverging. Note that the degree of over fitting (as compared to the infinite data limit) is significantly overestimated by $L_{\mathrm{test}}-L_{\mathrm{train}}$ (denoted by a black bar for each run).
图 16 左:我们表征了早停发生的步数与过拟合程度的关系。红色线条表示在第 5.3 节推导出的早停下界。右:我们展示了在不同大小的数据子集上训练的 300M 参数模型的训练损失和测试损失。测试损失通常会跟随使用不受限数据运行的损失,直到出现分歧。请注意,过拟合的程度(与无限数据极限相比)通过 $L_{\mathrm{test}}-L_{\mathrm{train}}$ (每个运行用黑色条表示)显著高估。
D Supplemental Figures
D 补充图
D.1 Early Stopping and Test vs Train
D.1 提前停止和测试与训练对比
In section 5.3 we described the result shown in Figure 16, which provides a prediction for a lower bound on the early stopping step. We also show the train and test loss for a given model size when training on different sized datasets.
在第 5.3 节中,我们描述了图 16 所示的结果,该图提供了一个关于提前停止步骤下限的预测。我们还展示了在不同规模的数据集上训练时,给定模型规模的训练损失和测试损失。
D.2 Universal Transformers
D.2 通用 Transformer
We compare the performance of standard Transformers to recurrent Transformers $[\mathrm{DGV}^{+}18]$ in Figure 17. These models re-use parameters, and so perform slightly better as a function of $N$ , but slightly worse as a function of compute $C$ . We include several different different possibilities for parameter re-use.
我们在图 17 中比较了标准 Transformer 和循环 Transformer $[\mathrm{DGV}^{+}18]$ 的性能。这些模型重用了参数,因此随着 $N$ 的增加性能略有提高,但随着计算资源 $C$ 的增加性能略有下降。我们包含了多种不同的参数重用可能性。
图 17:
D.3 Batch Size
D.3 批量大小 (Batch Size)
We measure the critical batch size using the data displayed in figure 18. This made it possible to estimate $B_{\mathrm{crit}}(L)$ in figure 10.
我们使用图 18 中显示的数据来测量关键批次大小。这使得可以在图 10 中估算 $B_{\mathrm{crit}}(L)$ 。

Figure 17 We compare recurrent Transformers $[\mathrm{DGV}^{+}18]$ , which re-use parameters, to standard Transformers. Recurrent Transformers perform slightly better when comparing models with equal parameter count, but slightly worse when accounting for reuse and comparing per FLOP.
图 17: 我们比较了循环 Transformer (Recurrent Transformers) $[\mathrm{DGV}^{+}18]$ ,其重用参数,与标准的 Transformer。当比较参数量相等的模型时,循环 Transformer 表现略好;但在考虑参数重用并按每次浮点运算 (FLOP) 比较时,表现略差。

Figure 18 These figures demonstrate fits to Equation (5.1) for a large number of values of the loss $L$ , and for two different Transformer model sizes. These fits were used to measure $B_{\mathrm{crit}}(L)$ for Figure 10.
图 18: 这些图表展示了对公式 (5.1) 的拟合,针对大量损失值 $L$ ,以及两种不同大小的 Transformer 模型。这些拟合用于测量图 10 中的 $B_{\mathrm{crit}}(L)$ 。
D.4 Sample Efficiency vs Model Size
D.4 样本效率 vs 模型大小
It is easy to see from figure 2 that larger models train faster, and are therefore more sample efficient. We provide another way of looking at this phenomenon in figure 19, which shows when different models reach various fixed values of the loss.
从图 2 可以看出,较大的模型训练速度更快,因此样本效率更高。我们在图 19 中提供了另一种观察这一现象的方式,该图显示了不同模型达到各种固定损失值的时间。

Figure 19 The number of minimum serial steps needed to reach any fixed value of the test loss decreases precipitously with model size. Sample efficiency (show here for training far below the critical batch size) improves greatly as well, improving by a factor of almost 100 when comparing the smallest possible model to a very large one.

图 19: 达到任何固定测试损失值所需的最小串行步骤数量随着模型大小的增加而急剧减少。样本效率(此处显示的是远低于临界批量大小的训练)也大大提高,在比较最小可能模型与非常大的模型时,样本效率提高了近 100 倍。

Figure 20 This figure provides information about the performance per token as a function of model size and training time. Left: Loss per token as a function of its position $T$ in the 1024-token context. Loss scales predictably as a power-law in $T$ . Right: Test loss per token as a function of training step.

图 20: 该图提供了关于模型大小和训练时间对每个 Token 性能的影响信息。左:每个 Token 的损失作为其在 1024-Token 上下文中的位置 $T$ 的函数。损失随 $T$ 按幂律可预测地变化。右:每个 Token 的测试损失作为训练步骤的函数。

Figure 21 In addition to the averaged loss, individual tokens within the 1024-token context also improve smoothly as model size increases. Training runs with shorter context $n_{\mathrm{ctx}}=8$ (dashed lines) perform better on early tokens, since they can allocate all of their capacity to them.
图 21: 除了平均损失外,1024-token 上下文中的各个 token 随着模型大小的增加也平稳改进。较短上下文 $n_{\mathrm{ctx}}=8$ (虚线)的训练运行在早期 token 上表现更好,因为它们可以将所有容量分配给这些 token。
D.5 Context Dependence
D.5 上下文依赖性
The trends for loss as a function of model size are displayed for different tokens in the context in Figure 21. We see that models trained on $n_{\mathrm{ctx}},=,1024$ show steady improvement with model size on all but the first token.
图 21: 不同上下文中的 Token 的损失随模型大小的变化趋势。我们看到,在所有 Token 上,训练于 $n_{\mathrm{ctx}},=,1024$ 的模型随着模型大小的增加而表现出稳定的改进,除了第一个 Token 。
Fixing model size, it appears that the loss scales as a power-law as a function of position $T$ in the context, see Figure 20. This may be a consequence of underlying power-law correlations in language [EP94, ACDE12, LT16], or a more general feature of the model architecture and optimization. It provides some suggestion for the potential benefits (or lack thereof) from training on larger contexts. Not only do larger models converge to better performance at $T=1024$ , but they also improve more quickly at early tokens, suggesting that larger models are more efficient at detecting patterns with less contextual information. In the right-hand plot we show how per-token performance varies for a fixed model as a function of the training step. The model begins by learning short-range information, and only learns longer-range correlations later in training.
固定模型大小后,损失似乎随上下文中的位置 $T$ 呈幂律变化,见图 20。这可能是语言中潜在的幂律相关性 [EP94, ACDE12, LT16] 的结果,或者是模型架构和优化的一个更普遍的特征。它为在更大上下文中进行训练的潜在好处(或缺乏)提供了一些线索。不仅更大的模型在 $T=1024$ 时收敛到更好的性能,而且它们在早期 Token 上也更快地改进,表明更大的模型在检测模式时效率更高,即使上下文信息较少。在右侧图表中,我们展示了对于固定模型,每个 Token 的性能如何随训练步骤而变化。模型首先学习短距离信息,并且只在训练后期才学习更长距离的相关性。
We have also included models trained with a tiny context $n_{\mathrm{ctx}},=,8$ in order to compare with our longer context models. Even modestly sized models trained on $n_{\mathrm{ctx}},=,8$ can dominate our largest $n_{\mathrm{ctx}},=,1024$ models on very early tokens. This also suggests that further improvements should be possible with much larger models trained on large contexts.
我们还包含了使用极小上下文 $n_{\mathrm{ctx}},=,8$ 训练的模型,以便与我们的长上下文模型进行比较。即使是在 $n_{\mathrm{ctx}},=,8$ 上训练的中等规模模型,在非常早期的 Token 上也可以超过我们最大的 $n_{\mathrm{ctx}},=,1024$ 模型。这也表明,通过在大上下文中训练更大规模的模型,进一步的改进是可能的。
D.6 Learning Rate Schedules and Error Analysis
D.6 学习率调度和误差分析
We experimented with a variety of learning rates and schedules. A host of schedules and resulting test performances for a small language model are plotted in Figure 22. We conclude that the choice of learning rate schedule is mostly irrelevant, as long as the total summed learning rate is sufficiently large, and the schedule includes a warmup period and a final decay to near-vanishing learning rate. Variations among schedules appear to be statistical noise, and provide a rough gauge for the scale of variation between different training runs. Experiments on larger models suggest that the variation in the final test loss between different random seeds is roughly constant in magnitude for different model sizes.
我们实验了各种学习率和调度方案。图 22 展示了一个小语言模型的多种调度方案及其相应的测试性能。我们得出结论,只要总累积学习率足够大,并且调度方案包含预热期和最终衰减到接近零的学习率,那么学习率调度方案的选择基本上是无关紧要的。不同调度方案之间的差异似乎只是统计噪声,并为不同训练运行之间的变化规模提供了一个大致的衡量标准。在更大模型上的实验表明,不同随机种子之间的最终测试损失的变化幅度大致恒定,与模型大小无关。

Figure 22 We test a variety of learning rate schedules including cosine decay, linear decay, as well as other faster/slower decays schedules on a 3 million parameter model, shown on the left. For these experiments we do not decay to zero, since we find that this tends to give a fixed improvement close to the end of training. We find that, as long as the learning rate is not too small and does not decay too quickly, performance does not depend strongly on learning rate. Run-to-run variation is at the level of 0.05 in the loss, so averaging multiple runs is necessary to validate performance changes smaller than this level.
图 22: 我们测试了多种学习率调度策略,包括余弦衰减、线性衰减以及其他更快/更慢的衰减策略,在一个包含 3 百万参数的模型上进行实验,结果如左图所示。在这些实验中,我们不会将学习率衰减到零,因为我们发现这往往会在训练接近尾声时提供一个固定的改进。我们发现,只要学习率不是太小且不衰减得太快,性能对学习率的依赖性并不强。运行间的差异在损失值上约为 0.05,因此需要平均多次运行的结果来验证小于这一水平的性能变化。

Figure 23 The trend for performance as a function of parameter count, $L(N)$ , is fit better by a power law than by other functions such as a logarithm at a qualitative level.

图 23: 在定性层面上,性能随参数数量变化的趋势 L(N) 更符合幂律函数,而不是其他函数如对数函数。
We found that larger models require a smaller learning rate to prevent divergence, while smaller models can tolerate a larger learning rate. To implement this, the following rule of thumb was used for most runs:
我们发现,较大的模型需要较小的学习率以防止发散,而较小的模型可以容忍较大的学习率。为了实现这一点,大多数运行中使用了以下经验法则:
$$
\mathrm{LR}(N)\approx0.003239+-0.0001395\log(N)
$$
$$
\mathrm{LR}(N) \approx 0.003239 + -0.0001395 \log(N)
$$
We expect that this formula could be improved. There may be a dependence on network width, likely set by the initialization scale. The formula also breaks down for $\dot{N}>10^{10}$ parameters. Nevertheless, we found that it works sufficiently well for the models we considered.
我们期望这个公式能够得到改进。它可能依赖于网络宽度,这很可能由初始化规模决定。当 $\dot{N}>10^{10}$ 参数时,该公式也会失效。尽管如此,我们发现它对于我们考虑的模型来说已经足够好。
D.7 Fit Details and Power Law Quality
D.7 拟合细节和幂律质量
We experimented with a number of functional forms for the fits to $L(N),L(C)$ , and $L(D)$ ; the power-law fits were qualitatively much more accurate than other functions such as logarithms (see Figure 23).
我们尝试了多种函数形式来拟合 $L(N), L(C)$ 和 $L(D)$;幂律拟合在定性上比其他函数(如对数)准确得多(见图 23)。
For $L(C)$ , we do not include small models with only 1 layer in the fit, as the transition from 1 to 2 layers causes a noticable lump in the data. For $L(N)$ we also do not include very small models with only 1 layer in the fit, and we exclude the largest models that have not trained fully to convergence. Fit parameters change marginally if we do include them, and the trend extrapolates well in both directions regardless.
对于 $L(C)$ ,我们不包括只有 1 层的小模型,因为从 1 层到 2 层的过渡会在数据中产生明显的变化。对于 $L(N)$ ,我们同样不包括只有 1 层的非常小的模型,并且排除那些尚未完全收敛的最大模型。如果包含这些模型,拟合参数只会发生微小变化,无论是否包含它们,趋势在这两个方向上都能很好地外推。
D.8 Generalization and Architecture
D.8 泛化和架构
In figure 24 we show that generalization to other data distributions does not depend on network depth when we hold the total parameter count fixed. It seems to depend only on the performance on the training distribution.
图 24: 我们显示,当固定总参数数量时,泛化到其他数据分布不依赖于网络深度。它似乎仅依赖于训练分布上的性能。

Figure 24 We show evaluations on a series of datasets for models with approximately 1.5 Billion parameters. We observe no effect of depth on generalization; generalization performance depends primarily on training distribution performance. The 12-layer model overfit the Internet Books dataset and we show the early-stopped performance; we have not seen this surprising result in other experiments.
图 24: 我们在一系列数据集上对参数量约为 15 亿的模型进行了评估。我们观察到深度对泛化没有影响;泛化性能主要取决于训练分布性能。12 层模型在 Internet Books 数据集上过拟合,我们展示了提前停止的性能;我们在其他实验中未见过这一令人惊讶的结果。
