大Batch训练大语言模型的探索与实践

2024-01-24 11:01

©作者 | 牛信尧

研究方向 | 大语言模型

写在前面

最近在训练 language model 的时候发现了一个问题，在有很多卡的情况下，最有效提高训练效率的办法就是提高数据并行不论是提高 batch size 本身，还是通过梯度累计的方法）。在一些公开的训练方案中，可以找到了一些相关信息，如下：

GPT-3

larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size.

PaLM

For all models, we increase the batch size during training. For the largest model, we use batch size 512 (1M tokens) until step 50k, then double it to 1024 (2M tokens) until step 115k, and finally double again it to 2048 (4M tokens) until training is complete at step 255k. The smaller models followed similar schedules. The reason for using such batch size schedule is twofold: (1) smaller batch sizes are more sample efficient (i.e., better loss as a function of tokens seen) earlier in training, while larger batch sizes are beneficial later in training due to better gradient estimates (Smith et al., 2018; McCandlish et al., 2018), and (2) larger batch sizes result in larger matrix multiplication dimensions, which increases TPU efficiency.

If the smaller model were trained using fewer TPU chips than the larger model, this would proportionally increase the wall-clock time of training, since the total training FLOP count is the same. If it were trained using the same number of TPU chips, it would be very difficult to maintain TPU compute efficiency without a drastic increase in batch size. The batch size of PaLM 540B is already 4M tokens, and it is unclear if even larger batch sizes would maintain sample efficiency.

MT-NLG

A large batch size can be an effective way of increasing compute efficiency, because it increases the arithmetic intensity of a kernel and helps amortize the time spent stalled on communication and synchronization. However, the batch size that a model can be trained with has an upper bound; using too large of a batch size can have negative effects on the model quality. Over the first 12 billion tokens, we started at a batch size of 32 and gradually increased the batch size in increments of 32, until we reach the final batch size of 1920.

GLM 130B

We warm-up the batch size from 192 to 4224 over the first 2.5% samples. The memory per processor is too small => Require too many pipeline stages => Batch size is too large (up to 12,000) => Harm the model’s convergency.

根据公开信息来看大家设定 batch size 的时候都是比较经验主义。然而，我们的实验结果发现显著增加 batch size 可能会引发一些问题。考虑到当前大模型超参调节成本极高，其中一个比较重点的问题是 batch size 应该如何与 learning rate（LR）一起变化。

一个常用的 heuristic 是 LR 应该与 batch size 的增长倍数的开方成正比，从而保证 variance 与梯度成比例的增长，但是我们实验中发现并非完全如此。因此，本文希望通过两篇早期的工作的分析和结论来探讨一下是否我们永远可以使用提高数据并行度来提高训练效率，以及 batch size 和 LR 的关系。

TL;DR

OpenAI 找到了一个 function 用来指导设置模型的 batch size。代表了在不同 loss 情况下，最优的 batch size 大小。这里，最优代表平衡了训练速度和 total compute，在设置小于这个值的 batch size 时，提升 batch size 可以获得更好的训练速度，在大于这个值的时候，继续提升并不会对（达到相应的 performance）需要的训练时间带来提升。

注意，这个结论与模型大小、任务种类（CV/NLP/RL）相关性较弱，但是和 learning rate schedule 非常相关。

增大数据并行度总能提高整体的训练速度吗？

当 batch size 较小时，更新方向（即对真实梯度的近似）会具有很高的方差，导致的梯度更新主要是噪声。经过一些更新后，方差会相互抵消，总体上推动模型朝着正确的方向前进，但个别更新可能不太有用，可以一次性应用（使用更大 batch size 进行更新）。

相反，当 batch size 非常大时，我们从训练数据中抽样的任何两组数据都会非常相似（因为它们几乎完全匹配真实梯度）。因此，在这种情况下，增加 batch size 几乎不会改善性能，因为你无法改进真实的梯度预测。换句话说，你需要在每一步中处理更多的数据，但并不能减少整个训练过程中的步数，这表明总体训练时间几乎没有改善。但是更糟糕的是你增加了总体的 FLOPS。

通过观察这些线性图，我们可以发现使用更大的 batch size 通常需要较少的训练 step。然而，这将相应地增加需要处理的数据。当 batch size 从 2048 翻倍时，达到同样性能所需要的 step 几乎没有任何改善，但你需要花费两倍的计算资源。Google 的经验研究也有类似的观察，即在在固定的 epoch budget 下，当 batch size 达到临界值时，模型的性能会 batch size 的增加而降低。可以如下说明：

因此，各种结果表明似乎存在着一个关于数据并行程度的临界点，通过找到这个临界点，我们可以有效的平衡训练的效率和模型的最终效果。

Gradients, Batches, and the Gradient Noise Scale

OpenAI 发现最优步长可能与 batch size 和噪声尺度之间存在密切关系，其遵循以下形式：

在采用最优 step size 时，从含有噪声的梯度中获得的损失的最优改进现在变为：‍

从这些公式中我们可以得出两个结论：

无论我们如何准确地估计真实梯度，总存在一个最大步长
批处理大小越大，我们优化模型的步长就越大（有一个上限）

左侧的图表说明了为什么使用更大的批次模型可以取得更多提升。但是当 batch size 太大时，我们会遇到收益递减的问题（因为分母中的 1 开始占主导地位）。但是需要注意的事，这仅在学习率调整良好的情况下有效。因此，OpenAI 建议将学习率调整到一个相对接近最优值的数值是理论能有效的前提。

在进行一些其他数学计算后，OpenAI 发现噪声尺度可以通过以下方式估计：

‍是相对于梯度的每个示例的协方差矩阵，是真实梯度。为了进一步简化这个方程，OpenAI 作出了一个（不切实际的）假设，即优化是完全 well-conditioned 的。在这种情况下，Hessian 矩阵只是单位矩阵的倍数，噪声尺度简化就可以简化为以下形式：

他们经验上发现结果相当接近。该方程表明噪声尺度等于个别梯度分量的方差之和，除以梯度的 norm。OpenAI 使用以上结论在后续的 scaling law 工作中预测了模型的最优 batch size 大小。

Learning rate as temperature

前面的结论有提到一个前提，就是模型的 LR 是调的比较好的。这是因为 OpenAI 发现噪声尺度基本符合以下规律

在使用 SGD 和小 batch 进行更新时，可以大概近似为这表明

从以上内容，我们可以得知：

高温度导致较小的噪声尺度。其中的直觉是在高温度下，相对于方差，梯度幅度较大。
当学习率以一个常数因子衰减时，噪声尺度大致以相同的因子增长。因此，如果学习率太小，噪声尺度将被放大。

Batch size v.s. LR 的实验结论

Google 在实验中发现，几乎任何关于 batch size 和 LR 的 heuristic 都只能在某个范围内有效。同时，任何只为一个 batch size 调整 LR，然后使用 heuristic 选择其他 batch size 的 LR 的研究都会使被调整的 batch size（以及附近的 batch size）获得系统性优势。

写在最后

本文只是非常简略的总结了一下上面提到的两篇工作，当做自己前段时间一些工作的总结，对于追求细节的读者还是非常推荐去阅读一下论文本身。在做相关工作的同时，出现了一篇非常有意思的文章叫做 Chinchilla's Death, 其中通过对比不同模型的等效 GPU-Hours 提供了一个非常有意思的视角来看 scaling up 这件事情。

但是，结合这篇文章，同时考虑到小模型仅能通过数据并行来 scale up，该文中的诸多结论将不再成立。并且，随着获取算力的成本不断地降低，同时，可以用来训练模型的优质数据逐渐消耗殆尽，我们可能会观察到新的 paradim shifting.