Revisiting critical batch size for large-batch OLMo pretraining
June 3, 2025
Will Merill - Ai2
Training LLMs like OLMo on massive amounts of text relies heavily on data parallelism: the ability to learn from many text documents at once. But data parallelism is bottlenecked by batch size: the number of documents used for each gradient step during training. This means it’s important to train LLMs with a large batch size if you want training to finish in a reasonable amount of time. On the other hand, training with a batch size too large can degrade the performance achievable by training on a fixed amount of data, since larger batches can show diminishing return in their ability to reduce loss. Thus, in practice, it is important to pick a batch size that balances these two concerns, allowing a high degree of data parallelism but not degrading performance.
Such a batch size that balances data parallelism and token efficiency is known as the critical batch size (CBS) in the literature. In practice, when training LLMs, it is useful to have a method to measure the CBS in advance before launching a pretraining run. There are existing ways to do this (most notably, measuring the CBS via a noise scale proxy), but, while applying these ideas to OLMo, we found fundamental problems that make them hard to trust.
Therefore, in our recent preprint Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training, we introduced a new method to measure the CBS directly and reliably. We use our method to carefully evaluate CBS over previous OLMo training runs, and use these measurements, combined with a novel batch size warmup method, to increase data parallelism for future OLMo runs. In fact, using these insights, we reduce the number of gradient steps used to train OLMo 1B to the same loss as the original run with 43% fewer gradient steps. This shows that CBS measurements and batch size warmup are promising tools for increasing data parallelism during pretraining runs without sacrificing performance, which we plan to continue to use as we train newer generations of OLMo models.
Background
Critical Batch Size (CBS)
The CBS was introduced by McCandlish et al. (2018) as a batch size representing a reasonable tradeoff between data parallelism and token efficiency: i.e., a batch size that is large enough to speed up training, but not too large to degrade final performance. Following this idea, we state the CBS hypothesis as follows in our paper:
There is some critical batch size B* up to which increasing the batch size (and appropriately modifying the learning rate) approximately preserves the loss trajectory as a function of tokens trained, but, above which, the loss trajectory degrades.
As illustrated in the figure below, we broadly find that, indeed, there is a B* below which models achieve roughly the same loss, but, above which, loss degrades. Thus, in order to increase data parallelism, we can increase our batch size to B* (and scale the learning rate). But this is only practical if we have some way to measure the CBS B* before launching a training run.
Estimating CBS via the gradient noise scale
McCandlish et al. (2018) propose that the CBS can be measured before launching a large training run by using a quantity called the gradient noise scale as a proxy. The gradient noise scale is a measure of the variation in the gradients within different examples in a batch. Where G is the true gradient and Σ is the covariance across all gradient estimates in the batch, the gradient noise scale ℬₙ is defined as
ℬₙ = tr(Σ) / ‖G‖²
Larger gradient noise scale means there is more variation in the gradient between different examples in the batch. As argued by McCandlish et al. (2018), more noise means that the batch size can be increased larger without having a major effect on the optimization trajcetory, since small-batch steps will be noisy anyway, and aggregating them together is required to take a step in the right direction. Thus, according to McCandlish et al., we can measure the gradient noise scale and use it as a proxy for the CBS.
Practically, this idea is appealing for several reasons. First, McCandlish et al. show how gradient noise scale can be estimated efficiently using gradient norms, which we would likely compute anyway when training an LLM. Second, McCandlish et al. present a formal argument for how, under certain assumptions, gradient noise scale should be a proxy for the CBS. This inspired some practical adoption of the gradient noise scale, e.g., in pretraining GPT-3.
However, justifying that gradient noise scale should serve as a proxy for CBS relies on two strong assumptions, which makes it unclear whether this connection should be trusted in practice:
- McCandlish et al. assumed stochastic gradient descent (SGD) as an optimizer, which was appropriate for the time, but is not standard today for training LLMs (instead, adaptive methods like Adam are common). Further, Malladi et al. (2022) showed that the scaling rule between batch size and learning rate, central to McCandlish et al.’s analysis, works differently for Adam than for SGD. Thus, it is unclear whether we should trust gradient noise scale for models optimized via Adam.
- McCandlish et al. also assume that the optimization is well-conditioned (the Hessian is a multiple of the identity matrix), which is a strong assumption. Without it, gradient noise scale does not serve as a proxy for CBS: instead, an intractable formula involving the Hessian replaces it! Still, McCandlish et al. suggests that the gradient noise scale might be roughly correlated with CBS in practice, just off by a constant factor. It’s not clear why this should be true, but even if it is, there is an issue in practice, where ideally we want an absolute measurement of the CBS, rather than a correlated proxy.
Since both of these two assumptions may not be met in practice, it is unclear whether we should trust the gradient noise scale as a proxy for the CBS. This motivated our exploration of more direct, but still cheap, methods for measuring the CBS.
Our method: Direct CBS measurement via branched training
Along these lines, we introduce an empirical method to directly measure the CBS with a small amount of additional training. Given a checkpoint, our method works by training multiple local branches with different batch sizes (and appropriately scaled learning rates) for Delta steps. We record the loss achieved by each local branch and define the CBS as the largest batch size that matches or outperforms all smaller batch sizes in loss.
To make this local branching method efficient, we set Delta smaller than the full training budget. This relies on the mild assumption that, if the loss curve recovers to the original loss within Delta tokens, the two loss curves will remain comparable thereafter.
Using our method, we measure how the CBS changes across training checkpoints for OLMo 1B and 7B.
The results show a similar qualitative pattern across both model scales: the CBS starts near 0, increases rapidly and then diminishingly, and plateaus around a batch size of 4096. We draw three main takeaways from these results:
- The fact that the pattern looks similar across model sizes agrees with prior work that CBS seems to depend primarily on data (volume, and possibly also quality) rather than model size. Practically, this suggests that measurements of the CBS on small-scale training runs could inform larger-scale training runs.
- We also compare our CBS measurements with gradient noise scale estimated via McCandlish’s methodology. Broadly speaking, the two do not match. At the 1B scale, the qualitative pattern looks somewhat similar, but the absolute numbers are off. At the 7B scale, the qualitative pattern of gradient noise scale is messy. Since our method more directly measures CBS (and we will validate it via large scale training), we take this to suggest that gradient noise scale cannot be reliably trusted as a proxy for CBS—measuring CBS via gradient noise scale could have led us to underestimate.
3. Finally, the qualitative pattern for CBS growth we observe naturally motivates batch size warmup, where the batch size starts small and then dynamically increases over the course of a training run as the CBS grows. In principle, this should allow us to train with a larger batch size for most of the training without ever training with a batch size that is too large (near the beginning). Thus, the final part of the paper moves on to formalizing and validating this idea.
Using CBS to train with fewer gradient steps
The crucial takeaway from our CBS measurements for OLMo was that the CBS starts small at the beginning of training, increases rapidly initially, and then plateaus. While most work on CBS scaling laws assumes a fixed batch size for all of training, our findings suggest we could train with a much larger batch size for most of training compared to the batch size used at initialization.
More generally, we propose a batch size warmup method as follows. We start batch size small, and increase it (double it) whenever we determine the CBS has increased appropriately. We also scale the learning rate by a factor of $\sqrt{2}$ when we increase the batch size, following the square-root scaling rule.
We use batch size warmup to train OLMo 1B to the same (in fact, slightly better) loss with 43% fewer gradient steps compared to the original run as well as a baseline of simply training with a fixed larger batch size. In contrast, we see that the fixed larger batch size degrades loss at the beginning of training (when we predict it to be too large relative to the CBS), and that the final large-batch loss is slightly degraded relative to the original loss after a mid-training phase involving learning rate annealing. Drawing strong conclusions about the relative strength of different methods is difficult from small loss differences between individual training runs. Yet, overall, the fact that batch size warmup outperforms both the original run and the large-batch control run suggests it is a simple way for safely increasing data parallelism via larger batch sizes in practice. Thus, we plan to incorporate batch size warmup into future pretraining runs (e.g., OLMo 3).
Takeaways
Our work revisited the methodology for measuring the CBS, introducing an empirical branched training method instead of the gradient noise scale proxy, whose validity relies on some strong assumptions. We used our measure to measure the CBS over OLMo pretraining, which allowed us to train OLMo to comparable loss with fewer gradient steps using batch size warmup. Overall, we plan to leverage these insights in future OLMo pretraining runs and hope they can also inform other open pretraining efforts.