Skip to main content ->
Ai2

Investigating pretraining dynamics and stability with OLMo checkpoints

Will Merrill / October 2, 2024

A central goal of the OLMo project is to use our experience to contribute to an open science of LM pretraining to provide a foundation for open-source pretraining efforts. For this reason, we are trying out something new: a short blog post where we use data from our open pretraining runs to evaluate how hypotheses about pretraining dynamics and stability manifest in our own OLMo checkpoints. We hope to both contribute to the open science of pretraining and better understand the root causes of instability in our own pretraining runs.

This blog covers a simple, but hopefully illustrative, investigation into the evolution of parameter and activation magnitudes throughout pretraining checkpoints from OLMo 7B 0724. Parameter and activation magnitudes have been previously studied with the aim of understanding the inductive biases of pretraining (Merrill et al., 2021) as well as for improving quantization methods (Dettmers et al., 2022). In contrast, we are interested in analyzing the evolution of parameter and activation magnitudes in order to better understand and circumvent pretraining instability issues with large and low-precision models. Indeed, in combination with recent work on pretraining stability, the analysis here suggests that shrinking embedding norms and the emergence of outlier features in early layers might both be destabilizing forces in the OLMo 7B 0724 pretraining run. Modeling tweaks that increase the activation norm in early layers and prevent the emergence of outlier features could improve pretraining stability for future OLMo runs.

Background - Parameter norm growth

Classical machine learning (and perhaps conventional wisdom about deep learning) would suggest that, during training, parameters converge to a specific magnitude as they approach a local minimum. The findings of Merrill et al. (2021), as well as work in deep learning theory (Ji & Telgarsky, 2020), suggest this is not the case: in fact, during pretraining, parameters can continue to grow (i.e., actually diverge) in magnitude, even as the loss converges. Under this view, training is not a process where the model descends a hill to the bottom of a valley, but rather where the model is swept through a fjord flowing out to an ocean at infinity.

Outlier parameters and features

Relatedly, Dettmers et al. (2022) claim outlier features emerge during pretraining. That is, after a model has been pretrained, there will be dimensions in the activations with significantly greater magnitude than typical activations: i.e., the magnitude of these features has grown relative to the overall norm. This can be understood as a refinement of Merrill et al's claim of norm growth during pretraining where the growth is targeted to specific parameters (and activations). Dettmers et al. (2022) were motivated to think about outlier features from the perspective of quantization (where it is important to represent outliers with higher precision), but our motivations lie more in connection to pretraining stability.

Connection to pretraining stability

Pretraining large models often runs into stability issues where the loss rapidly or slowly increases rather than monotonically decreasing. The causes of instability can be complex, involving data properties, modeling choices, as well as numerical issues. However, recent work has suggested general principles for understanding and combatting instability, and the evolution of parameter and activation norms plays a central role. In particular, Takase et al. (2024) find that small embedding and activation norms can lead to large gradient updates that destabilize training. In addition, maintaining unit scale of the parameters (Blake et al., 2023; Blake et al., 2024) has been proposed as a way to prevent numerical precision issues leading to instability, especially in low-precision regimes.

Questions about OLMo 7B 0724

The role of parameter and activation norms in pretraining stability motivates us to ask the following questions about parameter and activations magnitudes using our OLMo 7B 0724 checkpoints:

Evolution of the parameters

  1. How does the parameter norm evolve during pretraining?
  2. Do specific "outlier" parameters emerge during pretraining?

Evolution of the activations

  1. How does the activation norm evolve during pretraining, across layers?
  2. Do specific "outlier" features emerge during pretraining, across layers?

Evolution of the parameters

Aggregate parameter norm decreases during pretraining

To evaluate whether Merrill et al.'s observations with T5 transfer to OLMo, we measure the norm of the parameters of OLMo 7B 0724 across different pretraining checkpoints, excluding embedding parameters. Because OLMo 7B 0724 has no biases and non-parameteric layer-norm, all parameters included in this measurement are linear projection matrices. We then compute both the 1-norm and 2-norm. To make the norms comparable and remove the effect of model size, we normalize the 1-norm by 1/n and the 2-norm by 1/sqrt(n), where n is the total number of included parameters.

Across layers, the 1-norm and 2-norm consistently decrease over pretraining for OLMo 7B 0724. In contrast, Merrill et al. (2021) found that, for T5, where the 2-norm increased over pretraining.

Contrary to Merrill et al.'s findings with T5, we see that both parameter 1-norm and 2-norm consistently decrease over pretraining over OLMo 7B 0724. There are many differences between the two models that might explain this difference in the training dynamics such as the strength of weight decay, learning rate schedule, optimizer, and model size.

Emergence of outlier parameters

Next, we aim to evaluate whether there is some specific set of outlier parameters that grows or separates in norm from the other parameters as training progresses. To do this, we introduce two different metrics for parameter "outlieriness" or sparsity and evaluate them across the same OLMo 7B 0724 checkpoints. Let θ be a vector of parameters.

  • Max-sparsity: One way to evaluate the degree to which there are outlier parameters is to compare the magnitude of the most extreme (e.g., max or min) parameters against typical parameter magnitude. We formalize this by comparing the maximum and median parameter magnitudes:

  • Norm-sparsity: Another way to define sparsity is in terms of norms. With the p-norm, smaller values of p better reflect typical magnitudes in a vector, whereas larger values of p better reflect outliers (in the limit, the ∞-norm is equivalent to the max absolute value). Inspired by this, we define the norm-sparsity metric by comparing the 2-norm against the 1-norm:

Intuitively, the more outliers there are in the parameters, the larger the 2-norm will be relative to the 1-norm, and the larger this metric will be. The factor sqrt(n) is chosen so that a norm-sparsity of 1 is the minimum obtainable value (if all parameters are the same), and a larger value indicates more outliers.

The max-sparsity metric, which is designed to detect extreme outliers, increases roughly monotonically through pretraining across layers. The norm-sparsity metric suggests sparsity decreases initially and then outliers become more prominent after about 40k steps.

The max-sparsity metric suggests that the maximum magnitude parameters drift away from the median as pretraining progresses. This is evidence for outliers with very extreme values. The story with norm-sparsity is more complicated: after an early phase where norm-sparsity decreases from initialization, norm-sparsity begins to increase after 40k steps. Thus, later in pretraining, this metric also detects some emergence of parameter outliers. We believe the early phase where norm-sparsity decreases may be explained by many largely initialized parameters being pruned early in pretraining. Notably, in the embedding layer (represented by the bottommost light blue line), norm stability increases more or less monotonically from initialization.

Taken together with the previous section, these results suggest that, while parameter magnitudes shrink in aggregate during pretraining, specific outlier parameters separate in magnitude from other parameters. Due to their larger norm, the computation of the network is likely to be more sensitive to the values of this sparse set of parameters. From the perspective of pretraining stability (especially in the low precision regime), this is important, as more bits may be needed to represent the values of large parameters, both due to their potential greater significance and the fact that large values can lead to inexact floating-point arithmetic (Blake et al., 2023). We now turn to investigating whether these trends in norms and sparsity of parameters are also manifested for the activations computed by the network, which has further implications for pretraining stability.

Evolution of the activations

To understand how the activation magnitudes change over training, we take 500 documents from the Pile validation set (each truncated to 2048 tokens) and pass them through each OLMo 7B 0724 checkpoint, recording the activations at different layers of the model. We can then make similar measurements on these activation vectors as we did for the model parameters.

Activation norm decreases similarly to parameter norm

Similarly to with the parameters, we plot the norm of the activations across different pretraining checkpoints. This time, we break down the results by layer to understand the relative magnitudes of activations in different parts of the network.

Across layers, the 1-norm decreases over time, with a greater effect in earlier layers. The trend for the 2-norm is qualitatively similar though the trend by layers is less clean, with some later layers decaying faster than early ones. The final layer (dark blue) shows a different pattern than the other layers, presumably because a certain temperature is required to maintain calibration of the logits.

Similarly to the parameter norm, the activation norm tends to decrease over pretraining steps. The final layer patterns flattens off rather than decaying with the other layers, which can likely be explained by the fact that the norm of the final layer activations influences the calibration or confidence of the LM predictions. Ignoring the final layer, the activation norm is smaller in earlier layers than later ones, reflecting the additive nature of the residual stream. This is notable from the perspective of instability because it supports the hypothesis that layer-norm gradients coming from earlier layers are particularly susceptible to instability (Takase et al., 2024). Takase et al. (2024) propose changing the architecture to increase the norm of activated embeddings (either with larger initialization or layer-norm), which could be interesting to explore in future pretraining runs.

Emergence of activation outliers, especially in early layers

Following the methodology we used for the parameters, we can measure both the norm-sparsity and max-sparsity of the activations across different layers over pretraining.

The two sparsity metrics, despite their different definitions, look remarkably similar! Both suggest that, early in training, outlier features emerge in the activations. This is most pronounced in earlier layers in the network, and especially the embedding layer (layer 0). We can visualize the a histogram of the activation magnitudes in the embedding layer to confirm this layer level of sparsity:

While most of the activations after pretraining overlap with the blue distribution representing initialization, there are 10 outlier activations to the right of the initialization distribution. A similar trend holds in other early layers, though it gets less prominent in deeper layers. In contrast, past layer 12, each layer has at most 1 (and usually 0) activations to the right of the initialization distribution. For example, in layer 12:

This visualization confirms the interpretation of the sparsity metrics mentioned above: outlier features similar to those reported by Dettmers et al. (2022) emerge during pretraining, especially in early layers over the network. Thus, while the magnitude of typical features is shrinking, the magnitude of these features actually grows, which has the potential to introduce floating point numerical stability issues that could in turn induce instability.

Discussion: implications for pretraining stability

This simple investigation of OLMo 7B 0724 checkpoints produced two high-level insights for understanding pretraining dynamics and stability issues. First, in aggregate, parameter and activation magnitudes seem to decay to near 0 over time, with a stronger effect in early layers of the network. As noted in prior work, this can lead to instability issues because small activations, when passed through pre-norm, produce large gradients. Second, despite the aggregate decrease in parameter and activation norms, specific "outlier" parameters and activations emerge with magnitudes that substantially deviate from typical magnitudes. Along with Dettmers et al. (2022), we conjecture that outlier parameters play an outsized role in network computation, so it is important to represent them precisely. Moreover, large outlier parameters and activations (even if they are rare) can lead to errors in floating point computation, especially under low-precision training. Thus, the parameter and activation dynamics documented here may explain the instability issues we faced while training OLMo 7B 0724.

Moving forward, these observations motivate modifying the training dynamics to prevent the development of small embeddings and large outlier features in order to improve stability. Methods like z-loss (Chowdhery et al., 2022) and unit-scaled initialization (Blake et al., 2023) have been developed for this. In future blog posts, we plan to investigate further whether these methods can improve the pretraining stability of OLMo.

Thanks to Dirk Groeneveld for advice on this investigation.

Subscribe to receive monthly updates about the latest Ai2 news.