Signal and Noise: Reducing uncertainty in language model evaluation

August 19, 2025

David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, and Jesse Dodge - Ai2

At Ai2, we employ smaller model training runs to validate the quality of a dataset, architecture, or training decision. Then, we scale up the best-performing options for our large model training. Despite evaluation benchmarks being the key decision-making tool in each step of language model development, it’s sometimes challenging to know when – and which – benchmark scores are appropriate to consider, given that the properties of benchmarks aren’t always clear. But can we look at how small models train, and how performance of our models scales, to improve our procedure for evaluating LLMs?

We find that two simple metrics reveal differences in the utility of current benchmarks: signal, a benchmark’s ability to separate better models from worse models, and noise, a benchmark’s sensitivity to random variability between training steps. By measuring the ratio between signal and noise across a large number of benchmarks and models, we find a clear trend that benchmarks with a better signal-to-noise ratio are more reliable for making decisions at a small scale.

Our measure of the signal-to-noise ratio suggests we can use “interventions” on our benchmarks to improve their signal-to-noise ratio. To encourage further research on evaluation methodology, we release a dataset of 900K evaluation results on 465 open-weight language models, including evaluations across intermediate checkpoints on our OLMo models, the DataDecide model suite, and Ai2’s ladder scaling law models.

Below, we present our core findings and interventions.

How does noise impact decisions when training language models?

Our work studied the impact of signal and noise in two scenarios:

Decision accuracy: Training two small language models on different datasets and comparing their performance against the same pair of datasets used to train two larger models.
Scaling law prediction error: Training a set of small models, and fitting a “scaling law” to predict the performance of a large model.

Comparing two small models

Consider the training curves for these 25 1-billion-parameter models (see below), trained with 100 billion tokens. This represents a typical small-scale experiment, such as an experiment we would run during OLMo development. At this scale, if the scores are too close (like HellaSwag), or if the training curve is too noisy (like ARC Challenge), then clearly we cannot claim the performance of the models is better or worse!

Making scaling law predictions

In this setting, we are training many small language models (from 150M parameters to 1B parameters), and predicting the performance at a much larger scale (in this case, 13 billion parameters). However, we noticed that the noise around the model we are predicting – the curve in the inset axis – might be far noisier than the error of our scaling law fit.

For example, HellaSwag, a language task, can be a reasonable task to predict because the noise is very small (with the fit exhibiting a < 0.1% error), but MBPP, a code generation task, has much more noise, and therefore is far more difficult to predict (with our scaling law exhibiting a 15.7% error). In this case, the noise can tell us how well we are able to predict performance!

These are only two of many places where decisions are made with small-scale experiments. Designing these experiments is relevant to hyperparameter transfer methods (Yang et al., 2022), evaluation experiments that rely on language models that display specific capabilities like answering MCQA (Wiegreffe et al., 2024), and writing math or executable code (Snell et al., 2024).

Measuring the signal-to-noise ratio

Using the two setups above, our work proposes a cheap and reliable metric to estimate the signal-to-noise ratio of a model given a few checkpoints to evaluate.

Noise

Noise is a well-studied phenomenon in language model benchmarking (Berg-Kirkpatrick et al., 2012, Card et al., 2020, and recently Miller, 2024). However, these works study the intrinsic noise of the dataset (e.g., the standard error due to the sample size), rather than noise as a result of differences in the model during training.

To measure noise, we simply calculate the standard deviation of the final checkpoints of training for a single model at a particular compute scale. For example, to measure the noise for a benchmark at the 7B scale, we use the final 30 checkpoints of the OLMo 2 7B training.

In our paper, we considered other sources of modeling noise, such as noise as a result of changing the random seed when initializing the model weights or measuring the full variation of the training curve. However, we found that these sources of modeling noise were highly correlated. Given that the noise of the final training checkpoints is a cheap measure, we use it in our analysis.

Signal

Conceptually, this captures the spread of models at a specific scale. In particular, an ideal benchmark exhibits a wide and evenly distributed range of scores for a population of models, which we calculate with dispersion: the maximum difference between any pair of models at that scale.

In our work, we experimented with many measures of spread and observed they resulted in similar findings, so we chose dispersion for simplicity.

Signal-to-noise ratio

Combining these phenomena together, we measure the ratio of signal, how well a model separates benchmarks, and noise, the variability of that benchmark during training. Here is what this measure looks like on a single benchmark, MMLU, at the 1-billion-parameter scale:

We formally define the signal-to-noise ratio as:

Measuring signal-to-noise ratio (SNR) across scales

In our work, we study models trained up to 32 billion parameters. For the larger compute scales, we rely on a population of open-weight models to represent the signal. You can see below the model scores we use to calculate signal for larger compute scales. Importantly, the tasks that are most useful for small scales (such as ARC Easy) are different from those at large scales (like HumanEval and MATH 500).

Combining these together, we use the open-weight models to calculate signal, and the variability of the model during training (which we estimate using the OLMo 2 intermediate checkpoints) to calculate noise.

SNR indicates useful benchmarks

To validate that the signal-to-noise ratio is meaningful, we collected the aforementioned dataset of 900K evaluation results on 465 open-weight language models, as also described in our paper. In particular, we evaluated models in both the decision accuracy and scaling law scenarios described above.

In the decision accuracy setup (left figure), we calculated signal using the final checkpoint of the 25 DataDecide models ranging from 60M to 750M parameters, and calculated noise using the standard deviation of the final 5 checkpoints.

In the scaling law setting, we predicted the performance of a 13B-parameter model using models trained up to 1B parameters on the OLMo 2 data mix. To measure noise, we used the final 50 checkpoints of the 13B model training.

In both settings, we find that the signal-to-noise ratio is highly predictive of the benchmark quality ($R^2=0.626$ and $R^2=0.471$, respectively). For scaling laws in particular, we find that some individual tasks, like HellaSwag and Jeopardy, are more predictable than their aggregate multi-task averages.

This evidence suggests that practitioners should aim for benchmarks with high signal and low noise, and that this metric can indicate whether a benchmark will be useful.

Better language model evaluation by improving SNR

Equipped with our measure of the signal-to-noise ratio, below we demonstrate two interventions that improve the quality of our evaluation methodology, in this case MMLU, Minerva MATH, and AutoBencher, a synthetically generated benchmark. We include more details and additional successful interventions in our paper.

Filtering noisy sub-tasks using SNR

Many benchmarks are an average of multiple subtasks (for example, MMLU is an average of 57 subtasks). Can SNR detect high-quality subtasks? We test this by calculating the SNR at the 1-billion-parameter scale for each subtask, then including them in our benchmark from highest to lowest SNR.

The curves below show the signal-to-noise ratio when we include tasks randomly vs. including them by their SNR, and we find that using only 16 subtasks for MMLU and 6 subtasks for an automatically-generated dataset (AutoBencher) exhibits higher SNR than a random set of subtasks, or even using the entire evaluation set. We show this improvement in SNR is also exhibited in decision accuracy and the scaling law prediction error, notably showing a 32% error reduction for MMLU. Our evidence suggests that higher-quality evaluation sets, rather than large sample sizes, determine whether a benchmark will be useful for decision making.

Using metrics with higher SNR

The signal-to-noise ratio also allows us to test alternative evaluation setups that might increase our ability to detect differences between models. For example, one setting is to predict the language modeling perplexity over a human-written answer by calculating the bits per byte (BPB) over the human completion. The chart below illustrates this change in the signal-to-noise ratio on the Minerva MATH benchmark.

We find that BPB is particularly effective in improving SNR for generative math and code benchmarks like GSM8K (1.2 to 7.0) and MBPP (2.0 to 41.8). In these cases, the decision accuracy at 150M (the proportion of models ranked the same at 150M parameters compared to the same ranking at 1B parameters) increased from 46% to 77% on GSM8K, 68% to 93% on MBPP, and 51% to 90% on Minerva MATH. We observe an improvement in decision accuracy at the small scale for 90.0% of all benchmarks and a lower scaling law prediction error for 73.3% of all benchmarks.

Takeaways

Our work demonstrates how evaluating the statistical properties of a benchmark can help us understand when evaluation is useful, and how to improve the science of building language models. We plan to continue using these metrics to bolster our evaluation infrastructure—and we believe a focus on the "evaluation of evaluation" can improve the way we build models moving forward.

For the full results, please read our paper! Download the Signal and Noise evaluation suite here, and the code here.