Behind the scenes at every AI lab, many small models and pre-training datasets are created and experimented with as part of the process of developing their language models. These models and datasets, if made public, could be rich sources of insight into important questions such as: how do developers decide what dataset to use for pre-training their models, or which benchmarks to hill-climb on?

As part of Ai2’s commitment to openness, and to empower open exploration of these questions, today we release DataDecide—a suite of models we pretrain on 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, over 14 different model sizes ranging from 4M parameters up to 1B parameters (more than 30k model checkpoints in total). We evaluate all models across a suite of 10 downstream tasks and calculate how accurately we can use small models to predict that one pretraining corpus will lead to better performance than another for our largest models. Our conclusions provide recommendations about the best and most cost-effective benchmarks, prediction methods, and metrics to use to make decisions.

What did we learn?

Interestingly, we find that the simple approach of using the ranking of a set of models of a single size (e.g., 150M parameters) as the prediction of which models will be best at our larger target scale (1B) works quite well (~80% decision accuracy). Our experiments using scaling laws actually didn’t outperform this strong ranking baseline. We also see that single-scale decisions from intermediate checkpoints are as good as compute-equivalent final checkpoints. We identify that among 10 multiple choice benchmarks, MMLU and ARC Easy are highly predictable with as little as 4 orders of magnitude less compute, and that using the right evaluation metric (the character-normalized likelihood of answers) leads code evaluations on MBPP and HumanEval to also be predictable.

Check out our paper for our full findings, but next we’ll give you a taste of our detailed conclusions about different benchmarks, scaling laws, and metrics.

Recommendations for model developers

Which benchmarks should we be evaluating on?

In our first figure we showed you the accuracy of predictions for the ranking of 25 datasets at 1B based on downstream performance aggregated over the 10 multiple choice tasks in OLMES (averaged over 3 random seeds). Here we show that tradeoff of compute to good decisions broken down for each of the 10 tasks separately. The amount of compute needed to make good predictions varies between tasks. MMLU and ARC are much cheaper to predict than HellaSwag, while the rest of OLMES tasks give markedly less reliable predictions across the scales we examine.

Are scaling laws better?

Here we show decision accuracy over 8 baseline scaling law variants. At best, these approaches reach only the same compute-to-decision accuracy frontier as ranking single scale experiments. Scaling laws should be able to better capture when the steepness of scaling trends causes the performance of one dataset to overtake another at a large scale. Future scaling laws work can use DataDecide to check if better leveraging this advantage can beat the strong baseline set by single scale experiments.

Can we use a better metric than accuracy?

Here we show decision accuracy using character-normalized proxy metrics for accuracy targets. 5 tasks benefit at smaller scales from using metrics based on the raw likelihood of answers (correct prob and total prob), as opposed to discrete accuracy or continuous metrics that penalize probability on incorrect answers (norm correct prob, margin). See the paper for even more metrics.

Our choice of metric has a large impact on decision accuracy, even for benchmarks like code. Choosing a continuous metric leads to much better decision accuracy for code tasks MBPP and HumanEval, even when predictions based on accuracy lead to random chance performance. See the paper for more details.

Dig in and try it out yourself!

Our suite and accompanying code are easily extended; for example, our checkpoints can simply be evaluated on new benchmarks and metrics, and already released evaluation results can be used to assess new prediction methods or scaling laws. To the best of our knowledge, DataDecide is the most extensive openly available sweep of data decisions over scales and random seeds to date. Releasing this suite will enable future work to significantly reduce the cost of model development by using metrics and benchmarks that give better decisions at smaller scales.