Fluid language model benchmarking

September 16, 2025

Valentin Hofmann - Ai2

When we benchmark language models, we typically administer the same evaluation items (e.g., multiple-choice questions) to every model. However, models vary widely in capability, so using a static item set can be suboptimal—much like giving the same exam to both elementary school and college students.

Could benchmarking be improved by tailoring evaluation items to each model?

In our upcoming COLM paper, we introduce Fluid Benchmarking, a new evaluation approach that dynamically selects items matched to a model’s capability level. Empirically, we find that Fluid Benchmarking enhances evaluation along several dimensions. For example, on MMLU, it achieves higher validity and lower variance than standard methods while using fifty times fewer items, thus simultaneously reducing evaluation costs and improving evaluation quality.

Learning item characteristics from language model crowds

Fluid Benchmarking is based on item response theory (IRT), a statistical framework from psychometrics that leverages the joint response patterns of many test takers to estimate latent item characteristics. In simple terms, items answered correctly by fewer test takers tend to be more difficult, and items with steep accuracy differences between stronger and weaker test takers tend to be more discriminative.

For Fluid Benchmarking, we adapt IRT to language models by drawing upon publicly available evaluation results from the Open LLM Leaderboard. We learn item characteristics for six benchmarks: ARC Challenge, GSM8K, HellaSwag, MMLU, TruthfulQA, and WinoGrande.

The specific IRT model we use yields two parameters for each benchmark item:

Difficulty indicates the capability level at which a model has a 50% probability of answering the item correctly.
Discrimination indicates how sharply the item distinguishes between models of differing capabilities.

Measuring model performance in ability space

In IRT, test takers – in our case, language models – are represented by a parameter called ability. Models with higher ability are on average more likely to answer items correctly.

Provided a benchmark with learned difficulty and discrimination parameters, and provided a model’s responses to those items, we can use statistical methods such as maximum likelihood estimation to estimate the model’s ability on that benchmark. The estimated ability serves as an alternative to standard accuracy, offering a performance measure that incorporates item characteristics. For example, items with low discrimination contribute less to the estimated ability, whereas standard accuracy treats all items equally.

Adapting evaluation items to language model capability

IRT provides for a systematic way to adapt evaluation items to a language model’s capability.

In Fluid Benchmarking, when we evaluate a model, we begin with an item of average difficulty and estimate the model’s ability based on its response. For the next step, we select the item that is most suitable for the current ability estimate. After the model responds, we update the estimate and select the next item. This process repeats until the number of administered items reaches the allotted budget, at which point we compute the final ability estimate using the model responses on all selected items.

To determine the most suitable item at each step, we use Fisher information to quantify how informative each item is, given the current ability estimate. This approach can be shown to minimize the standard error of the ability estimate, yielding the most precise measurement of the model’s performance.

Experimental results

In our paper, we apply Fluid Benchmarking to language model evaluation during pretraining, a setting where model capabilities evolve rapidly. We find that Fluid Benchmarking dynamically adapts to these changes, administering easier items early in training and progressively more difficult items later.

To assess whether this adaptivity improves evaluation quality, we compare Fluid Benchmarking against several baselines, including prior benchmarking methods based on IRT. We find that across all examined dimensions, Fluid Benchmarking consistently enhances evaluation quality.

For example, Fluid Benchmarking increases external validity: evaluation results generalize better to other benchmarks targeting the same capability.

One reason for the improved validity is that Fluid Benchmarking automatically avoids mislabeled items, resulting in a relative reduction of mislabeled items by 99%.

Fluid Benchmarking also reduces the step-to-step variance in benchmark performance curves during pretraining, yielding a clearer learning signal.

Furthermore, it produces more monotonic training curves, delaying the saturation point at which a benchmark stops being informative for a given pretraining run.

All these benefits come with substantial gains in evaluation efficiency. We require far fewer items to achieve the same – or better – evaluation quality. On MMLU, for instance, Fluid Benchmarking attains higher validity and lower variance than standard methods while using 50 times fewer items, and often even outperforms evaluation on the full benchmark.

Conclusion and resources