PreScience: Forecasting the future of science end-to-end

February 25, 2026

Ai2

Every scientific paper starts with a series of choices: who to work with, which prior work to build on, what scientific contribution results from combining them, and how to communicate it. Then the community decides how much attention those results deserve. These choices unfold over months and years, shaped by an enormous and constantly evolving body of research—and they're what ultimately determine the direction of scientific advances in a field.

Understanding this process well enough to forecast it would be a meaningful test of how deeply AI systems grasp the dynamics of science. Can they, given the scientific record up to a fixed point in time, predict what comes next—not just one piece of the puzzle, but the whole picture, from team formation through to eventual impact?

That's the question behind PreScience, a new benchmark we've built with the University of Chicago, supported by the U.S. National Science Foundation (NSF) through NSF Global Observatory and Virtual Laboratory for Science and Technology Advance, to evaluate scientific forecasting across the entire research workflow. PreScience breaks a scientific advance into four composable stages – team formation, literature selection, contribution generation, and impact prediction – that can be chained into a full simulation of how a field evolves month by month. When we ran a 12-month simulation of AI research, the results revealed something striking: the simulated corpus was systematically less diverse and less novel than what human researchers actually produced, and the bottleneck wasn't in selecting teams or prior work—it was in the generation step itself.

Our hope is that PreScience accelerates progress toward AI that can anticipate where a field is heading and eventually help researchers get there faster. The dataset, evaluation suite, and tech report are all available now.

Why we built PreScience and how it works

Most existing evaluations for automating aspects of scientific research focus on narrow or synthetic tasks—can a model write a plausible abstract, or predict a citation count? Researchers have studied more substantive questions too, like predicting future collaborations, anticipating novel idea combinations, forecasting follow-up work, and estimating publication impact—but always in isolation. In practice, these are interdependent stages in the lifecycle of a single scientific contribution. Teams form around shared interests and complementary expertise, draw on prior research, contribute something new, and the community responds over time. Each stage feeds into the next, and studying them separately limits what you can learn about the process as a whole.

PreScience is the first large-scale benchmark to treat them jointly. It's grounded in real papers, real authors, and real citation histories from arXiv, spanning seven AI subcategories including computational linguistics, machine learning, computer vision, and information retrieval. For each paper, PreScience includes the title and abstract as the prediction target, along with the paper's key references, the publication histories of its authors, and other relevant metadata.

The dataset covers ~100K target papers published between October 2023 and October 2025, drawn from a broader corpus of over 500K papers and nearly 183K unique authors. Models can use papers from October 2023 through October 2024 and are evaluated on the following year, so they're always forecasting into the future rather than interpolating within a known period. And because much of the data post-dates the training cutoff of frontier models, there's no risk of contamination leaking into the results.

Several additional design choices ensure the dataset is clean enough to reflect genuine forecasting ability rather than noise or shortcuts. Author identities are disambiguated via a method that improves clustering quality. Target papers are filtered to those with one to ten key references, removing outliers that would be incredibly easy or hard to predict. And all metadata (e.g., citation counts, h-indices, and publication histories) is temporally aligned to each paper's publication date, so models never accidentally see information from the future. Together, these choices ensure clean evaluations with no information leakage into the benchmark test set.

With that foundation in place, PreScience breaks down a scientific advance into four interdependent tasks, each reflecting a key decision point in how research unfolds:

Collaborator prediction. Given an author and the current state of the field, who will they work with next? This reflects the social and topical dynamics of how research teams come together.
Prior work selection. Given a team, which papers from the existing literature will they build on? This tests whether models can identify the most relevant foundations for a new contribution.
Contribution generation. Once a team and the prior work are established, what will the paper actually say? Models must produce a title and abstract that represent a plausible new contribution.
Impact prediction. Once a paper exists, how much attention will it receive? Models forecast how many citations it’ll accumulate in its first year.

The tasks are composable. You can study each one in isolation, or chain them together into a multi-step "science simulator"—predicting teams, generating their papers, folding those papers back into the literature, and repeating month by month.

A new way to evaluate generated contributions: LACERScore

Measuring the quality of a generated scientific contribution is its own challenge. Standard text-similarity metrics like ROUGE or BERTScore can tell you whether two abstracts share surface-level characteristics, but that alone doesn't tell you much about whether they describe the same scientific finding. Two abstracts might use very different language for closely related results, or share substantial vocabulary while describing fundamentally different work.

To address this, PreScience introduces LACERScore, a calibrated evaluation metric that uses a language model as a judge. Given a generated abstract and the real one, the model rates their alignment on a 1-to-10 scale, guided by automatically constructed reference examples that anchor what different score levels mean. In practice, LACERScore tracks expert human judgments closely—approaching the level of agreement between human annotators themselves and substantially outperforming prior automatic metrics.

What we found

Even with strong baselines and frontier models, there's a lot of room for improvement across all four tasks in PreScience:

In collaborator prediction, a simple heuristic based on how often two authors have co-published in the past outperforms all of the more complex ML baselines. And when it comes to predicting first-time collaborations, where two researchers have never worked together before, none of the baselines can do it reliably.

Prior work selection is also quite difficult for models. The best baseline achieves an nDCG (a standard ranking metric) of only about 0.13, meaning models struggle to identify which specific papers from the literature a team will cite even when given access to the authors' full publication histories.

In contribution generation, frontier LLMs produce abstracts that are only moderately similar to the real contributions. GPT-5, the strongest model we tested, averages roughly 5.6 out of 10 on LACERScore. Larger and more recent models generally improve over smaller or earlier ones, but the gains are incremental rather than dramatic. To put that in context, a simple paraphrase of the abstract scores much higher, so there's a meaningful gap between what models generate and what researchers actually wrote.

And in impact prediction, even the best combinations of features leave substantial prediction error. Highly cited papers – arguably the most important ones to forecast – are systematically the hardest to get right.

The bigger question: A full year of simulated science

The individual task results are revealing on their own, but PreScience is ultimately designed for a more ambitious question: what happens when you compose the four stages into a full simulation?

We ran 12-month simulations, predicting research teams each month, selecting their prior work, generating papers, and adding those papers back into the evolving literature for subsequent months. The result is a synthetic corpus of AI research that we can compare against what human researchers observably produced over the same period.

The headline finding is that the simulated corpus is systematically less diverse and less novel than real research. Individual generated papers aren't wildly off-base—each one is roughly as different from prior work as a real paper would be. But they tend to cluster together. As the simulation progresses, newly generated papers become increasingly similar to each other, converging on a narrower range of ideas than what the real research community explored.

Figures below: Simulated (synthetic) papers (A) are less diverse and (B) trend towards being less novel compared to ground truth (natural) papers that correspond to the same time period. When novelty is measured relative to the fixed pre-simulation corpus (C), this trend disappears. H<t: Prior to generating a new paper (includes synthetic generations from the corpus). H<t0: Prior to the test period (includes only real papers from the corpus).

The diagnostic value of PreScience shows up clearly here. The sets of authors and prior work surfaced during the simulations are actually more diverse than their real-world counterparts. The upstream stages (team formation and literature selection) aren't the bottleneck—it's the contribution generation step where diversity collapses. The language model, given a diverse set of inputs, still tends to produce outputs that are more homogeneous than what real researchers would write.

What's next and how to get involved

PreScience highlights several open challenges in scientific forecasting: predicting first-time collaborations, surfacing relevant prior work, generating contributions that match real-world novelty, and anticipating which papers will have outsized impact. Each of these connects directly to the kinds of AI tools that could meaningfully help researchers, from recommending co-authors and navigating the literature to proposing promising directions and anticipating a paper's influence. If we want AI systems that support real discovery, we need evaluations grounded in how science actually happens.

We see PreScience as a living benchmark. As the scientific record grows, so should our ability to test forecasting methods against it. Looking ahead, we're excited to explore richer context signals like institutional affiliations, venues, and funding sources, as well as multimodal scientific artifacts like figures and tables.

PreScience ships with training and test corpora, author mappings, baseline implementations, and evaluation scripts. You can find everything via our GitHub and Hugging Face repos—read our tech report here.