Skip to main content ->
Ai2

Evaluating agents for scientific discovery

April 13, 2026

Ai2


Everyone's building AI science agents. But how do you know if they actually work?

Open any social media feed and you'll find teams announcing agents that design experiments, write code, and produce entire research papers. The claims are extraordinary. The evidence behind them, usually, is not. That's why we've spent years building benchmarks that test whether AI agents can actually do science. Two developed at Ai2 – ScienceWorld, released in 2022, and DiscoveryWorld, released in 2024 – have taken on new significance, as the capabilities of today's models have caught up to the challenges we designed the benchmarks to measure.

In 2022, the best AI models scored highly on multiple-choice grade-school science exams. But when those same models were asked to demonstrate that knowledge by performing experiments in a simple virtual environment, they scored below 10%, highlighting the difference between “book smarts” and "street smarts." 

That virtual environment was ScienceWorld, our benchmark that tests whether agents can carry out elementary-school science experiments. Three years later, top frontier models (as of early 2025) score in the low 80s—real progress, but still short of fully solving a 4th-grade science curriculum. And on DiscoveryWorld, our harder benchmark that asks agents to design and execute their own scientific investigations, some of the best systems complete only ~20% of tasks at higher difficulty—problems that average human scientists with advanced degrees solve ~70% of the time.

"So many folks are jumping on the science agent bandwagon and releasing agents," says Ai2   Researcher Peter Jansen, who led development of ScienceWorld and DiscoveryWorld and has built much of the modern infrastructure enabling language models to be evaluated on text-based games. "But if the best systems a year ago couldn't even solve most of the easy problems in DiscoveryWorld, how likely is it that they're much better today?"

End-to-end scientific discovery, in simulation

DiscoveryWorld, released in 2024, is the first benchmark built to test whether an agent can design and execute end-to-end scientific investigations from scratch. DiscoveryWorld takes place on Planet X, a hypothetical space colony in the not-so-distant future, and the player takes the role of one of the scientists on Planet X.

DiscoveryWorld contains 120 challenge tasks spanning eight topics – from proteomics and rocket science to radioisotope dating and epidemiology – across three difficulty levels, with parametric variations that change the data, solution, and environment layout each run. Tasks are set in fictional scientific contexts so agents can't fall back on prior knowledge: in one, an agent has to determine the cause of an illness outbreak; in another, it has to uncover the mathematical relationship governing a quantum reactor. Each requires forming hypotheses, designing experiments, running them, and analyzing results — often over hundreds of in-game actions. DiscoveryWorld scores not just whether the agent solved the task, but whether it followed a scientific process and whether it actually understood the discovery it made, distinguishing genuine insight from lucky guessing.

As Jansen describes it, these evaluations measure the difference between "book smarts" (like answering exam questions) and "street smarts"—i.e., using the scientific method to make new discoveries. While practicing human scientists find all of the discoveries in DiscoveryWorld, recent leading agents can't complete roughly 80% of DiscoveryWorld's tasks at normal and challenge difficulty—knowing what a concept is and being able to apply it are different things entirely.

Despite this difficulty – or perhaps because of it – DiscoveryWorld has drawn wide interest. The paper has been cited nearly 80 times and covered by New Scientist. 

"We tend to release benchmarks that start out being very challenging, but they become much more popular a year or two later as models and methods catch up," Jansen says. "ScienceWorld was very much like that, and DiscoveryWorld seems like it's getting like that now. In fact, with models at their current price-to-performance ratio, I’d argue there’s never been a better time to test whether your agent can solve long-horizon scientific discovery tasks with DiscoveryWorld.”

Executing experiments at the elementary level

ScienceWorld is a more foundational benchmark. Where DiscoveryWorld tests open-ended discovery at a college or PhD level – designing novel investigations and interpreting ambiguous results – ScienceWorld asks whether an agent can “re-make” classic scientific discoveries at roughly an elementary-school level—the kinds of experiments found in today's science textbooks.

ScienceWorld places agents inside a text-based simulated world spanning ten interconnected locations – a kitchen, a workshop, a greenhouse, and others – populated with around 200 types of objects that behave as they would in a real lab: ice melts when heated, circuits conduct based on the materials used, and plants grow under the right conditions. Instead of picking the boiling point of water from a list of multiple-choice answers, an agent might be given an unknown substance, a thermometer, and a stove, and asked to figure out the boiling point itself. Agents issue text commands and receive descriptions of what happens next, working through 30 task types across categories like changing states of matter, mixing chemicals, and running Mendelian genetics crosses. Each of the 30 tasks has hundreds of randomized configurations, so an agent can't succeed by memorizing solutions—it has to generalize.

At this level, the gap between knowing and doing is still wide. When ScienceWorld launched, the same models that received an 'A' grade on the ARC science exam – a standard benchmark for scientific knowledge – failed more than 90% of ScienceWorld, despite both covering the same conceptual material. Knowing what a melting point is turns out to be a far cry from figuring out how to measure one.

Scores have climbed since then. TALES, a 2025 benchmark suite from Microsoft Research that includes ScienceWorld, found that leading models scored in the low 80s—a dramatic improvement from sub-10% three years earlier, but still short of fully solving the tasks.

“We hope that in the near future, science agents will help treat diseases, create new materials, and generate other important discoveries,” Jensen says. “DiscoveryWorld and ScienceWorld help measure whether agents can begin that process by testing their end-to-end scientific capabilities in simplified virtual worlds. If an agent flunks basic science, what hope does it have of curing cancer?”

Benchmarks like DiscoveryWorld and ScienceWorld help test what science agents are actually capable of—and we're building them alongside systems that push the boundaries of what's possible, because making progress and measuring it are two sides of the same effort. DiscoveryWorld and ScienceWorld are open and freely available, with the goal of helping turn promising ideas into proven results.

Subscribe to receive monthly updates about the latest Ai2 news.