Skip to main content ->
Ai2

AstaBench: Rigorous benchmarking of AI agents with a holistic scientific research suite

August 26, 2025

Ai2


AI agents are increasingly being applied to complex real-world problems like science. They hold the promise to revolutionize scientific productivity by automating reviews of the literature, replicating experiments, analyzing data, and even proposing new directions of inquiry. Indeed, there are now many agents, ranging from OpenAI and Google’s general-purpose deep research systems to specialized science-specific agents such as AI Scientist and AIGS.

But with so many different agents—many behind paywalls and all evaluated in their own bespoke ways—how are AI developers and the public to know which perform best? Our investigation found existing evaluation methods unable to answer this question, leading us to develop a new, more rigorous approach to benchmarking agents, which guided our development of a new benchmark suite, AstaBench, in order to provide a better gauge of the state of the art. 

AstaBench enables comparison of both general-purpose and specialized agents on a diverse set of over 2,400 problems across four areas where AI can help human scientists: literature understanding, code and execution, data analysis, and end-to-end discovery. Since simple mechanisms, such as directing an AI system to repeat a task many times and voting, are known to increase accuracy while running up inference costs, AstaBench evaluates the cost of running an agent as well as the quality of its answers.

Also, to ensure a level playing field, we’re releasing a series of baseline agents with the first set of standard tools that enables controlled, reproducible evaluation with realistic, production-grade search tools that operate over a comprehensive collection of scientific papers. This allows our measurements to isolate AI reasoning capabilities from mere access to knowledge.  

Later in this post, we go into more details about AstaBench, and also explain: 

  • Our new criteria for rigorous benchmarking of agents that guided AstaBench's design
  • A comparison to existing benchmarks for AI agents highlighting AstaBench’s unique value
  • The new tooling and resources we built to support benchmarking AI agents

AstaBench leaderboard early results and analysis

But first, let’s summarize some learnings based on initial tests of 57 agents across 22 classes (or types) of agent architecture—including both general-purpose and task-specific agents, open-source and closed agents, and systems powered with open- and closed-weight models (full details are available in the technical report and leaderboard). Not every agent could handle the full range of problem areas, so the overall (cross-category) leaderboard shows the 18 agent instances that could do so:

AI-powered scientific research assistance is still far from solved, as evidenced by the generally low overall scores on AstaBench, with the best (highest-scoring) agent – our own Asta v0 (which uses a mixture of LLMs depending on the task*) – scoring 53.0%. The fact that Asta v0 scores about 10% higher than the next best agent, ReAct with gpt-5 (43.3%), also shows the power that comes from designing a special-purpose agent for scientific tasks. However, this higher score also comes with the tradeoff of significantly higher development (engineering) cost, and, for certain tasks – specifically those in the area of end-to-end-discovery – higher runtime cost. 

The best economical model is ReAct powered by claude-3-5-haiku (score 20%, at a minimal cost of $0.03 per problem). With a marginally higher cost ($0.04), ReAct using gpt-5-mini scores surprisingly high at 31%, which is within close reach of much costlier models.

Data analysis was one of the hardest categories, with no agent scoring above 34%. This suggests that much more work will be required in order to develop agents that can reliably analyze and generate meaningful scientific hypotheses from structured datasets.   

Literature understanding is the most mature area for scientific research agents, with 44 agent instances able to solve problems in one of the benchmarks in this domain. Interestingly, none of the external/commercial scientific research agents were able to perform the full range of AstaBench research tasks, though many did well in literature understanding. 

For the task of scientific question answering, our agent Asta Scholar QA, Elicit, and SciSpace Deep Review are the best tools on these tests (all score about 85% or higher on ScholarQA-CS2). The other external/commercial agents are not far behind, but also don't beat a simple ReAct agent powered by a good language model and with access to a comprehensive textual index. 

Among literature search agents, our Asta Paper Finder stands out as a remarkably impressive system, scoring over double its closest rival (ReAct) on PaperFindingBench, and 15% above it on the LitQA2-FullText-Search benchmark.

Language model performance can be unintuitive. Powering a general agent with an expensive model can lower the overall cost. As just one example, even though the per-token cost of o3 is 6-10x higher than that of gemini-flash, a ReAct agent powered by gemini often takes more steps or gets stuck in loops, causing it to cost 4x more per problem while achieving half the score. 

Unfortunately, even the best open-weight LLMs lag far behind their closed competitors for scientific agent control. Smolagents Coder with llama-4-scout was the best open-weight system, but its 12.4% score is far behind Asta v0 (53.0%) or even ReAct with gpt-5 (43.3%). While gpt-5 did notably well when powering a general ReAct agent (~3% gain over o3 and claude-sonnet-4), it didn’t provide comparable gains when guiding our more specialized science agents, such as Asta Scholar QA, Asta DataVoyager, and Asta Code. Scholar QA, DataVoyager, and Code perform worse with gpt-5 than with previous models, while ReAct performs much better. This makes us speculate that perhaps gpt-5 was specially post-trained to control the increasingly common ReAct-style workflows.

These are just a few conclusions from our initial analysis. Please see the technical report for details or check out the AstaBench leaderboard.

*The current version of Asta v0 routes each problem to one of five task-specific helper agents, overall using five language models: claude-sonnet-4, gemini-2.0-flash, o3, gpt-4.1, and gpt-4o.

**In AstaBench, for general solvers, model/agent reasoning effort is always set to "medium"—the default. For task-specific solvers, the reasoning effort varies. Work is underway to add reasoning effort to the leaderboard.

Why create a whole new benchmark suite for agents?

As we started developing our own science agents, we found that existing agent evaluations all had deficiencies (Table 1), so we formulated several criteria for rigorous benchmarking of agents in general:

  1. The task suite must reflect the complexity of real-world usage. This requires a diverse set of challenging problems that are informed by real-world usage data, which is typically guarded by product companies.
  2. Agent tools must support reproducible experiments, even as knowledge changes. Available information, and hence the correct answers to some questions, changes day-to-day (e.g., as new papers are published). Fair assessment must control for these factors (e.g., by specifying a date cutoff for information being retrieved with answers), yet the community today lacks large-scale reproducible search tools for agents.
  3. Reporting must account for confounding variablesespecially computational cost and tool usage. It’s essential to account for cost, since even simplistic strategies, such as taking a majority vote over repeated invocation, can boost accuracy by burning cash. And by controlling for tool usage, we can separate gains due to agent architecture from benefits attributable to privileged access to specialized information sources. Current leaderboards fail to properly account for these variables.
  4. Task interfaces must be standardized to facilitate integration of general agents. General agents that can perform many different tasks are likely to better meet diverse real-world needs. Unfortunately, most previous benchmark suites require general agent developers to adapt agents for individual tasks, introducing developer bias and hindering development.
  5. Baselines must be strong and diverse to support state-of-the-art claims and future development. Current agent suites lack comprehensive baselines, making it hard to know if reported performance is truly state-of-the-art and, if so, what aspect of the agent design caused the gains.

The remainder of this post details how AstaBench addresses these criteria.

Benchmark suite with broad coverage of realistic scientific research tasks

Part of how AstaBench satisfies these criteria comes through its holistic set of over 2,400 problems, including many based on real user requests to Asta agents. These problems are arranged into 11 benchmarks in 4 high-level categories: literature understanding, code and execution, data analysis, and end-to-end discovery.

Each AstaBench problem has an associated scoring rubric that describes what constitutes a good answer to that specific question. We use the “LLM-as-a-judge” paradigm to grade each submission using the appropriate rubric. 

Our benchmarks are ready for use by new agents. They set a new compatibility standard with interfaces decoupled from agents, self-sufficient instructions, and standard task tools and submission procedures.

A standard scientific research environment for agents

Scientific research builds upon past discoveries, so understanding the literature is essential. Unfortunately, previous benchmarks lack standardized tools for accessing the literature and don't control for the fact that new discoveries may change the correct answers to questions over time, making it hard to attribute performance differences to agentic reasoning as opposed to differing retrieval corpora.

AstaBench provides the first high-quality, standard scientific research execution environment for agents. The Asta Scientific Corpus tool, in Asta resources and integrated into AstaBench, lets agents search and traverse a production-grade (large-scale) literature corpus, with date-restricted access for enhanced reproducibility. Agents can also perform reproducible experiments with the Computational Notebook tool in our sandboxed execution environment. Like our benchmarks, our tools are ready for use by new agents: they are cleanly decoupled from agents (unlike other suites), provide easy integration via the Model Context Protocol (MCP), and are callable from both host processes and sandboxed code execution.

Leaderboards that level the playing field for agents

We created the agent-eval Python package to power agent benchmark suites and leaderboards—including the AstaBench leaderboard—that provide more level comparisons and enhanced reproducibility. Under the hood for individual benchmarks, our package leverages the U.K. AI Security Institute's open-source Inspect evaluation framework, which provides detailed logging, debugging interfaces, broad model and eval compatibility, and more.

Time-invariant cost reporting

Importantly, the AstaBench leaderboard shows the Pareto frontier across reasoning accuracy and computational cost, since there are common ways to trade one for the other. Our displayed costs are time-invariant; model usage details logged (locally) by Inspect enable our package to calculate monetary costs across all submissions using consistent pricing even as provider costs change over time.

Traceable logs and source code

Our package collects submission source code and logs for display on the leaderboard, warning users about reproducibility issues like uncommitted or changed source code during their experiments.

Accounting for submissions with less controlled measurement or lower transparency

The AstaBench leaderboard highlights controlled and transparent submissions – which enhance understanding and trust – and warns about results that may not be comparable due to use of external tools, closed-source or model weights, or missing logs or cost information.

A comprehensive collection of standardized AI agents, with new scientific research agents

As part of Asta resources, we are releasing the agent-baselines suite of 22 classes of AI agents: a new collection of nine open-source Asta scientific research agents (including Asta v0) that demonstrate leading performance on AstaBench tasks, plus many baseline agents ranging from well-known agents from the scientific literature to wrappers around closed products. Unlike other agent suites, our agents have broad benchmark compatibility and local model cost reporting when run in Inspect.

We open-source these agents both to illuminate their strengths and weaknesses, as well as to provide easy starting points for developers to extend into novel approaches. Our suite enables measuring the capabilities of general-purpose agents, such as those based on a ReAct or deep research architecture, in addition to those optimized for a specific research task (e.g., Perplexity and Elicit for literature synthesis). As we show through our experiments, these types of cross-architecture comparisons are especially important to understand tradeoffs between generality and expertise (and to uncover new methods of agentic control).

Join us

AstaBench represents a major step towards better agent evaluation, but it is just the beginning of a new era for scientific AI. By providing a transparent, rigorous, and extensible framework, we hope to accelerate research on both AI agency and scientific reasoning, empowering both researchers and developers to push the boundaries of what science agents can do.

We have much more in the works: 

  • We are actively pushing the performance-cost frontiers in AstaBench and closing the gap for truly open agents by developing new agent techniques, tools, and open models specialized for scientific research. 
  • We are continuing to refine our LLM-as-a-judge grading procedures, especially for challenging scientific discovery tasks. 
  • We plan to release fresh benchmark problems that use the latest scientific knowledge—which combats contamination by virtue of being past the model training cut-off dates.
  • We also plan to release new challenging benchmarks that test more aspects of collaboration with humans, and deepen coverage of problems in impactful fields such as biomedicine. 
  • Finally, we are committed to measuring the latest advances—both by testing the newest models and expanding our agent-baselines suite of agents.

Join us as we chart the path toward smarter, more reliable agents that are able to make new scientific discoveries:

Subscribe to receive monthly updates about the latest Ai2 news.