AstaBench update: New results, plus adoption from industry
April 30, 2026
Ai2
AstaBench, our open benchmark for measuring AI agent scientific research capabilities, has a new round of results.
We've tested the strongest frontier models, including GPT-5.5, on more than 2.4K research problems and updated the leaderboard. AI has gotten rapidly better at coding, reasoning, and language tasks since we published AstaBench last August, and we wanted to know how much of that transfers to the harder, messier work of doing scientific research.
We're also pleased to share that AstaBench is gaining adoption beyond Ai2, with pickup from the UK AI Security Institute (UK AISI) and General Reasoning and agent submissions from organizations including Elicit, SciSpace, Distyl AI, and EvoScientist.
We built AstaBench to give the field a shared, transparent way to track whether AI can perform grounded scientific research, and everything is open so anyone can run it, submit to the leaderboard, or build on the tools. With new results and growing partnerships, AstaBench is gaining traction as an authoritative benchmark for scientific AI.
What is AstaBench?
With dozens of AI agents and models now available for scientific work – many accessible only through proprietary APIs and all tested differently – it's hard to know which ones perform well on challenging research tasks. That's why we created AstaBench, which we released alongside Asta, our open ecosystem for capable scientific AI agents.
The benchmark tests agents on thousands of problems in four categories including finding and understanding the scientific literature, writing and executing code, analyzing datasets, and running end-to-end discovery workflows. The evaluation framework, tools, and a large collection of baseline agents – both general-purpose and science-optimized, providing starting points for developers to extend and compare against – are all open-source. Learn more in our AstaBench paper, which appeared as an oral presentation at the International Conference on Learning Representations (ICLR) 2026.
When we first published AstaBench results in August, the top-scoring agent, our Asta v0, which routes tasks to specialized sub-agents, achieved an overall score of ~53% across all problem categories. But performance was uneven. While agents did reasonably well on focused tasks like literature search and code execution, end-to-end discovery was a different story. On E2E-Bench-Hard, a subtask in AstaBench that asks an agent to take a research idea all the way to working code and a written report with no simplification or scaffolding, our best agent completed only 3% of tasks perfectly end-to-end. In practice, it often completed roughly 60-70% of the required steps successfully, but still failed to finish the full task.
AI can help with individual steps of the scientific process, but stringing them together into a complete workflow remains a much harder problem.
New models we tested
Several major models have shipped since we released AstaBench, many with stronger reasoning. We wanted to know whether they more effectively tackle scientific tasks or whether AstaBench exposes challenges that better models alone don't solve.
We ran the following using the ReAct agent framework:
- Claude Opus 4.7, Claude Opus 4.6, and Claude Sonnet 4.6 with extended thinking (max effort, adaptive thinking)
- GPT-5.5 and GPT-5.4 (xhigh reasoning)
- Gemini 3.1 Pro Preview with high thinking
On the aggregate leaderboard, Claude Opus 4.7 ranks first at 58.0% overall with an average cost of $3.54 per problem, followed by Claude Opus 4.6 at 55.3% and Claude Sonnet 4.6 at 54.5%. GPT-5.5 reaches 52.9% at $1.61 per problem, placing it just behind Asta v0 (53.0%) and making it the strongest non-Claude frontier run in this round. Gemini 3.1 Pro Preview reaches 49.6%, and GPT-5.4 comes in at 46.5%. Interestingly, all these results are on the quality-cost Pareto frontier relative to one another, and any may be preferable depending on the desired quality-cost tradeoff.
Compared to last year's initial frontier model results, the new runs show four clear shifts:
- Top scores improved substantially overall, though the benchmark remains far from solved.
- The category-level gains are uneven across categories—top scores increased substantially in Code & Execution and End-to-End Discovery, but only moderately in Data Analysis and Literature Understanding.
- Costs are up sharply across providers, with the strongest-performing Claude configurations the most expensive in absolute terms.
- GPT-5.5 raises the ceiling for non-Claude frontier models, especially on component tasks, while still struggling with the hardest end-to-end workflows.
The category results reveal a split field. Among the current frontier runs, GPT-5.5 now leads Code & Execution and Data Analysis and narrowly leads the top Claude run on Literature Understanding. Claude Opus 4.7 still leads End-to-End Discovery, with the caveat that End-to-End Discovery is also judged by a Claude model.
Across the frontier runs, better performance usually comes with higher average cost—and that pattern is strongest in the Claude family, which also produces the top overall results. Within the Claude runs, Opus 4.7 improves on Opus 4.6 by 2.7 points overall, but at a steep cost: about 62% more per problem. Most of the cost and score increases come from End-to-End Discovery, where Opus 4.7 wins by 10.2 points (17%) but takes 54% more steps and costs 65% more. Some of the cost increase likely reflects Opus 4.7's new tokenizer, which is known to scale token counts by 1.0–1.35× for the same text. Notably, Opus 4.7 loses slightly to 4.6 on Code & Execution despite costing more, suggesting it isn't purely an improvement.
GPT-5.5 changes the cost-performance picture. It comes within 5.1 points of Opus 4.7 overall while costing less than half as much per problem, and it leads several category-level evaluations at lower cost than the top Claude run. But its weaker End-to-End Discovery result shows that strong performance on coding, literature understanding, and data analysis does not automatically translate into robust end-to-end scientific work.
GPT-5.4 and Gemini 3.1 Pro Preview now sit below GPT-5.5 overall, though both remain competitive in some component categories at lower cost. Data Analysis remains relatively inexpensive across the new frontier runs, with top results between $0.18 and $0.44 per problem, while the highest-scoring End-to-End Discovery runs remain much more expensive. Recent gains are largest in the hardest workflows, and so are the costs.
Taken together, the metrics suggest frontier models are improving quickly on scientific tasks, but not evenly—and there is still far to go. GPT-5.5 raises the ceiling on several component skills, especially Code & Execution and Data Analysis. But the hardest benchmark category still separates models that can solve individual scientific subtasks from agents that can carry out a full research workflow end to end.
For a full breakdown of the results, view the updated AstaBench leaderboard.
Scoring model update: The models we previously used for scoring ScholarQA-CS2 and End-to-End Discovery were deprecated. We swapped them for newer versions in the same families, following each provider's recommended upgrade path. The new scorer for End-to-End Discovery is notably stricter, penalizing things like fabricated results and placeholder code more reliably than the previous scorer. The ScholarQA-CS2 task has scores comparable to the originals. The results on our public leaderboard have been updated with the re-scored numbers for those tasks to maintain fair comparisons between agents.
Cost figures reflect benchmark-measured average LLM cost per problem under each evaluated agent setup, including differences in harness, tool use, and number of model calls. They should not be read as a direct apples-to-apples comparison of underlying API pricing.
Adoption from industry
AstaBench was designed to be an industry standard, and we're pleased to see increased agent submissions to the leaderboard and broadening adoption.
UK AISI. Inspect Evals is an open collection of LLM evaluations built using the UK AISI’s Inspect, an industry-standard evaluation framework. Arcadia Impact, which co-created Inspect Evals with the UK AISI, has been working to add AstaBench to this collection, making AstaBench more accessible to a wide range of safety researchers and AI developers. Arcadia has also used AstaBench. "AstaBench is an excellent addition to the AI evaluation ecosystem,” says Justin Olive, Arcadia Impact Head of AI Safety. “There is an urgent need for standardization and secondary analysis, two areas in which this initiative has made important contributions. Building this work on the UK AISI's state-of-the-art Inspect framework demonstrates strategic foresight, and reflects Ai2's genuine commitment to open science and research impact.”
General Reasoning. General Reasoning, an AI R&D company building infrastructure for reinforcement learning (RL), has implemented an AstaBench task (SUPER-Expert) as an environment on OpenReward, their platform for hosting RL environments at scale. "AstaBench provides an impressive suite of scientific environments for testing and training sophisticated agents, which we've worked to integrate into our OpenReward platform,” says Ross Taylor, General Reasoning Co-founder & CEO. “We're very thankful for Ai2's open research in this area."
Try it yourself
If you want to test your own agent on AstaBench, everything you need is in the AstaBench and agent-baselines repositories. We accept external submissions to the leaderboard and are working to make the process easier.
We built AstaBench because we think the question of whether AI can do real science needs open, rigorous measurement that anyone can verify and build on. The new results and the growing community around the suite bring us closer to that vision.
See for yourself on the leaderboard.