ApplicationsAstaBench
Literature Understanding Benchmarks
Several AstaBench evaluations test an AI model’s literature understanding skills. This includes its ability to locate relevant research papers, retrieve and answer questions from scientific documents, summarize key findings, and more.
PaperFindingBench
PaperFindingBench assesses an agent’s ability to locate sets of papers based on a natural language description that may involve both the papers’ content and metadata, such as the author or publication year.
LitQA2-FullText-Search
A LitQA2-FullText variant that isolates retrieval: same multiple-choice questions, but agents are scored on ranking papers likely to contain the answer, not on answering.
ScholarQA-CS2
ScholarQA-CS2 evaluates long-form responses to CS literature-review questions, expecting comprehensive, deep-research-style reports. It advances ScholarQA-CS with real-world queries and new metrics for coverage and precision of report text and citations.
LitQA2-FullText
LitQA2 (FutureHouse) tests models on multiple-choice questions that require retrieving a specific paper from the scientific literature and reading its full text—not just the abstract. The original release gave the answering paper’s title but no fixed corpus; our version searches the Asta standard index. “-FullText” denotes the subset whose answering papers have open-source full text in our index.
ArxivDIGESTables-Clean
evaluates models on generating literature-review tables—rows as papers, columns as comparison aspects—given related papers and a caption, scoring against tables published in arXiv. “-Clean” is a curated subset removing tables that are trivial or unreconstructable from full text.
Coding and Execution Benchmarks
AstaBench evaluates how well models can write, edit, and run code for scientific research tasks. This includes reproducing results from computational studies, modifying existing code, and producing correct outputs in real-world research scenarios.
SUPER-Expert
SUPER-Expert tests models on setting up and executing tasks from low-resource research repositories (centralized databases of research data/materials). The "-Expert” split is SUPER’s hardest, requiring reproductions from scratch without hints or intermediate landmarks.
Core-Bench-Hard
Core-Bench-Hard measures computational reproducibility—reproducing study results from provided code and data—via language-only and vision-language tasks across multiple difficulty levels. The “-Hard” split is Core-bench’s toughest: only a README is provided, with no instructions or Dockerfile.
DS-1000
DS-1000 is a well-established code-generation benchmark of Python data-science questions from Stack Overflow, covering diverse, realistic use cases and exercising widely used data-science/ML libraries. We split it into 100 validation and 900 test problems.
Data Analysis Benchmarks
AstaBench evaluates a model’s ability to analyze scientific datasets and generate meaningful insights. This includes transforming and modeling data to support accurate, data-driven reasoning across scientific domains.
DiscoveryBench
DiscoveryBench is the first comprehensive benchmark to formalize multi-step data-driven discovery—data loading, transformation, statistical analysis, and modeling—and to systematically test how well current LLMs reproduce published findings across domains like social science, biology, and history.
End-to-End Discovery Benchmarks
AstaBench tests whether agents can complete an entire scientific workflow without human intervention. This includes designing and running experiments, analyzing results, and producing full research outputs.
E2E-Bench
E2E-Bench is the “decathlon” of AI-assisted research. It measures whether a system can run the entire research pipeline, starting with an initial task description, to designing and performing (software) experiments, to analyzing and writing up the results.
E2E-Bench-Hard
E2E-Bench-Hard is a tougher E2E-Bench variant. It generates tasks from research trends and underexplored problems. Tasks are feasibility-checked only—no simplification—testing systems on complex, less-structured research scenarios under the same end-to-end process.
Get started with Asta
Use or fork an agent in our agents suite (includes general and science agents)
Evaluate your agent or model using the AstaBench code (before submitting)
View the performance of current agents on the AstaBench leaderboards, or submit your own evaluation results
Use the agent-eval package that powers AstaBench and Asta leaderboards.
Evaluation Framework for AI agents
AstaBench is backed by a more rigorous evaluation framework for AI agents.
Framework for agent benchmark suites and leaderboards
The agent-eval Python package powers more rigorous agent benchmark suites and leaderboards, with time-invariant cost reporting, traceable logs and source code, and accounting for submissions with less controlled measurement or lower transparency. We use it to power AstaBench and the Asta leaderboards; you can also use it to create your own benchmark suite and leaderboard.
Agents suite and tools
AstaBench comes with the agent-baselines suite of AI agents, as well as the first high-quality, standard scientific research environment and tools for agents. See also Asta resources for more details.