Skip to main content ->
Ai2

Literature Understanding Benchmarks

Several AstaBench evaluations test an AI model’s literature understanding skills. This includes its ability to locate relevant research papers, retrieve and answer questions from scientific documents, summarize key findings, and more.

View the leaderboard

PaperFindingBench

PaperFindingBench assesses an agent’s ability to locate sets of papers based on a natural language description that may involve both the papers’ content and metadata, such as the author or publication year.

LitQA2-FullText-Search

A LitQA2-FullText variant that isolates retrieval: same multiple-choice questions, but agents are scored on ranking papers likely to contain the answer, not on answering.

ScholarQA-CS2

ScholarQA-CS2 evaluates long-form responses to CS literature-review questions, expecting comprehensive, deep-research-style reports. It advances ScholarQA-CS with real-world queries and new metrics for coverage and precision of report text and citations.

LitQA2-FullText

LitQA2 (FutureHouse) tests models on multiple-choice questions that require retrieving a specific paper from the scientific literature and reading its full text—not just the abstract. The original release gave the answering paper’s title but no fixed corpus; our version searches the Asta standard index. “-FullText” denotes the subset whose answering papers have open-source full text in our index.

ArxivDIGESTables-Clean

evaluates models on generating literature-review tables—rows as papers, columns as comparison aspects—given related papers and a caption, scoring against tables published in arXiv. “-Clean” is a curated subset removing tables that are trivial or unreconstructable from full text.


Coding and Execution Benchmarks

AstaBench evaluates how well models can write, edit, and run code for scientific research tasks. This includes reproducing results from computational studies, modifying existing code, and producing correct outputs in real-world research scenarios.

View the leaderboard

SUPER-Expert

SUPER-Expert tests models on setting up and executing tasks from low-resource research repositories (centralized databases of research data/materials). The "-Expert” split is SUPER’s hardest, requiring reproductions from scratch without hints or intermediate landmarks.

Core-Bench-Hard

Core-Bench-Hard measures computational reproducibility—reproducing study results from provided code and data—via language-only and vision-language tasks across multiple difficulty levels. The “-Hard” split is Core-bench’s toughest: only a README is provided, with no instructions or Dockerfile.

DS-1000

DS-1000 is a well-established code-generation benchmark of Python data-science questions from Stack Overflow, covering diverse, realistic use cases and exercising widely used data-science/ML libraries. We split it into 100 validation and 900 test problems.


Data Analysis Benchmarks

AstaBench evaluates a model’s ability to analyze scientific datasets and generate meaningful insights. This includes transforming and modeling data to support accurate, data-driven reasoning across scientific domains.

View the leaderboard

DiscoveryBench

DiscoveryBench is the first comprehensive benchmark to formalize multi-step data-driven discovery—data loading, transformation, statistical analysis, and modeling—and to systematically test how well current LLMs reproduce published findings across domains like social science, biology, and history.


End-to-End Discovery Benchmarks

AstaBench tests whether agents can complete an entire scientific workflow without human intervention. This includes designing and running experiments, analyzing results, and producing full research outputs.

View the leaderboard

E2E-Bench

E2E-Bench is the “decathlon” of AI-assisted research. It measures whether a system can run the entire research pipeline, starting with an initial task description, to designing and performing (software) experiments, to analyzing and writing up the results.

E2E-Bench-Hard

E2E-Bench-Hard is a tougher E2E-Bench variant. It generates tasks from research trends and underexplored problems. Tasks are feasibility-checked only—no simplification—testing systems on complex, less-structured research scenarios under the same end-to-end process.

Get started with Asta

Use or fork an agent in our agents suite (includes general and science agents)

Get started building agents

Evaluate your agent or model using the AstaBench code (before submitting)

Get started building agents

View the performance of current agents on the AstaBench leaderboards, or submit your own evaluation results

View the leaderboard

Use the agent-eval package that powers AstaBench and Asta leaderboards.

Get started building your own benchmarks

Evaluation Framework for AI agents

AstaBench is backed by a more rigorous evaluation framework for AI agents.

Framework for agent benchmark suites and leaderboards

The agent-eval Python package powers more rigorous agent benchmark suites and leaderboards, with time-invariant cost reporting, traceable logs and source code, and accounting for submissions with less controlled measurement or lower transparency. We use it to power AstaBench and the Asta leaderboards; you can also use it to create your own benchmark suite and leaderboard.

Agents suite and tools

AstaBench comes with the agent-baselines suite of AI agents, as well as the first high-quality, standard scientific research environment and tools for agents. See also Asta resources for more details.