NeuroDiscoveryBench: Benchmarking AI for neuroscience data analysis
December 12, 2025
Kris Ganjam (Allen Institute), Harshit Surana, Bodhisattwa Prasad Majumder, Peter Clark - Ai2
The field of neuroscience is expanding rapidly, with new experimental and analytical techniques deepening our understanding of the brain and raising hopes for better treatments for diseases such as Alzheimer’s. At the same time, the volume of data being generated is staggering—ranging from single-cell atlases profiling millions of cells to connectivity maps that chart entire brain regions. These openly available resources have become indispensable for researchers worldwide, underscoring the central role of community datasets in modern neuroscience. But as datasets grow in scale and complexity, extracting insights has outpaced traditional analysis methods, creating a pressing need for AI systems that can keep up.
While AI-powered data analysis tools have advanced significantly, benchmarks to measure progress have so far been concentrated in other domains such as chemistry (ChemBench), bioinformatics (BixBench), and data science (DiscoveryBench). Until now, no comparable data analysis benchmark existed specifically for neuroscience (although there are neuroscience datasets testing other skills, e.g., BrainBench for predicting experimental outcomes). To address this gap, we created and are releasing NeuroDiscoveryBench, the first benchmark to assess data analysis question-answering in neuroscience, together with initial baseline results.
NeuroDiscoveryBench
We developed NeuroDiscoveryBench in collaboration with our sibling organization, the Allen Institute. Both institutes were founded by Paul G. Allen—they've spent over two decades creating foundational open datasets like the Brain Atlas, while we focus on AI tools that accelerate scientific discovery.
Our goal with NeuroDiscoveryBench was a benchmark to test how well AI systems could answer questions grounded in real-world neuroscience data. The resulting dataset contains about 70 question–answer pairs, each requiring direct analysis of the associated data. These are not simple “factoid” questions; answers take the form of scientific hypotheses or quantitative observations that demand substantial data analysis.
For example, NeuroDiscoveryBench includes questions such as:
- How does the frequency of the APOE4 allele relate with increasing ADNC scores?
- What is the distribution (with cell numbers) of donor genotypes for the 'Glut' neurotransmitter class?
- Which subclass of glutamatergic neurons is most frequent in the RHP region?
Because the answers depend directly on the provided data, they cannot simply be retrieved from memory or a web search.
To generate such questions, we drew on three major recent Allen Institute neuroscience publications (Figure 1), all of which made their datasets openly available. These publications – like many others from the Institute – have become widely used resources for the field.
We followed a procedure similar to the one used successfully in an earlier data science benchmark (the aforementioned DiscoveryBench). First, important questions were identified in or inspired by these publications, then we reverse-engineered the data analysis workflows required to answer them, ensuring that they were indeed solvable through data analysis. Second, using an interactive data analysis tool – our own Asta DataVoyager – we verified these workflows by executing them directly on the datasets. Finally, for queries that required complex preprocessing, we created both “raw” and “processed” versions: the former operating over the original datasets, and the latter using preprocessed data to simplify analysis. We also included a small set of harder “no-traces” questions that require deeper biological understanding, along with data analysis.
Questions were selected to be challenging but feasible, while excluding those that were highly complex, subjective, or went beyond the available data (e.g., determining the “best” clustering of all mouse brain cell types, which would require iterative scientific studies outside the scope of this benchmark).
We also included questions requiring visualization, such as:
- Display the distribution of Thal phases across increasing ADNC categories (Not AD, Low, Intermediate, High) as a heatmap.
- Generate a hierarchical plot of subclass and neurotransmitter, grouped by cluster, filtered by class '24 MY Glut'.
Finally, neuroscientists and data experts reviewed the wording of questions and gold answers, ensuring that they were clear, unambiguous, and faithful to the data, avoiding speculation or claims beyond what the datasets supported.
Evaluations
To evaluate a system, we provide it with a question and data from NeuroDiscoveryBench, and require it to produce an answer—either text, or a figure if requested. For natural language answers, the scoring function checks whether the context, variables, and relations match the gold answers. For figures, a vision–language model is used to assess correctness.
This is a formidable AI task, requiring natural language understanding, data manipulation, code generation, and both neuroscience and commonsense reasoning. Such tasks would have been completely beyond automated systems two years ago.
We tested two simple baselines plus DataVoyager, running autonomously (without human guidance):
- No data: We simply give the query to a language model (LM) without the dataset. This tests whether the LM might have memorized the answer to the question during its pretraining, independent of the supplied data.
- No data, with search: Similar, but the LM can also perform web searches to answer the question. This tests if the relevant answer is available or can be derived from information on the Web.
- DataVoyager: DataVoyager interprets the query, generates dataset transformation and analysis code, runs that code on the dataset to produce an answer, and re-expresses the answer in natural language.
Results & insights
The no-data baselines (run on GPT-5.1 with medium reasoning) had low scores (6% and 8% respectively), confirming that models largely can’t “cheat” their way to an answer (i.e., they already know or can find the answer on the web) without data analysis*. In contrast, DataVoyager (using GPT-5.1, medium reasoning, no web search) scored substantially higher, 35%. This illustrates that AI agents are becoming capable of generating data-driven insights, while also showing that this dataset is still hard, and that automated neuroscience analysis is not yet a solved problem.
*The no-data scores are slightly above 0% because very occasionally, e.g., the relationship between APOE4 status and Alzheimer’s pathology (ADNC), the answers are well known and have already been reported, hence are known/found by the model. Also, interestingly, adding web search (“no data, with search”) slightly hurt performance, as search sometimes pulled in off-topic papers, confusing the models.
Additionally, the “raw” (un-preprocessed) datasets were, as expected, much harder for the agents—they struggled with the complex data transformations required for the final hypothesis analysis. This mirrors what we saw in earlier work in DiscoveryBench, and again suggests more effort is needed to improve agents to handle heavy-duty data wrangling for biological datasets, and that data preprocessing can be as important as the final analysis.
The future
NeuroDiscoveryBench is one of the first benchmarks specifically focused on AI analysis of neuroscience data, and provides a shared testbed for developing, comparing, and improving tools that aim to help neuroscientists. By building on openly available datasets from the Allen Institute and others, the benchmark also draws attention to this rapidly advancing field and the high-impact breakthroughs in brain science that researchers are striving to achieve. In addition, NeuroDiscoveryBench will shortly become part of Ai2’s growing suite of benchmarks, called AstaBench, for testing AI agents for scientific tasks.
Of course, neuroscience involves far more than answering structured data questions: it requires experimental design, integration with literature, iteration at the lab bench, and ultimately, the formation of new insights about the brain. Establishing rigorous benchmarks is a crucial step, ensuring AI tools are measured against real scientific challenges.
The field of AI is moving rapidly to help in all stages of research, with new assistants such as our Asta project supporting scientists in data analysis, experiment planning, and discovery. NeuroDiscoveryBench complements these efforts by providing a testbed for measuring progress and highlighting the kinds of challenges AI systems must overcome to be useful partners in neuroscience.
We’re committed with the Allen Institute to expanding these efforts. We encourage researchers and tool builders to test their systems on NeuroDiscoveryBench and to build on these open datasets—just as generations of neuroscientists have built upon the Allen Institute’s atlases and cell type resources. The future of AI-assisted discovery depends on it.
Try NeuroDiscoveryBench today.