SciArena: A New Platform for Evaluating Foundation Models in Scientific Literature Tasks
July 1, 2025
Ai2
Scientific literature is expanding at an unprecedented rate, making it challenging for researchers to stay updated and synthesize new knowledge. Foundation models are increasingly being used to help with this, but evaluating their capabilities in open-ended scientific tasks remains a significant challenge. Traditional benchmarks are often not suitable for nuanced evaluations in scientific tasks as they are static, limited in scale, and quickly becoming outdated.
To address these limitations, we present SciArena, an open and collaborative platform that directly engages the scientific research community in evaluating foundation models for scientific literature tasks. This crowdsourced, head-to-head evaluation approach for LLMs has been successfully pioneered in the general domain by platforms such as Chatbot Arena.
Results
As of June 30, 2025, SciArena hosts 23 frontier foundation models, selected for their representation of current state-of-the-art capabilities. Among them, the o3 model consistently delivers top performance across all scientific domains. Moreover, we find that o3 provides a more detailed elaboration of cited scientific papers, and its output tends to be more technical in Engineering disciplines. Performance among the remaining models varies by discipline, for instance, Claude-4-Opus excels in Healthcare, while DeepSeek-R1-0528 performs well in Natural Science.
SciArena-Eval, the meta-evaluation benchmark, has also highlighted significant challenges for model-based evaluators. Even the top-performing model in our experiments, o3, achieves only 65.1% accuracy in predicting human preferences. This marks a notable gap compared to general-purpose benchmarks like AlpacaEval and WildChat, where pairwise evaluation protocols surpass 70% accuracy. These findings highlight the need for more robust and reliable automated evaluation methods in scientific reasoning tasks.
We will be continuously adding new models to the SciArena platform to ensure ongoing evaluation of the latest advancements.
What is SciArena?
SciArena is an open evaluation platform where researchers can compare and vote on the performance of different foundation models in tasks related to scientific literature. It's built on a community voting approach, similar to Chatbot Arena, but specifically tailored for the complex and open-ended nature of scientific inquiry.
The platform has three main components:
- SciArena Platform: This is where human researchers submit questions, view side-by-side responses from different foundation models, and cast their votes for the preferred output.
- Leaderboard: Based on community votes, an Elo rating system ranks the models, providing a dynamic and up-to-date assessment of their performance.
- SciArena-Eval: This is a meta-evaluation benchmark built on the collected human preference data, designed to assess the accuracy of model-based evaluation systems.
How SciArena Works: A Glimpse Under the Hood
Unlike general-domain queries, scientific tasks often need to be grounded on scientific literature. When a user submits a question on SciArena, the platform utilizes an advanced multi-stage retrieval pipeline, adapted from Ai2's Scholar QA system, to gather relevant scientific paper contexts. This pipeline includes query decomposition, passage retrieval, and re-ranking to ensure high-quality and relevant information. These retrieved contexts, along with the user's question, are then fed to two randomly selected foundation models. Note that in SciArena, our focus is on evaluating standard and directly comparable foundation models rather than customized agentic or proprietary deep research systems such as Perplexity or OpenAI’s Deep Research, so neither of those agentic systems were included in our comparisons. The models generate long-form, literature-grounded responses, complete with citations. To mitigate potential biases from stylistic elements, responses are post-processed to a standardized, plain-text format with consistent citation styles. Users then evaluate these outputs and vote for the one that best satisfies their information needs.
The Impact and Quality of SciArena Data
The quality of evaluation hinges on the quality of the data, and SciArena has put a strong emphasis on this. Over the first four months of its internal operation, SciArena collected over 13,000 votes from 102 trusted researchers across various scientific fields.
To ensure the reliability and integrity of this human preference data, rigorous quality control measures were applied. This includes a separate annotation pipeline that ensures the quality of pairwise annotations:
- Expert Annotators: The initial data collection involved 102 researchers with at least two peer-reviewed publications and prior experience with AI-assisted literature tools.
- Comprehensive Training: All annotators underwent a one-hour training session to ensure consistency and accuracy in their evaluations.
- Blind rating: In SciArena’s interface, the models that generated each answer are not revealed until after the user submits their vote.
- Inter-Annotator Agreement (IAA) and Self-Consistency: SciArena assesses both IAA and self-consistency to quantify the reliability of the collected data. Results show strong self-consistency (weighted Cohen’s κ as 0.91), meaning individual annotators' judgments remain stable over time, and high IAA (weighted Cohen’s κ as 0.76), indicating that experts tend to reach similar judgments despite the subjective nature of some questions.
This commitment to data quality ensures that SciArena offers a robust and trustworthy assessment of model performance.
Future Challenges
For future work, SciArena welcomes partnerships with model developers that would enable us to evaluate and post new models to our leaderboard. Further, SciArena currently evaluates different models using a fixed approach for other important elements of the retrieval-augmented generation (RAG) pipeline, such as the retrieval index and prompting workflow. These other pipeline elements can have a significant influence on answer quality, and in future work we would like to evaluate other choices for indexing papers and prompting the models.
Full Results
Join the Arena!
- Visit SciArena today to compare models and vote.
- Bookmark the leaderboard page - updated often - to see how the rankings change over time
- Watch for new model releases - we’ll add them continuously as new models drop
- Download and use the SciArena-Eval set, code and data for your own projects
- Read our paper to learn more about SciArena's design and more detailed results and analysis.
- We are continuously improving the platform, feedback and suggestions are welcome