Ai2 Newsletter

July 2025

Top story - An AI benchmark for scientific discovery

The volume of scientific papers continues to grow, particularly in domains such as computer science. To this year’s upcoming NeurIPS conference alone, researchers submitted a record 25,000 studies, up from 17,000 in 2024.

AI can help sift through the noise, but it’s not always clear which models are best suited to the task. To illuminate this, we just launched SciArena, a platform designed to measure how well models can obtain and synthesize information from scientific literature across healthcare, engineering, and other fields.

SciArena allows users to compare how dozens of different models answer scientific questions such as “What are the recent developments in quantum computing's impact on cryptography?”

"We launched SciArena as an open, evolving, and collaborative platform that invites the scientific community to directly evaluate LLMs on real, literature-based tasks," Arman Cohan, research scientist at Ai2 and assistant professor of computer science at Yale University, says. "Our motivation has been to create a rigorous, dynamic, and community-driven evaluation platform that reflects the needs and standards of scientists and helps measure progress, while making all data and code openly available for others to build on."

SciArena queries models, directing them to retrieve relevant papers from Semantic Scholar, our AI-powered search and discovery tool for scientific literature. Anyone can use SciArena and vote on their preferred model. Rankings feed into a leaderboard showing which models perform best overall.

"The leaderboard uses the pairwise votes to rank the models using an Elo rating system, similar to what's used in chess rankings," Cohan says. "The models that consistently win head-to-head comparisons move up the rankings. So, the leaderboard aggregates real-world human judgments."

Alongside the platform, we open-sourced SciArena-Eval, a meta-evaluation benchmark, as well as the code and data used to develop SciArena. The goal is to continuously add new models to the platform to ensure that the leaderboard remains a trustworthy source for evaluating models’ ability to parse the vast–and growing–scientific literature.

According to Arman, the SciArena-Eval results in particular highlight the challenges for scientific task evaluation techniques. The top-performing model on SciArena-Eval, o3, achieved only 65.1% accuracy–a significant gap compared to general-purpose benchmarks.

"Open benchmarks such as SciArena's leaderboard put models through identical, expert-vetted tests and publicly show the results in real-time, so strengths and failure modes are visible to researchers, developers, and end-users alike," Cohan says. "Our hope is that these resources help drive both model improvement and more reliable, human-aligned AI evaluation in science."

Learn more about SciArena

An autonomous LLM discovery system

We recently released a prototype of Genesys, an autonomous multi-agent LLM discovery system that aims to discover new types of language model architectures. We found Genesys can discover novel architectures competitive with the industry-standard transformer.

Try it out

Benchmark for OCR systems

We developed a method to benchmark the performance of different OCR engines and APIs. Called olmOCR Bench, the test evaluates how well OCR systems preserve reading order, formulas, tables, and more, and it charts the systems’ performance-to-cost ratio. olmOCR Bench is available in olmOCR, Ai2’s fully open toolkit for transforming documents into clean markdown.

Test it now

Evaluating model instruction-following

IFBench, which we announced this week, is a benchmark designed to measure how well AI models follow new, challenging, and diverse verifiable instructions. Alongside it, we released IF-RVLR, a recipe for improving and generalizing a model’s ability to follow constraints.

Download it

A measure for reasoning

Our new OMEGA math benchmark systematically evaluates models along three axes of reasoning, each designed to probe a distinct type of cognitive leap. The goal is to see whether models have the ability to extend learned skills to unfamiliar problems, combine ideas in new ways, or discover strategies not explicitly seen during training.

See the results

Topping the heatmap

Ai2 is #1 on Hugging Face’s heatmap. We released more models, datasets, and other research artifacts than any other organization on Hugging Face in the past year.

View our releases

Looking ahead

Several Ai2 researchers will be presenting their work at ICML, the AI and machine learning conference, in Vancouver in mid-July. They’ll be giving talks on methods for developing steerable moderation for AI, mechanistic interpretability, and identifying diverging preferences between data annotators, among other subjects.

Following ICML, stay tuned for new tools and resources from Ai2 to help scientists apply AI in their work.

ICML 2025

Ai2 Newsletter Archive