Latest research
September 16, 2025
Fluid language model benchmarking
We explore how Fluid Benchmarking can adapt evaluation items to a language model’s capability level.August 28, 2025
OLMoASR: A series of open speech recognition models
We release OLMoASR, a family of open automatic speech recognition (ASR) models trained from scratch on a curated, large-scale dataset.August 26, 2025
Asta: Accelerating science through trustworthy agentic AI
We announce Asta, our bold initiative to accelerate science through trustworthy, truly open agentic AI.August 26, 2025
AstaBench: Rigorous benchmarking of AI agents with a holistic scientific research suite
Introducing AstaBench, a novel AI agents evaluation framework and scientific research benchmark suite.August 19, 2025
Signal and Noise: Reducing uncertainty in language model evaluation
We find that two simple metrics, signal and noise, reveal key differences in the utility of current LLM benchmarks.August 18, 2025
MoNaCo: More natural questions for reasoning across dozens of documents
Introducing MoNaCo, a benchmark of highly challenging questions spanning dozens of documents for evaluating large language models.August 12, 2025
MolmoAct: An Action Reasoning Model that reasons in 3D space
MolmoAct is the first model able to “think” in three dimensions, trained efficiently and delivering benchmark-topping performance.July 22, 2025
Contextualized Evaluations: Judging language model responses to underspecified queries
How do we evaluate LLMs on underspecified queries? We show that adding clarifying context flips model rankings and uncovers model biases.July 18, 2025