An abstract illustration of swirling shapes, meant to denote a futuristic feeling.

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

On Linear Representations and Pretraining Data Frequency in Language Models

Jack MerulloNoah A. SmithSarah WiegreffeYanai Elazar

2025

ICLR

Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on…

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun JungFaeze BrahmanYejin Choi

2025

International Conference on Learning Representations

We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on…

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Bill Yuchen LinYuntian DengK. ChanduYejin Choi

2025

ICLR

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully…

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Jiacheng LiuTaylor BlantonYanai ElazarJesse Dodge

2025

ACL 2025 Demo Track

We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches…

Skilful global seasonal predictions from a machine learning weather model trained on reanalysis data

Chris KentAdam A. ScaifeN. DunstoneOliver Watt-Meyer

2025

npj Climate and Atmospheric Science

Machine learning weather models trained on observed atmospheric conditions can outperform conventional physics-based models at short- to medium-range (1-14 day) forecast timescales. Here we take the…

CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation

Peter JansenOyvind TafjordMarissa RadenskyPeter Clark

2025

ACL (Findings)

Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore…

OLMoE: Open Mixture-of-Experts Language Models

Niklas MuennighoffLuca SoldainiDirk GroeneveldHanna Hajishirzi

2025

arXiv.org

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain…

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Bill Yuchen LinRonan Le BrasKyle RichardsonYejin Choi

2025

ICML

We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive…

Applying Corrective Machine Learning in the E3SM Atmosphere Model in C++ (EAMxx)

Aaron S. DonahueElynn WuW. PerkinsJ. Golaz

2025

EGUsphere

. The Simplified Cloud-Resolving E3SM Atmosphere Model (SCREAM) is the newest addition to the family of earth system models capable of explicitly resolving convective systems. SCREAM is a…

2 OLMo 2 Furious

Pete WalshLuca SoldainiDirk GroeneveldHanna Hajishirzi

2025

arXiv.org

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and…

Previous42-51Next