Skip to main content ->
Ai2

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

Filter papers

Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

William MerrillShane AroraDirk GroeneveldHanna Hajishirzi
2025
arXiv.org

The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To… 

Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

William MerrillShane AroraDirk GroeneveldHanna Hajishirzi
2025
arXiv.org

The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To… 

RewardBench: Evaluating Reward Models for Language Modeling

Nathan LambertValentina PyatkinJacob Daniel MorrisonHanna Hajishirzi
2025
NAACL Findings

Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models.… 

Superlatives in Context: Modeling the Implicit Semantics of Superlatives

Valentina PyatkinBonnie WebberIdo DaganReut Tsarfaty
2025
NAACL

Superlatives are used to single out elements with a maximal/minimal property. Semantically, superlatives perform a set comparison: something (or some things) has the min/max property out of a set.… 

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Sarah WiegreffeOyvind TafjordYonatan BelinkovAshish Sabharwal
2025
ICLR

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have… 

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Antonis AntoniadesXinyi WangYanai ElazarW. Wang
2025
ICLR

The impressive capabilities of large language models (LLMs) have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of… 

Holistically Evaluating the Environmental Impact of Creating Language Models

Jacob MorrisonClara NaJared FernandezJesse Dodge
2025
ICLR

As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the… 

On Linear Representations and Pretraining Data Frequency in Language Models

Jack MerulloNoah A. SmithSarah WiegreffeYanai Elazar
2025
ICLR

Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on… 

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun JungFaeze BrahmanYejin Choi
2025
International Conference on Learning Representations

We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on… 

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Bill Yuchen LinYuntian DengK. ChanduYejin Choi
2025
ICLR

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully…