An abstract illustration of swirling shapes, meant to denote a futuristic feeling.

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

Diverging Preferences: When do Annotators Disagree and do Models Know?

Michael J.Q. ZhangZhilin WangJena D. HwangValentina Pyatkin

2025

ICML

We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes -- task underspecification,…

MIB: A Mechanistic Interpretability Benchmark

Aaron MuellerAtticus GeigerSarah WiegreffeYonatan Belinkov

2025

ICML

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks…

SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior

Jing-Jing LiValentina PyatkinMax Kleiman-WeinerSydney Levine

2025

ICML

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's…

Language Modeling by Language Models

Junyan ChengPeter ClarkKyle Richardson

2025

arxiv

Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional…

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Yue YangFan-Yun SunLuca WeihsChristopher Clark

2025

Computer Vision and Pattern Recognition

3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To miti-gate this limitation,…

Multi-Attribute Constraint Satisfaction via Language Model Rewriting

Ashutosh BahetiDebanjana ChakrabortyFaeze BrahmanMaarten Sap

2025

TMLR

Obeying precise constraints on top of multiple external attributes is a common computational problem underlying seemingly different domains, from controlled text generation to protein engineering.…

ACE2: accurately learning subseasonal to decadal atmospheric variability and forced responses

Oliver Watt‐MeyerBrian HennJeremy McGibbonChristopher S. Bretherton

2025

NPJ Climate and Atmospheric Science

Existing machine learning models of weather variability are not formulated to enable assessment of their response to varying external boundary conditions such as sea surface temperature and…

Applying the ACE2 Emulator to SST Green's Functions for the E3SMv3 Global Atmosphere Model

Elynn WuF. RebassooPappu PaulChristopher S. Bretherton

2025

arXiv

Green's functions are a useful technique for interpreting atmospheric state responses to changes in the spatial pattern of sea surface temperature (SST). Here we train version 2 of the Ai2 Climate…

RewardBench: Evaluating Reward Models for Language Modeling

Nathan LambertValentina PyatkinJacob Daniel MorrisonHanna Hajishirzi

2025

NAACL Findings

Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models.…

Superlatives in Context: Modeling the Implicit Semantics of Superlatives

Valentina PyatkinBonnie WebberIdo DaganReut Tsarfaty

2025

NAACL

Superlatives are used to single out elements with a maximal/minimal property. Semantically, superlatives perform a set comparison: something (or some things) has the min/max property out of a set.…

1-10Next