An abstract illustration of swirling shapes, meant to denote a futuristic feeling.

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

Diverging Preferences: When do Annotators Disagree and do Models Know?

Michael J.Q. ZhangZhilin WangJena D. HwangValentina Pyatkin

2025

ICML

We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes -- task underspecification,…

SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior

Jing-Jing LiValentina PyatkinMax Kleiman-WeinerSydney Levine

2025

ICML

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's…

Multi-Attribute Constraint Satisfaction via Language Model Rewriting

Ashutosh BahetiDebanjana ChakrabortyFaeze BrahmanMaarten Sap

2025

TMLR

Obeying precise constraints on top of multiple external attributes is a common computational problem underlying seemingly different domains, from controlled text generation to protein engineering.…

RewardBench: Evaluating Reward Models for Language Modeling

Nathan LambertValentina PyatkinJacob Daniel MorrisonHanna Hajishirzi

2025

NAACL Findings

Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models.…

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Bill Yuchen LinYuntian DengK. ChanduYejin Choi

2025

ICLR

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully…

The Art of Saying No: Contextual Noncompliance in Language Models

Faeze BrahmanSachin KumarVidhisha BalachandranHannaneh Hajishirzi

2024

NeurIPS Datasets & Benchmarks

Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of"unsafe"queries, we posit that the…

Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals

Yanai ElazarBhargavi ParanjapeHao PengNoah A. Smith

2024

EMNLP Findings

The inevitable appearance of spurious correlations in training datasets hurts the generalization of NLP models on unseen data. Previous work has found that datasets with paired inputs are prone to…

Impossible Distillation: from Low-Quality Model to High-Quality Dataset&Model for Summarization and Paraphrasing

Jaehun JungPeter WestLiwei JiangYejin Choi

2024

NAACL

We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization, that distills a high-quality dataset and model from a low-quality teacher that itself cannot…

JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding over Small Language Models

Jillian R. FisherXiming LuJaehun JungYejin Choi

2024

NAACL

The permanence of online content combined with the enhanced authorship identification techniques calls for stronger computational methods to protect the identity and privacy of online authorship…

MacGyver: Are Large Language Models Creative Problem Solvers?

Yufei TianAbhilasha RavichanderLianhui QinFaeze Brahman

2024

NAACL

We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600…

1-10Next