An abstract illustration of swirling shapes, meant to denote a futuristic feeling.

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

Applying the ACE2 Emulator to SST Green's Functions for the E3SMv3 Global Atmosphere Model

Elynn WuF. RebassooPappu PaulChristopher S. Bretherton

2025

Journal of Geophysical Research - Machine Learning

Green's functions are a useful technique for interpreting atmospheric state responses to changes in the spatial pattern of sea surface temperature (SST). Here we train version 2 of the Ai2 Climate…

RewardBench: Evaluating Reward Models for Language Modeling

Nathan LambertValentina PyatkinJacob Daniel MorrisonHanna Hajishirzi

2025

NAACL Findings

Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models.…

Superlatives in Context: Modeling the Implicit Semantics of Superlatives

Valentina PyatkinBonnie WebberIdo DaganReut Tsarfaty

2025

NAACL

Superlatives are used to single out elements with a maximal/minimal property. Semantically, superlatives perform a set comparison: something (or some things) has the min/max property out of a set.…

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Mingqi GaoYixin LiuXinyu HuArman Cohan

2025

NAACL

Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human…

Social-RAG: Retrieving from Group Interactions to Socially Ground Proactive AI Generation to Group Preferences

Ruotong WangXinyi ZhouLin QiuAmy X. Zhang

2025

CHI

AI agents are increasingly tasked with making proactive suggestions in online spaces where groups collaborate, but can be unhelpful or even annoying, due to not fitting the group's preferences or…

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Sarah WiegreffeOyvind TafjordYonatan BelinkovAshish Sabharwal

2025

ICLR

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have…

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad MajumderHarshit SuranaDhruv AgarwalPeter Clark

2025

ICLR

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of…

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Antonis AntoniadesXinyi WangYanai ElazarW. Wang

2025

ICLR

The impressive capabilities of large language models (LLMs) have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of…

Holistically Evaluating the Environmental Impact of Creating Language Models

Jacob MorrisonClara NaJared FernandezJesse Dodge

2025

ICLR

As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the…

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Parshin ShojaeeKazem MeidaniShashank GuptaChandan K Reddy

2025

ICLR

Mathematical equations have been unreasonably effective in describing complex natural phenomena across various scientific disciplines. However, discovering such insightful equations from data…

Previous32-41Next