Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
What Sets the Tropical Cold Point in GSRMs During Boreal Winter? Overshooting Convection Versus Cirrus Lofting
The cold point tropopause, the minimum temperature within the tropical upper troposphere‐lower stratosphere region (UTLS), significantly impacts Earth's climate by influencing the amount of water…
Multi-Attribute Constraint Satisfaction via Language Model Rewriting
Obeying precise constraints on top of multiple external attributes is a common computational problem underlying seemingly different domains, from controlled text generation to protein engineering.…
ACE2: accurately learning subseasonal to decadal atmospheric variability and forced responses
Existing machine learning models of weather variability are not formulated to enable assessment of their response to varying external boundary conditions such as sea surface temperature and…
Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To…
Applying the ACE2 Emulator to SST Green's Functions for the E3SMv3 Global Atmosphere Model
Green's functions are a useful technique for interpreting atmospheric state responses to changes in the spatial pattern of sea surface temperature (SST). Here we train version 2 of the Ai2 Climate…
RewardBench: Evaluating Reward Models for Language Modeling
Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models.…
Superlatives in Context: Modeling the Implicit Semantics of Superlatives
Superlatives are used to single out elements with a maximal/minimal property. Semantically, superlatives perform a set comparison: something (or some things) has the min/max property out of a set.…
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human…
Social-RAG: Retrieving from Group Interactions to Socially Ground Proactive AI Generation to Group Preferences
AI agents are increasingly tasked with making proactive suggestions in online spaces where groups collaborate, but can be unhelpful or even annoying, due to not fitting the group's preferences or…
Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have…