Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
ACE2: accurately learning subseasonal to decadal atmospheric variability and forced responses
Existing machine learning models of weather variability are not formulated to enable assessment of their response to varying external boundary conditions such as sea surface temperature and…
Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To…
Applying the ACE2 Emulator to SST Green's Functions for the E3SMv3 Global Atmosphere Model
Green's functions are a useful technique for interpreting atmospheric state responses to changes in the spatial pattern of sea surface temperature (SST). Here we train version 2 of the Ai2 Climate…
RewardBench: Evaluating Reward Models for Language Modeling
Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models.…
Superlatives in Context: Modeling the Implicit Semantics of Superlatives
Superlatives are used to single out elements with a maximal/minimal property. Semantically, superlatives perform a set comparison: something (or some things) has the min/max property out of a set.…
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human…
Social-RAG: Retrieving from Group Interactions to Socially Ground Proactive AI Generation to Group Preferences
AI agents are increasingly tasked with making proactive suggestions in online spaces where groups collaborate, but can be unhelpful or even annoying, due to not fitting the group's preferences or…
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of…
Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have…
LLM-SR: Scientific Equation Discovery via Programming with Large Language Models
Mathematical equations have been unreasonably effective in describing complex natural phenomena across various scientific disciplines. However, discovering such insightful equations from data…