Skip to main content ->
Ai2

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

Filter papers

ACE2: accurately learning subseasonal to decadal atmospheric variability and forced responses

Oliver Watt‐MeyerBrian HennJeremy McGibbonChristopher S. Bretherton
2025
NPJ Climate and Atmospheric Science

Existing machine learning models of weather variability are not formulated to enable assessment of their response to varying external boundary conditions such as sea surface temperature and… 

Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

William MerrillShane AroraDirk GroeneveldHanna Hajishirzi
2025
arXiv.org

The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To… 

Applying the ACE2 Emulator to SST Green's Functions for the E3SMv3 Global Atmosphere Model

Elynn WuF. RebassooPappu PaulChristopher S. Bretherton
2025
arXiv

Green's functions are a useful technique for interpreting atmospheric state responses to changes in the spatial pattern of sea surface temperature (SST). Here we train version 2 of the Ai2 Climate… 

RewardBench: Evaluating Reward Models for Language Modeling

Nathan LambertValentina PyatkinJacob Daniel MorrisonHanna Hajishirzi
2025
NAACL Findings

Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models.… 

Superlatives in Context: Modeling the Implicit Semantics of Superlatives

Valentina PyatkinBonnie WebberIdo DaganReut Tsarfaty
2025
NAACL

Superlatives are used to single out elements with a maximal/minimal property. Semantically, superlatives perform a set comparison: something (or some things) has the min/max property out of a set.… 

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Mingqi GaoYixin LiuXinyu HuArman Cohan
2025
NAACL

Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human… 

Social-RAG: Retrieving from Group Interactions to Socially Ground Proactive AI Generation to Group Preferences

Ruotong WangXinyi ZhouLin QiuAmy X. Zhang
2025
CHI

AI agents are increasingly tasked with making proactive suggestions in online spaces where groups collaborate, but can be unhelpful or even annoying, due to not fitting the group's preferences or… 

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad MajumderHarshit SuranaDhruv AgarwalPeter Clark
2025
ICLR

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of… 

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Sarah WiegreffeOyvind TafjordYonatan BelinkovAshish Sabharwal
2025
ICLR

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have… 

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Parshin ShojaeeKazem MeidaniShashank GuptaChandan K Reddy
2025
ICLR

Mathematical equations have been unreasonably effective in describing complex natural phenomena across various scientific disciplines. However, discovering such insightful equations from data…