An abstract illustration of swirling shapes, meant to denote a futuristic feeling.

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Mingqi GaoYixin LiuXinyu HuArman Cohan

2025

NAACL

Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human…

Social-RAG: Retrieving from Group Interactions to Socially Ground Proactive AI Generation to Group Preferences

Ruotong WangXinyi ZhouLin QiuAmy X. Zhang

2025

CHI

AI agents are increasingly tasked with making proactive suggestions in online spaces where groups collaborate, but can be unhelpful or even annoying, due to not fitting the group's preferences or…

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Sarah WiegreffeOyvind TafjordYonatan BelinkovAshish Sabharwal

2025

ICLR

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have…

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad MajumderHarshit SuranaDhruv AgarwalPeter Clark

2025

ICLR

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of…

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Antonis AntoniadesXinyi WangYanai ElazarW. Wang

2025

ICLR

The impressive capabilities of large language models (LLMs) have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of…

Holistically Evaluating the Environmental Impact of Creating Language Models

Jacob MorrisonClara NaJared FernandezJesse Dodge

2025

ICLR

As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the…

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Parshin ShojaeeKazem MeidaniShashank GuptaChandan K Reddy

2025

ICLR

Mathematical equations have been unreasonably effective in describing complex natural phenomena across various scientific disciplines. However, discovering such insightful equations from data…

On Linear Representations and Pretraining Data Frequency in Language Models

Jack MerulloNoah A. SmithSarah WiegreffeYanai Elazar

2025

ICLR

Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on…

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Bill Yuchen LinYuntian DengK. ChanduYejin Choi

2025

ICLR

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully…

Skilful global seasonal predictions from a machine learning weather model trained on reanalysis data

Chris KentAdam A. ScaifeN. DunstoneOliver Watt-Meyer

2025

arXiv

Machine learning weather models trained on observed atmospheric conditions can outperform conventional physics-based models at short- to medium-range (1-14 day) forecast timescales. Here we take the…

Previous11-20Next