Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
We're Afraid Language Models Aren't Modeling Ambiguity
Ambiguity is an intrinsic feature of natural language. Managing ambiguity is a key part of human language understanding, allowing us to anticipate misunderstanding as communicators and revise our…
Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements
Despite the much discussed capabilities of today's language models, they are still prone to silly and unexpected commonsense failures. We consider a retrospective verification approach that reflects…
Language Models with Rationality
While large language models (LLMs) are proficient at question-answering (QA), the dependencies between their answers and other "beliefs" they may have about the world are typically unstated, and may…
Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy
When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer…
Editing Common Sense in Transformers
Editing model parameters directly in Transformers makes updating open-source transformer-based models possible without re-training. However, these editing methods have only been evaluated on…
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The…
Demystifying Prompts in Language Models via Perplexity Estimation
Language models can be prompted to perform a wide variety of zero- and few-shot learning problems. However, performance varies significantly with the choice of prompt, and we do not yet understand…
Measuring and Narrowing the Compositionality Gap in Language Models
We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often…
Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning
Large language models excel at a variety of language tasks when prompted with examples or instructions. Yet controlling these models through prompting alone is limited. Tailoring language models…
FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions
Theory of mind (ToM) evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM, a new benchmark designed to stress-test ToM…