Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training
We present UNIFIEDQA-v2, a QA model built with the same process as UNIFIEDQA, except that it utilizes more supervision – roughly 3× the number of datasets used for UNIFIEDQA. This generally leads to…
Vessel Detection in Sentinel-1 Imagery
In this document, we detail the approach in our xView3 submission. The xView3 dataset presents the challenge of detecting vessels and other maritime objects in synthetic aperture radar (SAR) images…
Tropical Cirrus in Global Storm‐Resolving Models: 2. Cirrus Life Cycle and Top‐of‐Atmosphere Radiative Fluxes
Cirrus clouds of various thicknesses and radiative characteristics extend over much of the tropics, especially around deep convection. They are difficult to observe due to their high altitude and…
Tropical Cirrus in Global Storm‐Resolving Models: 1. Role of Deep Convection
Pervasive cirrus clouds in the upper troposphere and tropical tropopause layer (TTL) influence the climate by altering the top‐of‐atmosphere radiation balance and stratospheric water vapor budget.…
DREAM: Improving Situational QA by First Elaborating the Situation
When people answer questions about a specific situation, e.g., "I cheated on my mid-term exam last week. Was that wrong?", cognitive science suggests that they form a mental picture of that…
Inherently Explainable Reinforcement Learning in Natural Language
We focus on the task of creating a reinforcement learning agent that is inherently explainable—with the ability to produce immediate local explanations by thinking out loud while performing a task…
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification
Constructing benchmarks that test the abilities of modern natural language un1 derstanding models is difficult – pre-trained language models exploit artifacts in 2 benchmarks to achieve human…
FLEX: Unifying Evaluation for Few-Shot NLP
Few-shot NLP research is highly active, yet conducted in disjoint research threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental…
MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers
As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We introduce MAUVE , a comparison measure…
MERLOT: Multimodal Neural Script Knowledge Models
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model…