Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning
Has there been real progress in multi-hop question-answering? Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts. This…
Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering.
Despite the rapid progress in multihop question-answering (QA), models still have trouble explaining why an answer is correct, with limited explanation training data available to learn from. To…
More Bang for Your Buck: Natural Perturbation for Robust Question Answering
While recent models have achieved human-level scores on many NLP datasets, we observe that they are considerably sensitive to small changes in input. As an alternative to the standard approach of…
OCNLI: Original Chinese Natural Language Inference
Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g., SNLI, MNLI) and advances in modeling, most progress has…
UnifiedQA: Crossing Format Boundaries With a Single QA System
Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit…
UnQovering Stereotyping Biases via Underspecified Questions
While language embeddings have been shown to have stereotyping biases, how these biases affect downstream question answering (QA) models remains unexplored. We present UNQOVER, a general framework…
What-if I ask you to explain: Explaining the effects of perturbations in procedural text
We address the task of explaining the effects of perturbations in procedural text, an important test of process comprehension. Consider a passage describing a rabbit's life-cycle: humans can easily…
"You are grounded!": Latent Name Artifacts in Pre-trained Language Models
Pre-trained language models (LMs) may perpetuate biases originating in their training corpus to downstream models. We focus on artifacts associated with the representation of given names (e.g.,…
Evaluating Models' Local Decision Boundaries via Contrast Sets
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading:…
What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge
Open-domain question answering (QA) is known to involve several underlying knowledge and reasoning challenges, but are models actually learning such knowledge when trained on benchmark tasks? To…