Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
S2AND: A Benchmark and Evaluation System for Author Name Disambiguation
Author Name Disambiguation (AND) is the task of resolving which author mentions in a bibliographic database refer to the same real-world person, and is a critical ingredient of digital library…
COVR: A test-bed for Visually Grounded Compositional Generalization with real images
While interest in models that generalize at test time to new compositions has risen in recent years, benchmarks in the visually-grounded domain have thus far been restricted to synthetic images. In…
Conversational Multi-Hop Reasoning with Neural Commonsense Knowledge and Symbolic Logic Rules
One of the challenges faced by conversational agents is their inability to identify unstated presumptions of their users’ commands, a task trivial for humans due to their common sense. In this…
General-Purpose Question-Answering with Macaw
Despite the successes of pretrained language models, there are still few high-quality, general-purpose QA systems that are freely available. In response, we present MACAW, a versatile, generative…
Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Comp
Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to…
Factorizing Perception and Policy for Interactive Instruction Following
Performing simple household tasks based on language directives is very natural to humans, yet it remains an open challenge for AI agents. The ‘interactive instruction following’ task attempts to…
It's not Rocket Science : Interpreting Figurative Language in Narratives
Figurative language is ubiquitous in English. Yet, the vast majority of NLP research focuses on literal language. Existing text representations by design rely on compositionality, while figurative…
Question Decomposition with Dependency Graphs
QDMR is a meaning representation for complex questions, which decomposes questions into a sequence of atomic steps. While stateof-the-art QDMR parsers use the common sequence-to-sequence (seq2seq)…
All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run…
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce STRATEGYQA, a question…