Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
Mitigating Biases in CORD-19 for Analyzing COVID-19 Literature
On the behest of the Office of Science and Technology Policy in the White House, six institutions, including ours, have created an open research dataset called COVID-19 Research Dataset (CORD-19) to…
Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions
The task of definition detection is important for scholarly papers, because papers often make use of technical terminology that may be unfamiliar to readers. Despite prior work on definition…
PySBD: Pragmatic Sentence Boundary Disambiguation
In this paper, we present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical…
Fact or Fiction: Verifying Scientific Claims
We introduce the task of scientific fact-checking. Given a corpus of scientific articles and a claim about a scientific finding, a fact-checking model must identify abstracts that support or refute…
MedICaT: A Dataset of Medical Images, Captions, and Textual References
Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of…
SciSight: Combining faceted navigation and research group detection for COVID-19 exploratory scientific search
The COVID-19 pandemic has sparked unprecedented mobilization of scientists, already generating thousands of new papers that join a litany of previous biomedical work in related areas. This deluge of…
SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search
With worldwide concerns surrounding the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), there is a rapidly growing body of literature on the virus. Clinicians, researchers, and…
TLDR: Extreme Summarization of Scientific Documents
We introduce TLDR generation for scientific papers, a new automatic summarization task with high source compression, requiring expert background knowledge and complex language understanding. To…
ABNIRML: Analyzing the Behavior of Neural IR Models
Numerous studies have demonstrated the effectiveness of pretrained contextualized language models such as BERT and T5 for ad-hoc search. However, it is not wellunderstood why these methods are so…
Generative Data Augmentation for Commonsense Reasoning
Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been…