Papers
See AI2's Award Winning Papers
Learn more about AI2's Lasting Impact Award
Viewing 91-100 of 216 papers
GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Smith, Daniel S. WeldarXiv • 2021 Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks which can be reliably evaluated in an automatic…CLUE: A Chinese Language Understanding Evaluation Benchmark
L. Xu, X.Zhang, L. Li, H. Hu, C. Cao, W. Liu, J. Li, Y. Li, K. Sun, Y. Xu, Y. Cui, C. Yu, Q. Dong, Y. Tian, D. Yu, B. Shi, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Q. Zhao, C. Yue, X. Zhang, Z. Yang, et.al.COLING • 2020 We introduce CLUE, a Chinese Language Understanding Evaluation benchmark. It contains eight different tasks, including single-sentence classification, sentence pair classification, and machine reading comprehension. We evaluate CLUE on a number of existing…Belief Propagation Neural Networks
J. Kuck, Shuvam Chakraborty, Hao Tang, R. Luo, Jiaming Song, A. Sabharwal, S. ErmonNeurIPS • 2020 Learned neural solvers have successfully been used to solve combinatorial optimization and decision problems. More general counting variants of these problems, however, are still largely solved with hand-crafted solvers. To bridge this gap, we introduce…Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge
Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, Jonathan BerantNeurIPS • Spotlight Presentation • 2020 To what extent can a neural network systematically reason over symbolic facts? Evidence suggests that large pre-trained language models (LMs) acquire some reasoning capacity, but this ability is difficult to control. Recently, it has been shown that…From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project
Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld, Michal Guerquin, Michael SchmitzAI Magazine • 2020AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge. Even in 2016, the best AI system achieved merely 59.3% on an 8th Grade science exam…AI2 Lasting Impact AwardNeural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation
Atticus Geiger, Kyle Richardson, Christopher PottsEMNLP • BlackboxNLP Workshop • 2020 We address whether neural models for Natural Language Inference (NLI) can learn the compositional interactions between lexical entailment and negation, using four methods: the behavioral evaluation methods of (1) challenge test sets and (2) systematic…A Dataset for Tracking Entities in Open Domain Procedural Text
Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi Mishra, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson, Eduard HovyEMNLP • 2020 We present the first dataset for tracking state changes in procedural text from arbitrary domains by using an unrestricted (open) vocabulary. For example, in a text describing fog removal using potatoes, a car window may transition between being foggy, sticky…A Simple Yet Strong Pipeline for HotpotQA
Dirk Groeneveld, Tushar Khot, Mausam, Ashish SabharwalEMNLP • 2020 State-of-the-art models for multi-hop question answering typically augment large-scale language models like BERT with additional, intuitively useful capabilities such as named entity recognition, graph-based reasoning, and question decomposition. However…IIRC: A Dataset of Incomplete Information Reading Comprehension Questions
James Ferguson, Matt Gardner. Hannaneh Hajishirzi, Tushar Khot, Pradeep DasigiEMNLP • 2020 Humans often have to read multiple documents to address their information needs. However, most existing reading comprehension (RC) tasks only focus on questions for which the contexts provide all the information required to answer them, thus not evaluating a…Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning
H. Trivedi, N. Balasubramanian, Tushar Khot, A. SabharwalEMNLP • 2020 Has there been real progress in multi-hop question-answering? Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts. This limits our ability to measure true progress and defeats the…