Learn more about AI2's Lasting Impact Award
All Years
Viewing 21-30 of 144 papers
  • CLUE: A Chinese Language Understanding Evaluation Benchmark

    L. Xu, X.Zhang, L. Li, H. Hu, C. Cao, W. Liu, J. Li, Y. Li, K. Sun, Y. Xu, Y. Cui, C. Yu, Q. Dong, Y. Tian, D. Yu, B. Shi, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Q. Zhao, C. Yue, X. Zhang, Z. Yang, 2020 We introduce CLUE, a Chinese Language Understanding Evaluation benchmark. It contains eight different tasks, including single-sentence classification, sentence pair classification, and machine reading comprehension. We evaluate CLUE on a number of existing… more
  • Belief Propagation Neural Networks

    J. Kuck, Shuvam Chakraborty, Hao Tang, R. Luo, Jiaming Song, A. Sabharwal, S. ErmonNeurIPS2020 Learned neural solvers have successfully been used to solve combinatorial optimization and decision problems. More general counting variants of these problems, however, are still largely solved with hand-crafted solvers. To bridge this gap, we introduce… more
  • Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge

    Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, Jonathan BerantNeurIPS • Spotlight Presentation2020 To what extent can a neural network systematically reason over symbolic facts? Evidence suggests that large pre-trained language models (LMs) acquire some reasoning capacity, but this ability is difficult to control. Recently, it has been shown that… more
  • From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An Overview of the Aristo Project

    Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld, Michal Guerquin, Michael SchmitzAI Magazine 41 (4), Winter2020 AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge. Even in 2016, the best AI system achieved merely 59.3% on an 8th Grade science exam… more
  • Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

    Atticus Geiger, Kyle Richardson, Christopher PottsEMNLP • BlackboxNLP Workshop 2020 We address whether neural models for Natural Language Inference (NLI) can learn the compositional interactions between lexical entailment and negation, using four methods: the behavioral evaluation methods of (1) challenge test sets and (2) systematic… more
  • A Dataset for Tracking Entities in Open Domain Procedural Text

    Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi Mishra, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson, Eduard HovyEMNLP2020 We present the first dataset for tracking state changes in procedural text from arbitrary domains by using an unrestricted (open) vocabulary. For example, in a text describing fog removal using potatoes, a car window may transition between being foggy, sticky… more
  • A Simple Yet Strong Pipeline for HotpotQA

    Dirk Groeneveld, Tushar Khot, Mausam, Ashish SabharwalEMNLP2020 State-of-the-art models for multi-hop question answering typically augment large-scale language models like BERT with additional, intuitively useful capabilities such as named entity recognition, graph-based reasoning, and question decomposition. However… more
  • IIRC: A Dataset of Incomplete Information Reading Comprehension Questions

    James Ferguson, Matt Gardner. Hannaneh Hajishirzi, Tushar Khot, Pradeep DasigiEMNLP2020 Humans often have to read multiple documents to address their information needs. However, most existing reading comprehension (RC) tasks only focus on questions for which the contexts provide all the information required to answer them, thus not evaluating a… more
  • Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning

    H. Trivedi, N. Balasubramanian, Tushar Khot, A. SabharwalEMNLP2020 Has there been real progress in multi-hop question-answering? Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts. This limits our ability to measure true progress and defeats the… more
  • Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering.

    Harsh Jhamtani, P. ClarkEMNLP2020 Despite the rapid progress in multihop question-answering (QA), models still have trouble explaining why an answer is correct, with limited explanation training data available to learn from. To address this, we introduce three explanation datasets in which… more
All Years