Learn more about AI2's Lasting Impact Award
All Years
Viewing 1-10 of 134 papers
  • Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, Jonathan BerantTACL2021
    A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce STRATEGYQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, STRATEGYQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in STRATEGYQA are short, topicdiverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of ∼ 66%
  • Temporal Reasoning on Implicit Events from Distant Supervision

    Ben Zhou, Kyle Richardson, Qiang Ning, Tushar Khot, Ashish Sabharwal, D. RothNAACL2021
    Existing works on temporal reasoning among events described in text focus on modeling relationships between explicitly mentioned events and do not handle event end time effectively. However, human readers can infer from natural language text many implicit events that help them better understand the situation and, consequently, better reason about time. This work proposes a new crowd-sourced dataset, TRACIE, which evaluates systems' understanding of implicit events - events that are not mentioned explicitly in the text but can be inferred from it. This is done via textual entailment instances querying both start and end times of events. We show that TRACIE is challenging for state-of-the-art language models. Our proposed model, SymTime, exploits distant supervision signals from the text itself and reasons over events' start time and duration to infer events' end time points. We show that our approach improves over baseline language models, gaining 5% on the i.i.d. split and 9% on an out-of-distribution test split. Our approach is also general to other annotation schemes, gaining 2%-8% on MATRES, an extrinsic temporal relation benchmark.
  • Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models

    Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish SabharwalNAACL2021
    A common approach to solve complex tasks is by breaking them down into simple sub-problems that can then be solved by simpler modules. However, these approaches often need to be designed and trained specifically for each complex task. We propose a general approach, Text Modular Networks(TMNs), where the system learns to decompose any complex task into the language of existing models. Specifically, we focus on Question Answering (QA) and learn to decompose complex questions into sub-questions answerable by existing QA models. TMNs treat these models as blackboxes and learn their textual input-output behavior (i.e., their language) through their task datasets. Our next-question generator then learns to sequentially produce sub-questions that help answer a given complex question. These sub-questions are posed to different existing QA models and, together with their answers, provide a natural language explanation of the exact reasoning used by the model. We present the first system, incorporating a neural factoid QA model and a symbolic calculator, that uses decomposition for the DROP dataset, while also generalizing to the multi-hop HotpotQA dataset. Our system, ModularQA, outperforms a cross-task baseline by 10-60 F1 points and performs comparable to task-specific systems, while also providing an easy-to-read explanation of its reasoning.
  • Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2

    G. Betz, Kyle Richardson, Christian VoigtarXiv2021
    Thinking aloud is an effective meta-cognitive strategy human reasoners apply to solve difficult problems. We suggest to improve the reasoning ability of pre-trained neural language models in a similar way, namely by expanding a task’s context with problem elaborations that are dynamically generated by the language model itself. Our main result is that dynamic problem elaboration significantly improves the zero-shot performance of GPT-2 in a deductive reasoning and natural language inference task: While the model uses a syntactic heuristic for predicting an answer, it is capable (to some degree) of generating reasoned additional context which facilitates the successful application of its heuristic. We explore different ways of generating elaborations, including fewshot learning, and find that their relative performance varies with the specific problem characteristics (such as problem difficulty). Moreover, the effectiveness of an elaboration can be explained in terms of the degree to which the elaboration semantically coheres with the corresponding problem. In particular, elaborations that are most faithful to the original problem description may boost accuracy by up to 24%.
  • Information to Wisdom: Commonsense Knowledge Extraction and Compilation

    Simon Razniewski, Niket Tandon, Aparna S. Varde WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining2021
    Commonsense knowledge is a foundational cornerstone of artificial intelligence applications. Whereas information extraction and knowledge base construction for instance-oriented assertions, such as Brad Pitt's birth date, or Angelina Jolie's movie awards, has received much attention, commonsense knowledge on general concepts (politicians, bicycles, printers) and activities (eating pizza, fixing printers) has only been tackled recently. In this tutorial we present state-of-the-art methodologies towards the compilation and consolidation of such commonsense knowledge (CSK). We cover text-extraction-based, multi-modal and Transformer-based techniques, with special focus on the issues of web search and ranking, as of relevance to the WSDM community.
  • Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge

    Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, B. D. Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, P. ClarkarXiv2021
    We present the ARC-DA dataset, a direct-answer (“open response”, “freeform”) version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset. While ARC has been influential in the community, its multiple-choice format is unrepresentative of real-world questions, and multiple choice formats can be particularly susceptible to artifacts. The ARCDA dataset addresses these concerns by converting questions to direct-answer format using a combination of crowdsourcing and expert review. The resulting dataset contains 2985 questions with a total of 8436 valid answers (questions typically have more than one valid answer). ARC-DA is one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves. We describe the conversion approach taken, appropriate evaluation metrics, and several strong models. Although high, the best scores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) still leave considerable room for improvement. In addition, the dataset provides a natural setting for new research on explanation, as many questions require reasoning to construct answers. We hope the dataset spurs further advances in complex questionanswering by the community.
  • GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

    Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Smith, Daniel S. WeldarXiv2021
    Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks which can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency), and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available,1 and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.
  • Critical Thinking for Language Models

    Gregor Betz, Christian Voigt, Kyle RichardsonarXiv2020
    This paper takes a first step towards a critical thinking curriculum for neural auto-regressive language models. We introduce a synthetic text corpus of deductively valid arguments, and use this artificial argument corpus to train and evaluate GPT-2. Significant transfer learning effects can be observed: Training a model on a few simple core schemes allows it to accurately complete conclusions of different, and more complex types of arguments, too. The language models seem to connect and generalize the core argument schemes in a correct way. Moreover, we obtain consistent and promising results for the GLUE and SNLI benchmarks. The findings suggest that there might exist a representative sample of paradigmatic instances of good reasoning that will suffice to acquire general reasoning skills and that might form the core of a critical thinking curriculum for language models.
  • ParsiNLU: A Suite of Language Understanding Challenges for Persian

    Daniel Khashabi, Arman Cohan, Siamak Shakeri, et al. arXiv2020
    Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5$k$ new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.
  • CLUE: A Chinese Language Understanding Evaluation Benchmark

    L. Xu, X.Zhang, L. Li, H. Hu, C. Cao, W. Liu, J. Li, Y. Li, K. Sun, Y. Xu, Y. Cui, C. Yu, Q. Dong, Y. Tian, D. Yu, B. Shi, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Q. Zhao, C. Yue, X. Zhang, Z. Yang, 20202020
    We introduce CLUE, a Chinese Language Understanding Evaluation benchmark. It contains eight different tasks, including single-sentence classification, sentence pair classification, and machine reading comprehension. We evaluate CLUE on a number of existing full-network pre-trained models for Chinese. We also include a small hand-crafted diagnostic test set designed to probe specific linguistic phenomena using different models, some of which are unique to Chinese. Along with CLUE, we release a large clean crawled raw text corpus that can be used for model pre-training. We release CLUE, baselines and pre-training dataset on Github.
All Years