Learn more about AI2's Lasting Impact Award
AI2 Irvine
All Years
Viewing 21-28 of 28 papers
  • Global Reasoning over Database Structures for Text-to-SQL Parsing

    Ben Bogin, Matt Gardner, Jonathan BerantEMNLP2019
    State-of-the-art semantic parsers rely on auto-regressive decoding, emitting one symbol at a time. When tested against complex databases that are unobserved at training time (zero-shot), the parser often struggles to select the correct set of database constants in the new database, due to the local nature of decoding. In this work, we propose a semantic parser that globally reasons about the structure of the output query to make a more contextually-informed selection of database constants. We use message-passing through a graph neural network to softly select a subset of database constants for the output query, conditioned on the question. Moreover, we train a model to rank queries based on the global alignment of database constants to question words. We apply our techniques to the current state-of-the-art model for Spider, a zero-shot semantic parsing dataset with complex databases, increasing accuracy from 39.4% to 47.4%.
  • QuaRTz: An Open-Domain Dataset of Qualitative Relationship Questions

    Oyvind Tafjord, Matt Gardner, Kevin Lin, Peter ClarkEMNLP2019
    We introduce the first open-domain dataset, called QuaRTz, for reasoning about textual qualitative relationships. QuaRTz contains general qualitative statements, e.g., "A sunscreen with a higher SPF protects the skin longer.", twinned with 3864 crowdsourced situated questions, e.g., "Billy is wearing sunscreen with a lower SPF than Lucy. Who will be best protected from the sun?", plus annotations of the properties being compared. Unlike previous datasets, the general knowledge is textual and not tied to a fixed set of relationships, and tests a system's ability to comprehend and apply textual qualitative knowledge in a novel setting. We find state-of-the-art results are substantially (20%) below human performance, presenting an open challenge to the NLP community.
  • Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

    Pradeep Dasigi, Nelson F. Liu, Ana Marasovic, Noah A. Smith, Matt GardnerEMNLP2019
    Machine comprehension of texts longer than a single sentence often requires coreference resolution. However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail to evaluate the ability of models to resolve coreference. We present a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia. Obtaining questions focused on such phenomena is challenging, because it is hard to avoid lexical cues that shortcut complex reasoning. We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark—the best model performance is 70.5 F1, while the estimated human performance is 93.4 F1.
  • Compositional Questions Do Not Necessitate Multi-hop Reasoning

    Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, Luke ZettlemoyerACL2019
    Multi-hop reading comprehension (RC) questions are challenging because they require reading and reasoning over multiple paragraphs. We argue that it can be difficult to construct large multi-hop RC datasets. For example, even highly compositional questions can be answered with a single hop if they target specific entity types, or the facts needed to answer them are redundant. Our analysis is centered on HotpotQA, where we show that single-hop reasoning can solve much more of the dataset than previously thought. We introduce a single-hop BERT-based RC model that achieves 67 F1---comparable to state-of-the-art multi-hop models. We also design an evaluation setting where humans are not shown all of the necessary paragraphs for the intended multi-hop reasoning but can still answer over 80% of questions. Together with detailed error analysis, these results suggest there should be an increasing focus on the role of evidence in multi-hop reasoning and possibly even a shift towards information retrieval style evaluations with large and diverse evidence collections.
  • Question Answering is a Format; When is it Useful?

    Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, Sewon MinarXiv2019
    Recent years have seen a dramatic expansion of tasks and datasets posed as question answering, from reading comprehension, semantic role labeling, and even machine translation, to image and video understanding. With this expansion, there are many differing views on the utility and definition of "question answering" itself. Some argue that its scope should be narrow, or broad, or that it is overused in datasets today. In this opinion piece, we argue that question answering should be considered a format which is sometimes useful for studying particular phenomena, not a phenomenon or task in itself. We discuss when a task is correctly described as question answering, and when a task is usefully posed as question answering, instead of using some other format.
  • Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing

    Ben Bogin, Jonathan Berant, Matt GardnerACL2019
    Research on parsing language to SQL has largely ignored the structure of the database (DB) schema, either because the DB was very simple, or because it was observed at both training and test time. In SPIDER, a recently-released text-to-SQL dataset, new and complex DBs are given at test time, and so the structure of the DB schema can inform the predicted SQL query. In this paper, we present an encoder-decoder semantic parser, where the structure of the DB schema is encoded with a graph neural network, and this representation is later used at both encoding and decoding time. Evaluation shows that encoding the schema structure improves our parser accuracy from 33.8% to 39.4%, dramatically above the current state of the art, which is at 19.7%.
  • Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling

    Robert L. Logan IV, Nelson F. Liu, Matthew E. Peters, Matt Gardner, Sameer SinghACL2019
    Modeling human language requires the ability to not only generate fluent text but also encode factual knowledge. However, traditional language models are only capable of remembering facts seen at training time, and often have difficulty recalling them. To address this, we introduce the knowledge graph language model (KGLM), a neural language model with mechanisms for selecting and copying facts from a knowledge graph that are relevant to the context. These mechanisms enable the model to render information it has never seen before, as well as generate out-of-vocabulary tokens. We also introduce the Linked WikiText-2 dataset, a corpus of annotated text aligned to the Wikidata knowledge graph whose contents (roughly) match the popular WikiText-2 benchmark. In experiments, we demonstrate that the KGLM achieves significantly better performance than a strong baseline language model. We additionally compare different language model's ability to complete sentences requiring factual knowledge, showing that the KGLM outperforms even very large language models in generating facts
  • DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, Matt GardnerNAACL-HLT2019
    Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.
AI2 Irvine
All Years