Learn more about AI2's Lasting Impact Award
All Years
Viewing 21-30 of 134 papers
  • More Bang for Your Buck: Natural Perturbation for Robust Question Answering

    Daniel Khashabi, Tushar Khot, Ashish SabharwalEMNLP2020
    While recent models have achieved human-level scores on many NLP datasets, we observe that they are considerably sensitive to small changes in input. As an alternative to the standard approach of addressing this issue by constructing training sets of completely new examples, we propose doing so via minimal perturbation of examples. Specifically, our approach involves first collecting a set of seed examples and then applying human-driven natural perturbations (as opposed to rule-based machine perturbations), which often change the gold label as well. Local perturbations have the advantage of being relatively easier (and hence cheaper) to create than writing out completely new examples. To evaluate the impact of this phenomenon, we consider a recent question-answering dataset (BoolQ) and study the benefit of our approach as a function of the perturbation cost ratio, the relative cost of perturbing an existing question vs. creating a new one from scratch. We find that when natural perturbations are moderately cheaper to create, it is more effective to train models using them: such models exhibit higher robustness and better generalization, while retaining performance on the original BoolQ dataset.
  • UnifiedQA: Crossing Format Boundaries With a Single QA System

    Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hannaneh HajishirziFindings of EMNLP2020
    Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats. UnifiedQA performs on par with 9 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UnifiedQA performs surprisingly well, showing strong generalization from its out-of-format training data. Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets, establishing UnifiedQA as a strong starting point for building QA systems
  • What-if I ask you to explain: Explaining the effects of perturbations in procedural text

    Dheeraj Rajagopal, Niket Tandon, Peter Clark, Bhavana Dalvi, Eduard H. HovyFindings of EMNLP2020
    We address the task of explaining the effects of perturbations in procedural text, an important test of process comprehension. Consider a passage describing a rabbit's life-cycle: humans can easily explain the effect on the rabbit population if a female rabbit becomes ill -- i.e., the female rabbit would not become pregnant, and as a result not have babies leading to a decrease in rabbit population. We present QUARTET, a system that constructs such explanations from paragraphs, by modeling the explanation task as a multitask learning problem. QUARTET provides better explanations (based on the sentences in the procedural text) compared to several strong baselines on a recent process comprehension benchmark. We also present a surprising secondary effect: our model also achieves a new SOTA with a 7% absolute F1 improvement on a downstream QA task. This illustrates that good explanations do not have to come at the expense of end task performance.
  • "You are grounded!": Latent Name Artifacts in Pre-trained Language Models

    Vered Shwartz, Rachel Rudinger, Oyvind TafjordEMNLP2020
    Pre-trained language models (LMs) may perpetuate biases originating in their training corpus to downstream models. We focus on artifacts associated with the representation of given names (e.g., Donald), which, depending on the corpus, may be associated with specific entities, as indicated by next token prediction (e.g., Trump). While helpful in some contexts, grounding happens also in under-specified or inappropriate contexts. For example, endings generated for "Donald is a" substantially differ from those of other names, and often have more-than-average negative sentiment. We demonstrate the potential effect on downstream tasks with reading comprehension probes where name perturbation changes the model answers. As a silver lining, our experiments suggest that additional pre-training on different corpora may mitigate this bias.
  • ReadOnce Transformers: Reusable Representations of Text for Transformers

    Shih-Ting Lin, Ashish Sabharwal, Tushar KhotarXiv2020
    While large-scale language models are extremely effective when directly fine-tuned on many end-tasks, such models learn to extract information and solve the task simultaneously from end-task supervision. This is wasteful, as the general problem of gathering information from a document is mostly task-independent and need not be re-learned from scratch each time. Moreover, once the information has been captured in a computable representation, it can now be re-used across examples, leading to faster training and evaluation of models. We present a transformer-based approach, READONCE Transformers, that is trained to build such information-capturing representations of text. Our model compresses the document into a variable-length task-independent representation that can now be re-used in different examples and tasks, thereby requiring a document to only be read once. Additionally, we extend standard text-to-text models to consume our READONCE Representations along with text to solve multiple downstream tasks. We show our task-independent representations can be used for multi-hop QA, abstractive QA, and summarization. We observe 2x-5x speedups compared to standard text-totext models, while also being able to handle long documents that would normally exceed the length limit of current models.
  • Evaluating Models' Local Decision Boundaries via Contrast Sets

    M. Gardner, Y. Artzi, V. Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y. Elazar, A. Gottumukkala, N. Gupta, H. Hajishirzi, G. Ilharco, D.Khashabi, K. Lin, J. Liu, N. F. Liu, P. Mulcaire, Q. Ning, S.Singh, N.A. Smith, S. Subramanian, et alFindings of EMNLP2020
    Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.
  • What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge

    Kyle Richardson, Ashish SabharwalTACL2020
    Open-domain question answering (QA) is known to involve several underlying knowledge and reasoning challenges, but are models actually learning such knowledge when trained on benchmark tasks? To investigate this, we introduce several new challenge tasks that probe whether state-of-theart QA models have general knowledge about word definitions and general taxonomic reasoning, both of which are fundamental to more complex forms of reasoning and are widespread in benchmark datasets. As an alternative to expensive crowd-sourcing, we introduce a methodology for automatically building datasets from various types of expert knowledge (e.g., knowledge graphs and lexical taxonomies), allowing for systematic control over the resulting probes and for a more comprehensive evaluation. We find automatically constructing probes to be vulnerable to annotation artifacts, which we carefully control for. Our evaluation confirms that transformerbased QA models are already predisposed to recognize certain types of structural lexical knowledge. However, it also reveals a more nuanced picture: their performance degrades substantially with even a slight increase in the number of “hops” in the underlying taxonomic hierarchy, or as more challenging distractor candidate answers are introduced. Further, even when these models succeed at the standard instance-level evaluation, they leave much room for improvement when assessed at the level of clusters of semantically connected probes (e.g., all Isa questions about a concept).
  • Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models

    Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish SabharwalarXiv2020
    A common approach to solve complex tasks is by breaking them down into simple sub-problems that can then be solved by simpler modules. However, these approaches often need to be designed and trained specifically for each complex task. We propose a general approach, Text Modular Networks(TMNs), where the system learns to decompose any complex task into the language of existing models. Specifically, we focus on Question Answering (QA) and learn to decompose complex questions into sub-questions answerable by existing QA models. TMNs treat these models as blackboxes and learn their textual input-output behavior (i.e., their language) through their task datasets. Our next-question generator then learns to sequentially produce sub-questions that help answer a given complex question. These sub-questions are posed to different existing QA models and, together with their answers, provide a natural language explanation of the exact reasoning used by the model. We present the first system, incorporating a neural factoid QA model and a symbolic calculator, that uses decomposition for the DROP dataset, while also generalizing to the multi-hop HotpotQA dataset. Our system, ModularQA, outperforms a cross-task baseline by 10-60 F1 points and performs comparable to task-specific systems, while also providing an easy-to-read explanation of its reasoning.
  • AdaWISH: Faster Discrete Integration via Adaptive Quantiles

    Fan Ding, Hanjing Wang, Ashish Sabharwal, Yexiang XueECAI2020
    Discrete integration in a high dimensional space of $n$ variables poses fundamental challenges. The WISH algorithm reduces the intractable discrete integration problem into $n$ optimization queries subject to randomized constraints, obtaining a constant approximation guarantee. The optimization queries are expensive, which limits the applicability of WISH. We propose AdaWISH, which is able to obtain the same guarantee, but accesses only a small subset of queries of WISH. For example, when the number of function values is bounded by a constant, AdaWISH issues only $O(\log n)$ queries. The key idea is to query adaptively, taking advantage of the shape of the weight function. In general, we prove that AdaWISH has a regret of no more than $O(\log n)$ relative to an oracle that issues queries at data-dependent optimal points. Experimentally, AdaWISH gives precise estimates for discrete integration problems, of the same quality as that of WISH and better than several competing approaches, on a variety of probabilistic inference benchmarks, while saving substantially on the number of optimization queries compared to WISH. For example, it saves $81.5\%$ of WISH queries while retaining the quality of results on a suite of UAI inference challenge benchmarks.
  • Approximating the Permanent by Sampling from Adaptive Partitions

    Jonathan Kuck, Tri Dao, Hamid Rezatofighi, Ashish Sabharwal, Stefano ErmonUAI2020
    Computing the permanent of a non-negative matrix is a core problem with practical applications ranging from target tracking to statistical thermodynamics. However, this problem is also #P-complete, which leaves little hope for finding an exact solution that can be computed efficiently. While the problem admits a fully polynomial randomized approximation scheme, this method has seen little use because it is both inefficient in practice and difficult to implement. We present AdaPart, a simple and efficient method for drawing exact samples from an unnormalized distribution. Using AdaPart, we show how to construct tight bounds on the permanent which hold with high probability, with guaranteed polynomial runtime for dense matrices. We find that AdaPart can provide empirical speedups exceeding 25x over prior sampling methods on matrices that are challenging for variational based approaches. Finally, in the context of multi-target tracking, exact sampling from the distribution defined by the matrix permanent allows us to use the optimal proposal distribution during particle filtering. Using AdaPart, we show that this leads to improved tracking performance using an order of magnitude fewer samples.
All Years