Viewing 1-10 of 519 papers
- Jieyu Zhao, Daniel Khashabi, Tushar Khot, Ashish Sabharwal and Kai-Wei Chang ACL-IJCNLP • 2021 Is it possible to use natural language to intervene in a model’s behavior and alter its prediction in a desired way? We investigate the effectiveness of natural language interventions for reading-comprehension systems, studying this in the context of social stereotypes. Specifically, we propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a questionanswering (QA) model’s unethical behavior by communicating context-specific principles of ethics and equity to it. To this end, we build upon recent methods for quantifying a system’s social stereotypes, augmenting them with different kinds of ethical interventions and the desired model behavior under such interventions. Our zero-shot evaluation finds that even today’s powerful neural language models are extremely poor ethical-advice takers, that is, they respond surprisingly little to ethical interventions even though these interventions are stated as simple sentences. Fewshot learning improves model behavior but remains far from the desired outcome, especially when evaluated for various types of generalization. Our new task thus poses a novel language understanding challenge for the community.
- Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, Noah A. SmithICLR • 2021 State-of-the-art neural machine translation models generate outputs autoregressively, where every step conditions on the previously generated tokens. This sequential nature causes inherent decoding latency. Non-autoregressive translation techniques, on the other hand, parallelize generation across positions and speed up inference at the expense of translation quality. Much recent effort has been devoted to non-autoregressive methods, aiming for a better balance between speed and quality. In this work, we re-examine the trade-off and argue that transformer-based autoregressive models can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a one-layer autoregressive decoder yields state-of-the-art accuracy with comparable latency to strong non-autoregressive models. Our findings suggest that the latency disadvantage for autoregressive translation has been overestimated due to a suboptimal choice of layer allocation, and we provide a new speed-quality baseline for future research toward fast, accurate translation.
- Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, Lingpeng KongICLR • 2021 Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA’s efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.
- Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, Jonathan BerantTACL • 2021 A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce STRATEGYQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, STRATEGYQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in STRATEGYQA are short, topicdiverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of ∼ 66%
- Alexis Ross, Ana Marasović, Matthew E. PetersFindings of ACL • 2021 Humans give contrastive explanations that explain why an observed event happened rather than some other counterfactual event (the contrast case). Despite the important role that contrastivity plays in how people generate and evaluate explanations, this property is largely missing from current methods for explaining NLP models. We present MINIMAL CONTRASTIVE EDITING (MICE), a method for generating contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks—binary sentiment classification, topic classification, and multiplechoice question answering—show that MICE is able to produce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MICE edits can be used for two use cases in NLP system development— uncovering dataset artifacts and debugging incorrect model predictions—and thereby illustrate that generating contrastive explanations is a promising research direction for model interpretability.
- Ori Ram, Yuval Kirstain, Jonathan Berant, A. Globerson, Omer LevyACL • 2021 In a number of question answering (QA) benchmarks, pretrained models have reached human parity through fine-tuning on an order of 100,000 annotated questions and answers. We explore the more realistic few-shot setting, where only a few hundred training examples are available. We show that standard span selection models perform poorly, highlighting the fact that current pretraining objective are far removed from question answering. To address this, we propose a new pretraining scheme that is more suitable for extractive question answering. Given a passage with multiple sets of recurring spans, we mask in each set all recurring spans but one, and ask the model to select the correct span in the passage for each masked span. Masked spans are replaced with a special token, viewed as a question representation, that is later used during fine-tuning to select the answer span. The resulting model obtains surprisingly good results on multiple benchmarks, e.g., 72.7 F1 with only 128 examples on SQuAD, while maintaining competitive (and sometimes better) performance in the high-resource setting. Our findings indicate that careful design of pretraining schemes and model architecture can have a dramatic effect on performance in the few-shot settings.
- Alexander M. Hoyle, Ana Marasović, Noah A. SmithFindings of ACL • 2021 Generating text from structured inputs, such as meaning representations or RDF triples, has often involved the use of specialized graphencoding neural networks. However, recent applications of pretrained transformers to linearizations of graph inputs have yielded stateof-the-art generation results on graph-to-text tasks. Here, we explore the ability of these linearized models to encode local graph structures, in particular their invariance to the graph linearization strategy and their ability to reconstruct corrupted inputs. Our findings motivate solutions to enrich the quality of models’ implicit graph encodings via scaffolding. Namely, we use graph-denoising objectives implemented in a multi-task text-to-text framework. We find that these denoising scaffolds lead to substantial improvements in downstream generation in low-resource settings.
- Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, Yejin Choi ACL • 2021 Despite recent advances in natural language generation, it remains challenging to control attributes of generated text. We propose DEXPERTS: Decoding-time Experts, a decodingtime method for controlled text generation that combines a pretrained language model with “expert” LMs and/or “anti-expert” LMs in a product of experts. Intuitively, under the ensemble, tokens only get high probability if they are considered likely by the experts and unlikely by the anti-experts. We apply DEXPERTS to language detoxification and sentiment-controlled generation, where we outperform existing controllable generation methods on both automatic and human evaluations. Moreover, because DEXPERTS operates only on the output of the pretrained LM, it is effective with (anti-)experts of smaller size, including when operating on GPT-3. Our work highlights the promise of tuning small LMs on text with (un)desirable attributes for efficient decoding-time steering.
- Kaiser Sun and Ana MarasovićFindings of ACL • 2021 An attention matrix of a transformer selfattention sublayer can provably be decomposed into two components and only one of them (effective attention) contributes to the model output. This leads us to ask whether visualizing effective attention gives different conclusions than interpretation of standard attention. Using a subset of the GLUE tasks and BERT, we carry out an analysis to compare the two attention matrices, and show that their interpretations differ. Effective attention is less associated with the features related to the language modeling pretraining such as the separator token, and it has more potential to illustrate linguistic features captured by the model for solving the end-task. Given the found differences, we recommend using effective attention for studying a transformer’s behavior since it is more pertinent to the model output by design. S2 link: https://api.semanticscholar.org/CorpusID:234777911
- Shaul Ravfogel, Hillel Taub-Tabib, Yoav GoldbergACL • Demo Track • 2021 Domain experts often need to extract structured information from large corpora. We advocate for a search paradigm called “extractive search”, in which a search query is enriched with capture-slots, to allow for such rapid extraction. Such an extractive search system can be built around syntactic structures, resulting in high-precision, low-recall results. We show how the recall can be improved using neural retrieval and alignment. The goals of this paper are to concisely introduce the extractive-search paradigm; and to demonstrate a prototype neural retrieval system for extractive search and its benefits and potential. Our prototype is available at https://spike. neural-sim.apps.allenai.org/ and a video demonstration is available at https:// vimeo.com/559586687.