Papers

Learn more about AI2's Lasting Impact Award
Viewing 1-10 of 571 papers
  • A Controllable Model of Grounded Response Generation

    Zeqiu Wu, Michel Galley, Chris Brockett, Yizhe Zhang, Xiang Gao, Chris Quirk, Rik Koncel-Kedziorski, Jianfeng Gao, Hannaneh Hajishirzi, Mari Ostendorf, Bill DolanAAAI 2022 Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process. This control is essential to ensure that users' semantic intents are satisfied and to impose a degree of specificity…
  • CommonsenseQA 2.0: Exposing the Limits of AI through Gamification

    Alon Talmor, Ori Yoran, Ronan Le Bras, Chandrasekhar Bhagavatula, Yoav Goldberg, Yejin Choi, Jonathan Berant NeurIPS2021 Constructing benchmarks that test the abilities of modern natural language un1 derstanding models is difficult – pre-trained language models exploit artifacts in 2 benchmarks to achieve human parity, but still fail on adversarial examples and make 3 errors…
  • FLEX: Unifying Evaluation for Few-Shot NLP

    Jonathan Bragg, Arman Cohan, Kyle Lo, Iz BeltagyNeurIPS2021 Few-shot NLP research is highly active, yet conducted in disjoint research threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental design. Consequently, the community does not know which…
  • Mauve: An Information Divergence Measure Between Neural Text and Human Text

    Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, S. Welleck, Yejin Choi, Z. HarchaouiNeurIPS2021 As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We propose Mauve, a comparison measure for open-ended text generation, which directly compares a…
  • MERLOT: Multimodal Neural Script Knowledge Models

    Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, J. S. Park, Jize Cao, Ali Farhadi, Yejin ChoiNeurIPS2021 As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of…
  • NaturalProofs: Mathematical Theorem Proving in Natural Language

    S. Welleck, Jiachen Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, Kyunghyun ChoNeurIPS2021 Understanding and creating mathematics using natural mathematical language – the mixture of symbolic and natural language used by humans – is a challenging and important problem for driving progress in machine learning. As a step in this direction, we develop…
  • One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval

    Akari Asai, Xinyan Yu, Jungo Kasai, Hanna HajishirziNeurIPS2021 We present CORA, a Cross-lingual Open-Retrieval Answer Generation model that can answer questions across many languages even when language-specific annotated data or knowledge sources are unavailable. We introduce a new dense passage retrieval algorithm that…
  • Teach Me to Explain: A Review of Datasets for Explainable NLP

    Sarah Wiegreffe and Ana Marasović NeurIPS2021 Explainable NLP (ExNLP) has increasingly focused on collecting human-annotated explanations. These explanations are used downstream in three ways: as data augmentation to improve performance on a predictive task, as a loss signal to train models to produce…
  • Achieving Model Robustness through Discrete Adversarial Training

    Maor Ivgi, Jonathan BerantEMNLP2021 Discrete adversarial attacks are symbolic perturbations to a language input that preserve the output label but lead to a prediction error. While such attacks have been extensively explored for the purpose of evaluating model robustness, their utility for…
  • Back to Square One: Bias Detection, Training and Commonsense Disentanglement in the Winograd Schema

    Yanai Elazar, Hongming Zhang, Yoav Goldberg, Dan RothEMNLP2021 The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. We…