Papers

Learn more about AI2's Lasting Impact Award
Viewing 41-50 of 991 papers
  • SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks

    Bill Yuchen Lin, Yicheng Fu, Karina Yang, Prithviraj Ammanabrolu, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Yejin Choi, Xiang RenNeurIPS2023 We introduce SwiftSage, a novel agent framework inspired by the dual-process theory of human cognition, designed to excel in action planning for complex interactive reasoning tasks. SwiftSage integrates the strengths of behavior cloning and prompting large…
  • SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

    Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey FeldmanEMNLP2023 Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In…
  • A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents

    Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, Kyle LoEMNLP2023 Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context…
  • Crystal: Introspective Reasoners Reinforced with Self-Feedback

    Jiacheng Liu, Ramakanth Pasunuru, Hannaneh Hajishirzi, Yejin Choi, Asli CelikyilmazEMNLP2023 Extensive work has shown that the performance and interpretability of commonsense reasoning can be improved via knowledge-augmented reasoning methods, where the knowledge that underpins the reasoning process is explicitly verbalized and utilized. However…
  • Demystifying Prompts in Language Models via Perplexity Estimation

    Hila Gonen, Srini Iyer, Terra Blevins, Noah A. Smith, Luke ZettlemoyerEMNLP Findings2023 Language models can be prompted to perform a wide variety of zero- and few-shot learning problems. However, performance varies significantly with the choice of prompt, and we do not yet understand why this happens or how to pick the best prompts. In this work…
  • Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

    Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, Yulia TsvetkovEMNLP2023 Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more…
  • Editing Common Sense in Transformers

    Anshita Gupta*, Debanjan Mondal*, Akshay Krishna Sheshadri*, Wenlong Zhao, Xiang Lorraine Li*, Sarah Wiegreffe*, Niket Tandon*EMNLP2023 Editing model parameters directly in Transformers makes updating open-source transformer-based models possible without re-training. However, these editing methods have only been evaluated on statements about encyclopedic knowledge with a single correct answer…
  • FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, M. Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh HajishirziEMNLP2023 Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2…
  • FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions

    Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, R. L. Bras, Gunhee Kim, Yejin Choi, Maarten SapEMNLP2023 Theory of mind (ToM) evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question…
  • Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy

    Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, Ashish SabharwalEMNLP2023 When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms…