Papers
See AI2's Award Winning Papers
Learn more about AI2's Lasting Impact Award
Viewing 1-10 of 275 papers
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hanna HajishirziNeurIPS • 2023 Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF) - where human preference judgments on LM outputs are transformed into a…How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hanna HajishirziNeurIPS • 2023 In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied…RealTime QA: What's the Answer Right Now?
Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Dragomir R. Radev, Noah A. Smith, Yejin Choi, Kentaro InuiNeurIPS • 2023 We introduce R EAL T IME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). R E AL T IME QA inquires about the current world, and QA systems need to answer questions about…Demystifying Prompts in Language Models via Perplexity Estimation
Hila Gonen, Srini Iyer, Terra Blevins, Noah A. Smith, Luke ZettlemoyerEMNLP Findings • 2023 Language models can be prompted to perform a wide variety of zero- and few-shot learning problems. However, performance varies significantly with the choice of prompt, and we do not yet understand why this happens or how to pick the best prompts. In this work…Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, Yulia TsvetkovEMNLP • 2023 Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more…Measuring and Narrowing the Compositionality Gap in Language Models
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, Mike LewisEMNLP Findings • 2023 We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate…Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements
Jiacheng Liu, Wenya Wang, Dianzhuo Wang, Noah A. Smith, Yejin Choi, Hanna HajishirziEMNLP • 2023 Despite the much discussed capabilities of today's language models, they are still prone to silly and unexpected commonsense failures. We consider a retrospective verification approach that reflects on the correctness of LM outputs, and introduce Vera, a…We're Afraid Language Models Aren't Modeling Ambiguity
Alisa Liu, Zhaofeng Wu, Julian Michael, Alane Suhr, Peter West, Alexander Koller, Swabha Swayamdipta, Noah A. Smith, Yejin ChoiEMNLP • 2023 Ambiguity is an intrinsic feature of natural language. Managing ambiguity is a key part of human language understanding, allowing us to anticipate misunderstanding as communicators and revise our interpretations as listeners. As language models (LMs) are…The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback
Nathan Lambert, Roberto CalandraarXiv • 2023 Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to prompt and more capable in complex settings. RLHF at its core is providing a new toolkit to optimize LLMs other than next…What's In My Big Data?
Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse DodgearXiv.org • 2023 Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose…