Papers

Learn more about AI2's Lasting Impact Award
Viewing 11-20 of 165 papers
  • Paloma: A Benchmark for Evaluating Language Model Fit

    Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, A. Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hanna Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse DodgearXiv2023 Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of language. Rather than assuming perplexity on one distribution…
  • How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources

    Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hanna HajishirziNeurIPS2023 In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied…
  • SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

    Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey FeldmanEMNLP2023 Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In…
  • A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents

    Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, Kyle LoEMNLP2023 Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context…
  • PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents

    Kyle Lo, Zejiang Shen, Benjamin Newman, Joseph Chee Chang, Russell Authur, Erin Bransom, Stefan Candra, Yoganand Chandrasekhar, Regan Huff, Bailey Kuehl, Amanpreet Singh, Chris Wilhelm, Angele Zamarron, Marti A. Hearst, Daniel S. Weld, Doug Downey, Luca SoldainiEMNLP2023 Despite growing interest in applying natural language processing (NLP) and computer vision (CV) models to the scholarly domain, scientific documents remain challenging to work with. They’re often in difficult-to-use PDF formats, and the ecosystem of models to…
  • RCT Rejection Sampling for Causal Estimation Evaluation

    Katherine A. Keith, Sergey Feldman, David Jurgens, Jonathan Bragg, Rohit BhattacharyaTransactions on Machine Learning Research2023 Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to…
  • CHAMP: Efficient Annotation and Consolidation of Cluster Hierarchies

    Arie Cattan, Tom Hope, Doug Downey, Roy Bar-Haim, Lilach Eden, Yoav Kantor, Ido DaganConference on Empirical Methods in Natural Language Processing2023 Various NLP tasks require a complex hierarchical structure over nodes, where each node is a cluster of items. Examples include generating entailment graphs, hierarchical cross-document coreference resolution, annotating event and subevent relations, etc. To…
  • CARE: Extracting Experimental Findings From Clinical Literature

    Aakanksha Naik, Bailey Kuehl, Erin Bransom, Doug Downey, Tom HopearXiv.org2023 Extracting fine-grained experimental findings from literature can provide massive utility for scientific applications. Prior work has focused on developing annotation schemas and datasets for limited aspects of this problem, leading to simpler information…
  • LongBoX: Evaluating Transformers on Long-Sequence Clinical Tasks

    Mihir Parmar, Aakanksha Naik, Himanshu Gupta, Disha Agrawal, Chitta BaralarXiv.org2023 Many large language models (LLMs) for medicine have largely been evaluated on short texts, and their ability to handle longer sequences such as a complete electronic health record (EHR) has not been systematically explored. Assessing these models on long…
  • Papeos: Augmenting Research Papers with Talk Videos

    Tae Soo Kim, Matt Latzke, Jonathan Bragg, Amy X. Zhang, Joseph Chee ChangUIST2023 Research consumption has been traditionally limited to the reading of academic papers—a static, dense, and formally written format. Alternatively, pre-recorded conference presentation videos, which are more dynamic, concise, and colloquial, have recently…