Papers
See AI2's Award Winning Papers
Learn more about AI2's Lasting Impact Award
Viewing 11-20 of 165 papers
Paloma: A Benchmark for Evaluating Language Model Fit
Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, A. Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hanna Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse DodgearXiv • 2023 Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of language. Rather than assuming perplexity on one distribution…How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hanna HajishirziNeurIPS • 2023 In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied…SciRepEval: A Multi-Format Benchmark for Scientific Document Representations
Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey FeldmanEMNLP • 2023 Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In…A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents
Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, Kyle LoEMNLP • 2023 Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context…PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents
Kyle Lo, Zejiang Shen, Benjamin Newman, Joseph Chee Chang, Russell Authur, Erin Bransom, Stefan Candra, Yoganand Chandrasekhar, Regan Huff, Bailey Kuehl, Amanpreet Singh, Chris Wilhelm, Angele Zamarron, Marti A. Hearst, Daniel S. Weld, Doug Downey, Luca SoldainiEMNLP • 2023 Despite growing interest in applying natural language processing (NLP) and computer vision (CV) models to the scholarly domain, scientific documents remain challenging to work with. They’re often in difficult-to-use PDF formats, and the ecosystem of models to…RCT Rejection Sampling for Causal Estimation Evaluation
Katherine A. Keith, Sergey Feldman, David Jurgens, Jonathan Bragg, Rohit BhattacharyaTransactions on Machine Learning Research • 2023 Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to…CHAMP: Efficient Annotation and Consolidation of Cluster Hierarchies
Arie Cattan, Tom Hope, Doug Downey, Roy Bar-Haim, Lilach Eden, Yoav Kantor, Ido DaganConference on Empirical Methods in Natural Language Processing • 2023 Various NLP tasks require a complex hierarchical structure over nodes, where each node is a cluster of items. Examples include generating entailment graphs, hierarchical cross-document coreference resolution, annotating event and subevent relations, etc. To…CARE: Extracting Experimental Findings From Clinical Literature
Aakanksha Naik, Bailey Kuehl, Erin Bransom, Doug Downey, Tom HopearXiv.org • 2023 Extracting fine-grained experimental findings from literature can provide massive utility for scientific applications. Prior work has focused on developing annotation schemas and datasets for limited aspects of this problem, leading to simpler information…LongBoX: Evaluating Transformers on Long-Sequence Clinical Tasks
Mihir Parmar, Aakanksha Naik, Himanshu Gupta, Disha Agrawal, Chitta BaralarXiv.org • 2023 Many large language models (LLMs) for medicine have largely been evaluated on short texts, and their ability to handle longer sequences such as a complete electronic health record (EHR) has not been systematically explored. Assessing these models on long…Papeos: Augmenting Research Papers with Talk Videos
Tae Soo Kim, Matt Latzke, Jonathan Bragg, Amy X. Zhang, Joseph Chee ChangUIST • 2023 Research consumption has been traditionally limited to the reading of academic papers—a static, dense, and formally written format. Alternatively, pre-recorded conference presentation videos, which are more dynamic, concise, and colloquial, have recently…