The AllenNLP team envisions language-centered AI that equitably serves humanity. We work to improve NLP systems' performance and accountability, and advance scientific methodologies for evaluating and understanding those systems. We deliver high-impact research of our own and masterfully-engineered open-source tools to accelerate NLP research around the world.
A Python library for choreographing your machine learning research. Construct machine learning experiments out of repeatable, reusable steps.View
A natural language processing platform for building state-of-the-art models. A complete platform for solving natural language processing tasks in PyTorch.View
- Tom Sherborne, Naomi Saphra, Pradeep Dasigi, Hao PengICLR • 2024 By reducing the curvature of the loss surface in the parameter space, Sharpness-aware minimization (SAM) yields widespread robustness improvement under domain transfer. Instead of focusing on parameters, however, this work considers the transferability of…
- Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse DodgeICLR • 2024 Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose…
- Yanai Elazar, Jiayao Zhang, David Wadden, Boshen Zhang, Noah A. SmithCLearR • 2024 What is the effect of releasing a preprint of a paper before it is submitted for peer review? No randomized controlled trial has been conducted, so we turn to observational data to answer this question. We use data from the ICLR conference (2018--2022) and…
- Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan, Chris Callison-BurcharXiv • 2024 Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and…
- Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, A. Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Daniel Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hanna HajishirziarXiv • 2024 Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of…
Question Answering on Research Papers
A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.
13K reading comprehension questions on Wikipedia paragraphs that require following links in those paragraphs to other Wikipedia pages
IIRC is a crowdsourced dataset consisting of information-seeking questions requiring models to identify and then retrieve necessary information that is missing from the original context. Each original context is a paragraph from English Wikipedia and it comes with a set of links to other Wikipedia pages, and answering the questions requires finding the appropriate links to follow and retrieving relevant information from those linked pages that is missing from the original context.
ZEST is a benchmark for zero-shot generalization to unseen NLP tasks, with 25K labeled instances across 1,251 different tasks.
ZEST tests whether NLP systems can perform unseen tasks in a zero-shot way, given a natural language description of the task. It is an instantiation of our proposed framework "learning from task descriptions". The tasks include classification, typed entity extraction and relationship extraction, and each task is paired with 20 different annotated (input, output) examples. ZEST's structure allows us to systematically test whether models can generalize in five different ways.
A benchmark for training and evaluating generative reading comprehension metrics.
Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train an evaluation metric: LERC, a Learned Evaluation metric for Reading Comprehension, to mimic human judgement scores.
AI’s Climate Impact Goes beyond Its Emissions
December 7, 2023
Peeking Inside Pandora’s Box: Unveiling the Hidden Complexities of Language Model Datasets with ‘What’s in My Big Data’? (WIMBD)
November 5, 2023
AI Is Becoming More Powerful—but Also More Secretive
October 19, 2023
Your Personal Information Is Probably Being Used to Train Generative AI Models
October 19, 2023
Inside the secret list of websites that make AI like ChatGPT sound smart
April 19, 2023
AI can help address climate change—as long as it doesn’t exacerbate it
February 15, 2023
How to Detect AI-Generated Text, According to Researchers
February 8, 2023
Could AI help you to write your next paper?
October 31, 2022
NLP Highlights is AllenNLP’s podcast for discussing recent and interesting work related to natural language processing. Hosts from the AllenNLP team at AI2 offer short discussions of papers and occasionally interview authors about their work.