The AllenNLP team envisions language-centered AI that equitably serves humanity. We work to improve NLP systems' performance and accountability, and advance scientific methodologies for evaluating and understanding those systems. We deliver high-impact research of our own and masterfully-engineered open-source tools to accelerate NLP research around the world.
A Python library for choreographing your machine learning research. Construct machine learning experiments out of repeatable, reusable steps.View
A natural language processing platform for building state-of-the-art models. A complete platform for solving natural language processing tasks in PyTorch.View
Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World ModellingKolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hanna Hajishirzi, Sameer Singh, Roy FoxarXiv • 2023 Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world, which makes learning complex tasks with sparse rewards difﬁcult. If initialized with knowledge of high-level subgoals and transitions between subgoals, RL…
- Alexander W. Fang, Simon Kornblith, Ludwig SchmidtarXiv • 2023 Does progress on ImageNet transfer to real-world datasets? We investigate this question by evaluating ImageNet pre-trained models with varying accuracy (57% - 83%) on six practical image classification datasets. In particular, we study datasets collected with…
- Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, J. JitsevarXiv • 2022 Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which oﬀers valuable guidance as large-scale…
- Zhaofeng Wu, Robert L. Logan IV, Pete Walsh, Akshita Bhagia, Dirk Groeneveld, Sameer Singh, Iz BeltagyEMNLP • 2022 Recently introduced language model prompting methods can achieve high accuracy in zero-and few-shot settings while requiring few to no learned task-speciﬁc parameters. Never-theless, these methods still often trail behind full model ﬁnetuning. In this work…
- Anas Awadalla, Mitchell Wortsman, Gabriel Ilharco, Sewon Min, Ian H. Magnusson, Hannaneh Hajishirzi, Ludwig SchmidtFindings of EMNLP • 2022 We conduct a large empirical evaluation to investigate the landscape of distributional robustness in question answering. Our investigation spans over 350 models and 16 question answering datasets, including a di-verse set of architectures, model sizes, and…
Question Answering on Research Papers
A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.
13K reading comprehension questions on Wikipedia paragraphs that require following links in those paragraphs to other Wikipedia pages
IIRC is a crowdsourced dataset consisting of information-seeking questions requiring models to identify and then retrieve necessary information that is missing from the original context. Each original context is a paragraph from English Wikipedia and it comes with a set of links to other Wikipedia pages, and answering the questions requires finding the appropriate links to follow and retrieving relevant information from those linked pages that is missing from the original context.
ZEST is a benchmark for zero-shot generalization to unseen NLP tasks, with 25K labeled instances across 1,251 different tasks.
ZEST tests whether NLP systems can perform unseen tasks in a zero-shot way, given a natural language description of the task. It is an instantiation of our proposed framework "learning from task descriptions". The tasks include classification, typed entity extraction and relationship extraction, and each task is paired with 20 different annotated (input, output) examples. ZEST's structure allows us to systematically test whether models can generalize in five different ways.
A benchmark for training and evaluating generative reading comprehension metrics.
Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train an evaluation metric: LERC, a Learned Evaluation metric for Reading Comprehension, to mimic human judgement scores.
Could AI help you to write your next paper?
October 31, 2022
How to shrink AI’s ballooning carbon footprint
July 19, 2022
These simple changes can make AI research much more energy efficient
July 6, 2022
Measuring AI’s Carbon Footprint
June 26, 2022
Why Historical Language Is a Challenge for Artificial Intelligence
November 16, 2021
The curse of neural toxicity: AI2 and UW researchers help computers watch their language
March 6, 2021
November 18, 2020
Your favorite A.I. language tool is toxic
September 29, 2020
NLP Highlights is AllenNLP’s podcast for discussing recent and interesting work related to natural language processing. Hosts from the AllenNLP team at AI2 offer short discussions of papers and occasionally interview authors about their work.