Aristo
Building the next generation of systems that can systematically reason, explain, and continually improve over time
- Systematic reasoning and explanation
- Teachable reasoning systems
- Continual learning with memory-based architectures
- Knowledge and belief
- Universal mathematical reasoning
Recent Updates
Towards Teachable Reasoning Systems
April 27, 2022This paper describes our work towards Teachable Reasoning Systems. First, EntailmentWriter searches for a chain of reasoning from facts it believes…
Memory-assisted prompt editing to improve GPT-3 after deployment
April 20, 2022Large LMs such as GPT-3 are powerful, but can commit mistakes that are obvious to humans. Memory-assisted prompt editing allows users to give…
DREAM: Improving Situational QA by First Elaborating the Situation
March 1, 2022When people answer questions about a specific situation, e.g., "I cheated on my mid-term exam last week. Was that wrong?", cognitive science suggests…
Explaining Answers with Entailment Trees
November 1, 2021EntailmentBank is a unique dataset of multi-step entailment trees. Each tree shows how known facts combine to entail the answer to a question. From…
BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief
November 1, 2021Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions…
Research Areas
Teachable Reasoning Systems
By interacting with and giving feedback on a system’s reasoning, a user can teach the system so it continually improves over time – without model retraining.
Learn More:
Neuro-Symbolic Reasoning and Explanation
Solving problems by generating consistent, faithful chains of reasoning using neural components.
Learn More:
Modular Models
By learning to chain together existing models, complex problems can be solved, beyond the capabilities of the individual components.
Learn More:
Universal Mathematical Reasoners
Creating models with built-in mathematical reasoning skills, that can be rapidly fine-tuned for a wide variety of mathematical tasks.
Learn More:
Macaw is a high-performance question-answering (QA) model capable of outperforming other popular current language models, all while being an order of magnitude smaller. This demo allows you to explore Macaw's answers and compare them to those of the popular GPT-3 language model on a benchmark set of questions.
Try the demo

Macaw is a high-performance question-answering (QA) model capable of outperforming other popular current language models, all while being an order of magnitude smaller. This demo allows you to explore Macaw's answers and compare them to those of the popular GPT-3 language model on a benchmark set of questions.
Try the demo
Like RuleTaker, ProofWriter determines whether statements are True or False based on rules given in natural language - but also generates the proof of its answers.
Try the demo
Like RuleTaker, ProofWriter determines whether statements are True or False based on rules given in natural language - but also generates the proof of its answers.
Try the demoRecent Papers
Lila: A Unified Benchmark for Mathematical Reasoning
Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin KalyanEMNLP • 2022 Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning…Calibrating Trust of Multi-Hop Question Answering Systems with Decompositional Probes
Kaige Xie, Sarah Wiegreffe, Mark O. RiedlFindings of EMNLP • 2022 Multi-hop Question Answering (QA) is a chal-lenging task since it requires an accurate ag-gregation of information from multiple context paragraphs and a thorough understanding of the underlying reasoning chains. Recent work in multi-hop QA has shown that…Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning
Oyvind Tafjord, Bhavana Dalvi Mishra, Peter ClarkEMNLP • 2022 Our goal is a question-answering (QA) system that can show how its answers are implied by its own internal beliefs via a systematic chain of reasoning . Such a capability would allow better understanding of why a model produced the answer it did. Our approach…Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning
Xiangyu Peng, Siyan Li, Sarah Wiegreffe, Mark O. RiedlFindings of EMNLP • 2022 Transformer-based language model approaches to automated story generation currently provide state-of-the-art results. However, they still suffer from plot incoherence when generating narratives over time, and critically lack basic commonsense reasoning…Towards Teachable Reasoning Systems: Using a Dynamic Memory of User Feedback for Continual System Improvement
Bhavana Dalvi Mishra, Oyvind Tafjord, Peter ClarkEMNLP • 2022 Our goal is a teachable reasoning system for question-answering (QA), where a user can interact with faithful answer explanations, and correct its errors so that the system improves over time. Our approach is to augment a QA model with a dynamic memory of…
Recent Datasets
Lila
A math reasoning benchmark of over 140K natural language questions annotated with Python programs
A comprehensive benchmark for mathematical reasoning with over 140K natural language questions annotated with Python programs and natural language instructions. The data set comes with multiple splits: Lila-IID (train, dev, test), Lila-OOD (train, dev, test), and Lila-Robust.
Entailer
Data for "Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning", EMNLP 2022
Data for "Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning", EMNLP 2022
TeachMe
Supplementary data for "Towards Teachable Reasoning Systems: Using a Dynamic Memory ...", EMNLP 2022
Supplementary data for "Towards Teachable Reasoning Systems: Using a Dynamic Memory ...", EMNLP 2022
Multihop Questions via Single-hop Question Composition
Multihop reading comprehension dataset with 2-4 hop questions.
MuSiQue is a multihop reading comprehension dataset with 2-4 hop questions, built by composing seed questions from 5 existing single-hop datasets. The dataset is constructed with a bottom-up approach that systematically selects composable pairs of single-hop questions that are connected, i.e., where one reasoning step requires information from the other. This approach allows greater control over the properties of the resulting k-hop questions, allowing us to create a dataset that is substantially less cheatable (e.g. by shortcut-based or singlehop reasoning) and more challenging than prior similar datasets. MuSiQue comes in two variations -- MuSiQue-Answerable, which contains only answerable questions, and MuSiQue-Full, which contains both answerable and unanswerable questions. In the latter, each answerable question from MuSiQue-Answerable is paired with closely similar unanswerable question. In MuSiQue-Answerable, the task is to identify the answer and the supporting paragraphs, given a question and a context of up to 20 paragraphs. In MuSiQue-Full, the task is to first determine whether the question is answerable from the given context, and if it is, identify the answer and the supporting paragraphs.
Recent Press
Researchers From Allen Institute for AI Introduce TeachMe: A Framework To Understand And Correct AI Models
January 17, 2023
Allen Institute for Artificial Intelligence Introduces MemPrompt: A New Method to “fix” GPT-3 After Deployment with User Interaction
December 18, 2022
Researchers at Allen Institute for AI Built a System Called DREAM-FLUTE to Explore Machine Learning ‘Mental Models’ for Figurative Language
December 1, 2022
Researchers at the Allen Institute for AI Propose Līla, a Unified Benchmark for Comprehensive Evaluation of the Mathematical Reasoning Abilities of Artificial Intelligence Systems
November 14, 2022
Perceptron: AI bias can arise from annotation instructions
August 27, 2022
Is AI2’s Macaw better than GPT-3?
January 28, 2022
AI2 shows off an open, Q&A-focused rival to GPT3
January 24, 2022
AI2 releases demo of question-answering model it claims outperforms GPT-3
January 21, 2022
Team
Peter ClarkInterim Chief Executive Officer
Bhavana DalviResearch
Matt FinlaysonPredoctoral Young Investigator
Yuling GuPredoctoral Young Investigator
Ashwin KalyanResearch
Tushar KhotResearch
Kyle RichardsonResearch
Ashish SabharwalResearch
Oyvind TafjordResearch
Niket TandonResearch
Sarah WiegreffeYoung Investigator