Aristo
Building the next generation of systems that can systematically reason, explain, and continually improve over time
- Systematic reasoning and explanation
- Teachable reasoning systems
- Continual learning with memory-based architectures
- Knowledge and belief
- Universal mathematical reasoning
Recent Updates
Towards Teachable Reasoning Systems
April 27, 2022This paper describes our work towards Teachable Reasoning Systems. First, EntailmentWriter searches for a chain of reasoning from facts it believes…
Memory-assisted prompt editing to improve GPT-3 after deployment
April 20, 2022Large LMs such as GPT-3 are powerful, but can commit mistakes that are obvious to humans. Memory-assisted prompt editing allows users to give…
DREAM: Improving Situational QA by First Elaborating the Situation
March 1, 2022When people answer questions about a specific situation, e.g., "I cheated on my mid-term exam last week. Was that wrong?", cognitive science suggests…
Explaining Answers with Entailment Trees
November 1, 2021EntailmentBank is a unique dataset of multi-step entailment trees. Each tree shows how known facts combine to entail the answer to a question. From…
BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief
November 1, 2021Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions…
Research Areas
Teachable Reasoning Systems
By interacting with and giving feedback on a system’s reasoning, a user can teach the system so it continually improves over time – without model retraining.
Learn More:
Neuro-Symbolic Reasoning and Explanation
Solving problems by generating consistent, faithful chains of reasoning using neural components.
Learn More:
Modular Models
By learning to chain together existing models, complex problems can be solved, beyond the capabilities of the individual components.
Learn More:
Universal Mathematical Reasoners
Creating models with built-in mathematical reasoning skills, that can be rapidly fine-tuned for a wide variety of mathematical tasks.
Learn More:
Macaw is a high-performance question-answering (QA) model capable of outperforming other popular current language models, all while being an order of magnitude smaller. This demo allows you to explore Macaw's answers and compare them to those of the popular GPT-3 language model on a benchmark set of questions.
Try the demo

Macaw is a high-performance question-answering (QA) model capable of outperforming other popular current language models, all while being an order of magnitude smaller. This demo allows you to explore Macaw's answers and compare them to those of the popular GPT-3 language model on a benchmark set of questions.
Try the demo
Like RuleTaker, ProofWriter determines whether statements are True or False based on rules given in natural language - but also generates the proof of its answers.
Try the demo
Like RuleTaker, ProofWriter determines whether statements are True or False based on rules given in natural language - but also generates the proof of its answers.
Try the demoRecent Papers
Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmarking
Ronen Tamari, Kyle Richardson, Aviad Sar-Shalom, Noam Kahlon, Nelson H S Liu, Reut Tsarfaty, Dafna Shahaf SEM • 2022 While neural language models often perform surprisingly well on natural language understanding (NLU) tasks, their strengths and limitations remain poorly understood. Controlled synthetic tasks are thus an increasingly important resource for diagnosing model…Learning to Repair: Repairing model output errors after deployment using a dynamic memory of feedback
Niket Tandon, Aman Madaan, Peter Clark, Yiming YangFindings of EMNLP • 2022 Large language models (LMs), while power-ful, are not immune to mistakes, but can be difficult to retrain. Our goal is for an LM to continue to improve after deployment, without retraining, using feedback from the user. Our approach pairs an LM with (i) a…DREAM: Improving Situational QA by First Elaborating the Situation
Yuling Gu, Bhavana Dalvi Mishra, Peter ClarkNAACL 2021 • 2022 When people answer questions about a specific situation, e.g., "I cheated on my mid-term exam last week. Was that wrong?", cognitive science suggests that they form a mental picture of that situation before answering. While we do not know how language models…Log-Precision Transformers are Uniform Threshold Circuits
William Merrill, Ashish SabharwalarXiv • 2022 We prove that transformer neural networks with logarithmic precision in the input length (and where the feedforward subnetworks are computable using linear space in their input length) can be simulated by constant-depth uniform threshold circuits. Thus, such…DeepA2: A Modular Framework for Deep Argument Analysis with Pretrained Neural Text2Text Language Models
Gregor Betz, Kyle RichardsonSEM • 2022 In this paper, we present and implement a multi-dimensional, modular framework for performing deep argument analysis (DeepA2) using current pre-trained language models (PTLMs). ArgumentAnalyst – a T5 model (Raffel et al. 2020) set up and trained within DeepA2…
Recent Datasets
Multihop Questions via Single-hop Question Composition
Multihop reading comprehension dataset with 2-4 hop questions.
MuSiQue is a multihop reading comprehension dataset with 2-4 hop questions, built by composing seed questions from 5 existing single-hop datasets. The dataset is constructed with a bottom-up approach that systematically selects composable pairs of single-hop questions that are connected, i.e., where one reasoning step requires information from the other. This approach allows greater control over the properties of the resulting k-hop questions, allowing us to create a dataset that is substantially less cheatable (e.g. by shortcut-based or singlehop reasoning) and more challenging than prior similar datasets. MuSiQue comes in two variations -- MuSiQue-Answerable, which contains only answerable questions, and MuSiQue-Full, which contains both answerable and unanswerable questions. In the latter, each answerable question from MuSiQue-Answerable is paired with closely similar unanswerable question. In MuSiQue-Answerable, the task is to identify the answer and the supporting paragraphs, given a question and a context of up to 20 paragraphs. In MuSiQue-Full, the task is to first determine whether the question is answerable from the given context, and if it is, identify the answer and the supporting paragraphs.
The Fermi Challenge
A challenge dataset of Fermi (estimation) problems, currently beyond the capabilities of modern methods.
A challenge dataset of Fermi (estimation) problems, currently beyond the capabilities of modern methods.
BeliefBank
4998 facts and 12147 constraints to test a model's consistency
Dataset of 4998 simple facts and 12147 constraints to test, and improve, a model's accuracy and consistency
EntailmentBank
2k multi-step entailment trees, explaining the answers to ARC science questions
2k multi-step entailment trees, explaining the answers to ARC science questions
Recent Press
Paul Allen's 'Digital Aristotle' sets eyes on accomplishing practical tasks
February 5, 2020
Perceptron: AI bias can arise from annotation instructions
May 8, 2022
Is AI2’s Macaw better than GPT-3?
January 28, 2022
AI2 shows off an open, Q&A-focused rival to GPT3
January 24, 2022
AI models are becoming better at answering questions, but they’re not perfect
January 21, 2022
AI2 releases demo of question-answering model it claims outperforms GPT-3
January 21, 2022
Multimodal models are fast becoming a reality — consequences be damned
December 21, 2021
Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking
January 20, 2021