ARISTO

Build machines that read, learn and reason.

The Aristo Project aims to build systems that demonstrate a deep understanding of the world, integrating technologies for reading, learning, reasoning, and explanation.

A multiple choice question and reasoning explaining each answer
Our research integrates multiple AI technologies, including:
  • Natural language processing
  • Information extraction
  • Knowledge representation
  • Machine reasoning
  • Commonsense knowledge

Research Areas

Probing Reasoning with Language Models

Language models (LMs) have dominated much of AI recently. But what kind(s) of reasoning are they capable of? And how can they be taught to do more? We are developing analytical datasets to probe LMs and help answer these questions.

Learn More:

Multihop Reasoning

Many questions require multiple pieces of information to be combined to arrive at an answer. We are developing new multihop models capable of identifying and combining relevant facts to answer such questions.

Learn More:

Explanation

An intelligent system should not only answer questions correctly, but also be able to explain why its answers are correct. Such a capability is essential for practical acceptance of AI technology. It is also essential for the broader goals of communicating knowledge to a user, and receiving correction from the user when the system's answer is wrong.

Learn More:

Reasoning about Actions

A key aspect of intelligence is being able to reason about the dynamics of the world. This requires modeling what state the world might be in, and how different actions might affect that state. Such capabilities are essential for understanding what happens during a procedure or process, for planning, and for reasoning about "what if..." scenarios.

Learn More:

  • Modular QA answers questions by breaking them down into a series of smaller, more specific ones. This produces answers in a human-like way that's more explainable than black-box systems. | Aristo

    ModularQA is a neuro-symbolic question-answering system that answers complex questions by asking a series of sub-questions to existing simpler QA systems or symbolic modules. It explains each of its reasoning steps in language, in terms of a simple question and its answer as produced by a simpler model or a math calculator. In today's world of black-box models this is an important step towards explainable AI.

    Try the demo
    ModularQA
  • ModularQA
    Modular QA answers questions by breaking them down into a series of smaller, more specific ones. This produces answers in a human-like way that's more explainable than black-box systems. | Aristo

    ModularQA is a neuro-symbolic question-answering system that answers complex questions by asking a series of sub-questions to existing simpler QA systems or symbolic modules. It explains each of its reasoning steps in language, in terms of a simple question and its answer as produced by a simpler model or a math calculator. In today's world of black-box models this is an important step towards explainable AI.

    Try the demo
  • UnQover demo logo
    Uncovering stereotypical biases via underspecified questions | Aristo

    This work focuses specifically on identifying biases in question answering (QA) models. If these models are blindly deployed in real-life settings, the biases within these models could cause real harm, which raises the question; how extensive are social stereotypes in question-answering models?

    Try the demo
  • UnQover demo logo
    Uncovering stereotypical biases via underspecified questions | Aristo

    This work focuses specifically on identifying biases in question answering (QA) models. If these models are blindly deployed in real-life settings, the biases within these models could cause real harm, which raises the question; how extensive are social stereotypes in question-answering models?

    Try the demo
    • Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?

      Jieyu Zhao, Daniel Khashabi, Tushar Khot, Ashish Sabharwal and Kai-Wei Chang ACL-IJCNLP2021
      Is it possible to use natural language to intervene in a model’s behavior and alter its prediction in a desired way? We investigate the effectiveness of natural language interventions for reading-comprehension systems, studying this in the context of social stereotypes. Specifically, we propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a questionanswering (QA) model’s unethical behavior by communicating context-specific principles of ethics and equity to it. To this end, we build upon recent methods for quantifying a system’s social stereotypes, augmenting them with different kinds of ethical interventions and the desired model behavior under such interventions. Our zero-shot evaluation finds that even today’s powerful neural language models are extremely poor ethical-advice takers, that is, they respond surprisingly little to ethical interventions even though these interventions are stated as simple sentences. Fewshot learning improves model behavior but remains far from the desired outcome, especially when evaluated for various types of generalization. Our new task thus poses a novel language understanding challenge for the community.
    • Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

      Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, Jonathan BerantTACL2021
      A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce STRATEGYQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, STRATEGYQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in STRATEGYQA are short, topicdiverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of ∼ 66%
    • ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language

      Oyvind Tafjord, B. D. Mishra, P. ClarkFindings of ACL2021
      Transformers have been shown to emulate logical deduction over natural language theories (logical rules expressed in natural language), reliably assigning true/false labels to candidate implications. However, their ability to generate implications of a theory has not yet been demonstrated, and methods for reconstructing proofs of answers are imperfect. In this work we show that a generative model, called ProofWriter, can reliably generate both implications of a theory and the natural language proof(s) that support them. In particular, iterating a 1-step implication generator results in proofs that are highly reliable, and represent actual model decisions (rather than post-hoc rationalizations). On the RuleTaker dataset, the accuracy of ProofWriter’s proofs exceed previous methods by +9% absolute, and in a way that generalizes to proof depths unseen in training and on out-of-domain problems. We also show that generative techniques can perform a type of abduction with high precision: Given a theory and an unprovable conclusion, identify a missing fact that allows the conclusion to be proved, along with a proof. These results significantly improve the viability of neural methods for systematically reasoning over natural language.
    • ParsiNLU: A Suite of Language Understanding Challenges for Persian

      Daniel Khashabi, Arman Cohan, Siamak Shakeri, et al. TACL2021
      Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5$k$ new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.
    • Critical Thinking for Language Models

      Gregor Betz, Christian Voigt, Kyle RichardsonIWCS2021
      This paper takes a first step towards a critical thinking curriculum for neural auto-regressive language models. We introduce a synthetic text corpus of deductively valid arguments, and use this artificial argument corpus to train and evaluate GPT-2. Significant transfer learning effects can be observed: Training a model on a few simple core schemes allows it to accurately complete conclusions of different, and more complex types of arguments, too. The language models seem to connect and generalize the core argument schemes in a correct way. Moreover, we obtain consistent and promising results for the GLUE and SNLI benchmarks. The findings suggest that there might exist a representative sample of paradigmatic instances of good reasoning that will suffice to acquire general reasoning skills and that might form the core of a critical thinking curriculum for language models.

    QuaRTz Dataset

    3864 questions about open domain qualitative relationships

    QuaRTz is a crowdsourced dataset of 3864 multiple-choice questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).

    QuaRel Dataset

    2771 story questions about qualitative relationships

    QuaRel is a crowdsourced dataset of 2771 multiple-choice story questions, including their logical forms.

    BeliefBank

    4998 facts and 12147 constraints to test a model's consistency

    Dataset of 4998 simple facts and 12147 constraints to test, and improve, a model's accuracy and consistency

    EntailmentBank

    2k multi-step entailment trees, explaining the answers to ARC science questions

    2k multi-step entailment trees, explaining the answers to ARC science questions

    “Knowing is not enough, we must apply. Willing is not enough, we must do.”
    Johann Wolfgang von Goethe

    Paul Allen's 'Digital Aristotle' sets eyes on accomplishing practical tasks

    KOMO News
    February 5, 2020
    Read the Article

    Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking

    VentureBeat
    January 20, 2021
    Read the Article

    מערכת בינה מלאכותית עברה בהצטיינות יתרה מבחן במדעים של כיתה ח' (Artificial Intelligence System Cum Laude Passed 8th Grade Science Test)

    Haaretz
    September 6, 2019
    Read the Article

    Allen Institute's Aristo AI makes breakthrough, passes eighth-grade science test

    TechSpot
    September 5, 2019
    Read the Article

    How to tutor AI from an ‘F’ to an ‘A’

    Vulcan Inc
    September 4, 2019
    Read the Article

    A Breakthrough for A.I. Technology: Passing an 8th-Grade Science Test

    The New York Times
    September 4, 2019
    Read the Article

    Allen Institute’s Aristo AI system finally passes an eighth-grade science test

    GeekWire
    September 4, 2019
    Read the Article

    AI assistants say dumb things, and we’re about to find out why

    MIT Tech Review
    March 14, 2018
    Read the Article

    Team

    • Peter Clark's Profile PhotoPeter ClarkResearch
    • Bhavana Dalvi's Profile PhotoBhavana DalviResearch
    • Michal Guerquin's Profile PhotoMichal GuerquinEngineering
    • personal photoAshwin KalyanResearch
    • Daniel Khashabi's Profile PhotoDaniel KhashabiYoung Investigator
    • Tushar Khot's Profile PhotoTushar KhotResearch
    • Kyle Richardson's Profile PhotoKyle RichardsonResearch
    • Ashish Sabharwal's Profile PhotoAshish SabharwalResearch
    • Carissa Schoenick's Profile PhotoCarissa SchoenickProduct
    • Oyvind Tafjord's Profile PhotoOyvind TafjordResearch
    • Niket Tandon's Profile PhotoNiket TandonResearch