Aristo

Building the next generation of systems that can systematically reason, explain, and continually improve over time


Diagram showing entailment tree from hypothesis and text
Our research includes pioneering work on:
  • Systematic reasoning and explanation
  • Teachable reasoning systems
  • Continual learning with memory-based architectures
  • Knowledge and belief
  • Universal mathematical reasoning

Recent Updates

Research Areas

Teachable Reasoning Systems

By interacting with and giving feedback on a system’s reasoning, a user can teach the system so it continually improves over time – without model retraining.

Modular Models

By learning to chain together existing models, complex problems can be solved, beyond the capabilities of the individual components.

Universal Mathematical Reasoners

Creating models with built-in mathematical reasoning skills, that can be rapidly fine-tuned for a wide variety of mathematical tasks.

  • A QA model that outperforms other popular language models while being an order of magnitude smaller | Aristo, Research Visualization

    Macaw is a high-performance question-answering (QA) model capable of outperforming other popular current language models, all while being an order of magnitude smaller. This demo allows you to explore Macaw's answers and compare them to those of the popular GPT-3 language model on a benchmark set of questions.

    Try the demo
    Macaw
  • Macaw
    A QA model that outperforms other popular language models while being an order of magnitude smaller | Aristo, Research Visualization

    Macaw is a high-performance question-answering (QA) model capable of outperforming other popular current language models, all while being an order of magnitude smaller. This demo allows you to explore Macaw's answers and compare them to those of the popular GPT-3 language model on a benchmark set of questions.

    Try the demo
  • ProofWriter OpenGraph image
    Generating Implications, Proofs, and Abductive Statements over Natural Language | Aristo

    Like RuleTaker, ProofWriter determines whether statements are True or False based on rules given in natural language - but also generates the proof of its answers.

    Try the demo
  • ProofWriter OpenGraph image
    Generating Implications, Proofs, and Abductive Statements over Natural Language | Aristo

    Like RuleTaker, ProofWriter determines whether statements are True or False based on rules given in natural language - but also generates the proof of its answers.

    Try the demo
    • Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs

      Shashank Gupta, Vaishnavi Shrivastava, A. Deshpande, A. Kalyan, Peter Clark, Ashish Sabharwal, Tushar KhotICLR2024 Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior…
    • Calibrating Large Language Models with Sample Consistency

      Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan, Chris Callison-BurcharXiv2024 Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and…
    • TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

      Yikai Zhang, Siyu Yuan, Caiyu Hu, Kyle Richardson, Yanghua Xiao, Jiangjie ChenarXiv2024 Despite remarkable advancements in emulating human-like behavior through Large Language Models (LLMs), current textual simulations do not adequately address the notion of time. To this end, we introduce TimeArena, a novel textual simulated environment that…
    • OLMo: Accelerating the Science of Language Models

      Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, A. Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Daniel Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hanna HajishirziarXiv2024 Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of…
    • The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

      Peter Hase, Mohit Bansal, Peter Clark, Sarah WiegreffearXiv.org2024 How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have…

    IfQA Counterfactual Reasoning Benchmark

    3,800 open-domain questions designed to assess counterfactual reasoning abilities of NLP models

    Counterfactual reasoning benchmark introduced in the EMNLP-2023 paper titled "IfQA: A Dataset for Open-domain Question Answering under Counterfactual Presuppositions".

    Digital Socrates

    DS Critique Bank contains annotated critiques of answers and explanations from "student" models.

    DS Critique Bank (DSCB) is a dataset of multiple-choice questions with associated answers and explanations provided by "student models", along with "critiques" of the explanations provided by "critique models". Many of the instances have human annotations.

    ParRoT (Parts and Relations of Things)

    11,720 “X relation Y?” True/False questions on parts of everyday things and relational information about these parts

    This is the dataset in "Do language models have coherent mental models of everyday things?", ACL 2023.

    Belief and Reasoning Dataset

    BaRDA: A Belief and REasoning Dataset that Separates Factual Accuracy and Reasoning Ability

    BaRDa is a new belief and reasoning dataset for evaluating the factual correctness ("truth") and reasoning accuracy ("rationality", or "honesty") of new language models. It was created in collaboration with, and with the support of, the Open Philanthropy organization.

    “Knowing is not enough, we must apply. Willing is not enough, we must do.”
    Johann Wolfgang von Goethe

    Persona-driven ChatGPT yields toxic, racist output

    TechXplore
    April 19, 2023
    Read the Article

    Changing ChatGPTs Persona Might Make It Malicious

    Digital Information World
    April 17, 2023
    Read the Article

    This AI Paper Shows How ChatGPT’s Toxicity Can Increase Up To Six-Fold When Assigned A Persona

    Marktechpost
    April 14, 2023
    Read the Article

    'They’re All So Dirty and Smelly:' Study Unlocks ChatGPT's Inner Racist

    Gizmodo
    April 13, 2023
    Read the Article

    New study reveals ChatGPT's inherent toxicity when assigned different personas

    Mashable Middle East
    April 13, 2023
    Read the Article

    ChatGPT can turn toxic just by changing its assigned persona, researchers say

    VentureBeat
    April 12, 2023
    Read the Article

    Researchers discover a way to make ChatGPT consistently toxic

    TechCrunch
    April 12, 2023
    Read the Article

    Researchers From Allen Institute for AI Introduce TeachMe: A Framework To Understand And Correct AI Models

    Marktechpost
    January 17, 2023
    Read the Article

    Team

    • personal photoChris Callison-BurchResearch
    • personal photoPeter ClarkResearch
    • Personal photoBen BoginYoung Investigator
    • profile pictureBhavana DalviResearch
    • personal photoYuling GuPredoctoral Young Investigator
    • personal photoShashank GuptaResearch
    • Ashwin Kalyan's Profile PhotoAshwin KalyanResearch
    • Tushar Khot's Profile PhotoTushar KhotResearch
    • personal photoBodhisattwa Prasad MajumderResearch
    • Kyle Richardson's Profile PhotoKyle RichardsonResearch
    • Ashish Sabharwal's Profile PhotoAshish SabharwalResearch
    • Oyvind Tafjord's Profile PhotoOyvind TafjordResearch
    • Niket Tandon's Profile PhotoNiket TandonResearch
    • personal photoSarah WiegreffeYoung Investigator