• 294,000 science-relevant tuples January 2017

    The Aristo Tuple KB contains 294,000 high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints.

  • 1,197,377 science-relevant sentences December 2016

    The Aristo Mini corpus contains 1,197,377 (very loosely) science-relevant sentences drawn from public data. It provides simple science-relevant text that may be useful to help answer elementary science questions. It is used in the Aristo Mini system and is also available here as a resource in its own right.

  • 9,659 real science exam question sets derived from a variety of regional and state science exams and item banks.

    These science exam question sets guide our research into multiple choice question answering at the elementary and middle school levels. These datasets contain multiple choice questions both with and without diagrams.

  • University of Arizona, Stony Brook University & AI2

    This is the dataset for the paper What's in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams (COLING'16). The data contains: gold explanation sentences supporting 363 science questions, relation annotation for a subset of those explanations, and a graphical annotation tool with annotation guidelines.

  • 4,817 images

    AI2D is a dataset of illustrative diagrams for research on diagram understanding and associated question answering.

  • 1,080 questions

    These questions are created using the "AI2 Elementary School Science Questions (No Diagrams)" data set by changing all of the incorrect answer options of each question with some other related word. This dataset can be a good measure of robustness for QA systems when being testing on modified questions. More details can be found in this publication.

  • 9,850 videos These videos of daily indoors activities were collected through Amazon Mechanical Turk.

    This dataset guides our research into unstructured video activity recogntion and commonsense reasoning for daily human activities.

  • AI2 TabMCQ: Multiple Choice Questions aligned with the Aristo Tablestore
    AI2 & Carnegie Mellon University (Sujay Kumar Jauhar)

    This package contains a copy of the Aristo Tablestore (Nov. 2015 Snapshot), plus a large set of crowd-sourced multiple-choice questions covering the facts in the tables. Through the setup of the crowd-sourced annotation task, the package also contains implicit alignment information between questions and tables. For further information, see "TabMCQ: A Dataset of General Knowledge Tables and Multiple-choice Questions" (PDF included in this package).

  • AI2 Tablestore (November 2015 Snapshot)
    Created by AI2

    This package contains a collection of curated facts in the form of tables used by the Aristo Question-Answering System, collected using a mixture of manual and semi-automated techniques.

  • AI2 Arithmetic Questions
    AI2 & University of Washington (Hannaneh Hajishirzi) collaboration on arithmetic exam Question Answering.

    These questions guide our research into Question Answering for arithmetic exams. Focus is on high school level questions. Example: "Sandy has 10 books, Benny has 24 books, and Tim has 33 books. How many books do they have together?"

  • AI2 Geometry Questions
    AI2 & University of Washington collaboration on geometry exam Question Answering.

    These questions guide our research into Question Answering for geometry exams. Focus is on the high school level. Example (note: diagrams included in data file): "In circle O, diameter AB is perpendicular to chord CD at E. If CD = 8 and BE = 2, find AE."

  • AI2 Biology How/Why Corpus
    AI2 & University of Arizona (Mihai Surdeanu)

    This dataset consists of 185 "how" and 193 "why" biology questions authored by a domain expert, with one or more gold answer passages identified in an undergraduate textbook. The expert was not constrained in any way during the annotation process, so gold answers might be smaller than a paragraph or span multiple paragraphs. This dataset was used for the question-answering system described in "DiscourseComplements Lexical Semantics for Non-factoid Answer Reranking" (ACL 2014).

  • AI2 Conversational Dialog Traces

    Produced at AI2 as part of intern Ben Hixon's project on Conversational Dialog Questions, dialog traces, and extractions from KnowBot, an experimental dialog system that learns about its domain from conversational dialogs with the user.

  • AI2 Co-reference analysis
    This analysis was performed by AI2 based on the New York State Education Department, Grade 4 Elementary-Level Science Test (accessed July 2014).

    An understanding of co-reference (i.e. multiple references to the same thing) is necessary in order to understand the meaning of a text. This dataset is an analysis of co-reference types occurring in 4th-grade biology textbooks.

  • AI2 Meaningful Citations Data Set
    This data set was produced at AI2 as part of intern Marco Valenzuela's work for his paper, "Identifying Meaningful Citations".

    This dataset is comprised of annotations for 465 computer science papers. The annotations indicate whether a citation is important (i.e., refers to ongoing or continued work on the relevant topic) or not and then assigns the citation one of four importance rankings.

  • AI2 Odd-Man-Out Problem Set
    This data set was produced at AI2; categories are taken from the card game Anomia, which was used to drive the puzzle generation process.

    This collection consists of four sets of "odd-man-out" puzzles. There are two collections of "common noun" puzzles, where the answer options are largely common nouns, and two collections of "proper noun" puzzles, where the answers options are largely proper nouns. Each collection contains approximately 100 puzzles.

  • Open AI Resources

    Open AI Resources is a directory of open source software and data for the AI research community. The site was initially developed by the Allen Institute for Artificial Intelligence and InferLink Corporation, and is currently managed by the AI Access Foundation.

  • AI2 Paraphase examples
    Subset of AI2 intern Ellie Pavlick’s analysis of PPDB paraphrases relevant to 4th-grade biology exams.

    Vocabulary used in questions may differ from that of sources contributing to our Question Answering knowledge base. Relevant paraphrases like these help the QA system understand connections between question vocabulary and knowledge base vocabulary. This dataset is an example of analysis done by AI2 intern Ellie Pavlick. example "get a better look at/view in more detail"

  • AI2 ProcessBank data
    AI2 & Stanford University (Jonathan Berant).

    This dataset was used to train a system to automatically extract process models from paragraphs that describe processes. The dataset consists of 200 paragraphs that describe biological processes. Each paragraph is annotated with its process structure, and accompanied by a few multiple-choice questions about the process. Each question has two possible answers of which exactly one is correct. The dataset contains three files:
    1. bioprocess-bank-questions.tar.gz: There is an xml file for each paragraph containing the paragraph ID, the questions and answers.
    2. process-bank-structures-train.tar.gz: These are the structure annotations used for training our structure predictor. Each paragraph has two files - one containing the text and one containing the annotation. This is standard BRAT format (
    3. process-bank-structures-test.tar.gz: These are structure annotations used for testing. They are also in BRAT format.