Menu

Datasets

  • 13,679 science questions with suppoting sentences

    The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

  • 5,059 real science exam questions derived from a variety of regional and state science exams

    The AI2 Science Questions dataset consists of questions used in student assessments in the United States across elementary and middle school grade levels. Each question is 4-way multiple choice format and may or may not include a diagram element.

  • 156K sentences for 4th grade questions, 107K sentences for 8th grade questions, and derived tuples

    The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.

  • 1,076 textbook lessons, 26,260 questions, 6229 images

    The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula as described in Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks downloaded from ck12.org. Each lesson has a set of multiple choice questions that address concepts taught in that lesson. TQA has a total of 26,260 questions including 12,567 that have an accompanying diagram.

  • 9,356 science terms and sentences

    This is the dataset for the paper Leveraging Term Banks for Answering Complex Questions: A Case for Sparse Vectors. The dataset contains 9,356 science terms and, for each term, an average of 16,000 sentences that contain the term.

  • This is the dataset for the paper Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. It includes the query log used in the paper, relevance judgements for the queries, ranking lists from Semantic Scholar, candidate documents, entity embeddings trained using the knowledge graph, and baselines, development methods, and alternative methods from the experiments.

  • 294,000 science-relevant tuples

    The Aristo Tuple KB contains 294,000 high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints.

  • 1,197,377 science-relevant sentences

    The Aristo Mini corpus contains 1,197,377 (very loosely) science-relevant sentences drawn from public data. It provides simple science-relevant text that may be useful to help answer elementary science questions. It is used in the Aristo Mini system and is also available here as a resource in its own right.

  • 6,952 real science exam questions derived from a variety of item banks

    The AI2 Science Questions Mercury dataset consists of questions used in student assessments across elementary and middle school grade levels, provided under license by an AI2 research partner.

  • 1,363 gold explanation sentences

    This is the dataset for the paper What's in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams (COLING'16). The data contains: gold explanation sentences supporting 363 science questions, relation annotation for a subset of those explanations, and a graphical annotation tool with annotation guidelines. This dataset was produced by AI2, the University of Arizona, and Stony Brook University.

  • 4,817 images

    AI2D is a dataset of illustrative diagrams for research on diagram understanding and associated question answering.

  • 1,080 questions

    These questions were created using the "AI2 Elementary School Science Questions (No Diagrams)" data set by changing all of the incorrect answer options of each question with some other related word. This dataset can be a good measure of robustness for QA systems when being testing on modified questions. More details can be found in the paper Question Answering via Integer Programming over Semi-Structured Knowledge.

  • 9,850 videos

    This dataset guides our research into unstructured video activity recogntion and commonsense reasoning for daily human activities. These videos of daily indoors activities were collected through Amazon Mechanical Turk.

  • AI2 TabMCQ: Multiple Choice Questions aligned with the Aristo Tablestore
    9092 crowd-sourced science questions and 68 tables of curated facts

    This package contains a copy of the Aristo Tablestore (Nov. 2015 Snapshot), plus a large set of crowd-sourced multiple-choice questions covering the facts in the tables. Through the setup of the crowd-sourced annotation task, the package also contains implicit alignment information between questions and tables. For further information, see "TabMCQ: A Dataset of General Knowledge Tables and Multiple-choice Questions" (PDF included in this package). This dataset was produced by AI2 and Sujay Kumar Jauhar (Carnegie Mellon University).

  • AI2 Tablestore (November 2015 Snapshot)
    68 tables of curated facts

    This package contains a collection of curated facts in the form of tables used by the Aristo Question-Answering System, collected using a mixture of manual and semi-automated techniques.

  • AI2 Conversational Dialog Traces
    81 dialog traces and extractions

    Produced at AI2 as part of intern Ben Hixon's project on Conversational Dialog Questions, dialog traces, and extractions from KnowBot, an experimental dialog system that learns about its domain from conversational dialogs with the user.

  • 108 real science exam questions

    These 4th grade science exam questions provide an important benchmark for measuring Aristo’s progress in our research into multiple choice question answering at the elementary science level.

  • AI2 Odd-Man-Out Problem Set
    404 "odd-man-out" puzzles

    This collection consists of four sets of "odd-man-out" puzzles. There are two collections of "common noun" puzzles, where the answer options are largely common nouns, and two collections of "proper noun" puzzles, where the answers options are largely proper nouns. Each collection contains approximately 100 puzzles. The categories are taken from the card game Anomia, which was used to drive the puzzle generation process.

  • AI2 Arithmetic Questions
    391 arithmetic questions August 2014

    These questions guide our research into Question Answering for arithmetic exams. Focus is on high school level questions. Example: "Sandy has 10 books, Benny has 24 books, and Tim has 33 books. How many books do they have together?". This dataset was produced by AI2 and Hannaneh Hajishirzi (University of Washington).

  • AI2 Geometry Questions
    100 geometry questions

    These questions guide our research into Question Answering for geometry exams. Focus is on the high school level. Example (note: diagrams included in data file): "In circle O, diameter AB is perpendicular to chord CD at E. If CD = 8 and BE = 2, find AE." This dataset was produced by AI2 and the University of Washington.

  • AI2 Biology How/Why Corpus
    378 biology questions

    This dataset consists of 185 "how" and 193 "why" biology questions authored by a domain expert, with one or more gold answer passages identified in an undergraduate textbook. The expert was not constrained in any way during the annotation process, so gold answers might be smaller than a paragraph or span multiple paragraphs. This dataset was used for the question-answering system described in "DiscourseComplements Lexical Semantics for Non-factoid Answer Reranking" (ACL 2014). This dataset was produced by AI2 and Mihai Surdeanu (University of Arizona).

  • AI2 Co-reference analysis
    Analysis of three co-reference types

    An understanding of co-reference (i.e. multiple references to the same thing) is necessary in order to understand the meaning of a text. This dataset is an analysis of co-reference types occurring in 4th-grade biology textbooks. This analysis was based on the New York State Education Department's Grade 4 Elementary-Level Science Test (accessed July 2014).

  • AI2 Meaningful Citations Data Set
    630 paper annotations

    This dataset is comprised of annotations for 465 computer science papers. The annotations indicate whether a citation is important (i.e., refers to ongoing or continued work on the relevant topic) or not and then assigns the citation one of four importance rankings. This data set was produced at AI2 as part of intern Marco Valenzuela's work for his paper, "Identifying Meaningful Citations".

  • AI2 Paraphase examples
    33 paraphrases

    Vocabulary used in questions may differ from that of sources contributing to our Question Answering knowledge base. Relevant paraphrases like these help the QA system understand connections between question vocabulary and knowledge base vocabulary. This dataset is an analysis of PPDB paraphrases relevant to 4th-grade biology exams done by AI2 intern Ellie Pavlick.

  • Open AI Resources
    2600 open-source artificial intelligence resources

    Open AI Resources is a directory of open source software and data for the AI research community. The site was initially developed by AI2 and InferLink Corporation, and is currently managed by the AI Access Foundation.

  • AI2 ProcessBank Data
    200 annotated paragraphs about biological processes

    This dataset was used to train a system to automatically extract process models from paragraphs that describe processes. The dataset consists of 200 paragraphs that describe biological processes. Each paragraph is annotated with its process structure, and accompanied by a few multiple-choice questions about the process. Each question has two possible answers of which exactly one is correct. The dataset contains three files:
    1. bioprocess-bank-questions.tar.gz: There is an xml file for each paragraph containing the paragraph ID, the questions and answers.
    2. process-bank-structures-train.tar.gz: These are the structure annotations used for training our structure predictor. Each paragraph has two files - one containing the text and one containing the annotation. This is standard BRAT format.
    3. process-bank-structures-test.tar.gz: These are structure annotations used for testing. They are also in BRAT format.
    The dataset was produced by AI2 and Jonathan Berant (Stanford University).