• 488 richly annotated paragraphs about processes (containing 3,300 sentences)

    The ProPara dataset is designed to train and test comprehension of simple paragraphs describing processes (e.g., photosynthesis), designed for the task of predicting, tracking, and answering questions about how entities change during the process.

  • Over 39 million published research papers in Computer Science, Neuroscience, and Biomedical

    This is a subset of the full Semantic Scholar corpus which represents papers crawled from the Web and subjected to a number of filters.

  • PeerRead NEW
    Over 14K paper drafts and over 10K textual peer reviews

    PearRead is a dataset of scientific peer reviews available to help researchers study this important artifact.

  • 7,787 multiple choice science questions and associated corpora

    A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to the task, and an implementation of three neural baseline models for this dataset. We pose ARC as a challenge to the community.

  • Explanation graphs for 1,680 questions

    A collection of resources for studying explanation-centered inference, including explanation graphs for 1,680 questions, with 4,950 tablestore rows, and other analyses of the knowledge required to answer elementary and middle-school science questions. ExplanationBank was constructed by Peter Jansen (University of Arizona), in collaboration with AI2.

  • 27,026 statements

    The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.

  • 5,059 real science exam questions derived from a variety of regional and state science exams

    The AI2 Science Questions dataset consists of questions used in student assessments in the United States across elementary and middle school grade levels. Each question is 4-way multiple choice format and may or may not include a diagram element.

  • 13,679 science questions with supporting sentences

    The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

  • 156K sentences for 4th grade questions, 107K sentences for 8th grade questions, and derived tuples

    The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.

  • 1,076 textbook lessons, 26,260 questions, 6229 images

    The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula as described in Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks downloaded from Each lesson has a set of multiple choice questions that address concepts taught in that lesson. TQA has a total of 26,260 questions including 12,567 that have an accompanying diagram.

  • 9,356 science terms and sentences

    This is the dataset for the paper Leveraging Term Banks for Answering Complex Questions: A Case for Sparse Vectors. The dataset contains 9,356 science terms and, for each term, an average of 16,000 sentences that contain the term.

  • 294,000 science-relevant tuples

    The Aristo Tuple KB contains 294,000 high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints.

  • 1,197,377 science-relevant sentences

    The Aristo Mini corpus contains 1,197,377 (very loosely) science-relevant sentences drawn from public data. It provides simple science-relevant text that may be useful to help answer elementary science questions. It is used in the Aristo Mini system and is also available here as a resource in its own right.

  • 6,952 real science exam questions derived from a variety of item banks

    The AI2 Science Questions Mercury dataset consists of questions used in student assessments across elementary and middle school grade levels, provided under license by an AI2 research partner.

  • 1,363 gold explanation sentences

    This is the dataset for the paper What's in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams (COLING'16). The data contains: gold explanation sentences supporting 363 science questions, relation annotation for a subset of those explanations, and a graphical annotation tool with annotation guidelines. This dataset was produced by AI2, the University of Arizona, and Stony Brook University.

  • 4,817 images

    AI2D is a dataset of illustrative diagrams for research on diagram understanding and associated question answering.

  • 1,080 questions

    These questions were created using the "AI2 Elementary School Science Questions (No Diagrams)" data set by changing all of the incorrect answer options of each question with some other related word. This dataset can be a good measure of robustness for QA systems when being testing on modified questions. More details can be found in the paper Question Answering via Integer Programming over Semi-Structured Knowledge.

  • 9,850 videos

    This dataset guides our research into unstructured video activity recogntion and commonsense reasoning for daily human activities. These videos of daily indoors activities were collected through Amazon Mechanical Turk.

  • 9092 crowd-sourced science questions and 68 tables of curated facts

    This package contains a copy of the Aristo Tablestore (Nov. 2015 Snapshot), plus a large set of crowd-sourced multiple-choice questions covering the facts in the tables. Through the setup of the crowd-sourced annotation task, the package also contains implicit alignment information between questions and tables. For further information, see "TabMCQ: A Dataset of General Knowledge Tables and Multiple-choice Questions" (PDF included in this package). This dataset was produced by AI2 and Sujay Kumar Jauhar (Carnegie Mellon University).

  • 68 tables of curated facts

    This package contains a collection of curated facts in the form of tables used by the Aristo Question-Answering System, collected using a mixture of manual and semi-automated techniques.