Menu
Viewing 1-20 of 34 data
Clear all
    • 5,957 multiple-choice questions probing a book of 1,326 science facts

      OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension.

    • 488 richly annotated paragraphs about processes (containing 3,300 sentences)

      The ProPara dataset is designed to train and test comprehension of simple paragraphs describing processes (e.g., photosynthesis), designed for the task of predicting, tracking, and answering questions about how entities change during the process.

    • Over 39 million published research papers in Computer Science, Neuroscience, and Biomedical

      This is a subset of the full Semantic Scholar corpus which represents papers crawled from the Web and subjected to a number of filters.

    • Over 14K paper drafts and over 10K textual peer reviews

      PeerRead is a dataset of scientific peer reviews available to help researchers study this important artifact.

    • 7,787 multiple choice science questions and associated corpora

      A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to the task, and an implementation of three neural baseline models for this dataset. We pose ARC as a challenge to the community.

    • Explanation graphs for 1,680 questions

      A collection of resources for studying explanation-centered inference, including explanation graphs for 1,680 questions, with 4,950 tablestore rows, and other analyses of the knowledge required to answer elementary and middle-school science questions. ExplanationBank was constructed by Peter Jansen (University of Arizona), in collaboration with AI2.

    • 27,026 statements

      The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.

    • 5,059 real science exam questions derived from a variety of regional and state science exams

      The AI2 Science Questions dataset consists of questions used in student assessments in the United States across elementary and middle school grade levels. Each question is 4-way multiple choice format and may or may not include a diagram element.

    • 13,679 science questions with supporting sentences

      The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

    • 156K sentences for 4th grade questions, 107K sentences for 8th grade questions, and derived tuples

      The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.

    • 1,076 textbook lessons, 26,260 questions, 6229 images

      The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula as described in Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks downloaded from ck12.org. Each lesson has a set of multiple choice questions that address concepts taught in that lesson. TQA has a total of 26,260 questions including 12,567 that have an accompanying diagram.

    • 9,356 science terms and sentences

      This is the dataset for the paper Leveraging Term Banks for Answering Complex Questions: A Case for Sparse Vectors. The dataset contains 9,356 science terms and, for each term, an average of 16,000 sentences that contain the term.

    • 294,000 science-relevant tuples

      The Aristo Tuple KB contains 294,000 high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints.

    • 1,197,377 science-relevant sentences

      The Aristo Mini corpus contains 1,197,377 (very loosely) science-relevant sentences drawn from public data. It provides simple science-relevant text that may be useful to help answer elementary science questions. It is used in the Aristo Mini system and is also available here as a resource in its own right.

    • 6,952 real science exam questions derived from a variety of item banks

      The AI2 Science Questions Mercury dataset consists of questions used in student assessments across elementary and middle school grade levels, provided under license by an AI2 research partner.

    • 1,363 gold explanation sentences

      This is the dataset for the paper What's in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams (COLING'16). The data contains: gold explanation sentences supporting 363 science questions, relation annotation for a subset of those explanations, and a graphical annotation tool with annotation guidelines. This dataset was produced by AI2, the University of Arizona, and Stony Brook University.

    • 4,817 images

      AI2D is a dataset of illustrative diagrams for research on diagram understanding and associated question answering.

    • 1,080 questions

      These questions were created using the "AI2 Elementary School Science Questions (No Diagrams)" data set by changing all of the incorrect answer options of each question with some other related word. This dataset can be a good measure of robustness for QA systems when being testing on modified questions. More details can be found in the paper Question Answering via Integer Programming over Semi-Structured Knowledge.

    • 9,850 videos

      This dataset guides our research into unstructured video activity recogntion and commonsense reasoning for daily human activities. These videos of daily indoors activities were collected through Amazon Mechanical Turk.

    • 9092 crowd-sourced science questions and 68 tables of curated facts

      This package contains a copy of the Aristo Tablestore (Nov. 2015 Snapshot), plus a large set of crowd-sourced multiple-choice questions covering the facts in the tables. Through the setup of the crowd-sourced annotation task, the package also contains implicit alignment information between questions and tables. For further information, see "TabMCQ: A Dataset of General Knowledge Tables and Multiple-choice Questions" (PDF included in this package). This dataset was produced by AI2 and Sujay Kumar Jauhar (Carnegie Mellon University).