Allen Institute for AI

Datasets

Viewing 21-30 of 47 datasets
  • PeerRead

    Over 14K paper drafts and over 10K textual peer reviewsAristo • 2018PeerRead is a dataset of scientific peer reviews available to help researchers study this important artifact.
  • ComplexWebQuestions

    34,689 complex questions and their answers, web snippets, and SPARQL queryAI2 Israel, Question Understanding • 2018ComplexWebQuestions is a dataset for answering complex questions that require reasoning over multiple web snippets. It contains a large set of complex questions in natural language, and can be used in multiple ways: 1) By interacting with a search engine, which is the focus of our paper (Talmor and Berant, 2018); 2) As a reading comprehension task: we release 12,725,989 web snippets that are relevant for the questions, and were collected during the development of our model; 3) As a semantic parsing task: each question is paired with a SPARQL query that can be executed against Freebase to retrieve the answer.
  • AI2 Reasoning Challenge (ARC) 2018

    7,787 multiple choice science questions and associated corporaAristo • 2018A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to the task, and an implementation of three neural baseline models for this dataset. We pose ARC as a challenge to the community.
  • ExplanationBank

    Explanation graphs for 1,680 questionsAristo • 2018A collection of resources for studying explanation-centered inference, including explanation graphs for 1,680 questions, with 4,950 tablestore rows, and other analyses of the knowledge required to answer elementary and middle-school science questions.
  • SciTail Dataset

    27,026 statementsAristo • 2017The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.
  • SciQ Dataset

    13,679 science questions with supporting sentencesAristo • 2017The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.
  • TupleInf Open IE Dataset

    156K sentences for 4th grade questions, 107K sentences for 8th grade questions, and derived tuplesAristo • 2017The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in the paper "Answering Complex Questions Using Open Information Extraction".
  • Science Terms and Sentences

    9,356 science terms and sentencesAristo • 2017The dataset contains 9,356 science terms and, for each term, an average of 16,000 sentences that contain the term.
  • Textbook Question Answering (TQA)

    1,076 textbook lessons, 26,260 questions, 6229 imagesPRIOR • 2017The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks. This includes 26,260 questions, including 12,567 that have an accompanying diagram.
  • Explicit Semantic Ranking Dataset

    March 2017Semantic Scholar • 2017This is the dataset for the paper Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. It includes the query log used in the paper, relevance judgements for the queries, ranking lists from Semantic Scholar, candidate documents, entity embeddings trained using the knowledge graph, and baselines, development methods, and alternative methods from the experiments.