Viewing 21-31 of 31 data
Clear all
    • 9,850 videos

      This dataset guides our research into unstructured video activity recogntion and commonsense reasoning for daily human activities. These videos of daily indoors activities were collected through Amazon Mechanical Turk.

    • 9092 crowd-sourced science questions and 68 tables of curated facts

      This package contains a copy of the Aristo Tablestore (Nov. 2015 Snapshot), plus a large set of crowd-sourced multiple-choice questions covering the facts in the tables. Through the setup of the crowd-sourced annotation task, the package also contains implicit alignment information between questions and tables. For further information, see "TabMCQ: A Dataset of General Knowledge Tables and Multiple-choice Questions" (PDF included in this package). This dataset was produced by AI2 and Sujay Kumar Jauhar (Carnegie Mellon University).

    • 68 tables of curated facts

      This package contains a collection of curated facts in the form of tables used by the Aristo Question-Answering System, collected using a mixture of manual and semi-automated techniques.

    • 81 dialog traces and extractions

      Produced at AI2 as part of intern Ben Hixon's project on Conversational Dialog Questions, dialog traces, and extractions from KnowBot, an experimental dialog system that learns about its domain from conversational dialogs with the user.

    • Evaluations for 108 real science exam questions

      This work explores the use of Markov Logic Networks (MLNs) for answering elementary-level natural language science questions. The dataset contains the MLNs generated from three different formulations along with a README describing the format. Also see "Markov Logic Networks for Natural Language Question Answering" (StarAI '15) for a description of the formulations.

    • 391 arithmetic questions

      These questions guide our research into Question Answering for arithmetic exams. Focus is on high school level questions. Example: "Sandy has 10 books, Benny has 24 books, and Tim has 33 books. How many books do they have together?". This dataset was produced by AI2 and Hannaneh Hajishirzi (University of Washington).

    • 378 biology questions

      This dataset consists of 185 "how" and 193 "why" biology questions authored by a domain expert, with one or more gold answer passages identified in an undergraduate textbook. The expert was not constrained in any way during the annotation process, so gold answers might be smaller than a paragraph or span multiple paragraphs. This dataset was used for the question-answering system described in "Discourse Complements Lexical Semantics for Non-factoid Answer Reranking" (ACL 2014). This dataset was produced by AI2 and Mihai Surdeanu (University of Arizona).

    • 630 paper annotations

      This dataset is comprised of annotations for 465 computer science papers. The annotations indicate whether a citation is important (i.e., refers to ongoing or continued work on the relevant topic) or not and then assigns the citation one of four importance rankings. This data set was produced at AI2 as part of intern Marco Valenzuela's work for his paper, "Identifying Meaningful Citations".

    • 100 geometry questions

      These questions guide our research into Question Answering for geometry exams. Focus is on the high school level. Example (note: diagrams included in data file): "In circle O, diameter AB is perpendicular to chord CD at E. If CD = 8 and BE = 2, find AE." This dataset was produced by AI2 and the University of Washington.

    • 200 annotated paragraphs about biological processes

      This dataset was used to train a system to automatically extract process models from paragraphs that describe processes. The dataset consists of 200 paragraphs that describe biological processes. Each paragraph is annotated with its process structure, and accompanied by a few multiple-choice questions about the process. Each question has two possible answers of which exactly one is correct. The dataset contains three files:
      1. bioprocess-bank-questions.tar.gz: There is an xml file for each paragraph containing the paragraph ID, the questions and answers.
      2. process-bank-structures-train.tar.gz: These are the structure annotations used for training our structure predictor. Each paragraph has two files - one containing the text and one containing the annotation. This is standard BRAT format.
      3. process-bank-structures-test.tar.gz: These are structure annotations used for testing. They are also in BRAT format.
      The dataset was produced by AI2 and Jonathan Berant (Stanford University).

    • March 2017

      This is the dataset for the paper Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. It includes the query log used in the paper, relevance judgements for the queries, ranking lists from Semantic Scholar, candidate documents, entity embeddings trained using the knowledge graph, and baselines, development methods, and alternative methods from the experiments.