Menu
Viewing 21-34 of 34 data
Clear all
    • 68 tables of curated facts

      This package contains a collection of curated facts in the form of tables used by the Aristo Question-Answering System, collected using a mixture of manual and semi-automated techniques.

    • 81 dialog traces and extractions

      Produced at AI2 as part of intern Ben Hixon's project on Conversational Dialog Questions, dialog traces, and extractions from KnowBot, an experimental dialog system that learns about its domain from conversational dialogs with the user.

    • Evaluations for 108 real science exam questions

      This work explores the use of Markov Logic Networks (MLNs) for answering elementary-level natural language science questions. The dataset contains the MLNs generated from three different formulations along with a README describing the format. Also see "Markov Logic Networks for Natural Language Question Answering" (StarAI '15) for a description of the formulations.

    • 108 real science exam questions

      These 4th grade science exam questions provide an important benchmark for measuring Aristo’s progress in our research into multiple choice question answering at the elementary science level.

    • 404 "odd-man-out" puzzles

      This collection consists of four sets of "odd-man-out" puzzles. There are two collections of "common noun" puzzles, where the answer options are largely common nouns, and two collections of "proper noun" puzzles, where the answers options are largely proper nouns. Each collection contains approximately 100 puzzles. The categories are taken from the card game Anomia, which was used to drive the puzzle generation process.

    • 100 geometry questions

      These questions guide our research into Question Answering for geometry exams. Focus is on the high school level. Example (note: diagrams included in data file): "In circle O, diameter AB is perpendicular to chord CD at E. If CD = 8 and BE = 2, find AE." This dataset was produced by AI2 and the University of Washington.

    • 391 arithmetic questions

      These questions guide our research into Question Answering for arithmetic exams. Focus is on high school level questions. Example: "Sandy has 10 books, Benny has 24 books, and Tim has 33 books. How many books do they have together?". This dataset was produced by AI2 and Hannaneh Hajishirzi (University of Washington).

    • 378 biology questions

      This dataset consists of 185 "how" and 193 "why" biology questions authored by a domain expert, with one or more gold answer passages identified in an undergraduate textbook. The expert was not constrained in any way during the annotation process, so gold answers might be smaller than a paragraph or span multiple paragraphs. This dataset was used for the question-answering system described in "Discourse Complements Lexical Semantics for Non-factoid Answer Reranking" (ACL 2014). This dataset was produced by AI2 and Mihai Surdeanu (University of Arizona).

    • Analysis of three co-reference types

      An understanding of co-reference (i.e. multiple references to the same thing) is necessary in order to understand the meaning of a text. This dataset is an analysis of co-reference types occurring in 4th-grade biology textbooks. This analysis was based on the New York State Education Department's Grade 4 Elementary-Level Science Test (accessed July 2014).

    • 630 paper annotations

      This dataset is comprised of annotations for 465 computer science papers. The annotations indicate whether a citation is important (i.e., refers to ongoing or continued work on the relevant topic) or not and then assigns the citation one of four importance rankings. This data set was produced at AI2 as part of intern Marco Valenzuela's work for his paper, "Identifying Meaningful Citations".

    • 33 paraphrases

      Vocabulary used in questions may differ from that of sources contributing to our Question Answering knowledge base. Relevant paraphrases like these help the QA system understand connections between question vocabulary and knowledge base vocabulary. This dataset is an analysis of PPDB paraphrases relevant to 4th-grade biology exams done by AI2 intern Ellie Pavlick.

    • 2600 open-source artificial intelligence resources

      Open AI Resources is a directory of open source software and data for the AI research community. The site was initially developed by AI2 and InferLink Corporation, and is currently managed by the AI Access Foundation.

    • 200 annotated paragraphs about biological processes

      This dataset was used to train a system to automatically extract process models from paragraphs that describe processes. The dataset consists of 200 paragraphs that describe biological processes. Each paragraph is annotated with its process structure, and accompanied by a few multiple-choice questions about the process. Each question has two possible answers of which exactly one is correct. The dataset contains three files:
      1. bioprocess-bank-questions.tar.gz: There is an xml file for each paragraph containing the paragraph ID, the questions and answers.
      2. process-bank-structures-train.tar.gz: These are the structure annotations used for training our structure predictor. Each paragraph has two files - one containing the text and one containing the annotation. This is standard BRAT format.
      3. process-bank-structures-test.tar.gz: These are structure annotations used for testing. They are also in BRAT format.
      The dataset was produced by AI2 and Jonathan Berant (Stanford University).

    • March 2017

      This is the dataset for the paper Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. It includes the query log used in the paper, relevance judgements for the queries, ranking lists from Semantic Scholar, candidate documents, entity embeddings trained using the knowledge graph, and baselines, development methods, and alternative methods from the experiments.