This collection consists of four sets of "odd-man-out" puzzles. There are two collections of "common noun" puzzles, where the answer options are largely common nouns, and two collections of "proper noun" puzzles, where the answers options are largely proper nouns. Each collection contains approximately 100 puzzles. The categories are taken from the card game Anomia, which was used to drive the puzzle generation process.
Vocabulary used in questions may differ from that of sources contributing to our Question Answering knowledge base. Relevant paraphrases like these help the QA system understand connections between question vocabulary and knowledge base vocabulary. This dataset is an analysis of PPDB paraphrases relevant to 4th-grade biology exams done by AI2 intern Ellie Pavlick.
An understanding of co-reference (i.e. multiple references to the same thing) is necessary in order to understand the meaning of a text. This dataset is an analysis of co-reference types occurring in 4th-grade biology textbooks. This analysis was based on the New York State Education Department's Grade 4 Elementary-Level Science Test (accessed July 2014).
This dataset consists of 185 "how" and 193 "why" biology questions authored by a domain expert, with one or more gold answer passages identified in an undergraduate textbook. The expert was not constrained in any way during the annotation process, so gold answers might be smaller than a paragraph or span multiple paragraphs. This dataset was used for the question-answering system described in "Discourse Complements Lexical Semantics for Non-factoid Answer Reranking" (ACL 2014). This dataset was produced by AI2 and Mihai Surdeanu (University of Arizona).
This dataset was used to train a system to automatically extract process models from paragraphs that describe processes. The dataset consists of 200 paragraphs that describe biological processes. Each paragraph is annotated with its process structure, and accompanied by a few multiple-choice questions about the process. Each question has two possible answers of which exactly one is correct. The dataset contains three files:
1. bioprocess-bank-questions.tar.gz: There is an xml file for each paragraph containing the paragraph ID, the questions and answers.
2. process-bank-structures-train.tar.gz: These are the structure annotations used for training our structure predictor. Each paragraph has two files - one containing the text and one containing the annotation. This is standard BRAT format.
3. process-bank-structures-test.tar.gz: These are structure annotations used for testing. They are also in BRAT format.
The dataset was produced by AI2 and Jonathan Berant (Stanford University).