Datasets

Viewing 21-30 of 73 datasets
  • Real Toxicity Prompts

    A dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.Mosaic • 2020A dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.
  • eQASC: Multihop Explanations for QASC

    98k annotated explanations for the QASC datasetAristo • 2020This dataset contains 98k 2-hop explanations for questions in the QASC dataset, with annotations indicating if they are valid (~25k) or invalid (~73k) explanations.
  • hasPart KB

    A high-quality KB of hasPart relationsAristo • 2020A high-quality knowledge base of ~50k hasPart relationships, extracted from a large corpus of generic statements.
  • SciDocs

    Academic paper representation dataset accompanying the SPECTER paper/modelSemantic Scholar • 2020Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives…
  • GenericsKB

    A large knowledge base of generic sentencesAristo • 2020The GenericsKB contains 3.4M+ generic sentences about the world, i.e., sentences expressing general truths such as "Dogs bark," and "Trees remove carbon dioxide from the atmosphere." Generics are potentially useful as a knowledge source for AI systems…
  • SciFact

    1.4K expert-written scientific claims paired with evidence-containing abstracts.Semantic Scholar • 2020Due to the rapid growth in the scientific literature, there is a need for automated systems to assist researchers and the public in assessing the veracity of scientific claims. To facilitate the development of systems for this task, we introduce SciFact, a…
  • TORQUE

    A new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships.AllenNLP • 2020A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have…
  • Contrast Sets

    Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities.AllenNLP • 2020Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on…
  • CORD-19: COVID-19 Open Research Dataset

    Tens of thousands of scholarly articles about COVID-19 and related coronavirusesSemantic Scholar • 2020CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.
  • Break

    83,978 examples sampled from 10 question answering datasets over text, images and databases.AI2 Israel, Question Understanding • 2020Break is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases.