Viewing 11-20 of 59 datasets
- 98k annotated explanations for the QASC datasetAristo • 2020This dataset contains 98k 2-hop explanations for questions in the QASC dataset, with annotations indicating if they are valid (~25k) or invalid (~73k) explanations.
- A high-quality KB of hasPart relationsAristo • 2020A high-quality knowledge base of ~50k hasPart relationships, extracted from a large corpus of generic statements.
- Academic paper representation dataset accompanying the SPECTER paper/modelSemantic Scholar • 2020Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, the embeddings power strong performance on end tasks. We propose SPECTER, a new method to generate document-level embedding of scientific documents based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation.
- A large knowledge base of generic sentencesAristo • 2020The GenericsKB contains 3.4M+ generic sentences about the world, i.e., sentences expressing general truths such as "Dogs bark," and "Trees remove carbon dioxide from the atmosphere." Generics are potentially useful as a knowledge source for AI systems requiring general world knowledge. The GenericsKB is the first large-scale resource containing naturally occurring generic sentences (as opposed to extracted or crowdsourced triples), and is rich in high-quality, general, semantically complete statements. Generics were primarily extracted from three large text sources, namely the Waterloo Corpus, selected parts of Simple Wikipedia, and the ARC Corpus. A filtered, high-quality subset is also available in GenericsKB-Best, containing 1,020,868 sentences. We recommend you start with GenericsKB-Best.
- 1.4K expert-written scientific claims paired with evidence-containing abstracts.Semantic Scholar • 2020Due to the rapid growth in the scientific literature, there is a need for automated systems to assist researchers and the public in assessing the veracity of scientific claims. To facilitate the development of systems for this task, we introduce SciFact, a dataset of 1.4K expert-written claims, paired with evidence-containing abstracts annotated with veracity labels and rationales.
- Tens of thousands of scholarly articles about COVID-19 and related coronavirusesSemantic Scholar • 2020CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.
- 83,978 examples sampled from 10 question answering datasets over text, images and databases.AI2 Israel, Question Understanding • 2020Break is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases.
- A dataset of 2,985 grade-school level, direct-answer science questions derived from the ARC multiple-choice question set.Aristo • 2020A dataset of 2,985 grade-school level, direct-answer ("open response", "free form") science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018.
- The largest collection of machine-readable academic papers to date for NLP & text mining.Semantic Scholar • 2019A large corpus of 81.1M English-language academic papers spanning many academic disciplines. Rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. Aggregated papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date.
- 24K QA pairs over 4.7K paragraphs, split between train (19K QAs), development (2.4K QAs) and a hidden test partition (2.5K QAs).AllenNLP, AI2 Irvine • 2019Quoref is a QA dataset which tests the coreferential reasoning capability of reading comprehension systems. In this span-selection benchmark containing 24K questions over 4.7K paragraphs from Wikipedia, a system must resolve hard coreferences before selecting the appropriate span(s) in the paragraphs for answering questions.