A digital image of multicolored cards, evoking computer chips, to represent data.

Open technologies - Open data

Datasets are the foundation of AI models, and their content holds the key to a better understanding of how models work and how to make them more useful, efficient, and safe. We’re committed to creating and sharing open datasets that can help move the field forward.

For a full list of our available datasets, visit us on Hugging Face.

A digital image of squares of different shades overlaying each other in a random pattern.

Featured dataset - Dolma

Dolma is a dataset from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

Explore Dolma

WildChat

The WildChat Dataset is a corpus of 1 million real-world user-ChatGPT interactions, characterized by a wide range of languages and a diversity of user prompts. It was constructed by offering free access to ChatGPT and GPT-4 in exchange for consensual chat history collection.

Explore WildChat

Super-NaturalInstructions

1,616 diverse NLP tasks over 76 distinct task types along with expert-written instructions to measure how well NLP models can generalize to a variety of unseen tasks when provided with clear guidance.

Explore Super-NaturalInstructions

Self-Instruct

A framework that helps language models improve their ability to follow natural language instructions by using the model's own generations to create a large collection of instructional data.

Explore Self-Instruct

S2ORC

A large corpus of structured full text for English-language open access academic papers. It is the largest publicly-available collection of machine-readable academic text, comprising over 10M documents. It aims to facilitate research and development of tools for text mining over academic text.

Explore S2ORC

S2AG

A collection of over 200M paper titles, abstracts, citations, and other metadata of open-access papers from the Semantic Scholar Academic Graph.

Explore S2AG

HellaSwag

A challenge dataset of questions that are trivial for humans (>95% accuracy) but that state-of-the-art models struggle with (<48%), created through a collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers.

Explore HellaSwag

WinoGrande

WinoGrande is a collection of 44K problems, inspired by Winograd Schema Challenge, but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.

Explore WinoGrande

SciRIFF

137K instruction-following demonstrations for 54 scientific literature understanding tasks. The tasks cover five essential scientific literature categories and span five domains.

Explore SciRIFF

KIWI

Instruction data collected for writing paragraph-level answers to multiple document-grounded NLP research questions. It was collected via 234 interactive sessions of NLP experts instructing different language models, culminating in 1.2K interaction turns.

Explore KIWI

CHIME

2.1K LLM-generated hierarchical organizations of medical studies on 472 research topics, with expert-provided corrections for a subset of 100 topics. This data can be used to assess and improve LLM-based tools to assist literature review.

Explore CHIME

SciFact

1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales to support the development of scientific claim verification systems. It’s been used in shared tasks like SCIVER and retrieval benchmarks like BEIR.

Explore SciFact

SciTLDR

5.4K extremely short (<30 words) expert-written summaries of 3.2K scientific papers, used to develop models for single document summarization and to develop the initial version of the TLDR feature on Semantic Scholar.

Explore SciTLDR

Ai2 Reasoning Challenge (ARC)

7,787 genuine grade-school level, multiple-choice science questions partitioned into a Challenge Set and an Easy Set, along with a corpus of over 14 million science sentences relevant to the task. Offered as a challenge to the machine reasoning community.

Explore ARC

DROP

A QA dataset that tests the comprehensive understanding of paragraphs. In this crowdsourced, adversarially-created, 96K question-answering benchmark, a system must resolve multiple references in a question, map them onto a paragraph, and perform discrete operations over them (such as addition, counting, or sorting).

Explore DROP

Qasper

5K information-seeking questions over 1.5K scientific papers. Each question is asked by an expert researcher and answered by a different expert researcher using supporting evidence from the paper's full text. Qasper has been included in long-context benchmarks such as SCROLLS.

Explore Qasper

MS^2

20K biomedical literature review summaries synthesizing information from over 470K studies. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is one of the first large-scale, publicly available multi-document summarization dataset in the biomedical domain.

Explore MS^2

HCI alt texts

3386 author-written alt texts from HCI publications, of which 547 have been annotated with semantic content. Most figures in scientific papers lack alt text, harming accessibility, and this dataset can be used to build tools for understanding and describing figures, leading to a higher prevalence of alt texts.

Explore HCI alt texts

ALFRED

ALFRED (Action Learning From Realistic Environments and Directives), is a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.

Ask for ALFRED

ProcTHOR

ProcTHOR enables Embodied AI to scale by orders of magnitude by procedurally generating interactive 3D environments.

Explore ProcTHOR

PixMo

PixMo is the training data for our multimodal model Molmo. Pixmo includes two broad categories of data: (1) dense captioning data for multimodal pre-training and (2) supervised fine-tuning data for enabling a wide array of user interactions, including behaviors like question answering, document reading, and pointing.

PixMo on Hugging Face