A digital image of squares of different shades overlaying each other in a random pattern.

Open data - Dolma

Dolma, the pretraining dataset of OLMo, is an open dataset from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Multiple versions of Dolma are openly available for download from the Hugging Face Hub under the ODC-BY license.

Why Dolma?

Dolma is designed to be truly open. We want everyone to create better versions of this dataset independently, study the relationship between the data and any model trained on it, and critique our curation practices, data artifacts, and models. We will continue to make improvements and add new sources to Dolma.

Recent updates:

(April 17, 2024) Dolma 1.7: New data for OLMo
(April 15, 2024) Dolma moves to ODC-BY license
(August 18, 2023) Introducing Dolma

Dolma 1.7: the latest version

We release Dolma 1.7 alongside OLMo 1.7–7B, with more diverse data sources, more precise filtering, and fuzzy deduplication. The full Dolma 1.7 collection is 2.3 trillion tokens summing across all sources. Training on Dolma 1.7 results in significant improvements in OLMo 7B’s performance on downstream benchmarks like MMLU, GSM8k, and others. Learn more about the latest Dolma 1.7 in our blog.

Read the full release

An abstract illustration of cards, meant to denote data.

An abstract illustration of blocks, meant to denote data.

Dolma Toolkit

The Dolma toolkit is a high-performance solution for language model dataset curation, including source code, examples, and documentation. The toolkit is portable across computing environments and contains a number of features such as built-in taggers, a fast deduplication tool, and support for extension to custom taggers.

Explore the Toolkit

Learn More - Resources

An abstract illustration of swirling shapes.

Dolma research paper

Learn about Dolma’s design principles and construction details. Discover key insights about important data curation practices as you explore our analyses and experimental results.

Read Dolma Paper

Dolma on Hugging Face

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

Explore on Hugging Face

An abstract illustration of rippling glass.

Open source tools

Dolma is both an open dataset and a high-performance toolkit which enables curation of large datasets for (pre)-training ML models. The repository can be found on GitHub.

Explore on GitHub