Skip to main content ->
Ai2

Open data - Dolma

Dolma, the pretraining dataset of OLMo, is an open dataset from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Multiple versions of Dolma are openly available for download from the Hugging Face Hub under the ODC-BY license.

Why Dolma?

Dolma is designed to be truly open. We want everyone to create better versions of this dataset independently, study the relationship between the data and any model trained on it, and critique our curation practices, data artifacts, and models. We will continue to make improvements and add new sources to Dolma.

Recent updates:

Dolma 1.7: the latest version

We release Dolma 1.7 alongside OLMo 1.7–7B, with more diverse data sources, more precise filtering, and fuzzy deduplication. The full Dolma 1.7 collection is 2.3 trillion tokens summing across all sources. Training on Dolma 1.7 results in significant improvements in OLMo 7B’s performance on downstream benchmarks like MMLU, GSM8k, and others. Learn more about the latest Dolma 1.7 in our blog.

Dolma Toolkit

The Dolma toolkit is a high-performance solution for language model dataset curation, including source code, examples, and documentation. The toolkit is portable across computing environments and contains a number of features such as built-in taggers, a fast deduplication tool, and support for extension to custom taggers.

Learn More - Resources

Dolma research paper

Learn about Dolma’s design principles and construction details. Discover key insights about important data curation practices as you explore our analyses and experimental results.

Dolma on Hugging Face

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

Open source tools

Dolma is both an open dataset and a high-performance toolkit which enables curation of large datasets for (pre)-training ML models. The repository can be found on GitHub.