Open data - Dolma
Dolma, the pretraining dataset of OLMo, is an open dataset from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Multiple versions of Dolma are openly available for download from the Hugging Face Hub under the ODC-BY license.
Why Dolma?
Dolma is designed to be truly open. We want everyone to create better versions of this dataset independently, study the relationship between the data and any model trained on it, and critique our curation practices, data artifacts, and models. We will continue to make improvements and add new sources to Dolma.
Recent updates:
- (April 17, 2024) Dolma 1.7: New data for OLMo
- (April 15, 2024) Dolma moves to ODC-BY license
- (August 18, 2023) Introducing Dolma
Dolma 1.7: the latest version
We release Dolma 1.7 alongside OLMo 1.7–7B, with more diverse data sources, more precise filtering, and fuzzy deduplication. The full Dolma 1.7 collection is 2.3 trillion tokens summing across all sources. Training on Dolma 1.7 results in significant improvements in OLMo 7B’s performance on downstream benchmarks like MMLU, GSM8k, and others. Learn more about the latest Dolma 1.7 in our blog.
Dolma Toolkit
The Dolma toolkit is a high-performance solution for language model dataset curation, including source code, examples, and documentation. The toolkit is portable across computing environments and contains a number of features such as built-in taggers, a fast deduplication tool, and support for extension to custom taggers.
Learn More - Resources
Dolma research paper
Learn about Dolma’s design principles and construction details. Discover key insights about important data curation practices as you explore our analyses and experimental results.
Dolma on Hugging Face
Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
Open source tools
Dolma is both an open dataset and a high-performance toolkit which enables curation of large datasets for (pre)-training ML models. The repository can be found on GitHub.