Language models - OLMo 2
OLMo 2 is a family of fully-open language models, developed start-to-finish with open and accessible training data, open-source training code, reproducible training recipes, transparent evaluations, intermediate checkpoints, and more.
What OLMo 2 provides
for researchers and developers
Models
Explore the collection of fully-open OLMo 2 models, including both pretrained as well as instruction-tuned variants.
Data
Download and explore the underlying training data—the often-hidden secret sauce behind model capabilities that we make freely available to support open scientific research–-used across all stages, including pre-training, mid-training and post-training.
Training
Use and extend our high-performance training code for OLMo 2, which we rely on internally for high-stakes language model training and experimentation.
Evaluation
Inspect the code and data used to produce OLMo 2’s results, which we make openly available for scientific reproduction and scrutiny.
Our philosophy
Early work on pretraining language models considered only a single stage of pretraining on trillions of tokens of unstructured text from massive web crawls. Since then, while more sophisticated approaches have emerged—such as the ideas of mid-training, data curriculum, and the relation between training stability and performance—most successful models offer limited information on how to employ these techniques. By openly sharing our data, recipes, and findings, we hope to provide the open-source community with the resources needed to discover new and innovative approaches to improve model pretraining.