Open Language Model: OLMo

A State-of-the-Art, Truly Open LLM and Framework

Open Language Model (OLMo) - the AI2 LLM framework is intentionally designed to provide access to data, training code, models, and evaluation code necessary to advance AI through open research to empower academics and researchers to study the science of language models collectively.

OLMo and framework includes:

  • Full pretraining data: The model is built on AI2’s Dolma dataset which features three trillion token open corpus for language model pretraining, including code that produces the training data.
  • Training code and model weights: The OLMo framework includes full model weights for four model variants at the 7B scale, each trained to at least 2T tokens. Inference code, training metrics and training logs are all provided.
  • Evaluation: We’ve released the evaluation suite used in development, complete with 500+ checkpoints per model, from every 1000 steps during the training process and evaluation code under the umbrella of the Catwalk project.

Each model comes with the following:

  • Full training data used for these models, including code that produces the training data, from AI2’s Dolma, and WIMBD for analyzing pretraining data.
  • Full model weights, training code, training logs, training metrics in the form of Weights & Biases logs, and inference code.
  • 500+ checkpoints per model, from every 1000 steps during the training process, available as revisions on HuggingFace.
  • Evaluation code under the umbrella of AI2’s Catwalk and Paloma.
  • Fine-tuning code and adapted models (with Open Instruct)
  • All code, weights, and intermediate checkpoints are released under the Apache 2.0 License.

What OLMo provides researchers and developers

  • More Precision:

    With full insight into the training data behind the model, researchers will be able to work faster and no longer need to depend on qualitative assumptions of model performance
  • Less Carbon:

    By opening the full training and evaluation ecosystem, it radically reduces developmental redundancies, which is critical in the decarbonization of AI
  • Lasting results:

    Keeping models and their datasets in the open and not behind APIs enables researchers to learn and build from previous models and work.

Now is the time for truly open AI research

“I’m enthusiastic about getting OLMo into the hands of AI researchers,” said Eric Horvitz, Microsoft’s Chief Scientific Officer and a founding member of the AI2 Scientific Advisory Board. “The new offering continues Allen AI's tradition of providing valuable open models, tools, and data, which have spurred numerous advancements in AI across the global community.”

“Open foundation models have been critical in driving a burst of innovation and development around generative AI,” said Yann LeCun, Chief AI Scientist at Meta. “The vibrant community that comes from open source is the fastest and most effective way to build the future of AI.”

Data: Dolma

Introducing Dolma, the OLMo pretraining dataset. Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It is generally available for download from the Hugging Face Hub and is the largest open dataset to date for LLM training.

Evaluation: Paloma

Paloma is a benchmark for evaluating open language models across many different domains (ranging from niche artist communities to reddit forums on mental health). We have already evaluated several models such as 6 one billion parameter baseline models that we trained using different popular corpora (such as Dolma) to understand how language model performance varies across 585 different domains. We encourage the community to run our standardized inference code on additional models and submit their results to extend our benchmark.

Getting Started with OLMo

Learn more

Questions? Contact us.

For questions or feedback you can reach us at olmo at allenai dot org or open an issue on GitHub!

This work was made possible by our awesome partners!

OLMo not be possible without the collaborative effort from AMD, CSC - IT Center for Science (Finland), Mosaic/Databricks, Kempner Institute at Harvard University, and the University of Washington. Additional thanks to EleutherAI, Meta, Stanford CRFM, TogetherAI and HuggingFace.