About OLMo

AI2 is embarking on the creation of an open, state-of-the-art generative language model: AI2 OLMo (Open Language Model). OLMo will be comparable in scale to other state-of-the-art large language models at 70 billion parameters, and is expected in early 2024.

OLMo will be a uniquely open language model intended to benefit the research community by providing access and education around all aspects of model creation. OLMo will be a new avenue for many people in the AI research community to work directly on language models for the first time. We will be making all elements of the OLMo project accessible — not only will our data be available, but so will the code used to create the data. We will release the model, the training code, the training curves, and evaluation benchmarks. We will also openly share and discuss the ethical and educational considerations around the creation of this model to help guide the understanding and responsible development of language modeling technology.

This broad availability of all aspects of OLMo will allow the research community to directly take what we create and work to improve it. We believe that millions of people want to better understand and engage with language models, and we aim to create an environment where they actually can, leading to faster and safer progress for everyone. Our goal is to collaboratively build the best open language model in the world.


Licensing

The artifacts that AI2 creates and releases as part of the OLMo initiative will be released with the new AI2 ImpACT License.

Artifacts

Data

Dolma

Introducing Dolma, OLMo pretraining dataset. Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It is generally available for download from the HuggingFace Hub and is distributed under AI2's ImpACT license. It is the largest open dataset to date for LLM training.

Recent Updates

AI2 Unveils Dolma: A 3 Trillion Token Corpus Pioneering Transparency in Language Model Research

Marktechpost
August 23, 2023
Read the Article

AI2 drops biggest open dataset yet for training language models

TechCrunch
August 18, 2023
Read the Article

Allen Institute for AI takes new approach to managing AI risks and promoting transparency

GeekWire
August 7, 2023
Read the Article

Allen Institute for AI Announces OLMo: An Open Language Model Made By Scientists For Scientists

Marktechpost
May 21, 2023
Read the Article

Allen Institute for AI Unveils AI2 OLMo, An Open Source Language Model

Analytics India Magazine
May 12, 2023
Read the Article

Allen Institute for AI creating an open generative AI language model ‘by scientists, for scientists’

GeekWire
May 11, 2023
Read the Article

AI2 is developing a large language model optimized for science

TechCrunch
May 11, 2023
Read the Article

Collaborators

AI2 is working with several organizations and collaborators to make OLMo possible.