About OLMo
OLMo will be a uniquely open language model intended to benefit the research community by providing access and education around all aspects of model creation. OLMo will be a new avenue for many people in the AI research community to work directly on language models for the first time. We will be making all elements of the OLMo project accessible — not only will our data be available, but so will the code used to create the data. We will release the model, the training code, the training curves, and evaluation benchmarks. We will also openly share and discuss the ethical and educational considerations around the creation of this model to help guide the understanding and responsible development of language modeling technology.
This broad availability of all aspects of OLMo will allow the research community to directly take what we create and work to improve it. We believe that millions of people want to better understand and engage with language models, and we aim to create an environment where they actually can, leading to faster and safer progress for everyone. Our goal is to collaboratively build the best open language model in the world.
The artifacts that AI2 creates and releases as part of the OLMo initiative will be released with the new AI2 ImpACT License.
Artifacts
Data
Dolma
Introducing Dolma, OLMo pretraining dataset. Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It is generally available for download from the HuggingFace Hub and is distributed under AI2's ImpACT license. It is the largest open dataset to date for LLM training.
Recent Updates
NLP Highlights Episode 141 - Building an open source LM, with Iz Beltagy and Dirk Groeneveld
June 29, 2023In this special episode of NLP Highlights, we discussed building and open sourcing language models. What is the usual recipe for building large…
Recent Press
AI2 Unveils Dolma: A 3 Trillion Token Corpus Pioneering Transparency in Language Model Research
August 23, 2023
AI2 drops biggest open dataset yet for training language models
August 18, 2023
Allen Institute for AI takes new approach to managing AI risks and promoting transparency
August 7, 2023
Allen Institute for AI Announces OLMo: An Open Language Model Made By Scientists For Scientists
May 21, 2023
Allen Institute for AI Unveils AI2 OLMo, An Open Source Language Model
May 12, 2023
Allen Institute for AI creating an open generative AI language model ‘by scientists, for scientists’
May 11, 2023
AI2 is developing a large language model optimized for science
May 11, 2023