OLMo: Open Language Model, A State-Of-The-Art, Truly Open LLM and Framework
As the world races to deploy AI models that are effective and safe, the demand for Open Large Language Models (LLMs) has exploded. The massive adoption of both open and closed AI models means that AI capabilities have leapfrogged our ability to understand how they are created. AI2 has released OLMo 7B, a truly open, state-of-the-art large language model released alongside the pre-training data and training code. This empowers researchers and developers to use the best and open models to advance the science of language models collectively.
OLMo is built on AI2’s Dolma set which features a three trillion token open corpus for language model pretraining, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In the new paper, researchers document Dolma, including its design principles, details about its construction, and a summary of its contents. They also open source a high-performance curation toolkit to reproduce Dolma and curate other datasets.
AI2’s new open-source LLM may reset the definition of ‘open AI’
"Being a researcher in the AI field and just working with APIs or closed models is like being an astronomer trying to research the Solar System and only having access to pictures of it from the newspaper,” says Hanna Hajishirzi, Senior Director of AllenNLP and one of the primary researchers behind OLMo. Open research will remove silos and improve efficiency in the AI research community.