OLMo 2 32B: First fully open model to outperform GPT 3.5 and GPT 4o mini

March 13, 2025

Ai2

Today we release OLMo 2 32B, the most capable and largest model in the OLMo 2 family, scaling up the OLMo 2 training recipe used for our 7B and 13B models released in November. It is trained up to 6T tokens and post-trained using Tulu 3.1. OLMo 2 32B is the first fully-open model (all data, code, weights, and details are freely available) to outperform GPT3.5-Turbo and GPT-4o mini on a suite of popular, multi-skill academic benchmarks. It is comparable to the leading open-weight models while requiring only a fraction of training compute. For example, OLMo 2 32B takes only one third of the cost of training Qwen 2.5 32B while reaching similar performance. The OLMo 2 family of models—now available in 7B, 13B, and 32B parameter sizes, all can be finetuned on a single H100 GPU node, and all models are available on the Ai2 playground.

State-of-the-art results: OLMo 2 32B matches or outperforms GPT3.5 Turbo, GPT4-o mini, Qwen 2.5 32 B, Mistral 24B, and approaches Qwen 2.5 72B, Llama 3.1 and 3.3 70B.
Fully open, end-to-end recipe for language model training: Our pretraining and post-training team worked together to create a single, turn-key recipe with all ingredients openly available: data, training recipe, methodology, and software. Finally, AI researchers and developers can seamlessly build and customize a state-of-the-art pipeline for their project or application.
Improved data and efficient pretraining: We trained OLMo 2 32B on the same resource-efficient pretraining and mid-training mixes released in November. For OLMo 2 32B, we rewrote our training codebase to create OLMo-core. This new framework better supports larger models, different training paradigms, and modalities beyond just text. It was designed to be efficient on modern hardware.
Refined post-training and RLVR: Our models integrate our latest breakthrough in reinforcement learning with verifiable rewards (RLVR) as part of the Tülu 3.1 recipe by using Group Relative Policy Optimization (GRPO) and improved training infrastructure further enhancing their capabilities.
Fostering open, scientific research and development: OLMo models facilitate researchers to scientifically advance and study critical areas such as understanding pretraining dynamics, impact of data on model behavior, and how different stages of training (pretraining, midtraining, finetuning) affect each other, and other studies that require access to all artifacts in the model pipeline. OLMo 2 32B is already supported in Hugging Face's Transformers library and in vLLM's main branch.
Training infrastructure: OLMo 2 32B was trained on Augusta, a 160 node AI Hypercomputer provided by Google Cloud Engine. Each node has 8 H100 GPUs, and the nodes are connected with GPUDirect-TCPXO interconnect. Over the course of the training run, we reached performance of over 1800 tokens per second per GPU (~38% MFU).

Training OLMo 2 32B

We develop OLMo 2 32B in two broad development phases – base model and instruct model. The base model is trained in two stages—commonly referred to as pretraining and midtraining—which covers over 90% of the total training budget. The instruct model is derived from the base model using a series of techniques—supervised fine tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR)---that collectively form our posttraining strategy.

Pre-training

We train on OLMo-Mix-1124, a collection of 3.9 trillion tokens sourced from DCLM, Dolma, Starcoder, and Proof Pile II. Each OLMo base model is trained on successively more tokens with larger model size:

OLMo 2 7B is trained for one epoch, up to 4T tokens,
OLMo 2 13B is trained for 1.3 epochs, up to 5T tokens,
OLMo 2 32B is trained for 1.5 epochs, up to 6T tokens.

Mid-training

We train on a curated collection called Dolmino, containing 843 billion tokens from several sources:

High-quality documents re-sampled from the OLMo-Mix-1124 using a quality filter;
Documents containing educational, math, or academic content that were not included in OLMo-Mix-1124;
Instruction-tuning data, both synthetic and human generated, some of which is from the Tulu 3 Mix.

We perform the mid-training for each OLMo 2 model on Dolmino in different ways:

For OLMo 2 7B, we continued training the final stage 1 checkpoint on a subsampled 50B tokens while linearly annealing the learning rate to zero. We perform this process three times on a different data order of the same 50B tokens, and average the three final checkpoints, a process called model souping (Wortsman et al, 2022).
For OLMo 2 13B and 32B, we repeat this process using two samples of 100B and 300B tokens. We create three checkpoints using the 100B samples using different data order, and one checkpoint using the 300B sample. We perform model souping on these four model variants.

Post-training

OLMo 2 32B’s post-training recipe follows Tülu 3 closely, as done with the 7B and 13B models. Tülu 3 uses a three stage approach to training: High-quality instructions for supervised finetuning, on-policy preference data, and reinforcement learning with verifiable rewards (RLVR). The results for OLMo 2 32B Instruct place it in line with state-of-the-art open weight models in its size class and as competitive with smaller closed API models.

In training OLMo 2 32B Instruct we made some minor modifications to the recipe. Specifically, we:

Filtered out instructions from the SFT dataset and the chosen responses of the preference data that included mentions of a date cutoff from the synthetic data generation process. This resulted in a new version of the instruction dataset, Tulu 3 SFT Mixture 0225, and preference dataset, OLMo-2-32B-pref-mix-0325.
We use majority voting to improve the quality of answers to our synthetic math questions. For our Persona MATH and Grade School Math datasets from Tülu 3, we only include prompts and completions where the model reaches a majority vote over 5 completions. New versions of the math and grade school math datasets are available.
As with our previous OLMo 2 models we applied the RLVR training stage on GSM8K, IFEval, and MATH prompts. For the 32B, we trained it similar to our latest model, Llama 3.1 8B Tülu 3.1, with Group Relative Policy Optimization (GRPO). For this model, we have released intermediate RL checkpoints on HuggingFace for researchers to study the impacts of RLVR on instruction models.

For more details on the evaluation setting, see the Tülu 3 or OLMo 2 papers. For now, note the following details:

The 7B and 13B OLMo 2 Instruct checkpoints have improved safety numbers due to a bug in our evaluation tooling for their original release.
The evaluations for GPT-3.5 and GPT-4o-mini were taken from the Tülu 3 report and still represent the latest models OpenAI has available.
For some models, such as GPT-4o-mini and Qwen-32B-Instruct, there are formatting issues causing major drops in evaluations such as DROP. In future work we plan to improve our evaluation suite to move away from few-shot evaluations where formatting can cause substantial variations across peer models.
For Qwen QwQ 32B we conducted evaluations by removing the thinking tokens and grading the following answer. It was evaluated with their recommended sampling parameters (
```
32K context length, 0.6 temperature, sampling, top_p 0.95, min_p 0, top_k 30
```
) in the model card for all evaluations except safety, which just had a shorter context length of 8K tokens. Multiple choice evaluations, PopQA and TruthfulQA had challenges with answer extraction, where the model would return the answer within the <think> tokens, so we did not report a score. Even outside of extraction issues, the very long context generation of reasoning models has caused challenges to many pieces of open evaluation tooling, which we need to improve.

We’re excited to continue to push the limits of our post-training recipe with RLVR and move further into the space of open-source reasoning models.

OLMo-core Trainer

OLMo-core is the result of a major overhaul of our first-generation pretraining codebase to better support larger models, different training paradigms, and modalities beyond just text. It was designed to be highly efficient on modern hardware, scalable through 4D+ parallelism, and flexible enough to handle a variety of use cases.

Efficiency

Asynchronous distributed checkpointing: checkpoint state is first copied to CPU and then asynchronously saved to disk and/or uploaded to cloud storage while training continues. Under the hood we use PyTorch’s
```
distributed.checkpoint
```
API which ensures each rank in the distributed process group only saves its local shard of the model’s parameters and corresponding optimizer state, which effectively parallelizes the saving process. Moreover, the storage format allows us to seamlessly load those checkpoints with a different distributed topology.
Minimal host-device syncs: ideally there would be no synchronization points throughout the training loop, but certain operations like logging the loss require transferring data from GPU to CPU, and therefore a host-device sync. OLMo-core minimizes host-device syncs like this by keeping a small running buffer of metrics like the loss on CUDA, and then periodically flushing that buffer to CPU on a configurable interval to perform logging operations.
PyTorch best practices: we effectively use new PyTorch features like compile, FSDP2, and selective activation checkpointing following the best practices from torchtitan.

Scalability

4D+ parallelism: OLMo-core supports pipeline parallelism, data parallelism (through DDP, FSDP, or HSDP), context parallelism, tensor parallelism, and expert parallelism for MoE models.
Fine-grained activation checkpointing: the library has a highly configurable activation checkpointing API to allow for fine-grained strategies that can target any submodule or even specific operations. We took advantage of this to tune the confirmation of the 32B to different cluster sizes throughout the training run.

Flexibility

Modular design: OLMo-core has a modular design that makes it easy to plug-in custom model components, optimizers, learning rate schedules, data loaders, etc. The trainer itself makes no assumptions about the model architecture, parallelism strategy, or data format, and has a flexible callback API for simple customization of the training loop.
Unified API for local and remote filesystems: the trainer knows when its save folder is a cloud storage URL instead of a local path and automatically uploads checkpoints directly to the remote folder. Similarly the data loader will seamlessly stream data from cloud storage. This allows us to move training jobs between clusters with minimal effort.

Training on Google Cloud Engine

For any large training run, the team continues to tweak and improve the run as it progresses. The Augusta cluster, which we used for OLMo 2 32B, offered unique opportunities to tackle some of these issues in collaboration with Google engineering support.

Underperforming compute nodes

With Google engineering support, we investigated underperforming nodes, i.e., nodes that don’t produce errors, but run slower than the others. Since the whole job runs at the speed of the slowest components, these nodes would have a disproportionate impact on performance. With the right tooling, identifying them is quick, and replacing the host fixes the issue within minutes. Google is working on making tooling for this generally available.

Network topology

Our cluster management software, Beaker, does not take the topology of the underlying hardware into account when it assigns nodes to jobs. Google engineers advised us that jobs will run faster and more reliably if we make sure adjacent NCCL ranks are as close as possible to each other in the network. We implemented this change, and saw a drastic difference in the way the training job uses the network.

Asynchronous checkpointing improvements

When it is time to write a checkpoint, the trainer creates a copy of the model in CPU memory, hands it off to the checkpointer, and continues training while the checkpointer uploads the checkpoint. This saves a lot of time during which GPUs would otherwise be idle. Over the course of training, we realized that even with asynchronous checkpointing, there is still a performance impact. Google Storage allows such fast uploads that uploading checkpoints from 1024 processes at the same time slowed down the cluster management network, and training made very little progress. To mitigate this problem, we a) reduced the frequency of checkpointing, and b) throttled the speed at which we upload to Google Storage.

This kind of change is not captured in common efficiency metrics like MFU, but when taking a training job in aggregate, it makes a big difference. In our case, this change alone improved throughput by 30%. OLMo-core logs both instantaneous throughput, and also overall throughput since the beginning of the job. Seeing the large gap between the two made it easy to identify this issue.

Hybrid sharding

Most large-scale trainers use pipeline parallelism to shard the model across GPUs, but OLMo uses Zero3 / FSDP. For GPU counts above 256, FSDP is known to compromise performance. Instead we use hybrid sharding, where the model gets sharded across smaller groups of 256 GPUs (or less, depending on model size), and multiple of these groups work together using data parallelism. OLMo-core used a version of PyTorch that had a bug preventing this configuration from working. Fortunately, PyTorch 2.6 fixed the bug, and switching versions gave us an immediate speed-up around 20%.

Switching PyTorch versions during a run is not recommended because it runs the risk of unexpected side effects. In our case, the calculation of z-loss in the new version is subtly different (likely due to precision issues), and z-loss shows a slightly different trend after the switch. However, we decided that the speed-up was worth the risk and continued training.

Faster NCCL Collectives

Towards the end of our training run, Google engineers made a new version of their NCCL drivers available to us. These drivers (together with a backend change) enable the NCCL LL128 algorithm. This change will enable large speedups at the message sizes that OLMo 2 32B used during training. It would likely not result in an immediate speed-up by itself, since OLMo training is already as fast as the GPUs allow, but it will give us more flexibility for future training runs in regards to batch sizes and recomputation.

Artifacts

Demo
OLMo 2 Hugging Face Collection
OLMo-2-0325-32B
OLMo-2-0325-32B-Instruct
OLMo-2-0325-32B-SFT
OLMo-2-0325-32B-DPO
Pretraining dataset: OLMo-mix-1124
Mid-training dataset: Dolmino-Mix-1124
Post-training dataset: Tülu 3 SFT Mix (updated)
Preference data for OLMo 2 32B
RLVR Mix

OLMo 2 32B: First fully open model to outperform GPT 3.5 and GPT 4o mini

Training OLMo 2 32B

OLMo-core Trainer

Training on Google Cloud Engine

Artifacts

Subscribe to receive monthly updates about the latest Ai2 news.