Skip to main content ->
Ai2

Introducing Olmo Hybrid: Combining transformers and linear RNNs for superior scaling

March 5, 2026

Ai2


Hybrid language models – architectures that mix transformer attention with linear recurrent layers – have been gaining momentum across the field, with recent efforts from projects like Samba, Nemotron-H, Qwen3-Next, Kimi Linear, and Qwen 3.5. By combining transformers' ability to recall precise details from earlier in a sequence with recurrent layers' efficiency at tracking evolving state, hybrids promise to be both more capable and cheaper to run at long context lengths. But the community has lacked consensus on whether the purported benefits of hybrid architectures justify the cost of scaling them up.

Today we're releasing Olmo Hybrid, a new 7B-parameter fully open model family that provides compelling evidence in favor of hybrid models by showing clear performance gains in a controlled comparison to Olmo 3 7B. Additionally, our report dives deep into explaining why hybrid models outperform transformers via theoretical analysis and scaling experiments. Our new study shows that hybrid architectures are fundamentally more expressive than pure transformers or pure linear RNNs alone, and that this expressivity advantage translates directly to more efficient scaling during pretraining. On MMLU, a widely used benchmark for general knowledge and reasoning, Olmo Hybrid reaches the same accuracy as Olmo 3 using 49% fewer tokens — roughly 2× data efficiency. That means you can train to the same capability with half the data, or train on the same data and get a meaningfully better model.

Before diving into results, it’s worth understanding why we think hybrid architectures are an important direction for language modeling.

The transformer architecture has dominated the field of language modeling since its introduction in 2017. At its core, a transformer processes text using “self attention,” a mechanism that lets the model look at every preceding word in a sequence simultaneously and decide which words are most relevant to each next-word prediction. The parallelism inherent to their internal computations makes transformers extremely efficient to train on modern hardware, and their ability to directly access any part of the input sequence gives them remarkable in-context recall.

But transformers have limitations. Their attention mechanism scales quadratically with sequence length – processing a sequence twice as long takes four times as much computation – so inference gets increasingly expensive as context grows. And while they excel at recall tasks, transformers don’t naturally represent robust state tracking—the kind of computation where you need to update a running tally or maintain a mental model of changing conditions (for example, the state of a chessboard as players make different moves). Our past theoretical work has explored this.

Recurrent neural networks take a fundamentally different approach. Instead of looking at the entire sequence at once, an RNN processes text one token at a time, maintaining a hidden "state" that gets updated with each new input. This makes RNNs naturally suited for state tracking, but traditional RNNs are difficult to train at scale because their sequential nature prevents parallelization.

Recent work on parallelizable linear RNNs and state-space-style models has revived interest in recurrent approaches by redesigning recurrence to be trainable efficiently. These models scale linearly with sequence length at inference, but because they compress past information into a bounded state, they can struggle with tasks requiring precise recall from earlier in a sequence.

This brings us to hybrid models like Olmo Hybrid, which mix transformer and linear RNN layers to get the benefits of each architecture. Moreover, we show that hybrid models are more expressive than either transformers or linear RNNs in isolation. This theoretical motivation led us to explore scaling up hybrid models, which we found translated to improved pretraining performance relative to Olmo 3.

Olmo Hybrid at a glance

Our hybrid model interleaves transformer layers with Gated DeltaNet layers, a modern linear RNN design that remains parallelizable during training while offering expressive state dynamics. 

We developed Olmo Hybrid through a series of increasingly large experiments, first at 1B scale where we found that hybrid models consistently beat transformers on bits-per-byte evaluations and iterated on the RNN and hybrid architecture, then at 7B scale where we confirmed the pattern held and hybrids matched transformer baselines with substantially fewer tokens. The full 6T-token pretraining training run confirmed that these gains persist at scale—they appear to be a property of the architecture rather than an artifact of training dynamics. 

Olmo Hybrid uses a 3:1 pattern—three DeltaNet sublayers followed by one multihead attention sublayer, repeated throughout the network. That replaces 75% of attention mixing with Gated DeltaNet, giving the model architectural paths for both state tracking (via DeltaNet) and precise recall (via attention), with attention appearing often enough to prevent information from getting “stuck” in a bounded recurrent state.

Olmo Hybrid is a 7B-parameter model pretrained on 6 trillion tokens using the improved data mix from Olmo 3 32B. Training was carried out on 512 GPUs—starting on NVIDIA H100s before migrating to NVIDIA HGX B200s hosted on Lambda's infrastructure roughly halfway through pretraining, making Olmo Hybrid one of the first state-of-the-art fully open models trained on B200s.

Olmo Hybrid closely follows the Olmo 3 blueprint except for the hybrid substitution. Training throughput was matched to Olmo 3—both models train at comparable speeds with similar parameter counts, which suggests the efficiency gains come from the hybrid architecture itself rather than from trading speed for performance.

Improved data and compute efficiency in controlled studies

Olmo Hybrid reaches better performance than Olmo 3 models of the same size with substantially less training data—and because fewer tokens means less training overhead, the savings translate directly into compute savings as well. On MMLU, we see roughly 2× token efficiency—the hybrid model reaches the same accuracy as Olmo 3 using 49% fewer tokens. On a Common Crawl evaluation slice, Olmo Hybrid reaches parity in 35% fewer tokens. In both cases, since training throughput is matched between the two architectures, the token savings correspond to proportional reductions in total training compute.

By the end of pretraining, Olmo Hybrid does noticeably better on a selected set of math and science benchmarks but is slightly worse on coding tasks and general question-answering compared to Olmo 3. After mid-training, those gaps close—Olmo Hybrid outperforms Olmo 3 across every primary evaluation domain, and these gains largely persist after long-context extension. On held-out evaluations not used during Olmo 3 development, the hybrid model posts gains on BBH and MMLU Pro, with small regressions on LBPP and DM Math.

After long-context extension, Olmo Hybrid shows substantial gains over Olmo 3 on RULER, a standard long-context benchmark. At shorter contexts (4k tokens), the hybrid model trails Olmo 3 slightly, but it overtakes at 8k and the gap widens with context length.

We evaluated two approaches to long-context adaptation – YaRN and DRoPE – which allow models to handle longer inputs than they were originally trained on. At 64k context length, Olmo Hybrid with DRoPE scores 85.0 on RULER, a long-context benchmark, compared to 70.9 for Olmo 3 7B with YaRN. Even using the same YaRN method, the hybrid architecture outperforms the transformer baseline, scoring 76.9—and with DRoPE, the gains at very long context lengths are particularly striking.

Expressivity and scaling

A common motivation for hybrid models has been inference efficiency on long context lengths. In contrast, our results suggest a different, fundamental strength: hybrid models are more expressive than transformers, and this translates to more efficient scaling when they are pretrained in practice. Theoretically, hybrid models can represent useful computations that neither pure transformers nor pure linear RNNs can easily express alone. Moreover, we argue theoretically that this expressivity advantage likely explains the better pretraining scaling we find in practice.

To more systematically quantify the pretraining efficiency gains of hybrid models, we fit scaling-law curves to compare architectures under matched training conditions. In the unconstrained fit, the point estimates favor Olmo Hybrid over Olmo 3, but uncertainty is large enough that coefficient differences aren’t statistically conclusive. 

These fitted laws also predict that the token-savings factor grows with scale, rising from ~1.3× at 1B parameters to ~1.9× at 70B parameters at a fixed target loss.

Why should more expressive models scale better with data? One intuition: as highlighted in many recent analyses of scaling laws, language modeling consists of learning many discrete subtasks, and each subtask is either expressible by the architecture (and eventually gets learned) or inexpressible (and contributes to irreducible loss). If hybrids can express more of the subtasks that appear in natural language, they can lower loss more efficiently per token seen. We formalize this explanation by proving that, under an idealized model of neural scaling laws called the quantization model, increasing expressivity indeed translates to more efficient scaling trends. Thus, it makes sense that more expressive architectures should exhibit more efficient pretraining scaling trends, as we observe in practice for hybrid models vs. transformers.

What's next

We're continuing to explore the hybrid architecture's potential, including expanded evaluations across generative and reasoning benchmarks and further investigation of inference efficiency advantages at long context lengths.

Alongside the models, we're releasing a technical report covering the main empirical results, the theoretical basis for expressivity benefits described above, scaling-law analysis connecting expressivity to data efficiency, and implementation details including ablations on the hybrid ratio and RNN layer design. We also present comparisons with other open models (hybrid and otherwise) and preliminary investigations into post-training hybrid models.

It's important to note once again that we're not the first to explore this direction. Olmo Hybrid complements other recent hybrid model releases by being closely comparable to Olmo 3 across all aspects of training; the fact that we see dramatic pretraining and mid-training gains over Olmo 3 provides compelling evidence for hybrid models. We think hybrid models represent a promising direction for the field, one grounded in both theoretical insight and empirical results. We encourage you to download Olmo Hybrid, dig into the technical report, and let us know what you find.

This research benefited greatly from the computational resources and technical expertise of Lambda to train Olmo Hybrid. We thank them for their support.

Subscribe to receive monthly updates about the latest Ai2 news.