OlmPool: How small architectural choices compound to undermine long context extension

April 23, 2026

Ai2

Most language models are trained on short sequences of text – measured in tokens, the word-sized fragments that models use as their basic unit of input – and then taught to handle much longer inputs through additional training on longer documents, a process called context extension. A large share of the published work on extending context has been developed and validated on Llama-family models, partly because of Llama's popularity, but also because Llama 3 happens to extend very easily.

But since the pretraining data behind Llama 3 is proprietary, it has been difficult to tell whether that ease of extension comes from architectural decisions, the training data, or both. This matters because researchers building on other architectures have had to assume that the same extension recipes will transfer.

In this work, we show that they often do not, and that architecture is a primary driver of how well a model handles long context after extension. Four architectural choices – each present in at least one of the Olmo, Llama, or Qwen model families – have a compoundingly negative effect on long context performance. Any one of these choices alone has a minor impact. Combining three or more can drop scores on long context benchmarks by up to 47%. To study this, we developed OlmPool, a controlled suite of 26 7B models that isolate these architectural differences. We're releasing the whole suite with full checkpoints before and after context extension.

The four architectural choices

Each model in OlmPool was pretrained for 140 billion tokens on the same data, then extended to 64K context using the same long context data mix and procedure. The only thing that varies across models is the architecture; the total cost of constructing OlmPool is approximately 160,000 GPU hours of training.

All four design decisions we tested affect attention, the mechanism that determines which parts of the input the model focuses on when making a prediction. As inputs get longer, the demands on attention increase, and these design decisions shape how well the model adapts:

QK normalization. QK norm is a technique that normalizes the query and key vectors inside each attention layer, typically added to improve training stability and prevent large, erratic attention scores. It’s used in Olmo 3, Qwen 3, and Gemma 3. A variant called headwise QK norm applies normalization separately to each attention head rather than across the full layer; this variant is used by Qwen 3 and Gemma 3.

Grouped-query attention (GQA). GQA is an efficiency technique that shares key-value parameters across multiple attention heads, reducing memory usage during inference. The tradeoff is reduced model capacity: fewer independent key-value heads means the model has less flexibility in how it retrieves and combines information from prior context. GQA is used in Llama 3, Qwen 3, Gemma 3, and many other recent models.

Sliding window attention. This restricts most attention layers to look at only a local window of nearby tokens rather than the full input. A smaller number of layers retain full attention over the entire context. Sliding window attention is used in Olmo 3 and Gemma 3. The Olmo 3 configuration uses three local-attention layers for every one full-attention layer.

Pretraining context length. Some model families pretrain at shorter sequence lengths (e.g., 4,096 tokens) and rely entirely on context extension to reach longer contexts later. Others pretrain at longer lengths (e.g., 8,192 tokens), giving the model some exposure to longer-range patterns before extension.

Analysis and benchmark results

We evaluated every model on three established long context benchmarks: HELMET, which tests in-context learning, retrieval, and question answering at various context lengths; RULER, a set of synthetic retrieval tasks of increasing difficulty; and LongPPL, a variant of perplexity focused on tokens that depend on long-range context. All three correlate closely—for readability, we primarily report HELMET scores below.

In these evaluations, models are pretrained on 1T tokens from the OLMoE pretraining corpus, followed by a 50B-token annealing phase. For full results, read our technical report.

As it turns out, short context metrics don't predict long context performance. Standard training signals give almost no indication of how well a model will handle long context after extension. Training loss, validation perplexity (how well the model predicts held-out text), and a suite of 16 short-context benchmarks all fail to predict which models will score well at 32K or 64K context lengths. Even HELMET scores at 8K – the shortest context split of the same benchmark – fail to anticipate double-digit swings in post-extension performance. Models that look nearly identical on standard evaluations can diverge by more than 26 points on HELMET at 32K once extended.

Context extension typically happens late in the development cycle, well after architecture decisions have been locked in. But while standard training metrics miss these issues, we found that running a context extension experiment early in pretraining can surface problems at a fraction of the cost.

These effects compound. In paired comparisons between models that differ in only one architectural feature, most individual features have a modest effect. QK normalization has the single largest individual impact: on the Olmo architecture, removing QK norm and switching to a different normalization ordering yields a 6-point gain on HELMET at 32K. Headwise QK norm causes an additional slight degradation beyond the standard layerwise version. GQA and shorter pretraining context length each cause smaller drops, and sliding window attention costs about 1 point on HELMET in isolation.

But when these choices are combined, the effects are much larger than the sum of their parts. Adding sliding window attention to a model that also uses GQA drops performance by around 9 points on average. The worst-scoring configurations in OlmPool combine two or more choices that constrain how flexibly the model can attend over its full input.

Indeed, we found that the single best predictor of long context performance is simply counting how many of the four architectural choices in OlmPool are present—that count alone explains more of the variation across models than a statistical model using the four choices as separate variables.

Llama 3 is strong for long context, but not necessarily the best. In OlmPool, where data is held constant, the Llama 3 configuration is one of the strongest performers—but it isn’t the optimum in all cases. Several other models measurably beat it. This confirms that Llama 3's long context success is primarily architectural, and it suggests that extension recipes validated on Llama may need adaptation for other model families.

Architecture-driven gaps don't wash out with more data. We tested this in two ways:

First, we ran context extension at three data scales – 1B, 10B, and 50B tokens – on three representative models. All three improve with more data, but the architecture-originated deltas remain. Even after 50B tokens of context extension, representing 26% of total training, the worst architecture doesn’t reach the performance the Llama architecture achieves after just 1B tokens.
Second, we performed context extensions at multiple points during much longer pretraining runs, from 70B up to 2 trillion tokens. The relative ranking of architectures stays consistent from 140B tokens onward.

Attention patterns help explain why. We analyzed how all 26 models in OlmPool distribute attention across their context and found that models without QK norm develop stronger attention sinks—positions early in the input (typically among the first few tokens) that consistently receive a large share of attention, even when they aren’t relevant to the current prediction. Researchers have generally considered attention sinks undesirable, since they can complicate model compression. But in OlmPool, stronger sinks correlate with better long context performance. In the absence of other mechanisms for managing excess attention weight, sinks appear to be the default strategy learned by models without QK norm to support retrieval over long inputs.

We also tested whether models could retrieve specific information embedded in long documents, using a needle-in-a-haystack setup where a target fact is placed somewhere in a long passage. Models with QK norm placed less attention on the target information, consistent with their weaker long context performance overall.

OlmPool: a resource for the community

Each of the four architectural choices we studied has a clear benefit in other contexts—QK norm improves training stability, shorter pretraining context length is more compute-efficient, and GQA and sliding window attention both reduce inference cost. But our work shows that the combination can produce long context performance far below what practitioners would expect, and that this outcome isn’t visible from standard training signals.

We're releasing all 26 OlmPool models with 38 checkpoints each, covering the full pretraining and context extension process. We hope these models are useful both for developing better context extension methods and for studying other phenomena in early pretraining.

OlmPool: How small architectural choices compound to undermine long context extension

The four architectural choices

Analysis and benchmark results

OlmPool: a resource for the community

Subscribe to receive monthly updates about the latest Ai2 news.