Modern language models are trained on many kinds of data—web text, code, PDFs, math, and more. How you combine these sources matters enormously for the quality of the final model, but figuring out the right recipe is one of the messiest parts of building an LM. There's no universal formula, and the "best" mix depends on dozens of design choices that practitioners often have to guess at. In practice, many practitioners end up resorting to manual tweaking of the mix, with little way to know if they're leaving performance on the table.
Making things harder, training data isn't static throughout model development. In the weeks and months leading up to when the final model starts training, you're constantly adding new datasets, removing low-quality ones, filtering, and reorganizing—this is the reality of training an LM. Every time your data changes, you potentially need to figure out the mix all over again.
Olmix, which we’re releasing today, is our answer to these challenges. It's a framework for data mixing designed to keep up with how LMs are actually built, addressing two problems we repeatedly encounter in practice:
- There's surprisingly little guidance on how to configure a data mixing method. Existing approaches make different choices about model sizes, how many experiments to run, what modeling techniques to use, and so on—and these choices often conflict or aren't concretely justified in the literature. Practitioners are left to figure it out themselves.
- Recomputing your data mix from scratch every time your datasets change gets expensive fast. As your training corpus evolves through dozens of iterations, mixing can become a tax on every modification.
Olmix provides empirically grounded defaults so you're not guessing at configuration choices, and it introduces mixture reuse techniques that let you efficiently update your mix as your data evolves throughout LM development. In our experiments, this translated to a mix that is 12% better on downstream tasks and 3x more data-efficient than no mixing at all.
What actually matters in data mixing
A standard approach to finding a good data mix is to train a bunch of smaller proxy models on different mixtures (a “swarm”), see how they perform, and use a regression model to predict what mix will work best for your full-scale model. (In our paper, we call it the "offline mixing schema.") That's straightforward in theory. In practice, there's little consensus on how to actually set this up and do it throughout development.
Olmix is built around the idea that mixing should be a repeatable workflow that can keep up with modern LM development. It has two main components: OlmixBase, a data mixing method that instantiates the offline schema with research-validated configuration choices, giving you a strong starting mix at the beginning of development; and a set of mixture reuse techniques for efficiently updating that mix as your data evolves throughout the rest of LM development—without recomputing from scratch each time.
OlmixBase: Taking the guesswork out of configuring a mixing method
We ran a comprehensive study to answer the questions practitioners typically face when setting up a data mixing method according to the offline schema: How small can your proxy models be? How many proxy runs do you need? What regression model should you use? The findings from this study form the basis of OlmixBase, our recommended data mixing method.
For our study, we trained 1B-parameter models on DCLM data, a curated web text corpus, partitioned into 24 topic-based domains, and evaluated them across 52 downstream tasks spanning math, code, and commonsense QA. We measured performance via bits-per-byte (BPB), or the negative log-likelihood of the correct answer normalized by answer length in UTF-8 bytes.
Here's some of what we found:
Proxy models: small, but not too small. Proxies above roughly 15M parameters achieve strong rank correlation (ρ > 0.89) with 1B-parameter target models. Substantially smaller proxies become unreliable—1M-parameter models achieve only ρ = 0.73, which is too noisy for confident mixture decisions. Notably, while RegMix recommends a 1M proxy configuration, investigating their public code revealed their 1M implementation is actually closer to 15M—which our results suggest is a good proxy size. If you've been using extremely tiny proxies to save compute, you may be paying for those savings in noisy results.
Mixing costs scale linearly with domain count. Existing methods use anywhere from 20 to over 500 proxy runs, but few explain how this should scale with your domain set size. We found that the required number scales linearly – O(m) runs for m domains – giving practitioners a concrete prescription for allocating compute. Even as your domain set grows, mixing costs don't have to explode.
Log-linear regression is a strong default. Different papers use different regression models to predict performance from mixture weights – power law, log-linear, gradient boosting, Gaussian processes, etc. – often with little justification. We found that swarm size (the number of proxy models you train to explore the mixture space) is a key confounding factor—different models excel at different swarm sizes, which may explain the lack of consensus in the literature. We found that across settings, log-linear models achieve the best overall fit while remaining competitive on downstream validation.
Don't let your optimizer over-repeat scarce data. It's a common failure mode: your mixing method suggests allocating a large portion of training on code (or some other data type) when code is only a fraction of your available data, forcing harmful data repetition. Unlike existing methods that assume unlimited data, OlmixBase incorporates explicit repetition constraints so your mix stays grounded in actual data availability.
All these findings are baked into OlmixBase—proxy models large enough to be reliable but small enough to remain cheap, swarm sizes that scale linearly with domains, log-linear regression, and feasibility constraints for data-limited regimes. If you've wanted a sensible starting point for mixing that isn't just guesswork, OlmixBase is designed to be exactly that.
Mixture reuse: Efficient mixing for iterative development
The second component of Olmix addresses what happens after your initial mix.
In our experience developing Olmo 1–3, we repeatedly added, removed, revised, and partitioned datasets—a pattern also observed in other iterative development efforts, such as SmolLM 1–3. These operations describe how real training pipelines evolve. The question is whether you have to mix from scratch every time data changes, which is costly.
Olmix introduces mixture reuse: when your domain set changes, you don't need to recompute everything. The key insight is that most changes only touch a few domains, not the entire corpus. So instead of recomputing from scratch, reuse the relative ratios among unchanged domains fixed and only recompute the ratios for the domains that actually changed. In practice, this means bundling all unchanged domains into a single "virtual domain," solving a much smaller mixing problem over just that virtual domain plus whatever has changed, and then expanding the result back into a full mixture. Since mixing cost scales linearly with the number of domains you're optimizing over, this dramatically cuts compute—especially when only a handful of domains change at a time. We call our default approach Full Mixture Reuse, which preserves the ratios for all unchanged domains.
However, Full Mixture Reuse doesn't always match full recomputation (mixing from scratch on all domains). Our theoretical analysis reveals that coupling—when unchanged and changed domains both impact the same downstream tasks—is a key factor driving performance degradation. We see this play out empirically; when adding code data, for example, the "software development" slice of the web corpus and the new code data both influence coding tasks, competing to serve the same purpose. In this case, both of their ratios should be recomputed.
This is where our Partial Mixture Reuse approach comes in: rather than preserving the ratios of all unchanged domains, you selectively recompute a subset of them (e.g., web software development) alongside the changed domains. This approach can reduce coupling effects and can close the gap to full recomputation—while requiring only a few more proxy runs than Full Mixture Reuse.
To test these approaches, we simulated a realistic development sequence of 5 updates, ending with 64 total domains. When training 1B parameter models on 100B tokens, Full Mixture Reuse achieves 95% of the improvement of full recomputation while using 74% fewer proxy runs (216 vs. 832), and Partial Mixture Reuse reaches 98% while using 67% fewer proxy runs. Furthermore, our best mix obtained via mixture reuse is 12.2% better than the natural distribution (a no-mixing baseline with implicit ratios proportional to domain sizes) and is 3.05x more data-efficient. Our mix looks qualitatively similar to what you'd get from full recomputation, up-weighting high-value domains like arXiv, FineMath, and code.
Who is Olmix for?
Olmix is for anyone training LMs on diverse data who's tired of guessing at configuration choices—or tired of re-running expensive mixing experiments every time their training corpus changes. It provides a principled starting configuration and an efficient update mechanism so you can keep improving your mix as your data evolves.
Fundamentally, we think data mixing has been underserved by the research community. It's a first-order lever on model quality, but most practitioners copy configurations from prior work without clear justification, rely on expensive sweeps from scratch every time something changes, or resort to manual, intuition-driven tweaking. Olmix is our attempt to change that—grounded in experimentation and designed for the realities of iterative large-scale LM development.
The full framework, including code and paper, is available now.