Tracing knowledge cutoffs
"Olmo lets others constructively engage with the work, forming hypotheses, testing them, and building new work." — Marc Marone, PhD Student at Johns Hopkins University
When you talk to an AI model and it says “I was trained up to March 2023,” you’re trusting that it has current enough knowledge for your needs. But what does “trained up to” really mean in practice?
That’s the question a team led by Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme asked, using Olmo to find out. In their paper “Dated Data: Tracing Knowledge Cutoffs in Large Language Models,”they show that an LLM’s claimed knowledge cutoff often doesn’t match what the model actually knows about different topics.
The key insight is that not all parts of a model’s training data are equally fresh. Even a recent web crawl can include outdated pages, and deduplication systems (which remove repeat or near-duplicate text) can mask which versions survive in the final training mix. As a result, models may treat some topics as if they’re current even when they rely on stale information.
Here’s where Olmo’s openness makes a difference. Because Olmo’s pre-training corpus, processing pipelines, and training history are public and inspectable, researchers can cross-check exactly which document versions entered the training set. That allows them to trace when a mismatch is due to stale web data versus which versions survived deduplication.
“A model with a documented training process and dataset doesn’t need to be the absolute best model,” Marone, a PhD student at Johns Hopkins University, says. “Olmo lets others constructively engage with the work, forming hypotheses, testing them, and building new work. None of these are possible without open models and datasets.”
Marone and the team downloaded Olmo and the associated datasets from Hugging Face, using Ai2’s GitHub resources to examine the source code. They were able to run Olmo on their own GPUs, ensuring the data stayed local.
In the study, the researchers assembled version histories of documents over time, fed these time-tagged texts to Olmo and various other models, and measured perplexity—a statistic that shows how “surprised” a model is by an input. Low perplexity means the model expects that text; high perplexity means it finds it unfamiliar. By tracking when a model’s perplexity started rising, the authors could estimate the point in time up to which the model had reliable info about a particular subject.
The results are striking. In many cases, a model’s effective cutoff is significantly older than its claimed one. In other words, when a model says it was trained up to a specific year, it might in fact frequently refer to dated documents while answering factual questions.
What’s the practical upshot? In domains where correctness matters, like medicine, users often trust the model’s knowledge cutoff as a guardrail. But if the effective cutoff is older than advertised, the model might be repeating outdated facts as if they were current. Identifying that gap allows teams to flag inputs that risk staleness, retrain on fresher data, or tighten version filtering to ensure that only the most recent facts survive.
Because Olmo is designed as a fully open research stack, it serves as a diagnostic instrument in studies like these—not just a model. You can trace mismatch patterns, rerun experiments, and audit whether your model’s knowledge is genuinely fresh—or secretly stuck in the past.
“Open models like Olmo gave us confidence that we could make statements about what was or wasn’t in the training data,” Marone says. “It’s still very hard to analyze huge piles of data for model training, but being able to download the raw data and having a documented process for how it was constructed are essential. We analyzed some closed models using the same methods, but of course can’t verify our findings like we could for Olmo.”