Watching an LLM learn math skills
“Olmo’s accessibility makes it the perfect artifact for researchers to work with." — Shubhra Mishra, PhD Candidate at KTH
How do models learn math skills? That’s what Stanford researchers Shubhra Mishra, Gabriel Poesia, and Noah D. Goodman set out to uncover with MathCAMPS, a dataset that turns the Common Core math curriculum into thousands of fine-grained test questions for LLMs.
The way models learn – or fail to learn – math has far-reaching implications. Because math has clear right answers and step-by-step solutions, it’s a clean window into whether a model is reasoning or just pattern-matching. It also gives developers and researchers hard evidence of how changes in data – or training – influence a model’s skill acquisition.
“LLMs, despite being trained on the single objective of predicting the next most likely token, demonstrate surprisingly strong abilities—for example, the ability to reason mathematically,” Mishra says. “Benchmarks have tried to quantify this. What we still don’t understand, though, is how such abilities evolve during training.”
Each of the questions in MathCAMPS is tied to a specific math skill—fractions, decimals, systems of equations, and more. The researchers used the dataset to evaluate the training checkpoints of different models so they could gain a better understanding of when models acquire particular skillsets.
This is where Olmo comes in. Because Olmo is fully open source – including its intermediate checkpoints – the researchers could use it to observe the points at which its math skills emerged during pre-training. They were also able to compare what happens after instruction tuning, the short post-training phase that teaches models to follow natural-language directions.
The team used Olmo and its many checkpoints and fine-tuned variants, running them on local hardware.
“We first downloaded the models from Hugging Face,” Mishra says. “To actually run the models, I used either NVIDIA A40 or A100 GPUs, depending on what was free in the lab at the time. We didn’t need any fine-tuning, since the only thing the model got was a math problem [from our dataset]. We also gave the model two-shot examples of other problems just to guide it to generate an answer in the correct formatting.”
Two findings stand out. First, many math abilities don’t appear all at once during model training—they tend to come online in an order that correlates with the way we teach math in schools: simple arithmetic first, then algebra, etc., even though models learn from most data in a jumbled order. Second, when probed with simple follow-up questions (e.g., “now change one number and solve again”), even strong models can stumble, revealing where reasoning is brittle.
Those results were clearly visible because the team could test multiple Olmo checkpoints and trace the learning curve of each skill over time.
“Olmo’s accessibility makes it the perfect artifact for researchers to work with,” adds Mishra. “We were able to run a subset of intermediate Olmo model checkpoints on our benchmark and get a skill-wise breakdown of reasoning acquired during training. Additionally, the transparency about Olmo’s data use during training helped us hypothesize how different aspects of the data impacted reasoning skill acquisition.”
By combining a project like MathCAMPS with Olmo, researchers gain insights like how models acquire skills, when they acquire those skills, and how training choices like instruction tuning change them. It’s evidence people can verify and build on—openness doing real diagnostic work.
“Because so many people use Olmo, there are thousands of guides out there on setting up, fixing issues, and so on,” Mishra says. “It’s a two-way street of Olmo contributing to the scientific community, and the scientific community in turn returning its love to Olmo. I believe that sharing findings the way that Ai2 does is crucial to the equitable acceleration of science, especially in a world where things are increasingly becoming closed source or benefitting people already at the top.”