OMEGA: Can LLMs Reason Outside the Box in Math?

June 24, 2025

Nouha Dziri - Ai2

Large language models (LLMs) like GPT-4, Claude, and DeepSeek-R1 have made headlines for their impressive performance on mathematical competitions, sometimes approaching human expert levels or even exceeding it on Olympiad problems. Yet a fundamental question remains: Are they truly reasoning or are they just recalling familiar strategies without inventing new ones?

At the heart of this question is generalization: the ability to extend learned skills to unfamiliar problems, combine ideas in new ways or discover strategies not explicitly seen during training. While LLMs excel at solving problems similar to those in their training data, they often falter when required to adapt, integrate, or rethink.

To investigate these limitations, we introduce OMEGA: a controlled math benchmark that systematically evaluates LLMs along three axes of reasoning, each designed to probe a distinct type of cognitive leap.

Why Another Math Benchmark?

There are dozens of benchmarks out there. But few allow us to meticulously test specific reasoning skills or isolate generalization failures (e.g., GSM8K, DeepMath). Most are either broad but coarse, offering thousands of problems across various types without a clear way to attribute success to particular skills or strategies, or controlled but narrow, focusing on single-skill reasoning tasks with limited structural and domain diversity.

Large-scale datasets like Numina-Math, Omni-Math, and DeepMath aggregate diverse math problems spanning multiple domains and complexity levels, but this breadth comes at the cost of interpretability. These corpora mix arithmetic, algebra, geometry, and beyond into a single training stream, making it difficult to isolate which specific reasoning skill an RL-tuned model actually learned or whether its success stems from pattern matching, memorization, or genuine abstraction. On the flip side, highly controlled datasets like GSM-Symbolic, GSM-PLUS, and GSM-Infinite offer cleaner scaffolds for causal analysis but focus on limited domains, such as integer arithmetic or symbolic manipulation, which narrows the scope of generalization.

OMEGA addresses this gap by combining both key properties: it is both controlled (built from 40 programmatic templates across six mathematical domains: arithmetic, algebra, combinatorics, number theory, geometry, and logic & puzzles) and diverse (covering 8 ranging from combinatorics and number theory to puzzles and logic).

Three Axes of Mathematical Generalization

Inspired by Margaret Boden’s typology of creativity, we define three distinct reasoning leaps:

Exploratory Generalization: Can the model apply a known strategy to more complex instances within the same problem domain? (e.g., extending a method from an octagon to a dodecagon)
Compositional Generalization: Can it combine distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways? (e.g., mix GCD computation with root-solving)
Transformative Generalization: Can it adopt novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively? (e.g., replace exhaustive enumeration with a clever counting trick)

Experimental Setup

We benchmarked four top-tier models (DeepSeek-R1, Claude 3.7, OpenAI o3-mini, o4-mini) and fine-tuned Qwen2.5 models (Qwen2.5-7B-Instruct and Qwen2.5-Math-7B) using reinforcement learning (GRPO) across all settings. For each generalization paradigm, we do RL on a training set of 1,000 problems and evaluate on both in-domain (ID) and out-of-distribution (OOD) test sets.

Exploratory: Can the model apply a known strategy to more complex versions of the same problem?
- → Train on easy examples, test on harder ones from the same template.
Compositional: Can it combine two familiar skills to solve a novel problem that requires both?
- → Train on each skill in isolation, test on problems requiring their integration.
Transformative: Can it abandon familiar strategies and discover a new, more effective approach?
- → Train on standard tactics, test on problems where those tactics fail, requiring a shift in reasoning approach.

What We Found

Our key findings reveal both the strengths and fundamental constraints in how AI systems approach mathematical reasoning:

1. Reasoning LLMs performance degrades with increasing problem complexity

LLMs show strong performance on low-complexity problems—but accuracy degrades sharply as complexity rises. Even with more inference compute (Pass@k), performance degrades at higher levels.

What does this mean? imagine a model that can easily count rectangles in a simple 4-sided polygon, performs well on an 8-sided octagon, but struggles with a 12-sided dodecagon despite the underlying strategy being identical. Or consider multiplication: a model might reliably multiply two 6-digit numbers but fail completely when asked to multiply two 7-digit numbers, even though the algorithm is fundamentally the same. This isn't a context length problem (the problems fit well within the models' capabilities) but rather reveals natural limits in handling extended multi-step reasoning.

What's particularly striking is how quickly this degradation occurs—tasks requiring just 2-3 additional reasoning steps can cause performance to drop from 80% to near zero, suggesting that current models haven't fully internalized the underlying algorithms, but rather learned patterns at specific complexity levels.

2. When Thinking Too Much Hurts: Chain-of-Thought Patterns Reveal Early Solutions and Error Spirals

We noticed that many failures stem not from lack of knowledge, but from overthinking. Models often find the right answer early in their chain of thought (CoT), only to spiral into self-corrections and abandon correct solutions.

Analysis of DeepSeek-R1's reasoning traces revealed two concerning patterns:

"Correct → Wrong" transitions: ~38% of incorrect responses initially contained the right answer, but models talked themselves out of it
Reasoning spirals: Models get trapped in cycles of failed verification attempts, sometimes consuming over 10,000 tokens while moving further from the solution

Why This Matters: This challenges the assumption that "more reasoning = better results" Sometimes the models' self-correction mechanisms can inadvertently hurt performance, suggesting opportunities to refine Chain-of-Thought approaches.

3. RL Shows Strong Performance with Important Limitations

Can RL Effectively Generalize from Easy to Hard Problems? Strong Early Gains, but Generalization Plateaus with Task Complexity

Reinforcement learning delivers substantial improvements on familiar problem types (+38 percentage points on average) and meaningful gains on moderately harder versions (+19 points). In some domains like logic puzzles, we saw dramatic improvements—jumping from 30% to over 80% accuracy without any supervised training.

However, while RL effectively helps models master problems within a certain complexity range, the benefits don't extend indefinitely. Training on levels 1–4 provides minimal improvement on level 5, suggesting there are natural limits to how far learned strategies can be stretched.

Can RL Learn to Compose Math Skills into Integrated Solutions? Strong Performance on Isolated Skills, but Limited Compositional Generalization

Our results suggest that reinforcement learning can significantly improve model performance on individual mathematical skills. For instance, when trained on isolated tasks such as computing greatest common divisors (GCD) or solving polynomial equations, models often reach high accuracy, indicating that RL is effective at reinforcing skill-specific solution patterns. These improvements are robust across a variety of problem types when the skill boundaries are clear and the solution path is consistent with those seen during training.

However, this success does not readily transfer to compositional tasks—problems that require combining two or more learned skills in a coherent, integrated solution. For example, when models are asked to first compute the GCD of two numbers and then use that result within a polynomial root-finding subproblem, performance drops sharply. Despite having mastered both components individually, models fail to integrate them effectively when the steps must be composed within a single reasoning chain.

This gap underscores a key limitation of current RL approaches: they are effective at optimizing for well-scoped, atomic skills but struggle to induce flexible reasoning policies that generalize across skill boundaries. In contrast to human learners—who routinely integrate known techniques to solve novel problems—RL-trained models appear to lack the inductive bias or learning signal needed to form compositional abstractions.

Can RL Go Beyond Familiar Skills to Discover New Reasoning Abilities? Creative Discovery Remains Limited

Transformational generalization where models must go beyond rehearsed procedures to discover entirely new solution strategies, remains a significant challenge for current reinforcement learning methods. Our findings show that while RL can substantially enhance performance on tasks that follow familiar patterns observed during training, it struggles when success depends on creative insight or reasoning strategies not explicitly demonstrated in the data.

For instance, in tasks requiring an unfamiliar combination of mathematical ideas or a shift from procedural recall to structural abstraction, models typically achieve near-zero performance, even after RL training. This gap indicates that RL primarily reinforces solution patterns the model has already seen or implicitly encoded from pretraining. It is far less effective at encouraging the emergence of novel reasoning policies or adaptive behaviors in out-of-distribution settings.

Toward Smarter Scaling

Yes, today’s models can master increasingly difficult math problems. But they often do so within the boundaries of what they’ve seen. Beyond that, they struggle.

Our findings show that RL is highly effective within those boundaries: it can substantially boost performance on known problem types and even support moderate generalization to harder variants. However, the most demanding aspects of mathematical reasoning—scaling to higher complexity, creatively integrating skills, and discovering entirely new solution strategies remain largely unresolved.

Recent works reinforce this point. Spurious Rewards [Shao et al., 2025] shows that RLVR can boost Qwen model accuracy by more than 10% points even when the reward signal is random or incorrect, simply by surfacing reasoning patterns already acquired during pre-training. The Illusion of Thinking paper [Shojaee et al, 2025] demonstrates that these same patterns fail to generalize once problem complexity surpasses a threshold, leading to a sharp drop in accuracy. Together with our results, these studies indicate that current improvements often reflect amplification of existing priors rather than the acquisition of new algorithms at test time.

Beyond highlighting these limitations, we hope this study encourages the community to explore smarter scaling solutions rather than brute-force approaches. Although many of the identified failure cases could potentially be patched through targeted data augmentation or synthetic scaffolding, such short-term fixes may obscure deeper, structural weaknesses in model reasoning. Our objective is not only to expose these limitations, but also to inspire strategies that fundamentally equip models with robust, efficient mathematical reasoning capabilities; ones that go beyond what can be solved through more data or larger models alone.

Paper: http://arxiv.org/abs/2506.18880