Theorizer: Turning thousands of papers into scientific laws

January 28, 2026

Ai2

Tech Report Code Sample generated theories

Automated discovery is the effort to build AI systems that can do science: generate hypotheses, design experiments, run them, analyze results, and iterate. Most of the work in this space has focused on the experimental side. Our own CodeScientist, for instance, can take a research question and generate code-based experiments before executing them and surfacing candidate discoveries.

But in real science, experiments are usually in service of something higher-level: theory building. Theories compress many results into compact laws that explain and predict—they're how fields consolidate knowledge and make lasting progress. For example, Kepler's laws distilled centuries of astronomical observations into a handful of statements that describe planetary motion.

That raises a question we've been chasing: Can an AI system synthesize scientific theories by reading the research literature?

Today we're releasing Theorizer, a multi-LLM framework that takes a user query ("make me theories about X"), reads the relevant scientific literature, and generates structured claims about what patterns hold across that research and where they apply. It's a way for scientists to get oriented in a new domain in minutes rather than months—and a step toward automating one of the parts of science that has thus far remained stubbornly manual: building theories from scattered findings.

Alongside Theorizer, we're publishing a dataset of approximately 3,000 theories generated from a broad cross-section of AI/NLP research—useful as a starting point for anyone working in those areas, or as a benchmark for researchers interested in automated theory generation techniques.

What Theorizer does and how it works

Scientists typically focus on developing or testing individual theories, often over long periods of time. But what if thousands of theories could be synthesized at once—by automatically analyzing the full body of research within a field?

There are lots of AI tools that summarize papers or produce detailed literature reviews. Theorizer instead identifies regularities – patterns that hold consistently across multiple studies – and expresses them as testable claims with defined scope and supporting evidence.

Each theory that Theorizer outputs is structured as a set of ⟨LAW, SCOPE, EVIDENCE⟩ tuples:

The law is a qualitative or quantitative statement—a regularity Theorizer believes holds. A qualitative law might express a directional relationship ("X increases Y" or "A causes B"), while a quantitative law specifies explicit numerical bounds.
The scope indicates where the law should hold, including domain constraints, boundary conditions, and known exceptions (e.g., "applies only for small R" or "does not hold when P is present").
The evidence is extracted empirical or observational support traced back to specific papers in the input corpus (additional details below), including experimental findings, reported effects, and quantitative results that bear on the law.

Theorizer also produces a theory name and high-level description to situate it within the broader literature. In practice, each generated theory typically contains one or two laws.

A multi-stage theory generation pipeline

At a high level, Theorizer is a literature-to-theory pipeline with three main stages.

Literature discovery. Given an initial query, Theorizer builds a corpus by retrieving up to 100 relevant papers to serve as evidence—a practical balance between context-window limits and extraction time. The pipeline reformulates the query into a literature search query, submits it to PaperFinder, uses Semantic Scholar to find open-access PDFs, downloads them, and converts them to full text via an OCR-based extraction pipeline. If it comes up short, it expands the candidate set by scanning retrieved full text for additional references, ranking them by relevance, and backfilling with the highest-scoring papers.

Evidence extraction. To gather evidence, Theorizer generates an extraction schema tailored to the initial query—a structured template specifying which entities, variables, and empirical results matter for generating a theory. For a query about memory-augmented LMs, for example, the schema might capture the task type examined (e.g., question answering), specific memory mechanisms used (e.g., causal memory, spatial memory), evaluation setting, and performance comparisons with and without memory. An inexpensive extraction model then populates this schema for each paper, producing JSON-formatted extraction records that become input to theory synthesis.

Theory synthesis and refinement. The synthesis stage aggregates extracted evidence across papers and induces candidate theories using a prepopulated prompt. After generation, the initial batch is refined using a self-reflection step intended to improve internal consistency, evidence attribution, and specificity. As part of this process, the system also generates self-assessments of novelty for each law, filtering out laws that are too close to existing, well-known claims. Depending on how much evidence is extracted, the total tokens can exceed the generating model's context window; when that happens, evidence is randomly subsampled.

Benchmarking theory generation

How does Theorizer know what makes a strong theory versus a weak or implausible one? In our technical report, we specify five desiderata:

Specificity—laws should make testable claims
Empirical support—claims should be consistent with existing evidence
Predictive accuracy—claims should hold up against future results
Novelty—laws should introduce insights not explicitly stated in prior work
Plausibility—laws should have a reasonable scientific rationale

Evaluating thousands of theories by running new experiments isn't feasible, so we used two complementary approaches: LLM-as-a-judge scoring on the theory-quality desiderata described above, and a literature backtesting setup for predictive accuracy.

LLM-as-a-judge evaluation. We compared two modes of generation: parametric-only (what the model already "knows") versus literature-supported (Theorizer's default approach, which grounds generation in extracted evidence). We tested both modes under accuracy-focused and novelty-focused prompting conditions. Literature-supported theories scored substantially higher on specificity and empirical support in both settings, and higher on plausibility as well.

Backtesting predictive accuracy. To test whether Theorizer's theories actually predict future findings, we used a backtesting paradigm: generate theories with a fixed knowledge cutoff, then evaluate their predictions against subsequently published literature. Concretely, in our setup:

A language model expands each law into a list of specific predictions
PaperFinder retrieves recent papers that might speak to those predictions
Each paper is judged as supporting, contradicting, or providing no evidence for each prediction
These judgments are tallied into precision and recall estimates

Theorizer's default generation model (GPT-4.1) has a reported June 2024 knowledge cutoff. We used the first 12 months after that cutoff to supplement parametric knowledge with literature during generation, while holding out the most recent 6 months for evaluation. In total, we tested 2,983 laws against 4,554 unique papers across 16,713 law–paper evaluations.

Results. We found that while the literature-supported method is almost 7× more expensive than parametric-only generation, it yields more accurate and more predictive theories. In accuracy-focused generation, both methods achieve high precision (around 0.88–0.90), meaning predictions are almost always supported when tested. But literature-supported laws achieve higher recall (0.51 vs. 0.45)—more of them generate predictions that can actually be evaluated against subsequent work. In novelty-focused generation, the gap widens: literature support dramatically improves both precision (0.34 to 0.61) and recall (0.04 to 0.16).

Novelty and diversity. When repeatedly generating theories for the same query, we found that parametric-only generation quickly saturates into duplicates—the model recycles what it knows without new input. Literature-supported generation maintains a lower duplicate rate, likely because each retrieval surfaces different papers and evidence. When comparing the two approaches head-to-head, we found that after generating 40 theories, 32% of statements were non-duplicates—suggesting the two methods explore meaningfully different parts of the hypothesis space.

Caveats and future directions

It’s important to keep in mind that Theorizer is a research tool, and its outputs are hypotheses—not truth.

Backtesting has limited recall: roughly 51% of accuracy-focused theories have at least one paper that tests their predictions, and this drops sharply in novelty-focused generation. The literature is also biased toward positive results, which can make contradictory evidence harder to surface. Therorizer’s cost and runtime are non-trivial – approximately 15–30 minutes per query, though parallelizable – and coverage depends on open access papers, so the system currently works best in fields like AI/NLP.

Errors are also possible—Theorizer can produce partially accurate or misleading theories, so we recommend treating its output as a starting point. That said, we think this direction is promising. Scientific knowledge is growing faster than any one person can synthesize, and theory building remains a largely manual process. If automated systems can help compress the literature into structured, testable theories, they could become a useful tool for making sense of what we collectively know.

What we're releasing

Our full release includes Theorizer code on GitHub, with a user interface, API, and all prompts used in the pipeline. In the experiments reported in our technical report, the generation model was GPT-4.1 for schema generation, theory generation, and theory reflection, while GPT-5 mini was used for large-scale evidence extraction. You don't need to use those exact models to run Theorizer, but they're part of our reference implementation.

We're also releasing a dataset of approximately 3,000 theories spanning a wide cross-section of AI/NLP topics. To produce it, we ran Theorizer at scale using 13,744 source papers to synthesize 2,856 theories from 100 theory queries, with each theory typically containing one or two laws. Those 100 queries were designed to be broad and representative—we randomly sampled 50 papers from recent NLP/AI venues near the model's knowledge cutoff (ACL 2023, EMNLP 2023, AAAI 2023, NeurIPS 2024) then automatically generated two theory queries per paper (one general, one specific).