DR Tulu: An open, end-to-end training recipe for long-form deep research
November 18, 2025
Ai2
Deep research is about building agentic systems that can plan, search, and synthesize information from diverse sources to produce in-depth, well-attributed answers to complex questions. Done well, these capabilities could accelerate scientific discovery and allow students and professionals to explore unfamiliar domains with expert-level rigor, backed by transparent citations and reasoning traces.
Given the increasing success of proprietary deep research systems, there has been growing interest in building open alternatives. Many recent approaches rely on Reinforcement Learning from Verifiable Rewards (RLVR)—training agents on short-form QA tasks where answers can be automatically verified through comparison to a ground-truth answer. However, these existing RLVR recipes don't directly transfer to open-ended deep research tasks. Training agents to handle long-form, tool-intensive research workflows is difficult: models must integrate evidence across many sources while justifying each step, meaning that there isn’t a single "correct" answer to verify against. Evaluating long-form responses is intrinsically challenging—the criteria for quality are often underspecified, static rubrics can't capture the full range of response quality, and LM judges must keep pace with a rapidly evolving, incredibly vast body of world knowledge. Because of these difficulties, prior work often resorts to fixed, hand-crafted report generation pipelines built on closed models. To our knowledge, the community still lacks both a clear understanding and a practical recipe for training fully open deep research agents.
To address these challenges, we introduce Deep Research Tulu (DR Tulu), the first open model that is directly trained for long-form deep research tasks through an end-to-end training recipe that combines supervised fine-tuning (SFT) and Reinforcement Learning with Evolving Rubrics (RLER). DR Tulu starts from a strong base model and progresses through multiple training stages: SFT on high-quality, naturally occurring information-seeking queries, followed by online RL with RLER tailored to long-form research.
Combined with our agent stack, which lets DR Tulu flexibly choose among multiple search and browsing tools, this training recipe yields an open model that substantially outperforms prior open deep research models – including much larger systems – across rigorous industry benchmarks. Our Qwen3-based DR Tulu-8B example model even matches or exceeds several proprietary research agents while remaining significantly smaller – and cheaper – per deep research query.
To make this work reproducible and extensible, we’re releasing all the components of DR Tulu: the full training recipe and code, our DR Tulu-8B checkpoint, our RLER rubric generation and training framework, and dr-agent-lib, an open research library built on MCP with multi-tool search, asynchronous tool calling, and an accompanying evaluation suite.
Demo | Models & data | Code | Technical report
Conducting deep research in steps
Our core challenge was building a model that can flexibly adapt its depth of response, switching between concise answers and multi-paragraph reports depending on a question’s complexity. Deep research is inherently dynamic: as the model searches and acquires new information, the space of possible outputs evolves during execution, making fixed rubrics inadequate. Add to this that evaluating multi-source synthesis requires verifying that claims are faithfully grounded across multiple documents and reasoning steps—far harder than checking short-form answers.
At inference time, DR Tulu runs an auto-search loop and chooses between three actions:
- think for internal planning
- call_tool to invoke a search or browsing tool
- answer to produce a final response
Inside the final answer, the model wraps claims in citation tags that link back to supporting sources.
When given a research question, the model begins by planning what information it needs and which sources to consult. It then iteratively searches and gathers evidence from multiple places, synthesizing findings, identifying gaps, and refining its strategy based on what it learns. DR Tulu adapts its search depth to question complexity—simple queries might need one or two searches, while complex research questions could involve many tool calls exploring multiple angles.
Research questions demand varied information sources. Scientific research benefits from scholarly databases, healthcare queries need authoritative medical sources, while general inquiries work best with broad web search. To support this diversity, we built our inference system using the Model Context Protocol (MCP), treating tools as swappable components. In our default setup, DR Tulu has access to three search tools:
- google_search, which returns top web snippets
- web_browse, which extracts full-page text from URLs
- paper_search, which retrieves relevant paragraphs from open-access research papers
This MCP-based design lets you bring your own tools – API search, local retrieval and reranking, site-specific readers, or domain-specific databases – via a unified protocol. Our agent library, dr-agent-lib, provides a programmable MCP-based frontend for experimenting with prompt templates, multi-stage workflows, and fine-grained tool-calling strategies, without retraining the underlying model.
Training a long-form research agent with RLER
Training research agents typically relies on reward signals. Recently, RLVR has proven effective for complex tasks with clear success criteria (e.g., math problems with verifiable answers, pass/fail coding tasks). But this approach struggles with open-ended deep research, where there's no single, clear evaluation criterion.
Using a general-domain LM-as-judge that offers high-level feedback on helpfulness or writing quality provides convenient but biased signals. Models learn to optimize superficial patterns rather than genuine quality: they produce responses that sound comprehensive without synthesizing diverse sources, cite cherry-picked evidence that's easy to retrieve rather than most relevant, or exploit consistent judge preferences for certain writing styles or response lengths. These issues compound as the model discovers and exploits weaknesses in static evaluation criteria.
Inspired by the success of evaluating long-form responses with detailed rubrics, including our ScholarQA-CS line of benchmarks, we address this with RLER. RLER makes the reward function itself adaptive, evolving along three main axes:
- Instance-specific, search-grounded criteria. Rather than applying generic evaluation prompts, we build a rubric tailored to each question. For every training query, we first run a web and paper search seeded by the original question and feed the question plus retrieved context into a rubric-generator model. This produces a persistent set of search-based rubrics that encodes up-to-date, instance-specific criteria for what a good answer should contain.
- Positive and negative evolving rubrics. During online RL, we periodically sample multiple rollouts from the current policy and ask the rubric generator to propose new criteria that compare these responses. This yields two types of evolving rubrics: positive and negative rubrics. Positive rubrics reward new, high-value strategies or evidence the model has discovered but that isn’t yet captured in the rubric pool (e.g., consulting an underused data source or providing an especially useful intermediate analysis). Negative rubrics explicitly penalize failure modes and reward hacking, such as copying retrieved text verbatim, padding answers with irrelevant content just to increase apparent coverage, or overusing citations without adding genuine synthesis. By keeping both positive and negative rubrics in play, RLER both encourages emerging good behaviors and quickly suppresses emerging exploitative ones.
- Dynamic rubric buffer and auxiliary citation rewards. As training proceeds, we maintain a rubric buffer that filters and ranks rubrics based on how discriminative they are. Rubrics that no longer differentiate between good and bad responses – those with near-zero reward variance – are dropped, and we retain only a bounded number of the most informative rubrics for each question. Alongside rubric-based rewards, we add small format and citation rewards that check whether the model follows the expected output protocol and attaches faithful citations that actually support its key claims.
Equipped with these mechanisms, we train DR Tulu in multiple high-level steps: prompt curation, SFT with teacher-generated trajectories to establish basic research skills, and online RL with RLER to refine tool use, synthesis quality, and citation behavior.
SFT for cold start
In preliminary experiments, we found that applying RLER directly to a base LM led to weak tool use and low-quality reports, and failed to promote effective exploration. To bootstrap deep research skills, we first run a distillation SFT stage.
We start by curating diverse prompts that reflect realistic open-ended deep research tasks. We draw from publicly available user-interaction logs – SearchArena and our OpenScholar platform – and filter them with an LM that rates each prompt on a 1–5 scale for whether it truly demands multi-step search, planning, and synthesis. This yields a pool of realistic long-form research questions. We also mix in a moderate amount of short-form, verifiable data from HotpotQA, TaskCraft, WebWalker-Silver, and MegaScience, plus additional challenging synthetic prompts inspired by PopQA. These questions ensure DR Tulu can still answer concise fact-based queries and reason about shorter chains of evidence, even though the model is primarily optimized for long-form research.
We use GPT-5 as a teacher to generate full trajectories – explicit “thinking” traces, tool calls, tool outputs, and final answers – on our curated pool of long-form and short-form prompts. GPT-5 is given a detailed system prompt that describes the deep research workflow and exposes the same three tools (google_search, paper_search, web_browse), and we ask it to produce entire trajectories end-to-end.
Because GPT-5 doesn’t expose its native internal reasoning, we instruct it to emit explicit mock thinking tokens before each tool call or answer. We then apply two lightweight rejection-sampling filters to ensure trajectories satisfy our requirements:
- For all prompts, we verify that trajectories follow the expected tool-calling and answer formats (e.g., properly structured tool calls, well-formed final answers with citations).
- For short-form prompts, we discard trajectories whose final answer does not exactly match the gold answer, following prior RLVR-style work on verifiable QA.
The resulting filtered dataset contains a mixture of high-quality long-form and short-form trajectories. Training DR Tulu-8B on this data gives the model a reasonable initial strategy for planning, calling tools, and citing sources before we introduce RLER.
RLER with asynchronous RL after cold start
After SFT establishes basic deep research capabilities, we further train DR Tulu-8B with online RL using RLER, allowing the model to explore and improve both its tool use and answer quality in a web-enabled environment.
We build on GRPO, using a variant that supports multi-rollout training and scalable, tool-augmented generation. For RL, we train exclusively on long-form questions from our curated pool (including additional SearchArena/OpenScholar prompts and questions from this dataset). At each training step, we:
- Generate multiple rollouts per query with real tool calls.
- Compute rewards by combining search-based persistent rubrics, evolving positive and negative rubrics, and small auxiliary format and citation rewards.
- Update the model using a GRPO-style loss, with sample packing and a 1-step asynchronous training setup that lets us overlap generation and learning.
To reduce computational overhead and keep tool usage efficient, we employ asynchronous tool calling: tool requests are sent immediately when the model triggers them, rather than waiting for an entire batch to finish generating. When a rollout issues a tool call, that rollout is temporarily put to sleep while other rollouts continue generating, allowing search and generation to overlap wherever possible. Our tool calls are mediated by dr-agent-lib, which manages concurrency, caching, and rate limits under the MCP abstraction.
Strong performance on industry benchmarks
We evaluated DR Tulu-8B across seven benchmarks covering both long-form research synthesis and short-form factual retrieval.
Four benchmarks – ScholarQA-CSv2 (part of AstaBench), ResearchQA, DeepResearch Bench, and HealthBench – evaluate multi-source synthesis with citation requirements. ScholarQA-CSv2 focuses on literature synthesis over scientific papers, ResearchQA assesses synthesis of up-to-date scientific literature, DeepResearch Bench covers general-domain questions spanning technology and policy, and HealthBench targets healthcare research requiring expert-level guidance. Our experiments show that supporting multiple search modalities (general web, browsing, and scholarly search) is important for performance across these tasks.
Three additional benchmarks – SimpleQA, WebWalkerQA, and 2Wiki – test retrieval speed and factual grounding for concise answers, emphasizing precision over synthesis depth.
Long-form results. Our best 8B DR Tulu agent – DR Tulu-8B (RL) – substantially outperforms open baselines on all four long-form benchmarks—ScholarQA-CSv2, HealthBench, ResearchQA, and DeepResearch Bench. On ScholarQA-CSv2, it reaches 86.7 versus 42.5 for WebExplorer-8B and 32.9 for WebThinker-32B-DPO, despite being 4x smaller. On ResearchQA and DeepResearch Bench, DR Tulu-8B (RL) scores 71.1 and 41.8, compared to 64.8/36.7 for WebExplorer-8B and 48.6/23.3 for WebThinker-32B-DPO.
Beyond overall scores, DR Tulu-8B (RL) also improves the quality of long-form deep research reports. On ScholarQA-CSv2, it attains a rubric score of 84.8 with large gains in citation precision and recall (90.6 and 76.1), meaning answers are not only more comprehensive but also better grounded in the underlying literature.
Proprietary systems. Compared to proprietary deep-research agents, DR Tulu-8B (RL) matches or exceeds performance on several long-form benchmarks. On ScholarQA-CSv2, it outperforms all proprietary systems we evaluated, including OpenAI Deep Research (79.6) and Perplexity Deep Research (67.3). On ResearchQA and DeepResearch Bench, it remains competitive with Claude-Sonnet Search, Perplexity Sonar (high), Perplexity Deep Research, Gemini Deep Research, GPT-5+Search, and OpenAI Deep Research, often trailing by only a few points.
Despite similar average scores, model behavior can look quite different. On ScholarQA-CSv2, for example, OpenAI Deep Research generates answers that are roughly three times longer than DR Tulu’s and contain about twice as many citations, while achieving similar overall scores. DR Tulu’s reports are much more compact, suggesting that RLER encourages focused synthesis rather than simply adding length and references.
DR Tulu-8B (RL) is inexpensive to run for the high performance it delivers. Under our evaluation configuration (and counting only external API calls, assuming hardware and hosting costs are already paid), a typical deep research query costs approximately $0.00008. Even if the agent issues the absolute maximum number of search calls allowed during evaluation (10), the per-query API cost is still capped at about $0.0075.
By contrast, OpenAI Deep Research costs about $1.80 per ScholarQA-CSv2 query, and our Asta pipeline using Claude Sonnet costs about $1.30 per query.
HealthBench remains challenging. On HealthBench, DR Tulu-8B (RL) achieves 43.7, clearly ahead of open deep research models but still leaving significant headroom. Because most proprietary systems do not yet report HealthBench scores, we treat it as an open area for improvement, especially for queries that require expert-level clinical guidance.
Clinically grounded case study. Beyond industry benchmarks, we applied DR Tulu to a real-world challenge: investigating disease-causing gene variants using expert-curated data from the N=1 Collaborative. This task is a natural fit for deep research agents: the model has to search across bioinformatics databases, research papers, and case reports, then synthesize sparse, heterogeneous evidence into a coherent assessment of whether a variant is eligible for antisense oligonucleotide (ASO) gene therapies. Unlike our standard benchmarks, this isn’t an existing leaderboard task—it was constructed specifically for this study to test how well DR Tulu generalizes to a new, clinically meaningful setting.
We create a new evaluation dataset, GeneticDiseasesQA, consisting of 47 questions derived from 24 disease-causing genetic variants. Questions focus on information genetics experts use to assess variant eligibility for gene therapy strategies—reasoning about molecular properties, disease-causing mechanisms, and potential therapeutic approaches. For each question, we prompt deep research systems to generate a long-form report that both answers the question and provides supporting evidence for each claim.
Outputs are scored with GPT-4.1 along several dimensions, including final answer (whether the expert-annotated key fact is mentioned in the response), evidence quality (whether the right kinds of evidence appear in the cited statements), and evidence synthesis: whether the model links multiple sources together in a coherent, relationship-level explanation. We compare DR Tulu-8B (RL) to several baselines, including an auto-search baseline (Qwen3-8B + our search stack), our Asta agent using Claude Sonnet, OpenAI Deep Research (o4-mini), and GPT-5 + OpenAI Search. For DR Tulu, Qwen3-8B + Our Search, and Asta, we average over three runs; for OpenAI Deep Research and GPT-5 + Search, we report one run each due to cost and inference time.
The results show that DR Tulu consistently outperforms the Qwen3-8B + Our Search baseline across all categories, indicating that our method generalizes to tasks unseen during training. DR Tulu also outperforms both Asta (Claude Sonnet) and OpenAI Deep Research on the overall GeneticDiseasesQA score, with especially large margins over Asta on final answer and evidence quality. GPT-5 + OpenAI Search achieves the highest overall score, but DR Tulu achieves stronger evidence synthesis, producing reports that more clearly connect evidence across papers and databases.
Short-form results. Although DR Tulu-8B (RL) is optimized for long-form research, it also remains competitive on short-form QA. We evaluate on SimpleQA, 2Wiki, and WebWalkerQA, using top-1 predictions judged by GPT-4.1 under a unified evaluation pipeline. Across these tasks, DR Tulu-8B (RL) substantially outperforms our naive Qwen3-8B RAG baseline and improves the average short-form score by 4.4 points over our SFT-only model, showing that our long-form-oriented training recipe generalizes well to concise, factoid questions.
What we're releasing
We’re making available the entirety of our DR Tulu research and training stack under a permissive license.
Releasing all of DR Tulu’s components serves three goals. First, it enables reproducibility and transparency: we release our curated prompt datasets, training and evaluation code (including our RLER implementation), and our 8B model checkpoint so others can replicate our results and study how reward functions and tool configurations shape behavior. Second, it provides deployment flexibility—you can run the agent with your own MCP tool stack, infrastructure, and privacy constraints. Third, it supports extensibility: the dr-agent-lib agent library lets you plug in domain-specific tools and retrieval systems without retraining by simply describing new tools to the model. Taken together, these artifacts make DR Tulu the first fully open, end-to-end deep research framework.
We encourage you to experiment with different tool configurations, audit the agent’s research steps, and test how DR Tulu handles your domain's research questions. If you find issues or ways to improve the approach, we'd love to hear about them.