Skip to main content ->
Ai2

Tülu 3: The next era in open post-training

November 21, 2024

Try Tülu 3 in the Ai2 Playground: https://playground.allenai.org


Post-training — a collection of techniques including instruction tuning followed by reinforcement learning from human feedback — has become a vital step in refining behaviors and unlocking new capabilities in language models. Since early approaches such as InstructGPT and the original ChatGPT, the sophistication and complexity of post-training approaches have continued to increase, moving towards multiple rounds of training, model merging, leveraging synthetic data, AI feedback, and multiple training algorithms and objectives (such as in the Llama 3.1 report).

Early in 2023, the open ecosystem thrived on multiple versions of synthetic data for instruction fine-tuning, building on techniques like Self-Instruct. Later in the year, the community gained further performance by aligning models with Direct Preference Optimization (DPO) on small preference datasets. Since then, progress in truly open methods has largely stalled – measured by a few metrics, such as key evaluations such as MATH or IFEval, and community standards such as ChatBotArena where open data is not available for models in the top 50.

This lack of transparency creates challenges for reproducibility and hinders progress in understanding how specific fine-tuning strategies impact model performance. With Tülu 3, we are releasing state-of-the-art post-trained models with every step in the pipeline open – training datasets, data curation tools, data decontamination scripts, training code, evaluation suites, etc. We believe this will both close the gap to closed recipes for post training and act as a foundation for the next chapter of open post-training research.

Our best-performing recipe yields Tülu 3, a state-of-the-art post-trained model outperforming open-weight post-trained models of the same size such as Llama 3.1-Instruct, Qwen2.5-Instruct, Mistral-Instruct, and Nemotron on our comprehensive evaluation suite – highlighted in the tables and plot below. We also demonstrate that our open-source models not only achieve state-of-the-art performance but also close the gap to the capabilities of their proprietary counterparts.

Tülu 3 Overview

The advancements of Tülu 3 are attributed to careful data curation leading to Tülu 3 Data, new permissively licensed training datasets targeting core skills, improved training infrastructure, Tülu 3 Code, reproducible evaluation toolkit Tülu 3 Eval, and innovative methodologies and guidance through training stages, Tülu 3 Recipe.

Tülu 3 Recipe

We produce Tülu 3 models (8B and 70B) through a five-part post-training recipe on top of pre-trained language models (namely Llama 3 Base). This includes:

(1) careful prompt curation and synthesis targeting core skills,

(2) supervised finetuning (SFT) on our carefully selected mix of prompts and their completions,

(3) Direct Preference Optimization (DPO) on both off- and on-policy preference data

(4) a new RL-based method to enhance specific skills with verifiable rewards, and

(5) a standardized evaluation suite for development, decontamination, and final evaluation stage.

Tülu 3 Data

Prompts represent the diverse ways users may interact with models, and serve as the essential component for all post-training stages. To target the desired core skills, we curate a diverse and high quality set of prompts from publicly available datasets with clear provenance and synthetically generate prompts to fill any gaps.

Open Source Data. We start this process with a broad survey of public datasets, including those annotated by dedicated workers, sourced from real users, and synthesized with models. We then manually review each individual dataset, and pick those while considering: (1) diversity to enhance models’ generalization, (2) target challenging skills such as complex reasoning, coding and precise instruction following, (3) data provenance, only allowing datasets with correct and clear licenses, (4) rigorous decontamination by removing any training set that has overlap with more than 2% of our evaluation suite.

Synthetic data curation. To address the growing need for diverse and skill-specific datasets, we incorporate synthetic data generation as a complementary approach. Our synthetic data generation effort took a persona-driven methodology in Chan et al. 2024. The key idea is to use different personas (e.g., "A machine learning researcher focused on neural networks") with a data synthesis prompt (e.g., "create a coding problem") to steer an LLM to synthesize data with corresponding perspectives. Using this approach, we generate prompts for specific skills such as precise instruction following, math and coding.

In total, we collect 939,344 prompts to use with our later training recipes, of which 57% are sourced from public resources and 43% are synthetically generated in house.

Supervised Finetuning: Data Collection and Composition

To design our final SFT mix, we start by building skill-specific data mixtures and models, keeping the mixtures that led to the best performance on individual skills, ignoring other evaluations. This strategy was employed to approximate the upper performance limit for each skill. We then combined these mixtures to create our initial Tülu 3 preview mix and iteratively add/remove datasets to improve lagging skills, decontaminating against our evaluations and downsampling particularly large datasets. We found that doing SFT on our improved carefully curated final mix led to substantial performance improvements across several tasks as shown in the Table.

All of our new synthetic SFT datasets contain responses that are created by either GPT-4o or Claude 3.5 Sonnet (for coding), which resulted in highest quality responses.

Preference Tuning: Data and Algorithms

We convert a subset of our collected prompts (~2-300K) into preference data using both on-policy (Tülu 3 suite) and off-policy models (other available instruct models). We extend the UltraFeedback pipeline to scale our preference data. We first sample prompts from our prompt data pool (we deliberately select both used and unused SFT prompts at this stage). For a given prompt, we randomly choose multiple models from our model pool to generate responses. To include on-policy data, we also generate responses from our Tülu 3 SFT model. Finally, we use an LLM-as-a-judge, specifically GPT-4o-2024-0806, to rate each response from 1 to 5 across four different aspects: helpfulness, instruction-following, honesty, and truthfulness. Computing the average scores, we take the highest-rated response as the chosen response and randomly sample from the responses with the lower mean as the rejected response.

We additionally curate preference datasets for target skills such as precise instruction following by rewriting the SFT instruction following prompts through modifying the constraints and generating rejected responses for the modified prompt.

We experimented with several preference algorithms such as variants of DPO and PPO. We found that results are roughly similar with better hyperparameter tuning and length-normalization. We therefore prioritized simplicity and efficiency in experimentation and used DPO throughout the development process and training our final models, in lieu of more costly investigations into PPO-based methods. We performed several rounds of data mixture ablations and extensive hyperparameter tuning, similar to our SFT step, to maximize average performance on the development evaluations, while also excelling at targeted skills. Below is the summary of what we found:

Findings

  • Length-normalized DPO achieved better performance compared to several preference tuning algorithms including PPO, DPO, and SimPO.
  • Scaling the number of unique prompts improved downstream DPO performance.
  • The presence of new prompts in the DPO mix (as opposed to reusing prompts from SFT) can help improve downstream DPO performance.
  • Including on-policy data improved aggregated downstream DPO performance compared to a completely off-policy dataset where completions were sampled from other models.

New Methodology: Reinforcement Learning on Verifiable Rewards

In Tülu 3, we introduce Reinforcement Learning with Verifiable Rewards (RLVR), a novel method for training language models on tasks with verifiable outcomes such as mathematical problem-solving and instruction following.

RLVR leverages the existing RLHF objective but replaces the reward model with a verification function. When applied to domains with verifiable answers, such as mathematics and verifiable instruction following tasks, RLVR demonstrates targeted improvements on benchmarks like GSM8K while maintaining performance across other tasks.

RLVR can be seen as a simplified form of existing approaches for bootstrapping LM reasoning (Eric Zelikman et al., Du Phan et al.) or a simpler form of RL with execution feedback, in which we simply use answer matching or constraint verification as a binary signal to train the model. In other words, the policy only receives a reward when its generated responses are verifiably correct.

We found that integrating RLVR as a component of the generalist training pipeline can gain up to 1.7, 3.3, and 1.3 points improvement over the DPO checkpoint on MATH, GSM8K and IFEval. Starting RVLR from SFT results in bigger gains, but we found the highest final models were from training with DPO before RVLR. Surprisingly, RLVR also led to improvements on other tasks that it was not optimized for including BigBenchHard, Drop, and AlpacaEval 2.

Tülu 3 Evaluation

A key factor in the success of our post-training approach is establishing clear performance goals and evaluation tools to guide improvement through these stages. With Tülu 3 Eval, we release a unified standardized evaluation suite and a toolkit to guide the development and assessment of final models and decontaminate training data against evaluations benchmarks.

Tülu 3 Eval is constructed with the following goals:

  1. Our evaluations should be reproducible.
  2. We should evaluate models' generalization to unseen tasks, not just the specific benchmarks we use for development.
  3. Our evaluation setup (e.g., templates and strategies for prompting) should be fair to a wide range of models.

With all the released resources, others can take open base models and finetune them to high-performance on any task of interest - laying the foundation of post-training research within complex, multi-objective and multi-stage training regimes.

Resources

Subscribe to receive monthly updates about the latest Ai2 news.