Language models - Tülu 3
Tülu 3 is a leading instruction following model family, offering fully open-source data, code, and recipes designed to serve as a comprehensive guide for modern post-training techniques.
What Tülu 3 provides
for researchers and developers
Models
Explore the collection of open-sourced instruct models created from our open data and recipes.
Data
The underlying training data for fine-tuning processes is the most important piece of the puzzle but often the element with the least transparency. Tülu 3 changes that.
Training
We open source our scalable codebase for supervised finetuning (SFT), Direct Preference Optimization (DPO), Reinforcement Learning with Verifiable Rewards (RLVR), and all the other algorithms we considered when training Tülu.
Evaluation
We're sharing the code base used to produce Tülu 3's results to make these evaluations more standardized and reproducible.
Our philosophy
Early work in language model post-training followed a standard recipe pioneered by models like InstructGPT, consisting of instruction-tuning followed by preference fine-tuning. Since then, the sophistication and complexity of post-training approaches have continued to increase, however most successful post-training models offer limited information about their training data, code, or recipes. Tülu 3 pushes the boundaries of research in post-training and closes the gap between open and closed fine-tuning recipes. By openly sharing our data, recipes, and findings, we hope to uncover which paths for the open-source community will lead to success and which do not, enabling the community to explore new and innovative post-training approaches.
Our approach
The Tülu 3 effort began with identifying key desirable capabilities for generalist language models, including knowledge, reasoning, mathematics, coding, instruction following, general chat, and safety – areas where current open post-training recipes often fall behind.
Our success is rooted in careful data curation, rigorous experimentation, innovative methodologies, and improved training infrastructure. In particular, we produce Tülu 3 models through a four-stage post-training recipe on top of pre-trained language models (namely Llama 3 Base). This includes (1) careful prompt curation and synthesis, (2) supervised finetuning on our carefully selected mix of prompts and their completions targeting core skills, (3) combining both off- and on-policy preference data to apply preference tuning, and (4) a new RL-based method to enhance specific skills with verifiable rewards.
Our results
Tülu 3 models achieve state-of-the-art performance across our multi-skill evaluation compared to models of an equivalent size and some closed API-based models.