Camels in a changing climate: Enhancing LM adaptation with Tulu 2
Hamish Ivison / December 8, 2023
Earlier in the year, we released the Tulu suite of models, where we explored the space of instruction fine-tuning in light of the many instruction datasets and models being released. But the community didn't stop there - since the release of Tulu, better base models have been released, better datasets have been developed, and new adaptation methods (e.g., DPO) have shown promise. With Tulu 2, we have tested and incorporated these recent advancements into a new set of models, testing how far we can further push the limits of open-source models, and providing the community with an open-source ChatGPT equivalent. Our new set of Tulu 2 models achieve at or near state-of-the-art performance on AlpacaEval and LMSYS's Chatbot Arena for openly-released models, and at the time of release were state-of-the-art on MT-Bench for all open models.
Alongside our improved models, we have also been hard at work improving our evaluation setup, adding new tasks and improving its speed! By making use of vLLM, we can get results across a varied set of benchmarks in under 30 minutes (for 7B-size models). Our evaluation benchmark now includes MMLU, GSM8k, TydiQA, HumanEval (which we call 'Codex-Eval'), AlpacaEval, ToxiGen, and TruthfulQA. We report averages across these benchmarks in the post below, but we encourage readers to read our preprint for per-task results!
Let's briefly go over what's changed from Tulu 1. The two biggest additions to Tulu 2 are our use of a new dataset mixture and our use of DPO training for training over preference data. We also swapped to using Llama 2 since our original Tulu models, which provides a large boost in performance on its own.
1 - New dataset mixture
Since the release of Tulu 1, the community has doubled down on and improved distilled datasets, with methods like Evol-Instruct and Orca used to improve the quality of data distilled from existing strong models.
Additionally, recent work (e.g., LIMA, LIMIT) has suggested that "a few high-quality samples are all you need". Inspired by this, and wanting to reduce the overall size of our mixture to reduce compute costs, we downsample elements of our original mixture, such as FLAN, to reduce overall size. Our Tulu 2 mixture contains 100k fewer samples than our original mixture, while significantly improving in performance! We suspect further, more in-depth data curation may lead to further gains and further reduce the mixture size.
2 - DPO Training
Inspired by the success of Zephyr-beta, we applied and scaled their DPO recipe to Llama-2 models of all sizes. Surprisingly, we found that this worked straightaway, and led to significant improvements in open-ended generation benchmarks like AlpacaEval and MT-Bench. Interestingly, DPO finetuning did not significantly drop model performance in most capabilities, apart from TydiaQA (which is likely due to the lack of multilingual data in our finetuning and DPO-training data!)
In our original Tulu paper, we also noted a strong correlation between model output length and performance on model-based benchmarks like AlpacaEval. We observed this trend remained with more recent AlpacaEval results, and found that while DPO improved AlpacaEval performance significantly, it also increased model output lengths:
However, our models are still significantly less verbose than most high-scoring models with similar performance.
While this all suggests DPO training is very useful for chat performance, we note that training on GPT-distilled outputs and evaluating with GPT-4-based metrics may result in inflated scores. To properly test our model, the folks at LMSYS added Tulu 2+DPO 70B, our largest DPO-trained model, to ChatArena, where real-world users could compare our model to other models with their own prompts. We find that Tulu 2+DPO 70B achieves the same rating as GPT-3.5-turbo-0314, and is the best overall open model tested!
Additional Experiments
We also ran several other ablations and experiments likely interesting to anyone working on LLM finetuning! I'll highlight two key ones here: QLoRA training and CodeLlama training.
1 - QLoRA Training
We initially explored QLoRA training as a way to further reduce compute costs, allowing us to fit a 70B model on 1 80GB A100. We started by exploring if QLoRA training with various hyperparameters could match the original Alpaca model (i.e. when only training on Alpaca data). We found that while it was easier to match performance on classification tasks like MMLU, performance on open-ended generation tasks like AlpacaEval tended to fall behind full-finetuning.
As such, we eventually decided against using QLoRA. However, since LoRA modules may prove useful on their own, we have trained and released QLoRA-trained Llama 2 models on the Tulu 2 mixture and benchmarked them against our fully-finetuned models. These results mirror our earlier experiments: while QLoRA does improve significantly over the base model, it doesn't match full-finetuning performance.
2 - CodeLlama Training
One of the biggest drawbacks of using Llama-2 is its poor code performance, due to its pretraining mixture. We explored remedying this by using Llama 2 models further pretrained on Code (i.e., CodeLlama) as a base model instead of Llama 2 - we call these models CodeTulu 2 models. We found that while we got significantly improved coding performance, performance in other tasks tended to lag compared to base Llama 2 models (compare the non-coding task average performances between the two models). However, these models may prove more useful than our base Tulu 2 models for coding or structured output tasks!
Wrapping it all up
As part of our commitment to open science, we have released all aspects of this project publicly:
Our models and datasets are available here: https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101
Our training and evaluation codebase is here: https://github.com/allenai/open-instruct
Our Jax codebase (used for large model training) is here: https://github.com/hamishivi/EasyLM
We hope that our models, data, and code, make studying LLM training easier, and provide a solid foundation for further improvement and investigations into LLM adaptation! We also encourage interested readers to read our paper for more details and to see a more nuanced view of what skills different finetuning methods can help and hinder.