Skip to main content ->
Ai2

Hybrid preferences: learning to route instances for human vs. AI feedback

Lj Miranda / October 28, 2024

Much of the recent advancements in large language models (LLMs) have been powered by human feedback, usually in the form of preference datasets. Think of preferences as a judgment between two (or more) model outputs given a user prompt. Say for example you're deciding between two ad copies for a sourdough business you're starting–which one do you prefer?

LLM researchers typically collect thousands of these judgments to train models using techniques like Reinforcement Learning from Human Feedback or Direct Preference Optimization. The common approach is to obtain these judgments directly from human annotators, however, this tends to be costly and time-consuming. In addition, human annotators can also make mistakes, especially when asked to annotate complex examples or content that is beyond their expertise.

On the other hand, we can also use other language models to annotate, but they do not always reflect the nuances of human annotators and can be prone to certain biases or errors in judgment. Human annotators and LMs tend to excel and struggle in different scenarios, suggesting potential avenue in combining both sources of feedback.

In our paper, Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback, we showed that we can obtain high-quality and cost-efficient preference data by finding the right combination of direct human preferences and synthetic preferences from LMs.

Our approach: a routing framework for allocating preference instances

We present a routing framework for allocating preference instances to either human or LM annotators, resulting in a set of hybrid annotations. Our main approach is to identify specific instances that will benefit from direct human annotation, while the rest are routed to an LM. Our framework is composed of a performance prediction model (PPM) and a routing strategy based on that model.

Our routing framework finds the right combination of human and LM annotations for preference data.

The PPM first learns to predict how a preference dataset will perform when we train a reward model from it. Once we have a trained PPM, we can simulate several candidate datasets with hybrid annotations and predict their expected performance without the need of actually training them. The candidate dataset with the highest predicted performance is then selected as our final mix of human and AI annotations.

In order to train this PPM, we need training data. In our framework, the features for our PPM training dataset are the counts of instances routed to human annotators that contain certain characteristics such as the length of the prompt or similarity between the two model outputs. The target is the performance of a reward model (RM) trained from this dataset on RewardBench, a popular RM benchmark. We trained a quadratic regression model to learn to predict the performance given these characteristics.

A visualization of the training dataset for the performance prediction model.

Introducing MultiPref: a new preference dataset

To obtain training data for the PPM, we constructed MultiPref, a new preference dataset containing 10k+ instances with human and GPT-4 annotations, containing prompts from a variety of open resources. We used Prolific, a data annotation platform, and worked with several crowdworkers to obtain high-quality preference annotations.

Data pipeline for constructing MultiPref, a new preference dataset

We believe that the use of MultiPref extends beyond training a prediction model: you can use it as a preference training dataset for language modeling, or as a basis for analyzing annotator disagreements and subjectivity on preference data!

Impact: better preference mixes and understanding of when human annotations are helpful

We tested our routing framework on different existing preference datasets such as Helpsteer2, ChatArena, and AlpacaFarm. We found that we can outperform a 100% human-annotated dataset with a smaller human annotation budget on RewardBench. For instance, we only need ~37% of human annotations to outperform MultiPref. We find similar patterns in other preference datasets:

RewardBench (overall) performance for different preference datasets and the best hybrid mix.

In addition, our routing framework also allowed us to understand what factors make a preference instance benefit from human annotations. For example, we find a trend where prompts with moderate safety concern (i.e., the degree of discomfort caused by the user instruction) or complexity of user intent (e.g., number of constraints, goals or requirements in the instruction), are better routed to human annotators. One possible explanation is that simple examples may not need human annotation and complex examples may be equally or even more challenging for humans.

The gain in performance for routing an instance to human annotators in MultiPref, with examples of a preference instance with high and low gains.

Overall, our work shows that we can obtain high-quality data at lower costs by combining the strengths of human and LM annotators. In addition, it's a step forward in understanding what qualities of a preference instance render it fit for human annotation. We're excited to share our paper, code, and the MultiPref dataset in the links below:

Subscribe to receive monthly updates about the latest Ai2 news.