The unreasonable effectiveness of easy training data
Peter Hase / January 16, 2024
We typically train AI systems to answer questions in specific domains (like STEM) by finetuning a model on example questions and answers. But what happens when it's hard to collect examples to train the model on, because only experts can answer the questions or the questions are so hard that experts often get them wrong?
In a new paper, we present results showing that language models can perform well on hard, domain-specific questions when trained only on easy questions. Below, we see that a language model's exam scores on college-level STEM questions are almost as good when it's trained on 3rd grade questions as when it's trained on college questions!
In fact, we do equally well on college STEM questions whether we train the model on college or high school level questions. (Here, training can mean using in-context learning, training a linear classifier head, or finetuning with QLoRA.)
Why would this matter in the real world? Because gathering data in specialized domains like medicine and law is expensive, and even experts can give noisy answers to hard questions. If we can perform well on hard questions by training on easier data, then we might be able to save a lot of time and effort while still producing reliable models that are useful for people. This problem has been termed the scalable oversight problem, to describe situations where it is difficult to properly train (oversee) models to accurately answer questions in domains of increasing (scaling) complexity.
Our findings imply that easy training data can be better than hard training data in practice, since hard data is generally noisier and costlier to collect:
In the paper, we demonstrate the same conclusion for when hard data is costlier to collect than easy data - you can do better on hard test data by training on easy data! These are actually common scenarios too. Some kinds of "hard" data can easily cost twice as much to collect and be twice as noisy as "easy" data in the same domain.
Will these results hold up as language models continue to improve? Interestingly, we observe good easy-to-hard generalization across model sizes between 7b and 70b parameters:
This kind of scaling result is important because it suggests that as models become more capable over time, easy-to-hard generalization will continue to be about as good as hard-to-hard generalization in settings like these. This means that we can do well on hard test questions without having to label hard training questions!
Conclusion. It appears that easy-to-hard generalization in LMs is often surprisingly strong, suggesting that the scalable oversight problem may be easier than previously thought. That said, we encourage future research into easy-to-hard generalization, to help models perform well for even harder tasks.
Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college-level STEM questions and general-knowledge trivia. Check out our public code here!
And see the paper here!