Are you thirsty for social chitchat data?
We give you SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Hyunwoo Kim / November 28, 2023
Although social conversations occur every day and everywhere around you, they are often not recorded as data. And when they are (e.g., text messages), research use is rightly restricted due to privacy and legal concerns. As a result, collecting high-quality, everyday social conversations on a large scale has long been recognized as a difficult task. It's almost similar to searching for drinking water in the sea - it's there, but not in a usable form, leaving many with a thirst for large-scale, quality social chitchat data.
In this blog post, we introduce SODA, the first million-scale high-quality social chitchat dataset that will quench this thirst. What's even better? Our recent paper, "SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization", which has been accepted to EMNLP as an oral presentation, shows how anyone can obtain substantially larger and more diverse social chitchat data with better quality.
Too good to be true. Large, diverse, and high-quality?
Yes, we can achieve all three by leveraging the power of large language models (LLMs) and symbolic commonsense knowledge graphs. More concretely, we use OpenAI's InstructGPT and Atomic10x to distill social conversation. However, you can use other open-source LLMs too, such as Llama-2.
Isn't it obvious that you can generate dialogues with LLMs?
Yes, it is very obvious. However, if you're trying to generate a vast amount of coherent dialogues spanning an exceptionally broad range of everyday scenarios, the problem becomes challenging. Repeatedly tasking LLMs with the prompt, "Generate 10 coherent dialogues with diverse topics, but don't let the dialogue topics overlap," won't cover all the various everyday scenarios. This is where the symbolic commonsense knowledge graph comes in to save the day.
What's a symbolic commonsense knowledge graph?
A symbolic commonsense knowledge graph is a way to organize general human knowledge about everyday life in a format that both computers and people can understand. It's made up of "nodes" which are like points representing different ideas or things, and "edges" which are the lines that connect these points to show how they're related. The connections make something called a "triple" - for instance, you could have one point for "PersonX moves a step closer to the goal" and another for "take the first step," and they would be connected by a relationship called "xNeed." This makes a triple: (PersonX moves a step closer to the goal, xNeed, take the first step). By linking many of these triples together, the graph creates a big web of common knowledge that's easy to navigate for AI.
So how do you go from knowledge graphs to social dialogues?
We obtain dialogues by contextualizing (i.e., adding more context information) the commonsense triples in a step-by-step manner. These triples are the distilled essence of our social experiences, abstracted into narratives and ultimately crystallized into concise pieces of knowledge. By leveraging LLMs, we reverse this abstraction process by taking the commonsense knowledge triples and expanding them into short narratives and conversations that might have originally contained that knowledge.
(1) First, we convert the symbolic knowledge triple into sentence form in a rule-based manner. For example, the commonsense knowledge in the above figure is converted to "Madeleine took the first step. Madeleine moves a step closer to the goal." (2) Next, we use the LLM to generate a short narrative based on the sentence form commonsense knowledge. Also, we infer the likely conversation participants (e.g., Madeleine and coach) using the LLM. (3) Finally, with the conversation participants and narrative as input, we prompt the LLM to generate a full, multi-turn conversation.
The final dataset of SODA comprises 1.5 million conversations with more than 11 million utterances, making it the largest publicly available social chitchat dataset.
How good is the quality of SODA?
To assess the relative quality of the corpus, we conducted head-to-head human evaluations comparing SODA with two widely used open-domain dialogue datasets: DailyDialog and BlendedSkillTalk. We random sample 300 dialogues from each dataset and evaluate them according to six criteria: (1) natural flow, (2) context dependence, (3) topic consistency, (4) speaker consistency, (5) specificity, and (6) overall. Judges are asked to select a better dialogue between the two, regarding each criterion.
Despite being fully machine-generated, human raters judge SODA as better in quality compared to both DailyDialog and BlendedSkillTalk across all axes by a large margin, except for the context dependence compared with BlendedSkillTalk. In particular, evaluators rate the flow of SODA to be significantly more natural than other datasets that were collected through crowdsourcing.
Any other characteristics of SODA?
SODA also contains rich emotion-related information. Since commonsense knowledge from Atomic10x includes emotional reactions of people to events (i.e., the xReact triples), conversations with rich emotional contents are also included in SODA. In total, SODA includes 385K conversations generated from 1.7K unique emotion descriptions of the xReact triples. Therefore, it contains significantly more descriptive emotion labels (i.e., the Tail node) than other datasets which have a fixed number of classes. Furthermore, because we construct conversations in a bottom-up fashion from those emotional reactions, we know which speaker in the conversation is experiencing the emotion (i.e., PersonX) and what caused the emotion (i.e., the Head node).
How strong would the model be if trained on SODA?
We compared COSMO, our 3B model trained on SODA, to four other conversational agents (i.e., BlenderBot, GODEL, Koala, Vicuna) on DailyDialog, which is an out-of-domain dataset for all models. We performed head-to-head comparison between two responses, each from a different model. We sample 100 test examples randomly from datasets and ask three human judges on Amazon Mechanical Turk to select the better response between the two in terms of four distinct criteria: (1) naturalness, (2) consistency, (3) specificity, and (4) overall.
Although COSMO is trained on significantly smaller amount of data (1.5M dialogues vs. 1.5B Reddit comments, 551M Reddit threads) and is significantly smaller (3B vs. 7B), it outperforms all other existing models with a significant margin across all aspects. The most surprising part is that human judges prefer COSMO's responses even over the original ground truth responses in the dataset. This suggests that dialogue models trained on SODA can lead to high generalizability and naturalness, even for unseen conversations.
Conclusion
SODA is not only orders of magnitude larger than existing popular dialogue datasets; it is also perceived to be significantly better than them across multiple aspects (e.g., naturalness, specificity, consistency). Furthermore, our distillation framework offers a cost and time-efficient method to collect rich social chitchat data. With SODA, we hope to alleviate the data scarcity issue of social chitchat.
Please check out our resources if you're interested in more details:
- Paper on Semantic Scholar
- Code for making your own SODA: https://github.com/skywalker023/sodaverse
The SODA dataset and COSMO are publicly available under permissive license CC-BY-4.0 at the HuggingFace hub: