PolygloToxicityPrompts: multilingual evaluation of neural toxic degeneration in large language models

June 24, 2024

Ai2

The presence of low-quality data on the internet leads to undesirable, unsafe, or toxic knowledge being instilled in large language models (LLMs). Their applications in chatbots have all the more increased the chances of risk exposure to people. There have already been instances where such chatbots have provided harmful advice to users and exhibited aggressive and offensive behavior.

Moreover, recent advancements in LLMs and their multilingual capabilities have led to their extensive deployment in global contexts. However, existing toxicity evaluation datasets are either focused on English or are translated from English benchmarks, leading to underestimated toxic degeneration in state-of-the-art LLMs. In new joint work with CMU, we study how large language models generate toxicity in multiple languages, how toxicity varies across differently-resourced languages, and how design decisions like model size and alignment method impact toxicity.

GPT-3.5-Turbo's AVERAGE TOXICITY score on existing toxicity evaluation datasets, showing that PTP uncovers more toxicity in LLMs.

Introducing PolygloToxicityPrompts, a dataset of 425K naturally-occurring prompts across 17 languages with varying degrees of toxicity. We created our dataset by extracting documents from the web and splitting them in half (called prompts and continuations), building on previous work (RealToxicityPrompts [1]). Compared to existing toxicity evaluation benchmarks, our naturally-occurring multilingual prompts can capture more toxicity in LLMs, as shown in the graph above.

PolygloToxicityPrompts will allow researchers to further address the risk of multilingual neural toxic degeneration in models and is available to download.

We utilized PerspectiveAPI to measure the toxicity of prompts/generations and compute a model's average toxicity, quantifying its overall toxicity averaged across all its continuations.

Language-wise AT trends for multilingual models. Takeaway: High toxicity scores for all languages indicate the need for multilingual toxicity mitigation methods.

The graph above shows that state-of-the-art multilingual LLMs have the lowest toxicity levels in Russian and Dutch and the highest in Hindi and Czech. More generally, model toxicity increases as the availability of high-quality language data decreases. Overall, high toxicity scores across all languages (including English) emphasize the current gap in multilingual toxicity mitigation.

On the left: Influence of model size on AT for Pythia suite. Takeaway: Toxicity increases with model size within a model family for base LLMs. On the right: AT for differet model categories. Takeaway: base > instruct ≈ preference.

We also investigated the impact of model parameter size on toxicity and found that toxicity increases with model size within a model family for base LLMs - that is, models trained with only the language modeling objective (e.g., GPT-2 or GPT-3 equivalents), as shown in the graph on the left for the Pythia model suite. This suggests that as models' size and capacity increases, they tend to learn more of the toxicity in their training data, but only up until a certain point.

Next, we investigated the impact of instruction- and preference-tuning LLMs on toxicity (right). Unsurprisingly, instruction- and preference-tuned models (e.g., GPT-3.5 or GPT-4 equivalents) are less toxic than base models (e.g., GPT-2 or GPT-3 equivalents).

On the left: Impact of alignment techniques on TinyLlama and Archangel models. Takeaway: Alignment methods don't impact toxicity. On the right: Influence of model size on AT in aligned models. Takeaway: Future work is required for safety-aligned LLMs.

Surprisingly, however, the choice of preference-tuning algorithm does not make a significant impact on model toxicity, as shown in the graph on the left. Given the same base model, preference-tuning with AI feedback also leads to lower toxicity than human feedback for the language(s) targeted by the technique (graph on the right).

Finally, we studied the extent to which toxicity and safety overlap by comparing Perspective API, a toxicity detector, and Llama Guard [2], a safety detector. While Perspective API and Llama Guard scores are correlated, Perspective API excels at detecting explicit toxicity and hate speech, whereas Llama Guard can identify subtle unsafe generations (e.g., identify inappropriate URL domains without seeing the content of the page) and extend to other axes of AI safety. We concluded that toxicity and safety are related, but distinct concepts and require their own solutions.

Next Steps

Our findings highlight crucial gaps in current research around the need for open-source multilingual toxicity detection, and methods for safeguarding multilingual LLMs. We also emphasize further empirical and theoretical investigations of how toxic degeneration is affected by prompt language, model size, and alignment methods. Finally, we call for the use of PolygloToxicityPrompts to evaluate toxicity in current and future LLMs, as well as extensions of our work to include even more languages.

References

Jain, Devansh, et al. "PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models." arXiv preprint arXiv:2405.09373 (2024).
Gehman, Samuel, et al. "Realtoxicityprompts: Evaluating neural toxic degeneration in language models." arXiv preprint arXiv:2009.11462 (2020).
Inan, Hakan, et al. "Llama guard: Llm-based input-output safeguard for human-ai conversations." arXiv preprint arXiv:2312.06674 (2023).

PolygloToxicityPrompts: multilingual evaluation of neural toxic degeneration in large language models

Next Steps

References

Subscribe to receive monthly updates about the latest Ai2 news.