Open research is the key to unlocking safer AI

August 8, 2024

Ai2

The last few years of AI development have shown the power and potential of generative AI. Naturally, these leaps in machine intelligence have opened existential questions around AI safety. The fundamental question of how to build AI that benefits humanity while minimizing human harm remains largely unanswered. In fact, it's hard to even know how to effectively approach the challenge of AI safety when we do not even have a unified definition of what constitutes safe AI.

Why is AI safety so difficult to define? To start, safe or appropriate ways for a large language or multimodal model to respond are culturally dependent and vary wildly by semantics and context. "How do you kill the lights in this room?" has a completely different meaning from "How do you kill Casey in this room?" Humans can understand the difference, but naive, surface level AI safeguards would simply spot the word "kill" and block a response.

The starting point to developing safe AI models is, we must understand what the model understands. With closed models, where we are restricted to only the API output, we will never be able to truly learn what the model knows. Without an understanding of what the model knows, how it's leveraging data to formulate a response, and what data is in the model, we have no hope of conducting the research that is required to design and effectively regulate AI models.

The dynamics of making closed models safer is a mixed bag. Internal, largely undocumented research within large companies develops techniques to control model outputs. At the same time, researchers attempt to understand and "jailbreak" the models, sharing the information freely with model providers. The integration of said feedback into closed models is undocumented and unofficial. Ultimately, solutions devised on top of closed models tend to act like band-aids that often cannot last the test of time, because they patch narrowly defined, specific behaviors one at a time. As the stakes of generative models raises, making this feedback loop of safety open is crucial to creating a healthy ecosystem.

At Ai2 we are taking a long-term, scientific approach to creating models that are optimized for both model capability and safety. We believe the challenges around safety are bigger than what any individual institute could achieve alone, which is why we are empowering the whole community to conduct safety research by opening all of our data, models, and evaluations techniques end to end.

Ai2 has developed principles to govern our approach to AI safety:

AI safety is a technical problem

We need to first understand what the models understand. From there we can design safety into every stage of the AI development pipeline from data through to model training and evaluation, and with different audiences in mind.

Our safety research focuses on the following areas:

Enhancing the safety of OLMo, our open-source AI model, by exploring adding safety data into post-training pipelines.
Improving the generation of large-scale synthetic training data for safety by investigating additional data mixtures and best practices.
Identifying the root causes of harm in LLMs, as well as how to unlearn harmful behavior while enforcing safety and preserving general capabilities.
Discovering novel failure cases by improving methods for large-scale automatic jailbreaking.
Expanding safety evaluation benchmarks to detect and understand both known and unknown risks.
Sharing our learnings to advance safety practices across academia and industry.

Safety research needs to be done in the open

In order to create a safe language model, safety needs to be involved in every step of the training process. To enable the entire research community to conduct safety research, Ai2 makes all elements of the AI development pipeline free and openly available via permissive licenses:

OLMo, pre-training data, training code, logs, and checkpoints make an entire framework that enables researchers and developers to iterate work without starting from scratch or remaining "in the dark" about model details.
Dolma is an explorable dataset, so users can see the finer details of what is going into the data that trains models.
Our partnership with DSRI will further aim to bring important safety activities like red teaming to light for the larger community.

Designing for safety is a continuous process

Safety is an ongoing research problem. We are constantly learning the community expectations for how to build and deploy a safe language model. For example, when should models not comply with user inquiries? Not only when the inquiries are clearly dangerous, but also when a user is asking for information a model cannot - or should not - provide. CoCoNot is a dataset crafted to help models train on situations when they should not comply with user requests, and is another important piece of the safety puzzle.

Safe AI needs to be multilingual and culturally aware

Safety must go beyond Euro-centric standards, but much of the existing toxicity research done on LLMs is in English. We aim to open up the conversation by including more people, and languages, in our research. PolygloToxicityPrompts is a dataset of 425K naturally-occurring prompts across 17 languages with varying degrees of toxicity, and only scratches the surface of inclusivity we strive for.

The Ai2 Safety Toolkit: A Hub for Open Collaboration

Recently, we introduced the Ai2 Safety Toolkit, a central hub for advancing LLM safety and fostering open science. It is a suite of resources focused on LLM safety that empowers researchers and industry professionals to collaborate on building safer AI models.

As a first step towards this effort, we rolled out two major components that offer content safety moderation tools and a mechanism to protect against in-the-wild attacks.

WildTeaming, an automatic red-teaming framework for identifying and reproducing human-devised attacks. It can identify up to 4.5 times more unique successful attacks than prior systems. It also enables the creation of the WildJailbreak, a high-quality, large-scale safety training dataset with 262K training examples, which substantially updates prior public safety training resources.
WildGuard, a light-weight, multi-purpose moderation tool for assessing the safety of user-LLM interactions across three safety moderation tasks including prompt harmfulness, response harmfulness, and response refusal. It includes WildGuardMix, a carefully balanced, multi-task moderation dataset with 92K labeled examples covering 13 risk categories - the largest multi-task open safety dataset to date.

To ensure the integrity and effectiveness of our safety research, we have established a Safety Committee and a Safety Review Board. These bodies review research at Ai2 to ensure our work meets the highest bar for ethical, human-centered AI research and development. This work is closely aligned with our work in NAIRR and our other public and private research partners.

Continuing to safeguard AI

The safe development of AI is a continuous process, there is no simple safe or unsafe implementation. Researchers, policy-makers, large enterprises, and all of society must collectively support ongoing research and work together to ensure AI serves humanity in a safe, open, responsible, and ethical way.

Resources

Learn about the Ai2 Safety Toolkit
Safeguard your model with WildGuard