Introducing Ai2’s Beaker
Michal Guerquin / October 24, 2022
Today I’d like to introduce you to Beaker, a computation platform we use and develop at AI2. It allows our researchers to run hundreds of thousands of easy-to-track experiments on infrastructure they don’t need to administer. In the last year, we ran over 100,000 unique experiments on Beaker, representing over 300,000 hours of work. Much of that occurred on our multi-million dollar physical machine cluster, while still allowing researchers to burst to the cloud to meet deadlines. We want to share the story of what we’ve developed because it’s been a force multiplier for the high-quality research going on at AI2. As a small platform engineering group there’s no way we would’ve kept pace with our researchers without it.
Setting the stage
AI2 researchers are great problem solvers. Starting from scratch on a newly provisioned machine, they can install GPU drivers and dependencies, run code, debug and iterate, capture results to a storage bucket in the cloud and set permissions for others in the group to read them. Oh, and track those steps in a spreadsheet or text file.
But this administrative activity sure is a lot of work, and it is often a distraction from the main goal of experimenting with ideas and making discoveries. And when every researcher invents their own way to track experiments or share data, it’s hard to collaborate.
Many years ago we anticipated this potential trend and introduced Beaker as an in-house system for running experiments. It abstracts away the boilerplate so that researchers can focus on the core: provide some input files and a program to run, tell Beaker how to run it, and watch as results are collected. It’s easy to compare inputs and outputs on Beaker’s website to discover patterns and to share what was done with collaborators. In effect, we can move fast and learn things.
How researchers use it
Let’s say we are writing code to train a deep learning model on a corpus of common sense statements. To do so, we’ll need a GPU, which a laptop might not have. With the Beaker CLI program, we can simply request an interactive shell in a container with 1 or more GPUs and get a prompt in seconds. Then we test our code there to confirm it runs on a sample of data, or debug it. When things are looking good, Beaker will save the program and any library dependencies in a Docker image.
Next we’ll want to train on a larger corpus. Using the same Beaker CLI program, we can upload the large corpus to Beaker (or use an existing corpus that someone already shared!) and tell Beaker how to run our code unattended. Here’s an example specification in YAML format:
Once submitted, we can watch progress on the Beaker website or use the Beaker CLI to get updates.
Without further involvement, Beaker will find a machine with a free GPU in the cluster ai2/mosaic-compute-a100, download the input corpus and the common_sense_trainer program to that machine, start it with the arguments provided, log output, and save the trained model in /output as a dataset.
As the training proceeds, we can tail output logs to confirm expected progress. When it’s done, we can download the trained model from the website or with the Beaker CLI program.
Behind the scenes
When researchers use Beaker, they frame the work in terms of specific ideas:
- Experiments: A specification for how Beaker should run a program.
- Datasets: An immutable directory of files with provenance. This can be an input to an experiment (which Beaker provides automatically) or a result from an experiment (which Beaker saves automatically.)
- Resources: How much memory and how many CPUs and GPUs are required.
With these ideas in mind, we’ve created Beaker to execute batch experiments, to automatically provision machines in the cloud to meet demand surges, and to provide requested resources for interactive shell sessions.
Researchers don’t have to worry about copying code or data to machines, or saving results, because Beaker safely archives results as soon as an experiment completes. This lets AI2 researchers easily submit hundreds or thousands of experiments to run without babysitting the actual execution. If machines are idle, low priority work can fill in the gaps to maximize utilization. And if there’s a bug in the experiment, Beaker provides stdout logs with timestamps, so diagnostics are easy. In addition, by making it possible to see others’ work, researchers can share runtime parameters and results effortlessly.
Future
While it may seem like Beaker is a panacea, we are growing and learning how our needs change. For example:
- We used to only support cloud computation, but extended to physical machines to support custom cutting-edge hardware. This has worked well for most of our workloads, while bursting to use the cloud remains a necessity before deadlines.
- More recently, we’ve made it easy to run Python-based batch experiments, even though Beaker is language-agnostic at its core because it relies on containers.
On the engineering side, our experience with managing clusters of machines, their storage, and networking has been a reminder that this is a hard problem. GPUs can get too hot, disks can fail, fiber optic cables can become dusty. Resource demand can be unpredictable, and it’s hard to isolate experiments from one another. So as researchers push the limits, we adapt Beaker. Because Beaker’s stability is critical, we prepare testing environments to iterate safely and quickly. Every week we make improvements to Beaker to smooth over rough spots and to enable new ways of making discoveries.
We’re excited about Beaker’s future, and the opportunity we have to catalyze world-class ML research by working closely with scientists.