MolmoAct: An Action Reasoning Model that reasons in 3D space
August 12, 2025
Ai2
People instinctively use space to think and communicate. We storyboard narratives to visualize sequences of events. We annotate maps to plan routes. We diagram abstract concepts to make them tangible. Unlike written language, which is how large language models reason, spatial reasoning conveys visuo-structural ideas directly. It allows us to externalize thought, predict how our movements will exert force on things around us, use maps to move around, and even architect blueprints to build complex structures.
Today, language models, and specifically vision-language models, are being used as the brains behind robots. Since language models mostly reason in written language, they struggle to make sense of space. Similarly, people find it hard to learn how to perform a sports maneuver or dance move just by reading about it. Text simply doesn’t capture all the complexities needed to power movement in three dimensions.
To overcome these limitations, we introduce a new class of models, called Action Reasoning Models (ARMs), which reason about actions in space. We have trained and are releasing the very first ARM model, which we call MolmoAct. Built on Molmo, our open-source family of vision-language models, MolmoAct bridges the gap between language and action, enabling machines to precisely follow instructions and reason in 3D space.
In this way, MolmoAct is truly innovative—the first model able to “think” in three dimensions.
True to Ai2’s mission, MolmoAct, like Molmo, is also completely open source.
A Generalizable Action Reasoning Model
While vision-language-action models (VLAs) rely on language to carry out actions – language that’s frequently insufficient to represent the motions and physics of the 3D world – MolmoAct grounds scene semantics through depth-aware perception tokens. Given an instruction, the model sketches out a visual reasoning trace via waypoints in image space and converts that plan into detailed low-level action commands for robotics hardware.
At a high level, MolmoAct reasons through three autoregressive stages:
- Understanding the physical world. The model first outputs spatially grounded perception tokens—special tokens pre-trained and extracted with a VQVAE. These tokens are unique; unlike the textual tokens many VLAs use to represent and decompose spatial information, perception tokens encode geometric structure via depth information and positional embeddings. This lets the model estimate distances between objects and incorporate them into its reasoning.
- Planning in image space. Conditioned on the perception tokens, MolmoAct predicts a sequence of image-space waypoints that act as intermediate goals, visually outlining how the task should unfold while remaining independent of a machine’s embodiment.
- Action decoding. Finally, conditioned on the waypoints, MolmoAct outputs actions for hardware like end-effectors and grippers. These are denormalized using a machine’s kinematic configuration to produce low-level motor commands.
With only minimal fine‑tuning, we’ve found that MolmoAct can adapt to different embodiments (e.g., robotic humanoids, gripper arms) and tasks more effectively than even strong baselines like Physical Intelligence’s π0 and OpenVLA.
Open Model—and Data
There’s no shortage of contributions to the field of VLAs and other perception-action models from commercial enterprises and industrial research labs. As robotics hardware becomes cheaper and easier to obtain, the volume of open models continues to grow.
But while there exist many open-weight VLA models, few models are open in the sense that the techniques behind them are inspectable and verifiable. These models can’t be reproduced from scratch, and lack the code, data, and other artifacts required to fine-tune and evaluate them.
By contrast, MolmoAct is architected to be a genuinely open, inspectable, and highly steerable model. We envisioned MolmoAct as a fully open model from the start—a foundation for impactful research.
MolmoAct-7B – the first in a family of MolmoAct models – was pre-trained on a highly curated subset of the Open-X Embodiment data and a multimodal reasoning dataset, then post-trained on the MolmoAct post-training dataset, an openly available set containing ~10,000 distinct “robot episodes.” To create this dataset, we spent months curating videos of robots performing actions in diverse household settings, from arranging pillows on a living room couch to putting away laundry in a bedroom.
We think the post-training data – a brand-new dataset released alongside MolmoAct-7B – in particular will be useful to teams in the research community. Building post‑training sets for robotics models is often an expensive and time-consuming endeavor that draws on scores of human annotators. We’ve curated a dataset that links high‑level reasoning to concrete actions, transforming robot demonstrations into action chain-of-thought sequences that expose each reasoning stage.
By training on action chain-of-thought and auxiliary robot reasoning tasks, we demonstrate that we can extract more knowledge from new and existing data, such as the Open-X Embodiment data, to enhance MolmoAct’s abilities.
Efficient and Performant
MolmoAct-7B was trained far more efficiently than many of the VLA models used to control robotics without sacrificing performance. In fact, MolmoAct-7B beats a number of leading models on key benchmarks for robotics, demonstrating the strength of its reasoning techniques.
In pre-training, we used 26.3 million samples on a cluster of 256 NVIDIA H100 GPUs to train MolmoAct-7B, finishing pre-training in about a day. Fine-tuning on 64 H100s took ~2 hours. By comparison, NVIDIA’s GR00T-N1-2B model was trained on 600 million samples with 1,024 H100s, while Physical Intelligence trained π0 using 900 million samples and an undisclosed number of chips.
We evaluated MolmoAct’s pre-training capabilities through SimplerEnv, a benchmark containing a set of simulated test environments for common real robot manipulation setups. MolmoAct achieved a state-of-the-art out-of-distribution task success rate of 72.1%, beating models from Physical Intelligence, Google, Microsoft, NVIDIA, and others.
Beyond pre-training, MolmoAct is a powerful generalist foundation model that can quickly adapt to new tasks, embodiment, and domains. On the LIBERO simulation, which measures knowledge transfer in multitask and lifelong robot learning problems, MolmoAct delivers state-of-the-art results, topping models from a number of major labs with an average success rate of 86.6% using parameter-efficient fine-tuning.
We ran extensive real-world experiments to probe MolmoAct’s ability to generalize. After training MolmoAct, OpenVLA, and π0 Fast on the same fixed set of multi-task demonstrations, we systematically introduced perturbations—ranging from paraphrased language commands to entirely novel objects. Across every variant, MolmoAct delivered the highest success rates, confirming its superior generalization.
Interpretable and Steerable
With its exceptional reasoning capabilities, MolmoAct enables more intuitive control: users can instruct the model using natural language or by drawing a visual trace on a screen.
Before issuing any control commands, MolmoAct grounds its internal reasoning in pixel space and overlays its planned motion trajectory directly onto images that the model takes as input. This visual trace gives a preview of intended movements, offering a means of correcting mistakes or preventing unwanted behaviors.
Users can further guide MolmoAct by sketching target poses or paths on a device such as a smartphone, tablet, or laptop. MolmoAct integrates these free-form annotations in real time, delivering a safer and more explainable experience.
A Foundation for Future Work
From mobile manipulators to humanoids, MolmoAct enables coherent, grounded, and transparent behavior in robotics and other hardware, opening the door to robust generalization across diverse real-world settings.
We’re releasing everything you need to inspect, customize, and validate MolmoAct. This includes evaluation scripts and data that robotics models rarely make public, plus benchmark tools to verify the model works as intended.
MolmoAct is a strong base model. But we have many ideas of where we’d like to take it. In the future, we plan to run more real-world robotics experiments and broaden our evaluation to a larger set of computer simulation benchmarks.
We encourage you to learn more about MolmoAct in our technical report, and to download the model and model artifacts – including training checkpoints and evaluations – from our repositories.
We look forward to sharing more about our spatial reasoning research soon.