AI has rapidly advanced in perception and reasoning, but its next frontier is action. Major technology players are racing to build robots that can operate reliably in homes, hospitals, warehouses, and public spaces. The core challenge lies in training these robots to interact with the real world—and until now, researchers have depended heavily on expensive, manually collected demonstrations to do it.
Most approaches today that use simulation data use it simply as an auxiliary data source, mixing it in with on-target real-world data for use. But what if simulation data became not just the primary data source, but the only data source? Conventional wisdom would suggest that the sim2real gap would prove insurmountable, but we hypothesize that we can close this gap by dramatically expanding the diversity of simulated environments, objects, and camera conditions. If training no longer depends on proprietary, manually collected data and is instead rooted in scalable simulation, robotics research becomes more reproducible and broadly accessible.
To test that thesis, today we're releasing MolmoBot, an open robotic manipulation model suite trained entirely on simulation data. The suite spans two robotic platforms – the Rainbow Robotics RB-Y1 mobile manipulator and the Franka FR3 tabletop arm – and includes multiple policy architectures at different capability and compute tradeoffs. In our evaluations, our best model achieves zero-shot transfer to real-world static and mobile manipulation tasks on unseen objects and environments without any fine-tuning, achieving competitive performance with prior methods including π0 and π0.5 under standard benchmarking protocols.
MolmoBot includes the full stack – training data, data generation pipelines, training code, and a technical report – so others can reproduce, extend, and stress-test our methodology. We believe that simulation can lower the obstacles and democratize access to robot learning, putting capable manipulation within reach of academic labs without access to large-scale teleoperation setups and organizations exploring manipulation without extensive data collection infrastructure.
The case for simulated data
The most capable robotic manipulation systems today are built on large amounts of real-world data—data that's often closed-source. Projects like Open X-Embodiment and DROID illustrate the scale involved: Open X-Embodiment combines over one million real robot trajectories from 22 embodiments collected across 21 institutions, while DROID includes 76,000 teleoperated trajectories – roughly 350 hours – across 564 scenes and 86 tasks, gathered with the same hardware setup at 13 institutions. These types of projects have driven progress, but they’re expensive to build and difficult to scale.
Our earlier work on SPOC suggested an alternative. Focused on navigation, SPOC showed that training at sufficient scale on cheap simulation supervision can produce systems that generalize to the real world—without reinforcement learning, with RGB-only sensing, without human trajectory collection, and without real-world fine-tuning.
Manipulation is harder, though, demanding more precise physics simulation through platforms like our recently released MolmoSpaces.
Several recent efforts have explored synthetic data for manipulation, but most still start with real-world demonstrations. NVIDIA's GR00T platform uses a "data pyramid" where teleoperated robot data sits at the top—synthetic pipelines augment human demonstrations, but real data remains essential. Google DeepMind's RT-1 required 130,000 episodes collected over 17 months with human teleoperators. Physical Intelligence's π series is trained on teleoperated data. A few projects have moved closer to sim-only training: GraspVLA pretrains entirely on synthetic grasping data, though on a single static platform with a fixed camera and without releasing their data or engine, and InternVLA demonstrates sim-to-real transfer but only when fine-tuned on a digital twin closely matching the real evaluation setup.
MolmoBot goes further, training entirely in simulation across contact-rich tasks with fully randomized cameras—transferring zero-shot to real robots across two platforms including mobile manipulation and releasing everything openly.
Underpinning this work is MolmoSpaces, our open ecosystem for embodied AI, which provides the infrastructure for reproducible trajectory generation and the procedurally generated environments behind MolmoBot-Data. MolmoBot-Data is a large-scale dataset of millions of expert manipulation trajectories produced by combining MuJoCo simulation, aggressive domain randomization, and procedural environment generation with variation in objects, placements, viewpoints, lighting, textures, and dynamics across training runs. We also source rigid assets from iTHOR and Objaverse to broaden object coverage. Although the pipeline can produce richer signals (including depth and privileged simulator metadata), our training runs use RGB observations for policy learning, which makes the transfer results more notable.
One suite, many tasks
MolmoBot is a suite of manipulation policies trained on MolmoBot-Data, spanning several core task categories evaluated across two robotic platforms:
- Pick-and-place. Tabletop grasping and precise object placement, evaluated on the Franka FR3.
- Articulated object manipulation. Opening and closing drawers, cabinets, microwaves, and other articulated objects across several categories, evaluated on the RB-Y1.
- Door opening. Approaching, grasping, and pulling or pushing doors through their full range of motion, evaluated on the RB-Y1.
For clarity, MolmoBot focuses on manipulation and articulation—navigation is out of scope.
You can specify tasks in natural language or through point-based commands (e.g., "pick," "place," and "close").
One dataset, many architectures
The MolmoBot suite includes three policy architectures, all trained via behavior cloning on the same synthetic data.
MolmoBot is our primary VLM-based manipulation policy. Built on the Molmo2 vision-language backbone, it processes multiple timesteps of RGB observations and language instructions through an image encoder, language encoder, and action decoder. MolmoBot achieves the highest performance across our evaluations.
MolmoBot-SPOC is a lightweight transformer policy adapted from the original SPOC navigation architecture. It offers competitive performance with significantly fewer parameters, making it well-suited for compute-constrained settings.
MolmoBot-Pi0 uses the PaliGemma backbone with an action head, matching the architecture used by Physical Intelligence's π0. We included these specifically to enable controlled, apples-to-apples comparisons—isolating the effect of synthetic vs. real-world data.
How it performs
We evaluated MolmoBot in simulation and in the real world, testing robustness to visual distribution shift with evaluation-time perturbations not seen during training—including camera changes, lighting changes, and an alternate renderer. Because these comparisons can be sensitive to differences in task definitions and success criteria, we matched protocols where possible and report head-to-head results under the same setup.
Without any real-world fine-tuning, MolmoBot achieves zero-shot sim-to-real transfer on both the RB-Y1 and Franka FR3. On pick-and-place benchmarks, MolmoBot outperforms π0.5 , a model trained on large-scale real-world demonstration data—suggesting that synthetic training with sufficient scale and diversity can approach or match methods that depend on expensive data collection.
Why this matters
Tthe biggest constraint in robotics is expensive manually collected data. Our results suggest that robots can be trained entirely in simulation. This changes our priorities from collecting manual demonstrations to generating diverse virtual environments with platforms such as MolmoSpaces. This lowers the barrier to entry, speeds up experimentation, and makes it possible for far more labs and organizations to build capable physical AI systems. It becomes possible to train robots that can grasp unfamiliar objects, manipulate articulated surfaces, and operate reliably in unstructured environments.
We see MolmoBot as a test of whether fully synthetic training can work for manipulation. Our results suggest it can, without expensive real-world data collection, task-specific fine-tuning, photorealistic rendering, or complex domain adaptation. The practical outcome is that the bottleneck moves from manually collecting data to designing better virtual worlds—a problem we can scale with compute and open infrastructure.
If you're working on manipulation, sim-to-real transfer, or grounded instruction-following, we'd love for you to try MolmoBot. Download the models and test on your robot or benchmark setup, generate your own synthetic training data with MolmoSpaces, and build with us. We're especially eager to see where it breaks—the failure cases will shape what comes next.
The more researchers experimenting with MolmoBot, the faster the community will learn what synthetic training can and can't do—and what it will take to close the remaining gaps. The future of robot learning should be open, and we're building it that way.
