The next wave of AI will act in the physical world, but realizing this potential requires robots that generalize across new spaces. As the field pursues more general robotic systems, robot environments need to keep pace—generating training and testing data across many unique scenarios rather than tightly controlled settings. Through our past work on the Ai2-THOR family, we've seen how open simulation can accelerate research in robotic navigation, and we've been building the assets and infrastructure to support the next generation of robotic manipulation.
Today, we're launching MolmoSpaces, a large-scale, fully open platform for studying embodied learning. MolmoSpaces supports physics-grounded navigation and manipulation, unifying over 230,000 indoor scenes and more than 130,000 object models – curated from Objaverse and THOR – with over 42 million annotated robotic grasps in a single ecosystem. It includes tooling for scene conversion, grasp integration, and benchmarking, all designed to support large-scale asset generation and systematic evaluation. As a bonus, our assets are compatible with common simulators including MuJoCo, ManiSkill, and NVIDIA Isaac Lab/Sim through a USD conversion script.
High-fidelity physics as a foundation
Our earlier simulation environments relied on Unity's simplified physics and "magic grasps," where an object is considered grasped once it enters a sphere around the gripper—without modeling realistic contact forces. MolmoSpaces instead uses physics engines (such as MuJoCo) with carefully validated physical parameters.
For rigid objects, we verify mass and density by comparing simulated values to LLM-annotated estimates and adjusting density as needed. For articulated objects, we use a teleoperation suite to tune joint properties and movable-part densities using a simulated Franka FR3; the FR3 itself is tuned via system identification from real cube-pushing and picking trajectories of known weights.
We also manually annotate collider and mesh preparation for stable contact-rich simulation. Collider meshes are generated with CoACD, and we annotate primitive colliders for all object assets. Receptacle-heavy rigid objects (tables, dressers) primarily use primitives to avoid mesh-mesh contact issues; manipulable objects use convex decomposition for higher fidelity, with primitives again for small or thin objects.
MolmoSpaces-Bench for evaluating generalization
MolmoSpaces includes MolmoSpaces-Bench, a benchmark for evaluating generalist policies with a focus on generalization under systematic, controlled variation. The framework enables researchers to measure performance across multiple axes – object properties (shape, size, weight, articulation), layouts (multi-room, multi-floor, clutter), task complexity (single-step to hierarchical), sensory conditions (lighting, viewpoints), dynamics (friction, mass), and task semantics (instruction phrasing) – rather than reporting a single aggregate success rate. This systematic variation allows for distributional analysis and helps identify out-of-distribution failure modes.
Task definitions include atomic manipulation skills (pick, place, open, close) and compositions, and explicitly include navigation objectives. Assets and environments can be instantiated across multiple simulator backends, enabling comparisons on shared foundations.
This gives robotics researchers something the field has long lacked: the ability to systematically vary one factor at a time while holding others fixed across thousands of realistic scenes. Want to probe grasps robustness to object mass? Study how policies handle lighting or clutter? Expose prompt fragility or object-frequency biases? MolmoSpaces-Bench is designed to enable these controlled experiments while also measuring how training diversity affects sim-to-real transfer through systematic real-world validation.
Assets and scenes at scale
MolmoSpaces unifies custom object assets with a curated Objaverse-derived bank, all provided in MJCF and converted to USD for portability across simulators including MuJoCo, ManiSkill, and NVIDIA Isaac Lab/Sim.
From THOR assets, we extracted and converted 1,600+ rigid, graspable object instances across 134 categories. We also expanded the library with articulated household objects – fridges, microwaves, ovens, dishwashers, doors, dressers, and more – annotating joint type (hinge or slide), axis, position, and range so articulated behaviors are defined explicitly rather than through simulator-specific workarounds.
For Objaverse assets, our pipeline starts from 625,000 assets and applies filtering for metadata completeness, single-object validation, scale normalization, texture quality (score ≥ 4), cross-renderer fidelity (CLIP similarity ≥ 0.6), geometry efficiency (< 1.5 MB), and receptacle validation. The result is 129,000 curated assets spanning approximately 3,000 WordNet synsets, split into train/val/test subsets. For procedural house generation, additional placement filters yield ~92,000 assets suitable for automatic scene population.
These assets populate scenes drawn from multiple datasets – iTHOR-120, ProcTHOR-10K, ProcTHOR-Objaverse, and Holodeck – spanning hundreds of thousands of indoor environments across homes, offices, classrooms, hospitals, schools, museums, and more. Scene creation modes include hand-crafted environments, manually reproduced digital twins, heuristic procedural generation, and LLM-assisted procedural generation, supporting evaluation across both curated and highly diverse settings.
The scene collection undergoes extensive validation. For rigid manipulables, we apply small external forces; objects that don't move beyond 2 cm are treated as stuck. For articulated objects, we apply joint forces and reject assets that fail to move through at least 60% of their joint range. We also simulate environments to detect drifting and intersections—more than 95% of scenes pass these tests. Occupancy maps identify collision-free starting poses for robots.
Grasps for scalable development
MolmoSpaces includes over 42 million 6-DoF grasp poses across 48,111 objects (up to ~1,000 per object). Grasps are sampled directly from MJCF geometry using a Robotiq-2F85 gripper model; for articulated objects, sampling is restricted to leaf components (often handles), and grasps that collide with non-leaf geometry are discarded.
We select grasps to be both diverse and robust: they're clustered in full 6-DoF pose space and selected uniformly across clusters, with contact-point preferences (mid-fingerpad vs. fingertip for thin objects). Rigid grasps are tested with linear and rotational perturbations. For articulated objects, we evaluate robustness via actuation feasibility—requiring stable actuation through at least 70% of the valid joint range in both directions while maintaining contact.
We verify graspability using a floating Robotiq gripper placed at randomly sampled poses to lift and open/close objects.
These grasps can be embedded directly into environments via a grasp loader, and an accompanying trajectory-generation pipeline enables reproducible demonstrations conditioned on grasp data—supporting dataset creation and imitation learning at scale.
What's included
Every component of MolmoSpaces is open and modular. Researchers can inspect and modify the underlying MJCF, regenerate grasps, plug in new robots and controllers, and replicate or extend experiments across multiple simulators and embodiments—including single-arm and dual-arm systems.
This release includes:
- Assets: 130K+ USD/MJCF object assets plus meshes/materials, physics parameters, and rich metadata consisting of descriptions, scale, mass, and synsets/categories
- Scene datasets: 230K+ environment definitions and metadata for fully physics-enabled scenes ranging from handcrafted single-room scenes to procedurally-generated 10+ room commercial and residential spaces
- Grasps: over 42M 6-DoF grasp annotations across 48,000+ objects (rigid + articulated)
- Tools: loaders/utilities for using assets across simulators (including Isaac Lab/Sim via USD; ManiSkill loader)
MolmoSpaces also supports teleoperation-based data collection using mobile platforms like Teledex, letting researchers gather demonstrations directly from their phone. The interface is compatible with all of our existing embodiment setups, including DROID and CAP, with no special configuration required.
We invite you to explore MolmoSpaces and train generalist policies on the resulting data. We're eager to see how the community uses it.
Learn more and get started:
Tech Report | Data | Code | Demo