Most vision systems can tell you what is in an image. Far fewer can tell you where that object sits in three dimensions – how far away it is, how large it is, and how it's oriented – from a single photograph. This is the core challenge of spatial intelligence: understanding not just what objects are, but how they exist in the physical world. An autonomous vehicle navigating a construction zone, a warehouse robot sorting packages, an AR app placing directions over a street—all need precise 3D understanding, and they need it to work for any object, from any camera.
Recent years have brought rapid progress in finding and labeling objects in 2D images using natural language. But recovering 3D structure from a single image remains fundamentally harder, especially when the system needs to work beyond a fixed category list, handle different ways of specifying what to look for, and generalize across cameras with different resolutions, aspect ratios, and optics. Most approaches cover only a narrow domain like driving or indoor scenes, support a single prompt type, or assume a specific hardware setup—and few can take advantage of extra depth cues when available.
Today we're releasing WildDet3D, an open model for monocular 3D detection. Given a single RGB image, it predicts 3D bounding boxes – estimating an object's position, size, and orientation in metric coordinates – and accepts multiple prompt types including text queries, point prompts, and 2D bounding boxes. Enter a category like "fire hydrant" and it finds every instance in the scene, tap an object and it returns the full 3D bounding box, or pass in a 2D detection from another model and it lifts it into 3D.
WildDet3D can handle inputs such as a cropped phone photo, a wide-angle action-camera frame, or a robotic camera feed without fine-tuning. And when additional geometric signals such as sparse depth/LiDAR/TOF are available, WildDet3D folds them in to sharpen its predictions.
Alongside the model, we're releasing WildDet3D-Data: over one million images with 3.7 million verified 3D annotations spanning more than 13K object categories, including over 100K human-annotated images, along with evaluation materials and an interactive demo. We're also releasing an iOS demo app that uses live camera input and LiDAR depth to render 3D bounding boxes as AR overlays in real time.
Everything is openly available—because we believe progress in spatial intelligence should be inspectable, reproducible, and built on by the broader research community.
One architecture, many prompt types
WildDet3D supports several prompt modalities within a single geometry-aware architecture. Category-name prompts let you query by object type—enter "chair" and the model finds every chair in the scene, localized in 3D. Point prompts let you click on an object for interactive selection. Box prompts let you supply a 2D bounding box and have the system infer the full 3D extent.
For richer interaction, WildDet3D can be paired with a vision-language model like Molmo 2: the VLM interprets what a user is asking about, then hands the relevant region to WildDet3D for 3D localization. This also means WildDet3D can serve as a spatial reasoning layer in larger pipelines, adding 3D understanding to any system that can produce a category name, a point, or a 2D box.
This flexibility also opens the door to zero-shot 3D tracking. Because WildDet3D can accept a 2D bounding box from any upstream detector or tracker and lift it into 3D frame by frame, it can provide continuous 3D localization of objects across a video stream without ever having been trained on tracking data. Pair it with a wearable camera – like smart glasses – and the architecture could support persistent spatial awareness of the objects around you, driven entirely by the visual feed (though the full model currently requires server-side compute or further optimization for real-time on-device use).
Under the hood, three components work together. First, a 2D detector built on the SAM3 vision backbone accepts all three prompt types and identifies objects in the image. Second, a separate geometry backend – a frozen DINOv2 encoder with a trainable depth decoder – estimates per-pixel depth and produces geometry-aware features. These two branches run in parallel for efficiency. Third, a 3D detection head fuses the 2D detections with the depth features through cross-attention, lifting the 2D evidence into full 3D bounding box predictions that include position, dimensions, and orientation.
A key design choice is that the geometry backend is modular—decoupled from the detection backbone so that different depth models can be swapped in without rearchitecting the system. The backend also uses a ray-aware decoder that bakes camera geometry directly into its features using spherical harmonic encodings of camera ray directions, eliminating the need for a separate camera calibration branch.
When sparse or partial depth data is available at inference time – from a LiDAR sensor, an RGB-D camera, or stereo setup – it feeds seamlessly into this backend, improving localization without requiring any changes to the overall pipeline.
Better 3D perception doesn't just require better models—it requires training data that reflects the variety of objects beyond standard benchmarks. Complementing WildDet3D, WildDet3D-Data was built by generating candidate 3D boxes for objects in existing large-scale 2D detection datasets, including COCO, LVIS, Objects365, and V3Det, using five complementary 3D estimation methods, then refining and filtering candidates before passing them through VLM-based selection and human selection. This curation process yields over one million images with 3.7 million verified 3D annotations covering more than 13K categories, with a carefully human-selected core of over 100K images—far broader than what established 3D datasets offer alone.
Training on this data is what enables WildDet3D to generalize beyond narrow benchmark taxonomies. As we show below, it lifts in-the-wild performance across 700+ object categories.
Strong across benchmarks, zero-shot transfer
We evaluated WildDet3D across several settings to test both accuracy on established benchmarks and the ability to generalize to new domains and categories.
On Omni3D – the standard suite for monocular 3D detection, spanning six indoor and outdoor datasets across 50 categories – WildDet3D reaches 34.2 AP (Average Precision, a measure of how accurately predicted 3D boxes match ground truth in position and size) with text prompts, a 5.8-point improvement over the previous best (3D-MOOD), and 36.4 AP with oracle box prompts, surpassing DetAny3D by 2.0 points. It achieves this with just 12 training epochs compared to 80-120 for prior methods, enabled by high-quality pretrained representations from SAM3 and DINOv2. When sparse depth is provided at test time, performance climbs further: 41.6 AP (text) and 45.8 AP (oracle), with the largest jumps on indoor datasets where depth sensors are common.
To test generalization beyond Omni3D's training distribution, we evaluated zero-shot on datasets including Argoverse 2, an autonomous driving dataset with 26 object categories, and ScanNet, an indoor scene dataset with 18 categories. Performance is measured by Open Detection Score (ODS), a composite metric combining precision, translation accuracy, scale, and orientation quality.
WildDet3D achieves 40.3 ODS on Argoverse 2, nearly doubling the previous best of 23.8, and 48.9 ODS on ScanNet, a 17.4-point gain. The improvements are most striking on novel categories—objects absent from Omni3D. On those, it scores 38.6 ODS on Argoverse 2 versus 14.8 for the prior best, and 45.8 versus 15.7 on ScanNet, suggesting the model's visual backbone transfers far more effectively to unfamiliar objects than previous architectures. We also see the same pattern on Stereo4D, a zero-shot benchmark with real stereo depth. Without depth, WildDet3D is already competitive in box-prompt mode at 7.5 AP. When real stereo depth is provided at test time, it climbs to 27.7 AP in the oracle box-prompt setting—evidence that the same architecture can generalize beyond Omni3D and make strong use of real geometric signals when they’re available.
In-the-wild evaluation over 700+ object categories. To test even broader generalization, we evaluated WildDet3D on WildDet3D-Bench, our in-the-wild benchmark spanning over 700 object categories grouped by how often they appear: rare (fewer than 5 samples), common (5–20), and frequent (more than 20). Even when trained on Omni3D alone, WildDet3D reaches 6.8 AP in text-prompt mode, already outperforming the strongest 3D-MOOD baseline at 2.3 AP. With additional training data, WildDet3D reaches 22.6 AP in text-prompt mode, up from 2.3 AP for the 3D-MOOD baseline (Swin-T). When ground-truth depth is available at test time, the full model hits 41.6 AP. The gains hold across all frequency buckets, with the biggest jump on rare categories, where WildDet3D reaches 47.4 AP versus 2.4 for 3D-MOOD—an especially strong sign that the model transfers to long-tail, open-world objects rather than only the categories seen most often in training.
Why this matters—and what's next
WildDet3D represents a meaningful advance in spatial intelligence. It brings together multiple prompt types in one model, making 3D detection more extensible and practical. It demonstrates that open-vocabulary 3D perception can generalize far beyond narrow taxonomies, particularly on categories the model was never trained on. It shows that monocular 3D systems don't have to ignore richer geometry when it's available—the same architecture can reason from RGB alone and still benefit when additional depth cues are present. And it accomplishes this with substantially less training compute than prior methods.
With this release, we're making available the WildDet3D model, WildDet3D-Data, an iOS app, supporting materials for evaluation and experimentation, and an interactive demo—all openly accessible.
Spatial intelligence is core to where AI is heading. The same model that helps an AR app place directions over a street can help a robot estimate the size of a package on a shelf, or power 3D-aware applications on smart glasses—and we think the most interesting applications are the ones no one has built yet.
