MolmoPoint: Better pointing architecture for vision-language models

March 18, 2026

Ai2

Models Tech Report Data Code MolmoPoint-8B Demo MolmoPoint-GUI-8B Demo

Grounding is one of the most useful capabilities in modern vision-language models. It's what lets a model do more than describe an image or video in the abstract—a grounded model can indicate where something is: the right place for a robot to grasp a mug to pick it up, the correct UI element in a screenshot, or the object being counted and tracked across video frames. That matters for robotics, computer-use agents, visual reasoning, and any setting where the model needs to connect language to specific parts of visual input.

But most models still point in a fairly unnatural way. They generate coordinates as text or emit tokens that correspond to coordinate bins. That works, but it forces the model to learn an awkward coordinate system, uses a lot of output tokens, and can make high-resolution grounding brittle.

In MolmoPoint, which we're releasing today, we take a more intuitive approach. Instead of asking the model to describe a location by spelling out coordinates, we let it point by directly selecting parts of its input features. We provide three models – one for general image and video tasks (MolmoPoint-8B), one specialized for software interfaces like apps and websites (MolmoPoint-GUI-8B), and one optimized for video (MolmoPoint-Vid-4B) – along with MolmoPoint-GUISyn, a new open dataset of 36K high-resolution screenshots with over 2 million annotated points. We also introduce MolmoPoint-TrackData, a new tracking dataset that augments our previously released Molmo2-VideoPoint data with human-annotated tracks covering a broader range of objects and scenes, along with synthetically generated tracks featuring complex occlusion and motion dynamics.

All models, code, and data are open source.

From writing coordinates to selecting visual evidence

MolmoPoint replaces text-coordinate pointing with a coarse-to-fine grounding mechanism built around three special tokens: <PATCH>, <SUBPATCH>, and <LOCATION>.

First, the model chooses a coarse image or video patch by attending over visual tokens. Then it refines that choice to a finer-grained subpatch using lower-level ViT features. Finally, it predicts a location within that subpatch. The result is a pointing system that is more directly tied to the model's internal visual representation, rather than to an external coordinate format.

This design comes with a few important additions. MolmoPoint uses rotary embeddings to encode how far each candidate patch is from the previously selected one, which helps the model generate points in a consistent order and avoid double-pointing. It also adds a no-more-points class, letting the model explicitly stop instead of being forced to select another patch when nothing relevant remains.

The practical upside is that, because the model no longer has to memorize a coordinate system, pointing becomes easier to learn and more robust across resolutions. It also takes fewer tokens to express each point—down from 8 tokens to 3. And because the pointing query directly matches visual embeddings, the model can transition from recognition to pointing more naturally and efficiently.

To train our GUI-specialized model, we built MolmoPoint-GUISyn, a synthetic dataset of roughly 36,000 high-resolution screenshots spanning desktop, web, and mobile environments. We generated each screenshot by prompting an LLM to produce HTML mimicking real software, then used Playwright, a browser automation tool that can programmatically inspect page elements, to extract bounding boxes for every visible element, and generated five pointing instructions per element. The resulting dataset contains over 2 million annotated points – 54 per image on average – making it dense enough for efficient training by packing all annotations for a single screenshot into one sequence.

Stronger pointing on images and video

We evaluate MolmoPoint across several domains. For natural images, we use PointBench, which tests a range of pointing skills, including spatial reasoning, affordance recognition, and counting, and PixMo-Points, which measures how well models can locate objects across diverse images. For GUI grounding, we use ScreenSpot-Pro and OSWorldG, which ask models to identify specific interface elements in high-resolution screenshots of real software. For video, we use counting and pointing benchmarks from BURST and Molmo 2, as well as a human preference evaluation comparing model outputs side by side.

We additionally evaluate MolmoPoint’s tracking performance on academic benchmarks like MeVIS, which requires models to understand dynamic actions and temporal context, and Molmo2-Track. Together, these span diverse video domains and object categories.

The headline result is that MolmoPoint-8B reaches a new state of the art on PointBench with 70.7% average accuracy, up from 68.7% for Molmo 2 (8B). On PixMo-Points, it reaches 89.2 F1, compared with 85.2 for Molmo 2 (8B). The gains are especially strong in reasoning and spatial reasoning on PointBench, where we see roughly 5-point improvements.

For GUI grounding, MolmoPoint-GUI-8B reaches 61.1 on ScreenSpot-Pro and 70.0 on OSWorldG–-state-of-the-art among fully open models. We also compared against a Molmo 2 baseline fine-tuned on the exact same GUI data – the only difference being how the model points – and see a 2-to-9 point gap in favor of grounding tokens.

For video pointing and counting, MolmoPoint-8B improves counting metrics and wins human preference comparisons 59.1% of the time (excluding ties), and MolmoPoint-Vid-4B pushes further with a 58.7 close-accuracy on Molmo2-VideoCount, which asks models to count objects across video frames by pointing to each one. Qualitatively, we see cleaner behavior across the board—fewer degenerate runs of incorrect points, better localization of small targets, and more precise grounding overall.

For tracking, MolmoPoint-8B shows substantial gains over Molmo 2 (8B), reaching state-of-the-art results on MeViS and improving by +5.7 J&F overall on Molmo2-Track. Our ablation studies suggest that both grounding tokens and the new data contribute meaningfully, with grounding tokens driving the majority of the improvement and the new tracking data improving performance across more diverse object types and scenes.

Beyond the benchmark numbers, MolmoPoint turns out to be easier to train than its coordinate-based equivalent. With only 8,192 training examples, it outperforms the baseline by about 20 F1 points, and during full pretraining, it reaches peak pointing performance faster. Grounding tokens aren't just a different way to point—they're an easier one for models to learn.

Why this matters

Pointing sounds like a simple task, but it underpins a lot of what we want models to do—like navigating software interfaces, tracking objects in video, guiding a robot's grasp, or showing a user exactly which part of an image it's talking about. All of that depends on whether the model can point accurately and efficiently. MolmoPoint suggests the field may be using the wrong abstraction when it treats pointing as text generation over coordinates. A model already has visual tokens—letting it point by selecting those tokens turns out to be simpler, faster, and better.

The same idea could extend to other modalities (for example, text or audio tokens). But our finding here is compelling already—grounding tokens offer a better foundation for multimodal models that need to point.

MolmoPoint: Better pointing architecture for vision-language models

From writing coordinates to selecting visual evidence

Stronger pointing on images and video

Why this matters

Subscribe to receive monthly updates about the latest Ai2 news.