Skip to main content ->
Ai2

Molmo learns to point and act

April 29, 2026

Ai2


When we released Molmo, it was a bet on openness: that an open vision-language model could not only match or outperform closed alternatives, but also give researchers and developers something closed systems cannot—models they can inspect, adapt, reproduce, and build on. Molmo 2 extended that foundation to video, adding tracking, multi-frame reasoning, and temporal grounding.

In just a few short years, the Molmo family’s openness is already compounding. 

Researchers at Harvard and the Broad Institute built an autonomous agent that relies on Molmo’s pointing capabilities to track animal behavior in experimental videos without any retraining. A team at the University of Edinburgh incorporated Molmo into a multimodal debate framework for AI oversight, where its detailed visual descriptions helped an automated judge catch reasoning errors. And scientists at the University of Trento drew on Molmo’s fully open training pipeline to probe and reshape how VLMs understand spatial relationships.

These projects were possible because Molmo’s weights, data, and code were open to build on. As Molmo grows, that openness is helping it evolve into a broader ecosystem for AI systems that see – and even act – in the real world.

MolmoPoint: A better way to point

Pointing is one of the most practically useful things a vision model can do. It's what lets a robot know where to grasp a mug or an automation app know which on-screen buttons to tap. 

It's also core to how people understand what a model is actually “seeing.”

"Having models that can point is important for many things, including interpretability, since the model can show the user exactly where to look," says Chris Clark, Molmo research lead. "It matters for complex tasks like counting, since the model can count by pointing at things one at a time, the way a human would do it. And it matters for robotics and computer-use agents."

The issue is that teaching a general-purpose VLM to point well has been harder than it sounds–often requiring a lot of training and data mixture tuning, much more so than for other tasks. "When we had difficulties training Molmo and Molmo 2, it was often because pointing was lagging behind the performance we were expecting,” Clark says.

Most models point by generating text coordinates, which is an indirect and oftentimes brittle process. MolmoPoint, released in March, takes a more intuitive approach. Instead of outputting coordinates as text, the model points by selecting directly from what it sees—first picking a coarse region then zeroing in on the exact spot.

The idea came from approaching pointing as a cross-modal problem. 

"Giving an X and Y coordinate is fine for images,” Clark explains, “but it wouldn't work if you wanted to point to some input text or even something like an input audio clip. So we asked how you could point with the same mechanism across many modalities, and pointing directly to input [data] was the obvious answer."

The payoff surprised even Clark and the rest of the research team behind MolmoPoint. The model sets new state-of-the-art results across pointing, screen element identification, and object tracking benchmarks among open models of comparable sizes. The result is more accurate, more efficient, and more robust pointing abilities than previous-generation Molmo models, particularly at high resolutions and in cluttered UIs with lots of small, tightly packed buttons and menus.

"The biggest shocks were just how much of a jump we saw in our training efficiency evaluations, and how much more end-task performance improved than I expected," Clark says. 

We’ve made MolmoPoint variants tuned for general image and video tasks, software interfaces, and video tracking available, along with new open datasets including thousands of annotated screenshots and human-labeled object tracks so others can train their own pointing models. 

Better grounding – and grounding that's simpler to tweak – opens up a lot of possibilities, Clark says: "Making it easier to train grounding VLMs without having to extensively tune data mixture or dedicate a large percentage of the training mixture to pointing data will make training easier and cheaper.”

MolmoWeb: An open agent for the open web

Another place where pointing directly translates into action is the web. 

MolmoWeb is a suite of multimodal web agents that can navigate websites and complete tasks on behalf of users. Given an instruction and a screenshot, MolmoWeb predicts the next browser action, working from the visual interface alone without relying on HTML or accessibility trees.

“MolmoWeb is Ai2's first step toward building visual agents for automating tasks in a browser environment using the same interface as humans,” says Tanmay Gupta, MolmoWeb lead. “It’s perception via screenshots and manipulation via mouse and keyboard.”

Vision-based perception was a deliberate design decision for MolmoWeb. Screenshots are more robust to website changes than underlying page code, and capturing them is cheaper than processing text since a single image can replace thousands of lines of a webpage’s structure. “We not only want agents to do more, but to do so more reliably,” adds Gupta.

It’s also part of the larger bet behind MolmoWeb. 

“Building an agent that uses the same interface as humans means that any and every economically useful human activity on the web is within reach as capabilities improve,” says Gupta. “No website, no piece of information would be inaccessible; developers won't need custom APIs or special instrumentation. Just describe in plain language what you need done and the agent does it. And it can do that not just once, but a million times over with massive parallelization.”

MolmoWeb outperforms comparable open-weight models on major web browsing benchmarks, and the most capable version also surpasses agents built on much larger proprietary models like GPT-4o despite having fewer parameters and seeing only screenshots.

Getting there took time. “When we started the project, there were a lot of moving parts – a synthetic data pipeline built entirely on LLMs, human trajectory annotation, a browser eval harness, model training – and we weren't fully sure what kind of performance we'd get from supervised fine-tuning on all the data we were collecting,” Gupta says. 

To prove their approach was sound, the team set a preliminary goal last year: build an agent that works on just 20 websites with 5–10 templated tasks each. Into early 2026, the focus shifted to scaling training data and making evals more robust. 

“In agentic research, evals are uniquely hard and expensive because you're not evaluating isolated predictions—you're evaluating a sequence of actions where a single failure can cascade through the rest of the trajectory,” says Gupta. “We spent a lot of time visualizing trajectories and tracking down inconsistencies across data generation, training, and evaluation—both to get a clean read on model performance and to close the gaps that were holding it back.”

We believe agents for the web should be open source. That's why we've released model checkpoints, training data that includes the largest publicly available set of human web-task demonstrations, and a unified evaluation harness so that others can reproduce, build on, and improve our work. 

“Vision-language models have come a long way since the early captioning and visual question answering models of circa 2015, but to realize their full potential, we need to figure out how to use them to drive actions that actually move the needle economically—not just output descriptive captions,” Gupta says. “With MolmoWeb, we open-sourced everything because we believe this is a problem the community needs to solve together.”

Gupta hopes MolmoWeb will encourage more researchers to “roll up their sleeves” and help build autonomous visual agents designed to augment human work—not replace it. 

“Our North Star is digital assistants that free people to focus on what only humans can do,” he adds. “The models behind computer-use and web-use agents are going to transform human digital activity, impacting everyone from our grandparents to our children who will be raised in an agent-native digital world. I'm excited about the training techniques and architectures that get us there; I want to help shape the technology that defines humanity's next interface to computers—one that's not just more powerful, but more accessible to everyone.” 

Building blocks, not silos

MolmoPoint and MolmoWeb tackle different dimensions of the same problem: helping models understand and act on what they see. MolmoPoint makes pointing more precise, and MolmoWeb uses what's on screen to navigate the open web. Both build on Molmo 2, sharing its vision backbone and the same open development philosophy.

With MolmoBot and MolmoSpaces already providing open infrastructure for robotics, and WildDet3D opening up new possibilities in areas like AR/VR and digital scene understanding, the Molmo ecosystem now covers pointing, web interaction, 3D perception, and physical manipulation.

Every piece of this ecosystem is open source, which means a university lab can fine-tune MolmoPoint for a specific use case or a tinkerer can build an agent on top of MolmoWeb without taking on vendor dependency. We’re building these tools so progress in visual intelligence isn’t limited by who has access—and so the next breakthrough can come from anywhere.

Subscribe to receive monthly updates about the latest Ai2 news.