Web agents – systems that can navigate and complete tasks in a browser on your behalf – are one of the most promising applications of multimodal AI. They represent a natural next step for vision-language models, moving from understanding images through captions and visual question answering to actually using that understanding to take action in the world. But the most capable web agents today are proprietary, trained on undisclosed data with undisclosed methods. The open-source community lacks not just the models but the training data, infrastructure, and evaluation tools needed to build competitive alternatives. That gap limits reproducibility, slows research progress, and makes it difficult to understand how these systems actually work. In many ways, web agents today are where LLMs were before Olmo—the community needs an open foundation to build on.
Today we're announcing MolmoWeb, an open visual web agent built on our Molmo 2 multimodal model family in two sizes (4B and 8B parameters) along with the weights, training data, code (training code coming soon), and evaluation tools used to build it. Designed for self-hosted deployment – whether locally or on cloud services – MolmoWeb can operate a browser by interpreting the same visual interface that humans see, connecting perception and action: given a task instruction and a live webpage, the model observes the page through screenshots, predicts the next step, and executes browser actions such as clicking, typing, or scrolling.
Unlike other open-weight web agents, MolmoWeb was trained without distilling from proprietary vision-based agents—our data comes from synthetic trajectories generated by text-only accessibility-tree agents and human demonstrations.
Alongside the model we’re releasing MolmoWebMix, a large and diverse dataset for training web agents, along with a complete training and evaluation pipeline, reproducible model checkpoints, and tools for collecting web-interaction data. Together these provide a full recipe for building web agents – from data collection to deployment – enabling researchers and developers to inspect and improve every part of the stack.
From looking to doing
Molmo models are trained for multimodal understanding, excelling at tasks such as captioning, visual reasoning, and grounding language in images. MolmoWeb extends these capabilities to browser control.
The system works in a simple loop—look at the screen, decide what to do, do it. At each step it receives a task instruction (e.g., "Find the cheapest nonstop flights from Seattle to Tokyo"), a screenshot of the current browser view, and the history of recent actions. The model then produces a short natural-language thought describing its reasoning, followed by the next browser action to execute.
Supported actions include navigating to URLs, clicking at screen coordinates, typing text into fields, scrolling pages, opening or switching browser tabs, and sending a message back to the user. These actions operate directly in the browser viewport, with click locations represented as normalized coordinates and converted to pixels when executed.
This design allows MolmoWeb to interact with websites the same way people do—by interpreting visual layout rather than relying on structured page representations like HTML or accessibility trees. Working from screenshots brings practical advantages. A single screenshot is far more compact than a serialized page representation, which can consume tens of thousands of tokens. Visual interfaces also remain stable even when underlying page structures change, and because the model reasons about the same interface the user sees, its behavior is easier to interpret and debug.
In practice, this means MolmoWeb can carry out a wide range of everyday web tasks – navigating multi-page websites, filling out forms, searching and filtering product listings, or retrieving information from a target webpage – all without needing a dedicated API for any particular website or service. The model decomposes instructions into sequences of actions, maintaining context about previous steps while responding to what appears on the page.
A typical interaction might involve navigating to a website, identifying a search field, entering a query, interpreting the results, opening the relevant page by clicking on links, and extracting or presenting the answer. Throughout, the agent's internal reasoning and action trace remain visible, allowing users to inspect the process and intervene if needed.
A dataset for training web agents
One major challenge in building web agents is the lack of public training data. Most prior systems rely on undisclosed training data. To address this, we created MolmoWebMix, a large open dataset that combines synthetically generated data with human-annotated examples—designed specifically for training multimodal web agents.
MolmoWebMix combines several complementary components.
Human demonstrations. Crowdworkers performed various browsing tasks using a custom Chrome extension that recorded actions and screenshots, capturing realistic behavior across tasks such as search, navigation, and form filling. The resulting dataset includes 30K human task trajectories – the largest publicly released dataset of human web task execution to date – spanning over 590K individual subtask demonstrations across more than 1.1K websites.
Synthetic trajectories. To scale beyond what human annotation alone can provide, we generated additional trajectories using automated agents that operate on webpage accessibility trees. These include single-agent runs filtered for task success, multi-agent pipelines that decompose tasks into subgoals and verify completion, and deterministic navigation paths constructed by systematically exploring link structures across hundreds of websites. Together, these methods produce a large and diverse set of browsing trajectories without requiring further manual effort.
GUI perception data. Finally, MolmoWebMix includes training data that teaches the model to interpret webpage screenshots. This covers element grounding tasks – identifying where a UI element appears on screen – and screenshot question-answering tasks that require reading and reasoning about page content. The screenshot QA portion alone contains over 2.2 million question-answer pairs drawn from nearly 400 websites.
For a detailed breakdown of each data source, including collection methodology and filtering criteria, see our technical report.
Benchmarks
We evaluate MolmoWeb on four widely used web-agent benchmarks that require interacting with live websites: WebVoyager, Online-Mind2Web, DeepShop, and WebTailBench. WebVoyager tests general web navigation across 15 popular websites such as arXiv and GitHub. Online-Mind2Web covers a more diverse range of multi-step tasks spanning 136 websites. DeepShop focuses on complex shopping-related queries on Amazon, such as comparing products and filtering listings. WebTailBench evaluates instruction-following across a curated set of tasks designed to stress-test agent reliability.
In each, a VLM judge evaluates whether the agent successfully completed the task.
Despite their compact size, both the 4B and 8B MolmoWeb models achieve state-of-the-art results among open-weight web agents. MolmoWeb (8B) scores 78.2% on WebVoyager, 42.3% on DeepShop, and 49.5% on WebTailBench, outperforming leading open-weight models like Fara-7B across all four benchmarks. On DeepShop, even the smaller 4B model outperforms Fara-7B at matching step budgets—and still wins when limited to just 30 steps against Fara's 100. MolmoWeb also outperforms agents built on much larger proprietary models like GPT-4o that rely on annotated screenshots and structured page data—a striking result given that those models enjoy substantially richer input representations and orders-of-magnitude higher parameters.
Beyond task completion, MolmoWeb also demonstrates strong visual grounding—the ability to precisely locate UI elements on screen. On the ScreenSpot and ScreenSpot v2 benchmarks, a dedicated 8B grounding model trained on our data outperforms both open-weight models like Fara-7B and much larger proprietary systems including Claude 3.7 and OpenAI CUA. Even MolmoWeb (4B), trained as a general web agent rather than a grounding specialist, scores competitively on these benchmarks while also handling full task completion.
We also find that running multiple independent agent rollouts and selecting the best result significantly improves performance. With this test-time scaling approach, the 8B model reaches 94.7% pass@4 on WebVoyager and 60.5% on Online-Mind2Web (compared to 78.2% and 35.3% with a single rollout), demonstrating that additional compute at inference time can substantially improve reliability.
Limitations and safety considerations
MolmoWeb has several known limitations. As a purely vision-based model, it can make mistakes when reading text from screenshots. It can also be thrown off track by incorrect actions—for example, scrolling before a page has finished loading and missing relevant content. Performance degrades as instructions become more ambiguous or involve many constraints, and certain actions like scrolling within a specific page element or drag-and-drop remain challenging. Additionally, MolmoWeb is not trained on tasks that require logins or financial transactions, due to safety and privacy concerns. These are all active areas for improvement.
On the safety side, MolmoWeb was designed with transparency as a core goal—every component is open for inspection and audit. Our hosted demo includes additional safeguards: it is restricted to a set of whitelisted websites, uses the Google Cloud Natural Language API to flag and reject unsafe queries, checks input field types before typing, and blocks actions on password and credit card fields. These restrictions are specific to the demo environment rather than built into the model itself, and we encourage the research community to develop and experiment with additional safety mechanisms as the field matures.
What this unlocks
MolmoWeb is available through Hugging Face and GitHub, along with all training data, evaluation tools, and an inference library for running the model locally. Developers can start self-hosting MolmoWeb to automate everyday browser tasks—running routine tasks on a fixed schedule, executing templated queries with different parameters to gather information across websites or products, and chaining simpler queries into complex workflows where each step picks up from the last browser state.
Because the full training pipeline is open, developers can also fine-tune the model on their own data to work well for their specific use cases. Researchers, meanwhile, can inspect and build on every component to advance the science of multimodal web agents, from improving the models and expanding the training data to developing new training methods.
Deploying capable agents on the open web raises important unsolved questions. How should agents respect the terms and conditions of websites they interact with? How do we prevent agents from accessing illegal or inappropriate content? How do we ensure safe financial transactions and protect users' personal information? How do we prevent irreversible actions? Making the full system open allows more people to participate in answering these questions and developing the safety practices needed for trustworthy automation on the web.
The web is the world's largest software platform. Agents that can navigate it reliably could dramatically expand access to information and digital services. Just as importantly, MolmoWeb represents a step in an exciting scientific direction—pushing multimodal models beyond passive understanding of images toward systems that can act on what they see.





