Skip to main content ->
Ai2

Molmo 2: State-of-the-art video understanding, pointing, and tracking

December 11, 2025

Ai2


From your smartphone to autonomous vehicles and industrial sensors, video is quickly becoming the dominant language of data. Understanding how the world changes over time underpins the research into the next generation of multimodal intelligence—robotics and assistive technology, traffic monitoring and safety, scientific measurement, and more. 

In 2024, we introduced Molmo, a family of open multimodal models that helped redefine image understanding. Molmo set a new state-of-the-art on static-image benchmarks and pioneered image pointing. It was downloaded millions of times and deployed across research, education, and industry—and the accompanying PixMo dataset became a go-to resource for teams looking to replace larger but noisier corpora with high-quality captioning data.

Today we're releasing Molmo 2, which is poised to bring that same impact to video. Molmo 2 takes Molmo's strengths in grounded vision and expands them to video and multi-image understanding.

Three variants serve different needs:

  • Molmo 2 (8B) is Qwen 3-based and our best overall model for video grounding and QA. 
  • Molmo 2 (4B) – also Qwen 3-based – is optimized for efficiency. 
  • Molmo 2-O (7B) is built on Olmo, offering a fully open end-to-end model flow including the underlying LLM. This Olmo-backed variant is particularly useful for researchers who want full control over every part of the stack—vision encoder, connector, and language model.

These smaller models punch above their weight. Molmo 2 (8B) outperforms the original Molmo (72B) on key image pointing and grounding benchmarks, delivering stronger localization and reasoning in a far more efficient package. On video tracking, Molmo 2 leapfrogs Gemini 3 Pro along with strong open-weight alternatives, making it the top overall tracker across domains in our evals. Most impressively, Molmo 2 achieves all this while training on less than one-eighth of the video data used by Meta's PerceptionLM – 9.19M videos versus 72.5M – demonstrating the power of careful curation and grounding-focused objectives.

Just as Molmo helped make image pointing standard across the open community, Molmo 2 brings video pointing, tracking, and dense captioning to the same level of accessibility.

State-of-the-art on core multimodal evaluations 

Molmo 2 sets a new open model high-water mark across several dimensions of our multimodal evaluation suite, leading or tying for the best results among open peers on image QA, short-video QA, video counting, video tracking, and human preference—while remaining competitive with substantially larger proprietary systems.

Video tracking. Molmo 2 is the strongest tracker in our evaluations, outperforming both open-weight VLM baselines and specialized open trackers including Sa2VA variants and a Molmo + SAM 2 baseline. It also beats proprietary systems like Gemini 3 Pro by a wide margin.

Image and multi-image reasoning. On our 11-benchmark image average, Molmo 2 (8B) leads all open-weight models, with the 4B variant close behind and Molmo 2-O (7B) providing a strong fully open option. Against proprietary APIs, Molmo 2 lands just behind GPT-5 and GPT-5 mini, and ahead of Gemini 2.5 Pro.

Molmo 2 vs closed peers

Human preference. Molmo 2 (8B) leads our open-weight human preference evaluation, narrowly ahead of Qwen3-VL-8B, with the 4B and fully open 7B variants close behind. Proprietary systems still lead overall, but Molmo 2 outperforms GPT-5 and Claude Sonnet 4.5 in this evaluation.

Short-video QA. Averaged across seven benchmarks including NextQA, PerceptionTest, MVBench, and Video-MME, Molmo 2 (8B) posts the best open-weight score, with the 4B model offering nearly identical performance for users prioritizing speed and efficiency.

Molmo 2 vs open peers

Video grounding (pointing + counting). On our video counting benchmark, Molmo 2 leads all open models we tested by a comfortable margin. These results highlight both progress and remaining headroom—video grounding is still hard, and no model yet reaches 40% accuracy. Importantly, Molmo 2 answers "how many?" questions by returning concrete visual evidence through spatial and temporal grounding, not just a number. Against proprietary APIs, Molmo 2 is competitive with the top systems.

Long-video QA. Molmo 2 delivers strong results on long-video understanding, though larger proprietary systems maintain a clear lead, and one open-weight baseline (Eagle2.5-8B) scores slightly higher.

Compared to earlier Molmo models. Molmo 2 (8B) shows consistent gains over both Molmo (7B) and Molmo (72B) on most core benchmarks, with the largest improvements on grounding and counting tasks like Point-Bench, PixMo-Count, and CountBenchQA. One exception is InfoQA, where Molmo (72B) remains ahead—though Molmo 2 still substantially improves over Molmo (7B) on that benchmark.

Molmo 2 (8B) vs Molmo (7B, 72B)

From image pointing to video-native intelligence

Molmo 2 natively supports single images, multi-image inputs, and video clips of varying length. Where Molmo showed where a model was looking in an image, Molmo 2 extends this idea to space and time.

One of our core design goals was to close a major gap in open models: grounding. With Molmo 2, models can move beyond simple descriptive answers to pinpointing events. Ask "How many times does the robot grasp the red block?" and the model returns points and timestamps for each grasp event. Ask "When did the cup fall?" and it returns the timestamp and location of the fall. Ask "Which player scored the goal?" and it identifies and locates that player in the relevant frames.

These grounded outputs open up a range of capabilities: counting-by-pointing, multi-object tracking with persistent IDs that follow objects across occlusions and re-entries, dense video captioning with highly descriptive and searchable narratives of long clips, anomaly detection that flags rare or surprising events, artifact detection that points to flaws in generative video such as inconsistent lighting or broken object geometry, and subtitle-aware QA that combines visual evidence with in-video subtitles.

For a prompt like "Point out every instance where the person in the striped shirt flexes their muscles," Molmo 2 analyzes the entire clip, emits the coordinates and timestamps for each event, and maintains a stable ID for the person across the sequence to avoid double-counting. Given requests such as "Find the window above the kitchen sink," "Identify the voice-capturing device held by the woman in yellow," or "Locate the animal that makes the plank tip downward," Molmo 2 resolves these referring expressions, localizes the relevant regions, and returns approximate locations and times.

At inference time, Molmo 2 scales gracefully. You can process more frames directly for maximum fidelity, or adopt a SlowFast-style strategy that uses high resolution on key frames and lower resolution on others—maintaining similar accuracy on long-video tasks while using significantly fewer vision tokens.

Open and extensible architecture

Molmo 2's architecture consists of a vision encoder that processes images or video frames into visual tokens and a language model backbone (Qwen 3 or Olmo) that consumes those tokens alongside text. A lightweight connector interleaves visual tokens with timestamps, image indices, and text so the model can reason jointly over space, time, and language.

All Molmo 2 variants are trained in two stages. The first focuses on pretraining for alignment and grounding through joint image captioning and image pointing. This stage uses a mix of 60% captioning, 30% pointing, and 10% natural language data, with natural language supervision including supervised fine-tuning (SFT) data from Tulu to preserve strong language capabilities.

The second stage involves SFT on a multimodal mixture integrating images, multi-image sets, videos, and pure text across categories including captions, image QA, video QA, pointing, tracking, and NLP. Each category receives a sampling rate tuned via empirical experiments, and within a category, datasets are sampled roughly proportional to the square root of their size, with manual adjustments to avoid over-representing large synthetic sources. We train for 25,000 steps with a batch size of 128 and maximum sequence length of 16,384 tokens.

For videos, we sample up to 128 frames at ≤2 fps per clip and encode them with a vision transformer. To keep long clips tractable without discarding temporal structure, patches are pooled into 3×3 windows and passed through a connector where they're interleaved with text and timing information before entering the LLM. Crucially, we allow visual tokens – even across different frames or images – to attend to one another, which significantly boosts multi-image and video performance.

Several additional techniques help squeeze more performance out of the same compute: a token-weighting scheme during fine-tuning balances learning across diverse tasks, sequence packing and a message-tree schedule increase throughput, and bi-directional attention between visual tokens yields further gains on grounding and tracking.

Scaling up data curation for video grounding

To teach Molmo 2 spatio-temporal grounding, we extended the Molmo architecture to video and constructed an open, video-centric multimodal corpus of over 9 million examples. This includes nine new datasets designed specifically for dense captioning, long-form QA, and grounded pointing/tracking across images, multi-image sets, and videos.

DatasetDescriptionScale
Molmo2-CapDense video captioning104k videos + 431k clips
Molmo2-AskModelAnythingHuman-authored video QA pairs (fine-grained)140k QA pairs
Molmo2-CapQASynthetic video QA generated from captions + metadata1M QA pairs (200k videos)
Molmo2-SubtitleQASynthetic video QA requiring reasoning over visuals + transcripts300k QA pairs (100k videos)
Molmo2-VideoPointVideo pointing for counting + spatial-temporal localization300k+ queries (160k videos)
Molmo2-VideoTrackPoint-based object tracking with natural-language queries3.6k clips; 15k queries
Molmo2-MultiImageQAQA over semantically related image sets (2–5 images)45k sets; 72k QA (96k images)
Molmo2-MultiImagePointMulti-image pointing + counting (2–5 related images)470k+ samples
Molmo2-SynMultiImageQASynthetic multi-image QA on text-rich images (charts/tables/docs)188k examples
We create 9 new datasets to train Molmo 2, also incorporating a suite of image and language data from academic datasets into our training mix.

At the core is a new long-form captioning pipeline. Human annotators narrate video clips in rich, spoken descriptions, which are then transcribed and enriched with frame-level details from Molmo itself to ensure subtle visual cues aren't missed. The result is Molmo2-Cap, where captions average hundreds of words per video—significantly longer and more detailed than typical large-scale video caption datasets. This density gives Molmo 2 much richer supervision about events, relationships, and rare details in each clip.

To cover medium-length videos and diverse content, we use our captioning model to summarize and annotate extended videos, segment them into clips, and formulate QA pairs from captions and transcripts. We generate large-scale video grounding data for both pointing and tracking, covering high-frequency everyday objects and actions, complex referring expressions, and long-horizon dynamics where objects reappear after occlusions. We also convert and integrate several academic datasets to ensure coverage across domains and scenarios.

Beyond video, we build a grounding dataset with multi-image documents and pointing instructions, enabling Molmo 2 to ground answers where multiple images provide context. New human-annotated QA sets – Molmo 2-AskModelAnything and Molmo 2-MultiImageQA – add around 215,000 long-form QA instances spanning video and multi-image inputs, while long-video QA datasets contribute roughly 1.3 million additional examples on several-minute clips.

As part of Molmo 2, we're releasing captions for over 100,000 unique videos plus 431,000 clip-level captions, forming a large, descriptive open corpus for the community to study, benchmark against, and extend.

Ready to use on day one

Molmo 2 is intended for research and educational use in accordance with Ai2’s Responsible Use Guidelines and is designed to be easily useful for a wide audience, from practitioners to researchers. We're releasing the three core model variants alongside specialized versions tuned for pointing and tracking. We're also releasing all newly introduced Molmo 2 open datasets with detailed data recipes, plus benchmarks and tools for grounded video evaluation.

You can try Molmo 2 in the Ai2 Playground, where we've enabled video and multi-image workflows. Upload a clip or multiple images, run video summarization, counting, tracking, or grounded QA, and see exactly where the model is "looking" as it answers.

Soon, Molmo 2 will also be available via API, and we’ll be releasing the training code under an open-source license.

Open, state-of-the-art models for image and video understanding are critical for building systems that anyone can reuse, customize, and improve. We invite you to download the Molmo 2 models and datasets, explore our cookbooks and examples, and read the technical report. Tell us what you build—your feedback will help shape the future Molmo models we're training.

Try in the Ai2 Playground | Download | Data | Tech report

Molmo 2 is licensed under Apache 2.0 and is trained on third-party datasets that are subject to academic and non-commercial research use only. Please review the sources when evaluating whether Molmo 2 is appropriate for your use case.

Subscribe to receive monthly updates about the latest Ai2 news.