Molmo2

VLM

Multimodal

A fully open Vision-Language Model with video grounding capabilities

Author

Naoto Iwase

Published

February 3, 2026

Molmo2 (Multimodal Open Language Model 2) is a fully open Vision-Language Model (VLM) family developed by the Allen Institute for AI (AI2) and the University of Washington. Its key distinguishing feature is video grounding capability, which enables the model to precisely indicate “when and where” specific events or objects occur within a video.

Using 9 new datasets (constructed entirely without relying on proprietary models), Molmo2 achieves state-of-the-art performance among open-source models. In particular, it surpasses proprietary models such as Gemini 3 Pro in video pointing and tracking.

Paper: arXiv:2601.10611

Code: github.com/allenai/molmo2

Demo: playground.allenai.org

Key Contributions

9 new datasets: Built entirely without distillation from proprietary models
Video grounding: Spatiotemporal pointing and tracking
Ultra-dense video captions: Average of 924 words/video (approximately 2-12x more than existing datasets)
Fully open: Models, data, and code are all publicly released

Model Sizes

Molmo2-4B: Based on Qwen3 LLM
Molmo2-8B: Based on Qwen3 LLM
Molmo2-O-7B: Based on Olmo LLM (fully open)

Key Contributions

Model Sizes

Contents