Molmo2

VLM
Multimodal
A fully open Vision-Language Model with video grounding capabilities
Author

Naoto Iwase

Published

February 3, 2026

Molmo2 (Multimodal Open Language Model 2) is a fully open Vision-Language Model (VLM) family developed by the Allen Institute for AI (AI2) and the University of Washington. Its key distinguishing feature is video grounding capability, which enables the model to precisely indicate “when and where” specific events or objects occur within a video.

Using 9 new datasets (constructed entirely without relying on proprietary models), Molmo2 achieves state-of-the-art performance among open-source models. In particular, it surpasses proprietary models such as Gemini 3 Pro in video pointing and tracking.

Paper: arXiv:2601.10611

Code: github.com/allenai/molmo2

Demo: playground.allenai.org

Key Contributions

  • 9 new datasets: Built entirely without distillation from proprietary models
  • Video grounding: Spatiotemporal pointing and tracking
  • Ultra-dense video captions: Average of 924 words/video (approximately 2-12x more than existing datasets)
  • Fully open: Models, data, and code are all publicly released

Model Sizes

  • Molmo2-4B: Based on Qwen3 LLM
  • Molmo2-8B: Based on Qwen3 LLM
  • Molmo2-O-7B: Based on Olmo LLM (fully open)

Contents