Molmo2
VLM
Multimodal
A fully open Vision-Language Model with video grounding capabilities
Molmo2 (Multimodal Open Language Model 2) is a fully open Vision-Language Model (VLM) family developed by the Allen Institute for AI (AI2) and the University of Washington. Its key distinguishing feature is video grounding capability, which enables the model to precisely indicate “when and where” specific events or objects occur within a video.
Using 9 new datasets (constructed entirely without relying on proprietary models), Molmo2 achieves state-of-the-art performance among open-source models. In particular, it surpasses proprietary models such as Gemini 3 Pro in video pointing and tracking.
Paper: arXiv:2601.10611
Code: github.com/allenai/molmo2
Demo: playground.allenai.org
Key Contributions
- 9 new datasets: Built entirely without distillation from proprietary models
- Video grounding: Spatiotemporal pointing and tracking
- Ultra-dense video captions: Average of 924 words/video (approximately 2-12x more than existing datasets)
- Fully open: Models, data, and code are all publicly released
Model Sizes
- Molmo2-4B: Based on Qwen3 LLM
- Molmo2-8B: Based on Qwen3 LLM
- Molmo2-O-7B: Based on Olmo LLM (fully open)