flowchart TB
S1["<b>Stage 1: Video Selection</b><br/>10M+ clips → Filter<br/>→ Diversity Sampling → 100k"]
S2["<b>Stage 2: Human Annotation</b><br/>Split → Voice → Whisper → Edit"]
S3["<b>Stage 3: LLM Refinement</b><br/>Organize → Convert to coherent text"]
S4["<b>Stage 4: Molmo Integration</b><br/>Molmo frame captions<br/>→ LLM merge → Final caption"]
S1 --> S2 --> S3 --> S4
Dense Video Captioning
Overview
Molmo2-Cap is an ultra-dense video captioning dataset used for Molmo2 pretraining. Compared to conventional video captioning datasets, it contains orders of magnitude more detailed descriptions, achieving a remarkable average of 924 words per video.
Unlike the short, superficial captions produced by conventional VLMs, Molmo2-Cap captures both dynamic events and fine-grained visual details, providing a foundation for models to develop deep spatiotemporal understanding of video.
Dataset scale:
- 104k video-level captions
- 431k clip-level captions
- Average of 924 words per video in ultra-dense descriptions
Why Dense Captioning Matters
Challenges of Video Understanding
Video captioning is inherently more difficult than image captioning because annotators must describe both of the following:
- Dynamic events: Occurrences, actions, and state transitions that change over time
- Fine-grained visual details: Object appearances, spatial arrangements, and attribute changes
Many existing video captioning datasets are limited to superficial descriptions, making them insufficient for learning video grounding (understanding when, where, and what happened). Molmo2-Cap was designed to bridge this gap.
The Importance of Density
More detailed captions provide models with the following capabilities:
- Spatiotemporal understanding: Accurately grasping “when,” “where,” and “what” happened
- Fine-grained visual recognition: Capturing small objects, subtle actions, and attribute changes
- Contextual understanding: Learning causal relationships and temporal dependencies between events
Comparison with Existing Datasets
Molmo2-Cap achieves significantly greater descriptive volume than existing video captioning datasets:
| Dataset | Avg. Words/Video | Characteristics |
|---|---|---|
| Molmo2-Cap | 924 words | Human spoken descriptions + Molmo integration |
| LLaVA-Video-178K | 547 words | GPT-based synthetic captions |
| ShareGPT4-Video | 280 words | GPT-based synthetic captions |
| RDCap | 100 words | Existing dataset |
| RCap | 89 words | Existing dataset |
| Video Localized Narratives | 75 words | Human annotation |
Molmo2-Cap contains 1.7x more words than LLaVA-Video, 3.3x more than ShareGPT4-Video, and over 12x more than Video Localized Narratives.
Key differences:
- Molmo2-Cap is built on a fully open pipeline that does not rely on proprietary models (such as GPT)
- It is based on human spoken descriptions, which are more natural and detailed than synthetic data
- Frame-level caption integration ensures that low-level visual details are comprehensively described
Data Collection Pipeline
Molmo2-Cap employs an innovative two-stage pipeline for data collection.
Stage 1: Video Sourcing and Selection
- Initial pool construction: Over 10M video clips are collected from multiple large-scale sources (YT-Temporal, YouTube, etc.)
- Information content filtering:
- Audio tracks are removed and frames are uniformly sampled at 1 fps
- Encoded with H.264 and a normalized information content score is computed:
bits / (duration x W x H) - Videos below mean - 1 sigma are excluded (removing videos with low visual and temporal diversity)
- Diversity-based sampling:
- Frames are segmented with SAM 2 to estimate visual complexity
- Frames are captioned with Molmo, and keywords are extracted via the MetaCLIP pipeline
- Greedy sampling targeting entropy maximization (over keyword distribution and segment count distribution)
- Approximately 100k videos are ultimately selected (sampling rate of 1%)
Stage 2: Human Annotation
Clip Splitting Algorithm
Videos are split into variable-length clips (10–30 seconds). Clips with higher information density are assigned shorter durations, equalizing annotator workload while encouraging detailed descriptions.
- Average of 4–5 clips per video
Collecting Spoken Descriptions
Spoken captions have the following advantages over typed descriptions:
- Faster description speed: Annotators can naturally convey details more quickly than by typing
- Natural language expression: Spoken language tends to be more fluent and richer than written language
- Reduced cognitive load: Annotators can focus on the video without being distracted by typing
This approach was also adopted in PixMo-Cap (an image captioning dataset), where it has been demonstrated to be effective for producing high-quality captions.
Annotation process:
Clip description:
- Annotators verbally describe the content of each short clip (audio is muted)
- They narrate in detail what is happening on screen
- Real-time transcription (Whisper-1) runs automatically
- Annotators edit the transcript to correct recognition errors
Overall video summary:
- After all clip descriptions are completed, a comprehensive description of the entire video is written
Question-based prompts:
- A set of predefined questions is presented to encourage annotators to describe “dynamic visual details”
- Example: “How did objects or events change over time?”
Stage 3: LLM-Based Text Refinement
Since Whisper transcriptions contain incomplete sentences and colloquial expressions, a text-only LLM is used to perform the following:
- Organize sentence structure and ensure consistency
- Remove redundancy and improve readability
- Convert to fluent text while preserving the original meaning
Stage 4: Frame-Level Integration with Molmo
This stage supplements low-level visual details that human descriptions tend to overlook:
Generate frame-level captions with Molmo:
- Individual frames are captioned using Molmo (an early version)
- Colors, textures, fine-grained object attributes, and other details are described
Merge with LLM:
- Clip-level captions and frame-level captions are integrated
- Duplicates are removed to produce a coherent, long-form caption
Pipeline Diagram
Dataset Statistics
Video sources:
- YT-Temporal
- YouTube keyword search
- Multiple large-scale video datasets
License: Creative Commons (partial)
Filtering:
- Videos with low visual and temporal diversity are excluded
- Low-quality captions containing repetitive patterns are removed using heuristic rules
Usage in Training
Molmo2-Cap is used in the pretraining phase of Molmo2:
- Length-conditioned caption generation: The model is trained to generate captions of a specified length
- Weighted sampling: A fixed weight of 0.1 is assigned to video captioning data (balanced with other tasks)
Impact and Contributions
Molmo2-Cap makes the following significant contributions:
- Open-source foundation: A fully open pipeline that does not rely on proprietary models (such as GPT)
- Foundation for video grounding: Ultra-dense descriptions enable learning spatiotemporal pointing and tracking
- New standard for data quality: An average of 924 words per video sets a new benchmark for video captioning datasets
- Reproducible methodology: The clear pipeline of spoken descriptions + LLM refinement + Molmo integration can be reused in other projects
Evaluation: Molmo2-CapTest
To evaluate Molmo2’s video captioning capabilities, an evaluation set called Molmo2-CapTest has been constructed:
- 693 Creative Commons licensed videos
- Collected using the same protocol as Molmo2-Cap, but annotated by manually selected high-quality annotators
- Multiple reference captions are provided for each video
- Caption quality is evaluated using the F1 score
Summary
Molmo2-Cap has become the most detailed video captioning dataset in history through the following innovative design choices:
- Spoken descriptions + Whisper: Efficiently collecting natural and fluent descriptions
- LLM refinement: Converting colloquial expressions into readable text
- Molmo integration: Supplementing low-level visual details
- Diversity-based sampling: Constructing a visually and semantically diverse video set
This dataset serves as an indispensable foundation for Molmo2’s video grounding (pointing and tracking) capabilities, demonstrating the potential of fully open video VLMs.
Related sections:
- Video Grounding: Pointing & Tracking – Video pointing and tracking dataset
- Multi-Image Understanding – Multi-image understanding dataset