flowchart TD
A[Visual Input<br/>Image or Video] --> B[ViT Encoding<br/>Patch-level Features]
B --> C[Multi-layer Feature<br/>Extraction]
C --> D{Input Type}
D -->|Image| E[2x2 Attention<br/>Pooling]
D -->|Video Frame| F[3x3 Attention<br/>Pooling]
E --> G[Shared MLP<br/>Projection]
F --> G
G --> H[Visual Tokens<br/>for LLM]
style C fill:#e6f0ff
style E fill:#ffe6f0
style F fill:#ffe6f0
style G fill:#f0ffe6
Vision-Language Connector
Overview
The Vision-Language Connector is a critical module that transforms visual features extracted by the Vision Transformer (ViT) into a format that the Large Language Model (LLM) can process. Molmo2 follows the standard VLM architecture [@Clark2024Molmo] and adopts a design that can uniformly process both images and video.
Architecture Details
Multi-Layer Feature Usage
The Molmo2 Vision-Language Connector extracts features from multiple layers of the ViT, rather than a single layer.
- Third-to-last layer: High-level semantic features
- Ninth-from-last layer: Mid-level features
This design follows the prior work Molmo [@Clark2024Molmo], combining visual information at different levels of abstraction to achieve richer representations.
flowchart TD
L0["Layer 0 (Input)"]
L1["Layer 1"]
Ldots1["..."]
LN9["<b>Layer N-9</b> ← 9th-from-last"]
Ldots2["..."]
LN3["<b>Layer N-3</b> ← 3rd-to-last"]
LN2["Layer N-2"]
LN1["Layer N-1"]
LN["Layer N (Output)"]
C["Connector"]
L0 --- L1 --- Ldots1 --- LN9 --- Ldots2 --- LN3 --- LN2 --- LN1 --- LN
LN9 -. "Features used<br/>by Connector" .-> C
LN3 -. "Features used<br/>by Connector" .-> C
style LN9 fill:#ffe6f0,stroke:#c44a6e,stroke-width:2px
style LN3 fill:#ffe6f0,stroke:#c44a6e,stroke-width:2px
style C fill:#f0ffe6,stroke:#6ec44a,stroke-width:2px
Attention Pooling
Attention Pooling is used to reduce patch-level features. The mean of the patches serves as the query, and each patch window is aggregated into a single vector.
Images: 2x2 Pooling
Input Patches (4x4 example):
| Col 1 | Col 2 | Col 3 | Col 4 | |
|---|---|---|---|---|
| Row 1 | P₁ | P₂ | P₃ | P₄ |
| Row 2 | P₅ | P₆ | P₇ | P₈ |
| Row 3 | P₉ | P₁₀ | P₁₁ | P₁₂ |
| Row 4 | P₁₃ | P₁₄ | P₁₅ | P₁₆ |
After 2x2 Attention Pooling:
| Col 1 | Col 2 | |
|---|---|---|
| Row 1 | T₁ (P₁~P₆) | T₂ (P₃~P₈) |
| Row 2 | T₃ (P₉~P₁₄) | T₄ (P₁₁~P₁₆) |
Video Frames: 3x3 Pooling
Since videos have many frames, a 3x3 window is used to further reduce the token count.
Input Patches (9x9 example):
| C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | |
|---|---|---|---|---|---|---|---|---|---|
| R1 | P₁ | P₂ | P₃ | P₄ | P₅ | P₆ | P₇ | P₈ | P₉ |
| R2 | … | … | … | … | … | … | … | … | … |
| … | … | … | … | … | … | … | … | … | … |
After 3x3 Attention Pooling:
| Col 1 | Col 2 | Col 3 | |
|---|---|---|---|
| Row 1 | T₁ (9 patches) | T₂ (9 patches) | T₃ (9 patches) |
| … | … | … | … |
Cropping Strategy
Image Cropping
Molmo2 employs a multi-crop strategy.
- One downscaled full crop + up to K overlapping tile crops
- During training: K = 8
- During inference: K = 24 (high-resolution processing)
Images that cannot be tiled with K crops are downscaled.
flowchart TD
Orig["Original High-res Image"]
Orig --> DS["Downscaled Full Crop"]
Orig --> Tiles["K Overlapping Tile Crops<br/>C₁, C₂, C₃, ..., Cₖ"]
style Orig fill:#e6f0ff,stroke:#4a86c8
style DS fill:#ffe6f0,stroke:#c44a6e
style Tiles fill:#f0ffe6,stroke:#6ec44a
For multi-crop images, column tokens are included in the input to the LLM. This conveys aspect ratio information of the image to the LLM.
Column tokens are not included for single-crop images (which are always square).
Video Cropping
For video, the following strategy is adopted to reduce computational cost:
- Sampling rate: S = 2 fps (1 frame every 2 seconds)
- Each frame is processed as a single crop (downscaled as needed)
- Maximum frame count: F = 128 (standard training) or F = 384 (long-context training)
flowchart LR
T0["0s"] --- T1["1s"] --- T2["2s"] --- T3["3s"] --- T4["4s"] --- T5["5s ..."]
T1 --> F1["Frame₁"]
T2 --> F2["Frame₂"]
T3 --> F3["Frame₃"]
T4 --> F4["..."]
T5 --> F5["Frame_n"]
style F1 fill:#e6f0ff,stroke:#4a86c8
style F2 fill:#e6f0ff,stroke:#4a86c8
style F3 fill:#e6f0ff,stroke:#4a86c8
style F5 fill:#ffe6f0,stroke:#c44a6e
If the video length exceeds F/S seconds, F frames are uniformly sampled, and the last frame is always included.
The last frame of a video is always included. This is because many video players display the last frame after playback ends, making it potentially significant to the user.
Bi-directional Attention
In Molmo2, the LLM is designed so that image tokens can mutually attend to each other when processing visual tokens [@Miao2024LongVU; @Wu2024DoubleLLaVA].
In a standard LLM, each token can only attend to tokens preceding it due to the causal mask. However, Molmo2 allows bi-directional attention for visual tokens.
Standard Causal Attention:
| T₁ | T₂ | T₃ | T₄ | |
|---|---|---|---|---|
| T₁ | ● | × | × | × |
| T₂ | ● | ● | × | × |
| T₃ | ● | ● | ● | × |
| T₄ | ● | ● | ● | ● |
Bi-directional Attention on Vision Tokens:
| V₁ | V₂ | V₃ | T₁ | T₂ | |
|---|---|---|---|---|---|
| V₁ | ● | ● | ● | × | × |
| V₂ | ● | ● | ● | × | × |
| V₃ | ● | ● | ● | × | × |
| T₁ | ● | ● | ● | ● | × |
| T₂ | ● | ● | ● | ● | ● |
This allows visual tokens from different frames or different images to exchange information, enabling learning of spatiotemporal relationships.
Ablation studies confirmed that bi-directional attention on visual tokens improves performance.
It is particularly effective for tasks that require capturing relationships between multiple frames/images, such as video tracking and multi-image understanding.
Input Format to the LLM
Visual tokens generated by the Vision-Language Connector are fed to the LLM in the following formats.
Video
<image_start> [Visual Tokens for Frame1] <timestamp>0.5s</timestamp>
<image_start> [Visual Tokens for Frame2] <timestamp>1.0s</timestamp>
...
[Subtitle text] <timestamp>0.5s-2.0s</timestamp>
- Timestamps are appended to each frame’s visual tokens
- If subtitles are available, they are added as timestamped text
Multi-Image
<image_start> [Visual Tokens for Image1] <image>1</image>
<image_start> [Visual Tokens for Image2] <image>2</image>
...
- An image index is appended to each image
Multi-Crop Images
<image_start> [Column Tokens] [Visual Tokens for Full Crop]
[Visual Tokens for Crop1] [Visual Tokens for Crop2] ...
- Column tokens convey aspect ratio
- Tokens from the full crop and partial crops are concatenated
Summary
The Molmo2 Vision-Language Connector has the following characteristics:
- Multi-layer features: Extracts features from multiple ViT layers (3rd-to-last, 9th-from-last)
- Adaptive pooling: 2x2 Attention Pooling for images, 3x3 for video frames
- Shared parameters: Unified MLP projection for images and video
- Multi-crop strategy: Uses up to 24 crops for high-resolution processing
- Efficient video processing: 2 fps sampling + last frame retention
- Bi-directional attention: Allows mutual interaction among visual tokens (improves performance)
- Column tokens: Conveys aspect ratio information for multi-crop images
This design enables Molmo2 to process images and video uniformly while balancing computational efficiency and representational power.
References
- Clark, C., et al. (2024). Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models. arXiv:2409.17146.
- Miao, X., et al. (2024). LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding. arXiv:2410.17434.
- Wu, H., et al. (2024). DoubleLLaVA: Efficient Long Video Understanding with Grouped Frame Tokens. arXiv:2410.00907.