Vision-Language Connector

Overview

The Vision-Language Connector is a critical module that transforms visual features extracted by the Vision Transformer (ViT) into a format that the Large Language Model (LLM) can process. Molmo2 follows the standard VLM architecture [@Clark2024Molmo] and adopts a design that can uniformly process both images and video.

flowchart TD
    A[Visual Input<br/>Image or Video] --> B[ViT Encoding<br/>Patch-level Features]
    B --> C[Multi-layer Feature<br/>Extraction]
    C --> D{Input Type}
    D -->|Image| E[2x2 Attention<br/>Pooling]
    D -->|Video Frame| F[3x3 Attention<br/>Pooling]
    E --> G[Shared MLP<br/>Projection]
    F --> G
    G --> H[Visual Tokens<br/>for LLM]

    style C fill:#e6f0ff
    style E fill:#ffe6f0
    style F fill:#ffe6f0
    style G fill:#f0ffe6

Figure 1: Data flow of the Vision-Language Connector

Architecture Details

Multi-Layer Feature Usage

The Molmo2 Vision-Language Connector extracts features from multiple layers of the ViT, rather than a single layer.

Third-to-last layer: High-level semantic features
Ninth-from-last layer: Mid-level features

This design follows the prior work Molmo [@Clark2024Molmo], combining visual information at different levels of abstraction to achieve richer representations.

flowchart TD
    L0["Layer 0 (Input)"]
    L1["Layer 1"]
    Ldots1["..."]
    LN9["<b>Layer N-9</b> ← 9th-from-last"]
    Ldots2["..."]
    LN3["<b>Layer N-3</b> ← 3rd-to-last"]
    LN2["Layer N-2"]
    LN1["Layer N-1"]
    LN["Layer N (Output)"]
    C["Connector"]

    L0 --- L1 --- Ldots1 --- LN9 --- Ldots2 --- LN3 --- LN2 --- LN1 --- LN

    LN9 -. "Features used<br/>by Connector" .-> C
    LN3 -. "Features used<br/>by Connector" .-> C

    style LN9 fill:#ffe6f0,stroke:#c44a6e,stroke-width:2px
    style LN3 fill:#ffe6f0,stroke:#c44a6e,stroke-width:2px
    style C fill:#f0ffe6,stroke:#6ec44a,stroke-width:2px

Figure 2: Multi-layer feature extraction from ViT (layers used by the Connector)

Attention Pooling

Attention Pooling is used to reduce patch-level features. The mean of the patches serves as the query, and each patch window is aggregated into a single vector.

Images: 2x2 Pooling

Input Patches (4x4 example):

Table 1: 4x4 input patch layout

	Col 1	Col 2	Col 3	Col 4
Row 1	P₁	P₂	P₃	P₄
Row 2	P₅	P₆	P₇	P₈
Row 3	P₉	P₁₀	P₁₁	P₁₂
Row 4	P₁₃	P₁₄	P₁₅	P₁₆

After 2x2 Attention Pooling:

Table 2: Token layout after 2x2 Attention Pooling (16 → 4 tokens, 1/4 reduction)

	Col 1	Col 2
Row 1	T₁ (P₁~P₆)	T₂ (P₃~P₈)
Row 2	T₃ (P₉~P₁₄)	T₄ (P₁₁~P₁₆)

Video Frames: 3x3 Pooling

Since videos have many frames, a 3x3 window is used to further reduce the token count.

Input Patches (9x9 example):

Table 3: 9x9 input patch layout

	C1	C2	C3	C4	C5	C6	C7	C8	C9
R1	P₁	P₂	P₃	P₄	P₅	P₆	P₇	P₈	P₉
R2	…	…	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…	…	…

After 3x3 Attention Pooling:

Table 4: Token layout after 3x3 Attention Pooling (81 → 9 tokens, 1/9 reduction)

	Col 1	Col 2	Col 3
Row 1	T₁ (9 patches)	T₂ (9 patches)	T₃ (9 patches)
…	…	…	…

Shared MLP Projection

Finally, the pooled features are projected by a Shared MLP. This MLP shares parameters between images and video frames, learning a unified visual representation.

flowchart LR
    A[ViT Layer N-9] --> C[Concat]
    B[ViT Layer N-3] --> C
    C --> D{Pooling Window}
    D -->|Image| E[2x2 Attention]
    D -->|Video| F[3x3 Attention]
    E --> G[Shared MLP]
    F --> G
    G --> H[Visual Tokens]

    style C fill:#e6f0ff
    style G fill:#f0ffe6

Figure 3: Architecture of the Vision-Language Connector

Cropping Strategy

Image Cropping

Molmo2 employs a multi-crop strategy.

One downscaled full crop + up to K overlapping tile crops
During training: K = 8
During inference: K = 24 (high-resolution processing)

Images that cannot be tiled with K crops are downscaled.

flowchart TD
    Orig["Original High-res Image"]
    Orig --> DS["Downscaled Full Crop"]
    Orig --> Tiles["K Overlapping Tile Crops<br/>C₁, C₂, C₃, ..., Cₖ"]

    style Orig fill:#e6f0ff,stroke:#4a86c8
    style DS fill:#ffe6f0,stroke:#c44a6e
    style Tiles fill:#f0ffe6,stroke:#6ec44a

Figure 4: Multi-crop strategy: generate a downscaled full crop + K overlapping tile crops from the original image

Column Tokens

For multi-crop images, column tokens are included in the input to the LLM. This conveys aspect ratio information of the image to the LLM.

Column tokens are not included for single-crop images (which are always square).

Video Cropping

For video, the following strategy is adopted to reduce computational cost:

Sampling rate: S = 2 fps (1 frame every 2 seconds)
Each frame is processed as a single crop (downscaled as needed)
Maximum frame count: F = 128 (standard training) or F = 384 (long-context training)

flowchart LR
    T0["0s"] --- T1["1s"] --- T2["2s"] --- T3["3s"] --- T4["4s"] --- T5["5s ..."]
    T1 --> F1["Frame₁"]
    T2 --> F2["Frame₂"]
    T3 --> F3["Frame₃"]
    T4 --> F4["..."]
    T5 --> F5["Frame_n"]

    style F1 fill:#e6f0ff,stroke:#4a86c8
    style F2 fill:#e6f0ff,stroke:#4a86c8
    style F3 fill:#e6f0ff,stroke:#4a86c8
    style F5 fill:#ffe6f0,stroke:#c44a6e

Figure 5: Video frame sampling (2 fps)

If the video length exceeds F/S seconds, F frames are uniformly sampled, and the last frame is always included.

Special Handling of the Last Frame

The last frame of a video is always included. This is because many video players display the last frame after playback ends, making it potentially significant to the user.

Bi-directional Attention

In Molmo2, the LLM is designed so that image tokens can mutually attend to each other when processing visual tokens [@Miao2024LongVU; @Wu2024DoubleLLaVA].

In a standard LLM, each token can only attend to tokens preceding it due to the causal mask. However, Molmo2 allows bi-directional attention for visual tokens.

Standard Causal Attention:

Table 5: Standard causal attention mask

	T₁	T₂	T₃	T₄
T₁	●	×	×	×
T₂	●	●	×	×
T₃	●	●	●	×
T₄	●	●	●	●

Bi-directional Attention on Vision Tokens:

Table 6: Bi-directional attention mask on vision tokens (V: Vision tokens, T: Text tokens)

	V₁	V₂	V₃	T₁	T₂
V₁	●	●	●	×	×
V₂	●	●	●	×	×
V₃	●	●	●	×	×
T₁	●	●	●	●	×
T₂	●	●	●	●	●

This allows visual tokens from different frames or different images to exchange information, enabling learning of spatiotemporal relationships.

Effect of Bi-directional Attention

Ablation studies confirmed that bi-directional attention on visual tokens improves performance.

It is particularly effective for tasks that require capturing relationships between multiple frames/images, such as video tracking and multi-image understanding.

Input Format to the LLM

Visual tokens generated by the Vision-Language Connector are fed to the LLM in the following formats.

Video

<image_start> [Visual Tokens for Frame1] <timestamp>0.5s</timestamp>
<image_start> [Visual Tokens for Frame2] <timestamp>1.0s</timestamp>
...
[Subtitle text] <timestamp>0.5s-2.0s</timestamp>

Timestamps are appended to each frame’s visual tokens
If subtitles are available, they are added as timestamped text

Multi-Image

<image_start> [Visual Tokens for Image1] <image>1</image>
<image_start> [Visual Tokens for Image2] <image>2</image>
...

An image index is appended to each image

Multi-Crop Images

<image_start> [Column Tokens] [Visual Tokens for Full Crop]
[Visual Tokens for Crop1] [Visual Tokens for Crop2] ...

Column tokens convey aspect ratio
Tokens from the full crop and partial crops are concatenated

Summary

The Molmo2 Vision-Language Connector has the following characteristics:

Multi-layer features: Extracts features from multiple ViT layers (3rd-to-last, 9th-from-last)
Adaptive pooling: 2x2 Attention Pooling for images, 3x3 for video frames
Shared parameters: Unified MLP projection for images and video
Multi-crop strategy: Uses up to 24 crops for high-resolution processing
Efficient video processing: 2 fps sampling + last frame retention
Bi-directional attention: Allows mutual interaction among visual tokens (improves performance)
Column tokens: Conveys aspect ratio information for multi-crop images

This design enables Molmo2 to process images and video uniformly while balancing computational efficiency and representational power.

References

Clark, C., et al. (2024). Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models. arXiv:2409.17146.
Miao, X., et al. (2024). LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding. arXiv:2410.17434.
Wu, H., et al. (2024). DoubleLLaVA: Efficient Long Video Understanding with Grouped Frame Tokens. arXiv:2410.00907.