Multi-Image Understanding

Multi-Image Understanding is the ability to simultaneously process multiple images and understand their relationships and differences. Unlike traditional single-image processing, which treats each image independently, Multi-Image Understanding captures relationships between multiple images.

Differences from Single-Image Processing

Single-Image Processing:

  • Performs question answering or caption generation for a single image
  • Cannot compare or understand relationships between images
  • Difficult to handle multi-page documents or before-and-after comparisons

Multi-Image Understanding:

  • Processes sets of 2-5 semantically related images
  • Understands commonalities and differences between images
  • Enables cross-image question answering and grounding

Molmo2-MultiImageQA Dataset

Molmo2-MultiImageQA is a question-answering dataset for semantically related image sets.

Dataset Scale:

  • 45,000 image sets (composed of 96,000 unique images)
  • 72,000 QA pairs
  • 2-5 images per set (average 2.73)

Collection Method: The dataset was constructed through human annotation using the following process:

  1. Generate captions for each image using a model trained on PixMoCap
  2. Group images based on sentence-level similarity of captions
  3. Annotators create questions for each set
  4. Improve answers through an iterative loop with Claude Sonnet 4.5

This approach produced a high-quality dataset that supports real-world multi-image queries.

Molmo2-MultiImagePoint Dataset

Molmo2-MultiImagePoint is a pointing and counting dataset spanning multiple images.

Dataset Scale:

  • Over 470,000 pointing and counting examples
  • 2-5 images per set (average 3.24)

Collection Method: The dataset was synthetically constructed using the following pipeline.

Data Collection Pipeline

flowchart TD
    S1["<b>Step 1: Soft Clustering of Images</b><br/>- Use images from PixMo-Points<br/>- Combine single-token &amp; sentence-level embedding<br/>- Generate semantically related sets (2-5 images)"]
    S2["<b>Step 2: Label Normalization</b><br/>- Lowercase, punctuation/whitespace normalization<br/>- Synonym consolidation"]
    S3["<b>Step 3: Canonical Label Generation</b><br/>- Use LLM to merge normalized labels<br/>- Create single canonical description<br/>- Define shared entity/concept across all images"]
    S4["<b>Step 4: Training-time Sampling</b><br/>- Sample from original annotations (not just canonical)<br/>- Preserve lexical diversity &amp; improve robustness"]

    S1 --> S2 --> S3 --> S4

    style S1 fill:#e6f0ff,stroke:#4a86c8
    style S2 fill:#e6f0ff,stroke:#4a86c8
    style S3 fill:#e6f0ff,stroke:#4a86c8
    style S4 fill:#e6f0ff,stroke:#4a86c8
Figure 1: Molmo2-MultiImagePoint Data Collection Pipeline
NoteRole of Canonical Labels

A canonical label is a standardized description that unifies multiple human annotations within an image set. For example, different expressions such as “waterfall,” “taki” (Japanese), and “bakufu” (Chinese) are unified into a single canonical label: “waterfall.”

However, rather than always using canonical labels during training, the model probabilistically samples from the original annotations as well, building a model that can handle diverse expressions.

Molmo2-SynMultiImageQA Dataset

Molmo2-SynMultiImageQA is a synthetic multi-image dataset specialized for text-rich images.

Dataset Scale:

  • 188,000 synthetic multi-image QA examples

Collection Method: The dataset was built by extending CoSyn [172]. CoSyn is a framework that synthetically generates question-answering pairs for text-rich images such as charts, tables, and documents.

Target Image Types:

  • Charts
  • Tables
  • Documents

These text-rich images are critical data directly relevant to practical tasks such as document understanding and cross-document comparison.

TipPractical Examples: Applications of Multi-Image Understanding

Document Understanding:

  • Comparing clauses across multiple pages of a contract
  • Consistency checking between different sections of a report
  • Content comparison across multiple invoices

Multi-Image Comparison:

  • Comparing product photos from different angles to understand features
  • Change detection in before-and-after photos
  • Trend analysis across multiple charts and graphs

Grounding:

  • Cross-image pointing such as “Point to the waterfall in all images”
  • Counting such as “How many images contain a red car?”
  • Detecting common objects across the entire set

Dataset Statistics

Dataset Scale Image Set Size Collection Method Purpose
Molmo2-MultiImageQA 45k sets
72k QA
2-5 images
(avg. 2.73)
Human General QA
Molmo2-MultiImagePoint 470k examples 2-5 images
(avg. 3.24)
Synthetic Pointing & Counting
Molmo2-SynMultiImageQA 188k examples - Synthetic
(CoSyn extension)
Text-rich image QA

Importance of Multi-Image Understanding

Multi-Image Understanding enables the following tasks that were impossible with single-image processing.

Information Integration: It integrates information from multiple sources (images) to provide comprehensive understanding.

Comparison and Contrast: It can clearly identify commonalities and differences between images.

Document Processing: It enables understanding across multi-page documents or multiple related documents.

Real-World Application: In real-world applications, scenarios involving multiple images arise frequently (e.g., product images on e-commerce sites, time-series comparison of medical images, multiple surveillance camera angles, etc.).

Molmo2 achieves state-of-the-art Multi-Image Understanding among open-source models by leveraging these three datasets.