Multi-Image Understanding

Multi-Image Understanding is the ability to simultaneously process multiple images and understand their relationships and differences. Unlike traditional single-image processing, which treats each image independently, Multi-Image Understanding captures relationships between multiple images.

Differences from Single-Image Processing

Single-Image Processing:

Performs question answering or caption generation for a single image
Cannot compare or understand relationships between images
Difficult to handle multi-page documents or before-and-after comparisons

Multi-Image Understanding:

Processes sets of 2-5 semantically related images
Understands commonalities and differences between images
Enables cross-image question answering and grounding

Molmo2-MultiImageQA Dataset

Molmo2-MultiImageQA is a question-answering dataset for semantically related image sets.

Dataset Scale:

45,000 image sets (composed of 96,000 unique images)
72,000 QA pairs
2-5 images per set (average 2.73)

Collection Method: The dataset was constructed through human annotation using the following process:

Generate captions for each image using a model trained on PixMoCap
Group images based on sentence-level similarity of captions
Annotators create questions for each set
Improve answers through an iterative loop with Claude Sonnet 4.5

This approach produced a high-quality dataset that supports real-world multi-image queries.

Molmo2-MultiImagePoint Dataset

Molmo2-MultiImagePoint is a pointing and counting dataset spanning multiple images.

Dataset Scale:

Over 470,000 pointing and counting examples
2-5 images per set (average 3.24)

Collection Method: The dataset was synthetically constructed using the following pipeline.

Data Collection Pipeline

flowchart TD
    S1["<b>Step 1: Soft Clustering of Images</b><br/>- Use images from PixMo-Points<br/>- Combine single-token &amp; sentence-level embedding<br/>- Generate semantically related sets (2-5 images)"]
    S2["<b>Step 2: Label Normalization</b><br/>- Lowercase, punctuation/whitespace normalization<br/>- Synonym consolidation"]
    S3["<b>Step 3: Canonical Label Generation</b><br/>- Use LLM to merge normalized labels<br/>- Create single canonical description<br/>- Define shared entity/concept across all images"]
    S4["<b>Step 4: Training-time Sampling</b><br/>- Sample from original annotations (not just canonical)<br/>- Preserve lexical diversity &amp; improve robustness"]

    S1 --> S2 --> S3 --> S4

    style S1 fill:#e6f0ff,stroke:#4a86c8
    style S2 fill:#e6f0ff,stroke:#4a86c8
    style S3 fill:#e6f0ff,stroke:#4a86c8
    style S4 fill:#e6f0ff,stroke:#4a86c8

Figure 1: Molmo2-MultiImagePoint Data Collection Pipeline

Role of Canonical Labels

A canonical label is a standardized description that unifies multiple human annotations within an image set. For example, different expressions such as “waterfall,” “taki” (Japanese), and “bakufu” (Chinese) are unified into a single canonical label: “waterfall.”

However, rather than always using canonical labels during training, the model probabilistically samples from the original annotations as well, building a model that can handle diverse expressions.

Molmo2-SynMultiImageQA Dataset

Molmo2-SynMultiImageQA is a synthetic multi-image dataset specialized for text-rich images.

Dataset Scale:

188,000 synthetic multi-image QA examples

Collection Method: The dataset was built by extending CoSyn [172]. CoSyn is a framework that synthetically generates question-answering pairs for text-rich images such as charts, tables, and documents.

Target Image Types:

Charts
Tables
Documents

These text-rich images are critical data directly relevant to practical tasks such as document understanding and cross-document comparison.

Practical Examples: Applications of Multi-Image Understanding

Document Understanding:

Comparing clauses across multiple pages of a contract
Consistency checking between different sections of a report
Content comparison across multiple invoices

Multi-Image Comparison:

Comparing product photos from different angles to understand features
Change detection in before-and-after photos
Trend analysis across multiple charts and graphs

Grounding:

Cross-image pointing such as “Point to the waterfall in all images”
Counting such as “How many images contain a red car?”
Detecting common objects across the entire set

Dataset Statistics

Dataset	Scale	Image Set Size	Collection Method	Purpose
Molmo2-MultiImageQA	45k sets 72k QA	2-5 images (avg. 2.73)	Human	General QA
Molmo2-MultiImagePoint	470k examples	2-5 images (avg. 3.24)	Synthetic	Pointing & Counting
Molmo2-SynMultiImageQA	188k examples	-	Synthetic (CoSyn extension)	Text-rich image QA

Importance of Multi-Image Understanding

Multi-Image Understanding enables the following tasks that were impossible with single-image processing.

Information Integration: It integrates information from multiple sources (images) to provide comprehensive understanding.

Comparison and Contrast: It can clearly identify commonalities and differences between images.

Document Processing: It enables understanding across multi-page documents or multiple related documents.

Real-World Application: In real-world applications, scenarios involving multiple images arise frequently (e.g., product images on e-commerce sites, time-series comparison of medical images, multiple surveillance camera angles, etc.).

Molmo2 achieves state-of-the-art Multi-Image Understanding among open-source models by leveraging these three datasets.