Midtraining

Olmo 3 training includes an additional stage called Midtraining after pretraining. This phase uses 100B high-quality tokens to strengthen critical capabilities such as mathematical reasoning, code generation, question answering, instruction following, and chain-of-thought reasoning.

Overview

Midtraining bridges pretraining and the subsequent SFT (Supervised Fine-Tuning) stage. It uses the Dolma 3 Dolmino Mix dataset and has the following characteristics:

100B tokens of high-quality data
Data source selection targeted at specific capabilities
Decontamination to remove evaluation dataset contamination
Effective data mix design through Microanneal and integration tests

Methodological Framework

Data curation for midtraining follows a two-part framework (Figure 11).

+-----------------------+     +-------------------------+
| Distributed           |     | Centralized             |
| Exploration           |     | Assessment              |
+-----------------------+     +-------------------------+
| - Individual data     | --> | - Combine candidate     |
|   source testing      |     |   datasets              |
| - Lightweight         |     | - Full 100B integration |
|   feedback loops      |     |   tests                 |
| - Microanneal (10B)   |     | - Post-SFT evaluation   |
+-----------------------+     +-------------------------+

Distributed Exploration

Each data source is evaluated through lightweight feedback loops to assess its effectiveness.

Microanneal: 5B tokens of target data + 5B web data
Baseline: 10B tokens of web-only data
Rapid evaluation to identify promising data sources

Centralized Assessment

Selected candidate datasets are combined and subjected to integration testing.

Integration tests: Full annealing runs with 100B tokens
Evaluates interactions between data sources
Measures performance after SFT training as well

Midtraining Data Composition

The Dolmino Mix shown in Table 5 organizes data sources by the following target capabilities.

Capability	Dataset	Token Count	Description
Math	TinyMATH	~5B	Math problem-solution pairs
	CraneMath	~3B	Mathematical reasoning
	MegaMatt	~2B	Advanced mathematics
	Dolmino Math	~4B	Curated math corpus
Code	Stack-Edu (FIM)	~10B	Educational code with Fill-In-Middle
	CraneCode	~5B	High-quality code snippets
QA	Reddit-to-Flashcards	~3B	Question-answer extraction
	Wiki-to-RCQA	~4B	Reading comprehension QA
	Nemotron	~2B	Synthetic QA pairs
Instruction	Tulu3 SFT	~2B	Instruction-following examples
	Flan	~3B	Task-oriented instructions
Thinking	Meta-reasoning	~2B	Chain-of-thought reasoning
	Program-verifiable	~1B	Verifiable reasoning traces
	OMR rewrite	~1B	Reasoning rewriting
Web	Dolma v1.7 Web	~50B	General web content (baseline)

Design Philosophy of Dolmino Mix

By combining multiple data sources for each capability, the design avoids dependence on any single dataset and improves generalization performance across capabilities.

Capability Improvements

The following summarizes the improvement results for each target capability (Section 3.5.2).

Math (Mathematical Reasoning)

TinyMATH: Basic arithmetic and algebra problems
CraneMath: Complex equation processing and proofs
MegaMatt: University-level mathematics problems
Dolmino Math: A curated corpus integrating the above sources

Code (Code Generation)

Stack-Edu (FIM): Educational code in Fill-In-Middle format
CraneCode: High-quality code snippets across multiple languages

Fill-In-Middle (FIM)

Fill-In-Middle is a task that predicts the middle portion of code, closely simulating real-world code completion scenarios in IDEs.

QA (Question Answering)

Reddit-to-Flashcards: Extracts QA pairs from Reddit discussions
Wiki-to-RCQA: Generates reading comprehension questions from Wikipedia articles
Nemotron: Synthetic QA dataset

Instruction (Instruction Following)

Tulu3 SFT: Diverse instruction-following tasks
Flan: Task-oriented instruction data

Thinking (Chain-of-Thought Reasoning)

Meta-reasoning: Chain-of-Thought (CoT) style reasoning
Program-verifiable: Program-verifiable reasoning traces
OMR rewrite: Rewriting of reasoning processes

Decontamination

The decontamination process is detailed in Section 3.5.3.

A new decon package was developed to remove overlaps with evaluation datasets.

N-gram based matching
Contamination detection against evaluation benchmarks
Exclusion of contaminated samples from training data

Risk of Evaluation Data Contamination

High-quality datasets may contain samples that overlap with evaluation benchmarks. Decontamination ensures fair evaluation by removing these overlaps.

Key Findings

The main findings from Section 3.5.4 are as follows.

Effectiveness of Microanneal: Lightweight tests with 10B tokens can predict the results of full 100B runs
Complementarity of data sources: Combining multiple data sources yields greater benefits than any single dataset
Synergy with SFT: Capabilities strengthened during midtraining continue to improve after SFT
Necessity of decontamination: Removing contamination significantly improves evaluation accuracy

Midtraining

Overview

Methodological Framework

Distributed Exploration

Centralized Assessment

Midtraining Data Composition

Capability Improvements

Math (Mathematical Reasoning)

Code (Code Generation)

QA (Question Answering)

Instruction (Instruction Following)

Thinking (Chain-of-Thought Reasoning)

Decontamination

Key Findings

Related Sections