Midtraining

Olmo 3 training includes an additional stage called Midtraining after pretraining. This phase uses 100B high-quality tokens to strengthen critical capabilities such as mathematical reasoning, code generation, question answering, instruction following, and chain-of-thought reasoning.

Overview

Midtraining bridges pretraining and the subsequent SFT (Supervised Fine-Tuning) stage. It uses the Dolma 3 Dolmino Mix dataset and has the following characteristics:

  • 100B tokens of high-quality data
  • Data source selection targeted at specific capabilities
  • Decontamination to remove evaluation dataset contamination
  • Effective data mix design through Microanneal and integration tests

Methodological Framework

Data curation for midtraining follows a two-part framework (Figure 11).

+-----------------------+     +-------------------------+
| Distributed           |     | Centralized             |
| Exploration           |     | Assessment              |
+-----------------------+     +-------------------------+
| - Individual data     | --> | - Combine candidate     |
|   source testing      |     |   datasets              |
| - Lightweight         |     | - Full 100B integration |
|   feedback loops      |     |   tests                 |
| - Microanneal (10B)   |     | - Post-SFT evaluation   |
+-----------------------+     +-------------------------+

Distributed Exploration

Each data source is evaluated through lightweight feedback loops to assess its effectiveness.

  • Microanneal: 5B tokens of target data + 5B web data
  • Baseline: 10B tokens of web-only data
  • Rapid evaluation to identify promising data sources

Centralized Assessment

Selected candidate datasets are combined and subjected to integration testing.

  • Integration tests: Full annealing runs with 100B tokens
  • Evaluates interactions between data sources
  • Measures performance after SFT training as well

Midtraining Data Composition

The Dolmino Mix shown in Table 5 organizes data sources by the following target capabilities.

Capability Dataset Token Count Description
Math TinyMATH ~5B Math problem-solution pairs
CraneMath ~3B Mathematical reasoning
MegaMatt ~2B Advanced mathematics
Dolmino Math ~4B Curated math corpus
Code Stack-Edu (FIM) ~10B Educational code with Fill-In-Middle
CraneCode ~5B High-quality code snippets
QA Reddit-to-Flashcards ~3B Question-answer extraction
Wiki-to-RCQA ~4B Reading comprehension QA
Nemotron ~2B Synthetic QA pairs
Instruction Tulu3 SFT ~2B Instruction-following examples
Flan ~3B Task-oriented instructions
Thinking Meta-reasoning ~2B Chain-of-thought reasoning
Program-verifiable ~1B Verifiable reasoning traces
OMR rewrite ~1B Reasoning rewriting
Web Dolma v1.7 Web ~50B General web content (baseline)
NoteDesign Philosophy of Dolmino Mix

By combining multiple data sources for each capability, the design avoids dependence on any single dataset and improves generalization performance across capabilities.

Capability Improvements

The following summarizes the improvement results for each target capability (Section 3.5.2).

Math (Mathematical Reasoning)

  • TinyMATH: Basic arithmetic and algebra problems
  • CraneMath: Complex equation processing and proofs
  • MegaMatt: University-level mathematics problems
  • Dolmino Math: A curated corpus integrating the above sources

Code (Code Generation)

  • Stack-Edu (FIM): Educational code in Fill-In-Middle format
  • CraneCode: High-quality code snippets across multiple languages
TipFill-In-Middle (FIM)

Fill-In-Middle is a task that predicts the middle portion of code, closely simulating real-world code completion scenarios in IDEs.

QA (Question Answering)

  • Reddit-to-Flashcards: Extracts QA pairs from Reddit discussions
  • Wiki-to-RCQA: Generates reading comprehension questions from Wikipedia articles
  • Nemotron: Synthetic QA dataset

Instruction (Instruction Following)

  • Tulu3 SFT: Diverse instruction-following tasks
  • Flan: Task-oriented instruction data

Thinking (Chain-of-Thought Reasoning)

  • Meta-reasoning: Chain-of-Thought (CoT) style reasoning
  • Program-verifiable: Program-verifiable reasoning traces
  • OMR rewrite: Rewriting of reasoning processes

Decontamination

The decontamination process is detailed in Section 3.5.3.

A new decon package was developed to remove overlaps with evaluation datasets.

  • N-gram based matching
  • Contamination detection against evaluation benchmarks
  • Exclusion of contaminated samples from training data
WarningRisk of Evaluation Data Contamination

High-quality datasets may contain samples that overlap with evaluation benchmarks. Decontamination ensures fair evaluation by removing these overlaps.

Key Findings

The main findings from Section 3.5.4 are as follows.

  • Effectiveness of Microanneal: Lightweight tests with 10B tokens can predict the results of full 100B runs
  • Complementarity of data sources: Combining multiple data sources yields greater benefits than any single dataset
  • Synergy with SFT: Capabilities strengthened during midtraining continue to improve after SFT
  • Necessity of decontamination: Removing contamination significantly improves evaluation accuracy