flowchart LR
A["WebOrganizer<br/>(24 topics)"] --> B["Science, Math,<br/>and Technology"]
A --> C["Software<br/>Development"]
A --> D["Arts and<br/>Entertainment"]
A --> E["... (21 more topics)"]
B --> B1["Quality tiers (1-20)"]
C --> C1["Quality tiers (1-20)"]
D --> D1["Quality tiers (1-20)"]
E --> E1["Quality tiers (1-20)"]
Data Mixing
Overview
Data Mixing is a technique for combining multiple data sources at optimal ratios to maximize model performance. In Dolma 3, two innovative methods—Token-constrained Mixing and Quality-aware Upsampling—were introduced to compose a 6T-token training mix from a 9T-token data pool. These methods achieve an optimal data allocation under a fixed token budget.
Purpose of Data Mixing
Token Budget Constraints
Model training is subject to a token budget constraint:
- Computational cost: The number of tokens available for training is limited by computational resources
- Need for optimal allocation: Within a limited budget, the proportion of each data source must be decided
- Balancing diversity and quality: High-quality data must be prioritized while maintaining data diversity
Determining the Optimal Mix Ratio
When composing a 6T-token training mix from a 9T-token data pool, the following factors are considered:
- Data source characteristics: Features unique to each source—web text, academic PDFs, code, math, etc.
- Topic balance: Optimal allocation across topics such as STEM, software development, and general knowledge
- Quality considerations: Preferential selection of high-quality documents
Token-constrained Mixing
Token-constrained Mixing is a method for determining the optimal data mix under a token budget constraint.
Swarm-based Methods
Many small proxy models are trained, and the optimal mix is estimated from their results:
Procedure:
- Swarm construction: Train many small proxy models with different mixing ratios
- Per-task regression: Each proxy model maps mixing weights to task performance
- Mix optimization: Find the mixing that minimizes average task BPB (bits-per-byte)
Advantages:
- Computational efficiency: Experiments with small-scale models allow estimating the optimal mix before training a large-scale model
- Parallelism: Multiple proxy models can be trained in parallel
- Iterative refinement: The mix can be improved incrementally based on results
In Dolma 3, many 1B-parameter models were trained to evaluate performance under different mixing ratios. These proxy models were trained at 5x Chinchilla (five times the standard number of tokens) to accurately measure the effect of data mixes.
Conditional Mixing
A conditional mixing procedure is adopted to accommodate continuous improvements to data sources:
Features:
- Flexibility: When data sources are updated, the entire mix does not need to be recomputed
- Modularity: Individual data sources can be improved independently
- Scalability: New data sources can be added easily
Adapting to the development cycle:
- Continuous improvement of data sources
- Incremental introduction of new data sources
- Dynamic adjustment of mix ratios
Quality-aware Upsampling
Quality-aware Upsampling is a method that selectively reintroduces high-quality documents into a deduplicated clean dataset.
Selective Introduction of Duplicates
High-quality documents are selectively restored from data removed during deduplication:
Approach:
- Deduplication as the foundation: First, build a clean dataset by removing all duplicates
- Quality assessment: Compute a quality score for each document
- Selective upsampling: Selectively repeat high-quality documents
Effects:
- Improved quality: Increasing the proportion of high-quality data improves model performance
- Efficient repetition: Repetition is concentrated on high-quality data while minimizing overall repetition
- Token efficiency: The limited token budget is preferentially allocated to high-quality data
Among documents removed during deduplication, some are high quality. By selectively restoring these, quality degradation from deduplication is prevented while the overall quality of the dataset is improved.
Classification by Topic and Quality
In Dolma 3, web text is classified along both topic and quality axes to achieve fine-grained mixing.
24-Topic Classification with WebOrganizer
WebOrganizer is a tool that classifies web text into 24 major topics:
Major topics (examples):
- Science, Math, and Technology
- Software Development
- Arts and Entertainment
- Business and Finance
- Health and Medicine
- Education
- News and Current Events
- 17 additional topics
Benefits of classification:
- Per-topic weighting: Assign optimal weights to each topic
- STEM reinforcement: Preferentially allocate Science, Math, and Technology topics
- Balanced mix: Adjust to avoid overrepresentation of any single topic
fastText Quality Classifier
Within each topic, documents are further classified by quality score:
Quality classification:
- 20 quality tiers: Each topic is divided into 20 quality tiers
- fastText-based classifier: Fast and accurate quality estimation
- Objective quality metric: Consistent quality assessment across documents
480 Subsets
24 topics x 20 quality tiers = 480 subsets:
Fine-grained mixing:
- Per-subset weights: Individual weights are assigned to each subset
- Quality and topic alignment: High-quality data in important topics is prioritized
- Flexible tuning: Data allocation optimization at fine granularity
Mixing Strategy Results
Token-constrained Mixing and Quality-aware Upsampling determined the optimal ratios of data sources.
Per-Topic Weights (Figure 9a)
In the topic distribution of web text, the following trends are observed:
Upweighted topics:
- Science, Math, and Technology: STEM domains are significantly upweighted
- Software Development: Programming and software development are reinforced
- Education: Educational content is emphasized
Downweighted topics:
- Entertainment-related topics
- General news and social media content
Results:
- Training a 1B-parameter model at 5x Chinchilla achieved an average improvement of 0.056 BPB
- Performance degradation was observed on only 13 out of 54 tasks, with a maximum degradation of 0.035 BPB
Comparison with DCLM Baseline (Figure 9b)
Compared to the DCLM (DataComp for Language Models) Baseline, the following improvements were confirmed:
Improvements:
- STEM tasks: Substantial performance gains on science, math, and technology tasks
- Coding tasks: Improved programming ability
- General knowledge: Performance improvements on a wide range of knowledge tasks
Trade-offs:
- Slight performance degradation on some tasks
- Overall, performance gains on important tasks are prioritized
Optimization of data mixing has a significant impact on model performance. By prioritizing STEM domains, performance on scientific and technical tasks improves, forming a core strength of Olmo 3.
Programming Language Distribution in Stack-Edu
An optimal mix of programming languages was also determined for code data:
Upweighted languages:
- Python: Highest weight (importance in machine learning and data science)
- JavaScript/TypeScript: Primary languages for web development
- C++/Rust: Systems programming languages
Downweighted languages:
- Java: Relatively lower weight (high proportion of verbose code)
- Markdown: Documentation files are limited
Results:
- Improvements achieved on nearly all coding benchmarks
- Particularly notable improvements on Python-centric tasks
Summary
Data Mixing is a critical process that determines the quality of Dolma 3. Two innovative methods—Token-constrained Mixing and Quality-aware Upsampling—achieve an optimal data allocation under a fixed token budget.
Key features:
- Token-constrained Mixing: Optimization via swarm-based methods
- Quality-aware Upsampling: Selective reintroduction of high-quality data
- 480 subsets: Fine-grained classification by topic and quality
- Conditional Mixing: Accommodates continuous improvement of data sources
- Demonstrated improvement: Average improvement of 0.056 BPB compared to the DCLM Baseline
These methods make Dolma 3 the foundation supporting the high performance of the Olmo 3 Base model.