Delta Learning

Delta Learning is a novel approach to preference tuning. This method leverages the “delta” between an SFT (Supervised Fine-Tuning) model and a Base model to generate high-quality contrastive data, maximizing the effectiveness of DPO (Direct Preference Optimization).

Core Principle

The central idea of Delta Learning is to explicitly capture the capability gap between models.

+------------------------------------------------------------------+
|                      Delta Learning Concept                      |
+------------------------------------------------------------------+
|                                                                  |
|  Base Model    -->  Limited reasoning capability                 |
|  SFT Model     -->  Enhanced reasoning capability                |
|  Delta         -->  The "learned" reasoning ability              |
|                                                                  |
|  Goal: Amplify the delta through preference optimization         |
+------------------------------------------------------------------+

The Delta Between Models

The SFT model acquires the following capabilities over the Base model:

More structured reasoning processes
Step-by-step problem-solving approaches
Application of task-specific knowledge

Delta Learning harnesses these “acquired capabilities” to generate preferred responses.

Application in Dolci Think DPO

Dolci Think uses Delta Learning to improve reasoning capability (Section 4.3).

Synthetic Data Generation

+------------------------------------------------------------------+
|                  Dolci Think Data Generation                      |
+------------------------------------------------------------------+
|                                                                  |
|  Step 1: Sample question from training set                       |
|  Step 2: Generate response using SFT model (Preferred)           |
|  Step 3: Generate response using Base model (Dispreferred)       |
|  Step 4: Apply quality filtering                                 |
|                                                                  |
+------------------------------------------------------------------+

Creating Preferred vs Dispreferred Responses

Preferred responses:

Generated by the Dolci Think SFT model
Include step-by-step reasoning processes
Arrive at the correct final answer

Dispreferred responses:

Generated by the OLMo2 7B Base model
Lack sufficient reasoning depth
Reach incorrect conclusions or produce incomplete reasoning

Quality Filtering

The generated pairs are filtered according to the following criteria:

The preferred response contains the correct answer
The dispreferred response is incorrect or incomplete
A clear quality gap exists between the two responses

This process yields approximately 1M high-quality preference pairs.

Application in Dolci Instruct DPO

Dolci Instruct uses Delta Learning for multi-turn dialogue optimization (Section 5.3).

Multi-turn Preference Data

+------------------------------------------------------------------+
|                Dolci Instruct Data Generation                     |
+------------------------------------------------------------------+
|                                                                  |
|  Source: Approximately 500K multi-turn prompts                   |
|                                                                  |
|  Preferred:                                                      |
|    - Generated by Dolci Instruct SFT                             |
|    - Concise, well-structured responses                          |
|                                                                  |
|  Dispreferred:                                                   |
|    - Generated by OLMo2 7B Base                                  |
|    - Verbose or poorly structured responses                      |
|                                                                  |
+------------------------------------------------------------------+

Response Length Optimization

Delta Learning enables the following improvements:

Maintaining conciseness: Eliminating unnecessary verbosity
Increasing information density: Conveying important information efficiently
Improving structure: Producing responses with logical flow

Implementation Details

Preference pairs are generated from approximately 500K multi-turn prompts to improve response quality.

Effects and Benefits

Preference tuning with Delta Learning provides several advantages.

Performance Beyond SFT

The additional optimization through DPO achieves performance levels that SFT alone cannot reach.

+------------------------------------------------------------------+
|                    Performance Progression                        |
+------------------------------------------------------------------+
|                                                                  |
|  Base Model  -->  SFT Model  -->  DPO Model (with Delta)         |
|                                                                  |
|  Limited     -->  Enhanced   -->  Optimized reasoning             |
|  reasoning        reasoning       and preference alignment        |
|                                                                  |
+------------------------------------------------------------------+

Priming for RL

DPO with Delta Learning serves as a foundation for future Reinforcement Learning.

Reward model alignment: Improves alignment with human preferences
Exploration efficiency: Provides a better initial policy
Improved stability: Facilitates convergence of RL training

Enhanced Reasoning Capability

The application in Dolci Think demonstrates the following improvements:

Strengthened step-by-step approaches to complex problems
Increased depth and accuracy of reasoning
Expansion of the reasoning frontier

Comparison with Other Preference Tuning Methods

Conventional DPO:

Relies on human-labeled data
High cost of data collection
Limited scalability

RLHF (Reinforcement Learning from Human Feedback):

Requires training a reward model
Complex implementation and tuning
High computational cost

Advantages of Delta Learning:

Scalability: Synthetic data enables large-scale training
Cost efficiency: No human annotation required
Quality assurance: The capability gap between models produces clear contrastive signals
Flexibility: Easily applicable to different tasks and domains

Delta Learning maximizes the capabilities acquired through SFT to achieve efficient and effective preference tuning.

Summary

Delta Learning plays a central role in preference tuning for OLMo2 3B.

Key points:

Leverages the delta between SFT and Base models
Automatically generates high-quality contrastive data
Improves performance in both reasoning ability and response quality
A scalable and cost-effective method

Through this approach, Dolci Think and Dolci Instruct achieve state-of-the-art performance in their respective domains.