Delta Learning
Delta Learning is a novel approach to preference tuning. This method leverages the “delta” between an SFT (Supervised Fine-Tuning) model and a Base model to generate high-quality contrastive data, maximizing the effectiveness of DPO (Direct Preference Optimization).
Core Principle
The central idea of Delta Learning is to explicitly capture the capability gap between models.
+------------------------------------------------------------------+
| Delta Learning Concept |
+------------------------------------------------------------------+
| |
| Base Model --> Limited reasoning capability |
| SFT Model --> Enhanced reasoning capability |
| Delta --> The "learned" reasoning ability |
| |
| Goal: Amplify the delta through preference optimization |
+------------------------------------------------------------------+
The Delta Between Models
The SFT model acquires the following capabilities over the Base model:
- More structured reasoning processes
- Step-by-step problem-solving approaches
- Application of task-specific knowledge
Delta Learning harnesses these “acquired capabilities” to generate preferred responses.
Application in Dolci Think DPO
Dolci Think uses Delta Learning to improve reasoning capability (Section 4.3).
Synthetic Data Generation
+------------------------------------------------------------------+
| Dolci Think Data Generation |
+------------------------------------------------------------------+
| |
| Step 1: Sample question from training set |
| Step 2: Generate response using SFT model (Preferred) |
| Step 3: Generate response using Base model (Dispreferred) |
| Step 4: Apply quality filtering |
| |
+------------------------------------------------------------------+
Creating Preferred vs Dispreferred Responses
Preferred responses:
- Generated by the Dolci Think SFT model
- Include step-by-step reasoning processes
- Arrive at the correct final answer
Dispreferred responses:
- Generated by the OLMo2 7B Base model
- Lack sufficient reasoning depth
- Reach incorrect conclusions or produce incomplete reasoning
Quality Filtering
The generated pairs are filtered according to the following criteria:
- The preferred response contains the correct answer
- The dispreferred response is incorrect or incomplete
- A clear quality gap exists between the two responses
This process yields approximately 1M high-quality preference pairs.
Application in Dolci Instruct DPO
Dolci Instruct uses Delta Learning for multi-turn dialogue optimization (Section 5.3).
Multi-turn Preference Data
+------------------------------------------------------------------+
| Dolci Instruct Data Generation |
+------------------------------------------------------------------+
| |
| Source: Approximately 500K multi-turn prompts |
| |
| Preferred: |
| - Generated by Dolci Instruct SFT |
| - Concise, well-structured responses |
| |
| Dispreferred: |
| - Generated by OLMo2 7B Base |
| - Verbose or poorly structured responses |
| |
+------------------------------------------------------------------+
Response Length Optimization
Delta Learning enables the following improvements:
- Maintaining conciseness: Eliminating unnecessary verbosity
- Increasing information density: Conveying important information efficiently
- Improving structure: Producing responses with logical flow
Implementation Details
Preference pairs are generated from approximately 500K multi-turn prompts to improve response quality.
Effects and Benefits
Preference tuning with Delta Learning provides several advantages.
Performance Beyond SFT
The additional optimization through DPO achieves performance levels that SFT alone cannot reach.
+------------------------------------------------------------------+
| Performance Progression |
+------------------------------------------------------------------+
| |
| Base Model --> SFT Model --> DPO Model (with Delta) |
| |
| Limited --> Enhanced --> Optimized reasoning |
| reasoning reasoning and preference alignment |
| |
+------------------------------------------------------------------+
Priming for RL
DPO with Delta Learning serves as a foundation for future Reinforcement Learning.
- Reward model alignment: Improves alignment with human preferences
- Exploration efficiency: Provides a better initial policy
- Improved stability: Facilitates convergence of RL training
Enhanced Reasoning Capability
The application in Dolci Think demonstrates the following improvements:
- Strengthened step-by-step approaches to complex problems
- Increased depth and accuracy of reasoning
- Expansion of the reasoning frontier
Conventional DPO:
- Relies on human-labeled data
- High cost of data collection
- Limited scalability
RLHF (Reinforcement Learning from Human Feedback):
- Requires training a reward model
- Complex implementation and tuning
- High computational cost
Advantages of Delta Learning:
- Scalability: Synthetic data enables large-scale training
- Cost efficiency: No human annotation required
- Quality assurance: The capability gap between models produces clear contrastive signals
- Flexibility: Easily applicable to different tasks and domains
Delta Learning maximizes the capabilities acquired through SFT to achieve efficient and effective preference tuning.
Summary
Delta Learning plays a central role in preference tuning for OLMo2 3B.
Key points:
- Leverages the delta between SFT and Base models
- Automatically generates high-quality contrastive data
- Improves performance in both reasoning ability and response quality
- A scalable and cost-effective method
Through this approach, Dolci Think and Dolci Instruct achieve state-of-the-art performance in their respective domains.