Delta Learning

Delta Learning is a novel approach to preference tuning. This method leverages the “delta” between an SFT (Supervised Fine-Tuning) model and a Base model to generate high-quality contrastive data, maximizing the effectiveness of DPO (Direct Preference Optimization).

Core Principle

The central idea of Delta Learning is to explicitly capture the capability gap between models.

+------------------------------------------------------------------+
|                      Delta Learning Concept                      |
+------------------------------------------------------------------+
|                                                                  |
|  Base Model    -->  Limited reasoning capability                 |
|  SFT Model     -->  Enhanced reasoning capability                |
|  Delta         -->  The "learned" reasoning ability              |
|                                                                  |
|  Goal: Amplify the delta through preference optimization         |
+------------------------------------------------------------------+

The Delta Between Models

The SFT model acquires the following capabilities over the Base model:

  • More structured reasoning processes
  • Step-by-step problem-solving approaches
  • Application of task-specific knowledge

Delta Learning harnesses these “acquired capabilities” to generate preferred responses.

Application in Dolci Think DPO

Dolci Think uses Delta Learning to improve reasoning capability (Section 4.3).

Synthetic Data Generation

+------------------------------------------------------------------+
|                  Dolci Think Data Generation                      |
+------------------------------------------------------------------+
|                                                                  |
|  Step 1: Sample question from training set                       |
|  Step 2: Generate response using SFT model (Preferred)           |
|  Step 3: Generate response using Base model (Dispreferred)       |
|  Step 4: Apply quality filtering                                 |
|                                                                  |
+------------------------------------------------------------------+

Creating Preferred vs Dispreferred Responses

Preferred responses:

  • Generated by the Dolci Think SFT model
  • Include step-by-step reasoning processes
  • Arrive at the correct final answer

Dispreferred responses:

  • Generated by the OLMo2 7B Base model
  • Lack sufficient reasoning depth
  • Reach incorrect conclusions or produce incomplete reasoning

Quality Filtering

The generated pairs are filtered according to the following criteria:

  • The preferred response contains the correct answer
  • The dispreferred response is incorrect or incomplete
  • A clear quality gap exists between the two responses

This process yields approximately 1M high-quality preference pairs.

Application in Dolci Instruct DPO

Dolci Instruct uses Delta Learning for multi-turn dialogue optimization (Section 5.3).

Multi-turn Preference Data

+------------------------------------------------------------------+
|                Dolci Instruct Data Generation                     |
+------------------------------------------------------------------+
|                                                                  |
|  Source: Approximately 500K multi-turn prompts                   |
|                                                                  |
|  Preferred:                                                      |
|    - Generated by Dolci Instruct SFT                             |
|    - Concise, well-structured responses                          |
|                                                                  |
|  Dispreferred:                                                   |
|    - Generated by OLMo2 7B Base                                  |
|    - Verbose or poorly structured responses                      |
|                                                                  |
+------------------------------------------------------------------+

Response Length Optimization

Delta Learning enables the following improvements:

  • Maintaining conciseness: Eliminating unnecessary verbosity
  • Increasing information density: Conveying important information efficiently
  • Improving structure: Producing responses with logical flow

Implementation Details

Preference pairs are generated from approximately 500K multi-turn prompts to improve response quality.

Effects and Benefits

Preference tuning with Delta Learning provides several advantages.

Performance Beyond SFT

The additional optimization through DPO achieves performance levels that SFT alone cannot reach.

+------------------------------------------------------------------+
|                    Performance Progression                        |
+------------------------------------------------------------------+
|                                                                  |
|  Base Model  -->  SFT Model  -->  DPO Model (with Delta)         |
|                                                                  |
|  Limited     -->  Enhanced   -->  Optimized reasoning             |
|  reasoning        reasoning       and preference alignment        |
|                                                                  |
+------------------------------------------------------------------+

Priming for RL

DPO with Delta Learning serves as a foundation for future Reinforcement Learning.

  • Reward model alignment: Improves alignment with human preferences
  • Exploration efficiency: Provides a better initial policy
  • Improved stability: Facilitates convergence of RL training

Enhanced Reasoning Capability

The application in Dolci Think demonstrates the following improvements:

  • Strengthened step-by-step approaches to complex problems
  • Increased depth and accuracy of reasoning
  • Expansion of the reasoning frontier
NoteComparison with Other Preference Tuning Methods

Conventional DPO:

  • Relies on human-labeled data
  • High cost of data collection
  • Limited scalability

RLHF (Reinforcement Learning from Human Feedback):

  • Requires training a reward model
  • Complex implementation and tuning
  • High computational cost

Advantages of Delta Learning:

  • Scalability: Synthetic data enables large-scale training
  • Cost efficiency: No human annotation required
  • Quality assurance: The capability gap between models produces clear contrastive signals
  • Flexibility: Easily applicable to different tasks and domains

Delta Learning maximizes the capabilities acquired through SFT to achieve efficient and effective preference tuning.

Summary

Delta Learning plays a central role in preference tuning for OLMo2 3B.

Key points:

  • Leverages the delta between SFT and Base models
  • Automatically generates high-quality contrastive data
  • Improves performance in both reasoning ability and response quality
  • A scalable and cost-effective method

Through this approach, Dolci Think and Dolci Instruct achieve state-of-the-art performance in their respective domains.