OlmoRL / GRPO: Efficient Reinforcement Learning

OlmoRL is a framework developed to improve the efficiency of reinforcement learning training for reasoning models. It is built upon Group Relative Policy Optimization (GRPO) and employs the Reinforcement Learning with Verifiable Rewards (RLVR) approach.

As the third stage of post-training, reinforcement learning is conducted across multiple domains including math, coding, instruction-following, and general chat, combining verifiable rewards with LM-judge rewards.

Overview of OlmoRL

OlmoRL is a system that tightly integrates algorithmic improvements and engineering infrastructure to address the challenges of reinforcement learning with long reasoning traces. It extends RLVR, which was previously limited to math and code, to a broader range of verifiable tasks.

Key features:

Algorithmic improvements: Built on GRPO, integrating recent improvements from DAPO and Dr GRPO
Large-scale dataset: Dolci-Think-RL (approximately 100K prompts across 4 domains)
Efficient infrastructure: A distributed training system that efficiently handles long sequences (up to 32K tokens)
4x speedup: Achieves approximately 4x speedup compared to OLMo 2’s RL infrastructure

OlmoRL Algorithm Details

The reinforcement learning stage of OlmoRL is built upon GRPO (Shao et al., 2024), integrating recent algorithmic improvements such as DAPO (Yu et al., 2025) and Dr GRPO (Liu et al., 2025b).

Improvements over Vanilla GRPO

OlmoRL implements the following improvements over Vanilla GRPO:

1. Zero gradient signal filtering:

Excludes groups where all rewards are identical (i.e., the standard deviation of advantages is zero)
Avoids training on zero-gradient samples (similar to DAPO)

2. Active sampling:

Maintains a consistent batch size despite zero gradient filtering
Implements an improved version of dynamic sampling (details in Section 4.4.3)

3. Token-level loss:

Normalizes the loss by the total number of tokens across the batch, rather than per sample
Avoids length bias

4. No KL loss:

Removes the KL loss term (following the practice of GLM-4.5, DAPO, Dr GRPO, and others)
Enables less restrictive policy updates without causing over-optimization or training instability

5. Clip higher:

Sets the upper clipping bound slightly higher than the lower bound
Allows larger updates for tokens (Yu et al., 2025)

6. Truncated importance sampling:

Adjusts for differences in log-probabilities between the inference engine and training engine
Multiplies the loss by a truncated importance sampling ratio (Yao et al., 2025)

7. No standard deviation normalization:

Does not normalize by the standard deviation when computing advantages (Liu et al., 2025b)
Removes difficulty bias (prevents advantages from being inflated disproportionately for problems with low reward standard deviation)

OlmoRL Objective Function

The final objective function incorporates token-level loss, truncated importance sampling, clip-higher, and no standard deviation normalization in the advantage computation:

\[ J(\theta) = \frac{1}{\sum_{i=1}^{G} |y_i|} \sum_{i=1}^{G} \sum_{t=1}^{|y_i|} \min\left(\frac{\pi(y_{i,t} | x, y_{i,<t}; \theta_{\text{old}})}{\pi_{\text{vllm}}(y_{i,t} | x, y_{i,<t}; \theta_{\text{old}})}, \rho\right) \times \min(r_{i,t} A_{i,t}, \text{clip}(r_{i,t}, 1 - \varepsilon_{\text{low}}, 1 + \varepsilon_{\text{high}}) A_{i,t}) \]

Where:

\(r_{i,t} = \frac{\pi(y_{i,t}|x,y_{i,<t};\theta)}{\pi(y_{i,t}|x,y_{i,<t};\theta_{\text{old}})}\)
\(\varepsilon_{\text{low}}\) and \(\varepsilon_{\text{high}}\) are the clipping hyperparameters
\(\rho\) is the upper bound for truncated importance sampling
The advantage \(A_{i,t}\) is computed based on relative rewards within group \(G\):

\[ A_{i,t} = r(x, y_i) - \text{mean}(\{r(x, y_i)\}_{i=1}^{G}) \]

Comparison of GRPO and PPO

GRPO is a variant of PPO, and the main difference lies in the reward normalization method.

PPO:

Normalizes rewards across the entire dataset or batch
Uses a global baseline

GRPO:

Normalizes rewards within a group of responses generated from the same prompt
Improves training stability through group-based relative quality assessment
Particularly effective when output lengths vary significantly, as in reasoning models

OlmoRL improvements:

Implements 7 major improvements over Vanilla GRPO
Zero gradient filtering and active sampling substantially improve training efficiency and stability

Verifiers

OlmoRL extends verifiable rewards beyond OLMo 2’s math domain to general domains. Different custom verifiers are used for each domain:

Math:

Rule-based verifier
Performs basic normalization and compares against the reference answer using SymPy
Returns 1 if the answer matches the reference, 0 otherwise

Code:

Test-case-based verifier
Runs a set of test cases against the response
1. Uses the proportion of passed test cases as the reward, or (b) returns 1 if all test cases pass, 0 otherwise

Instruction-following:

Passes the response through a set of functions that check compliance with constraints from the prompt
Returns 1 if all constraints are satisfied, 0 otherwise

Chat – reference:

When a ground-truth response is available, uses an LM judge to compare the model’s response against the reference answer
Assigns a score in [0, 1] based on response quality

Chat – open-ended:

Without a reference answer, uses an LM judge to assign a score in [0, 1] based on response quality

Dolci-Think-RL Dataset

Dolci-Think-RL is a large and diverse dataset consisting of approximately 100K samples across four domains (math, coding, instruction-following, and general chat). It supports robust RL across diverse reasoning tasks while maintaining general helpfulness.

Dataset Composition

Category	Dataset	Think RL Prompts	Instruct RL Prompts
Precise IF	IF-RLVR	30,186	38,000
Math	Open-Reasoner-Zero	3,000	14,000
	DAPO-Math	2,584	7,000
	AceReason-Math	6,602	-
	Polaris-Dataset	-	14,000
	KlearReasoner-MathSub	3,000	9,000
	OMEGA-train	15,000	20,000
Coding	AceCoder	9,767	20,000
	KlearReasoner-Code	8,040	-
	Nemotron Post-training Code	2,303	-
	SYNTHETIC-2	3,000	-
General Chat	Tulu 3 SFT	7,129	18,955
	Wildchat-4.8M	7,129	18,761
	Multi-Subject RLVR	7,129	12,234
Total		104,869	171,950

Data Construction Process

Step 1: Prompt sourcing:

High-quality prompts are collected and curated from each domain.

Math: Open-Reasoner-Zero, DAPO-Math, AceReason-Math, KlearReasoner-MathSub, OMEGA, etc.
Coding: AceCoder, Klear-Reasoner Code, Nemotron Post-training Code, SYNTHETIC-2, etc.
Instruction-following: IF-RLVR (up to 5 constraints, sampled from IFEval and IFBench-Train)
General chat: Tulu 3 SFT, WildChat-4.8M, Multi-subject-RLVR

Step 2: Offline difficulty filtering:

Generates 8 rollouts per prompt from the model’s initial checkpoint
Excludes samples that the model easily solves (pass rate > 62.5%)
Samples at temperature 1.0, top-p 1.0 (matching RL training settings)

Step 3: Data mixing:

Conducts domain-specific experiments and observes downstream evaluation trends over the first 500-1000 steps
Upweights high-quality datasets
Uses roughly equal data volumes across domains (with slightly more emphasis on math and instruction-following)
Downsamples specific subtasks from OMEGA

OlmoRL Infrastructure

OlmoRL introduces substantial improvements to the reinforcement learning infrastructure for handling long sequences and achieving faster overall throughput.

Compute Resources

A major technical challenge in RL is managing inference (rollouts). For the final model, rollouts are generated with a maximum length of 32K tokens, averaging over 10K tokens.

Resource allocation (for the 32B model):

Training: 8 H100 nodes
Inference: 20 nodes
GPU utilization: Inference uses approximately 5x the compute of training

Due to the high cost of autoregressive inference, the learner spends 75% of its time waiting for data.

Key Technical Innovations

+------------------------------------------------------------------+
|               OlmoRL Infrastructure Components                   |
+------------------------------------------------------------------+
|                                                                  |
|  1. Fully Asynchronous Training                                 |
|     - Centralized learner across multiple nodes (DeepSpeed)     |
|     - Large pool of actors (independent vLLM instances)         |
|     - Prompts queue & Results queue                             |
|                                                                  |
|  2. Continuous Batching                                         |
|     - Remove compute waste for long generations                 |
|     - Constantly enqueue new generations as each one finishes   |
|     - Up to 54% compute savings vs static batching              |
|                                                                  |
|  3. Active Sampling                                             |
|     - Continuously pull completions and resample prompts        |
|     - Filter until desired batch size is reached                |
|     - More efficient than dynamic oversampling (3x reduction)   |
|                                                                  |
|  4. Inflight Updates                                            |
|     - Update weights without pausing generation engine          |
|     - Thread-safe, no KV cache invalidation                     |
|     - Up to 4x throughput increase                              |
|                                                                  |
+------------------------------------------------------------------+

Fully Asynchronous Training:

Distributes a centralized learner across multiple nodes (using DeepSpeed)
Large pool of actors, each running an independent vLLM instance
The learner enqueues prompts and distributes them to actors
Actors interact with the environment and return results through a queue

Continuous Batching:

Constantly enqueues new generations as each one finishes
Reduces compute waste on long generations compared to static batching
In Olmo 3, with a 32K generation length, the average is 14,628 tokens and the maximum is 32K tokens
Static batching would waste up to 54% of compute

Active Sampling:

To compensate for filtered instances, continuously pulls completions from actors and resamples prompts into the queue
Actively samples and filters until the desired batch size of non-zero-gradient completions is reached
More efficient than DAPO’s dynamic sampling (which requires 3x oversampling)

Inflight Updates:

After each training step, immediately updates weights without pausing the generation engine
Relies on a thread-safe generation framework to continue generation without invalidating the KV cache
Achieves up to 4x speedup with the same resources

Impact of Infrastructure Improvements

Configuration	Total Tokens (Mtok)	Tokens/sec	MFU (%)	MBU (%)
OLMo 2	6.34	881	0.30	12.90
+ continuous batching	7.02	975	0.33	14.29
+ better threading	9.77	1358	0.46	19.89
+ inflight updates (Olmo 3)	21.23	2949	1.01	43.21

The addition of inflight updates provides the most dramatic improvement.

Extended Training of Olmo 3.1 Think 32B

Olmo 3.1 Think 32B demonstrated performance gains through extended OlmoRL training. Additional epochs on the Dolci Think RL dataset yielded the following improvements:

Performance gains:

AIME 2024: +4 points
IFBench: +20 points
Other benchmarks: Performance maintained

Extended training confirmed that longer RL training improves generalization without catastrophic forgetting, enabling stable training.

Key Findings

Delta Learning Provides a Stronger Initialization for RLVR

Performing preference tuning with Delta Learning before applying RLVR achieves better overall performance than SFT alone.

Both DPO and SFT Benefit from RL, but DPO Is a Better Starting Point

Running the final RL mix on a DPO model consistently yields superior performance compared to running it on an SFT model.

Key differences:

On evaluations where RL does not improve performance, the DPO model often outperforms and maintains its advantage throughout RL training (e.g., AlpacaEval)
On evaluations explicitly targeted by RL, both DPO and SFT models achieve similar final performance (e.g., OMEGA)
On evaluations targeted by RL where further improvement from DPO is difficult, the SFT model improves to approach DPO performance (e.g., AIME 2025)

Rewards Steadily Increase Across All Domains

During RL training, rewards steadily increase across all domains, though at different rates. Instruction-following data shows the most consistent increase, while code rewards increase most slowly.

Summary

OlmoRL is an efficient reinforcement learning framework built on GRPO, achieving the following:

Key contributions:

Algorithmic improvements: 7 important improvements over Vanilla GRPO
Large-scale dataset: Dolci-Think-RL (100K prompts across 4 domains)
Efficient infrastructure: 4x speedup through continuous batching, active sampling, and inflight updates
Multi-domain support: Math, Code, Instruction-following, and General chat
Extended training: Significant performance gains through 2,300 steps of extended training on Olmo 3.1 Think 32B

OlmoRL substantially improves the efficiency of reasoning model training and provides a fully open RL research environment.