Drifting Models: A New Paradigm via Distribution Evolution During Training

Generative Modeling via Drifting (Deng, Li, Li, Du, He; MIT, 2026) is a method that fundamentally overturns the inference paradigm of generative models. While conventional Diffusion Models and Flow Matching perform iterative computation at inference time, Drifting Models evolve the pushforward distribution during training and generate samples with only a single forward pass at inference time. It achieves FID 1.54 (latent space) / 1.61 (pixel space) with 1-NFE on ImageNet 256x256, establishing a new state of the art for one-step generation.

Background and Motivation

Inference cost in generative models is the greatest barrier to real-time applications. All conventional methods are based on the framework of “performing iterative computation at inference time to transform a noise distribution into a data distribution”:

Diffusion Models: Learn a score function \(\nabla \log p_t(x)\) and solve an SDE/ODE over tens to hundreds of steps
Flow Matching: Learn an instantaneous velocity field \(v(z_t, t)\) and solve an ODE
MeanFlow: Introduce mean velocity \(u(z_t, t)\) to reduce the number of steps (details)
Terminal Velocity Matching (TVM): Promote one-step generation through terminal-time regularization (details)

What these methods share is the structure of fixing a trained network \(f_\theta\) and performing iterative pushforward at inference time. While MeanFlow and TVM achieve 1-NFE, they remain extensions within the Flow Matching framework.

Drifting Models abandons this structure altogether. It eliminates iteration at inference time and proposes an entirely different paradigm in which the training process itself drives the evolution of the distribution.

Pushforward Distribution Evolution

Conventional Approach

In conventional one-step generation, the network \(f_\theta\) aims to directly transform the noise distribution \(p_\varepsilon\) into the data distribution \(p_\text{data}\):

\[ f_{\theta\#} p_\varepsilon \approx p_\text{data} \]

Here, \(f_{\theta\#} p_\varepsilon\) is the pushforward distribution of \(p_\varepsilon\) through \(f_\theta\). That is, for \(\varepsilon \sim p_\varepsilon\), it is the distribution followed by samples \(f_\theta(\varepsilon)\).

The Drifting Approach

In Drifting Models, the parameters \(\theta_i\) are updated at each training iteration \(i\), and the pushforward distribution \(q_{\theta_i} = f_{\theta_i\#} p_\varepsilon\) gradually approaches \(p_\text{data}\):

\[ q_{\theta_0} \to q_{\theta_1} \to q_{\theta_2} \to \cdots \to q_{\theta_N} \approx p_\text{data} \]

flowchart TB
    subgraph Conventional["Conventional (Flow Matching / Diffusion)"]
        direction TB
        CT["Training: Learn v(z_t, t) or score function"]
        CI["Inference: z₀ → z₁ → z₂ → ... → z_T<br/>(iterative pushforward at inference time)"]
        CT --> CI
    end

    subgraph Drifting["Drifting Models"]
        direction TB
        DT["Training: θ₀ → θ₁ → θ₂ → ... → θ_N<br/>q₀ → q₁ → q₂ → ... → q_N ≈ p_data<br/>(distribution evolves during training)"]
        DI["Inference: x = f_{θ_N}(ε)<br/>(single forward pass only)"]
        DT --> DI
    end

    style Conventional fill:#fff3e0,stroke:#e65100
    style Drifting fill:#e8f5e9,stroke:#2e7d32

Figure 1: Paradigm comparison between conventional methods and Drifting Models

The core of this design lies in absorbing the cost of iterative computation into training time. At inference time, the final \(f_{\theta_N}\) is applied only once.

Drifting Field

The theoretical foundation of Drifting Models is a vector field called the Drifting Field \(V_{p,q}(x)\). This determines the direction in which sample \(x\) should move, playing the role of bringing the generated distribution \(q\) closer to the data distribution \(p\).

Decomposition into Attraction and Repulsion

The Drifting Field decomposes into two components:

\[ V_{p,q}(x) := V_p^+(x) - V_q^-(x) \]

where:

Attraction term: \(V_p^+(x) = \frac{1}{Z_p(x)} \mathbb{E}_{y^+ \sim p}\left[ k(x, y^+)(y^+ - x) \right]\)
Repulsion term: \(V_q^-(x) = \frac{1}{Z_q(x)} \mathbb{E}_{y^- \sim q}\left[ k(x, y^-)(y^- - x) \right]\)

The normalization terms are \(Z_p(x) = \mathbb{E}_{y^+ \sim p}[k(x, y^+)]\) and \(Z_q(x) = \mathbb{E}_{y^- \sim q}[k(x, y^-)]\), respectively.

Intuitively:

Attraction term \(V_p^+\): Pulls samples toward data samples \(y^+\)
Repulsion term \(V_q^-\): Repels from other generated samples \(y^-\), preventing mode collapse

Kernel Function

The kernel \(k(x, y)\) is defined as an exponential kernel:

\[ k(x, y) = \exp\left(-\frac{\|x - y\|}{\tau}\right) \]

Here, \(\tau\) is a temperature parameter and \(\|\cdot\|\) is the \(\ell_2\) norm. The division by the normalization terms \(Z_p(x)\) and \(Z_q(x)\) corresponds to softmax normalization and bears similarity to the InfoNCE loss.

Anti-symmetry

The Drifting Field satisfies an important anti-symmetry property:

\[ V_{p,q}(x) = -V_{q,p}(x), \quad \forall x \]

This means that swapping the roles of attraction and repulsion reverses the direction of the vector field.

Equilibrium Condition

An important consequence that immediately follows from anti-symmetry:

\[ q = p \implies V_{p,q}(x) = 0, \quad \forall x \]

That is, when the generated distribution \(q\) matches the data distribution \(p\), the Drifting Field becomes zero and the distribution evolution naturally halts. This is a property that guarantees convergence as a fixed-point iteration.

Training Objective

From Fixed-Point Iteration to MSE Loss

Distribution updates using the Drifting Field are formulated as fixed-point iteration:

\[ f_{\theta_{i+1}}(\varepsilon) \leftarrow f_{\theta_i}(\varepsilon) + V_{p, q_{\theta_i}}\left(f_{\theta_i}(\varepsilon)\right) \]

The training objective is this update rule converted to an MSE loss:

\[ \mathcal{L} = \mathbb{E}_{\varepsilon}\left[\left\| f_\theta(\varepsilon) - \text{stopgrad}\left(f_\theta(\varepsilon) + V_{p, q_\theta}(f_\theta(\varepsilon))\right) \right\|^2\right] \tag{1}\]

Role of Stop-Gradient

In Equation 1, \(\text{stopgrad}(\cdot)\) is an operator that blocks gradient propagation. It “freezes” the target \(f_\theta(\varepsilon) + V(f_\theta(\varepsilon))\) and computes gradients only with respect to the network output \(f_\theta(\varepsilon)\).

This design achieves:

Avoids backpropagation through the Drifting Field \(V\), stabilizing computation
The network learns to “follow” toward the target
The loss value \(\mathcal{L} = \mathbb{E}[\|V(f(\varepsilon))\|^2]\) corresponds to minimizing the magnitude of the Drifting Field, converging toward the equilibrium condition \(V = 0\)

Extension to Feature Space

Limitations of Pixel Space

When computing the Drifting Field directly in pixel space, distance computation in high dimensions struggles to capture meaningful similarity. The \(\ell_2\) distance of pixel values tends to diverge from human perceptual similarity, causing the kernel \(k(x, y)\) to function ineffectively.

Introduction of Feature Encoder

To solve this problem, Drifting is performed in the feature space of a pretrained encoder \(\phi\). The loss function is extended using multi-scale features:

\[ \mathcal{L} = \sum_j \mathbb{E}\left[\left\| \phi_j(x) - \text{stopgrad}\left(\phi_j(x) + V(\phi_j(x))\right) \right\|^2\right] \]

Here, \(\phi_j\) is the feature extractor at the \(j\)-th layer of the encoder. Features are extracted at multiple scales from a ResNet-type encoder, and the Drifting Field is computed independently at each scale, guiding samples from coarse structure to fine texture from multiple perspectives.

Encoder Selection

The quality of the feature encoder has a decisive impact on results. The following is a comparison with B/2 model at 100 epochs:

Table 1: FID comparison by feature encoder choice (B/2, 100 epochs)

Encoder	Method	Feature Dim	FID
ResNet	SimCLR	256	11.05
ResNet	MoCo-v2	256	8.41
ResNet	Latent-MAE	640	3.36

Latent-MAE, trained directly on the VAE latent space, substantially outperforms generic self-supervised learning methods (SimCLR, MoCo-v2). This demonstrates the importance of feature representations specialized for the space in which the generative model operates (VAE latent space).

Classifier-Free Guidance Integration

For class-conditional generation, Classifier-Free Guidance (CFG) can be integrated. Unlike conventional CFG, which is based on score interpolation at inference time, Drifting Models achieve guidance through mixing of negative samples.

The negative distribution is defined as:

\[ \tilde{q}(\cdot | c) := (1 - \gamma) q_\theta(\cdot | c) + \gamma \, p_\text{data}(\cdot | \varnothing) \]

Here, \(c\) is the class label, \(\varnothing\) denotes unconditional, and \(\gamma \in [0, 1)\) is the mixing ratio. When \(\gamma > 0\), unconditional generation samples are mixed into the negative samples, and conditional generation samples repel from them, improving fidelity to the condition.

At inference time, the CFG scale \(\alpha\) can be freely adjusted to control the trade-off between quality and diversity.

Experimental Results

ImageNet 256x256

Evaluation is performed in both the latent space (VAE latent space) and pixel space:

Latent space generation:

Table 2: FID scores in latent space (ImageNet 256x256, 1-NFE)

Model	Epochs	FID
B/2	100	3.36
B/2	320	2.51
B/2	1280	1.75
L/2	1280	1.54

Pixel space generation:

Table 3: FID scores in pixel space (ImageNet 256x256, 1-NFE)

Model	FID
B/16	1.76
L/16	1.61

The L/2 model’s FID 1.54 sets a new state of the art for 1-NFE generative models. Furthermore, the achievement of FID 1.61 in pixel space is noteworthy, demonstrating that high-quality generation is possible even without going through latent space.

Comparison with Other Methods

Comparison with other one-step generation methods in pixel space:

StyleGAN-XL: FID 2.30 (1574G FLOPs)
Drifting L/16: FID 1.61 (87G FLOPs)

Drifting Models outperform GANs in FID while substantially reducing computational cost.

Scaling Properties

Scaling up from B/2 to L/2 yields a substantial FID improvement from 3.36 to 1.54. Additionally, FID monotonically improves with increasing training epochs (100 to 1280 epochs), suggesting high training stability.

Application to Robot Control

Drifting Models are applicable beyond image generation. Evaluated as an alternative to Diffusion Policy in robot control, they achieve performance equal to or better than 100-NFE diffusion models with 1-NFE:

Table 4: Comparison with Diffusion Policy on robot control tasks

Task	Diffusion (100 NFE)	Drifting (1 NFE)
Lift (state)	0.98	1.00
Can (state)	0.96	0.98
BlockPush Phase 1	0.36	0.56

Notably, Drifting significantly outperforms Diffusion on the BlockPush task, suggesting that the low latency of one-step inference is advantageous for real-time control.

Ablations

Importance of Anti-symmetry

The impact of breaking anti-symmetry is dramatic (B/2, 100 epochs):

Setting	FID
Anti-symmetric (default)	8.46
1.5x attraction (excessive increase)	41.05
1.5x repulsion (excessive increase)	46.28

Breaking anti-symmetry causes FID to deteriorate to 41–177. When the balance between attraction and repulsion is disrupted, samples fail to converge to the data distribution, and generation quality degrades catastrophically. This provides experimental evidence that the equilibrium condition described in Section 1.3 depends on anti-symmetry.

Number of Positive and Negative Examples

Under a fixed computational budget, increasing the number of positive examples (data samples) and negative examples (generated samples) improves quality. With \(N_\text{pos} = 64\) and \(N_\text{neg} = 64\), FID 8.46 is achieved, and increasing the sample count has the effect of improving the estimation accuracy of the Drifting Field.

Feature Encoder Selection

As shown in Table 1, the choice of encoder results in more than a 3x difference in FID. The gap between SimCLR (11.05) and Latent-MAE (3.36) demonstrates that the quality of the feature space determines the effectiveness of kernel-based similarity computation.

Relationship with GANs

Drifting Models and GANs (Generative Adversarial Networks) share superficial similarities:

Similarities:

Generation in a single step
Implicit distribution matching (no explicit likelihood computation)
Generator directly outputs samples

Differences:

Elimination of adversarial optimization: GANs solve a min-max game between Generator and Discriminator, but Drifting Models convert fixed-point iteration via the Drifting Field into an MSE loss, without adversarial training
Mode collapse risk: GANs are prone to mode collapse, but in Drifting Models, the repulsion term \(V_q^-\) functions as a mechanism to maintain diversity among generated samples, reducing the risk of mode collapse
Computational efficiency: While StyleGAN-XL requires 1574G FLOPs, Drifting L/16 requires only 87G FLOPs—approximately 18x more efficient while outperforming in FID
Training stability: GAN training tends to be unstable, but Drifting Models achieve stable training based on stop-gradient and MSE loss

Relationship with MeanFlow / TVM

Drifting Models aims for the same “one-step generation” as MeanFlow and TVM, but the approach is fundamentally different.

MeanFlow (details):

Within the Flow Matching framework, learns mean velocity instead of instantaneous velocity
Inherits the ODE structure with time parameter \(t\)
Connects mean velocity and instantaneous velocity via the MeanFlow Identity

TVM (details):

Within the Flow Matching framework, regularizes the velocity field at the terminal time
Theoretically derives an upper bound between the displacement map and 2-Wasserstein distance
Requires architecture modifications for Lipschitz continuity

Drifting Models:

A fundamentally different paradigm from the Flow / Diffusion framework
Has no time parameter \(t\) (does not use ODE / SDE structure)
An entirely new formulation based on distribution evolution during training
Implicit distribution matching through a kernel-based attraction-repulsion mechanism

All three methods were published by Kaiming He’s group at MIT, and one can trace a progressive research trajectory toward the shared goal of one-step generation: from refinement of Flow Matching (MeanFlow to TVM) to the creation of a new paradigm (Drifting).

Summary

Drifting Models is a method that overturns the fundamental premise of “iterative computation at inference time” in generative models. Through the attraction-repulsion mechanism and anti-symmetry of the Drifting Field, the pushforward distribution naturally converges to the data distribution during training, enabling high-quality generation with only a single forward pass at inference time.

The result of FID 1.54 (latent space) represents overwhelming performance for one-step generation without distillation or pretraining, surpassing both GANs and multi-step diffusion models. Furthermore, the demonstrated application to robot control illustrates the versatility of this method.

Unlike the improvements to Flow Matching (MeanFlow, TVM), Drifting Models presents a new paradigm that completely departs from the ODE/SDE framework, charting a new direction for generative model research.

References

Deng, M., Li, H., Li, T., Du, Y., & He, K. (2026). Generative Modeling via Drifting. arXiv:2602.04770. [CC BY 4.0]
Project page: https://lambertae.github.io/projects/drifting/