flowchart LR
X0["X₀<br/>(noise)"] -->|"p^θ_{1|0}"| X1["X₁"]
X1 -->|"p^θ_{2|1}"| X2["X₂"]
X2 -->|"..."| XT["X_T<br/>(data)"]
style X0 fill:#f9f,stroke:#333
style XT fill:#9f9,stroke:#333
Transition Matching: A Unified Framework via Discrete-Time Markov Transitions
Background and Motivation
Flow Matching is a generative model based on continuous-time deterministic ODEs that learns smooth trajectories from noise to data. However, this formulation has several limitations:
- Dependence on continuous time: Since actual inference discretizes using methods like the Euler method, a trade-off between the number of steps and accuracy is unavoidable
- Deterministic trajectories: Only a single trajectory exists for each noise point, precluding stochastic exploration
- Disconnection from other paradigms: It occupies a different framework from diffusion models (stochastic) and autoregressive models (discrete, causal)
Shaul et al. (2025) proposed Transition Matching to simultaneously resolve these limitations through a unified framework. The core question is:
Can diffusion models, Flow Matching, and autoregressive models be unified as discrete-time Markov transitions?
Transition Matching answers this question affirmatively. By formulating the generative process as a sequence of stochastic transition kernels, it achieves a flexible framework that encompasses all three paradigms.
General Framework of Transition Matching
Generation via Markov Transition Kernels
Transition Matching formulates the generative process as a sequence of Markov transition kernels at discrete times \(t = 0, 1, \ldots, T\). Specifically, the generative process to be learned takes the following form:
\[ p^\theta(X_0, X_1, \ldots, X_T) = p(X_0) \prod_{t=0}^{T-1} p^\theta_{t+1|t}(X_{t+1} | X_t) \tag{1}\]
Here, \(p(X_0)\) is the noise distribution (e.g., standard normal distribution) and \(p^\theta_{t+1|t}\) is the transition kernel parameterized by learnable parameters \(\theta\). Generation starts from \(X_0 \sim p(X_0)\) and sequentially samples \(X_{t+1} \sim p^\theta_{t+1|t}(\cdot | X_t)\).
Supervising Process
The learning objective is to imitate a supervising process \(q\) whose terminal distribution is the data distribution \(q_1\). The supervising process is defined by the following joint distribution:
\[ q(X_0, X_1, \ldots, X_T) = p(X_0) \prod_{t=0}^{T-1} q_{t+1|t}(X_{t+1} | X_t) \]
This supervising process is typically constructed from an interpolation path connecting noise and data (e.g., linear interpolation \(X_t = (1-\alpha_t) X_0 + \alpha_t X_1\)).
Loss Function
Learning is performed by minimizing the divergence between the supervising process transition kernels and the model’s transition kernels at each time step:
\[ \mathcal{L}(\theta) = \sum_{t=0}^{T-1} \mathbb{E}_{q(X_t)} \left[ D\left( q_{t+1|t}(\cdot | X_t) \,\|\, p^\theta_{t+1|t}(\cdot | X_t) \right) \right] \tag{2}\]
Here, \(D\) is a divergence measure between probability distributions (such as KL divergence). A crucial point is that the transition kernels at each time step can be matched independently, which is the source of the framework’s flexibility.
Kernel Parameterization via Latent Variables
Directly approximating the supervising process transition kernel \(q_{t+1|t}\) can be difficult in some cases. Transition Matching introduces a latent variable \(Y\) to parameterize the kernel:
\[ q_{t+1|t}(X_{t+1} | X_t) = \int q_{t+1|t,Y}(X_{t+1} | X_t, Y) \, q_{Y|t}(Y | X_t) \, dY \]
Different choices of latent variable \(Y\) give rise to different variants. This design freedom underpins the versatility of Transition Matching.
- Supervising process: Target \(q_{t+1|t}(X_{t+1} | X_t)\)
- Learning: Imitate with \(p^\theta_{t+1|t}(X_{t+1} | X_t)\)
- Loss: Minimize \(D(q_{t+1|t} \| p^\theta_{t+1|t})\) at each time step independently
DTM: Difference Transition Matching
Formulation
DTM (Difference Transition Matching) is the most fundamental variant of Transition Matching and serves as a natural discrete-time generalization of Flow Matching.
It adopts the difference \(Y = X_T - X_0\) (the difference between data and noise) as the latent variable. This means that at each transition step, the model predicts the “direction” from noise to data. The transition kernel is defined as:
\[ p^\theta_{t+1|t}(X_{t+1} | X_t) = \mathcal{N}\left(X_{t+1}; X_t + (\alpha_{t+1} - \alpha_t) f_\theta(X_t, t), \sigma_t^2 I \right) \]
Here, \(f_\theta(X_t, t)\) is a neural network that predicts the difference (the direction of \(X_T - X_0\)), \(\alpha_t\) is the interpolation schedule parameter, and \(\sigma_t\) is the magnitude of stochastic noise.
Theoretical Relationship with Flow Matching
An important theorem establishing the relationship between DTM and FM has been proven.
Theorem: The expected value of one DTM step coincides with a Flow Matching Euler step. That is:
\[ \mathbb{E}\left[X_{t+1} | X_t\right] = X_t + (\alpha_{t+1} - \alpha_t) \mathbb{E}\left[f_\theta(X_t, t)\right] \]
This is structurally identical to the FM Euler discretization:
\[ z_{t+\Delta t} = z_t + \Delta t \cdot v_\theta(z_t, t) \]
Furthermore, in the limit \(T \to \infty\) (infinite number of time steps), DTM converges exactly to FM Euler steps.
This theorem has two important implications:
- DTM is theoretically justified as a rigorous discrete-time version of FM
- At finite steps, DTM has an additional stochastic noise term \(\sigma_t\) compared to FM
Additionally, the paper provides a new elementary proof of FM’s marginal velocity field. While the original FM formulation required somewhat complex arguments for proving the existence and uniqueness of the marginal velocity field, the Transition Matching framework yields a more direct and transparent proof.
Backbone-Head Architecture
The practical implementation of DTM adopts the Backbone-Head architecture. This is a critically important design from the perspective of computational efficiency.
flowchart TB
Input["Input X₀"] --> Backbone["Backbone (heavy)<br/>e.g., UNet, DiT<br/>(run once per sample)"]
Backbone -->|"shared features"| H1["Head 1<br/>(t=1)"]
Backbone -->|"shared features"| H2["Head 2<br/>(t=2)"]
Backbone -->|"shared features"| H3["Head 3<br/>(t=3)"]
Backbone -->|"shared features"| HT["Head T<br/>(t=T)"]
style Backbone fill:#ffd,stroke:#333
style H1 fill:#dff,stroke:#333
style H2 fill:#dff,stroke:#333
style H3 fill:#dff,stroke:#333
style HT fill:#dff,stroke:#333
Backbone forward pass count is reduced from 128 (FM) to 16 (DTM), achieving a 7x speedup.
The key points of this architecture are:
- Backbone: A heavy network such as UNet or DiT that extracts common features from the input. It is run only once per sample
- Head: Lightweight networks specialized for each time step \(t\) that predict transitions from the Backbone’s output
- Speedup: While conventional FM requires 128 Backbone forward passes, DTM requires only 16. This corresponds to a 7x speedup
ARTM: Autoregressive Transition Matching
ARTM (Autoregressive Transition Matching) is a variant that incorporates the structure of autoregressive models into Transition Matching.
Independent Linear Processes and Causal Structure
In ARTM, an independent linear process is defined for each token position \(i\):
\[ X_t^{(i)} = (1 - \alpha_t) X_0^{(i)} + \alpha_t X_T^{(i)} \]
Here, \(X_t^{(i)}\) is the state at time \(t\) for token position \(i\). The crucial point is that this process has a causal structure. That is, the transition at position \(i\) depends only on information from positions \(1, \ldots, i-1\).
This naturally integrates the left-to-right sequential generation structure of autoregressive models with the noise-to-data transformation structure of Flow Matching. The velocity at each token position is learned independently, and causal masks control the flow of information.
Relationship with Autoregressive Models
ARTM can be interpreted as an extension of discrete-token autoregressive generation to continuous space. When the number of tokens is fixed to 1 and the number of steps is \(T=1\), it degenerates to a single step of a standard autoregressive model.
FHTM: Full History Transition Matching
FHTM (Full History Transition Matching) is the most expressive variant and occupies an important position in integration with LLM architectures.
Access to Full History
While DTM and ARTM predict the next state based only on the current state \(X_t\), FHTM has access to the full history \(X_0, X_1, \ldots, X_t\):
\[ p^\theta_{t+1|0:t}(X_{t+1} | X_0, X_1, \ldots, X_t) \]
This means abandoning the Markov property in exchange for utilizing richer contextual information.
Training with Teacher-Forcing
FHTM training uses teacher-forcing. This is the standard training technique for autoregressive models in natural language processing, where the true history from the supervising process is used as input during training:
\[ \mathcal{L}_{\text{FHTM}}(\theta) = \sum_{t=0}^{T-1} \mathbb{E}_{q(X_0, \ldots, X_t)} \left[ D\left( q_{t+1|0:t}(\cdot | X_0, \ldots, X_t) \,\|\, p^\theta_{t+1|0:t}(\cdot | X_0, \ldots, X_t) \right) \right] \]
Teacher-forcing ensures efficient and stable training. At inference time, the model uses its own generated history sequentially.
Innovation as a Fully Causal Model
The most notable achievement of FHTM is that it was the first fully causal model to surpass Flow Matching. Previously, generative models with causal structures were thought to be inferior to bidirectional models. FHTM overturned this conventional wisdom, demonstrating that leveraging full history information more than compensates for the constraints of causal structure.
FHTM can be directly implemented with standard LLM architectures (Transformer decoder). The reasons are as follows:
- Causal mask: FHTM’s causal structure perfectly aligns with the autoregressive causal mask of LLMs
- Teacher-forcing: The standard LLM training technique can be applied directly
- Representation as token sequences: The state \(X_t\) at each time step can be treated as a token and processed as a temporal sequence
This compatibility opens the possibility of seamlessly integrating text and image generation. For example, after autoregressive generation of text tokens, the same model with the same architecture could perform step-by-step refinement of images. The practical advantage of being able to directly leverage the massive LLM ecosystem (optimization methods, inference engines, hardware support) is also significant.
Experimental Results
DTM Image Generation Performance
DTM was trained on 350M Shutterstock data and evaluated on text-conditioned image generation. The evaluation metrics cover both image quality and prompt alignment.
| Metric | DTM (16 steps) | FM (128 steps) | Notes |
|---|---|---|---|
| CLIPScore | Surpasses | Baseline | Text-image alignment |
| PickScore | Surpasses | Baseline | Human preference-based evaluation |
| ImageReward | Surpasses | Baseline | Reward model score |
| Aesthetics | Surpasses | Baseline | Aesthetic quality |
| Backbone forwards | 16 | 128 | 7x speedup |
DTM surpasses 128-step FM across all metrics with only 16 Backbone forward passes. This demonstrates the effectiveness of the Backbone-Head architecture.
FHTM Performance
FHTM as a fully causal model showed noteworthy results in the following respects:
- Surpassing FM: The first instance of a causal model outperforming a bidirectional model
- LLM architecture: Demonstrated implementability with a standard Transformer decoder
- Effectiveness of teacher-forcing: Confirmed training stability and efficiency
Comparison of the Three Variants
Summarizing the positioning of each variant:
| Variant | Latent Variable \(Y\) | Structure | Key Advantage |
|---|---|---|---|
| DTM | \(X_T - X_0\) (difference) | Markov | Discrete-time version of FM, 7x speedup |
| ARTM | Independent linear processes | Causal | Bridge between AR models and FM |
| FHTM | Full history | Fully causal | First to surpass FM, LLM-compatible |
Significance and Positioning
The contribution of Transition Matching extends beyond individual performance improvements. This framework provides a unified perspective for treating three generative model paradigms that have previously developed independently:
- Diffusion models: Can be expressed as stochastic transition kernels
- Flow Matching: Recoverable as the \(T \to \infty\) limit of DTM
- Autoregressive models: Subsumed as special cases of ARTM and FHTM
This unification not only deepens theoretical understanding but also opens new design spaces in practice. For example, it becomes possible to mix stochastic and deterministic transitions, or to adopt causal structure for some steps and bidirectional structure for others.
The fact that FHTM is compatible with LLM architectures is particularly significant as an important step toward unified text and image generation. In the context of one-step generation, it occupies a complementary position to the other methods discussed in the main document, approaching the shared goal of high-quality generation with fewer steps from a unique angle of discrete-time, stochastic formulation.
As a theoretical byproduct of Transition Matching, a new elementary proof of Flow Matching’s marginal velocity field has been obtained.
In the original Flow Matching formulation, deriving the marginal velocity field \(u_t(x)\) from the conditional velocity field \(u_t(x | x_1)\) required going through the theory of probability flows or the continuity equation.
In the Transition Matching framework, starting from discrete-time transition kernels and taking the \(T \to \infty\) limit directly yields the existence and expression of the marginal velocity field. Specifically:
\[ u_t(x) = \lim_{\Delta t \to 0} \frac{1}{\Delta t} \mathbb{E}\left[X_{t+\Delta t} - X_t \mid X_t = x\right] \]
This proof provides the perspective of understanding continuous-time FM as the limit of discrete time, placing the theoretical foundations of Flow Matching on firmer ground.