Diffusion Models as Generative Particle Dynamics

Diffusion models power most text-to-image systems today (Stable Diffusion¹, Imagen²). At a high level the recipe is simple: start from noise, apply a learned update many times, and you get an image. Under the hood, this is a transport process of “measure”/”mass”.

In an earlier post I covered the SDE formalism. Here I want to build geometric intuition instead: denoising defines a vector field, and generation describes particles following that field.

Brief history. Sohl-Dickstein et al.³ proposed learning the reverse of a noising process. Ho et al.⁴ (DDPM) simplified the loss to noise prediction and matched GANs on image quality. Song & Ermon⁵ and Song et al.⁶ connected the whole thing to score functions and continuous-time SDEs. That unified view opened ODE-based samplers (e.g., Flow Matching⁷⁸⁹) and is the backbone of current systems.

In this post I want to answer a simple question: why does repeatedly applying a small denoising correction produce good samples?

Start from denoising

Let’s start from the denoising problem itself. Take a clean sample $x$ and corrupt it:

\[\tilde{x} = x + \sigma \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I).\]

A denoiser trained on this task learns more than a point estimate of $x$. Tweedie’s identity¹⁰ says:

\[\mathbb{E}[x \mid \tilde{x}] = \tilde{x} + \sigma^2 \nabla_{\tilde{x}} \log p_\sigma(\tilde{x}).\]

Rearranging the identity gives:

\[\nabla_{\tilde{x}} \log p_\sigma(\tilde{x}) = \frac{\mathbb{E}[x \mid \tilde{x}] - \tilde{x}}{\sigma^2}.\]

The denoising residual is the score (up to $\sigma^2$). So if a denoiser works across noise levels, it is implicitly parameterizing the score field $\nabla_x \log p_\sigma(x)$ over the space¹¹. Milanfar & Delbracio¹² give a useful operator-theoretic interpretation: denoisers behave like gradient fields of smoothed energies.

Now think in particles

Once you view denoising as a score field, the sampling step becomes very concrete. You initialize $N$ particles $x_1, \dots, x_N$ at noise and run Langevin dynamics:

\[x_i \leftarrow x_i + \eta \, s_\theta(x_i, t) + \sqrt{2\eta}\,\xi_i, \qquad \xi_i \sim \mathcal{N}(0, I),\]

where $s_\theta(x,t) \approx \nabla_x \log p_t(x)$ is the learned score at noise level $t$. The noise term keeps exploration alive and helps avoiding mode and point collapse.

Sampling code (click to expand):

Show sampling code

import numpy as np

def langevin_step_euler(x, score_fn, eta):
    """One Euler-Maruyama Langevin step.

    Args:
        x: Particle locations with shape (N, D).
        score_fn: Callable that maps x -> score with shape (N, D).
        eta: Step size.
    """
    grad = score_fn(x)
    noise = np.random.randn(*x.shape)
    return x + eta * grad + np.sqrt(2.0 * eta) * noise

def annealed_langevin(initial_x, score_fn, sigmas, steps_per_sigma=100, eta_factor=0.5, seed=42):
    """Run Langevin updates across a noise schedule.

    Uses eta = eta_factor * sigma**2 at each noise level.
    """
    np.random.seed(seed)
    x = initial_x.copy()
    for sigma in sigmas:
        eta = eta_factor * (sigma ** 2)
        for _ in range(steps_per_sigma):
            x = langevin_step_euler(x, lambda y: score_fn(y, sigma), eta)
    return x

def dummy_score(x, sigma):
    return -x / (1.0 + sigma ** 2)

np.random.seed(42)
particles = np.random.randn(500, 2) * 10.0
sigmas = np.geomspace(10.0, 0.05, num=10)
samples = annealed_langevin(particles, dummy_score, sigmas, steps_per_sigma=100)

Each step has a clear role: the score term moves particles toward higher density, and the noise term keeps exploration alive so the cloud does not collapse too early. Over many steps, the particles travels from the initial noise distribution toward $p_{\mathrm{data}}$.

Franceschi et al.¹³ call this viewpoint “generative particle dynamics.” In that language, the model defines a velocity field on sample space, and generation is just integrating that field. Whether you use Langevin noise, an ODE solver, or something in between is a discretization choice. The underlying object is the same: transport under a learned field.

A toy example: analytic score on a mixture

Before training a score network, it helps to isolate the transport story with an exact score. A Gaussian mixture is perfect for that. Take a $K$-component mixture of Gaussians (MoG) $p(x)=\sum_k \pi_k\,\mathcal{N}(x;\mu_k,\Sigma_k)$.

Adding noise $\mathcal{N}(0,\sigma^2 I)$ just adds up on each component’s covariance¹⁴:

\[p_\sigma(x)=\sum_{k=1}^{K} \pi_k\,\mathcal{N}(x;\mu_k,\Sigma_k+\sigma^2 I).\]

Appendix: click to expand the convolution derivation

Start from¹⁴

\[\begin{aligned} p_\sigma(x) &= \int p(y)\,\mathcal{N}(x;y,\sigma^2 I)\,dy \\ &= \sum_k \pi_k \int \mathcal{N}(y;\mu_k,\Sigma_k)\,\mathcal{N}(x;y,\sigma^2 I)\,dy. \end{aligned}\]

Fix one component $k$. The product in $y$ is Gaussian after completing the square:

\[\mathcal{N}(y;\mu_k,\Sigma_k)\,\mathcal{N}(x;y,\sigma^2 I) = C_k(x)\,\mathcal{N}(y;m_k,\Lambda_k),\]

with

\[\Lambda_k^{-1}=\Sigma_k^{-1}+\sigma^{-2}I, \qquad m_k=\Lambda_k\big(\Sigma_k^{-1}\mu_k+\sigma^{-2}x\big),\]

and

\[C_k(x)=\mathcal{N}(x;\mu_k,\Sigma_k+\sigma^2 I).\]

Integrate over $y$:

\[\int \mathcal{N}(y;m_k,\Lambda_k)\,dy=1 \quad\Rightarrow\quad \int \mathcal{N}(y;\mu_k,\Sigma_k)\,\mathcal{N}(x;y,\sigma^2 I)\,dy =\mathcal{N}(x;\mu_k,\Sigma_k+\sigma^2 I).\]

Apply this to every mixture term:

$p_\sigma(x)=\sum_k \pi_k\,\mathcal{N}(x;\mu_k,\Sigma_k+\sigma^2 I).$

Differentiating $\log p_\sigma$ gives the score in closed form:

\[\nabla_x\log p_\sigma(x) = \sum_k w_k(x)\,S_k^{-1}(\mu_k-x), \qquad w_k(x)=\frac{\pi_k\,\mathcal{N}(x;\mu_k,S_k)}{p_\sigma(x)}, \qquad S_k=\Sigma_k+\sigma^2 I.\]

We can read it as: the score at $x$ is a weighted average of pulls toward each mode mean $\mu_k$, weighted by how likely component $k$ explains $x$ (the posterior responsibility $w_k$), and scaled by that component’s precision $S_k^{-1}$.

At large $\sigma$ the pulls are weak and roughly uniform ($S_k\approx\sigma^2 I$, so precision is tiny). As $\sigma$ shrinks the pulls sharpen into separate basins around modes. Particles follow this field, and injected noise helps them jump across low-density barriers between basins.

In this MoG example, removing the noise term makes particles settle near mode centers and under-sample the lower-density regions between modes.

I use a 6-component mixture: six modes at irregular positions in $[-10, 10]^2$, non-uniform weights ($\pi_0\approx 0.55$ dominates), and anisotropic variances. No symmetry to exploit.

Because the score is available in closed form, there is no network training in this toy setup. That is intentional: we can study transport dynamics directly, without approximation error from a learned model.

The score field at three noise levels:

Score field at decreasing noise levels. At high $\sigma$ the field is nearly uniform (all arrows point inward), and at low $\sigma$ individual modes become visible as separate basins.

With that setup in place, we can run annealed Langevin dynamics: a geometric schedule from $\sigma=10$ down to $\sigma=0.05$, with 100 Langevin steps per level and step size $\eta = 0.5 \cdot \sigma^2$.

500 particles evolving under the analytic score. They start as isotropic noise and progressively concentrate around the six mixture components.

As a sanity check, compare the KDE of final Langevin samples to true samples drawn directly from the mixture:

Langevin samples (left) recover the target density (right). The dominant mode at $(-6.5, -6)$ gets the most mass, matching the mixture weight $\pi_0 \approx 0.55$.

Adding batch interactions

So far, each particle evolves independently. Each $x_i$ only sees the external score $\nabla_x \log p_\sigma(x_i)$, and the rest of the batch does not enter its update.

A natural next step is to let nearby particles interact through a kernel. This is exactly the SVGD idea¹⁵. The deterministic update is

\[\phi_i(x_{1:N}, t) = \frac{1}{N}\sum_{j=1}^{N} \left[ k_h(x_j, x_i)\,s_\theta(x_j, t) + \lambda \, \nabla_{x_j} k_h(x_j, x_i) \right],\]

and then moves particles by

\[x_i \leftarrow x_i + \eta \, \phi_i(x_{1:N}, t).\]

You can read this update as a balance between attraction and diversity. The first term is a kernel-smoothed score that pulls particles toward high-density regions. The second term is the kernel-gradient interaction. Pairwise, its direction depends on which variable you differentiate with respect to, so it is better to think of it as a batch-level spreading effect rather than a literal point-to-point repulsive force. If we also add Langevin noise, we get a stochastic variant:

\[x_i \leftarrow x_i + \eta \, \phi_i(x_{1:N}, t) + \sqrt{2\eta}\,\xi_i.\]

With a radial basis function (RBF) kernel $k_h$ at bandwidth $h>0$, nearby pairs interact strongly and far-away pairs barely interact. In aggregate over the batch, this term counteracts particle collapse and improves coverage. Without this interaction you get something closer to mean shift. With it, you get SVGD plus optional Langevin noise.

For the kernel used in the code,

\[k_h(x_j, x_i) = \exp\!\left(-\frac{\|x_j-x_i\|_2^2}{h^2}\right),\]

the gradient with respect to the first argument is

\[\nabla_{x_j} k_h(x_j, x_i) = -\frac{2}{h^2} k_h(x_j, x_i)(x_j-x_i) = \frac{2}{h^2} k_h(x_j, x_i)(x_i-x_j).\]

The batch effect is easier to see by summing over neighbors:

\[\sum_j k_h(x_j,x_i)(x_i-x_j) = \left(\sum_j k_h(x_j,x_i)\right)\left(x_i-\bar{x}_{k,i}\right), \qquad \bar{x}_{k,i}=\frac{\sum_j k_h(x_j,x_i)x_j}{\sum_j k_h(x_j,x_i)}.\]

So the interaction pulls each particle away from its kernel-weighted local mean $\bar{x}_{k,i}$ when viewed through the full batch update.

So the interaction is just the original kernel value scaled by a displacement and $1/h^2$. Larger bandwidth weakens this spreading effect everywhere, and the exponential still ignores interactions for distant pairs.

Interacting sampling code (click to expand):

Show interacting particle code

import numpy as np

def rbf_kernel(x, bandwidth):
  """Compute an RBF kernel matrix and pairwise offsets.

  Args:
    x: Particle locations with shape (N, D).
    bandwidth: RBF bandwidth h.

  Returns:
    kernel: Shape (N, N), where kernel[i, j] = k(x_i, x_j).
    diff: Shape (N, N, D), where diff[i, j] = x_i - x_j.
  """
  diff = x[:, None, :] - x[None, :, :]
  sqdist = np.sum(diff ** 2, axis=-1)
  kernel = np.exp(-sqdist / (bandwidth ** 2))
  return kernel, diff


def stein_interacting_step(x, score_fn, sigma, eta, bandwidth, repulsion=1.0):
  """One noisy SVGD-style update.

  Args:
    x: Particle locations with shape (N, D).
    score_fn: Callable score_fn(x, sigma) -> array with shape (N, D).
    sigma: Current noise level.
    eta: Step size.
    bandwidth: RBF bandwidth h.
    repulsion: Weight on the kernel-gradient interaction term.
  """
  score = score_fn(x, sigma)
  kernel, diff = rbf_kernel(x, bandwidth)
  n_particles = x.shape[0]

  pooled_score = (kernel @ score) / n_particles
  repulsive_term = (2.0 / (bandwidth ** 2)) * np.sum(kernel[:, :, None] * diff, axis=1)
  repulsive_term /= n_particles

  drift = pooled_score + repulsion * repulsive_term
  noise = np.random.randn(*x.shape)
  return x + eta * drift + np.sqrt(2.0 * eta) * noise


def annealed_interacting_sampler(
  initial_x,
  score_fn,
  sigmas,
  steps_per_sigma=100,
  eta_factor=0.5,
  bandwidth_factor=1.0,
  repulsion=1.0,
  seed=42,
):
  """Run annealed interacting sampling with an SVGD-style drift.

  Uses h = bandwidth_factor * sigma and eta = eta_factor * sigma**2.
  """
  np.random.seed(seed)
  x = initial_x.copy()
  for sigma in sigmas:
    eta = eta_factor * (sigma ** 2)
    bandwidth = bandwidth_factor * sigma
    for _ in range(steps_per_sigma):
      x = stein_interacting_step(
        x,
        score_fn=score_fn,
        sigma=sigma,
        eta=eta,
        bandwidth=bandwidth,
        repulsion=repulsion,
      )
  return x

Are we double-counting with score and kernel gradients?

At first glance, it can feel like double-counting: both the MoG score and the RBF kernel gradient look like an exponential weight times a displacement vector. But they play different roles.

The score term is attraction to high-density regions of the target:

\[\nabla_x \log p_\sigma(x) = \sum_k w_k(x)\,S_k^{-1}(\mu_k-x).\]

It points toward where probability mass lives.

The kernel-gradient term is interaction between particles:

\[\nabla_{x_j} k_h(x_j, x_i) = \frac{2}{h^2}k_h(x_j,x_i)(x_i-x_j).\]

For fixed $(i,j)$, this vector points from $x_j$ toward $x_i$ because the derivative is taken with respect to $x_j$. But in the summed SVGD update for each particle, the net effect is to discourage particle collapse and improve coverage across modes.

You can think of it this way: the score asks, “where should this particle move if it were alone?” The kernel term asks, “given where everyone else already is, how should this finite set rearrange to cover the target better?”

So the two terms are not redundant. One encodes the target density, and the other controls how a finite cloud of particles covers that density. If you remove the score term, particles only spread and never lock onto the right regions. If you remove the interaction term, particles follow the score independently and can pile into the dominant basin.

This also clarifies the Langevin connection. Classical Langevin is

\[x_i \leftarrow x_i + \eta\,\nabla_x\log p(x_i) + \sqrt{2\eta}\,\xi_i.\]

Vanilla SVGD is deterministic (no injected Gaussian noise): it replaces the single-particle drift with a kernel-coupled batch drift. So SVGD is not “Langevin where noise prevents collapse”. In SVGD, anti-collapse comes primarily from the deterministic kernel-gradient interaction. If you add Gaussian noise to SVGD, you get a stochastic SVGD/Langevin-SVGD hybrid.

These distinctions become concrete when we run all three settings side by side on the same imbalanced MoG:

\[\text{(A)}\; x \leftarrow x + \eta\, s(x,\sigma), \qquad \text{(B)}\; x \leftarrow x + \eta\, s(x,\sigma) + \sqrt{2\eta}\,\xi, \qquad \text{(C)}\; x \leftarrow x + \eta\, \phi_{\mathrm{SVGD}}(x_{1:N},\sigma).\]

(A) Score-only: collapses into the dominant basin.

(B) Score + Langevin noise: robust coverage across all modes.

(A) collapses almost entirely into the dominant basin — score-only drift has no mechanism to maintain diversity. (B) gives good coverage out of the box, reaching TV $= 0.0037$ to the target mode weights. (C) can also reach TV $= 0.037$, but only with a broader kernel and stronger repulsion. With weak repulsion it still collapses, because the kernel interaction is local and cannot move particles back across basins once mass has pooled. Noise and repulsion both encourage exploration, but differently: noise is stochastic and global, while the kernel interaction is deterministic and local to nearby particles.

Three ways to evolve the same batch on the 6-mode MoG. (A) Score-only particles collapse into the dominant basin. (B) Score plus Langevin noise covers all modes robustly. (C) Deterministic SVGD without noise can also cover the weaker modes when the kernel interaction is made strong enough.

Why this framing helps

The particle picture strips diffusion models down to mechanics: the score is a vector field, sampling is integration, and the noise schedule controls how far particles can jump between basins. You do not need the full SDE formalism to build good intuition for what happens during generation.

The practical recipe is still short: corrupt data at many noise levels, train a denoiser (which is a score estimator in disguise), initialize particles from noise, and iterate the learned update. Independent Langevin is the simplest version. If you want particles to coordinate within a batch, kernel-coupled updates (SVGD-style¹⁵) give a natural extension, at the cost of an extra bandwidth parameter.

References

Rombach et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. ↩
Saharia et al. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS 2022. ↩
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. ↩
Ho, J., Jain, A. & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. ↩
Song, Y. & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. ↩
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. ↩
Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M. & Le, M. (2022). Flow Matching for Generative Modeling. ICLR 2023. ↩
Albergo, M.S. & Vanden-Eijnden, E. (2022). Building Normalizing Flows with Stochastic Interpolants. ICLR 2023. ↩
Liu, X., Gong, C. & Liu, Q. (2022). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. ICLR 2023. ↩
Robbins, H. (1956). An Empirical Bayes Approach to Statistics. Proceedings of the Third Berkeley Symposium. Historical source for the empirical Bayes normal-means identity that later became known as Tweedie’s formula. For explicit modern derivations, see Efron (2011) and Luo (2022, Sec. 4.3). ↩
Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. JMLR 2005. ↩
Milanfar, P. & Delbracio, M. (2024). Denoising: A Powerful Building-Block for Imaging, Inverse Problems, and Machine Learning. arXiv:2409.06219. ↩
Franceschi, J.-Y., de Bézenac, E., Ayed, I., Chen, M., Gallinari, P. & Ollivier, Y. (2024). Unifying GANs and Score-Based Diffusion as Generative Particle Models. ICLR 2024. ↩
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. See Section 2.3 (Gaussian identities: products and sums/convolutions). ↩ ↩²
Liu, Q. & Wang, D. (2016). Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. NeurIPS 2016. ↩ ↩²