Terminologies

Term Meaning
Diffusion Models models that can sample from a highly complex probability distribution(e.g. images of cars)
Non-equilibrium thermodynamics a branch of thermodynamics that deals with physical systems that are not in thermodynamic equilibrium, where “there are no net macroscopic flows of matter nor of energy within a system or between systems”.
It is often used by diffusion models as a technique to sample from distribution.
Diffusion the net movement of anything, generally from a region of higher concentration to a region of lower concentration.
Also a technique of Non-equilibrium thermodynamics
DDPM model that improves the performance of diffusion models by variational inference
DDIM a generalized version of DDPM, with better performance and less diversity and quality
Jensen inequality $f(\sum\limits a_{i}x_{i}) \le \sum\limits a_{i}f(x_{i})$, where $a \ge 0, \sum\limits a_{i} = 1$
In other words, the Expected Value of a convex function $\ge$ the value of the function at the Expected Input
Variational Lower Bound(short for VLB, a.k.a. Evidence Lower BOund, short for ELBO) A easy-to-train lower bound of Log-Likelihood, derived by using a prior $p(z)$ to approximate (implies variational) an intractable posterior $q$.
Jacobian matrix A matrix derived from a vector of function of several variables, with all its first-order partial derivatives. Suppose $f: R^{n} \to R^{m}$:
$$
J = [\frac{\partial{f}}{\partial{x_{1}}}…\frac{\partial{f}}{\partial{x_{n}}}]
$$

Notations

Notation Meaning
$x_{0}$ the data point, where $t$ is the total count of timestamp
$x_{t}$ the data after applying $t$ times of forward iteration
$\epsilon_t$ the (standard gaussian) noise
$\epsilon_{\theta}(x_t,t)$ our model to predict the noise at each timestamp
$\mu_{\theta}(x_t, t)$ parameterized model to predict $x_{t-1}$ at time $t$
$p(x_{0:T})$ the joint distribution of $x_{0}, x_{1} … x_{T}$

Introduction

This article will introduce the definitions of DDPM and DDIM.

As stated earlier, the work of DDIM is based on DDPM.

DDPM

Diffusion Models often involves modeling two processes:

  • forward process: noise data($x_{0}$) to data point($x_{t}$)
  • reverse process: data point to noise data, the reversion of forward process

Forward Process

As a improvement of Diffusion Models, DDPM models the forward process as:

$$ \begin{equation} x_{t} = \alpha_{t} x_{t-1} + \beta_{t}\epsilon_{t}, \epsilon_{t} \sim \mathcal{N}(0,1), 0 \le t \le T, t \in \mathbb{Z} \end{equation} $$

where $\alpha, \beta > 0, \alpha_{t}^{2}+ \beta_{t}^{2} = 1$. This can be viewed as:

  • the remains from the previous data: $\alpha_{t} x_{t-1}$
  • the destruction by introducing noise $\epsilon_{t}$

Accordingly, the conditional probability of $x_t$ would be: $$ p(x_{t}| x_{t-1}) = \mathcal{N}(x_{t};\alpha_{t}x_{t-1}, \beta_{t}^{2}I) $$

[!NOTE]

  • All $\alpha, \beta, T$ are constants
  • Apparently, this is a Markovian process

Reverse Process

By applying the forward process for $T$ times, we have $t$ pairs of $(x_{t-1}, x_{t})$. This is our training data.

Reversing the forward process, the task of the reverse process should be: learn how to get $x_0$ from $x_{t}$, formally $x_0 \to x_{t}$.

DDPM splits this process into $t$ steps of $x_t \to x_{t-1}$.

[!TIP] The Methodology of DDPM DDPM is a Likelihood-based Model.

In the paper, they model each single step as a gaussian transition: $$ p_{\theta}(x_{t-1}|x_{t}) = \mathcal{N}(x_{t-1};\mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_{t},t)) $$ where:

  • $\mu_{\theta}(x_{t}, t)$: the mean value
  • $\Sigma_{\theta}(x_{t},t)$: the variance predictor (of reverse process).

[!NOTE]

  • The noise is not gaussian noise multiplied by a factor, but predicted directly.
  • Not to confuse: * $\Sigma_{\theta}(x_{t},t)$: the noise in reverse process. It has been tested positive to the reverse process * $\epsilon_\theta(x_{t},t) \to \epsilon_{t}$ : the noise predictor (of forward process)

Mean Value Predictor: $\mu_{\theta}(x_{t}, t)$

In order to model the $\mu_{\theta}(x_{t}, t)$, from Bayes’ Theorem we have: $$ p(x_{t-1}| x_{t},x_0)= \frac{p(x_t|x_{t-1}) p(x_{t-1} | x_0)}{p(x_{t}|x_{0}) } $$

The process of induction would be:

  1. Predict $x_t$, $x_{t-1}$ from $x_0$
  2. Replace all the variables($x_{0}$) with $x_{t}$ in Equation 4
Predict $x_{0}$ with $x_{t}$

Applying forward process $p(x_{t}|x_{t-1})$ for $t$ times, we can rewrite $x_{t}$ as:

$$ x_{t} = \bar \alpha_{t} x_{0} + \bar{\beta_{t}} ^ {2}\epsilon_{t}, \bar \alpha_{t} = \prod \alpha_{i}, \bar \beta_{t} = \sqrt{1-\bar \alpha_{t}^{2}} $$ And the probability version: $$ p(x_t|x_{0}) = \mathcal{N}(x_{t}; \bar \alpha_{t} x_{0}, \bar \beta_{t} ^ {2}I) $$

Now that we have $x_t$ from $x_0$, update Eq 4 (since $\mathcal{N}$ can be represented as probabilities, the result is conformed to $\mathcal{N}$ as well):

$$ p(x_{t-1}| x_{t},x_{0}) = \mathcal{N}\left(x_{t-1}; \underbrace{\frac{\alpha_{t}\bar \beta_{t-1}^{2}}{\bar \beta_{t}^{2}}x_{t} + \frac{\bar \alpha_{t-1}\beta_{t}^{2}}{\bar \beta_{t}^{2}}x_{0}}_{\text{$\tilde \mu_t(x_{t}, x_{0})$}},\frac{\bar \beta_{t-1}^{2}\beta_{t}^{2}}{\bar \beta_{t}^{2}}I\right) $$

Let’s define the predicted mean value of $x_{t-1}$ as $\tilde \mu_t(x_{t}, x_{0}) = \frac{\alpha_{t}\bar \beta_{t-1}^{2}}{\bar \beta_{t}^{2}}x_{t} + \frac{\bar \alpha_{t-1}\beta_{t}^{2}}{\bar \beta_{t}^{2}}x_{0}$.

Notice the meaning of it: With Bayes’ Theorem, using $x_{t}$ and $x_{0}$, we can derive the mean value of $x_{t-1}$.

So naturally, we can make our $\mu_{\theta}(x_{t}, t)$, who have the same estimated output as $\tilde \mu_t(x_{t}, x_{0})$, learn the distribution of it: $$ \mu_{\theta}(x_{t}, t) = \tilde \mu_t(x_{t}, x_{0}) $$

[!NOTE] Different ways of modeling $\mu_\theta$ is also acceptable, it’s just that this is a better way (or not)

However, we don’t have $x_0$ to pass to $\tilde \mu_t(x_{t}, x_{0})$. Luckily, we can predict $x_0$ from rewriting Equation 7: $$ x_{0}= \frac{x_{t} - \sqrt{1- \bar \alpha_{t}}}{\sqrt{\bar \alpha_{t}}}\epsilon_{t} $$

[!TIP] This is actually an embodiment of the predictor–corrector method

Since we don’t have $\epsilon$ in the reverse process, we can make a neural work to learn it: $\epsilon_\theta(x_t,t) \to \epsilon_t$ : $$ x_{0}= \frac{x_{t} - \sqrt{1- \bar \alpha_{t}}}{\sqrt{\bar \alpha_{t}}}\epsilon_{\theta}(x_{t}, t) $$

Update the Eq 10: $$ \mu_{\theta}(x_{t}, t) = \tilde \mu_t(x_{t}, x_{0}) = \tilde \mu_t\left(x_{t}, \frac{x_{t} - \sqrt{1- \bar \alpha_{t}}}{\sqrt{\bar \alpha_{t}}}\epsilon_{\theta}(x_{t}, t)\right)= \frac{1}{\alpha_{t}}\left(x_{t} - \frac{\beta_{t}^2}{\bar\beta_{t}}\epsilon_{\theta}(x_{t},y, t)\right) $$

Reverse Noise Predictor: $\Sigma_{\theta}(x_{t},t)$

It still remains to design $\Sigma_{\theta}(x_{t},t)$, since it encourages diversity.

The DDPM paper suggested not learning it, since it:

resulted in unstable training and poorer sample quality

By fixing it at some value $\Sigma_{\theta}(x_{t},t) = \sigma_{t}^{2}I$ , where either $\sigma_{t}^{2} = \beta_{t}$ or $\tilde{\beta_t}$ yielded similar performance.

Training & Defining Loss

Conclusively, we have only defined one trainable model: $\epsilon_{\theta}(x_t, t)$

To reconstruct $x_{0}$

The training target can be MLE, the objective function being log-likelihood of reconstructing $x_{0}$:

$$ \begin{align*} \ln p(x_{0}) &= \int{\ln p(x_{0:T})dx_{1:T}} & \text{marginalization of marginal distribution}\\\\ &= \ln \int{p(x_{0:T})dx_{1:T}}\\\\\\ &= \ln \mathbb{E}_{q(x_{1:T}|x_{0})}[\int{\frac{p(x_{0:T})}{q(x_{1:T}|x_{0})}}]\\\\ &\ge \mathbb{E}_{q(x_{1:T}|x_{0})}\left[\ln \frac{p(x_{0:T})}{q(x_{1:T}|x_{0})}\right] & \text{Jensen Inequality of $\log$}\\\\ &= \underbrace{\mathbb{E}_{q(x_{1}|x_{0})}\left[\ln p_{\theta}({x_{0}|x_{1}})\right]}_{\text{reconstruction term}} - \sum_{t = 2}^{T}\mathbb{E}_{q(x_{t}|x_{0})}\left[D_{KL}(q(x_{t-1}|x_{t},x_{0})||p_{\theta}(x_{t-1}|x_{t}))\right]\\\\ &= \sum\limits_{t=1}^{T} \gamma \mathbb{E}_{q(x_{t}|x_{0})}[||\epsilon_{t} - \epsilon_{\theta}(x_{t}, t) || ^{2}] & \text{$\gamma$ being some constants} \end{align*} $$

To optimize the pixels

We design the loss function of $\theta$ as the Euclidean distance of the true and predicted mean of $x_{t-1}$: $$ \begin{align*} \ell &= ||x_{t-1} - \hat x_{t-1}|| ^ 2 \newline &= ||x_{t-1} - \mu_\theta(x_{t},t)|| ^ 2 \newline &=|| (\frac{1}{\alpha_{t}}(x_{t} - \beta_{t}\epsilon_{t}) ) ^ {2} - \frac{1}{\alpha_{t}}(x_{t} - \beta_{t}\epsilon_{\theta}(x_{t}, t)) ^ {2}||\newline &= \frac{\beta_{t}^{2}}{\alpha_{t}^{2}} ||\epsilon_{t} - \epsilon_{\theta}(x_{t}, t) || ^2 \end{align*} $$

DDIM

While the original DDPM is capable to generate satisfying images, it is known for poor performance: since the denoising(reverse) diffusion process usually take $T \sim 1000$ times of noise-prediction.

DDIM is proposed to boost the reverse process as Non-Markovian Process, by directly taking any model trained on DDPM and sampling only $T_{ddim}, T_{ddim} \le T$ timestamps, with some timestamps skipped. As a side-effect, the quality is compromised a little.

Reverse Process

It still takes the same approach as DDPM: predict $x_0$ from $x_t$ first.

From Eq 4, we can see that the sampling/training do involves $x_t$, but doesn’t actually involves $p(x_{t-1}|x_{t})$ (which is defined in our reverse model). Instead, it defines:

$$ q_\sigma(x_{t-1}|x_{t}, x_0) = \mathcal{N}(x_{t-1};\sqrt{\bar \alpha_{t-1}}x_0 + \sqrt{1 - \bar \alpha_{t-1} - \sigma_t^2}\frac{x_t - \sqrt{\bar \alpha_t}x_0}{\sqrt{1 - \bar \alpha_t}}, \sigma_t^2I) $$

Hence, the relation between $x_{t-1}$ and $x_t$ is:

$$ x_{t-1} = \sqrt{\alpha_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1 - \alpha_{t}}\epsilon_{\theta}(x_{t},t)} {\sqrt{\bar \alpha_{t}}} \right)}_{\text{predicted $x_0$}} + \underbrace{\sqrt{1 - \bar \alpha_{t-1} - \sigma_{t}^{2}\epsilon_{t}(x_{t},t)}}_{\text{predicted noise}} + \underbrace{\sigma_{t} \epsilon_{t}}_{\text{random noise}} $$ where: $$ \sigma_t = \eta \sqrt{(\frac{1 - \bar \alpha_t}{1 - \bar \alpha_{t-1}}) \left(1 - \frac{\bar \alpha_t}{\bar \alpha_{t-1}}\right)} $$ where $\eta \in (0,1)$ is a constant, indicating the level of random noise:

  • $\eta = 1$: The random noise is maximized, which is DDPM.
  • $\eta = 0$: The random noise is totally removed, making it a deterministic process/Implicit model, which is DDIM. It relies entirely on the predicted noise, while sacrificing some diversity with lowering random noise level.

As for the timestamps chosen, they are determined empirically.

<p class="notice-title">
    <span class="icon-notice baseline">
        <svg xmlns="http://www.w3.org/2000/svg" viewBox="300.5 134 300 300">
    </span>Tip</p><p>In fact, $\eta$ represents the degree of moving some of the noise from predicted noise $\epsilon_t$ to sampled noise $\epsilon$: the bigger the $\eta$, the less deterministic, the larger random noise will be introduced to the reverse process.</p></div>

Short Summary

Conclusively, both models apply the same forward process, and have the same target: $x_{t} \to x_{0}$, though they have differences in the reverse process:

  • DDPM maximize the random noise, and in order to mitigate the negative effects it has, takes more timestamps in the reverse process
  • DDIM boost the performance by only selecting some of the timestamps, and reduce the random noise level

Conditioned Generation

While being able to generate high quality images with reasonable speed with the models mentioned above, it is a common feature to generated conditioned output.

Given condition $y$, our goal is to derive $p(x_{t-1}|x_{t},y)$

Classifier Guided Diffusion

Using bayes’ rule, we have: $$ p(x_{t-1}|x_{t},y) = \frac{p(x_{t-1}|x_{t})p(y|x_{t-1},x_{t})}{p(y|x_{t})} $$

Using the notations in Reverse Process: $$ p(x_{t-1}|x_{t},y) \propto \exp(-||x_{t-1}-\mu(x_{t})-\Sigma_{t}^{2}\underbrace{\nabla_{x_{t}}\log p(y|x_{t})}{\text{classifier}}||^{2}/2\Sigma{t}^{2}) $$

So $\mu_{\theta}(x_{t}, t,y) = \mu(x_{t})+\Sigma_{t}^{2}\nabla_{x_{t}}\log p(y|x_{t})), \Sigma_{t} = \sigma_{t}$

[!NOTE]

  • The gradient of the prob is easy to get with autograd, if the classifier can output the prob
  • The classifier guides the model only when inferencing

Classifier-Free Diffusion

To infer without a classifier, we need to blend the condition $y$ into training process.

By directly modeling the conditioned reverse process as $p(x_{t-1}|x_{t},y) = \mathcal{N}(x_{t-1};\mu(x_{t},y), \sigma_{t}^{2}I)$, following the modeling of Eq. 11, we have: $$ \mu(x_{t}, y) = \frac{1}{\alpha_{t}}\left(x_{t} - \frac{\beta_{t}^2}{\bar\beta_{t}}\epsilon_{\theta}(x_{t},y, t)\right) $$

The $\epsilon_{\theta}(x_{t},y, t)$ can be trained to predict the noise under condition.

[!WARNING] The conditioned noise predictor depends on $y$, so retraining is required if the $y$ is changed

Score-based generative models

Notation Meaning
Ancestral Sampling A sample method, auto-regressive
Score Distillation Sampling A sampling method(sampler) to generate samples from a diffusion model by optimizing a loss function. Basically, it utilizes(distills) the score function of a teacher diffusion model, to train a larger model, with the final result as a sample (as $t \to 0$).
score function The gradient of the log-likelihood function
$U(x) = -\log q(x)$ An energy function. The lower the likelihood, the higher the energy
$\mu_{\theta}(x_t, t)$ parameterized model to predict $x_{t-1}$ at time $t$
$p(x_{0:T})$ the joint distribution of $x_{0}, x_{1} … x_{T}$
Langevin dynamic A Markov chain Monte Carlo(MCMC) method for obtraining random samples
Fisher Divergence

Most often, we don’t care about the probability $q(x)$ of a certain input $x$, but how it changes through time: therefore, we can utilize score function(gradient, changes) $s(x):=\nabla_{x}\log q(x)$

[!TIP] It is also an advantage of modeling the score: don’t have to make sure probability sum up to 1

With $s(x)$ allowing us to sample from $q(x)$ using thermodynamics, our goals changes to: model $q(x)$

$$ dx_{t} = \nabla \log q(x) d_{t} + d{W_{t}} $$

Loss

We learn a model $s_{\theta}$ to match(approximate) the score $\nabla \log q$:

$$ s_{\theta} \approx \nabla \log q(x) $$ – This is score matching.

Typically, score matching is formalized as minimizing Fisher divergence function . By expanding the integral, and performing an integration by parts, we have our loss function: $$ \mathcal{L} = \mathbb{E}{q}[||s{\theta}(x) - \nabla \log q(x)||^{2}] $$ However, it’s infeasible since it requires access to unknown score $\nabla \log q(x)$.

Fortunately, we have score matching techniques(e.g. Hyvärinen scoring rule) which minimize the Fisher divergence without knowledge of the gorund-truth score: $$ \mathcal{L} = \mathbb{E}{q}\left[\nabla{x} s_{\theta}(x)+ \frac{1}{2}||s_{\theta}(x)||_{2}^{2}\right] $$

Since $s_{\theta}$ is modeled by ourself, its output and gradients can be easily calculated. We use Monte-Carlo methods with gradient descent to optimize it.

Sample / Inference

But how do we generate a sample.

Once we have trained a score-based model $s_{\theta}(x)$, we can use an iterative procedure called Langevin dynamics to draw samples from it: $$ x_{i + 1} \leftarrow x_{i} + \epsilon \nabla_{x} \log p(x) + \sqrt{2\epsilon} z_{i}, z_{i} \sim \mathcal{N}(0, I) $$ Notice some white noise is injected, to avoid all samples collapse into some limited local optimas.

This seems decent, but in fact: in low-density regions, the estimated scores are inaccurate.

It’s natural to augment the low-density regions by perturbing our datapoint: injecting $\mathcal{N}$. It can solve the problem in low-density, however since the training data is perturbed, the generated samples are too.

Multiple (decreasing) noise levels $\sigma$ are applied as an input to score funcion $s$, with the output of previous model $i$ as the input of the next model $i+1$. The whole process resembles an Anneald Langevn Dynamics

SDE

Notation Meaning
SDE A DE in which one or more of the terms is a stohastic process
$\mathcal{U}(T_{a},T_{b})$ Uniform distribution over the time interval $[T_{a}, T_{b}]$
In DDPM, we define $t$ as discrete timestamps, however it’s more natural to model it as continuous time.

Forward Process

With this premise, we model the forward process with Stochastic DE(Differential equation), but not funtion on timestamps: $$ dx = f_{t}(x) + g_{t}dw $$

Reverse Process

Similarly, we want to model $p(x_{t}|x_{t + \Delta{t}})$: $$ \begin{align} p(x_{t}|x_{t + \Delta{t}}) &= \frac{p(x_{t + \Delta_{t}} | x_{t})p(x_{t})}{p(x_{t+ \Delta{t}})} \\ &= p(x_{t + \Delta{t}} | x_{t})\exp(\log p(x_{t}) - \log p(x_{t+\Delta{t}})) \\ &\propto \left(-\frac{||x_{t + \Delta_{t}} - x_{t} - f_{t}(x_{t})\Delta{t}||^{2}}{2g_{t}^{2}\Delta t}+ \log p(x_{t}) - \log p(x_{t+\Delta{t}})\right) \end{align} $$

In order to calculate the unknown diff, we apply Taylor expansion: $$ \log p(x_{t+\Delta{t}}) \approx \log p(x_{t}) + (x_{t+\Delta t} - x_{t}) \cdot \nabla_{x_{t}}\log p(x_{t}) + \underbrace{\Delta t \frac{\partial \log p(x_{t})}{\partial t}}{\text{$x{t}$’s deritive of $t$}} $$

Update Equation 26-3 with it, we have: $$ p(x_{t}|x_{t + \Delta{t}}) \sim \mathcal{N}(f_{t+\Delta t}(x_{t + \Delta t}) - g_{t + \Delta t}^{2}\nabla_{x_{t + \Delta t}} \log p(x_{t + \Delta t})\Delta t; g_{t + \Delta t}^{2}\Delta t I) $$

and the SDE of reverse process: $$ dx = [f_{t}(x) - g_{t}^{2}\nabla_{x}\log p_{t}(x)]dt + g_{t}dw $$

Training

$$ \mathcal{L} = \mathbb{E}{t \in \mathcal{U}(0, T)}\mathbb{E}{p_{t}(x)}[\lambda(t)||\nabla_{x}\log p_{t}(x) - s_{\theta}(x,t)||^{2}_{2}] $$ where:

  • $\lambda : \mathbb{R} \to \mathbb{R}_{>0}$ is a positive weighting function

Probability flow ODE

Notation Meaning
PF(Probability flow) ODE The ODE of an SDE

Despite capable of generating high-quality samples, samplers based on Langevin MCMC and SDE solvers do not provide a way to compute the exact log-likelihood of score-based generative models.

It has been proved that, it is possible to convert any SDE into an ODE(ordinary differential equation) without changing its marginal distributions $p_{t}(x)$

Forward Process

With a sequence of complex calculations(including F-P function & Dirac function), we have:

$$ dx = [f(x,t) - \frac{1}{2}(g^{2}(t)-\sigma_{t}^{2})\nabla_{x}\log p_{t}(x)]dt $$

Reverse Process

The reverse process of PF-ODE is given by:

$$ dx = [f(x,t) - \frac{1}{2}g^{2}(t)\nabla_{x}\log p_{t}(x)]dt $$

[!TIP] When $\nabla_{x}\log p_{t}(x)$ replaces $s_{\theta}(x,t)$, PF ODE becomes a special case of a neural ODE

Samplers

Euclidean

$$ \begin{equation}\left.\frac{d\boldsymbol{x}t}{dt}\right|{t=t_{n+1}}\approx \frac{\boldsymbol{x}{t{n+1}} - \boldsymbol{x}{t_n}}{t{n+1} - t_n}\end{equation} $$ 一阶近似

Heun solver

DPM solver

AMED solver